Thursday, May 2, 2013

Good Old R-Squared!

My students are often horrified when I tell them, truthfully, that one of the last pieces of information that I look at when evaluating the results of an OLS regression, is the coefficient of determination (R2), or its "adjusted" counterpart. Fortunately, it doesn't take long to change their perspective!

After all, we all know that with time-series data, it's really easy to get a "high" R2 value, because of the trend components in the data. With cross-section data, really low R2 values are really common. For most of us, the signs, magnitudes, and significance of the estimated parameters are of primary interest. Then we worry about testing the assumptions underlying our analysis. R2 is at the bottom of the list of priorities.

Our students learn that R2 represents the proportion of the sample variation in the data for the dependent variable that's "explained" by the regression model. Even so, they don't always think of R2 as a statistic - a random variable that has a distribution.

That's what it is, of course. If we define the coefficient of determination as R2 = [ESS/TSS], where ESS is the "explained sum of squares", and TSS is the "total sum of squares", then clearly the denominator is a function of the random "y" data. Moreover, the numerator is a function of the OLS estimator for the coefficient vector, β, so it's a random quantity too. Yes, R2 is indeed a statistic. So is the familiar "adjusted" R2, of course.

This implies that it has a sampling distribution. How often do students think about this?

We're going to focus on the sampling distribution of the coefficient of determination for a linear regression model

                  y = Xβ + u    ;    u ~ N[0 , σ2In]  ,

estimated by OLS.

If we recall the relationship between R2 and the F-statistic that used in the OLS context for testing the hypothesis that all of the "slope coefficients" are zero, namely:

             F = [R2 /(1 - R2)][(n - k) / (k - 1)]  ,

then we start to see something about the possible form of this sampling distribution. If you know something about the connections between random variables that are F-distributed, and ones that follow a Beta distribution, you won't be surprised if I tell you that Cramer (1987) showed that the density function for R2 can be expressed as a messy infinite weighted sum of  Beta densities, with Poisson weights. It's also a function of X data, and the unknown parameters, β and σ.

Yuk! Let's try and keep things a little simpler.

Koerts and Abrahamse (1971) explored various aspects of the behaviour of the statistic, R2.

Let E = I - (i i'), where i is a column vector with every element equal to one in value. It is easily shown that E is idempotent, and that it is an operator that transforms a series into deviations about the sample mean. Similarly (because of the idempotency), the  matrix (X'EX) is the "deviations about means" counterpart to the (X'X) matrix.

Let's assume that (X'EX) is well-behaved when n becomes infinitely large. Specifically, let's make the the standard assumption that

             Limit[n-1(X'EX)] = Q,

where Q is finite and non-singular.  Then, Koerts and Abrahamse (1971, pp. 133-136) show that

            plim(R2) = 1 - σ2 / [β'Qβ + σ2] = φ , say.

The quantity, φ, is usually regarded as the "population R2" in this context. The sample R2 is a consistent estimator of the population R2.

Also, asymptotically, R2 converges to a value less than one. Moreover, its large-sample behaviour depends on the unknown values of the parameters, and also on the (unobservable) matrix, Q. The nature of the latter matrix - the degree of multicollinearity in the X data (asymptotically) - plays a role in the probability limit of R2.

Now, let's put the large-n asymptotic case behind us, and let's focus on the sampling distribution of R2 in finite samples. First, what can be said about the first two moments of the sampling distribution for the coefficient of determination?

Following earlier work by Barten (1962), Cramer (1987) derived the mean and the standard deviation of R2, and also for its "adjusted" counterpart. As you'd guess from my earlier description of his result for the p.d.f. of R2, the results are pretty messy, and they depend on the X data and the unobserved parameters, β and σ. However, there are some pretty clear results that emerge:
  • R2 is an upward-biased estimator of the "population R2", φ.
  • The bias vanishes quite quickly as n grows.
  • Whenever the bias of R2 is noticeable, its standard deviation is several times larger than this bias.
  • "Adjusting" R2 (for degrees of freedom) in the usual way virtually eliminates the positive bias in R2 itself.
  • However, the adjusted R2 has even more variability than R2 itself.
Koerts and Abrahamse (1971) used Imhof's (1961) algorithm to numerically evaluate the full sampling distribution of R2, for various choices of  the X matrix, and different parameter values. They found that:
  • This sampling distribution is sensitive to the form of the X matrix.
  • If the disturbances in the regression model follow a positive (negative) AR(1) process, then this shifts the sampling distribution of R2 to the right (left) - that is, towards one (zero). In other words, if the errors follow a positive AR(1) process, the probability of observing a large sample R2 value is increased, ceteris paribus.
However, the effects of autocorrelation on the sampling distribution of R2 were studied more extensively by Carrodus and Giles (1992). We considered a range of X matrices, and both AR(1) and MA(1) processes for the regression errors, and obtained results that did not fully support Koerts and Abrahamse. Using Davies (1980) algorithm, we computed the c.d.f.'s and p.d.f.'s for R2, and found that:
  • Almost without exception, negative AR(1) errors shift the distribution of R2 to the left.
  • Positive AR(1) errors can shift the distribution of R2 either to the left or to the right, depending on the X data.
  • The direction of the shift in the distribution with either positive or negative MA(1) errors is very "mixed", and exhibits no clear pattern.
So, there are few things for students to think about when they look at the value of R2 in their regression results.

First, the coefficient of determination is a sample statistic, and as such it is a random variable with a sampling distribution. Second, the form of this sampling distribution depends on the X data, and on the unknown beta and sigma parameters. Third, this sampling distribution gets distorted if the regression errors are autocorrelated. Finally, even if we have a very large sample of data, R2 converges in probability to a value less than one, regardless of the data values or the values of the unknown parameters.

The bottom line - it's just another statistic!


References

Barten, A.P., 1962,. Note on the unbiased estimation of the squared multiple correlation coefficient. Statistica Neerlandica, 16, 151-163.

Carrodus, M. L. and D. E. A. Giles, 1992. The exact distribution of R2 when the regression disturbances are autocorrelated. Economics Letters, 38, 375-380.

Cramer, J. S., 1987. Mean and variance of R2 in small and moderate samples. Journal of Econometrics, 35, 253-266.

Davies, R. B., 1980. Algorithm AS155. The distribution of a linear combination of χ2 random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29, 323-333.

Imhof, J.P., 1961. Computing the distribution of quadratic forms in normal variables. Biometrika, 48, 419- 426.

Koerts, J and A. P. J. Abrhamse, 1971. On the Theory and Application of the General Linear Model. Rotterdam University Press, Rotterdam.



© 2013, David E. Giles

19 comments:

  1. Suggested future article: Describe and illustrate real examples of R squared's "dumb-bell" effect! .. and the corrolary --- sparcity of data in the tails.

    ReplyDelete
  2. Nice post! Something that I haven't thought too much about.

    ReplyDelete
  3. Good post, I also changed my perspective. Thanks.

    ReplyDelete
  4. very good article, thank you for that. I have one question: for my own research I try to compare R-squared measures of one model that is fed with data derived under different accounting regimes (e.g. cash flow and earnings derived under local and international accounting standards). So, I get two R-squared measures, one if the model is fed with international accounting data and one if it is fed with the local accounting data.

    Is there a statistical test to compare both R-squared measures? If R-squared has a distribution one could use a z-test; the z-statistic could be: (R-squared1 - R-squared2) / sqrt[var(R-squared1) + var(R-squared2)] .

    The problem is: Stata would not give me the variances of the R-squared measures.

    I would appreciate your help,
    thanks!

    ReplyDelete
    Replies
    1. Thank you. See my more recent post on the distribution of R-squared at: http://davegiles.blogspot.ca/2013/10/more-on-distribution-of-r-squared.html#more

      Note that the results given there apply only if the null hypothesis (of no linear relationship between y and X) is TRUE. In general, the F distribution will have to be replaced with a non-central F distribution, and the Beta distribution will become non-central Beta distribution. This could then form the basis for constructing a test along the lines that you have in mind, although I haven't seen this done. Of course, it won't be a z-test! And STATA isn't going to be of any help!

      Delete
  5. Thanks. The test I have in mind was was performed in the following paper: "The value relevance of German accounting measures: an empirical analysis" by Harris, Lang, and Möller (Journal of Accounting Research, Vol 32, No. 2 (1994)). The Z-statistic used is described in FN 38 on p. 198. The authors state that their test is based on Cramer (1987) even though I don't see any specific test proposed there.

    If Stata is not going to be of any help is there any other package that could give me the variances?

    ReplyDelete
    Replies
    1. Hate to say it, but that Z-test is nonsensical. The main point to keep in mind is the following.A statistical test relates to a statement (hypothesis) about something in the POPULATION, not the sample. We use one or more sample statistics to test that hypothesis. Going back to your original comment, what you really want is a test of the hypothesis that the POPULATION R-squared for one model equals that for the other model, presumably against a one-sided alternative hypothesis. This test could be based on the 2 sample R-squared values, using knowledge of the respective distributions.

      There is a well-established statistics literature (going back at least to the 1930's) on the problem of testing the equality of simple (Pearson) correlations associated with Normal populations. Presumably this can be extended to the multiple regression case you're interested.

      I can't think of any "canned" software that's going to give you what you want.

      Delete
    2. Here are 2 references that might be of some help:

      http://166.111.121.20:9080/mathjournal/DBSX200004/dbsx200004005.caj.pdf

      http://www.tandfonline.com/doi/abs/10.1080/03610918908812798#.Ulg_iVCkr1o

      Delete
  6. In this case using bootstrap to calculate the R2's standard error would it provide a viable alternative?

    ReplyDelete
  7. I Am studying the value relevance of earnings per share and book value of equity after the adoptionof IFRS. I measure value relevance by using the Adj R2. I would like to compare AdjR2 of the period of preadoption and AdjR2 of the post adoption period. I would like to know if there is diferenc and if this difference is significant. I would like to know the differnt tests that I can use based on stata. THank you

    ReplyDelete
  8. I am currently pursuing Master in Applied Statistics.
    To meet the continuous assessment requirements of Applied Econometrics course, I was required to perform an OLS regression research project on cross sectional data.
    I am most grateful if Prof. may provide the appropriate assistance to enable me to have a cross sectional data for this econometric assignment.

    I am looking forward to hear from Prof. soon.

    ReplyDelete
    Replies
    1. You obviously have access to the internet.........................

      Delete
  9. I would like to test if the shared variance of two predictors x1 and x2 with a criterion y is non-zero. I can calculate the shared variance by taking the R-squared minus the squared semi-partial correlations of x1 and x2 to give me the shared variance. My first question is should I use the R squared or is it a better idea to use the adjusted r squared to compute the shared variance? More importantly, I would like to test if this shared variance is > 0. Can I still use the Beta distribution?

    ReplyDelete
    Replies
    1. That's an interesting question. I would use the (unadjusted) R-squared for this purpose. It's not clear, though, what the distribution of the shared variance will be - I doubt if it is still Beta. You could always bootstrap the test of the hypothesis that you're interested in.

      Delete
  10. Dear Prof Dave,

    I ran a regression on a cross-sectional firm level data across different countries, but my R2 is about 0.05.

    This worries me as it seems my model do not explain much of the variation in the dependent variable.

    What do you suggest I do about this please?

    ReplyDelete
    Replies
    1. Very low R-squared values often arise when cross-section data are used. It's very common.

      Delete