Econometrics Beat: Dave Giles' Blog: Good Old R-Squared!

Thursday, May 2, 2013

Good Old R-Squared!

My students are often horrified when I tell them, truthfully, that one of the last pieces of information that I look at when evaluating the results of an OLS regression, is the coefficient of determination (R²), or its "adjusted" counterpart. Fortunately, it doesn't take long to change their perspective!

After all, we all know that with time-series data, it's really easy to get a "high" R² value, because of the trend components in the data. With cross-section data, really low R² values are really common. For most of us, the signs, magnitudes, and significance of the estimated parameters are of primary interest. Then we worry about testing the assumptions underlying our analysis. R² is at the bottom of the list of priorities.

Our students learn that R² represents the proportion of the sample variation in the data for the dependent variable that's "explained" by the regression model. Even so, they don't always think of R² as a statistic - a random variable that has a distribution.

That's what it is, of course. If we define the coefficient of determination as R² = [ESS/TSS], where ESS is the "explained sum of squares", and TSS is the "total sum of squares", then clearly the denominator is a function of the random "y" data. Moreover, the numerator is a function of the OLS estimator for the coefficient vector, β, so it's a random quantity too. Yes, R² is indeed a statistic. So is the familiar "adjusted" R², of course.

This implies that it has a sampling distribution. How often do students think about this?

We're going to focus on the sampling distribution of the coefficient of determination for a linear regression model

y = Xβ + u ; u ~ N[0 , σ²I_n] ,

estimated by OLS.

If we recall the relationship between R² and the F-statistic that used in the OLS context for testing the hypothesis that all of the "slope coefficients" are zero, namely:

F = [R² /(1 - R²)][(n - k) / (k - 1)] ,

then we start to see something about the possible form of this sampling distribution. If you know something about the connections between random variables that are F-distributed, and ones that follow a Beta distribution, you won't be surprised if I tell you that Cramer (1987) showed that the density function for R² can be expressed as a messy infinite weighted sum of Beta densities, with Poisson weights. It's also a function of X data, and the unknown parameters, β and σ.

Yuk! Let's try and keep things a little simpler.

Koerts and Abrahamse (1971) explored various aspects of the behaviour of the statistic, R².

Let E = I - (i i'), where i is a column vector with every element equal to one in value. It is easily shown that E is idempotent, and that it is an operator that transforms a series into deviations about the sample mean. Similarly (because of the idempotency), the matrix (X'EX) is the "deviations about means" counterpart to the (X'X) matrix.

Let's assume that (X'EX) is well-behaved when n becomes infinitely large. Specifically, let's make the the standard assumption that

Limit[n^-1(X'EX)] = Q,

where Q is finite and non-singular. Then, Koerts and Abrahamse (1971, pp. 133-136) show that

plim(R²) = 1 - σ² / [β'Qβ + σ²] = φ , say.

The quantity, φ, is usually regarded as the "population R²" in this context. The sample R² is a consistent estimator of the population R².

Also, asymptotically, R² converges to a value less than one. Moreover, its large-sample behaviour depends on the unknown values of the parameters, and also on the (unobservable) matrix, Q. The nature of the latter matrix - the degree of multicollinearity in the X data (asymptotically) - plays a role in the probability limit of R².

Now, let's put the large-n asymptotic case behind us, and let's focus on the sampling distribution of R² in finite samples. First, what can be said about the first two moments of the sampling distribution for the coefficient of determination?

Following earlier work by Barten (1962), Cramer (1987) derived the mean and the standard deviation of R², and also for its "adjusted" counterpart. As you'd guess from my earlier description of his result for the p.d.f. of R², the results are pretty messy, and they depend on the X data and the unobserved parameters, β and σ. However, there are some pretty clear results that emerge:

R² is an upward-biased estimator of the "population R²", φ.
The bias vanishes quite quickly as n grows.
Whenever the bias of R² is noticeable, its standard deviation is several times larger than this bias.
"Adjusting" R² (for degrees of freedom) in the usual way virtually eliminates the positive bias in R² itself.
However, the adjusted R² has even more variability than R² itself.

Koerts and Abrahamse (1971) used Imhof's (1961) algorithm to numerically evaluate the full sampling distribution of R², for various choices of the X matrix, and different parameter values. They found that:

This sampling distribution is sensitive to the form of the X matrix.
If the disturbances in the regression model follow a positive (negative) AR(1) process, then this shifts the sampling distribution of R² to the right (left) - that is, towards one (zero). In other words, if the errors follow a positive AR(1) process, the probability of observing a large sample R² value is increased, ceteris paribus.

However, the effects of autocorrelation on the sampling distribution of R² were studied more extensively by Carrodus and Giles (1992). We considered a range of X matrices, and both AR(1) and MA(1) processes for the regression errors, and obtained results that did not fully support Koerts and Abrahamse. Using Davies (1980) algorithm, we computed the c.d.f.'s and p.d.f.'s for R², and found that:

Almost without exception, negative AR(1) errors shift the distribution of R² to the left.
Positive AR(1) errors can shift the distribution of R² either to the left or to the right, depending on the X data.
The direction of the shift in the distribution with either positive or negative MA(1) errors is very "mixed", and exhibits no clear pattern.

So, there are few things for students to think about when they look at the value of R² in their regression results.

First, the coefficient of determination is a sample statistic, and as such it is a random variable with a sampling distribution. Second, the form of this sampling distribution depends on the X data, and on the unknown beta and sigma parameters. Third, this sampling distribution gets distorted if the regression errors are autocorrelated. Finally, even if we have a very large sample of data, R² converges in probability to a value less than one, regardless of the data values or the values of the unknown parameters.

The bottom line - it's just another statistic!

References

Barten, A.P., 1962,. Note on the unbiased estimation of the squared multiple correlation coefficient. Statistica Neerlandica, 16, 151-163.

Carrodus, M. L. and D. E. A. Giles, 1992. The exact distribution of R² when the regression disturbances are autocorrelated. Economics Letters, 38, 375-380.

Cramer, J. S., 1987. Mean and variance of R² in small and moderate samples. Journal of Econometrics, 35, 253-266.

Davies, R. B., 1980. Algorithm AS155. The distribution of a linear combination of χ² random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29, 323-333.

Imhof, J.P., 1961. Computing the distribution of quadratic forms in normal variables. Biometrika, 48, 419- 426.

Koerts, J and A. P. J. Abrhamse, 1971. On the Theory and Application of the General Linear Model. Rotterdam University Press, Rotterdam.

18 comments:

AnonymousMay 2, 2013 at 10:21 PM
Suggested future article: Describe and illustrate real examples of R squared's "dumb-bell" effect! .. and the corrolary --- sparcity of data in the tails.
ReplyDelete
Replies
Charlie GibbonsMay 3, 2013 at 8:13 AM
Nice post! Something that I haven't thought too much about.
ReplyDelete
Replies
UnknownMay 3, 2013 at 11:01 AM
Good post, I also changed my perspective. Thanks.
ReplyDelete
Replies
AnonymousOctober 10, 2013 at 11:23 AM
very good article, thank you for that. I have one question: for my own research I try to compare R-squared measures of one model that is fed with data derived under different accounting regimes (e.g. cash flow and earnings derived under local and international accounting standards). So, I get two R-squared measures, one if the model is fed with international accounting data and one if it is fed with the local accounting data.

Is there a statistical test to compare both R-squared measures? If R-squared has a distribution one could use a z-test; the z-statistic could be: (R-squared1 - R-squared2) / sqrt[var(R-squared1) + var(R-squared2)] .

The problem is: Stata would not give me the variances of the R-squared measures.

I would appreciate your help,
thanks!

ReplyDelete
Replies
AnonymousOctober 11, 2013 at 2:07 AM
Thanks. The test I have in mind was was performed in the following paper: "The value relevance of German accounting measures: an empirical analysis" by Harris, Lang, and Möller (Journal of Accounting Research, Vol 32, No. 2 (1994)). The Z-statistic used is described in FN 38 on p. 198. The authors state that their test is based on Cramer (1987) even though I don't see any specific test proposed there.

If Stata is not going to be of any help is there any other package that could give me the variances?
ReplyDelete
Replies
AnonymousFebruary 23, 2014 at 4:17 AM
In this case using bootstrap to calculate the R2's standard error would it provide a viable alternative?
ReplyDelete
Replies
AnonymousMarch 19, 2014 at 6:30 AM
I Am studying the value relevance of earnings per share and book value of equity after the adoptionof IFRS. I measure value relevance by using the Adj R2. I would like to compare AdjR2 of the period of preadoption and AdjR2 of the post adoption period. I would like to know if there is diferenc and if this difference is significant. I would like to know the differnt tests that I can use based on stata. THank you
ReplyDelete
Replies
UnknownJune 3, 2014 at 8:01 AM
I am currently pursuing Master in Applied Statistics.
To meet the continuous assessment requirements of Applied Econometrics course, I was required to perform an OLS regression research project on cross sectional data.
I am most grateful if Prof. may provide the appropriate assistance to enable me to have a cross sectional data for this econometric assignment.

I am looking forward to hear from Prof. soon.
ReplyDelete
Replies
Stats EnthusiastNovember 28, 2016 at 9:21 PM
I would like to test if the shared variance of two predictors x1 and x2 with a criterion y is non-zero. I can calculate the shared variance by taking the R-squared minus the squared semi-partial correlations of x1 and x2 to give me the shared variance. My first question is should I use the R squared or is it a better idea to use the adjusted r squared to compute the shared variance? More importantly, I would like to test if this shared variance is > 0. Can I still use the Beta distribution?
ReplyDelete
Replies
jghiugNovember 17, 2017 at 9:23 AM
Dear Prof Dave,

I ran a regression on a cross-sectional firm level data across different countries, but my R2 is about 0.05.

This worries me as it seems my model do not explain much of the variation in the dependent variable.

What do you suggest I do about this please?

ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Pages

Thursday, May 2, 2013

Good Old R-Squared!

18 comments: