A while back I wrote about the fact that R2 (the coefficient of determination for a linear regression model) is a sample statistic, and as such it has a sampling distribution. In that post and in follow-up posts here and here, I discussed some of the properties of that sampling distribution, and about the mean and variance of R2 in certain circumstances.
Let's take that discussion a step further by comparing the MSE's of R2 and its "adjusted" counterpart.
y = Xβ + ε ; ε ~ N[0 , σ2 In] .
The coefficient of determination can be expressed in various (equivalent) ways. Let's write it as:
R2 = 1 - (e'e) / (y*'y*) ,
where y* is the y vector, but expressed as deviations about the sample mean; and e is the OLS residual vector, e = y - Xb, where b = (X'X)-1X'y.
The "adjusted" R2 is:
RA2 = 1 - [(e'e ) / (n - k)] / [(y*'y*) / (n - 1)],
where k is the number of regressors (including the intercept).
Each of the sums of squares in the original R2 formula is divided by the appropriate degrees of freedom. You'll recall the following results:
- RA2 ≤ R2.
- Unlike R2, RA2 can take negative values.
- Although R2 cannot decrease if we add a regressor to the model, RA2 will decrease if the (usual) t-statistic associated with that regressor is less than one in absolute value. (See here and here.)
Now, note that the relationship between R2 and RA2 can be written in various ways, including:
In one of the earlier, related posts I showed that the following results hold in the special situation where there is no linear relationship between y and the (non-intercept) regressors:
E[R2] = (k - 1) / (n - 1) (2)
Var.[R2] = [(k - 1)(n - k)] / [n (n - 1)2] (3)
These results were obtained by exploiting the relationship between R2 and the F-statistic that we use to test the joint significance of the regressors. Notice that when the null hypothesis for this F-test is true (and there is no linear relationship), the population coefficient of determination is zero.
So, we see that the usual sample R2 is an upwards-biased estimator of the population R2 (in this special case), and its MSE is:
MSE[R2] = [k(k - 1) / n(n - 1)] . (4)
Using the results in (2) and (3), it follows immediately from (1) that:
E[RA2] = 0 , (5)
Var.[RA2] = (k - 1) / [n(n -k)] , (6)
andMSE[RA2] = Var.[RA2] = (k - 1) / [n(n -k)] . (7)
From (5), we can see that "adjusting" the coefficient of determination eliminates the bias associated with R2 (when there is no linear relationship between y and the regressors). Comparing the variances of R2 and RA2, we can see that the elimination of this bias comes at the expense of increased variability, because Var.[R2] ≤ Var.[RA2] .
To see that this inequality holds, note that
Var.[R2] / Var.[RA2] = [(n - k) / (n - 1)]2 ≤ 1.
Comparing the MSE's of R2 and RA2, given in (4) and (7), we have the following result:
Δ = MSE[R2] - MSE[RA2] = (k - 1)[k(n - k) - (n - 1)] / [n(n - k)(n - 1)] ,
sgn(Δ) = sgn[k(n - k) - (n - 1)] = sgn[k(n - k) - (n - 1) - k(n - 1) + k(n - 1)] = sgn[(k - 1)(n - k - 1)].
This expression is non-negative, if (n - k) ≥ 1, and so as long as we have at least one degree of freedom, the "adjusted" coefficient of determination has smaller MSE than the usual (sample) R2, when the true population R2 is zero.
In addition, notice than both of these MSE's go to zero as n increases, so both of these goodness-of-fit measures are mean-square consistent, and hence weakly consistent of the unobserved population R2.
When there is a linear relationship between y and the regressors in our model, the F-statistic noted above has a non-central F distribution; R2 and RA2 can be written as functions of a non-central Beta statistic (see here); and the associated non-centrality parameter is a function of the X data and the true values of all of the parameters in the regression model.
In this case any bias, variance, and MSE comparisons between the unadjusted and adjusted coefficients of determination will depend on the values of all of these quantities, not all of which are observable, of course!
Great post! I was looking for some tips on which of the two, $R^2$ or $R^2_{adj}$ is a better estimator (say, in MSE sense) of the population $R^2$, and here it is. Of course, the really interesting case is when at least some (if not all) of the regressors truly belong in the model, but I understood that there is no general result (that would not depend on the true slope coefficients) for that case. On a side note, it does not seem you have made use of the normality assumption anywhere in the derivation. If so, why include it at all? And if you include it, a note on its irrelevance could be handy. It is my perception that too many people do not realize how little the normality assumption matters in deriving the standard results for OLS estimators. In any case, thank you for the great post!
ReplyDeleteThe Normality assumption is in fact needed and used for the bias and variance expressions. See the reference to Cramer's paper in this earlier post: http://davegiles.blogspot.ca/2013/05/good-old-r-squared.html
Delete