Pages

Thursday, May 15, 2014

More on the Properties of the "Adjusted" Coefficient of Determination

A while back I wrote about the fact that R2 (the coefficient of determination for a linear regression model) is a sample statistic, and as such it has a sampling distribution. In that post and in follow-up posts here and here, I discussed some of the properties of that sampling distribution, and about the mean and variance of R2 in certain circumstances.

Let's take that discussion a step further by comparing the MSE's of R2 and its "adjusted" counterpart.

First, let's be clear about the framework for this discussion, and assumptions that I'll be using. We'll be dealing with a linear regression model, with a full-rank non-random regressor matrix (that includes an intercept), and with errors that are serially independent, homoskedastic, and normally distributed. That is:

                            y = Xβ + ε    ;   ε ~ N[0 , σIn] .

The coefficient of determination can be expressed in various (equivalent) ways. Let's write it as:

                           R2 = 1 - (e'e) / (y*'y*) ,

where y* is the y vector, but expressed as deviations about the sample mean; and e is the OLS residual vector, e = y - Xb, where b = (X'X)-1X'y.

The "adjusted" R2 is:

                          RA2 = 1 - [(e'e ) / (n - k)] / [(y*'y*) / (n - 1)],

where k is the number of regressors (including the intercept).

Each of the sums of squares in the original R2 formula is divided by the appropriate degrees of freedom. You'll recall the following results:
  • RA2 ≤ R2.
  • Unlike R2, RA2 can take negative values.
  • Although R2 cannot decrease if we add a regressor to the model, RA2 will decrease if the (usual) t-statistic associated with that regressor is less than one in absolute value. (See here and here.)  
Now, note that the relationship between R2 and RA2 can be written in various ways, including:

                         RA2 = R2 - (1 - R2)(k - 1) / (n - k)                                                  (1)

In one of the earlier, related posts I showed that the following results hold in the special situation where there is no linear relationship between y and the (non-intercept) regressors:
                        E[R2] = (k - 1) / (n - 1)                                                                  (2)

                       Var.[R2] = [(k - 1)(n - k)] / [n (n - 1)2]                                            (3)

These results were obtained by exploiting the relationship between R2 and the F-statistic that we use to test the joint significance of the regressors. Notice that when the null hypothesis for this F-test is true (and there is no linear relationship), the population coefficient of determination is zero.

So, we see that the usual sample R2 is an upwards-biased estimator of the population R2 (in this special case), and its MSE is:

                      MSE[R2] = [k(k - 1) / n(n - 1)] .                                                      (4)

Using the results in (2) and (3), it follows immediately from (1) that:

                      E[RA2] =  0 ,                                                                                   (5)

                      Var.[RA2] =  (k - 1) / [n(n -k)]    ,                                                    (6)
and
                      MSE[RA2] =  Var.[RA2] =  (k - 1) / [n(n -k)] .                                    (7)

From (5), we can see that "adjusting" the coefficient of determination eliminates the bias associated with R2 (when there is no linear relationship between y and the regressors). Comparing the variances of R2 and RA2, we can see that the elimination of this bias comes at the expense of increased variability, because  Var.[R2] ≤  Var.[RA2] . 

To see that this inequality holds, note that

                    Var.[R2] / Var.[RA2] = [(n - k) / (n - 1)]2 ≤ 1.

Comparing the MSE's of R2 and RA2, given in (4) and (7), we have the following result:
           
                   Δ = MSE[R2] - MSE[RA2] = (k - 1)[k(n - k) - (n - 1)] / [n(n - k)(n - 1)] ,

and

     sgn(Δ) = sgn[k(n - k) - (n - 1)] = sgn[k(n - k) - (n - 1) - k(n - 1) + k(n - 1)] = sgn[(k - 1)(n - k - 1)].

This expression is non-negative, if (n - k) ≥ 1, and so as long as we have at least one degree of freedom, the "adjusted" coefficient of determination has smaller MSE than the usual (sample) R2, when the true population R2 is zero.

In addition, notice than both of these MSE's go to zero as n increases, so both of these goodness-of-fit measures are mean-square consistent, and hence weakly consistent of the unobserved population R2.

When there is a linear relationship between y and the regressors in our model, the F-statistic noted above has a non-central F distribution; Rand RA2 can be written as functions of a non-central Beta statistic (see here); and the associated non-centrality parameter is a function of the X data and the true values of all of the parameters in the regression model.

In this case any bias, variance, and MSE comparisons between the unadjusted and adjusted coefficients of determination will depend on the values of all of these quantities, not all of which are observable, of course!


© 2014, David E. Giles

2 comments:

  1. Great post! I was looking for some tips on which of the two, $R^2$ or $R^2_{adj}$ is a better estimator (say, in MSE sense) of the population $R^2$, and here it is. Of course, the really interesting case is when at least some (if not all) of the regressors truly belong in the model, but I understood that there is no general result (that would not depend on the true slope coefficients) for that case. On a side note, it does not seem you have made use of the normality assumption anywhere in the derivation. If so, why include it at all? And if you include it, a note on its irrelevance could be handy. It is my perception that too many people do not realize how little the normality assumption matters in deriving the standard results for OLS estimators. In any case, thank you for the great post!

    ReplyDelete
    Replies
    1. The Normality assumption is in fact needed and used for the bias and variance expressions. See the reference to Cramer's paper in this earlier post: http://davegiles.blogspot.ca/2013/05/good-old-r-squared.html

      Delete

Note: Only a member of this blog may post a comment.