## Wednesday, October 2, 2013

### In What Sense is the "Adjusted" R-Squared Unbiased?

In a post yesterday, I showed that the usual coefficient of determination (R2) is an upward -biased estimator of the "population R2", in the following sense. If there is really no linear relationship between y and the (non-constant) regressors in a linear multiple regression model, then E[R2] > 0. However, both E[R2] and Var.[R2] → 0 as n → ∞. So, R2 is a consistent estimator of the (zero-valued) population R2.

At the end of that post I posed the following questions:
"You might ask yourself, what emerges if we go through a similar analysis using the "adjusted" coefficient of determination? Is the "adjusted R2" more or less biased than R2 itself, when there is actually no linear relationship between y and the columns of X?"

We have the following linear multiple regression model:

y = Xβ + ε      ;    ε ~ N[0 , σ2In]                                                                    (1)

where X is non-random and of full rank, k, and includes an intercept variable as its first column.

Consider the null hypothesis,  H0:  "β2 = β3 = .... = βk = 0"    vs.    HA:  "Not H0", and let F be the F-statistic for testing H0. In yesterday's post, we noted that we can write

R2 = [(k - 1)F] / [(n - k) + (k - 1)F] ,

where R2 is the usual coefficient of determination.

Now, recall that the "adjusted" R2 can be written as:

R*2 = R2 - (1 - R2)[(k - 1) / (n - k)] = R2(n -1) / (n - k) - [(k - 1) / (n - k)].       (2)

From the previous post, if H0 is true, then:

E[R2] = [(k - 1) / (n - 1)]     and     Var.[R2] = [(k - 1)(n - k)] / [n (n - 1)2] .

Immediately, it follows from (2) that E[R*2] = 0, and Var.[R*2] = (k - 1) / [n(n - k)].

So, if there is no linear relationship between y and X (and the "population R2" is zero), the adjusted R2 is both an unbiased and consistent estimator of that population measure.

Adjusting the usual R2 for the degrees of freedom results in an interesting property for this sample statistic.

1. 2. 1. 