Thursday, April 10, 2014

In a post last year I discussed the conditions under which the "adjusted" coefficient of determination (RA2) will increase or decrease, when regressors are deleted from (added to) a regression model. Without going over the full discussion again, here is one of the key results:

Adding a group of regressors to the model will increase (decrease) RA2 depending on whether the F-statistic for testing that their coefficients are all zero is greater (less) than one in value. RA2 is unchanged if that  F-statistic is exactly equal to one.

A few days ago, "Zeba" reminded me that I had promised to post a simple proof of this result, but I still hadn't done so. Shame on me! A proof is given below. As a bonus, I've given the proof for a more general result - we don't have to be imposing "zero" restrictions on some of the coefficients - any exact linear restrictions will suffice.

Let's take a look at the proof.

The model we're going to look at is the standard, k-regressor, linear multiple regression model:

y = Xβ + ε    .                                                                                     (1)

We have n observations in our sample.

The result that follows is purely algebraic, and not statistical, so in actual fact I don't have to assume anything in particular about the errors in the model, and the regressors can be random. So that the definition of the coefficient of determination is unique, I will assume that the model includes an intercept term.

The adjusted coefficient of determination when model (1) is estimated by OLS is

RA2 = 1 - [e'e / (n - k)] / [(y*'y*) / (n - 1)] ,                                         (2)

where e is the OLS residual vector, and y* is the y vector, but with each element expressed as a deviation from the sample mean of the y data.

Now consider J independent exact linear restrictions on the elements of β, namely Rβ = r, where R is a known non-random (J x k) matrix of rank J; and r is a known non-random (J x 1) vector. The F-statistic that we would use to test the validity of these restrictions can be written as:

F = [(eR'eR - e'e) / J] / [e'e / (n - k)] ,                                                   (3)

where eR is the residual vector when the restrictions on β are imposed, and the model is estimated by RLS.

In the latter case, the adjusted coefficient of determination is

RAR 1 - [eR'eR / (n - k + J)] / [(y*'y*) / (n - 1)] .                                 (4)

From equation (3),  F ≥ 1 if and only if

(n - k) eR'eR ≥ (n - k + J) e'e .                                                                   (5)

From (2) and (4), RA2≥ RAR2 if and only if

(n - k) eR'eR  ≥ (n - k + J) e'e.

But this is just the condition in (5).

So, we have the following result:

Imposing a set of exact linear restrictions on the coefficients of a linear regression model will decrease (increase) the adjusted coefficient of determination if the F-statistic for testing the validity of those restrictions is greater (less) than one in value. If this statistic is exactly equal to one, the adjusted coefficient of determination will be unchanged.

Notice that the result quoted at the beginning of this post is a special case of this result, where the restrictions are all "zero" restrictions. Recalling that the square of a t statistic with v degrees of freedom is just an F statistic with 1 and v degrees of freedom, the other principal result given in the earlier post is also obviously a special case of this, with just one zero restriction:

Adding a regressor will increase (decrease) RA2 depending on whether the absolute value of the t-statistic associated with that regressor is greater (less) than one in value. RA2 is unchanged if that absolute t-statistic is exactly equal to one.