Saturday, August 3, 2013

Unbiased Model Selection Using the Adjusted R-Squared

The coefficient of determination (R2), and its "adjusted" counterpart, really don't impress me much! I often tell students that this statistic is one of the last things I look at when appraising the results of estimating a regression model.

Previously, I've had a few things to say about this measure of goodness-of-fit  (e.g., here and here). In this post I want to say something positive, for once, about "adjusted" R2. Specifically, I'm going to talk about its use as a model-selection criterion.

I decided to prepare this particular post as a result of some comments/questions that came from one particular reader of my recent piece, Information Criteria Unveiled. The question was, "what do we mean when we talk about a model-selection criterion being unbiased?"

I think I finally responded adequately, but I promised to put together a follow-up post with more information. A good way of illustrating the concept in question is to see why choosing between alternative regression model specifications, by maximizing the adjusted R2, can be described as an "unbiased" model-selection criterion. What follows is due, originally, to Theil (1957).

Suppose we have two linear regression models, each explaining the same dependent variable, y:

       M1:    y = X1β1 + ε1

       M2:    y = X2β2 + ε2   .

In each case, the same sample of n observations is available for y. Suppose that X1 and X2 are each non-stochastic, of full rank, k1 and k2 respectively. Finally, suppose that ε1 has a zero mean, is serially independent, and homoskedastic (with a variance of σ12).

Notice that no assumptions are being made about the error term in M2, namely ε2. So, what we are doing here is setting up M1 to be the data-generating process (or "true model"), while M2 is a "false model".

Now, recall that the adjusted coefficient of determination is the quantity,

       RA2 = 1 - [e'e / (n - k)] / sy2  ,

where sy2 is the (unbiased) sample variance of the y data, and e is the OLS residual vector.

Clearly, for a given sample of y data, RA2 increases monotonically as [e'e /(n - k)] decreases. So, choosing the model with the larger RA2 value is equivalent to choosing the model with the smaller value of s2 = [e'e /(n - k)].

Let's focus on this latter quantity, in the case of M2:

        s22 = (e'e) / ( n - k2) = (y' Py) / (n - k2)  ;    where  P2 = I - X(X2'X2)-1 X2' .

So,
       (n - k2)s22 = (X1β1 + ε1)' P(X1β1 + ε1)                                                              (*)
                          = β1'X1'P2X1β1 + 2β1X1'P2ε1 + ε1'P2ε1
                          ≥ 2β1X1'P2ε1 + ε1'P2ε,

because P2 is idempotent, and X1 has full rank.

Then, using the results that E[ε1] = 0,  and E[ε1ε1'] = σ12In, we have:

       E[(n - k2)s22 ] ≥ E[ε1'P2ε1]
                               = E{tr.[ε1'P2ε]}
                               = tr.(P2)E[[ε1ε1']
                               = tr.(P212In
                               = σ12(n - k2).
So,
                    E[s22] ≥ σ12 = E[s12] .

In other words, if we choose the smaller s2 (or, larger RA2), we'll select the true model, M1, on average.

It's in this particular sense that the "maximize RA2" rule is an unbiased model-selection criterion.

Now, there are (at least) three things to notice about the derivation of this result:

  1. One of the models under consideration had to be the true model, in the sense that its error term was "well-behaved".
  2. If we'd replaced y with (X2β2 + ε2) at line (*), this would have been correct, but totally unhelpful, as we know nothing about the properties of ε2.
  3. If the columns of X2 include all of the columns of X1, then P2X1 = 0, and we'd end up with the result that E[s22] = σ12 = E[s12] . Of course, in this case of "nested" models, presumably we'd select between them by testing the restrictions that make M2 collapse to M1.
This basic result has been extended in several different directions by various authors over the years. For example, Kloek (1975) shows that minimizing s2 is a strongly consistent model-selection criterion, as long as the correct model specification is one of those being considered; and Schmidt (1974) shows that it is a weakly consistent selection criterion when the models have autocorrelated errors. Giles and Smith (1977) prove that this selection rule retains its property of unbiasedness when there are exact linear restrictions on the models' parameters; and it retains its property of weak consistency if, in addition, the models' errors are autocorrelated.

Giles and Sturmfels (1979) show that Klein's "unbiased model selection" result holds if the regressors are perfectly correlated, and a generalized inverse is used to estimate "estimable functions" of the parameters of the models being compared. Finally, Giles and Low (1981) prove that minimizing s2 is a weakly consistent model-selection rule if models with random regressors are estimated using the method of Instrumental Variables.

So, indeed, there is some basis for choosing among competing regression models on the basis of a large (adjusted) R2 value.

As you might have guessed, though, it's not all good news!

Apart from the very strong requirement that the true model specification has to among those considered, getting things right, on average, isn't necessarily much comfort! The situation is analogous to that of having an unbiased estimator - which may have a very large variance.

Schmidt (1973) and Ebberler (1975) have explored the probability of selecting the true model  by using the "maximize adjusted R2" rule. The distribution of RA2 depends (among other things) on the regressors that appear in the models - see my post here. So, only illustrative results can be obtained for these probabilities. The results are not particularly encouraging. As you'd no doubt guess, you can select the correct model on average, but the probability of making a mistake can be quite high!


References

Ebberler, D. H., 1975. On the probability of correct model selection using the maximum (adjusted) R2 choice criterion. International Economic Review, XVI, 516-520.

Giles, D. E. A. and C. K. Low, 1981. Choosing between alternative structural equations estimated by instrumental variables. Review of Economics and Statistics, LXIII, 476-478.

Giles, D. E. A. and R. G. Smith1977. A Note on the minimum error variance rule and the restricted regression model. International Economic Review, 18, 247-251.

Giles, D. E. A. and B. M. Sturmfels, 1979. Choosing between rank-deficient restricted models. New Zealand Economic Papers, 13, 202-210.

Kloek, T., 1975. Note on a large-sample result in specification analysis. Econometrica, 43, 933-936.

Schmidt, P., 1973. Calculating the power of the minimum standard error choice criterion. International Economic Review, XIV, 253-255.

Schmidt, P., 1974. A note on Theil's minimum standard error criterion when the disturbances are autocorrelated. Review of Economics and Statistics, LVI, 122-123.

Theil, H., 1957. Specification errors and the estimation of econometric relationships. Review of the International Statistical Institute, 25, 41-51.


© 2013, David E. Giles

2 comments:

  1. Item 3 cannot be emphasized enough. Otherwise, the statement that it is a consistent procedure can be rather misleading, because, without that condition, it would be inconsistent with the definition of consistency in more recent literature (see Shao (1997) for example)

    ReplyDelete