Econometrics Beat: Dave Giles' Blog: Minimum MSE Estimation of a Regression Model

Students of econometrics encounter the Gauss-Markhov Theorem (GMT) at a fairly early stage - even if they don't see a formal proof to begin with. This theorem deals with a particular property of the OLS estimator of the coefficient vector, β, in the following linear regression model:

y = Xβ + ε ; ε ~ [0 , σ²I_n] ,

where X is (n x k), non-random, and of rank k.

The GMT states that among all linear estimators of β that are also unbiased estimators, the OLS estimator of β is most efficient. That is, OLS is the BLU estimator for β.

A "linear" estimator is simply one that can be written as a linear function of the random sample data. Here, these are the observed values of the elements of y. We can write the OLS estimator of β as b = (X'X)^-1X'y = Ay; where A = (X'X)^-1X' is a non-random matrix. So, each element of b is a linear combination of the elements of y, with weights that aren't random. It's a linear estimator.

Notice that the GMT holds without having to assume that the errors in the model are Normally distributed. (If they do happen to be Normal, then the OLS estimator of β is "Best Unbiased" - that is, we don't have to restrict ourselves to the family of linear estimators.) Also, if ε has a non-scalar (but known) covariance matrix, then the GLS estimator of β is BLU.

When I introduce students to the GMT I usually emphasize that it's a result that's of only limited interest. There are lots of interesting and important estimators that aren't linear estimators. Moreover, why on earth would we want to constrain ourselves to considering only estimators that are unbiased? Econometricians use non-linear and biased estimators all of the time. For example, most Instrumental Variables and GMM estimators fall into this category.

Putting this important point aside for the moment, it's also interesting to ask students to derive the linear minimum MSE estimator of β in the above model. That is, think of the family of estimators of β that are of the form, b = Ay, and determine what choice of A leads to the estimator with the smallest MSE. Strictly speaking, as b and β are vectors, we need to think of the MSE matrices, defined as MSE = V + (Bias Bias'), where V is the covariance matrix of the estimator and Bias is the (k x 1) bias vector. This quantity can be made scalar by taking trace(MSE).

If you want to try and answer this question, you need to know a bit of matrix differential calculus - specifically, you need to know how to differentiate functions of matrices with respect to the elements of a matrix. There are plenty of books on this, but the free download from Steven Nydick is all you need for this particular problem.

However, to make life a little easier, but without altering the message, we can simplify the model to the case where k = 1:

y_i = βx_i + ε_i ; ε_i ~ i.i.d. [0 , σ²]

and consider the family of estimators,

β* = [a₁y₁ + a₂y₂ + ......... + a_ny_n] ,

where the a_iweights are non-random.

Immediately, E[β*] = Σ(a_iβx_i) = βΣ(a_ix_i).

So, Bias[β*] = E[β*] - β = β[Σ(a_ix_i) - 1].

In addition, given the i.i.d. assumption for the errors,

Var.[β*] = Σ(a_i²Var.(y_i)) = σ²Σ(a_i²) .
So, M = MSE[β*] = σ²Σ(a_i²) + β²[Σ(a_ix_i) - 1]².

If we differentiate M, partially, with respect to each of the a_js, and set these derivatives to zero, we get:

2σ²a_j + 2β²x_j[Σ(a_ix_i)] = 2β²x_j ; j = 1, 2, ....., n. (1)

Then, dividing by 2 and multiplying each side of each of these "n" equations, (1), by x_j, we get:

σ²a_jx_j + β²x_j²[Σ(a_ix_i)] = β²x_j^{2 ; j = 1, 2, ....., n .}

Now, sum over all j:

σ²Σ(a_jx_j) + β²Σ(x_j²)Σ(a_ix_i) = β²Σ(x_j²) . (2)

Similarly, dividing each equation in (1) by 2, multiplying each side by y_j, and summing over all j, we get:

σ²Σ(a_jy_j) + β²Σ(x_jy_j)Σ(a_ix_i) = β²Σ(x_jy_j). (3)

Notice that (3) can be re-written as:

σ²β* + β²Σ(x_jy_j)Σ(a_ix_i) = β²Σ(x_jy_j) . (4)

It's now a simple matter to solve equations (2) and (4) for Σ(a_ix_i) and β*. The solution for the latter is:

β* = {β² / [(σ² / Σ(x_i²)) + β²] } b ,

where b = [Σ(x_iy_i) / Σ(x_i²)] is the OLS estimator of β.

There are three important things to notice about β*:

It isn't really an "estimator" because it's a function of the unknown parameters, β and σ². It can't actually be used!
|β*| < |b| . This MMSE "estimator" of β "shrinks" the OLS estimator towards towards the origin.
The "estimator", β* is both non-linear and biased.

Non-linear, biased, shrinkage estimators - ones that are genuine estimators and don't involve the unknown parameters - are often used in regression analysis. Examples are the Stein, James-Stein, Ridge, and Bayes estimators. The last of these can be especially appealing, as Bayes estimators allow us to shrink the OLS estimator towards a point that reflects our prior beliefs - not necessarily towards the origin.

So, we can now see why trying to "free up" one of the conditions associated with the Gauss-Markhov Theorem, by considering estimators that are linear, but not necessarily unbiased, really doesn't lead us very far if we have in mind that we want to minimize MSE. This also serves as another example of a situation where the MMSE estimator isn't feasible (computable).

2 comments:

Ai Deng, PhDMay 22, 2013 at 4:04 PM
An interesting observation is that the amount of shrinkage is a function of the "signal-to-noise" ratio \sigma^2/\sum(x).
nottrampisMay 23, 2013 at 4:44 PM
David, I do hope you are gaining more Australian visitors.

Thank you for your work!!

Note: Only a member of this blog may post a comment.

Econometrics Beat: Dave Giles' Blog

Pages

Wednesday, May 22, 2013

Minimum MSE Estimation of a Regression Model

2 comments: