Wednesday, May 22, 2013

Minimum MSE Estimation of a Regression Model

Students of econometrics encounter the Gauss-Markhov Theorem (GMT) at a fairly early stage - even if they don't see a formal proof to begin with. This theorem deals with a particular property of the OLS estimator of the coefficient vector, β, in the following linear regression model:

                        y = Xβ + ε  ;  ε ~ [0 , σIn] ,

where X is (n x k), non-random, and of rank k.

The GMT states that among all linear estimators of β that are also unbiased estimators, the OLS estimator of β is most efficient. That is, OLS is the BLU estimator for β.

A "linear" estimator is simply one that can be written as a linear function of the random sample data. Here, these are the observed values of the elements of y. We can write the OLS estimator of β as b = (X'X)-1X'y = Ay; where A = (X'X)-1X' is a non-random matrix. So, each element of b is a linear combination of the elements of y, with weights that aren't random. It's a linear estimator.

Notice that the GMT holds without having to assume that the errors in the model are Normally distributed. (If they do happen to be Normal, then the OLS estimator of β is "Best Unbiased" - that is, we don't have to restrict ourselves to the family of linear estimators.) Also, if ε has a non-scalar (but known) covariance matrix, then the GLS estimator of β is BLU.

When I introduce students to the GMT I usually emphasize that it's a result that's of only limited interest. There are lots of interesting and important estimators that aren't linear estimators. Moreover, why on earth would we want to constrain ourselves to considering only estimators that are unbiased? Econometricians use non-linear and biased estimators all of the time. For example, most Instrumental Variables and GMM estimators fall into this category.

Putting this important point aside for the moment, it's also interesting to ask students to derive the linear minimum MSE estimator of β in the above model. That is, think of the family of estimators of β that are of the form, b = Ay, and determine what choice of A leads to the estimator with the smallest MSE. Strictly speaking, as b and β are vectors, we need to think of the MSE matrices, defined as MSE = V + (Bias Bias'), where V is the covariance matrix of the estimator and Bias is the (k x 1) bias vector. This quantity can be made scalar by taking trace(MSE).

If you want to try and answer this question, you need to know a bit of matrix differential calculus - specifically, you need to know how to differentiate functions of matrices with respect to the elements of a matrix. There are plenty of books on this, but the free download from Steven Nydick is all you need for this particular problem.

However, to make life a little easier, but without altering the message, we can simplify the model to the case where k = 1:

               yi = βxi + εi      ;    εi ~ i.i.d. [0 , σ2]

and consider the family of estimators,

              β* = [a1y1 + a2y2 + ......... + anyn] ,

where the aweights are non-random.

Immediately, E[β*] = Σ(aiβxi) = βΣ(aixi).

So,       Bias[β*] = E[β*] - β = β[Σ(aixi) - 1].

In addition, given the i.i.d. assumption for the errors,

           Var.[β*] = Σ(ai2Var.(yi)) = σ2Σ(ai2) .

So,      M = MSE[β*] = σ2Σ(ai2) + β2[Σ(aixi) - 1]2.

If we differentiate M, partially, with respect to each of the ajs, and set these derivatives to zero, we get:

           2σ2aj + 2β2xj[Σ(aixi)] = 2β2xj    ;   j = 1, 2, ....., n.        (1)

Then, dividing by 2 and multiplying each side of each of these "n" equations, (1), by xj, we get:

           σ2ajxj + β2xj2[Σ(aixi)] = β2xj2    ;     j = 1, 2, ....., n .

Now, sum over all j:

           σ2Σ(ajxj) + β2Σ(xj2)Σ(aixi) = β2Σ(xj2)    .                      (2)

Similarly, dividing each equation in (1) by 2, multiplying each side by yj, and summing over all j, we get:

          σ2Σ(ajyj) + β2Σ(xjyj)Σ(aixi) = β2Σ(xjyj).                    (3)

Notice that (3) can be re-written as:

          σ2β* +  β2Σ(xjyj)Σ(aixi) = β2Σ(xjyj) .                        (4)

It's now a simple matter to solve equations (2) and (4) for Σ(aixi) and β*. The solution for the latter is:

           β* = {β2 / [(σ2 / Σ(xi2)) + β2] } b ,

where b = [Σ(xiyi) / Σ(xi2)] is the OLS estimator of β.

There are three important things to notice about β*:
  1. It isn't really an "estimator" because it's a function of the unknown parameters, β and σ2. It can't actually be used!
  2. |β*| < |b| . This MMSE "estimator" of β "shrinks" the OLS estimator towards towards the origin.
  3. The "estimator", β* is both non-linear and biased.
Non-linear, biased, shrinkage estimators - ones that are genuine estimators and don't involve the unknown parameters - are often used in regression analysis. Examples are the Stein, James-Stein, Ridge, and Bayes estimators. The last of these can be especially appealing, as Bayes estimators allow us to shrink the OLS estimator towards a point that reflects our prior beliefs - not necessarily towards the origin.

So, we can now see why trying to "free up" one of the conditions associated with the Gauss-Markhov Theorem, by considering estimators that are linear, but not necessarily unbiased, really doesn't lead us very far if we have in mind that we want to minimize MSE. This also serves as another example of a situation where the MMSE estimator isn't feasible (computable).


© 2013, David E. Giles


  1. An interesting observation is that the amount of shrinkage is a function of the "signal-to-noise" ratio \sigma^2/\sum(x).

  2. David, I do hope you are gaining more Australian visitors.

    Thank you for your work!!