## Thursday, October 10, 2013

### Beyond MSE - "Optimal" Linear Regression Estimation

In a recent post I discussed the fact that there is no linear minimum MSE estimator for the coefficients of a linear regression model. Specifically, if you try to find one, you end up with an "estimator" that is non-operational, because it is itself a function of the unknown parameters of the model. It's note really an estimator at all, because it can't be computed.

However, by changing the objective of the exercise slightly, a computable "optimal estimator" can be obtained. Let's take a look at this.

We'll consider the same model as in the earlier post. Although it's the simple linear regression model without an intercept, the basic result generalizes to the usual multiple linear regression model. So, our model, with a non-random regressor, is:

yi = βxi + εi    ;     εi ~ i.i.d. [0 , σ2]       ;       i = 1, 2, ...., n.

Let β* be any linear estimator of β, so that we can write β* = Σaiyi, where the ai's are non-random weights, and all summations (here and below) are taken over i (or, later j) = 1 to n.

So, E[β*] = βΣ(aixi) , and

Bias[β*] = β[Σ(aixi) - 1].                                                              (1)

Similarly,

var.[β*] = Σ[ai2var.(yi)] = σ2Σ(ai2).                                               (2)

Now, instead of trying to find the ai weights (and hence the β*) that minimizes MSE, let's try and find the ai's that lead to a β* that minimizes the quantity

Q = α[ var.(β*) /  σ2] + (1 - α)[Bias(β*) / β]2    ,                          (3)

where α is any number satisfying 0 < α < 1. The quantity we're going to minimize is a weighted sum of the relative variance and the squared relative bias of our linear estimator.

Notice that if we choose α = 0, then β* is just the OLS estimator; and we choose α = 1, then β* = 0 (which is not exactly an interesting "estimator").

From (3), we get:

(∂Q / ∂aj) = 2αaj + 2(1 - α)xj[∑(aixi) - 1]                                        (4)

Setting all of the equations in (4) to zero, multiplying by yj, and summing over j, we get:

αβ* + (1 - α)∑(xjyj)[∑(aixi) - 1] = 0  .                                            (5)

Similarly, setting all of the equations in (4) equal to zero, multiplying by xj, and summing over all j, we get:

α∑(ajxj) + (1 - α)∑(xj2)[∑(aixi) - 1] = 0   .                                      (6)

From (6),

∑(ajxj) = ∑(aixi ) = (1 - α)∑(xj2) / [α +  (1 - α)∑(xj2)]  .                  (7)

Using (7) in (5), and solving for β*, we get:

β* = b [(1 - α)∑(xj2)] / [α +  (1 - α)∑(xj2)]  ,                                 (8)

where b is the OLS estimator of β.

The estimator in (8) can be computed for any chosen value of α; and for 0 < α < 1 β* is a shrinkage estimator - it shrinks the OLS estimator towards the origin.

1. that's interesting dave. if you had included an intercept, would it still act as a shrinkage estimator ?

1. Mark - thanks. Yes - just take deviations about sample means for x and y. The same result will then apply for the estimation of the slope coefficient (but not for the estimation of the intercept).

2. Dave,

Thanks for a really interesting couple of posts - good stuff!

Can we say something concrete about how the two estimators are related? If I'm not mistaken, the feasible estimator #2 is identical to the infeasible estimator #1 if you set alpha = sigma^2 / (sigma^2 + beta^2).

Of course we don't know sigma^2 or beta^2, but then alpha has to come from somewhere. Can we use the OLS estimates, set alpha = s^2 / (s^2 + b^2), and say our estimator (a) minimizes the weighted sum of the relative variance and the squared relative bias, and (b) uses an estimate of the "optimal" weight (where "optimal" means we would like to minimize the MSE if we could). Kind of hand-wavey, I know.

You suggested this option at the end of your first post, and I wonder if there's anything more concrete that can be said about it. (I suspect not - otherwise you probably would have said it! - but hope springs eternal.)

--Mark

1. Mark - thanks. If you could set alpha equal to sigma^2 /(sigma^2 + beta^2), this will indeed be the choice that actually minimizes MSE. Replacing sigma^2 with s^2 and beta^2 with b^2 will alter the MSE (as before). However, at least these estimators are consistent, so in a large enough sample everything would be fine.

2. And alphahat = s^2 / (s^2 + b^2) is consistent for alpha. But consistency is kind of weak tea in this context, since the MSE is going to zero in the limit anyway.

Is there any Monte Carlo (or other) evidence about these estimators? I'm tempted to set some students loose on this as a project, but not if there's a paper or book out there with all the answers!

And thanks again for the blogging - good stuff indeed.

--Mark

3. Mark - absolutely right. I'm not aware of any studies - let them loose!