## Sunday, December 15, 2013

### Proxy Variables and Biased Estimation

Here's a problem from the exam. that one of my econometrics classes sat recently. It's to do with some of the consequences of mis-specifying a regression model, and then applying OLS estimation.

Specifically, let's suppose that data-generating process (the correct model specification) is actually of the form:

y = Xβ + ε     ;   ε ~ [0 , σ2In] .                         (1)

However, we can't observe the k variables in the X matrix, and instead we replace them with k "proxy variables" (substitutes) that we can observe. So, the model that we actually estimate is:

y = X*β + v .                                                     (2)

The students were asked to show that the usual (unbiased) estimator of σ2 is actually biased in this case; and they were asked if they could determine the "direction" of the bias.

If v* is the residual vector after we estimate (2) by OLS, then the estimator of σ2 that we'd construct would be

σ*2 = v*'v* / (n - k) = y'M*y / (n - k),
where
M* = In - X*(X*'X*)-1X*' .

Now, the correct expression for y is given by (1), so

σ*2 = (Xβ + ε)'M*(Xβ + ε) / (n - k)

= [β'X'M*Xβ + ε'M*ε + 2ε'M*Xβ] / (n - k) ,
and
E[σ*2] = β'X'M*Xβ + E[ε'M*ε] .                               (3)

Note that each term in (3) is scalar, so

E[ε'M*ε] = E{tr.[ε'M*ε]} = E{tr.[M*ε ε']} = tr.{E[M*ε ε']}

= tr.{M*σ2In} = σ2 tr.(M*) = σ2 (n - k).

So,
E[σ*2] = β'X'M*Xβ / (n - k) + σ2,

and our estimator of σ2 is biased, with a bias of β'X'M*Xβ.

Finally, note that as M* is idempotent it is (at least) positive semi-definite, so β'X'M*Xβ ≥ 0. That is, our estimator has a non-negative bias.

The exercise can be taken a step further by asking "under what condition(s), if any, will this bias be zero?"

Putting to one side the uninteresting situation where in Xβ = 0, we're left with the following condition - the estimator will be unbiased if M*X = 0 (or, equivalently, if X'M* = 0). Let's interpret this condition. Given that the problem has been set up so that models (1) and (2) each have the same number (k) of regressors, M*X = 0 only if X = X*. In this case, the correct variables have been used for estimation purposes.

So, replacing all of the regressors with proxy variables implies that the usual unbiased estimator of σ2 will definitely be biased upwards.

You might check out the following variation on the problem. What if there are k* > k proxy variables in model (2)? What if there are k* < k proxy variables? Do you get such an unambiguous result in these cases?