Econometrics Beat: Dave Giles' Blog: What NOT To Do When Data Are Missing

Thursday, October 1, 2015

What NOT To Do When Data Are Missing

Here's something that's very tempting, but it's not a good idea.

Suppose that we want to estimate a regression model by OLS. We have a full sample of size n for the regressors, but one of the values for our dependent variable, y, isn't available. Rather than estimate the model using just the (n - 1) available data-points, you might think that it would be preferable to use all of the available data, and impute the missing value for y.

Fair enough, but what imputation method are you going to use?

For simplicity, and without any loss of generality, suppose that the model has a single regressor,

y_i = β x_i + ε_i , (1)

and it's the n^th value of y that's missing. We have values for x₁, x₂, ...., x_n; and for y₁, y₂, ...., y_n-1.

Here's a great idea! OLS will give us the Best Linear Predictor of y, so why don't we just estimate (1) by OLS, using the available (n - 1) sample values for x and y; use this model (and x_n) to get a predicted value (y*_n) for y_n; and then re-estimate the model with all n data-points: x₁, x₂, ...., x_n; y₁, y₂, ...., y_n-1, y*_n.

Unfortunately, this is actually a waste of time. Let's see why.

Using our sample of (n - 1) observations, the OLS estimator of β in (1) is:

b_(n-1) = Σ (x_i y_i) / Σ(x_i²) , (2)

where the summations in (2) run from i = 1 to i = (n - 1).

The predictor for y_n is:

y*_n = b_(n-1) x_n = x_n Σ (x_i y_i) / Σ(x_i²) . (3)

If we now estimate (1) using x₁, x₂, ...., x_n; y₁, y₂, ...., y_n-1, y*_n, our OLS estimator of β will be:

b_(n) = Σ (x_i y'_i) / Σ(x_i²) , (4)

where y'_i = y_i for i = 1 to (n-1), and y'_n = y*_n. The summations in (4) run from i = 1 to i = n.

We can re-write (4) as:

b_(n) = [Σ (x_i y_i) + x_ny*_n] / [Σ(x_i²) + x_n²], (5)

where the summations in (5) run from i = 1 to i = (n -1).

Substituting (3) into (5), and gathering up terms, we obtain:

b_(n) = Σ (x_i y_i)[Σ(x_i²) + x_n²] / {Σ(x_i²) [Σ(x_i²) + x_n²]}, (6)

where the summations in (6) run from i = 1 to i = (n -1).

Equation (6) reduces to:

b_(n) = Σ (x_i y_i) / Σ(x_i²) = b_{(n-1) .}
In other words, nothing at all is gained by imputing the missing value of y_n by using its OLS predicted value. You don't get something for nothing!

9 comments:

DaumantasOctober 1, 2015 at 9:40 AM
In addition, one has to be careful interpreting the variance of the estimated regression coefficient(s) and the R-squared if imputation was done before running the estimation. In your example I suspect the R-squared would be spuriously higher and the variance of beta would be lower than it should be.
ReplyDelete
Replies
JCOctober 1, 2015 at 11:14 PM
Can you put this on the context of multiple imputation, which accounts for the uncertainty in y*_n? There's a lot of literature arguing for it, but the intuition has always escaped me.
ReplyDelete
Replies
Evan SoltasOctober 5, 2015 at 12:01 PM
I had always thought the temptation with missing data was to impute missing values of independent variables, so that one does not have to drop a large number of rows. My understanding is that there are a number of open questions in econometrics as to how best to do this, e.g. for principal components analysis of large datasets.
ReplyDelete
Replies
AnonymousOctober 7, 2015 at 4:55 AM
Thanks for another great post! I'm assuming this result holds for any number of missing Y values provided the data on X are not missing, and of course we don't have n<k?

Also, I was curious of your general feelings about imputing missing data? Are you ever in favor of it? Or do you think what's missing should always be left missing?

Thanks!
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Pages

Thursday, October 1, 2015

What NOT To Do When Data Are Missing

9 comments: