Econometrics Beat: Dave Giles' Blog: Oct 1, 2015

Thursday, October 1, 2015

What NOT To Do When Data Are Missing

Here's something that's very tempting, but it's not a good idea.

Suppose that we want to estimate a regression model by OLS. We have a full sample of size n for the regressors, but one of the values for our dependent variable, y, isn't available. Rather than estimate the model using just the (n - 1) available data-points, you might think that it would be preferable to use all of the available data, and impute the missing value for y.

Fair enough, but what imputation method are you going to use?

For simplicity, and without any loss of generality, suppose that the model has a single regressor,

y_i = β x_i + ε_i , (1)

and it's the n^th value of y that's missing. We have values for x₁, x₂, ...., x_n; and for y₁, y₂, ...., y_n-1.

Here's a great idea! OLS will give us the Best Linear Predictor of y, so why don't we just estimate (1) by OLS, using the available (n - 1) sample values for x and y; use this model (and x_n) to get a predicted value (y*_n) for y_n; and then re-estimate the model with all n data-points: x₁, x₂, ...., x_n; y₁, y₂, ...., y_n-1, y*_n.

Unfortunately, this is actually a waste of time. Let's see why.

Pages

Thursday, October 1, 2015

What NOT To Do When Data Are Missing