Thursday, October 1, 2015

What NOT To Do When Data Are Missing

Here's something that's very tempting, but it's not a good idea.

Suppose that we want to estimate a regression model by OLS. We have a full sample of size n for the regressors, but one of the values for our dependent variable, y, isn't available. Rather than estimate the model using just the (n - 1) available data-points, you might think that it would be preferable to use all of the available data, and impute the missing value for y.

Fair enough, but what imputation method are you going to use?

For simplicity, and without any loss of generality, suppose that the model has a single regressor,
                yi = β xi + εi ,                                                                       (1)

 and it's the nth value of y that's missing. We have values for x1, x2, ...., xn; and for y1, y2, ...., yn-1.

Here's a great idea! OLS will give us the Best Linear Predictor of y, so why don't we just estimate (1) by OLS, using the available (n - 1) sample values for x and y; use this model (and xn) to get a predicted value (y*n) for yn; and then re-estimate the model with all n data-points: x1, x2, ...., xn; y1, y2, ...., yn-1, y*n.

Unfortunately, this is actually a waste of time. Let's see why.

Using our sample of (n - 1) observations, the OLS estimator of β in (1) is:

            b(n-1) = Σ (xi yi) / Σ(xi2) ,                                                          (2)

where the summations in (2) run from i = 1 to i = (n - 1).

The predictor for yn is:

          y*n = b(n-1) xn =  xn Σ (xi yi) / Σ(xi2) .                                         (3)

If we now estimate (1) using x1, x2, ...., xn; y1, y2, ...., yn-1, y*n, our OLS estimator of β will be:

          b(n)Σ (xi y'i) / Σ(xi2) ,                                                             (4)

where y'i = yi for i = 1 to (n-1), and y'n = y*n. The summations in (4) run from i = 1 to i = n.

We can re-write (4) as:

           b(n) = [Σ (xi yi) + xny*n] / [Σ(xi2) + xn2],                                    (5)

where the summations in (5) run from i = 1 to i = (n -1).

Substituting (3) into (5), and gathering up terms, we obtain:

            b(n) = Σ (xi yi)[Σ(xi2) + xn2] / {Σ(xi2[Σ(xi2) + xn2]},                  (6)

where the summations in (6) run from i = 1 to i = (n -1).

Equation (6) reduces to:

             b(n) = Σ (xi yi/ Σ(xi2) = b(n-1) .

In other words, nothing at all is gained by imputing the missing value of yn by using its OLS predicted value. You don't get something for nothing!

© 2015, David E. Giles


  1. In addition, one has to be careful interpreting the variance of the estimated regression coefficient(s) and the R-squared if imputation was done before running the estimation. In your example I suspect the R-squared would be spuriously higher and the variance of beta would be lower than it should be.

    1. You have to be careful, yes. The standard errors may be greater or smaller, and the same is true of the R-squared. You can easily check this with an empirical example.

    2. I am not sure what the intuition for a potential decrease in R-squared and variance of beta could be. But here is my argument for a decrease. The imputed y*_n will lie on the regression line. The associated epsilon*_n will be zero. The estimated sigma^2 will be lower (an extra zero term in the sum, but an extra unit in the denominator), which will decrease both the R-squared (unless y*_n equals the mean of y) and the variance of beta. Could you provide (an idea for) an example where the effect would be the opposite?

    3. Mistake in the second-to-last line, should be "_increase_ the R-squared (unless y*_n equals the mean of y) and _decrease_ the variance of beta".

  2. Can you put this on the context of multiple imputation, which accounts for the uncertainty in y*_n? There's a lot of literature arguing for it, but the intuition has always escaped me.

    1. Not sure that I can. Maybe another reader can help?

  3. I had always thought the temptation with missing data was to impute missing values of independent variables, so that one does not have to drop a large number of rows. My understanding is that there are a number of open questions in econometrics as to how best to do this, e.g. for principal components analysis of large datasets.

  4. Thanks for another great post! I'm assuming this result holds for any number of missing Y values provided the data on X are not missing, and of course we don't have n<k?

    Also, I was curious of your general feelings about imputing missing data? Are you ever in favor of it? Or do you think what's missing should always be left missing?


    1. Yes that's right, and any number of X variables. Generally, I'm in favour of imputation, as long as it is done well.