Friday, June 1, 2012

Yet Another Reason for Avoiding the Linear Probability Model

Oh dear, here we go again. Hard on the heels of this post, as well an earlier one here, I'm moved to share even more misgivings about the Linear Probability Model (LPM). That's just a fancy name for a linear regression model in which the dependent variable is a binary dummy variable, and the coefficients are estimated by OLS.

What is it this time?

I'll keep it brief - I promise!

First, a general observation about measurement errors and any regression model. Measurement errors associated with the regressors are usually a cause for concern. Measurement errors associated with the dependent variable are (usually) less of a problem. The effects get absorbed into the error term.

Now let's think about measurement errors associated with the binary dependent variable in a LPM. The assigned values are either zero or one. What errors can arise?
  1. A value that should be coded as zero gets coded as one instead.
  2. Or, a value that should be coded as one gets coded as zero.

This sort of mis-classification of the data has been investigated by a number of authors. Notably, Hausman et al. (1998) looked at the implications for the Logit and Probit models, as well as for the LPM.

The adverse implications of mis-classification are fundamentally worse for the LPM than they are for the other models. Specifically, the parameters of the LPM aren't identifiable. That's really serious! If the parameters can't be identified, then no amount of ingenuity is going to yield estimates of them.


Yes, there are issues associated with the maximum likelihood estimators for the Logit and Probit models when the data are mis-classified, but these can be addressed. Identification is ensured by imposing a simple restriction: the sum of the probabilities of the two types of "mis-classification errors" noted above must be less than one in value.

So, ask yourself the following question:

"When I have binary choice data, can I be absolutely sure that every one of the observations has been classified correctly into zeroes and ones?"

If your answer is "Yes", then I have to say that I don't believe you. Sorry!

If your answer is "No", then forget about using the LPM. You'll just be trying to do the impossible - namely, estimate parameters that aren't identified.

And that's not going to impress anybody!


Hausman, J. A., J. Abrevaya & F. M. Scott-Morton, 1998. Misclassification of the dependent variable in a discrete-response setting. Journal of Econometrics, 87, 239-269.

© 2012, David E. Giles


  1. Prof. Giles,

    Steve Pischke answered a question from Mark Schaffer in the Mostly Harmless Econometrics blog concerning your gripes about the LPM. Would you care to respond? I feel like this is truly an exchange from which a lot of people can learn.

    1. Alfredo: Thanks for drawing my atention to this post - it's not a blog I follow. I'll post some follow-up comments on this blog in the near future.