Friday, June 1, 2012

Another Gripe About the Linear Probability Model

NOTE: This post was revised significantly on 15 February, 2019, as a result of correcting an error in my original EViews code. The code file and the Eviews workfile that are available elsewhere on separate pages of this blog were also revised. I would like to thank Frederico Belotti for drawing my attention to the earlier coding error.

So you're still thinking of using a Linear Probability Model (LPM) - also known in the business as good old OLS - to estimate a binary dependent variable model?

Well, I'm stunned!

Yes, yes, I've heard all of the "justifications" (excuses) for using the LPM, as opposed to using a Logit or Probit model. Here are a few of them:

  • It's computationally simpler.
  • It's easier to interpret the "marginal effects".
  • It avoids the risk of mis-specification of the "link function".
  • There are complications with Logit or Probit if you have endogenous dummy regressors.
  • The estimated marginal effects from the LPM, Logit and Probit models are usually very similar, especially if you have a large sample size.
Oh really? That don't impress me much!

Why not? Well, in almost all circumstances, the LPM yields biased and inconsistent estimates. You didn't know that? Then take a look at the paper by Horrace and Oaxaca (2006), and some previous results given by Amemiya (1977)!

It's not the bias that worries me in this particular context - it's the inconsistency. After all, the MLE's for the Logit and Probit models are also biased in finite samples - but they're consistent. Given the sample sizes that we usually work with when modelling binary data, it's consistency and asymptotic efficiency that are of primary importance.

Why would you choose to use a modelling/estimation strategy that will give you the wrong answer, with probability one, even if you have an infinitely large sample? Wouldn't you rather use one that will give you the right answer, with certainty, if the sample size is very, very large?

If you're not feeling up to reading the Horrace and Oaxaca paper - bad news is always unpleasant - here's their key result. Let xi denote the ith observation on the vector of covariates, and let β be the vector of associated coefficients. So, xiβ is the true value of the ith element of the "prediction vector".

OLS estimation of the Linear Probability Model will be both biased and inconsistent, unless it happens to be the case that 0 ≤ xiβ ≤ 1, for every i.

How likely is that? In addition, notice that  xiβ isn't observed, in practice.
The bottom line, in their words (H & O, p.326):
"Although it is theoretically possible for OLS on the LPM to yield unbiased estimation, this generally would require fortuitous circumstances. Furthermore, consistency seems to be an exceedingly rare occurrence as one would have to accept extraordinary restrictions on the joint distribution of the regressors. Therefore, OLS is frequently a biased estimator and almost always an inconsistent estimator of the LPM."
Perhaps a small Monte Carlo simulation experiment will be helpful?

Here's what I've done. I generated binary data for my dependent variable, y, as follows:

yi* = β1 + β2xi + εi     ;        εi ~ iid N[0 , 1]

yi = 1  ;  if yi* > 0
    = 0  ;  if yi* ≤ 0       ;   i = 1, 2, ...., n

Then, I estimated both a Probit model and a LPM, using the y and x data. In each case I computed the estimated marginal effect for x. (In the case of the LPM, this is just the OLS estimate of the coefficient of x.) Of course, as I know the true values of β1 and β2 in the data-generating process, I also know the true value of the marginal effect. For the Probit model I calculated the marginal; effect at the sample mean of x. An alternative would have been to have averaged the partial effects, computed at each sample point. This is something I'll write a post about at a later date.

I've focused on marginal effects because they are more interesting than the parameters themselves in a Logit or Probit model. In addition, unlike the parameter estimates, they can be compared meaningfully between the LPM and Logit or Probit models.

For a fixed sample size, n, I replicated this 5,000 times. Various sample sizes were considered.

The EViews workfile and program file are on the code page for this blog. (I'm sorry that I didn't do this one in R! However, the EViews program file is also there as a text version of it for ease of viewing.)

Here are some summary results (Probit followed by LPM), first for n = 100

Comparing the means of the sampling distributions (0.6115 and 0.3197) for the estimated marginal effects with the true value of 0.6051, you can see that the LPM (OLS) result is biased downwards, whereas the Probit result has virtually no bias. 

Remember, this is for a sample size of n = 100. Now let's increase the sample size to n = 250:

Notice that the mean of the sampling distribution for the Probit-estimated marginal effect (0.5228) is very close to the true value of 0.5263. In contrast, the OLS/LPM estimator of the marginal effect is biased upwards (being 0.3241, on average).

Now, let's explore the asymptotics a little by considering n = 5,000, and n = 10,000:


We can see that the marginal effect implied by OLS estimation of the LPM is converging to a value of approximately 0.32, as n increases. The standard deviation of the sampling distribution for that estimator is getting smaller and smaller. The inconsistency of the estimator is being driven by asymptotic bias.

On the other hand, the marginal effect implied by maximum likelihood estimation of the Probit model is converging to the true marginal effect  - even though the latter value is changing as n increases. In fact, this convergence is quite rapid. The (small) bias vanishes, and the standard deviation of the estimator continues to decrease, as the sample size grows.

So much for the last of the "reasons" for using the LPM in the bulleted list near the start of this post!

When we look at the proportion of sample values for which the true value of (β1 + β2xi) lies in the unit interval, the results for n = 100, 250, 5000, and 10000 are 20%, 15.2%, 19.2%, and 19.1% respectively. These values are a long way from 100%.

So, recalling Horace and Oaxaca's results, that's why OLS estimation of the Linear Probability Model is both biased and inconsistent, here. The maximum likelihood estimator is consistent, and its bias is quite small in finite samples, in this experiment.

Yes, there are some situations where I'd consider using an LPM. For example, with a complex panel data-set, or when there are endogenous dummy covariates. However, in the typical case of cross-section data, and no other complications, you'll have to work pretty hard to convince me!



References

Amemiya, T., 1977. Some theorems in the linear probability model. International Economic Review, 18, 645–650.

Horrace, W. C. & R. L. Oaxaca, 2006. Results on the bias and inconsistency of ordinary least squares for the linear probability model. Economics Letters, 90, 321-327. (WP version here.)


© 2012, 2019, David E. Giles

24 comments:

  1. Dave: Can you suggest a reference for the endogenous dummy covariate case?

    ReplyDelete
    Replies
    1. Brian: See J. A. Angrist, "Estimation of Limited Dependent Variable Models With Dummy Endogenous Regressors: Simple Strategies for Empirical Practice", Journal of Business & Economic Statistics, 2001, 19, 2-28 (includes discussion & response). email me directly if you have trouble getting this.

      Delete
  2. Would you still do this if heteroscedasticity was an issue?

    ReplyDelete
    Replies
    1. Dimitriy: We'd need to model the het. in the Logit or Probit model, because we know that the MLE for the PARAMETERS in these models is inconsistent if there is het. I think you've already seen my earlier post on this point. I'd still avoid the LPM!

      Delete
  3. This post has some very useful information about LPM. Thanks for the effort you put into this and your other interesting (and often entertaining!) blog posts.

    I try to do my bit by being as DEMONSTRATIVE as possible in telling my Econ 345 students why they must use probit or logit instead of LPM.

    ReplyDelete
  4. Alan: Thanks! Glad it was helpful.

    ReplyDelete
  5. Well, Dave, you mention that panel data you would consider the use of LPM, but not cross-section... but there are many cases of using lots of FE in cross-section data, and I think it is *really* hard to defend non-linear models in cases such as these. FE account for variation in the data in a completely general way -- the non-linear model relies on the functional form.

    Also, your monte carlo example is a bit of a cherry pick; the LPM is the "wrong" model in this case. MLE will be inconsistent if the probit model is wrong, too.

    ReplyDelete
    Replies
    1. Scott: Thanks for the comments. First one - fair enough.

      Second one - Yes, MLE will also be inconsistent if the Probit DGP is wrong. However, other work I've played around with shows that the asymptotic bias associated with the LPM is often greater than that associated with an incorrect nonlinear model. For example, if the data are generated according to Probit, and then we fit either LPM or Logit.

      This is something I'm currently working on more seriously - see one of my responses at http://davegiles.blogspot.ca/2012/07/more-comments-on-use-of-lpm.html#comment-form

      This certainly deserves proper investigation.

      Delete
  6. Hi Dave,
    What about when you have many interacted independent variables and a binary dependent variable? (for example interacting many independent variables with a set of dummies) Given that calculating marginal effects of interactions is complex when there are so many. Could you be justified in using LPM in this situation?
    Rose

    ReplyDelete
    Replies
    1. Rose - I have some sympathy with that, but there's a good literature on how to do things properly, even in that case. For example:

      http://www.unc.edu/~enorton/AiNorton.pdf

      and

      http://www.sciencedirect.com/science/article/pii/S0165176510000777

      Delete
  7. Dave, if I use a logistic link function but minimize MSE instead of using MLE, do I still have the same problems?

    ReplyDelete
  8. Personally, I'd see more sense in that than just using OLS.

    ReplyDelete
    Replies
    1. Thanks. A follow-up if I could...

      If I use the logistic link function, then maximizing (1-a)*(1-p) + ap (MLE?) in the binary response case seems to be identical to minimizing absolute error. Is this true?

      Delete
  9. How did your estimates of B2 turn out?

    ReplyDelete
  10. Trying simulatating a model in which there is a true "absolute treatment effect", e.g.

    y = a0 + a1*x + e in which e ~ bernoulli

    Then run LPM and logit.

    Better yet add a covariate to the equation above (e.g. a2*w) and show how logit will suggest that "a1" varies with w when it actually doesn't.

    ReplyDelete
  11. Very useful posting since I am now dealing with binary dependent variable. I am still learning your another post about robust standard error for Probit and Logit. Very helpful!! Tony

    ReplyDelete
    Replies
    1. Great posting! I have a quick question for you. What if there is no a priori reason for preferring the probit model (e.g., we are not doing a simulation and not knowing it to be the true model)..how can we choose between the probit model and the logit model?

      Delete
    2. I'd discriminate between the two using one the of the available information criteria. A useful paper on this is: G. Chen & H. Tsurumi, "Probit and Logit MOdel Selection", Communications in Statisics - Theory & Methods, 2010, 40, 159-175. Here's the abstract:

      Abstract:
      "Monte Carlo experiments are conducted to compare the Bayesian and sample theory model selection criteria in choosing the univariate probit and logit models. We use five criteria: the deviance information criterion (DIC), predictive deviance information criterion (PDIC), Akaike information criterion (AIC), weighted, and unweighted sums of squared errors. The first two criteria are Bayesian while the others are sample theory criteria. The results show that if data are balanced none of the model selection criteria considered in this article can distinguish the probit and logit models. If data are unbalanced and the sample size is large the DIC and AIC choose the correct models better than the other criteria. We show that if unbalanced binary data are generated by a leptokurtic distribution the logit model is preferred over the probit model. The probit model is preferred if unbalanced data are generated by a platykurtic distribution. We apply the model selection criteria to the probit and logit models that link the ups and downs of the returns on S&P500 to the crude oil price."

      Delete
  12. This comment has been removed by a blog administrator.

    ReplyDelete
  13. Dear Dave,

    I'd like to jump in here, even though this is an old thread.
    In particular, I'd like to ask for two clarifications on the Eviews code you used for the Monte Carlo analysis. I might be wrong or missing something since I don't know the Eviews syntax very well but it seems to me that marginal effect of x (at means) in a probit model should be

    @dnorm(c(1)+c(2)*@mean(x))*c(2)

    instead of

    @cnorm(c(1)+c(2)*@mean(x))*(1-@cnorm(c(1)+c(2)*@mean(x)))*c(2)

    Am I wrong?
    Second: Why did you consider the marginal effect at mean instead of the average marginal effect? How the regressor x is generated? I wasn't able to find it looking at the code.

    Many thanks,
    Federico






    ReplyDelete
    Replies
    1. Frederico - you are right! How silly of me. The structure I used would have been correct if it had been the Logit model, and U'd used the cumulative logistic instead of the cumulative Normal. I'll have to fix this at some stage, even though this post is ancient history. Second, it's moot as to whether one reports the marginal effect at the mean, or the average of the marginal effects. Of course, you do get different answers. Finally, the X variable was just artificially generated - it's in the EViews workfile alreadt and wasn't generated in the program. DG

      Delete
    2. Frederico - I have now amended the EViews code and updated the blog post. Again - much appreciated.

      Delete
    3. Dave: thank you for your prompt reply and feedback.

      Delete
  14. There is no such thing as the "true marginal effect", because the marginal effect depends on x. It is the researcher who chooses to calculate it for the mean x, but why should we care about the derivative at this particular point? This is just one possible x, no better and no worse than any other. LPM (OLS) gives you a weighted-average of marginal effects at different values of x. Of course it will be a different number! (And even more if the distribution of x is ugly). This is like an apples-to-oranges comparison. But this is not a problem of OLS per se, it is a problem of the choice of mean x as the point where marginal effect was calculated. A somewhat fairer test could be to at least calculate the average partial effect of MLE and then compare it to OLS - these are the two competing ways of aggregating marginal effects into one single parameter of interest. OLS can be seen as a more convenient one, especially since MLE (and hence average partial effects) relies on untestable distributional assumptions to identify the parameter of interest: here you even simulated a normal epsilon and selectively picked a model that assumes a normal epsilon, but in real data we would never know...

    ReplyDelete