Wednesday, May 8, 2013

Robust Standard Errors for Nonlinear Models

André Richter wrote to me from Germany, commenting on the reporting of robust standard errors in the context of nonlinear models such as Logit and Probit. He said he 'd been led to believe that this doesn't make much sense. I told him that I agree, and that this is another of my "pet peeves"!

Yes, I do get grumpy about some of the things I see so-called "applied econometricians" doing all of the time. For instance, see my 
Gripe of the Day post back in 2011. Sometimes I feel as if I could produce a post with that title almost every day!

Anyway, let's get back to André's point.

The following facts are widely known (e.g., check any recent edition of Greene's text) and it's hard to believe that anyone could get through a grad. level course in econometrics and not be aware of them:
  • In the case of a linear regression model, heteroskedastic errors render the OLS estimator, b, of the coefficient vector, β, inefficient. However, this estimator is still unbiased and weakly consistent. 
  • In this same linear model, and still using OLS, the usual estimator of the covariance matrix of b is an inconsistent estimators of the true covariance matrix of b. Consequently, if the standard errors of the elements of b are computed in the usual way, they will inconsistent estimators of the true standard deviations of the elements of b.
  • For this reason,we often use White's "heteroskedasticity consistent" estimator for the covariance matrix of b, if the presence of heteroskedastic errors is suspected.
  • This covariance estimator is still consistent, even if the errors are actually homoskedastic.
  • In the case of the linear regression model, this makes sense. Whether the errors are homoskedastic or heteroskedastic, both the OLS coefficient estimators and White's standard errors are consistent.
However, in the case of a model that is nonlinear in the parameters:
  • The MLE of the parameter vector is biased and inconsistent if the errors are heteroskedastic (unless the likelihood function is modified to correctly take into account the precise form of heteroskedasticity).
  • This stands in stark contrast to the situation above, for the linear model.
  • The MLE of the asymptotic covariance matrix of the MLE of the parameter vector is also inconsistent, as in the case of the linear model.
  • Obvious examples of this are Logit and Probit models, which are nonlinear in the parameters, and are usually estimated by MLE.
I've made this point in at least one previous post. The results relating to nonlinear models are really well-known, and this is why it's extremely important to test for model mis-specification (such as heteroskedasticity) when estimating models such as Logit, Probit, Tobit, etc. Then, if need be, the model can be modified to take the heteroskedasticity into account before we estimate the parameters. For more information on such tests, and the associated references, see this page on my professional website.

Unfortunately, it's unusual to see "applied econometricians" pay any attention to this! They tend to just do one of two things. They either
  1. use Logit or Probit, but report the "heteroskedasticity-consistent" standard errors that their favourite econometrics package conveniently (but misleading) computes for them. This involves a covariance estimator along the lines of White's "sandwich estimator". Or, they
  2. estimate a "linear probability model" (i.e., just use OLS, even though the dependent variable is a binary dummy variable, and report the "het.-consistent standard errors".
If they follow approach 2, these folks defend themselves by saying that "you get essentially the same estimated marginal effects if you use OLS as opposed to Probit or Logit." I've said my piece about this attitude previously (here, here, here, and here), and I won't go over it again here.

My concern right now is with approach 1 above.

The "robust" standard errors are being reported to cover the possibility that the model's errors may be heteroskedastic. But if that's the case, the parameter estimates are inconsistent. What use is a consistent standard error when the point estimate is inconsistent? Not much!!

This point is laid out pretty clearly in Greene (2012, pp. 692-693), for example. Here's what he has to say:
"...the probit (Q-) maximum likelihood estimator is not consistent in the presence of any form of heteroscedasticity, unmeasured heterogeneity, omitted variables (even if they are orthogonal to the included ones), nonlinearity of the form of the index, or an error in the distributional assumption [ with some narrow exceptions as described by Ruud (198)]. Thus, in almost any case, the sandwich estimator provides an appropriate asymptotic covariance matrix for an estimator that is biased in an unknown direction." (My underlining; DG.) "White raises this issue explicitly, although it seems to receive very little attention in the literature.".........."His very useful result is that if the QMLE converges to a probability limit, then the sandwich estimator can, under certain circumstances, be used to estimate the asymptotic covariance matrix of that estimator. But there is no guarantee the the QMLE will converge to anything interesting or useful. Simply computing a robust covariance matrix for an otherwise inconsistent estimator does not give it redemption. Consequently, the virtue of a robust covariance matrix in this setting is unclear."
Back on July 2006, on the R Help feed, Robert Duval had this to say:
"This discussion leads to another point which is more subtle, but more important...
You can always get Huber-White (a.k.a robust) estimators of the standard errors even in non-linear models like the logistic regression. However, if you believe your errors do not satisfy the standard assumptions of the model, then you should not be running that model as this might lead to biased parameter estimates.
For instance, in the linear regression model you have consistent parameter estimates independently of whether the errors are heteroskedastic or not. However, in the case of non-linear models it is usually the case that heteroskedasticity will lead to biased parameter estimates (unless you fix it explicitly somehow).
Stata is famous for providing Huber-White std. errors in most of their regression estimates, whether linear or non-linear. But this is nonsensical in the non-linear models since in these cases you would be consistently estimating the standard errors of inconsistent parameters.
This point and potential solutions to this problem is nicely discussed in Wooldrige's Econometric Analysis of Cross Section and Panel Data."
Amen to that!

Regrettably, it's not just Stata that encourages questionable practices in this respect. These same options are also available in EViews, for example.


Greene, W. H., 2012. Econometric Analysis. Prentice Hall, Upper Saddle River, NJ.

© 2013, David E. Giles


  1. This post focuses on how the MLE estimator for probit/logit models is biased in the presence of heteroskedasticity. Assume you know there is heteroskedasticity, what is the best approach to estimating the model if you know how the variance changes over time (is there a GLS version of probit/logit)? Is this also true for autocorrelation?

    1. John - absolutely - you just need to modify the form of the likelihood function to accomodate the particular form of het. and/or autocorrelation.

    2. Yes, Stata has a built-in command, hetprob, that allows for specification of the error variances as exp(w*d), where w is the vector of variables assumed to affect the variance.

    3. Stata has a downloadable command, oglm, for modelling the error variance in ordered multinomial models.
      In the R environment there is the glmx package for the binary case and oglmx for ordered multinomial.

  2. In characterizing White's theoretical results on QMLE, Greene is of course right that "there is no guarantee the the QMLE will converge to anything interesting or useful [note that the operative point here isn't the question of convergence, but rather the interestingness/usefulness of the converged-to object]."

    But it is not crazy to think that the QMLE will converge to something like a weighted average of observation-specific coefficients (how crazy it is surely depends on the degree of mis-specification--suppose there is epsilon deviation from a correctly specified probit model, for example, in which case the QMLE would be so close to the MLE that sample variation would necessarily dominate mis-specification in any real-world empirical application). It would be a good thing for people to be more aware of the contingent nature of these approaches.

    If, whenever you use the probit/logit/whatever-MLE, you believe that your model is perfectly correctly specified, and you are right in believing that, then I think your purism is defensible. If that's the case, then you should be sure to use every model specification test that has power in your context (do you do that? does anyone?).

    1. Jonah - thanks for the thoughtful comment. Regarding your last point - I find it amazing that so many people DON'T use specification tests very much in this context, especially given the fact that there is a large and well-established literature on this topic. That's the reason that I made the code available on my website. I'll repeat that link, not just for the code, but also for the references:

  3. Dear David, would you please add the links to your blog when you discuss the linear probability model. You said "I've said my piece about this attitude previously (here and here), and I won't go over it again here."

    But on here and here you forgot to add the links.

    Thanks for that

    1. Jorge - whoops! My apologies. And by way of recompense I've put 4 links instead of 2. :-)

    2. Wow, really good reward that is info you don't usually get in your metrics class. Best regards

  4. Dave -- there's a section in Deaton's Analysis of Household Surveys on this that has always confused me. He discusses the issue you raise in this post (his p. 85) and then goes on to say the following (pp. 85-86):

    "The point of the previous paragraph is so obvious and so well understood that
    it is hardly of practical importance; the confounding of heteroskedasticity and "structure" is unlikely to lead to problems of interpretation. It is standard procedure in estimating dichotomous models to set the variance in (2.38) to be unity,
    and since it is clear that all that can be estimated is the effects of the covariates on the probability, it will usually be of no importance whether the mechanism works through the mean or the variance of the latent "regression" (2.38). While it is
    correct to say that probit or logit is inconsistent under heteroskedasticity, the
    inconsistency would only be a problem if the parameters of the function f were
    the parameters of interest. These parameters are identified only by the homoskedasticity assumption, so that the inconsistency result is both trivial and obvious."

    I understand why we normalise the variance to 1, but I've never really understood Deaton's point as to why this make the inconsistency result under heteroskedasticity "trivial" (he then states the same issue is more serious in, for instance, a tobit model). Do you perhaps have a view? (You can find the book here, in case you don't have a copy:

    Thanks for your blog posts, I learn a lot from them and they're useful for teaching as well.

  5. Two comments. First, while I have no stake in Stata, they have very smart econometricians there. I would not characterize them as "encouraging" any practice. They provide estimators and it is incumbent upon the user to make sure what he/she applies makes sense.

    Second, there is one situation I am aware of (albeit not an expert) where robust standard errors seem to be called for after probit/logit and that is in the context of panel data. Wooldridge discusses in his text the use of a "pooled" probit/logit model when one believes one has correctly specified the marginal probability of y_it, but the likelihood is not the product of the marginals due to a lack of independence over time. Here, I believe he advocates a partial MLE procedure using a pooled probit model, but using robust standard errors.

  6. I've said my piece about this attitude previously (here and here)

    You bolded, but did not put any links in this line. As it stands, it appears that you have not previously expressed yourself about this attitude. If you indeed have, please correct this so I can easily find what you've said.


    1. Marcel - thank you. Please see above. DG

  7. DLM - thanks for the good comments. You'll notice that the word "encouraging" was a quote, and that I also expressed the same reservation about EViews. I do worry a lot about the fact that there are many practitioners out there who treat these packages as "black boxes". It's hard to stop that, of course. Regarding your second point - yes, I agree. Thanks!

    1. In line with DLM, Stata has long had a FAQ on this:

      but I agree that people often use them without thinking. I have students read that FAQ when I teach this material.

    2. Dear David,

      I came across your post looking for an answer to the question if the robust standard errors (Wooldridge suggests in 13.8.2.) are correct without assuming strict exogeneity?

      To be more precise, is it sufficient to assume that:

      (1) D(y_it|x_it) is correctly specified and
      (2) E(x_it|e_it)=0 (contemporaneous exogeneity)

      in the case of pooled Probit, for 13.53 (in Wooldridge p. 492) to be applicable?


    3. Thanks for the reply!

      Are the same assumptions sufficient for inference with clustered standard errors?

    4. Thanks for the prompt reply!

  8. David,

    I do trust you are getting some new readers downunder and this week I have spelled your name correctly!!

  9. Great post! Grad student here. I like to consider myself one of those "applied econometricians" in training, and I had not considered this. Thank you, thank you, thank you. So obvious, so simple, so completely over-looked. The likelihood function depends on the CDFs, which is parameterized by the variance. An incorrect assumption about variance leads to the wrong CDFs, and the wrong likelihood function. Hence, a potentially inconsistent. How is this not a canonized part of every first year curriculum?!

  10. I'm confused by the very notion of "heteroskedasticity" in a logit model.

    The model I have in mind is one where the outcome Y is binary, and we are using the logit function to model the conditional mean: E(Y(t)|X(t)) = Lambda(beta*X(t)). We can rewrite this model as Y(t) = Lambda(beta*X(t)) + epsilon(t). But then epsilon is a centered Bernoulli variable with a known variance.

    Of course the assumption about the variance will be wrong if the conditional mean is mispecified, but in this case you need to define what exactly you even mean by the estimator of beta being "consistent."

    What am I missing here?

    1. You could still have heteroskedasticity in the equation for the underlying LATENT variable. This is discussed, for example in the Davidson-MacKinnon paper on testing for het. in such models, in their book (pp. 526-527), and in various papers cited here:
      I hope this helps.

    2. Ah yes, I see, thanks.

      My view is that the vast majority of people who fit logit/probit models are not interested in the latent variable, and/or the latent variable is not even well defined outside of the model. They are generally interested in the conditional mean for the binary outcome variable. When I teach students, I emphasize the conditional mean interpretation as the main one, and only mention the latent variable interpretation as of secondary importance. I think the latent variable model can just confuse people, leading to the kind of conceptual mistake described in your post.

      I'll admit, though, that there are some circumstances where a latent variable logit model with heteroskedasticity might be interesting, and I now recall that I've even fitted such a model myself.

    3. I have some questions following this line:

      1. One motivation of the Probit/Logit model is to give the functional form for Pr(y=1|X), and the variance does not even enter the likelihood function, so how does it affect the point estimator in terms of intuition?

      2. Any evidence that this bias is large, if our focus is on sign of the coefficient or sometimes the marginal effect?

      3. Can the use of non-linear least square using sum(yi-Phi(Xi'b))^2 with robust standard errors robust to the existence of heteroscedasticity?

      Thanks a lot!

    4. 1. I have put together a new post for you at

      2. Yes it can be - it will depend, not surprisingly on the extent and form of the het.

      3. Yes, it usually is.

  11. Dave, thanks for this very good post! I have been looking for a discussion of this for quite some time, but I could not find clear and concisely outlined arguments as you provide them here. Thanks a lot for that, even though it is a bit disheartening that so many applied econometricians should be wrong...

    However, please let me ask two follow up questions:

    First: in one of your related posts you mention that looking at both robust and homoskedastic standard errors could be used as a crude rule of thumb to evaluate the appropriateness of the likelihood function. That is, when they differ, something is wrong. This simple comparison has also recently been suggested by Gary King (1). If I understood you correctly, then you are very critical of this approach. Do you have an opinion of how crude this approach is? Do you have any guess how big the error would be based on this approach? It is obvious that in the presence of heteroskedasticity, neither the robust nor the homoskedastic variances are consistent for the "true" one, implying that they could be relatively similar due to pure chance, but is this likely to happen?

    Second: In a paper by Papke and Wooldridge (2) on fractional response models, which are very much like binary choice models, they propose an estimator based on the wrong likelihood function, together with robust standard errors to get rid of heteroskedasticity problems. Their arguement that their estimation procedure yields consistent results relies on quasi-ML theory. While I have never really seen a discussion of this for the case of binary choice models, I more or less assumed that one could make similar arguments for them. I guess that my presumption was somewhat naive (and my background is far from sufficient to understand the theory behind the quasi-ML approach), but I am wondering why. Is there a fundamental difference that I overlooked?

    Thanks a lot!


  12. You remark "This covariance estimator is still consistent, even if the errors are actually homoskedastic." (meaning, of course, the White heteroskedastic-consistent estimator). What about estimators of the covariance that are consistent with both heteroskedasticity and autocorrelation? Which ones are also consistent with homoskedasticity and no autocorrelation? I'm thinking about the Newey-West estimator and related ones. I would say the HAC estimators I've seen in the literature are not but would like to get your opinion.

    I've read Greene and googled around for an answer to this question. The paper "Econometric Computing with HC and HAC Covariance Matrix Estimators" from JSS ( is a very useful summary but doesn't answer the question either. I've also read a few of your blog posts such as

    The King et al paper is very interesting and a useful check on simply accepting the output of a statistics package. However, we live with real data which was not collected with our models in mind. The data collection process distorts the data reported. Dealing with this is a judgement call but sometimes accepting a model with problems is sometimes better than throwing up your hands and complaining about the data.

    Please keep these posts coming. They are very helpful and illuminating. Thanks.

  13. Dear Professor Giles,

    thanks a lot for this informative post. I think it is very important, so let me try to rephrase it to check whether I got it right: The main difference here is that OLS coefficients are unbiased and consistent even with heteroscedasticity present, while this is not necessarily the case for any ML estimates, right? And, yes, if my parameter coefficients are already false why would I be interested in their standard errors. My conclusion would be that - since heteroskedasticity is the rule rather than the exception and with ML mostly being QML - the use of the sandwich estimator is only sensible with OLS when I use real data. Am I right here?
    Best wishes,

  14. Dear Professor Giles,
    Could you pease clear up the confusion in my mind: you state tate the probel is for "the case of a model that is nonlinear in the parameters" but then you also state thtat "obvious examples of this are Logit and Probit models". But Logit and Probit as linear in parameters; they belong to a class of generalized linear models.
    Thank you

    1. Think about the estimation of these models (and, for example, count data models such as Poisson and NegBin, which are also examples of generalized LM's. The likelihood equations (i.e., the 1st-order conditions that have to be solved to get the MLE's are non-linear in the parameters. This stands in contrast to (say) OLS (= MLE if the errors are Normal).