## Friday, July 22, 2011

### On the Importance of Going Global

Since being proposed by Sir Ronald Fisher in a series of papers during the period 1912 to 1934 (Aldrich, 1977), Maximum Likelihood Estimation (MLE) has been one of the "workhorses" of statistical inference, and so it plays a central role in econometrics. It's not the only game in town, of course, especially if you're of the Bayesian persuasion, but even then the likelihood function (in the guise of the joint data density) is a key ingredient in the overall recipe.

MLE provides one of the core "unifying themes" in econometric theory and practice. Many of the particular estimators that we use are just applications of MLE; and many of the tests we use are simply special cases of the core tests that go with MLE - the likelihood ratio; score (Lagrange multiplier), and Wald tests.

The statistical properties that make MLE (and the associated tests) appealing are mostly "asymptotic" in nature. That is, they hold if the sample size becomes infinitely large. There are no guarantees, in general, that MLEs will have "good" properties if the sample size is small. It will depend on the problem in question. So, for example, in some cases MLEs are unbiased, but in others they are not.

More specifically, you'll be aware that (in general) MLEs have the following desirable large-sample properties - they are:
• (At least) weakly consistent.
• Asymptotically efficient.
• Asymptotically normally distributed.

Just what does "in general" mean here? ..........

Well, first of all the proofs of these three core results rest on a set of conditions being satisfied. In statistics, econometrics, and life in general, you rarely get something for nothing! These conditions, in this case, are usually referred to as the "regularity conditions", and although they're generally attributed Cramér (1946), actually they date back to Dugué (1937). I discuss this briefly in an earlier post, here.

In short, the regularity conditions require the existence of the first three derivatives of the logarithm of the density function of the random data, and impose certain constraints on these derivatives. For the most part, in econometric applications, the regularity conditions are pretty innocuous. But not always!
The second thing that we have to watch out for is the fact that the desirable asymptotic properties of  MLEs won't hold if the estimation problem in question has any one of the following characteristics:
• The number of parameters to be estimated increases, as the sample size grows, at a rate at least as fast as the sample is growing.
• The maximum of the likelihood function occurs exactly on the boundary of the parameter space.
• There's a singularity in the likelihood surface.
These might sound like strange situations. However, there are some good examples of each of them that arise in econometrics, so they're well worth knowing about.

By way of some quick examples, the first situation arises with the classic "errors-in-variables" model. The second arises with stochastic frontier production models; and the third arises in mixtures of normals models, such as models of markets in disequilibrium, and other "switching" models. There are plenty of other "problematic" examples that are of interest to econometricians too, so be really careful!

The third thing that we have to be careful about is that when we maximize the (log-) likelihood function, we need to locate the global maximum, and not just some local maximum. Now, this isn't just some nicety to do with the calculus of the problem. It can actually be really, really, important as far as the statistical side of the problem is concerned. Those three large-sample properties of MLEs won't necessarily hold if we treat some local maximum of the likelihood function as if it's the MLE.

It's this last point that I want to elaborate on here.

Let's see if we can summarize the situation, while throwing in a bit of relevant history at the same time. To begin with, recall that when we set the derivative (vector) of the log-likelihood function with respect to the parameter (vector) equal to zero, we call this the "likelihood equation" (LE). We then  solve this equation for the parameter value(s) corresponding to a maximum of the original log-likelihood function (LLF).

That is, we try to find all of the roots of the likelihood equation, and then check to see which root corresponds to the global maximum of the LLF. Much of the statistical theory relates to these roots of the LE. Let's begin with the one-parameter case. The key reference for the asymptotic properties of the MLE when the parameter is a scalar, is the paper by Huzurbazar (1948). His main result tells us that there's just one consistent root of the LE. But which root is the consistent one if there are many roots? We don't know.

O.K. - now what about the multi-parameter case? Wald (1949) proved the (strong) consistency of the MLE , but his proof required that the LLF is in fact globally maximized. (Strictly, this condition isn't necessary if the parameter space is finite, but this is rarely the case.) Relaxing this requirement, and focusing on the LE, Chanda (1954) proved that there exists a consistent root of the latter equation, and that this root is unique.

He also observed that this consistent root yields an estimator that is asymptotically normal and asymptotically efficient. However, not surprisingly, Chanda's proof is achieved only by substituting an alternative assumption in place of Wald's requirement of global maximization. But if there are many roots to the LE, which is the unique consistent one? Again, we don't know for sure.

If you want a definitive discussion of (most of) the statistical content of this, you can't do better than take a look at chapter 6 (especially sections 6.3 and 6.4) of Lehmann (1983).

Barnett (1966) took up the computational issues associated with all of this. His paper has a lot of good, practical, advice. Here's a quote (p.150; his emphasis):

(BTW: I'll leave you to follow up on the references to Kale and Norton, if you're interested.)
"In any practical problem, however, we are not necessarily concerned with relative orders of convergence, utility of application or asymptotic probabilistic properties of the different methods. Given a single sample of observations x1, x2, ..., xn, of fixed finite size n, from a distribution with parameter θ, we wish to evaluate the M.L.E. of θ for that sample. Regularity conditions and the associated existence of a unique consistent root are no guarantee that a single root of the likelihood equation will exist for this sample. In fact there will often exist multiple roots, corresponding to multiple relative maxima of the likelihood function, even if the regularity conditions are satisfied. The results described above do not consider this effect specifically, either because (as in the case of Kale, 1961) the author is not basically concerned with finite samples, or (as Norton, 1956) particular examples discussed quite fortuitously have a unique root for the likelihood equation.

In general, then, we have a more fundamental problem of whether or not a particular method will actually locate a root of the likelihood equation, rather than how well it will do so in terms of the various criteria described above. Further, we require a method which can be applied systematically to locate all roots of the likelihood equation, enabling us to choose as the M.L.E. that one which corresponds to the absolute maximum of the likelihood function."

All of this should get you thinking. As it can often be important to ensure that the likelihood function has been globally maximized, what can we do to check that this has in fact been achieved in practice?

In some cases we know, analytically, that the likelihood function is strictly concave. So, if we locate a local maximum, it must be the global maximum. Econometric examples of this include maximum likelihood estimation of the logit, probit, and Poisson regression models, among many others. In these cases there's not too much to worry about. We just check the gradient vector, and when it's zero we've found the (unique) global maximum of the likelihood function.

But, what if the likelihood function is not strictly concave, so there may be several local maxima and minima? The latter are easily ruled out - we just check the definiteness of the Hessian matrix at the local extremum. As for local versus global maxima, the standard response is that we can use a range of different starting values for the maximization algorithm; check that the gradient vector is zero at each "solution" we find; and then look at the corresponding maximized values of the likelihood function to find the "highest" of the maxima.

Of course, even if we do this thoroughly, there's no guarantee that we haven't still missed the global maximum. Some other choice of initial values that we haven't considered might lead to this. This is a very real problem when the dimension of the parameter space is relatively large. It's also worth noting that there are some problems where the number of roots of the LE is actually infinite (e.g., Barnett, 1966)!

This raises an interesting question might not have occurred to you before. Can we perhaps test to see if a particular maximum of the likelihood function that we've located is the global maximum, and not just a local maximum? This seems like a reasonable question to ask. In fact, it's been asked and answered in several published statistical papers.

This being the case, how many times have you seen this testing problem discussed in econometrics text books? I can't think of a single occasion (but please correct me if I'm wrong). And yet it's a question that's been addressed by several authors, including Canadian economist Mike Veall (1990). Moreover, the principal statistical papers that discuss it rely heavily on Hal White's (1982) seminal Econometrica  paper. It deserves a lot more attention from econometricians than it seems to have received to date.

Here's a quick run-down on some of this, and on what you can do in practice.

There are two basic approaches that are worth noting. The first, based on extreme value theory, was proposed by de Haan (1981). and extended by Veall (1990). It involves constructing a confidence interval for the global maximum, and is based on a grid search method. It can become computationally infeasible if the support of the parameter space is very large, or there are many parameters in the problem. A related methodology is presented by Dorsey and Mayer (2000).

The second basic approach "inverts" the following basic result from White (1982). Suppose we have obtained the global maximum of the likelihood function. If the model is correctly specified, then (at the true parameter values) the expected value of the Hessian of the log-likelihood function  equals the negative of the expected value of the outer product of the gradient vector of the LLF.

So, if we find that the sum of the Hessian and the outer product of the gradient vector (evaluated at the MLE) differs from a null matrix, this suggests that the model is mis-specified. White takes this basic idea and constructs an asymptoticaly valid test of the hypothesis that the sum is null (i.e., that the model is correctly specified).

As was pointed out by Gan and Jiang (1999), this logic can be reversed. If the model is correctly specified, then if we evaluate the Hessian and outer product of the gradient vector at some maximum of the log-likelihood function, their sum should be zero (asymptotically) if the maximum in question is actually the global maximum.

So, you evaluate the two matrices, convert their sum into a test statistic whose asymptotic distribution is known, and there you are! You have a test for a global maximum.

Of course, there is bit of a catch here. White's approach tests for model mis-specification, assuming that you have the global maximum. Gan and Jiang's test is for a global maximum, but it assumes that the model is correctly specified.

Related contributions include those of  Biernacki (2005), and Blatt and Hero (2007).  The last of these authors go some way towards dealing with the conundrum noted in the last paragraph above.

So, where does this leave us?
• If you're using the Maximum Likelihood Estimator because you are interested in the desirable (large-sample) asymptotic properties that this estimator can have, then you'd better put in some serious effort to ensure that you've globally maximized the likelihood function.
• This may involve some pretty costly "brute-force" computing.
• The large-sample test for a global maximum suggested by Gan and Jiang (1999) is something you should be aware of, and consider using.
• If you're going to apply their test, keep in mind that you have to be really confident about the specification of your model. Otherwise, you won't know if a rejection of the null hypothesis is due to having located just a local maximum, or to a mis-specified model.
• There are plenty of tests that you can use to assess various aspects of the model's specification, and you should be using them in any case.
• On the positive, side, a rejection of the null hypothesis when applying the Gan-Jiang test certainly suggests strongly that something's wrong!

Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written Reference section is provided.

References

Aldrich, J. (1997). R. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12, 162-176.

Barnett, V. D. (1966). Evaluation of the maximum likelihood estimator where the likelihood equation has multiple roots. Biometrika, 53, 151-166.

Biernacki, C. (2005). Testing for a global maximum of the likelihood. Journal of Computational and Graphical Statistics, 14, 657-674.

Blatt, D. and A. O. Hero, III (2007). On tests for global maximum of the log-likelihood function. IEEE Transactions on Information Theory , 53, 2510-2526.

Chanda, K. C. (1954). A note on the consistency and maxima of the roots of likelihood equations. Biometrika, 41, 56-61.

Cramér, H. (1946). Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press.

de Haan, L. (1981). Estimation of the minimum of a function using order statistics. Juornal of the American Statistical Association, 76, 467-469.

Dorsey, R. E. and W. J. Mayer (2000). Detection of spurious maxima through random draw tests and specification tests. Computational Economics, 16, 237-256.

Dugué, D. (1937). Application des propriétés de la limite au sens du calcul des probabilités à l’étude de diverses questions d’estimation. Journal de l’École Polytechnique, 3, 305-372.

Gan, L. and J. Jiang (1999). A test for global maximum. Journal of the American Statistical Association, 94, 847-854.

Huzurbazar, V. S. (1948). The likelihood equation, consistency, and the maxima of the likelihood function. Annals of Eugenics, London, 14, 185.

Lehmann , E. L. (1983). Theory of Point Estimation. New York: Wiley.

Veall, M. R. (1990). Testing for a global maximum in an econometric context. Econometrica, 58, 1459-1465.

Wald, A. (1949). On the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20, 595-601.

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1-26.

1. Hi, I recently started reading your blog and my attention has been drawn to your post about the MLE. I wonder if there's a chance you would elaborate more on the 3 characteristics where the MLE doesn't hold, and the cases where these situations arises. I'm more interest on the stochastic frontier models but I'd like to get the full picture of this. If not, would you mind pointing out some documents where I can learn more about the subject.

By the way, its a great blog, I use Google Reader as a feed aggregator and I always leave your posts last because I'd like to get full attention to them. I'm a BA in economics with a strong interest in statistics and econometrics, so your blog is one of the few blogs I like to read carefully.

2. Christian: Thanks for the kind comment. Yes, I can certainly post some more on the failure of MLE - especially in relation to stochastic frontier models.

DG