## Monday, July 25, 2011

### Maximum Likelihod Estimation is Invariably Good!

In a recent post I talked a bit about some of the (large sample) asymptotic properties of Maximum Likelihood Estimators (MLEs). With some care in its construction, the MLE will be consistent, asymptotically efficient, and asymptotically normal.These are all desirable statistical properties.

Most of you will be well aware that MLEs also enjoy an important, and very convenient, algebraic property - we usually call it "invariance". However, you may not know that this property holds in more general circumstances than those that are usually mentioned in econometrics textbooks. I'll come to that shortly.

In case the concept of invariance is news to you, here's what this property is about. Let's suppose that the underlying joint data density, for our vector of data, y, is parameterized in terms of a vector of parameters, θ. The likelihood function (LF) is just the joint data density, p(y | θ) , viewed as if it is a function of the parameters, not the data. That is, the LF is L(θ | y) = p(y | θ). We then find the value of θ that (globally) maximizes L, given the sample of data, y.

That's all fine, but what if our main interest is not in θ itself, but instead we're interested in some function of  θ, say φ = f(θ)? For instance, suppose we are estimating a k-regressor linear regression model of the form:

y = + ε  ;  ε ~N[0 , σ2In] .

Here, θ' = (β' , σ2).  You'll know that in this case the MLE of ﻿β is just the OLS estimator of that vector, and the MLE of σ2 is the sum of the squared residuals, divided by n (not by n-k). The first of these estimators is minimum variance unbiased; while the second estimator is biased. Both estimators are consistent and "best asymptotically normal".

Now, what if we are interested in estimating the non-linear function, φ = f(θ) = (β1 + β2β3)?

Well, first of all, by Slutsky's Theorem, we can replace each of the parameters in this function with any consistent estimator, and resulting function of the estimators will be a consistent estimator of the function. In this example, if bi is the MLE (OLS) estimator of  βi (i = 1, 2, ...., k), then (b1 + b2b3) is a consistent estimator of  (β1 + β2β3). O.K. - that gives us a consistent estimator for the function of interest. But does this estimator have any other desirable properties?

Given the non-linear nature of the function, you can guess that the estimator is biased, even though the individual bi's are unbiased for the individual βi's. Are there any general properties we can be sure the estimator (b1 + b2b3) will have?

This is where "invariance" comes in. It simply tells us that MLEs are invariant to functional transformations. In words: The MLE of a function of the parameters is just that function of the MLEs. In the example above, this means that (b1 + b2b3) is not just a consistent estimator of  (β1 + β2β3). It is the MLE of  (β1 + β2β3). This implies, in turn, that it is not just a consistent estimator, but it is also asymptotically efficient, and asymptotically normal!

Note the practical advantage of this. Let's go back to our generic problem say, where the LF was a function of θ, but we were really interested in estimating φ = f(θ). If the inverse function was unique, we could write  θ = f -1[φ]. Then we could replace θ in the LF with  f -1[φ], and maximize the LF with respect to φ, instead of with respect to θ. In our regression model example, if we are interested in estimating φ = exp(β1), we'd replace β1 with ln(φ) in the LF, and maximize with respect to φ, β2, β3, ......, βk and σ2. Sounds like a bit of a pain, doesn't it? How would you deal with the function (β1 + β2β3)?

Of course, there would seem to be a problem with the discussion above if the inverse function is not unique. That is, if f is not one-to-one. And this is what's led to some confusion in some textbooks.

In the end, it's actually just a matter of going back to basics, and recalling just what the MLE really is. If you're interested, Zehna (1966) provides a formal resolution of this in just four very readable paragraphs. There's a bit of a twist that you should note, though. Berk (1967) provides a review of Zehna's paper. He agrees that the function  φ = f(θ) need not be one-to-one (or even continuous, for that matter), but if it's not 1-1 then we may also need to set up one or more additional (artificial) transformations of the form  ψ= u(θ), ω = v(θ), ........, etc. in such a way that the mapping from θ to (f(θ), u(θ), v(θ)) as a whole is indeed 1-1. He give s a simple example.

(Those of you who are familiar with the problem of obtaining the density of a function, say W, of two random variables, X and Y, will recall that we do something similar in that context. We map (X , Y ) → (W, Z), where Z is of no direct interest. It's included to ensure that the Jacobian of the transformation involves a "square" matrix, and is chosen in a way that will make the integration needed to obtain the marginal density of W from the joint density of (W, Z) relatively easy.)

Ultimately, when it comes to the MLE of a function of the original parameters, the function of interest need not be 1-1. You'll find this point noted in standard statistics monographs, such as Lehmann (1983, p.410). It's a pity that more econometricians haven't taken the time to read them!

While hesitating to single out a particular (excellent) text - I use it in my own courses - I see that Greene (2011, p.521) explicitly states that invariance of the MLE holds only under 1-1 transformations. He shows that if we re-parameterize the LF for the normal distribution in terms of its mean (μ) and precision (φ = 1/σ2), rather than its mean and variance (σ2), then the MLE of φ is just the reciprocal of the usual MLE of σ2. That's invariance at work - in this case with a 1-1 mapping.

One could then add that the MLE of σ is the positive square root of the MLE of σ2, even though squaring or taking the square root are not 1-1 transformations when taken in isolation.

In summary, the invariance property of MLEs can save us a huge amount of work in practice. Moreover, if used carefully, this property has much wider applicability than you may have thought.

Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written Reference section is provided.

References

Berk, R. H. (1967). Review of Zehna (1966). MR0193707. Mathematical Reviews, 33, #1922.

Greene, W. H. (2011). Econometric Analysis (7th Edition). Upper Saddle River, NJ: Prentice Hall.

Lehmann, E. L., (1983). Theory of Point Estimation. New York: Wiley.

Zehna, P. W. (1966). Invariance of maximum likelihood estimation. Annals of Mathematical Statistics, 37, 744.