## Sunday, September 9, 2012

### Using Integrated Likelihoods to Deal With Nuisance Parameters

There are more possibilities open to you when using maximum likelihood estimation than you might think.

When we're conducting inference, it's often the case that our primary interest lies with a sub-set of the parameters. and the other parameters are essentially what we call "nuisance parameters". They're part of the data-generating process, but we're not that interested in learning about them.

We can't just ignore these other parameters - that would amount to mis-specifying the model we're working with. However, in the context of maximum likelihood estimation, there are several things that we can do to make life a little easier.

To be specific, let's suppose that we have a vector parameters, θ, in which we are primarily interested, and another vector, λ, of nuisance parameters. For a vector of sample observations, y, the likelihood function is just the joint data density, viewed as a function of theta and lambda, rather than as a function of y. That is, it's a function of the form L(θ , λ | y).

Now, if we're really lucky, it may be the case that the likelihood function is separable with respect to θ and λ. That is, we can write it as L(θ , λ | y) = L1(θ | y)L2(λ | y); where L1 doesn't depend on λ, and L2 doesn't depend on θ. If this unusual situation occurs, then we can simply maximize L1 with respect to θ.

Usually, we're not that lucky!

In an earlier post I talked about working with the "concentrated" likelihood function, obtained by maximizing the likelihood function (partially) with respect to the nuisance parameters (the elements of λ), and then substituting the maximizing values of these parameters into the likelihood so that the latter is then a function only of the parameters that are of interest to us. Statisticians refer to the resulting "concentrated likelihood function" as the "profile likelihood function". We then proceed to maximize the concentrated likelihood with respect to the remaining parameters - the elements of θ in our present notation

Generally speaking this works well if the sample size is very large. Unfortunately, there are some situations in which this is not the best thing to do. One example arises with measurement errors and the "incidental parameters" problem (Neyman and Scott, 1948). That's why we tend to use instrumental variables methods in this context. Another arises with certain time-series problems - e.g., see Cruddas et al. (1989).  Other examples, including one where the likelihood function has a very sharp "ridge", are presented by Berger et al. (1999).

Fortunately, there are some important alternative approaches, and these seem to overcome many of the potential drawbacks associated with the profile likelihood.

These alternative methods, based on "pseudolikelihoods", were pioneered by Kalbfleisch and Sprott (1970, 1973), and others, and are well worth knowing about. Berger et al. (1999) provide a really good overview.

Here, I want to focus on just one of these methods - inference based on the so-called "integrated likelihood function".

This is obtained by averaging the full likelihood function across all possible values of the nuisance parameter(s). If we have a weighting function that we want to use, then we can use it. Computationally, although not philosophically, this sounds rather Bayesian. So, the integrated likelihood function is:

LI(θ | y) =  ∫ L(θ , λ | y) π(θ) dλ ,

where π(θ) is the weighting function. Often, we'll just use an unweighted average, so the integrated likelihood function is simply:

LU(θ | y) = ∫ L(θ , λ | y) dλ .

We then estimate θ by maximizing LU or LI

(Parenthetically, notice that by averaging over theta, we avoid the unrepresentative result obtained by profiling if the original likelihood function has a sharp "ridge" in the region associated with the maximum in the λ direction(s).)

Now, let's take a look at a simple estimation problem that not only illustrates the use of the integrated likelihood function, but also demonstrates that sometimes this approach results in an estimator with better finite-sample properties than the corresponding (regular) maximum likelihood estimator.

Suppose that we have a sample of n independent observations (yi) drawn from a N[μ , σ2] population. Then the likelihood function for μ and σ2 is:

L(μ , σ2 | y) = (σ22π)(-n/2)exp[-1 /( 2σ2)Σ(yi - μ)2]  .                    (1)

You'll recall that the maximum likelihood estimators (MLEs) of μ and σ2 are ybar = (1/n)Σyi and (1/n)Σ(yi - ybar)2, respectively. So, the MLE for μ is unbiased, but the MLE for σ2 is biased. (An unbiased estimator can be obtained by replacing "n" with "(n - 1)" in the formula for the MLE of σ2.)

Now, let's treat σ2 as the parameter of interest, and μ as a nuisance parameter. (Yes, I know it's usually the other way around, but trust me!)

Using a uniform weighting function, the integrated likelihood function is:

LU = ∫ L(μ , σ2 | y) dμ ,                                                               (2)

where L(μ , σ2 | y)  is given in (1) above.

Now, note that, in (1) we can write

Σ(yi - μ)2= Σ[(yi - ybar) + (ybar - μ)]2  =  Σ(yi - ybar)2 + n(ybar - μ)2   .      (3)

Recalling that the sampling distribution of ybar is N[μ , σ2/n], and that this density integrates to unity, it follows immediately that :

LU = n-1/2(2πσ2)-(n - 1)/2 exp[-1 / ( 2σ2)Σ(yi - ybar)2] .

If we now maximize  LU with respect to σ2, the first-order condition is:

(n - 1) / (2σ2) = 1 / (2σ4)Σ(yi - ybar)2] ,

and the solution yields (1/(n - 1))Σ(yi - ybar)2 as the estimator of σ2. That is, the minimum variance unbiased estimator (MVUE) of this parameter emerges directly!

(If we use the "concentrated" ("profile") likelihood approach here we get the usual maximum likelihood estimates of the two parameters.)

You can easily check that if we reverse the roles of μ and σ2, so that the former is the parameter of interest, and the latter is the nuisance parameter, then maximizing the (new) integrated likelihood function with respect to μ still yields ybar as the estimator of μ. Again, for this problem this happens to be the MVUE estimator for that parameter.

Finally, notice that all of this carries over trivially to the linear multiple regression model with normal errors,

y = Xβ + ε.

Here, the (conditional) mean of the y data is just Xβ, and σ2 is the variance of each εi value, and each yi observation. Whether you use the full, profile, or integrated likelihood, the MLE of β is the OLS estimator, b = (X'X)-1X'y. In the first two cases, the MLE of σ2 is (e'e) / n, where e is the residual vector. However, if you use the integrated likelihood approach, you get the MVUE of σ2, namely (e'e) / (n - 1).

So, keep in mind that there's more than one way to conduct inference, even if you're dead-set on using maximum likelihood.

References

Berger, J. O., B. Liseo, and R. Wolpert, 1999. Integrated likelihood functions for eliminating nuisance parameters (with discussion). Statistical Science, 14, 1–28.

Cruddas, A. M., N. Reid, and D. R. Cox, 1989. A time series illustration of approximate conditional likelihood. Biometrika, 76, 231 - 237.

Kalbfleisch J. D. and D. A. Sprott, 1970. Application of likelihood methods to models involving large numbers of parameters (with discussion). Journal of the Royal Statistical Society, B, 32, 175–208.

Kalbfleisch, J. D. and D. A. Sprott, 1973. Marginal and conditional likelihoods. Sankhya, A, 35, 311–328.

Neyman, J. and E. L. Scott, 1948. Consistent estimates based on partially consistent observations. Econometrica, 16, 1-32.

Severini, T. A., 2007. Integrated likelihood functions for non-Bayesian inference. Biometrika, 94, 529-542.