Tuesday, July 10, 2012

Concentrating, or Profiling, the Likelihood Function

We call it "concentrating", they (the statisticians) call it "profiling" - the likelihood function, that is.

Different language - same thing.

So what's this all about, anyway?

Let's suppose that we have an (n x 1) vector of sample data (y) that are generated according to some random process whose density function can be written in the form p(y | θ), where θ is a (k x 1) vector of parameters. Moreover, suppose that we can partition θ into θ = (θ1 : θ2), where θ1 is (k1 x 1) and θ2 is (k2 x 1), and (k1 + k2) = k.

By definition, the likelihood function is the joint data density, viewed as a function of the parameters. That is,  L(θ | y) = p(y | θ) = p(y | θ1, θ2).

Now, we want to draw inferences about both θ1 and θ2. If k is relatively large, then the problem of obtaining the Maximum Likelihood Estimator (MLE) of θ = (θ1 : θ2) can be quite problematic if the first-order conditions associated with the maximization of L(θ | y) are non-linear in the elements of  θ.

In that case, we have to resort to a numerical optimization algorithm (such as the Newton-Raphson algorithm) to solve these first-order conditions (the so-called "likelihood equations") for the values of the parameters.

Finding the global maximum of L(θ | y) is vitally important - see the earlier post here. This can be a real challenge, in practice. The larger the value of k (the greater the dimension of the parameter space), the greater this challenge is.

Now, what if we can find a way to reduce the magnitude of this challenge, without sacrificing anything in the process? Sounds good to me!

So, this is where the partitioning of the θ vector into its two sub-vectors, θ1 and θ2, comes into play.

Suppose that we can maximize L(θ | y) = L(θ1, θ2 | y) with respect to θ1 analytically. That is, we can solve the (vector) first-order condition  ∂L(θ1, θ2 | y) / ∂θ1 = 0 analytically, to yield the (partial) solution θ1 = θ1*. Note that, in general, θ1* will be a function of the remaining parameters, the elements of θ2. That is, θ1* = θ1*(θ2).

Now, we can substitute the expression for θ1* in place of θ1 in L(θ1, θ2 | y). This will generate what we call the "concentrated" (or "profile") likelihood function. Because θ1* = θ1*(θ2), this new version of the likelihood function will be a function only of θ2 (and the data, of course).

That is, we'll now have the function, L(θ | y) = L(θ1*(θ2), θ| y) =  Lc(θ| y), say.

Even if the remaining first-order conditions, ∂Lc(θ| y) / ∂θ = 0 are non-linear in the parameters, the computational problem that we now face will have been reduced, especially if k2 is much less than k1.

Of course, the above discussion applies equally to the log-likelihood function, rather than the original likelihood function.

Now, let's look at a simple econometric example that will illustrate this. The example is one where there is actually no need to concentrate the likelihood function, because the MLEs for all of the parameters can be obtained analytically. However, because of its familiarity, the problem I'm going to "solve" should be useful in convincing you that this works rather nicely.

Consider a standard linear multiple regression model, with a full-rank and non-random regressor matrix, and an error term that is normally distributed, has a zero mean, is serially independent, and is homoskedastic. That is:

                    y = Xβ + ε    ;    ε ~ N[0 , σ2In] .

Assuming that we have an independent sample of n observations, the likelihood function is:

             L(β , σ| y) = p(y | β , σ2) = (2πσ2)-n/2 exp[-½ (y - Xβ)'(y - Xβ) / σ2],

and so the log-likelihood function is:

       LogL(β , σ| y) = -(n/2)log(σ2) -(n/2)log(2π) - (y - Xβ)'(y - Xβ) / (2σ2)   (1)

Differentiating LogL, partially, with respect to σ2 we obtain the following first-order condition:
        LogL(β , σ| y) / σ = -n/(2σ2) + (y - Xβ)'(y - Xβ) / (2σ4) = 0 .

Solving this last equation for σ2, we get:

           σ2*(y - Xβ)'(y - Xβ) / n.                                                                   (2)

Now, we can concentrate the log-likelihood function, (1), by substituting  σ2* in place of σ2. This gives:

       LogLc(β | y) = -(n / 2)log[(y - Xβ)'(y - Xβ) / n] .                                         (3)

We can now maximize (3) with respect to the vector, β:

LogLc(β | y) / β = -(n / 2)[n / (y - Xβ)'(y - Xβ)(∂ / β)[(y - Xβ)'(y - Xβ) / n] = 0,

        (∂ / β)[(y - Xβ)'(y - Xβ) / n] = 0,

and solving for β, we get:

       β* = [X'X]-1 X'y .                                                                                    (4)

Finally, substituting β* for  β in (2), we have the MLE for σ2:

      σ2* (y - Xβ*)'(y - Xβ*) / n.                                                                    (5)

You'll recognize the expressions in (4) and (5) as being the usual MLEs for the parameters of our model.

As I emphasized at the start of this example, in this particular case there was really no need to concentrate the likelihood function, as the entire maximization problem could be solved analytically. However, you can now see the steps that are involved in solving a problem of this general type.

Also, notice that in this particular problem we could have differentiated the log-likelihood function partially with respect to β, first and obtained β*. We could then have substituted β* for β in (1) to obtain a different concentrated log-likelihood function, which we could then have differentiated with respect to σ2 to obtain the solution,  σ2*

There are some great examples in econometrics where concentrating the likelihood function can be extremely advantageous from a computational viewpoint. Some of these arise in the estimation of systems of equations by the method of maximum likelihood.

First, suppose that we have a Seemingly Unrelated Regression Equations (SURE) model. (See my recent post.) If there are M (possibly non-linear) equations in the system, then allowing for its symmetry, the covariance matrix for the errors  has M(M + 1)/2 distinct unknown elements. All of these parameters have to be estimated, jointly with the coefficients of the regressors in the model.

Suppose that M = 6, and that on average there are 5 regressors in each equation. Then, the total number of parameters that have to be estimated is (21 + 30) = 51. If we are using MLE, then there would be 51 (non-linear) likelihood equations that have to be solved to obtain the parameter estimates.

However, in this particular problem it is easy to differentiate the log-likelihood function (partially) with respect the (elements of) the covariance matrix and to solve analytically for this matrix as a function of the regressors coefficients. We can then concentrate the log-likelihood function. Finally, we are left with the problem of maximizing the concentrated log-likelihood, and although this is typically going to require a numerical algorithm, we are now dealing with a problem whose dimension has been reduced substantially.

Take the example above, where originally there were 51 parameters in the log-likelihood function. After concentrating the latter, we have a function of only 30 parameters that has to be maximized.

Second, suppose that we have a Simultaneous Equations Model (SEM). Again, if there are M equations in the system, then the number of unknown error covariance matrix parameters is M(M + 1)/2. If we set up the full log-likelihood function in oder to obtain the Full Information Maximum Likelihood (FIML) estimator, this function can be concentrated with respect to the covariance matrix parameters, in exactly the same way as for the SURE model.

Although the FIML estimates of the remaining structural parameters will have to be obtained by using a non-linear algorithm, once again we'll have a substantial reduction in the complexity of the latter computational problem.

I've couched the discussion in this post in terms of maximum likelihood estimation. However, it's worth noting that the general idea of "concentrating" can be applied to any objective function that we're seeking to optimize. This being the case, it's a good idea to look for opportunities to concentrate when using any "extremum estimator". Generalized Method of Moments (GMM) estimation is a case in point.

Happy computing!

© 2012, David E. Giles


  1. Another reason for profiling a likelihood function is to obtain confidence intervals for a subset of parameters when you don't think the usual normal-based asymptotic ones will have the right coverage. Here's a paper that describes the process:


  2. Brian - absolutely right - indeed this is probably the primary reason for profiling in the stats. literature. Thanks for pointing that out.