Monday, March 17, 2014

MCMC for Econometrics Students - I

This is the first of a short sequence of posts that discuss some material that I use when teaching Bayesian methods in my graduate econometrics courses.

This material focuses on Markov Chain Monte Carlo (MCMC) methods - especially the use of the Gibbs sampler to obtain marginal posterior densities. This first post discusses some of the computational issues associated with Bayesian econometrics, and introduces the Gibbs sampler. The follow-up posts will illustrate this technique with some specific examples.

So, what's the computational issue here?

To begin with, what do we mean by MCMC, and what is the Gibbs sampler?

Loosely speaking, a Markov chain is a stochastic process in which the value at any step depends on the immediately preceding value, but doesn't depend on any values prior to that. In other words, If X denotes a random variable, and x denotes the value of that random variables, a Markov chain satisfies the condition:

        Pr.[Xn+1 = xn+1 | Xn = xn, Xn-1 = xn-1, ......., X2 = x2, X1 = x1] = Pr.[Xn+1 = xn+1 | Xn = xn] .

The Gibbs sampler (described below) exploits the characteristics of a Markov chian, and in doing so it revolutionized the computational side of Bayesian statistics and econometrics - for the following reasons.

Suppose that we have a model with two parameters, θ1 and θ2. Let p(θ1 , θ2) be the prior pdf, and let L(θ1, θ2 | y) = p(y | θ1 , θ2) be the likelihood function. Then, by Bayes' Theorem, the posterior pdf for the parameters is:

                          p(θ1 , θ2 | y) ∝  p(θ1 , θ2) L(θ1 , θ2 | y) .                                         (1)

Typically, the joint posterior pdf won't be of any standard form, and neither will the associated marginal pdf's for θ1 and θ2. In that case, marginalizing the expression in (1) to get the individual posterior pdf's for θ1 and θ2 will require numerical integration of p(θ1 , θ2), first with respect to θ1, and then separately with respect to θ2. The resulting marginal posterior pdf's will then have to be normalized so that they integrate to one, and obtaining the normalizing constants will again require numerical integration. Finally, if we want to obtain the means and variances of the marginal posteriors, another round of numerical integration will be needed.

You can find an example of this in my (much) earlier post, A Bayesian Consumption Function.

What I've just described is for a model with just two parameters - but who would want to use a model as simple as that? If there are three parameters in the model, then obtaining the marginal posterior densities will require two-dimensional numerical integration. If there are six parameters in the model, then five-dimensional numerical integration will be required. It's not feasible to do this using standard numerical quadrature methods.

One response to this problem was to simulate the integrals using "importance sampling", or "Monte Carlo integration". However, there's a much more appealing solution. Don't even try to evaluate the integrals at all!

Instead, what the Gibbs sampler does is to draw random values from the conditional distributions for each of the parameters in the model. Often, these distributions are univariate, with the conditioning being on all of the other parameters in the model. Sometimes, these conditional distributions are of a standard form, which makes it trivial to generate random draws from them. Even if these distributions aren't of a standard form, there are lots of computationally simple ways of generating random values from them - the inversion method, the acceptance-rejection method, the table look-up method, and so on. (More on this later.)

We're not really that interested in these conditional distributions - we want to obtain the marginal distributions of the various parameters in the model. However, that where the "magic" of the Gibbs sampler comes in.

Let's suppose that our model has the two parameters, θ1 and θ2, as before, and the joint posterior, p(θ1 , θ2 | y) is of a non-standard form. Suppose that the conditional posteriors, p(θ1 | θ2 , y) and p(θ2 | θ1 , y) are either standard distributions, or at least ones that we can generate random values from efficiently. Then, here's what we do:

  1. Assign initial values, θ1(0) and θ2(0) to the two parameters.
  2. Draw a random value θ1(1) from p(θ1 | θ2(0) , y).
  3. Then draw a random value θ2(1) from p(θ2 | θ1(1) , y).
  4. Next, draw a random value θ1(2) from p(θ1 | θ2(1) , y).
  5. Repeat steps 3 and 4 many, many, times.
We'll then have a long sequence of values for θ1 and θ2. It turns out that each of these sequences is a Markov chain, so the initial values won't matter. Even more importantly, after many replications, the successive values in these two chains start to behave as if they were random draws from the marginal posterior distributions, p(θ1 | y) and p(θ2 | y), rather than from the conditional posterior distributions, p(θ1 | θ2 , y) and p(θ2 | θ1 ,y).

Isn't that neat?

The best part - we didn't have to do any integrating at all. Not to marginalize the posterior, nor to normalize the marginal posteriors! In my books, any day that we don't have to do any integrating is a good day!

Notice that we have to let the Markov chain develop properly before the successive draws are effectively from the marginal posterior distributions. In practice, this means letting the process run, flip-flopping back and forth between steps 3 and 4 above, and then discarding the first few thousand values. These early values are associated with what we term the "burn-in period". We then continue on and generate thousands of values from the marginal posterior densities.

There are various measures that can be used to determine when the "burn-in" should end. (For example, see the survey by Cowles and Carlin (1996).) Also, notice how well the Gibbs sampler lends itself to parallel processing.

Once we have a big "sample" of values from p(θ1 | y), say, the mean of the marginal posterior can be approximated by taking the simple average of these thousands of values. The variance, median. mode, or any other measures we're interested in can be obtained in the same way. For example, if we were using a quadratic loss function, then the mean of the marginal posterior distribution is the Bayes estimator of θ1. The same thing applies to the values drawn from the marginal posterior for θ2.

What if we had three, four, or more parameters in the model, as is likely to be the case?

Well, we just generalize the five steps above:
  1. Assign initial values, θ1(0), θ2(0), θ3(0), θ4(0), ..., etc. to all of the parameters.
  2. Draw a random value θ1(1) from p(θ1 | θ2(0) , θ3(0), θ4(0), ..... , y).
  3. Then draw a random value θ2(1) from p(θ2 | θ1(1) , θ3(0), θ4(0), ...., y).
  4. Next, draw a random value θ3(1) from p(θ3 | θ1(1) , θ2(1), θ4(0), ...., y).
  5. Then draw a random value θ4(1) from p(θ4 | θ1(1), θ2(1), θ3(1), ...., y)
  6. Continue until we have a first draw for each of the parameters.
  7. Then draw a random value θ1(2) from p(θ1 | θ2(1), θ3(1), θ4(1), ..., y)
  8. Draw a random value for θ2(2) from p(θ2 | θ1(2), θ3(1), θ4(1), ...., y)
  9. etc.
  10. Repeat steps 7 to 9 many, many, times.
Again, after a suitable "burn-in period" we'll start to draw values from the marginal posteriors for all of the parameters in the model. These posteriors then provide us with the information we need to make Bayesian inferences about the parameters.

The Gibbs sampler was named after the physicist, W. J. Gibbs, and was introduced by Geman and Geman (1984) in the pattern recognition literature. It was brought into the statistics literature by Gelfand and Smith (1990). It's a special case of the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970).

Casella and George (1992) provide a very useful and readable introduction to the Gibbs sampler, and a proof of its convergence.

The next post in this sequence will go through an example where we know the answer analytically, but it should convince you that the "magic" of the Gibbs sampler actually works!

Biographical note

You can find an interesting bio. of Keith Hastings here. He and I didn't (quite) overlap at the University of Canterbury, or here at the University of Victoria. We missed each other by two years in each case.


References 

Casella, G. and E. I. George, 1992. Explaining the Gibbs sampler. American Statistician, 42, 167-174.

Cowles, M. K. and B. P. Carlin, 1996. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 81, 883-904. 

Gelfand, A. E. and A. F. M. Smith, 1990. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398-409

Hastings, W. K., 1970. Monte Carlo sampling methods Using Markov chains and their applications. Biometrika, 57, 97–109.

Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, 1953. Equations of state equations by fast computing machines. Journal of Chemical Physics, 21, 1087-1092.



© 2014, David E. Giles

5 comments:

  1. I think this was a good explanation, but that's coming from someone who already understands MCMC. I had a difficult time learning it and what you've written is very similar to the material that I had difficulty understanding. My econometrics classes didn't cover Bayesian statistics, so I taught myself MCMC about 2-3 years ago. I had tried to learn about Gibbs sampling several times before it actually took. The one thing that really clarified it for me was working through an example of the normal inverse gamma Gibbs. Seeing the equations and why it was so much easier to work with the conditional distributions just made everything click.

    ReplyDelete
    Replies
    1. John - thanks for the comment. The third of the posts coming in this sequence is effectively the normal-inverse gamma. I use the precision rather than the variance, so it is normal-gamma. I'll be posting the R code with it.

      Delete
  2. Excellent post! I also had issues learning these methods until someone showed me 1) how to find full conditional distributions and 2) how to complete the square in the normal kernel. If you allow links, I have a paper on SSRN that explains these ideas in the context of spatial econometric models available here:

    http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1244261

    I also have MATLAB code for anyone interested in these spatial econometric models or the normal linear model.

    Again, thank you for this blog and these topics. I never fail to learn something really wonderful.

    ReplyDelete
    Replies
    1. Don - thanks for your very helpful comment. I hope you enjoy the upcoming posts on MCMC.

      Delete
  3. Simply brilliant, finally I understood this problem :)

    ReplyDelete

Note: Only a member of this blog may post a comment.