In a recent post I mentioned my long-standing interest in Bayesian Econometrics. When I teach this material I usually include a simple application that involves estimating a consumption function using U.S. time-series data. I used to have this coded up for the SHAZAM package. EViews isn't appropriate as it doesn't include a numerical integration routine.
You could use BUGS, or some other package, but it's nice to see what is going on, step-by-step, when you're encountering this stuff for the first time.
The other day, I thought, "It's time to code this up in R". So, here we are!
Here's how I deal with the example. We have annual U.S. data on real private consumption expenditure and real disposable income, for the period 1950 to 1985. (See the data page for this blog.) We're going to estimate a really simple consumption function. To reduce the number of parameters for which prior information has to be assigned, we'll measure the data as deviations about sample means, thereby eliminating the intercept. So, the model is:
Ct = βYt + εt ; t = 1, 2,......., 36
where C and Y denote consumption and income.
Let's assume that the errors are independently and normally distributed.
Now, let's put on our Bayesian hat! We know that the m.p.c. (β) must lie between zero and one in value. Otherwise the results make no economic sense. As far as σ is concerned, all I know is that it must be positive and finite in value.
I'm going to put an improper ("diffuse") marginal prior on σ, and a Beta marginal prior on β. The latter reflects the fact that this parameter must lie between zero and one. Assuming that the prior information about σ is independent of that about β, the joint prior for β and σ is:
p(β, σ) = p(β) p(σ)
p(σ) α (1 / σ) ; 0 < σ < ∞
p(β) α β (α - 1)(1 - β)(γ - 1) ; 0 < β < 1 ; α, γ > 0
where "α" denotes "is proportional to".
The parameters, α and γ, of the marginal prior for β will be assigned values to reflect our prior information about the m.p.c., before we see the current sample of data.
Here's how we'll do that. Recall that the mean and variance of a Beta distribution are:
Mean = m = α / (α + γ)
Variance = v = (α γ) / [ (α + γ)2 (α + γ + 1) ] .
If we assign values to the mean and the variance, to reflect our prior beliefs, the two equations above can be solved for the implied values of α and γ. Specifically:
α = (m / v) (m (1 - m) - v)
γ = (1 - m) [m (1 - m) - v] / v .
Having chosen values for m and v, we have now fully specified the joint prior for β and σ.
Given the normality of the errors, the likelihood function takes the form
L(β, σ | Data) = p(Data | β, σ) α (1 / σ)n exp[ -Σ (Ct - βYt)2 ] / (2σ2) .
Here, "Data" refers to the random consumption data. Then, by Bayes' Theorem, the joint posterior p.d.f. for β and σ is:
p(β, σ | Data) α p(β, σ) L(β, σ | Data)
α (1 / σ)(n + 1) β (α - 1)(1 - β)(γ - 1) exp[ -Σ (Ct - βYt)2 ] / (2σ2) .
We can marginalize this joint posterior pd.f. by integrating it with respect to σ, over the range (0 , ∞). This integration can be performed analytically, quite easily. The details can be found here.
The marginal posterior for β is then given by:
p(β | Data) α β (α - 1)(1 - β)(γ - 1) [ Σ (Ct - βYt)2 ] (-n / 2) .
We now need to compute the normalizing constant for this marginal posterior density - the constant that ensures that this p.d.f. integrates to unity. We simply take the last expression above, integrate it with respect to β over the interval [0, 1], and then the reciprocal of the value of this integral is the normalizing constant.
The integration with respect to β is readily done numerically, given the closed interval for the β values.
To illustrate this, using the U.S. consumption data, and setting the prior mean and variance for β to m = 0.75 and v = 0.0005 respectively. The latter values imply choices for the parameters of the prior for β of α = 280.5, and γ = 93.5.
This is what we get when we run the R code available in the code page for this blog:
The nice thing about being a Bayesian is that I don't have to justify to you where I got the values for the prior mean and variance from! (Just kidding, of course :-) .)
Once we have the normalized marginal posterior for β, we can compute the mean and variance of this p.d.f. as:
E[β | Data] = ∫ β p(β | Data) dβ
Var[β | Data] = ∫ (β - E[β | Data] )2 p(β | Data) dβ .
Again, these integrals are easy to compute numerically, given the range for β.
In our example, we get E[β | Data] = 0.8810, and Var[β | Data] = 0.000219.
Under a quadratic loss function, E[β | Data] is the Bayes estimator of β. Other loss functions imply different Bayes estimators of β.
For our example, under a zero-one loss function, the posterior mode is the Bayes estimator. It's simple to check that the mode of p(β | Data) occurs at β = 0.8792. We get essentially the same estimator under either loss function in this particular example.
Now, lets suppose that non-Bayesian (yes, there really are such people!) comes along and decides to use OLS estimation. Given the normality of our errors, that's the same as MLE as far as the estimation of β is concerned. This, in turn, is identical to the Bayes estimator we'd get if we used the following totally "diffuse" prior for β and σ:
p(β, σ) = p(β) p(σ),
where p(β) α constant ; -∞ < β < ∞
and p(σ) α (1 / σ) ; 0 < σ < ∞
We could consider this to be our "reference prior". The associated results will enable us to see how much "influence" our chosen prior information had on the results given above.
Here are the OLS (MLE) regression results from R:
Also notice that while the prior variance for β was 0.005, and the variance of the posterior distribution for β, is 0.00219. Adding the sample information to the prior information leads to a gain in estimation precision, as we'd expect.
A Bayesian posterior interval for β can also be constructed, but I won't pursue that here.
Here are some results you get when you vary the prior information:
Hopefully, this example will be useful to some of you. Put on your Bayesian hat, and have fun!
[Technical Note: The likelihood function that's plotted above is actually the marginal (or integrated) likelihood function. The full likelihood function is the joint data density, viewed as a function of the two parameters, β and σ. So that I can plot the likelihood here, I've marginalized it by integrating out σ. The integration to get the kernel of this function can be done analytically, using a similar approach to that used above to marginalize the posterior with respect to σ. Then the normalizing constant is computed numerically.]
© 2012, David E. Giles
Why on earth would anyone ever go to this much trouble? Without assuming that the errors are normally distributed (or even that they are IID), I can use the method of moments estimator (which will be the OLS estimator, in this case) and get an unbiased, root-n consistent estimate of B.ReplyDelete
The only benefit I can see here is imposing the restriction that B must lie on [0,1], but it would seem much simpler to use some type of restricted least squares estimate. That being said, if I ever got an estimate outside of this range using OLS, I would suspect that there is something seriously wrong with the data I am using.
I Guess you weren't born to be a Bayesian!Delete
Keep in mond that your OLS estimator and RLS estimators are inadmissible under a wide range of loss functions if there are more than 2 regressors in your model, whereas the Bayes estimator is admisible, by construction.
Also, the purpose of the prior here isn't to constrain the estimator to be in [0 , 1], it's simply introducing additional information about B, and that information is consistent with B being the m.p.c. If I got an OLS estimate outside this range, I'd be concerned too - not just about the data, but also about the specification of the model.
Dave, what exactly is a loss function in Bayesian estimation? I suspect a quadratic loss function is analogous to minimizing the sum of squares, but what, for example, is the zero-one loss function. Also, what does it mean for an estimator to be admissible or inadmissible?ReplyDelete
Brian: The loss function is a function of the estimator and parameter that takes positive values if the two don't coincide, and zero value only if they do. When we use OLS we're implicitly using a quadratic loss function. We might also have an absolute error loss structure, for example. A zero-one loss is where the loss ia zero if the estimator is within an epsilon of the target parameter, but it jumps to one (or any constant positive value) as soon as we are any amount further away than that.ReplyDelete
Loss functions take their values randomly, because they are a function of the random estimator (or, more generally, the random decision rule). So, we typically take the expected value of the loss, and this gives us what we call the risk function. The risk function associated with a quadratic loss is what we usually call mean squared error.
Now, suppose we agree on a loss function, to keep the playing field level. Now consider an estimator and evaluate its risk. This will be a function of the unknown parameter. If there exists ANY other estimator whose risk is no greater than the risk of our estimtor for any value of the parameter, and whose risk is striclt less than that of our stimator for at least some value of the parameter, then this makes our estimator "inadmissible". It's dominated by the other estimator. We wouldn't want to use it.
An estimator is "admissible" if it's not inadmissible. The second estimator ablove need not be admissible - it could be dominated in turn by some other estimator. But it makes our estimator look pretty bad.
All Bayes rules (estimators) that use a "proper" prior (the prior density integrates to one) are admissible by construction. So, that's a rather nice property.
Back in the 60's Stein showed that under quadatic loss the OLS estimator is inadmissible if you have more than 2 regressors. The family of Stein-rule estimators all dominate OLS in this sense. SR estimators are in turn dominated by the "positive-part" SR estimator. The inadmissibiliy of OLS was subsequently shown to hold under a range of other loss function. In fact, inadmissibility under quadratic loss is enough to be troubliny, as it's a quadratic loss that we have in mind when we minimize the sum of the SQUARED residuals!
I hope this answers your queries.
Thanks Dave, that made sense.Delete
Brian - I should have also said that Bayes estimators are selected so as to minimize posterior expected loss. This quantity is the average loss, where the average is taken using the posterior distribution, not the data density as the "weighting function". So it's different from risk.ReplyDelete
If we have a quadratic loss function the the posterior expected loss is minimized by selecting by estimating the parameter using the mean of the posterior density. With an absolute error loss, the Bayes estimator is the median of the posterior density; and with a zero-one loss, the Bayes estimator is the mode of the posterior.
Nice article, thanks for the information.ReplyDelete
thanks for great website!ReplyDelete
I would like to thank you for the efforts you have made in writing this article.ReplyDelete