## Thursday, August 22, 2013

### Forecasting From Log-Linear Regressions

I was in (yet another) session with my analyst, "Jane", the other day, and quite unintentionally the conversation turned, once again, to the subject of "semi-log" regression equations.

After my previous rant to discussion with her about this matter, I've tried to stay on the straight and narrow. It's better for my blood pressure, apart from anything else! Anyway, somehow how we got back this topic, and she urged me to get some related issues off my chest. This is therapy, after all!

Right at the outset, let me state quite categorically that lots of people estimate semi-logarithmic regressions for the wrong reasons. I'm not condoning what they do. That's their problem - I have enough of my own! However, if they're going to insist on doing this, then I'm going to insist that they be consistent when it comes to using their estimated model for forecasting purposes!

I mean, we wouldn't use an inconsistent estimator, would we? So why should we put up with an inconsistent modeller?

Let me tell you all about it.........

Suppose that we're using a regression model of the form

log(yt) = β1 + β2x2t + β3x3t + ......+ βkxkt + εt    ,

where log(.) denotes the natural logarithm. One or more of the regressors may also be in logarithmic form, but that's irrelevant for the purposes of this post. We just need them to be non-random.

Having estimated the model by (say) OLS, we have the "fitted" (predicted) values for the dependent variable. That is, we have values of [log(yt)]* = b1 + b2x2t + ..... + bkxkt, where bi is the OLS estimator for βi (i = 1, 2, ......, k).

Usually, of course, we'd be more interested in fitted values expressed in terms of the original data - that is, in terms of y itself, rather than log(y).

So, all we have to do is take the inverse of the logarithmic transformation, and the fitted values of interest will be yt* = exp[log(yt)]* . Right?

Actually, No!

Sure, we could do that, but we'd be introducing a bias into the predictions that we can easily avoid.

So, what's going on here? Well, you'll notice that I didn't state any assumptions about the random error term, εt, in the regression model. That was bad - those assumptions are really just as important as the specification of the structural part of the equation. So, let's tidy things up a bit. Specifically, let's assume that the errors have a zero mean, and are homoskedastic and serially independent. Crucially, let's also assume that they are Normally distributed.

This last assumption is key to what follows. If εt is Normally distributed, then log(yt) is also Normally distributed, assuming that the regressors are non-random. In turn, this implies that yt itself must follow a Log-Normal distribution.

We need to be aware of the following key relationships between these Normal and Log-Normal distributions. In general, if X ~ N[μ , σ2], then Y = exp[X] ~ Log-N[(m , v)], where "m" and "v" are the mean and variance of the Log-Normal distribution. Its well-known that:
• m = exp[μ + σ2 / 2]
• v = [exp(σ2) - 1] exp[2μ + σ2
In our case, μ = E[εt] = 0 ; and σ2 = var.[εt].

In particular, this implies that E[yt] = exp[β0 + β1x1t + β2x2t + ......+ βkxkt + (σ2 / 2)].

So, when we generate our predictions ("fitted values") of yt, based on our log-linear model, really we should create them as:

yt* = exp{[log(yt)]* + ( s2 / 2)},

where
[log(yt)]* = [ b0 + b1x1t + ... + bkxkt ],

and s2 is the usual unbiased estimator of σ2, based on the OLS estimates of the semi-log model.

To be more specific,

s2 = Σ[log(yt) - b1 - b2x2t - ....... - bkxkt]2 / (n - k),

where the range of summation is for t = 1, 2, ...., n.

The naive approach of generating the predictions simply as yt* = exp[log(yt)]* ignores a term, and this will distort the predictions. In fact, it will distort them in a downwards direction, as the term we're ignoring must be positive in sign. Of course, just how important this distortion is, in practice, will depend on the scale of our data, and the "signal-to-noise" ratio for our model.

Let's take a look at a couple of simple empirical examples, first using R, and then using EViews. (Jane has been encouraging me to be more "open" in my choice of software!)

The R example uses the well-known "Airplane Passengers" (AP) time-series, and is based loosely on the analysis of Cowperwait and Metcalf (2009, pp. 109-118). The R script is available on this blog's code page, and it can be opened with any text editor. The logarithm of AP is regressed against a quadratic time trend and a bunch of Sine and Cosine terms of the form SIN(2πit) and COS(2πit); i = 1, 2, ..., 5:

The time series "APF" is the series of naive within-sample predictions, obtained by simply exponentiating the fitted values for log(AP). The time-series "APFAD" incorporates the adjustment term discussed above. In this particular case, s = 0.04811, so there's not much difference between APF and APFAD:

However, the sum of the squared (within-sample) prediction errors, based on AP and APF is 24936.24, while that based on AP and APFAD is 24840.21. So, there's a bit of improvement when we take the adjustment into account.

Now let's repeat this exercise using EViews. The associated workfile is on the code page for this blog, and the data are in a csv file on the data page. Here are the regression results using EViews:

(There are some small differences between the R and EViews results - this probably reflects the fact that the data were created in R, written to a file to only a limited number of decimal places, and then read into EViews.)

Then, I've selected the "Forecast" tab, and then chosen to forecast "AP", rather than "log(AP)", over the sample period:

The "modified" forecasts are easily obtained with the command:

and here are the results:

There are several interesting questions that I haven't dealt with here.

For example, what does the result, v = [exp(σ2) - 1] exp[2μ + σ2] , imply for possible modifications to the calculation of forecast intervals? [See Rob Hyman's comment, below.]

For instance, what if the errors are non-normal? Can (how should) we modify our naive forecasts in such cases?

So many questions....... so little time!

Reference

Cowperwait, P. S. P. and A. V. Metcalf, 2009. Introductory Time Series With R. Springer, Dordrecht.

1. Wooldridge's intro to econometrics has a very nice discussion about this issue. One thing to keep in mind is that if the model contains a lag dependent variable, things are more complicated.
Lutkpohl's recent paper argues that it is not always beneficial to do this adjustment in a VAR model ...

1. We need the regressors to be non-random. So, yes, that precludes lagged dependent variables. And that rules out VAR's.

2. What if we are interested in the expectation of the dependent variable (both in logs and in levels) that is conditional on the regressors? It seems to me that this would allow for all kinds of regressors, provided that we evaluate the expectation at particular values of theirs, no?

3. If that's what you're interested in. With lagged dependent variables, though, there's still a problem if you are doing dynamic forecasting.

2. Here are 2 nice, closely related blog posts from Arthur Charpentier, on the "Freakonometrics" blog:
http://freakonometrics.hypotheses.org/1311
http://freakonometrics.hypotheses.org/2447

3. Nice work David. But you don't have to worry about forecast intervals. The exponential of the usual intervals on the log scale will work fine because the probability coverage is preserved with monotonic transformations.

1. Thanks Rob!

4. Excellent post! But I have two doubts:
1 - Is it the same in case of a mixed effects model? And in this case “k” (the number of parameter to take into account to calculate degrees of freedom [n-k] for obtaining s2) is given just by the number of fixed effects?
2 – In Gamma Regressions on can just apply the exponential, right?

1. 1. Yes
2. Right

5. David, Have you confirmed that the errors from your model are free of structure. One way to do that is to submit it to AUTOBOX and see if any ARIMA structure is detectable or if any Pulses/Level Shifts,Seasonal Pulses or Local Time Trends are present and need to be remedied. Furthermore is the variance of the errors constant over time and are the parameters of the model constant over time. Please post your model residuals. Please post all of your analysis to "prove" that the Gaussian Assumptions are met and that you decomposed your series to signal and noise. Please send your results to me at dave@autobox.com.

1. David - there are certainly issues with this particular series, so perhaps I chose a poor "illustrative example"! Thanks for the great telephone conversation though!

6. Dave,

Thanks for this post. I am currently working exactly on an issue like this and I couldn't explain a strange residual term: Exactly the missing exp(sigma^2) term.

Thanks again.

7. Very nice post indeed. This issue has been studied in detail in the health economics literature under the name of "the retransformation problem". Accounting for non-normality under homoskedasticity is not a problem; the so-called smearing estimator of Duan should work well in this case and that is what is discussed in Wooldridge's introductory textbook. Under heteroskedasticity, however, things are much more serious and even the estimates of the coefficients do not have their usual interpretation as elasticities or semi-elasticities. That was the point made by Silva and Tenreyro in their "log of gravity" paper.

1. 8. Sorry for a stupid question, but what if we have a mixed log level model
let's say, log(y)=b0+b1*log(x1)+b2*x2+b3*log(x3), how do you forecast then?

1. HI Katherine - nothing changes in this situation. The key thing is the log of the dependent variable.

9. Thank you for the excellent explanation!

Out of curiosity - I suspect that problems caused by other transformations on the dependent variable don't resolve as cleanly. Is there a paper or book you'd recommend that deals with these back-transformations?

1. Hi - You're right. Here, things work because of the connection between the Normal and log-Normal distributions. I can't think of a general reference off-hand. Is there a particular transformation you're concerned about?

2. Yes, actually - I'm quite curious about the square root (so root(Y) = a + Xb + e).

3. O.K. using the link between a standard normal random variable and a Chi-Square with 1 d.o.f, you would square the prediction of root(y), and then add s-squared. I'll do a short post for you.

4. Here is a new post that fills in the details for you -
https://davegiles.blogspot.com/2019/03/forecasting-from-regression-with-square.html

10. To get around this, I typically interpret y_t^* = exp(log(y_t)) values in terms of the median (original scale) response.

11. That may be a useful approximation, but it's not exact. The adjustment that I discuss here has been well-known for a long time.

1. You are correct that what you show in the post is the way to get an estimator for the mean of the response on the original scale. But, if interest is in the median of the response on the original scale, taking the plain exponential should be exact because if log(Y) ~ Normal(m, v), then exp(m) is the median of Y.

2. That's correct. If that's your interest.

12. Shouldn't we use the standard error of the forecast rather than the rmse to de-bias?

1. Hi - for the life of me, I don't understand what you're referring to here. By all means elaborate, and I'll be happy to respond.

13. Hi. My first post went into a cyber black hole (!). I have a model as described above, ln(y)=XB. I understand that when I want to make a prediction of linear y I need to "de-bias" the prediction by including the s^2/2 term. My question is when we estimate the standard error of a forecast (stdf), stdf=s^2*x(X'X)invx' + s^2), where little x is a vector of independent variables, and big X is our data matrix, shouldn't we use the stdf to de-bias rather than the rmse from the regression? That is, yhat(linear)=exp(yhatln+stdf^2/2)?

Ref on stdf: https://www.stata.com/manuals13/p_predict.pdf

1. Hi - no. What we need is a suitable estimate of the variance of the error term - not the variance of the forecast.

14. Very useful article. I only wonder, what is the appropriate s^2 in case of a random effects (mixed effects) model?
I'd be inclined to use the mean of [ln(y)-Xb]^2 (effectively the sum of within and between-groups variation), would that be correct?