Thursday, November 21, 2013

Forecasting from a Regression Model

There are several reasons why we estimate regression models, one of them being to generate forecasts of the dependent variable. I'm certainly not saying that this is the most important or the most interesting use of such models. Personally, I don't think this is the case.

So, why is this post about forecasting? Well, a few comments and questions that I've had from readers of this blog suggest to me that not all students of econometrics are completely clear about certain issues when it comes to using regression models for forecasting.

Let's see if we can clarify some terms that are used in this context, and in the process clear up any misunderstandings.

First, we need a concrete model that we can refer to. Although "forecasting" doesn't necessarily require the use of time-series data, I'm going to assume we're using the latter. This will enable me to draw some important distinctions in what follows. Let's begin with a static linear regression model - that is, one in which there are no lagged values of the dependent variable entering as regressors:

yt = β1 + β2x2t + β3x3t + ..... + βkxkt + εt  .                                                   (1)

I'll assume that the error term, εt, is fully "well-behaved", and that our model is estimated from a sample of T observations.

I'll use bi to denote the OLS estimator of βi ;   i = 1, 2, ...., k.

When we look at the "fitted values", for our estimated model, namely:

yt* = b1 + b2x2t + b3x3t + ...... + bkxkt   ;   t = 1, 2, ...., T                            (2)

we're just looking at the "within-sample" predictions of the estimated model. Notice that these predictions are constructed using the point estimates of the regression coefficients, and the actual observed values of the regressors. This information is fully available to us for all time observations, t = 1, 2, ...., T. Also, notice that (implicitly) in obtaining the fitted values, the error term has been set to its assumed mean value of zero.

Now, let's suppose that time passes by, and we have an additional n observations on y and all of the x variables. However, we still retain our OLS parameter estimates based on the original T observations. (Alternatively, it may have been the case that we had T+n sample values to start with, but we "held back" the last n of them for post-estimation model checking.)

In this case we can generate what we usually call "ex post forecasts" of the n additional observations. We know the values of these data, but they haven`t been used in the estimation of the model. We can actually see how well our model performs when it comes to forecasting these n values, because we know exactly what actually happened.

Here's an example, using EViews. (The data for this example, and the other one below, are on the data page for this blog; and the EViews workfiles are on the code page.) REALCONS and REALDPI are real private consumption expenditure and real disposable personal income respectively. The two series are seasonally adjusted, and it is easily verified that they are each I(1) and they are cointegrated. So, we can regress one series on the other without the need for any differencing, to estimate the long-run relationship:

I have estimated the model with the sample ending in 1983Q4, even though I also have data for 1984Q1 to 1985Q4. I'll use these other 8 observations for ex post forecasting:

Now let's compare REALCONS and REALCONSF over the forecast period:

PERCERR is the percentage forecast error. Here's a plot of the same data:
In contrast to ex post forecasting, let's think about a situation that's more "real-life" in nature. Suppose that we've estimated our model, as before, using a sample of T observations. Then, we want to forecast for another n observations. At this point we don't know the actual values of y for these data-points. This is usually referred to as "ex ante forecasting". If we've estimated our model with forecasting in mind, this is exactly the situation in which we're going to find ourselves, in practice.

With this type of forecasting there's a practical problem that arises. It's obvious once you think about it. If we're going to apply equation (2), say for period (T + 1), then we need to have values of each of the x variables - in period (T + 1)! In practice we'll either have to insert "educated guesses" for these values, or (better still) we'll have to generate forecasts for the future values of these regressors from some other source. Often, simple ARIMA models are used for this purpose, proved that we are, indeed, using time-series data. Generating predictions of the regressors in order to facilitate forecasts of the dependent variables can be a major source of forecasting error.

Now, the discussion so far has (implicitly) been phrased in terms of what we usually call "static" forecasting. This term refers to the fact that our regression model (1) is "static" (rather than "dynamic") because none of the regressors are lagged values of y. Now let's amend model (1) to include a lagged value of the dependent variable among the regressors:

yt = β1 + β2 yt-1 + β3x3t + ..... + βkxkt + εt  .                                                   (3)

[We could have additional lags of y as regressors, and this wouldn't alter the following story, except in terms of the details.]

In this case, the estimated model can be used to obtain either ex post or ex ante forecasts for observation (T + 1) as follows:

y*T+1b1 + b2yT + b3x3T+1 + ...... + bkxkT+1   .                                            (4)

This is because at time (T + 1) we already know the value of yt-1. (It's the observed value of yT.)

However, when we get to the point of forecasting y for period (+ 2), there are actually two options open to us in the case of ex post forecasting.
1. We could insert the known value of yT+1 for yt-1 in the forecasting equation (together with values for x3T+2, x4T+2, ...., etc.
2. Alternatively, we could insert the previously predicted value of yT+1, namely y*T+1, from (4), together with appropriate x values.
The same options remain for forecasting in periods (T + 3), (T + 4), ...., etc.

The first option above amounts to "static forecasting"; while the second option is called "dynamic forecasting", for what should now be obvious reasons.

When we're undertaking ex ante forecasting for two or more periods ahead, we actually have to use dynamic forecasting. There's no choice in the matter - in this situation we don't know the true values of the dependent variable outside the sample. Once again, future values for the x variables will have to obtained in some way or other, and this can be a major exercise in itself.

Here's an extension of the consumption function example, this time with some (far too simple) dynamics:

I'm not saying that I'm particularly happy with this model. In particular, the residuals exhibit some autocorrelation. However, to illustrate the points discussed above, let's still generate some ex post static forecasts:

and some ex post dynamic forecasts:

Here are the results, again with percentage forecast errors:

Notice that the static and dynamic forecasts are identical in the first forecast period, as expected, but after that they differ. On average, the dynamic percentage forecast errors are greater than their static counterparts. They are also slightly more variable:

This is pretty typical - dynamic forecasting is usually more challenging than static forecasting, for fairly obvious reasons.

There's more that could be considered, of course. In particular, could we improve the quality of our forecasts by estimating an Error-Correction Model, given that the data are cointegrated? That's something I'll consider in a later post.

1. Many thanks for this post Prof. Giles. Could I request a follow-up post on in-sample forecast accuracy tests for 1 to n-step ahead forecasts?

1. I'll see what I can do :-)

2. Dave: What does your observation that C and Y are cointegrated imply for the underlying causal model?

1. Brian - if the variables are cointegrated, then there must be Granger causality either from C to Y, or vice versa, or both ways. In contrast, the existence of causality does NOT imply that there has to be cointegration.

3. This is extremely helpful, very clear explaination. Thank you very much!

4. Thank you very much for your explanation. So, there is no option when there are not observations after period T. Dynamic forecasting is used, however error prediction is higher. My question is: what about indicators such as (i) root mean squared error, (ii) mean absolute percent error, (iii) bias proportion, (iv) variance proportion. Are they helpful in order to know the forecast performance?

1. Julio - yes, they certainly are.