Friday, October 2, 2015

Illustrating Spurious Regressions

I've talked a bit about spurious regressions a bit in some earlier posts (here and here). I was updating an example for my time-series course the other day, and I thought that some readers might find it useful.

Let's begin by reviewing what is usually meant when we talk about a "spurious regression".

In short, it arises when we have several non-stationary time-series variables, which are not cointegrated, and we regress one of these variables on the others.

In general, the result that we get are nonsensical, and the problem is only worsened if we increase the sample size. This phenomenon was observed by Granger and Newbold (1974), and others, and Phillips (1986) developed the asymptotic theory that he then used to prove that in a spurious regression the Durbin-Watson statistic converges in probability to zero; the OLS parameter estimators and R2 converge to non-standard limiting distributions; and the t-ratios and F-statistic diverge in distribution, as T ↑ ∞ .

Let's look at some of these results associated with spurious regressions. We'll do so by means of a simple simulation experiment.
The data that I've used for this experiment are on the data page for this blog, and the EViews workfile is on the code page. (See the first page in the workfile.)

        X = Xt-1 + ut    ;    ut ~ i.i.d. N[0 , σ2 = 4]

        Yt = Yt-1 + v   ;     vt ~ i.i.d. N[0 , σ2 = 1]    ;   cov.(ut , vτ) = 0   ;   all t, τ = 1, ..., T.

So, the two series each have a unit root, and they are totally independent of each other. And they aren't cointegrated. Here's what they like for T = 50,000:
(You can enlarge the graphs by clicking on them.)

Now, let's see what happens when we regress Y on X, using OLS:

        Yt = β0 + β1 Xt + εt    ;      t = 1, 2, ....., T.                                     (1)

First, when T = 200:

You'll see that I've discarded the first 1,000 values in the data so that the effects of the start-up condition (X1 = Y1 = 0) are "washed out".

The "spurious regression" effects show up clearly.

Even though there's actually no relationship between X and Y:
  • The value of R2 is over 20%.
  • The t-statistic associated with the estimate of β1 is more than 7 in value ( p = 0.0000).
Even though the true regression errors in the data-generating process must be serially independent:
  • The Durbin-Watson (DW) statistic is only 0.9, suggesting serious positive first-order autocorrelation.
  • The DW statistic is less than the R2 value - a typical "spurious regression" result.
(As an aside, de Jong (2003) shows that similar results arise if the original series are log-transformed - something that we often see in practice.)

Now let's increase the sample size to T = 49,000. Here are the results:

You can see that the R2 has increased to 48%; the t-statistic for testing that β1 = 0 is now in excess of 212 in value; and the DW statistic is essentially zero.

In short, the spurious effects become even more pronounced, asymptotically!

In addition, let's look at a couple of the results established in Giles (2007). First, the value of the Jarque-Bera (JB) statistic, for testing the hypothesis that the regression errors are normally distributed, increases without limit as the sample size grows (here, from T = 200 to T = 49,000):
Further, the value of the Breussch-Pagan-Godfrey (BP) test statistic (Breusch and Pagan, 1979; Godfrey, 1978) for testing the homoskedasticity of the regression errors grows in value as the sample size is increased.

All of the results above are based on just a single pair of artificial Y and X time-series.

To reinforce the points made here, I've also run a simple Monte Carlo experiment in which the two time series are replicated 5,000 times. The regression in (1) is then estimated by OLS with each set of data. Here's a summary of the results that I obtained:

The EViews program that I used for this is also available on the code page for this blog, and can be run with the second page of the original EViews workfile. (If you don't have access to EViews, you can open the program file with any text editor to take a look at the code.)


Breusch, T. S., and A. R. Pagan, 1979. A simple test for heteroskedasticity and random coefficient variation. Econometrica, 47, 1287–1294.

de Jong, R. M., 2003. Logarithmic spurious regressions. Economics Letters, 81, 13-21.

Giles, D. E. A., 2007. Spurious regressions with time-series data: Further asymptotic results. Communications in Statistics - Theory and Methods, 36, 967-979. (Free download here.)

Godfrey, L. G., 1978. Testing for multiplicative heteroscedasticity. Journal of Econometrics, 8, 227–236.

Granger, C. W. J. and P. Newbold, 1974. Spurious regressions in econometrics. Journal of Econometrics, 2, 111-120.

Phillips, P. C. B., 1986. Understanding spurious regressions in econometrics. Journal of Econometrics, 33, 311-340.

© 2015, David E. Giles


  1. Hi Dave, thanks for your illustrations. It's long ago since I had a time series course. Could you please tell me whether this problem is done away with if I regress not only on the other time series but also on time as a variable?

    Sorry if this comment appears multiple times; when I click "publish" it just disappears.

    1. Achim - no, the problem definitely doe NOT go away. The time variable addresses any (linear) "deterministic" trend in the data. Unit roots introduce "stochastic" trends - something entirely different. Indeed, if you see any regression where time is a regressor, be awfully suspicious of the results.