Thursday, June 16, 2011

Obesity, Driving, & Unbalanced Regressions

The Economist magazine's daily chart of 15 June 2011 was to do with a recent study on the relationship between obesity and amount of driving. This study is reported in a paper by Jacobson et al. (2011), which is "in press" at the journal, Transport Policy. This paper will bring tears to all econometric eyes! Not tears of joy, either.

There are so many things that one could say about it that it's really difficult to know where to start - but I'll try!

Before we start, though, a word about the charts in the 15 June chart blog in The Economist. There are two of them - look at the one on the left. Are they really plotting the correlation between the two variables in question? I don't think so.  Anyway, that's not my real gripe. My problem is with the paper that's being published by Transport Policy (TP). And judging by the nature of the comments on The Economist's blog, I'm not alone.

From its website, I see that TP has an impact factor of 1.024. This just goes to show how misleading impact factors can be - it's quite possible to get the impact factor above unity simply by publishing material that infuriates people so much that everyone has to cite it in order to take issue with it! Nice trick if you can get away with it!
Anyway, to the paper......, and my gripes are about the analysis; not the authors, who undoubtedly are decent folk. First, here is a plot of the data that are used to get the main results in the Jacobson et al. paper. Just two series, and only for the years 1995 to 2007 inclusive:

The main idea of the paper is that the amount of driving may have an immediate or lagged effect on the obesity rate.  Something along the lines of: "More driving implies less exercise, implies a higher obesity rate". The quantitative analysis is based on simple (one regressor) regressions with the obesity rate as the dependent variable. The assumed causality is explicitly in the direction just stated. What about the possibility that highly obese people can't get around too quickly on foot, and are more inclined to drive? (I'll come back to this.) And what about all of the other relevant covariates that are not controlled for? Diet, family history, for instance?

The "partial" nature of the empirical analysis is more reminiscent of one of my tongue-in-cheek blog postings than of a serious piece of published research! And it only gets better (or worse, depending on your perspective).

In a nutshell (which is where this paper belongs), the authors regress the Obesity Rate in a given year on (only) vehicle miles travelled (VMT) per Licensed Driver (LD). The regressor (VMT/LD) is either for the current year, or lagged one year, or lagged (just) two years, and so on - up to a ten-year lag.

The authors look at the R2 values  for these simple regressions and find that the largest such correlation arises when the regressor has a 6-year lag. It's hardly surprising that quite large R2 values are obtained in the original paper, given the strong upward-sloping trends in the 2 variables! What do you expect!

All of the data are supplied in Table 1 of the Jacobson et al. paper. They're also in an Excel workbook that is linked on the Data page for this blog.

All of the data in Table 1 of the paper are reported to just one decimal place. I used the data as supplied, so it may be the case that some of the numbers that the authors used in their regression analysis were actually recorded and stored to more decimal places than this. I get slightly different results than they do, and this may be the reason why. However, none of this alters the main points that I want to make here.

Unless stated otherwise, the following results were obtained using the EViews workfile on the Code page for this blog.

Table 1



Lag


Slope Coeff.


J-B
(p-val)


D-W
(+ve p-val)


R-Squared
 0
 8.479


 0.016


 0.001
 0.863
 1
 7.469
 0.193


 0.003
 0.945
 2
 6.796
 0.802
 0.129
 0.960
 3
 6.237
 0.595


 0.021
 0.971
 4
 5.926
 0.799


 0.068
 0.972
 5
 5.823
 0.565
 0.712
 0.984
 6
 5.775
 0.812
 0.196


 0.989
 7
 5.684
 0.471
 0.353
 0.972
 8
 5.047
 0.560


 0.047
 0.953
 9
 4.521
 0.556


 0.022
 0.936
 10
 4.256
 0.702


 0.001
 0.952


In Table 1, I report the estimated slope coefficients, the R2 values, and the p-values for the Jarque-Bera (J-B) normality test and the Durbin-Watson (D-W) test  for independent errors (against the alternative of positive AR(1) errors). Keep in mind that the first of these tests is valid only asymptotically, and we have ridiculously small sample size here: n = 13. The p-values for the D-W test are exact, for this sample size and this set of data, and are computed using the SHAZAM package.

You can compare my estimated coefficients and correlations with those reported by Jacobson et al. As I mentioned already, they're a little different, but the maximum R2 value does indeed arise with a 6-year lag on the regressor. Although not reported, the p-values for the t-tests of the hypothesis that the slope coefficient is zero, are really tiny. We appear to have estimated highly significant relationships here. But so what?

Looking at the diagnostic tests for normality and independence of the errors, we can see that there are problems with non-normality when the regressor is current-period  (VMT/LD); and there are widespread problems with serial correlation. Interetingly, there's no discussion of any diagnostic testing at all in the paper. Even getting to this stage, it's clear that the t-tests will be highly suspect! And why might the serial correlation be arising? Almost certainly because of omitted regressors - all of those control variables that simply aren't there.

These results should also start you thinking about "spurious regressions"! In Table 1 we have high R2 values, and low D-W values (as reflected by some very small p-values for the latter). This sounds like a situation where the data are non-stationary, but not cointegrated. Again, stationarity is not something that was considered in the paper, but let's take up this point - just for funzies.

The values of the statistics for the augmented Dickey-Fuller tests are -4.296 (p = 0.0306) and -0.3844 (p = 0.9815) for OBESITY and (VMT/LD) respectively. Here, a drift and trend are included in the Dickey-Fuller regressions, and I've used the full sample of data available - so that's 1985 to 2007 in the case of (VMT/LD). If I use the shorter sample of 1995 to 2007 for this variable to match the sample for the OBESITY data, the ADF test statistic is 1.9525 (p = 1.0000).

These results suggest that OBESITY is I(0), while (VMT/LD) is I(1). Applying the KPSS test leads to the same conclusions. Keep in mind, though, that we have a really small sample here!

This isn't quite the standard "spurious regression" situation, but it's close.

Now, we have a bit of a problem. Or, more correctly, there's a BIG problem with this paper - one that the authors either didn't want to grapple with, or that they simply weren't aware of. Given that there's no mention of non-stationarity, and the paper draws strong conclusions from a tiny sample of strongly trended data, and even goes so far as to report out-of-sample forecasts, I strongly suspect that it's the latter reason.

In any event, the various OLS regressions that they have estimated are "unbalanced" - the dependent variable is I(0) variable and the regressor is I(1). Naughty! Naughty! You're not going to get any meaningful results that way. In addition, there are all of the obviously important factors that they haven't controlled for in their regressions, but we don't even need to go there. The point is, you have know an ADF test from a multiple choice test when working with time-series data.

Now, we have two options (if we really want to play this rather foolish game of fitting regressions of this type, with so little data). One option is simply to first-difference the (VMT/LD) data, but leave the OBESITY data in levels. The the fitted regressions will then be "balanced" - both variables will be I(0). A second option is to difference both (VMT/LD) and OBESITY. If we difference the OBESITY variable it won't be I(0) any more, but it will still be stationary, as will the regressor (after the differencing). There's a risk in this second case of over-differencing, but we can see how the two approaches work out.

Tables 2 and 3 provide the answers. Table 2 is for the case where just (VMT/LD) is differenced; and Table 3 is for the case where both variables are differenced. Where did those nice big juicy  R2 values go to? Surprise, surprise!

Table 2



Lag


Slope Coeff.


p-val


R-Squared


D-W
(+ve p-val)
 0
-24.018
 0.000


 0.594
 0.030
 1
-24.723
 0.005
 0.366
 0.036
 2
-27.460
 0.003
 0.395
 0.040
 3
-21.051
 0.002
 0.176
 0.001
 4
-9.636
 0.497
 0.048
 0.000
 5
-5.047
 0.652
 0.011
 0.000
 6
0.383
 0.975
 6E-05
 0.000
 7
-12.435
 0.060
 0.220
 0.000
 8
-12.605
 0.052
 0.228
 0.000
 9
-12.027
 0.040
 0.202
 0.000
 10
-10.560
 0.037
 0.193
 0.000


In Tables 2 and 3, the p-values associated with the t-test on the slope coefficient are 2-sided p-values. In Table 2, the Newey-West correction is used for computing the standard errors in all cases, to allow for the obvious serial correlation. Here, this has the effect of reducing the p-values for the t-test on the slope coefficient. The smallest p-value for the J-B normality test is 0.631 (when the lag length is 7), so we can take the errors to be normally distributed.

Even if you were to buy into the results in Table 2 (and I'm certainly not), look what you now find. The largest R2 value is now when there is no lag on the regressor. Ironically with the previously preferred lag of six years, we see the smallest R2 value. It's essentially zero! Gee, I can't guess why the results are so sensitive to the transformation of the data!

Table 3


Lag


Slope Coeff.


p-val


R-Squared


D-W
 (+ve p-val)    


         D-W
(-ve p-val)
 0
-0.609
 0.624
 0.014
 0.913


 0.087
 1
 1.041
 0.404
 0.028
 0.854
 0.146
 2
-3.424
 0.134
 0.216
 0.882
 0.118
 3
 1.071
 0.665
 0.020
 0.895
 0.105
 4
-0.491
 0.870
 0.004
 0.933


 0.067
 5
-0.135
 0.945
 3E-04
 0.921


 0.079
 6
 3.790


 0.082
 0.251
 0.870
 0.130
 7
-3.866


 0.054


 0.263
 0.932


 0.068
 8
-0.432
 0.469
 0.010
 0.946


 0.054
 9
-1.799
 0.213
 0.194
 0.822
 0.178
 10
 1.080
 0.289
 0.070
 0.857
 0.143



In Table 3, we see that in (at least) half of the regressions there is significant negative autocorrelation. The smallest D-W value is 2.56 (when the lag length is 9). This is a signal of the over-differencing of the dependent variable that I mentioned earlier. In this table the Newey-West correction for the standard errors has been used again, just to be on the safe side.Although not reported, the smallest p-value for the J-B normality test is 0.107 (when the lag length is 10). All of the p-values for this test exceed 0.4. Non-normality of the errors doesn't appear to be an issue.

Also in Table 3, very few of the slope coefficients are significantly different from zero, and the R2 values are (unsurpisingly) small, with the largest value occurring for a lag length of seven years. "Explaining" the variation in changes isn't as easy as "explaining" the variation in levels when you have trended time-series data, is it? Another way to think about these R2 values in Table 3 is that only two of them are significantly different from zero. This is a far cry from the conclusions reached in the paper!

So where does this leave us? Basically, exactly where we should be if we have a negligible amount of trended time-series data, and a model that is hopelessly under-specified - with absolutely nothing! It seems clear to me that the correlations "discovered" by Jacobson et al. are totally spurious artifacts of the non-stationary data. Once this problem is dealt with, the sensitivity of the results becomes clear, as does the gross mis-specification of the regressions.

I'll leave you with one last thought. Remember my comment about the regressions being set up in a way that assumes that causality runs from distance travelled to the obesity rate? Could it be the other way around? Or even bi-directional? One simple exercise that you could now go through, using the EViews file I've supplied,  would be to test the exogeneity of the regressor. If it's not exogenous, then the OLS estimator that Jacobson et al. (and I) used would be totally inappropriate. Oh dear! Now there's another set of issues to worry about - and still not enough data to handle them properly!

Another case of "Much Ado About Nothing".


© 2011, David E. Giles


No comments:

Post a Comment