Tuesday, April 26, 2011

Drawing Inferences From Very Large Data-Sets

It's not that long ago that one of the biggest complaints to be heard from applied econometricians was that there were never enough data of sufficient quality to enable us to do our job well. I'm thinking back to a time when I lived in a country that had only annual national accounts, with the associated publication delays. Do you remember when monthly data were hard to come by; and when surveys were patchy in quality, or simply not available to academic researchers? As for longitudinal studies - well, they were just something to dream about!

Now, of course that situation has changed dramatically. The availability, quality and timeliness of economic data are all light years away from what a lot of us were brought up on. Just think about our ability, now, to access vast cross-sectional and panel data-sets, electronically, from our laptops via wireless internet.

Obviously things have changed for the better, in terms of us being able to estimate our models and provide accurate policy advice. Right? Well, before we get too complacent, let's think a bit about what this flood of data means for the way in which we interpret our econometric/statistical results. Could all of these data possibly be too much of a good thing? 

To give us something concrete to focus on, let's consider some of the results from a paper by Mellor (2011) published recently in the well-known journal, Health Economics. I've computed and included the p-values that go with the estimated coefficients, because I want to focus on these in the following discussion. What follows is an adaptation of part of Table II from Mellor's paper. There are 8 sets of results in that table, and I'm reporting the two sets that have the most significant estimated coefficients. The models explain body mass index (BMI) for children aged 2 to 11 years, for the case where the mother smoked 100+ cigarettes by 1992:

Covariates:      Tax       Tax*Treatment        Price       Price*Treatment        
Model 1:        - 0.716            0.886                                                                        
                       (2.94)            (2.31)
                     [0.002]          [0.010]
      
Model 2:                                                     -0.418           0.406                         
                                                                  (2.78)           (3.60)
                                                                  [0.003]         [0.0002]

The t-values are given in parentheses, and the implied one-sided p-values are in square brackets. A range of additional control variables are included in the models, and the 'significance' of these is not reported by the author. The extremely small  R2 values (0.083 and 0.075, respectively) are typical of models based on large cross-section data-sets. (Two of the 8 models reported in Mellor's Table 2 have  R2 = 0.0008.)

I've chosen this paper as an example, not because I have any problem with it - I don't - but because I think it's an excellent example of careful applied microeconometrics. I hope that you'll find the time to read it. Importantly, this particular paper is typical of a lot of empirical micro. studies - it's based on a pretty large set of data. Specifically, the above regression results are for a sample of 17,754 observations. This augers well for any estimators or tests whose reliability depends on large-n asymptotics. I have no problem with that at all. However, when we look at the estimated coefficients and their associated standard errors, what do we see, and what should we conclude?

Granger (1998; 2003) has reminded us that if the sample size is sufficiently large, then it's virtually impossible not to reject almost any hypothesis. So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about 'statistically significant' results when the sample size is in the thousands, or even bigger. So, the p-values reported above are mostly pretty marginal, as far as significance is concerned. When you work out the p-values for the other 6 models I mentioned, they range from  to 0.005 to 0.460. I've been generous in the models I selected.

Here's another set of  results taken from a second, really nice, paper by Ciecieriski et al. (2011) in the same issue of Health Economics:

Covariates:         Current Expend.   Lagged Expend.   Current + Lagged Expend
Full sample

Past Year              -0.0001                  0.01                          0.01 
Smokers                (-0.00)                  (0.32)                        (0.26)
(n = 13,099)            [0.50]                  [0.37]                        [0.40]

Daily                      -0.02                   0.06                            0.01
Smokers                (-0.67)                 (1.67)                         (0.43)
(n = 8,754)              [0.25]                 [0.05]                         [0.33]


I've reproduced the results for 2 of the 6 Probit models that these authors report in Table II of their paper. Again, the sample sizes are quite large. The models look at the effect of tobacco control program expenditures on individuals' attempts to quit smoking in the previous 12 months. Again, I've computed and added the p-values in square brackets to put things in context. The t-statistics are in parentheses. You can see that none of the p-values are less than 5%. The p-values for the other 4 model range from 0.005 to 0.45, so again, I've tried to be fair in what I've selected here.

The authors of this paper are careful to note the lack of association implied by many of their results. In this study, these 'negative' results are actually even more convincing than they appear to be when we interpret them with the usual 10%, 5%, and 1% significance levels in the back of our mind.

Now, in contrast, let's consider some results drawn from a recently published  paper that I co-authored some time back with Janice Xie. The study in question (Xie & Giles, 2011) is a duration analysis of the time delay between the application for a patent in the U.S., and the final award of that patent. Various 'accelerated failure time' survival models were estimated by maximum likelihood, using a sample based on  nearly 2 million patents. There should be enough data here to keep anyone happy, shouldn't there?

The preferred model (based on AIC) used a log-normal distribution. I'm not going to reproduce the estimates for the 25 parameters here. However, I can tell you that 23 of the associated p-values are zero, to four decimal places; one is 0.00015; and the largest one is 0.025. With the exception of the last one, these p-values are much more of the order of magnitude we'd be hoping for with a sample this large. The EViews workfile for estimating our preferred model is pretty large, but if you're keen you can download it from the Code page that goes with this blog.

I, for one, am very happy that we have easy access to large data-sets. Of course we're better off overall, and it's just great to see some of the creative research that can now be undertaken. The two papers from Health Economics are good examples of this. However, with more data comes more responsibility - namely, to interpret our inferences appropriately. It's something to keep in mind when you read people's interpretations of their empirical research - some results really aren't quite as significant as they're made out to be.

Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.

References

Ciecierski, C. C., P. Chatterji, F. J. Chaloupka, & H. Wechsler (2011). Do state expenditures on tobacco control programs decrease use of tobacco products among college students? Health Economics, 20, 253-272.

Granger, C. W. J. (1998). Extracting information from mega-panels and high frequency data. Statistica Neerlandica, 52, 258-272.

Granger, C. W. J.  (2003). Some methodological questions arising from large data sets. In D. E. A. Giles (ed.), Computer-Aided Econometrics, CRC Press, 1-8.

Mellor, J. M. (2011). Do cigarette taxes affect children's body mass index? The effect of household environment on health. Health Economics, 20, 417-431.

Xie, Y. & D. E. Giles (2011). A survival analysis of the approval of U.S. patent applications. Applied Economics, 43, 1375-1384.


© 2011, David E. Giles

5 comments:

  1. As a reader of the blog I should probably understand p-values better, having said that, a p-value of 0.01 with any number of samples, is just as indicative of the falseness of the null hypothesis (as given the null, the p-value has the same distribution). For large samples, large p-values should accompany minuscule alternate hypothesis, which can make them uninteresting. A combination of large effects and large p-values can indicate looseness of the model (regardless of sample size), which is prone to over-fitting (another familiar malaise). Do I make sense?

    ReplyDelete
  2. Fantastic post - I often get into arguments about large sample size and barely statistically significant p-values so its good to have a easily accessible blog post to point to.

    ReplyDelete
  3. Anonymous: Thanks for the very useful comment. And yes, you do make sense. I think I could have expressed a couple of points a little better in the posting.

    Let me try again. I agree (and discussed why, in an earlier posting) that the null distribution of the p-value is independent of the sample size – indeed it is simply uniform on [0 , 1]. This is not the case under the alternative.

    Quoting from Granger (1998, p.260) : “In simple cases where confidence intervals are O(1/n) they will be effectively zero [in length; D.G.], so that virtually any parsimonious parametric model will produce a very low p-value and will be strongly rejected by any standard hypothesis test using the usual confidence levels. Virtually all specific null hypotheses will be rejected using present standards [5%, 1% significance levels; D.G.].”

    It’s worth adding that in the case of cointegrated data, where the convergence rate for the asymptotics is O(n), rather than O(n^1/2), then the confidence interval lengths are O(n^-2), and the situation is even more extreme!

    The point that we should take away from the posting is that if the null is just a tiny bit false – not to the extent that it really makes any economic difference – then we’ll always reject that null if we have a big enough sample size, & we use conventional significance levels. To guard against this it would be wise to dramatically reduce the magnitude of the p-value that we require to reject the null.

    Take an example – estimating a simple OLS regression model under ideal assumptions. We'll test if the slope coefficient is zero, against the alternative that it's positive, using the t-test. This test is UMP against 1-sided alternatives. The test statistic is standard normally distributed, under the null, if n is large.

    I’ve conducted a Monte Carlo experiment (1,000 replications), generating the y data using an intercept coefficient of unity, a slope coefficient of 0.01, and i.i.d. N[0,1] errors. The null we’re testing is slightly false. If the data were measured in logarithms, the slope coefficient would be an elasticity, and in economic terms it probably makes very little difference if that elasticity is zero or 0.01.

    The EViews workfile and program file for the experiment are on this blog's Code page.

    For n = 100; 5,000; 50,000; and 100,000 the averages of the 1,000 t-statistics and of their 1 – sided p-values are 0.088 and 0.252 when n = 100; 0.676 and 0.217 when n = 5,000; 2.270 and 0.054 when n = 50,000; and 3.193 and 0.022 when n = 100,000. (Note that I have reported the average of the 1,000 p-values, not the p-value of the average of the 1,000 t-statistics.)

    For this example, with a sample of 50,000 observations, the false null is not quite rejected at the 5% significance level, but we’d reject it at the 10% level, on average. We finally get a rejection at both these significance levels when n = 100,000. But is there any real economic sense in rejecting? After all, the true value of the parameter is 0.01 (rather than exactly zero).

    If we then increase the sample size to n = 250,000, the averages of the t-statistics and their p-values are 4.993 and 0.001, respectively. Now I’d say there’s a clear rejection of the null.

    It’s in this sense that we probably should be insisting on rally, really small p-values before we reject the null when the sample size is extremely large.

    Incidentally, I see that Andrew Gelman was blogging on a similar point a couple of years ago, albeit in a slightly different context. See http://www.stat.columbia.edu/~cook/movabletype/archives/2009/06/the_sample_size.html

    Anonymous – if I’m reading your comment correctly, I don’t think we’re in disagreement on any of this. Again, thanks for the helpful input.

    ReplyDelete
  4. Sinclair: Thanks for kind feedback. Glad that the post was helpful.

    ReplyDelete
  5. I think the problem is the confusion between statistical and economic significance. In your example of 0.01 you are not wrong rejecting the null. After all, 0.01 is not 0, and that difference could be important in some context. Maybe we must change the nulls we test (vg. H0: b <0.1 or whatever we think is economically relevant) instead of changing the way we interpret p-values.

    ReplyDelete