Wednesday, October 30, 2019

Everything's Significant When You Have Lots of Data

Well........, not really!

It might seem that way on the face of it, but that's because you're probably using a totally inappropriate measure of what's (statistically) significant, and what's not.

I talked a bit about this issue in a previous post, where I said:
"Granger (1998, 2003) has reminded us that if the sample size is sufficiently large, then it's virtually impossible not to reject almost any hypothesis. So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about 'statistically significant' results when the sample size is in the thousands, or even bigger."
This general point, namely that our chosen significance level should be decreased as the sample size grows, is pretty well understood by most statisticians and econometricians. (For example, see Good, 1982.) However, it's usually ignored by the authors of empirical economics studies based on samples of thousands (or more) observations. Moreover, a lot of practitioners seem to be unsure of just how much they should revise their significance levels (or re-interpret their p-values) in such circumstances.

There's really no excuse for this, because there are some well-established guidelines to help us. In fact, as we'll see, some of them have been around since at least the 1970's.

Let's take a quick look at this, because it's something that all students need to be made aware of as we work more and more with "big data". Students certainly won't gain this awareness by looking at the  interpretation of the results in the vast majority of empirical economics papers that use even sort-of-large samples!

The main result that I want to highlight is one that was brought into the econometrics literature by Leamer (1978).  (Take a look at Chapter 4 of his book, referenced below - and especially p.116.)

Let's set the scene by quoting from Deaton (2018, Chap. 2):
"The effect most noted by empirical researchers is that the null hypothesis seems to be more frequently rejected in large samples than in small. Since it is hard to believe that the truth depends on the sample size, something else must be going on.......... As the sample size increases, and provided we are using a consistent estimation procedure, our estimates will be closer and closer to the truth, and less dispersed around it, so that discrepancies that were undetectable with small samples will lead to rejections in large samples...........
Over-rejection in large samples can also be thought of in terms of Type I and Type II errors. When we hold Type I error fixed and increase the sample size, all the benefits of increased precision are implicitly devoted to the reduction of Type II error.........
Repairing these difficulties requires that the critical values of test statistics be raised with the sample size, so that the benefits of increased precision are more equally allocated between reduction Type I and Type II errors. That said, it is a good deal more difficult to decide exactly how to do so, and to derive the rule from basic principles. Since classical procedures cannot provide such a basis, Bayesian alternatives are the obvious place to look."
And that's precisely what Leamer does.  Also, see Schwartz (1978).

Suppose that we have a linear multiple regression model with k regressors and n observations, and we want to test the null hypothesis that a set of q independent linear restrictions on the regression coefficients are satisfied. The alternative hypothesis is that at least one of the restrictions is violated. Under the very restrictive assumptions that we usually begin with in this context, an F-test would be used, and the associated statistics would be F-distributed with q and (n - k) degrees of freedom if the null is true.

We would reject the null hypothesis if F > Fc(α), where α is the chosen significance level, and Fc(α) is the associated critical value. Alternatively, we would calculate the p-value associated with the observed value of F, and reject if this p-value is "small enough".

We're interested in situations where n is large - probably very large. So, we can ignore the distinction between n and (n - k). Asymptotically, qF converges to a Chi-Square statistic with q degrees of freedom if the null is true. This is equivalent to the Wald test. Moreover, this convergence still holds if the restrictions under test are nonlinear, rather than linear. We would reject the null if qF > χ2c(α), where again the "c" subscript denotes the appropriate critical value.

It's the choice of α that has to be questioned here. Should we still set α = 10%, 5%, 1% if n is very, very large? (No, we shouldn't!)

Equivalently, if n is very big, what is the appropriate magnitude of  the p-value below which we should decide to reject the null hypothesis? Or, equivalently again, how should the critical value for this test be modified in very large samples?

Leamer's result tells us that we should reject the null if F > (n / q)(nq/n - 1) ; or equivalently, if qF = χ2 > n(nq/n - 1)

It's important to note that this result is based on a Bayesian analysis with a particular approach to the diffuseness of the prior distribution.

Also, recall that if we have a t-statistic with v degrees of freedom, then t2 is F-distributed with 1 and v degrees of freedom. So, if we are testing the significance of a single regressor (i.e., we are testing just one restriction), then Leamer's result tells us that we should reject the null that this coefficient is zero if t2 > n(n1/n - 1). That is, we should reject against a 2-sided alternative hypothesis if |t| >  √[n(n1/n - 1)]   (Remember, q = 1 in this case.)

It's actually easy to check that n(n1/n - 1) is approximately equal to loge(n) for large values of n. Indeed, if n is very, very large, this approximation is still excellent even if q > 1 (as long as q is finite). Consider the following numerical examples:

Table 1

So, what this means is that for very big samples Leamer's rule amounts to the following:

Reject H0: "q independent restrictions on the model's parameters are true", if F > loge(n) ; or equivalently, if  χ2 > qloge(n).

How does this differ from what we'd do traditionally? (Remember, for large n we can ignore the distinction between n and (n - k).) Here are the corresponding critical F-values:

Table 2

We see that if n = 100,000 and q = 5, then using the F-test with conventional significance levels we'd reject the validity of the restrictions at the 10%, 5% and 1% significance levels if the F-statistic exceeded 1.8, 2.2, or 3.0 respectively. From Table 1, we see that the critical value in this case actually should be 11.5! You can quickly check for yourself that if we're applying a 2-sided t-test (q = 1), with n = 100,000, than we should reject the null hypothesis if |t| > √(11.5162) = 3.394.

So, using conventional measures, we'll reject the validity of the restrictions far more often than we should do. Youch!

To look at things from a different perspective we can ask, "what sort of significance levels are being implied by Leamer's suggestion, relative to the levels (10%, 5%, etc.) that we typically use in practice?"

Let's go back to Table 1, and focus on the last column of loge(n) critical values. The associated significance levels are as follows:

Table 3

And in the case of the t-test example given beneath Table 2, the significance level associated with a critical value of 3.394 is 0.000345.

As we can see, when n is very large, the significance levels that we should be using (or, equivalently, the p-values that we should be using) are much less than the conventional levels that we tend to think of!

As an exercise, why don't you take a look back at one of your favourite applied econometrics papers that uses a very large sample size, and ask yourself, "do I really believe the conclusions that the author has reached?"

If you want to read more on this topic, I suggest that you take a look at Lin et al. (2013), and Lakens (2018).


Deaton, A. S., 2018. The Analysis of Household Surveys: A Microeconometric Approach to Development Policy. (Reissue edition with a new preface.) World Bank, Washington, D.C..

Good, I. J., 1982. C140. Standardized tail-area probabilities. Journal of Statistical Computation and Simulation, 16, 65–66.

Granger, C. W. J. (1998). Extracting information from mega-panels and high frequency data. Statistica Neerlandica, 52, 258-272.

Granger, C. W. J.  (2003). Some methodological questions arising from large data sets. In D. E. A. Giles (ed.), Computer-Aided Econometrics, CRC Press, Boca Raton, FL, 1-8.

Lakens, D., 2018. Justify your alpha by decreasing alpha levels as a function of the sample size. The 20% Statistician Blog.

Leamer, E. E., 1978. Specification Searches: Ad Hoc Inference With Nonexperimental Data. Wiley, New York. (Legitimate free download.)

Lin, M., H. C. Lucas Jr., and G. Schmueli, 2013. Too big to fail: Large samples and the p-value problem. Information Systems Research, 24, 906-917.

Schwartz, G., 1978. estimating the dimension of a model. Annals of Statistics, 6, 461-464.
© 2019, David E. Giles


  1. Thanks Dave. Very insightful. I want to add that the significance-or-not discussion should be complemented by (1) the impact and (2) the rationale. I have shared some thoughts on this here:

  2. Great article as always, Dave. Just for my own understanding, if we're running a regression with a large sample, we can check the significance of a regression coefficient (roughly) by using a t-statistic of (ln(N))^.5? Where N corresponds to sample size. Is my understanding correct?


Note: Only a member of this blog may post a comment.