Friday, April 8, 2011

May I Show You My Collection of p-Values?

Tom Thorn kindly sent me the following link, which is certainly pertinent to this posting:

It's always fun to start things off with a snap quiz - it wakes you up, if nothing else. So here we go - multiple choice, so you have at least a 20% chance of getting the correct answer:

Question:    Which of the following statements about p-values is correct?

1. A p-value of 5% implies that the probability of the null hypothesis being true is (no more than) 5%.
2. A p-value of 0.005 implies much more "statistically significant" results than does a p-value of 0.05.
3. The p-value is the likelihood that the findings are due to chance.
4. A p-value of 1% means that there is a 99% chance that the data were sampled from a population that's consistent with the null hypothesis.
5. None of the above.

Well, the answer, of course,  is #5. If you disagree, then send me a comment, by all means. Option #1 suggests that the null hypothesis is a random variable, with some probability of being true! Sorry, but the null is not random, and it's definitely either true or it's false - we are just trying to infer which case holds. I know that #2 looks tempting too, but keep in mind that a p-value will decrease if the sample size increases, provided that the null distribution is the same for both sample sizes. If the null distribution changes then anything can happen. So, when comparing p-values we need to be sure that we have all of the facts. You don't learn much by comparing apples with oranges. [Added, 6/12/2012: But see the comments below.]

Options #3 and #4 are plain silly, but pretty typical of some of the stuff you run into all of the time. Just Google® the term "p-value misinterpret" and see all of the attention that's paid to this in medical journals, for instance. You'll never want to be admitted to hospital if you can avoid it. You'll find that there's a whole cottage industry that's been built up by medical scientists who spend their days writing endless little pieces explaining to each other how one should (should not) interpret a p-value in clinical trials. See Blume & Peipert (2003), for example. Quite scary, actually!

For the record, let's have a clear statement of what a p-value is, and then we can move on.
The p-value is the probability of observing a value for our test statistic that is as extreme (or more extreme) than the value we have calculated from our sample data, given that the null hypothesis is true.
So, it's a conditional probability. Very small p-values speak against the null hypothesis, but keep in mind my earlier comments about option #2 in the quiz, and take now another look at the link from Tom Thorn right back at the start of this posting.

I used to have a terrific ASA T-shirt (there's a point to this - trust me) with the message "May I Show You My Collection of Random Numbers" emblazoned on the front. It caused quite a stir in class, I can tell you. Not that I ever actually had such a collection - it's the thought that counts. But.... Yes, Virginia, I do have collection of random p-values! If you're patient, I'll even show you some of the more exciting and brightest-coloured ones.

Now, I'll bet that most of you probably never thought of a p-value as being a statistic - a function of the random sample data, and hence a random variable itself. Well, it is, and so a p-value has its own sampling distribution. We'll get to the details of this in due course, but to begin with, let's ponder this possible revelation.

When we see a p-value reported, it's just an innocent-looking number - like 0.045. When we report an estimated regression coefficient it's also a number - like 1.473, However, we'd be pretty upset if someone reported just that point estimate, and didn't give us a standard error (or the same information in the form of a t-statistic). So, if  p-values are random variables, why do we see them reported without some indication of their dispersion, or "uncertainty" if you like? Worse yet, suppose that we're told that an estimated regression coefficient is 1.473, and that the p-value (for the t-test of the hypothesis that the true coefficient is zero) is 0.045. How do we know if the latter value is significantly different from 0.05 (say), when we've been told nothing about the probability distribution that generated the single realized value, 0.045? It sounds as if we should be calling these quantities "estimated p-values", or maybe "p-hat-values".

Confused? Let's go back to our definition of a so-called p-value, and recall how we go about calculating it in practice. For simplicity, suppose that we're testing the null hypothesis that a (true) regression coefficient is zero, against the alternative hypothesis that it's positive. (It doesn't matter if the alternative hypothesis is one-sided or two-sided; positive or negative.) Under standard assumptions, the test statistic is Student-t distributed, with known degrees of freedom, if the null is true. Under the (sequence of) alternative(s), the test statistic follows a non-central Student-t distribution, with the same degrees of freedom, and an unobserved non-centrality parameter that changes as the true value of the coefficient changes, and also depends on the variance of the regression errors. So, we condition on the null being true. In that single case, the distribution of our calculated t-statistic (let's call it t*) is known, because we know the degrees of freedom. So, it's then a trivial matter to compute the area under the tail of the Student-t distribution to the right of t*. That's our p-value in this particular case, and it's a probability, conditioned on the null being true.

More formally, let's use the usual convention of letting P and T be the names of the random variables (the p-value and the t-statistic) and let p and t be their values. Then, by definition,  P = Pr.(T > t | H0 ) = 1 - FT (t), where FT (.) is the cumulative distribution function (c.d.f.) for the random variable, T, given that H0 is true. The p-value is a function of a random variable, so it's a random variable itself. It's just another summary statistic associated with our sample of data. Breaking news? Not really - this has been fully understood at least as far back as Pearson (1938), through his concept of the "probability transform", and he mentions that this idea was used by Fisher (1932).

Now, what can we say about the distribution of this statistic? What does its sampling distribution look like? Not surprisingly, just as the distribution of the t-statistic itself changes as we move from the null hypothesis being true, to the alternative hypothesis being true, the same thing happens with the p-value. Conditional on the null being true, any p-value (for any problem involving a test statistic with a continuous distribution) has a sampling distribution that is Uniform on [0,1].

At least that's easy to remember, but where does such a general result come from?  To  find out, let FP (p | H0 ) be the (conditional) c.d.f. of P. Immediately, we have the following result:

                  FP (p | H0 ) = Pr.(P < p | H0 ) = Pr.(1 - FT (t) < | H)
                                   = Pr.[FT (t) > (1 - p) | H] = 1 - Pr.[FT (t) < (1 - p) | H]
                                   = 1 - FT [FT-1(1 - p) | H] = 1 - (1 - p) = p.

The random variable, P, has a distribution that is uniform on [0,1]. Notice that this was established quite generally - the statistic, T, could have been any test statistic at all!

Now here are some really interesting p-values - there are 10,000 of them, so a graph is a better choice than a table:
These simulated p-values are for the problem of testing that the unknown mean of a Normal random variable is zero, against a one-sided positive alternative, when the variance is unknown. An EViews workfile and program file are available on the Code page that goes with this blog, so you can play around with this to your heart's content!The UMP test in this case is based on the usual t-statistic, whose distribution is Student-t with n-1 degrees of freedom under the null, where n is the sample size. I set n = 5 and σ2 = 1. The values seem to be reasonably uniformly distributed on [0,1]. The mean value is 0.503, the standard deviation is 0.286, the skewness is -0.013, and the kurtosis is 1.821. For a U(0,1) distribution, the corresponding values are 0.5, 0.289, 0 and1.8. The Q-Q plot for these data against Uniform (0,1) is as good as it gets:

and when we apply the usual bunch of non-parametric tests for uniformity, based on the empirical c.d.f., we can't reject uniformity. The p-values (whoops!) for the Kolmogorov-D and Anderson-Darling tests are 0.31 and 0.26, for instance. So, hopefully you're convinced about the form of the sampling distribution of a p-value when the null hypothesis is true.

It might seem that we can now simply report that p-value of 0.045, with its standard deviation of 0.289, to convey some information about the randomness of p. Unfortunately, that just won't do - unless we are sure that the null hypothesis is actually true - and isn't that what we're in the middle of testing? Perhaps we should consider what form the sampling distribution of the p-value takes when the null is false? It would be remarkable if the standard deviation is still 0.289 in this case.

We should use our intuition first, and then we can follow up with a bit of math.  If the test is any good (& it's UMP in the present case), as the null hypothesis becomes more and more false, we should tend to get values for our test statistic that are increasingly "extreme" (larger, in the case of the positive one-sided t-test). This implies that the p-values should be increasingly likely to be smaller and smaller. (This is all for a fixed null distribution and a fixed value of n, of course.) So, we should start to see a sampling distribution which is no longer uniform, but instead has a lot of values close to zero, and fewer values close to unity. This suggests a positively skewed distribution, and in fact this intuition is dead right. Here are the sampling distributions for the p-values in my simulation experiment, when the data are actually generated under two different scenarios: first, with a true mean of 0.25; and second with a true mean of 1.0. We're now looking at things when the null hypothesis of a zero mean is false, and in these cases the p-values are even more interesting, as we can see from their colours:

Additional examples are given by Murdoch et al. (2008).

We can easily do the math. to get a general expression for the c.d.f. of  a p-value when the alternative hypothesis is true. Here we go again. Let GP (p | HA) be the (conditional) c.d.f. of P when the alternative hypothesis holds. Immediately, we have the following result:

                  FP (p | HA) = Pr.(P < p | HA ) = Pr.(1 - FT (t) < | HA)
                                   = Pr.[FT (t) > (1 - p) | HA] = 1 - Pr.[FT (t) < (1 - p) | HA]
                                   = 1 - GT [FT-1(1 - p) | HA] .

So, unless the test statistic has the same distribution under the null as it has under all alternatives - which it won't - the last expression won't equal p. The distribution of the p-value under the alternative hypothesis is not uniform. Moreover, it changes with the form of the alternative distribution, G. Recall that in our t-test example, the distribution of the test statistic is non-central Student-t, with a non-centrality that changes as the alternative hypothesis changes. Hence the difference between the orange and pink distributions we've just seen. This can all be extended further, and you can get some really neat results relating to the probability density function (as opposed to the c.d.f.) of the p-value under the alternative distribution, as is discussed by Donahue (1999).

You should now be able to see why we can't just report a standard deviation of 0.289 (= (1/12)0.5) with any old p-value. That value will be correct if the null is true, but not otherwise. As there's an infinity of ways for the null to be false, and each implies a different sampling distribution (and different standard deviation) for the p-value in question, we have a bit of a problem on our hands. We can't supply a DVD with a gazillion possible standard deviations every time we report a p-value!

With this in mind, some authors have suggested that we pay attention to the expected value of the p-value under the alternative hypothesis, and some interesting discussions of this idea are given by Dempster & Schatzoff (1965), Schatzoff (1966), Hung et al. (1997), Sacrowitz & Samuel-Cahn (1999). Others, such as Joiner (1969), have promoted the use of the median of the p-value's sampling distribution; and Bahadur (1960) and Lambert & Hall (1982) showed how a log-normal distribution can be used to approximate this sampling distribution under the alternative(s).

Notice that none of this is "hot off the press" - this literature goes back at least 50 years, and appears in mainstream journals. Sounds like the sort of thing that's worth knowing about!

So, the next time you're reading something where someone reports a p-value, just remember that what you're seeing is just a single realized value of a random variable. It's not a deterministic constant. Just when you thought it was safe to go back in the water................!

Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.


Bahadur, R. R. (1960). Simultaneous comparison of the optimum and sign tests of a normal mean. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling (ed. I. Olkin et al.), 77-88. Stanford University Press.

Blume, J. & J. F. Peipert (2003). What your statistician never told you about p-values. Journal of the American Association of Gynecologic Laparoscopists, 10, 439-444.

Dempster, A. P., and M. Schatzoff (1965). Expected significance level as a sensibility index for test statistics. Journal of the American Statistical Association, 60, 420-43.

Donahue, R. M. J. (1999). A note on information seldom reported via the p value. American Statistician, 53, 303-306.

Fisher, R. A. (1932), Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.

Hung, H. H. J., R. T. O'Neill, P. Bauer and K. Khöne (1997). The behavior of the p-value when the alternative hypothesis is true, Biometrics, 53, 11-22.

Joiner, B. L. (1969). The median significance level and other small sample measures of test efficacy. Journal of the American Statistical Association, 64, 971-985.

Lambert, D. & W. J. Hall (1982). Asymptotic log-normality of p-values. Annals of Statistics, 10, 44-64.

Murdoch, D. J., Y-L. Tsai & J. Adcock (2008). p-values are random variables. American Statistician, 62, 242-245.

Pearson, E. S. (1938). The probability transformation for testing goodness of fit and combining independent tests of significance. Biometrika, 30, 134-148.

Sacrowitz, H. & E. Samuel-Cahn (1999). P values as random variables - expected p values. American Statistician, 53, 326-331.

Schatzoff, M. (1966). Sensitivity comparisons among tests of the general linear hypothesis. Journal of the American Statistical Association, 61, 415-435.

© 2011, David E. Giles


  1. You write

    "You should now be able to see why we can't just report a standard deviation of 0.289 (= 1/12)"

    But 1/12 is the variance of U[0,1], not the SD. I presume you mean sqrt(1/12). The math works out better this way, too ;-)

  2. Thanks Mycroft - a silly typo which I have fixed! Much appreciated.

  3. This comment has been removed by a blog administrator.

  4. This comment has been removed by a blog administrator.

  5. I don't understand any of this. Maybe I should of paid attention in high school math class?

  6. Answer 2 is correct. You have misconstrued the P-values as being error rates from the Neyman-Pearson hypothesis testing framework. When P-values are more properly used within Fisher's significance testing framework, the size of the P-values is the level of significance.
    It is common to work within a hybrid of the N-P and F approaches, but the hybrid is not consistent. Either test for a test statistic beyond a pre-defined critical value, or calculate the P-value as an index of evidence. Don't do both.

    1. OK - but the two numbers being compared need to relate to the same problem and be based on the same information set. That's my point.

    2. So, is your point about P-values or about an alternative meaning for 'significant'? The question is badly incomplete and thus ambiguous. I assumed that you meant to compare two potential results from a single experiment. What would the point of the comparison be if you expand its scope to include different types of experiments with different sample sizes etc.?

      P-values are being criticised all over the place, but most of the criticisms are based on misunderstandings.

    3. I also agree that number 2 is correct. Even if it's not the same problem and not the same data set. The p-value is basically the definition of the statistically significance metric.

      It just doesn't work as a comparison of effect sizes or practical significance. But I don't see the problem in calling a very small effect on a large data set more statistically significant than a large effect on a small data set, as long as one upholds the distinctions between effect sizes, statistical significance and pratical significance.

      All this is of course conditional to all the assumptions which have been needed to calculate the p-value being correct.

    4. OK - I see the point! Thanks to both of you!

  7. The one I like is that the p-value is "the probability that we would have gotten this result by chance."

    Since the result was obtained by a chance process, the probability that it was obtained by chance is 1.

  8. Can't we say that #2 is correct under certain conditions? Say, holding sample size/test power constant? In clinical trials it's common to have extremely low test size (<= 0.001), which implies that there's some truth to #2. However I'd agree that if we're talking about economic significance, there's little to no difference because other factors would become much more important (sample size, misspecification, use of asymptotic results in finite samples, etc).