## Wednesday, September 14, 2011

### P-Values of Differences of P-Values of Differences of....

An interesting post on Andrew Gelman's blog last week reminded us that we have to be careful to distinguish between the statistical significance of the difference between two "effects", and the difference between the significances of two separate effects. They're not the same thing, and it's the first of these that's usually relevant.

Let's re-phrase this, and put it in baby econometrics terms.

Suppose that I fit a regression model, using data for the U.S. and for Canada. I "pool" the data, and I include as a regressor a dummy variable that takes the value unity for the Canadian observations, and zero for the U.S. observations. This dummy variable enters the model multiplicatively: that is, in the form (D*X), where D is the dummy variable and X is the regressor whose significance is of particular interest to me. X is also included by itself as a regressor in the model.

This also ties in with "differences in differences" models, and ECONJEFF took up a related point a few days ago.

Now, we all know that the t-statistic associated with the (D*X) coefficient can be used to test if there is a difference between the effect of X on y in Canada, and the effect of X on y in the U.S. To be specific, suppose that the p-value associated with this test statistic turns out to be 0.03. Then, if (implicitly in the back of my mind) I'm a 5% significance level person, I would reject the hypothesis that the two effects are the same. If you're a 1% person, you wouldn't reject this hypothesis.

O.K. - that's fine. I could take an alternative approach. I could estimate the equation using the pooled data, but without the dummy variable interaction term. Then I could conduct a Chow test of the hypothesis that there is no structural change in (only) the coefficient for X . The square root of the Chow test F-statistic would be identical to the t-statistic in the earlier equation with the multiplicative dummy variable. The associated p-value would again be 0.03. The two approaches are equivalent to each other - they're testing the same null hypothesis, and you get the same result.

I've put some artificial data and an EViews workfile on the Data and Code pages of this blog if you want to check this for yourself.

Now consider something a little bit different. Assuming that the U.S. and Canadian sub-samples are large enough, I could estimate two separate OLS regressions involving y, X, the other regressors, but no dummy variable. Suppose  that I found that that the p-value for the t-statistic associated with X was 0.40 in the U.S. equation, and 0.03 in the Canadian equation. Being a 5% person, I guess I'd conclude that the effect of X is significant in Canada, but not in the U.S..

But what if these p-values were, say, 0.052 and 0.048 respectively? Well, if I'm a really hard-liner, I wouldn't alter my conclusion.

However, at what point do we say to ourselves: "The numerical difference between those two p-values is so tiny that I really should come to the same conclusion in each case". But do we go with the 0.052 and conclude that there's really no X-effect in either country? Or, do we go with the 0.048 and conclude that there is an X-effect in both the U.S. and Canada?

Now we're no longer talking about the significance of a difference between two effects. We're talking about the difference between two measures of significance. That's something altogether different.

Is there anything sensible that we can say about the difference of two p-values? Well, keep in mind that p-values are random variables and they have a sampling distribution. See my earlier posts on this - here, and here. The numbers, 0.052 and 0.048 are just realized values of two different random variables. There's some random variation in the background. Shouldn't we take this into account when we look at the difference of the two values?

For a single p-value, whatever the test, the distribution of that p-value if the null hypothesis is true is Uniform on [0,1]. However, if the null hypothesis is false the distribution of the p-value is skewed to the right, but its precise form depends on the nature of the testing problem, and on the actual values of the parameters.

Now, getting back to our comparison of two p-values, note that the difference between two p-values is also a (different) random variable, of course. It has its own distribution. We're concerned with two independent t-tests. One is testing that the coefficient of X is zero in the Canadian equation; and the other that it is zero in the U.S. equation. Suppose that both of these null hypotheses are true. Then the difference between the two p-values is just the difference, Δ, between two independent U[0,1] random variables.

Homework: Prove that the density function for Δ is triangular (!), with a base running from (-1,0) to (1,0), and an apex at (0,1).

So, if in fact there is no X-effect in either the U.S. or Canada, then the difference of the p-values has an expected value of zero, and a variance of 1/6 (s.d. = 0.40825).

More Homework: Prove that the values ΔL = -0.776 and ΔR = 0.776 cut off areas of  2.5% in each of the left and right tails, respectively, of the distribution for Δ.

Assuming you've done your homework, you can now construct a 95% confidence interval for the difference between the two true p-values, for the case where there is no X-effect in either country. Using our two empirical p-values of 0.048 and 0.052, this interval is  [0.004 +/- (0.776)(0.40825)], or [-0.3128 , 0.3208]. This interval covers the value zero, leading us to conclude that indeed there is no difference between the p-values.

Remember that the interval has been constructed correctly only if the two X coefficients are zero, and hence are the same in a rather trivial way! So I guess this particular interval might lead us to conclude that there's no X-effect in either country, but that would be a rather dubious outcome!

(There's a circularity to this that makes me feel a little queasy.)

But what if the observed p-values were, say, 0.40 for the U.S. and 0.03 for Canada? In that case the 95% confidence interval would be [0.0532 , 0.6868], and we'd conclude that the difference between the p-values is non-zero. We'd probably be tempted to conclude that there's a difference between the two X-effects; and that there's no X-effect in the U.S., but there is one in Canada.

Hang on! The confidence interval, [0.0532 , 0.6868], has been constructed correctly only if there is no X-effect in either country. So what sense does it make to use this interval to reach a conclusion that there's an X-effect in one of the countries? None at all!

Of course, if one or other of the null hypotheses (no X-effect in a particular country) is false, then things go from bad to worse. Then, the p-values themselves have distributions that depend on the true (unobservable)coefficients of the X regressors, among other things. The same is true (in an even messier way) for Δ. In this case, we're not going to get anywhere by asking if the observed difference between the two empirical p-values is "significant".

Take-home points:
• It's important to distinguish between the statistical significance of the difference between two "effects", and the difference between the significances of two separate effects.
• The difference between a significant effect and an insignificant effect is not necessarily significant.
• Trying to draw conclusions by comparing observed p-values is a dangerous pastime.
• By using interaction dummy variables in a regression model you can usually create a sound basis for testing the difference between effects in a meaningful way.

1. 2. 