## Saturday, January 28, 2017

### Hypothesis Testing Using (Non-) Overlapping Confidence Intervals

Here's something (else!) that annoys the heck out of me. I've seen it come up time and again in economics seminars over the years.

It usually goes something like this:

There are two estimates of some parameter, based on two different models.

Question from Audience: "I know that the two point estimates are numerically pretty similar, but is the difference statistically significant?"

Speaker's Response: "Well, if you look at the two standard errors and mentally compute separate 95% confidence intervals, these intervals overlap, so there's no significant difference, at least at the 5% level."

My Reaction: "What utter crap!  (Eye roll!)

So, what's going on here?

Let's just focus on a few key points, or else I'll be raving on forever here!

First, we can re-cast the scenario involving estimated regression coefficients into one where we're comparing the means of two populations. If the models have normally distributed errors, then we're comparing the means of two Normal populations. However, normality is not needed for what follows.

Second, testing the equivalence of the means of two populations, when the variances of those populations are different and unknown is one of the great unsolved problems in statistics. It's known as the Behrens-Fisher problem, and has been around since the 1930's. Although it's usually stated in terms of Normal populations, this is actually unduly restrictive. The Behrens-Fisher problem arises in any situation where the two populations belong to the same scale-location family.

So, what exactly is the "problem"?

Well, there's no exact classical test for this problem - the unknown variances are "nuisance parameters", and they prevent the construction of an exact test with a known significance level. Sure, there are lots of approximate solutions to this problem, but no exact one.

This should be enough to worry you, right from the start, when you hear the speaker's response to the question from the audience!

But let's get back to the use of the two confidence intervals that are being used to try and invoke a solution to this riddle.

I know that, on the face of it, the answer seems rather plausible, but unfortunately it's totally misleading. And we've known this for decades - so why do we still hear this rubbish?

(I have my own answer(s) to this question, but let's leave it as being rhetorical - at least in this post!)

Here's a well-known result that bears on this use of the confidence intervals. Recall that we're effectively testing H0: μ1 = μ2, against HA: μ1 ≠ μ2. If we construct the two 95% confidence intervals, and they fail to overlap, then this does not imply rejection of H0 at the 5% significance level. In fact the correct significance is roughly one-tenth of that. Yes, 0.5%!

If you want to learn why, there are plenty of references to help you. For instance, check out McGill et al. (1978), Andrews et al. (1980), Schenker and Gentleman (2001), Masson and Loftus (2003), and Payton et al. (2003) - to name a few. The last of these papers also demonstrates that a rough rule-of-thumb would be to use 84% confidence intervals if you want to achieve an effective 5% significance level when you "try" to test H0 by looking at the overlap/non-overlap of the intervals.

So, what do we conclude when we look back at the original question and answer, namely:

Question from Audience: "I know that the two point estimates are numerically pretty similar, but is the difference statistically significant?"

Speaker's Response: "Well, if you look at the two standard errors and mentally compute separate 95% confidence intervals, these intervals overlap, so there's no significant difference, at least at the 5% level."

What we can conclude in this situation is that there's no significant difference at the 0.5% level. But not rejecting H0 at the 0.5% level does not imply that we can't reject H0 at the 5% level. Just draw yourself a picture!

If you're looking for a more technical discussion of all of this, you might check out the recent paper by

A final question - what if the two intervals had not overlapped ?

References

Andrews, H. P., R. D. Snee, and M. H. Sarner, 1980. Graphical display of means. American Statistician, 34, 195–199.

Masson, M. E., and G. R. Loftus, 2003. Using confidence intervals for graphically based data interpretation. Canadian Journal of Experimental Psychology, 57, 203–220.

McGill, R., J. W. Tukey, and W. A. Larsen, 1978. Variations of box plots. American Statistician, 32, 12–16.

Noguchi, K. and F. Marmolejo-Ramos, 2016. Assessing equality of means using the overlap of range-preserving confidence intervals. American Statistician, 70, 325-334.

Payton, M. E., M. H. Greenstone, and N. Schenker, 2003. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3, 1–6.

Schenker, N., and J. F. Gentleman, 2001. On judging the significance of differences by examining the overlap between confidence intervals. American Statistician, 55, 182–186.