Saturday, March 15, 2014

Research on the Interpretation of Confidence Intervals

Like a lot of others, I follow Andrew Gelman's blog with great interest, and today I was especially pleased to see this piece relating to a recent study on the extent to which researchers do or do not interpret confidence intervals correctly.

If you've ever taught an introductory curse on statistical inference (from a frequentist, rather than Bayesian perspective), then I don't need  to tell you how difficult it can be for students to really understand what a confidence interval is, and (perhaps more importantly) what it isn't!

It's not only students who have this problem. Statisticians acting as "expert witnesses" in court cases have no end of trouble getting judges to understand the correct interpretation of a confidence interval. And I'm sure we've all seen or heard empirical researchers misinterpret confidence results! For a specific example of the latter, involving a subsequent Nobel laureate, see my old post here!

The study that's mentioned by Andrew today was conducted by four psychologists (Hoekstra et al., 2014) and involved a survey of academic psychologists at three European Universities. The participants included 442 Bachelor students, 34 Master students, and 120 researchers (Ph.D. or faculty members).

Yes, the participants in this survey are psychologists, but we won't hold that against them, and my hunch is that if we changed "psychologist" to "economist" the results wouldn't alter that much!

Before summarizing the findings of this study, let's see what the authors have to say about the correct interpretation of a confidence interval (CI) constructed from a particular sample of data:

"Before proceeding, it is important to recall the correct definition of a CI. A CI is a numerical interval constructed around the estimate of a parameter. Such an interval does not, however, directly indicate a property of the parameter; instead, it indicates a property of the procedure, as is typical for a frequentist technique. Specifically, we may find that a particular procedure, when used repeatedly across a series of hypothetical data sets (i.e., the sample space), yields intervals that contain the true parameter value in 95 % of the cases. When such a procedure is applied to a particular data set, the resulting interval is said to be a 95 % CI. The key point is that the CIs do not provide for a statement about the parameter as it relates to the particular sample at hand; instead, they provide for a statement about the performance of the procedure of drawing such intervals in repeated use. Hence, it is incorrect to interpret a CI as the probability that the true value is within the interval (e.g., Berger & Wolpert, 1988). As is the case with p-values, CIs do not allow one to make probability statements about parameters or hypotheses."  (Hoekstra et al., 2014, 2nd. page of online pre-print.)
For what it's worth, I agree that this description and interpretation of a CI is correct. 

I'm not saying that we should be using CI's. Specifically, when I'm wearing my Bayesian hat, CI's make no sense at all, and the very term is banished from my vocabulary. But I digress.........

So, what are the findings of the study in question? Very briefly (because you should read the paper yourself):

  • Participants were given 5 6 incorrect statements about a confidence interval, and were asked which ones , if any were correct.
  • 8 undergraduate students (1.8%), 0 Masters students, and 3 (2.5%) Ph.D./faculty correctly said that all five six statements were incorrect.
  • The claimed level of experience of the respondents had a slight positive correlation with the extent to which misinterpretations of CIs were made.
  • Researchers (Ph.D. and faculty) scored about as well as first-year students without any training in statistics.

  • Very much a case of "read it and weep"!

    However,....... check the survey questions in the Appendix of the Hoekstra et al. paper, and see how you score.


    Berger, J. O. and R. L. Wolpert, 1988. The Likelihood Principle (2nd. ed.), Institute of Mathematical Statistics, Hayward, CA.

    Hoekstra, R., R. D. Morey, J. N. Rouder, and E-J. Wagenmakers, 2014. Robust misinterpretation of confidence intervals. Psychonomic Bulletin Review, in press.

    © 2014, David E. Giles


    1. I found this description of CIs on a Yale stats site. It seems to me it makes the common but incorrect interpretation that there is a n% probability that the true parameter lies within the interval - in the section beside the first figure. It doesn't mention repeated sampling. Maybe I got a bad draw from the population of websites but it seems to me there is a wide variation in how CIs are interpreted.

      1. If it doesn't mention repeated sampling then it's simply wrong. Go back and take a look at my post on the Nobel Laureate, mentioned in the post.

      2. Looks to me like there were six statements, not five.

        I'm surprised that number six got so many people to agree with it, because it is incoherent gibberish. "Not even wrong," as they say. I wonder what in the world they think "the true mean" is supposed to denote?

        QUOTE: "If we were to repeat the experiment over and over, then 95 % of the time the true mean falls between 0.1 and 0.4."

      3. By the way, I don't see anything wrong with the description at the Yale site. It doesn't explicitly use the term "repeated sampling," but this is inherent in the frequentist understanding of probability. The key thing is to only make probabilistic statements about the random endpoints of the intervals, not about the *realizations* of those random variables.

      4. ed - five corrected to six. Thanks!
        You're right about the piece on the Yale site. Frequentists rely on the notion of the sampling distribution fro pretty much everything - e.g., unbiasedness - and by definition that's what's happening in repeated sampping.

    2. Dave,

      There's a crowded discussion on all this at Andrew Gelman's blog, and if I may I'd like to raise something in the friendly environment here instead (fellow members of the economics tribe etc.).

      The definition of a CI in the Yale piece,

      "The level C of a confidence interval gives the probability that the interval produced by the method employed includes the true value of the parameter."

      seems OK to me too. It doesn't say "under repeated sampling", but that's covered by the wording. The probability is the probability, however many times you sample.

      My question: how can one correctly combine this definition with the information that the CI is [0.1, 0.4], as in the questionnaire used in the Hoekstra et al. paper? (I confess the paper has left me a bit reluctant to try myself!)


      NB: There is an inconsequential error in the definition of a CI in the paper. They say "A CI is a numerical interval constructed around the estimate of a parameter" but strictly speaking a point estimate isn't always necessary for the construction of a CI. I have in mind Anderson-Rubin (1949) CIs as the counterexample.

      1. Mark - thanks for the comment. Re. your N.B. - quite right.

        The main point - the notion of a C.I. rests on the concept of the sampling distribution. For that matter, so does the notion of the unbiasedness of an estimator, its efficiency, etc. If we have any sample statistic, S, then the sampling distribution of S is defined as follows. We (hypothetically) sample from the population, again and again (with a fixed sample size), and we re-calculate S every time. Then we look at the empirical distribution of all of the infinity of values of S. This is the sampling distribution of S. If the mean of this distribution equals the true, unknown, value of some population parameter, the S is an unbiased estimator of S.

        Let's return to C.I.'s. As we sample again and again, we can construct an interval, such as xbar +/- 1.96 s.d./(n^0.5), every time. We'll then have lots and lots of intervals of this type. Of all of these many intervals, 95% will cover the true unknown value of the population mean. The particular interval that we construct with our one sample may or may not cover the true population mean - we'll never know. It either does, or it doesn't (prob. =1 or 0) - we don't know.

        So, if I report a (say 95%) C.I. of [ 0.1 , 0.4], all we can say is the following. If I were to construct intervals of this type, in the same way, based on the same sample size, many many times, then 95% of such intervals would cover the true population mean. As for this particular interval, it may or may not cover that parameter's true value. I can't tell.

        If this makes you uneasy, you're probably a (latent) Bayesian! And I don't blame you!

      2. Thanks, Dave. That's about where I ended up (which is a bit of a relief).

        Is it helpful to think of the analogy with frequentist estimators? Say the true mean is theta (as in the example) and the sample mean is xbar. For this particular sample - the same sample that gave us the CI of [ 0.1 , 0.4 ] - xbar=0.25.

        We can say (with some assumptions) that the sample mean is an unbiased estimator of the true mean, i.e., E(xbar)=theta, i.e., we say (A) something about the method. We can also say that for this sample, xbar=0.25, i.e., we say (B) something about this particular sample. But we obviously can't say that theta=0.25, or put a probability on this being true (unless the parameter space is continuous, when maybe we can say the probability is zero!).

        When we say "0.25 is an unbiased estimate of theta", this is what we mean: in general the sample mean is an unbiased estimator of the true mean, and the sample mean for this particular sample is 0.25.

        Same for CIs. We can say (A) something about the method: if I plan to construct a CI, the probability that it will have theta in it is 95%. We can say (B) something about this particular sample: it gives me a CI of [ 0.1 , 0.4 ]. But we can't say anything about the probability that theta lies in [ 0.1 , 0.4 ]. As you say, we can't tell.

        I guess what I would like is a wording for the CI [ 0.1 , 0.4 ] that is analogous to, and as kosher as, "0.25 is an unbiased estimate of theta". But I haven't found a good one yet.

        And thanks again for the blog hosting - a great public service.