## Tuesday, August 30, 2011

### An Overly Confident (Future) Nobel Laureate

For some reason, students often have trouble interpreting confidence intervals correctly. Suppose they're presented with an OLS estimate of 1.1 for a regression coefficient, and an associated 95% confidence interval of [0.9,1.3]. Unfortunately, you sometimes see interpretations along the following lines:
• There's a 95% probability that the true value of the regression coefficient lies in the interval [0.9,1.3].
• This interval includes the true value of the regression coefficient 95% of the time.

So, what's wrong with these statements?

Well, something pretty fundamental, actually, but if you keep the following three things straight in your mind, you'll be fine:

1. The distinction between a parameter and point estimate of that parameter.
2. The distinction between an estimator and an estimate.
3. What is meant by the sampling distribution of an estimator.
Let's get this over with, and then I'll tell you a rather good story. ☺

The true value of the coefficient in this regression model is a parameter. It's a constant, whose value we just don't happen to know. On the other hand, the (point) estimate of 1.1 is just one particular "realized" value of a random variable - generated using this one particular sample of data.

An estimator is a formula - like the OLS formula in our example. Except in rather silly cases, this formula involves using the (random) sample data. So, an estimator is a function of the sample data - in othere words, what we call a statistic. When we apply this formula using a particular sample of data, we generate a number - a point estimate.

Because an estimator is a function of the random data, it's random itself. Being a random variable, an estimator has a distribution function. We reserve a special name for distributions of statistics (including estimators and test statistics): we call them "sampling distributions".

In our OLS example, under standard assumptions (including normally distributed errors) the estimator of the coefficient has a sampling distribution which is Normal. Why is this? Well, linear functions of normal random variables are themselves normal. (This doesn't work for lots of other distributions, so be careful.) In the linear regression model, y = + ε, if the values of ε are normally distributed, so are the values of y. The OLS estimator, b = (X 'X)-1X 'y, is again a linear function of the y values, so b is also (multivariate) normal in its distribution.

So, every time we use the estimator (apply the OLS formula) with a different sample of data, we're likely to get a different result (estimate). Just what result we get depends not only on the values of the input data (the X values and the random y values), but also on the fact that the estimates are being generated by a random process - a normal distribution in this case.

But exactly what is a "sampling distribution"? In essence, it's a hypothetical construct. We have to imagine the following experiment:
1. We fix the sample size, n.
2. We take a random sample of n values of y.
3. We estimate the regession model by OLS (in our example).
4. We keep the point estimate of the coefficient vector.
5. We repeat steps 2 to 4, again, and again.........an infinite number of times.
6. We look at all of the estimates that we have obtained. If we were to plot them as a histogram, the latter would represent the "sampling distribution" of the OLS estimator.
Of course, in practice, we can't undetake an infinity of repetitions of this experiment. But we can always approximate the end result by using a simple Monte Carlo exepriment with a large number of replications.

Using this idea, we can get the correct interpretation of a confidence interval. Modify the above experiment as follows:
1. We fix the sample size, n.
2. We take a random sample of n values of y.
3. We estimate the regession model by OLS (in our example).
4. We construct a (say) 95% confidence interval for the true coefficient of interest to us.
5. We repeat steps 2 to 4, again, and again.........an infinite number of times.
6. We look at all of the confidence intervals (interval estimates) that we have obtained. Out of all of these many confidence intervals, 95% of them will cover the true (unknown) value of the coefficient.
However, any single one of these intervals may, or may not, cover it. We'll never know!

So, the first interpretation I gave for the confidence interval in the opening paragraph above is clearly wrong. The correct probability there is not 95% - it's either zero or 100%! The second interpretation is also wrong. "This interval" doesn't include the true value 95% of the time. Instead, 95% of such intervals will cover the true value.

And notice that we really should talk about the (random) intervals "covering" the (fixed) value of the parameter. If, as some people do, we talk about the parameter "falling in the interval", it sounds as if it's the parameter that's random and the interval that's fixed. Not so!

In a recent post (here) I talked about Wolfram's new cdf file format. They have a nice app. to illustrate confidence intervals here.

Incidentally, anyone who has served in court as an expert witness in statistics will tell you that judges have a lot of trouble understanding the concept of a confidence interval. Students may find some comfort in this!

So, with all of this under our belts, it's time for that story I promised you.

My colleague, Malcolm Rutherford, who specializes in HET, has a forthcoming paper about the history of the U.S. Department of Agriculture Graduate School. A year or two ago, he kindly lent me some archival material he had acquired. This related to the eminent statistician, Jerzy Neyman's, visiting lectures and "conferences" (we'd now call them "seminars") at the Graduate School in April 1937.

This material makes fascinating reading. Pertinent to this post is the transcript of one of Neyman's conferences, titled "An Outline of the Theory of Confidence Intervals". It's a gem! As with the other conferences, not only is Neyman's presentation itself recorded in full detail, but so are the questions from the attendees, and Neyman's responses.

Keep the date in mind, as well as the fact that Neyman was Polish, and did not move to Berkeley until 1938. Confidence intervals were at the cutting edge at this time, leading to the following interchange (p.157):

Mrs. Kantor:
"Is there anything in English literature on confidence intervals?"
Dr. Neyman:
"It is a relatively new subject. Apart from my paper in the Phil Trans. already cited on page 28, where I give the theory as discussed here, the references are as follows......."
Neyman then proceeded to reference 5 papers published in 1934 and 1935, and one to be published in 1937.

For me, the most interesting interchange is following one (pp.144-145), where Neyman summarizes the steps associated with constructing a confidence interval:

Dr. Neyman:
"The practical statistician is able to observe the xi, and he wishes to know how these should be used for making some statement concerning the value of θ1. We may advise him to perform the following three steps which, together, are equivalent to a single random experiment:
(i) to observe the values of the xi, called E.
(ii) to calculate the corresponding values of the functions θL (E) and θU (E).
(iii) to state that θL (E) ≤ θ1θU (E).

You will notice that in this statement he may be correct or he may be wrong. But, owing to the properties of the functions θL and θU as expressed by Eq.(1) [not shown here; DG], the probability of his being correct will be equal to α (e.g. 0.99). It follows that, if the experiment is so arranged that the xi do follow the elementary law that served for constructing the functions θL and θU, then the empirical law of big numbers will guarantee that the practical statistician following the above advice will be correct in his statements concerning the value of θ1 in 99 percent of all cases."
(immediately followed by)

Mr. F:
"Your statement of probability that he will be correct in 99 per cent of the cases is also equivalent to the statement, is it not, that the probability is 99 out of 100 that θ10 [the true value of the parameter; DG] lies between the limits θL and θU?"
Dr. Neyman:
"No. This is the point I tried to emphasize in my first two lectures both in theoretical discussions and in examples. [OUCH!@#!; DG] θ10 is not a random variable. It is an unknown constant. In consequence, if you consider the probability of θ10 falling within any limits, this may be either zero or unity, according to whether the actual value of θ1 happens to outside of those limits or within."
Ex post, W. Edwards Deming (who assisted Neyman in the editing of the transcript) added a footnote (p.145):
"...The point is that we must not speak of the probability of θ1 lying within fixed limits, nor limits that are not random variables. Ed."
Succinctly put, Dr. Deming! But who was the mysterious "Mr. F." who just couldn't get it straight, and got his knuckles rapped by Neyman?

Actually, it was no other than Milton Friedman!

Students - take note. And take heart!

References

Neyman, J. (1937). An outline of the theory of confidence intervals. (A conference with Dr. Neyman in the auditorium of the Department of Agriculture, 9th April 1937, 10 a.m., Dr. Frederick V. Waugh presiding). In Lectures and Conferences in Mathematical Statistics, Delivered by J. Neyman at the United States Department of Agriculture in April 1937, Graduate School of the United States Department of Agriculture, Washington D.C., pp. 143-160.

Rutherford, M. A. (2011). "The USDA Graduate School: Government training in statistics and economics, 1921-1945."   Journal of the History of Economic Thought, forthcoming.

1. Lovely anecdote about Neyman and Friedman. As for your students who say "there's a 95% probability the true value lies in the interval", they are probably just Bayesians implicitly using noninformative priors and conditioning on their model and observed data...can't blame them too much!

2. Ben: Absolutely right! Of course a REAL Bayeian wouldn't use the words "confidence interval"...
:-)

3. Great post and great story! In addition to judges, doctors also have a great deal of trouble properly interpreting the confidence intervals presented in medical literature.

I think students' difficulties (my own, anyway) stem from an inadequate understanding of the sampling distribution before diving into moments and OLS regression. Peter Kennedy's Guide to Econometrics opens with an excellent, intuitive treatment of sampling distributions that helped me immensely.

4. Jeremy: Thanks. I agree about the medics. There are some gems in the med. journals! You're also right about the key role the sampling distribution plays in understanding what follows. I like to get students doing MC simulations nice and early.

1. «I like to get students doing MC simulations nice and early.»

That is a really really good point. For example I found that myself and others only understand ("somewhat") the dreaded p-value if it is computed from a MC simulation, because the classic definition is in effect a double negative, not a constructive one.

5. Does the story (i.e. the interpretation of the confidence intervals) change if we are considering an estimator where only the asymptotic distribution is known? E.g T^0.5 (b-beta) is asymptotically normal.

6. Rasmus: Thanks for the comment. No, the story (interpretation of the confidence interval) doesn't change in the asymptotic case.

Of course, if you were a Bayesian, then the whole idea of a confidence interval will be meanngless, regardless of the sample size - because you'd have no interest in "repeated sampling", or the associated idea of the "sampling distribution".

Glad you are enjoying the blog.

1. «if you were a Bayesian, then the whole idea of a confidence interval will be meanngless, regardless of the sample size - because you'd have no interest in "repeated sampling", or the associated idea of the "sampling distribution".»

I think that is an bizarre misdescription of bayesian approaches, as if they were "well one sample is plenty and then we bet the farm, because priors!".

7. Very nice post.
About Mr. F's question: I am ok with the fact that θ10 is a constant, and as such the probabilities are zero or 1. But the intervals are random variables, and we can ask about the probability of one of those intervals covering θ10 or not. Can't we?

1. Yes we can. And of course given that the confidence interval is constructed using the sampling distribution of the point estimator, the notion of "probability" in this context (whether we like it or not), is based on "repeated sampling". We'll never know if our single interval covers the unknown parameter or not.

8. But in this case, the probability of the single interval covering the unknown parameter would be 95%, wouldn't it?

1. Yes, but only in the sense that if we repeated the exercise again and again, with randomly drawn samples of the same size, then 95% of all of the intervals that we constructed would cover the parameter. In practice, we're not (usually) able to do this.

9. Brilliant! Now, I just can't link the two things:
(1) the probability of a single interval covering the unknown parameter is 95%.
(2) the probability of the unknown parameter be within a single interval is either zero or 1.

1. The interval is random; the parameter is constant.

2. Ok, I just wanted to convince myself this is the only reason.
All doubts clarified. Thanks for that.

10. Thanks professor. So what can we say about a single interval? In your example, how would you interpret the confidence interval of [0.9,1.3]?.

11. Anonymous: You can't say anything about a single interval. As Neyman (1952) said, "[all the CI] does assert is that the probability of success in estimation using [any] formula[] is equal to [95\%]." You can read our paper on this topic in our upcoming paper, "The Fallacy of Placing Confidence in Confidence Intervals" (http://learnbayes.org/papers/confidenceIntervalsFallacy/index.html.

12. Anonymous: What you can say about the single interval is, barring other available information that would make such a statement absurd, that you believe the true value is in it. People run into the most trouble trying to wrap such statements in probabilities not understanding that after the interval is calculated there aren't any known ones for the interval (without additional work). But consider the confidence in the procedure. If you perform a procedure that is correct 95% of the time it's perfectly rational to then just act as if the procedure gave you the correct answer even if you have no idea the exact probability that you're correct this particular time. I always find it fascinating that people have no problem acting as if their decision following a typical test is correct even though they may be wrong at a much higher rate than for a CI but can't do the same with a CI. The difference is that you're not stating sig./non-sig. but instead saying that mu is here.

13. I am grateful to all of you (Dave Giles and Richard Morey et al) for explaining this so clearly. Thank you!

14. Excellent explanation... but sorry, but you're really just parsing words here. If 95% of the intervals would cover the true value, then IMO it's not illogical at all to say that there's a 95% chance that any particular interval selected contains the true value. Yes, I get that the specific one we estimated either does or does not, but on average 95% is the best estimate we have of whether it does or does not.

15. Actually, there IS a way to interpret realized CIs. The concept is “bet-proofness”. We had quite a good discussion about it over at Andrew Gelman's blog several months ago. I learned about the concept from a recent paper by Mueller-Norets (Econometrica 2016).

Mueller-Norets (2016, published version, p. 2185):

“Following Buehler (1959) and Robinson (1977), we consider a formalization of “reasonableness” of a confidence set by a betting scheme: Suppose an inspector does not know the true value of θ either, but sees the data and the confidence set of level 1−α. For any realization, the inspector can choose to object to the confidence set by claiming that she does not believe that the true value of θ is contained in the set. Suppose a correct objection yields her a payoff of unity, while she loses α/(1−α) for a mistaken objection, so that the odds correspond to the level of the confidence interval. Is it possible for the inspector to be right on average with her objections no matter what the true parameter is, that is, can she generate positive expected payoffs uniformly over the parameter space? … The possibility of uniformly positive expected winnings may thus usefully serve as a formal indicator for the “reasonableness” of confidence sets.”

“The analysis of set estimators via betting schemes, and the closely related notion of a relevant or recognizable subset, goes back to Fisher (1956), Buehler (1959), Wallace (1959), Cornfield (1969), Pierce (1973), and Robinson (1977). The main result of this literature is that a set is “reasonable” or bet-proof (uniformly positive expected winnings are impossible) if and only if it is a superset of a Bayesian credible set with respect to some prior. In the standard problem of inference about an unrestricted mean of a normal variate with known variance, which arises as the limiting problem in well behaved parametric models, the usual [realized confidence] interval can hence be shown to be bet-proof."

Full reference:

Credibility of Confidence Sets in Nonstandard Econometric Problems
Ulrich K. Mueller and Andriy Norets (2016)
https://www.princeton.edu/~umueller/cred.pdf
http://onlinelibrary.wiley.com/doi/10.3982/ECTA14023/abstract

Interesting stuff!

--Mark

16. Mark - thanks for pointing this out! I'll check it out.(Note that my blog post was from 2011 - I promoted it recently because it was the 100'th anniversary of Friedman's birth.)

17. Ah... hadn't noticed that! In 2011 I wasn't aware of "bet-proofness" either - I only learned about it from the M-N 2016 paper. But the concept has been around for decades, apparently. It's curious that it isn't more widely known.

1. I will definitely be looking into this - thanks again for alerting me (and other readers).

2. The best discussion on Andrew Gelman's blog is in connection with this entry:

http://andrewgelman.com/2017/03/04/interpret-confidence-intervals/

Some good contributions there, esp. by Carlos Ungil and Daniel Lakeland.

3. Thanks Mark!

18. «The true value of the coefficient in this regression model is a parameter. It's a constant, whose value we just don't happen to know. On the other hand, the (point) estimate of 1.1 is just one particular "realized" value of a random variable - generated using this one particular sample of data. An estimator is a formula - like the OLS formula in our example. Except in rather silly cases, this formula involves using the (random) sample data. So, an estimator is a function of the sample data - in othere words, what we call a statistic. When we apply this formula using a particular sample of data, we generate a number - a point estimate. Because an estimator is a function of the random data, it's random itself. Being a random variable, an estimator has a distribution function.»

To me this looks like extremely loose and obfuscating terminology that gets so many people in trouble, for example can a "formula" be a "random variable" and have a "distribution function"? That's simply ridiculous. The way I learned it from some very clear thinking definettian subjectivists (but it is not a subjectivist point of view) is:

* There is an algebra of arithmetic number and an algebra of stochastic numbers, and they are fundamentally different.

* A "statistic" is a measure over a set of numbers, whether they be arithmetic or stochastic. The same formula for a measure can portend two different functions, one over arithmetic numbers, one over stochastic numbers.

* Arithmetic numbers arise from populations, stochastic numbers from samples (under the hypothesis that the sampling process is ergodic, but I am not sure that is what a definettian subjectivist would say).

* A measure on a sample is at the same time an arithmetic number with respect to the sample, and a stochastic number if *interpreted* as an estimate of the same measure on the population, while a measure on a population is always and only an arithmetic number.

* Bonus point: it fantastically important (especially in studies of the political economy) to always ask what is the population from which a sample has been drawn, and whether the sampling process was indeed ergodic. And if you consider those two questions deeply enough, you end up a definettian subjectivist I guess :-).

I do hope that I was not that loose conceptually or in terminology in the above, and that it reflects the insights I got from those clear thinking people.