We've all experienced it. You go to use a statistical table - Standard Normal, Student-t, F, Chi Square - and the line that you need simply isn't there in the table. That's to say the table simply isn't detailed enough for our purposes.
One question that always comes up when students are first being introduced to such tables is:
"Do I just interpolate linearly between the nearest entries on either side of the desired value?"
Not that these exact words are used, typically. For instance, a student might ask if they should take the average of the two closest values. How should you respond?
The correct answer to the question is: "No - not in general". To which we should add that the correct way to deal with this situation depends on:
The correct answer to the question is: "No - not in general". To which we should add that the correct way to deal with this situation depends on:
- Which distribution we're dealing with.
- Whether we are trying to retrieve a "missing" quantile for the distribution, or trying to determine a tail area (probability) associated with a given quantile.
For example, suppose that we wish to find the 70th. percentile for a Chi-square distribution with 67 degrees of freedom; or the p-value associated with a F-value of 0.65, when the numerator and denominator degrees of freedom are 17 and 33.
With regard to the first of these calculations, suppose that we have access to this table of percentiles for the Chi-square distribution. Using this table alone, the best that we could do by way of reporting the 70th.percentile when the degrees of freedom are 67, would be to say that it lies somewhere between 46.459 and 85.527. Let's face it - that's not particularly helpful!
A similar situation arises if we want to compute the p-value for the F-distribution mentioned above, and we have access only to the tables that we usually find in text books.
Of course, if we have access to our favourite econometric/statistical package, it's a simple matter to compute the desired value. To get the exact answers with EViews, we'd use the command:
scalar q70 = @qchisq(0.70, 67)
and the answer would be 72.554.
Similarly, the p-value referred to above could be computed exactly by using the command:
scalar pval = 1 - @cfdist(0.65,17,33)
and the answer would be 0.826.
In R, we'd use the commands:
qchisq(0.70, df=67) and pf(0.65, 17, 33, lower.tail=FALSE)
to produce exactly the same results.
But what about the answer to your student's question about interpolation? Suppose you're stuck with the statistical table, and you don't have access to your econometrics or statistical package. What do you do?
Some of you will remember when this was a daily occurrence! We didn't have a laptop, a tablet, or even a pocket calculator.
We could look at various common distributions here. However, by way of illustration, let's take just the Student-t Distribution. We'll consider appropriate ways to compute (interpolate) a quantile (percentile), first for a non-tabulated tail area probability; and secondly, for a non-tabulated degrees of freedom.
We could look at various common distributions here. However, by way of illustration, let's take just the Student-t Distribution. We'll consider appropriate ways to compute (interpolate) a quantile (percentile), first for a non-tabulated tail area probability; and secondly, for a non-tabulated degrees of freedom.
1. Interpolating Between Tail Areas
Suppose that we have v = 20 degrees of freedom and we want to find the quantile (percentile) that will give us an area of 0.075 under the right tail of the Student-t density function.
Note that 0.075 is (mid-way) between 0.05 and 0.10, so any standard table (e.g., here) will tell us that the quantile that we want is between 1.325 and 1.725. The (arithmetic) average of these two percentiles is (1.325 + 1.725) /2 = 1.525.
What we're actually doing is linearly (straight-line) interpolating between the two original values:
1.525 = 1.325 + (1.725 - 1.325)[(0.075 - 0.1) / (0.05 - 0.1)]
However, using either the EViews command, scalar qval = @qtdist(0.925,20) or the R command, qt(0.925, df=20), we get the answer, 1.497036. This isn't the result that we got by linearly interpolating between the percentiles on either side!
(Recall that we wanted a right-tail area of 0.075, which implies a left-tail area of 0.925.)
Why isn't the linear interpolation working? Well, intuitively, it's a consequence of the curvature of the density function (or the distribution function).
So, how can we deal with this?
Hoaglin et al. (1991) show that we can get a very good approximation to the correct value for the quantile (1.497036) by interpolating using the (base 10) logarithms of the tail areas to construct the interpolating weights.
Noting that log10(0.075) = -1.1249387366083 ; log10(0.1) = -1.0 ; and log10(0.05) = -1.301029995663981, the appropriate calculation for the desired quantile becomes:
q = 1.325 + (1.725 - 1.325)[(-1.1249387366083 - (-1)) / (-1.301029995663981 - (-1))]
= 1.49104
That's more like it, even though it isn't perfect!
Now let's consider a different situation - one where the table that we're using for the t-distribution includes the (right-hand) tail area that we want - say, 5%. However, our particular degrees of freedom (v) are nowhere to be found in the table.
Actually, this isn't likely to be as much of a problem as the situation we've just considered above. Most t-tables cover every degree of freedom, for small-to-moderate values of v; and when v is large the standard-Normal approximation will usually suffice.
However, let's suppose that we want an accurate answer, and by way of an example, consider a 5% (right) tail area, and 53 degrees of freedom.
In this case we can use an harmonic interpolation, rather than a linear one. This is like switching from an arithmetic mean to an harmonic mean, so we use the the values of (1 / v) to construct the interpolation weights.
(Isn't it nice to come across a situation where that harmonic mean that you learned about actually gets used! Another situation arises with certain index numbers.)
Let's go back to our example. From any Student-t table you can find that the quantiles that determine 5% in the right tail of the density are 1.671 and 1.684 for v = 60 and v = 40, respectively. Here's what happens if you just use a naive linear interpolation to get the quantile for v = 53:
q = 1.671 + (1.684 - 1.671)[(53 - 60) / (40 - 60)] = 1.6755
Using the EViews command scalar q = @qtdist(0.95 , 53), or the R code qt(0.95, df = 53), you can verify that the exact answer is 1.674116.
Applying the harmonic interpolation, we get:
q = 1.671 + (1.684 - 1.671)[((1/53) - (1/60)) / ((1/40) - (1/60))] = 1.67443.
This is certainly a little more accurate than the result obtained above by linear interpolation. However, the gain in accuracy will vary, depending on the significance level and degrees of freedom that we're considering.
When we're working with the Chi-Square, F, or other distributions, similar methods are available. However, the details differ a little from those used for the Student-t distribution. The references given below will give you some guidance.
In a sense, the take-away message here is very simple - don't use "straight-line" interpolation when the function is curved!
References
Hoaglin, D. C., F. Mosteller, and J. W. Tukey, 1991. Fundamentals of Exploratory Analysis of Variance. Wiley, New York.
Salton, G., 1959. The use of the central limit theorem for interpolating in tables of probability distribution functions. Mathematical Tables and Other Aids to Computation, 13, 213-216.
Zinger, A., 1964. On interpolation in tables of the F-distribution. Journal of the Royal Society, Series C (Applied Statistics), 13, 51-53.
© 2018, David E. Giles
Happy New Year Dave
ReplyDeleteThanks - and to you too! Here's to some regular blogging this year!
Delete"In a sense, the take-away message here is very simple" - for me, not the one you give at the end, but rather the one you give in between: any econometrics teacher who has not switched yet should be making the new year's resolution to teach students to use statistical software for tasks such as finding quantiles.
ReplyDeleteThe confusion in my students when performing exactly the task you are describing here was one of the reasons for me to ditch tables once for good.
This is not the advice I would give. If ordinary interpolation (not averaging the two numbers, actual interpolation) the problem is too complex for simple fixes like interpolation in some other variable. The student's time would be better spent looking for a friend with a working computer than puzzling over such an uninteresting question. Your interpolation suggestions are nice, but it's bad advice to ask someone to actually use them -- IMHO.
ReplyDeleteSorry that you found the issue uninteresting. Of course I agree that we should just use a standard package: see my paragraph that begins, "Of course.........", and the passages that follow.
DeleteI'm sorry. I didn't mean "uninteresting" in the sense of not interesting. I meant that it wouldn't contribute to human knowledge because smart people like you already know it, and that it's more technical than fundamental. I thought your suggestions were interesting and clever.
DeleteI sometimes wish could be licensed to give technical advice on computational methods the way people giving financial advice are. Then you could lose your license for suggesting rectangle rule integration rather than the integration package in R, say. The rectangle rule isn't wrong, but high order integration rules are better. Influencing your colleague to write his/her own integration program from scratch is as wrong as suggesting that a retired person invest her/his savings in bitcoin -- the wrong thing for almost everyone.