Wednesday, September 7, 2011

A Tale of Two Tests

Here's a puzzle for you. It relates to two very standard tests that you usually encounter in a first (proper) course in econometrics. One is the Chow (1960) test for a structural break in the regression model's coefficient vector; and the other is the Goldfeld and Quandt (1965) test for homoskedasticity, against a particularly simple form of heteroskedasticity.

What's the puzzle, exactly?

These tests are very straightforward. Indeed, the Chow test is just a particular application of the usual F-test for the validity of exact linear restrictions on the coefficient vector; and that's the way I always teach it. So, there's nothing special about it, in that sense, as Fisher (1970) pointed out.

Let's take a look at the underlying framework for these two tests. In the case of the Chow test we're looking at a situation where there are two sub-samples of data, with n1 and n2 observations in each:

y1 = X1β1 + ε1 ; ε1 ~ N[0 , σ2In1]

 y2 = X2β2 + ε2  ;  ε2 ~ N[0 , σ2In2] .

Note that we have the same dependent variable, y, in each sub-sample. It's just that we are using subscripts to signal which sub-sample we're using. The same applies to the regressor matrix, X. So, each equation has the same k regressors. Very importantly, notice that the variance of the errors is the same (σ2) across both sub-samples.

The Chow test is for the null hypothesis, H0: β= β2 (and σis the same for both sub-samples); against  HA: β ≠  β2 (and σ2 is the same for both sub-samples). Under the  null hypothesis, the Chow test statistic is distributed as F, with k and (n1+ n2 - 2k) degrees of freedom.

(Actually, this null distribution is unaltered if we relax the assumption of normality and just require that the errors follow any elliptically symmetric distribution - see my earlier post here.)

Now let's look at the framework for the (simplest form of the) Goldfeld-Quandt test. In this case, we have:

y1 = X1β + ε1 ; ε1 ~ N[0 , σ12In1]

y2 = X2β + ε2 ; ε2 ~ N[0 , σ22In2] .

Again, the same dependent variable and regressors appear in both equations. The subscripts are used just to distinguish between the two sub-samples of data.

The Goldfeld-Quandt test is for the null hypothesis, H0: σ12 = σ22 (and  β  is the same for both sub-samples); against HA: σ12 > σ22 (and β  is the same for both sub-samples). The inequality in the statement of the null hypothesis can be reversed without affecting anything that's said here.

Under the null hypothesis, and in the case where no "central" observations are dropped to improve the power of the test, the Goldfeld-Quandt test statistic is distributed as F, with (n1 - k) and (n2 - k) degrees of freedom, if the errors follow any distribution in the elliptically symmetric family.

So, in a sense, the Chow test and the Goldfeld-Quandt test are are the flip sides of each other. The first is used to test that the coefficient vector is constant; against  a composite alternative hypothesis that says that the coefficient vector changes discretely, although the variance of the error term is constant. The second is used to test that the variance of the error term is constant; against the composite alternative hypothesis that this variance changes discretely, while the coefficient vector is constant.

Usually, when I'm covering this in class, several questions arise naturally. Notably:
  1. Which of these tests should you perform first? 
  2. Why does the order of testing matter?
  3. Why don't we have a more general version of the test Chow test that allows for the possibility that the variance of the error term may also change from one regime to the other?
  4. Why don't we have a more general version of the Goldfeld-Quandt test that allows for the possibility that the coefficient vector may also change from one regime to the other? 
These are really good questions, so let's take a look at them.

I'm going to dispense with the first two (related) questions really quickly. They're important, for sure, but they deserve separate attention and I plan to go into more detail in a subsequent post. Here, I want to focus primarily on Questions 3 and 4.

With regard to Questions 1 and 2, the first thing to note is that the test statistics for the Chow and Goldfeld-Quandt tests are not independent of each other. So, we have an example of a "preliminary testing" problem if we apply one test and then the other. Just as preliminary testing tend to bias estimators (or more generally, completely alter their sampling properties), it also affects the properties of subsequent tests.

In particular, if we apply a Chow test, and then apply a Goldfeld-Quandt test to the same model, both the size and power of the latter test are distorted. For instance if we nominally assign (say) a 5% significance level for the second test, the true rejection rate may end up being (say) 30%. If we reverse the order of the testing, the magnitudes (and possibly directions) of the size and power distortions will change.

See Giles and Giles (1993) for an overview of pre-test testing results in econometrics. As I mentioned, I'll return to some of these issues in a subsequent post.

Now, let's look at Questions 3 and 4. The short answer to Question 3 is that this is one of the big unanswered questions in inferential statistics. Remembering that the (conditional) means of y1 and y2 in the set-up for the Chow test are X1β1 and  X2β2, respectively, we see that the Chow test amounts to testing if the means of two normal populations are equal, when the variances of those populations are uknown but equal in value.

As soon as these unknown variances are allowed to be different from each other we have a problem. In fact this problem has a name - it's called the "Behrens-Fisher problem" (Behrens, 1929; Fisher, 1935), and there is no known exact solution to it. If you think back to when you learned about the 2-sample t-test, you should have been taught that it applies only when the to population variances are equal.

There are lots of standard approximations that are used when the variances may be unequal - notably Welch's approximate t-test (Welch, 1938, 1947). However, none of these tests are exact in the following sense. If you choose some nominal significance level, such as 5%, and construct the test, there is no way of guaranteeing that the test will actually reject the null hypothesis when it is true, exactly 5% of the time.

You'd think we'd have solved this problem by now, but we haven't! Not fully. As I said, we have various approximate tests, and the problem has, of course, also been analyzed from a Bayesian perspective. Recently, one of my Ph.D. students, Lauren Dong, made some good progress by using an empirical likelihood procedure to address the Behrens-Fisher problem, and its regression counterpart - the Chow test. (See Dong, 2003.)

So, that's why the set-up for the Chow test involves a set-up where the variance of the error term in the underlying regression model is homoskedastic. The other tests that have been proposed for testing for a structural break in the regression context all involve approximations of some sort, fail to be exact in finite samples, or are "inefficient" - i.e., the associated confidence intervals are too large. Such tests include those discussed by Jayatissa (1977), Watt (1979), Ohtani and Toyoda (1985), and Weerahandi (1987), for example.

Now, what about Question 4? This can be viewed as asking, "How do we construct an exact test for the equality of the variances from two (or more) Normal populations, when the means are unknown and unequal, and we have unequal sample sizes?"

Once again, this is a much tougher question than it might look!

Bartlett (1937) provided us with a widely used test for this problem. It has at least two important limitations, though. It is known to be sensitive to departures from normality (e.g., Box, 1953);  and exact critical values are available only for the very special case where the sub-samples are of equal size - see Glaser (1976). The normality robustness issue has been addressed via alternative tests, but these are not exact.

Chao and Glaser (1978, p.425) provide an exact expression for the null density of Bartlett's test statistic for the general case of unequal sub-sample sizes. However, this expression is extremely complex and as it depends on many parameters, the task of using it to compute exact critical values is daunting, and has not been accomplished. Indeed, Chao and Glaser (1978, p. 426) note:

"The construction of a table of exact Bartlett critical values seems prohibitive, however, due to the enormity of the number of conceivably interesting situations."
So, we still seem to lack a practical answer to Question 4.

I think these two tests help to remind us that that often there are some really strong (unrealistic?) conditions that have to be satisfied before some of our standard results hold. These two tests are also interesting because it turns out that there are some pretty good reasons why strong conditions are being imposed in their construction.

Finally, sometimes even problems that look really simple are still awaiting an adequate solution. There's always work to be done!

Note: The links to the following references may require that your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.


Bartlett, M. S. (1937), "Properties of sufficiency and statistical tests," Proceedings of the Royal Society of London, A, 160, 268-282.

Behrens, W. V. (1929), "Ein beitrag zur Fehlerberechnung bei wenigen Beobachtungen" (transl: A contribution to error estimation with few observations). Landwirtschaftliche Jahrbücher, 68, 807–37.

Box, G. E. P. (1953), "Non-normality and tests on variances", Biometrika, 40, 318-335.

Chao, M-T. and R. E. Glaser (1978), "The exact distribution for Bartlett's test statistic for homogeneity of variances with unequal sample sizes", Journal of the American Statistical Association, 73, 422-426.

Chow, G. C. (1960), "Tests of equality between sets of coefficients in two linear regressions". Econometrica, 28, 591–605.

Dong, L. B. (2003), “Empirical Likelihood in Econometrics”. Ph.D. Dissertation, Department of Economics, University of Victoria.

Fisher, F. M. (1970), "Tests of equality between sets of coefficients in two linear regressions: An expository note", Econometrica, 38, 361-366.

Fisher, R. A. (1935), "The fiducial argument in statistical inference". Annals of Eugenics, 8, 391-398.

Giles, J. A. and D. E. A. Giles, 1993, “Pre-test estimation and testing in econometrics: Recent developments”, Journal of Economic Surveys, 7, 145-197.

Glaser, R. E. (1976), "Exact critical values for Bartlett's test for homogeneity of variances", Journal of the American Statistical Association, 71, 488-490.

Goldfeld, S. M. and R. E. Quandt (1965). "Some tests for homoscedasticity". Journal of the American Statistical Association , 60, 539–547.

Jayatissa, W. A. (1977), "Tests of equality between sets of coefficients in two linear regressions when disturbance variances are unequal", Econometrica, 45, 1291-1292.

Ohtani, K. and T. Toyoda (1985), "Small sample properties of tests of equality between sets of coefficients in two linear regressions under heteroscedasticity", International Economic Review, 26, 37-44.

Watt, P. A. (1979), "Tests of equality between sets of coefficients in two linear regressions when disturbance variances are unequal: Some small sample properties", The Manchester School, 47, 391-396.

Weerahandi, S. (1987), “Testing regression equality with unequal variances”, Econometrica, 55, 1211-1215.

Welch, B. L. (1938), "The significance of the difference between two means when the population variances are unequal", Biometrika 29, 350-362.

Welch, B. L. (1947), "The generalization of "Student's" problem when several different population variances are involved", Biometrika, 34, 28-35.

© 2011, David E. Giles



  1. i actually like your entries that involve almost undergraduate level econometrics but may not have practical implementations (yet students may have thought of these questions--which test to apply first?). for instance, extreme bounds analysis is one of the many practical ways to cope with the choice of variables in a linear regression framework. the underlying econometric issue (choice of variables, i.e. which variables to include or exclude) has an answer in econometrics (found in most undergrad textbooks) but require knowledge not typically available to the econometrician.

    i am looking forward to your post on pretesting testing problems.

  2. Andrew: Thanks for the comment. I'll get to the pre-testing material ASAP.

  3. Why not create a fully interacted model and test the null that the interaction terms are all 0 using a Wald test with robust standard errors?

  4. Charlie: That's just fine, of course, if all you want is an asymptotically valid test; but this discussion is all about exact (finite-sample) testing.

  5. Okay, so how about doing a fully interacted model, which would be robust to having different coefficients, though it would be inefficient if the coefficients are all the same. Then do the GQ test. If there's heteroskedasticity, do WLS and apply the Chow test to the weighted model. Does that work? The inefficiency would come into play for the GQ test, but it seems to me that it would be inefficient in the same way (i.e., due to to few degrees of freedom) for both/all groups, so it wouldn't matter.

  6. Charlie: First, the point is that there is no exact solution to the Behrens-Fisher problem. Period.

    Second, what you suggest may be "reasonable", asymptotically, but in finite samples the properties of this pre-test testing procedure would have to be worked out properly. The sugnificance level of the overall procedure would definitely not be the same as you nominally assign it to be for either of the tests. Also, if you reversed the order of the tests you'll change the true significance level and could get a different final outcome.

    There are lots of ad hoc proceures such as this that you could use, but that's not the point.