## Monday, June 24, 2013

### Can You Actually TEST for Multicollinearity?

When you're undertaking a piece of applied econometrics, something that's always on your mind is the need to test the specification of your model, and to test the validity of the various underlying assumptions that you're making. At least - I hope it's always on your mind!

This is an important aspect of any modelling exercise, whether you're working with a linear regression model, or with some nonlinear model such Logit, Probit, Poisson regression, etc. Most people are pretty good when it comes to such testing in the context of the linear regression model. They seem to be more lax once they move away from that framework. That makes me grumpy, but that's not what this particular post is about.

It's actually about a rather silly question that you sometimes encounter, namely: "Have you tested to see if multicollinearity is a problem for your results?"

I'll explain why this isn't really a sensible question, and why the answer to the question in the title for this post is a resounding "No!"
First of all, let's stop and think for a moment what is actually going on when we perform a statistical test f some hypothesis. The null and alternative hypotheses are statements, or conjectures, about some feature of the underlying population. That is, they're associated with the data-generating process that supposedly gave rise to the sample data that we actually observe. For example, the hypothesis that we're testing may be a statement to the effect that one of the parameters in the population takes a particular value.

Hypothesis testing is an example of statistical inference. We use specific sample information to try and learn (infer) something about the unobserved characteristics of the population at large.

To state the obvious, in the context of a regression model, it's meaningful to have null and alternative hypotheses of the form, H0: β2 = 0 and H1: β2  > 0. Under standard conditions we might then test the validity of H0 by using the t-statistic, t2 = (b2 - β2) / s.e.(b2). We'd reject H0 in favour of H1 if t2 > cα, where cα is the 100(1 - α)'th percentile of Student's t-distribution with (n - k) degrees of freedom.

On the other hand, it's nonsensical to write down null and alternative hypotheses of the form, H0: b2 = 0 and H1: b2 >0. These "hypotheses" are actually statements about the random variable, b2, which is a function of the observed sample data. They are not statements about a fixed, unobserved, population parameter! They don;t form any basis for making inferences.

What do we mean by this term, anyway? This turns out to be the key question!

Multicollinearity is a phenomenon associated with our particular sample of data when we're trying to estimate a regression model. Essentially, it's a situation where there is insufficient information in the sample of data to enable us to enable us to draw "reliable" inferences about the individual parameters of the underlying (population) model.

I'll be elaborating more on the "informational content" aspect of this phenomenon in a follow-up post. Yes, there are various sample measures that we can compute and report, to help us gauge how severe this data "problem" may be. But they're not statistical tests, in any sense of the word

Because multicollinearity is a characteristic of the sample, and not a characteristic of the population, you should immediately be suspicious when someone starts talking about "testing for multicollinearity". Right?

Apparently not everyone gets it!

There's an old paper by Farrar and Glauber (1967) which, on the face of it might seem to take a different stance. In fact, if you were around when this paper was published (or if you've bothered to actually read it carefully), you'll know that this paper makes two contributions. First, it provides a very sensible discussion of what multicollinearity is all about. Second, the authors take some well known results from the statistics literature (notably, by Wishart, 1928; Wilks, 1932; and Bartlett, 1950) and use them to give "tests" of the hypothesis that the regressor matrix, X, is orthogonal.

How can this be? Well, there's a simple explanation if you read the Farrar and Glauber paper carefully, and note what assumptions are made when they "borrow" the old statistics results. Specifically, there's an explicit (and necessary) assumption that in the population the X matrix is random, and that it follows a multivariate normal distribution.

This assumption is, of course totally at odds with what is usually assumed in the linear regression model! The "tests" that Farrar and Glauber gave us aren't really tests of multicollinearity in the sample. Unfortunately, this point wasn't fully appreciated by everyone.

There are some sound suggestions in this paper, including looking at the sample multiple correlations between each regressor, and all of the other regressors. These, and other sample measures such as variance inflation factors, are useful from a diagnostic viewpoint, but they don't constitute tests of "zero multicollinearity".

So, why am I even mentioning the Farrar and Glauber paper now?

Well, I was intrigued to come across some Stata code (Shehata, 2012) that allows one to implement the Farrar and Glauber "tests". I'm not sure that this is really very helpful. Indeed, this seems to me to be a great example of applying someone's results without understanding (bothering to read?) the assumptions on which they're based!

Be careful out there - and be highly suspicious of strangers bearing gifts!

References

Bartlett, M. S., 1950. Tests of significance in factor analysis. British Journal of Psychology, Statistical Section, 3, 77-85.

Farrar, D. E. and R. R. Glauber, 1967. Multicollinearity in regression analysis: The problem revisited.  Review of Economics and Statistics, 49, 92-107.

Shehata, E. A. E., 2012. FGTEST: Stata module to compute Farrar-Glauber Multicollinearity Chi2, F, t tests.

Wilks, S. S., 1932. Certain generalizations in the analysis of variance. Biometrika, 24, 477-494.

Wishart, J., 1928. The generalized product moment distribution in samples from a multivariate normal population. Biometrika, 20A, 32-52.

1. Hi Dave, as I recall, quite a few econometrics texts note that the assumption of fixed, rather than stochastic, variables/regressors is often made for convenience (of derivation and exposition) rather than because it has some substantive basis. Doesn't that affect your statement above about the Farrar and Glauber paper?

1. Hi - thanks for the comment. Thats right. However, we wouldnt use OLS unless the random regressors were (at least asymptotically) uncorrelated with the errors. The FG tests are based on OLS. In addition, the FG tests require multivariate NORMALITY of the regressors. Thats a really strong assumption. Finally, even if their assumptions were satisfied, all you get are tests relating to lack of orthogonality of the regressors in the POPULATION. They arent (and cannot be) tests of a sample phenomenon - namely multicollinearity.

2. Why do you say that multicollinearity is only a sample phenomenon? Two independent variables can have multicollinearity in the population and we can test this using the sample data. If there is no multicollinearity in the population then there should be no multicollinearity in the 'random' sample.

3. I guess because that's what it is - by definition.

4. You mean two variables cannot be collinear in the population?

5. They could be - but that's not what the FG test is testing.

2. See Goldberger, A Course in Econometrics, 1991, pp. 248-50, for a pretty scathing critique of "testing" for multicollinearity (which he analogizes to "testing for small sample size," aka "micronumerosity")

3. It seems to me that the main issues here are loose language, and loose concepts. I say this as someone who has actually asked stuff similar to "Have you tested to see if multicollinearity is a problem for your results?" from time to time. It's true that generally I don't see what the usefulness of statistical hypothesis testing for multicollinearity would be, but I think that there are implied questions here from such a statement that are informative and worth looking at. For instance:

Q1: Is [phenomenon] easily explained by multicollinearity?

Such phenomenon include things like an algorithm taking a very long time to converge, or being very sensitive to initial parameters, or giving gigantic error estimates. Sometimes people throw their arms up at such things without asking why, or they go to explanations like 'this must mean there's a big noise component' without exhausting all the possibilities.

Q2: Is the model designed such that multicollinearity arises frequently? Is the study designed such that this can happen?

This can happen with models involving transformations with covariates. Sometimes, even with arbitarily large samples, your model neccessarily has a multicollinearity problem! Surveys can also have this issue.

Q3: Does your method survive multicollinearity well?

For example, working with principle components is one way to tackle the multi-collinearity issue.

1. I have no problem with any of these points, or with reporting variance inflation factors, conditioning numbers, etc. I object to people thinking that they're testing for something, when they really aren't, and when they're not aware of the assumptions on which the procedure they're using are based.

2. Yes, I totally agree.

4. Here's a test: does Stata drop one of your independent variables when it estimates your model? If yes, then they are collinear. If not, then they are not.

1. That misses the point. That's just an ad hoc just a numerical response to PERFECT collinearity caused by an unfortunate specification of the models.

5. I am doing sGMM, do I have to ensure that there are no multicollinearity among the explanatory variables? even though I am using instruments in this case?

1. There will always be some degree of multicollinearity. It would not be of high concern to me personally.

6. Thanks for nice tutorial on relationship between explanatory variables.

7. Hello Professor, do we worry about multicollinearity issue in ARDL estimation?

8. Dear Professor, I am using sGMM to investigate the effect of executive compensation (independent variable) on bank risk (dependent variable). In the model, I use also some other control variables. My problem is that the coefficients of control variables are not consistent when I replace the compensation variable. For example if I used Salary as the compensation variable I found significant effect of bank capital, however when Bonus is used as the compensation variable, effect of bank capital is not significant. Is this caused by multicollinearity? Thank you in advance.

1. Mai - it's impossible to tell from the information you provide. It could be due to a multitude of possible reasons.

2. Would you mind if I send you through email more detail about the research model and database? Thank you so much.

3. Mai - sorry, but I just don't have time. Please see http://davegiles.blogspot.ca/2015/06/readers-forum-page.html