Wednesday, March 9, 2011

The Second-Longest Word in the Econometrics Dictionary

The other day I paid a surprise visit to my University's library - it was an honest-to-goodness physical visit that involved putting one foot in front of the other, and straining those remaining neurons that deal with my long-term memory - not one of the virtual on-line visits that now pass for the real thing. Back in the day, when all the world and most of you were young and beautiful, an occasional visit to the library was actually quite therapeutic - a nice break from all of those interruptions in the office. It was great to be back, even though I had to focus hard to avoid tripping over old rabbit burrows while en route, and I was somewhat confused by the fact that my favourite large red book had apparently been moved from the very end of the 3rd shelf in the 4th row from the left on Level 2, sometime since September of 2002.

So, what drove me out into the bitter wasteland of Gordon Head in the dead of March? Well, I wanted to take a look at a dictionary that (shame on me) I thought I didn't own. Did you know that there really is A Dictionary of  Econometrics (Darnell, 1994)? It's a very nice volume, and - for the record - the longest word in this dictionary is "heteroskedasticity", with 18 letters. Yes, this word should indeed be spelled with a "k", and not another "c" (see McCulloch, 1985). The second-longest word in that dictionary, with 17 letters is - you guessed it - "multicollinearity". Quite a mouthful, I know, but if you've read this far then I assume that I don't need to bore you by explaining what this blockbuster of a word means.

Some years ago, a very perceptive colleague of mine offered the opinion that the quality of an introductory or intermediate econometrics textbook is inversely proportional to the number of pages that it devotes to discussing "multicollinearity". (I also recall my equally perceptive high school math teacher commenting that the number of marks you score in a geography exam is directly proportional to the number of different coloured pencils that you own. Not that I'd ever repeat that outrageous suggestion!) Now, I certainly don't want to pick on any particular authors of undergraduate econometrics textbooks. After all, there are some wonderful texts out there, and writing one of them requires enormous effort and a special talent that I know that I don't have myself. However, the place of multicollinearity in our teaching of econometrics is something that has to be taken seriously, and warrants our scrutiny.

I'm not sure how you actually measure the "quality" of a textbook - it's a fairly subjective concept, but I guess we could conduct an opinion poll. I also hasten to add that I don't have a scrap of evidence regarding the sales figures for different econometrics texts. That's proprietary information that no sane publisher is going to provide just so that I can blog about it. In any case, quality and sales are not the same thing, and I can think of some instances where they are actually negatively correlated, at least in my view. However, the amount of attention that has been paid to the concept of "multicollinearity" in textbooks over the years provides us with an interesting case-study of fashions and priorities in econometrics, and the way that the discipline has developed over the past half century or so. When I recalled my colleague's comment on the subject, my prior belief was that the path for multicollinearity had been pretty much downhill since Ragnar Frisch (1934) coined the term. We'll see in due course if I was right, once we've let the data have their say on the matter.

Before proceeding, I should make one thing clear. Frisch's contributions to the foundations of econometrics were absolutely fundamental (Arrow, 1960). His work on what he called "confluence analysis" dealt with the essential pillars of econometrics - simultaneity, identification, and measurement error. Most of Frisch's work on confluence analysis developed into what we now deal with under the headings of "identification", "errors-in-variables", and "endogeneity". The other (smaller) part is what we now think of as "multicollinearity". Interestingly, there's also a very important connection between Frisch's confluence analysis and the modern concept of "cointegration" in non-stationary time-series analysis - see Hendry and Morgan (1989).

Clearly, if we're going to look at the decline and fall of multicollinearity, it's not going to be good enough to simply count the number of pages that deal with this topic, book by book. This information has to be put into perspective by allowing for both the length of the book in question, as well as the size of the pages. So, I'll focus on the percentage of pages in a book that deal with this topic. The majority of the observations in the sample that I've compiled relate to texts written by North American (U.S. and Canadian) authors, but there are also several written by European and British authors. All of the books selected for my sample relate to undergraduate-level econometrics, and books that are explicitly marketed as being "applied" are, for better or worse, excluded from the sample. (For this reason alone, a great book by my colleague, Ken Stewart, is not included.)

Many of the best known econometrics texts have run to several editions; this being a testimony to their impact and endurance. Different editions of several of these hardy perennials are included in the sample. It may seem as if there is some "double-counting" going on here. However, the inclusion of books that have made it to multiple editions provides the opportunity for us to track the amount of attention paid to the topic of multicollinearity over time, while automatically controlling for a variety of other factors. My data are available on the Data page that forms part of this blog, and the EViews workfile that can be used to replicate the following results is linked through the accompanying Code page. The sample size is n = 48, and the sample values for the fraction of a book's pages that are devoted to a core discussion of multicollinearity (MULTI) range from 0.16% to 5.07%, with a mean and median that are each 2.1%.

Figure 1 provides a scatter-plot MULTI, against the year of publication, YEAR. We can see that there does indeed seem to be  a negative (partial) relationship between these two variables. However, we need to control for other factors, and we also need to see if this relationship is actually statistically significant.

Now, let's consider a really simple model that explains MULTI as a function of YEAR, the edition number of the book (ED), and a dummy variable (DNA) for North American authorship. Those of you who are sharp-eyed will have noticed already that the dependent variable I am going to use (MULTI) is bounded, in the population, between zero and 100. Usually, this would bring into question the assumption that the errors of the model are normally distributed. However, nothing that I'm going to do here will actually require or use the latter assumption. Incidentally, transforming the data and fitting a log-log model might seem like a smart idea, because then the dependent variable would be bounded between -∞ and 4.605 (= ln(100)). However, if you play with the EViews code that I've provided you'll find that if you fit a log-log model, then various issues arise relating to heteroskedasticity of the errors. I presume that we don't want to have to deal with both of the two longest words in A Dictionary of  Econometrics at the same time!

When I estimated my basic model, the coefficient of the DNA dummy variable was statistically insignificant (p = 0.892). Re-estimating the model without the dummy variable I got the following results:

MULTI = 62.274 - 0.031YEAR + 0.429ED + residual        R2 = 0.133 .
(29.378)  (0.015)          (0.145)

White's heteroskedasticity-consistent standard errors appear in parentheses below the estimated OLS regression coefficients, and the following results were obtained when various Lagrange multiplier (nR2) tests for homoskedasticity of the errors were applied to the residuals:

 Breusch-Pagan-Godfrey: 1.704 (p = 0.427) Harvey: 1.364 (p = 0.506) Glesjer: 1.522 (p = 0.467) White (with cross-products): 7.412 (p = 0.192)

Good news! As a brief nod in the direction of the dreaded multicollinearity issue I should note that the (centered) variance inflation factors for YEAR and ED were 1.362 and 1.392 respectively when DNA is included as a regressor, and the latter decreased to 1.362 when DNA was omitted. Variance inflation factors lie in range [1, ∞), by construction. The greater the impact of multicollinearity on the variance of an estimated regression coefficient, the greater is the associate variance inflation factor. So, these numbers are also most encouraging.

Now, how might we interpret the results of this estimated model? The significantly negative slope coefficient for YEAR accords with my prior belief - other things being equal, over the years (1963 to 2011) relatively less book space has been allocated to discussing multicollinearity. The significantly positive coefficient for ED may indicate that, notwithstanding this overall declining trend, successful authors tend to stick with what sells (and what requires fewer changes). You can probably come up with other ways to spin these results.

When will we cease to see a formal discussion of multicollinearity in our undergraduate econometrics texts? You're going to have to be rather patient because my estimated model predicts that books in their first (or only) edition will devote 0.0014% of their page-space to this topic in April of 2046, before succumbing entirely. With luck, I'll live to celebrate that day, but I'm not sure. Half way through writing this I discovered that I have a copy of A Dictionary of Econometrics on my shelf after all - so much for my ailing neurons!

Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.

References

Arrow, K. J. (1960). The work of Ragnar Frisch, econometrician. Econometrica, 28, 175-192.

Darnell, A. C. (1994). A Dictionary of Econometrics. Edward Elgar, Cheltenham.

Frisch, R. (1934). Statistical Confluence Analysis by Means of Complete Regression Systems. Publication no. 5, Oslo University Institute of Economics, Oslo.

Hendry, D. F. and M. S. Morgan (1989). A re-analysis of confluence analysis. Oxford Economic Papers, 41, 35-52.

McCulloch, J. H. (1985). On heteros*edasticity. Econometrica, 53, 403.