Wednesday, July 24, 2013

Information Criteria Unveiled

Most of you will have used, or at least encountered, various "information criteria" when estimating a regression model, an ARIMA model, or a VAR model. These criteria provide us with a way of comparing alternative model specifications, and selecting between them. 

They're not test statistics. Rather, they're minus twice the maximized value of the underlying log-likelihood function, adjusted by a "penalty factor" that depends on the number of parameters being estimated. The more parameters, the more complicated is the model, and the greater the penalty factor. For a given level of "fit", a more parsimonious model is rewarded more than a more complex model. Changing the exact form of the penalty factor gives rise to a different information criterion.

However, did you ever stop to ask "why are these called information criteria?" Did you realize that these criteria - which are, after all, statistics - have different properties when it comes to the probability that they will select the correct model specification? In this respect, they are typically biased, and some of them are even inconsistent.

This sounds like something that's worth knowing more about!

To begin with, let's consider the best known information criterion - Akaike's (1973) Information Criterion (AIC) - and use it to address the points raised above. Then we can look at various other common information criteria, and talk about their relative merits.

Suppose that we are estimating a model with k unknown parameters. Let θ be the (k x 1) parameter vector, and let y be the (n x 1) vector of random observations drawn from the population distribution that has a density function, p(y | θ). Then, the likelihood function is just this joint data density, viewed not as a function of y, but as a function of θ, given the sample. That is, L(θ | y) = p(y | θ).

Let θ* be the MLE of θ, and let l* = l(θ*) = log[L(θ | y)|θ*] be the maximized value of the log-likelihood function. Then AIC is usually defined as:

                    AIC = -2l*  + 2k .

Some authors (packages) scale both of the terms in the AIC formula, by dividing each of them by the sample size, n. With this scaling, the name AICk is sometimes used. Obviously, it won't matter whether you use AIC or AICk, as long as:  (i) you adopt a fixed convention when comparing different values of the information criterion; (ii) you compute them using the same sample of data; and (iii) you use them only for ranking competing models, and not as cardinal measures.

The "more likely" is the model, the smaller is AIC. The second term is the penalty factor. The more complex is the model, the bigger AIC will be. To use this criterion to select among alternative model specifications, we'd choose the model with the smallest AIC value.

Here's a very simple regression example. If the errors are normally distributed, then the OLS estimator of the coefficient vector is the same as the MLE of those parameters. (This is a sufficient, but not necessary condition.) Consider the following regression output:

In the EViews package, which I've used here, the information measures are scaled by dividing by the sample size, so the "Akaike info criterion" is actually AICk. The dependent variable is expenditure (price times quantity) on wine, and the regressors are the relative price of wine to spirits, and income, M. There's plenty amiss with this simple model, but for now let's just focus on two numbers in the output. The "log likelihood" value of -194.7502 is what I called l*, above; and the AICk value is 12.75808. Using our AICk formula above, and noting that n = 31, and k = 3, we have:

            AICk  = -2(-194.7502 / 31) + 2(4 / 31) = 12.75808.

If you've been following carefully, you might wonder why I've used k = 3, rather than k = 4 to allow for the unknown variance, σ2, of the error term. In the case of a regression model,  σalways has to be estimated, so EViews has the convention that this parameter isn't counted, and "k" refers to the number of parameters associated with the regressors. This is not a common convention, and you should always be very careful if you're comparing AIC values from different econometrics packages!

(Also, note that "the number of parameters associated with the regressors" is not necessarily the same as the number of regressors. For instance, think of the case of a regression model that's non-linear in the parameters.)

Let's consider a different specification for our "demand for wine" equation. I'm going to use the same dependent variable and sample. This model, and one we've just seen are "non-nested" - you can't get from on of them to the other simply by imposing restrictions on the coefficients of either model. Here it is:


Allowing for the fact that the reported p-values are for a two-sided alternative, all of the regressors are statistically significant, at least at the 10% level, and have the anticipated signs. However, the value of the DW statistic is still small enough to signal that the model's specification is suspect. Let's not go there right now, though.

In this case, the AICk value is 12.83302, which is greater than that for the first model. So, if we were using this information criterion as the sole basis for model selection, the first model would be preferred.

As an aside, you might note that AIC statistics are closely related to the likelihood ratio statistic associated with Model 1 and Model 2. In fact, the usual likelihood ratio test statistic is:

          LRT = -2(l1* - l2*) = (AIC1 - AIC2)  -  2(k1 - k2)   .

However, in general the distribution of this statistic is not known (even asymptotically), unless the two models are nested. In that case it's asymptotically chi-square distributed, if the restrictions defining the "nesting" are valid. We don't have this result to help us when we are comparing non-nested models.

Before we proceed, it's worth emphasising once again that we're just ranking alternative model specifications here. We're not concerned with the raw values of the AIC statistics. These statistics can take any value, and can be of either sign, depending on the model and the data we're working with.

Now, we know what the AIC statistic is, and how it's defined. However, why is it termed an "information criterion"?

The word "information", here, refers to the Kullback and Leibler (1951) "information discrepancy" measure that's used in information theory. This theory was developed originally in the telecommunications literature, notably by the Nobel Laureate Shannon (1948).  The last line of this path-breaking paper is a prophetic gem:  "(To be continued.)" !

(Notice, in the references below, where Akaike's 1973 paper was published.)

Now, in what sense is the AIC statistic an information criterion?

Information discrepancy, as the name suggests, measures the difference in informational content. Here, that difference, or distance, is taken between the expected values of the random vector (say, Y) when (i) Y is determined by the true data-generating process (DGP); and (ii) Y is determined by some candidate model. Minimizing this distance, when considering several candidate models, gets us "close" to the true DGP.

The only glitch with this idea is that the expectations of Y are generally unobservable. For instance, in the case of a standard linear regression model, E[Y] = Xβ, and β is unknown. So, these expectations need to be estimated. For a sample of size n, the information associated with Y is essentially given by the joint density function of its random elements. Viewed as a function of the parameters, this density is just the likelihood function. Not surprisingly, then, the AIC measure involves the log-likelihood function, evaluated at the estimated parameter values. Excellent discussions of the way in which the "penalty term" then enters the picture are given by Linhart and Zucchini (1986), and McQuarrie and Tsai (1998), for example.

This is all well and good, but does the AIC statistic have any desirable properties? Being a statistic, it has a sampling distribution, and the latter can be used to address questions of the type: "Is AIC an unbiased or a consistent model-selection criterion?" Essentially, this comes down to considering situations where we have several competing models, one of which is the "true" DGP. We then ask if choosing the model that minimizes AIC will lead us to select the correct model "on average"; or at least with a probability that approaches one as n becomes infinitely large.

Here's what we know about this aspect of AIC. First, it is an inconsistent model-selection criterion. Usually, inconsistency is seen as a fatal flaw when it comes to estimators or tests. Here, it is certainly not good news, but this result has to be tempered with the recognition that, in all likelihood, the set of competing models that we're considering won't include the "true" DGP.

AIC is also a biased model-selection criterion. Several studies, including that of Hurvich and Tsai (1989), have shown that the use of AIC tends to lead to an "over-fitted" model. In the case of a regression model, this would mean the retention of too many regressors; and in the case of a time-series model it would mean selecting a lag length that is longer than is optimal.

The problem of over-fitting has led several authors to suggest "corrected" AIC measures, such as the AICc measure proposed by Sugiura (1978) and Hurvich and Tsai (1989).  AICc out-performs AIC in finite samples, in terms of reducing the tendency to over-fit the model, but in large samples its properties are the same as those of AIC itself. So, AICc is also an inconsistent model-selection criterion.

Moving on, let's remember that information criteria have a "goodness-of-fit" component, based on a high log-likelihood value; and they have a "model complexity" factor, that penalizes according to the number of parameters being estimated. Parsimony is rewarded, ceteris paribus.

Let's consider some different penalty functions that are added to -2l* to get various information criteria. (Here, I'm not scaling everything by 1/n, even though this scaling is used in EViews for the HQ and SIC measures, as well as for AIC.)

The "corrected" AIC measure has been mentioned already. Some other commonly used information criteria are tabulated below, though the list is not exhaustive.

          Measure        Penalty                                Author(s)
  1. AIC                2k                                         Akaike (1973)
  2. HQ                 2klog[log(n)]                         Hannan and Quinn (1979) 
  3. SIC*               klog(n)                                 Schwarz (1978)
  4. BIC*               klog(n)                                Akaike (1978)
  5. AICc             n(n + k - 1)/(n - k + 1)           Sugiura (1978); Hurvich and Tsai (1989)
* These two criteria were developed concurrently but independently, and are equivalent in terms of their properties.

In contrast to AIC, the SIC (BIC) and HQ measures are consistent model-selection criteria. The SIC/BIC criteria were each derived from a Bayesian perspective, and include a much stronger penalty for over-fitting the model than does AIC.

McQuarrie and Tsai (1998, pp. 36-43) provide detailed information about the probability of over-fitting a model when using the above (and some other) information criteria. Among their results:

  • The probability of over-fitting when using AIC decreases as n increases.
  • The probability of over-fitting when using AICc increases as n increases.
  • The probabilities of over-fitting when using SIC/BIC or HQ decreases as n increases, and decreases faster for SIC/BIC than for the HQ criterion. 

Some general points are worth keeping in mind.

First, if you're comparing information criteria values from different econometrics packages and/or different authors, be sure that you know what definitions have been used. For instance, does the package really report AIC, or is it in fact reporting AICc? Has the complete maximized log-likelihood function been used in the construction of the information criterion, or have constant terms, such as ones involving (2π) been :dropped"? This won't matter if you're comparing values from the same package, but it may be crucially important if you're comparing values from different packages!

Second, be careful not to compare the values of information criteria in situations where they're not really comparable. For instance, if you have two regression models that are estimated using the same sample of data, but where one has a dependent variable of y and the other has a dependent variable of log(y), direct comparisons of the AIC values aren't meaningful. (The same is true of coefficients of determination, of course.) Often, this problem is easily resolved by transforming the likelihood function for one of the dependent variables into one that is consistent with the units of the other dependent variable. You just need to take into account the Jacobian of the transformation from one density function to the other. Examples of this, in the context of systems of Engel curves, are given by Giles and Hampton (1985), for instance.

And, ..... a final comment. The acronym "AIC" is usually taken to mean "Akaike's Information Criterion". In fact, Hirotugu Akaike intended it to mean "An Information Criterion", as he mentions here!

Postscript - while I was in the process of preparing this post, Rob Hynman posted a related piece on his Hyndsight blog.I urge you to read what he has to say on this topic.


References

Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki (eds.), 2nd. International Symposium on Information Theory. Akademia Kiado, Budapest, 267-281.

Akaike, H., 1978. A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics, 30, Part A,  9-14.

Giles, D. E. A. and P. Hampton, 1985. An Engel curve analysis of household expenditure in New Zealand. Economic Record, 61, 450-462.

Hannan, E. J. and B. G. Quinn, 1979. The determination of the order of an autoregression. Journal of the Royal Statistical Society, B, 41, 190-195.

Hurvich, C. M. and C-L. Tsai, 1989. Regression and time series model selection in small samples. Biometrika, 76, 297-307.

Linhart, H. and W. Zucchini, 1986. Model Selection. Wiley, New York.

McQuarrie, A. D. R. and C-L. Tsai, 1998. Regression and Time Series Model Selection. World Scientific, Singapore.

Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics, 6, 461-464.

Shannon, C. E., 1948. A mathematical theory of communication. Bell System Technical Journal, 27, 379-423.

Sugiura, N., 1978. Further analysis of the data by Akaike's information criterion and the finite corrections. Communications in Statistics - Theory and Methods, 7, 13-26.


© 2013, David E. Giles

17 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Dave,

    you are getting lazy during your holidays.

    please note Canada is no longer a province of the USA.

    Keep it up

    ReplyDelete
  3. Professor, can you explain what you mean when you say a model selection criterion is "biased"? What does it mean "it selects the correct model on average"? I cannot relate the concept of estimation bias to this. Any reference would be very helpful.

    ReplyDelete
    Replies
    1. The term "biased" is used in various ways. Apart from it's usual meaning in estimation, we say a test is biased if, at any point in the parameter space its power falls below the assigned significance level.

      Here, a model-selection procedure is unbiased if, when used repeatedly, it selects the correct model on average. For example, using maximum adjusted R-squared as a model selection procedure is an unbiased selection process, in this sense. Of course, the correct model has to be among those being chosen between,

      Delete
    2. Thanks, Professor. But can you be more explicit about "selecting the correct model on average"? Suppose I do 1000 MC experiments, if it selects an underfitting model 50% of the time, and an overfitting model 50% of the time, does it mean the criterion is unbiased? Or if it selects the correct model 50% of the time, and an underfitting model and an overfitting model each 25% of the time, does it mean it is unbiased? Or if it selects the correct model 90% of the time, and an overfitting model 10% of the time, does it mean it is biased?
      Thanks, Professor, for the clarification.

      Delete
    3. I think you should take a look at the McQuarrie and Tsai book, or the paper by Hurvich and Tsai.

      Delete
    4. Thanks, Professor. The bias they (Hurvich and Tsai's biometrika paper) talked about is the bias in AIC as an estimator of the KL information. Are you saying that since AICC is an unbiased (or bias-corrected) estimator of the KL information, it will "select the correct model on average"? Thanks again and please feel free to ignore my reply if this is too trivial.

      Delete
    5. AISc uses a bias-corrected estimator of the KL divergence, in the sense you say, but still not fully unbiased. YOu'll see that I say in the post that AICc reduces the tendency to over-fit the model (relative to AIC). It doesn't fully solve the problem.

      Delete
    6. Thanks, Professor. But could you please give a mathematical definition of "selecting the correct model on average"? (For example, a criterion is (weakly) consistent if Pr(correct model is selected) converges to 1) If any of the reference has such a definition, could you please point me to it? I am genuinely not getting what it means and eager to learn its formal definition. Thank you again.

      Delete
    7. Fair enough! Here is what we usually mean. For simplicity, suppose we are using a criterion that involves calculating and ranking the values a of a statistic, S, for two versions of the model, using the same random sample. Call these statistics (random variables) S and S2. Suppose, further, that the model-selection criterion is to select Model 1 if S1 < S2. Otherwise we select Model 2.

      If E[S1] < E[S2], for all values of the parameters of the models, we say that the model selection criterion is "unbiased".

      Note that, as with the concept of "unbiasedness" in the context of estimation or hypothesis testing, the concept involves the POPULATION expectations of the statistics we're using.

      In the econometrics literature, I believe Theil was the first to use this notion in the following context. Note that if we select between two models by maximizing adjusted R-squared (with the same sample), then this is equivalent to choosing the model with the smaller estimator of the error variance, si^2=(e'e)/(n-ki); for i = 1, 2. It's easy to show that if Model 1 is the correct model, then E[s1^2]<E[s2^2]. So, using this criterion we select the correct model "on average", and we say that this model-selection criterion is "unbiased".

      Maybe I should do a really short blog post on this.

      Thanks for persisting!

      Delete
    8. Thank you, Professor. I was afraid my persistence would have become annoying to you...
      This is one of those jargons that you don't quite get unless you see the formal definitions. Anyway, it appears that, perhaps just like estimation unbiasedness, this unbiasedness is a rather weak standard. I don't know if this is why I haven't seen this being studied in recent literature at all (or maybe I haven't looked hard enough).

      Delete
  4. Ι juѕt like the helpful information yοu рroviԁe in your
    artiсles. I will bοokmагk your blog and teѕt once more here
    frequently. I аm mοderately sure I will be tοld lots
    οf new stuff right here! Best of lucκ for the next!


    My webpage :: çeşitli örgü battaniye

    ReplyDelete
  5. Nice post. For my part, I consider the fact that AIC is inconsistent a feature rather than a bug. What is and is not desirable in a criterion very much depends on how you plan to use it. But I would argue that in economics we're seldom interested in determining which of our candidate models is the truth. For one thing, as you point out above, it's pretty unlikely that any of our models is entirely correctly specified. But even more fundamentally, estimating the correct specification does not necessarily provide the best forecast or estimate of a parameter of interest: there's a bias-variance trade-off. If you're interested in estimators (or forecasts) with low risk, which I think is a much more typical situation in applied work, consistent model selection is a particularly bad idea. BIC, for example, has *unbounded* minimax risk. In contrast, efficient/conservative criteria such as AIC are much better-behaved in this regard. There's a good discussion of this in Chapter 4 of of ``Model Selection and Model Averaging'' by Claeskens and Hjort (2008, Cambridge University Press). Larry Wasserman also has some recent discussion on this point:

    https://normaldeviate.wordpress.com/2013/07/27/the-steep-price-of-sparsity/

    As you point out, it's important to remember that AIC (like all other non-trivial information criteria) has its own sampling distribution. Among other things, this means that estimators associated with the selected model will *not* share the asymptotic risk properties of the selection criterion itself. Indeed, the post-selection estimator is a *randomly weighted average* of all the candidate estimators. Model selection certainly is a challenging econometric problem!


    ReplyDelete
  6. "The probability of over-fitting when using AIC decreases as n increases.
    The probability of over-fitting when using AICc increases as n increases.
    The probabilities of over-fitting when using SIC/BIC or HQ decreases as n increases, and decreases faster for SIC/BIC than for the HQ criterion."

    Is it not the other way round with the first two criteria? Isn't AICC consistent?

    ReplyDelete
    Replies
    1. No - it is correct as stated. I Say clearly, & correctly, in the post (3 paragraphs above the numbered table), that AICc is inconsistent.

      Delete
  7. Hello Professor!

    Is it something meaningful or wrong if one gets negative values for the information criteria like AIC, SIC/BIC or HQ?

    Regards,

    Raluca

    ReplyDelete
    Replies
    1. Raluca - no, not at all. They can be positive or negative, depending on the problem and the sample values.

      Delete