Econometrics Beat: Dave Giles' Blog: Reporting an R-Squared Measure for Count Data Models

Sunday, October 27, 2019

Reporting an R-Squared Measure for Count Data Models

This post was prompted by an email query that I received some time ago from a reader of this blog. I thought that a more "expansive" response might be of interest to other readers............

In spite of its many limitations, it's standard practice to include the value of the coefficient of determination (R²) - or its "adjusted" counterpart - when reporting the results of a least squares regression. Personally, I think that R² is one of the least important statistics to include in our results, but we all do it. (See this previous post.)

If the regression model in question is linear (in the parameters) and includes an intercept, and if the parameters are estimated by Ordinary Least Squares (OLS), then R² has a number of well-known properties. These include:

0 ≤ R² ≤ 1.
The value of R² cannot decrease if we add regressors to the model.
The value of R² is the same, whether we define this measure as the ratio of the "explained sum of squares" to the "total sum of squares" (R_E²); or as one minus the ratio of the "residual sum of squares" to the "total sum of squares" (R_R²).
There is a correspondence between R² and a significance test on all slope parameters; and there is a correspondence between changes in (the adjusted) R² as regressors are added, and significance tests on the added regressors' coefficients. (See here and here.)
R² has an interpretation in terms of information content of the data.
R² is the square of the (Pearson) correlation (R_C²) between actual and "fitted" values of the model's dependent variable.

However, as soon as we're dealing with a model that excludes an intercept or is non-linear in the parameters, or we use an estimator other than OLS, none of the above properties are guaranteed.

For example, when reporting a linear model that's been estimated by Instrumental Variables, we get different R² values depending on which of the two definitions noted in property 3 above is adopted. Similarly, when estimating Logit and Probit models (for instance), most econometrics packages report several "pseudo-R²" statistics, because there's no single measure that has all of the desirable features that we're used to in the linear model/OLS case.

So-called "count" data arise frequently in empirical economics. These are data that take values that are only non-negative integers, namely 0, 1, 2, 3, 4, ........ Models for such data are often based on the Poisson or negative binomial distributions, although other distributions may also be used. Regressors enter the model by equating the mean of the chosen distribution to a positive function of these variables and their coefficients.

For instance, if the y_i data (i = 1, 2, ...., n) are being modelled using a Poisson distribution with a mean of μ, then we typically assign μ_i = exp[x_i'β], using familiar regression notation. The resulting non-linear model is then estimated by MLE (or quasi-MLE).

What's a sensible way of reporting an R² measure for an estimated Poisson regression?

As with the Logit-Probit case noted above, several possibilities suggest themselves. However, unlike that other case, when modelling "count" data there is actually one definition of R² that really stands out as the obvious choice.

What is it?

Before answering this question, let's look at how R_R², R_E², and R_C² behave when applied in the context of Poisson, or negative binomial, regression. Some key facts include:

The three measures will generally differ in value from one another.
We still have 0 ≤ R_C² ≤ 1. However, although R_R² ≤ 1 it can be negative (even if an intercept is included in the model); and although R_E² ≥ 0 it can be greater than one (even with an intercept).
All three measures can decrease as regressors are added to the model.

When we compare these results with the six properties noted above for the OLS case, they suggest that these R² measures are probably best avoided with count data models. Interestingly, it's R_R² that's reported as a matter of course by the EViews package. Stata, on the other hand, reports McFadden's "pseudo-R²" for these models, but its properties are no better.

Cameron and Windmeijer (1996) effectively answer the question that I posed above.

They consider various R²-type measures for count data models. These measures differ primarily on the type of residuals (from the estimated model) that are used in their construction. As in the case of a linear regression, the usual, or "raw", residuals are the differences between the actual y_i values and their "predicted" mean values. That is, they're of the of the form (y_i - μ_i*), where μ_i* = exp[x_i'β*], and β* is the MLE of the β vector. These residuals give us R_R², noted above.

In regression analysis in general, there are actually lots of different forms of residuals that can be constructed, and these can be useful in various situations - especially with generalized linear models (of which the Poisson count models is an example). Some examples include the Pearson (standardized) residuals and the so-called "deviance" residuals. (for more on the notion of "deviance" and goodness-of-fit, see this post.)

Cameron and Windmeijer (1996) consider the properties of R² measures for Poisson and negative binomial models based on both of these other types of residuals, as well as on the "raw" residuals. (Cameron and Windmeijer (1997) extend these results to a variety of other non-linear models.)

They make a convincing case for constructing an R² measure using the deviance residuals, when working with a Poisson regression model or the negative binomial (NegBin2) model.

(As an aside, when the model is linear and we use OLS, the deviance residuals are just the usual residuals.)

For the Poisson model, the i^th. deviance residual is defined as

d_i = sign(y_i - μ_i*)[2{y_ilog(y_i / μ_i*) - (y_i - μ_i*)}]^½; i = 1, 2, ...., n

and the deviance R² for that model is defined as:

R_D,P² = 1 - Σ{y_ilog(y_i / μ_i*) - (y_i - μ_i*)} / Σ{y_ilog(y_i / ybar)},

where here and below all summations are for i = 1, 2, ...., n.

If the model includes an intercept, then this formula simplifies to:

R_D,P² = 1 - Σ{y_ilog(y_i / μ_i*)} / Σ{y_ilog(y_i / ybar)}.

(Note: if y_i = 0, then y_ilog(y_i) = 0. In this case, d_i = - [2μ_i*]^½.)

Importantly, R_D,P² satisfies the properties 1 to 5 noted earlier.

In the case of the NegBin2 model, the corresponding R² takes the form:

R_D,NB² = 1 - (A / B) ,

where

A = Σ{y_ilog(y_i / μ_i*) - (y_i + α*^-1)log[(y_i + α*^-1) / (μ_i* + α*^-1)]}

and

B = Σ{y_ilog(y_i / ybar) - (y_i + α*^-1)log[(y_i + α*^-1) / (ybar + α*^-1)]}.

("ybar" is the sample average of the y_i values; and α* is the MLE of the dispersion parameter for the NegBin2 distribution.)

The R_D,NB² goodness-of-fit measure satisfies properties 1, 3 and 4 noted earlier.

So, when it comes to reporting an R² for count data models, the usual such measure - based on the "raw" residuals - is generally a very poor choice. Of the other options that are available, the R² measures constructed using the so-called "deviance residuals" stand out as excellent contenders.

References

Cameron, A. C. & F. A. C. Windmeijer, 1996. R-squared measures for count data regression models with applications to health-care utilization. Journal of Business and Economic Statistics, 14, 209-220. (Download working paper version.)

Cameron, A. C. & F. A. C. Windmeijer, 1997. An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Economerics, 77, 329-342.

2 comments:

AnonymousOctober 27, 2019 at 12:17 PM
Thanks for the informative post Professor.
I have a question about the allocation of R² among regressors. some say that there is little point in doing so, but others say the opposite and developed algorithms for this purpose.
What do you think about this topic Professor?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Pages

Sunday, October 27, 2019

Reporting an R-Squared Measure for Count Data Models

2 comments: