Friday, July 18, 2014

Step-wise Regression

Some time ago, Haynes Goddard emailed me suggesting that I post something about step-wise regression. He also put me on an interesting paper by Peter Flom and David Cassell.

Many statistical and econometrics packages include stepwise regression. I wish they didn't! Here's why.
Flom and Cassell list several "strikes" against step-wise regression:

1. R2 values are biased high.
2. The F and χ2 test statistics do not have the claimed distribution.
3. The standard errors of the parameter estimates are too small.
4. Consequently, the conﬁdence intervals around the parameter estimates are too narrow.
5. p-values are too low, due to multiple comparisons, and are difﬁcult to correct.
6. Parameter estimates are biased high in absolute value.
7. Collinearity problems are exacerbated.

I've quoted the authors directly - I'd have expressed some of these points a little differently, but I'm sure you get their drift.

The crux of the matter is this. Step-wise regression involves a search for a model specification based on the apparent significance (or otherwise) of various covariates. It's an example of model simplification, rather than model specification. The end result may make little economic sense.

Searches of this type involve fundamental "pre-testing" issues, unless it happens to be the case that the successive tests that are used are all statistically independent. This independence is absent with conventional step-wise regresssion procedures. When we test the significance of one potential regressor (regardless of the outcome), this affects the properties of the next (and all subsequent) tests that we perform. The true significance levels of the subsequent tests are not what we nominally assign them to be. Moreover, the extent to which these values are distorted depends on the true (unknown) values of the parameters in a very complicated way.

In short, the sequential testing associated with step-wise regression is fatally flawed.

In addition, the properties of the estimates of the parameters are no longer what we might think they are. For example, after even a single pre-test, the OLS estimator loses its usual property of unbiasedness. And things just get worse and worse as we continue with more testing.

If you want a (now out-of-date) survey of the pre-testing literature, with an econometrics slant, then take a look at Giles and Giles (1993).

One package that I use a lot is EViews. Although it has a facility for both forward and reverse step-wise regression, the documentation includes the following statement:
"The p-values listed in the final regression output and all subsequent testing procedures do not account for the regressions that were run during the selection process. One should take care to interpret results accordingly.

Invalid inference is but one of the reasons that stepwise regression and other variable selection methods have a large number of critics amongst statisticians. Other problems include an upwardly biased final R-squared, possibly upwardly biased coefficient estimates, and narrow confidence intervals. It is also often pointed out that the selection methods themselves use statistics that do not account for the selection process."
I totally agree with all of this, and the team at EViews should be applauded for including this warning.

As econometricians, we're concerned with more than just a parsimonious model. Usually, we're very interested in the economic implications of our model specification and the quality of our parameter estimates.

Unfortunately, with the advent of "big data", and situations where there are far more potential covariates than there are data-points, techniques such as step-wise regression seem to be back in vogue again. Grrrrr!!!!!

Reference

P. L. Flom and D. L. Cassell, 2007. Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use. NESUG 2007 Proceedings.

Giles, J. A. andD. E. A. Giles, 1993. Pre-test estimation and testing in econometrics: Recent developments. Journal of Economic Surveys, 7, 145-197.

1. There's some nice work by Leeb & Potscher showing that not only do you get misleading results if you ignore the selection, but that there isn't a (complete) fix

"We consider the problem of estimating the conditional distribution of a post-model-selection estimator where the conditioning is on the selected model.... We show that it is impossible to estimate this distribution with reasonable accuracy even asymptotically."
http://projecteuclid.org/euclid.aos/1169571807

1. Thanks for this. Also, see my post coming up later today (21 July, 2014).

2. Stepwise regression (and other similar methods to reduce the dimensionality of one's data) were taught to me as exclusively methods for prediction purposes. As such, problems with inference on parameters is irrelevant. Additionally, for prediction purposes, one never evaluates the fit of the model using the same sample used to estimate the parameters.

1. Pre-testing leads to inferior predictions, as is well documented in the pre-test literature.

3. All model selection procedure, from stepwise regression to machine learning vector including general to specific methodology have the same problems related to the familywise errors rates.

However, analysis of the results from various model selection procedure suggests that the false discovery rates is a more relevant and useful way a looking at model selections. In fact, it seems that familywise error rates are usually too conservatives, especially in the light of the incredible amount of statistical analysis and tests that have been done in the literature on pretty much any topics.

While I agree that not all of the model selection procedures are great, some are surely more effective then others, their use are essential to the financial industry. They help reduces the needs for "expert judgement" which was previously used to select model that gave the desired results creating model that ignored key variables. This biased expert selection is also common in the economic literature.

Furthermore, model selection procedures are also easy to replicate which simplifies model validation, a key new standards of statistical model in the financial industry.

I myself often prefer solution that average the results of multiple models and methodology to reduce the model risks. A single model, which is just a simplified version of the world, cannot be expected to hold the entire truth about an issue.

1. Although I agree with your blog I wonder if it's also useful to differentiate between selection strategies (some are probably worse than others?). I've seen many stepwise procedures that use AIC or BIC during the search process. Have you looked at the R package "bestglm" for example?

2. A lot depends on what the intended end-use of the model is. In econometrics, it isn't simply prediction. I'm all in favour of model-averaging - especially BMA - as I've mentioned in earlier posts.

3. Keep in mind that AIC, BIC, etc. are also statistics. See my earlier posts on this point.Their sampling properties are also distorted by pre-testing.

4. Anomymous, can you please tell me some references where stepwise procedures use AIC or BIC during the search process? I have been looking for such papers for a long time with no luck. Thanks a lot!

5. I don't have any - maybe some readers do.

4. it seems many comments applicable to lasso as well.

5. Hendry and Krolzig (2005) showed that their improved Gets algorithms with bias correction can overcome the biases you noted. I'd be curious if you have any reactions to that particular work/program.

http://onlinelibrary.wiley.com/doi/10.1111/j.0013-0133.2005.00979.x/abstract?systemMessage=Wiley+Online+Library+will+be+disrupted+9th+Aug+from+10-2+BST+for+essential+maintenance.+Pay+Per+View+will+be+unavailable+from+10-6+BST.&userIsAuthenticated=false&deniedAccessCustomisedMessage=

1. Thanks - I'll be interested to check this out.