Econometrics Beat: Dave Giles' Blog: p-values

Showing posts with label p-values. Show all posts

Wednesday, October 30, 2019

Everything's Significant When You Have Lots of Data

Well........, not really!

It might seem that way on the face of it, but that's because you're probably using a totally inappropriate measure of what's (statistically) significant, and what's not.

I talked a bit about this issue in a previous post, where I said:

"Granger (1998, 2003) has reminded us that if the sample size is sufficiently large, then it's virtually impossible not to reject almost any hypothesis. So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about 'statistically significant' results when the sample size is in the thousands, or even bigger."

This general point, namely that our chosen significance level should be decreased as the sample size grows, is pretty well understood by most statisticians and econometricians. (For example, see Good, 1982.) However, it's usually ignored by the authors of empirical economics studies based on samples of thousands (or more) observations. Moreover, a lot of practitioners seem to be unsure of just how much they should revise their significance levels (or re-interpret their p-values) in such circumstances.

There's really no excuse for this, because there are some well-established guidelines to help us. In fact, as we'll see, some of them have been around since at least the 1970's.

Let's take a quick look at this, because it's something that all students need to be made aware of as we work more and more with "big data". Students certainly won't gain this awareness by looking at the interpretation of the results in the vast majority of empirical economics papers that use even sort-of-large samples!

Monday, April 1, 2019

Some April Reading for Econometricians

Here are my suggestions for this month:

Hyndman, R. J., 2019. A brief history of forecasting competitions. Working Paper 03/19, Department of Econometrics and Business Statistics, Monash University.
Kuffner, T. A. & S. G. Walker, 2019. Why are p-values controversial?. American Statistician, 73, 1-3.
Sargan, J. D.,, 1958. The estimation of economic relationships using instrumental variables. Econometrica, 26, 393-415. (Read for free online.)
Sokal, A. D., 1996. Transgressing the boundaries: Towards a trasnformative hermeneutics of quantum gravity. Social Text, 46/47, 217-252.
Zeng, G. & Zeng, E., 2019. On the relationship between multicollinearity and separation in logistic regression. Communications in Statistics - Simulation and Computation, published online.
Zhang, X., S. Paul, & Y-G. Yang, 2019. Small sample bias correction or bias reduction? Communications in Statistics - Simulation and Computation, published online.

Thursday, March 21, 2019

A World Beyond p < 0.05

The American Statistical Association has just published a special supplementary issue of The American Statistician, titled Statistical Inference in the 21st. Century: A World Beyond p < 0.05.

This entire issue is open-access. In addition to an excellent editorial, Moving to a World Beyond "p < 0.05" (by Ronald Wasserstein, Allen Schirm, and Nicole Lazar) it comprises 43 articles with such titles as:

The p-Value Requires Context, Not a Threshold (by Rebecca Betensky)
The False Positive Risk: A Proposal Concerning What to do About p-Values (by David Colquhoun)
What Have we (Not) Learnt From Millions of Scientific Papers With P Values? (by John Ioannidis)
Three Recommendations for Improving the Use of p-Values (by Daniel Benjamin and James Berger)

I'm sure that you get the idea of what this supplementary issue is largely about.

But look back at its title - Statistical Inference in the 21st. Century: A World Beyond p < 0.05. It's not simply full of criticisms. There's a heap of excellent, positive, and constructive material in there.

Highly recommended reading!

Tuesday, February 5, 2019

Misinterpreting Tests, P-Values, Confidence Intervals & Power

There are so many things in statistics (and hence in econometrics) that are easily, and frequently, misinterpreted. Two really obvious examples are p-values and confidence intervals.

I've devoted some space in earlier posts to each of these concepts, and their mis-use. For instance, in the case of p-values, see the posts here and here; and for confidence intervals, see here and here.

Today I was reading a great paper by Greenland et al. (2016) that deals with some common misconceptions and misinterpretations that arise not only with p-values and confidence intervals, but also with statistical tests in general and the "power" of such tests. These comments by the authors in the abstract for their paper sets the tone of what's to follow rather nicely:

"A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut deﬁnitions and interpretations that are simply wrong, sometimes disastrously so - and yet these misinterpretations dominate much of the scientiﬁc literature."

The paper then goes through various common interpretations of the four concepts in question, and systematically demolishes them!

The paper is extremely readable and informative. Every econometrics student, and most applied econometricians, would benefit from taking a look!

Reference

Greenland, S., S. J. Senn, K. R. Rothman, J. B. Carlin, C. Poole, S. N. Goodman, & D. G. Altman, 2016. Statistical tests, p values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31, 337-350.

Wednesday, October 4, 2017

Sunday, September 10, 2017

Econometrics Reading List for September

A little belatedly, here is my September reading list:

Benjamin, D. J. et al., 2017. Redefine statistical significance. Pre-print.
Jiang, B., G. Athanasopoulos, R. J. Hyndman, A. Panagiotelis, and F. Vahid, 2017. Macroeconomic forecasting for Australia using a large number of predictors. Working Paper 2/17, Department of Econometrics and Business Statistics, Monash University.
Knaeble, D. and S. Dutter, 2017. Reversals of least-square estimates and model-invariant estimations for directions of unique effects. The American Statistician, 71, 97-105.
Moiseev, N. A., 2017. Forecasting time series of economic processes by model averaging across data frames of various lengths. Journal of Statistical Computation and Simulation, 87, 3111-3131.
Stewart, K. G., 2017. Normalized CES supply systems: Replication of Klump, McAdam and Willman (2007). Journal of Applied Econometrics, in press.
Tsai, A. C., M. Liou, M. Simak, and P. E. Cheng, 2017. On hyperbolic transformations to normality. Computational Statistics and Data Analysis, 115, 250-266,

Saturday, July 1, 2017

Canada Day Reading List

I was tempted to offer you a list of 150 items, but I thought better of it!

Hamilton, J. D., 2017. Why you should never use the Hodrick-Prescott filter. Mimeo., Department of Economics, UC San Diego.

Jin, H. and S. Zhang, 2017. Spurious regression between long memory series due to mis-specified structural breaks. Communications in Statistics - Simulation and Computation, in press.

Kiviet, J. F., 2016. Testing the impossible: Identifying exclusion restrictions.Discussion Paper 2016/03, Amsterdam School of Economics, University of Economics.

Lenz, G. and A. Sahn, 2017. Achieving statistical significance with covariates. BITSS Preprint (H/T Arthur Charpentier)

Sephton, P., 2017. Finite sample critical values of the generalized KPSS test. Computational Economics, 50, 161-172.

Stypka, O., P. Grabarczyk, R. Kawka, and M. Wagner, 2017. "Linear" fully modified OLS estimation of cointegrating polynomial regressions. Discussion Paper Nr. 77/2016, SFB 823. (H/T David Stern)

Monday, December 26, 2016

Specification Testing With Very Large Samples

I received the following email query a while back:

"It's my understanding that in the event that you have a large sample size (in my case, > 2million obs) many tests for functional form mis-specification will report statistically significant results purely on the basis that the sample size is large. In this situation, how can one reasonably test for misspecification?"

Well, to begin with, that's absolutely correct - if the sample size is very, very large then almost any null hypothesis will be rejected (at conventional significance levels). For instance, see this earlier post of mine.

Schmueli (2012) also addresses this point from the p-value perspective.

But the question was, what can we do in this situation if we want to test for functional form mis-specification?

Schmueli offers some general suggestions that could be applied to this specific question:

Present effect sizes.
Report confidence intervals.
Use (certain types of) charts

This is followed with an empirical example relating toauction prices for camera sales on eBay, using a sample size of n = 341,136.

To this, I'd add, consider alternative functional forms and use ex post forecast performance and cross-validation to choose a preferred functional form for your model.

You don't always have to use conventional hypothesis testing for this purpose.

Reference

Schmueli, G., 2012. Too big to fail: Large samples and the p-value problem. Mimeo., Institute of Service Science, National Tsing Hua University, Taiwan.

Sunday, April 12, 2015

How (Not) to Interpret That p-Value

Thanks to my colleague, Linda Welling, for bringing this post to my attention: Still Not Significant.

I just love it!

(Take some of the comments with a grain of salt, though.)

Monday, November 3, 2014

Central and Non-Central Distributions

Let's imagine that you're teaching an econometrics class that features hypothesis testing. It may be an elementary introduction to the topic itself; or it may be a more detailed discussion of a particular testing problem. We're not talking here about a course on Bayesian econometrics, so in all likelihood you'll be following the "classical" Neyman-Pearson paradigm.

You set up the null and alternative hypotheses. You introduce the idea of a test statistic, and hopefully, you explain why we try to find one that's "pivotal". You talk about Type I and Type II errors; and the trade-off between the probabilities of these errors occurring.

You might talk about the idea of assigning a significance level for the test in advance of implementing it; or you might talk about p-values. In either case, you have to emphasize to the classt that in order to apply the test itself, you have to know the sampling distribution of your test statistic for the situation where the null hypothesis is true.

Why is this?

The 7 Pillars of Statistical Wisdom

Yesterday, Stephen Stigler presented the (ASA) President's Invited Address to an overflow, and appreciative, audience at the 2014 Joint Statistical Meetings in Boston. The title of his talk was, "The Seven Pillars of Statistical Wisdom".

I'd been looking forward to this presentation by our foremost authority on the history of statistics, and it surpassed my (high) expectations.

The address will be published in JASA at some future date, and I urge you to read it when it appears. In the meantime, here are the "seven pillars" - the supporting pillars of statistical science - with some brief comments:

Do You Use P-Values and Confidence Intervals?

Unless your econometrics training has been true-blue Bayesian in nature, you'll have reported a lot of p-values, and constructed heaps of confidence intervals in your time.

Both of these concepts have been the centre of widespread controversy in the statistics literature since their inception. It's probably good to be aware of this - just so you don't go and "shoot yourself in the foot" at some stage.

Economist/econometrician Aris Spanos has published an interesting and readable piece about all of this In a recent issue of the journal, Ecology. His paper is titled, "Recurring Controversies About P Values and Confidence Intervals Revisited". You can read a summary on the Error Statistics blog, here.

I strongly recommend this paper.

Friday, February 14, 2014

P-Values ...... Again!

I've had posts about p-values in the past - e.g., see here, here, here, and here. Well, this pesky little devil is back in the news again. Every now and the the "p-value bashers" emerge from the swamp, and this past week it happened again - in Nature.

When I read this piece by Regina Nuzzo (and once the eye-rolling had subsided) I was very tempted to put together a post. I'm glad I didn't, because today Jeff Leek published a post on the Simply Statstics blog that is way, way better than anything I could have put together.

It's a must-read piece!

Friday, January 24, 2014

Testing Up, or Testing Down?

Students are told that if you're going to go in for sequential testing, when determining the specification of a model, then the sequence that you follow should be "from the general to the specific". That is, you should start off with a "large" model, and then simplify it - not vice versa.

At least, I hope this is what they're told!

But are they told why they should "test down", rather than "test up"? Judging by some of the things I read and hear, I think the answer to the last question is "no"!

The "general-to-specific" modelling strategy is usually attributed to David Hendry, and an accessible overview of the associated literature is provided by Campos et al. (2005).

Let's take a look at just one aspect of this important topic.

Statistical Significance - Again

With all of this emphasis on "Big Data", I was pleased to see this post on the Big Data Econometrics blog, today.

When you have a sample that runs to the thousands (billions?), the conventional significance levels of 10%, 5%, 1% are completely inappropriate. You need to be thinking in terms of tiny significance levels.

I discussed this in some detail back in April of 2011, in a post titled, "Drawing Inferences From Very Large Data-Sets". If you're of those (many) applied researchers who uses large cross-sections of data, and then sprinkles the results tables with asterisks to signal "significance" at the 5%, 10% levels, etc., then I urge you read that earlier post.

It's sad to encounter so many papers and seminar presentations in which the results, in reality, are totally insignificant!

Monday, November 25, 2013

A Bayesian View of P-Values

"I have always considered the arguments for the use of P (p-value) absurd. They amount to saying that a hypothesis that may or may not be true is rejected because a greater departure from the trial was improbable: that is, that it has not rejected something that has not happened'"

H. Jeffreys, 1980. Some general points in probability theory. In A Zellner (ed.), Bayesian Analysis in Probability and Statistics. North-Holland, Amsterdam, p. 453.

Saturday, November 16, 2013

How Science (Econometrics?) is Really Done

If you tweet, you may be familiar with #OverlyHonestMethods. If not, this link to Popular Science will set you on the right track. As it says: "In 140 characters or less, the info that didn't get through peer review."

Here are some beauties that may strike an accord with certain applied econometricians:

"Our results were non-significant at p > 0.05, but they're humdingers at p > 0.1"
"Experiment was repeated until we had three statistically significant similar results and could discard the outliers"
"We decided to use Technique Y because it's new and sexy, plus hot and cool. And because we could."
"I can't send you the original data because I don't remember what my excel file names mean anymore."
"Non-linear regression analysis was performed in Graph Pad Prism because SPSS is a nightmare."
"We made a thorough comparison of all post-hoc tests while our statistician wasn't looking."
"Our paper lacks post-2010 references as it's taken the co-authors that long to agree on where to submit the final draft."
"If you pay close attention to our degrees-of-freedom you will realize we have no idea what test we actually ran."
"Additional variables were not considered because everyone involved is tired of working on this paper."
"We used jargon instead of plain English to prove that a decade of grad school and postdoc made us smart."

Oh yes!!!!

Thursday, September 19, 2013

P-Values, Statistical Significance, and Logistic Regression

Yesterday, William M. Briggs ("Statistician to the Stars") posted on his blog a piece titled "How to Mislead With P-values: Logistic Regression Example".

Here are some extracts which, hopefully, will encourage to read the post:

"It’s too easy to generate “significant” answers which are anything but significant. Here’s yet more—how much do you need!—proof. The pictures below show how easy it is to falsely generate “significance” by the simple trick of adding “independent” or “control variables” to logistic regression models, something which everybody does...............

Logistic regression is a common method to identify whether exposure is “statistically significant”. .... (The) Idea is simple enough: data showing whether people have the malady or not and whether they were exposed or not is fed into the model. If the parameter associated with exposure has a wee p-value, then exposure is believed to be trouble.

So, given our assumption that the probability of having the malady is identical in both groups, a logistic regression fed data consonant with our assumption shouldn’t show wee p-values. And the model won’t, most of the time. But it can be fooled into doing so, and easily. Here’s how.

Not just exposed/not-exposed data is input to these models, but “controls” are, too; sometimes called “independent” or “control variables.” These are things which might affect the chance of developing the malady. Age, sex, weight or BMI, smoking status, prior medical history, education, and on and on. Indeed models which don’t use controls aren’t considered terribly scientific.

Let’s control for things in our model, using the same data consonant with probabilities (of having the malady) the same in both groups. The model should show the same non-statistically significant p-value for the exposure parameter, right? Well, it won’t. The p-value for exposure will on average become wee-er (yes, wee-er). Add in a second control and the exposure p-value becomes wee-er still. Keep going and eventually you have a “statistically significant” model which “proves” exposure’s evil effects. Nice, right?"

Oh yes - don't forget to read the responses/comments for this post, here.

Wednesday, April 17, 2013

Star Wars

Today, Ryan MacDonald, a UVic Economics grad. who works with Statistics Canada, sent me an interesting paper by Abel Brodeur et al.: "Star Wars: The Empirics Strike Back". Who can resist a title like that!

The "stars" that are being referred to in the title are those single, double (triple!) asterisks that authors just love to put against the parameter estimates in their tables of results, to signal statistical significance at the 10%, 5% (1%!) levels. A table without stars is like champagne without bubbles!

Pages