Monday, December 1, 2014

Statistical Controls Are Great - Except When They're Not!

A blog post today, titled, How Race Discrimination in Law Enforcement Actually Works", caught my eye. Seemed like an important topic. The post, by Ezra Klein, appeared on Vox.

I'm not going to discuss it in any detail, but I think that some readers of this blog will enjoy reading it. Here are a few selected passages, to whet your collective appetite:

"You see it all the time in studies. "We controlled for..." And then the list starts. The longer the better." (Oh boy, can I associate with that. Think of all of those seminars you've sat through.......)
"The problem with controls is that it's often hard to tell the difference between a variable that's obscuring the thing you're studying and a variable that is the thing you're studying."
"The papers brag about their controls. They dismiss past research because it had too few controls." (How many seminars was that?)
"Statistical Controls Are Great - Except When They're Not!"

© 2014, David E. Giles

1 comment:

  1. For an even more egregious example of treating a mediating variable as a confounder, see Researchers claimed to demonstrate that runners who ran less than 20 miles a week had lower mortality than those who ran more. To reach this conclusion, it was necessary to condition on BMI, blood pressure, and cholesterol. But running influences mortality in large part by its effects on BMI, blood pressure, and cholesterol.

    I was disappointed by "researchers know this. There's just nothing they can do about it." How about not conditioning?

    The root of the problem, in my view, is the combination of a desire to avoid imposing unnecessary assumptions, which is understandable, with a belief that conditioning is a purely neutral activity that imposes no assumptions, which is heinous. There are many things in life which it is reasonable to want but not to expect to have (hey, I'd like a pony!), and assumption-free statistical inference is one of them. Yes; every time you choose not to condition on a variate, you are implicitly assuming that variate is not a confounder. But it is equally true that every time you choose to condition on a variate, you are assuming that it IS a confounder, or else unrelated. Much reported research would be improved if authors had a clearer grasp of this fact. For that matter, I think most papers would be improved if you had to explicitly draw one of those little causal diagrams of your assumptions, along with your p-values.

    In that connection, I recall that Judea Pearl has a wonderful diagram of a "Simpson's paradox machine"; conditioning successively on {X1}, {X1,X2}, {X1,X2,X3} ... reverses the conclusion each time. There is nothing wrong with testing more than one causal model, of course. If your conclusions stand up, then you can say "our results are robust to ..." etc, and if they don't, that indicates a very interesting direction for future research. But as the paradox machine makes clear, it is the combination of conditioners that matters. And if you have many potential confounders of which you are uncertain, such that this combination grows uncomfortably large? Why, then that is a fair indication of the weight to put on your results.