Comments on Econometrics Beat: Dave Giles' Blog: Can Your Results be Replicated?

Andrew - fair comment! You need to understand what...

2013-09-27T11:02:59.503-07:00

Andrew - fair comment! You need to understand what's in the can, and many users don't even seem to be interested in knowing.

I once had to convince a co-worker that Stata woul...

2013-09-27T10:23:51.144-07:00

I once had to convince a co-worker that Stata would do this. "But it generated coefficents they must be right." No these and another (infinite) set of coefficients.

When Stata tells you that the F-Stat is . there is likely something really wrong (also happens when you use a vce with not clusters)

I don't blame stata. Seems to be more a problem with canned procedures in general.

Thanks Dimitriy. DG

2013-09-13T15:26:49.742-07:00

Thanks Dimitriy.
DG

Here's a nice summary and commentary of this f...

2013-09-13T14:30:32.941-07:00

Here's a nice summary and commentary of this from Jeff Pitblado at http://www.stata.com/statalist/archive/2013-09/msg00618.html.

OK - Struck out!

2013-09-13T12:14:33.760-07:00

OK - Struck out!

I agree, it is a good idea to replicate other'...

2013-09-13T11:38:37.608-07:00

I agree, it is a good idea to replicate other's and one's own work with different packages (or do some other validations, e.g., simulate data that closely resembles some key features of the data at hand and see if the used command/package recovers the parameters correctly, etc.)

I got the insinuation from the following sentence in your original post:

"Especially if you favour the Stata package!"

This sentence in combination with your posted excerpts seem to present the problem as being caused by faulty software when in fact the problem seem to have arisen by researches choosing an inappropriate model.

Best,
Joerg

Joerg - thank you for correcting my spelling. I...

2013-09-13T11:08:25.662-07:00

Joerg - thank you for correcting my spelling. I'm not sure where you get the "insinuation" from. I stand by the point that if you are trying to replicate someone's results (including your own!) then it's a darn good idea to run the data on more than one piece of software.
DG

Hi Dave, In the same vein as the previous poster:...

2013-09-13T09:41:48.759-07:00

Hi Dave,

In the same vein as the previous poster: I don't think there is any need to blame a particular software here (BTW note it's Stata, not STATA). Apparently, the mistake of the original authors from the 2009 study was to use the wrong model for their purposes, but your post seems to insinuate that their mistake was to use the "wrong" software package. For example, using data generated in Stata (see http://www.stata.com/statalist/archive/2013-09/msg00595.html) the -geeglm- function from the -geepack- package (version 1.1-6) in R also provides estimates without any warning messages etc.:

> #--------------------------------
> require(Hmisc)
> require(geepack)
>
> dat = stata.get("gee_check1.dta")
>
> M1 <- geeglm( y ~ x.1 + x.3, data=dat, id=id,
+ family=binomial(link="logit"),
+ corstr="exchangeable")
> summary(M1)

Call:
geeglm(formula = y ~ x.1 + x.3, family = binomial(link = "logit"),
data = dat, id = id, corstr = "exchangeable")

Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 4.75e-01 1.18e-01 1.63e+01 5.4e-05 ***
x.1 -1.73e+07 3.80e+04 2.08e+05 < 2e-16 ***
x.3 -2.05e-01 1.50e-01 1.86e+00 0.17
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated Scale Parameters:
Estimate Std.err
(Intercept) 0.667 0.0134

Correlation: Structure = exchangeable Link = identity

Estimated Correlation Parameters:
Estimate Std.err
alpha 0.0187 0.0161
Number of clusters: 100 Maximum cluster size: 10
> #--------------------------------

It is the user's responsibility to chose reasonable models for their data, not the responsibility of a computer program.

Best,
Joerg

Yikes!! I don't like that either!!!

2013-09-12T14:49:17.653-07:00

Yikes!! I don't like that either!!!

Stata does at least tell you about the problem and...

2013-09-12T14:26:13.843-07:00

Stata does at least tell you about the problem and what it did. Before the table of coefficients it might, for example, say "X == 0 predicts success perfectly" and then tell you that it dropped that variable and some number of observations. Of course, it's up to the user to look at ALL the warning messages and understand what they mean.

"More generally, packages that produce any re...

2013-09-12T12:53:11.503-07:00

"More generally, packages that produce any results in this case (or the exact multicollinearity example you cited) still worry me. I'd rather they stopped, and produced an intelligible message that highlights the "problem"."

The glm function in R for example will not stop and
instead gives an answer under complete separation.
Here's an example from an lme4 github issue
(https://github.com/lme4/lme4/issues/124):

> set.seed(101)
> d <- data.frame(y=rbinom(1000,size=1,p=0.5),
+ x=runif(1000),
+ f=factor(rep(1:20,each=50)),
+ x2=rep(0:1,c(999,1)))
> glm(y ~ x+x2, data=d, family=binomial)

Call: glm(formula = y ~ x + x2, family = binomial, data = d)

Coefficients:
(Intercept) x x2
-0.10037 0.03549 -12.50117

Degrees of Freedom: 999 Total (i.e. Null); 997 Residual
Null Deviance: 1385
Residual Deviance: 1383 AIC: 1389

That amazing coefficient of -12.50117 is just a symptom of
complete separation.

Angelo - that's a very good point. I interpret...

2013-09-12T10:47:12.809-07:00

Angelo - that's a very good point. I interpreted "separation" in the same way you did - it's a pretty common way of describing that phenomenon. The original authors certainly should have provided that warning.

More generally, packages that produce any results in this case (or the exact multicollinearity example you cited) still worry me. I'd rather they stopped, and produced an intelligible message that highlights the "problem".

In addition there remains the point that genuine replication should preferably include alternative software. When you really get down to the computational side of things, it should really include different Operating Systems too, but not for the sort of things we have in mind.

I took a look at the paper by the two grad student...

2013-09-12T10:06:11.975-07:00

I took a look at the paper by the two grad students. I think this is what happened.

With logit and probit models, ML requires that predicted probabilities for different "cells" (i.e. subsamples identified by dummy variables) match empirical frequencies. So if all observations in the same cell make the same choice (I think this is what they mean by "separation in the data"), then coefficients aren't identified (because linear combinations of them have to shoot off to positive or negative infinity). Apparently, STATA has some way of resolving the lack of identification so that some answer comes out of the package. It's kind of like having a perfect multicollinearity problem in a linear regression, but the package picks a particular solution out for you from the space of solutions to the LS problem--it would be ad hoc and different identification schemes could generate very different parameter estimates, although the predicted value of the dependent variable would be the same in all cases. (FWIW, my quick reading didn't make it clear how the graduate students resolved the lack of identification.)

My bottom line is that the problem wasn't in the software (I don't know if it allows you to check for this problem and the authors didn't or if it should have printed a warning regardless). The problem was the failure of the original authors to warn the reader that the parameter estimates weren't identified in the data, only the predicted probabilities.