tag:blogger.com,1999:blog-2198942534740642384.post8333819034909518546..comments2023-10-24T03:16:41.009-07:00Comments on Econometrics Beat: Dave Giles' Blog: Can Your Results be Replicated?Dave Gileshttp://www.blogger.com/profile/05389606956062019445noreply@blogger.comBlogger13125tag:blogger.com,1999:blog-2198942534740642384.post-21443159504686011042013-09-27T11:02:59.503-07:002013-09-27T11:02:59.503-07:00Andrew - fair comment! You need to understand what...Andrew - fair comment! You need to understand what's in the can, and many users don't even seem to be interested in knowing.Dave Gileshttps://www.blogger.com/profile/05389606956062019445noreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-50408851685277537552013-09-27T10:23:51.144-07:002013-09-27T10:23:51.144-07:00I once had to convince a co-worker that Stata woul...I once had to convince a co-worker that Stata would do this. "But it generated coefficents they must be right." No these and another (infinite) set of coefficients.<br /><br />When Stata tells you that the F-Stat is . there is likely something really wrong (also happens when you use a vce with not clusters)<br /><br />I don't blame stata. Seems to be more a problem with canned procedures in general.Andrew Ushernoreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-82005585575817738802013-09-13T15:26:49.742-07:002013-09-13T15:26:49.742-07:00Thanks Dimitriy.
DGThanks Dimitriy.<br />DGDave Gileshttps://www.blogger.com/profile/05389606956062019445noreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-78189367848142320222013-09-13T14:30:32.941-07:002013-09-13T14:30:32.941-07:00Here's a nice summary and commentary of this f...Here's a nice summary and commentary of this from Jeff Pitblado at http://www.stata.com/statalist/archive/2013-09/msg00618.html.Dimitriyhttps://www.blogger.com/profile/02728704178088861714noreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-72978687775186264182013-09-13T12:14:33.760-07:002013-09-13T12:14:33.760-07:00OK - Struck out!OK - Struck out!Dave Gileshttps://www.blogger.com/profile/05389606956062019445noreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-81912065802873751252013-09-13T11:38:37.608-07:002013-09-13T11:38:37.608-07:00I agree, it is a good idea to replicate other'...I agree, it is a good idea to replicate other's and one's own work with different packages (or do some other validations, e.g., simulate data that closely resembles some key features of the data at hand and see if the used command/package recovers the parameters correctly, etc.)<br /><br />I got the insinuation from the following sentence in your original post:<br /><br />"Especially if you favour the Stata package!"<br /><br />This sentence in combination with your posted excerpts seem to present the problem as being caused by faulty software when in fact the problem seem to have arisen by researches choosing an inappropriate model.<br /><br />Best,<br />Joerg Joerg Luedickenoreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-69279107994266829902013-09-13T11:08:25.662-07:002013-09-13T11:08:25.662-07:00Joerg - thank you for correcting my spelling. I...Joerg - thank you for correcting my spelling. I'm not sure where you get the "insinuation" from. I stand by the point that if you are trying to replicate someone's results (including your own!) then it's a darn good idea to run the data on more than one piece of software.<br />DGDave Gileshttps://www.blogger.com/profile/05389606956062019445noreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-46951174301278840952013-09-13T09:41:48.759-07:002013-09-13T09:41:48.759-07:00Hi Dave,
In the same vein as the previous poster:...Hi Dave,<br /><br />In the same vein as the previous poster: I don't think there is any need to blame a particular software here (BTW note it's Stata, not STATA). Apparently, the mistake of the original authors from the 2009 study was to use the wrong model for their purposes, but your post seems to insinuate that their mistake was to use the "wrong" software package. For example, using data generated in Stata (see http://www.stata.com/statalist/archive/2013-09/msg00595.html) the -geeglm- function from the -geepack- package (version 1.1-6) in R also provides estimates without any warning messages etc.:<br /><br />> #--------------------------------<br />> require(Hmisc)<br />> require(geepack)<br />><br />> dat = stata.get("gee_check1.dta")<br />><br />> M1 <- geeglm( y ~ x.1 + x.3, data=dat, id=id,<br />+ family=binomial(link="logit"),<br />+ corstr="exchangeable")<br />> summary(M1)<br /><br />Call:<br />geeglm(formula = y ~ x.1 + x.3, family = binomial(link = "logit"),<br /> data = dat, id = id, corstr = "exchangeable")<br /><br /> Coefficients:<br /> Estimate Std.err Wald Pr(>|W|)<br />(Intercept) 4.75e-01 1.18e-01 1.63e+01 5.4e-05 ***<br />x.1 -1.73e+07 3.80e+04 2.08e+05 < 2e-16 ***<br />x.3 -2.05e-01 1.50e-01 1.86e+00 0.17<br />---<br />Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br /><br />Estimated Scale Parameters:<br /> Estimate Std.err<br />(Intercept) 0.667 0.0134<br /><br />Correlation: Structure = exchangeable Link = identity<br /><br />Estimated Correlation Parameters:<br /> Estimate Std.err<br />alpha 0.0187 0.0161<br />Number of clusters: 100 Maximum cluster size: 10<br />> #--------------------------------<br /><br />It is the user's responsibility to chose reasonable models for their data, not the responsibility of a computer program.<br /><br />Best,<br />JoergJoerg Luedickenoreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-24508018590672218382013-09-12T14:49:17.653-07:002013-09-12T14:49:17.653-07:00Yikes!! I don't like that either!!!Yikes!! I don't like that either!!!Dave Gileshttps://www.blogger.com/profile/05389606956062019445noreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-52314535242942667172013-09-12T14:26:13.843-07:002013-09-12T14:26:13.843-07:00Stata does at least tell you about the problem and...Stata does at least tell you about the problem and what it did. Before the table of coefficients it might, for example, say "X == 0 predicts success perfectly" and then tell you that it dropped that variable and some number of observations. Of course, it's up to the user to look at ALL the warning messages and understand what they mean.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-75988877198857344832013-09-12T12:53:11.503-07:002013-09-12T12:53:11.503-07:00"More generally, packages that produce any re..."More generally, packages that produce any results in this case (or the exact multicollinearity example you cited) still worry me. I'd rather they stopped, and produced an intelligible message that highlights the "problem"."<br /><br />The glm function in R for example will not stop and <br />instead gives an answer under complete separation.<br />Here's an example from an lme4 github issue<br />(https://github.com/lme4/lme4/issues/124):<br /><br />> set.seed(101)<br />> d <- data.frame(y=rbinom(1000,size=1,p=0.5),<br />+ x=runif(1000),<br />+ f=factor(rep(1:20,each=50)),<br />+ x2=rep(0:1,c(999,1)))<br />> glm(y ~ x+x2, data=d, family=binomial)<br /><br />Call: glm(formula = y ~ x + x2, family = binomial, data = d)<br /><br />Coefficients:<br />(Intercept) x x2 <br /> -0.10037 0.03549 -12.50117 <br /><br />Degrees of Freedom: 999 Total (i.e. Null); 997 Residual<br />Null Deviance: 1385 <br />Residual Deviance: 1383 AIC: 1389<br /><br />That amazing coefficient of -12.50117 is just a symptom of <br />complete separation.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-41487752114184141462013-09-12T10:47:12.809-07:002013-09-12T10:47:12.809-07:00Angelo - that's a very good point. I interpret...Angelo - that's a very good point. I interpreted "separation" in the same way you did - it's a pretty common way of describing that phenomenon. The original authors certainly should have provided that warning. <br /><br />More generally, packages that produce any results in this case (or the exact multicollinearity example you cited) still worry me. I'd rather they stopped, and produced an intelligible message that highlights the "problem". <br /><br />In addition there remains the point that genuine replication should preferably include alternative software. When you really get down to the computational side of things, it should really include different Operating Systems too, but not for the sort of things we have in mind.Dave Gileshttps://www.blogger.com/profile/05389606956062019445noreply@blogger.comtag:blogger.com,1999:blog-2198942534740642384.post-54386690853771899682013-09-12T10:06:11.975-07:002013-09-12T10:06:11.975-07:00I took a look at the paper by the two grad student...I took a look at the paper by the two grad students. I think this is what happened. <br /><br />With logit and probit models, ML requires that predicted probabilities for different "cells" (i.e. subsamples identified by dummy variables) match empirical frequencies. So if all observations in the same cell make the same choice (I think this is what they mean by "separation in the data"), then coefficients aren't identified (because linear combinations of them have to shoot off to positive or negative infinity). Apparently, STATA has some way of resolving the lack of identification so that some answer comes out of the package. It's kind of like having a perfect multicollinearity problem in a linear regression, but the package picks a particular solution out for you from the space of solutions to the LS problem--it would be ad hoc and different identification schemes could generate very different parameter estimates, although the predicted value of the dependent variable would be the same in all cases. (FWIW, my quick reading didn't make it clear how the graduate students resolved the lack of identification.) <br /><br />My bottom line is that the problem wasn't in the software (I don't know if it allows you to check for this problem and the authors didn't or if it should have printed a warning regardless). The problem was the failure of the original authors to warn the reader that the parameter estimates weren't identified in the data, only the predicted probabilities.Angelonoreply@blogger.com