Thursday, August 25, 2011

Reproducible Econometric Research

I doubt if anyone would deny the importance of being able to reproduce one's econometric results. More importantly, other researchers should be able to reproduce our results to verify (a) that we've done what we said we did; (b) to investigate the sensitivity of our results to the various choices we made (e.g., functional form of our model, choice of sample period, etc.); and (c) to satisfy themselves that they understand our analysis.

However, if you've ever tried to literally reproduce someone else's econometric results, you'll know that it's not always that easy to so - even if they supply you with their data-set. You really need to have their code (R, EViews, STATA, Gauss) as well. That's why I include both Data and Code pages with this blog.

Students of econometrics really shouldn't under-estimate the importance of the replicability of results. As tedious as it can be, it's really important to fully document the steps you take when "cleaning" your data prior to your empirical modelling. It's also sensible to document "bad" results as well as "good" results, if only for your own benefit when you inevitably have to re-visit your work at a later date. (Generally a much later data if you've submitted your work to a typical economics journal!)

There's been a move on the part of some academic journals towards asking or requiring authors of empirical papers to supply their data and code as a condition of acceptance of their work for publication. This material is then housed in an on-line repository that anyone can access.

Of course, exceptions sometimes have to be made - for example, when the data are proprietary and can't be released publicly. But these should be exceptions.

For example, the Journal of Applied Econometrics made this mandatory several years ago, and now have a very valuable Data Archive. I know that at least one of my colleagues here at UVic makes good use of this archive in her graduate teaching. That same journal also has a section for replication studies - we need more of this.

At the Journal of International Trade & Economic Development we introduced a data and code repository earlier this year. It's managed by Judith Clarke, and at this point it operates on a voluntary basis. I think we're going to have to make it mandatory, though - so far, no authors have volunteered to upload their files! I guess that incentives have something to do with this.

As far as I'm concerned, it's also perfectly reasonable to ask to see data and code when you're refereeing a paper for a journal. It should go without saying that such requests need to be made via the handling editor/associate editor. I ask for data and code quite frequently in connection with refereeing tasks, and it can lead to some interesting outcomes - believe me!

A while back, Jeff Racine drew my attention to Sweave, and kindly demonstrated some of its capabilities. So what's Sweave? I can't do better than to quote from the associated website:

"Sweave is a tool that allows to embed the R code for complete data analyses in latex documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. Instead of inserting a prefabricated graph or table into the report, the master document contains the R code necessary to obtain it. When run through R, all data analysis output (tables, graphs, etc.) is created on the fly and inserted into a final latex document. The report can be automatically updated if data or analysis change, which allows for truly reproducible research."

Notice that part of the appeal of Sweave is that it lends itself to the replicability of research results.

Incidentally, Jeff has a piece scheduled to appear in the  Journal of Applied Econometrics about RStudio (see my earlier post here), R, and Sweave. Watch out for it.

© 2011, David E. Giles


  1. One problem with reproduction of empirical results in macroeconomics is that published data get revised. Therefore, it is possible to go to the same data source, download the same series, use the same sample period and same software as the original authors and get different results.

    The FRB St. Louis has a tool that can be used to get around this. Their ALFRED site keeps track of all published versions of several thousand US statistical series. Users can use the site not only to get the latest data, but also to publish a data list that other users can then use to obtain precisely the same data, regardless of subsequent revisions.

  2. Simon: Very good point. How many of us have run into this problem with Statistics Canada's CANSIM database?!?!?! All the more reason to supply the actual data - not just the source and series numbers. The ALFRED site is a great resource!
    Thanks for the comment.



Note: Only a member of this blog may post a comment.