Saturday, April 25, 2015

Introductory Statistics for Data Science

The latest issue of Chance contains a very timely article by Nicholas Horton, Benjamin Baumer, and Hadley Wickham. It's titled, "Setting the Stage for Data Science: Integration of Data Management Skills in Introductory and Second Courses in Statistics".

Ask yourself - "Is the traditional way that we teach introductory and second-level statistics courses really suited for preparing students for future work in modern data science?"

More specifically, do our undergraduate courses provide the data-related skills that are increasingly needed? The same question could be asked of undergraduate training in econometrics.

Horton et al. itemize five things which, in their opinion, deserve more attention in this context:

"Thinking creatively, but constructively, about data. This “data tidying” includes the ability to move data not only between different file formats, but also into different shapes. There are elements of data-storage design (e.g., normal forms) that students need to learn, along with an understanding about how data should be arranged based on how it will likely be used.
Facility with data sets of varying sizes, and some understanding of scalability issues when working with data. This includes an elementary understanding of basic computer architecture (e.g., memory vs. hard disk space), and the ability to query a relational database management system (RDBMS).
Statistical computing skills in a command-driven environment (e.g., R, Python, or Julia). Coding skills (in any language) are highly valued and increasingly necessary. They provide freedom from the un-reproducible point-and-click application paradigm.
Experience wrestling with large, messy, complex, challenging data sets, for which there is no obvious goal or specially curated statistical method (see What’s in a Word). While perhaps sub-optimal for teaching specific statistical methods, these data are more similar to what analysts actually see in the wild.
An ethos of reproducibility. This is a major challenge for science in general, and we have the comparatively easy task of simply reproducing computations and analysis."
I concur entirely.

The development of "computing skills in a command-driven environment" in introductory economic statistics courses is actually something that's been on my mind of late. I've always preferred having those students work with a user-friendly package so that they can focus their attention and energy on the statistical content of what they're doing. Coding has been something to get to at a later stage.

However, I'm becoming increasingly convinced that coding skills should be developed right from the outset. Yes, this is going to be very challenging for a lot of students. But the payoff can be very high.

Food for thought.

© 2015, David E. Giles


  1. As a UVic economics grad who went on to graduate school, I always felt that there should be more of a focus on programming in undergrad econ. I agree that there is an accessibility issue, and it is tough to learn coding in one hour lab blocks, but I think the department would be serving its students quite well to make a programming course mandatory for the B.Sc. program, or maybe the Honours program. This would be helpful not only for prospective graduate students, but also those who want to boost their resume to work in industry.

  2. Agree with the post and had been thinking along the same lines. Is there a textbook that follows this approach?
    However, teaching to code is a large obstacle for US undergraduates. I think the coding better be left for graduate school.

    1. Not aware of s suitable text, especially for ECON students. Very challenging with Canadian undergrads. as well.

  3. We should be teaching kids to code as early as 4th grade. Math and Science teachers should have responsibility for this and quite frankly, we could start with $35 (US) calculators if every student couldn't afford an R prompt which can loaded on most smart phones with ARM processors by this point. By the time they are in high school, R or Matlab, or Scilab or Gretl or whatever should be consoles they are used to along with mathematics software like Microsoft Math, Octave, Julia, whatever is useful. 80 years ago John Von Neumann realized the field of math, statistics, and prediction was forever changed by electronic computation. Somehow, our educational systems take a long time to catch up...