The latest issue of Chance contains a very timely article by Nicholas Horton, Benjamin Baumer, and Hadley Wickham. It's titled, "Setting the Stage for Data Science: Integration of Data Management Skills in Introductory and Second Courses in Statistics".
Ask yourself - "Is the traditional way that we teach introductory and second-level statistics courses really suited for preparing students for future work in modern data science?"
More specifically, do our undergraduate courses provide the data-related skills that are increasingly needed? The same question could be asked of undergraduate training in econometrics.
Horton et al. itemize five things which, in their opinion, deserve more attention in this context:
"Thinking creatively, but constructively, about data. This “data tidying” includes the ability to move data not only between different file formats, but also into different shapes. There are elements of data-storage design (e.g., normal forms) that students need to learn, along with an understanding about how data should be arranged based on how it will likely be used.
Facility with data sets of varying sizes, and some understanding of scalability issues when working with data. This includes an elementary understanding of basic computer architecture (e.g., memory vs. hard disk space), and the ability to query a relational database management system (RDBMS).
Statistical computing skills in a command-driven environment (e.g., R, Python, or Julia). Coding skills (in any language) are highly valued and increasingly necessary. They provide freedom from the un-reproducible point-and-click application paradigm.
Experience wrestling with large, messy, complex, challenging data sets, for which there is no obvious goal or specially curated statistical method (see What’s in a Word). While perhaps sub-optimal for teaching specific statistical methods, these data are more similar to what analysts actually see in the wild.
An ethos of reproducibility. This is a major challenge for science in general, and we have the comparatively easy task of simply reproducing computations and analysis."
I concur entirely.
The development of "computing skills in a command-driven environment" in introductory economic statistics courses is actually something that's been on my mind of late. I've always preferred having those students work with a user-friendly package so that they can focus their attention and energy on the statistical content of what they're doing. Coding has been something to get to at a later stage.
However, I'm becoming increasingly convinced that coding skills should be developed right from the outset. Yes, this is going to be very challenging for a lot of students. But the payoff can be very high.
Food for thought.