Tuesday, August 19, 2014

David Mimno on "Data Carpentry"

There's a post on David Mimno's blog  today titled, "Data Carpentry".

I like it a lot, because it emphasises just how much effort, time and creativity can be required in order to get one's data in order before we can get on with the fun stuff - estimating models, testing hypotheses, making forecasts, and so on. I know that this was something that I didn't fully appreciate when I was starting my career. And when I did get the message, I found it rather irksome!

However, the message isn't going to change, so we just have to live with it, and accept the realities of working with "real" data.

In his post, David explains why he doesn't like the oft-used term"data cleaning" (which makes us sound like "data janitors"), and why he prefers the term "data carpentry". Certainly, the latter has more constructive overtones.

As he says:
"To me these imply that there is some kind of pure or clean data buried in a thin layer of non-clean data, and that one need only hose the dataset off to reveal the hard porcelain underneath the muck. In reality, the process is more like deciding how to cut into a piece of material, or how much to plane down a surface. It’s not that there’s any real distinction between good and bad, it’s more that some parts are softer or knottier than others. Judgement is critical.
The scale of data work is more like woodworking, as well. Sometimes you may have a whole tree in front of you, and only need a single board. There’s nothing wrong with the rest of it, you just don’t need it right now."
A nice post, and a very nice "take" on a crucial part of the work that we do.

© 2014, David E. Giles

No comments:

Post a Comment