Data Janitor or Archeologist

Steve Lohr in The New York Times 18 August 2014 summarized a key challenge in working with data, “The ‘Janitor Work’ for Data Scientists is a Limit to Insights."

While the article heads right into the world of big data* in paragraph two, what Lohr describes applies to many other data sets that wouldn’t make any supercomputer hiccup these days.

Lohr’s topic is data quality, hence the janitor metaphor. Data are never clean—despite our use of computers with precise operating characteristics, many records are contaminated with junk. We need to separate junk from good stuff before moving on to effective analysis.

A current project with a couple of colleagues is a small data set -- only a few hundred thousand records. We’ll spend more than half our budget getting the data table in the right format for summary and insight -- clarifying the meaning of the variables and the content of the records.   

I think of data cleaning as part of my job in projects, not a step to skip over, though certainly a step to make more efficient. We act more like data detectives or data archeologists than janitors; the more we can understand about the source and nature of the apparent junk and its relationship to the supposed clean stuff, the better job we’ll do this time and the next time we use similar data.

Other links discussing the NYTimes article:  Getting Genetics Done and Revolutions blog.

*About big data: There are lots of definitional debate on what precisely big data are—the main thing, it is a moving target, and big is relative to the tools you use, an insight provided by Jeff Leek in his on-line Data Science MOOC course “The Data Scientist’s Toolbox.” 

So if you are bamboozled by how to make meaning from a data source or sources, the data are big to you. You can try to expand your tool kit that can tackle the data or find someone for whom the data are not that big.

W.E. Deming and Taichi Ohno: What do their Followers have in Common?

Car Factory or Fine Restaurant Kitchen: Choose Your Model