Reminder from an early generation of big data

Reminder from an early generation of big data

olddrawing..jpg

High volume manufacturing in the early 20th century provided a rich context to develop data and causal methods.  Walter Shewhart, the inventor of the method and philosophy of statistical process control, thought deeply about causality and how to use data to guide action.

As Shewhart considered the control of quality of manufactured product in the first quarter of the twentieth century, he realized that no amount of data sufficed to enable prediction in his setting without additional considerations.

Faced with the large numbers of parts and components used to build the equipment and network of the Bell telephone system, Shewhart identified a state of statistical control as the foundation for prediction.  Fundamentally, statistical control of a production process requires the system of causes embodied in the process to be independent of time, ‘essentially constant.’

In standard applications, statistical control is demonstrated by an appropriate control chart that shows no evidence of ‘assignable’ causes.

What does Shewhart’s industrial setting have to do with 21st century data analysis?    

Today, analysts with access to ‘big’ data know to split their data into random chunks at the start of their work to develop relationships and predictions.  After using data in a training chunk to develop a model that summarizes important relationships, analysts check model performance on other chunks. This approach protects analysts from being fooled by specific features of the data set used to establish the model.

To the extent that the cause system that generated the big data set is essentially constant, then models carefully crafted from that data set can yield useful predictions to characterize future performance.  In Shewhart’s terms, the cause system needs to exhibit a state of control as a condition for prediction. 

“A phenomenon will be said to be controlled when, through the use of past experience, we can predict, at least within limits, how the phenomenon may be expected to vary within the future.  Here it is understood that prediction within limits means that we can state, at least approximately, the probability that the observed phenomenon will fall with the given limits.” (W.A. Shewhart, The Economic Control of Quality of Manufactured Product, New York:  D. Van Nostrand Company, 1931; republished 1981 by the American Society for Quality Control, p. 6).

The basic question remains:  How will we know if the cause system is essentially constant?

Ultimately, the demonstration of control requires the comparison of prediction with actual performance.  In many cases, a control chart of the differences between prediction and actual values will enable assessment of a state of control.

No matter the level of sophistication of the modeling or the size of the data used to generate predictions, Shewhart’s insight still holds.   Our predictions depend on the constancy of the cause system that underlies the data used to generate those predictions.

Note on Big Data

The bigness of ‘big data’ is relative to the computational tools available to an analyst to process data values—an insight I first learned from Dr. Jeff Leek in an on-line Data Science course in 2014.   ‘Big data’ require special tools for manipulation and study relative to your current practice.  You seek out new tools because the number and nature of records overwhelm your usual data tools.   

The manufacturing data in Shewhart’s world certainly constituted big data for his time—millions of parts flowed year after year through the Hawthorne Works in Cicero, Illinois, the main Bell system manufacturing plant in Bell’s Western Electric Company.  At peak employment, 45,000 employees assembled a vast range of telephone and appliance parts and products.  Shewhart and Western Electric quality engineers regularly analyzed thousands of records with mechanical calculators and hand tabulations.  Shewhart’s invention of informative (rational) sampling enabled insight into strength of metal and wood raw materials, the dimensions of metal parts, and electrical properties of relays and switches.

The Hawthorne Works as of 1925 is shown at the top of this post. Photo source:  By Western Electric Company - Western Electric Company Photograph Album, 1925., Public Domain, https://commons.wikimedia.org/w/index.php?curid=37704076

Plotting the Wisconsin Gerrymander

Plotting the Wisconsin Gerrymander

Stupid Stuff:  Waste by another name

Stupid Stuff: Waste by another name