Two weeks ago, Dr Robert Wachter wrote an op-ed piece in The New York Times, “How Measurement Fails Doctors and Teachers” (http://www.nytimes.com/2016/01/17/opinion/sunday/how-measurement-fails-doctors-and-teachers.html) .
93 readers offered comments before the commentary cut-off by the site. Many were first-hand accounts by doctors and teachers about measurement that reduces time for direct interaction with patients or students and worse, generates data that may be of dubious relevance and quality.
As someone educated in statistics and active in quality improvement applied to healthcare for the past 16 years (and as a former high school math teacher), the article and comments offer a sobering perspective.
Just as experiments in the physical or chemical sciences have a different support for generalization than experiments in social settings (blogs here and here and here), I sympathize with Dr Wachter’s assertion that the activities of medicine and teaching are different than the business of consumer products like cars and smartphones.
As is often the case in measurement of system performance important to people, we need to find measures that are just “good enough” to tell us whether we are getting closer to our aim.
People like me have to help those doing the work of healing patients and educating students find meaningful measures that yield more benefit than cost. While the benefit-cost trade-off itself is subject to debate and illusory attempts at precision, people of good-will and ordinary intelligence can almost always describe measurement ideas worth considering. Then we measurement folks have the obligation to turn those descriptions into feasible measurements, checking back frequently with the professionals who will use them—applying our own sermons about continuous improvement and driving out waste to the measurement system itself.
(A related objection to the problems of measurement in medicine was published the same week as Wachter’s piece, in the NEJM perspective “Medical Taylorism” by Drs Pamela Hartzband and Jerome Groopman, NEngl J Med 374;2, January 14, 2016, 106-108, http://www.nejm.org/doi/full/10.1056/NEJMp1512402. Hartzband and Groopman go beyond measurement issues, challenging the universal application of Toyota-inspired Lean methods to health care. Spirited rebuttals that describe a people-centered application of scientific management in general and Toyota Production System perspective specifically may be found in the comments to the NEJM paper on their site, as well as more extensively here, here and here.)
My blogs on the application of randomized experiments (here and here) have pushed me to review more articles that examine the nature of randomized controlled trials (RCTs).
Nancy Cartwright (https://en.wikipedia.org/wiki/Nancy_Cartwright_(philosopher) ) has considered the structure of RCTs in several articles. I draw upon two of them in this post:
(1) “What are randomised controlled trials good for?”, Philosophical Studies (2010) 147, 59–70, http://escholarship.org/uc/item/42v4w8k1
(2) “A philosopher’s view of the long road from RCTs to effectiveness”, The Lancet, 377 April 23, 2011 1400-1401, http://www.thelancet.com/pdfs/journals/lancet/PIIS0140-6736%2811%2960563-1.pdf
As Cartwright (2010) states, the logic of RCTs is compelling when we seek to demonstrate whether or not a cause and effect relationship exists:
“The RCT is neat because it allows us to learn causal conclusions without knowing what the possible confounding factors actually are. By definition of an ideal RCT, these are distributed equally in both the treatment and control wing [through the operation of randomization], so that when a difference in probability of the effect between treatment and control wings appears, we can infer that there is an arrangement of confounding factors in which C and E are probabilistically dependent and hence in that arrangement C causes E because no alternative explanation is left.” (p. 64)
If we judge the treated and control groups as different on some measured characteristic, what can we conclude?
Cartwright (2011) notes:
“The circumstances [defined by an RCT] are ideal for ensuring ‘the treatment caused the outcome in some members of the study’—i.e., they are ideal for supporting ‘it-works-somewhere’ claims. But they are in no way ideal for other purposes; in particular they provide no better base for extrapolating or generalising than knowledge that the treatment caused the outcome in any other individuals in any other circumstances.” (p. 1401)
This is the nub of the inferential problem--the tension between the internal and external validity of a study. External validity is what we need in improvement applications: will the change we observed in the contrived situation of the RCT work in other circumstances?
(Cartwright also points out that ‘the treatment caused the outcome in some members of the study’ may mask effects in the opposite direction for a relatively small number of treated individuals. That’s another reason to be modest in our inferential leaps from one study.)
Cartwright (2010) outlines several requirements to justify generalization from our study to other situations.
Even if we suppose that individuals enrolled in our RCT are ‘representative’ of the larger population to which we want to apply our change, Cartwright lists three assumptions that need to hold to support a logical bridge from trial to wider application:
As Cartwright (2010) says: “These are heavy demands.” (p. 67). This seems especially true in social or management situations, which are common in quality improvement.
As far as I can tell, we have only two options, which are bound together in practice.
Option (1): we have to develop a provisional theory, a causal explanation that reaches beyond the immediate testing circumstances that we are ready to modify given new evidence.
Option(2): we have to repeat the test of change in additional settings, building up our degree of belief incrementally.
Angus Deaton makes the case for theory clearly in a discussion of experiments to demonstrate effective social programs:
“…[RCTs] even when done without error or contamination, are unlikely to be helpful for policy, or to move beyond the local, unless they tell us something about why the program worked, something to which they are often neither targeted nor well-suited…For an RCT to produce ‘useful knowledge’ beyond its local context, it must illustrate some general tendency, some effect that is the result of mechanism that is likely to apply more broadly.” (“Instruments, Randomization, and Learning about Development”, Journal of Economic Literature, Vol. XLVIII (June 2010), p. 448, https://www.princeton.edu/rpds/papers/Deaton_Instruments_randomization_and_learning_about_development_JEL.pdf )
Cartwright (2011) weaves the two options together:
“For policy and practice we do not need to know ‘it works somewhere’. We need evidence for ‘it-will-work-for-us’ claims: the treatment will produce the desired outcome in our situation as implemented there. How can we get from it-works-somewhere to it-will-work-for-us? Perhaps by simple enumerative induction: swan 1 is white; swan 2 is white…so the next swan will be white. For this we need a large and varied inductive base—lots of swans from lots of places; lots of RCTs from different populations—plus reason to believe the observations are projectable, plus an account of the range across which they project. “(p. 1401)
Cartwright (2011) then reminds us that naïve analogies between experiments in physics and experiments in social settings may lead us to think our treatment effects have more general application than is warranted:
“ Electron charge is projectable everywhere—one good experiment is enough to generalise to all electrons; bird colour sometimes is; causality is dicey. Many causal connections depend on intimate, complex interactions among factors present so that no special role for the factor of interest can be prised out and projected to new situations.” (p. 1401).
Knowledge of what works in social settings is hard won, typically requiring many cycles of experience, whether informal or rigorously designed.
Experimenters derive benefit from the challenges in designing a trial—the design process invites careful thought about causal relations and measurement systems.
Randomized trials should explore the limits of our theories. The aim of a randomized trial should be to help us to understand why differences in treatment exist, not just whether differences exist.
More complex designs than the single treatment versus control structure of a basic RCT are attractive. As Parry and Power advised in the BMJ Quality and Safety commentary that kicked off this series of blogs, revisiting Fisher’s work on experimental design seems in order. Randomized block designs that account for variation across experimental units and screening designs that put into play a large number of factors are both tools for our toolkit. These alternatives have more complexity than the simple RCT and consequently can yield more insights, contributing to better theories and stronger degrees of belief.
David Donoho, on the occasion of John Tukey’s Centennial workshop last September, has summarized “50 years of Data Science”.
It’s an article well worth reading for anyone interested in future developments in data science and the practice of statistics, http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
After reviewing recent developments (and hype) in the intertwined discussion of Big Data and Data Science, Donoho looks back 50 years at seminal contributions by Tukey, Cleveland, Chambers and Brieman.
Along the way, Donoho formulates a description of Greater Data Science (GDS), in contrast to a narrower version (Lesser Data Science) driven by commercial developments and interests currently in the news.
Donoho starts his history with a discussion of Tukey’s 1962 article, “The Future of Data Analysis”, in the Annals of Statistics.(https://projecteuclid.org/download/pdf_1/euclid.aoms/1177704711) Tukey defined Data Analysis and described how it should be seen as a science, not as a branch of mathematics.
Donoho then jumps ahead several decades to cite 1993 insights from John Chambers, developer of the S statistical language (ancestor to today’s R). In the article “Greater or Lesser Statistics: A Choice for Future Research” (Statistics and Computing, 3:4, pp. 182-184, https://statweb.stanford.edu/~jmc4/papers/greater.ps), Chambers argues for a broader aim of statistical research, “based on an inclusive concept of learning from data.” (from the abstract.)
Donoho next discusses two articles from 2001.
The first, William Cleveland’s “Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics” (ISI Review, 69, 21-26) proposed that heavily mathematical statistical theory should usefully comprise only 20% of an academic preparation in Data Science—a far cry from the structure of any university department in statistics over the past 50 years.(https://utexas.instructure.com/files/35465950/download)
The second 2001 article, Leo Brieman’s “Statistical Modeling: The Two Cultures”, (Statistical Science, 16, 3, 199-231 http://projecteuclid.org/euclid.ss/1009213726) contrasted and compared two primary goals in working with data. Breiman contrasted a focus on Information (inference and an insistence on specific mathematical models) with a focus on prediction. Breiman claimed that only 2% of academic statisticians spent their time and energy on problems of prediction.
Despite the impressive credentials and insights of Donoho’s quartet of statisticians, the academic field of statistics has not moved very far or fast in the direction they’ve outlined. Academic statistics risks being eclipsed by Lesser Data Science, the Data Science in the popular press and in university Deans’ minds that has less integrity and potential impact than Donoho’s GDS alternative.
Donoho’s historical review and Data Science recommendations reminded me of W.E. Deming’s 1975 article, “On Probability as a Basis for Action” (W. Edwards Deming (1975) “On Probability as a Basis for Action”, The American Statistician, 29:4, 146-152, https://www.deming.org/media/pdf/145.pdf). Deming also failed to move academic statistics departments very far from their traditional focus.
Deming distinguished between enumerative studies and analytic studies, roughly the same dichotomy flagged by Breiman; Deming cited work by industrial statisticians in the 1940s as progenitors of the distinction.
Enumerative studies are an arena for mathematical modeling for a fixed population, operationalized by a sampling frame. Analytic studies, in contrast, are focused on future performance, the problem of prediction.
Despite advances in machine learning and access to ever-larger data sets, Deming’s distinction remains important: unless the mechanisms (set of causes) of the system under study remain essentially the same in the future, “best” predictive models and compelling insights from a batch of data (large or small) can fail badly as a guide for decisions and actions tomorrow. Shewhart built his control chart theory and tools to assess the evidence that a set of causes is “about the same” over time, which suggests a role for control chart thinking in any new Data Science.
Deming noted:
“It is important to remember that the mean, the variance, the standard error, likelihood, and many other functions of a set of numbers, are symmetric. Interchange of any two observations xi and xj leaves unchanged the mean, the variance, and even the distribution itself. Obviously, then, use of variance and elaborate methods of estimation buries the information contained in the order of appearance in the original data, and must therefore be presumed inefficient until cleared.” (p. 149)
Donoho cites an example from the Tukey Centennial that illustrates the issue in a completely modern setting: “Rafael Irizarry gave a convincing example of exploratory data analysis of GWAS [Genome Wide Association Study] data, studying how the data row mean varied with the date on which each row was collected, convince[d] the field of gene expression analysis [has] to face up to some data problems that were crippling their studies.” (footnote 29, p. 23).
Donoho’s quartet worked extensively on real data problems; Deming worked on national-scale data surveys in the mid-20th century at the U.S. Census and in his consulting practice.
Perhaps the ideas of Greater Data Science sketched by Donoho naturally arise when very smart, skilled people interact with challenging data problems!