## Random Sampling and the p-chart: Related but Distinct Applications of the Binomial Distribution

In the previous post , I discussed a problem in application of control charts to per cent data. One type of control chart--the p-chart--uses the binomial probability model to define the upper and lower control limits.

We need to distinguish a related but different use of the binomial probability model, which arises from random sampling:

Suppose a clinic applies simple random sampling to all patients served in September, with a sample size of n (say, n=100) and the sample size is small relative to the total number of patients served. We ask a question about how well staff listened of each of those 100 patients; we want to know the per cent of patients who answer “Always” to our question.

We can use the binomial probability model to characterize the per cent of all patients in September who would have responded “Always” to the listening question, if we were able to ask every patient and get a reply.

Simple random sampling starts with a frame that lists each of the patients served in September. We apply a specific random procedure to select the 100 patients from the frame.

Here’s an example of a simple random sampling procedure:

(1) amend the frame of patients served in September by numbering the patients from 1 to N;

(2) go to www.random.org and generate 100 random integers between 1 and N;

(3) Select the patients from the frame according to the list of integers in step (2);

(4) Ask the patients in step (3) the listening question.

In practice, there will be issues that affect the sampling procedure like patients in the sample who refuse to respond or who can’t be located. Let’s set those aside for now, even though we should recognize that such issues often turn out to be more important than the math details we’re discussing.

(A short, clear article on the effects of sampling bias, applied to political polling, appeared in The New York Times on 5 October 2016, .)

To illustrate the binomial calculation, if 60 patients in the random sample of patients answer “Always”, our simple estimate for all patients served and answering “Always” in September is 60%.

We can estimate two or three sigma limits for the simple estimate. We use the standard deviation formula of the binomial distribution to get a value for σ-hat, our estimate of σ.

In our example, p-hat = 0.60 and n = 100, so σ-hat ≈ 0.05.

If we use three-sigma limits, we can make an interval: (0.60 – 3 x 0.05, 0.60 + 3 x 0.05) or (0.45, 0.75).

The interpretation of this interval is a little tricky, as the interval depends on numbers generated from the random sample. We have to imagine repeating the sampling procedure many times, each time calculating the interval in the same way. In that series of intervals, 99% of the intervals will contain the actual per cent of September patients who would respond “Always” to our question.

So, when we look at the interval (0.45, 0.75), it would be surprising for that interval to miss the actual per cent from the entire September population; missing the actual value would happen only about 1% of the time.

**The Size of Random Sample Relative to the Number of Items in the Frame has to be “small” to Use the Binomial Formula**

The assumption that the sample size is small relative to the total number of patients served in September allows us to say that the probability of selecting patients answering “Always” does not vary appreciably for different patients in our sample.

If the sample size is larger than about 10% of the population, the estimate of variability derived from the binomial distribution will over-estimate the sampling variation.

See this blog post that includes a simulator of the binomial and alternative (hypergeometric) sampling distributions.

**Connection to the p-chart**

If you have a time series of per cents that describe the answer “Always” to the listening question, two conditions will allow you to construct a valid p-chart:

The sample size must be small relative to the frame size. Each member of the series must be generated by simple random sampling as described in the previous section.

In terms of the four assumptions described in the previous post, these two conditions assure assumptions (3) and (4), respectively.

**Note on Enumerative versus Analytic Studies**

W.E. Deming distinguished between two types of studies that benefit from statistical methods.

Enumerative studies focus on characterizing a particular situation, for example, the properties of patients seen by an organization in one month. We are concerned with counting or assessing attributes of those patients. We are not concerned with the system of causes that generated the pattern of attributes exhibited by the population.

The simple random sample application, which also uses the binomial probability distribution to characterize proportion of patients answering “Always” to a question, is an example of an enumerative study.

On the other hand, analytic studies purposely focus on a system of causes; calculations of a specific instance help the analyst gain insight into that system. Typically, analytic studies inquire about the behavior of a system of causes over time. A key question in analytic studies is whether the system of causes remains essentially the same over time; if so, analysts may make a prediction about future performance, with the prediction derived from study of past performance.

In our discussion of p-charts in the previous post, the study of patient experience over time is an example of an analytic study.

**Reference**

W.E. Deming (1975), “On Probability as a Basis For Action”, *The American Statistician*, 29, 4, pp 146-152, https://deming.org/media/pdf/145.pdf