Popcorn & Lung Disease Chance and Rossman (2018)
In May 2000, eight people who had worked at the same microwave-popcorn production plant reported to the Missouri Department of Health with fixed obstructive lung disease. These workers had become ill between 1993 and 2000 while employed at the plant. On the basis of these cases, in November 2000 researchers began conducting medical examinations and environmental surveys of workers employed at the plant to assess their occupational exposures to certain compounds (Kreiss, et al., 2002). As part of the study, current employees at the plant underwent spirometric testing. This measures FVC – forced viral capacity – the volume of air that can be maximally, forcefully exhaled. On this test, 31 employees had abnormal results, including 21 with airway obstruction. A total of 116 employees were tested.
Researchers were curious as to whether the baseline risk was high among microwave-popcorn workers. The plant itself was broken into several areas, some of which were separate from the popcorn production area. Air and dust samples in each job area were measured to determine the exposure to diacetyl, and marker of organic chemical exposure. Employees were classified into two groups: a “low-exposure group” and a “high-exposure group,” determined by how long an employee spent at different jobs within the plant and the average exposure in that job area. Of the 21 participants with airway obstruction, 6 were from the low-exposure group and 15 were from the high-exposure group. Both exposure groups contained 58 total employees.
How can we tell if popcorn production is related to lung disease? Consider High / Low exposure:
| low exposure |
6 |
52 |
58 |
| high exposure |
15 |
43 |
58 |
|
21 |
95 |
116 |
Is 21 a lot of people? Can we compare 6 vs. 15? What should we look at? proportions (always a number between 0 and 1). Look at your data (graphically and numerically). Segmented bar graph (mosaic plot):
Is there a difference in the two groups? Look at the difference in proportions or risk:
\[\begin{eqnarray*}
6/58 = 0.103 & 15/58=0.2586 & \Delta = 0.156\\
p_1 = 0.65 & p_2 = 0.494 & \Delta = 0.156\\
p_1 = 0.001 & p_2 = 0.157 & \Delta = 0.156\\
\end{eqnarray*}\]
Differences in Proportions
It turns out that the sampling distribution of the difference in the sample proportions (of success) across two independent groups can be modeled by the normal distribution if we have reasonably large sample sizes (CLT).
To ensure the accuracy of the test, check whether np and n(1-p) is bigger than 5 in both samples is usually adequate. A more precise check is \(n_s \hat{p}_c\) and \(n_s(1-\hat{p}_c)\) are both greater than 5; \(n_s\) is the smaller of the two sample sizes and \(\hat{p}_c\)is the sample proportion when the two samples are combined into one.
Note: \[\begin{eqnarray*}
\hat{p}_1 - \hat{p}_2 \sim N\Bigg(p_1 - p_2, \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\Bigg)
\end{eqnarray*}\]
When testing independence, we assume that \(p_1=p_2,\) so we use the pooled estimate of the proportion to calculate the SE: \[\begin{eqnarray*}
SE(\hat{p}_1 - \hat{p}_2) = \sqrt{ \hat{p}_c(1-\hat{p}_c) \bigg(\frac{1}{n_1} + \frac{1}{n_2}\bigg)}
\end{eqnarray*}\]
So, when testing, the appropriate test statistic is: \[\begin{eqnarray*}
Z = \frac{\hat{p}_1 - \hat{p}_2 - 0}{ \sqrt{ \hat{p}_c(1-\hat{p}_c) (\frac{1}{n_1} + \frac{1}{n_2})}}
\end{eqnarray*}\]
Confidence interval for \(p_1 - p_2\)
We can’t pool our estimate for the SE, but everything else stays the same…
\[\begin{eqnarray*}
SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
\end{eqnarray*}\]
The main idea here is to determine whether two categorical variables are independent. That is, does knowledge of the value of one variable tell me something about the probability of the other variable (gender and pregnancy). We’re going to talk about two different ways to approach this problem.
Relative Risk
Relative Risk The relative risk (RR) is the ratio of risks for each group. We say, “The risk of success is RR times higher for those in group 1 compared to those in group 2.”
\[\begin{eqnarray*}
\mbox{relative risk} &=& \frac{\mbox{risk group 1}}{\mbox{risk group 2}}\\
&=& \frac{\mbox{proportion of successes in group 1}}{\mbox{proportion of successes in group 2}}\\
\mbox{RR} &=& \frac{p_1}{p_2} = \frac{p_1}{p_2}\\
\widehat{\mbox{RR}} &=& \frac{\hat{p}_1}{\hat{p}_2}
\end{eqnarray*}\]
\(\widehat{RR}\) in the popcorn example is \(\frac{15/58}{6/58} = 2.5.\) We say, “The risk of airway obstruction is 2.5 times higher for those in high exposure group compared to those in the low exposure group.” What about
- sample size?
- baseline risk?
Confidence Interval for RR
Due to some theory that we won’t cover, we use the fact that:
\[SE(\ln (\widehat{RR})) \approx \sqrt{\frac{(1 - \hat{p}_1)}{n_1 \hat{p}_1} + \frac{(1-\hat{p}_2)}{n_2 \hat{p}_2}}\]
And more theory we won’t cover that tells us:
\[
\ln(\widehat{RR}) \stackrel{\mbox{approx}}{\sim} N\Bigg( \ln(RR), \sqrt{\frac{(1 - \hat{p}_1)}{n_1 \hat{p}_1} + \frac{(1-\hat{p}_2)}{n_2 \hat{p}_2}} \Bigg)
\]
A \((1-\alpha)100\%\) CI for the \(\ln(RR)\) is: \[\begin{eqnarray*}
\ln(\widehat{RR}) \pm z_{1-\alpha/2} SE(\ln(\widehat{RR}))
\end{eqnarray*}\]
Which gives a \((1-\alpha)100\%\) CI for the \(RR\): \[\begin{eqnarray*}
(e^{\ln(\widehat{RR}) - z_{1-\alpha/2} SE(\ln(\widehat{RR}))}, e^{\ln(\widehat{RR}) + z_{1-\alpha/2} SE(\ln(\widehat{RR}))})
\end{eqnarray*}\]
Odds Ratios
A related concept to risk is odds. It is often used in horse racing, where “success” is typically defined as losing. So, if the odds are 3 to 1 we would expect to lose 3/4 of the time.
Odds Ratio A related concept to risk is odds. It is often used in horse racing, where “success” is typically defined as losing. So, if the odds are 3 to 1 we would expect to lose 3/4 of the time. The odds ratio (OR) is the ratio of odds for each group. We say, “The odds of success is OR times higher for those in group 1 compared to those group 2.”
\(\mbox{ }\)
\[\begin{align}
\mbox{odds} &= \frac{\mbox{proportion of successes}}{\mbox{proportion of failures}}\\
&= \frac{\mbox{number of successes}}{\mbox{number of failures}} = \theta\\
\widehat{\mbox{odds}} &= \hat{\theta}\\
\mbox{odds ratio} &= \frac{\mbox{odds group 1}}{\mbox{odds group 2}} \\
\mbox{OR} &= \frac{\theta_1}{\theta_2} = \frac{p_1/(1-p_1)}{p_2/(1-p_2)}= \frac{p_1/(1-p_1)}{p_2/(1-p_2)}\\
\widehat{\mbox{OR}} &= \frac{\hat{\theta}_1}{\hat{\theta}_2} = \frac{\hat{p}_1/(1-\hat{p}_1)}{\hat{p}_2/(1-\hat{p}_2)}\\
\end{align}\]
\(\widehat{OR}\) in the popcorn example is \(\frac{15/43}{6/52} = 3.02.\) We say, “The odds of airway obstruction are 3 times higher for those in the high exposure group compared to those in the low exposure group.”
OR is more extreme than RR
Without loss of generality, assume the true \(RR > 1,\) implying \(p_1 / p_2 > 1\) and \(p_1 > p_2.\)
Note the following sequence of consequences:
\[\begin{eqnarray*}
RR = \frac{p_1}{p_2} &>& 1\\
\frac{1 - p_1}{1 - p_2} &<& 1\\
\frac{ 1 / (1 - p_1)}{1 / (1 - p_2)} &>& 1\\
\frac{p_1}{p_2} \cdot \frac{ 1 / (1 - p_1)}{1 / (1 - p_2)} &>& \frac{p_1}{p_2}\\
OR &>& RR
\end{eqnarray*}\]
Other considerations:
- Observational study (who worked in each place?)
- Cross sectional (only one point in time)
- Healthy worker effect (who stayed home sick?)
-
Explanatory variable is one that is a potential explanation for any changes (here exposure level).
-
Response variable is the measured outcome of interest (here airway obstruction).
Smoking & Lung Cancer Chance and Rossman (2018)
After World War II, evidence began mounting that there was a link between cigarette smoking and pulmonary carcinoma (lung cancer). In the 1950s, two now classic articles were published on the subject. One of these studies was conducted in the United States by Wynder and Graham (1950). They found records from a large number (684) of patients with proven bronchiogenic carcinoma (a specific form of lung cancer) in hospitals in California, Colorado, Missouri, New Jersey, New York, Ohio, Pennsylvania, and Utah. They personally interviewed 634 of the subjects to identify their smoking habits, occupation, exposure to dust and fumes, alcohol intake, education, and cause of death of parents and siblings. Thirty-three subjects completed mailed questionnaires, and information for the other 17 was obtained from family members or close acquaintances. Of those in the study, the researchers focused on 605 male patients with the same form of lung cancer. Another 1332 hospital patients with similar age and economic distribution (including 780 males) without lung cancer were interviewed by these researchers in St. Louis and by other researchers in Boston, Cleveland, and Hines, Illinois.
The following two-way table replicates the counts for the 605 male patients with the same form of cancer and for the “control-group” of 780 males.
| none |
\(<\) 1/day |
8 |
114 |
| light |
1-9/day |
14 |
90 |
| mod heavy |
10-15/day |
61 |
148 |
| heavy |
16-20/day |
213 |
278 |
| excessive |
21-34/day |
187 |
90 |
| chain |
35\(+\)/day |
122 |
60 |
Given the results of the study, do you think we can generalize from the sample to the population? Explain and make it clear that you know the difference between a sample and a population.
| chain smoking |
122 |
60 |
182 |
| no smoking |
8 |
114 |
122 |
|
130 |
174 |
304 |
- Causation?
- Case-control study (605 with lung cancer, 780 without… baseline rate?)
| expl = smoking status |
expl = lung cancer |
| resp = lung cancer |
resp = smoking status |
- If lung cancer is considered a success and no smoking is baseline:
\[\begin{eqnarray*}
\widehat{RR} &=& \frac{122/182}{8/122} = 10.22\\
\widehat{OR} &=& \frac{122/60}{8/114} = 28.9\\
\end{eqnarray*}\] The risk of lung cancer is 10.22 times higher for those who smoke than for those who don’t smoke.
The odds of lung cancer is 28.9 times higher for those who smoke than for those who don’t smoke.
- If chain smoking is considered a success and healthy is baseline:
\[\begin{eqnarray*}
\widehat{RR} &=& \frac{122/130}{60/174} = 2.7\\
\widehat{OR} &=& \frac{122/8}{60/114} = 28.9\\
\end{eqnarray*}\] The risk of smoking is 2.7 times higher for those who have lung cancer than for those who don’t have lung cancer.
The odds of smoking is 28.9 times higher for those who have lung cancer than for those who don’t have lung cancer.
Important Take-Away
We want to interpret as the probability of cancer if smoker. But those individual proportions and individual odds are not estimable. Therefore the relative risk is also not estimable.
BUT because the odds ratio is invariant to which is the response variable and which is the explanatory variable, the odds ratio is estimable!
AND the interpretation for the study will be:
The odds of lung cancer is 28.9 times higher for those who smoke than for those who don’t smoke.
We know the risk of being a light smoker if you have lung cancer but we do not know the risk of lung cancer if you are a light smoker. Let’s say we have a population of 1,000,000 people:
| light smoking |
49,000 |
51,000 |
100,000 |
| no smoking |
1,000 |
899,000 |
900,000 |
|
50,000 |
950,000 |
1,000,000 |
\[\begin{eqnarray*}
P(\mbox{light} | \mbox{lung cancer}) &=& \frac{49,000}{50,000} = 0.98\\
P(\mbox{lung cancer} | \mbox{light}) &=& \frac{49,000}{100,000} = 0.49\\
\end{eqnarray*}\]
- What is the explanatory variable?
- What is the response variable?
- relative risk?
- odds ratio?
| expl = smoking status |
expl = lung cancer |
| resp = lung cancer |
resp = smoking status |
- If lung cancer is considered a success and no smoking is baseline:
\[\begin{eqnarray*}
RR &=& \frac{49/100}{1/900} = 441\\
OR &=& \frac{49/51}{1/899} = 863.75\\
\end{eqnarray*}\]
- If light smoking is considered a success and healthy is baseline:
\[\begin{eqnarray*}
RR &=& \frac{49/50}{51/950} = 18.25\\
OR &=& \frac{49/1}{51/899} = 863.75\\
\end{eqnarray*}\]
OR is the same no matter which variable you choose as explanatory versus response! Though, in general, we still prefer to know baseline odds or baseline risk (which we can’t know with a case-control study).
More on Smoking & Lung Cancer, Chance and Rossman (2018)
Now we have a cohort prospective study. (Previously we had a case-control retrospective study). Now do we have baseline risk estimates? Yes! But be careful, we can’t conclude causation, because the study is still observational.
Confidence Interval for OR
Due to some theory that we won’t cover:
\[\begin{eqnarray*}
SE(\ln (\widehat{OR})) &\approx& \sqrt{\frac{1}{n_1 \hat{p}_1 (1-\hat{p}_1)} + \frac{1}{n_2 \hat{p}_2 (1-\hat{p}_2)}}
\end{eqnarray*}\]
And more theory we won’t cover that tells us:
\[
\ln(\widehat{OR}) \stackrel{\mbox{approx}}{\sim} N\Bigg( \ln(OR), \sqrt{\frac{1}{n_1 \hat{p}_1 (1-\hat{p}_1)} + \frac{1}{n_2 \hat{p}_2 (1-\hat{p}_2)}} \Bigg)
\]
Note that your book introduces \(SE(\ln(\widehat{OR}))\) in the context of hypothesis testing where the null, \(H_0: p_1 = p_2,\) is assumed to be true. If the null is true, you’d prefer an estimate for the proportion of success to be based on the entire sample:
\[\begin{eqnarray*}
SE(\ln (\widehat{OR})) &\approx& \sqrt{\frac{1}{n_1 \hat{p} (1-\hat{p})} + \frac{1}{n_2 \hat{p}(1-\hat{p})}}
\end{eqnarray*}\]
So, a \((1-\alpha)100\%\) CI for the \(\ln(OR)\) is: \[\begin{eqnarray*}
\ln(\widehat{OR}) \pm z_{1-\alpha/2} SE(\ln(\widehat{OR}))
\end{eqnarray*}\]
Which gives a \((1-\alpha)100\%\) CI for the \(OR\): \[\begin{eqnarray*}
(e^{\ln(\widehat{OR}) - z_{1-\alpha/2} SE(\ln(\widehat{OR}))}, e^{\ln(\widehat{OR}) + z_{1-\alpha/2} SE(\ln(\widehat{OR}))})
\end{eqnarray*}\]
Back to the example… \(OR = 28.9.\) \[\begin{eqnarray*}
SE(\ln(\widehat{OR})) &=& \sqrt{\frac{1}{182*0.67*(1-0.67)} + \frac{1}{122*0.0656*(1-0.0656)}}\\
&=& 0.398\\
90\% \mbox{ CI for } \ln(OR) && \ln(28.9) \pm 1.645 \cdot 0.398\\
&& 3.366 \pm 1.645 \cdot 0.398\\
&& (2.71, 4.02)\\
90\% \mbox{ CI for } OR && (e^{2.71}, e^{4.02})\\
&& (15.04, 55.47)\\
\end{eqnarray*}\]
We are 90% confident that the true \(\ln(OR)\) is between 2.71 and 4.02. We are 90% confident that the true \(OR\) is between 15.04 and 55.47. That is, the true odds of getting lung cancer if you smoke are somewhere between 15.04 and 55.47 times higher than if you don’t smoke, with 90% confidence.
Note 1: we use the theory which allows us to understand the sampling distribution for the \(\ln(\widehat{OR}).\) We use the process for creating CIs to transform back to \(OR.\)
Note 2: We do not use the t-distribution here because we are not estimating the population standard deviation.
Note 3: There are not good general guidelines for checking whether the sample sizes are large enough for the normal approximation. Most authorities agree that one can get away with smaller sample sizes here than for the differences of two proportions. If the sample sizes pass the rough check discussed for \(\chi^2,\) they should be large enough to support inferences based on the approximate normality of the log of the estimated odds ratio, too. (Ramsey and Schafer 2012, 541)
For the normal approximation to hold, we need the expected counts in each cell to be at least 5. (Pagano and Gauvreau 2000, 355)
Note 4: If any of the cells are zero, many people will add 0.5 to that cell’s observed value.
Note 5: The OR will always be more extreme than the RR (one more reason to be careful…)
\[\begin{eqnarray*}
\mbox{assume } && \frac{X_1 / n_1}{X_2 / n_2} = RR > 1\\
& & \\
\frac{X_1}{n_1} &=& RR \ \ \frac{X_2}{n_2}\\
\frac{X_1}{n_1 - X_1} &=& RR \ \ \bigg( \frac{n_1}{n_2} \frac{n_2 - X_2}{n_1 - X_1} \bigg) \frac{X_2}{n_2-X_2}\\
OR &=& RR \ \ \bigg(\frac{n_1}{n_2} \bigg) \frac{n_2 - X_2}{n_1 - X_1}\\
&=& RR \ \ \bigg(\frac{1/n_2}{1/n_1} \bigg) \frac{n_2 - X_2}{n_1 - X_1}\\
&=& RR \ \ \frac{1 - X_2/n_2}{1 - X_1/n_1}\\
& > & RR
\end{eqnarray*}\] [\(1 - \frac{X_2}{n_2} > 1 - \frac{X_1}{n_1} \rightarrow \frac{1 - \frac{X_2}{n_2}}{1 - \frac{X_1}{n_1}} > 1\)]
Note 6: \(RR \approx OR\) if the risk is small (the denominator of the OR will be very similar to the denominator of the RR).