3 t-tests vs SLR
We are going to build on a very basic model of the following form:
data = deterministic model + random error
planned variability your experimental conditions, hopefully represented by an interesting deterministic model
random error natural variability due to individuals.
systematic error error that is not contained within the model. It can happen because of poor sampling or poor experimental conditions.
Surgery Timing
The study, “Operation Timing and 30-Day Mortality After Elective General Surgery”, tested the hypotheses that the risk of 30-day mortality associated with elective general surgery: 1) increases from morning to evening throughout the routine workday; 2) increases from Monday to Friday through the workweek; and 3) is more frequent in July and August than during other months of the year. As a presumed negative control, the investigators also evaluated mortality as a function of the phase of the moon. Secondarily, they evaluated these hypotheses as they pertain to a composite in-hospital morbidity endpoint.
The related data set contains 32,001 elective general surgical patients. Age, gender, race, BMI, several comorbidities, several surgical risk indices, the surgical timing predictors (hour, day of week, month,moon phase) and the outcomes (30-day mortality and in-hospital complication) are provided. The dataset is cleaned and complete (no missing data except for BMI). There are no outliers or data problems. The data are from (Sessler et al. 2011)
Note that in the example, mortality rates are compared for patients electing to have surgery in July August vs. other months of the year. We’d like to compare the average age of the participants from the July and August groups as compared to the rest of the year. Even if the mortality difference is significant, we can’t conclude causation (of the treatment) because it was an observational study. However, the more similar the groups are based on clinical variables (e.g., age), the more likely any differences in mortality are due to timing (i.e., the treatment). Let’s start by asking: how different are the groups based on clinical variables, here we assess age?
age | gender | race | hour | dow | month | complication | bmi | asa_status | baseline_cancer | baseline_cvd | baseline_dementia | baseline_diabetes | baseline_digestive | baseline_osteoart | baseline_psych | baseline_pulmonary | baseline_charlson | mortality_rsi | complication_rsi | ccsmort30rate | ccscomplicationrate | moonphase | mort30 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
67.8 | M | Caucasian | 9.03 | Mon | Nov | No | 28.0 | I-II | No | Yes | No | No | Yes | No | No | No | 0 | -0.63 | -0.26 | 0.004 | 0.072 | Full Moon | No |
39.5 | F | Caucasian | 18.48 | Wed | Sep | No | 37.9 | I-II | No | Yes | No | No | No | No | No | No | 0 | -0.63 | -0.26 | 0.004 | 0.072 | New Moon | No |
56.5 | F | Caucasian | 7.88 | Fri | Aug | No | 19.6 | I-II | No | No | No | No | No | No | No | No | 0 | -0.49 | 0.00 | 0.004 | 0.072 | Full Moon | No |
71.0 | M | Caucasian | 8.80 | Wed | Jun | No | 32.2 | III | No | Yes | No | No | No | No | No | No | 0 | -1.38 | -1.15 | 0.004 | 0.072 | Last Quarter | No |
56.3 | M | African American | 12.20 | Thu | Aug | No | 24.3 | I-II | Yes | No | No | No | No | No | No | No | 0 | 0.00 | 0.00 | 0.004 | 0.072 | Last Quarter | No |
57.7 | F | Caucasian | 7.67 | Thu | Dec | No | 40.3 | I-II | No | Yes | No | No | No | No | Yes | No | 0 | -0.77 | -0.84 | 0.004 | 0.072 | First Quarter | No |
surgery |>
dplyr::mutate(summer = case_when(
month %in% c("Jul", "Aug") ~ TRUE,
!(month %in% c("Jul", "Aug")) ~ FALSE)) |>
dplyr::group_by(summer) |>
dplyr::summarize(age_mean = mean(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE), age_n = sum(!is.na(age)))
#> # A tibble: 2 × 4
#> summer age_mean age_sd age_n
#> <lgl> <dbl> <dbl> <int>
#> 1 FALSE 57.6 15.0 26498
#> 2 TRUE 57.8 15.3 5501
3.1 t-test
(Section 2.1 in Kuiper and Sklar (2013).)
A t-test is a test of means. For the surgery timing data, the groups would ideally have similar age distributions. Why? What are the advantages and disadvantages of running a retrospective cohort study?
The two-sample t-test starts with the assumption that the population means of the two groups are equal, \(H_0: \mu_1 = \mu_2.\) The sample means \(\overline{y}_1\) and \(\overline{y}_2\) will always be different. How different must the \(\overline{y}\) values be in order to reject the null hypothesis?
Model 1:
\[\begin{align} y_{1j} &= \mu_{1} + \epsilon_{1j} \ \ \ \ j=1, 2, \ldots, n_1\\ y_{2j} &= \mu_{2} + \epsilon_{2j} \ \ \ \ j=1, 2, \ldots, n_2\\ \epsilon_{ij} &\sim N(0,\sigma^2)\\ E[Y_i] &= \mu_i \end{align}\]
That is, we are assuming that for each group the true population average is fixed and an individual that is randomly selected will have some amount of random error away from the true population mean. Note that we have assumed that the variances of the two groups are equal. We have also assumed that there is independence between and within the groups.
Note: we will assume the population variances are equal if neither sample variance is more than twice as big as the other.
Example 3.1 Are the mean ages of the July + August vs other patients statistically different? (why two sided?)
\[\begin{align} H_0: \mu_1 = \mu_2\\ H_a: \mu_1 \ne \mu_2 \end{align}\]
\[\begin{align} t &= \frac{(\overline{y}_1 - \overline{y}_2) - 0}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\\ s_p &= \sqrt{ \frac{(n_1 - 1)s_1^2 + (n_2-1) s_2^2}{n_1 + n_2 -2}}\\ df &= n_1 + n_2 -2\\ &\\ t &= \frac{(57.62 - 57.84) - 0}{15.04 \sqrt{\frac{1}{26498} + \frac{1}{5501}}}\\ &= -0.99\\ s_p &= \sqrt{ \frac{(26498-1)14.98^2 + (5501-1) 15.34^2}{26498 + 5501 -2}}\\ &= 15.04\\ df &= n_1 + n_2 -2\\ &= 31997\\ \mbox{p-value} &= 2 \cdot pt(-0.99,31997) = 0.322\\ \end{align}\]
The same analysis can be done in R (with and without tidying the output):
d <- surgery |>
dplyr::mutate(summer = case_when(
month %in% c("Jul", "Aug") ~ TRUE,
!(month %in% c("Jul", "Aug")) ~ FALSE))
t.test(formula = age ~ summer, var.equal = TRUE, data = d)
#>
#> Two Sample t-test
#>
#> data: age by summer
#> t = -1, df = 31997, p-value = 0.3
#> alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
#> 95 percent confidence interval:
#> -0.662 0.212
#> sample estimates:
#> mean in group FALSE mean in group TRUE
#> 57.6 57.8
t.test(formula = age ~ summer, var.equal = TRUE, data = d) |>
tidy()
#> # A tibble: 1 × 10
#> estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.225 57.6 57.8 -1.01 0.312 31997 -0.662 0.212
#> # ℹ 2 more variables: method <chr>, alternative <chr>
- Look at SD and SEM
- What is the statistic? What is the sampling distribution of the statistic?
- Why do we use the t-distribution?
- Why is the big p-value important? (It’s a good thing!) How do we interpret the p-value?
- What can we conclude?
- applet from (Chance and Rossman 2018): sampling from two populations
- What are the model assumptions? (basically all the assumptions are given in the original linear model: independence between & within groups, random sample, pop values don’t change, additive error, \(\epsilon_{i,j} \ \sim \ iid \ N(0, \sigma^2))\)
Considerations when running a t-test:
- one-sample vs two-sample t-test
- one-sided vs. two-sided hypotheses
- t-test with unequal variance (less powerful, more conservative)
\[\begin{align} t &= \frac{(\overline{y}_1 - \overline{y}_2) - (\mu_1 - \mu_2)}{ \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\\ df &= \min(n_1-1, n_2-1)\\ \end{align}\] - two dependent (paired) samples – one sample t-test!
Example 3.2 Assume we have two very small samples: \((y_{11}=3, y_{12} = 9, y_{21} = 5, y_{22}=1, y_{23}=9).\) Find \(\hat{\mu}_1, \hat{\mu}_2, \hat{\epsilon}_{11}, \hat{\epsilon}_{12}, \hat{\epsilon}_{21}, \hat{\epsilon}_{22}, \hat{\epsilon}_{23}, n_1, n_2.\)
By considering the estimates of the parameters, we can see that Model 1 expands to include both the parameter model as well as the statistic model:
Model 1:
\[\begin{align} y_{1j} &= \mu_{1} + \epsilon_{1j} \ \ \ \ j=1, 2, \ldots, n_1\\ y_{2j} &= \mu_{2} + \epsilon_{2j} \ \ \ \ j=1, 2, \ldots, n_2\\ \epsilon_{ij} &\sim N(0,\sigma^2)\\ E[Y_i] &= \mu_i\\ y_{1j} &= \hat{\mu}_{1} + \epsilon_{1j} \ \ \ \ j=1, 2, \ldots, n_1\\ &= \overline{y}_{1} + \epsilon_{1j} \ \ \ \ j=1, 2, \ldots, n_1\\ y_{2j} &= \hat{\mu}_{2} + \epsilon_{2j} \ \ \ \ j=1, 2, \ldots, n_2\\ &= \overline{y}_{2} + \epsilon_{2j} \ \ \ \ j=1, 2, \ldots, n_2\\ \end{align}\]
3.1.1 What is an Alternative Hypothesis?
Consider the brief video from the movie Slacker, an early movie by Richard Linklater (director of Boyhood, School of Rock, Before Sunrise, etc.). You can view the video here from starting at 2:22 and ending at 4:30: [https://www.youtube.com/watch?v=b-U_I1DCGEY]
In the video, a rider in the back of a taxi (played by Linklater himself) muses about alternate realities that could have happened as he arrived in Austin on the bus. What if instead of taking a taxi, he had found a ride with a woman at the bus station? He could have take a different road into a different alternate reality, and in that reality his current reality would be an alternate reality. And so on.
What is the point? Why watch the video? How does it relate the to the material from class? What is the relationship to sampling distributions? [Thanks to Ben Baumer at Smith College for the pointer to the specific video.]
3.2 ANOVA
Skip ANOVA in your text (2.4 and part of 2.9 in Kuiper and Sklar (2013)).
3.3 Simple Linear Regression
(Section 2.3 in Kuiper and Sklar (2013).)
Simple Linear Regression is a model (hopefully discussed in introductory statistics) used for describing a linear relationship between two variables. It typically has the form of:
\[\begin{align} y_i &= \beta_0 + \beta_1 x_i + \epsilon_i \ \ \ \ i = 1, 2, \ldots, n\\ \epsilon_i &\sim N(0, \sigma^2)\\ E(Y|x) &= \beta_0 + \beta_1 x \end{align}\]
For this model, the deterministic component \((\beta_0 + \beta_1 x)\) is a linear function of the two parameters, \(\beta_0\) and \(\beta_1,\) and the explanatory variable \(x.\) The random error terms, \(\epsilon_i,\) are assumed to be independent and to follow a normal distribution with mean 0 and variance \(\sigma^2.\)
How can we use this model to describe the two sample means case we discussed on the ages of the patients from the elective surgery data? Consider \(x\) to be a dummy variable that takes on the value 0 if the observation is a control and 1 if the observation is a case. Assume we have \(n_1\) controls and \(n_2\) cases. It turns out that, coded in this way, the regression model and the two-sample t-test model are mathematically equivalent!
(For the color game in the text, the natural way to code is 1 for the color distracter and 0 for the standard game. Why? What does \(\beta_0\) represent? What does \(\beta_1\) represent?)
\[\begin{align} \mu_1 &= \beta_0 + \beta_1 (0) = \beta_0 \\ \mu_2 &= \beta_0 + \beta_1 (1) = \beta_0 + \beta_1\\ \mu_2 - \mu_1 &= \beta_1 \end{align}\]
Why are they the same?
You may remember that to find estimates for \(\beta_0\) and \(\beta_1\) we minimized the sum of the squared error terms (we’ll see that in the next section) and came up with estimates of:
\[\begin{eqnarray*} b_1 &=& \hat{\beta}_1 = \hat{\beta}_1 &= \frac{n \sum x_i y_i - \sum x_i \sum y_i}{n \sum x_i^2 - (\sum x_i )^2}\\ &=& r \frac{s_y}{s_x}\\ b_0 &=& \hat{\beta}_0 = \frac{\sum y_i - b_1 \sum x_i}{n}\\ &=& \overline{y} - b_1 \overline{x} \end{eqnarray*}\]
Some simplifying of terms gets us to see that the model estimates (of the deterministic part of the model) are the same whether we use model 1 or model 2.
\[\begin{align} b_1= \hat{\beta}_1 &= \frac{n \sum x_i y_i - \sum x_i \sum y_i}{n \sum x_i^2 - (\sum x_i )^2}\\ &= \frac{n \sum_2 y_i - n_2 \sum y_i}{(n n_2-n_2^2)}\\ &= \frac{ n \sum_2 y_i - n_2 (\sum_1 y_i + \sum_2 y_i)}{n_2(n-n_2)}\\ &= \frac{(n_1 + n_2) \sum_2 y_i - n_2 \sum_1 y_i - n_2 \sum_2 y_i}{n_1 n_2}\\ &= \frac{n_1 \sum_2 y_i - n_2 \sum_1 y_i}{n_1 n_2}\\ &= \frac{n_1 n_2 \overline{y}_2 - n_2 n_1 \overline{y}_1}{n_1 n_2}\\ &= \overline{y}_2 - \overline{y}_1\\ b_0 = \hat{\beta}_0 &= \frac{\sum y_i - b_1 \sum x_i}{n}\\ &= \frac{\sum_1 y_i + \sum_2 y_i - b_1 n_2}{n}\\ &= \frac{n_1 \overline{y}_1 + n_2 \overline{y}_2 - n_2 \overline{y}_2 + n_2 \overline{y}_1}{n}\\ &= \frac{n \overline{y}_1 + n_2 \overline{y}_2 - n_2 \overline{y}_2 + n_2 \overline{y}_1}{n}\\ &= \frac{n \overline{y}_1}{n} = \overline{y}_1 \end{align}\]
Model 2:
\[\begin{align} y_{i} &= \beta_0 + \beta_1 x_i + \epsilon_i \ \ \ \ i=1, 2, \ldots, n\\ \epsilon_{i} &\sim N(0,\sigma^2)\\ E[Y_i] &= \beta_0 + \beta_1 x_i\\ \hat{y}_i &= b_0 + b_1 x_i \end{align}\]
That is, we are assuming that for each observation the true population average is fixed and an individual that is randomly selected will have some amount of random error away from the true population mean at their value for the explanatory variable, \(x_i.\) Note that we have assumed that the variance is constant across any level of the explanatory variable. We have also assumed that there is independence across individuals. [Note: there are no assumptions about the distribution of the explanatory variable, \(X].\)
Note the similarity in running a t.test()
and a linear model (lm()
):
d <- surgery |>
dplyr::filter(month %in% c("Jul", "Aug"))
t.test(formula = age ~ month, data = d) |>
tidy()
#> # A tibble: 1 × 10
#> estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.486 58.1 57.6 1.16 0.247 4954. -0.337 1.31
#> # ℹ 2 more variables: method <chr>, alternative <chr>
lm(formula = age ~ month, data = d) |>
tidy()
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 58.1 0.272 213. 0
#> 2 monthJul -0.486 0.419 -1.16 0.245
- What are the similarities in the t-test vs. SLR models?
- predicting average
- assuming independent, constant errors
- errors follow a normal distribution with zero mean and variance \(\sigma^2\)
- What are the differences in the two models?
- one subscript versus two (or similarly, two models for the t-test)
- two samples for the t-test (two variables for the regression… or is that a similarity??)
- both variables are quantitative in SLR
3.4 Confidence Intervals
(Section 2.11 in Kuiper and Sklar (2013).)
A fantastic applet for visualizing what it means to have 95% confidence: [http://www.rossmanchance.com/applets/2021/confsim/ConfSim.html]
In general, the format of a confidence interval is give below… what is the interpretation? Remember, the interval is for a given parameter and the “coverage” happens in alternative universes with repeated sampling. We’re 95% confident that the interval captures the parameter.
estimate +/- critical value x standard error of the estimate
Age data: \[\begin{align} 90\% \mbox{ CI for } \mu_1: & \overline{y}_1 \pm t^*_{26498 - 1} \times \hat{\sigma}_{\overline{y}_1}\\ & 57.62 \pm 1.645 \times 14.98/\sqrt{26498}\\ & (57.61, 58.49)\\ 98\% \mbox{ CI for }\mu_1 - \mu_2: & \overline{y}_1 - \overline{y}_2 \pm t^*_{5499} s_p \sqrt{1/n_1 + 1/n_2}\\ & 57.62 - 57.84 \pm 2.33 \times 15.04\cdot \sqrt{\frac{1}{26498} + \frac{1}{5501}}\\ & (-0.739, 0.299) \end{align}\]
We are 98% confident that the true difference in ages for all people (in the population) who get elective surgery in July/August versus in other months is between -0.739 years and 0.299 years. Note that our CI overlaps zero and so the true difference in parameters might be zero. Therefore, we have no evidence to claim that the July/August group is significantly younger (or significantly older!) than the rest of the patients.
Note the CI on pgs 54/55, there is a typo. The correct interval for \(\mu_1 - \mu_2\) for the games data should be:
\[\begin{align} 95\% \mbox{ CI for } \mu_1 - \mu_2: & \overline{y}_1 - \overline{y}_2 \pm t^*_{38} \hat{\sigma}_{\overline{y}_1 - \overline{y}_2}\\ & \overline{y}_1 - \overline{y}_2 \pm t^*_{38} s_p \sqrt{1/n_1 + 1/n_2}\\ & 38.1 - 35.55 \pm 2.02 \times \sqrt{\frac{(19)3.65^2 + (19)3.39^2}{20+20-2}} \sqrt{\frac{1}{20} + \frac{1}{20}}\\ & (0.29 s, 4.81 s) \end{align}\]
3.5 Random Sample vs. Random allocation
Recall what you’ve learned about how good random samples lead to inference about a population. On the other hand, in order to make a causal conclusion, you need a randomized experiment with random allocation of the treatments (impossible to happen in many settings). Random sampling and random allocation are DIFFERENT ideas that should be clear in your mind.
Note: no ANOVA (section 2.4 in Kuiper and Sklar (2013)) or normal probability plots (section 2.8 in Kuiper and Sklar (2013)).