14  Permutation tests

Permutation tests are a class of statistical hypothesis tests that use computational work to create what we call a sampling distribution. You may be familiar with sampling distributions from your introductory statistics course in which case you likely ran into the sampling distribution modeled as a t-distribution (for use in a t-test or a t confidence interval). We will not discuss t-tests in this course, but we will set up the structure of a hypothesis test. Before we see the structure, however, let’s do an example.

14.1 Example: Friend or Foe

This example comes from Investigation 1.1: Friend or Foe? Chance and Rossman (2018). The idea is to use simulation to determine how likely our data would be if nothing interesting was going on.

In a study reported in the November 2007 issue of Nature, researchers investigated whether infants take into account an individual’s actions towards others in evaluating that individual as appealing or aversive, perhaps laying for the foundation for social interaction (Hamlin, Wynn, and Bloom 2007). In other words, do children who aren’t even yet talking still form impressions as to someone’s friendliness based on their actions? In one component of the study, 10-month-old infants were shown a “climber” character (a piece of wood with “googly” eyes glued onto it) that could not make it up a hill in two tries. Then the infants were shown two scenarios for the climber’s next try, one where the climber was pushed to the top of the hill by another character (the “helper” toy) and one where the climber was pushed back down the hill by another character (the “hinderer” toy). The infant was alternately shown these two scenarios several times. Then the child was presented with both pieces of wood (the helper and the hinderer characters) and asked to pick one to play with. Videos demonstrating this component of the study can be found at http://campuspress.yale.edu/infantlab/media/.

One important design consideration to keep in mind is that in order to equalize potential influencing factors such as shape, color, and position, the researchers varied the colors and shapes of the wooden characters and even on which side the toys were presented to the infants. The researchers found that 14 of the 16 infants chose the helper over the hinderer.

Always Ask

  • What are the observational units?
    • infants
  • What is the variable? What type of variable?
    • choice of helper or hindered: categorical
  • What is the statistic?
    • \(\hat{p}\) = proportion of infants who chose helper = 14/16 = 0.875
  • What is the parameter?
    • p = proportion of all infants who might choose helper (not measurable!)

Hypotheses

\(H_0\): Null hypothesis. Babies (or rather, the population of babies under consideration) have no inherent preference for the helper or the hinderer shape.

\(H_A\): Alternative hypothesis. Babies (or rather, the population of babies under consideration) are more likely to prefer the helper shape over the hinderer shape.

p-value is the probability of our data or more extreme if nothing interesting is going on.

completely arbitrary cutoff \(\rightarrow\) generally accepted conclusion
p-value \(>\) 0.10 \(\rightarrow\) no evidence against the null model
0.05 \(<\) p-value \(<\) 0.10 \(\rightarrow\) moderate evidence against the null model
0.01 \(<\) p-value \(<\) 0.05 \(\rightarrow\) strong evidence against the null model
p-value \(<\) 0.01 \(\rightarrow\) very strong evidence against the null model

Computation

# to control the randomness
set.seed(47)

# first create a data frame with the Infant data
Infants <- read.delim("http://www.rossmanchance.com/iscam3/data/InfantData.txt")

# find the observed number of babies who chose the helper
help_obs <- Infants |> 
  summarize(prop_help = mean(choice == "helper")) |> 
  pull()
help_obs
[1] 0.875
# write a function to simulate a set of infants who are 
# equally likely to choose the helper or the hinderer

random_choice <- function(rep, num_babies){
  choice = sample(c("helper", "hinderer"), size = num_babies,
                  replace = TRUE, prob = c(0.5, 0.5))
  return(mean(choice == "helper"))
}

# repeat the function many times
map_dbl(1:10, random_choice, num_babies = 16)
 [1] 0.688 0.375 0.438 0.375 0.500 0.500 0.625 0.438 0.688 0.625
num_exper <- 5000
help_random <- map_dbl(1:num_exper, random_choice, 
                            num_babies = 16)

# the p-value!
sum(help_random >= help_obs) / num_exper
[1] 0.0022
# visualize null sampling distribution
help_random |> 
  data.frame() |> 
  ggplot(aes(x = help_random)) + 
  geom_histogram() + 
  geom_vline(xintercept = help_obs, color = "red") + 
  labs(x = "proportion of babies who chose the helper",
       title = "sampling distribution when null hypothesis is true",
       subtitle = "that is, no inherent preference for helper or hinderer")

Histogram of sample proportions calculated under the setting where babies have no inherent preference for the helper or the hinderer shape. The red line is at the observed proportion from the Hamlin, Wynn, and Bloom (2007) study.

Histogram of sample proportions calculated under the setting where babies have no inherent preference for the helper or the hinderer shape. The red line is at the observed proportion from the Hamlin, Wynn, and Bloom (2007) study.

Logic for what we believe

  1. If we look back to the study, we can tell that the researchers varied color, shape, side, etc. to make sure there was nothing systematic about how the infants chose the block (e.g., if they all watch Blue’s Clues they might love the color blue, so we wouldn’t always want the helper shape to be blue).

The excellent design survey rules out outside influence as the reason so many of the infants chose the helper shape.

  1. We ruled out random chance as the mechanism for the larger number of infants who chose the helper shape. (We reject the null hypothesis.)

  2. We conclude that babies are inclined to be helpful. That is, they are more likely to choose the helper than the hindered. [Note: we don’t have any evidence for why they choose the helper. That is, they might be predisposed. They might be modeling their parents. They might notice that they need a lot of help, etc.]

14.2 Structure of Hypothesis testing

14.2.1 Hypotheses

  • Hypothesis Testing compares data to the expectation of a specific null hypothesis. If the data are unusual, assuming that the null hypothesis is true, then the null hypothesis is rejected.

  • The Null Hypothesis, \(H_0\), is a specific statement about a population made for the purposes of argument. A good null hypothesis is a statement that would be interesting to reject.

  • The Alternative Hypothesis, \(H_A\), is a specific statement about a population that is in the researcher’s interest to demonstrate. Typically, the alternative hypothesis contains all the values of the population that are not included in the null hypothesis.

  • In a two-sided (or two-tailed) test, the alternative hypothesis includes values on both sides of the value specified by the null hypothesis.

  • In a one-sided (or one-tailed) test, the alternative hypothesis includes parameter values on only one side of the value specified by the null hypothesis. \(H_0\) is rejected only if the data depart from it in the direction stated by \(H_A\).

14.2.2 Other pieces of the process

  • A statistic is a numerical measurement we get from the sample, a function of the data. [Also sometimes called an estimate.]

  • A parameter is a numerical measurement of the population. We never know the true value of the parameter.

  • The test statistic is a quantity calculated from the data that is used to evaluate how compatible the data are with the result expected under the null hypothesis.

  • The null distribution is the sampling distribution of outcomes for a test statistic under the assumption that the null hypothesis is true.

  • The p-value is the probability of obtaining the data (or data showing as great or greater difference from the null hypothesis) if the null hypothesis is true. The p-value is a number calculated from the dataset.

Examples of Hypotheses

Identify whether each of the following statements is more appropriate as the null hypothesis or as the alternative hypothesis in a test:

  • The number of hours preschool children spend watching TV affects how they behave with other children when at day care. Alternative

  • Most genetic mutations are deleterious. Alternative

  • A diet of fast foods has no effect on liver function. Null

  • Cigarette smoking influences risk of suicide. Alternative

  • Growth rates of forest trees are unaffected by increases in carbon dioxide levels in the atmosphere. Null

  • The number of hours that grade-school children spend doing homework predicts their future success on standardized tests. Alternative

  • King cheetahs on average run the same speed as standard spotted cheetahs. Null

  • The risk of facial clefts is equal for babies born to mothers who take folic acid supplements compared with those from mothers who do not. Null

  • The mean length of African elephant tusks has changed over the last 100 years. Alternative

  • Caffeine intake during pregnancy affects mean birth weight. Alternative

14.2.3 All together: structure of a hypothesis test

  • decide on a research question (which will determine the test)
  • collect data, specify the variables of interest
  • state the null (and alternative) hypothesis values (often statements about parameters)
    • the null claim is the science we want to reject
    • the alternative claim is the science we want to demonstrate
  • generate a (null) sampling distribution to describe the variability of the statistic that was calculated along the way
  • visualize the distribution of the statistics under the null model
  • get_p_value to measure the consistency of the observed statistic and the possible values of the statistic under the null model
  • make a conclusion using words that describe the research setting
Example: randomization test on Gerrymandering

Note that the idea of creating a null distribution can apply to a wide range of possible settings. The key is to swap observations around under the null hypothesis where “randomizing under the null hypothesis” helps get the researcher to a conclusion.

Below is a youtube video describing permuting (i.e., randomizing) different voting boundaries to come up with a null distribution of districts. The problem (as stated) is not possible to describe using mathematical functions, but we can derive a solution using computational approaches. [https://www.youtube.com/watch?v=gRCZR_BbjTo]

Example: randomization test on beer and mosquitoes

Great video of how/why computational statistical methods can be extremely useful. And it’s about beer and mosquitoes! John Rauser from Pintrest gives the keynote address at Strata + Hadoop World Conference October 16, 2014. David Smith, Revolution Analytics blog, October 17, 2014. http://blog.revolutionanalytics.com/2014/10/statistics-doesnt-have-to-be-that-hard.html

The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can simulate it, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come. (David Smith)

Rauser says that the in order to follow a statistical argument that uses simulation, you need three things:

  1. Ability to follow a simple logical argument.
  2. Random number generation.
  3. Iteration

14.3 Hypotheses

\(H_0\): Null hypothesis. The variables beer/water and number of mosquito bites are independent. They have no relationship, and therefore any observed difference between the number of bites on those who drank beer versus those who drank water is due to chance.

\(H_A\): Alternative hypothesis. The variables beer/water and number of mosquito bites are not independent. Any observed difference between the number of bites on those who drank beer versus those who drank water is not due to chance.

14.3.1 Permutation Tests Algorithm

To evaluate the p-value for a permutation test, estimate the sampling distribution of the test statistic when the null hypothesis is true by resampling in a manner that is consistent with the null hypothesis (the number of resamples is finite but can be large!).

  1. Choose a test statistic

  2. Shuffle the data (force the null hypothesis to be true)

  3. Create a null sampling distribution of the test statistic (under \(H_0\))

  4. Find the observed test statistic on the null sampling distribution and compute the p-value (observed data or more extreme). The p-value can be one or two-sided.

Technical Conditions

Permutation tests fall into a broad class of tests called “non-parametric” tests. The label indicates that there are no distributional conditions required about the data (i.e., no condition that the data come from a normal or binomial distribution). However, a test which is “non-parametric” does not meant that there are no conditions on the data, simply that there are no distributional or parametric conditions on the data. The parameters are at the heart of almost all parametric tests.

For permutation tests, we are not basing the test on population parameters, so we don’t need to make any claims about them (i.e., that they are the mean of a particular distribution).

  • Permutation The different treatments have the same effect. [Note: exchangeability, same population, etc.] If the null hypothesis is true, the labels assigning groups are interchangeable with respect to the probability distribution.
    • Note that it is our choice of statistic which makes the test more sensitive to some kinds of difference (e.g., difference in mean) than other kinds (e.g., difference in variance).
  • Parametric For example, the different populations have the same mean.

IMPORTANT KEY IDEA the point of technical conditions for parametric or permutation tests is to create a sampling distribution that accurately reflects the null sampling distribution for the statistic of interest (the statistic which captures the relevant research question information).

14.4 Permutation tests in practice

How is the test interpreted given the different types of sampling which are possibly used to collect the data?

  • Random Sample The concept of a p-value usually comes from the idea of taking a sample from a population and comparing it to a sampling distribution (from many many random samples).

  • Random Experiment In the context of a randomized experiment, the p-value represents the observed data compared to “happening by chance.”

    • The interpretation is direct: if there is only a very small chance that the observed statistic would take such an extreme value, as a result only of the randomization of cases: we reject the null treatment effect hypothesis. CAUSAL!
  • Observational Study In the context of observational studies the results are less strong, but it is reasonable to conclude that the effect observed in the sample reflects an effect present in the population.

    • In a sample, consider the difference (or ratio) and ask “Is this difference so large it would rarely occur by chance in a particular sample constructed under the null setting?”
    • If the data come from a random sample, then the sample (or results from the sample) are probably consistent with the population [i.e., we can infer the results back to the larger population].

14.4.1 Two sample permutation tests

Statistics Without the Agonizing Pain

Image of John Rauser who gave a keynote address  on permutation tests at the Strat + Hadoop conference in 2014.

John Rauser of Pintrest (now Amazon), speaking at Strata + Hadoop 2014. https://blog.revolutionanalytics.com/2014/10/statistics-doesnt-have-to-be-that-hard.html

Logic of hypothesis tests

  1. Choose a statistic that measures the effect.

  2. Construct the sampling distribution under \(H_0\).

  3. Locate the observed statistic in the null sampling distribution.

  4. p-value is the probability of the observed data or more extreme if the null hypothesis is true

Logic of permutation tests

  1. Choose a test statistic.

  2. Shuffle the data (force the null hypothesis to be true). Using the shuffled statistics, create a null sampling distribution of the test statistic (under \(H_0\)).

  3. Find the observed test statistic on the null sampling distribution.

  4. Compute the p-value (observed data or more extreme). The p-value can be one or two-sided.

Applet for two sample permutation tests

High School & Beyond survey

Data: 200 randomly selected observations from the High School and Beyond survey, conducted on high school seniors by the National Center for Educational Statistics.

Research Question: in the population, do private school kids have a higher math score on average?

\[H_0: \mu_{private} = \mu_{public}\] \[H_A: \mu_{private} > \mu_{public}\]

\(\mu\) is the average math score in the population.

# A tibble: 200 × 11
     id gender race  ses    schtyp prog        read write  math science socst
  <int> <chr>  <chr> <fct>  <fct>  <fct>      <int> <int> <int>   <int> <int>
1    70 male   white low    public general       57    52    41      47    57
2   121 female white middle public vocational    68    59    53      63    61
3    86 male   white high   public general       44    33    54      58    31
4   141 male   white high   public vocational    63    44    47      53    56
5   172 male   white middle public academic      47    52    57      53    61
6   113 male   white middle public academic      44    52    51      63    61
# ℹ 194 more rows

Summary of the variables

hsb2 |> 
  group_by(schtyp) |> 
  summarize(ave_math = mean(math),
            med_math = median(math))
# A tibble: 2 × 3
  schtyp  ave_math med_math
  <fct>      <dbl>    <dbl>
1 public      52.2     52  
2 private     54.8     53.5

Visualize the relationship of interest

hsb2 |> 
  ggplot(aes(x = schtyp, y = math)) + 
  geom_boxplot()

Calculate the observed statistic(s)

For fun, we are calculating both the difference in averages as well as the difference in medians. That is, we have two different observed summary statistics to work with.

hsb2 |> 
  group_by(schtyp) |> 
  summarize(ave_math = mean(math),
            med_math = median(math))
# A tibble: 2 × 3
  schtyp  ave_math med_math
  <fct>      <dbl>    <dbl>
1 public      52.2     52  
2 private     54.8     53.5
hsb2 |> 
  group_by(schtyp) |> 
  summarize(ave_math = mean(math),
            med_math = median(math)) |> 
  summarize(ave_diff = diff(ave_math),
            med_diff = diff(med_math))
# A tibble: 1 × 2
  ave_diff med_diff
     <dbl>    <dbl>
1     2.51      1.5

Generate a null sampling distribution.

perm_data <- function(rep, data){
  data |> 
    select(schtyp, math) |> 
    mutate(math_perm = sample(math, replace = FALSE)) |> 
    group_by(schtyp) |> 
    summarize(obs_ave = mean(math),
              obs_med = median(math),
              perm_ave = mean(math_perm),
              perm_med = median(math_perm)) |> 
    summarize(obs_ave_diff = diff(obs_ave),
              obs_med_diff = diff(obs_med),
              perm_ave_diff = diff(perm_ave),
              perm_med_diff = diff(perm_med),
              rep = rep)
}

map(1:10, perm_data, data = hsb2) |> 
  list_rbind()
# A tibble: 10 × 5
  obs_ave_diff obs_med_diff perm_ave_diff perm_med_diff   rep
         <dbl>        <dbl>         <dbl>         <dbl> <int>
1         2.51          1.5         2.62            2.5     1
2         2.51          1.5         0.757          -1       2
3         2.51          1.5         1.65            2       3
4         2.51          1.5        -0.805          -0.5     4
5         2.51          1.5         1.80            2.5     5
6         2.51          1.5         2.92            3       6
# ℹ 4 more rows

Visualize the null sampling distribution (average)

set.seed(47)
perm_stats <- 
  map(1:500, perm_data, data = hsb2) |> 
  list_rbind() 

perm_stats |> 
  ggplot(aes(x = perm_ave_diff)) + 
  geom_histogram() + 
  geom_vline(aes(xintercept = obs_ave_diff), color = "red")

Visualize the null sampling distribution (median)

perm_stats |> 
  ggplot(aes(x = perm_med_diff)) + 
  geom_histogram() + 
  geom_vline(aes(xintercept = obs_med_diff), color = "red")

p-value

perm_stats |> 
  summarize(p_val_ave = mean(perm_ave_diff > obs_ave_diff),
            p_val_med = mean(perm_med_diff > obs_med_diff))
# A tibble: 1 × 2
  p_val_ave p_val_med
      <dbl>     <dbl>
1     0.086      0.27

Conclusion

From these data, the observed differences seem to be consistent with the distribution of differences in the null sampling distribution.

There is no evidence to reject the null hypothesis.

We cannot claim that in the population the average math scores for private school kids is larger than the average math scores for public school kids (p-value = 0.086).

We cannot claim that in the population the median math scores for private school kids is larger than the median math scores for public school kids (p-value = 0.27).

Two-sided hypothesis test

\(H_0: \mu_{private} = \mu_{public}\) and \(H_A: \mu_{private} \ne \mu_{public}\)

Two-sided p-value

perm_stats |> 
    summarize(p_val_ave = 
                mean(perm_ave_diff > obs_ave_diff | 
                       perm_ave_diff < -obs_ave_diff),
              p_val_med = 
              mean(perm_med_diff > obs_med_diff | 
                     perm_med_diff < -obs_med_diff))
# A tibble: 1 × 2
  p_val_ave p_val_med
      <dbl>     <dbl>
1     0.154     0.534

Two-sided conclusion

From these data, the observed differences seem to be consistent with the distribution of differences in the null sampling distribution.

There is no evidence to reject the null hypothesis.

We cannot claim that there is a difference in average math scores in the population (p-value = 0.154).

We cannot claim that there is a difference in median math scores int he population (p-value = 0.534).

14.4.2 Stratified two-sample permutation test

MacNell Teaching Evaluations

Boring et al. (2016) reanalyze data from MacNell et al. (2014). Students were randomized to 4 online sections of a course. In two sections, the instructors swapped identities. Was the instructor who identified as female rated lower on average? (https://www.math.upenn.edu/~pemantle/active-papers/Evals/stark2016.pdf)

Kraj (2017)

Kraj (2017)

Kraj (2017)

Kraj (2017)

Mengel, Sauermann, and Zölitz (2019)

Mengel, Sauermann, and Zölitz (2019)

MacNell, Driscoll, and Hunt (2015)

MacNell, Driscoll, and Hunt (2015)

MacNell, Driscoll, and Hunt (2015)

MacNell, Driscoll, and Hunt (2015)
14.4.2.0.1 R code
# The data come from `permuter` which is no longer kept up as a package
macnell <- readr::read_csv("https://raw.githubusercontent.com/statlab/permuter/master/data-raw/macnell.csv")
#library(permuter)
#data(macnell)
library(ggridges)
macnell |> 
  mutate(TAID = ifelse(taidgender==1, "male", "female")) |>
  mutate(TAGend = ifelse(tagender==1, "male", "female")) |>
ggplot(aes(y=TAGend, x=overall, 
           group = interaction(TAGend, TAID), 
           fill=TAID)) +
  geom_point(position=position_jitterdodge(jitter.height=0.3, jitter.width = 0, dodge.width = 0.4), 
             aes(color = TAID)) +
  stat_summary(fun="mean", geom="crossbar", 
               size=.3, width = 1,
               aes(color = TAID),
               position=position_dodge(width=0.4)) +
  stat_summary(fun="mean", geom="point", shape = "X",
               size=5, aes(color = TAID),
               position=position_dodge(width=0.4)) +
  coord_flip() +
  labs(title = "Overall teaching effectiveness score",
       x = "",
       y = "TA gender",
       color = "TA identifier",
       fill = "TA identifier")

14.4.2.1 Analysis goal

Want to know if the score for the perceived gender is different.

\[H_0: \mu_{ID.Female} = \mu_{ID.Male}\] > Although for the permutation test, under the null hypothesis not only are the means of the population distributions the same, but the variance and all other aspects of the distributions across perceived gender.

14.4.2.2 MacNell Data without permutation

macnell |>
  select(overall, tagender, taidgender) |> head(15)
# A tibble: 15 × 3
  overall tagender taidgender
    <dbl>    <dbl>      <dbl>
1       4        0          1
2       4        0          1
3       5        0          1
4       5        0          1
5       5        0          1
6       4        0          1
# ℹ 9 more rows

14.4.2.3 Permuting MacNell data

Conceptually, there are two levels of randomization:

  1. \(N_m\) students are randomly assigned to the male instructor and \(N_f\) are assigned to the female instructor.

  2. Of the \(N_j\) assigned to instructor \(j\), \(N_{jm}\) are told that the instructor is male, and \(N_{jf}\) are told that the instructor is female for \(j=m,f\).

macnell |>
  group_by(tagender, taidgender) |>
  summarize(n())
# A tibble: 4 × 3
# Groups:   tagender [2]
  tagender taidgender `n()`
     <dbl>      <dbl> <int>
1        0          0    11
2        0          1    12
3        1          0    13
4        1          1    11

Stratified two-sample test:

  • For each instructor, permute perceived gender assignments.
  • Use difference in mean ratings for female-identified vs male-identified instructors.
macnell |> 
  group_by(tagender) |>
  mutate(permTAID = sample(taidgender, replace=FALSE)) |>
  select(overall, tagender, taidgender, permTAID) 
# A tibble: 47 × 4
# Groups:   tagender [2]
  overall tagender taidgender permTAID
    <dbl>    <dbl>      <dbl>    <dbl>
1       4        0          1        1
2       4        0          1        0
3       5        0          1        1
4       5        0          1        0
5       5        0          1        0
6       4        0          1        0
# ℹ 41 more rows
macnell |> 
  group_by(tagender) |>
  mutate(permTAID = sample(taidgender, replace=FALSE)) |>
  ungroup(tagender) |>
  group_by(permTAID) |>
  summarize(pmeans = mean(overall, na.rm=TRUE)) |>
  summarize(diff(pmeans))
# A tibble: 1 × 1
  `diff(pmeans)`
           <dbl>
1          0.468
diff_means_func <- function(.x){
  macnell |> group_by(tagender) |>
  mutate(permTAID = sample(taidgender, replace=FALSE)) |>
  ungroup(tagender) |>
  group_by(permTAID) |>
  summarize(pmeans = mean(overall, na.rm=TRUE)) |>
  summarize(diff_mean = diff(pmeans))
  }

map(1:5, diff_means_func) |> 
  list_rbind()
# A tibble: 5 × 1
  diff_mean
      <dbl>
1  -0.188  
2   0.180  
3   0.00216
4  -0.184  
5   0.00216

14.4.2.4 Observed vs. Permuted statistic

# observed
macnell |> 
  group_by(taidgender) |>
  summarize(pmeans = mean(overall, na.rm=TRUE)) |>
  summarize(diff_mean = diff(pmeans))
# A tibble: 1 × 1
  diff_mean
      <dbl>
1     0.474
# permuted
set.seed(47)
reps = 1000
perm_diff_means <- map(1:reps, diff_means_func) |> list_rbind()

14.4.2.5 permutation sampling distribution:

.pull-left[

]

.pull-right[

# permutation p-value
perm_diff_means |>
  summarize(p_val = 
      sum(diff_mean > 0.474) / 
      reps)
# A tibble: 1 × 1
  p_val
  <dbl>
1 0.048

]

14.4.2.6 Actual MacNell results

14.4.2.7 Other Test Statistics

The example in class used a modification of the ANOVA F-statistic to compare the observed data with the permuted data test statistics. Depending on the data and question, the permuted test statistic can take on any of a variety of forms.

Data Hypothesis Question Statistic
2 categorical diff in prop \(\hat{p}_1 - \hat{p}_2\) or \(\chi^2\)
variables ratio of prop \(\hat{p}_1 / \hat{p}_2\)
1 numeric diff in means \(\overline{X}_1 - \overline{X}_2\)
1 binary ratio of means \(\overline{X}_1 / \overline{X}_2\)
diff in medians \(\mbox{median}_1 - \mbox{median}_2\)
ratio of medians \(\mbox{median}_1 / \mbox{median}_2\)
diff in SD \(s_1 - s_2\)
diff in var \(s^2_1 - s^2_2\)
ratio of SD or VAR \(s_1 / s_2\)
1 numeric diff in means \(\sum n_i (\overline{X}_i - \overline{X})^2\) or
k groups F stat
paired or (permute within row) \(\overline{X}_1 - \overline{X}_2\)
repeated measures
regression correlation least sq slope
time series no serial corr lag 1 autocross

Depending on the data, hypotheses, and original data collection structure (e.g., random sampling vs random allocation), the choice of statistic for the permutation test will vary.

14.5 Reflection questions

14.6 Ethics considerations

Chance, Beth, and Allan Rossman. 2018. Investigating Statistics, Concepts, Applications, and Methods. 3rd ed. http://www.rossmanchance.com/iscam3/.
Hamlin, J. Kiley, Karen Wynn, and Paul Bloom. 2007. “Social Evaluation by Preverbal Infants.” Nature 450: 557–59.
Kraj, Tori. 2017. “Research Suggests Students Are Biased Against Female Lecturers.” The Economist.
MacNell, Lillian, Adam Driscoll, and Andrea Hunt. 2015. “What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching.” Innovative Higher Education 40: 291–303.
Mengel, Friederike, Jan Sauermann, and Ulf Zölitz. 2019. “Gender Bias in Teaching Evaluations.” Journal of the European Economic Association 17: 535–66.