12 Functions + iteration

12.1 Learning goals:

At the end of this section, you should be able to:

Identify the core components of a function definition and explain their role (the function() directive, arguments, argument defaults, function body, return value)
Describe the difference between argument matching by position and by name
Write if-else, if-else if-else statements to conditionally execute code
Write your own function to carry out a repeated task
Replicate your function multiple times using map()

12.2 Functions

12.2.1 Why functions?

Getting really good at writing useful and reusable functions is one of the best ways to increase your expertise in data science. It requires a lot of practice.

If you’ve copied and pasted code 3 or more times, it’s time to write a function. Try to avoid repeating yourself.

Reducing errors: Copy + paste + modify is prone to errors (e.g., forgetting to change a variable name)
Efficiency: If you need to update code, you only need to do it one place. This allows reuse of code within and across projects.
Readability: Encapsulating code within a function with a descriptive name makes code more readable.

12.2.2 An example

Consider the following code (taken from https://r4ds.hadley.nz/iteration). What does it do?

df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5),
)

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(a, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
  d = (d - min(d, na.rm = TRUE)) / 
    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)

# A tibble: 5 × 4
      a       b     c     d
  <dbl>   <dbl> <dbl> <dbl>
1 0.707 -0.204  0     0.758
2 0.252  0.796  0.229 0.115
3 0.190  0.235  0.655 0.772
4 1      0.0457 0.157 1    
5 0     -0.0266 1     0

You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? (Example from R4DS, and…) When Hadley wrote the code he made an error when copying-and-pasting and forgot to change an a to a b. Preventing exactly this type of mistake is one very good reason to learn how to write functions.

The key to the work above is that we want to repeat a set of code multiple times. The code we want to replicate can be written as:

(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))

where █ represents the part of the code that changes each time the function is run.

12.2.3 Parts of a function

To create a function you need three things:

A name. Here we’ll use rescale01() because this function rescales a vector to lie between 0 and 1.
The arguments. The arguments are things that vary across calls and our analysis above tells us that we have just one. We’ll call it x because this is the conventional name for a numeric vector.
The body. The body is the code that’s repeated across all the calls.

Then you create a function by following the template:

name <- function(arguments) {
  body
}

Which leads to the function:

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

At this point you might test with a few simple inputs to make sure you’ve captured the logic correctly:

rescale01(c(-10, 0, 10))

[1] 0.0 0.5 1.0

rescale01(c(1, 2, 3, NA, 5))

[1] 0.00 0.25 0.50   NA 1.00

Then you can rewrite the call to mutate() as:

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
  d = rescale01(d),
)

# A tibble: 5 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1 0.707 0     0     0.758
2 0.252 1     0.229 0.115
3 0.190 0.439 0.655 0.772
4 1     0.250 0.157 1    
5 0     0.178 1     0

12.2.4 Ordering and arguments

When calling a function, if you don’t name the arguments, R assumes that you passed them in the order defined inside the function.

my_power <- function(x, y){
  return(x^y)
}

my_power(x = 2, y = 3)

[1] 8

my_power(y = 3, x = 2)

[1] 8

my_power(2, 3)

[1] 8

my_power(3, 2)

[1] 9

Argument matching

In general, it is safest to match arguments by name and position for your peace of mind. For functions that you are very familiar with (and know the argument order), it’s okay to just use positional matching.

my_power <- function(x, y){
  return(x^y)
}

What will happen when I run the following code?

my_power(3)

my_power(3)

Error in my_power(3): argument "y" is missing, with no default

Function
result

my_power <- function(x, y = 2){
  return(x^y)
}

What will happen when I run the following code?

my_power(3)

my_power(3)

[1] 9

Function
result

my_power <- function(x, y = 2){
  return(x^y)
}

What will happen when I run the following code?

my_power(2, 3)

my_power(2, 3)

[1] 8

Function
result

my_power <- function(x = 2, y = 3){
  return(x^y)
}

What will happen when I run the following code?

my_power()

my_power()

[1] 8

12.2.6 Returning a value

functions
results

average1 <- function(x, remove_nas) {
    sum(x, na.rm = remove_nas)/sum(!is.na(x))
}

average2 <- function(x, remove_nas) {
    return(sum(x, na.rm = remove_nas)/sum(!is.na(x)))
}

average3 <- function(x, remove_nas = TRUE) {
    return(sum(x, na.rm = remove_nas)/sum(!is.na(x)))
    return(sum(x^2, na.rm = remove_nas)/sum(!is.na(x)))
}

some_data <- c(3, NA, 2, 13, 2, NA, 47)

average1(some_data)

Error in average1(some_data): argument "remove_nas" is missing, with no default

average1(some_data, remove_nas = TRUE)

[1] 13.4

average2(some_data)

Error in average2(some_data): argument "remove_nas" is missing, with no default

average2(some_data, remove_nas = TRUE)

[1] 13.4

average3(some_data)

[1] 13.4

without return(): the function returns the last value which gets computed and isn’t stored as an object (using <-).
with return(): the function will return an object that is explicitly included in the return() call. (Note: if you (accidentally?) have two return() calls, the function will return the object in the first return() call.)

12.3 Control flow

Often inside functions, you will want to execute code conditionally. In a programming language, control structures are parts of the language that allow you to control what code is executed.

12.3.1 `if` and `else` and `ifelse`

By far the most common is the if-else if-else structure.

if (logical_condition) {
    # some code
} else if (other_logical_condition) {
    # some other code
} else {
    # yet more code
}

Note that inside the curly else brackets, { }, you can have additional lines of code computing objects or conditions, or you can return desired objects.
You can include as many } else if { conditions as your problem calls for.

function
result

middle <- function(x) {
    mean_x <- mean(x, na.rm = TRUE)
    median_x <- median(x, na.rm = TRUE)
    seems_skewed <- (mean_x > 1.5*median_x) | (mean_x < (1/1.5)*median_x)
    if (seems_skewed) {
        median_x
    } else {
        mean_x
    }
}

Note that (mean_x > 1.5*median_x) | (mean_x < (1/1.5)*median_x) is a TRUE or FALSE question.

some_data <- c(3, NA, 2, 13, 2, NA, 47)

mean(some_data, na.rm = TRUE)

[1] 13.4

median(some_data, na.rm = TRUE)

[1] 3

middle(some_data)

[1] 3

If the work can be done on one line, often the easiest construction is ifelse().

`ifelse()`

data.frame(value = c(-2:2)) |>
  mutate(abs_value = ifelse(value >=0, value, -value))  # abs val

  value abs_value
1    -2         2
2    -1         1
3     0         0
4     1         1
5     2         2

12.3.2 Functions in the tidyverse

Functions that return the same number of rows as the original data frame are good to use inside mutate() and filter(). For example, you might want to capitalize the first word of every string:

first_upper <- function(x) {
  str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
  x
}

first_upper("hello")

[1] "Hello"

Functions that collapse into a single value will work well in the summarize() step of the pipeline. For example, you may want to calculate the coefficient of variation which is the standard deviation divided by the mean.

cv <- function(x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}

cv(runif(100, min = 0, max = 50))

[1] 0.625

cv(runif(100, min = 0, max = 500))

[1] 0.624

Shorthand lets us get away from pre-defining the function (which will be useful). Use the tilde ~ to indicate that you have a function:

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(~{data.frame(.x + 10)}) |> 
  list_cbind()

  .x...10 .x...10 .x...10
1      12      15      20

Mostly, the tilde will be used for functions we already know but want to modify (if we don’t modify, and it has a simple name, we don’t use the tilde):

library(palmerpenguins)
library(broom)

penguins_split <- split(penguins, penguins$species)
penguins_split |>
  purrr::map(~ lm(body_mass_g ~ flipper_length_mm, data = .x)) |>
  purrr::map(tidy) |>  
  list_rbind()

# A tibble: 6 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -2536.     965.       -2.63 9.48e- 3
2 flipper_length_mm     32.8      5.08      6.47 1.34e- 9
3 (Intercept)        -3037.     997.       -3.05 3.33e- 3
4 flipper_length_mm     34.6      5.09      6.79 3.75e- 9
5 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
6 flipper_length_mm     54.6      5.03     10.9  1.33e-19

penguins |>
  group_by(species) |>
  group_map(~lm(body_mass_g ~ flipper_length_mm, data = .x)) |>
  purrr::map(tidy)

[[1]]
# A tibble: 2 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        -2536.     965.       -2.63 0.00948      
2 flipper_length_mm     32.8      5.08      6.47 0.00000000134

[[2]]
# A tibble: 2 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        -3037.     997.       -3.05 0.00333      
2 flipper_length_mm     34.6      5.09      6.79 0.00000000375

[[3]]
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
2 flipper_length_mm     54.6      5.03     10.9  1.33e-19

12.3.3 Functions summary

Functions can be used to avoid repeating code
Arguments allow us specify the inputs when we call a function
If inputs are not named when calling the function, R uses the ordering from the function definition
All arguments must be specified when calling a function
Default arguments can be specified when the function is defined
The input to a function can be a function!

12.4 Iterating functions

There will be times when you will need to iterate a function multiple times.

12.4.1 purrr for functional programming

We will see the R package purrr in greater detail as we go, but for now, let’s get a hint for how it works.

We are going to focus on the map family of functions which will just get us started. Lots of other good purrr functions like pluck() and accumulate() and across() from dplyr.

Much of below is taken from a tutorial by Rebecca Barter.

The map functions are named by the output the produce. For example:

map(.x, .f) is the main mapping function and returns a list
map_df(.x, .f) returns a data frame
map_dbl(.x, .f) returns a numeric (double) vector
map_chr(.x, .f) returns a character vector
map_lgl(.x, .f) returns a logical vector

Note that the first argument is always the data object and the second object is always the function you want to iteratively apply to each element in the input object.

The input to a map function is always either a vector (like a column), a list (which can be non-rectangular), or a dataframe (like a rectangle).

A list is a way to hold things which might be very different in shape:

a_list <- list(a_number = 5,
               a_vector = c("a", "b", "c"),
               a_dataframe = data.frame(a = 1:3, 
                                        b = c("q", "b", "z"), 
                                        c = c("bananas", "are", "so very great")))

print(a_list)

$a_number
[1] 5

$a_vector
[1] "a" "b" "c"

$a_dataframe
  a b             c
1 1 q       bananas
2 2 b           are
3 3 z so very great

Consider the following function:

add_ten <- function(x) {
  return(x + 10)
  }

We can map() the add_ten() function across a vector. Note that the output is a list (the default).

library(purrr)
purrr::map(.x = c(2, 5, 10),
    .f = add_ten)

[[1]]
[1] 12

[[2]]
[1] 15

[[3]]
[1] 20

What if we use a different type of input? The default behavior is to still return a list!

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten)

$a
[1] 12

$b
[1] 15

$c
[1] 20

What if we want a different type of output? If the function outputs a data frame, we can combine the values in the list using a row-bind, list_rbind().

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten) |> 
  list_rbind()

Error in `list_rbind()`:
! Each element of `x` must be either a data frame or `NULL`.
ℹ Elements 1, 2, and 3 are not.

Darn! We get an error because the output of the add_ten() function is a scalar, not a data frame. In order to use list_rbind() we need to edit the add_ten() function.

add_ten_df <- function(x) {
  return(data.frame(x + 10))
}

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten_df) |> 
  list_rbind()   # output bound by rows

c(2, 5, 10) |>
  purrr::map(add_ten_df) |> 
  list_rbind()   # output bound by rows

data.frame(a = 2, b = 5, c = 10) |>
  purrr::map(add_ten_df) |> 
  list_cbind()   # output bound by columns

  x...10 x...10 x...10
1     12     15     20

c(2, 5, 10) |>
  purrr::map(add_ten_df) |> 
  list_cbind()   # output bound by columns

  x...1 x...2 x...3
1    12    15    20

12.5 Simulating models

12.5.1 Goals of Simulating Complicated Models

The goal of simulating a complicated model is not only to create a program which will provide the desired results. We also hope to be able to write code such that:

The problem is broken down into small pieces.
The problem has checks in it to see what works (run the lines inside the functions!).
uses simple code (as often as possible).

12.5.2 Simulate to…

… estimate probabilities (easier than calculus).
… understand complicated models.

Aside: `ifelse()`

data.frame(value = c(-2:2)) |>
  mutate(abs_value = ifelse(value >=0, value, -value))  # abs val

  value abs_value
1    -2         2
2    -1         1
3     0         0
4     1         1
5     2         2

Aside: `case_when()`

set.seed(4747)
diamonds |> select(carat, cut, color, price) |>
  sample_n(20) |>
  mutate(price_cat = case_when(
    price > 10000 ~ "expensive",
    price > 1500 ~ "medium", 
    TRUE ~ "inexpensive"))

# A tibble: 20 × 5
  carat cut       color price price_cat  
  <dbl> <ord>     <ord> <int> <chr>      
1  1.23 Very Good F     10276 expensive  
2  0.35 Premium   H       706 inexpensive
3  0.7  Good      E      2782 medium     
4  0.4  Ideal     D      1637 medium     
5  0.53 Ideal     G      1255 inexpensive
6  2.22 Ideal     G     14637 expensive  
# ℹ 14 more rows

Aside: `sample()`

sampling, shuffling, and resampling: sample()

set.seed(47)
alph <- letters[1:10]
alph

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

sample(alph, 5, replace = FALSE) # sample (from a population)

[1] "i" "b" "g" "d" "a"

sample(alph, 5, replace = TRUE) # sample (from a population)

[1] "j" "g" "f" "i" "f"

sample(alph, 10, replace = FALSE)  # shuffle

 [1] "f" "h" "i" "e" "g" "d" "c" "j" "b" "a"

sample(alph, 10, replace = TRUE)  # resample

 [1] "e" "j" "e" "b" "e" "c" "f" "a" "e" "a"

Aside: `set.seed()`

What if we want to be able to generate the same random numbers (here on the interval from 0 to 1) over and over?

set.seed(4747)
runif(4, 0, 1) # random uniform numbers

[1] 0.195 0.339 0.515 0.452

set.seed(123)
runif(4, 0, 1) # random uniform numbers

[1] 0.288 0.788 0.409 0.883

set.seed(4747)
runif(4, 0, 1) # random uniform numbers

[1] 0.195 0.339 0.515 0.452

12.5.3 Probability example 1: hats

10 people are at a party, and all of them are wearing hats. They each place their hat in a pile; when they leave, they choose a hat at random. What is the probability at least one person selected the correct hat?

Step 1: representing the hats

hats <- 1:10
hats

 [1]  1  2  3  4  5  6  7  8  9 10

Step 2: everyone draws a hat

random_hats <- sample(hats, size = 10, replace = FALSE)

hats

 [1]  1  2  3  4  5  6  7  8  9 10

random_hats

 [1]  4  6  9  3  5 10  2  8  1  7

Step 3: who got their original hat?

hats == random_hats

 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

# TRUE is 1, FALSE is 0
sum(hats == random_hats)

[1] 2

# did at least one person get their hat?
sum(hats == random_hats) > 0

[1] TRUE

Code so far

hats <- 1:10
random_hats <- sample(hats, size = 10, replace = FALSE)
sum(hats == random_hats) > 0

[1] TRUE

Do we have a good estimate of the probability of interest?

Function

remove the magic number 10

hat_match <- function(n){
  hats <- 1:n
  random_hats <- sample(hats, size = n, replace = FALSE)
  sum(hats == random_hats) > 0
}

hat_match(10)

[1] TRUE

hat_match(47)

[1] FALSE

Iterate

map_lgl(1:10, ~hat_match(n = 10))

 [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

map_lgl(1:10, ~hat_match(n = 10)) |> 
  mean()

[1] 0.7

Reproducible

set.seed(4747)
num_iter <- 10000

map_lgl(1:num_iter, ~hat_match(n = 10)) |> 
  mean()

[1] 0.633

Vary number of hats

num_iter <- 10
hat_match_prob <- function(n, reps){
  prob <- map_lgl(1:reps, ~hat_match(n = n)) |> 
    mean()
  return(data.frame(match_prob = prob, num_hats = n))
}

hat_match_prob(47, 3)

  match_prob num_hats
1      0.667       47

set.seed(47)
map(1:20, hat_match_prob, reps = num_iter) |> 
  list_rbind()

   match_prob num_hats
1         1.0        1
2         0.6        2
3         0.7        3
4         0.7        4
5         0.4        5
6         0.4        6
7         0.8        7
8         0.6        8
9         0.4        9
10        0.9       10
11        0.7       11
12        0.6       12
13        0.7       13
14        0.7       14
15        0.5       15
16        0.7       16
17        0.6       17
18        0.8       18
19        0.5       19
20        0.5       20

Visualizing

set.seed(47)
num_iter <- 10000
map(1:20, hat_match_prob, reps = num_iter) |> 
  list_rbind() |> 
  ggplot(aes(x = num_hats, y = match_prob)) + 
  geom_line() + 
  labs(y = "probability of at least one match",
       x = "number of hats")

12.5.4 Probability example 2: Birthday Problem

What is the probability that in a room of 29 people, at least 2 of them have the same birthday?

Step 1: representing the birthdays

set.seed(47)
class_birthdays <- sample(1:365, size=29, replace=TRUE)
class_birthdays

 [1]  27 242 143 212 257  13 279 198 365 232 136  69 139  12  41 117 330 165 341
[20] 176 118 144 273 261 213 304   1 360 144

Step 2: any duplicates?

duplicated(class_birthdays)

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE  TRUE

sum(duplicated(class_birthdays))

[1] 1

Function

class_duplicates <- function(unicorn){
  class_birthdays <- sample(1:365, size = 29, replace = TRUE)
  num_duplicates <- sum(duplicated(class_birthdays))
  return(ifelse(num_duplicates > 0, 1, 0))
}

class_duplicates()

[1] 1

class_duplicates()

[1] 0

Iterate

map_dbl(1:10, class_duplicates)

 [1] 0 1 1 1 1 0 0 1 0 0

Improve

class_duplicates <- function(class_size){
  class_birthdays <- sample(1:365, size = class_size, replace = TRUE)
  num_duplicates <- sum(duplicated(class_birthdays))
  return(ifelse(num_duplicates > 0, 1, 0))
}

set.seed(47)
num_stud <- 29
map_dbl(1:10, ~class_duplicates(class_size = num_stud))

 [1] 1 1 0 0 1 1 1 1 0 0

Many classrooms

set.seed(47)
num_stud <- 29
num_class <- 1000
map_dbl(1:num_class, ~class_duplicates(class_size = num_stud)) |> 
  mean()

[1] 0.654

Vary number of students

set.seed(47)
num_stud <- 29
num_class <- 100

birth_match_prob <- function(num_stud, reps){
  prob <- map_dbl(1:reps, ~class_duplicates(class_size = num_stud)) |> 
    mean()
  return(data.frame(match_prob = prob, num_stud = num_stud))
}

map(1:num_stud, birth_match_prob, reps = num_class) |> 
  list_rbind()

   match_prob num_stud
1        0.00        1
2        0.00        2
3        0.00        3
4        0.00        4
5        0.03        5
6        0.03        6
7        0.04        7
8        0.06        8
9        0.11        9
10       0.13       10
11       0.14       11
12       0.18       12
13       0.18       13
14       0.25       14
15       0.25       15
16       0.32       16
17       0.37       17
18       0.43       18
19       0.44       19
20       0.40       20
21       0.43       21
22       0.47       22
23       0.48       23
24       0.47       24
25       0.61       25
26       0.67       26
27       0.63       27
28       0.66       28
29       0.73       29

Visualize

set.seed(123)
num_stud <- 40
num_class <- 1000

map(1:num_stud, birth_match_prob, reps = num_class) |> 
  list_rbind()  |> 
  ggplot(aes(x = num_stud, y = match_prob)) + 
  geom_line() + 
  labs(y = "probability of birthday match",
       x = "number of students in class")

12.5.5 Good simulation practice

avoid magic numbers
set a seed for reproducibility
use meaningful names
add comments

12.5.6 Simulate to…

… estimate probabilities (easier than calculus).
… understand complicated models.

12.5.7 Model example 1: Bias in a model

Population:

talent ~ Normal (100, 15)
grades ~ Normal (talent, 15)
SAT ~ Normal (talent, 15)

The example is taken directly (and mostly verbatim) from a blog by Aaron Roth Algorithmic Unfairness Without Any Bias Baked In.

Goal for admitting students

College wants to admit students with

talent > 115

… but they only have access to grades and SAT which are noisy estimates of talent.

Plan for accepting students

Run a regression on a training dataset (talent is known for existing students)
Find a model which predicts talent based on grades and SAT
Choose students for whom predicted talent is above 115

Flaw in the plan …

there are two populations of students, the Reds and Blues.
Reds are the majority population (99%), and Blues are a small minority population (1%)
the Reds and the Blues are no different when it comes to talent: they both have the same talent distribution, as described above.
there is no bias baked into the grading or the exams: both the Reds and the Blues also have exactly the same grade and exam score distributions

What is really different?

But there is one difference: the Blues have more money than the Reds, so they each take the SAT twice, and report only the highest of the two scores to the college. This results in a small but noticeable bump in their average SAT scores, compared to the Reds.

Key aspect:

The value of SAT means something different for the Reds versus the Blues

(They have different feature distributions.)

Let’s see what happens …

Two models:

Red model (SAT taken once):

# A tibble: 3 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   33.2     0.152        218.       0
2 SAT            0.334   0.00149      224.       0
3 grades         0.334   0.00149      225.       0

Blue model (SAT is max score of two):

# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   25.3      1.60        15.8 2.04e- 50
2 SAT            0.432    0.0161      26.7 3.35e-119
3 grades         0.277    0.0154      18.0 6.83e- 63

New data

Generate new data, use the two models separately.
Can the models predict if a student has talent > 115?

New data

# A tibble: 2 × 5
  color   tpr    fpr   fnr error
  <chr> <dbl>  <dbl> <dbl> <dbl>
1 blue  0.503 0.0379 0.497 0.109
2 red   0.509 0.0378 0.491 0.109

TWO models doesn’t seem right????

What if we fit only one model to the entire dataset?

After all, there are laws against using protected classes to make decisions (housing, jobs, money loans, college, etc.)

# A tibble: 3 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   33.1     0.151        219.       0
2 SAT            0.334   0.00148      225.       0
3 grades         0.334   0.00148      226.       0

(The coefficients kinda look like the red model…)

How do the error rates change?

One model:

# A tibble: 2 × 5
  color   tpr    fpr   fnr error
  <chr> <dbl>  <dbl> <dbl> <dbl>
1 blue  0.619 0.0627 0.381 0.112
2 red   0.507 0.0375 0.493 0.109

Two separate models:

# A tibble: 2 × 5
  color   tpr    fpr   fnr error
  <chr> <dbl>  <dbl> <dbl> <dbl>
1 blue  0.503 0.0379 0.497 0.109
2 red   0.509 0.0378 0.491 0.109

What did we learn?

with two populations that have different feature distributions, learning a single classifier (that is prohibited from discriminating based on population) will fit the bigger of the two populations

depending on the nature of the distribution difference, it can be either to the benefit or the detriment of the minority population
no explicit human bias, either on the part of the algorithm designer or the data gathering process
the problem is exacerbated if we artificially force the algorithm to be group blind
well-intentioned “fairness” regulations prohibiting decision makers form taking sensitive attributes into account can actually make things less fair and less accurate at the same time

Simulate?

different varying proportions
effect due to variability
effect due to SAT coefficient
different number of times the blues get to take the test
etc.

12.5.8 Model example 2: Asset allocation

repeatedly run a model on a simulated outcome based on varying inputs
inputs are uncertain and variable.
output is a large set of results, creating a distribution of outcomes rather than a single point estimate.
easy to change the conditions of the models by varying the distribution type or properties of the inputs.

Taken from Risk Analysis Using Monte Carlo Simulations in R.

Aside: `rnorm()`

rnorm() calculates random normal values

rnorm(3)

[1] 1.060 0.281 0.148

rnorm(3, mean = 0, sd = 1)

[1] -1.522 -0.261  0.139

rnorm(4, mean = 47, sd = 5)

[1] 48.9 45.1 50.7 42.1

Asset allocation

Task: to model an asset allocation problem where you decide what portion of wealth should be allocated to risk-free investment or high-risk investment

the function calculates returns based on different asset allocations.

calculate_return <- function(step, alpha) {
  risk_free_rate <- 1.03
  risky_rate <- rnorm(1, mean = 0, sd = 1) * 0.05 + 1
  return(data.frame(step = step, 
                    return = (1 - alpha) * risk_free_rate + alpha * risky_rate))
}

calculate_return("unicorn", 0.47)

     step return
1 unicorn   1.03

Let alpha = 0.5

(half risky, half risk free)

decision_steps <- 12

pmap(data.frame(step = 1:decision_steps,
                alpha = rep(0.5, decision_steps)),
     calculate_return) |> 
  list_rbind()

   step return
1     1  1.024
2     2  0.974
3     3  1.011
4     4  1.038
5     5  1.002
6     6  0.985
7     7  1.030
8     8  0.992
9     9  1.038
10   10  1.009
11   11  1.037
12   12  1.011

Many runs

runs <- 10
decision_steps <- 12

run_func <- function(run, steps, alpha){
  pmap(data.frame(step = 1:steps,
                  alpha = rep(alpha, steps)),
       calculate_return) |> 
    list_rbind() |> 
    cbind(run_number = as.factor(run))
}

run_func("happy", 12, 0.5)

   step return run_number
1     1  0.997      happy
2     2  1.050      happy
3     3  1.013      happy
4     4  1.031      happy
5     5  1.042      happy
6     6  1.028      happy
7     7  1.009      happy
8     8  1.028      happy
9     9  1.027      happy
10   10  1.050      happy
11   11  1.048      happy
12   12  1.009      happy

map(1:runs, run_func, 
    steps = decision_steps,
    alpha = 0.5) |> 
  list_rbind()

    step return run_number
1      1  1.004          1
2      2  1.031          1
3      3  1.018          1
4      4  0.985          1
5      5  1.020          1
6      6  0.978          1
7      7  1.014          1
8      8  1.039          1
9      9  1.031          1
10    10  1.011          1
11    11  1.060          1
12    12  0.953          1
13     1  1.051          2
14     2  1.019          2
15     3  1.055          2
16     4  1.029          2
17     5  1.009          2
18     6  1.012          2
19     7  0.998          2
20     8  1.040          2
21     9  0.975          2
22    10  0.999          2
23    11  1.033          2
24    12  1.027          2
25     1  0.985          3
26     2  1.011          3
27     3  1.009          3
28     4  1.025          3
29     5  0.988          3
30     6  1.076          3
31     7  1.010          3
32     8  1.042          3
33     9  1.030          3
34    10  1.048          3
35    11  1.004          3
36    12  1.018          3
37     1  0.996          4
38     2  1.045          4
39     3  1.015          4
40     4  1.021          4
41     5  1.014          4
42     6  1.024          4
43     7  0.964          4
44     8  1.033          4
45     9  1.071          4
46    10  1.034          4
47    11  1.008          4
48    12  1.044          4
49     1  1.017          5
50     2  1.042          5
51     3  0.994          5
52     4  0.967          5
53     5  1.015          5
54     6  1.015          5
55     7  1.037          5
56     8  1.041          5
57     9  0.981          5
58    10  1.021          5
59    11  0.996          5
60    12  1.052          5
61     1  0.980          6
62     2  1.031          6
63     3  1.046          6
64     4  1.002          6
65     5  0.985          6
66     6  1.017          6
67     7  1.034          6
68     8  1.022          6
69     9  0.980          6
70    10  1.016          6
71    11  1.042          6
72    12  1.058          6
73     1  1.075          7
74     2  1.006          7
75     3  0.993          7
76     4  1.001          7
77     5  1.006          7
78     6  1.057          7
79     7  0.958          7
80     8  0.996          7
81     9  1.089          7
82    10  1.014          7
83    11  1.015          7
84    12  1.033          7
85     1  1.030          8
86     2  1.011          8
87     3  1.008          8
88     4  1.003          8
89     5  1.051          8
90     6  0.971          8
91     7  1.010          8
92     8  1.056          8
93     9  0.999          8
94    10  1.023          8
95    11  1.018          8
96    12  0.999          8
97     1  1.025          9
98     2  1.030          9
99     3  0.989          9
100    4  1.040          9
101    5  1.015          9
102    6  1.024          9
103    7  0.996          9
104    8  1.078          9
105    9  0.975          9
106   10  1.019          9
107   11  1.038          9
108   12  1.051          9
109    1  1.056         10
110    2  0.985         10
111    3  1.040         10
112    4  0.971         10
113    5  1.019         10
114    6  1.026         10
115    7  0.989         10
116    8  1.042         10
117    9  1.006         10
118   10  1.016         10
119   11  1.035         10
120   12  0.992         10

Cummulative product

numbers <- sample(1:5)
numbers

[1] 5 4 2 1 3

cumprod(numbers)

[1]   5  20  40  40 120

Cummulative rates

map(1:runs, run_func, 
    steps = decision_steps,
    alpha = 0.5) |> 
  list_rbind() |> 
  group_by(run_number) |> 
  mutate(compound_return = cumprod(return))

# A tibble: 120 × 4
# Groups:   run_number [10]
   step return run_number compound_return
  <int>  <dbl> <fct>                <dbl>
1     1  0.984 1                    0.984
2     2  1.01  1                    0.990
3     3  1.04  1                    1.03 
4     4  0.987 1                    1.02 
5     5  1.04  1                    1.05 
6     6  1.04  1                    1.09 
# ℹ 114 more rows

Returns over time

graph
code

runs <- 20
map(1:runs, run_func, 
    steps = decision_steps,
    alpha = 0.5) |> 
  list_rbind() |> 
  group_by(run_number) |> 
  mutate(compound_return = cumprod(return)) |> 
  ggplot(aes(x = step, y = compound_return)) +
  geom_line(aes(color = run_number)) +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(1,12)) + 
  labs(title = "Simulations of returns from asset allocation, alpha = 0.5")

Change the allocation

graph
code

runs <- 20
map(1:runs, run_func, 
    steps = decision_steps, 
    alpha = 0.1) |> 
  list_rbind() |> 
  group_by(run_number) |> 
  mutate(compound_return = cumprod(return)) |> 
  ggplot(aes(x = step, y = compound_return)) +
  geom_line(aes(color = run_number)) +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(1,12)) + 
  labs(title = "Simulations of returns from asset allocation, alpha = 0.1")

Change the allocation

graph
code

runs <- 20
map(1:runs, run_func, 
    steps = decision_steps, 
    alpha = 0.9) |> 
  list_rbind() |> 
  group_by(run_number) |> 
  mutate(compound_return = cumprod(return)) |> 
  ggplot(aes(x = step, y = compound_return)) +
  geom_line(aes(color = run_number)) +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(1,12)) + 
  labs(title = "Simulations of returns from asset allocation, alpha = 0.9")

Change the return

calculate return
run function

calculate_return_sd10 <- function(step, alpha) {
  risk_free_rate <- 1.03
  risky_rate <- rnorm(1, mean = 0, sd = 10) * 0.05 + 1
  return(data.frame(step = step, 
                    return = (1 - alpha) * risk_free_rate + alpha * risky_rate))
}

calculate_return("unicorn", 0.47)

     step return
1 unicorn  0.987

run_func_sd10 <- function(run, steps, alpha){
  pmap(data.frame(step = 1:steps,
                  alpha = rep(alpha, steps)),
       calculate_return_sd10) |> 
    list_rbind() |> 
    cbind(run_number = as.factor(run))
}

Does the change impact long-term?

graph
code

runs <- 20
map(1:runs, run_func_sd10, 
    steps = decision_steps,
    alpha = 0.5) |> 
  list_rbind() |> 
  group_by(run_number) |> 
  mutate(compound_return = cumprod(return)) |> 
  ggplot(aes(x = step, y = compound_return)) +
  geom_line(aes(color = run_number)) +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(1,12)) + 
  labs(title = "Simulations of returns from asset allocation, alpha = 0.5")

12.1 Learning goals:

12.2 Functions

12.2.1 Why functions?

12.2.2 An example

12.2.3 Parts of a function

12.2.4 Ordering and arguments

12.2.5 Function defaults

12.2.6 Returning a value

12.3 Control flow

12.3.1 if and else and ifelse

ifelse()

12.3.2 Functions in the tidyverse

12.3.3 Functions summary

12.4 Iterating functions

12.4.1 purrr for functional programming

12.5 Simulating models

12.5.1 Goals of Simulating Complicated Models

12.5.2 Simulate to…

Aside: ifelse()

Aside: case_when()

Aside: sample()

Aside: set.seed()

12.5.3 Probability example 1: hats

Step 1: representing the hats

Step 2: everyone draws a hat

Step 3: who got their original hat?

Code so far

Function

Iterate

Reproducible

Vary number of hats

Visualizing

12.5.4 Probability example 2: Birthday Problem

Step 1: representing the birthdays

Step 2: any duplicates?

Function

Iterate

Improve

Many classrooms

Vary number of students

Visualize

12.5.5 Good simulation practice

12.5.6 Simulate to…

12.5.7 Model example 1: Bias in a model

Goal for admitting students

Plan for accepting students

Flaw in the plan …

What is really different?

Key aspect:

Let’s see what happens …

Two models:

New data

New data

TWO models doesn’t seem right????

How do the error rates change?

What did we learn?

Simulate?

12.5.8 Model example 2: Asset allocation

Aside: rnorm()

Asset allocation

Let alpha = 0.5

Many runs

Cummulative product

Cummulative rates

Returns over time

Change the allocation

Change the allocation

Change the return

Does the change impact long-term?

12.6 Reflection questions

12.7 Ethics considerations

12.3.1 `if` and `else` and `ifelse`

`ifelse()`

Aside: `ifelse()`

Aside: `case_when()`

Aside: `sample()`

Aside: `set.seed()`

Aside: `rnorm()`