3 Data Wrangling

As with data visualization, data wrangling is a fundamental part of being able to accurately, reproducibly, and efficiently work with data. The approach taken in the following chapter is based on the philosophy of tidy data and takes many of its precepts from database theory. If you have done much work in SQL, the functionality and approach of tidy data will feel very familiar. The more adept you are at data wrangling, the more effective you will be at data analysis.

Information is what we want, but data are what we’ve got. (Kaplan 2015)

Embrace all the ways to get help!

cheat sheets: https://posit.co/resources/cheatsheets/
tidyverse vignettes: https://www.tidyverse.org/articles/2019/09/tidyr-1-0-0/
pivoting: https://tidyr.tidyverse.org/articles/pivot.html
google what you need and include R tidy or tidyverse

3.1 Coding Style

An important aspect of the data wrangling tools in this chapter includes best practices for writing code. The tidyverse style of coding allows you to implement code that is clean and communicates ideas easily. The more your code conforms to a style, the easier it will be to debug, and the fewer mistakes you will make. Understanding the rules of programming discourse is part of what makes excellent coders. (Soloway and Ehrlich 1984)

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. The tidyverse style guide

3.1.1 Why do style and format matter?

Consistency and communication are very important aspects of coding. While the computer does not usually care about code style, the computer is not the only entity working in the coding space. Humans play a huge role in making code work, and they absolutely do better in situations where style is consistent and clear.

In chess, experienced players have excellent memories for recall of specific games and positions they have seen previously. However, recall is substantially better for situations with regular patterns as opposed to remembering random positions. (Gobet and Simon 1996) The related psychological idea, chunking, states that small pieces of information improve long-term retention of the material. When code is random or unwieldy, the coder has to use more brainpower to process the intent of the code.

In particular, good code style:

highlights where the errors are.
makes it easier to fix errors.
focuses the reader’s eye on what the code is doing (bad style obscures the information into the code itself).
helps the future reader of the code (which is likely you in the future!) understand what the code does.
follows the rules that specify conventions in programming.

3.1.2 Best practices for writing good code.

A style guide is a set of rules and best practices that help coders write code. There are no absolute truths when it comes to style, but consistency is important for being able to communicate your work.

Use a style guide. For example, try The tidyverse style guide
Use the same guide that your collaborators are using – remember, communication is key!

3.2 Structure of Data

For plotting, analyses, model building, etc., it’s important that the data be structured in a very particular way. Hadley Wickham provides a thorough discussion and advice for cleaning up the data in Wickham (2014).

Tidy Data: rows (cases/observational units) and columns (variables). The key is that every row is a case and *every} column is a variable. No exceptions.
Creating tidy data is not trivial. We work with objects (often data tables), functions, and arguments (often variables).

The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was “an individual in the armed forces.” What variables would you use to capture the information in the following table?

https://docs.google.com/spreadsheets/d/1Ow6Cm4z-Z1Yybk3i352msulYCEDOUaOghmo9ALajyHo/edit#gid=1811988794

Problem: totals and different sheets

Better for R: longer format with columns - grade, gender, status, service, count (case is still the total pay grade)

Case is individual (?): grade, gender, status, service (no count because each row does the counting)

3.2.1 Building Tidy Data

Within R (really within any type of computing language, Python, SQL, Java, etc.), we need to understand how to build data using the patterns of the language. Some things to consider:

object_name <- function_name(arguments) is a way of using a function to create a new object.
object_name <- data_table |> function_name(arguments) uses chaining syntax as an extension of the ideas of functions. In chaining, the value on the left side of |> becomes the first argument to the function on the right side.

object_name <- data_table |>
  function_name(arguments) |> 
  function_name(arguments)

is extended chaining. |> is never at the front of the line, it is always connecting one idea with the continuation of that idea on the next line. * In R, all functions take arguments in round parentheses (as opposed to subsetting observations or variables from data objects which happen with square parentheses). Additionally, the spot to the left of |> is always a data table. * The pipe syntax should be read as then, |>.

3.2.2 Examples of Chaining

The pipe syntax (|>) takes a data frame (or data table) and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:

x |> f() is the same as f(x)

x |> f(y) is the same as f(x, y)

3.2.2.1 Little Bunny Foo Foo

From Hadley Wickham, how to think about tidy data.

Little bunny Foo Foo Went hopping through the forest Scooping up the field mice And bopping them on the head

The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.

foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_2, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)

Another approach is to concatenate the functions so that there is only one output.

bop(
   scoop(
      hop(foo_foo, through = forest),
      up = field_mice),
   on = head)

Or even worse, as one line:

bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)))

Instead, the code can be written using the pipe in the order in which the function is evaluated:

foo_foo |>
   hop(through = forest) |>
       scoop(up = field_mice) |>
           bop(on = head)

babynames Each year, the US Social Security Administration publishes a list of the most popular names given to babies. From 2014 until 2017, Social Security Administration’s website showed the names Emma and Olivia leading for girls, Noah and Liam for boys.

The babynames data table in the babynames package comes from the Social Security Administration’s listing of the names givens to babies in each year, and the number of babies of each sex given that name. (Only names with 5 or more babies are published by the SSA.)

3.2.3 Data Verbs (on single data frames)

Important

Super great resource: The Posit dplyr cheat sheet

Data verbs take data tables as input and give data tables as output (that’s how we can use the chaining syntax!). We will use the R package dplyr to do much of our data wrangling. Below is a list of verbs which will be helpful in wrangling many different types of data. See the Data Wrangling cheat from Posit for additional help.

sample_n() take a random row(s)
head() grab the first few rows
tail() grab the last few rows
filter() removes unwanted cases
arrange() reorders the cases
select() removes unwanted variables (and rename() )
distinct() returns the unique values in a table
mutate() transforms the variable (and transmute() like mutate, returns only new variables)
group_by() tells R that SUCCESSIVE functions keep in mind that there are groups of items. So group_by() only makes sense with verbs later on (like summarize()).
summarize() collapses a data frame to a single row. Some functions that are used within summarize() include:
- min(), max(), mean(), sum(), sd(), median(), and IQR()
- n(): number of observations in the current group
- n_distinct(x): count the number of unique values in x
- first_value(x), last_value(x) and nth_value(x, n): work similarly to x[1], x[length(x)], and x[n]

3.3 R examples, basic verbs

3.3.1 Datasets

The starwars dataframe is from dplyr , although the data are originally from SWAPI, the Star Wars API.

NHANES From ?NHANES: NHANES is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination center (MEC).

babynames Each year, the US Social Security Administration publishes a list of the most popular names given to babies. From 2014 until 2017, Social Security Administration’s website showed the names Emma and Olivia leading for girls, Noah and Liam for boys. (Only names with 5 or more babies are published by the SSA.)

3.3.2 Examples of Chaining

library(babynames)
babynames |> nrow()

[1] 1924665

babynames |> names()

[1] "year" "sex"  "name" "n"    "prop"

babynames |> glimpse()

Rows: 1,924,665
Columns: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida",…
$ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258,…
$ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.016…

babynames |> head()

# A tibble: 6 × 5
   year sex   name          n   prop
  <dbl> <chr> <chr>     <int>  <dbl>
1  1880 F     Mary       7065 0.0724
2  1880 F     Anna       2604 0.0267
3  1880 F     Emma       2003 0.0205
4  1880 F     Elizabeth  1939 0.0199
5  1880 F     Minnie     1746 0.0179
6  1880 F     Margaret   1578 0.0162

babynames |> tail()

# A tibble: 6 × 5
   year sex   name       n       prop
  <dbl> <chr> <chr>  <int>      <dbl>
1  2017 M     Zyhier     5 0.00000255
2  2017 M     Zykai      5 0.00000255
3  2017 M     Zykeem     5 0.00000255
4  2017 M     Zylin      5 0.00000255
5  2017 M     Zylis      5 0.00000255
6  2017 M     Zyrie      5 0.00000255

babynames |> sample_n(size=5)

# A tibble: 5 × 5
   year sex   name        n       prop
  <dbl> <chr> <chr>   <int>      <dbl>
1  2013 F     Keilynn     6 0.00000312
2  1930 F     Paris       6 0.00000514
3  2006 F     Aleysia     5 0.00000239
4  1987 F     Adonis      6 0.0000032 
5  2001 M     Idris      40 0.0000194

3.3.3 Data Verbs

Taken from the dplyr tutorial

3.3.3.1 Starwars

library(dplyr)

starwars |> dim()

[1] 87 14

starwars |> names()

 [1] "name"       "height"     "mass"       "hair_color" "skin_color"
 [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
[11] "species"    "films"      "vehicles"   "starships"

starwars |> head()

# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

starwars |> 
  dplyr::filter(species == "Droid")

# A tibble: 6 × 14
  name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
  <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
4 IG-88     200   140 none       metal       red               15 none  masculi…
5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
6 BB8        NA    NA none       none        black             NA none  masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

starwars |> 
  dplyr::select(name, ends_with("color"))

# A tibble: 87 × 4
   name               hair_color    skin_color  eye_color
   <chr>              <chr>         <chr>       <chr>    
 1 Luke Skywalker     blond         fair        blue     
 2 C-3PO              <NA>          gold        yellow   
 3 R2-D2              <NA>          white, blue red      
 4 Darth Vader        none          white       yellow   
 5 Leia Organa        brown         light       brown    
 6 Owen Lars          brown, grey   light       blue     
 7 Beru Whitesun Lars brown         light       blue     
 8 R5-D4              <NA>          white, red  red      
 9 Biggs Darklighter  black         light       brown    
10 Obi-Wan Kenobi     auburn, white fair        blue-gray
# ℹ 77 more rows

starwars |> 
  dplyr::mutate(name, bmi = mass / ((height / 100)  ^ 2)) |>
  dplyr::select(name:mass, bmi)

# A tibble: 87 × 4
   name               height  mass   bmi
   <chr>               <int> <dbl> <dbl>
 1 Luke Skywalker        172    77  26.0
 2 C-3PO                 167    75  26.9
 3 R2-D2                  96    32  34.7
 4 Darth Vader           202   136  33.3
 5 Leia Organa           150    49  21.8
 6 Owen Lars             178   120  37.9
 7 Beru Whitesun Lars    165    75  27.5
 8 R5-D4                  97    32  34.0
 9 Biggs Darklighter     183    84  25.1
10 Obi-Wan Kenobi        182    77  23.2
# ℹ 77 more rows

starwars |> 
  dplyr::arrange(desc(mass))

# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Jabba D…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
 2 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
 3 IG-88       200   140 none       metal      red             15   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Bossk       190   113 none       green      red             53   male  mascu…
 8 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 9 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
10 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

starwars |>
  dplyr::group_by(species) |>
  dplyr::summarize(
    num = n(),
    mass = mean(mass, na.rm = TRUE)
  ) |>
  dplyr::filter(num > 1)

# A tibble: 9 × 3
  species    num  mass
  <chr>    <int> <dbl>
1 Droid        6  69.8
2 Gungan       3  74  
3 Human       35  81.3
4 Kaminoan     2  88  
5 Mirialan     2  53.1
6 Twi'lek      2  55  
7 Wookiee      2 124  
8 Zabrak       2  80  
9 <NA>         4  81

3.3.3.2 NHANES

library(NHANES)
names(NHANES)

 [1] "ID"               "SurveyYr"         "Gender"           "Age"             
 [5] "AgeDecade"        "AgeMonths"        "Race1"            "Race3"           
 [9] "Education"        "MaritalStatus"    "HHIncome"         "HHIncomeMid"     
[13] "Poverty"          "HomeRooms"        "HomeOwn"          "Work"            
[17] "Weight"           "Length"           "HeadCirc"         "Height"          
[21] "BMI"              "BMICatUnder20yrs" "BMI_WHO"          "Pulse"           
[25] "BPSysAve"         "BPDiaAve"         "BPSys1"           "BPDia1"          
[29] "BPSys2"           "BPDia2"           "BPSys3"           "BPDia3"          
[33] "Testosterone"     "DirectChol"       "TotChol"          "UrineVol1"       
[37] "UrineFlow1"       "UrineVol2"        "UrineFlow2"       "Diabetes"        
[41] "DiabetesAge"      "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad" 
[45] "LittleInterest"   "Depressed"        "nPregnancies"     "nBabies"         
[49] "Age1stBaby"       "SleepHrsNight"    "SleepTrouble"     "PhysActive"      
[53] "PhysActiveDays"   "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"   
[57] "CompHrsDayChild"  "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
[61] "SmokeNow"         "Smoke100"         "Smoke100n"        "SmokeAge"        
[65] "Marijuana"        "AgeFirstMarij"    "RegularMarij"     "AgeRegMarij"     
[69] "HardDrugs"        "SexEver"          "SexAge"           "SexNumPartnLife" 
[73] "SexNumPartYear"   "SameSex"          "SexOrientation"   "PregnantNow"

# find the sleep variables
NHANESsleep <- NHANES |> 
  select(Gender, Age, Weight, Race1, Race3,
         Education, SleepTrouble, SleepHrsNight,
         TVHrsDay, TVHrsDayChild, PhysActive)
names(NHANESsleep)

 [1] "Gender"        "Age"           "Weight"        "Race1"        
 [5] "Race3"         "Education"     "SleepTrouble"  "SleepHrsNight"
 [9] "TVHrsDay"      "TVHrsDayChild" "PhysActive"

dim(NHANESsleep)

[1] 10000    11

# subset for college students
NHANESsleep <- NHANESsleep |> 
  filter(Age %in% c(18:22)) |> 
  mutate(Weightlb = Weight*2.2)

names(NHANESsleep)

 [1] "Gender"        "Age"           "Weight"        "Race1"        
 [5] "Race3"         "Education"     "SleepTrouble"  "SleepHrsNight"
 [9] "TVHrsDay"      "TVHrsDayChild" "PhysActive"    "Weightlb"

dim(NHANESsleep)

[1] 655  12

NHANESsleep |> 
  ggplot(aes(x = Age, y = SleepHrsNight, color = Gender)) + 
  geom_point(position=position_jitter(width=.25, height=0) ) + 
  facet_grid(SleepTrouble ~ TVHrsDay)

3.3.4 `summarize` and `group_by`

# number of people (cases) in NHANES
NHANES |> summarize(n())

# A tibble: 1 × 1
  `n()`
  <int>
1 10000

# total weight of all the people in NHANES (silly)
NHANES |> 
  mutate(Weightlb = Weight*2.2) |> 
  summarize(sum(Weightlb, na.rm=TRUE))

# A tibble: 1 × 1
  `sum(Weightlb, na.rm = TRUE)`
                          <dbl>
1                      1549419.

# mean weight of all the people in NHANES
NHANES |> 
  mutate(Weightlb = Weight*2.2) |> 
  summarize(mean(Weightlb, na.rm=TRUE))

# A tibble: 1 × 1
  `mean(Weightlb, na.rm = TRUE)`
                           <dbl>
1                           156.

# repeat the above but for groups

# males versus females
NHANES |> 
  group_by(Gender) |> 
  summarize(n())

# A tibble: 2 × 2
  Gender `n()`
  <fct>  <int>
1 female  5020
2 male    4980

NHANES |> 
  group_by(Gender) |> 
  mutate(Weightlb = Weight*2.2) |> 
  summarize(mean(Weightlb, na.rm=TRUE))

# A tibble: 2 × 2
  Gender `mean(Weightlb, na.rm = TRUE)`
  <fct>                           <dbl>
1 female                           146.
2 male                             167.

# smokers and non-smokers
NHANES |> 
  group_by(SmokeNow) |> 
  summarize(n())

# A tibble: 3 × 2
  SmokeNow `n()`
  <fct>    <int>
1 No        1745
2 Yes       1466
3 <NA>      6789

NHANES |> 
  group_by(SmokeNow) |> 
  mutate(Weightlb = Weight*2.2) |> 
  summarize(mean(Weightlb, na.rm=TRUE))

# A tibble: 3 × 2
  SmokeNow `mean(Weightlb, na.rm = TRUE)`
  <fct>                             <dbl>
1 No                                 186.
2 Yes                                177.
3 <NA>                               144.

# people with and without diabetes
NHANES |> 
  group_by(Diabetes) |> 
  summarize(n())

# A tibble: 3 × 2
  Diabetes `n()`
  <fct>    <int>
1 No        9098
2 Yes        760
3 <NA>       142

NHANES |> 
  group_by(Diabetes) |> 
  mutate(Weightlb = Weight*2.2) |> 
  summarize(mean(Weightlb, na.rm=TRUE))

# A tibble: 3 × 2
  Diabetes `mean(Weightlb, na.rm = TRUE)`
  <fct>                             <dbl>
1 No                                155. 
2 Yes                               202. 
3 <NA>                               21.6

# break down the smokers versus non-smokers further, by sex
NHANES |> 
  group_by(SmokeNow, Gender) |> 
  summarize(n())

# A tibble: 6 × 3
# Groups:   SmokeNow [3]
  SmokeNow Gender `n()`
  <fct>    <fct>  <int>
1 No       female   764
2 No       male     981
3 Yes      female   638
4 Yes      male     828
5 <NA>     female  3618
6 <NA>     male    3171

NHANES |> 
  group_by(SmokeNow, Gender) |> 
  mutate(Weightlb = Weight*2.2) |> 
  summarize(mean(Weightlb, na.rm=TRUE))

# A tibble: 6 × 3
# Groups:   SmokeNow [3]
  SmokeNow Gender `mean(Weightlb, na.rm = TRUE)`
  <fct>    <fct>                           <dbl>
1 No       female                           167.
2 No       male                             201.
3 Yes      female                           167.
4 Yes      male                             185.
5 <NA>     female                           138.
6 <NA>     male                             151.

# break down the people with diabetes further, by smoking
NHANES |> 
  group_by(Diabetes, SmokeNow) |> 
  summarize(n())

# A tibble: 8 × 3
# Groups:   Diabetes [3]
  Diabetes SmokeNow `n()`
  <fct>    <fct>    <int>
1 No       No        1476
2 No       Yes       1360
3 No       <NA>      6262
4 Yes      No         267
5 Yes      Yes        106
6 Yes      <NA>       387
7 <NA>     No           2
8 <NA>     <NA>       140

NHANES |> 
  group_by(Diabetes, SmokeNow) |> 
  mutate(Weightlb = Weight*2.2) |> 
  summarize(mean(Weightlb, na.rm=TRUE))

# A tibble: 8 × 3
# Groups:   Diabetes [3]
  Diabetes SmokeNow `mean(Weightlb, na.rm = TRUE)`
  <fct>    <fct>                             <dbl>
1 No       No                                183. 
2 No       Yes                               175. 
3 No       <NA>                              143. 
4 Yes      No                                204. 
5 Yes      Yes                               204. 
6 Yes      <NA>                              199. 
7 <NA>     No                                193. 
8 <NA>     <NA>                               19.1

3.3.5 babynames

babynames |> group_by(sex) |>
  summarize(total=sum(n))

# A tibble: 2 × 2
  sex       total
  <chr>     <int>
1 F     172371079
2 M     175749438

babynames |> 
  group_by(year, sex) |>
  summarize(name_count = n_distinct(name)) |> 
  head()

# A tibble: 6 × 3
# Groups:   year [3]
   year sex   name_count
  <dbl> <chr>      <int>
1  1880 F            942
2  1880 M           1058
3  1881 F            938
4  1881 M            997
5  1882 F           1028
6  1882 M           1099

babynames |> 
  group_by(year, sex) |>
  summarize(name_count = n_distinct(name)) |> 
  tail()

# A tibble: 6 × 3
# Groups:   year [3]
   year sex   name_count
  <dbl> <chr>      <int>
1  2015 F          19074
2  2015 M          14024
3  2016 F          18817
4  2016 M          14162
5  2017 F          18309
6  2017 M          14160

babysamp <- babynames |> 
  sample_n(size=50)
babysamp |> 
  select(year) |> 
  distinct() |> 
  table()

year
1902 1907 1909 1913 1915 1919 1921 1922 1923 1926 1928 1929 1933 1938 1948 1950 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
1951 1953 1954 1955 1965 1972 1974 1975 1977 1979 1981 1985 1987 1988 1989 1995 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
1997 1999 2001 2008 2013 2014 2016 
   1    1    1    1    1    1    1

babysamp |> 
  distinct() |> 
  select(year) |> 
  table()

year
1902 1907 1909 1913 1915 1919 1921 1922 1923 1926 1928 1929 1933 1938 1948 1950 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
1951 1953 1954 1955 1965 1972 1974 1975 1977 1979 1981 1985 1987 1988 1989 1995 
   1    1    2    1    2    2    1    1    2    1    2    1    1    1    1    2 
1997 1999 2001 2008 2013 2014 2016 
   1    1    2    2    2    2    2

Frances <- babynames |>
  filter(name== "Frances") |>
  group_by(year, sex) |>
  summarize(yrTot = sum(n))

Frances |> 
  ggplot(aes(x = year, y = yrTot)) +
  geom_point(aes(color = sex)) + 
  geom_vline(xintercept=2006) + scale_y_log10() +
  labs(y = "Yearly total on log10 scale",
       title = "Number of babies named Frances, over time")

3.4 Higher Level Data Verbs

There are more complicated verbs which may be important for more sophisticated analyses. See the RStudio dplyr cheat sheet, https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf}.

pivot_longer makes many columns into 2 columns: pivot_longer(data, cols, names_to = , value_to = )
pivot_wider makes one column into multiple columns: pivot_wider(data, names_from = , values_from = )
left_join returns all rows from the left table, and any rows with matching keys from the right table.
inner_join returns only the rows in which the left table have matching keys in the right table (i.e., matching rows in both sets).
full_join returns all rows from both tables, join records from the left which have matching keys in the right table.

Good practice: always specify the by argument when joining data frames.

If you ever need to understand which join is the right join for you, try to find an image that will lay out what the function is doing. I found this one that is quite good and is taken from Statistics Globe blog: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti

3.5 R examples, higher level verbs

tidyr 1.0.0 has just been released! The new release means that you need to update tidyr. You will know if you have the latest version if the following command works in the console (window below):

?tidyr::pivot_longer

If you are familiar with spread and gather, you should acquaint yourself with pivot_longer() and pivot_wider(). The idea is to go from very wide dataframes to very long dataframes and vice versa.

3.5.1 `pivot_longer()`

pivot the military pay grade to become longer?

https://docs.google.com/spreadsheets/d/1Ow6Cm4z-Z1Yybk3i352msulYCEDOUaOghmo9ALajyHo/edit# gid=1811988794

library(googlesheets4)
gs4_deauth()

navy_gs = read_sheet("https://docs.google.com/spreadsheets/d/1Ow6Cm4z-Z1Yybk3i352msulYCEDOUaOghmo9ALajyHo/edit#gid=1877566408", 
                     col_types = "ccnnnnnnnnnnnnnnn")

glimpse(navy_gs)

Rows: 38
Columns: 17
$ ...1                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Active Duty Family` <chr> NA, "Marital Status Report", NA, "Data Reflect Se…
$ ...3                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 31229, 53094, 131…
$ ...4                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5717, 8388, 21019…
$ ...5                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 36946, 61482, 152…
$ ...6                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 563, 1457, 4264, …
$ ...7                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 122, 275, 1920, 4…
$ ...8                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 685, 1732, 6184, …
$ ...9                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 139, 438, 3579, 8…
$ ...10                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 141, 579, 4902, 9…
$ ...11                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 280, 1017, 8481, …
$ ...12                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5060, 12483, 5479…
$ ...13                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 719, 1682, 6641, …
$ ...14                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5779, 14165, 6143…
$ ...15                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 36991, 67472, 193…
$ ...16                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 6699, 10924, 3448…
$ ...17                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 43690, 78396, 228…

names(navy_gs) = c("X","pay.grade", "male.sing.wo", "female.sing.wo",
                   "tot.sing.wo", "male.sing.w", "female.sing.w", 
                   "tot.sing.w", "male.joint.NA", "female.joint.NA",
                   "tot.joint.NA", "male.civ.NA", "female.civ.NA",
                   "tot.civ.NA", "male.tot.NA", "female.tot.NA", 
                   "tot.tot.NA")
navy = navy_gs[-c(1:8), -1]
dplyr::glimpse(navy)

Rows: 30
Columns: 16
$ pay.grade       <chr> "E-1", "E-2", "E-3", "E-4", "E-5", "E-6", "E-7", "E-8"…
$ male.sing.wo    <dbl> 31229, 53094, 131091, 112710, 57989, 19125, 5446, 1009…
$ female.sing.wo  <dbl> 5717, 8388, 21019, 16381, 11021, 4654, 1913, 438, 202,…
$ tot.sing.wo     <dbl> 36946, 61482, 152110, 129091, 69010, 23779, 7359, 1447…
$ male.sing.w     <dbl> 563, 1457, 4264, 9491, 10937, 10369, 6530, 1786, 579, …
$ female.sing.w   <dbl> 122, 275, 1920, 4662, 6576, 4962, 2585, 513, 144, 2175…
$ tot.sing.w      <dbl> 685, 1732, 6184, 14153, 17513, 15331, 9115, 2299, 723,…
$ male.joint.NA   <dbl> 139, 438, 3579, 8661, 12459, 8474, 5065, 1423, 458, 40…
$ female.joint.NA <dbl> 141, 579, 4902, 9778, 11117, 6961, 3291, 651, 150, 375…
$ tot.joint.NA    <dbl> 280, 1017, 8481, 18439, 23576, 15435, 8356, 2074, 608,…
$ male.civ.NA     <dbl> 5060, 12483, 54795, 105556, 130944, 110322, 70001, 210…
$ female.civ.NA   <dbl> 719, 1682, 6641, 9961, 8592, 5827, 3206, 820, 291, 377…
$ tot.civ.NA      <dbl> 5779, 14165, 61436, 115517, 139536, 116149, 73207, 218…
$ male.tot.NA     <dbl> 36991, 67472, 193729, 236418, 212329, 148290, 87042, 2…
$ female.tot.NA   <dbl> 6699, 10924, 34482, 40782, 37306, 22404, 10995, 2422, …
$ tot.tot.NA      <dbl> 43690, 78396, 228211, 277200, 249635, 170694, 98037, 2…

# get rid of total columns & rows:

navyWR = navy |> 
  select(-contains("tot")) |>
  filter(substr(pay.grade, 1, 5) != "TOTAL" & 
            substr(pay.grade, 1, 5) != "GRAND" ) |>
  pivot_longer(-pay.grade,
                values_to = "numPeople",
                names_to = "status") |>
  separate(status, into = c("sex", "marital", "kids"))

navyWR |> head()

# A tibble: 6 × 5
  pay.grade sex    marital kids  numPeople
  <chr>     <chr>  <chr>   <chr>     <dbl>
1 E-1       male   sing    wo        31229
2 E-1       female sing    wo         5717
3 E-1       male   sing    w           563
4 E-1       female sing    w           122
5 E-1       male   joint   NA          139
6 E-1       female joint   NA          141

Does a graph tell us if we did it right? what if we had done it wrong…?

navyWR |>
  ggplot(aes(x = pay.grade, y = numPeople, color = sex)) + 
  geom_point()  + 
  facet_grid(kids ~ marital) +
  theme_minimal() +
  scale_color_viridis_d() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   hjust = 1, size = rel(.5)))

3.5.2 `pivot_wider`

library(babynames)
babynames |> 
  dplyr::select(-prop) |>
  tidyr::pivot_wider(names_from = sex, values_from = n)

# A tibble: 1,756,284 × 4
    year name          F     M
   <dbl> <chr>     <int> <int>
 1  1880 Mary       7065    27
 2  1880 Anna       2604    12
 3  1880 Emma       2003    10
 4  1880 Elizabeth  1939     9
 5  1880 Minnie     1746     9
 6  1880 Margaret   1578    NA
 7  1880 Ida        1472     8
 8  1880 Alice      1414    NA
 9  1880 Bertha     1320    NA
10  1880 Sarah      1288    NA
# ℹ 1,756,274 more rows

babynames |> 
  select(-prop) |> 
  pivot_wider(names_from = sex, values_from = n) |>
  filter(!is.na(F) & !is.na(M)) |>
  arrange(desc(year), desc(M))

# A tibble: 168,381 × 4
    year name         F     M
   <dbl> <chr>    <int> <int>
 1  2017 Liam        36 18728
 2  2017 Noah       170 18326
 3  2017 William     18 14904
 4  2017 James       77 14232
 5  2017 Logan     1103 13974
 6  2017 Benjamin     8 13733
 7  2017 Mason       58 13502
 8  2017 Elijah      26 13268
 9  2017 Oliver      15 13141
10  2017 Jacob       16 13106
# ℹ 168,371 more rows

babynames |> 
  pivot_wider(names_from = sex, values_from = n) |>
  filter(!is.na(F) & !is.na(M)) |>
  arrange(desc(prop))

# A tibble: 12 × 5
    year name            prop     F     M
   <dbl> <chr>          <dbl> <int> <int>
 1  1986 Marquette 0.0000130     24    25
 2  1996 Dariel    0.0000115     22    23
 3  2014 Laramie   0.0000108     21    22
 4  1939 Earnie    0.00000882    10    10
 5  1939 Vertis    0.00000882    10    10
 6  1921 Vernis    0.00000703     9     8
 7  1939 Alvia     0.00000529     6     6
 8  1939 Eudell    0.00000529     6     6
 9  1939 Ladell    0.00000529     6     6
10  1939 Lory      0.00000529     6     6
11  1939 Maitland  0.00000529     6     6
12  1939 Delaney   0.00000441     5     5

3.5.3 `join` (use `join` to merge two datasets)

3.5.3.1 First get the data (GapMinder)

Both of the following datasets come from GapMinder. The first represents country, year, and female literacy rate. The second represents country, year, and GDP (in fixed 2000 US$).

gs4_deauth()
litF = read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")

litF <- litF |> 
  select(country = starts_with("Adult"),
         starts_with("1"), starts_with("2")) |>
  pivot_longer(-country,
               names_to = "year",
               values_to = "litRateF") |>
  filter(!is.na(litRateF))

gs4_deauth()
GDP <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")

GDP <- GDP |> 
  select(country = starts_with("Income"),
         starts_with("1"), starts_with("2")) |>
  pivot_longer(-country,
               names_to = "year",
               values_to = "gdp") |>
  filter(!is.na(gdp))

head(litF)

# A tibble: 6 × 3
  country     year  litRateF
  <chr>       <chr>    <dbl>
1 Afghanistan 1979      4.99
2 Afghanistan 2011     13   
3 Albania     2001     98.3 
4 Albania     2008     94.7 
5 Albania     2011     95.7 
6 Algeria     1987     35.8

head(GDP)

# A tibble: 6 × 3
  country year    gdp
  <chr>   <chr> <dbl>
1 Albania 1980  1061.
2 Albania 1981  1100.
3 Albania 1982  1111.
4 Albania 1983  1101.
5 Albania 1984  1065.
6 Albania 1985  1060.

# left
litGDPleft = left_join(litF, GDP, by = c("country", "year"))
dim(litGDPleft)

[1] 571   4

sum(is.na(litGDPleft$gdp))

[1] 66

head(litGDPleft)

# A tibble: 6 × 4
  country     year  litRateF   gdp
  <chr>       <chr>    <dbl> <dbl>
1 Afghanistan 1979      4.99   NA 
2 Afghanistan 2011     13      NA 
3 Albania     2001     98.3  1282.
4 Albania     2008     94.7  1804.
5 Albania     2011     95.7  1966.
6 Algeria     1987     35.8  1902.

# right
litGDPright = right_join(litF, GDP, by = c("country", "year"))
dim(litGDPright)

[1] 7988    4

sum(is.na(litGDPright$gdp))

[1] 0

head(litGDPright)

# A tibble: 6 × 4
  country year  litRateF   gdp
  <chr>   <chr>    <dbl> <dbl>
1 Albania 2001      98.3 1282.
2 Albania 2008      94.7 1804.
3 Albania 2011      95.7 1966.
4 Algeria 1987      35.8 1902.
5 Algeria 2002      60.1 1872.
6 Algeria 2006      63.9 2125.

# inner
litGDPinner = inner_join(litF, GDP, by = c("country", "year"))
dim(litGDPinner)

[1] 505   4

sum(is.na(litGDPinner$gdp))

[1] 0

head(litGDPinner)

# A tibble: 6 × 4
  country year  litRateF   gdp
  <chr>   <chr>    <dbl> <dbl>
1 Albania 2001      98.3 1282.
2 Albania 2008      94.7 1804.
3 Albania 2011      95.7 1966.
4 Algeria 1987      35.8 1902.
5 Algeria 2002      60.1 1872.
6 Algeria 2006      63.9 2125.

# full
litGDPfull = full_join(litF, GDP, by = c("country", "year"))
dim(litGDPfull)

[1] 8054    4

sum(is.na(litGDPfull$gdp))

[1] 66

head(litGDPfull)

# A tibble: 6 × 4
  country     year  litRateF   gdp
  <chr>       <chr>    <dbl> <dbl>
1 Afghanistan 1979      4.99   NA 
2 Afghanistan 2011     13      NA 
3 Albania     2001     98.3  1282.
4 Albania     2008     94.7  1804.
5 Albania     2011     95.7  1966.
6 Algeria     1987     35.8  1902.

3.5.4 lubridate

lubridate is a another R package meant for data wrangling (Grolemund and Wickham 2011). In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd (year month day) format and transform the information into whatever you want. The lubridate cheatsheet provides many of the basic functionality.

Example from https://lubridate.tidyverse.org/reference/lubridate-package.html

3.5.4.1 If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation, January 31st + one month. Should the answer be:

February 31st (which doesn’t exist)
March 4th (31 days after January 31), or
February 28th (assuming its not a leap year)

A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.

If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+% and %m-% operators. %m+% and %m-% automatically roll dates back to the last day of the month, should that be necessary.

3.5.4.2 R examples, lubridate

Some basics in lubridate

library(lubridate)
rightnow <- now()

day(rightnow)

[1] 19

week(rightnow)

[1] 33

month(rightnow, label=FALSE)

[1] 8

month(rightnow, label=TRUE)

[1] Aug
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

year(rightnow)

[1] 2025

minute(rightnow)

[1] 19

hour(rightnow)

[1] 19

yday(rightnow)

[1] 231

mday(rightnow)

[1] 19

wday(rightnow, label=FALSE)

[1] 3

wday(rightnow, label=TRUE)

[1] Tue
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

But how do I create a date object?

jan31 <- ymd("2021-01-31")
jan31 + months(0:11)

 [1] "2021-01-31" NA           "2021-03-31" NA           "2021-05-31"
 [6] NA           "2021-07-31" "2021-08-31" NA           "2021-10-31"
[11] NA           "2021-12-31"

floor_date(jan31, "month") + months(0:11) + days(31)

 [1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
 [6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
[11] "2021-12-02" "2022-01-01"

jan31 + months(0:11) + days(31)

 [1] "2021-03-03" NA           "2021-05-01" NA           "2021-07-01"
 [6] NA           "2021-08-31" "2021-10-01" NA           "2021-12-01"
[11] NA           "2022-01-31"

jan31 %m+% months(0:11)

 [1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
 [6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
[11] "2021-11-30" "2021-12-31"

NYC flights

library(nycflights13)
names(flights)

 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"

flightsWK <- flights |> 
   mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |>
   mutate(weekdy = wday(ymdday, label=TRUE), 
          whichweek = week(ymdday))

head(flightsWK)

# A tibble: 6 × 22
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, ymdday <date>, weekdy <ord>,
#   whichweek <dbl>

flightsWK <- flights |> 
   mutate(ymdday = ymd(paste(year,"-", month,"-",day))) |>
   mutate(weekdy = wday(ymdday, label=TRUE), whichweek = week(ymdday))

flightsWK |> 
  select(year, month, day, ymdday, weekdy, whichweek, dep_time, 
         arr_time, air_time) |>  
   head()

# A tibble: 6 × 9
   year month   day ymdday     weekdy whichweek dep_time arr_time air_time
  <int> <int> <int> <date>     <ord>      <dbl>    <int>    <int>    <dbl>
1  2013     1     1 2013-01-01 Tue            1      517      830      227
2  2013     1     1 2013-01-01 Tue            1      533      850      227
3  2013     1     1 2013-01-01 Tue            1      542      923      160
4  2013     1     1 2013-01-01 Tue            1      544     1004      183
5  2013     1     1 2013-01-01 Tue            1      554      812      116
6  2013     1     1 2013-01-01 Tue            1      554      740      150

3.6 Iterating functions

There will be times when you will need to iterate a function multiple times.

3.6.1 purrr for functional programming

We will see the R package purrr in greater detail as we go, but for now, let’s get a hint for how it works.

We are going to focus on the map family of functions which will just get us started. Lots of other good purrr functions like pluck() and accumulate() and across() from dplyr.

Much of below is taken from a tutorial by Rebecca Barter.

The map functions are named by the output the produce. For example:

map(.x, .f) is the main mapping function and returns a list
map_df(.x, .f) returns a data frame
map_dbl(.x, .f) returns a numeric (double) vector
map_chr(.x, .f) returns a character vector
map_lgl(.x, .f) returns a logical vector

Note that the first argument is always the data object and the second object is always the function you want to iteratively apply to each element in the input object.

The input to a map function is always either a vector (like a column), a list (which can be non-rectangular), or a dataframe (like a rectangle).

A list is a way to hold things which might be very different in shape:

a_list <- list(a_number = 5,
                      a_vector = c("a", "b", "c"),
                      a_dataframe = data.frame(a = 1:3, 
                                               b = c("q", "b", "z"), 
                                               c = c("bananas", "are", "so very great")))

print(a_list)

$a_number
[1] 5

$a_vector
[1] "a" "b" "c"

$a_dataframe
  a b             c
1 1 q       bananas
2 2 b           are
3 3 z so very great

Consider the following function:

add_ten <- function(x) {
  return(x + 10)
  }

We can map() the add_ten() function across a vector. Note that the output is a list (the default).

library(tidyverse)
map(.x = c(2, 5, 10),
    .f = add_ten)

[[1]]
[1] 12

[[2]]
[1] 15

[[3]]
[1] 20

What if we use a different type of input? The default behavior is to still return a list!

data.frame(a = 2, b = 5, c = 10) |>
  map(add_ten)

$a
[1] 12

$b
[1] 15

$c
[1] 20

What if we want a different type of output? We use a different map() function, map_df(), for example.

data.frame(a = 2, b = 5, c = 10) |>
  map_df(add_ten)

# A tibble: 1 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1    12    15    20

Shorthand lets us get away from pre-defining the function (which will be useful). Use the tilde ~ to indicate that you have a function:

data.frame(a = 2, b = 5, c = 10) |>
  map_df(~{.x + 10})

# A tibble: 1 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1    12    15    20

Mostly, the tilde will be used for functions we already know but want to modify (if we don’t modify, and it has a simple name, we don’t use the tilde):

library(palmerpenguins)
library(broom)

penguins_split <- split(penguins, penguins$species)
penguins_split |>
  map(~ lm(body_mass_g ~ flipper_length_mm, data = .x)) |>
  map_df(tidy)  # map(tidy)

# A tibble: 6 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -2536.     965.       -2.63 9.48e- 3
2 flipper_length_mm     32.8      5.08      6.47 1.34e- 9
3 (Intercept)        -3037.     997.       -3.05 3.33e- 3
4 flipper_length_mm     34.6      5.09      6.79 3.75e- 9
5 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
6 flipper_length_mm     54.6      5.03     10.9  1.33e-19

penguins |>
  group_by(species) |>
  group_map(~lm(body_mass_g ~ flipper_length_mm, data = .x)) |>
  map(tidy)  # map_df(tidy)

[[1]]
# A tibble: 2 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        -2536.     965.       -2.63 0.00948      
2 flipper_length_mm     32.8      5.08      6.47 0.00000000134

[[2]]
# A tibble: 2 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        -3037.     997.       -3.05 0.00333      
2 flipper_length_mm     34.6      5.09      6.79 0.00000000375

[[3]]
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -6787.    1093.       -6.21 7.65e- 9
2 flipper_length_mm     54.6      5.03     10.9  1.33e-19

3.7 `reprex()`

Help me help you

In order to create a reproducible example …

Step 1. Copy code onto the clipboard

Step 2. Type reprex() into the Console

Step 3. Look at the Viewer to the right. Copy the Viewer output into GitHub, Piazza, an email, stackexchange, etc.

Some places to learn more about reprex include

A blog about it: https://teachdatascience.com/reprex/
The reprex vignette: https://reprex.tidyverse.org/index.html
reprex dos and donts: https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html
Jenny Bryan webinar on reprex: “Help me help you. Creating reproducible examples” https://resources.rstudio.com/webinars/help-me-help-you-creating-reproducible-examples-jenny-bryan
Some advice: https://stackoverflow.com/help/minimal-reproducible-example

3.7.0.1 `reprex` demo

reprex(
  jan31 + months(0:11) + days(31)
)

multiple lines of code:

reprex({
  jan31 <- ymd("2021-01-31")
  jan31 + months(0:11) + days(31)
})

reprex({
  library(lubridate)
  jan31 <- ymd("2021-01-31")
  jan31 + months(0:11) + days(31)
})

3.8 References

Gobet, Fernand, and Herbert A. Simon. 1996. “Templates in Chess Memory: A Mechanism for Recalling Several Boards.” Cognitive Psychology 31 (1): 1–40. https://doi.org/https://doi.org/10.1006/cogp.1996.0011.

Grolemund, G., and H. Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3). http://www.jstatsoft.org/v40/i03/paper.

Kaplan, Daniel. 2015. Data Computing: An Introduction to Wrangling and Visualization with r. Project Mosaic Books.

Soloway, Elliot, and Kate Ehrlich. 1984. “Empirical Studies of Programming Knowledge.” IEEE Transactions on Software Engineering SE-10 (5).

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/paper.

3.1 Coding Style

3.1.1 Why do style and format matter?

3.1.2 Best practices for writing good code.

3.2 Structure of Data

3.2.1 Building Tidy Data

3.2.2 Examples of Chaining

3.2.2.1 Little Bunny Foo Foo

3.2.3 Data Verbs (on single data frames)

3.3 R examples, basic verbs

3.3.1 Datasets

3.3.2 Examples of Chaining

3.3.3 Data Verbs

3.3.3.1 Starwars

3.3.3.2 NHANES

3.3.4 summarize and group_by

3.3.5 babynames

3.4 Higher Level Data Verbs

3.5 R examples, higher level verbs

3.5.1 pivot_longer()

3.5.2 pivot_wider

3.5.3 join (use join to merge two datasets)

3.5.3.1 First get the data (GapMinder)

3.5.4 lubridate

3.5.4.1 If anyone drove a time machine, they would crash

3.5.4.2 R examples, lubridate

Some basics in lubridate

But how do I create a date object?

NYC flights

3.6 Iterating functions

3.6.1 purrr for functional programming

3.7 reprex()

3.7.0.1 reprex demo

3.8 References

3.3.4 `summarize` and `group_by`

3.5.1 `pivot_longer()`

3.5.2 `pivot_wider`

3.5.3 `join` (use `join` to merge two datasets)

3.7 `reprex()`

3.7.0.1 `reprex` demo