2 Data Wrangling
As with data visualization, data wrangling is a fundamental part of being able to accurately, reproducibly, and efficiently work with data. The approach taken in the following chapter is based on the philosophy of tidy data and takes many of its precepts from database theory. If you have done much work in SQL, the functionality and approach of tidy data will feel very familiar. The more adept you are at data wrangling, the more effective you will be at data analysis.
Information is what we want, but data are what we’ve got. (Kaplan 2015)
Embrace all the ways to get help!
- cheat sheets: https://www.rstudio.com/resources/cheatsheets/
- tidyverse vignettes: https://www.tidyverse.org/articles/2019/09/tidyr-1-0-0/
- pivoting: https://tidyr.tidyverse.org/articles/pivot.html
- google what you need and include
R tidy
ortidyverse
2.1 Structure of Data
For plotting, analyses, model building, etc., it’s important that the data be structured in a very particular way. Hadley Wickham provides a thorough discussion and advice for cleaning up the data in Wickham (2014).
- Tidy Data: rows (cases/observational units) and columns (variables). The key is that every row is a case and *every} column is a variable. No exceptions.
- Creating tidy data is not trivial. We work with objects (often data tables), functions, and arguments (often variables).
The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was “an individual in the armed forces.” What variables would you use to capture the information in the following table?
Problem: totals and different sheets
Better for R: longer format with columns - grade, gender, status, service, count (case is still the total pay grade)
Case is individual (?): grade, gender, status, service (no count because each row does the counting)
2.1.1 Building Tidy Data
Within R (really within any type of computing language, Python, SQL, Java, etc.), we need to understand how to build data using the patterns of the language. Some things to consider:
-
object_name = function_name(arguments)
is a way of using a function to create a new object. -
object_name = data_table %>% function_name(arguments)
uses chaining syntax as an extension of the ideas of functions. In chaining, the value on the left side of%>%
becomes the first argument to the function on the right side.
object_name = data_table %>%
function_name(arguments) %>%
function_name(arguments)
is extended chaining. %>%
is never at the front of the line, it is always connecting one idea with the continuation of that idea on the next line.
* In R, all functions take arguments in round parentheses (as opposed to subsetting observations or variables from data objects which happen with square parentheses). Additionally, the spot to the left of %>%
is always a data table.
* The pipe syntax should be read as then, %>%
.
2.1.2 Examples of Chaining
The pipe syntax (%>%
) takes a data frame (or data table) and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:
x %>% f(y)
is the same as f(x, y)
y %>% f(x, ., z)
is the same as f(x,y,z)
2.1.2.1 Little Bunny Foo Foo
From Hadley Wickham, how to think about tidy data.
Little bunny Foo Foo Went hopping through the forest Scooping up the field mice And bopping them on the head
The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.
foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_2, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
Another approach is to concatenate the functions so that there is only one output.
bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice),
on = head)
Or even worse, as one line:
bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)))
Instead, the code can be written using the pipe in the order in which the function is evaluated:
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mice) %>%
bop(on = head)
babynames Each year, the US Social Security Administration publishes a list of the most popular names given to babies. In 2014, http://www.ssa.gov/oact/babynames/#ht=2 shows Emma and Olivia leading for girls, Noah and Liam for boys.
The babynames
data table in the babynames
package comes from the Social Security Administration’s listing of the names givens to babies in each year, and the number of babies of each sex given that name. (Only names with 5 or more babies are published by the SSA.)
2.1.3 Data Verbs (on single data frames)
Super important resource: The RStudio dplyr cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
Data verbs take data tables as input and give data tables as output (that’s how we can use the chaining syntax!). We will use the R package dplyr to do much of our data wrangling. Below is a list of verbs which will be helpful in wrangling many different types of data. See the Data Wrangling cheat sheet from RStudio for additional help. https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
sample_n()
take a random row(s)head()
grab the first few rowstail()
grab the last few rowsfilter()
removes unwanted casesarrange()
reorders the casesdistinct()
returns the unique values in a tablemutate()
transforms the variable (andtransmute()
like mutate, returns only new variables)group_by()
tells R that SUCCESSIVE functions keep in mind that there are groups of items. Sogroup_by()
only makes sense with verbs later on (likesummarize()
).-
summarize()
collapses a data frame to a single row. Some functions that are used withinsummarize()
include:
2.2 R examples, basic verbs
2.2.1 Datasets
starwars
is from dplyr , although originally from SWAPI, the Star Wars API, http://swapi.co/.
NHANES
From ?NHANES
: NHANES is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination center (MEC).
babynames
Each year, the US Social Security Administration publishes a list of the most popular names given to babies. In 2018, http://www.ssa.gov/oact/babynames/#ht=2 shows Emma and Olivia leading for girls, Noah and Liam for boys. (Only names with 5 or more babies are published by the SSA.)
2.2.2 Examples of Chaining
## [1] 1924665
babynames %>% names()
## [1] "year" "sex" "name" "n" "prop"
babynames %>% glimpse()
## Rows: 1,924,665
## Columns: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida",…
## $ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258,…
## $ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.016…
babynames %>% head()
## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
babynames %>% tail()
## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 M Zyhier 5 0.00000255
## 2 2017 M Zykai 5 0.00000255
## 3 2017 M Zykeem 5 0.00000255
## 4 2017 M Zylin 5 0.00000255
## 5 2017 M Zylis 5 0.00000255
## 6 2017 M Zyrie 5 0.00000255
babynames %>% sample_n(size=5)
## # A tibble: 5 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1946 M Owen 392 0.000238
## 2 1953 M Abelardo 34 0.0000170
## 3 1996 F Nicki 50 0.0000261
## 4 1914 F Sofia 38 0.0000477
## 5 1970 M Lorenza 45 0.0000236
babynames %>% mosaic::favstats(n ~ sex, data = .)
## sex min Q1 median Q3 max mean sd n missing
## 1 F 5 7 11 31 99686 151.4294 1180.557 1138293 0
## 2 M 5 7 12 33 94756 223.4940 1932.338 786372 0
2.2.3 Data Verbs
Taken from the dplyr tutorial: http://dplyr.tidyverse.org/
2.2.3.1 Starwars
## [1] 87 14
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships"
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, grey light blue 52 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## gender min Q1 median Q3 max mean sd n missing
## 1 feminine 45 50 55 56.2 75 54.68889 8.591921 9 8
## 2 masculine 15 75 80 88.0 1358 106.14694 184.972677 49 17
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi…
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi…
## 3 R5-D4 97 32 <NA> white, red red NA none masculi…
## 4 IG-88 200 140 none metal red 15 none masculi…
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine
## 6 BB8 NA NA none none black NA none masculi…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## gender min Q1 median Q3 max mean sd n missing
## 1 feminine 45 50 55 56.2 75 54.68889 8.591921 9 7
## 2 masculine 15 77 80 88.0 1358 109.38222 192.397084 45 16
## # A tibble: 87 × 4
## name hair_color skin_color eye_color
## <chr> <chr> <chr> <chr>
## 1 Luke Skywalker blond fair blue
## 2 C-3PO <NA> gold yellow
## 3 R2-D2 <NA> white, blue red
## 4 Darth Vader none white yellow
## 5 Leia Organa brown light brown
## 6 Owen Lars brown, grey light blue
## 7 Beru Whitesun lars brown light blue
## 8 R5-D4 <NA> white, red red
## 9 Biggs Darklighter black light brown
## 10 Obi-Wan Kenobi auburn, white fair blue-gray
## # … with 77 more rows
starwars %>%
dplyr::mutate(name, bmi = mass / ((height / 100) ^ 2)) %>%
dplyr::select(name:mass, bmi)
## # A tibble: 87 × 4
## name height mass bmi
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 26.0
## 2 C-3PO 167 75 26.9
## 3 R2-D2 96 32 34.7
## 4 Darth Vader 202 136 33.3
## 5 Leia Organa 150 49 21.8
## 6 Owen Lars 178 120 37.9
## 7 Beru Whitesun lars 165 75 27.5
## 8 R5-D4 97 32 34.0
## 9 Biggs Darklighter 183 84 25.1
## 10 Obi-Wan Kenobi 182 77 23.2
## # … with 77 more rows
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Jabba … 175 1358 <NA> green-tan… orange 600 herm… mascu…
## 2 Grievo… 216 159 none brown, wh… green, y… NA male mascu…
## 3 IG-88 200 140 none metal red 15 none mascu…
## 4 Darth … 202 136 none white yellow 41.9 male mascu…
## 5 Tarfful 234 136 brown brown blue NA male mascu…
## 6 Owen L… 178 120 brown, grey light blue 52 male mascu…
## 7 Bossk 190 113 none green red 53 male mascu…
## 8 Chewba… 228 112 brown unknown blue 200 male mascu…
## 9 Jek To… 180 110 brown fair blue NA male mascu…
## 10 Dexter… 198 102 none brown yellow NA male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
starwars %>%
dplyr::group_by(species) %>%
dplyr::summarize(
num = n(),
mass = mean(mass, na.rm = TRUE)
) %>%
dplyr::filter(num > 1)
## # A tibble: 9 × 3
## species num mass
## <chr> <int> <dbl>
## 1 Droid 6 69.8
## 2 Gungan 3 74
## 3 Human 35 82.8
## 4 Kaminoan 2 88
## 5 Mirialan 2 53.1
## 6 Twi'lek 2 55
## 7 Wookiee 2 124
## 8 Zabrak 2 80
## 9 <NA> 4 48
2.2.3.2 NHANES
## [1] "ID" "SurveyYr" "Gender" "Age"
## [5] "AgeDecade" "AgeMonths" "Race1" "Race3"
## [9] "Education" "MaritalStatus" "HHIncome" "HHIncomeMid"
## [13] "Poverty" "HomeRooms" "HomeOwn" "Work"
## [17] "Weight" "Length" "HeadCirc" "Height"
## [21] "BMI" "BMICatUnder20yrs" "BMI_WHO" "Pulse"
## [25] "BPSysAve" "BPDiaAve" "BPSys1" "BPDia1"
## [29] "BPSys2" "BPDia2" "BPSys3" "BPDia3"
## [33] "Testosterone" "DirectChol" "TotChol" "UrineVol1"
## [37] "UrineFlow1" "UrineVol2" "UrineFlow2" "Diabetes"
## [41] "DiabetesAge" "HealthGen" "DaysPhysHlthBad" "DaysMentHlthBad"
## [45] "LittleInterest" "Depressed" "nPregnancies" "nBabies"
## [49] "Age1stBaby" "SleepHrsNight" "SleepTrouble" "PhysActive"
## [53] "PhysActiveDays" "TVHrsDay" "CompHrsDay" "TVHrsDayChild"
## [57] "CompHrsDayChild" "Alcohol12PlusYr" "AlcoholDay" "AlcoholYear"
## [61] "SmokeNow" "Smoke100" "Smoke100n" "SmokeAge"
## [65] "Marijuana" "AgeFirstMarij" "RegularMarij" "AgeRegMarij"
## [69] "HardDrugs" "SexEver" "SexAge" "SexNumPartnLife"
## [73] "SexNumPartYear" "SameSex" "SexOrientation" "PregnantNow"
# find the sleep variables
NHANESsleep <- NHANES %>% select(Gender, Age, Weight, Race1, Race3,
Education, SleepTrouble, SleepHrsNight,
TVHrsDay, TVHrsDayChild, PhysActive)
names(NHANESsleep)
## [1] "Gender" "Age" "Weight" "Race1"
## [5] "Race3" "Education" "SleepTrouble" "SleepHrsNight"
## [9] "TVHrsDay" "TVHrsDayChild" "PhysActive"
dim(NHANESsleep)
## [1] 10000 11
# subset for college students
NHANESsleep <- NHANESsleep %>% filter(Age %in% c(18:22)) %>%
mutate(Weightlb = Weight*2.2)
names(NHANESsleep)
## [1] "Gender" "Age" "Weight" "Race1"
## [5] "Race3" "Education" "SleepTrouble" "SleepHrsNight"
## [9] "TVHrsDay" "TVHrsDayChild" "PhysActive" "Weightlb"
dim(NHANESsleep)
## [1] 655 12
NHANESsleep %>% ggplot(aes(x=Age, y=SleepHrsNight, color=Gender)) +
geom_point(position=position_jitter(width=.25, height=0) ) +
facet_grid(SleepTrouble ~ TVHrsDay)
2.2.4 summarize
and group_by
## # A tibble: 1 × 1
## `n()`
## <int>
## 1 10000
# total weight of all the people in NHANES (silly)
NHANES %>% mutate(Weightlb = Weight*2.2) %>% summarize(sum(Weightlb, na.rm=TRUE))
## # A tibble: 1 × 1
## `sum(Weightlb, na.rm = TRUE)`
## <dbl>
## 1 1549419.
# mean weight of all the people in NHANES
NHANES %>% mutate(Weightlb = Weight*2.2) %>% summarize(mean(Weightlb, na.rm=TRUE))
## # A tibble: 1 × 1
## `mean(Weightlb, na.rm = TRUE)`
## <dbl>
## 1 156.
# repeat the above but for groups
# males versus females
NHANES %>% group_by(Gender) %>% summarize(n())
## # A tibble: 2 × 2
## Gender `n()`
## <fct> <int>
## 1 female 5020
## 2 male 4980
NHANES %>% group_by(Gender) %>% mutate(Weightlb = Weight*2.2) %>%
summarize(mean(Weightlb, na.rm=TRUE))
## # A tibble: 2 × 2
## Gender `mean(Weightlb, na.rm = TRUE)`
## <fct> <dbl>
## 1 female 146.
## 2 male 167.
## # A tibble: 3 × 2
## SmokeNow `n()`
## <fct> <int>
## 1 No 1745
## 2 Yes 1466
## 3 <NA> 6789
NHANES %>% group_by(SmokeNow) %>% mutate(Weightlb = Weight*2.2) %>%
summarize(mean(Weightlb, na.rm=TRUE))
## # A tibble: 3 × 2
## SmokeNow `mean(Weightlb, na.rm = TRUE)`
## <fct> <dbl>
## 1 No 186.
## 2 Yes 177.
## 3 <NA> 144.
## # A tibble: 3 × 2
## Diabetes `n()`
## <fct> <int>
## 1 No 9098
## 2 Yes 760
## 3 <NA> 142
NHANES %>% group_by(Diabetes) %>% mutate(Weightlb = Weight*2.2) %>%
summarize(mean(Weightlb, na.rm=TRUE))
## # A tibble: 3 × 2
## Diabetes `mean(Weightlb, na.rm = TRUE)`
## <fct> <dbl>
## 1 No 155.
## 2 Yes 202.
## 3 <NA> 21.6
# break down the smokers versus non-smokers further, by sex
NHANES %>% group_by(SmokeNow, Gender) %>% summarize(n())
## # A tibble: 6 × 3
## # Groups: SmokeNow [3]
## SmokeNow Gender `n()`
## <fct> <fct> <int>
## 1 No female 764
## 2 No male 981
## 3 Yes female 638
## 4 Yes male 828
## 5 <NA> female 3618
## 6 <NA> male 3171
NHANES %>% group_by(SmokeNow, Gender) %>% mutate(Weightlb = Weight*2.2) %>%
summarize(mean(Weightlb, na.rm=TRUE))
## # A tibble: 6 × 3
## # Groups: SmokeNow [3]
## SmokeNow Gender `mean(Weightlb, na.rm = TRUE)`
## <fct> <fct> <dbl>
## 1 No female 167.
## 2 No male 201.
## 3 Yes female 167.
## 4 Yes male 185.
## 5 <NA> female 138.
## 6 <NA> male 151.
# break down the people with diabetes further, by smoking
NHANES %>% group_by(Diabetes, SmokeNow) %>% summarize(n())
## # A tibble: 8 × 3
## # Groups: Diabetes [3]
## Diabetes SmokeNow `n()`
## <fct> <fct> <int>
## 1 No No 1476
## 2 No Yes 1360
## 3 No <NA> 6262
## 4 Yes No 267
## 5 Yes Yes 106
## 6 Yes <NA> 387
## 7 <NA> No 2
## 8 <NA> <NA> 140
NHANES %>% group_by(Diabetes, SmokeNow) %>% mutate(Weightlb = Weight*2.2) %>%
summarize(mean(Weightlb, na.rm=TRUE))
## # A tibble: 8 × 3
## # Groups: Diabetes [3]
## Diabetes SmokeNow `mean(Weightlb, na.rm = TRUE)`
## <fct> <fct> <dbl>
## 1 No No 183.
## 2 No Yes 175.
## 3 No <NA> 143.
## 4 Yes No 204.
## 5 Yes Yes 204.
## 6 Yes <NA> 199.
## 7 <NA> No 193.
## 8 <NA> <NA> 19.1
2.2.5 babynames
## # A tibble: 2 × 2
## sex total
## <chr> <int>
## 1 F 172371079
## 2 M 175749438
## # A tibble: 6 × 3
## # Groups: year [3]
## year sex name_count
## <dbl> <chr> <int>
## 1 1880 F 942
## 2 1880 M 1058
## 3 1881 F 938
## 4 1881 M 997
## 5 1882 F 1028
## 6 1882 M 1099
## # A tibble: 6 × 3
## # Groups: year [3]
## year sex name_count
## <dbl> <chr> <int>
## 1 2015 F 19074
## 2 2015 M 14024
## 3 2016 F 18817
## 4 2016 M 14162
## 5 2017 F 18309
## 6 2017 M 14160
## .
## 1896 1915 1922 1924 1926 1927 1928 1933 1940 1942 1946 1953 1955 1963 1966 1975
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1980 1981 1982 1984 1985 1989 1990 1991 1992 1994 1996 1997 1999 2000 2002 2004
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2006 2007 2009 2010 2014 2015 2017
## 1 1 1 1 1 1 1
## .
## 1896 1915 1922 1924 1926 1927 1928 1933 1940 1942 1946 1953 1955 1963 1966 1975
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1980 1981 1982 1984 1985 1989 1990 1991 1992 1994 1996 1997 1999 2000 2002 2004
## 1 2 2 1 1 3 1 1 1 1 1 2 2 1 2 2
## 2006 2007 2009 2010 2014 2015 2017
## 1 1 3 1 2 1 1
2.3 Higher Level Data Verbs
There are more complicated verbs which may be important for more sophisticated analyses. See the RStudio dplyr cheat sheet, https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf}.
-
pivot_longer
makes many columns into 2 columns:pivot_longer(data, cols, names_to = , value_to = )
-
pivot_wider
makes one column into multiple columns:pivot_wider(data, names_from = , values_from = )
-
left_join
returns all rows from the left table, and any rows with matching keys from the right table. -
inner_join
returns only the rows in which the left table have matching keys in the right table (i.e., matching rows in both sets). -
full_join
returns all rows from both tables, join records from the left which have matching keys in the right table.
Good practice: always specify the by
argument when joining data frames.
2.4 R examples, higher level verbs
tidyr 1.0.0 has just been released! The new release means that you need to update tidyr. You will know if you have the latest version if the following command works in the console (window below):
?tidyr::pivot_longer
If you are familiar with spread
and gather
, you should acquaint yourself with pivot_longer()
and pivot_wider()
. The idea is to go from very wide dataframes to very long dataframes and vice versa.
2.4.1 pivot_longer()
pivot
the military pay grade to become longer?
library(googlesheets4)
gs4_deauth()
navy_gs = read_sheet("https://docs.google.com/spreadsheets/d/1Ow6Cm4z-Z1Yybk3i352msulYCEDOUaOghmo9ALajyHo/edit#gid=1877566408",
col_types = "ccnnnnnnnnnnnnnnn")
glimpse(navy_gs)
## Rows: 38
## Columns: 17
## $ ...1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `Active Duty Family` <chr> NA, "Marital Status Report", NA, "Data Reflect Se…
## $ ...3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 31229, 53094, 131…
## $ ...4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5717, 8388, 21019…
## $ ...5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 36946, 61482, 152…
## $ ...6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 563, 1457, 4264, …
## $ ...7 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 122, 275, 1920, 4…
## $ ...8 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 685, 1732, 6184, …
## $ ...9 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 139, 438, 3579, 8…
## $ ...10 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 141, 579, 4902, 9…
## $ ...11 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 280, 1017, 8481, …
## $ ...12 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5060, 12483, 5479…
## $ ...13 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 719, 1682, 6641, …
## $ ...14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 5779, 14165, 6143…
## $ ...15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 36991, 67472, 193…
## $ ...16 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 6699, 10924, 3448…
## $ ...17 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 43690, 78396, 228…
names(navy_gs) = c("X","pay.grade", "male.sing.wo", "female.sing.wo",
"tot.sing.wo", "male.sing.w", "female.sing.w",
"tot.sing.w", "male.joint.NA", "female.joint.NA",
"tot.joint.NA", "male.civ.NA", "female.civ.NA",
"tot.civ.NA", "male.tot.NA", "female.tot.NA",
"tot.tot.NA")
navy = navy_gs[-c(1:8), -1]
dplyr::glimpse(navy)
## Rows: 30
## Columns: 16
## $ pay.grade <chr> "E-1", "E-2", "E-3", "E-4", "E-5", "E-6", "E-7", "E-8"…
## $ male.sing.wo <dbl> 31229, 53094, 131091, 112710, 57989, 19125, 5446, 1009…
## $ female.sing.wo <dbl> 5717, 8388, 21019, 16381, 11021, 4654, 1913, 438, 202,…
## $ tot.sing.wo <dbl> 36946, 61482, 152110, 129091, 69010, 23779, 7359, 1447…
## $ male.sing.w <dbl> 563, 1457, 4264, 9491, 10937, 10369, 6530, 1786, 579, …
## $ female.sing.w <dbl> 122, 275, 1920, 4662, 6576, 4962, 2585, 513, 144, 2175…
## $ tot.sing.w <dbl> 685, 1732, 6184, 14153, 17513, 15331, 9115, 2299, 723,…
## $ male.joint.NA <dbl> 139, 438, 3579, 8661, 12459, 8474, 5065, 1423, 458, 40…
## $ female.joint.NA <dbl> 141, 579, 4902, 9778, 11117, 6961, 3291, 651, 150, 375…
## $ tot.joint.NA <dbl> 280, 1017, 8481, 18439, 23576, 15435, 8356, 2074, 608,…
## $ male.civ.NA <dbl> 5060, 12483, 54795, 105556, 130944, 110322, 70001, 210…
## $ female.civ.NA <dbl> 719, 1682, 6641, 9961, 8592, 5827, 3206, 820, 291, 377…
## $ tot.civ.NA <dbl> 5779, 14165, 61436, 115517, 139536, 116149, 73207, 218…
## $ male.tot.NA <dbl> 36991, 67472, 193729, 236418, 212329, 148290, 87042, 2…
## $ female.tot.NA <dbl> 6699, 10924, 34482, 40782, 37306, 22404, 10995, 2422, …
## $ tot.tot.NA <dbl> 43690, 78396, 228211, 277200, 249635, 170694, 98037, 2…
# get rid of total columns & rows:
navyWR = navy %>% select(-contains("tot")) %>%
filter(substr(pay.grade, 1, 5) != "TOTAL" &
substr(pay.grade, 1, 5) != "GRAND" ) %>%
pivot_longer(-pay.grade,
values_to = "numPeople",
names_to = "status") %>%
separate(status, into = c("sex", "marital", "kids"))
navyWR %>% head()
## # A tibble: 6 × 5
## pay.grade sex marital kids numPeople
## <chr> <chr> <chr> <chr> <dbl>
## 1 E-1 male sing wo 31229
## 2 E-1 female sing wo 5717
## 3 E-1 male sing w 563
## 4 E-1 female sing w 122
## 5 E-1 male joint NA 139
## 6 E-1 female joint NA 141
Does a graph tell us if we did it right? what if we had done it wrong…?
navyWR %>% ggplot(aes(x=pay.grade, y=numPeople, color=sex)) +
geom_point() +
facet_grid(kids ~ marital) +
theme_minimal() +
scale_color_viridis_d() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
hjust = 1, size = rel(.5)))
2.4.2 pivot_wider
library(babynames)
babynames %>% dplyr::select(-prop) %>%
tidyr::pivot_wider(names_from = sex, values_from = n)
## # A tibble: 1,756,284 × 4
## year name F M
## <dbl> <chr> <int> <int>
## 1 1880 Mary 7065 27
## 2 1880 Anna 2604 12
## 3 1880 Emma 2003 10
## 4 1880 Elizabeth 1939 9
## 5 1880 Minnie 1746 9
## 6 1880 Margaret 1578 NA
## 7 1880 Ida 1472 8
## 8 1880 Alice 1414 NA
## 9 1880 Bertha 1320 NA
## 10 1880 Sarah 1288 NA
## # … with 1,756,274 more rows
babynames %>%
select(-prop) %>%
pivot_wider(names_from = sex, values_from = n) %>%
filter(!is.na(F) & !is.na(M)) %>%
arrange(desc(year), desc(M))
## # A tibble: 168,381 × 4
## year name F M
## <dbl> <chr> <int> <int>
## 1 2017 Liam 36 18728
## 2 2017 Noah 170 18326
## 3 2017 William 18 14904
## 4 2017 James 77 14232
## 5 2017 Logan 1103 13974
## 6 2017 Benjamin 8 13733
## 7 2017 Mason 58 13502
## 8 2017 Elijah 26 13268
## 9 2017 Oliver 15 13141
## 10 2017 Jacob 16 13106
## # … with 168,371 more rows
babynames %>%
pivot_wider(names_from = sex, values_from = n) %>%
filter(!is.na(F) & !is.na(M)) %>%
arrange(desc(prop))
## # A tibble: 12 × 5
## year name prop F M
## <dbl> <chr> <dbl> <int> <int>
## 1 1986 Marquette 0.0000130 24 25
## 2 1996 Dariel 0.0000115 22 23
## 3 2014 Laramie 0.0000108 21 22
## 4 1939 Earnie 0.00000882 10 10
## 5 1939 Vertis 0.00000882 10 10
## 6 1921 Vernis 0.00000703 9 8
## 7 1939 Alvia 0.00000529 6 6
## 8 1939 Eudell 0.00000529 6 6
## 9 1939 Ladell 0.00000529 6 6
## 10 1939 Lory 0.00000529 6 6
## 11 1939 Maitland 0.00000529 6 6
## 12 1939 Delaney 0.00000441 5 5
2.4.3 join
(use join
to merge two datasets)
2.4.3.1 First get the data (GapMinder)
Both of the following datasets come from GapMinder. The first represents country, year, and female literacy rate. The second represents country, year, and GDP (in fixed 2000 US$).
gs4_deauth()
litF = read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")
litF = litF %>% select(country=starts_with("Adult"),
starts_with("1"), starts_with("2")) %>%
pivot_longer(-country,
names_to = "year",
values_to = "litRateF") %>%
filter(!is.na(litRateF))
gs4_deauth()
GDP = read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
GDP = GDP %>% select(country = starts_with("Income"),
starts_with("1"), starts_with("2")) %>%
pivot_longer(-country,
names_to = "year",
values_to = "gdp") %>%
filter(!is.na(gdp))
head(litF)
## # A tibble: 6 × 3
## country year litRateF
## <chr> <chr> <dbl>
## 1 Afghanistan 1979 4.99
## 2 Afghanistan 2011 13
## 3 Albania 2001 98.3
## 4 Albania 2008 94.7
## 5 Albania 2011 95.7
## 6 Algeria 1987 35.8
head(GDP)
## # A tibble: 6 × 3
## country year gdp
## <chr> <chr> <dbl>
## 1 Albania 1980 1061.
## 2 Albania 1981 1100.
## 3 Albania 1982 1111.
## 4 Albania 1983 1101.
## 5 Albania 1984 1065.
## 6 Albania 1985 1060.
## [1] 571 4
## [1] 66
head(litGDPleft)
## # A tibble: 6 × 4
## country year litRateF gdp
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1979 4.99 NA
## 2 Afghanistan 2011 13 NA
## 3 Albania 2001 98.3 1282.
## 4 Albania 2008 94.7 1804.
## 5 Albania 2011 95.7 1966.
## 6 Algeria 1987 35.8 1902.
# right
litGDPright = right_join(litF, GDP, by=c("country", "year"))
dim(litGDPright)
## [1] 7988 4
## [1] 0
head(litGDPright)
## # A tibble: 6 × 4
## country year litRateF gdp
## <chr> <chr> <dbl> <dbl>
## 1 Albania 2001 98.3 1282.
## 2 Albania 2008 94.7 1804.
## 3 Albania 2011 95.7 1966.
## 4 Algeria 1987 35.8 1902.
## 5 Algeria 2002 60.1 1872.
## 6 Algeria 2006 63.9 2125.
# inner
litGDPinner = inner_join(litF, GDP, by=c("country", "year"))
dim(litGDPinner)
## [1] 505 4
## [1] 0
head(litGDPinner)
## # A tibble: 6 × 4
## country year litRateF gdp
## <chr> <chr> <dbl> <dbl>
## 1 Albania 2001 98.3 1282.
## 2 Albania 2008 94.7 1804.
## 3 Albania 2011 95.7 1966.
## 4 Algeria 1987 35.8 1902.
## 5 Algeria 2002 60.1 1872.
## 6 Algeria 2006 63.9 2125.
## [1] 8054 4
## [1] 66
head(litGDPfull)
## # A tibble: 6 × 4
## country year litRateF gdp
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1979 4.99 NA
## 2 Afghanistan 2011 13 NA
## 3 Albania 2001 98.3 1282.
## 4 Albania 2008 94.7 1804.
## 5 Albania 2011 95.7 1966.
## 6 Algeria 1987 35.8 1902.
2.4.4 lubridate
lubridate is a another R package meant for data wrangling (Grolemund and Wickham 2011). In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd
(year month day) format and transform the information into whatever you want. The linked table is from the original paper and provides many of the basic lubridate commands: http://blog.yhathq.com/static/pdf/R_date_cheat_sheet.pdf}.
Example from https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
2.4.4.1 If anyone drove a time machine, they would crash
The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation, January 31st + one month. Should the answer be:
- February 31st (which doesn’t exist)
- March 4th (31 days after January 31), or
- February 28th (assuming its not a leap year)
A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.
If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+%
and %m-%
operators. %m+%
and %m-%
automatically roll dates back to the last day of the month, should that be necessary.
2.4.4.2 R examples, lubridate()
Some basics in lubridate
## [1] 11
week(rightnow)
## [1] 2
month(rightnow, label=FALSE)
## [1] 1
month(rightnow, label=TRUE)
## [1] Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
year(rightnow)
## [1] 2022
minute(rightnow)
## [1] 55
hour(rightnow)
## [1] 15
yday(rightnow)
## [1] 11
mday(rightnow)
## [1] 11
wday(rightnow, label=FALSE)
## [1] 3
wday(rightnow, label=TRUE)
## [1] Tue
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
But how do I create a date object?
## [1] "2021-01-31" NA "2021-03-31" NA "2021-05-31"
## [6] NA "2021-07-31" "2021-08-31" NA "2021-10-31"
## [11] NA "2021-12-31"
floor_date(jan31, "month") + months(0:11) + days(31)
## [1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
## [6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
## [11] "2021-12-02" "2022-01-01"
## [1] "2021-03-03" NA "2021-05-01" NA "2021-07-01"
## [6] NA "2021-08-31" "2021-10-01" NA "2021-12-01"
## [11] NA "2022-01-31"
## [1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
## [6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
## [11] "2021-11-30" "2021-12-31"
NYC flights
library(nycflights13)
names(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
flightsWK <- flights %>%
mutate(ymdday = ymd(paste(year, month,day, sep="-"))) %>%
mutate(weekdy = wday(ymdday, label=TRUE),
whichweek = week(ymdday))
head(flightsWK)
## # A tibble: 6 × 22
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, ymdday <date>, weekdy <ord>,
## # whichweek <dbl>
flightsWK <- flights %>%
mutate(ymdday = ymd(paste(year,"-", month,"-",day))) %>%
mutate(weekdy = wday(ymdday, label=TRUE), whichweek = week(ymdday))
flightsWK %>% select(year, month, day, ymdday, weekdy, whichweek, dep_time,
arr_time, air_time) %>%
head()
## # A tibble: 6 × 9
## year month day ymdday weekdy whichweek dep_time arr_time air_time
## <int> <int> <int> <date> <ord> <dbl> <int> <int> <dbl>
## 1 2013 1 1 2013-01-01 Tue 1 517 830 227
## 2 2013 1 1 2013-01-01 Tue 1 533 850 227
## 3 2013 1 1 2013-01-01 Tue 1 542 923 160
## 4 2013 1 1 2013-01-01 Tue 1 544 1004 183
## 5 2013 1 1 2013-01-01 Tue 1 554 812 116
## 6 2013 1 1 2013-01-01 Tue 1 554 740 150
2.5 purrr for functional programming
We will see the R package purrr in greater detail as we go, but for now, let’s get a hint for how it works.
We are going to focus on the map
family of functions which will get us started. Lots of other good purrr functions like pluck()
and accumulate()
.
Much of below is taken from a tutorial by Rebecca Barter.
The map
functions are named by the output the produce. For example:
map(.x, .f)
is the main mapping function and returns a listmap_df(.x, .f)
returns a data framemap_dbl(.x, .f)
returns a numeric (double) vectormap_chr(.x, .f)
returns a character vectormap_lgl(.x, .f)
returns a logical vector
Note that the first argument is always the data object and the second object is always the function you want to iteratively apply to each element in the input object.
The input to a map
function is always either a vector (like a column), a list (which can be non-rectangular), or a dataframe (like a rectangle).
A list is a way to hold things which might be very different in shape:
a_list <- list(a_number = 5,
a_vector = c("a", "b", "c"),
a_dataframe = data.frame(a = 1:3,
b = c("q", "b", "z"),
c = c("bananas", "are", "so very great")))
print(a_list)
## $a_number
## [1] 5
##
## $a_vector
## [1] "a" "b" "c"
##
## $a_dataframe
## a b c
## 1 1 q bananas
## 2 2 b are
## 3 3 z so very great
Consider the following function:
add_ten <- function(x) {
return(x + 10)
}
We can map()
the add_ten()
function across a vector. Note that the output is a list (the default).
## [[1]]
## [1] 12
##
## [[2]]
## [1] 15
##
## [[3]]
## [1] 20
What if we use a different type of input? The default behavior is to still return a list!
data.frame(a = 2, b = 5, c = 10) %>%
map(add_ten)
## $a
## [1] 12
##
## $b
## [1] 15
##
## $c
## [1] 20
What if we want a different type of output? We use a different map()
function, map_df()
, for example.
data.frame(a = 2, b = 5, c = 10) %>%
map_df(add_ten)
## # A tibble: 1 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 12 15 20
Shorthand lets us get away from pre-defining the function (which will be useful). Use the tilde ~
to indicate that you have a function:
data.frame(a = 2, b = 5, c = 10) %>%
map_df(~{.x + 10})
## # A tibble: 1 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 12 15 20
Mostly, the tilde will be used for functions we already know:
library(palmerpenguins)
library(broom)
penguins %>%
split(.$species) %>%
map(~ lm(body_mass_g ~ flipper_length_mm, data = .x)) %>%
map_df(tidy) # map(tidy)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2536. 965. -2.63 9.48e- 3
## 2 flipper_length_mm 32.8 5.08 6.47 1.34e- 9
## 3 (Intercept) -3037. 997. -3.05 3.33e- 3
## 4 flipper_length_mm 34.6 5.09 6.79 3.75e- 9
## 5 (Intercept) -6787. 1093. -6.21 7.65e- 9
## 6 flipper_length_mm 54.6 5.03 10.9 1.33e-19
penguins %>%
group_by(species) %>%
group_map(~lm(body_mass_g ~ flipper_length_mm, data = .x)) %>%
map(tidy) # map_df(tidy)
## [[1]]
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2536. 965. -2.63 0.00948
## 2 flipper_length_mm 32.8 5.08 6.47 0.00000000134
##
## [[2]]
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3037. 997. -3.05 0.00333
## 2 flipper_length_mm 34.6 5.09 6.79 0.00000000375
##
## [[3]]
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6787. 1093. -6.21 7.65e- 9
## 2 flipper_length_mm 54.6 5.03 10.9 1.33e-19
2.6 reprex()
Help me help you
In order to create a repr
oducible ex
ample …
Step 1. Copy code onto the clipboard
Step 2. Type reprex()
into the Console
Step 3. Look at the Viewer to the right. Copy the Viewer output into GitHub, Piazza, an email, stackexchange, etc.
Some places to learn more about reprex
include
- A blog about it: https://teachdatascience.com/reprex/
- The
reprex
vignette: https://reprex.tidyverse.org/index.html -
reprex
dos and donts: https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html - Jenny Bryan webinar on
reprex
: “Help me help you. Creating reproducible examples” https://resources.rstudio.com/webinars/help-me-help-you-creating-reproducible-examples-jenny-bryan - Some advice: https://stackoverflow.com/help/minimal-reproducible-example