some_string <- "banana"
some_string
[1] "banana"
Some new variable types:
A variable’s type determines the values that the variable can take on and the operations that can be performed on it. Specifying variable types ensures the dataset’s integrity and increases performance.
When working with character strings, we might want to detect, replace, or extract certain patterns.
Strings are objects of the character class (abbreviated as <chr>
in tibbles). When you print out strings, they display with double quotes:
some_string <- "banana"
some_string
[1] "banana"
Creating strings by hand is useful for testing out regular expressions.
To create a string, type any text in either double quotes "
or single quotes '
. Using double or single quotes doesn’t matter unless your string itself has single or double quotes.
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
string1
[1] "This is a string"
string2
[1] "If I want to include a \"quote\" inside a string, I use single quotes"
str_*()
functionsstr_view()
We can view these strings more “naturally” (without the opening and closing quotes) with str_view()
:
str_view(string1)
[1] │ This is a string
str_view(string2)
[1] │ If I want to include a "quote" inside a string, I use single quotes
str_c
Similar to paste()
(gluing strings together), but works well in a tidy pipeline.
df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
# A tibble: 4 × 2
name greeting
<chr> <chr>
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA> <NA>
str_sub()
str_sub(string, start, end)
will extract parts of a string
where start
and end
are the positions where the substring starts ane ends.
fruits <- c("Apple", "Banana", "Pear")
str_sub(fruits, 1, 3)
[1] "App" "Ban" "Pea"
str_sub(fruits, -3, -1)
[1] "ple" "ana" "ear"
Won’t fail if the string is too short.
str_sub(fruits, 1, 5)
[1] "Apple" "Banan" "Pear"
str_sub()
in a pipelineWe can use the str_*()
functions inside the mutate()
function.
titanic |>
mutate(class1 = str_sub(Class, 1, 1))
Class Sex Age Survived Freq class1
1 1st Male Child No 0 1
2 2nd Male Child No 0 2
3 3rd Male Child No 35 3
4 Crew Male Child No 0 C
5 1st Female Child No 0 1
6 2nd Female Child No 0 2
7 3rd Female Child No 17 3
8 Crew Female Child No 0 C
9 1st Male Adult No 118 1
10 2nd Male Adult No 154 2
11 3rd Male Adult No 387 3
12 Crew Male Adult No 670 C
13 1st Female Adult No 4 1
14 2nd Female Adult No 13 2
15 3rd Female Adult No 89 3
16 Crew Female Adult No 3 C
17 1st Male Child Yes 5 1
18 2nd Male Child Yes 11 2
19 3rd Male Child Yes 13 3
20 Crew Male Child Yes 0 C
21 1st Female Child Yes 1 1
22 2nd Female Child Yes 13 2
23 3rd Female Child Yes 14 3
24 Crew Female Child Yes 0 C
25 1st Male Adult Yes 57 1
26 2nd Male Adult Yes 14 2
27 3rd Male Adult Yes 75 3
28 Crew Male Adult Yes 192 C
29 1st Female Adult Yes 140 1
30 2nd Female Adult Yes 80 2
31 3rd Female Adult Yes 76 3
32 Crew Female Adult Yes 20 C
str_replace*()
str_replace()
replaces the first match of a pattern. str_replace_all()
replaces all the matches of a pattern.
fruits
[1] "Apple" "Banana" "Pear"
str_replace(fruits, "a", "x")
[1] "Apple" "Bxnana" "Pexr"
str_replace_all(fruits, "a", "x")
[1] "Apple" "Bxnxnx" "Pexr"
str_detect()
str_detect(fruits, "a")
[1] FALSE TRUE TRUE
str_detect()
can be seamlessly used in a filter()
pipeline.
starwars |>
select(name, films) |>
str()
tibble [87 × 2] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
$ films:List of 87
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
..$ : chr "A New Hope"
..$ : chr "A New Hope"
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
..$ : chr "A New Hope"
..$ : chr [1:3] "A New Hope" "Return of the Jedi" "The Phantom Menace"
..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
..$ : chr "A New Hope"
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
..$ : chr "The Empire Strikes Back"
..$ : chr "The Empire Strikes Back"
..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
..$ : chr "The Empire Strikes Back"
..$ : chr [1:2] "Return of the Jedi" "The Force Awakens"
..$ : chr "Return of the Jedi"
..$ : chr "Return of the Jedi"
..$ : chr "Return of the Jedi"
..$ : chr "Return of the Jedi"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "The Phantom Menace"
..$ : chr "Return of the Jedi"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Revenge of the Sith"
..$ : chr "Revenge of the Sith"
..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Revenge of the Sith"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:16] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
$ films:List of 16
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
..$ : chr "The Empire Strikes Back"
..$ : chr "The Empire Strikes Back"
..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
..$ : chr "The Empire Strikes Back"
The stringr package within tidyverse contains lots of functions to help process strings. Letting x be a string variable…
str function | arguments | returns |
---|---|---|
str_replace() |
x , pattern , replacement
|
a modified string |
str_replace_all() |
x , pattern , replacement
|
a modified string |
str_to_lower() |
x |
a modified string |
str_to_upper() |
x |
a modified string |
str_sub() |
x , start , end
|
a modified string |
str_length() |
x |
a number |
str_detect() |
x , pattern
|
TRUE/FALSE |
Use the stringr cheatsheet.
Factor variables are a special type of character string. The computer actually stores them as integers (?!?!!?) (abbreviated as <fct>
in tibbles).
SurveyUSA poll from 2012 on views of the DREAM Act.
What is off about the data viz part of the report?
openintro::dream
# A tibble: 910 × 2
ideology stance
<fct> <fct>
1 Conservative Yes
2 Conservative Yes
3 Conservative Yes
4 Conservative Yes
5 Conservative Yes
6 Conservative Yes
# ℹ 904 more rows
dream |>
ggplot(aes(x = ideology, fill = stance)) +
geom_bar()
dream |>
select(ideology) |>
pull() |> # levels() works only on vectors, not data frames
levels()
[1] "Conservative" "Liberal" "Moderate"
We can fix the order of the ideology
variable. The function fct_relevel()
is in the forcats pacakge.
dream |>
mutate(ideology = fct_relevel(ideology,
c("Liberal", "Moderate", "Conservative"))) |>
ggplot(aes(x = ideology, fill = stance)) +
geom_bar()
starbucks |>
select(item, type, calories)
# A tibble: 77 × 3
item type calories
<chr> <fct> <int>
1 8-Grain Roll bakery 350
2 Apple Bran Muffin bakery 350
3 Apple Fritter bakery 420
4 Banana Nut Loaf bakery 490
5 Birthday Cake Mini Doughnut bakery 130
6 Blueberry Oat Bar bakery 370
# ℹ 71 more rows
Lets say that we wanted to order the type of food item based on the average number of calories in that food.
starbucks |>
mutate(type = fct_reorder(type, calories, .fun = "mean", .desc = TRUE)) |>
ggplot(aes(x = type, y = calories)) +
geom_point() +
labs(x = "type of food",
y = "",
title = "Calories for food items at Starbucks")
The forcats package within tidyverse contains lots of functions to help process factor variables Use the forcats cheatsheet. We’ll focus on the most common functions.
fct_relevel()
= manually reorder levelsfct_reorder()
= reorder levels according to values of another variable
fct_infreq()
= order levels from highest to lowest frequencyfct_rev()
= reverse the current orderfct_recode()
= manually change levelsfct_lump()
= group together least common levelsThe (very well named) R package lubridate is used for wrangling time and date objects (Grolemund and Wickham 2011). In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd
(year month day) format and transform the information into whatever you want. The lubridate cheatsheet provides many of the basic functionality.
Example from https://lubridate.tidyverse.org/reference/lubridate-package.html
The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation, January 31st + one month. Should the answer be:
A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.
If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+%
and %m-%
operators. %m+%
and %m-%
automatically roll dates back to the last day of the month, should that be necessary.
Some basics in lubridate.
[1] 3
week(rightnow)
[1] 49
month(rightnow, label=FALSE)
[1] 12
month(rightnow, label=TRUE)
[1] Dec
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
year(rightnow)
[1] 2024
minute(rightnow)
[1] 50
hour(rightnow)
[1] 20
yday(rightnow)
[1] 338
mday(rightnow)
[1] 3
wday(rightnow, label=FALSE)
[1] 3
wday(rightnow, label=TRUE)
[1] Tue
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
But how do I create a date object?
[1] "2021-01-31" NA "2021-03-31" NA "2021-05-31"
[6] NA "2021-07-31" "2021-08-31" NA "2021-10-31"
[11] NA "2021-12-31"
floor_date(jan31, "month") + months(0:11) + days(31)
[1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
[6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
[11] "2021-12-02" "2022-01-01"
[1] "2021-03-03" NA "2021-05-01" NA "2021-07-01"
[6] NA "2021-08-31" "2021-10-01" NA "2021-12-01"
[11] NA "2022-01-31"
[1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
[6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
[11] "2021-11-30" "2021-12-31"
NYC flights
library(nycflights13)
names(flights)
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
flightsWK <- flights |>
mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |>
mutate(weekdy = wday(ymdday, label=TRUE),
whichweek = week(ymdday))
head(flightsWK)
# A tibble: 6 × 22
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, ymdday <date>, weekdy <ord>,
# whichweek <dbl>
flightsWK <- flights |>
mutate(ymdday = ymd(paste(year,"-", month,"-",day))) |>
mutate(weekdy = wday(ymdday, label=TRUE), whichweek = week(ymdday))
flightsWK |> select(year, month, day, ymdday, weekdy, whichweek, dep_time,
arr_time, air_time) |>
head()
# A tibble: 6 × 9
year month day ymdday weekdy whichweek dep_time arr_time air_time
<int> <int> <int> <date> <ord> <dbl> <int> <int> <dbl>
1 2013 1 1 2013-01-01 Tue 1 517 830 227
2 2013 1 1 2013-01-01 Tue 1 533 850 227
3 2013 1 1 2013-01-01 Tue 1 542 923 160
4 2013 1 1 2013-01-01 Tue 1 544 1004 183
5 2013 1 1 2013-01-01 Tue 1 554 812 116
6 2013 1 1 2013-01-01 Tue 1 554 740 150