8  Text analysis

8.1 Variable Types

Some new variable types:

  • character strings
  • factor variables
  • dates
  • numeric
  • logical

A variable’s type determines the values that the variable can take on and the operations that can be performed on it. Specifying variable types ensures the dataset’s integrity and increases performance.

8.2 Character strings

When working with character strings, we might want to detect, replace, or extract certain patterns.

Strings are objects of the character class (abbreviated as <chr> in tibbles). When you print out strings, they display with double quotes:

some_string <- "banana"
some_string
[1] "banana"

8.2.1 Creating strings

Creating strings by hand is useful for testing out regular expressions.

To create a string, type any text in either double quotes " or single quotes '. Using double or single quotes doesn’t matter unless your string itself has single or double quotes.

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

string1
[1] "This is a string"
string2
[1] "If I want to include a \"quote\" inside a string, I use single quotes"

8.2.2 Working with str_*() functions

8.2.2.1 str_view()

We can view these strings more “naturally” (without the opening and closing quotes) with str_view():

str_view(string1)
[1] │ This is a string
str_view(string2)
[1] │ If I want to include a "quote" inside a string, I use single quotes

8.2.2.2 str_c

Similar to paste() (gluing strings together), but works well in a tidy pipeline.

df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
# A tibble: 4 × 2
  name  greeting 
  <chr> <chr>    
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  <NA>     

8.2.2.3 str_sub()

str_sub(string, start, end) will extract parts of a string where start and end are the positions where the substring starts ane ends.

fruits <- c("Apple", "Banana", "Pear")
str_sub(fruits, 1, 3)
[1] "App" "Ban" "Pea"
str_sub(fruits, -3, -1)
[1] "ple" "ana" "ear"

Won’t fail if the string is too short.

str_sub(fruits, 1, 5)
[1] "Apple" "Banan" "Pear" 

8.2.2.4 str_sub() in a pipeline

We can use the str_*() functions inside the mutate() function.

titanic |> 
  mutate(class1 = str_sub(Class, 1, 1))
   Class    Sex   Age Survived Freq class1
1    1st   Male Child       No    0      1
2    2nd   Male Child       No    0      2
3    3rd   Male Child       No   35      3
4   Crew   Male Child       No    0      C
5    1st Female Child       No    0      1
6    2nd Female Child       No    0      2
7    3rd Female Child       No   17      3
8   Crew Female Child       No    0      C
9    1st   Male Adult       No  118      1
10   2nd   Male Adult       No  154      2
11   3rd   Male Adult       No  387      3
12  Crew   Male Adult       No  670      C
13   1st Female Adult       No    4      1
14   2nd Female Adult       No   13      2
15   3rd Female Adult       No   89      3
16  Crew Female Adult       No    3      C
17   1st   Male Child      Yes    5      1
18   2nd   Male Child      Yes   11      2
19   3rd   Male Child      Yes   13      3
20  Crew   Male Child      Yes    0      C
21   1st Female Child      Yes    1      1
22   2nd Female Child      Yes   13      2
23   3rd Female Child      Yes   14      3
24  Crew Female Child      Yes    0      C
25   1st   Male Adult      Yes   57      1
26   2nd   Male Adult      Yes   14      2
27   3rd   Male Adult      Yes   75      3
28  Crew   Male Adult      Yes  192      C
29   1st Female Adult      Yes  140      1
30   2nd Female Adult      Yes   80      2
31   3rd Female Adult      Yes   76      3
32  Crew Female Adult      Yes   20      C

8.2.2.5 str_replace*()

str_replace() replaces the first match of a pattern. str_replace_all() replaces all the matches of a pattern.

fruits
[1] "Apple"  "Banana" "Pear"  
str_replace(fruits, "a", "x")
[1] "Apple"  "Bxnana" "Pexr"  
str_replace_all(fruits, "a", "x")
[1] "Apple"  "Bxnxnx" "Pexr"  

8.2.2.6 str_detect()

str_detect(fruits, "a")
[1] FALSE  TRUE  TRUE

str_detect() can be seamlessly used in a filter() pipeline.

starwars |> 
  select(name, films) |> 
  str() 
tibble [87 × 2] (S3: tbl_df/tbl/data.frame)
 $ name : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
 $ films:List of 87
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "A New Hope"
  ..$ : chr "A New Hope"
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
  ..$ : chr "A New Hope"
  ..$ : chr [1:3] "A New Hope" "Return of the Jedi" "The Phantom Menace"
  ..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr "A New Hope"
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr [1:2] "Return of the Jedi" "The Force Awakens"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "Return of the Jedi"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Revenge of the Sith"
  ..$ : chr "Revenge of the Sith"
  ..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Revenge of the Sith"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
starwars |> 
  filter(str_detect(films, "Empire")) |> 
  select(name, films) |> 
  str()
tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
 $ name : chr [1:16] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
 $ films:List of 16
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
  ..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr "The Empire Strikes Back"

8.2.2.7 stringr functions

The stringr package within tidyverse contains lots of functions to help process strings. Letting x be a string variable…

str function arguments returns
str_replace() x, pattern, replacement a modified string
str_replace_all() x, pattern, replacement a modified string
str_to_lower() x a modified string
str_to_upper() x a modified string
str_sub() x, start, end a modified string
str_length() x a number
str_detect() x, pattern TRUE/FALSE

Use the stringr cheatsheet.

8.3 Factor variables

Factor variables are a special type of character string. The computer actually stores them as integers (?!?!!?) (abbreviated as <fct> in tibbles).

  • categorical variable
  • represented in discrete levels

8.3.1 Order matters

SurveyUSA poll from 2012 on views of the DREAM Act.

What is off about the data viz part of the report?

openintro::dream
# A tibble: 910 × 2
  ideology     stance
  <fct>        <fct> 
1 Conservative Yes   
2 Conservative Yes   
3 Conservative Yes   
4 Conservative Yes   
5 Conservative Yes   
6 Conservative Yes   
# ℹ 904 more rows
dream |> 
  ggplot(aes(x = ideology, fill = stance)) + 
  geom_bar()

dream |> 
  select(ideology) |> 
  pull() |>  # levels() works only on vectors, not data frames
  levels()
[1] "Conservative" "Liberal"      "Moderate"    

8.3.1.1 Change the order

We can fix the order of the ideology variable. The function fct_relevel() is in the forcats pacakge.

dream |> 
  mutate(ideology = fct_relevel(ideology, 
                                c("Liberal", "Moderate", "Conservative"))) |> 
  ggplot(aes(x = ideology, fill = stance)) + 
  geom_bar()

8.3.1.2 Factor and character variables

starbucks |> 
  select(item, type, calories)
# A tibble: 77 × 3
  item                        type   calories
  <chr>                       <fct>     <int>
1 8-Grain Roll                bakery      350
2 Apple Bran Muffin           bakery      350
3 Apple Fritter               bakery      420
4 Banana Nut Loaf             bakery      490
5 Birthday Cake Mini Doughnut bakery      130
6 Blueberry Oat Bar           bakery      370
# ℹ 71 more rows

8.3.2 Reorder according to another variable

Lets say that we wanted to order the type of food item based on the average number of calories in that food.

starbucks |> 
  mutate(type = fct_reorder(type, calories, .fun = "mean", .desc = TRUE)) |> 
  ggplot(aes(x = type, y = calories)) + 
  geom_point() + 
  labs(x = "type of food",
       y = "",
       title = "Calories for food items at Starbucks")

8.3.2.1 forcats functions

The forcats package within tidyverse contains lots of functions to help process factor variables Use the forcats cheatsheet. We’ll focus on the most common functions.

  • functions for changing the order of factor levels
    • fct_relevel() = manually reorder levels
    • fct_reorder() = reorder levels according to values of another variable
    • fct_infreq() = order levels from highest to lowest frequency
    • fct_rev() = reverse the current order
  • functions for changing the labels or values of factor levels
    • fct_recode() = manually change levels
    • fct_lump() = group together least common levels

8.4 Time and Date

8.4.1 Working with time and date

The (very well named) R package lubridate is used for wrangling time and date objects (Grolemund and Wickham 2011). In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd (year month day) format and transform the information into whatever you want. The lubridate cheatsheet provides many of the basic functionality.

Example from https://lubridate.tidyverse.org/reference/lubridate-package.html

8.4.2 If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation, January 31st + one month. Should the answer be:

  1. February 31st (which doesn’t exist)
  2. March 4th (31 days after January 31), or
  3. February 28th (assuming its not a leap year)

A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.

If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+% and %m-% operators. %m+% and %m-% automatically roll dates back to the last day of the month, should that be necessary.

8.4.3 R examples

Some basics in lubridate.

require(lubridate)
rightnow <- now()

day(rightnow)
[1] 3
week(rightnow)
[1] 49
month(rightnow, label=FALSE)
[1] 12
month(rightnow, label=TRUE)
[1] Dec
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
year(rightnow)
[1] 2024
minute(rightnow)
[1] 50
hour(rightnow)
[1] 20
yday(rightnow)
[1] 338
mday(rightnow)
[1] 3
wday(rightnow, label=FALSE)
[1] 3
wday(rightnow, label=TRUE)
[1] Tue
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

But how do I create a date object?

jan31 <- ymd("2021-01-31")
jan31 + months(0:11)
 [1] "2021-01-31" NA           "2021-03-31" NA           "2021-05-31"
 [6] NA           "2021-07-31" "2021-08-31" NA           "2021-10-31"
[11] NA           "2021-12-31"
floor_date(jan31, "month") + months(0:11) + days(31)
 [1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
 [6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
[11] "2021-12-02" "2022-01-01"
jan31 + months(0:11) + days(31)
 [1] "2021-03-03" NA           "2021-05-01" NA           "2021-07-01"
 [6] NA           "2021-08-31" "2021-10-01" NA           "2021-12-01"
[11] NA           "2022-01-31"
jan31 %m+% months(0:11)
 [1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
 [6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
[11] "2021-11-30" "2021-12-31"

NYC flights

 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     
flightsWK <- flights |> 
   mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |>
   mutate(weekdy = wday(ymdday, label=TRUE), 
          whichweek = week(ymdday))

head(flightsWK)
# A tibble: 6 × 22
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, ymdday <date>, weekdy <ord>,
#   whichweek <dbl>
flightsWK <- flights |> 
   mutate(ymdday = ymd(paste(year,"-", month,"-",day))) |>
   mutate(weekdy = wday(ymdday, label=TRUE), whichweek = week(ymdday))

flightsWK |> select(year, month, day, ymdday, weekdy, whichweek, dep_time, 
                     arr_time, air_time) |>  
   head()
# A tibble: 6 × 9
   year month   day ymdday     weekdy whichweek dep_time arr_time air_time
  <int> <int> <int> <date>     <ord>      <dbl>    <int>    <int>    <dbl>
1  2013     1     1 2013-01-01 Tue            1      517      830      227
2  2013     1     1 2013-01-01 Tue            1      533      850      227
3  2013     1     1 2013-01-01 Tue            1      542      923      160
4  2013     1     1 2013-01-01 Tue            1      544     1004      183
5  2013     1     1 2013-01-01 Tue            1      554      812      116
6  2013     1     1 2013-01-01 Tue            1      554      740      150

8.5 Reflection questions

8.6 Ethics considerations

Grolemund, G., and H. Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3). http://www.jstatsoft.org/v40/i03/paper.