8 Text analysis

8.1 Variable Types

Some new variable types:

character strings
factor variables
dates
numeric (decimal)
integer
logical (Boolean)

A variable’s type determines the values that the variable can take on and the operations that can be performed on it. Specifying variable types ensures the dataset’s integrity and increases performance.

8.2 Character strings

When working with character strings, we might want to detect, replace, or extract certain patterns.

Strings are objects of the character class (abbreviated as <chr> in tibbles). When you print out strings, they display with double quotes:

some_string <- "banana"
some_string

[1] "banana"

8.2.1 Creating strings

Creating strings by hand is useful for testing out regular expressions.

To create a string, type any text in either double quotes " or single quotes '. Using double or single quotes doesn’t matter unless your string itself has single or double quotes.

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

string1

[1] "This is a string"

string2

[1] "If I want to include a \"quote\" inside a string, I use single quotes"

8.2.2 Working with `str_*()` functions

8.2.2.1 `str_view()`

We can view these strings more “naturally” (without the opening and closing quotes) with str_view():

str_view(string1)

[1] │ This is a string

str_view(string2)

[1] │ If I want to include a "quote" inside a string, I use single quotes

8.2.2.2 `str_c`

Similar to paste() (gluing strings together), but works well in a tidy pipeline.

df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))

# A tibble: 4 × 2
  name  greeting 
  <chr> <chr>    
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  <NA>

8.2.2.3 `str_sub()`

str_sub(string, start, end) will extract parts of a string where start and end are the positions where the substring starts ane ends.

fruits <- c("Apple", "Banana", "Pear")
str_sub(fruits, 1, 3)

[1] "App" "Ban" "Pea"

str_sub(fruits, -3, -1)

[1] "ple" "ana" "ear"

Won’t fail if the string is too short.

str_sub(fruits, 1, 5)

[1] "Apple" "Banan" "Pear"

8.2.2.4 `str_sub()` in a pipeline

We can use the str_*() functions inside the mutate() function.

titanic |> 
  mutate(class1 = str_sub(Class, 1, 1))

   Class    Sex   Age Survived Freq class1
1    1st   Male Child       No    0      1
2    2nd   Male Child       No    0      2
3    3rd   Male Child       No   35      3
4   Crew   Male Child       No    0      C
5    1st Female Child       No    0      1
6    2nd Female Child       No    0      2
7    3rd Female Child       No   17      3
8   Crew Female Child       No    0      C
9    1st   Male Adult       No  118      1
10   2nd   Male Adult       No  154      2
11   3rd   Male Adult       No  387      3
12  Crew   Male Adult       No  670      C
13   1st Female Adult       No    4      1
14   2nd Female Adult       No   13      2
15   3rd Female Adult       No   89      3
16  Crew Female Adult       No    3      C
17   1st   Male Child      Yes    5      1
18   2nd   Male Child      Yes   11      2
19   3rd   Male Child      Yes   13      3
20  Crew   Male Child      Yes    0      C
21   1st Female Child      Yes    1      1
22   2nd Female Child      Yes   13      2
23   3rd Female Child      Yes   14      3
24  Crew Female Child      Yes    0      C
25   1st   Male Adult      Yes   57      1
26   2nd   Male Adult      Yes   14      2
27   3rd   Male Adult      Yes   75      3
28  Crew   Male Adult      Yes  192      C
29   1st Female Adult      Yes  140      1
30   2nd Female Adult      Yes   80      2
31   3rd Female Adult      Yes   76      3
32  Crew Female Adult      Yes   20      C

8.2.2.5 `str_replace*()`

str_replace() replaces the first match of a pattern. str_replace_all() replaces all the matches of a pattern.

fruits

[1] "Apple"  "Banana" "Pear"

str_replace(fruits, "a", "x")

[1] "Apple"  "Bxnana" "Pexr"

str_replace_all(fruits, "a", "x")

[1] "Apple"  "Bxnxnx" "Pexr"

8.2.2.6 `str_detect()`

str_detect(fruits, "a")

[1] FALSE  TRUE  TRUE

str_detect() can be seamlessly used in a filter() pipeline.

starwars |> 
  select(name, films)

# A tibble: 87 × 2
  name           films    
  <chr>          <list>   
1 Luke Skywalker <chr [5]>
2 C-3PO          <chr [6]>
3 R2-D2          <chr [7]>
4 Darth Vader    <chr [4]>
5 Leia Organa    <chr [5]>
6 Owen Lars      <chr [3]>
# ℹ 81 more rows

starwars |> 
  select(name, films) |> 
  unnest_wider(films, names_sep = "")

# A tibble: 87 × 8
  name           films1     films2            films3 films4 films5 films6 films7
  <chr>          <chr>      <chr>             <chr>  <chr>  <chr>  <chr>  <chr> 
1 Luke Skywalker A New Hope The Empire Strik… Retur… Reven… The F… <NA>   <NA>  
2 C-3PO          A New Hope The Empire Strik… Retur… The P… Attac… Reven… <NA>  
3 R2-D2          A New Hope The Empire Strik… Retur… The P… Attac… Reven… The F…
4 Darth Vader    A New Hope The Empire Strik… Retur… Reven… <NA>   <NA>   <NA>  
5 Leia Organa    A New Hope The Empire Strik… Retur… Reven… The F… <NA>   <NA>  
6 Owen Lars      A New Hope Attack of the Cl… Reven… <NA>   <NA>   <NA>   <NA>  
# ℹ 81 more rows

starwars |> 
  filter(str_detect(films, "Empire")) |> 
  select(name, films) |> 
  unnest_wider(films, names_sep = "")

# A tibble: 16 × 8
  name           films1     films2            films3 films4 films5 films6 films7
  <chr>          <chr>      <chr>             <chr>  <chr>  <chr>  <chr>  <chr> 
1 Luke Skywalker A New Hope The Empire Strik… Retur… Reven… The F… <NA>   <NA>  
2 C-3PO          A New Hope The Empire Strik… Retur… The P… Attac… Reven… <NA>  
3 R2-D2          A New Hope The Empire Strik… Retur… The P… Attac… Reven… The F…
4 Darth Vader    A New Hope The Empire Strik… Retur… Reven… <NA>   <NA>   <NA>  
5 Leia Organa    A New Hope The Empire Strik… Retur… Reven… The F… <NA>   <NA>  
6 Obi-Wan Kenobi A New Hope The Empire Strik… Retur… The P… Attac… Reven… <NA>  
# ℹ 10 more rows

8.2.2.7 stringr functions

The stringr package within tidyverse contains lots of functions to help process strings. Letting x be a string variable…

str function	arguments	returns
`str_vew()`	`x`	the string
`str_c()`	…, `sep`, `collapse`	a new concatenated string
`str_sub()`	`x`, `start`, `end`	a modified string
`str_replace()`	`x`, `pattern`, `replacement`	a modified string
`str_replace_all()`	`x`, `pattern`, `replacement`	a modified string
`str_detect()`	`x`, `pattern`	TRUE/FALSE
`str_to_lower()`	`x`	a modified string
`str_to_upper()`	`x`	a modified string
`str_length()`	`x`	a number

Use the stringr cheatsheet.

8.3 Factor variables

Factor variables are a special type of character string. The computer actually stores them as integers (?!?!!?) (abbreviated as <fct> in tibbles).

categorical variable
represented in discrete levels

8.3.1 Order matters

SurveyUSA poll from 2012 on views of the DREAM Act.

What is off about the data viz part of the report?

openintro::dream

# A tibble: 910 × 2
  ideology     stance
  <fct>        <fct> 
1 Conservative Yes   
2 Conservative Yes   
3 Conservative Yes   
4 Conservative Yes   
5 Conservative Yes   
6 Conservative Yes   
# ℹ 904 more rows

dream |> 
  ggplot(aes(x = ideology, fill = stance)) + 
  geom_bar()

dream |> 
  select(ideology) |> 
  pull() |>  # because levels() works only on vectors, not data frames
  levels()

[1] "Conservative" "Liberal"      "Moderate"

8.3.1.1 Change the order

We can fix the order of the ideology variable. The function fct_relevel() is in the forcats pacakge.

Code
Plot

dream |> 
  mutate(ideology = fct_relevel(ideology, 
                                c("Liberal", "Moderate", "Conservative"))) |> 
  ggplot(aes(x = ideology, fill = stance)) + 
  geom_bar()

starbucks |> 
  select(item, type, calories)

# A tibble: 77 × 3
  item                        type   calories
  <chr>                       <fct>     <int>
1 8-Grain Roll                bakery      350
2 Apple Bran Muffin           bakery      350
3 Apple Fritter               bakery      420
4 Banana Nut Loaf             bakery      490
5 Birthday Cake Mini Doughnut bakery      130
6 Blueberry Oat Bar           bakery      370
# ℹ 71 more rows

8.3.2 Reorder according to another variable

Lets say that we wanted to order the type of food item based on the average number of calories in that food.

Code
Plot

starbucks |> 
  mutate(type = fct_reorder(type, calories, .fun = "mean", .desc = TRUE)) |> 
  ggplot(aes(x = type, y = calories)) + 
  geom_point() + 
  labs(x = "type of food",
       y = "",
       title = "Calories for food items at Starbucks")

8.3.2.1 forcats functions

The forcats package within tidyverse contains lots of functions to help process factor variables Use the forcats cheatsheet. We’ll focus on the most common functions.

functions for changing the order of factor levels
- fct_relevel() = manually reorder levels
- fct_reorder() = reorder levels according to values of another variable
- fct_infreq() = order levels from highest to lowest frequency
- fct_rev() = reverse the current order
functions for changing the labels or values of factor levels
- fct_recode() = manually change levels
- fct_lump() = group together least common levels

8.4 Time and Date

8.4.1 Working with time and date

The (very well named) R package lubridate is used for wrangling time and date objects (Grolemund and Wickham 2011). In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd (year month day) format and transform the information into whatever you want. The lubridate cheatsheet provides many of the basic functionality.

Example from https://lubridate.tidyverse.org/reference/lubridate-package.html

8.4.2 If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation, January 31st + one month. Should the answer be:

February 31st (which doesn’t exist)
March 4th (31 days after January 31), or
February 28th (assuming its not a leap year)

A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.

If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+% and %m-% operators. %m+% and %m-% automatically roll dates back to the last day of the month, should that be necessary.

8.4.3 R examples

Some basics in lubridate.

require(lubridate)
rightnow <- now()

day(rightnow)

[1] 17

week(rightnow)

[1] 7

month(rightnow, label=FALSE)

[1] 2

month(rightnow, label=TRUE)

[1] Feb
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

year(rightnow)

[1] 2025

minute(rightnow)

[1] 37

hour(rightnow)

[1] 9

yday(rightnow)

[1] 48

mday(rightnow)

[1] 17

wday(rightnow, label=FALSE)

[1] 2

wday(rightnow, label=TRUE)

[1] Mon
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

But how do I create a date object?

jan31 <- ymd("2021-01-31")
jan31 + months(0:11)

 [1] "2021-01-31" NA           "2021-03-31" NA           "2021-05-31"
 [6] NA           "2021-07-31" "2021-08-31" NA           "2021-10-31"
[11] NA           "2021-12-31"

floor_date(jan31, "month") + months(0:11) + days(31)

 [1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
 [6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
[11] "2021-12-02" "2022-01-01"

jan31 + months(0:11) + days(31)

 [1] "2021-03-03" NA           "2021-05-01" NA           "2021-07-01"
 [6] NA           "2021-08-31" "2021-10-01" NA           "2021-12-01"
[11] NA           "2022-01-31"

jan31 %m+% months(0:11)

 [1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
 [6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
[11] "2021-11-30" "2021-12-31"

NYC flights

library(nycflights13)
names(flights)

 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"

flightsWK <- flights |> 
   mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |>
   mutate(weekdy = wday(ymdday, label=TRUE), 
          whichweek = week(ymdday))

head(flightsWK)

# A tibble: 6 × 22
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, ymdday <date>, weekdy <ord>,
#   whichweek <dbl>

flightsWK <- flights |> 
   mutate(ymdday = ymd(paste(year,"-", month,"-",day))) |>
   mutate(weekdy = wday(ymdday, label=TRUE), whichweek = week(ymdday))

flightsWK |> select(year, month, day, ymdday, weekdy, whichweek, dep_time, 
                     arr_time, air_time) |>  
   head()

# A tibble: 6 × 9
   year month   day ymdday     weekdy whichweek dep_time arr_time air_time
  <int> <int> <int> <date>     <ord>      <dbl>    <int>    <int>    <dbl>
1  2013     1     1 2013-01-01 Tue            1      517      830      227
2  2013     1     1 2013-01-01 Tue            1      533      850      227
3  2013     1     1 2013-01-01 Tue            1      542      923      160
4  2013     1     1 2013-01-01 Tue            1      544     1004      183
5  2013     1     1 2013-01-01 Tue            1      554      812      116
6  2013     1     1 2013-01-01 Tue            1      554      740      150

8.5 Reflection questions

What is the difference between character strings, factor variables, dates, numeric variables, integer variables, and logical variables?
What are the different str_*() functions? What do they do?
What are the different fct_*() functions? What do they do?
What does the all do in functions like str_replace_all() (as compared with str_replace())?
Are the str_*() functions case sensitive?
Can str_*() functions take variables that aren’t strings?
Can fct_*() functions take variables that aren’t factors?
Can lubridate functions take variables that aren’t time or data objects?
How do the str_*() functions work inside the tidy pipeline?

8.6 Ethics considerations

Why is the calculation of January 31 + one month interesting and/or an ethical consideration?
Why is it often important to relevel your factor variables when making a data visualization?
Why are timezones important to pay attention to?
What does this mean: If x is a string it can take any value. If x is a factor it can only take a values from a list of all levels. ?
Name one thing that you noticed about types of variables in the course materials (either in class or reading the notes, etc.) where you thought to yourself “Oh, I’ll have to be really careful about that.” Why would you need to be careful?

Grolemund, G., and H. Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3). http://www.jstatsoft.org/v40/i03/paper.