6  Changing databases

As in Chapter 5, we will use DuckDB so that we have access to a server into which we can create and change data tables. Note that the database has been stored to a database directory called duck_datab which lives in the current R project. You won’t be able to open it like a standard folder, but it is where DuckDB stores the database files.

con_duckdb <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "duck_datab")

6.1 Changing data

The UPDATE function allows you to change a value in a table across all rows that match a certain criteria. The impressions table has a name column indicating the person being impersonated. Let’s say, for whatever reason, that Ivanka Trump decides she doesn’t want to be affiliated with the Trump name and she changes her name to her husband’s name, becoming Ivanka Kushner. You might want to UPDATE the file to indicate the impressions were of Ivanka Kushner instead of Ivanka Trump. (See Section 5.1.6 for loading csv files into DuckDB directly.)

duckdb_read_csv(con = con_duckdb, name = "impressions", files = "data/impressions.csv")
SELECT * FROM impressions 
   WHERE name LIKE 'Ivanka%';
Table 6.1: Finding the impersonations of Ivanka Trump.
impid aid name
2598 Scarlett Johansson Ivanka Trump
3716 Emily Blunt Ivanka Trump
3694 Margot Robbie Ivanka Trump
3679 Vanessa Bayer Ivanka Trump
2340 Maya Rudolph Ivanka Trump

We can use the UPDATE function to change the value of Ivanka’s name to Ivanka Kushner throughout the database. Note that all rows which match the WHERE clause get updated.

UPDATE impressions
   SET name = 'Ivanka Kushner'
   WHERE name LIKE 'Ivanka%';
SELECT * FROM impressions 
   WHERE name LIKE 'Ivanka%';
Table 6.2: Ivanka’s last name has been UPDATEd to Kushner.
impid aid name
2598 Scarlett Johansson Ivanka Kushner
3716 Emily Blunt Ivanka Kushner
3694 Margot Robbie Ivanka Kushner
3679 Vanessa Bayer Ivanka Kushner
2340 Maya Rudolph Ivanka Kushner
Watch out!

Be careful with UPDATE. A careless UPDATE could write over all of the data in your table. There is no undo function.

6.2 Inserting data

Let’s say we want to include the more recent hosts in the hosts table. First, we scrape the SNL archives which lists the episode id (the date) and the host. The R package rvest allows us to pull out the appropriate html elements. The epid and aid are joined together in a tibble, and filtered to only include episodes which are not already in the episodes table. (See Section 5.1.6 for loading csv files into DuckDB directly.)

duckdb_read_csv(con = con_duckdb, name = "hosts", files = "data/hosts.csv")

By searching the SNL archives, we can see that the next host, chronologically was Elon Musk on May 8, 2021.

SELECT * FROM hosts
    ORDER BY epid DESC
    LIMIT 10;
Table 6.3: Most recent hosts in the original hosts table.
epid aid
20210410 Carey Mulligan
20210403 Daniel Kaluuya
20210327 Maya Rudolph
20210227 Nick Jonas
20210220 Rege-Jean Page
20210213 Regina King
20210206 Dan Levy
20210130 John Krasinski
20201219 Kristen Wiig
20201212 Timothee Chalamet

INSERT allows us to add the relevant information associated with the episode of SNL that Elon Musk hosted.

INSERT INTO hosts (epid, aid)
   VALUES ('20210508', 'Elon Musk');
SELECT * FROM hosts
    ORDER BY epid DESC
    LIMIT 10;
Table 6.4: hosts table including the added observation from May 8, 2021.
epid aid
20210508 Elon Musk
20210410 Carey Mulligan
20210403 Daniel Kaluuya
20210327 Maya Rudolph
20210227 Nick Jonas
20210220 Rege-Jean Page
20210213 Regina King
20210206 Dan Levy
20210130 John Krasinski
20201219 Kristen Wiig

It would be tedious to INSERT all of the most recent host information by hand. Instead, we’ll scrape the SNL archives using the R package rvest, which allows us to pull out the appropriate html elements. The epid and aid are joined together in a tibble, and filtered to only include episodes which are not already in the episodes table.

library(rvest)

recent_hosts <- read_html("http://www.snlarchives.net/Episodes/") |>
  html_nodes("tr") |>
  purrr::map_df( ~ tibble(
    epid = .x |> html_node("a.ms-2.me-2") |>
      html_attr("href") |>
      str_extract("\\d+"),
    aid = .x |> html_node("td:nth-child(2)") |>
      html_text2() |>
      str_extract("[\\w\\. \\w\\.]+(?=/|$)")
  )) |>
  filter(epid > 20210508)


write_csv(recent_hosts, "data/recent_hosts.csv")
INSERT INTO hosts
   SELECT *
   FROM READ_CSV('data/recent_hosts.csv', AUTO_DETECT = TRUE);
SELECT * FROM hosts
  ORDER BY epid DESC
  LIMIT 10;
Table 6.5: The full hosts table, updated through December 16, 2023.
epid aid
20231216 Kate McKinnon
20231209 Adam Driver
20231202 Emma Stone
20231118 Jason Momoa
20231111 Timothée Chalamet
20231028 Nate Bargatze
20231021 Bad Bunny
20231014 Pete Davidson
20230415 Ana de Armas
20230408 Molly Shannon

6.3 Deleting data

You might change your mind and decide that you really only want hosts from years up to 2022. The DELETE function deletes any rows specified by the WHERE clause.

DELETE FROM hosts
   WHERE epid > 20221231
SELECT * FROM hosts
  ORDER BY epid DESC
  LIMIT 10;
Table 6.6: The hosts table, after 2023 has been DELETEd.
epid aid
20221217 Austin Butler
20221210 Martin Short
20221203 Keke Palmer
20221112 Dave Chappelle
20221105 Amy Schumer
20221029 Jack Harlow
20221015 Megan Thee Stallion
20221008 Brendan Gleeson
20221001 Miles Teller
20220521 Natasha Lyonne

6.4 Altering the table

ALTER TABLE changes the structure of a table. For example, you can add or delete columns, create or destroy indexes, change the type of existing columns, or rename columns or the table itself. (Syntax below is for MySQL. Unfortunately, DuckDB is finicky when ALTERing tables, so the commands below may not work on the tables created using DuckDB.)1

Multiple ADD, ALTER, DROP, and CHANGE clauses are permitted in a single ALTER TABLE statement, separated by commas.

ALTER TABLE t1
DROP COLUMN col1,
DROP COLUMN col2;

To alter a column to change both its name and definition, use CHANGE, specifying the old and new names and the new definition. For example, to rename an INT NOT NULL column from a to b and change its definition to use the BIGINT data type while retaining the NOT NULL attribute, do this:

ALTER TABLE t1 CHANGE a b BIGINT NOT NULL;

To change a column definition but not its name, use CHANGE or MODIFY. With CHANGE, the syntax requires two column names, so you must specify the same name twice to leave the name unchanged. For example, to change the definition of column b:

ALTER TABLE t1 CHANGE b b INT NOT NULL;

MODIFY is more convenient to change the definition without changing the name because it requires the column name only once:

ALTER TABLE t1 MODIFY b INT NOT NULL;

To change a column name but not its definition, use CHANGE or RENAME COLUMN. With CHANGE, the syntax requires a column definition, so to leave the definition unchanged, you must re-specify the definition the column currently has. For example, to rename an INT NOT NULL column from b to a:

ALTER TABLE t1 CHANGE b a INT NOT NULL;

RENAME COLUMN is more convenient to change the name without changing the definition because it requires only the old and new names:

ALTER TABLE t1 RENAME COLUMN b TO a;

In general, you cannot rename a column to a name that already exists in the table. However, this is sometimes not the case, such as when you swap names or move them through a cycle. If a table has columns named a, b, and c, the following are valid operations:

/* swap a and b */
ALTER TABLE t1 RENAME COLUMN a TO b,
               RENAME COLUMN b TO a;

/* "rotate" a, b, c through a cycle */
ALTER TABLE t1 RENAME COLUMN a TO b,
               RENAME COLUMN b TO c,
               RENAME COLUMN c TO a;

6.5 Temporary tables

Temporary tables are used to break down complex queries into smaller, more manageable steps. For example, let’s say we want to JOIN two tables after each has been filtered using different WHERE clauses. The filtered tables can each be saved into their own temporary tables and then the temporary tables can be merged.

Note that tables in DuckDB are saved (to disk), even when the connection is closed. However, temporary tables are saved in memory (instead of on disk) and are deleted when the connection is closed. Specific configuration of the temporary directory allows for temporary tables to be saved, even when the connection is closed.

Notice that most of the tables have some kind of ID which allows JOINing across tables.

SELECT * FROM hosts LIMIT 10;
Table 6.7: Note the epid and aid identifiers in the hosts table.
epid aid
20210410 Carey Mulligan
20210403 Daniel Kaluuya
20210327 Maya Rudolph
20210227 Nick Jonas
20210220 Rege-Jean Page
20210213 Regina King
20210206 Dan Levy
20210130 John Krasinski
20201219 Kristen Wiig
20201212 Timothee Chalamet
SELECT * FROM episodes LIMIT 10;
Table 6.8: Note the sid and epid identifiers in the episodes table.
sid epid aired epno
46 20210410 April 10, 2021 17
46 20210403 April 3, 2021 16
46 20210327 March 27, 2021 15
46 20210227 February 27, 2021 14
46 20210220 February 20, 2021 13
46 20210213 February 13, 2021 12
46 20210206 February 6, 2021 11
46 20210130 January 30, 2021 10
46 20201219 December 19, 2020 9
46 20201212 December 12, 2020 8

6.5.1 Creating a temporary table

The episodes table has an aired column which includes the data. Recall that if we create a new variable (e.g., year) using aired, we cannot use year in the WHERE clause (WHERE only works on the original table, not the results set).

In MySQL the function STR_TO_DATE allowed us to create a datetime variable from which year could be extracted. However, in DuckDB, it is more complicated to convert the character string of “April 10, 2020” to “2020-04-10”. Don’t worry about the code too much, but note that we wouldn’t want to wrangle the character date string every time we wanted to filter for year.

What does POSITION do?

In case you are curious about the date wrangling code… consider SUBSTRING(aired, POSITION(',' IN aired) + 2)

  1. POSITION(',' IN aired): This part of the expression uses the POSITION function to find the position of the first occurrence of the comma (,) in the string aired. The result is the index (position) of the comma within the string.

  2. POSITION(',' IN aired) + 2: This adds 2 to the index of the comma. The + 2 is used to move the starting point of the substring two positions to the right of the comma. This is done to exclude the comma itself and any following spaces.

  3. SUBSTRING(aired, POSITION(',' IN aired) + 2): This part uses the SUBSTRING function to extract a substring from the string aired. The starting position of the substring is determined by POSITION(',' IN aired) + 2, and it goes until the end of the string. This effectively removes the part of the string that comes before and including the first comma.

In summary, the entire expression is extracting a substring from the original string aired, starting from two positions to the right of the first comma and continuing until the end of the string. This can be useful in scenarios where you want to remove or isolate part of a string based on the position of a specific character (in this case, the comma).

CREATE TEMP TABLE episodes_date AS
    SELECT *, CASE
             WHEN POSITION(',' IN aired) > 0 THEN
    EXTRACT(YEAR FROM CAST(
                SUBSTRING(aired, POSITION(',' IN aired) + 2) || '-' ||
                CASE
                    WHEN POSITION('January' IN aired) > 0 THEN '01'
                    WHEN POSITION('February' IN aired) > 0 THEN '02'
                    WHEN POSITION('March' IN aired) > 0 THEN '03'
                    WHEN POSITION('April' IN aired) > 0 THEN '04'
                    WHEN POSITION('May' IN aired) > 0 THEN '05'
                    WHEN POSITION('June' IN aired) > 0 THEN '06'
                    WHEN POSITION('July' IN aired) > 0 THEN '07'
                    WHEN POSITION('August' IN aired) > 0 THEN '08'
                    WHEN POSITION('September' IN aired) > 0 THEN '09'
                    WHEN POSITION('October' IN aired) > 0 THEN '10'
                    WHEN POSITION('November' IN aired) > 0 THEN '11'
                    WHEN POSITION('December' IN aired) > 0 THEN '12'
                    ELSE '01' -- Default to January if no month is found
                END || '-' ||
                SUBSTRING(aired, POSITION(' ' IN aired) + 1, 2) AS DATE
            ))
            END AS year FROM episodes;
SELECT * FROM episodes_date LIMIT 10;
Table 6.9: The temporary table called episodes_date that has identifiers of sid, epid, and epno.
sid epid aired epno year
46 20210410 April 10, 2021 17 2021
46 20210403 April 3, 2021 16 2021
46 20210327 March 27, 2021 15 2021
46 20210227 February 27, 2021 14 2021
46 20210220 February 20, 2021 13 2021
46 20210213 February 13, 2021 12 2021
46 20210206 February 6, 2021 11 2021
46 20210130 January 30, 2021 10 2021
46 20201219 December 19, 2020 9 2020
46 20201212 December 12, 2020 8 2020

6.5.2 Using a temporary table

Now that the year variable has been created in the new temporary table called episodes_date, we can use episode_date to query and find, for example, all of the hosts in 2019.

SELECT hosts.aid, ep.aired, ep.year FROM hosts 
JOIN episodes_date AS ep ON hosts.epid = ep.epid
WHERE year = 2019
LIMIT 25;
Table 6.10: SNL hosts in 2019.
aid aired year
Eddie Murphy December 21, 2019 2019
Scarlett Johansson December 14, 2019 2019
Jennifer Lopez December 7, 2019 2019
Will Ferrell November 23, 2019 2019
Harry Styles November 16, 2019 2019
Kristen Stewart November 2, 2019 2019
Chance the Rapper October 26, 2019 2019
David Harbour October 12, 2019 2019
Phoebe Waller-Bridge October 5, 2019 2019
Woody Harrelson September 28, 2019 2019
Paul Rudd May 18, 2019 2019
Emma Thompson May 11, 2019 2019
Adam Sandler May 4, 2019 2019
Emma Stone April 13, 2019 2019
Kit Harington April 6, 2019 2019
Sandra Oh March 30, 2019 2019
Idris Elba March 9, 2019 2019
John Mulaney March 2, 2019 2019
Don Cheadle February 16, 2019 2019
Halsey February 9, 2019 2019
James McAvoy January 26, 2019 2019
Rachel Brosnahan January 19, 2019 2019

6.6 Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_duckdb, shutdown = TRUE)

6.7 Reflection questions

  1. How can you update the value of a particular variable? What if you want to update a variable for many rows?

  2. Why/when would you use CHANGE instead of RENAME COLUMN? Why/when would you use CHANGE instead of MODIFY?

  3. When are temporary tables useful? Can you always create temporary tables if you are working in SQL? Explain the hierarchy of tables, temporary tables, and subqueries.

6.8 Ethics considerations

  1. Who should have the ability / access to insert, delete, or update tables? Should everyone who accesses a table also have the ability to edit the table? Why or why not?

  2. What can you do if you accidentally DELETE the wrong rows or DROP the wrong the columns?


  1. Information and examples in this section taken from https://dev.mysql.com/doc/refman/8.0/en/alter-table.html#alter-table-add-drop-column↩︎