6 Changing databases
As in Chapter 5, we will use DuckDB so that we have access to a server into which we can create and change data tables. Note that the database has been stored to a database directory called duck_datab
which lives in the current R project. You won’t be able to open it like a standard folder, but it is where DuckDB stores the database files.
6.1 Changing data
The UPDATE
function allows you to change a value in a table across all rows that match a certain criteria. The impressions
table has a name
column indicating the person being impersonated. Let’s say, for whatever reason, that Ivanka Trump decides she doesn’t want to be affiliated with the Trump name and she changes her name to her husband’s name, becoming Ivanka Kushner. You might want to UPDATE
the file to indicate the impressions were of Ivanka Kushner
instead of Ivanka Trump
. (See Section 5.1.6 for loading csv files into DuckDB directly.)
duckdb_read_csv(con = con_duckdb, name = "impressions", files = "data/impressions.csv")
SELECT * FROM impressions
WHERE name LIKE 'Ivanka%';
impid | aid | name |
---|---|---|
2598 | Scarlett Johansson | Ivanka Trump |
3716 | Emily Blunt | Ivanka Trump |
3694 | Margot Robbie | Ivanka Trump |
3679 | Vanessa Bayer | Ivanka Trump |
2340 | Maya Rudolph | Ivanka Trump |
We can use the UPDATE
function to change the value of Ivanka’s name to Ivanka Kushner
throughout the database. Note that all rows which match the WHERE
clause get updated.
UPDATE impressions
SET name = 'Ivanka Kushner'
WHERE name LIKE 'Ivanka%';
SELECT * FROM impressions
WHERE name LIKE 'Ivanka%';
impid | aid | name |
---|---|---|
2598 | Scarlett Johansson | Ivanka Kushner |
3716 | Emily Blunt | Ivanka Kushner |
3694 | Margot Robbie | Ivanka Kushner |
3679 | Vanessa Bayer | Ivanka Kushner |
2340 | Maya Rudolph | Ivanka Kushner |
6.2 Inserting data
Let’s say we want to include the more recent hosts in the hosts
table. First, we scrape the SNL archives which lists the episode id (the date) and the host. The R package rvest allows us to pull out the appropriate html elements. The epid
and aid
are joined together in a tibble, and filter
ed to only include episodes which are not already in the episodes
table. (See Section 5.1.6 for loading csv files into DuckDB directly.)
duckdb_read_csv(con = con_duckdb, name = "hosts", files = "data/hosts.csv")
By searching the SNL archives, we can see that the next host, chronologically was Elon Musk on May 8, 2021.
SELECT * FROM hosts
ORDER BY epid DESC
LIMIT 10;
epid | aid |
---|---|
20210410 | Carey Mulligan |
20210403 | Daniel Kaluuya |
20210327 | Maya Rudolph |
20210227 | Nick Jonas |
20210220 | Rege-Jean Page |
20210213 | Regina King |
20210206 | Dan Levy |
20210130 | John Krasinski |
20201219 | Kristen Wiig |
20201212 | Timothee Chalamet |
INSERT
allows us to add the relevant information associated with the episode of SNL that Elon Musk hosted.
INSERT INTO hosts (epid, aid)
VALUES ('20210508', 'Elon Musk');
SELECT * FROM hosts
ORDER BY epid DESC
LIMIT 10;
epid | aid |
---|---|
20210508 | Elon Musk |
20210410 | Carey Mulligan |
20210403 | Daniel Kaluuya |
20210327 | Maya Rudolph |
20210227 | Nick Jonas |
20210220 | Rege-Jean Page |
20210213 | Regina King |
20210206 | Dan Levy |
20210130 | John Krasinski |
20201219 | Kristen Wiig |
It would be tedious to INSERT
all of the most recent host information by hand. Instead, we’ll scrape the SNL archives using the R package rvest, which allows us to pull out the appropriate html elements. The epid
and aid
are joined together in a tibble, and filter
ed to only include episodes which are not already in the episodes
table.
library(rvest)
recent_hosts <- read_html("http://www.snlarchives.net/Episodes/") |>
html_nodes("tr") |>
purrr::map_df( ~ tibble(
epid = .x |> html_node("a.ms-2.me-2") |>
html_attr("href") |>
str_extract("\\d+"),
aid = .x |> html_node("td:nth-child(2)") |>
html_text2() |>
str_extract("[\\w\\. \\w\\.]+(?=/|$)")
)) |>
filter(epid > 20210508)
write_csv(recent_hosts, "data/recent_hosts.csv")
INSERT INTO hosts
SELECT *
FROM READ_CSV('data/recent_hosts.csv', AUTO_DETECT = TRUE);
SELECT * FROM hosts
ORDER BY epid DESC
LIMIT 10;
epid | aid |
---|---|
20231216 | Kate McKinnon |
20231209 | Adam Driver |
20231202 | Emma Stone |
20231118 | Jason Momoa |
20231111 | Timothée Chalamet |
20231028 | Nate Bargatze |
20231021 | Bad Bunny |
20231014 | Pete Davidson |
20230415 | Ana de Armas |
20230408 | Molly Shannon |
6.3 Deleting data
You might change your mind and decide that you really only want hosts from years up to 2022. The DELETE
function deletes any rows specified by the WHERE
clause.
DELETE FROM hosts
WHERE epid > 20221231
SELECT * FROM hosts
ORDER BY epid DESC
LIMIT 10;
epid | aid |
---|---|
20221217 | Austin Butler |
20221210 | Martin Short |
20221203 | Keke Palmer |
20221112 | Dave Chappelle |
20221105 | Amy Schumer |
20221029 | Jack Harlow |
20221015 | Megan Thee Stallion |
20221008 | Brendan Gleeson |
20221001 | Miles Teller |
20220521 | Natasha Lyonne |
6.4 Altering the table
ALTER TABLE
changes the structure of a table. For example, you can add or delete columns, create or destroy indexes, change the type of existing columns, or rename columns or the table itself. (Syntax below is for MySQL. Unfortunately, DuckDB is finicky when ALTER
ing tables, so the commands below may not work on the tables created using DuckDB.)1
Multiple ADD
, ALTER
, DROP
, and CHANGE
clauses are permitted in a single ALTER TABLE
statement, separated by commas.
ALTER TABLE t1
DROP COLUMN col1,
DROP COLUMN col2;
To alter a column to change both its name and definition, use CHANGE
, specifying the old and new names and the new definition. For example, to rename an INT NOT NULL
column from a
to b
and change its definition to use the BIGINT
data type while retaining the NOT NULL
attribute, do this:
ALTER TABLE t1 CHANGE a b BIGINT NOT NULL;
To change a column definition but not its name, use CHANGE
or MODIFY
. With CHANGE
, the syntax requires two column names, so you must specify the same name twice to leave the name unchanged. For example, to change the definition of column b
:
ALTER TABLE t1 CHANGE b b INT NOT NULL;
MODIFY
is more convenient to change the definition without changing the name because it requires the column name only once:
ALTER TABLE t1 MODIFY b INT NOT NULL;
To change a column name but not its definition, use CHANGE
or RENAME COLUMN
. With CHANGE
, the syntax requires a column definition, so to leave the definition unchanged, you must re-specify the definition the column currently has. For example, to rename an INT NOT NULL
column from b
to a
:
ALTER TABLE t1 CHANGE b a INT NOT NULL;
RENAME COLUMN
is more convenient to change the name without changing the definition because it requires only the old and new names:
ALTER TABLE t1 RENAME COLUMN b TO a;
In general, you cannot rename a column to a name that already exists in the table. However, this is sometimes not the case, such as when you swap names or move them through a cycle. If a table has columns named a
, b
, and c
, the following are valid operations:
/* swap a and b */
ALTER TABLE t1 RENAME COLUMN a TO b,
RENAME COLUMN b TO a;
/* "rotate" a, b, c through a cycle */
ALTER TABLE t1 RENAME COLUMN a TO b,
RENAME COLUMN b TO c,
RENAME COLUMN c TO a;
6.5 Temporary tables
Temporary tables are used to break down complex queries into smaller, more manageable steps. For example, let’s say we want to JOIN
two tables after each has been filtered using different WHERE
clauses. The filtered tables can each be saved into their own temporary tables and then the temporary tables can be merged.
Note that tables in DuckDB are saved (to disk), even when the connection is closed. However, temporary tables are saved in memory (instead of on disk) and are deleted when the connection is closed. Specific configuration of the temporary directory allows for temporary tables to be saved, even when the connection is closed.
Notice that most of the tables have some kind of ID which allows JOIN
ing across tables.
SELECT * FROM hosts LIMIT 10;
epid | aid |
---|---|
20210410 | Carey Mulligan |
20210403 | Daniel Kaluuya |
20210327 | Maya Rudolph |
20210227 | Nick Jonas |
20210220 | Rege-Jean Page |
20210213 | Regina King |
20210206 | Dan Levy |
20210130 | John Krasinski |
20201219 | Kristen Wiig |
20201212 | Timothee Chalamet |
SELECT * FROM episodes LIMIT 10;
sid | epid | aired | epno |
---|---|---|---|
46 | 20210410 | April 10, 2021 | 17 |
46 | 20210403 | April 3, 2021 | 16 |
46 | 20210327 | March 27, 2021 | 15 |
46 | 20210227 | February 27, 2021 | 14 |
46 | 20210220 | February 20, 2021 | 13 |
46 | 20210213 | February 13, 2021 | 12 |
46 | 20210206 | February 6, 2021 | 11 |
46 | 20210130 | January 30, 2021 | 10 |
46 | 20201219 | December 19, 2020 | 9 |
46 | 20201212 | December 12, 2020 | 8 |
6.5.1 Creating a temporary table
The episodes
table has an aired
column which includes the data. Recall that if we create a new variable (e.g., year
) using aired
, we cannot use year
in the WHERE
clause (WHERE
only works on the original table, not the results set).
In MySQL the function STR_TO_DATE
allowed us to create a datetime variable from which year could be extracted. However, in DuckDB, it is more complicated to convert the character string of “April 10, 2020” to “2020-04-10”. Don’t worry about the code too much, but note that we wouldn’t want to wrangle the character date string every time we wanted to filter for year.
What does POSITION
do?
In case you are curious about the date wrangling code… consider SUBSTRING(aired, POSITION(',' IN aired) + 2)
POSITION(',' IN aired)
: This part of the expression uses thePOSITION
function to find the position of the first occurrence of the comma (,
) in the stringaired
. The result is the index (position) of the comma within the string.POSITION(',' IN aired) + 2
: This adds 2 to the index of the comma. The+ 2
is used to move the starting point of the substring two positions to the right of the comma. This is done to exclude the comma itself and any following spaces.SUBSTRING(aired, POSITION(',' IN aired) + 2)
: This part uses the SUBSTRING function to extract a substring from the stringaired
. The starting position of the substring is determined byPOSITION(',' IN aired) + 2
, and it goes until the end of the string. This effectively removes the part of the string that comes before and including the first comma.
In summary, the entire expression is extracting a substring from the original string aired
, starting from two positions to the right of the first comma and continuing until the end of the string. This can be useful in scenarios where you want to remove or isolate part of a string based on the position of a specific character (in this case, the comma).
CREATE TEMP TABLE episodes_date AS
SELECT *, CASE
WHEN POSITION(',' IN aired) > 0 THEN
EXTRACT(YEAR FROM CAST(
',' IN aired) + 2) || '-' ||
SUBSTRING(aired, POSITION(CASE
WHEN POSITION('January' IN aired) > 0 THEN '01'
WHEN POSITION('February' IN aired) > 0 THEN '02'
WHEN POSITION('March' IN aired) > 0 THEN '03'
WHEN POSITION('April' IN aired) > 0 THEN '04'
WHEN POSITION('May' IN aired) > 0 THEN '05'
WHEN POSITION('June' IN aired) > 0 THEN '06'
WHEN POSITION('July' IN aired) > 0 THEN '07'
WHEN POSITION('August' IN aired) > 0 THEN '08'
WHEN POSITION('September' IN aired) > 0 THEN '09'
WHEN POSITION('October' IN aired) > 0 THEN '10'
WHEN POSITION('November' IN aired) > 0 THEN '11'
WHEN POSITION('December' IN aired) > 0 THEN '12'
ELSE '01' -- Default to January if no month is found
END || '-' ||
' ' IN aired) + 1, 2) AS DATE
SUBSTRING(aired, POSITION(
))END AS year FROM episodes;
SELECT * FROM episodes_date LIMIT 10;
sid | epid | aired | epno | year |
---|---|---|---|---|
46 | 20210410 | April 10, 2021 | 17 | 2021 |
46 | 20210403 | April 3, 2021 | 16 | 2021 |
46 | 20210327 | March 27, 2021 | 15 | 2021 |
46 | 20210227 | February 27, 2021 | 14 | 2021 |
46 | 20210220 | February 20, 2021 | 13 | 2021 |
46 | 20210213 | February 13, 2021 | 12 | 2021 |
46 | 20210206 | February 6, 2021 | 11 | 2021 |
46 | 20210130 | January 30, 2021 | 10 | 2021 |
46 | 20201219 | December 19, 2020 | 9 | 2020 |
46 | 20201212 | December 12, 2020 | 8 | 2020 |
6.5.2 Using a temporary table
Now that the year
variable has been created in the new temporary table called episodes_date
, we can use episode_date
to query and find, for example, all of the hosts in 2019.
SELECT hosts.aid, ep.aired, ep.year FROM hosts
JOIN episodes_date AS ep ON hosts.epid = ep.epid
WHERE year = 2019
LIMIT 25;
aid | aired | year |
---|---|---|
Eddie Murphy | December 21, 2019 | 2019 |
Scarlett Johansson | December 14, 2019 | 2019 |
Jennifer Lopez | December 7, 2019 | 2019 |
Will Ferrell | November 23, 2019 | 2019 |
Harry Styles | November 16, 2019 | 2019 |
Kristen Stewart | November 2, 2019 | 2019 |
Chance the Rapper | October 26, 2019 | 2019 |
David Harbour | October 12, 2019 | 2019 |
Phoebe Waller-Bridge | October 5, 2019 | 2019 |
Woody Harrelson | September 28, 2019 | 2019 |
Paul Rudd | May 18, 2019 | 2019 |
Emma Thompson | May 11, 2019 | 2019 |
Adam Sandler | May 4, 2019 | 2019 |
Emma Stone | April 13, 2019 | 2019 |
Kit Harington | April 6, 2019 | 2019 |
Sandra Oh | March 30, 2019 | 2019 |
Idris Elba | March 9, 2019 | 2019 |
John Mulaney | March 2, 2019 | 2019 |
Don Cheadle | February 16, 2019 | 2019 |
Halsey | February 9, 2019 | 2019 |
James McAvoy | January 26, 2019 | 2019 |
Rachel Brosnahan | January 19, 2019 | 2019 |
6.6 Best practice
It is always a good idea to terminate the SQL connection when you are done with it.
dbDisconnect(con_duckdb, shutdown = TRUE)
6.7 Reflection questions
How can you update the value of a particular variable? What if you want to update a variable for many rows?
Why/when would you use
CHANGE
instead ofRENAME COLUMN
? Why/when would you useCHANGE
instead ofMODIFY
?When are temporary tables useful? Can you always create temporary tables if you are working in SQL? Explain the hierarchy of tables, temporary tables, and subqueries.
6.8 Ethics considerations
Who should have the ability / access to insert, delete, or update tables? Should everyone who accesses a table also have the ability to edit the table? Why or why not?
What can you do if you accidentally
DELETE
the wrong rows orDROP
the wrong the columns?
Information and examples in this section taken from https://dev.mysql.com/doc/refman/8.0/en/alter-table.html#alter-table-add-drop-column↩︎