con_imdb <- DBI::dbConnect(
RMariaDB::MariaDB(),
dbname = "imdb",
host = Sys.getenv("SMITH_HOST"),
user = Sys.getenv("SMITH_USER"),
password = Sys.getenv("SMITH_PWD")
)
6 Combining tables in SQL
Beyond SELECT
queries, the next most fundamental actions in SQL have to do with working with two or more tables simultaneously. If you have worked in R, you may be familiar with the join_()
family of functions. Moving beyond JOIN
, we will also cover subqueries which create temporary tables inside a SQL statement.
The database we’ll work with for learning how to combine tables is on IMDb data. There are many tables, see Table 6.1. Table 6.2 shows information on the kind_type
table which describes each kind
. Other tables contain ID variables which will be important when performing JOIN
s.
As seen in Table 6.1, there are 21 tables in the imdb
database! We won’t go into details about what each one of them means, but it is a huge database with details on movies, TV series, video games, and more. For our work, we are interested only in movies (and details within, like actors and ratings).
TABLES; SHOW
Tables_in_imdb |
---|
aka_name |
aka_title |
cast_info |
char_name |
comp_cast_type |
company_name |
company_type |
complete_cast |
info_type |
keyword |
kind_type |
link_type |
movie_companies |
movie_info |
movie_info_idx |
movie_keyword |
movie_link |
name |
person_info |
role_type |
title |
If you are curious about the database, you can glance through each of the tables separately using a SELECT
command (don’t forget to use LIMIT
, especially on a table you haven’t previously explored).
SELECT * FROM kind_type
LIMIT 0, 10;
id | kind |
---|---|
1 | movie |
2 | tv series |
3 | tv movie |
4 | video movie |
5 | tv mini series |
6 | video game |
7 | episode |
6.1 Subqueries
A SQL subquery is a query used as a data source in the FROM
clause, instead of the usual table. There was a subquery in Table 5.6 when the task required a function of the results set within the SELECT
clause.
We could do something similar if we wanted to transform the variables in the select column. The example is a little bit forced, and there are other ways to obtain the same results. But hopefully the idea of a subquery is becoming more clear. Again, a subquery is just a query that becomes the data source for FROM
.
Chapter 9 will cover regular expressions in some detail. Here we use the function REGEXP_REPLACE
to remove any characters which are not letters, comma, or space. The function LOWER
converts any upper case letters to lower case.
SELECT name,
name_clean,',', 1) AS last_name,
SUBSTRING_INDEX(name_clean, ',', -1) AS first_name
SUBSTRING_INDEX(name_clean, FROM (
SELECT LOWER(REGEXP_REPLACE(name, '[^a-z,. ]', '')) AS name_clean,
name,id, person_id
FROM aka_name) AS temp_subquery
LIMIT 0, 30;
name | name_clean | last_name | first_name |
---|---|---|---|
Smith, Jessica Noel | smith, jessica noel | smith | jessica noel |
Pain, L. $ham | pain, l. ham | pain | l. ham |
Boy, $hutter | boy, hutter | boy | hutter |
Dollasign, Ty | dollasign, ty | dollasign | ty |
Sign, Ty Dolla | sign, ty dolla | sign | ty dolla |
Moore, Brandon | moore, brandon | moore | brandon |
$torm, Country | torm, country | torm | country |
'Hooper', Simon P.J. Kelly | hooper, simon p.j. kelly | hooper | simon p.j. kelly |
Hooper | hooper | hooper | hooper |
Kelly, Simon P.J. | kelly, simon p.j. | kelly | simon p.j. |
Abdul-Hamid, Jaffar | abdulhamid, jaffar | abdulhamid | jaffar |
Al-Hamid, Jaffar Abd | alhamid, jaffar abd | alhamid | jaffar abd |
Svensson, Acke | svensson, acke | svensson | acke |
Viera, Michael 'Power' | viera, michael power | viera | michael power |
Buguelo | buguelo | buguelo | buguelo |
'El Burro' Rankin', Jorge Van | el burro rankin, jorge van | el burro rankin | jorge van |
Burro, El | burro, el | burro | el |
Van Rankin, Jorge 'Burro' | van rankin, jorge burro | van rankin | jorge burro |
Van Rankin, Jorge | van rankin, jorge | van rankin | jorge |
van Rankin, Jorge 'El Burro' | van rankin, jorge el burro | van rankin | jorge el burro |
Seigal, Jason | seigal, jason | seigal | jason |
Kaufman, Murray | kaufman, murray | kaufman | murray |
'Knoccout'Madison, Kareim | knoccoutmadison, kareim | knoccoutmadison | kareim |
Starks, Johnny | starks, johnny | starks | johnny |
Kraemer, 'Logan' Howard | kraemer, logan howard | kraemer | logan howard |
Gee, Emm | gee, emm | gee | emm |
Cusick, Maura | cusick, maura | cusick | maura |
Maura, Maude Cusick | maura, maude cusick | maura | maude cusick |
Wheeler, Mackenzie | wheeler, mackenzie | wheeler | mackenzie |
Monkey | monkey | monkey | monkey |
6.2 All the JOIN
s
Recall that SQL is a query language that works on relational databases. One of its major strengths is being able to efficiently store information in separate tables that can be easily connected as needed. The syntax for tying together information from multiple tables is done with a JOIN
clause.
Each JOIN
clause needs four specific pieces of information:
- The name of the first table you want to
JOIN
. - The type of
JOIN
being used. - The name of the second table you want to
JOIN
. - The condition(s) under which you want the records in the first table to match records in the second table.
Some types of JOIN
s available in MySQL include the following, which are represented as Venn diagrams in Figure 6.1.
-
JOIN
: include all of the rows that exist in both tables (similar toinner_join()
in R, the intersection of the two tables).INNER JOIN
is alternative, and identical, function toJOIN
. -
LEFT JOIN
: include all of the rows in the first table. Connect them, as much as possible, to the rows in the second table. Rows that have no match in the second table will have a value ofNULL
for the new “second table” variables. -
RIGHT JOIN
: include all of the rows in the second table. Connect them, as much as possible, to the rows in the first table. Rows that have no match in the first table will have a value ofNULL
for the new “first table” variables. ARIGHT JOIN
with the tables in the opposite order is the same as aLEFT JOIN
with the tables in the original order. -
FULL OUTER JOIN
: include all rows in either table. Rows that have no match in the other table will have a value ofNULL
for the other table variables. (similar tofull_join()
in R, the union of the two tables). The functionality doesn’t exist in MySQL but can be created using joins andUNION
. -
CROSS JOIN
: match each row of the first table with each row in the second table.
Figure 6.1 shows Venn diagrams of the different types of joins. Figure 6.2 shows four of the JOIN
functions with mini data tables. Note that in SQL the missing values will be labeled as NULL
(not NA
).
6.2.1 A toy example
We will head to R for just a minute so as to understand joins using a small toy dataset on rock bands from the 60s, The Beatles and The Rolling Stones. The function sqldf()
in the sqldf R package allows for SQL commands on R objects.
Consider the following datasets which are available in the dplyr package.
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
Inner join
An inner join combines two datasets returning only the observations that exist in both of the original datasets.
sqldf::sqldf("SELECT star.name,
star.band,
inst.plays
FROM band_members AS star
JOIN band_instruments AS inst ON star.name = inst.name")
name band plays
1 John Beatles guitar
2 Paul Beatles bass
Full join
A full join combines two datasets returning every observation that exists in either one of the original datasets. Note that in the results, Mick’s instrument is missing, and Keith’s band is missing.
The full_join()
function does not have an equivalent in MySQL. See Section 6.3.1.1 for using JOIN
s and UNION
s to produce a full join.
band_members |>
full_join(band_instruments)
# A tibble: 4 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
Left join
A left join combines two datasets returning every observation that exists in the left (or first) original dataset. Note that in the results, Mick’s instrument is missing.
sqldf::sqldf("SELECT star.name,
star.band,
inst.plays
FROM band_members AS star
LEFT JOIN band_instruments AS inst
ON star.name = inst.name")
name band plays
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
Right join
A right join combines two datasets returning every observation that exists in the right (or second) original dataset. Note that in the results, Keith’s band is missing.
sqldf::sqldf("SELECT inst.name,
star.band,
inst.plays
FROM band_members AS star
RIGHT JOIN band_instruments AS inst
ON star.name = inst.name")
name band plays
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
6.2.2 JOIN
In the imdb
database, the title
table includes information about the 4,626,322 titles in the database, including the id
, title
, kind_id
(indicator for the kind of ID it is), and production_year
. It does not, however, include the review of the title. See Table 6.4.
SELECT * FROM title LIMIT 0, 10;
id | title | imdb_index | kind_id | production_year | imdb_id | phonetic_code | episode_of_id | season_nr | episode_nr | series_years | md5sum |
---|---|---|---|---|---|---|---|---|---|---|---|
78460 | Adults Recat to the Simpsons (30th Anniversary) | 7 | 2017 | A3432 | 78406 | 2ae09eed7d576cc2c24774fed5b18168 | |||||
70273 | (2016-05-18) | 7 | 2016 | 68058 | 511dfc14cfff7589d29a95abb30cd66a | ||||||
60105 | (2014-04-11) | 7 | 2014 | 59138 | c6cdce7e667e07713e431805c407feed | ||||||
32120 | (2008-05-01) | 7 | 2008 | 32060 | 100df65742caf5afd092b2e0ead67d8e | ||||||
97554 | Schmölders Traum | 7 | 2001 | S2543 | 97302 | 10 | 1 | 46862a2f96f9fb2d59e8c9a11ecfdd28 | |||
57966 | (#1.1) | 7 | 2013 | 57965 | 1 | 1 | 409c37703766c4b24f8a86162fd9cf85 | ||||
76391 | Anniversary | 7 | 1971 | A5162 | 76385 | 4 | 9 | 5e12ce73fac1d1dcf94136b6e9acd8f8 | |||
11952 | Angus Black/Lester Barrie/DC Curry | 7 | 2009 | A5214 | 11937 | 4 | 7 | 9c38b9e5601dc154444b73b518034aa1 | |||
1554 | New Orleans | 7 | 2003 | N6452 | 1508 | 2 | 11 | 621bea735740a547e862e4a3226f35d2 | |||
58442 | Kiss Me Kate | 7 | 2011 | K2523 | 58436 | 1 | 10 | 293e8c75c7f35a4035abf617962be5a9 |
The movie_info_idx
table does not contain much information about each particular film. It does, however, have an indicator for the movie ID (given by movie_id
) as well as the number of votes (given by info
where type_id = 100
). See Table 6.5.
SELECT * FROM movie_info_idx LIMIT 0, 6;
id | movie_id | info_type_id | info | note |
---|---|---|---|---|
1 | 1 | 99 | 31.2.1..2. | |
2 | 1 | 100 | 9 | |
3 | 1 | 101 | 4.1 | |
4 | 2 | 99 | 1000000102 | |
5 | 2 | 100 | 61 | |
6 | 2 | 101 | 6.4 |
Let’s say we want to combine the titles with the number of votes so that each title with user votes is included. That is, only keep the titles that have a corresponding votes. And also, only keep the votes if there is an associated title (which means we use INNER JOIN
or just plain JOIN
).
Remember that WHERE
will work on the raw variables, and HAVING
works on the results set.
Some aspects of the query are worth pointing out:
* The variables in the output are given in the SELECT
clause. The id
and title
(both from the title
table) and the info
from the movie_info_idx
which represents the number of IMDb votes. * The variables are preceded by the table from which they came. While not always necessary, it is good practice so as to avoid confusion. * The JOIN
happens by linking the id
variable in the title
table with the movie_id
variable in the movie_info_idx
table. * The LIMIT
wasn’t necessary (there are only 12 observations), but it’s good practice so that we don’t end up with unwieldy query results. * The WHERE
clause happens before the JOIN
action, despite being written after. * In the WHERE
clause, we keep only movies, only 2015 production year, and only at least 150,000 votes.
SELECT title.id,
title.title,
movie_info_idx.infoFROM title
JOIN movie_info_idx ON title.id = movie_info_idx.movie_id
WHERE title.production_year = 2015
AND title.kind_id = 1 # movies only
AND movie_info_idx.info_type_id = 100 # info_type is votes
AND movie_info_idx.info > 150000 # at least 150,000 votes
ORDER BY movie_info_idx.info DESC
LIMIT 0, 20;
id | title | info |
---|---|---|
4260166 | Star Wars: Episode VII - The Force Awakens | 691691 |
3915213 | Mad Max: Fury Road | 666484 |
4389619 | The Martian | 583987 |
3313672 | Avengers: Age of Ultron | 540606 |
4414139 | The Revenant | 526189 |
3787790 | Jurassic World | 471237 |
3752999 | Inside Out | 443051 |
3292159 | Ant-Man | 390965 |
4364483 | The Hateful Eight | 363199 |
4251736 | Spectre | 319875 |
3630368 | Furious Seven | 310970 |
4255450 | Spotlight | 290362 |
3961438 | Mission: Impossible - Rogue Nation | 266759 |
4321769 | The Big Short | 262598 |
4221220 | Sicario | 260996 |
3600120 | Fifty Shades of Grey | 250962 |
4164324 | Room | 244210 |
3379559 | Bridge of Spies | 229440 |
4368666 | The Hunger Games: Mockingjay - Part 2 | 214569 |
4387967 | The Man from U.N.C.L.E. | 213754 |
Let’s say we also want to obtain information about the actors and actresses in each of the movies. In the cast_info
table, there is a person_id
, a movie_id
, and person_role_id
is 1
if actor and 2
if actress.
SELECT * FROM cast_info LIMIT 0, 10;
id | person_id | movie_id | person_role_id | note | nr_order | role_id |
---|---|---|---|---|---|---|
1 | 1 | 3432997 | 1 | 31 | 1 | |
2 | 2 | 1901690 | 2 | 1 | ||
3 | 3 | 4027567 | 2 | 25 | 1 | |
4 | 3 | 4282876 | 3 | 22 | 1 | |
5 | 4 | 3542672 | 12 | 1 | ||
6 | 5 | 3331520 | 4 | (as $hutter Boy) | 10 | 1 |
7 | 5 | 4027191 | 2 | (as $hutter Boy) | 1 | 1 |
8 | 5 | 4195731 | 5 | (uncredited) | 1 | |
9 | 5 | 4263956 | 6 | (uncredited) | 1 | |
10 | 5 | 4267787 | 7 | (uncredited) | 1 |
We also want the name of the actress which is in the table aka_name
. Note that there is no movie information in the aka_name
table!
SELECT * FROM aka_name LIMIT 0, 10;
id | person_id | name | imdb_index | name_pcode_cf | name_pcode_nf | surname_pcode | md5sum |
---|---|---|---|---|---|---|---|
1 | 6188450 | Smith, Jessica Noel | S5325 | J2542 | S53 | 25c9d464e3ff2957533546aa92b397ed | |
2 | 5125059 | Pain, L. $ham | P545 | L515 | P5 | 569b1e885ccb51211c01753f0dad9b2c | |
3 | 5 | Boy, $hutter | B36 | H361 | B | 35092b5604ce378fc48c8a6fc0038a49 | |
4 | 4152053 | Dollasign, Ty | D4253 | T3425 | D425 | 0f565a2d8027cfb8ed6c5f4bba719fcd | |
5 | 4152053 | Sign, Ty Dolla | S2534 | T3425 | S25 | 2eded1b021b96333b4b74e0fec959650 | |
6 | 6 | Moore, Brandon | M6165 | B6535 | M6 | 193a6f5adf4756320f622162d2475608 | |
7 | 8 | $torm, Country | T6525 | C5363 | T65 | 1654400b707d34323ea392b87060e6cc | |
8 | 19 | 'Hooper', Simon P.J. Kelly | H1625 | S5124 | H16 | 3fd8885372c23f8c74e583da91d1fd05 | |
9 | 19 | Hooper | H16 | 24ddc68ab605ee95857ad45b65ffa2d8 | |||
10 | 19 | Kelly, Simon P.J. | K4251 | S5124 | K4 | 33d976f22e276b73c61513bc5f6e72a6 |
Connecting the most popular movies of 2015 with the actresses in those movies requires a series of JOIN
s. Note that to make the code less onerous, the title
table has been aliased by t
, the movie_info_idx
table has been aliased by idx
, the cast_info
table has been aliased by a
, and the aka_name
table has been aliased by n
.
There is a lot of data cleaning to do as some of the person_id
values are one to many!! That is, the person_id
matches multiple names in the aka_name
database.
SELECT t.title,
idx.info,
a.person_id,
n.nameFROM title AS t
JOIN movie_info_idx AS idx ON t.id = idx.movie_id
JOIN cast_info AS a ON idx.movie_id = a.movie_id
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE t.production_year = 2015
AND t.kind_id = 1 # movies only
AND idx.info_type_id = 100 # info_type is votes
AND idx.info > 150000 # at least 150,000 votes
AND a.role_id = 2 # actresses only
ORDER BY idx.info DESC
LIMIT 0, 50;
title | info | person_id | name |
---|---|---|---|
Star Wars: Episode VII - The Force Awakens | 691691 | 2698188 | Sam |
Star Wars: Episode VII - The Force Awakens | 691691 | 2806101 | Gillespie, Hilary Catherine |
Star Wars: Episode VII - The Force Awakens | 691691 | 2959609 | Cuzner, Natalie |
Star Wars: Episode VII - The Force Awakens | 691691 | 3089483 | Fisher, Carrie Frances |
Star Wars: Episode VII - The Force Awakens | 691691 | 3150880 | Class, Clare |
Star Wars: Episode VII - The Force Awakens | 691691 | 3150880 | Glass, Claire |
Star Wars: Episode VII - The Force Awakens | 691691 | 3231758 | Henwick, Jessica Yu Li |
Star Wars: Episode VII - The Force Awakens | 691691 | 3265686 | Hui, Karen |
Star Wars: Episode VII - The Force Awakens | 691691 | 3305561 | Kamen, Hannah John |
Star Wars: Episode VII - The Force Awakens | 691691 | 3462940 | Billy |
Star Wars: Episode VII - The Force Awakens | 691691 | 3462940 | Lourd, Billie Catherine |
Star Wars: Episode VII - The Force Awakens | 691691 | 3569409 | Fran |
Star Wars: Episode VII - The Force Awakens | 691691 | 3649948 | Nyongo, Lupita |
Star Wars: Episode VII - The Force Awakens | 691691 | 3649948 | Nyong'o, Lupita Amondi |
Star Wars: Episode VII - The Force Awakens | 691691 | 3785240 | Ridley, Daisy Jazz Isobel |
Star Wars: Episode VII - The Force Awakens | 691691 | 3835377 | Giagrande, Meredith J. |
Star Wars: Episode VII - The Force Awakens | 691691 | 3835377 | Salinger, Meredith |
Star Wars: Episode VII - The Force Awakens | 691691 | 3835377 | Salenger, Meredith Dawn |
Star Wars: Episode VII - The Force Awakens | 691691 | 3850834 | Phi |
Star Wars: Episode VII - The Force Awakens | 691691 | 3875581 | Fox, Claudia |
Star Wars: Episode VII - The Force Awakens | 691691 | 3879039 | Arti |
Star Wars: Episode VII - The Force Awakens | 691691 | 3907812 | Slade, Sandy |
Star Wars: Episode VII - The Force Awakens | 691691 | 3907812 | Sandy |
Star Wars: Episode VII - The Force Awakens | 691691 | 3907812 | Sandy Slade |
Star Wars: Episode VII - The Force Awakens | 691691 | 3938795 | Ryan, Karol Lesley |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Stevens, Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Taber, Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Tabor, Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Stevens, Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Taber, Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Tabor, Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 3970637 | Cat |
Star Wars: Episode VII - The Force Awakens | 691691 | 4073883 | Walter, Dame Harriet |
Star Wars: Episode VII - The Force Awakens | 691691 | 4073883 | Walter, Harriet Mary |
Star Wars: Episode VII - The Force Awakens | 691691 | 4094732 | White, Kelsey Marie |
Mad Max: Fury Road | 666484 | 2681098 | Michelle, Debra |
Mad Max: Fury Road | 666484 | 2782138 | Ali |
Mad Max: Fury Road | 666484 | 2873752 | Cardona, Helena |
Mad Max: Fury Road | 666484 | 2873752 | Cardona, Helene |
Mad Max: Fury Road | 666484 | 2873752 | Cardona, Hélène Vania |
Mad Max: Fury Road | 666484 | 2957052 | Cunico, Lillie |
Mad Max: Fury Road | 666484 | 3087531 | Finlay, Sandi 'Hotrod' |
Mad Max: Fury Road | 666484 | 3087531 | Finlay, Sandi 'Hotrod' |
Mad Max: Fury Road | 666484 | 3146859 | Gilles, Coco Jack |
Mad Max: Fury Road | 666484 | 3146859 | Gillies, Coco |
Mad Max: Fury Road | 666484 | 3268456 | Rose |
Mad Max: Fury Road | 666484 | 3268456 | Huntington-Whiteley, Rosie Alice |
Mad Max: Fury Road | 666484 | 3343489 | Kellerman, Antoinette |
Mad Max: Fury Road | 666484 | 3348513 | Riley |
6.2.3 Other JOIN
s
Consider the following two tables. The first has seven movies in it (from 2015 with at least 400,000 IMDb votes). The second consists of almost 3 million actresses (person_role_id = 2
). In order to find a subset of actresses, the person_id > 3900000
was set arbitrarily (in order to have a smaller group with which to work).
movies:
SELECT t.id,
t.title,
idx.info,SELECT COUNT(*)
(FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1
AND idx.info_type_id = 100
AND idx.info > 400000) AS row_count
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1 # movies only
AND idx.info_type_id = 100 # info_type is votes
AND idx.info > 400000 # at least 400,000 votes
ORDER BY idx.info DESC
id | title | info | row_count |
---|---|---|---|
4260166 | Star Wars: Episode VII - The Force Awakens | 691691 | 7 |
3915213 | Mad Max: Fury Road | 666484 | 7 |
4389619 | The Martian | 583987 | 7 |
3313672 | Avengers: Age of Ultron | 540606 | 7 |
4414139 | The Revenant | 526189 | 7 |
3787790 | Jurassic World | 471237 | 7 |
3752999 | Inside Out | 443051 | 7 |
actresses:
SELECT DISTINCT a.person_id,
a.movie_id,
n.name,SELECT COUNT(*)
(FROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE a.person_role_id = 2
AND a.person_id > 390000) AS row_count
FROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE a.person_role_id = 2
AND a.person_id > 3900000
LIMIT 0, 20;
person_id
is greater than 400000. Note that some actresses have different spelling or phrasing of their names.
person_id | movie_id | name | row_count |
---|---|---|---|
3900141 | 759802 | Simons, Rita Joanne | 2904759 |
3902258 | 4365829 | Singer, Rabbi Tovia | 2904759 |
3902699 | 3109788 | Singh, Sabine Erika | 2904759 |
3903035 | 3215866 | Val | 2904759 |
3904831 | 2468067 | Masha | 2904759 |
3904928 | 3654347 | Fei, Siu Yin | 2904759 |
3904928 | 3654347 | Hsiao, Yen-fei | 2904759 |
3904928 | 3654347 | Siu, Yinfei | 2904759 |
3904928 | 3654347 | Xiao, Yanfei | 2904759 |
3904928 | 3654347 | Yin-Fai, Siu | 2904759 |
3905289 | 115191 | Coso, Cosondra | 2904759 |
3905289 | 115191 | Sjostrom, Cossondra | 2904759 |
3905289 | 115191 | Coso | 2904759 |
3909355 | 2939100 | Slovácková, Anna Julie | 2904759 |
3911826 | 4379610 | Meador, Constance June | 2904759 |
3912134 | 2675144 | DJ | 2904759 |
3912134 | 2675144 | Smith, DJ | 2904759 |
3912134 | 2675144 | Smith, Dujonette | 2904759 |
3912134 | 2675144 | DJ Smith | 2904759 |
3913519 | 1678444 | Keely, Dorothy Jacqueline | 2904759 |
Using subqueries, we can JOIN
the two datasets using different JOIN
techniques.
Inner JOIN
With an inner JOIN
, there are 32 rows corresponding to all the actresses in the seven 2015 films with the most votes. Because the JOIN
is an intersection of the two tables, only the actresses with person_id
above 3900000 are included.
SELECT * FROM
SELECT t.id,
(
t.titleFROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1 # movies only
AND idx.info_type_id = 100 # info_type is votes
AND idx.info > 400000) AS movs # at least 400,000 votes
INNER JOIN (SELECT a.person_id,
a.movie_id,
n.nameFROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE a.role_id = 2 # acresses only
AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;
RIGHT JOIN
With a RIGHT JOIN
, there are more than 300 rows (the LIMIT
clause keeps us from knowing how many rows, but there are a LOT!) corresponding to all the actresses whose person_id
above 3900000 are included. Those actresses who acted in one of the seven top 2015 films are also included in the full results table, but they don’t happen to be in the truncated output here.
SELECT * FROM
SELECT t.id,
(
t.titleFROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1 # movies only
AND idx.info_type_id = 100 # info_type is votes
AND idx.info > 400000) AS movs # at least 400,000 votes
RIGHT JOIN (SELECT a.person_id,
a.movie_id,
n.nameFROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE a.role_id = 2 # acresses only
AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;
LEFT JOIN
With a LEFT JOIN
, there are 33 rows corresponding to the actresses in the seven top 2015 movies. Only The Revenant did not have any actresses whose person_id
is greater than 3900000.
SELECT * FROM
SELECT t.id,
(
t.titleFROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1 # movies only
AND idx.info_type_id = 100 # info_type is votes
AND idx.info > 400000) AS movs # at least 400,000 votes
LEFT JOIN (SELECT a.person_id,
a.movie_id,
n.nameFROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE a.role_id = 2 # acresses only
AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;
Counting repeat actresses
We might, for example, want to know how many names / spellings of a name with a specific person_id
(above 3900000) exist for each person_id
in each of the top voted seven films of 2015.
In Table 6.15 why isn’t there a column indicating the name of the actress? (There can’t be such a column. Why not?)
SELECT acts.person_id,
COUNT(*) AS num_repeat_names
FROM (SELECT t.id,
t.titleFROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1 # movies only
AND idx.info_type_id = 100 # info_type is votes
AND idx.info > 400000) AS movs # at least 400,000 votes
JOIN (SELECT a.person_id,
a.movie_id,
n.nameFROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE a.role_id = 2 # acresses only
AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
GROUP BY acts.person_id;
person_id
(> 3900000) in the seven top voted 2015 films, how many names / spellings are associated with the person_id
?
person_id | num_repeat_names |
---|---|
3916648 | 1 |
4122876 | 1 |
3938423 | 2 |
3950111 | 1 |
4079047 | 2 |
4084626 | 3 |
4099458 | 1 |
3958614 | 1 |
3990819 | 2 |
4081131 | 2 |
3907812 | 3 |
3938795 | 1 |
3970637 | 8 |
4073883 | 2 |
4094732 | 1 |
4098918 | 1 |
Counting number of actresses per film
We might, for example, want to know how many actresses with a specific person_id
(above 3900000) are in each of the top voted seven films of 2015.
SELECT movs.id,
movs.title,COUNT(*) AS num_actress
FROM (SELECT t.id,
t.titleFROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1 # movies only
AND idx.info_type_id = 100 # info_type is votes
AND idx.info > 400000) AS movs # at least 400,000 votes
JOIN (SELECT a.person_id,
a.movie_id,
n.nameFROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE a.role_id = 2 # acresses only
AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
GROUP BY movs.id;
id | title | num_actress |
---|---|---|
3313672 | Avengers: Age of Ultron | 1 |
3752999 | Inside Out | 1 |
3787790 | Jurassic World | 9 |
3915213 | Mad Max: Fury Road | 5 |
4260166 | Star Wars: Episode VII - The Force Awakens | 15 |
4389619 | The Martian | 1 |
6.3 UNION
ing
In SQL a UNION
clause combines two different tables by their rows (whereas JOIN
combines two tables by columns). Think about UNION
similarly to the bind_rows()
command in R.
6.3.1 UNION
s
UNION
does not check the names of the columns to make sure they match. UNION
requires that the number of columns be the same and that the variable type be the same for all columns in the two tables being combined.
Table 6.17 contains a silly example. The first table has 1 as bar
and the second table has 20 as bar
. But when the tables are UNION
ed, the bar
column contains c(1, 10)
. SQL took the column names from the first table and appended the second table without considering the variable names.
SELECT
1 AS bar,
2 AS foo
UNION
SELECT
10 AS foo,
20 AS bar;
UNION
.
bar | foo |
---|---|
1 | 2 |
10 | 20 |
UNION
is specifically designed to bind rows from two different SELECT
queries where the variables have been selected in the same order. If the two SELECT
clauses are done from the same table with the same order of variables, you do not need to worry about the order of the variables matching up in the UNION
. If you are UNION
ing two very different subqueries, you do need to worry about the variables and their order.
UNION
Let’s say we want to combine the top voted movies from 2015 with the top voted movies from 2019. However, to account for time, we require the movies from 2015 to have more votes (400,000) than the movies from 2017 (200,000). That is, the WHERE
clause is different for the two subqueries.
SELECT t.title,
(
t.production_year, AS num_votes
idx.info FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1
AND idx.info_type_id = 100
AND idx.info > 400000)
UNION
SELECT t.title,
(
t.production_year, AS num_votes
idx.info FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2017
AND t.kind_id = 1
AND idx.info_type_id = 100
AND idx.info > 200000)
LIMIT 0, 100;
UNION
.
title | production_year | num_votes |
---|---|---|
Avengers: Age of Ultron | 2015 | 540606 |
Inside Out | 2015 | 443051 |
Jurassic World | 2015 | 471237 |
Mad Max: Fury Road | 2015 | 666484 |
Star Wars: Episode VII - The Force Awakens | 2015 | 691691 |
The Martian | 2015 | 583987 |
The Revenant | 2015 | 526189 |
Dunkirk | 2017 | 229089 |
Guardians of the Galaxy Vol. 2 | 2017 | 281845 |
Logan | 2017 | 397056 |
Spider-Man: Homecoming | 2017 | 209930 |
Wonder Woman | 2017 | 306611 |
UNION ALL
UNION
does check, however, to see if any of the rows in the two tables are identical. If the goal is to include duplicates across two tables, use UNION ALL
instead of UNION
.
Let’s say that the first table is all movies with production year after 2012 and number of votes greater than 500,000. The second table is movies with production year equal to 2015 and number of votes greater than 400,000. Even though the Martian would have been in both tables, the results table lists The Marian only once in Table 6.19.
SELECT t.title,
(
t.production_year, AS num_votes
idx.info FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year > 2012
AND t.kind_id = 1
AND idx.info_type_id = 100
AND idx.info > 500000)
UNION
SELECT t.title,
(
t.production_year, AS num_votes
idx.info FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1
AND idx.info_type_id = 100
AND idx.info > 400000)
ORDER BY production_year DESC, num_votes;
UNION
to combine movies from table 1: later than 2012 and at least 500,000 votes with movies from table 2: 2015 and at least 400,000 votes.
title | production_year | num_votes |
---|---|---|
Batman v Superman: Dawn of Justice | 2016 | 500037 |
Deadpool | 2016 | 673887 |
Inside Out | 2015 | 443051 |
Jurassic World | 2015 | 471237 |
The Revenant | 2015 | 526189 |
Avengers: Age of Ultron | 2015 | 540606 |
The Martian | 2015 | 583987 |
Mad Max: Fury Road | 2015 | 666484 |
Star Wars: Episode VII - The Force Awakens | 2015 | 691691 |
Interstellar | 2014 | 1102826 |
Whiplash | 2014 | 507827 |
The Imitation Game | 2014 | 550521 |
The Grand Budapest Hotel | 2014 | 553558 |
Captain America: The Winter Soldier | 2014 | 562419 |
X-Men: Days of Future Past | 2014 | 567780 |
Gone Girl | 2014 | 664035 |
Guardians of the Galaxy | 2014 | 795151 |
12 Years a Slave | 2013 | 506640 |
Now You See Me | 2013 | 507519 |
World War Z | 2013 | 509285 |
The Hobbit: The Desolation of Smaug | 2013 | 526001 |
The Hunger Games: Catching Fire | 2013 | 537678 |
Man of Steel | 2013 | 592427 |
Iron Man Three | 2013 | 607323 |
Gravity | 2013 | 640900 |
The Wolf of Wall Street | 2013 | 900450 |
When UNION ALL
is applied in the same context, The Martian is listed twice in the results table given in Table 6.20.
SELECT t.title,
(
t.production_year, AS num_votes
idx.info FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year > 2012
AND t.kind_id = 1
AND idx.info_type_id = 100
AND idx.info > 500000)
UNION ALL
SELECT t.title,
(
t.production_year, AS num_votes
idx.info FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015
AND t.kind_id = 1
AND idx.info_type_id = 100
AND idx.info > 400000)
ORDER BY production_year DESC, num_votes;
UNION ALL
to combine movies from table 1: later than 2012 and at least 500,000 votes with movies from table 2: 2015 and at least 400,000 votes.
title | production_year | num_votes |
---|---|---|
Batman v Superman: Dawn of Justice | 2016 | 500037 |
Deadpool | 2016 | 673887 |
Inside Out | 2015 | 443051 |
Jurassic World | 2015 | 471237 |
The Revenant | 2015 | 526189 |
The Revenant | 2015 | 526189 |
Avengers: Age of Ultron | 2015 | 540606 |
Avengers: Age of Ultron | 2015 | 540606 |
The Martian | 2015 | 583987 |
The Martian | 2015 | 583987 |
Mad Max: Fury Road | 2015 | 666484 |
Mad Max: Fury Road | 2015 | 666484 |
Star Wars: Episode VII - The Force Awakens | 2015 | 691691 |
Star Wars: Episode VII - The Force Awakens | 2015 | 691691 |
Interstellar | 2014 | 1102826 |
Whiplash | 2014 | 507827 |
The Imitation Game | 2014 | 550521 |
The Grand Budapest Hotel | 2014 | 553558 |
Captain America: The Winter Soldier | 2014 | 562419 |
X-Men: Days of Future Past | 2014 | 567780 |
Gone Girl | 2014 | 664035 |
Guardians of the Galaxy | 2014 | 795151 |
12 Years a Slave | 2013 | 506640 |
Now You See Me | 2013 | 507519 |
World War Z | 2013 | 509285 |
The Hobbit: The Desolation of Smaug | 2013 | 526001 |
The Hunger Games: Catching Fire | 2013 | 537678 |
Man of Steel | 2013 | 592427 |
Iron Man Three | 2013 | 607323 |
Gravity | 2013 | 640900 |
The Wolf of Wall Street | 2013 | 900450 |
6.3.1.1 FULL OUTER JOIN
via UNION
MySQL doesn’t have a FULL OUTER JOIN
(although other implementations of SQL do have full join functionality). However, we can mimic a full join using right and left joins with UNION
.
Recall the ideas of RIGHT JOIN
(which keeps all observations in the right table) and LEFT JOIN
(which keeps all observations in the left table). By UNION
ing the right and left joins, all of the observations are obtained (i.e., a full join). Using the function sqldf()
in the sqldf R package, the full join will be demonstrated using the 1960s rock bands.
Notice that in the RIGHT JOIN
the name
column must come from the right table (not the left table).
Also notice that UNION ALL
keeps the duplicate rows which is probably not what we want.
sqldf::sqldf("SELECT star.name, star.band, inst.plays
FROM band_members AS star
LEFT JOIN band_instruments AS inst ON star.name = inst.name")
name band plays
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
sqldf::sqldf("SELECT inst.name, star.band, inst.plays
FROM band_members AS star
RIGHT JOIN band_instruments AS inst ON star.name = inst.name")
name band plays
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
sqldf::sqldf("SELECT star.name, star.band, inst.plays
FROM band_members AS star
LEFT JOIN band_instruments AS inst ON star.name = inst.name
UNION
SELECT inst.name, star.band, inst.plays
FROM band_members AS star
RIGHT JOIN band_instruments AS inst ON star.name = inst.name ")
name band plays
1 John Beatles guitar
2 Keith <NA> guitar
3 Mick Stones <NA>
4 Paul Beatles bass
sqldf::sqldf("SELECT star.name, star.band, inst.plays
FROM band_members AS star
LEFT JOIN band_instruments AS inst ON star.name = inst.name
UNION ALL
SELECT inst.name, star.band, inst.plays
FROM band_members AS star
RIGHT JOIN band_instruments AS inst ON star.name = inst.name ")
name band plays
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 John Beatles guitar
5 Paul Beatles bass
6 Keith <NA> guitar
6.4 Best practice
It is always a good idea to terminate the SQL connection when you are done with it.
dbDisconnect(con_imdb, shutdown = TRUE)
6.5 Reflection questions
What are the different types of joins? Which data from which table gets kept and which gets removed for each type of join?
What is the difference between a join and a union?
When working with multiple tables, how (and why) is a variable linked to its table?
Consider a
RIGHT JOIN
. If there are records in the right table that are not in the left table, what will the value of the left table variable be for those records?
6.6 Ethics considerations
What can happen if a
UNION
is done without carefully matching up the columns of the two tables beingUNION
ed?How will you know if
JOIN
ing removed some records? What if theJOIN
produced missing values for some of the variables? How should we deal with missing data or arbitrarily removed records?