4  Combining tables in SQL

Beyond SELECT queries, the next most fundamental actions in SQL have to do with working with two or more tables simultaneously. If you have worked in R, you may be familiar with the join_() family of functions. Moving beyond JOIN, we will also cover subqueries which create temporary tables inside a SQL statement.

The database we’ll work with for learning how to combine tables is on IMDb data. There are many tables, see Table 4.1. Table 4.2 shows information on the kind_type table which describes each kind. Other tables contain ID variables which will be important when performing JOINs.

con_imdb <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "imdb",
  host = Sys.getenv("SMITH_HOST"),
  user = Sys.getenv("SMITH_USER"),
  password = Sys.getenv("SMITH_PWD")
)

As seen in Table 4.1, there are 21 tables in the imdb database! We won’t go into details about what each one of them means, but it is a huge database with details on movies, TV series, video games, and more. For our work, we are interested only in movies (and details within, like actors and ratings).

SHOW TABLES;
Table 4.1: SHOW all the TABLES in the imdb database.
Tables_in_imdb
aka_name
aka_title
cast_info
char_name
comp_cast_type
company_name
company_type
complete_cast
info_type
keyword
kind_type
link_type
movie_companies
movie_info
movie_info_idx
movie_keyword
movie_link
name
person_info
role_type
title

If you are curious about the database, you can glance through each of the tables separately using a SELECT command (don’t forget to use LIMIT, especially on a table you haven’t previously explored).

SELECT * FROM kind_type
LIMIT 0, 10;
Table 4.2: SELECT to glance at the kind_type table in the imdb database.
id kind
1 movie
2 tv series
3 tv movie
4 video movie
5 tv mini series
6 video game
7 episode

4.1 Subqueries

A SQL subquery is a query used as a data source in the FROM clause, instead of the usual table. There was a subquery in Table 3.6 when the task required a function of the results set within the SELECT clause.

We could do something similar if we wanted to transform the variables in the select column. The example is a little bit forced, and there are other ways to obtain the same results. But hopefully the idea of a subquery is becoming more clear. Again, a subquery is just a query that becomes the data source for FROM.

Chapter 8 will cover regular expressions in some detail. Here we use the function REGEXP_REPLACE to remove any characters which are not letters, comma, or space. The function LOWER converts any upper case letters to lower case.

SELECT name,
       name_clean,
       SUBSTRING_INDEX(name_clean, ',', 1) AS last_name,
       SUBSTRING_INDEX(name_clean, ',', -1) AS first_name
FROM (
SELECT LOWER(REGEXP_REPLACE(name, '[^a-z,. ]', '')) AS name_clean,
       name,
       id, person_id
FROM aka_name) AS temp_subquery
LIMIT 0, 30;
Table 4.3: A subquery is used so that the variable in the subquery can be used and transformed in the SELECT clause.
name name_clean last_name first_name
Smith, Jessica Noel smith, jessica noel smith jessica noel
Pain, L. $ham pain, l. ham pain l. ham
Boy, $hutter boy, hutter boy hutter
Dollasign, Ty dollasign, ty dollasign ty
Sign, Ty Dolla sign, ty dolla sign ty dolla
Moore, Brandon moore, brandon moore brandon
$torm, Country torm, country torm country
'Hooper', Simon P.J. Kelly hooper, simon p.j. kelly hooper simon p.j. kelly
Hooper hooper hooper hooper
Kelly, Simon P.J. kelly, simon p.j. kelly simon p.j.
Abdul-Hamid, Jaffar abdulhamid, jaffar abdulhamid jaffar
Al-Hamid, Jaffar Abd alhamid, jaffar abd alhamid jaffar abd
Svensson, Acke svensson, acke svensson acke
Viera, Michael 'Power' viera, michael power viera michael power
Buguelo buguelo buguelo buguelo
'El Burro' Rankin', Jorge Van el burro rankin, jorge van el burro rankin jorge van
Burro, El burro, el burro el
Van Rankin, Jorge 'Burro' van rankin, jorge burro van rankin jorge burro
Van Rankin, Jorge van rankin, jorge van rankin jorge
van Rankin, Jorge 'El Burro' van rankin, jorge el burro van rankin jorge el burro
Seigal, Jason seigal, jason seigal jason
Kaufman, Murray kaufman, murray kaufman murray
'Knoccout'Madison, Kareim knoccoutmadison, kareim knoccoutmadison kareim
Starks, Johnny starks, johnny starks johnny
Kraemer, 'Logan' Howard kraemer, logan howard kraemer logan howard
Gee, Emm gee, emm gee emm
Cusick, Maura cusick, maura cusick maura
Maura, Maude Cusick maura, maude cusick maura maude cusick
Wheeler, Mackenzie wheeler, mackenzie wheeler mackenzie
Monkey monkey monkey monkey

4.2 All the JOINs

Recall that SQL is a query language that works on relational databases. One of its major strengths is being able to efficiently store information in separate tables that can be easily connected as needed. The syntax for tying together information from multiple tables is done with a JOIN clause.

Each JOIN clause needs four specific pieces of information:

  1. The name of the first table you want to JOIN.
  2. The type of JOIN being used.
  3. The name of the second table you want to JOIN.
  4. The condition(s) under which you want the records in the first table to match records in the second table.

Some types of JOINs available in MySQL include the following, which are represented as Venn diagrams in Figure 4.1.

  • JOIN: include all of the rows that exist in both tables (similar to inner_join() in R, the intersection of the two tables). INNER JOIN is alternative, and identical, function to JOIN.
  • LEFT JOIN: include all of the rows in the first table. Connect them, as much as possible, to the rows in the second table. Rows that have no match in the second table will have a value of NULL for the new “second table” variables.
  • RIGHT JOIN: include all of the rows in the second table. Connect them, as much as possible, to the rows in the first table. Rows that have no match in the first table will have a value of NULL for the new “first table” variables. A RIGHT JOIN with the tables in the opposite order is the same as a LEFT JOIN with the tables in the original order.
  • FULL OUTER JOIN: include all rows in either table. Rows that have no match in the other table will have a value of NULL for the other table variables. (similar to full_join() in R, the union of the two tables). The functionality doesn’t exist in MySQL but can be created using joins and UNION.
  • CROSS JOIN: match each row of the first table with each row in the second table.

Figure 4.1 shows Venn diagrams of the different types of joins. Figure 4.2 shows four of the JOIN functions with mini data tables. Note that in SQL the missing values will be labeled as NULL (not NA).

Venn diagrams describing different types of JOINs.

Figure 4.1: Venn diagrams describing different JOINs, image credit: phoenixNAP https://phoenixnap.com/kb/mysql-join

Mini data tables describing different types of JOINs.

Figure 4.2: Mini data tables describing different JOINs, image credit: Statistics Globe blog, https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti

4.2.1 A toy example

We will head to R for just a minute so as to understand joins using a small toy dataset on rock bands from the 60s, The Beatles and The Rolling Stones. The function sqldf() in the sqldf R package allows for SQL commands on R objects.

Consider the following datasets which are available in the dplyr package.

band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Inner join

An inner join combines two datasets returning only the observations that exist in both of the original datasets.

sqldf::sqldf("SELECT star.name,
                     star.band,
                     inst.plays
              FROM band_members AS star
              JOIN band_instruments AS inst ON star.name = inst.name")
  name    band  plays
1 John Beatles guitar
2 Paul Beatles   bass

Full join

A full join combines two datasets returning every observation that exists in either one of the original datasets. Note that in the results, Mick’s instrument is missing, and Keith’s band is missing.

The full_join() function does not have an equivalent in MySQL. See Section 4.3.1.1 for using JOINs and UNIONs to produce a full join.

band_members |>
  full_join(band_instruments)
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

Left join

A left join combines two datasets returning every observation that exists in the left (or first) original dataset. Note that in the results, Mick’s instrument is missing.

sqldf::sqldf("SELECT star.name,
                     star.band,
                      inst.plays
              FROM band_members AS star
              LEFT JOIN band_instruments AS inst 
              ON star.name = inst.name")
  name    band  plays
1 Mick  Stones   <NA>
2 John Beatles guitar
3 Paul Beatles   bass

Right join

A right join combines two datasets returning every observation that exists in the right (or second) original dataset. Note that in the results, Keith’s band is missing.

sqldf::sqldf("SELECT inst.name,
                     star.band,
                      inst.plays
              FROM band_members AS star
              RIGHT JOIN band_instruments AS inst 
              ON star.name = inst.name")
   name    band  plays
1  John Beatles guitar
2  Paul Beatles   bass
3 Keith    <NA> guitar

4.2.2 JOIN

In the imdb database, the title table includes information about the 4,626,322 titles in the database, including the id, title, kind_id (indicator for the kind of ID it is), and production_year. It does not, however, include the review of the title. See Table 4.4.

SELECT * FROM title LIMIT 0, 10;
Table 4.4: SELECT to glance at the title table in the imdb database.
id title imdb_index kind_id production_year imdb_id phonetic_code episode_of_id season_nr episode_nr series_years md5sum
78460 Adults Recat to the Simpsons (30th Anniversary) 7 2017 A3432 78406 2ae09eed7d576cc2c24774fed5b18168
70273 (2016-05-18) 7 2016 68058 511dfc14cfff7589d29a95abb30cd66a
60105 (2014-04-11) 7 2014 59138 c6cdce7e667e07713e431805c407feed
32120 (2008-05-01) 7 2008 32060 100df65742caf5afd092b2e0ead67d8e
97554 Schmölders Traum 7 2001 S2543 97302 10 1 46862a2f96f9fb2d59e8c9a11ecfdd28
57966 (#1.1) 7 2013 57965 1 1 409c37703766c4b24f8a86162fd9cf85
76391 Anniversary 7 1971 A5162 76385 4 9 5e12ce73fac1d1dcf94136b6e9acd8f8
11952 Angus Black/Lester Barrie/DC Curry 7 2009 A5214 11937 4 7 9c38b9e5601dc154444b73b518034aa1
1554 New Orleans 7 2003 N6452 1508 2 11 621bea735740a547e862e4a3226f35d2
58442 Kiss Me Kate 7 2011 K2523 58436 1 10 293e8c75c7f35a4035abf617962be5a9

The movie_info_idx table does not contain much information about each particular film. It does, however, have an indicator for the movie ID (given by movie_id) as well as the number of votes (given by info where type_id = 100). See Table 4.5.

SELECT * FROM movie_info_idx LIMIT 0, 6;
Table 4.5: SELECT to glance at the movie_info_idx table in the imdb database.
id movie_id info_type_id info note
1 1 99 31.2.1..2.
2 1 100 9
3 1 101 4.1
4 2 99 1000000102
5 2 100 61
6 2 101 6.4

Let’s say we want to combine the titles with the number of votes so that each title with user votes is included. That is, only keep the titles that have a corresponding votes. And also, only keep the votes if there is an associated title (which means we use INNER JOIN or just plain JOIN).

Remember that WHERE will work on the raw variables, and HAVING works on the results set.

Some aspects of the query are worth pointing out:
* The variables in the output are given in the SELECT clause. The id and title (both from the title table) and the info from the movie_info_idx which represents the number of IMDb votes. * The variables are preceded by the table from which they came. While not always necessary, it is good practice so as to avoid confusion. * The JOIN happens by linking the id variable in the title table with the movie_id variable in the movie_info_idx table. * The LIMIT wasn’t necessary (there are only 12 observations), but it’s good practice so that we don’t end up with unwieldy query results. * The WHERE clause happens before the JOIN action, despite being written after. * In the WHERE clause, we keep only movies, only 2015 production year, and only at least 150,000 votes.

SELECT title.id,
       title.title,
       movie_info_idx.info
FROM title
JOIN movie_info_idx ON title.id = movie_info_idx.movie_id 
WHERE title.production_year = 2015 
    AND title.kind_id = 1                  # movies only
    AND movie_info_idx.info_type_id = 100  # info_type is votes
    AND movie_info_idx.info > 150000       # at least 150,000 votes
ORDER BY movie_info_idx.info DESC
LIMIT 0, 20;
Table 4.6: Movies from 2015 that have at least 150,000 votes in the imdb database.
id title info
4260166 Star Wars: Episode VII - The Force Awakens 691691
3915213 Mad Max: Fury Road 666484
4389619 The Martian 583987
3313672 Avengers: Age of Ultron 540606
4414139 The Revenant 526189
3787790 Jurassic World 471237
3752999 Inside Out 443051
3292159 Ant-Man 390965
4364483 The Hateful Eight 363199
4251736 Spectre 319875
3630368 Furious Seven 310970
4255450 Spotlight 290362
3961438 Mission: Impossible - Rogue Nation 266759
4321769 The Big Short 262598
4221220 Sicario 260996
3600120 Fifty Shades of Grey 250962
4164324 Room 244210
3379559 Bridge of Spies 229440
4368666 The Hunger Games: Mockingjay - Part 2 214569
4387967 The Man from U.N.C.L.E. 213754

Let’s say we also want to obtain information about the actors and actresses in each of the movies. In the cast_info table, there is a person_id, a movie_id, and person_role_id is 1 if actor and 2 if actress.

SELECT * FROM cast_info LIMIT 0, 10;
Table 4.7: SELECT to glance at the cast_info table in the imdb database.
id person_id movie_id person_role_id note nr_order role_id
1 1 3432997 1 31 1
2 2 1901690 2 1
3 3 4027567 2 25 1
4 3 4282876 3 22 1
5 4 3542672 12 1
6 5 3331520 4 (as $hutter Boy) 10 1
7 5 4027191 2 (as $hutter Boy) 1 1
8 5 4195731 5 (uncredited) 1
9 5 4263956 6 (uncredited) 1
10 5 4267787 7 (uncredited) 1

We also want the name of the actress which is in the table aka_name. Note that there is no movie information in the aka_name table!

SELECT * FROM aka_name LIMIT 0, 10;
Table 4.8: SELECT to glance at the aka_name table in the imdb database.
id person_id name imdb_index name_pcode_cf name_pcode_nf surname_pcode md5sum
1 6188450 Smith, Jessica Noel S5325 J2542 S53 25c9d464e3ff2957533546aa92b397ed
2 5125059 Pain, L. $ham P545 L515 P5 569b1e885ccb51211c01753f0dad9b2c
3 5 Boy, $hutter B36 H361 B 35092b5604ce378fc48c8a6fc0038a49
4 4152053 Dollasign, Ty D4253 T3425 D425 0f565a2d8027cfb8ed6c5f4bba719fcd
5 4152053 Sign, Ty Dolla S2534 T3425 S25 2eded1b021b96333b4b74e0fec959650
6 6 Moore, Brandon M6165 B6535 M6 193a6f5adf4756320f622162d2475608
7 8 $torm, Country T6525 C5363 T65 1654400b707d34323ea392b87060e6cc
8 19 'Hooper', Simon P.J. Kelly H1625 S5124 H16 3fd8885372c23f8c74e583da91d1fd05
9 19 Hooper H16 24ddc68ab605ee95857ad45b65ffa2d8
10 19 Kelly, Simon P.J. K4251 S5124 K4 33d976f22e276b73c61513bc5f6e72a6

Connecting the most popular movies of 2015 with the actresses in those movies requires a series of JOINs. Note that to make the code less onerous, the title table has been aliased by t, the movie_info_idx table has been aliased by idx, the cast_info table has been aliased by a, and the aka_name table has been aliased by n.

There is a lot of data cleaning to do as some of the person_id values are one to many!! That is, the person_id matches multiple names in the aka_name database.

SELECT t.title,
       idx.info,
       a.person_id,
       n.name
FROM title AS t
JOIN movie_info_idx AS idx ON t.id = idx.movie_id 
JOIN cast_info AS a ON idx.movie_id = a.movie_id
JOIN aka_name AS n ON a.person_id = n.person_id
WHERE t.production_year = 2015 
    AND t.kind_id = 1           # movies only
    AND idx.info_type_id = 100  # info_type is votes
    AND idx.info > 150000       # at least 150,000 votes
    AND a.role_id = 2           # actresses only
ORDER BY idx.info DESC
LIMIT 0, 50;
Table 4.9: Movies from 2015 that have at least 150,000 votes in the imdb database with the actress name joined.
title info person_id name
Star Wars: Episode VII - The Force Awakens 691691 2698188 Sam
Star Wars: Episode VII - The Force Awakens 691691 2806101 Gillespie, Hilary Catherine
Star Wars: Episode VII - The Force Awakens 691691 2959609 Cuzner, Natalie
Star Wars: Episode VII - The Force Awakens 691691 3089483 Fisher, Carrie Frances
Star Wars: Episode VII - The Force Awakens 691691 3150880 Class, Clare
Star Wars: Episode VII - The Force Awakens 691691 3150880 Glass, Claire
Star Wars: Episode VII - The Force Awakens 691691 3231758 Henwick, Jessica Yu Li
Star Wars: Episode VII - The Force Awakens 691691 3265686 Hui, Karen
Star Wars: Episode VII - The Force Awakens 691691 3305561 Kamen, Hannah John
Star Wars: Episode VII - The Force Awakens 691691 3462940 Billy
Star Wars: Episode VII - The Force Awakens 691691 3462940 Lourd, Billie Catherine
Star Wars: Episode VII - The Force Awakens 691691 3569409 Fran
Star Wars: Episode VII - The Force Awakens 691691 3649948 Nyongo, Lupita
Star Wars: Episode VII - The Force Awakens 691691 3649948 Nyong'o, Lupita Amondi
Star Wars: Episode VII - The Force Awakens 691691 3785240 Ridley, Daisy Jazz Isobel
Star Wars: Episode VII - The Force Awakens 691691 3835377 Giagrande, Meredith J.
Star Wars: Episode VII - The Force Awakens 691691 3835377 Salinger, Meredith
Star Wars: Episode VII - The Force Awakens 691691 3835377 Salenger, Meredith Dawn
Star Wars: Episode VII - The Force Awakens 691691 3850834 Phi
Star Wars: Episode VII - The Force Awakens 691691 3875581 Fox, Claudia
Star Wars: Episode VII - The Force Awakens 691691 3879039 Arti
Star Wars: Episode VII - The Force Awakens 691691 3907812 Slade, Sandy
Star Wars: Episode VII - The Force Awakens 691691 3907812 Sandy
Star Wars: Episode VII - The Force Awakens 691691 3907812 Sandy Slade
Star Wars: Episode VII - The Force Awakens 691691 3938795 Ryan, Karol Lesley
Star Wars: Episode VII - The Force Awakens 691691 3970637 Stevens, Cat
Star Wars: Episode VII - The Force Awakens 691691 3970637 Taber, Cat
Star Wars: Episode VII - The Force Awakens 691691 3970637 Tabor, Cat
Star Wars: Episode VII - The Force Awakens 691691 3970637 Cat
Star Wars: Episode VII - The Force Awakens 691691 3970637 Stevens, Cat
Star Wars: Episode VII - The Force Awakens 691691 3970637 Taber, Cat
Star Wars: Episode VII - The Force Awakens 691691 3970637 Tabor, Cat
Star Wars: Episode VII - The Force Awakens 691691 3970637 Cat
Star Wars: Episode VII - The Force Awakens 691691 4073883 Walter, Dame Harriet
Star Wars: Episode VII - The Force Awakens 691691 4073883 Walter, Harriet Mary
Star Wars: Episode VII - The Force Awakens 691691 4094732 White, Kelsey Marie
Mad Max: Fury Road 666484 2681098 Michelle, Debra
Mad Max: Fury Road 666484 2782138 Ali
Mad Max: Fury Road 666484 2873752 Cardona, Helena
Mad Max: Fury Road 666484 2873752 Cardona, Helene
Mad Max: Fury Road 666484 2873752 Cardona, Hélène Vania
Mad Max: Fury Road 666484 2957052 Cunico, Lillie
Mad Max: Fury Road 666484 3087531 Finlay, Sandi 'Hotrod'
Mad Max: Fury Road 666484 3087531 Finlay, Sandi 'Hotrod'
Mad Max: Fury Road 666484 3146859 Gilles, Coco Jack
Mad Max: Fury Road 666484 3146859 Gillies, Coco
Mad Max: Fury Road 666484 3268456 Rose
Mad Max: Fury Road 666484 3268456 Huntington-Whiteley, Rosie Alice
Mad Max: Fury Road 666484 3343489 Kellerman, Antoinette
Mad Max: Fury Road 666484 3348513 Riley

4.2.3 Other JOINs

Consider the following two tables. The first has seven movies in it (from 2015 with at least 400,000 IMDb votes). The second consists of almost 3 million actresses (person_role_id = 2). In order to find a subset of actresses, the person_id > 3900000 was set arbitrarily (in order to have a smaller group with which to work).

movies:

SELECT t.id,
       t.title,
       idx.info,
       (SELECT COUNT(*)
       FROM title AS t
       JOIN movie_info_idx AS idx ON idx.movie_id = t.id
       WHERE t.production_year = 2015  
             AND t.kind_id = 1
             AND idx.info_type_id = 100
             AND idx.info > 400000) AS row_count
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1             # movies only
    AND idx.info_type_id = 100    # info_type is votes
    AND idx.info > 400000         # at least 400,000 votes
ORDER BY idx.info DESC
Table 4.10: Movies from 2015 that have at least 400,000 votes in the imdb database.
id title info row_count
4260166 Star Wars: Episode VII - The Force Awakens 691691 7
3915213 Mad Max: Fury Road 666484 7
4389619 The Martian 583987 7
3313672 Avengers: Age of Ultron 540606 7
4414139 The Revenant 526189 7
3787790 Jurassic World 471237 7
3752999 Inside Out 443051 7

actresses:

SELECT a.person_id,
       a.movie_id,
       n.name,
       (SELECT COUNT(*)
       FROM cast_info AS a
       JOIN aka_name AS n ON a.person_id = n.person_id
       WHERE a.person_role_id = 2  
             AND a.person_id > 390000) AS row_count
FROM cast_info AS a
JOIN aka_name AS n ON a.person_id = n.person_id
       WHERE a.person_role_id = 2  
             AND a.person_id > 3900000
LIMIT 0, 20;
Table 4.11: Actresses whose person_id is greater than 400000. Note that some actresses have different spelling or phrasing of their names.
person_id movie_id name row_count
3900141 759802 Simons, Rita Joanne 2904759
3902258 4365829 Singer, Rabbi Tovia 2904759
3902699 3109788 Singh, Sabine Erika 2904759
3903035 3215866 Val 2904759
3904831 2468067 Masha 2904759
3904928 3654347 Fei, Siu Yin 2904759
3904928 3654347 Hsiao, Yen-fei 2904759
3904928 3654347 Siu, Yinfei 2904759
3904928 3654347 Xiao, Yanfei 2904759
3904928 3654347 Yin-Fai, Siu 2904759
3905289 115191 Coso, Cosondra 2904759
3905289 115191 Sjostrom, Cossondra 2904759
3905289 115191 Coso 2904759
3909355 2939100 Slovácková, Anna Julie 2904759
3911826 4379610 Meador, Constance June 2904759
3912134 2675144 DJ 2904759
3912134 2675144 Smith, DJ 2904759
3912134 2675144 Smith, Dujonette 2904759
3912134 2675144 DJ Smith 2904759
3913519 1678444 Keely, Dorothy Jacqueline 2904759

Using subqueries, we can JOIN the two datasets using different JOIN techniques.

Inner JOIN

With an inner JOIN, there are 32 rows corresponding to all the actresses in the seven 2015 films with the most votes. Because the JOIN is an intersection of the two tables, only the actresses with person_id above 3900000 are included.


SELECT * FROM
(SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes     
INNER JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;

?(caption)

RIGHT JOIN

With a RIGHT JOIN, there are more than 300 rows (the LIMIT clause keeps us from knowing how many rows, but there are a LOT!) corresponding to all the actresses whose person_id above 3900000 are included. Those actresses who acted in one of the seven top 2015 films are also included in the full results table, but they don’t happen to be in the truncated output here.


SELECT * FROM
(SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes     
RIGHT JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;

?(caption)

LEFT JOIN

With a LEFT JOIN, there are 33 rows corresponding to the actresses in the seven top 2015 movies. Only The Revenant did not have any actresses whose person_id is greater than 3900000.


SELECT * FROM
(SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes     
LEFT JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
LIMIT 0, 300;

?(caption)

Counting repeat actresses

We might, for example, want to know how many names / spellings of a name with a specific person_id (above 3900000) exist for each person_id in each of the top voted seven films of 2015.

In Table 4.12 why isn’t there a column indicating the name of the actress? (There can’t be such a column. Why not?)

SELECT acts.person_id, 
       COUNT(*) AS num_repeat_names
FROM (SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes
JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
GROUP BY acts.person_id;
Table 4.12: For each person_id (> 3900000) in the seven top voted 2015 films, how many names / spellings are associated with the person_id?
person_id num_repeat_names
3916648 1
4122876 1
3938423 2
3950111 1
4079047 2
4084626 3
4099458 1
3958614 1
3990819 2
4081131 2
3907812 3
3938795 1
3970637 8
4073883 2
4094732 1
4098918 1

Counting number of actresses per film

We might, for example, want to know how many actresses with a specific person_id (above 3900000) are in each of the top voted seven films of 2015.

SELECT movs.id, 
       movs.title,
       COUNT(*) AS num_actress
FROM (SELECT t.id,
       t.title
FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               # movies only
    AND idx.info_type_id = 100      # info_type is votes
    AND idx.info > 400000) AS movs  # at least 400,000 votes
JOIN (SELECT a.person_id,
       a.movie_id,
       n.name
    FROM cast_info AS a
    JOIN aka_name AS n ON a.person_id = n.person_id
    WHERE a.role_id = 2             # acresses only
        AND a.person_id > 3900000) AS acts ON acts.movie_id = movs.id
GROUP BY movs.id;
Table 4.13: Number of actresses (with person_id > 3900000) in each of the seven top voted films of 2015. Recall that The Revenant had no actresses with person_id > 3900000, so there are only six movies listed.
id title num_actress
3313672 Avengers: Age of Ultron 1
3752999 Inside Out 1
3787790 Jurassic World 9
3915213 Mad Max: Fury Road 5
4260166 Star Wars: Episode VII - The Force Awakens 15
4389619 The Martian 1

4.3 UNIONing

In SQL a UNION clause combines two different tables by their rows (whereas JOIN combines two tables by columns). Think about UNION similarly to the bind_rows() command in R.

Mini data tables describing how UNION binds rows and JOIN appends columns.

Figure 4.3: UNION binds rows while JOIN appends columns, image credit: Jane Williams https://blog.devart.com/mysql-union-tutorial-html.html

4.3.1 UNIONs

UNION does not check the names of the columns to make sure they match. UNION requires that the number of columns be the same and that the variable type be the same for all columns in the two tables being combined.

Table 4.14 contains a silly example. The first table has 1 as bar and the second table has 20 as bar. But when the tables are UNIONed, the bar column contains c(1, 10). SQL took the column names from the first table and appended the second table without considering the variable names.

SELECT 
    1 AS bar,
    2 AS foo

UNION

SELECT 
    10 AS foo,
    20 AS bar;
Table 4.14: The variable names are chosen from the first table. The names and order of the variables in the second table are ignored when using UNION.
bar foo
1 2
10 20

UNION is specifically designed to bind rows from two different SELECT queries where the variables have been selected in the same order. If the two SELECT clauses are done from the same table with the same order of variables, you do not need to worry about the order of the variables matching up in the UNION. If you are UNIONing two very different subqueries, you do need to worry about the variables and their order.

UNION

Let’s say we want to combine the top voted movies from 2015 with the top voted movies from 2019. However, to account for time, we require the movies from 2015 to have more votes (400,000) than the movies from 2017 (200,000). That is, the WHERE clause is different for the two subqueries.

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 400000)

UNION

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2017  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 200000)
LIMIT 0, 100;
   
Table 4.15: The variable names are chosen from the first table. The names and order of the variables in the second table are ignored when using UNION.
title production_year num_votes
Avengers: Age of Ultron 2015 540606
Inside Out 2015 443051
Jurassic World 2015 471237
Mad Max: Fury Road 2015 666484
Star Wars: Episode VII - The Force Awakens 2015 691691
The Martian 2015 583987
The Revenant 2015 526189
Dunkirk 2017 229089
Guardians of the Galaxy Vol. 2 2017 281845
Logan 2017 397056
Spider-Man: Homecoming 2017 209930
Wonder Woman 2017 306611

UNION ALL

UNION does check, however, to see if any of the rows in the two tables are identical. If the goal is to include duplicates across two tables, use UNION ALL instead of UNION.

Let’s say that the first table is all movies with production year after 2012 and number of votes greater than 500,000. The second table is movies with production year equal to 2015 and number of votes greater than 400,000. Even though the Martian would have been in both tables, the results table lists The Marian only once in Table 4.16.

(SELECT t.title,
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year > 2012  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 500000)

UNION

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 400000)
ORDER BY production_year DESC, num_votes;
Table 4.16: Using UNION to combine movies from table 1: later than 2012 and at least 500,000 votes with movies from table 2: 2015 and at least 400,000 votes.
title production_year num_votes
Batman v Superman: Dawn of Justice 2016 500037
Deadpool 2016 673887
Inside Out 2015 443051
Jurassic World 2015 471237
The Revenant 2015 526189
Avengers: Age of Ultron 2015 540606
The Martian 2015 583987
Mad Max: Fury Road 2015 666484
Star Wars: Episode VII - The Force Awakens 2015 691691
Interstellar 2014 1102826
Whiplash 2014 507827
The Imitation Game 2014 550521
The Grand Budapest Hotel 2014 553558
Captain America: The Winter Soldier 2014 562419
X-Men: Days of Future Past 2014 567780
Gone Girl 2014 664035
Guardians of the Galaxy 2014 795151
12 Years a Slave 2013 506640
Now You See Me 2013 507519
World War Z 2013 509285
The Hobbit: The Desolation of Smaug 2013 526001
The Hunger Games: Catching Fire 2013 537678
Man of Steel 2013 592427
Iron Man Three 2013 607323
Gravity 2013 640900
The Wolf of Wall Street 2013 900450

When UNION ALL is applied in the same context, The Martian is listed twice in the results table given in Table 4.17.

(SELECT t.title,
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year > 2012  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 500000)

UNION ALL

(SELECT t.title, 
        t.production_year, 
        idx.info AS num_votes
    FROM title AS t
JOIN movie_info_idx AS idx ON idx.movie_id = t.id
WHERE t.production_year = 2015  
    AND t.kind_id = 1               
    AND idx.info_type_id = 100      
    AND idx.info > 400000)
ORDER BY production_year DESC, num_votes;
Table 4.17: Using UNION ALL to combine movies from table 1: later than 2012 and at least 500,000 votes with movies from table 2: 2015 and at least 400,000 votes.
title production_year num_votes
Batman v Superman: Dawn of Justice 2016 500037
Deadpool 2016 673887
Inside Out 2015 443051
Jurassic World 2015 471237
The Revenant 2015 526189
The Revenant 2015 526189
Avengers: Age of Ultron 2015 540606
Avengers: Age of Ultron 2015 540606
The Martian 2015 583987
The Martian 2015 583987
Mad Max: Fury Road 2015 666484
Mad Max: Fury Road 2015 666484
Star Wars: Episode VII - The Force Awakens 2015 691691
Star Wars: Episode VII - The Force Awakens 2015 691691
Interstellar 2014 1102826
Whiplash 2014 507827
The Imitation Game 2014 550521
The Grand Budapest Hotel 2014 553558
Captain America: The Winter Soldier 2014 562419
X-Men: Days of Future Past 2014 567780
Gone Girl 2014 664035
Guardians of the Galaxy 2014 795151
12 Years a Slave 2013 506640
Now You See Me 2013 507519
World War Z 2013 509285
The Hobbit: The Desolation of Smaug 2013 526001
The Hunger Games: Catching Fire 2013 537678
Man of Steel 2013 592427
Iron Man Three 2013 607323
Gravity 2013 640900
The Wolf of Wall Street 2013 900450

4.3.1.1 FULL OUTER JOIN via UNION

MySQL doesn’t have a FULL OUTER JOIN (although other implementations of SQL do have full join functionality). However, we can mimic a full join using right and left joins with UNION.

Recall the ideas of RIGHT JOIN (which keeps all observations in the right table) and LEFT JOIN (which keeps all observations in the left table). By UNIONing the right and left joins, all of the observations are obtained (i.e., a full join). Using the function sqldf() in the sqldf R package, the full join will be demonstrated using the 1960s rock bands.

Notice that in the RIGHT JOIN the name column must come from the right table (not the left table).

Also notice that UNION ALL keeps the duplicate rows which is probably not what we want.

sqldf::sqldf("SELECT star.name, star.band, inst.plays 
      FROM band_members AS star
      LEFT JOIN band_instruments AS inst ON star.name = inst.name")
  name    band  plays
1 Mick  Stones   <NA>
2 John Beatles guitar
3 Paul Beatles   bass
sqldf::sqldf("SELECT inst.name, star.band, inst.plays 
      FROM band_members AS star
      RIGHT JOIN band_instruments AS inst ON star.name = inst.name")
   name    band  plays
1  John Beatles guitar
2  Paul Beatles   bass
3 Keith    <NA> guitar
sqldf::sqldf("SELECT star.name, star.band, inst.plays 
      FROM band_members AS star
      LEFT JOIN band_instruments AS inst ON star.name = inst.name
UNION
      SELECT inst.name, star.band, inst.plays 
      FROM band_members AS star
      RIGHT JOIN band_instruments AS inst ON star.name = inst.name ")
   name    band  plays
1  John Beatles guitar
2 Keith    <NA> guitar
3  Mick  Stones   <NA>
4  Paul Beatles   bass
sqldf::sqldf("SELECT star.name, star.band, inst.plays 
      FROM band_members AS star
      LEFT JOIN band_instruments AS inst ON star.name = inst.name
UNION ALL
      SELECT inst.name, star.band, inst.plays 
      FROM band_members AS star
      RIGHT JOIN band_instruments AS inst ON star.name = inst.name ")
   name    band  plays
1  Mick  Stones   <NA>
2  John Beatles guitar
3  Paul Beatles   bass
4  John Beatles guitar
5  Paul Beatles   bass
6 Keith    <NA> guitar

4.4 Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_imdb, shutdown = TRUE)

4.5 Reflection questions

  1. What are the different types of joins? Which data from which table gets kept and which gets removed for each type of join?

  2. What is the difference between a join and a union?

  3. When working with multiple tables, how (and why) is a variable linked to its table?

  4. Consider a RIGHT JOIN. If there are records in the right table that are not in the left table, what will the value of the left table variable be for those records?

4.6 Ethics considerations

  1. What can happen if a UNION is done without carefully matching up the columns of the two tables being UNIONed?

  2. How will you know if JOINing removed some records? What if the JOIN produced missing values for some of the variables? How should we deal with missing data or arbitrarily removed records?