5 SQL clauses
Most of this chapter will focus on the SQL commands needed to run queries of the database. Much of the syntax is strikingly familiar to tidy verbs in R. However, this chapter starts with a few SQL specific tools used to learn about the tables in any particular database.
5.1 Looking at the tables in the database
Consider a database of information from Stanford Open Policing Project on over 200 million traffic stops in the US.
con_traffic <- DBI::dbConnect(
RMariaDB::MariaDB(),
dbname = "traffic",
host = Sys.getenv("TRAFFIC_HOST"),
user = Sys.getenv("TRAFFIC_USER"),
password = Sys.getenv("TRAFFIC_PWD")
)
TABLES; SHOW
There are 88 tables in the traffic
database.
DESCRIBE ca_los_angeles_2020_04_01;
The DESCRIBE
command shows the 10 field names (variables) in the ca_los_angeles_2020_04_01
table. Some of the variables are characters (text
) and some are date
or time
.
-
Field
is the name of the variable -
Type
is the type of the variable (numeric, character, etc.) -
Null
indicates whether or notNULL
values are acceptable.NULL
values are never allowed for a Primary Key. -
Key
designates whether the variable is a Primary Key.PRI
is a primary key, all unique, noNULL
values.UNI
is a unique key, butNULL
values are allowed.MUL
is a non-unique (“Multiple”) key, allowing for duplicates, can speed up queries. -
Default
provides the default value of the variable if no value is provided when a row is added to the database.
Most engagements with SQL are done through queries. Queries in SQL start with the SELECT
keyword and consist of several clauses, which must be written in the following order:1
-
SELECT
allows you to list the columns, or functions operating on columns, that you want to retrieve. This is an analogous operation to theselect()
verb in dplyr, potentially combined withmutate()
orsummarize()
. -
FROM
specifies the table where the data are. -
JOIN
allows you to stitch together two or more tables using a key. This is analogous to theinner_join()
andleft_join()
commands in dplyr. More details ofJOIN
are given in Chapter 6. -
WHERE
allows you to filter the records according to some criteria and is an analogous operation to thefilter()
verb in dplyr. Note, even though theWHERE
clause is written afterSELECT
andJOIN
, it is actually evaluated before theSELECT
orJOIN
clauses (which is whyWHERE
only works on the original data, not the results set). -
GROUP BY
allows you to aggregate the records according to some shared value and is an analogous operation to thegroup_by()
verb in dplyr. -
HAVING
is like aWHERE
clause that operates on the result set—not the records themselves and is analogous to applying a secondfilter()
command in dplyr, after the rows have already been aggregated. -
ORDER BY
is exactly what it sounds like—it specifies a condition for ordering the rows of the result set and is analogous to thearrange()
verb in dplyr. -
LIMIT
restricts the number of rows in the output and is similar to the R commandshead()
andslice()
.
5.2 SELECT … FROM
A SQL query starts with a SELECT
command and has a corresponding FROM
to indicate the table being queried. Columns may be specified, or the *
will indicate that every column in the table should be returned.
The shortest SQL query is the following SELECT
command. Do not run this command!!! The yellow_old
table has 15 million rows, and we do not want to look at them simultaneously.
: SELECT * FROM ca_los_angeles_2020_04_01; DO NOT RUN
Instead, to look at the top of the table, SELECT
the first few rows. The LIMIT
command specifies which rows to select: the first number is the number of rows to skip (0 rows skipped), the second number is the number of rows to print up to (up to row 14).
SELECT * FROM ca_los_angeles_2020_04_01 LIMIT 0, 14;
Speaking of which, how many rows are there in the ca_los_angeles_2020_04_01
table? That is, how many taxi rides are recorded? Now SELECT
is used with a summary function, COUNT()
. Instead of using a separate summary function (like mutate()
or summarize()
), all the work is done inside the SELECT
call.
SELECT COUNT(*) FROM ca_los_angeles_2020_04_01;
Yikes, more than 5.4 million traffic stops!
How old was the car being driven?
Unfortunately, date
is saved as a date variable, and vehicle_year
is saved as a double. Fortunately for us, SQL has functionality to convert a text
Type into DateTime
type (POSIXct is a special type of DateTime formatting).
SELECT
date, vehicle_year,
CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
STR_TO_DATE(FROM la_new_orleans_2020_04_01
LIMIT 0, 10;
Now that the variables are no longer strings, we can subtract them to figure out the number of minutes for each taxi ride. Unfortunately, the following code won’t run because the variables vehicle_date
is not in the table la_new_orleans_2020_04_01
.
SELECT
date, vehicle_year,
CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
STR_TO_DATE(YEAR, vehicle_date, date) AS car_age
TIMESTAMPDIFF(FROM la_new_orleans_2020_04_01
LIMIT 0, 10;
Instead, we need two layers of SELECT
commands so that the first SELECT
(i.e., inside) layer creates the new variables, and the second SELECT
(i.e., outside) layer subtracts the two times.
SELECT
date,
vehicle_date, YEAR, vehicle_date, date) AS car_age
TIMESTAMPDIFF(FROM (
SELECT
date,
CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
STR_TO_DATE(FROM la_new_orleans_2020_04_01)
AS subquery_table
LIMIT 0, 10;
Alternatively, the STR_TO_DATE()
function can be applied inside the TIMESTAMPDIFF()
function so that the full (now only) SELECT
command is being used only on variables that are in the original table.
SELECT
date,
vehicle_year,YEAR,
TIMESTAMPDIFF(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d'),
STR_TO_DATE(date) AS car_age
FROM la_new_orleans_2020_04_01
LIMIT 0, 10;
Keep in mind that there is a distinction between clauses that operate on the variables of the original table versus those that operate on the variables of the results set. The variables date
and vehicle_year
are columns in the original table - they are written to disk on the SQL server. The variable car_age
exist only in the results set, which is passed from the server (SQL server) to the client (e.g., RStudio or DBeaver) and is not written to disk.
5.3 SELECT DISTINCT
SELECT DISTINCT
returns only unique rows. That is, it filters out all the duplicates of a variable or a combination of variables.
SELECT DISTINCT raw_actions_taken
FROM la_new_orleans_2020_04_01
LIMIT 0, 50;
SELECT DISTINCT vehicle_make, vehicle_color
FROM la_new_orleans_2020_04_01
LIMIT 0, 50;
5.4 WHERE
The WHERE
clause is analogous to the filter()
function in dplyr. However, keep in mind that there are two SQL commands that resemble the dplyr filter()
function. WHERE
operates on the original data in the table and HAVING
operates on the result set. See below for examples using HAVING
.
What was the time of day for those traffic stops search_conducted
was true and vehicle_make
was Subaru?
Note that in SQL the equality logical is =
and in R the equality logical is ==
. Note also that the WHERE command is case-insensitive!
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE search_conducted = 1
AND vehicle_make = "Subaru"
LIMIT 0, 10;
BETWEEN
can be used to specify a range of values for a numeric value. BETWEEN
is inclusive.
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
AND vehicle_make = "Subaru"
LIMIT 0, 10;
IN
is similar to the dplyr %in%
function which specifies distinct values for the variable.
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make IN ("Subaru", "Dodge")
AND search_conducted = 1
LIMIT 0, 10;
The WHERE
clause can be established by a number of logical commands combined using either AND
or OR
. Usually it is important to use parentheses with OR
logicals to make sure the desired query is return. Consider the difference between the following queries. In SQL (as in many programming languages), AND
takes precedent over OR
in the order of operations, when there are no parentheses. (I was taught to remember order of operations using “please excuse my dear aunt Sally.”) The order of operations on the first query groups the second two conditions into one because AND
take precedence over OR
(as if the query was vehicle_year BETWEEN 1970 and 1999 OR (vehicle_make IN ("Subaru", "Dodge") AND search_conducted = 1)
).
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999 OR
IN ("Subaru", "Dodge") AND
vehicle_make = 1
search_conducted LIMIT 0, 10;
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE (vehicle_year BETWEEN 1970 and 1999 OR
IN ("Subaru", "Dodge") ) AND
vehicle_make = 1
search_conducted LIMIT 0, 10;
5.4.1 NULL in WHERE
SQL considers NULL
values to be unknown. Therefore, when searching for a NULL
value, you need to ask SQL if the value IS NULL
. Asking if the value is equal to NULL
doesn’t work because NULL
values don’t equal anything (they are unknown). To keep all values that are not NULL
values, use IS NOT NULL
in the WHERE
clause.
5.4.1.1 A NULL
example2
The logic of NULL
:
- If you do anything with
NULL
, you’ll just getNULL
. For instance if \(x\) isNULL
, then \(x > 3\), \(1 = x\), and \(x + 4\) all evaluate toNULL
. Even \(x =\)NULL
evaluates toNULL
! if you want to check whether \(x\) isNULL
, usex IS NULL
orx IS NOT NULL
. -
NULL
short-circuits with boolean operators. That means a boolean expression involvingNULL
will evaluate to:- TRUE, if it’d evaluate to TRUE regardless of whether the
NULL
value is really TRUE or FALSE. - FALSE, if it’d evaluate to FALSE regardless of whether the
NULL
value is really TRUE or FALSE. - Or
NULL
, if it depends on theNULL
value.
- TRUE, if it’d evaluate to TRUE regardless of whether the
Consider the following table and SQL query:
SELECT * FROM (
SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
UNION
SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs
UNION
SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
UNION
SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
AS temptable; )
SELECT * FROM (
SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
UNION
SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs
UNION
SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
UNION
SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
AS temptable
) WHERE age <= 20 OR num_dogs = 3;
Where does the WHERE
clause do? It tells us that we only want to keep the rows satisfying the age <= 20
OR num_dogs = 3
. Let’s consider each row one at a time:
- For Ace,
age <= 20
evaluates to TRUE so the claim is satisfied. - For Ada,
age <= 20
evaluates toNULL
butnum_dogs = 3
evaluates to TRUE so the claim is satisfied. - For Ben,
age <= 20
evaluates toNULL
andnum_dogs = 3
evaluates toNULL
so the overall expression isNULL
which has a FALSE value. - For Cho,
age <= 20
evaluates to FALSE andnum_dogs = 3
evaluates toNULL
so the overall expression evaluates toNULL
(because it depends on the value of theNULL
).
Thus we keep only Ace and Ada.
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make IS NULL
LIMIT 0, 10;
SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make = NULL
LIMIT 0, 10;
5.5 GROUP BY
The GROUP BY
clause will direct SQL to carry out the query separately for each category in the grouped variable. Using GROUP BY
is particularly important when aggregating multiple rows into a single number. Some aggregate functions include COUNT()
, SUM()
, MAX()
, MIN()
, and AVG()
.
Note that SUM(1)
adds (sums) the number 1
for each row. Which is the same as counting the number of rows. SUM(2)
adds (sums) the number 2
for each row which returns twice as many transactions.
SELECT subject_race,
COUNT(*) AS num_stops,
SUM(1) AS num_stops_also,
SUM(2) AS double_stops
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
GROUP BY subject_race;
For those people whose cars are between 1970 and 1999, how old is the youngest and oldest person for each subject_race
?
SELECT subject_race,
COUNT(*) AS num_stops,
MIN(subject_age) AS min_age,
MAX(subject_age) AS max_age
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
GROUP BY subject_race;
GROUP BY
will work applied to multiple columns. Let’s tabulate the same results, now broken down by subject_race
and day of week. Except that we don’t have a day of week variable! We need to pull out the day of the week from the date
variable, using DAYNAME
. (Note: DAYOFWEEK
will give you the day of the week as an integer. Use your internet sleuthing skills if you are looking for functions that might help your desired query.)
SELECT subject_race,
COUNT(*) AS num_stops,
MIN(subject_age) AS min_age,
MAX(subject_age) AS max_age,
date) AS wday
DAYNAME(FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday;
5.6 ORDER BY
The ORDER BY
command can be used with or without the GROUP BY
and aggregation commands. It allows us to look at interesting aspects of the data by sorting the data.
SELECT subject_race,
COUNT(*) AS num_stops,
MIN(subject_age) AS min_age,
MAX(subject_age) AS max_age,
date) AS wday
DAYNAME(FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY min_age ASC;
WHAT?!?!! How are 10 year olds getting pulled over for traffic stops? If the observation is found to be a typo, it would need to be removed from the data set. If the observation is somehow legitimate, it would need to be included in the analysis, with the information provided about its legitimacy.
SELECT subject_race,
COUNT(*) AS num_stops,
MIN(subject_age) AS min_age,
MAX(subject_age) AS max_age,
date) AS wday
DAYNAME(FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY max_age DESC;
So many people over 100 years old… the data seem fishy!
SELECT subject_race,
COUNT(*) AS num_stops,
MIN(subject_age) AS min_age,
MAX(subject_age) AS max_age,
date) AS wday
DAYNAME(FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY wday, subject_race;
As above, we were able to GROUP BY
and ORDER BY
on the new variables we had created, wday
.
5.7 HAVING
Recall that WHERE
acts only on the original data. If we are interested in traffic stops that took place on Friday
, we need to use the derived variable wday
instead of the raw variable pickup_datetime
. Fortunately, HAVING
works on the results set. Note that SQL uses ' '
for strings, not " "
. In SQL, " "
is used to identify variables (not values of variables), like R’s ` `.
SELECT subject_race,
COUNT(*) AS num_stops,
MIN(subject_age) AS min_age,
MAX(subject_age) AS max_age,
date) AS wday
DAYNAME(FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
HAVING wday = 'Friday';
While it worked out quite well for us that HAVING
was able to filter the data based on the results set, the use of HAVING
was quite onerous because the entire data set was considered before the filter was applied. That is, if the filter can be done on the original data using WHERE
, the query will be much faster and more efficient.
Note: HAVING
requires a GROUP BY
clause. And the variable(s) used in HAVING
must also be part of the GROUP BY
clause.
5.8 LIMIT
As we’ve seen, LIMIT
truncates the query to specified rows. The first number is the offset (i.e., the number of rows to skip), the second number is the (maximum) number of rows to return. Here, we return rows 154219 through 154228.
SELECT * FROM la_new_orleans_2020_04_01 LIMIT 154218, 10;
5.9 Saving SQL queries as R objects
If you are working in R to run SQL commands, you may want to use the query output for further analysis or visualizations. In that case, use #| output.var: "name_of_variable"
inside the {sql}
chunk. The variable called name_of_variable
will then be available to be used in the R environment.
```{sql}
#| connection: con_traffic
#| label: new-table
#| output.var: "new_table"
SELECT *, DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
LIMIT 0, 1000;
```
raw_row_number | date | time | location | lat | lng | district | zone | subject_age | subject_race | subject_sex | officer_assignment | type | arrest_made | citation_issued | warning_issued | outcome | contraband_found | contraband_drugs | contraband_weapons | frisk_performed | search_conducted | search_person | search_vehicle | search_basis | reason_for_stop | vehicle_color | vehicle_make | vehicle_model | vehicle_year | raw_actions_taken | raw_subject_race | wday |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2010-01-01 | 01:11:00 | 6 | E | 26 | black | female | 6th District | vehicular | 0 | 0 | 0 | 0 | 0 | 0 | 0 | TRAFFIC VIOLATION | BLACK | DODGE | CARAVAN | 2005 | BLACK | Friday | |||||||||
9087 | 2010-01-01 | 01:29:00 | 7 | C | 37 | black | male | 7th District | vehicular | 0 | 0 | 0 | 0 | 0 | 0 | 0 | TRAFFIC VIOLATION | BLUE | NISSAN | MURANO | 2005 | BLACK | Friday | |||||||||
9086 | 2010-01-01 | 01:29:00 | 7 | C | 37 | black | male | 7th District | vehicular | 0 | 0 | 0 | 0 | 0 | 0 | 0 | TRAFFIC VIOLATION | BLUE | NISSAN | MURANO | 2005 | BLACK | Friday | |||||||||
267 | 2010-01-01 | 14:00:00 | 7 | I | 96 | black | male | 7th District | vehicular | 0 | 0 | 0 | 0 | 0 | 0 | 0 | TRAFFIC VIOLATION | GRAY | JEEP | GRAND CHEROKEE | 2003 | BLACK | Friday | |||||||||
2 | 2010-01-01 | 02:06:00 | 5 | D | 17 | black | male | 5th District | 0 | 0 | 0 | 0 | 0 | 0 | 0 | CALL FOR SERVICE | BLACK | Friday | ||||||||||||||
7 | 2010-01-01 | 02:06:00 | 5 | D | 18 | black | male | 5th District | 0 | 0 | 0 | 0 | 0 | 0 | 0 | CALL FOR SERVICE | BLACK | Friday |
```{r}
new_table |>
drop_na(wday) |>
ggplot(aes(x = vehicle_year, y = subject_age, color = wday)) +
geom_point() +
xlim(1985, 2025)
```
5.10 Best practice
It is always a good idea to terminate the SQL connection when you are done with it.
dbDisconnect(con_traffic, shutdown = TRUE)
5.11 Reflection questions
Why don’t we usually want to run the query:
SELECT * FROM table;
?What is the difference between the original table and the results set?
In SQL does the
WHERE
clause use=
or==
to indicate equality?Does
BETWEEN
work only on numeric variables or also on character strings?What syntax is used to direct
ORDER BY
to sort by biggest to smallest or smallest to biggest?What is the difference between
WHERE
andHAVING
?
5.12 Ethics considerations
What are different ways to look at the dataset to identify possible typos or rogue values?
Why are such tasks so much harder with large datasets (versus small datasets)?
Why are such tasks to much more important with large datasets (versus small datasets)?
Taken directly from Modern Data Science with R↩︎
taken from: https://cs186berkeley.net/notes/note1/#filtering-null-values↩︎