5 SQL statements

Most of this chapter will focus on the SQL commands needed to run queries of the database. Much of the syntax is strikingly familiar to tidy verbs in R. However, this chapter starts with a few SQL specific tools used to learn about the tables in any particular database.

5.1 Looking at the tables in the database

Consider a database of information from Stanford Open Policing Project on over 200 million traffic stops in the US.

con_traffic <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "traffic",
  host = Sys.getenv("TRAFFIC_HOST"),
  user = Sys.getenv("TRAFFIC_USER"),
  password = Sys.getenv("TRAFFIC_PWD")
)

SHOW TABLES;

Table 5.1: SHOW all the TABLES in the nyctaxi database.

There are 88 tables in the traffic database.

DESCRIBE ca_los_angeles_2020_04_01;

Table 5.2: DESCRIBE variables in the ca_los_angeles_2020_04_01 table.

The DESCRIBE command shows the 10 field names (variables) in the ca_los_angeles_2020_04_01 table. Some of the variables are characters (text) and some are date or time.

Field is the name of the variable
Type is the type of the variable (numeric, character, etc.)
Null indicates whether or not NULL values are acceptable. NULL values are never allowed for a Primary Key.
Key designates whether the variable is a Primary Key. PRI is a primary key, all unique, no NULL values. UNI is a unique key, but NULL values are allowed. MUL is a non-unique (“Multiple”) key, allowing for duplicates, can speed up queries.
Default provides the default value of the variable if no value is provided when a row is added to the database.

Watch out!

SQL statements must be written in the following order.

Most engagements with SQL are done through queries. Queries in SQL start with the SELECT keyword and consist of several clauses, which must be written in the following order:¹

SELECT allows you to list the columns, or functions operating on columns, that you want to retrieve. This is an analogous operation to the select() verb in dplyr, potentially combined with mutate() or summarize().
FROM specifies the table where the data are.
JOIN allows you to stitch together two or more tables using a key. This is analogous to the inner_join() and left_join() commands in dplyr. More details of JOIN are given in Chapter 6.
WHERE allows you to filter the records according to some criteria and is an analogous operation to the filter() verb in dplyr. Note, even though the WHERE clause is written after SELECT and JOIN, it is actually evaluated before the SELECT or JOIN clauses (which is why WHERE only works on the original data, not the results set).
GROUP BY allows you to aggregate the records according to some shared value and is an analogous operation to the group_by() verb in dplyr.
HAVING is like a WHERE clause that operates on the result set—not the records themselves and is analogous to applying a second filter() command in dplyr, after the rows have already been aggregated.
ORDER BY is exactly what it sounds like—it specifies a condition for ordering the rows of the result set and is analogous to the arrange() verb in dplyr.
LIMIT restricts the number of rows in the output and is similar to the R commands head() and slice().

5.2 SELECT … FROM

R function: select()

A SQL query starts with a SELECT command and has a corresponding FROM to indicate the table being queried. Columns may be specified, or the * will indicate that every column in the table should be returned.

The shortest SQL query is the following SELECT command. Do not run this command!!! The yellow_old table has 15 million rows, and we do not want to look at them simultaneously.

DO NOT RUN:  SELECT * FROM ca_los_angeles_2020_04_01;

Watch out!

Do not run the following command unless you are certain that the table from which you are querying is small enough so that the query results fit easily into your memory.

SELECT * FROM table;

Instead, to look at the top of the table, SELECT the first few rows. The LIMIT command specifies which rows to select: the first number is the number of rows to skip (0 rows skipped), the second number is the number of rows to print up to (up to row 14).

SELECT * FROM ca_los_angeles_2020_04_01 LIMIT 0, 14;

Table 5.3: SELECT the first 14 rows of the table.

Speaking of which, how many rows are there in the ca_los_angeles_2020_04_01 table? That is, how many taxi rides are recorded? Now SELECT is used with a summary function, COUNT(). Instead of using a separate summary function (like mutate() or summarize()), all the work is done inside the SELECT call.

SELECT COUNT(*) FROM ca_los_angeles_2020_04_01;

Table 5.4: COUNT(*) the number of rows in the entire ca_los_angeles_2020_04_01 table.

Yikes, more than 5.4 million traffic stops!

How old was the car being driven?

Unfortunately, date is saved as a date variable, and vehicle_year is saved as a double. Fortunately for us, SQL has functionality to convert a text Type into DateTime type (POSIXct is a special type of DateTime formatting).

SELECT
      date, vehicle_year,
      STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
   FROM la_new_orleans_2020_04_01
   LIMIT 0, 10;

Table 5.5: Convert the vehicle_year to date object using STR_TO_DATE.

Now that the variables are no longer strings, we can subtract them to figure out the number of minutes for each taxi ride. Unfortunately, the following code won’t run because the variables vehicle_date is not in the table la_new_orleans_2020_04_01.

SELECT
      date, vehicle_year,
      STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
      TIMESTAMPDIFF(YEAR, vehicle_date, date) AS car_age
   FROM la_new_orleans_2020_04_01
   LIMIT 0, 10;

Instead, we need two layers of SELECT commands so that the first SELECT (i.e., inside) layer creates the new variables, and the second SELECT (i.e., outside) layer subtracts the two times.

SELECT 
   date,
   vehicle_date, 
   TIMESTAMPDIFF(YEAR, vehicle_date, date) AS car_age 
FROM (
   SELECT
      date,
      STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d') AS vehicle_date
   FROM la_new_orleans_2020_04_01)
   AS subquery_table
LIMIT 0, 10;

Table 5.6: Use TIMESTAMPDIFF to find the age of the car.

Alternatively, the STR_TO_DATE() function can be applied inside the TIMESTAMPDIFF() function so that the full (now only) SELECT command is being used only on variables that are in the original table.

SELECT 
   date,
   vehicle_year,
   TIMESTAMPDIFF(YEAR, 
                 STR_TO_DATE(CONCAT(vehicle_year, '-01-01'), '%Y-%m-%d'),
                 date) AS car_age 
FROM la_new_orleans_2020_04_01
LIMIT 0, 10;

Table 5.7: Alternative method to find the age of the car.

Keep in mind that there is a distinction between clauses that operate on the variables of the original table versus those that operate on the variables of the results set. The variables date and vehicle_year are columns in the original table - they are written to disk on the SQL server. The variable car_age exist only in the results set, which is passed from the server (SQL server) to the client (e.g., RStudio or DBeaver) and is not written to disk.

5.3 SELECT DISTINCT

SELECT DISTINCT returns only unique rows. That is, it filters out all the duplicates of a variable or a combination of variables.

SELECT DISTINCT raw_actions_taken
FROM la_new_orleans_2020_04_01
LIMIT 0, 50;

Table 5.8: There are more distinct types that I was willing to input!

SELECT DISTINCT vehicle_make, vehicle_color
FROM la_new_orleans_2020_04_01
LIMIT 0, 50;

Table 5.9: The distinct values of vehicle_make and vehicle_color.

5.4 WHERE

R function: filter()

The WHERE clause is analogous to the filter() function in dplyr. However, keep in mind that there are two SQL commands that resemble the dplyr filter() function. WHERE operates on the original data in the table and HAVING operates on the result set. See below for examples using HAVING.

What was the time of day for those traffic stops search_conducted was true and vehicle_make was Subaru?

Note that in SQL the equality logical is = and in R the equality logical is ==. Note also that the WHERE command is case-insensitive!

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE search_conducted = 1
   AND vehicle_make = "Subaru"
LIMIT 0, 10;

Table 5.10: WHERE to subset the queried rows.

BETWEEN can be used to specify a range of values for a numeric value. BETWEEN is inclusive.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
   AND vehicle_make = "Subaru"
LIMIT 0, 10;

Table 5.11: BETWEEN in the WHERE clause.

IN is similar to the dplyr %in% function which specifies distinct values for the variable.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make IN ("Subaru", "Dodge")
   AND search_conducted = 1
LIMIT 0, 10;

Table 5.12: IN in the WHERE clause.

The WHERE clause can be established by a number of logical commands combined using either AND or OR. Usually it is important to use parentheses with OR logicals to make sure the desired query is return. Consider the difference between the following queries. In SQL (as in many programming languages), AND takes precedent over OR in the order of operations, when there are no parentheses. (I was taught to remember order of operations using “please excuse my dear aunt Sally.”) The order of operations on the first query groups the second two conditions into one because AND take precedence over OR (as if the query was vehicle_year BETWEEN 1970 and 1999 OR (vehicle_make IN ("Subaru", "Dodge") AND search_conducted = 1)).

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999 OR
      vehicle_make IN ("Subaru", "Dodge") AND 
      search_conducted = 1
LIMIT 0, 10;

Table 5.13: OR and AND without parentheses.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE (vehicle_year BETWEEN 1970 and 1999 OR
      vehicle_make IN ("Subaru", "Dodge") ) AND 
      search_conducted = 1
LIMIT 0, 10;

Table 5.14: OR and AND with parentheses.

5.4.1 NULL in WHERE

SQL considers NULL values to be unknown. Therefore, when searching for a NULL value, you need to ask SQL if the value IS NULL. Asking if the value is equal to NULL doesn’t work because NULL values don’t equal anything (they are unknown). To keep all values that are not NULL values, use IS NOT NULL in the WHERE clause.

Watch out!

In order to find the records that are NULL use WHERE variable IS NULL.

5.4.1.1 A `NULL` example²

The logic of NULL:

If you do anything with NULL, you’ll just get NULL. For instance if \(x\) is NULL, then \(x > 3\), \(1 = x\), and \(x + 4\) all evaluate to NULL. Even \(x =\) NULL evaluates to NULL! if you want to check whether \(x\) is NULL, use x IS NULL or x IS NOT NULL.
NULL short-circuits with boolean operators. That means a boolean expression involving NULL will evaluate to:
- TRUE, if it’d evaluate to TRUE regardless of whether the NULL value is really TRUE or FALSE.
- FALSE, if it’d evaluate to FALSE regardless of whether the NULL value is really TRUE or FALSE.
- Or NULL, if it depends on the NULL value.

Consider the following table and SQL query:

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable;

SELECT * FROM (
   SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
   UNION
   SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs   
   UNION
   SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
   UNION
   SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
   ) AS temptable
WHERE age <= 20 OR num_dogs = 3;

Where does the WHERE clause do? It tells us that we only want to keep the rows satisfying the age <= 20 OR num_dogs = 3. Let’s consider each row one at a time:

For Ace, age <= 20 evaluates to TRUE so the claim is satisfied.
For Ada, age <= 20 evaluates to NULL but num_dogs = 3 evaluates to TRUE so the claim is satisfied.
For Ben, age <= 20 evaluates to NULL and num_dogs = 3 evaluates to NULL so the overall expression is NULL which has a FALSE value.
For Cho, age <= 20 evaluates to FALSE and num_dogs = 3 evaluates to NULL so the overall expression evaluates to NULL (because it depends on the value of the NULL).

Thus we keep only Ace and Ada.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make IS NULL
LIMIT 0, 10;

Table 5.15: The way to find NULL values is via IS NULL.

SELECT time, search_conducted, vehicle_year, vehicle_make
FROM la_new_orleans_2020_04_01
WHERE vehicle_make = NULL
LIMIT 0, 10;

Table 5.16: NO rows are selected when the WHERE command is specified to indicate if the variable equals NULL.

5.5 GROUP BY

R function: group_by()

The GROUP BY clause will direct SQL to carry out the query separately for each category in the grouped variable. Using GROUP BY is particularly important when aggregating multiple rows into a single number. Some aggregate functions include COUNT(), SUM(), MAX(), MIN(), and AVG().

Note that SUM(1) adds (sums) the number 1 for each row. Which is the same as counting the number of rows. SUM(2) adds (sums) the number 2 for each row which returns twice as many transactions.

SELECT subject_race,
       COUNT(*) AS num_stops, 
       SUM(1) AS num_stops_also,
       SUM(2) AS double_stops
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
GROUP BY subject_race;

Table 5.17: GROUP BY on subject_race.

For those people whose cars are between 1970 and 1999, how old is the youngest and oldest person for each subject_race?

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age
FROM la_new_orleans_2020_04_01
WHERE vehicle_year BETWEEN 1970 and 1999
GROUP BY subject_race;

Table 5.18: GROUP BY with aggregate functions.

GROUP BY will work applied to multiple columns. Let’s tabulate the same results, now broken down by subject_race and day of week. Except that we don’t have a day of week variable! We need to pull out the day of the week from the date variable, using DAYNAME. (Note: DAYOFWEEK will give you the day of the week as an integer. Use your internet sleuthing skills if you are looking for functions that might help your desired query.)


SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday;

Table 5.19: GROUP BY with subject_race and wday.

5.6 ORDER BY

R function: arrange()

The ORDER BY command can be used with or without the GROUP BY and aggregation commands. It allows us to look at interesting aspects of the data by sorting the data.

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY min_age ASC;

Table 5.20: ORDER BY min_age, ascending.

WHAT?!?!! How are 10 year olds getting pulled over for traffic stops? If the observation is found to be a typo, it would need to be removed from the data set. If the observation is somehow legitimate, it would need to be included in the analysis, with the information provided about its legitimacy.

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY max_age DESC;

Table 5.21: ORDER BY max_age, descending

So many people over 100 years old… the data seem fishy!

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
ORDER BY wday, subject_race;

Table 5.22: ORDER BY wday and subject_race

Note that both GROUP BY and ORDER BY evaluate the data after it has been retrieved. Therefore, the functions operate on the results set, not the original rows of the data.

As above, we were able to GROUP BY and ORDER BY on the new variables we had created, wday.

5.7 HAVING

R function: filter()

Recall that WHERE acts only on the original data. If we are interested in traffic stops that took place on Friday, we need to use the derived variable wday instead of the raw variable pickup_datetime. Fortunately, HAVING works on the results set. Note that SQL uses ' ' for strings, not " ". In SQL, " " is used to identify variables (not values of variables), like R’s ` `.

SELECT subject_race,
       COUNT(*) AS num_stops, 
       MIN(subject_age) AS min_age,
       MAX(subject_age) AS max_age,
       DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
GROUP BY subject_race, wday
HAVING wday = 'Friday';

Table 5.23: HAVING to filter only Friday stops

While it worked out quite well for us that HAVING was able to filter the data based on the results set, the use of HAVING was quite onerous because the entire data set was considered before the filter was applied. That is, if the filter can be done on the original data using WHERE, the query will be much faster and more efficient.

Note: HAVING requires a GROUP BY clause. And the variable(s) used in HAVING must also be part of the GROUP BY clause.

Whenever possible, use WHERE instead of HAVING to make your queries as efficient as possible.

5.8 LIMIT

R function: head() or slice()

As we’ve seen, LIMIT truncates the query to specified rows. The first number is the offset (i.e., the number of rows to skip), the second number is the (maximum) number of rows to return. Here, we return rows 154219 through 154228.

SELECT * FROM la_new_orleans_2020_04_01 LIMIT 154218, 10;

Table 5.24: LIMIT on intermediate rows of the table.

5.9 Saving SQL queries as R objects

If you are working in R to run SQL commands, you may want to use the query output for further analysis or visualizations. In that case, use #| output.var: "name_of_variable" inside the {sql} chunk. The variable called name_of_variable will then be available to be used in the R environment.

```{sql}
#| connection: con_traffic
#| label: new-table
#| output.var: "new_table"

SELECT *, DAYNAME(date) AS wday
FROM la_new_orleans_2020_04_01
LIMIT 0, 1000;

```

Table 5.25: New data.frame saved to R called new_table.

raw_row_number	date	time	district	zone	subject_age	subject_race	subject_sex	officer_assignment	type	reason_for_stop	vehicle_color	vehicle_make	vehicle_model	vehicle_year	raw_subject_race	wday
1	2010-01-01	01:11:00	6	E	26	black	female	6th District	vehicular	TRAFFIC VIOLATION	BLACK	DODGE	CARAVAN	2005	BLACK	Friday
9087	2010-01-01	01:29:00	7	C	37	black	male	7th District	vehicular	TRAFFIC VIOLATION	BLUE	NISSAN	MURANO	2005	BLACK	Friday
9086	2010-01-01	01:29:00	7	C	37	black	male	7th District	vehicular	TRAFFIC VIOLATION	BLUE	NISSAN	MURANO	2005	BLACK	Friday
267	2010-01-01	14:00:00	7	I	96	black	male	7th District	vehicular	TRAFFIC VIOLATION	GRAY	JEEP	GRAND CHEROKEE	2003	BLACK	Friday
2	2010-01-01	02:06:00	5	D	17	black	male	5th District		CALL FOR SERVICE					BLACK	Friday
7	2010-01-01	02:06:00	5	D	18	black	male	5th District		CALL FOR SERVICE					BLACK	Friday

```{r}
new_table |>
  drop_na(wday) |>
  ggplot(aes(x = vehicle_year, y = subject_age, color = wday)) + 
  geom_point() +
  xlim(1985, 2025)
```

5.10 Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_traffic, shutdown = TRUE)

5.11 Reflection questions

Why don’t we usually want to run the query: SELECT * FROM table;?
What is the difference between the original table and the results set?
In SQL does the WHERE clause use = or == to indicate equality?
Does BETWEEN work only on numeric variables or also on character strings?
What syntax is used to direct ORDER BY to sort by biggest to smallest or smallest to biggest?
What is the difference between WHERE and HAVING?

5.12 Ethics considerations

What are different ways to look at the dataset to identify possible typos or rogue values?
Why are such tasks so much harder with large datasets (versus small datasets)?
Why are such tasks to much more important with large datasets (versus small datasets)?

Taken directly from Modern Data Science with R ↩︎
taken from: https://cs186berkeley.net/notes/note1/#filtering-null-values↩︎

5.1 Looking at the tables in the database

5.2 SELECT … FROM

5.3 SELECT DISTINCT

5.4 WHERE

5.4.1 NULL in WHERE

5.4.1.1 A NULL example2

5.5 GROUP BY

5.6 ORDER BY

5.7 HAVING

5.8 LIMIT

5.9 Saving SQL queries as R objects

5.10 Best practice

5.11 Reflection questions

5.12 Ethics considerations

5.4.1.1 A `NULL` example²