library(mdsr)
con_taxi <- DBI::dbConnect(
RMariaDB::MariaDB(),
dbname = "nyctaxi",
host = Sys.getenv("MDSR_HOST"),
user = Sys.getenv("MDSR_USER"),
password = Sys.getenv("MDSR_PWD")
)
3 SQL clauses
Most of this chapter will focus on the SQL commands needed to run queries of the database. Much of the syntax is strikingly familiar to tidy verbs in R. However, this chapter starts with a few SQL specific tools used to learn about the tables in any particular database.
3.1 Looking at the tables in the database
Consider a database of taxi rides from the Yellow Cab company in NYC in March of 2014.
TABLES; SHOW
Tables_in_nyctaxi |
---|
yellow_old |
There is only one table in the nyctaxi
database, called yellow_old
.
DESCRIBE yellow_old;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
vendor_id | text | YES | |||
pickup_datetime | text | YES | |||
dropoff_datetime | text | YES | |||
passenger_count | bigint(20) | YES | |||
trip_distance | double | YES | |||
pickup_longitude | double | YES | |||
pickup_latitude | double | YES | |||
rate_code | bigint(20) | YES | |||
store_and_fwd_flag | text | YES | |||
dropoff_longitude | double | YES | |||
dropoff_latitude | double | YES | |||
payment_type | text | YES | |||
fare_amount | double | YES | |||
surcharge | double | YES | |||
mta_tax | double | YES | |||
tip_amount | double | YES | |||
tolls_amount | double | YES | |||
total_amount | double | YES |
Similarly, the DESCRIBE
command shows the 18 field names (variables) in the yellow_old
table. Some of the variables are characters (text
) and some are numeric (either double
or bigint
)
Most engagements with SQL are done through queries. Queries in SQL start with the SELECT
keyword and consist of several clauses, which must be written in the following order:1
-
SELECT
allows you to list the columns, or functions operating on columns, that you want to retrieve. This is an analogous operation to theselect()
verb in dplyr, potentially combined withmutate()
orsummarize()
. -
FROM
specifies the table where the data are. -
JOIN
allows you to stitch together two or more tables using a key. This is analogous to theinner_join()
andleft_join()
commands in dplyr. More details ofJOIN
are given in Chapter 4. -
WHERE
allows you to filter the records according to some criteria and is an analogous operation to thefilter()
verb in dplyr. Note, even though theWHERE
clause is written afterSELECT
andJOIN
, it is actually evaluated before theSELECT
orJOIN
clauses (which is whyWHERE
only works on the original data, not the results set). -
GROUP BY
allows you to aggregate the records according to some shared value and is an analogous operation to thegroup_by()
verb in dplyr. -
HAVING
is like aWHERE
clause that operates on the result set—not the records themselves and is analogous to applying a secondfilter()
command in dplyr, after the rows have already been aggregated. -
ORDER BY
is exactly what it sounds like—it specifies a condition for ordering the rows of the result set and is analogous to thearrange()
verb in dplyr. -
LIMIT
restricts the number of rows in the output and is similar to the R commandshead()
andslice()
.
3.2 SELECT … FROM
A SQL query starts with a SELECT
command and has a corresponding FROM
to indicate the table being queried. Columns may be specified, or the *
will indicate that every column in the table should be returned.
The shortest SQL query is the following SELECT
command. Do not run this command!!! The yellow_old
table has 15 million rows, and we do not want to look at them simultaneously.
: SELECT * FROM yellow_old; DO NOT RUN
Instead, to look at the top of the table, SELECT
the first few rows. The LIMIT
command specifies which rows to select: the first number is the number of rows to skip (0 rows skipped), the second number is the number of rows to print up to (up to row 14).
SELECT * FROM yellow_old LIMIT 0, 14;
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CMT | 2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 1 | 2.0 | -74.0 | 40.7 | 1 | N | -73.9 | 40.7 | CRD | 9.0 | 0.5 | 0.5 | 2.0 | 0 | 12.0 |
CMT | 2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 2 | 1.2 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 6.0 | 0.5 | 0.5 | 1.0 | 0 | 8.0 |
CMT | 2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 3 | 0.5 | -73.9 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 5.0 | 0.5 | 0.5 | 1.2 | 0 | 7.2 |
CMT | 2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 3 | 3.5 | -74.0 | 40.7 | 1 | N | -74.0 | 40.8 | CRD | 14.0 | 0.5 | 0.5 | 3.0 | 0 | 18.0 |
CMT | 2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 1 | 1.8 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 10.5 | 0.5 | 0.5 | 1.0 | 0 | 12.5 |
CMT | 2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 1 | 0.5 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 4.0 | 0.5 | 0.5 | 0.5 | 0 | 5.5 |
CMT | 2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 2 | 3.7 | -74.0 | 40.8 | 1 | N | -74.0 | 40.7 | CRD | 14.5 | 0.5 | 0.5 | 3.1 | 0 | 18.6 |
CMT | 2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 1 | 5.4 | -74.0 | 40.8 | 1 | N | -74.0 | 40.7 | CRD | 20.0 | 0.5 | 0.5 | 3.0 | 0 | 24.0 |
CMT | 2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 1 | 3.5 | -74.0 | 40.7 | 1 | N | -74.0 | 40.8 | CRD | 13.5 | 0.5 | 0.5 | 2.9 | 0 | 17.4 |
CMT | 2014-03-01 01:13:10 | 2014-03-01 01:38:54 | 3 | 5.9 | -74.0 | 40.8 | 1 | N | -74.0 | 40.7 | CRD | 21.5 | 0.5 | 0.5 | 2.0 | 0 | 24.5 |
CMT | 2014-03-01 01:14:13 | 2014-03-01 01:25:49 | 1 | 1.9 | -74.0 | 40.8 | 1 | N | -74.0 | 40.8 | CRD | 10.0 | 0.5 | 0.5 | 2.2 | 0 | 13.2 |
CMT | 2014-03-01 01:15:22 | 2014-03-01 01:30:04 | 3 | 3.2 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 13.0 | 0.5 | 0.5 | 2.8 | 0 | 16.8 |
CMT | 2014-03-01 01:16:28 | 2014-03-01 01:28:05 | 1 | 2.8 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 11.0 | 0.5 | 0.5 | 1.5 | 0 | 13.5 |
Speaking of which, how many rows are there in the yellow_old
table? That is, how many taxi rides are recorded? Now SELECT
is used with a summary function, COUNT()
. Instead of using a separate summary function (like mutate()
or summarize()
), all the work is done inside the SELECT
call.
SELECT COUNT(*) FROM yellow_old;
COUNT(*) |
---|
15428128 |
Yikes, more than 15 million taxi rides!!!!
You might have noticed that the yellow_old
table has two different datetime variables (one for pickup, the other for drop-off). We can use the information to assess the length of each ride (in time, not distance). However, the variables are stored in SQL as character strings instead of in a DateTime format (even though they look like they are stored in a DateTime format!), see Table 3.2. Fortunately for us, SQL has functionality to convert a text
Type into DateTime
type (POSIXct is a special type of DateTime formatting).
SELECT
pickup_datetime, dropoff_datetime,"%Y-%m-%d %T") AS pickup,
STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS dropoff
STR_TO_DATE(dropoff_datetime, FROM yellow_old
LIMIT 0, 10;
pickup_datetime | dropoff_datetime | pickup | dropoff |
---|---|---|---|
2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 2014-03-01 01:07:38 | 2014-03-01 01:16:26 |
2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 2014-03-01 01:08:03 | 2014-03-01 01:12:51 |
2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 2014-03-01 01:08:51 | 2014-03-01 01:13:18 |
2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 2014-03-01 01:09:20 | 2014-03-01 01:24:18 |
2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 2014-03-01 01:09:46 | 2014-03-01 01:22:34 |
2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 2014-03-01 01:12:41 | 2014-03-01 01:15:38 |
2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 2014-03-01 01:12:11 | 2014-03-01 01:27:38 |
2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 2014-03-01 01:13:55 | 2014-03-01 01:34:54 |
2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 2014-03-01 01:14:06 | 2014-03-01 01:28:25 |
Now that the variables are no longer strings, we can subtract them to figure out the number of minutes for each taxi ride. Unfortunately, the following code won’t run because neither of the variables pickup
or dropoff
are in the table yellow_old
.
SELECT
pickup_datetime, dropoff_datetime,"%Y-%m-%d %T") AS pickup,
STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS dropoff.
STR_TO_DATE(dropoff_datetime, AS length_time
TIMEDIFF(pickup, dropoff) FROM yellow_old
LIMIT 0, 10;
Instead, we need two layers of SELECT
commands so that the first SELECT
(i.e., inside) layer creates the new variables, and the second SELECT
(i.e., outside) layer subtracts the two times.
SELECT
pickup,
dropoff, AS length_time
TIMEDIFF(pickup, dropoff) FROM (
SELECT
"%Y-%m-%d %T") AS pickup,
STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T") AS dropoff
STR_TO_DATE(dropoff_datetime, FROM yellow_old)
AS subquery_table
LIMIT 0, 20;
pickup | dropoff | length_time |
---|---|---|
2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 00:08:48 |
2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 00:04:48 |
2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 00:04:27 |
2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 00:14:58 |
2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 00:12:48 |
2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 00:02:57 |
2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 00:15:27 |
2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 00:20:59 |
2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 00:14:19 |
2014-03-01 01:13:10 | 2014-03-01 01:38:54 | 00:25:44 |
2014-03-01 01:14:13 | 2014-03-01 01:25:49 | 00:11:36 |
2014-03-01 01:15:22 | 2014-03-01 01:30:04 | 00:14:42 |
2014-03-01 01:16:28 | 2014-03-01 01:28:05 | 00:11:37 |
2014-03-01 01:25:34 | 2014-03-01 02:01:03 | 00:35:29 |
2014-03-01 01:26:39 | 2014-03-01 01:30:03 | 00:03:24 |
2014-03-01 01:27:16 | 2014-03-01 01:46:59 | 00:19:43 |
2014-03-01 01:28:39 | 2014-03-01 01:30:53 | 00:02:14 |
2014-03-01 01:29:40 | 2014-03-01 01:35:01 | 00:05:21 |
2014-03-01 01:28:51 | 2014-03-01 01:43:06 | 00:14:15 |
Alternatively, the STR_TO_DATE()
function can be applied inside the TIMEDIFF()
function so that the full (now only) SELECT
command is being used only on variables that are in the original table.
SELECT
pickup_datetime,
dropoff_datetime, "%Y-%m-%d %T"),
TIMEDIFF(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS length_time
STR_TO_DATE(dropoff_datetime, FROM yellow_old
LIMIT 0, 20;
pickup_datetime | dropoff_datetime | length_time |
---|---|---|
2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 00:08:48 |
2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 00:04:48 |
2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 00:04:27 |
2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 00:14:58 |
2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 00:12:48 |
2014-03-01 01:12:41 | 2014-03-01 01:15:38 | 00:02:57 |
2014-03-01 01:12:11 | 2014-03-01 01:27:38 | 00:15:27 |
2014-03-01 01:13:55 | 2014-03-01 01:34:54 | 00:20:59 |
2014-03-01 01:14:06 | 2014-03-01 01:28:25 | 00:14:19 |
2014-03-01 01:13:10 | 2014-03-01 01:38:54 | 00:25:44 |
2014-03-01 01:14:13 | 2014-03-01 01:25:49 | 00:11:36 |
2014-03-01 01:15:22 | 2014-03-01 01:30:04 | 00:14:42 |
2014-03-01 01:16:28 | 2014-03-01 01:28:05 | 00:11:37 |
2014-03-01 01:25:34 | 2014-03-01 02:01:03 | 00:35:29 |
2014-03-01 01:26:39 | 2014-03-01 01:30:03 | 00:03:24 |
2014-03-01 01:27:16 | 2014-03-01 01:46:59 | 00:19:43 |
2014-03-01 01:28:39 | 2014-03-01 01:30:53 | 00:02:14 |
2014-03-01 01:29:40 | 2014-03-01 01:35:01 | 00:05:21 |
2014-03-01 01:28:51 | 2014-03-01 01:43:06 | 00:14:15 |
Keep in mind that there is a distinction between clauses that operate on the variables of the original table versus those that operate on the variables of the results set. The variables pickup_datetime
and dropoff_datetime
are columns in the original table - they are written to disk on the SQL server. The variables pickup
, dropoff
, and length_time
exist only in the results set, which is passed from the server (SQL server) to the client (e.g., RStudio or DBeaver) and is not written to disk.
3.3 SELECT DISTINCT
SELECT DISTINCT
returns only unique rows. That is, it filters out all the duplicates of a variable or a combination of variables. Note that I have a larger limit on the query that I needed, just to make sure I got all the levels.
SELECT DISTINCT payment_type
FROM yellow_old
LIMIT 0, 20;
payment_type |
---|
CRD |
CSH |
NOC |
DIS |
UNK |
SELECT DISTINCT vendor_id, payment_type
FROM yellow_old
LIMIT 0, 20;
vendor_id | payment_type |
---|---|
CMT | CRD |
CMT | CSH |
CMT | NOC |
CMT | DIS |
VTS | CRD |
VTS | CSH |
VTS | UNK |
3.4 WHERE
The WHERE
clause is analogous to the filter()
function in dplyr. However, keep in mind that there are two SQL commands that resemble the dplyr filter()
function. WHERE
operates on the original data in the table and HAVING
operates on the result set. See below for examples using HAVING
.
What was the fare for those taxi rides where the tip_amount
was more than $10 and the person used cash? (Note that in SQL the equality logical is =
and in R the equality logical is ==
.)
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount > 10
AND payment_type = "CSH"
LIMIT 0, 10;
payment_type | fare_amount | tip_amount | total_amount |
---|---|---|---|
CSH | 65.5 | 15.3 | 91.8 |
CSH | 52.0 | 11.6 | 69.4 |
CSH | 52.0 | 11.6 | 69.4 |
CSH | 55.0 | 16.2 | 81.2 |
CSH | 71.5 | 20.0 | 103.5 |
CSH | 70.0 | 16.2 | 97.1 |
CSH | 95.0 | 21.9 | 131.2 |
CSH | 62.5 | 15.5 | 93.0 |
CSH | 66.0 | 15.0 | 90.0 |
CSH | 65.0 | 13.2 | 79.2 |
BETWEEN
can be used to specify a range of values for a numeric value. BETWEEN
is inclusive.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12
AND payment_type = "CSH"
LIMIT 0, 10;
payment_type | fare_amount | tip_amount | total_amount |
---|---|---|---|
CSH | 52.0 | 11.6 | 69.4 |
CSH | 52.0 | 11.6 | 69.4 |
CSH | 88.0 | 10.0 | 107.0 |
CSH | 72.0 | 10.0 | 94.0 |
CSH | 64.5 | 10.0 | 85.5 |
CSH | 66.0 | 12.0 | 93.0 |
CSH | 52.0 | 11.6 | 69.4 |
CSH | 69.0 | 10.0 | 88.0 |
CSH | 90.0 | 10.0 | 100.0 |
CSH | 52.0 | 11.6 | 69.4 |
IN
is similar to the dplyr %in%
function which specifies distinct values for the variable.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount IN (10, 12)
AND payment_type = "CSH"
LIMIT 0, 10;
payment_type | fare_amount | tip_amount | total_amount |
---|---|---|---|
CSH | 88.0 | 10 | 107.0 |
CSH | 72.0 | 10 | 94.0 |
CSH | 64.5 | 10 | 85.5 |
CSH | 66.0 | 12 | 93.0 |
CSH | 69.0 | 10 | 88.0 |
CSH | 90.0 | 10 | 100.0 |
CSH | 74.5 | 10 | 90.3 |
CSH | 89.0 | 10 | 118.1 |
CSH | 52.0 | 10 | 67.8 |
CSH | 66.0 | 12 | 90.0 |
The WHERE
clause can be established by a number of logical commands combined using either AND
or OR
. Usually it is important to use parentheses with OR
logicals to make sure the desired query is return. Consider the difference between the following queries. In SQL (as in many programming languages), AND
takes precedent over OR
in the order of operations, when there are no parentheses. (I was taught to remember order of operations using “please excuse my dear aunt Sally.”) The order of operations on the first query groups the second two conditions into one because AND
take precedence over OR
(as if the query was tip_amount BETWEEN 10 and 12 OR (total_amount BETWEEN 100 and 112 AND payment_type = "CSH")
).
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE tip_amount BETWEEN 10 and 12 OR
BETWEEN 100 and 112 AND
total_amount = "CSH"
payment_type LIMIT 0, 10;
payment_type | fare_amount | tip_amount | total_amount |
---|---|---|---|
CRD | 52.0 | 10.5 | 63.0 |
CRD | 35.0 | 10.2 | 51.0 |
CRD | 52.0 | 11.6 | 69.4 |
CRD | 30.5 | 10.8 | 47.2 |
CRD | 52.0 | 10.5 | 63.0 |
CRD | 52.0 | 10.5 | 63.0 |
CRD | 52.0 | 11.6 | 69.4 |
CRD | 52.0 | 11.6 | 69.4 |
CRD | 52.0 | 11.6 | 69.4 |
CRD | 52.0 | 11.6 | 69.4 |
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE (tip_amount BETWEEN 10 and 12 OR
BETWEEN 100 and 112 ) AND
total_amount = "CSH"
payment_type LIMIT 0, 10;
payment_type | fare_amount | tip_amount | total_amount |
---|---|---|---|
CSH | 107.0 | 0.0 | 108.0 |
CSH | 92.5 | 0.0 | 103.5 |
CSH | 99.5 | 0.0 | 105.3 |
CSH | 92.0 | 0.0 | 106.3 |
CSH | 103.0 | 0.0 | 109.3 |
CSH | 107.0 | 0.0 | 107.0 |
CSH | 104.5 | 0.0 | 105.5 |
CSH | 112.0 | 0.0 | 112.0 |
CSH | 52.0 | 11.6 | 69.4 |
CSH | 52.0 | 11.6 | 69.4 |
3.4.1 NULL in WHERE
SQL considers NULL
values to be unknown. Therefore, when searching for a NULL
value, you need to ask SQL if the value IS NULL
. Asking if the value is equal to NULL
doesn’t work because NULL
values don’t equal anything (they are unknown). To keep all values that are not NULL
values, use IS NOT NULL
in the WHERE
clause.
3.4.1.1 A NULL
example2
The logic of NULL
:
- If you do anything with
NULL
, you’ll just getNULL
. For instance if \(x\) isNULL
, then \(x > 3\), \(1 = x\), and \(x + 4\) all evaluate toNULL
. Even \(x =\)NULL
evaluates toNULL
! if you want to check whether \(x\) isNULL
, usex IS NULL
orx IS NOT NULL
. -
NULL
short-circuits with boolean operators. That means a boolean expression involvingNULL
will evaluate to:- TRUE, if it’d evaluate to TRUE regardless of whether the
NULL
value is really TRUE or FALSE. - FALSE, if it’d evaluate to FALSE regardless of whether the
NULL
value is really TRUE or FALSE. - Or
NULL
, if it depends on theNULL
value.
- TRUE, if it’d evaluate to TRUE regardless of whether the
Consider the following table and SQL query:
SELECT * FROM (
SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
UNION
SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs
UNION
SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
UNION
SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
AS temptable; )
name | age | num_dogs |
---|---|---|
Ace | 20 | 4 |
Ada | 3 | |
Ben | ||
Cho | 27 |
SELECT * FROM (
SELECT 'Ace' AS name, 20 AS age, 4 as num_dogs
UNION
SELECT 'Ada' AS name, NULL AS age, 3 as num_dogs
UNION
SELECT 'Ben' AS name, NULL AS age, NULL as num_dogs
UNION
SELECT 'Cho' AS name, 27 AS age, NULL as num_dogs
AS temptable
) WHERE age <= 20 OR num_dogs = 3;
name | age | num_dogs |
---|---|---|
Ace | 20 | 4 |
Ada | 3 |
Where does the WHERE
clause do? It tells us that we only want to keep the rows satisfying the age <= 20
OR num_dogs = 3
. Let’s consider each row one at a time:
- For Ace,
age <= 20
evaluates to TRUE so the claim is satisfied. - For Ada,
age <= 20
evaluates toNULL
butnum_dogs = 3
evaluates to TRUE so the claim is satisfied. - For Ben,
age <= 20
evaluates toNULL
andnum_dogs = 3
evaluates toNULL
so the overall expression isNULL
which has a FALSE value. - For Cho,
age <= 20
evaluates to FALSE andnum_dogs = 3
evaluates toNULL
so the overall expression evaluates toNULL
(because it depends on the value of theNULL
).
Thus we keep only Ace and Ada.
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE payment_type IS NULL
LIMIT 0, 10;
payment_type | fare_amount | tip_amount | total_amount |
---|---|---|---|
SELECT payment_type, fare_amount, tip_amount, total_amount
FROM yellow_old
WHERE payment_type = NULL
LIMIT 0, 10;
payment_type | fare_amount | tip_amount | total_amount |
---|---|---|---|
3.5 GROUP BY
The GROUP BY
clause will direct SQL to carry out the query separately for each category in the grouped variable. Using GROUP BY
is particularly important when aggregating multiple rows into a single number. Some aggregate functions include COUNT()
, SUM()
, MAX()
, MIN()
, and AVG()
.
Note that SUM(1)
adds (sums) the number 1
for each row. Which is the same as counting the number of rows. SUM(2)
adds (sums) the number 2
for each row which returns twice as many transactions.
SELECT COUNT(*) AS num_transactions,
SUM(1) AS num_transactions_also,
SUM(2) AS double_transactions,
payment_type FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;
num_transactions | num_transactions_also | double_transactions | payment_type |
---|---|---|---|
213872 | 213872 | 427744 | CRD |
78 | 78 | 156 | CSH |
3 | 3 | 6 | DIS |
7 | 7 | 14 | NOC |
609 | 609 | 1218 | UNK |
For those people who tipped between $10 and $20, what was the lowest and highest fare for each of the types of payments?
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type FROM yellow_old
WHERE tip_amount BETWEEN 10 and 20
GROUP BY payment_type;
num_transactions | lowest_fare | highest_fare | payment_type |
---|---|---|---|
213872 | 0.0 | 370.0 | CRD |
78 | 52.0 | 102.0 | CSH |
3 | 52.0 | 79.5 | DIS |
7 | 58.0 | 94.0 | NOC |
609 | 4.5 | 147.0 | UNK |
GROUP BY
will work applied to multiple columns. Let’s tabulate the same results, now broken down by payment_type
and day of week. Except that we don’t have a day of week variable! We need to convert the pickup_datetime
variable to a DateTime object and then pull out the day of the week, using DAYNAME
. (Note: DAYOFWEEK
will give you the day of the week as an integer. Use your internet sleuthing skills if you are looking for functions that might help your desired query.)
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,"%Y-%m-%d %T")) AS wday
DAYNAME(STR_TO_DATE(pickup_datetime, FROM yellow_old
GROUP BY payment_type, wday;
num_transactions | lowest_fare | highest_fare | payment_type | wday |
---|---|---|---|---|
1 | ||||
1247397 | 2.5 | 477 | CRD | Friday |
1278362 | -612.4 | 500 | CRD | Monday |
1533796 | 2.5 | 420 | CRD | Saturday |
1324394 | 2.5 | 480 | CRD | Sunday |
1258098 | 2.5 | 500 | CRD | Thursday |
1121081 | 2.5 | 500 | CRD | Tuesday |
1192892 | 2.5 | 400 | CRD | Wednesday |
860920 | 2.5 | 444 | CSH | Friday |
918653 | 0.0 | 873 | CSH | Monday |
1207305 | 2.5 | 350 | CSH | Saturday |
1020438 | 2.5 | 425 | CSH | Sunday |
813813 | 2.5 | 475 | CSH | Thursday |
751769 | 2.5 | 300 | CSH | Tuesday |
775823 | 2.5 | 400 | CSH | Wednesday |
1592 | 2.5 | 255 | DIS | Friday |
1537 | 0.0 | 102 | DIS | Monday |
2236 | 2.5 | 200 | DIS | Saturday |
1821 | 2.5 | 200 | DIS | Sunday |
1357 | 2.5 | 165 | DIS | Thursday |
1222 | 2.5 | 475 | DIS | Tuesday |
1295 | 2.5 | 373 | DIS | Wednesday |
5252 | 2.5 | 229 | NOC | Friday |
5440 | 0.0 | 950 | NOC | Monday |
7217 | 2.5 | 295 | NOC | Saturday |
6383 | 2.5 | 300 | NOC | Sunday |
4840 | 2.5 | 223 | NOC | Thursday |
4123 | 2.5 | 384 | NOC | Tuesday |
4482 | 2.5 | 200 | NOC | Wednesday |
10131 | 2.5 | 130 | UNK | Friday |
11263 | 2.5 | 95 | UNK | Monday |
12813 | 2.5 | 147 | UNK | Saturday |
11003 | 2.5 | 114 | UNK | Sunday |
10197 | 2.5 | 200 | UNK | Thursday |
9643 | 2.5 | 138 | UNK | Tuesday |
9539 | 2.5 | 133 | UNK | Wednesday |
3.6 ORDER BY
The ORDER BY
command can be used with or without the GROUP BY
and aggregation commands. It allows us to look at interesting aspects of the data by sorting the data.
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,"%Y-%m-%d %T")) AS wday
DAYNAME(STR_TO_DATE(pickup_datetime, FROM yellow_old
GROUP BY payment_type, wday
ORDER BY lowest_fare ASC;
num_transactions | lowest_fare | highest_fare | payment_type | wday |
---|---|---|---|---|
1 | ||||
1278362 | -612.4 | 500 | CRD | Monday |
918653 | 0.0 | 873 | CSH | Monday |
5440 | 0.0 | 950 | NOC | Monday |
1537 | 0.0 | 102 | DIS | Monday |
1533796 | 2.5 | 420 | CRD | Saturday |
1121081 | 2.5 | 500 | CRD | Tuesday |
1192892 | 2.5 | 400 | CRD | Wednesday |
1258098 | 2.5 | 500 | CRD | Thursday |
1247397 | 2.5 | 477 | CRD | Friday |
1324394 | 2.5 | 480 | CRD | Sunday |
813813 | 2.5 | 475 | CSH | Thursday |
1207305 | 2.5 | 350 | CSH | Saturday |
751769 | 2.5 | 300 | CSH | Tuesday |
775823 | 2.5 | 400 | CSH | Wednesday |
1020438 | 2.5 | 425 | CSH | Sunday |
7217 | 2.5 | 295 | NOC | Saturday |
860920 | 2.5 | 444 | CSH | Friday |
1295 | 2.5 | 373 | DIS | Wednesday |
1821 | 2.5 | 200 | DIS | Sunday |
4123 | 2.5 | 384 | NOC | Tuesday |
1222 | 2.5 | 475 | DIS | Tuesday |
5252 | 2.5 | 229 | NOC | Friday |
6383 | 2.5 | 300 | NOC | Sunday |
4840 | 2.5 | 223 | NOC | Thursday |
4482 | 2.5 | 200 | NOC | Wednesday |
1357 | 2.5 | 165 | DIS | Thursday |
1592 | 2.5 | 255 | DIS | Friday |
2236 | 2.5 | 200 | DIS | Saturday |
11003 | 2.5 | 114 | UNK | Sunday |
12813 | 2.5 | 147 | UNK | Saturday |
11263 | 2.5 | 95 | UNK | Monday |
9643 | 2.5 | 138 | UNK | Tuesday |
9539 | 2.5 | 133 | UNK | Wednesday |
10197 | 2.5 | 200 | UNK | Thursday |
10131 | 2.5 | 130 | UNK | Friday |
WHAT?!?!! How in the world was one of the fares -$612.40? It doesn’t make any sense that a fare would be negative. Some additional inquiry into the observation corresponding to a fare of -$612.40 is absolutely warranted. If the observation is found to be a typo, it would need to be removed from the data set. If the observation is somehow legitimate, it would need to be included in the analysis, with the information provided about its legitimacy.
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,"%Y-%m-%d %T")) AS wday
DAYNAME(STR_TO_DATE(pickup_datetime, FROM yellow_old
GROUP BY payment_type, wday
ORDER BY highest_fare DESC;
num_transactions | lowest_fare | highest_fare | payment_type | wday |
---|---|---|---|---|
5440 | 0.0 | 950 | NOC | Monday |
918653 | 0.0 | 873 | CSH | Monday |
1278362 | -612.4 | 500 | CRD | Monday |
1121081 | 2.5 | 500 | CRD | Tuesday |
1258098 | 2.5 | 500 | CRD | Thursday |
1324394 | 2.5 | 480 | CRD | Sunday |
1247397 | 2.5 | 477 | CRD | Friday |
1222 | 2.5 | 475 | DIS | Tuesday |
813813 | 2.5 | 475 | CSH | Thursday |
860920 | 2.5 | 444 | CSH | Friday |
1020438 | 2.5 | 425 | CSH | Sunday |
1533796 | 2.5 | 420 | CRD | Saturday |
1192892 | 2.5 | 400 | CRD | Wednesday |
775823 | 2.5 | 400 | CSH | Wednesday |
4123 | 2.5 | 384 | NOC | Tuesday |
1295 | 2.5 | 373 | DIS | Wednesday |
1207305 | 2.5 | 350 | CSH | Saturday |
751769 | 2.5 | 300 | CSH | Tuesday |
6383 | 2.5 | 300 | NOC | Sunday |
7217 | 2.5 | 295 | NOC | Saturday |
1592 | 2.5 | 255 | DIS | Friday |
5252 | 2.5 | 229 | NOC | Friday |
4840 | 2.5 | 223 | NOC | Thursday |
4482 | 2.5 | 200 | NOC | Wednesday |
1821 | 2.5 | 200 | DIS | Sunday |
2236 | 2.5 | 200 | DIS | Saturday |
10197 | 2.5 | 200 | UNK | Thursday |
1357 | 2.5 | 165 | DIS | Thursday |
12813 | 2.5 | 147 | UNK | Saturday |
9643 | 2.5 | 138 | UNK | Tuesday |
9539 | 2.5 | 133 | UNK | Wednesday |
10131 | 2.5 | 130 | UNK | Friday |
11003 | 2.5 | 114 | UNK | Sunday |
1537 | 0.0 | 102 | DIS | Monday |
11263 | 2.5 | 95 | UNK | Monday |
1 |
$950 is a lot to pay for a cab ride! But in NYC, I’d believe it.
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,"%Y-%m-%d %T")) AS wday
DAYNAME(STR_TO_DATE(pickup_datetime, FROM yellow_old
GROUP BY payment_type, wday
ORDER BY wday, payment_type;
num_transactions | lowest_fare | highest_fare | payment_type | wday |
---|---|---|---|---|
1 | ||||
1247397 | 2.5 | 477 | CRD | Friday |
860920 | 2.5 | 444 | CSH | Friday |
1592 | 2.5 | 255 | DIS | Friday |
5252 | 2.5 | 229 | NOC | Friday |
10131 | 2.5 | 130 | UNK | Friday |
1278362 | -612.4 | 500 | CRD | Monday |
918653 | 0.0 | 873 | CSH | Monday |
1537 | 0.0 | 102 | DIS | Monday |
5440 | 0.0 | 950 | NOC | Monday |
11263 | 2.5 | 95 | UNK | Monday |
1533796 | 2.5 | 420 | CRD | Saturday |
1207305 | 2.5 | 350 | CSH | Saturday |
2236 | 2.5 | 200 | DIS | Saturday |
7217 | 2.5 | 295 | NOC | Saturday |
12813 | 2.5 | 147 | UNK | Saturday |
1324394 | 2.5 | 480 | CRD | Sunday |
1020438 | 2.5 | 425 | CSH | Sunday |
1821 | 2.5 | 200 | DIS | Sunday |
6383 | 2.5 | 300 | NOC | Sunday |
11003 | 2.5 | 114 | UNK | Sunday |
1258098 | 2.5 | 500 | CRD | Thursday |
813813 | 2.5 | 475 | CSH | Thursday |
1357 | 2.5 | 165 | DIS | Thursday |
4840 | 2.5 | 223 | NOC | Thursday |
10197 | 2.5 | 200 | UNK | Thursday |
1121081 | 2.5 | 500 | CRD | Tuesday |
751769 | 2.5 | 300 | CSH | Tuesday |
1222 | 2.5 | 475 | DIS | Tuesday |
4123 | 2.5 | 384 | NOC | Tuesday |
9643 | 2.5 | 138 | UNK | Tuesday |
1192892 | 2.5 | 400 | CRD | Wednesday |
775823 | 2.5 | 400 | CSH | Wednesday |
1295 | 2.5 | 373 | DIS | Wednesday |
4482 | 2.5 | 200 | NOC | Wednesday |
9539 | 2.5 | 133 | UNK | Wednesday |
As above, we were able to GROUP BY
and ORDER BY
on the new variables we had created, wday
.
3.7 HAVING
Recall that WHERE
acts only on the original data. If we are interested in rides that took place on Friday
, we need to use the derived variable wday
instead of the raw variable pickup_datetime
. Fortunately, HAVING
works on the results set. Note that SQL uses ''
for strings, not ""
. In SQL, ""
is used to identify variables (not values of variables), like R’s ``
.
SELECT COUNT(*) AS num_transactions,
MIN(fare_amount) AS lowest_fare,
MAX(fare_amount) AS highest_fare,
payment_type,"%Y-%m-%d %T")) AS wday
DAYNAME(STR_TO_DATE(pickup_datetime, FROM yellow_old
GROUP BY payment_type, wday
HAVING wday = 'Friday';
num_transactions | lowest_fare | highest_fare | payment_type | wday |
---|---|---|---|---|
1247397 | 2.5 | 477 | CRD | Friday |
860920 | 2.5 | 444 | CSH | Friday |
1592 | 2.5 | 255 | DIS | Friday |
5252 | 2.5 | 229 | NOC | Friday |
10131 | 2.5 | 130 | UNK | Friday |
While it worked out quite well for us that HAVING
was able to filter the data based on the results set, the use of HAVING
was quite onerous because the entire data set was considered before the filter was applied. That is, if the filter can be done on the original data using WHERE
, the query will be much faster and more efficient.
Note: HAVING
requires a GROUP BY
clause. And the variable(s) used in HAVING
must also be part of the GROUP BY
clause.
3.8 LIMIT
As we’ve seen, LIMIT
truncates the query to specified rows. The LIMIT
command specifies which rows to select: the first number is the number of rows to skip (0 rows skipped), the second number is the number of rows to print up to (up to row 14). The query below shows the last 10 rows of the entire data set.
SELECT * FROM yellow_old LIMIT 15428118, 10;
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CMT | 2014-03-18 14:35:21 | 2014-03-18 14:52:01 | 1 | 2.7 | 0 | 0.0 | 1 | N | 0 | 0.0 | CRD | 13.0 | 0 | 0.5 | 2.50 | 0.00 | 16.0 |
CMT | 2014-03-18 14:08:23 | 2014-03-18 14:19:29 | 2 | 1.3 | 0 | 0.0 | 1 | N | 0 | 0.0 | CRD | 9.0 | 0 | 0.5 | 1.90 | 0.00 | 11.4 |
CMT | 2014-03-18 09:18:38 | 2014-03-18 09:19:41 | 1 | 0.2 | -74 | 40.8 | 1 | N | -74 | 40.8 | CRD | 3.0 | 0 | 0.5 | 1.00 | 0.00 | 4.5 |
CMT | 2014-03-18 06:28:12 | 2014-03-18 06:49:49 | 1 | 9.9 | 0 | 0.0 | 1 | N | 0 | 0.0 | CRD | 30.0 | 0 | 0.5 | 7.16 | 5.33 | 43.0 |
CMT | 2014-03-18 17:39:28 | 2014-03-18 17:53:01 | 1 | 4.9 | -74 | 40.8 | 1 | N | -74 | 40.7 | CRD | 16.5 | 1 | 0.5 | 3.00 | 0.00 | 21.0 |
CMT | 2014-03-18 18:14:19 | 2014-03-18 18:27:22 | 1 | 0.3 | -74 | 40.7 | 1 | N | -74 | 40.7 | CRD | 3.5 | 1 | 0.5 | 6.00 | 0.00 | 11.0 |
CMT | 2014-03-18 10:12:33 | 2014-03-18 10:28:09 | 1 | 3.3 | -74 | 40.8 | 1 | N | -74 | 40.8 | CRD | 13.5 | 0 | 0.5 | 4.20 | 0.00 | 18.2 |
CMT | 2014-03-18 09:02:37 | 2014-03-18 09:16:29 | 1 | 6.2 | -74 | 40.8 | 1 | N | -74 | 40.7 | CRD | 19.5 | 0 | 0.5 | 1.00 | 0.00 | 21.0 |
CMT | 2014-03-18 10:10:19 | 2014-03-18 10:19:25 | 1 | 1.7 | -74 | 40.8 | 1 | N | -74 | 40.8 | CRD | 8.5 | 0 | 0.5 | 1.80 | 0.00 | 10.8 |
CMT | 2014-03-18 15:24:53 | 2014-03-18 15:42:42 | 1 | 1.7 | -74 | 40.8 | 1 | N | -74 | 40.7 | CRD | 12.5 | 0 | 0.5 | 2.60 | 0.00 | 15.6 |
3.9 Saving SQL queries as R objects
If you are working in R to run SQL commands, you may want to use the query output for further analysis or visualizations. In that case, use #|output.var: "name_of_variable"
inside the {sql}
chunk. The variable called name_of_variable
will then be available to be used in the R environment.
```{sql}
#| connection: con_taxi
#| label: new-table
#| output.var: "new_table"
SELECT *, DAYNAME(STR_TO_DATE(pickup_datetime, "%Y-%m-%d %T")) AS wday
FROM yellow_old
LIMIT 0, 1000;
```
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount | wday |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CMT | 2014-03-01 01:07:38 | 2014-03-01 01:16:26 | 1 | 2.0 | -74.0 | 40.7 | 1 | N | -73.9 | 40.7 | CRD | 9.0 | 0.5 | 0.5 | 2.0 | 0 | 12.0 | Saturday |
CMT | 2014-03-01 01:08:03 | 2014-03-01 01:12:51 | 2 | 1.2 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 6.0 | 0.5 | 0.5 | 1.0 | 0 | 8.0 | Saturday |
CMT | 2014-03-01 01:08:51 | 2014-03-01 01:13:18 | 3 | 0.5 | -73.9 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 5.0 | 0.5 | 0.5 | 1.2 | 0 | 7.2 | Saturday |
CMT | 2014-03-01 01:09:20 | 2014-03-01 01:24:18 | 3 | 3.5 | -74.0 | 40.7 | 1 | N | -74.0 | 40.8 | CRD | 14.0 | 0.5 | 0.5 | 3.0 | 0 | 18.0 | Saturday |
CMT | 2014-03-01 01:09:46 | 2014-03-01 01:22:34 | 1 | 1.8 | -74.0 | 40.7 | 1 | N | -74.0 | 40.7 | CRD | 10.5 | 0.5 | 0.5 | 1.0 | 0 | 12.5 | Saturday |
```{r}
|>
new_table drop_na(wday) |>
ggplot(aes(x = fare_amount, y = tip_amount, color = wday)) +
geom_point()
```
3.10 Best practice
It is always a good idea to terminate the SQL connection when you are done with it.
dbDisconnect(con_taxi, shutdown = TRUE)
3.11 Reflection questions
Why don’t we usually want to run the query:
SELECT * FROM table;
?What is the difference between the original table and the results set?
In SQL does the
WHERE
clause use=
or==
to indicate equality?Does
BETWEEN
work only on numeric variables or also on character strings?What syntax is used to direct
ORDER BY
to sort by biggest to smallest or smallest to biggest?What is the difference between
WHERE
andHAVING
?
3.12 Ethics considerations
What are different ways to look at the dataset to identify possible typos or rogue values?
Why are such tasks so much harder with large datasets (versus small datasets)?
Why are such tasks to much more important with large datasets (versus small datasets)?
Taken directly from Modern Data Science with R↩︎
taken from: https://cs186berkeley.net/notes/note1/#filtering-null-values↩︎