1  Databases

The main focus of the course will be on SQL (pronounced “sequel” by many people and “s-q-l” by others), short for structured query language. However, before we jump into learning the power of SQL, we will start with two preliminary lessons:

  1. understanding what is a database and why would we ever want to use a database. See Section 1.1.
  2. using the R package dbplyr which translates dplyr code into SQL. Recall the data wrangling with which you are already familiar, including filter(), select(), group_by(), summarize(), and arrange(), all of which live in dplyr. See Section 2.1.

1.1 What is a database?

A database is a structured collection of data that is organized in such a way that facilitates efficient storage, retrieval, and management of information. They are particularly important for industries with exceptionally large amounts of data which can be partitioned into different tables. Additionally, databases allow for multiple users to access the data simultaneously.

You are likely already familiar with the concept of tidy data. (If you have never encountered tidy data, see Chapter 5 Data tidying in R for Data Science.) Tidy data typically live in data frames or tables; where, importantly, they consist of columns of variables (where every column is the same type!) and the rows of observational units.

Conceptually, a table in a database is no different from the data frames which are used in R. They will always be rectangles with the same row and column structure.

There are two big differences between a stand alone data frame (e.g., in R) and a table which lives in a database (e.g., in SQL).

  1. Most importantly, tables in databases can be arbitrarily large, primarily due to being stored on disk. Indeed, they are typically not stored on your computer’s disk, they are stored remotely on a hard drive outside of your own computer and work space. Data frames are stored in memory (on your computer) and can be quite limited in size.

    • Memory (RAM) is the amount of data that a computer can work on simultaneously. My computer has 32 GB of RAM. The important thing about memory is that the computer has easy access to its own memory and will access it quickly (tens of GBs per second).

    • Hard Disk is the amount of data that a computer can store permanently; it is your storage space. My computer has 2 TB of storage. Accessing the disk is much slower (hundreds of MBs per second) than accessing the memory. Accessing disk space is even slower if the storage lives on a different computer and is accessed virtually (via WiFi).

  2. The tables in a database are usually linked with a key. We will cover join() functions in Chapter 4, but keep in mind that in order to get information from one table to connect to information in another table, we need to somehow relate the rows of the first table to the rows of the second table.

1.2 The airlines database

To demonstrate the difference between data frames in memory and tables in storage, we will consider the airlines data consisting of millions of individual flights between 2010 and 2017. The flights are downloaded from the Bureau of Transportation Statistics, US Department of Transportation. The database is a superset of the nycflights13 R package that tracks only flights in and out of airports serving New York City in 2013.

The full data set occupies almost 20GB when they are saved as CSV (comma separated value) files, a common way to hold data that can be represented as columns of text which are separated by commas.

1.3 Tables in SQL

To look at the airlines data tables, we first need to connect to the database remotely. The function dbConnect_scidb() in the mdsr package allows us to connect to the databases which are stored for use with the text Modern Data Science with R.

The tbl() function (in the dplyr package) maps the tables called flights and carriers from the database to an object in R.

library(tidyverse)
library(mdsr)
con_air <- mdsr::dbConnect_scidb("airlines")
flights <- tbl(con_air, "flights")
carriers <- tbl(con_air, "carriers")

Alternatively, the SQL connection can be set up directly using information specific to the mdsr R package database.

con_air <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "airlines",
  host = Sys.getenv("MDSR_HOST"),
  user = Sys.getenv("MDSR_USER"),
  password = Sys.getenv("MDSR_PWD")
)

We can now use the objects flights and carriers as if they were data frames, but they are not actually the R version of data.frame. Instead, they exist as a tbl which is a special object that behaves similarly to a data.frame.

The carriers data represents the name of each airline and the associated carrier code. Note that when the data are printed to screen, the number of rows is given by ??, indicating that they are unknown. Indeed, R just needed the first few rows to print to screen, no need to spend precious computing resources looking at the entire table.

carriers 
# Source:   table<`carriers`> [?? x 2]
# Database: mysql  [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:3306/airlines]
  carrier name                         
  <chr>   <chr>                        
1 02Q     Titan Airways                
2 04Q     Tradewind Aviation           
3 05Q     Comlux Aviation, AG          
4 06Q     Master Top Linhas Aereas Ltd.
5 07Q     Flair Airlines Ltd.          
6 09Q     Swift Air, LLC               
# ℹ more rows

With the carriers data, it is possible, however, to load the entire object into R using collect(). Now, when the data are printed to screen, the object is a tibble and R knows that it has 1,610 rows.

carriers |>
  collect() 
# A tibble: 1,610 × 2
  carrier name                         
  <chr>   <chr>                        
1 02Q     Titan Airways                
2 04Q     Tradewind Aviation           
3 05Q     Comlux Aviation, AG          
4 06Q     Master Top Linhas Aereas Ltd.
5 07Q     Flair Airlines Ltd.          
6 09Q     Swift Air, LLC               
# ℹ 1,604 more rows

As already mentioned, working in R versus working remotely has trade-offs. Remember that the object takes up much more space in your memory if you load it into R. Consider the following which demonstrates how much more memory intensive it is to hold an object in R.

# carriers lives in the SQL database and is linked remotely
carriers |>
  object.size() |>
  print(units = "Kb")
5.3 Kb
# carriers lives in R
carriers |>
  collect() |>
  object.size() |>
  print(units = "Kb")
234.8 Kb

Indeed, the flights data set contains all of the flights and is much larger than the carriers data set. When pulled into R it takes up almost 5 GB of memory (!), but when it exists as only a connection to the SQL database is uses just a few Kb of memory.

# flights lives in the database and is linked remotely
flights |>
  object.size() |>
  print(units = "Kb")
6.5 Kb

Always a good idea to terminate the SQL connection when you are done with it.

library(DBI)
dbDisconnect(con_air, shutdown = TRUE)

1.4 Engaging with a database

1.4.1 Many SQL implementations

SQL (Structured Query Language) is a query language for working with relational databases. (The relational part means that the data sets are connected in a meaningful way, the database part means that we have lots of tables living somewhere remotely on a hard drive.) SQL has been around since the 1970s and is extremely powerful for data wrangling tasks.

Although SQL has been a standard for the American National Standards Institute (ANSI) since 1986, there exist many dialects of SQL. Translating between the dialects is not always easy although once you learn how to program in one dialect, you will be able to pick up any of the other SQL dialects. We will use MySQL in this class. MySQL is among the most popular implementations of SQL and it is open source.

1.4.2 SQL interfaces

MySQL is based on a client-server model. The data live on a powerful computer (the server) and you connect to the data from your own computer (the client). In this class:

  • SQL code will be written in the MySQL dialect
  • using both RStudio and DBeaver as the interface to
  • connect to many different remote servers.

1.4.3 SQL in-process

Another approach to engaging with SQL is where the client and the server are both on a single computer (called in-process). You may want to set up a database on your own computer to try things out and avoid monthly charges associated with buying server space in the cloud. Indeed, for the end of the semester project, if you choose to set up a database on your own computer, a good free database management system is Duckdb

1.5 Connecting to a database

When using R to connect to a database, we need two R packages. DBI is a low-level interface that connects to databases and executes SQL; RMariaDB is a package specifically tailored for MySQL which translates generic DBI commands into the specific syntax needed for MySQL.

A third R package, dbplyr, is a high-level interface that translates dplyr code (i.e., R code) to SQL queries then executes them with DBI.

1.5.1 Creating a database connection

Before we can do anything, we need to set up a connection to a MySQL database. We will call the connection con, and the syntax will look something like what is written below. However, the code below won’t run because values must be set for dbname, host, user, and password.

con <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "  ",
  host = "  ",
  user = "  ",
  password = "  "
)

Note that the function dbConnect_scidb() in the mdsr package is just a wrapper of the dbConnect() function where all the arguments are filled in to connect to the mdsr database.

Always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con, shutdown = TRUE)

1.5.2 Connecting to Duckdb

Connecting to Duckdb is reasonably straightforward because the default values of the duckdb() function in the duckdb package create a temporary database that is deleted when you quit R.

con_duckdb <- DBI::dbConnect(duckdb::duckdb())

Always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_duckdb, shutdown = TRUE)

1.6 Reflection questions

  1. Why is SQL such an important tool for data scientists? That is, what are the characteristics that make it useful?

  2. What is a relational database?

  3. (Maybe best to answer after learning a few more concepts.) What are the main differences between working with a data frame in R and a table in SQL?

  4. How does one connect to a SQL database? Using R? Using DuckDB?

1.7 Ethics considerations

  1. Why should you need a password to access a SQL server?

  2. What other skills for working with databases are important, beyond accessing, wrangling, and creating databases? (E.g., provenance of the data, purpose of the data’s use, protection of data, privacy of data… etc. Why / how are those important?)