install.packages("duckdb") # only once, in the Console, not in the .qmd or .Rmd file
library(duckdb) # at the top of the .qmd or .Rmd file
library(DBI) # we also still need the DBI package5 Creating databases
In order to practice creating and editing databases, you will need to install DuckDB onto your own computer. Fortunately, you can do the installation using R! DuckDB is an in-process database management system that runs entirely on your own computer. Setting up a database on your own computer is a great way to work with large datasets where you are the only analyst. The data then lives in your storage (instead of your memory), and you don’t have to transfer queries or results over the internet.
Using DuckDB suits our purposes because it allows us to create a local database that we can edit. However, the SQL dialect used in DuckDB is slightly different from MySQL, which is important to recognize. For example, we write SELECT * FROM table 10; instead of SELECT * FROM table 0, 10;. In your future as a data scientist, you will find different dialects, depending on the SQL server. Always be aware of the dialect you are using.
In order to create a database (on our own computer), we’ll start by creating a connection to DuckDB. Note that the database has been stored to a database directory called duck_datab which lives in the current R project. You won’t be able to open it like a standard folder, but it is where DuckDB stores the database files.
5.1 Preparing to load data
The duckdb database is currently empty, so we need to load in some data. The duckdb_read_csv() function in the duckdb R package allows us to load the .csv file (available on GitHub) directly into the database without being loaded as an R object first. The function, duckdb_read_csv() does some of the work for us to find data types. However, we will first learn what data types are and how to use them, and most dialects of SQL require you to specify the data types before loading in data (usually done using LOAD DATA but using COPY in Duckdb).
Recall that in Table 3.2 we used DESCRIBE to display the variable types of the database table(s). The list includes the variable name (Field), its Type, whether there are NULL values allowed, and whether there are keys or indexes defined on the variable. See Table 5.2 for the DESCRIBE output on the table we are about to import.
Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!
As an example, consider the Saturday Night Live datasets available on the snldb GitHub repo. Data is scraped from http://www.snlarchives.net and http://www.imdb.com/title/tt0072562 by Hendrik Hilleckes and Colin Morris. Notice that there are eleven .csv files available in the output folder.
Specifically, let’s consider the casts.csv file.
Before we get into loading data into a SQL database, let’s look at the casts file in R, so that we understand the data we want to load. glimpse() provides the variable names and the variables types. The variables types are a mix of character strings, numeric, and logical. Variable types are very important for inputting data into a SQL server.
casts <- readr::read_csv("https://raw.githubusercontent.com/hhllcks/snldb/master/output/casts.csv")
glimpse(casts)Rows: 614
Columns: 8
$ aid <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…
5.1.1 Variable types
The glance() function indicates that there are different variables types, here chr (character string), dbl (numerical value with double precision), and lgl (logical value). What are the data types in SQL?1
Numbers
- exact numeric data types (
INTEGER,SMALLINT,DECIMAL, andNUMERIC)2 - approximate numeric data types (
FLOAT,REAL, andDOUBLE PRECISION)3 - integer types
INTEGER(orINT) andSMALLINT,TINYINT,MEDIUMINT, andBIGINT -
DECIMAL(precision, scale). For example, inDECIMAL(5,2)5 is the precision and 2 is the scale. The precision represents the number of significant digits that are stored for values, and the scale represents the number of digits that can be stored following the decimal point.DECIMAL(5,2)must store any value with five digits and two decimals, so values that can be stored in the salary column range from -999.99 to 999.99.DECIMAL(M)is equivalent toDECIMAL(M,0). Similarly, the syntaxDECIMALis equivalent toDECIMAL(M,0), where the default value ofMis 10. If the scale is 0,DECIMALvalues contain no decimal point or fractional part.
Strings
- string data types are
CHAR,VARCHAR,BINARY,VARBINARY,BLOB,TEXT,ENUM, andSET -
CHARandVARCHARtypes are similar, but differ in the way they are stored and retrieved. They also differ in maximum length and in whether trailing spaces are retained. -
CHARandVARCHARtypes are declared with a length that indicates the maximum number of characters you want to store. For example,CHAR(30)can hold up to 30 characters. - The length of a
CHARcolumn is fixed to the length that you declare when you create the table. - Values in
VARCHARcolumns are variable-length strings. The length can be specified as a value from 0 to 65,535. -
BINARYandVARBINARYtypes are similar toCHARandVARCHAR, except that they store binary strings rather than nonbinary strings. That is, they store byte strings rather than character strings. - A
BLOBis a binary large object that can hold a variable amount of data. The fourBLOBtypes areTINYBLOB,BLOB,MEDIUMBLOB, andLONGBLOB. These differ only in the maximum length of the values they can hold. The fourTEXTtypes areTINYTEXT,TEXT,MEDIUMTEXT, andLONGTEXT. These correspond to the fourBLOBtypes and have the same maximum lengths and storage requirements.
Date
- date and time data types for representing temporal values are
DATE,TIME,DATETIME,TIMESTAMP, andYEAR -
DATEtype is used for values with a date part but no time part. MySQL retrieves and displays DATE values in ‘YYYY-MM-DD’ format. The supported range is ‘1000-01-01’ to ‘9999-12-31’. -
DATETIMEtype is used for values that contain both date and time parts. MySQL retrieves and displaysDATETIMEvalues in ‘YYYY-MM-DD hh:mm:ss’ format. The supported range is ‘1000-01-01 00:00:00’ to ‘9999-12-31 23:59:59’. -
TIMESTAMPdata type is used for values that contain both date and time parts.TIMESTAMPhas a range of ‘1970-01-01 00:00:01’ UTC to ‘2038-01-19 03:14:07’ UTC. -
TIMEvalues in ‘hh:mm:ss’ format (or ‘hhh:mm:ss’ format for large hours values).TIMEvalues may range from ‘-838:59:59’ to ‘838:59:59’. The hours part may be so large because the TIME type can be used not only to represent a time of day (which must be less than 24 hours), but also elapsed time or a time interval between two events (which may be much greater than 24 hours, or even negative). -
YEARtype is a 1-byte type used to represent year values with a display width of four characters.
5.1.2 CHECK constraints
While implementing CREATE TABLE, constraints can be added either to individual variables (CountryPopulation > 0) or to the table as a whole (LastCensus < NextCensus).4
If an attempt is made to load data that violate the CHECK constraints, an error will be given.
CREATE TABLE CountryListCensus (
Id INT,
CountryName VARCHAR(255) NOT NULL,
CountryPopulation INT CHECK(CountryPopulation > 0),
LastCensus DATE,
NextCensus DATE,
CHECK(LastCensus<NextCensus),
PRIMARY KEY (Id)
);
5.1.3 Creating KEYs
CREATE TABLE table1 (
col1 ...,
col2 ...,
col3 ...,
PRIMARY KEY col1,
FOREIGN KEY col2 REFERENCES table2(table2col1)
);Either or both of the KEYs could be multiple columns.
CREATE TABLE table1 (
col1 ...,
col2 ...,
col3 ...,
PRIMARY KEY (col1, col3),
FOREIGN KEY (col1, col2) REFERENCES table2(table2col1, table2col4)
);
5.1.4 Creating INDEXes
Indexes can be created on one or more variable. A table does not need to have an INDEX (or a KEY).
CREATE INDEX name_of_index ON table (col1);CREATE INDEX name_of_index ON table (col1, col2);5.1.5 Loading data
Once the database is set up, you will be ready to import .csv files into the database as tables. Importing .csv files as tables requires a series of steps:5
- a
USEstatement that ensures we are in the right schema/database. - a series of
DROP TABLEstatements that drop any old tables with the same names as the ones we are going to create. - a series of
CREATE TABLEstatements that specify the table structures. - a series of
COPYstatements that read the data from the .csv files into the appropriate tables.
Loading step 1
Use the local database that we’ve called duck_datab.
USE duck_datab;Loading step 2
Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE statement, as it will remove the casts table.
DROP TABLE IF EXISTS casts;Loading step 3
Carefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.
MySQL doesn’t actually have a BOOLEAN datatype (you would use TINYINT(1) instead). But DuckDB does have a BOOLEAN datatype!
CREATE TABLE casts (
aid VARCHAR(255) NOT NULL DEFAULT ' ',
sid INTEGER NOT NULL DEFAULT 0,
featured BOOLEAN NOT NULL DEFAULT 'false',
first_epid INTEGER DEFAULT 0,
last_epid INTEGER DEFAULT 0,
update_anchor BOOLEAN NOT NULL DEFAULT 0,
n_episodes INTEGER NOT NULL DEFAULT 0,
season_fraction DECIMAL(21,20) NOT NULL DEFAULT 0,
PRIMARY KEY (sid, aid)
);Loading step 4
The .csv file lives on my computer, so I load it in directly. Note that the statement to load in data is slightly different in MySQL.
COPY casts FROM 'data/casts.csv' HEADER;Checking the loading
SELECT * FROM casts LIMIT 8;| aid | sid | featured | first_epid | last_epid | update_anchor | n_episodes | season_fraction |
|---|---|---|---|---|---|---|---|
| A. Whitney Brown | 11 | TRUE | 19860222 | FALSE | 8 | 0.444 | |
| A. Whitney Brown | 12 | TRUE | FALSE | 20 | 1.000 | ||
| A. Whitney Brown | 13 | TRUE | FALSE | 13 | 1.000 | ||
| A. Whitney Brown | 14 | TRUE | FALSE | 20 | 1.000 | ||
| A. Whitney Brown | 15 | TRUE | FALSE | 20 | 1.000 | ||
| A. Whitney Brown | 16 | TRUE | FALSE | 20 | 1.000 | ||
| Alan Zweibel | 5 | TRUE | 19800409 | FALSE | 5 | 0.250 | |
| Sasheer Zamata | 39 | TRUE | 20140118 | FALSE | 11 | 0.524 |
Check
Let’s make sure that the database exists and that the table in the database exists.
SHOW DATABASES;| database_name |
|---|
| duck_datab |
SHOW TABLES;| name |
|---|
| casts |
DESCRIBE casts;| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| aid | VARCHAR | NO | PRI | ' ' | |
| sid | INTEGER | NO | PRI | 0 | |
| featured | BOOLEAN | NO | 'false' | ||
| first_epid | INTEGER | YES | 0 | ||
| last_epid | INTEGER | YES | 0 | ||
| update_anchor | BOOLEAN | NO | 0 | ||
| n_episodes | INTEGER | NO | 0 | ||
| season_fraction | DECIMAL(21,20) | NO | 0 |
5.1.6 Using DuckDB for loading data
The steps given in Section 5.1.5 are general to many SQL dialects and are important when working with most SQL clients. It is important to have control over the variables configurations as they make up the SQL database. However, using the duckdb package in R allows for shorthand entry of data from .csv files into the DuckDB database. Here, we take advantage of working with the DuckDB functionality in R.
duckdb_read_csv(con = con_duckdb, name = "hosts", files = "data/hosts.csv")
duckdb_read_csv(con = con_duckdb, name = "episodes", files = "data/episodes.csv")Checking the loading
SHOW TABLES;| name |
|---|
| casts |
| episodes |
| hosts |
5.2 Best practice
It is always a good idea to terminate the SQL connection when you are done with it.
dbDisconnect(con_duckdb, shutdown = TRUE)5.3 Reflection questions
What is the difference between R and SQL in terms of communicating the different data types?
Why does it matter if the variable type is specified correctly? For example, why would it be better for a date column to be specified as
DATETIMEinstead ofVARCHAR?If you are Googling some SQL syntax, why do you need to specify the dialect, for example, DuckDB or MySQL?
Why do we often include the
DROP TABLEoperation before theCREATE TABLEoperation?
5.4 Ethics considerations
When creating a database why should we worry about the provenance (origin) of the data?
How does a SQL database automatically hold extra information about the database (e.g., provenance)? Spoiler: it doesn’t. So what can we do?
Taken from https://dev.mysql.com/doc/refman/8.0/en/data-types.html↩︎
The keyword
INTis a synonym forINTEGER, and the keywordsDECandFIXEDare synonyms forDECIMAL↩︎MySQL treats
DOUBLEas a synonym forDOUBLE PRECISION. MySQL also treatsREALas a synonym forDOUBLE PRECISION.↩︎Example from https://www.sqlshack.com/how-to-use-sql-check-constraints/↩︎