install.packages("duckdb") # only once, in the Console, not in the .qmd or .Rmd file
library(duckdb) # at the top of the .qmd or .Rmd file
library(DBI) # we also still need the DBI package
5 Creating databases
In order to practice creating and editing databases, you will need to install DuckDB onto your own computer. Fortunately, you can do the installation using R! DuckDB is an in-process database management system that runs entirely on your own computer. Setting up a database on your own computer is a great way to work with large datasets where you are the only analyst. The data then lives in your storage (instead of your memory), and you don’t have to transfer queries or results over the internet.
Using DuckDB suits our purposes because it allows us to create a local database that we can edit. However, the SQL dialect used in DuckDB is slightly different from MySQL, which is important to recognize. For example, we write SELECT * FROM table 10;
instead of SELECT * FROM table 0, 10;
. In your future as a data scientist, you will find different dialects, depending on the SQL server. Always be aware of the dialect you are using.
In order to create a database (on our own computer), we’ll start by creating a connection to DuckDB. Note that the database has been stored to a database directory called duck_datab
which lives in the current R project. You won’t be able to open it like a standard folder, but it is where DuckDB stores the database files.
5.1 Preparing to load data
The duckdb
database is currently empty, so we need to load in some data. The duckdb_read_csv()
function in the duckdb R package allows us to load the .csv file (available on GitHub) directly into the database without being loaded as an R object first. The function, duckdb_read_csv()
does some of the work for us to find data types. However, we will first learn what data types are and how to use them, and most dialects of SQL require you to specify the data types before loading in data (usually done using LOAD DATA
but using COPY
in Duckdb).
Recall that in Table 3.2 we used DESCRIBE
to display the variable types of the database table(s). The list includes the variable name (Field
), its Type
, whether there are NULL
values allowed, and whether there are keys or indexes defined on the variable. See Table 5.2 for the DESCRIBE
output on the table we are about to import.
Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!
As an example, consider the Saturday Night Live datasets available on the snldb GitHub repo. Data is scraped from http://www.snlarchives.net and http://www.imdb.com/title/tt0072562 by Hendrik Hilleckes and Colin Morris. Notice that there are eleven .csv files available in the output folder.
Specifically, let’s consider the casts.csv file.
Before we get into loading data into a SQL database, let’s look at the casts file in R, so that we understand the data we want to load. glimpse()
provides the variable names and the variables types. The variables types are a mix of character strings, numeric, and logical. Variable types are very important for inputting data into a SQL server.
casts <- readr::read_csv("https://raw.githubusercontent.com/hhllcks/snldb/master/output/casts.csv")
glimpse(casts)
Rows: 614
Columns: 8
$ aid <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…
5.1.1 Variable types
The glance()
function indicates that there are different variables types, here chr
(character string), dbl
(numerical value with double precision), and lgl
(logical value). What are the data types in SQL?1
Numbers
- exact numeric data types (
INTEGER
,SMALLINT
,DECIMAL
, andNUMERIC
)2 - approximate numeric data types (
FLOAT
,REAL
, andDOUBLE PRECISION
)3 - integer types
INTEGER
(orINT
) andSMALLINT
,TINYINT
,MEDIUMINT
, andBIGINT
-
DECIMAL(precision, scale)
. For example, inDECIMAL(5,2)
5 is the precision and 2 is the scale. The precision represents the number of significant digits that are stored for values, and the scale represents the number of digits that can be stored following the decimal point.DECIMAL(5,2)
must store any value with five digits and two decimals, so values that can be stored in the salary column range from -999.99 to 999.99.DECIMAL(M)
is equivalent toDECIMAL(M,0)
. Similarly, the syntaxDECIMAL
is equivalent toDECIMAL(M,0)
, where the default value ofM
is 10. If the scale is 0,DECIMAL
values contain no decimal point or fractional part.
Strings
- string data types are
CHAR
,VARCHAR
,BINARY
,VARBINARY
,BLOB
,TEXT
,ENUM
, andSET
-
CHAR
andVARCHAR
types are similar, but differ in the way they are stored and retrieved. They also differ in maximum length and in whether trailing spaces are retained. -
CHAR
andVARCHAR
types are declared with a length that indicates the maximum number of characters you want to store. For example,CHAR(30)
can hold up to 30 characters. - The length of a
CHAR
column is fixed to the length that you declare when you create the table. - Values in
VARCHAR
columns are variable-length strings. The length can be specified as a value from 0 to 65,535. -
BINARY
andVARBINARY
types are similar toCHAR
andVARCHAR
, except that they store binary strings rather than nonbinary strings. That is, they store byte strings rather than character strings. - A
BLOB
is a binary large object that can hold a variable amount of data. The fourBLOB
types areTINYBLOB
,BLOB
,MEDIUMBLOB
, andLONGBLOB
. These differ only in the maximum length of the values they can hold. The fourTEXT
types areTINYTEXT
,TEXT
,MEDIUMTEXT
, andLONGTEXT
. These correspond to the fourBLOB
types and have the same maximum lengths and storage requirements.
Date
- date and time data types for representing temporal values are
DATE
,TIME
,DATETIME
,TIMESTAMP
, andYEAR
-
DATE
type is used for values with a date part but no time part. MySQL retrieves and displays DATE values in ‘YYYY-MM-DD’ format. The supported range is ‘1000-01-01’ to ‘9999-12-31’. -
DATETIME
type is used for values that contain both date and time parts. MySQL retrieves and displaysDATETIME
values in ‘YYYY-MM-DD hh:mm:ss’ format. The supported range is ‘1000-01-01 00:00:00’ to ‘9999-12-31 23:59:59’. -
TIMESTAMP
data type is used for values that contain both date and time parts.TIMESTAMP
has a range of ‘1970-01-01 00:00:01’ UTC to ‘2038-01-19 03:14:07’ UTC. -
TIME
values in ‘hh:mm:ss’ format (or ‘hhh:mm:ss’ format for large hours values).TIME
values may range from ‘-838:59:59’ to ‘838:59:59’. The hours part may be so large because the TIME type can be used not only to represent a time of day (which must be less than 24 hours), but also elapsed time or a time interval between two events (which may be much greater than 24 hours, or even negative). -
YEAR
type is a 1-byte type used to represent year values with a display width of four characters.
5.1.2 CHECK
constraints
While implementing CREATE TABLE
, constraints can be added either to individual variables (CountryPopulation > 0
) or to the table as a whole (LastCensus < NextCensus
).4
If an attempt is made to load data that violate the CHECK
constraints, an error will be given.
CREATE TABLE CountryListCensus (
Id INT,
VARCHAR(255) NOT NULL,
CountryName INT CHECK(CountryPopulation > 0),
CountryPopulation DATE,
LastCensus DATE,
NextCensus CHECK(LastCensus<NextCensus),
PRIMARY KEY (Id)
);
5.1.3 Creating KEY
s
CREATE TABLE table1 (
...,
col1 ...,
col2 ...,
col3 PRIMARY KEY col1,
FOREIGN KEY col2 REFERENCES table2(table2col1)
);
Either or both of the KEY
s could be multiple columns.
CREATE TABLE table1 (
...,
col1 ...,
col2 ...,
col3 PRIMARY KEY (col1, col3),
FOREIGN KEY (col1, col2) REFERENCES table2(table2col1, table2col4)
);
5.1.4 Creating INDEX
es
Indexes can be created on one or more variable. A table does not need to have an INDEX
(or a KEY
).
CREATE INDEX name_of_index ON table (col1);
CREATE INDEX name_of_index ON table (col1, col2);
5.1.5 Loading data
Once the database is set up, you will be ready to import .csv files into the database as tables. Importing .csv files as tables requires a series of steps:5
- a
USE
statement that ensures we are in the right schema/database. - a series of
DROP TABLE
statements that drop any old tables with the same names as the ones we are going to create. - a series of
CREATE TABLE
statements that specify the table structures. - a series of
COPY
statements that read the data from the .csv files into the appropriate tables.
Loading step 1
Use the local database that we’ve called duck_datab
.
USE duck_datab;
Loading step 2
Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE
statement, as it will remove the casts
table.
DROP TABLE IF EXISTS casts;
Loading step 3
Carefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.
MySQL doesn’t actually have a BOOLEAN
datatype (you would use TINYINT(1)
instead). But DuckDB does have a BOOLEAN
datatype!
CREATE TABLE casts (
VARCHAR(255) NOT NULL DEFAULT ' ',
aid INTEGER NOT NULL DEFAULT 0,
sid BOOLEAN NOT NULL DEFAULT 'false',
featured INTEGER DEFAULT 0,
first_epid INTEGER DEFAULT 0,
last_epid BOOLEAN NOT NULL DEFAULT 0,
update_anchor INTEGER NOT NULL DEFAULT 0,
n_episodes DECIMAL(21,20) NOT NULL DEFAULT 0,
season_fraction PRIMARY KEY (sid, aid)
);
Loading step 4
The .csv file lives on my computer, so I load it in directly. Note that the statement to load in data is slightly different in MySQL.
COPY casts FROM 'data/casts.csv' HEADER;
Checking the loading
SELECT * FROM casts LIMIT 8;
aid | sid | featured | first_epid | last_epid | update_anchor | n_episodes | season_fraction |
---|---|---|---|---|---|---|---|
A. Whitney Brown | 11 | TRUE | 19860222 | FALSE | 8 | 0.444 | |
A. Whitney Brown | 12 | TRUE | FALSE | 20 | 1.000 | ||
A. Whitney Brown | 13 | TRUE | FALSE | 13 | 1.000 | ||
A. Whitney Brown | 14 | TRUE | FALSE | 20 | 1.000 | ||
A. Whitney Brown | 15 | TRUE | FALSE | 20 | 1.000 | ||
A. Whitney Brown | 16 | TRUE | FALSE | 20 | 1.000 | ||
Alan Zweibel | 5 | TRUE | 19800409 | FALSE | 5 | 0.250 | |
Sasheer Zamata | 39 | TRUE | 20140118 | FALSE | 11 | 0.524 |
Check
Let’s make sure that the database exists and that the table in the database exists.
SHOW DATABASES;
database_name |
---|
duck_datab |
TABLES; SHOW
name |
---|
casts |
DESCRIBE casts;
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
aid | VARCHAR | NO | PRI | ' ' | |
sid | INTEGER | NO | PRI | 0 | |
featured | BOOLEAN | NO | 'false' | ||
first_epid | INTEGER | YES | 0 | ||
last_epid | INTEGER | YES | 0 | ||
update_anchor | BOOLEAN | NO | 0 | ||
n_episodes | INTEGER | NO | 0 | ||
season_fraction | DECIMAL(21,20) | NO | 0 |
5.1.6 Using DuckDB for loading data
The steps given in Section 5.1.5 are general to many SQL dialects and are important when working with most SQL clients. It is important to have control over the variables configurations as they make up the SQL database. However, using the duckdb package in R allows for shorthand entry of data from .csv files into the DuckDB database. Here, we take advantage of working with the DuckDB functionality in R.
duckdb_read_csv(con = con_duckdb, name = "hosts", files = "data/hosts.csv")
duckdb_read_csv(con = con_duckdb, name = "episodes", files = "data/episodes.csv")
Checking the loading
TABLES; SHOW
name |
---|
casts |
episodes |
hosts |
5.2 Best practice
It is always a good idea to terminate the SQL connection when you are done with it.
dbDisconnect(con_duckdb, shutdown = TRUE)
5.3 Reflection questions
What is the difference between R and SQL in terms of communicating the different data types?
Why does it matter if the variable type is specified correctly? For example, why would it be better for a date column to be specified as
DATETIME
instead ofVARCHAR
?If you are Googling some SQL syntax, why do you need to specify the dialect, for example, DuckDB or MySQL?
Why do we often include the
DROP TABLE
operation before theCREATE TABLE
operation?
5.4 Ethics considerations
When creating a database why should we worry about the provenance (origin) of the data?
How does a SQL database automatically hold extra information about the database (e.g., provenance)? Spoiler: it doesn’t. So what can we do?
Taken from https://dev.mysql.com/doc/refman/8.0/en/data-types.html↩︎
The keyword
INT
is a synonym forINTEGER
, and the keywordsDEC
andFIXED
are synonyms forDECIMAL
↩︎MySQL treats
DOUBLE
as a synonym forDOUBLE PRECISION
. MySQL also treatsREAL
as a synonym forDOUBLE PRECISION
.↩︎Example from https://www.sqlshack.com/how-to-use-sql-check-constraints/↩︎