5 Creating databases

In order to practice creating and editing databases, you will need to install DuckDB onto your own computer. Fortunately, you can do the installation using R! DuckDB is an in-process database management system that runs entirely on your own computer. Setting up a database on your own computer is a great way to work with large datasets where you are the only analyst. The data then lives in your storage (instead of your memory), and you don’t have to transfer queries or results over the internet.

Using DuckDB suits our purposes because it allows us to create a local database that we can edit. However, the SQL dialect used in DuckDB is slightly different from MySQL, which is important to recognize. For example, we write SELECT * FROM table 10; instead of SELECT * FROM table 0, 10;. In your future as a data scientist, you will find different dialects, depending on the SQL server. Always be aware of the dialect you are using.

install.packages("duckdb")  # only once, in the Console, not in the .qmd or .Rmd file
library(duckdb)             # at the top of the .qmd or .Rmd file

library(DBI)                # we also still need the DBI package

In order to create a database (on our own computer), we’ll start by creating a connection to DuckDB. Note that the database has been stored to a database directory called duck_datab which lives in the current R project. You won’t be able to open it like a standard folder, but it is where DuckDB stores the database files.

con_duckdb <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "duck_datab")

5.1 Preparing to load data

The duckdb database is currently empty, so we need to load in some data. The duckdb_read_csv() function in the duckdb R package allows us to load the .csv file (available on GitHub) directly into the database without being loaded as an R object first. The function, duckdb_read_csv() does some of the work for us to find data types. However, we will first learn what data types are and how to use them, and most dialects of SQL require you to specify the data types before loading in data (usually done using LOAD DATA but using COPY in Duckdb).

Recall that in Table 3.2 we used DESCRIBE to display the variable types of the database table(s). The list includes the variable name (Field), its Type, whether there are NULL values allowed, and whether there are keys or indexes defined on the variable. See Table 5.2 for the DESCRIBE output on the table we are about to import.

Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!

As an example, consider the Saturday Night Live datasets available on the snldb GitHub repo. Data is scraped from http://www.snlarchives.net and http://www.imdb.com/title/tt0072562 by Hendrik Hilleckes and Colin Morris. Notice that there are eleven .csv files available in the output folder.

Specifically, let’s consider the casts.csv file.

Before we get into loading data into a SQL database, let’s look at the casts file in R, so that we understand the data we want to load. glimpse() provides the variable names and the variables types. The variables types are a mix of character strings, numeric, and logical. Variable types are very important for inputting data into a SQL server.

casts <- readr::read_csv("https://raw.githubusercontent.com/hhllcks/snldb/master/output/casts.csv")
glimpse(casts)

Rows: 614
Columns: 8
$ aid             <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid             <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid      <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes      <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…

5.1.1 Variable types

The glance() function indicates that there are different variables types, here chr (character string), dbl (numerical value with double precision), and lgl (logical value). What are the data types in SQL?¹

Numbers

exact numeric data types (INTEGER, SMALLINT, DECIMAL, and NUMERIC)²
approximate numeric data types (FLOAT, REAL, and DOUBLE PRECISION)³
integer types INTEGER (or INT) and SMALLINT, TINYINT, MEDIUMINT, and BIGINT
DECIMAL(precision, scale). For example, in DECIMAL(5,2) 5 is the precision and 2 is the scale. The precision represents the number of significant digits that are stored for values, and the scale represents the number of digits that can be stored following the decimal point. DECIMAL(5,2) must store any value with five digits and two decimals, so values that can be stored in the salary column range from -999.99 to 999.99. DECIMAL(M) is equivalent to DECIMAL(M,0). Similarly, the syntax DECIMAL is equivalent to DECIMAL(M,0), where the default value of M is 10. If the scale is 0, DECIMAL values contain no decimal point or fractional part.

Strings

string data types are CHAR, VARCHAR, BINARY, VARBINARY, BLOB, TEXT, ENUM, and SET
CHAR and VARCHAR types are similar, but differ in the way they are stored and retrieved. They also differ in maximum length and in whether trailing spaces are retained.
CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters you want to store. For example, CHAR(30) can hold up to 30 characters.
The length of a CHAR column is fixed to the length that you declare when you create the table.
Values in VARCHAR columns are variable-length strings. The length can be specified as a value from 0 to 65,535.
BINARY and VARBINARY types are similar to CHAR and VARCHAR, except that they store binary strings rather than nonbinary strings. That is, they store byte strings rather than character strings.
A BLOB is a binary large object that can hold a variable amount of data. The four BLOB types are TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB. These differ only in the maximum length of the values they can hold. The four TEXT types are TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT. These correspond to the four BLOB types and have the same maximum lengths and storage requirements.

Date

date and time data types for representing temporal values are DATE, TIME, DATETIME, TIMESTAMP, and YEAR
DATE type is used for values with a date part but no time part. MySQL retrieves and displays DATE values in ‘YYYY-MM-DD’ format. The supported range is ‘1000-01-01’ to ‘9999-12-31’.
DATETIME type is used for values that contain both date and time parts. MySQL retrieves and displays DATETIME values in ‘YYYY-MM-DD hh:mm:ss’ format. The supported range is ‘1000-01-01 00:00:00’ to ‘9999-12-31 23:59:59’.
TIMESTAMP data type is used for values that contain both date and time parts. TIMESTAMP has a range of ‘1970-01-01 00:00:01’ UTC to ‘2038-01-19 03:14:07’ UTC.
TIME values in ‘hh:mm:ss’ format (or ‘hhh:mm:ss’ format for large hours values). TIME values may range from ‘-838:59:59’ to ‘838:59:59’. The hours part may be so large because the TIME type can be used not only to represent a time of day (which must be less than 24 hours), but also elapsed time or a time interval between two events (which may be much greater than 24 hours, or even negative).
YEAR type is a 1-byte type used to represent year values with a display width of four characters.

5.1.2 `CHECK` constraints

While implementing CREATE TABLE, constraints can be added either to individual variables (CountryPopulation > 0) or to the table as a whole (LastCensus < NextCensus).⁴

If an attempt is made to load data that violate the CHECK constraints, an error will be given.

CREATE TABLE CountryListCensus (
    Id INT,
    CountryName VARCHAR(255) NOT NULL,
    CountryPopulation INT CHECK(CountryPopulation > 0),
    LastCensus DATE,
    NextCensus DATE,
    CHECK(LastCensus<NextCensus),
    PRIMARY KEY (Id)
);

5.1.3 Creating `KEY`s

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY col1,
  FOREIGN KEY col2 REFERENCES table2(table2col1)
);

Either or both of the KEYs could be multiple columns.

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY (col1, col3),
  FOREIGN KEY (col1, col2) REFERENCES table2(table2col1, table2col4)
);

5.1.4 Creating `INDEX`es

Indexes can be created on one or more variable. A table does not need to have an INDEX (or a KEY).

CREATE INDEX name_of_index ON table (col1);

CREATE INDEX name_of_index ON table (col1, col2);

5.1.5 Loading data

Once the database is set up, you will be ready to import .csv files into the database as tables. Importing .csv files as tables requires a series of steps:⁵

a USE statement that ensures we are in the right schema/database.
a series of DROP TABLE statements that drop any old tables with the same names as the ones we are going to create.
a series of CREATE TABLE statements that specify the table structures.
a series of COPY statements that read the data from the .csv files into the appropriate tables.

Watch out!

DuckDB has its own dialect of SQL. To load data into a MySQL server, the final statement would be LOAD DATA instead of COPY. See MDSR for more information on loading data into a remote MySQL server.

Loading step 1

Use the local database that we’ve called duck_datab.

USE duck_datab;

Loading step 2

Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE statement, as it will remove the casts table.

DROP TABLE IF EXISTS casts;

Loading step 3

Carefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.

MySQL doesn’t actually have a BOOLEAN datatype (you would use TINYINT(1) instead). But DuckDB does have a BOOLEAN datatype!

CREATE TABLE casts (
  aid VARCHAR(255) NOT NULL DEFAULT ' ',
  sid INTEGER NOT NULL DEFAULT 0,
  featured BOOLEAN NOT NULL DEFAULT 'false',
  first_epid INTEGER DEFAULT 0,
  last_epid INTEGER DEFAULT 0,
  update_anchor BOOLEAN NOT NULL DEFAULT 0,
  n_episodes INTEGER NOT NULL DEFAULT 0,
  season_fraction DECIMAL(21,20) NOT NULL DEFAULT 0,
  PRIMARY KEY (sid, aid)
);

Loading step 4

The .csv file lives on my computer, so I load it in directly. Note that the statement to load in data is slightly different in MySQL.

COPY casts FROM 'data/casts.csv' HEADER;

Checking the loading

SELECT * FROM casts LIMIT 8;

Table 5.1: After `CREATE TABLE` where variable types are set, the `COPY` command pulls the data into the table. `SELECT` shows us that the table is as expected.
aid	sid	featured	first_epid	update_anchor	n_episodes	season_fraction
A. Whitney Brown	11	TRUE	19860222	FALSE	8	0.444
A. Whitney Brown	12	TRUE		FALSE	20	1.000
A. Whitney Brown	13	TRUE		FALSE	13	1.000
A. Whitney Brown	14	TRUE		FALSE	20	1.000
A. Whitney Brown	15	TRUE		FALSE	20	1.000
A. Whitney Brown	16	TRUE		FALSE	20	1.000
Alan Zweibel	5	TRUE	19800409	FALSE	5	0.250
Sasheer Zamata	39	TRUE	20140118	FALSE	11	0.524

Check

Let’s make sure that the database exists and that the table in the database exists.

SHOW DATABASES;

1 records
database_name
duck_datab

SHOW TABLES;

1 records
name
casts

DESCRIBE casts;

Table 5.2: DESCRIBE variables in the casts table.
column_name	column_type	null	key	default
aid	VARCHAR	NO	PRI	' '
sid	INTEGER	NO	PRI	0
featured	BOOLEAN	NO		'false'
first_epid	INTEGER	YES		0
last_epid	INTEGER	YES		0
update_anchor	BOOLEAN	NO		0
n_episodes	INTEGER	NO		0
season_fraction	DECIMAL(21,20)	NO		0

5.1.6 Using DuckDB for loading data

The steps given in Section 5.1.5 are general to many SQL dialects and are important when working with most SQL clients. It is important to have control over the variables configurations as they make up the SQL database. However, using the duckdb package in R allows for shorthand entry of data from .csv files into the DuckDB database. Here, we take advantage of working with the DuckDB functionality in R.

duckdb_read_csv(con = con_duckdb, name = "hosts", files = "data/hosts.csv")
duckdb_read_csv(con = con_duckdb, name = "episodes", files = "data/episodes.csv")

Checking the loading

SHOW TABLES;

3 records
name
casts
episodes
hosts

5.2 Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_duckdb, shutdown = TRUE)

5.3 Reflection questions

What is the difference between R and SQL in terms of communicating the different data types?
Why does it matter if the variable type is specified correctly? For example, why would it be better for a date column to be specified as DATETIME instead of VARCHAR?
If you are Googling some SQL syntax, why do you need to specify the dialect, for example, DuckDB or MySQL?
Why do we often include the DROP TABLE operation before the CREATE TABLE operation?

5.4 Ethics considerations

When creating a database why should we worry about the provenance (origin) of the data?
How does a SQL database automatically hold extra information about the database (e.g., provenance)? Spoiler: it doesn’t. So what can we do?

Taken from https://dev.mysql.com/doc/refman/8.0/en/data-types.html↩︎
The keyword INT is a synonym for INTEGER, and the keywords DEC and FIXED are synonyms for DECIMAL↩︎
MySQL treats DOUBLE as a synonym for DOUBLE PRECISION. MySQL also treats REAL as a synonym for DOUBLE PRECISION.↩︎
Example from https://www.sqlshack.com/how-to-use-sql-check-constraints/↩︎
taken from MDSR.↩︎

5.1 Preparing to load data

5.1.1 Variable types

Numbers

Strings

Date

5.1.2 CHECK constraints

5.1.3 Creating KEYs

5.1.4 Creating INDEXes

5.1.5 Loading data

Loading step 1

Loading step 2

Loading step 3

Loading step 4

Checking the loading

Check

5.1.6 Using DuckDB for loading data

Checking the loading

5.2 Best practice

5.3 Reflection questions

5.4 Ethics considerations

5.1.2 `CHECK` constraints

5.1.3 Creating `KEY`s

5.1.4 Creating `INDEX`es