5  Creating databases

In order to practice creating and editing databases, you will need to install DuckDB onto your own computer. Fortunately, you can do the installation using R! DuckDB is an in-process database management system that runs entirely on your own computer. Setting up a database on your own computer is a great way to work with large datasets where you are the only analyst. The data then lives in your storage (instead of your memory), and you don’t have to transfer queries or results over the internet.

Using DuckDB suits our purposes because it allows us to create a local database that we can edit. However, the SQL dialect used in DuckDB is slightly different from MySQL, which is important to recognize. For example, we write SELECT * FROM table 10; instead of SELECT * FROM table 0, 10;. In your future as a data scientist, you will find different dialects, depending on the SQL server. Always be aware of the dialect you are using.

install.packages("duckdb")  # only once, in the Console, not in the .qmd or .Rmd file
library(duckdb)             # at the top of the .qmd or .Rmd file

library(DBI)                # we also still need the DBI package

In order to create a database (on our own computer), we’ll start by creating a connection to DuckDB. Note that the database has been stored to a database directory called duck_datab which lives in the current R project. You won’t be able to open it like a standard folder, but it is where DuckDB stores the database files.

con_duckdb <- DBI::dbConnect(duckdb::duckdb(),
                             dbdir = "duck_datab")

5.1 Preparing to load data

The duckdb database is currently empty, so we need to load in some data. The duckdb_read_csv() function in the duckdb R package allows us to load the .csv file (available on GitHub) directly into the database without being loaded as an R object first. The function, duckdb_read_csv() does some of the work for us to find data types. However, we will first learn what data types are and how to use them, and most dialects of SQL require you to specify the data types before loading in data (usually done using LOAD DATA but using COPY in Duckdb).

Recall that in Table 3.2 we used DESCRIBE to display the variable types of the database table(s). The list includes the variable name (Field), its Type, whether there are NULL values allowed, and whether there are keys or indexes defined on the variable. See Table 5.2 for the DESCRIBE output on the table we are about to import.

Unlike R, when creating a new data table, SQL requires that you communicate each future variable (column) and that variable’s type. Variable types are not automatically generated!

As an example, consider the Saturday Night Live datasets available on the snldb GitHub repo. Data is scraped from http://www.snlarchives.net and http://www.imdb.com/title/tt0072562 by Hendrik Hilleckes and Colin Morris. Notice that there are eleven .csv files available in the output folder.

Specifically, let’s consider the casts.csv file.

Before we get into loading data into a SQL database, let’s look at the casts file in R, so that we understand the data we want to load. glimpse() provides the variable names and the variables types. The variables types are a mix of character strings, numeric, and logical. Variable types are very important for inputting data into a SQL server.

casts <- readr::read_csv("https://raw.githubusercontent.com/hhllcks/snldb/master/output/casts.csv")
glimpse(casts)
Rows: 614
Columns: 8
$ aid             <chr> "A. Whitney Brown", "A. Whitney Brown", "A. Whitney Br…
$ sid             <dbl> 11, 12, 13, 14, 15, 16, 5, 39, 40, 41, 42, 45, 46, 21,…
$ featured        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ first_epid      <dbl> 19860222, NA, NA, NA, NA, NA, 19800409, 20140118, NA, …
$ last_epid       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ update_anchor   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ n_episodes      <dbl> 8, 20, 13, 20, 20, 20, 5, 11, 21, 21, 21, 18, 17, 20, …
$ season_fraction <dbl> 0.444, 1.000, 1.000, 1.000, 1.000, 1.000, 0.250, 0.524…

5.1.1 Variable types

The glance() function indicates that there are different variables types, here chr (character string), dbl (numerical value with double precision), and lgl (logical value). What are the data types in SQL?1

Numbers

  • exact numeric data types (INTEGER, SMALLINT, DECIMAL, and NUMERIC)2
  • approximate numeric data types (FLOAT, REAL, and DOUBLE PRECISION)3
  • integer types INTEGER (or INT) and SMALLINT, TINYINT, MEDIUMINT, and BIGINT
  • DECIMAL(precision, scale). For example, in DECIMAL(5,2) 5 is the precision and 2 is the scale. The precision represents the number of significant digits that are stored for values, and the scale represents the number of digits that can be stored following the decimal point. DECIMAL(5,2) must store any value with five digits and two decimals, so values that can be stored in the salary column range from -999.99 to 999.99. DECIMAL(M) is equivalent to DECIMAL(M,0). Similarly, the syntax DECIMAL is equivalent to DECIMAL(M,0), where the default value of M is 10. If the scale is 0, DECIMAL values contain no decimal point or fractional part.

Strings

  • string data types are CHAR, VARCHAR, BINARY, VARBINARY, BLOB, TEXT, ENUM, and SET
  • CHAR and VARCHAR types are similar, but differ in the way they are stored and retrieved. They also differ in maximum length and in whether trailing spaces are retained.
  • CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters you want to store. For example, CHAR(30) can hold up to 30 characters.
  • The length of a CHAR column is fixed to the length that you declare when you create the table.
  • Values in VARCHAR columns are variable-length strings. The length can be specified as a value from 0 to 65,535.
  • BINARY and VARBINARY types are similar to CHAR and VARCHAR, except that they store binary strings rather than nonbinary strings. That is, they store byte strings rather than character strings.
  • A BLOB is a binary large object that can hold a variable amount of data. The four BLOB types are TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB. These differ only in the maximum length of the values they can hold. The four TEXT types are TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT. These correspond to the four BLOB types and have the same maximum lengths and storage requirements.

Date

  • date and time data types for representing temporal values are DATE, TIME, DATETIME, TIMESTAMP, and YEAR
  • DATE type is used for values with a date part but no time part. MySQL retrieves and displays DATE values in ‘YYYY-MM-DD’ format. The supported range is ‘1000-01-01’ to ‘9999-12-31’.
  • DATETIME type is used for values that contain both date and time parts. MySQL retrieves and displays DATETIME values in ‘YYYY-MM-DD hh:mm:ss’ format. The supported range is ‘1000-01-01 00:00:00’ to ‘9999-12-31 23:59:59’.
  • TIMESTAMP data type is used for values that contain both date and time parts. TIMESTAMP has a range of ‘1970-01-01 00:00:01’ UTC to ‘2038-01-19 03:14:07’ UTC.
  • TIME values in ‘hh:mm:ss’ format (or ‘hhh:mm:ss’ format for large hours values). TIME values may range from ‘-838:59:59’ to ‘838:59:59’. The hours part may be so large because the TIME type can be used not only to represent a time of day (which must be less than 24 hours), but also elapsed time or a time interval between two events (which may be much greater than 24 hours, or even negative).
  • YEAR type is a 1-byte type used to represent year values with a display width of four characters.

5.1.2 CHECK constraints

While implementing CREATE TABLE, constraints can be added either to individual variables (CountryPopulation > 0) or to the table as a whole (LastCensus < NextCensus).4

If an attempt is made to load data that violate the CHECK constraints, an error will be given.

CREATE TABLE CountryListCensus (
    Id INT,
    CountryName VARCHAR(255) NOT NULL,
    CountryPopulation INT CHECK(CountryPopulation > 0),
    LastCensus DATE,
    NextCensus DATE,
    CHECK(LastCensus<NextCensus),
    PRIMARY KEY (Id)
);

5.1.3 Creating KEYs

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY col1,
  FOREIGN KEY col2 REFERENCES table2(table2col1)
);

Either or both of the KEYs could be multiple columns.

CREATE TABLE table1 (
  col1 ...,
  col2 ...,
  col3 ...,
  PRIMARY KEY (col1, col3),
  FOREIGN KEY (col1, col2) REFERENCES table2(table2col1, table2col4)
);

5.1.4 Creating INDEXes

Indexes can be created on one or more variable. A table does not need to have an INDEX (or a KEY).

CREATE INDEX name_of_index ON table (col1);
CREATE INDEX name_of_index ON table (col1, col2);

5.1.5 Loading data

Once the database is set up, you will be ready to import .csv files into the database as tables. Importing .csv files as tables requires a series of steps:5

  1. a USE statement that ensures we are in the right schema/database.
  2. a series of DROP TABLE statements that drop any old tables with the same names as the ones we are going to create.
  3. a series of CREATE TABLE statements that specify the table structures.
  4. a series of COPY statements that read the data from the .csv files into the appropriate tables.
Watch out!

DuckDB has its own dialect of SQL. To load data into a MySQL server, the final statement would be LOAD DATA instead of COPY. See MDSR for more information on loading data into a remote MySQL server.

Loading step 1

Use the local database that we’ve called duck_datab.

USE duck_datab;

Loading step 2

Make sure to “refresh” the table, in case it already exists. However, be very careful with the DROP TABLE statement, as it will remove the casts table.

DROP TABLE IF EXISTS casts;

Loading step 3

Carefully define the variable types, whether or not they allow missing values, and what a default value is for that variable. Additionally, identify the key for accessing information.

MySQL doesn’t actually have a BOOLEAN datatype (you would use TINYINT(1) instead). But DuckDB does have a BOOLEAN datatype!

CREATE TABLE casts (
  aid VARCHAR(255) NOT NULL DEFAULT ' ',
  sid INTEGER NOT NULL DEFAULT 0,
  featured BOOLEAN NOT NULL DEFAULT 'false',
  first_epid INTEGER DEFAULT 0,
  last_epid INTEGER DEFAULT 0,
  update_anchor BOOLEAN NOT NULL DEFAULT 0,
  n_episodes INTEGER NOT NULL DEFAULT 0,
  season_fraction DECIMAL(21,20) NOT NULL DEFAULT 0,
  PRIMARY KEY (sid, aid)
);

Loading step 4

The .csv file lives on my computer, so I load it in directly. Note that the statement to load in data is slightly different in MySQL.

COPY casts FROM 'data/casts.csv' HEADER;

Checking the loading

SELECT * FROM casts LIMIT 8;
Table 5.1: After CREATE TABLE where variable types are set, the COPY command pulls the data into the table. SELECT shows us that the table is as expected.
aid sid featured first_epid last_epid update_anchor n_episodes season_fraction
A. Whitney Brown 11 TRUE 19860222 FALSE 8 0.444
A. Whitney Brown 12 TRUE FALSE 20 1.000
A. Whitney Brown 13 TRUE FALSE 13 1.000
A. Whitney Brown 14 TRUE FALSE 20 1.000
A. Whitney Brown 15 TRUE FALSE 20 1.000
A. Whitney Brown 16 TRUE FALSE 20 1.000
Alan Zweibel 5 TRUE 19800409 FALSE 5 0.250
Sasheer Zamata 39 TRUE 20140118 FALSE 11 0.524

Check

Let’s make sure that the database exists and that the table in the database exists.

SHOW DATABASES;
1 records
database_name
duck_datab
SHOW TABLES;
1 records
name
casts
DESCRIBE casts;
Table 5.2: DESCRIBE variables in the casts table.
column_name column_type null key default extra
aid VARCHAR NO PRI ' '
sid INTEGER NO PRI 0
featured BOOLEAN NO 'false'
first_epid INTEGER YES 0
last_epid INTEGER YES 0
update_anchor BOOLEAN NO 0
n_episodes INTEGER NO 0
season_fraction DECIMAL(21,20) NO 0

5.1.6 Using DuckDB for loading data

The steps given in Section 5.1.5 are general to many SQL dialects and are important when working with most SQL clients. It is important to have control over the variables configurations as they make up the SQL database. However, using the duckdb package in R allows for shorthand entry of data from .csv files into the DuckDB database. Here, we take advantage of working with the DuckDB functionality in R.

duckdb_read_csv(con = con_duckdb, name = "hosts", files = "data/hosts.csv")
duckdb_read_csv(con = con_duckdb, name = "episodes", files = "data/episodes.csv")

Checking the loading

SHOW TABLES;
3 records
name
casts
episodes
hosts

5.2 Best practice

It is always a good idea to terminate the SQL connection when you are done with it.

dbDisconnect(con_duckdb, shutdown = TRUE)

5.3 Reflection questions

  1. What is the difference between R and SQL in terms of communicating the different data types?

  2. Why does it matter if the variable type is specified correctly? For example, why would it be better for a date column to be specified as DATETIME instead of VARCHAR?

  3. If you are Googling some SQL syntax, why do you need to specify the dialect, for example, DuckDB or MySQL?

  4. Why do we often include the DROP TABLE operation before the CREATE TABLE operation?

5.4 Ethics considerations

  1. When creating a database why should we worry about the provenance (origin) of the data?

  2. How does a SQL database automatically hold extra information about the database (e.g., provenance)? Spoiler: it doesn’t. So what can we do?


  1. Taken from https://dev.mysql.com/doc/refman/8.0/en/data-types.html↩︎

  2. The keyword INT is a synonym for INTEGER, and the keywords DEC and FIXED are synonyms for DECIMAL↩︎

  3. MySQL treats DOUBLE as a synonym for DOUBLE PRECISION. MySQL also treats REAL as a synonym for DOUBLE PRECISION.↩︎

  4. Example from https://www.sqlshack.com/how-to-use-sql-check-constraints/↩︎

  5. taken from MDSR.↩︎