Data acquisition
We can’t do anything with the data if we don’t have it! We can obtain data in many ways, but here we describe web scraping and working with SQL databases.
First, we obtain data from html webpages using the ideas in 3 Web scraping. The aspects of an html document are broken down and used to pull out the information of interest. While we use R tools to parse the html pieces, understanding html data structures transcends the programming language and will be the same regardless of what language is being used.
Databases are a powerful structure for holding huge amounts of data that can easily be accessed. Databases are made up of tables which are efficiently stored using indexing and avoiding unnecessary replication of information.
?sec-db covers details of databases and reviews R functionality (mostly through the dplyr package) for working with data frames that mimics SQL queries for working with tables.
4.2 SQL in R covers three different ways to run SQL queries in R. R code can be translated into SQL; the DBI package can send SQL queries through an r
chunk; and SQL queries can be sent directly to a SQL server through a sql
chunk.
A query is made up of clauses. Every query must have a SELECT
and FROM
clause. Other clauses include WHERE
, GROUP BY
, HAVING
, and ORDER BY
, all of which will be covered in 5 SQL clauses.
Because of the efficiency of data storage in SQL databases, it is often necessary to combine information held across two or more tables. 6 Combining tables in SQL covers combining tables via JOIN
(combining columns) and via UNION
(combining rows).