Data exploration

Being able to fluently wrangle data is one of the most important skills of a data scientist, and it is described in detail in 7 Data wrangling. The data story comes from a good understanding of the data – what are the variables? are they numeric or categorical? where are the values centered and spread? are there levels of the variable which only show up once or twice? is there any missing data?

Often textual data can provide tremendous amounts of information. Feature engineering is the practice of creating new variables from other variables. For example, instead of using the entire text of a poem, you might want to know how often the poem uses the word “love.” Basic skills for working with text data are given in 8 Text analysis. Additionally, working with text data requires an ability to work with regular expressions, sequences of characters that define search patterns, described in 9 Regular expressions. Regular expressions use symbolic notation to find particular sequences of interest and are agnostic to programming language, showing up in R, Python, SQL, and many other programming languages.