17  Fin

We learned a lot of technical tools throughout the semester. But hopefully, the bigger take-away is the larger thought process about how we do data science and how the technical pieces come together.

Remember, the goal is to communicate insight that you obtain from data. The more you understand your data, the more insight you will have.

17.1 Data Science Overview

17.2 Data Scientists

Who does data science?

connecting, uplifting, and recognizing voices – a database of statisticians and data scientists.

17.2.1 David Blackwell

Arguably, the most famous/influential/brilliant African American statistician is David Blackwell. Statisticians may know Blackwell from the Rao-Blackwell theorem which says that after conditioning on a sufficient statistic, the new estimator will have smaller (or equal to) mean squared error than the original estimator.

Blackwell was the 1st African American elected to the National Academies of Science and the 1st African American tenured at UC Berkeley. He was the 7th African American to receive a PhD in mathematics. In 2012, President Obama posthumously awarded Blackwell the National Medal of Science.

The majority of Blackwell’s career in statistics was spent at UC Berkeley (1954-1988). However, his start at UC Berkeley was postponed due to racism in the Department of Mathematics. Hear Blackwell describe the situation in his own words in the following video:

The full interview with David Blackwell can be found at https://www.youtube.com/watch?v=Mqpf9tw44Xw/.

Why does representation matter?

When individuals don’t feel a part of the community, their identity gets mixed up with their ability. The following xkcd comic encapsulates what can happen when individuals of the non-dominant demographic group engage with the content of the course / curriculum / minor / major.

In the first panel, a stick figure makes a mathematical mistake, and the other stick figure says "you are bad at math".  In the second panel, a stick figure with long hair (presumably a female stick figure) makes the same mistake, and the other stick figure says "girls are bad at math."

image credit – https://xkcd.com/385/

Research indicates that many young people may be deterred from pursuing STEM fields due to prominent stereotypes regarding who best fits and belongs in such fields.1

What are some reasons that representation impacts participation in engaging in STEM?

  • stereotypes about innate abilities
  • stereotypes about images in the field

17.2.2 Liz Hare

Liz Hare is not a statistician. Indeed, she is a geneticist, working primarily in dog / animal genetics. However, as someone who is very active in the Minorities in R (MiR) Community, she works regularly with statisticians.

Liz Hare is visually impaired and has focused her work on communicating the value and ease with which statisticians and data scientists can add alt text to their reports. In the alt text, she asks us to consider and report:

  1. What kind of graph or chart is it?
  2. What variables are on the axes?
  3. What are the ranges of the variables?
  4. What does the appearance tell you about the relationships between the variables?

Importantly, including alt text in your own work is straightforward if you are using Quarto or RMarkdown documents. In R, including alt text is done by providing information for the relevant R chunk.

To include alt text in Rmarkdown or Quarto files, the alt text information is given in the arguments of the R chunk.

R code for Ibo tweets in TidyTuesday analysis.

The figure contains information on how to communicate information about a graphic through alt text, report captions, and individual figure captions.

Different ways to annotate a figure include alt text, figure captions for the full file, and figure captions for the ggplot.

17.2.3 Rafael Irizarry

Rafael Irizarry is a well known biostatistician, having done his PhD at Berkeley, worked for many years at Johns Hopkins University, and currently running a lab at Harvard as Professor of Biostatistics and at the Dana-Farber Cancer Institute as Professor of Biostatistics and Computational Biology. He has dozens of online courses through the edX platform and over a hundred publications via Google scholar.

Relevant to the CURV database however is the work that Rafael Irizarry has done in Puerto Rico. Having graduated from the University of Puerto Rico, Rafael Irizarry had a vested interest in the community that was ravaged in 2017 when Hurricane Maria, a category 5 hurricane, ravaged the island. With collaborators, Professor Irizarry performed a representative stratified sample to measure neighborhoods based on how easily accessible they were in the aftermath of the hurricane.

The original news reports months after the hurricane was that the official death report from Hurricane Maria was 64 people. Professor Irizarry and colleagues estimated that the number of excess deaths was 4645, with a 95% confidence interval of 793 to 8498.

Some of the ideas being Professor Irizarry’s work include: who is doing the work to understand climate change at a global level?; how is stratified sampling different from simple random samples and why can’t we always take simple random samples?; and why is the CI so wide?

New England Journal of Medicine paper from July 2018 titled "Mortality in Puerto Rico after Hurricane Maria" with Irizarry and co-authors.

Abstract of the New England Journal of Medicine paper.  In particular, they provide a point estimate for the mortality number to be 4645 excess deaths, with a CI of 793 to 8498.

17.3 Take-aways

  • 80-90% of data science work is data wrangling and visualization
  • wrangling the data well is usually more important than modeling the data well
  • there are many choices along the way, there is no such thing as truth
  • if you can’t reproduce the work, you should question whether to trust it
  • communicating to your audience is likely the most important aspect of doing data science

17.4 Reflection questions

17.5 Ethics considerations


  1. Nguyen and Riegle-Crumb. Who is a scientist? The relationship between counter-stereotypical beliefs about scientists and the STEM major intentions of Black and Latinx male and female students. International Journal of STEM Education, volume 8, Article number: 28 (2021), https://doi.org/10.1186/s40594-021-00288-x↩︎