17 Fin
We learned a lot of technical tools throughout the semester. But hopefully, the bigger take-away is the larger thought process about how we do data science and how the technical pieces come together.
Remember, the goal is to communicate insight that you obtain from data. The more you understand your data, the more insight you will have.
17.1 Data Science Overview
17.2 Data Scientists
Who does data science?
connecting, uplifting, and recognizing voices – a database of statisticians and data scientists.
17.2.1 David Blackwell
Arguably, the most famous/influential/brilliant African American statistician is David Blackwell. Statisticians may know Blackwell from the Rao-Blackwell theorem which says that after conditioning on a sufficient statistic, the new estimator will have smaller (or equal to) mean squared error than the original estimator.
Blackwell was the 1st African American elected to the National Academies of Science and the 1st African American tenured at UC Berkeley. He was the 7th African American to receive a PhD in mathematics. In 2012, President Obama posthumously awarded Blackwell the National Medal of Science.
The majority of Blackwell’s career in statistics was spent at UC Berkeley (1954-1988). However, his start at UC Berkeley was postponed due to racism in the Department of Mathematics. Hear Blackwell describe the situation in his own words in the following video:
The full interview with David Blackwell can be found at https://www.youtube.com/watch?v=Mqpf9tw44Xw/.
Why does representation matter?
When individuals don’t feel a part of the community, their identity gets mixed up with their ability. The following xkcd comic encapsulates what can happen when individuals of the non-dominant demographic group engage with the content of the course / curriculum / minor / major.
Research indicates that many young people may be deterred from pursuing STEM fields due to prominent stereotypes regarding who best fits and belongs in such fields.1
What are some reasons that representation impacts participation in engaging in STEM?
- stereotypes about innate abilities
- stereotypes about images in the field
17.2.2 Liz Hare
Liz Hare is not a statistician. Indeed, she is a geneticist, working primarily in dog / animal genetics. However, as someone who is very active in the Minorities in R (MiR) Community, she works regularly with statisticians.
Liz Hare is visually impaired and has focused her work on communicating the value and ease with which statisticians and data scientists can add alt text to their reports. In the alt text, she asks us to consider and report:
- What kind of graph or chart is it?
- What variables are on the axes?
- What are the ranges of the variables?
- What does the appearance tell you about the relationships between the variables?
Importantly, including alt text in your own work is straightforward if you are using Quarto or RMarkdown documents. In R, including alt text is done by providing information for the relevant R chunk.
17.2.3 Rafael Irizarry
Rafael Irizarry is a well known biostatistician, having done his PhD at Berkeley, worked for many years at Johns Hopkins University, and currently running a lab at Harvard as Professor of Biostatistics and at the Dana-Farber Cancer Institute as Professor of Biostatistics and Computational Biology. He has dozens of online courses through the edX platform and over a hundred publications via Google scholar.
Relevant to the CURV database however is the work that Rafael Irizarry has done in Puerto Rico. Having graduated from the University of Puerto Rico, Rafael Irizarry had a vested interest in the community that was ravaged in 2017 when Hurricane Maria, a category 5 hurricane, ravaged the island. With collaborators, Professor Irizarry performed a representative stratified sample to measure neighborhoods based on how easily accessible they were in the aftermath of the hurricane.
The original news reports months after the hurricane was that the official death report from Hurricane Maria was 64 people. Professor Irizarry and colleagues estimated that the number of excess deaths was 4645, with a 95% confidence interval of 793 to 8498.
Some of the ideas being Professor Irizarry’s work include: who is doing the work to understand climate change at a global level?; how is stratified sampling different from simple random samples and why can’t we always take simple random samples?; and why is the CI so wide?
17.3 Take-aways
- 80-90% of data science work is data wrangling and visualization
- wrangling the data well is usually more important than modeling the data well
- there are many choices along the way, there is no such thing as truth
- if you can’t reproduce the work, you should question whether to trust it
- communicating to your audience is likely the most important aspect of doing data science
17.4 Reflection questions
17.5 Ethics considerations
Nguyen and Riegle-Crumb. Who is a scientist? The relationship between counter-stereotypical beliefs about scientists and the STEM major intentions of Black and Latinx male and female students. International Journal of STEM Education, volume 8, Article number: 28 (2021), https://doi.org/10.1186/s40594-021-00288-x↩︎