Module 7 Data Science Basics

by Patrick Boily and Jen Schellinck

In October 2012, the Harvard Business Review published an article calling data science the “sexiest job of the 21st century”, and comparing data scientists with the ubiquitous “quants” of the ’90s: a data scientist is a “hybrid of data hacker, analyst, communicator, and trusted adviser” [75].

Would-be data scientists are usually introduced to the field via machine learning algorithms and applications. While we will discuss these topics in later modules, we would like to start by with some of important non-technical (and semi-technical) notions that are often unfortunately swept aside in favour of diving head first into murky analysis waters.

In this module, we focus on some of the fundamental ideas and concepts that underlie and drive forward the discipline of data science, as well as the contexts in which these concepts are typically applied. We also highlight issues related to the ethics of practical data science. We conclude by getting a bit more concrete and considering the analytical workflow of a typical data science project, the types of roles and responsibilities that generally arise during data science projects and some basics of how to think about data, as a prelude to more technical topics.

Note: we encourage readers to take a look at the Programming Primer before diving into data science proper.


7.1 Introduction
     7.1.1 What Is Data?
     7.1.2 From Objects and Attributes to Datasets
     7.1.3 Data in the News
     7.1.4 The Analog/Digital Data Dichotomy

7.2 Conceptual Frameworks for Data Work
     7.2.1 Three Modeling Strategies
     7.2.2 Information Gathering
     7.2.3 Cognitive Biases

7.3 Ethics in the Data Science Context
     7.3.1 The Need for Ethics
     7.3.2 What Is/Are Ethics?
     7.3.3 Ethics and Data Science
     7.3.4 Guiding Principles
     7.3.5 The Good, the Bad, and the Ugly

7.4 Analytics Workflows
     7.4.1 The “Analytical” Method
     7.4.2 Collection, Storage, Processing, Modeling
     7.4.3 Model Assessment and Life After Analysis
     7.4.4 Automated Data Pipelines

7.5 Getting Insight From Data
     7.5.1 Asking the Right Questions
     7.5.2 Structuring and Organizing Data
     7.5.3 Basic Data Analysis Techniques
     7.5.4 Common Statistical Procedures in R
     7.5.5 Quantitative Methods

7.6 Exercises



T. H. Davenport and D. J. Patil, Data Scientist: The Sexiest Job of the 21st Century,” Harvard Business Review, Oct. 2012.