Preface
About These Course Notes
Funding Acknowledgement
Datasets
Dedication
Contents
Contributors and Influences
I Prelude to Data Understanding
Introduction
1
Programming Primer
1.1
Programming Fundamentals
1.1.1
Compiled vs. Interpreted Languages
1.1.2
Some Fundamental Concepts
1.1.3
Code Components
1.1.4
Designing with Pseudo-Code
1.1.5
From Pseudo-Code to Code that Runs
1.1.6
Debugging
1.1.7
R/Python
1.2
Introduction to
R
1.2.1
Why Use
R
?
1.2.2
Installing
R
/ RStudio
1.2.3
Test, Test, Test!
1.2.4
Customizing RStudio
1.2.5
Upgrading
R
and/or RStudio
1.2.6
Basics of
R
1.3
More About Programming in
R
1.3.1
Help and Documentation
1.3.2
Simple Data Manipulation
1.3.3
Exploring Data
1.3.4
A Word About NAs
1.3.5
Loops and Conditional Statements
1.4
The
tidyverse
1.4.1
Pipeline Operator
1.4.2
Tidy Data
1.4.3
The
dplyr
Package
1.5
Basics of
Python
1.5.1
Integrated Development Environments for
Python
1.5.2
Introduction to
Python
1.5.3
NumPy
and Arrays
1.6
Python
for Data Science
1.6.1
Pandas
and Data Frames
1.6.2
Data Wrangling
1.6.3
Data Aggregation
1.6.4
Combining
Python
and
R
1.7
Exercises
2
A Survey of Optimization
2.1
Beginnings
2.2
Single-Objective Optimization Problem
2.2.1
Feasible and Optimal Solutions
2.2.2
Infeasible and Unbounded Problems
2.2.3
Possible Tasks Involving Optimization Problems
2.3
Classification of Optimization Problems and Types of Algorithms
2.3.1
Classification
2.3.2
Algorithms
2.4
Linear Programming
2.4.1
Linear Programming Duality
2.4.2
Methods for Solving LP Problems
2.5
Mixed-Integer Linear Programming
2.5.1
Cutting Planes
2.6
Useful Modeling Techniques
2.6.1
Activation
2.6.2
Disjunction
2.6.3
Soft Constraints
2.7
Data Envelopment Analysis
2.7.1
Challenges and Pitfalls
2.7.2
Advantages and Disadvantages
2.7.3
SAS, Excel, and
R
DEA Solvers
2.7.4
Case Study: Barcelona Schools
2.8
Software Solvers
3
Probability and Applications
3.1
Basic Notions
3.1.1
Sample Spaces and Events
3.1.2
Counting Techniques
3.1.3
Ordered Samples
3.1.4
Unordered Samples
3.1.5
Probability of an Event
3.1.6
Conditional Probability and Independent Events
3.1.7
Bayes’ Theorem
3.2
Discrete Distributions
3.2.1
Random Variables and Distributions
3.2.2
Expectation of a Discrete Random Variable
3.2.3
Binomial Distributions
3.2.4
Geometric Distributions
3.2.5
Negative Binomial Distributions
3.2.6
Poisson Distributions
3.2.7
Other Discrete Distributions
3.3
Continuous Distributions
3.3.1
Continuous Random Variables
3.3.2
Expectation of a Continuous Random Variable
3.3.3
Normal Distributions
3.3.4
Exponential Distributions
3.3.5
Gamma Distributions
3.3.6
Normal Approximation of the Binomial Distribution
3.3.7
Other Continuous Distributions
3.4
Joint Distributions
3.5
Central Limit Theorem and Sampling Distributions
3.5.1
Sampling Distributions
3.5.2
Central Limit Theorem
3.5.3
Sampling Distributions (Reprise)
3.6
Exercises
4
Introductory Statistical Analysis
4.1
Introduction
4.2
Descriptive Statistics
4.2.1
Data Descriptions
4.2.2
Outliers
4.2.3
Visual Summaries
4.2.4
Coefficient of Correlation
4.3
Point and Interval Estimation
4.3.1
Estimator (Sampling) Variance and Standard Error
4.3.2
Confidence Interval for
\(\mu\)
When
\(\sigma\)
is Known
4.3.3
Confidence Level
4.3.4
Sample Size
4.3.5
Confidence Interval for
\(\mu\)
When
\(\sigma\)
is Unknown
4.3.6
Confidence Interval for a Proportion
4.4
Hypothesis Testing
4.4.1
Hypothesis Testing in General
4.4.2
Test Statistics and Critical Regions
4.4.3
Test for a Mean
4.4.4
Test for a Proportion
4.4.5
Two-Sample Tests
4.4.6
Difference of Two Proportions
4.4.7
Hypothesis Testing with
R
4.5
Additional Topics
4.5.1
Analysis of Variance
4.5.2
Analysis of Covariance
4.5.3
Basics of Multivariate Statistics
4.5.4
Goodness-of-Test Fits
4.6
Exercises
5
Survey Sampling Methods
5.1
Background
5.1.1
Survey Sampling Generalities
5.1.2
Survey Frames
5.1.3
Fundamental Sampling Concepts
5.1.4
Data Collection Basics
5.1.5
Types of Sampling Methods
5.2
Questionnaire Design
5.2.1
Basic Concepts
5.2.2
Question Types
5.2.3
Design Considerations
5.2.4
Question Order
5.3
Simple Random Sampling
5.3.1
Basic Notions
5.3.2
Estimators and Confidence Intervals
5.3.3
Sample Size
5.4
Stratified Random Sampling
5.4.1
Estimators and Confidence Intervals
5.4.2
Sample Size and Allocation
5.4.3
Comparison Between SRS and StS
5.5
Ratio, Regression, and Difference Estimation
5.5.1
Ratio Estimation
5.5.2
Regression Estimation
5.5.3
Difference Estimation
5.5.4
Comparisons
5.6
Cluster Sampling
5.6.1
Estimators and Confidence Intervals
5.6.2
Sample Size
5.6.3
Comparison Between SRS and CLS
5.7
Special Topics
5.7.1
Systematic Sampling
5.7.2
Sampling with Probability Proportional to Size
5.7.3
Multi-Stage Sampling
5.7.4
Multi-Phase Sampling
5.7.5
Miscellaneous
5.8
Exercises
II Fundamentals of Data Insight
Introduction
6
Non-Technical Aspects of Data Work
6.1
First Principles
6.1.1
The Consulting/Analysis Framework
6.1.2
The “Multiple I’s” Approach to Quantitative Work
6.1.3
Roles and Responsibilities
6.1.4
Consulting/Analysis Cheatsheet
6.2
The Consulting/Analysis Life Cycle
6.2.1
Marketing
6.2.2
Initial Contact
6.2.3
Client Meetings
6.2.4
Assembling the Team
6.2.5
Team Meetings
6.2.6
Proposal
6.2.7
Contracting and IP
6.2.8
Project Planning
6.2.9
Information Gathering
6.2.10
Quantitative Analysis
6.2.11
Interpreting the Results
6.2.12
Reporting and Deliverables
6.2.13
Invoicing
6.2.14
Closing the File
6.3
Lessons Learned
6.3.1
About Clients
6.3.2
About Consultants
6.4
Business Development
6.4.1
Basics
6.4.2
Clients and Choices
6.4.3
Building Trust
6.4.4
Improving Trust
6.5
Technical Writing
6.5.1
Basics
6.5.2
Components
6.5.3
Traits
6.5.4
Example
6.6
Exercises
7
Data Science Basics
7.1
Introduction
7.1.1
What Is Data?
7.1.2
From Objects and Attributes to Datasets
7.1.3
Data in the News
7.1.4
The Analog/Digital Data Dichotomy
7.2
Conceptual Frameworks for Data Work
7.2.1
Three Modeling Strategies
7.2.2
Information Gathering
7.2.3
Cognitive Biases
7.3
Ethics in the Data Science Context
7.3.1
The Need for Ethics
7.3.2
What Is/Are Ethics?
7.3.3
Ethics and Data Science
7.3.4
Guiding Principles
7.3.5
The Good, the Bad, and the Ugly
7.4
Analytics Workflows
7.4.1
The “Analytical” Method
7.4.2
Data Collection, Storage, Processing, and Modeling
7.4.3
Model Assessment and Life After Analysis
7.4.4
Automated Data Pipelines
7.5
Getting Insight From Data
7.5.1
Asking the Right Questions
7.5.2
Structuring and Organizing Data
7.5.3
Basic Data Analysis Techniques
7.5.4
Common Statistical Procedures in
R
7.5.5
Quantitative Methods
7.6
Exercises
8
Data Preparation
8.1
Introduction
8.2
General Principles
8.2.1
Approaches to Data Cleaning
8.2.2
Pros and Cons
8.2.3
Tools and Methods
8.3
Data Quality
8.3.1
Common Error Sources
8.3.2
Detecting Invalid Entries
8.4
Missing Values
8.4.1
Missing Value Mechanisms
8.4.2
Imputation Methods
8.4.3
Multiple Imputation
8.5
Anomalous Observations
8.5.1
Anomaly Detection
8.5.2
Outlier Tests
8.5.3
Visual Outlier Detection
8.6
Data Transformations
8.6.1
Common Transformations
8.6.2
Box-Cox Transformations
8.6.3
Scaling
8.6.4
Discretizing
8.6.5
Creating Variables
8.7
Example: Algae Blooms
8.7.1
Problem Description
8.7.2
Loading the Data
8.7.3
Summary and Visualization
8.7.4
Data Cleaning
8.7.5
Principal Components
8.8
Exercises
9
Data Visualization and Data Exploration
9.1
Data and Charts
9.1.1
Pre-Analysis Uses
9.1.2
Presenting Results
9.1.3
Multivariate Elements in Charts
9.1.4
Visualization Catalogue
9.1.5
A Word About Accessibility
9.2
Fundamental Principles of Analytical Design
9.2.1
Comparisons
9.2.2
Causality, Mechanism, Structure, Explanation
9.2.3
Multivariate Analysis
9.2.4
Integration of Evidence
9.2.5
Documentation
9.2.6
Content First and Foremost
9.3
Introduction to Dashboards
9.3.1
Dashboard Fundamentals
9.3.2
Dashboard Structure
9.3.3
Dashboard Design
9.3.4
Examples
9.4
Basic Visualizations in
R
9.4.1
Scatterplots
9.4.2
Barplots
9.4.3
Histograms
9.4.4
Curves
9.4.5
Boxplots
9.4.6
Examples
9.5
ggplot2
Visualizations in
R
9.5.1
Basics of
ggplot2
’s Grammar
9.5.2
ggplot2
Miscellenea
9.5.3
Examples
9.6
Exercises
10
Data Engineering and Management
10.1
Background and Context
10.2
Data Engineering
10.2.1
Data Pipelines
10.2.2
Automatic Deployment and Operations
10.2.3
Scheduled Pipelines and Workflows
10.2.4
Data Engineering Tools
10.3
Data Management
10.3.1
Databases
10.3.2
Data Modeling
10.3.3
Data Storage
10.4
Reporting and Deployment
10.4.1
Reports and Products
10.4.2
Cloud and On-Premise Architecture
III Spotlight on Machine Learning
Introduction
11
Machine Learning 101
11.1
Introduction
11.2
Statistical Learning
11.2.1
Types of Learning
11.2.2
Data Science and Machine Learning Tasks
11.3
Association Rules Mining
11.3.1
Overview
11.3.2
Generating Rules
11.3.3
The A Priori Algorithm
11.3.4
Validation
11.3.5
Case Study: Danish Medical Data
11.3.6
Toy Example: Titanic Dataset
11.4
Classification and Value Estimation
11.4.1
Overview
11.4.2
Classification Algorithms
11.4.3
Decision Trees
11.4.4
Performance Evaluation
11.4.5
Case Study: Minnesota Tax Audit
11.4.6
Toy Example: Kyphosis Dataset
11.5
Clustering
11.5.1
Overview
11.5.2
Clustering Algorithms
11.5.3
k-Means
11.5.4
Clustering Validation
11.5.5
Case Study: Livehoods
11.5.6
Toy Example: Iris Dataset
11.6
Issues and Challenges
11.6.1
Bad Data
11.6.2
Overfitting/Underfitting
11.6.3
Appropriateness and Transferability
11.6.4
Myths and Mistakes
11.7
R
Examples
11.7.1
Association Rules Mining: Titanic Dataset
11.7.2
Classification: Kyphosis Dataset
11.7.3
Clustering: Iris Dataset
11.8
Exercises
12
Regression and Value Estimation
12.1
Statistical Learning
12.1.1
Supervised Learning Framework
12.1.2
Systematic Component and Regression Function
12.1.3
Model Evaluation
12.1.4
Bias-Variance Trade-Off
12.2
Regression Modeling
12.2.1
Formalism
12.2.2
Least Squares Properties
12.2.3
Generalizations of OLS
12.2.4
Shrinkage Methods
12.3
Resampling Methods
12.3.1
Cross-Validation
12.3.2
Bootstrap
12.3.3
Jackknife
12.4
Model Selection
12.4.1
Best Subset Selection
12.4.2
Stepwise Selection
12.4.3
Selecting the Optimal Model
12.5
Nonlinear Modeling
12.5.1
Basis Function Models
12.5.2
Splines
12.5.3
Generalized Additive Models
12.6
Example: Algae Blooms
12.6.1
Value Estimation Modeling
12.6.2
Model Evaluation
12.6.3
Model Predictions
12.7
Exercises
13
Spotlight on Classification
13.1
Overview
13.1.1
Formalism
13.1.2
Model Evaluation
13.1.3
Bias-Variance Trade-Off
13.2
Simple Classification Methods
13.2.1
Logistic Regression
13.2.2
Discriminant Analysis
13.2.3
Receiver Operating Characteristic Curve
13.3
Rare Occurrences
13.4
Other Supervised Approaches
13.4.1
Tree-Based Methods
13.4.2
Support Vector Machines
13.4.3
Artificial Neural Networks
13.4.4
Naïve Bayes Classifiers
13.5
Ensemble Learning
13.5.1
Bagging
13.5.2
Random Forests
13.5.3
Boosting
13.6
Exercises
14
Spotlight on Clustering
14.1
Overview
14.1.1
Unsupervised Learning
14.1.2
Clustering Framework
14.1.3
A Philosophical Approach to Clustering
14.2
Simple Clustering Algorithms
14.2.1
\(k-\)
Means and Variants
14.2.2
Hierarchical Clustering
14.3
Clustering Evaluation
14.3.1
Clustering Assessment
14.3.2
Model Selection
14.4
Advanced Clustering Methods
14.4.1
Density-Based Clustering
14.4.2
Spectral Clustering
14.4.3
Probability-Based Clustering
14.4.4
Affinity Propagation
14.4.5
Fuzzy Clustering
14.4.6
Cluster Ensembles
14.5
Exercises
15
Feature Selection and Dimension Reduction
15.1
Data Reduction for Insight
15.1.1
Reduction of an NHL Game
15.1.2
Meaning in Macbeth
15.2
Dimension Reduction
15.2.1
Sampling Observations
15.2.2
The Curse of Dimensionality
15.2.3
Principal Component Analysis
15.2.4
The Manifold Hypothesis
15.3
Feature Selection
15.3.1
Filter Methods
15.3.2
Wrapper Methods
15.3.3
Subset Selection Methods
15.3.4
Regularization (Embedded) Methods
15.3.5
Supervised and Unsupervised Feature Selection
15.4
Advanced Topics
15.4.1
Singular Value Decomposition
15.4.2
PCA Regression and Partial Least Squares
15.4.3
Spectral Feature Selection
15.4.4
Uniform Manifold Approximation and Projection
15.5
Exercises
IV Special Topics in Data Analysis
Introduction
16
Anomaly Detection and Outlier Analysis
16.1
Overview
16.1.1
Basic Notions and Concepts
16.1.2
Statistical Learning Framework
16.1.3
Motivating Example
16.2
Quantitative Approaches
16.2.1
Distance Methods
16.2.2
Density Methods
16.3
Qualitative Approaches
16.3.1
Attribute Value Frequency Algorithm
16.3.2
Greedy Algorithm
16.4
Anomalies in High-Dimensional Data
16.4.1
Definitions and Challenges
16.4.2
Projection Methods
16.4.3
Subspace Methods
16.4.4
Ensemble Methods
16.5
Exercises
17
Web Scraping and Automated Data Collection
17.1
Data Analysis and Web Scraping
17.1.1
What and Why of Web Scraping
17.1.2
Web Data Quality
17.1.3
Ethical Considerations
17.1.4
Automated Data Collection Decision Process
17.2
Web Technologies Basics
17.2.1
Content Dissemination
17.2.2
Hyper Text Transfer Protocol
17.2.3
Web Content
17.2.4
HTML/XML
17.2.5
Cookies and Other Headers
17.3
Scraping Toolbox
17.3.1
Developer Tools
17.3.2
XPath
17.3.3
Regular Expressions
17.3.4
Beautiful Soup
17.3.5
Selenium
17.3.6
APIs
17.3.7
Specialized Uses and Applications
17.4
Examples
17.4.1
Wikipedia
17.4.2
Weather Data
17.4.3
CFL Play-by-Play
17.4.4
Bad HTML
17.4.5
Extracting Text from a PDF File
17.4.6
YouTube Titles
17.5
Exercises
18
Bayesian Data Analysis
18.1
Plausible Reasoning
18.1.1
Rules of Probability
18.1.2
Bayes’ Theorem
18.1.3
Bayesian Inference Basics
18.1.4
Bayesian Data Analysis
18.2
Examples
18.2.1
The Mysterious Coin
18.2.2
The Salary Question
18.2.3
Money (Dollar Bill Y’All)
18.3
Prior Distributions
18.3.1
Conjugate Priors
18.3.2
Uninformative Priors
18.3.3
Informative Priors
18.3.4
Maximum Entropy Priors
18.4
Posterior Distributions
18.4.1
High-Density Intervals
18.4.2
Markov Chain Monte Carlo Methods
18.4.3
Metropolis-Hastings Algorithm
18.5
Additional Topics
18.5.1
Uncertainty
18.5.2
Bayesian A/B Testing
18.6
Exercises
19
Queueing Systems
19.1
Background
19.2
Terminology
19.2.1
Input/Arrival Process
19.2.2
Output/Service Process
19.2.3
Queue Discipline
19.2.4
Method Used by Arrivals to Join Queue
19.3
Queueing Theory Framework
19.3.1
Kendall-Lee Notation
19.3.2
Birth-Death Processes
19.3.3
Little’s Queuing Formula
19.4
M/M/1 Queueing Systems
19.4.1
Basics
19.4.2
Limited Capacity
19.5
M/M/c Queueing Systems
19.6
Exercises
References
Published with bookdown
Data Understanding, Data Analysis, and Data Science - Course Notes (DRAFT)
Datasets
The various datasets are available
here
.