Module 8 Data Preparation

by Patrick Boily

Once raw data has been collected and stored in a database or a dataset, the focus should shift to data cleaning and processing. This requires testing for soundness and fixing errors, designing and implementing strategies to deal with missing values and outlying/influential observations, as well as low-level exploratory data analysis and visualization to determine what data transformations and dimension reduction approaches will be needed before embarking on a more sophisticated path.

In this module, we establish the essential elements of data cleaning and data processing.


8.1 Introduction

8.2 General Principles
     8.2.1 Approaches to Data Cleaning
     8.2.2 Pros and Cons
     8.2.3 Tools and Methods

8.3 Data Quality
     8.3.1 Common Error Sources
     8.3.2 Detecting Invalid Entries

8.4 Missing Values
     8.4.1 Missing Value Mechanisms
     8.4.2 Imputation Methods
     8.4.3 Multiple Imputation

8.5 Anomalous Observations
     8.5.1 Anomaly Detection
     8.5.2 Outlier Tests
     8.5.3 Visual Outlier Detection

8.6 Data Transformations
     8.6.1 Common Transformations
     8.6.2 Box-Cox Transformations
     8.6.3 Scaling
     8.6.4 Discretizing
     8.6.5 Creating Variables

8.7 Example: Algae Blooms
     8.7.1 Problem Description
     8.7.2 Loading the Data
     8.7.3 Summary and Visualization
     8.7.4 Data Cleaning
     8.7.5 Principal Components

8.8 Exercises