Module 15 Feature Selection and Dimension Reduction

by Patrick Boily, with contributions from Olivier Leduc, Andrew Macfie, Aditya Maheshwari, and Maia Pelletier


Data mining is the collection of processes by which we can extract useful insights from data. Inherent in this definition is the idea of data reduction: useful insights (whether in the form of summaries, sentiment analyses, etc.) ought to be “smaller” and “more organized” than the original raw data.

The challenges presented by high data dimensionality (the so-called curse of dimensionality) must be addressed in order to achieve insightful and interpretable analytical results.

In this module, we introduce the basic principles of dimensionality reduction and a number of feature selection methods (filter, wrapper, regularization), and we discuss some advanced topics (SVD, spectral feature selection, UMAP).

Contents

15.1 Data Reduction for Insight
     15.1.1 Reduction of an NHL Game
     15.1.2 Meaning in Macbeth

15.2 Dimension Reduction
     15.2.1 Sampling Observations
     15.2.2 The Curse of Dimensionality
     15.2.3 Principal Component Analysis
     15.2.4 The Manifold Hypothesis

15.3 Feature Selection
     15.3.1 Filter Methods
     15.3.2 Wrapper Methods
     15.3.3 Subset Selection Methods
     15.3.4 Regularization (Embedded) Methods
     15.3.5 Supervised and Unsupervised FS

15.4 Advanced Topics
     15.4.1 Singular Value Decomposition
     15.4.2 PCA Regression and Partial LS
     15.4.3 Spectral Feature Selection
     15.4.4 UMAP

15.5 Exercises