Module 16 Anomaly Detection and Outlier Analysis

by Patrick Boily, with contributions from Youssouph Cissokho, Soufiane Fadel, and Richard Millson

With the advent of automatic data collection, it is now possible to store and process large troves of data. There are technical issues associated to massive data sets, such as the speed and efficiency of analytical methods, but there are also problems related to the detection of anomalous observations and the analysis of outliers.

Extreme and irregular values behave very differently from the majority of observations. For instance, they can represent criminal attacks, fraud attempts, targeted attacks, or data collection errors. As a result, anomaly detection and outlier analysis play a crucial role in cybersecurity, quality control, etc. . The (potentially) heavy human price and technical consequences related to the presence of such observations go a long way towards explaining why the topic has attracted attention in recent years.

In this module, we review various detection methods, with particular attention paid to both supervised and unsupervised methods.


16.1 Overview
     16.1.1 Basic Notions and Concepts
     16.1.2 Statistical Learning Framework
     16.1.3 Motivating Example

16.2 Quantitative Approaches
     16.2.1 Distance Methods
     16.2.2 Density Methods

16.3 Qualitative Approaches
     16.3.1 Attribute Value Frequency Algorithm
     16.3.2 Greedy Algorithm

16.4 Anomalies in High-Dimensional Data
     16.4.1 Definitions and Challenges
     16.4.2 Projection Methods
     16.4.3 Subspace Methods
     16.4.4 Ensemble Methods

16.5 Exercises