16.1 Overview
[This section is an extension of Section 8.5.]
Isaac Asimov, the prolific American author, once wrote that
The most exciting phrase to hear […], the one that heralds the most discoveries, is not “Eureka!” but “That’s funny…”.
However, anomalous observations are not only harbingers of great scientific discoveries – unexpected observations can spoil analyses or be indicative of the presence of issues related to data collection or data processing.
Either way, it becomes imperative for decisionmakers and analysts to establish anomaly detection protocols, and to identify strategies to deal with such observations.
16.1.1 Basic Notions and Concepts
Outlying observations are data points which are atypical in comparison to the unit’s remaining features (withinunit), or in comparison to the measurements for other units (betweenunits), or as part of a collective subset of observations. Outliers are thus observations which are dissimilar to other cases or which contradict known dependencies or rules.^{320}
Observations could be anomalous in one context, but not in another. Consider, for instance, an adult male who is 6foot tall. Such a man would fall in the 86th percentile among Canadian males [156], which, while on the tall side, is not unusual; in Bolivia, however, the same man would land in the 99.9th percentile [156], which would mark him as extremely tall and quite dissimilar to the rest of the population. Anomaly detection points towards interesting questions for analysts and subject matter experts: in this case, why is there such a large discrepancy in the two populations?
In practice, an outlier/anomalous observation may arise as
a “bad” object/measurement: data artifacts, spelling mistakes, poorly imputed values, etc.
a misclassified observation: according to the existing data patterns, the observation should have been labeled differently;
an observation whose measurements are found in the distribution tails, of a large enough number of features;
an unknown unknowns: a completely new type of observations whose existence was heretofore unsuspected.
A common mistake that analysts make when dealing with outlying observations is to remove them from the dataset without carefully studying whether they are influential data points, that is, observations whose absence leads to markedly different analysis results.
When influential observations are identified, remedial measures (such as data transformation strategies) may need to be applied to minimize any undue effect. Note that outliers may be influential, and influential data points may be outliers, but the conditions are neither necessary nor sufficient.
Anomaly Detection
By definition, anomalies are infrequent and typically shrouded in uncertainty due to their relatively low numbers, which makes it difficult to differentiate them from banal noise or data collection errors.
Furthermore, the boundary between normal and deviant observations is usually fuzzy; with the advent of eshops, for instance, a purchase which is recorded at 3AM local time does not necessarily raise a red flag anymore.
When anomalies are actually associated to malicious activities, they are more than often disguised in order to blend in with normal observations, which obviously complicates the detection process. Numerous methods exist to identify anomalous observations; none of them are foolproof and judgement must be used.
Methods that employ graphical aids (such as boxplots, scatterplots, scatterplot matrices, and 2D tours) to identify outliers are particularly easy to implement, but a lowdimensional setting is usually required for ease of interpretability. These methods usually find the anomalies that shout the loudest [318].
Analytical methods also exist (using Cooke’s or Mahalanobis’ distances, say), but in general some additional level of analysis must be performed, especially when trying to identify influential observations (cf. leverage). With small datasets, anomaly detection can be conducted on a casebycase basis, but with large datasets, the temptation to use automated detection/removal is strong – care must be exercised before the analyst decides to go down that route. This stems partly from the fact that once the “anomalous” observations have been removed from the dataset, previously “regular” observations can become anomalous in turn in the smaller dataset; it is not clear when that runaway train will stop.
In the early stages of anomaly detection, simple data analyses (such as descriptive statistics, 1 and 2way tables, and traditional visualizations) may be performed to help identify anomalous observations, or to obtain insights about the data, which could eventually lead to modifications of the analysis plan.^{321}
How are outliers actually detected? Most methods come in one of two flavours: supervised and unsupervised (we will discuss those in detail in later sections).
Supervised learning (SL) methods use a historical record of labeled (that is to say, previously identified) anomalous observations to build a predictive classification or regression model which estimates the probability that a unit is anomalous; domain expertise is required to tag the data.
Since anomalies are typically infrequent, these models often also have to accommodate the rare occurrence (or class imbalance) problem.
Supervised models are built to minimize a cost function; in default settings, it is often the case that the misclassification cost is assumed to be symmetrical, which can lead to technically correct but useless solutions. For instance, the vast majority (99.999+%) of air passengers emphatically do not bring weapons with them on flights; a model that predicts that no passenger is attempting to smuggle a weapon on board a flight would be 99.999+% accurate, but it would miss the point completely.
For the security agency, the cost of wrongly thinking that a passenger is:
smuggling a weapon \(\Longrightarrow\) cost of a single search;
NOT smuggling a weapon \(\Longrightarrow\) catastrophe (potentially).
The wrongly targeted individuals may have a … somewhat different take on this, however, from a societal and personal perspective.
Unsupervised methods, on the other hand, use no previously labeled information (anomalous/nonanomalous) or data, and try to determine if an observation is an outlying one solely by comparing its behaviour to that of the other observations.
As an example, if all participants in a workshop except for one can view the video conference lectures, then the one individual/internet connection/computer is anomalous – it behaves in a manner which is different from the others.
It is very important to note that this DOES NOT mean that the different behaviour is the one we are actually interested in/searching for! In Figure 16.1, perhaps we were interested in the slightly larger red fish that swims in a different direction than the rest of the school, but perhaps we were really interested in the regularsized teal fish that swims in the same direction as the others but that has orange eyes (can you spot it?).
Outlier Tests
The following traditional methods and tests of outlier detection fall into this category:^{322}
 Perhaps the most commonlyused test is Tukey’s boxplot test; for normally distributed data, regular observations typically lie between the inner fences \[Q_11.5(Q_3Q_1) \quad\mbox{and}\quad Q_3+1.5(Q_3Q_1).\] Suspected outliers lie between the inner fences and their respective outer fences \[Q_13(Q_3Q_1) \quad\mbox{and}\quad Q_3+3(Q_3Q_1).\] Points beyond the outer fences are identified as outliers (\(Q_1\) and \(Q_3\) represent the data’s \(1^{\textrm{st}}\) and \(3^{\textrm{rd}}\) quartile, respectively; see Figure 8.5 (a concrete example is provided in Section 8.5.2).
The Grubbs test is another univariate test, which takes into consideration the number of observations in the dataset: \[H_0:\ \text{no outlier in the data}\quad \text{against}\quad H_1:\ \text{exactly one outlier in the data}.\] Let \(x_i\) be the value of feature \(X\) for the \(i^{\textrm{th}}\) unit, \(1\leq i\leq n\), let \((\overline{x},s_x)\) be the mean and standard deviation of feature \(X\), let \(\alpha\) be the desired significance level, and let \(T(\frac{\alpha}{2n},n)\) be the critical value of the Student \(t\)distribution at significance \(\frac{\alpha}{2n}\). The test statistic is \[G=\frac{\max\{x_i\overline{x}: i=1,\ldots, n\}}{s_x}=\frac{x_{i^*}\overline{x}}{s_x}.\] Under \(H_0\), \(G\) follows a special distribution with critical value \[\ell(\alpha;n)=\frac{n1}{\sqrt{n}}\sqrt{\frac{T^2(\frac{\alpha}{2n},n)}{n2+T^2(\frac{\alpha}{2n},n)}}.\] At significance level \(\alpha\in (0,1)\), we reject the null hypothesis \(H_0\) in favour of the alternative hypothesis that \(x_{i^*}\) is the unique outlier along feature if \(G\geq\ell(\alpha;n)\). If we are looking for more than one outlier, it can be tempting to classify every observation \(\mathbf{x}_i\) for which \[\dfrac{x_i\overline{x}}{s_x}\geq\ell(\alpha;n) \] as an outlier, but this approach is contraindicated.
Other common tests include:
the Mahalanobis distance, which is linked to the leverage of an observation (a measure of influence), can also be used to find multidimensional outliers, when all relationships are linear (or nearly linear);
the TietjenMoore test, which is used to find a specific number of outliers;
the generalized extreme studentized deviate test, if the number of outliers is unknown;
the chisquare test, when outliers affect the goodnessoffit, as well as
DBSCAN and other clusteringbased outlier detection methods;
visual outlier detection (see Section 8.5.3 for some simple examples).
16.1.2 Statistical Learning Framework
Fraudulent behaviour is not always easily identifiable, even after the fact. Credit card fraudsters, for instance, will try to disguise their transactions as regular and banal, rather than as outlandish; to fool human observers into confusing what is merely plausible with what is probable (or at least, not improbable).
At its most basic level, anomaly detection is a problem in applied probability: if \(I\) denotes what is known about the dataset (behaviour of individual observations, behaviour of observations as a group, anomalous/normal verdict for a number of similar observations, etc.), is \[P(\text{observation is anomalous}\mid I) > P(\text{observation is not anomalous}\mid I)?\]
Anomaly detection models usually assume stationarity for normal observations, which is to say, that the underlying mechanism that generates data does not change in a substantial manner over time, or, if it does, that its rate of change (or cyclicity) is known.
A Time Series Detour
For time series data, this means that it may be necessary to first perform trend and seasonality extraction. Information on these topics can be obtained in Module ??.
Example: Supply chains play a crucial role in the transportation of goods from one part of the world to another. As the saying goes, “a given chain is only as strong as its weakest link” – in a multimodal context, comparing the various transportation segments is far from an obvious endeavour: if shipments departing Shanghai in February 2013 took two more days, on average, to arrive in Vancouver than those departing in July 2017, can it be said with any certainty that the shipping process has improved in the intervening years? Are February departures always slower to cross the Pacific Ocean? Are either of the Feb 2013 or the July 2017 performances anomalous?
The seasonal variability of performance is relevant to supply chain monitoring; the ability to quantify and account for the severity of its impact on the data is thus of great interest.
One way to tackle this problem is to produce an index to track container transit times. This index should depict the reliability and the variability of transit times but in such a way as to be able to allow for performance comparison between differing time periods. To simplify the discussion, assume that the ultimate goal is to compare quarterly and/or monthly performance data, irrespective of the transit season, in order to determine how well the network is performing on the Shanghai \(\to\) Port Metro Vancouver/Prince Rupert \(\to\) Toronto corridor, say.
The supply chain under investigation has Shanghai as the point of origin of shipments, with Toronto as the final destination; the containers enter the country either through Vancouver or Prince Rupert. Containers leave their point of origin by boat, arrive and dwell in either of the two ports before reaching their final destination by rail.
For each of the three segments (Marine Transit, Port Dwell, Rail Transit), the data consists of the monthly empirical distribution of transit times, built from subsamples (assumed to be randomly selected and fully representative) of all containers entering the appropriate segment. Each segment’s performance is measured using fluidity indicators (in this case, compiled at a monthly scale), which are computed using various statistics of the transit/dwelling time distributions for each of the supply chain segments, such as:
 Reliability Indicator (RI)

the ratio of the 95\(^{\text{th}}\) percentile to the 5\(^{\text{th}}\) percentile of transit/dwelling times (a high RI indicates high volatility, whereas a low RI \((\approx 1)\) indicates a reliable corridor);
 Buffer Index (BI)

the ratio of the positive difference between the 95\(^{\text{th}}\) percentile and the mean, to the mean. A small BI \((\approx 0)\) indicates only slight variability in the upper (longer) transit/dwelling times; a large BI indicates that the variability of the longer transit/dwelling times is high, and that outliers might be found in that domain;
 Coefficient of Variation (CV)

the ratio of the standard deviation of transit/dwelling times to the mean transit/dwelling time.
The time series of monthly indicators (which are derived from the monthly transit/dwelling time distributions in each segment) are then decomposed into their
trend;
seasonal component (seasonality, tradingday, movingholiday), and
irregular component.
The trend and the seasonal components provide the expected behaviour of the indicator time series;^{323} the irregular component arise as a consequence of supply chain volatility. A high irregular component at a given time point indicates a poor performance against expectations for that month, which is to say, an anomalous observation.
In general, the decomposition follows a model which is
multiplicative;
additive, or
pseudoadditive.
The choice of a model is driven by data behaviour and choice of assumptions; the X12 model automates some of the aspects of the decomposition, but manual intervention and diagnostics are still required.^{324} The additive model, for instance, assumes that:
the seasonal component \(S_t\) and the irregular component \(I_t\) are independent of the trend \(T_t\);
the seasonal component \(S_t\) remains stable from year to year; and
there is no seasonal fluctuation: \(\sum_{j=1}^{12} S_{t+j}=0\).
Mathematically, the model is expressed as: \[O_t = T_t + S_t + I_t.\] All components share the same dimensions and units. After seasonality adjustment,the seasonality adjusted series is: \[SA_t = O_t  S_t = T_t + I_t.\] The multiplicative and pseudoadditive models are defined in similar ways (again, consult Module ?? for details).^{325}
The data decomposition/preparation process is illustrated with the 40month time series of marine transit CVs from 20102013, whose values are shown in Figure 16.5. The size of the peaks and troughs seems fairly constant with respect to the changing trend; the SAS implementation of X12 agrees with that assessment and suggests the additive decomposition model, with no need for further data transformations.
The diagnostic plots are shown in Figure 16.6: the CV series is prioradjusted from the beginning until OCT2010 after the detection of a level shift. The SI (Seasonal Irregular) chart shows that there are more than one irregular component which exhibits volatility.
The adjusted series is shown in Figure 16.7; the trend and irregular components are also shown separately for readability. It is on the irregular component that detection anomaly would be conducted.
This example showcases the importance of domain understanding and data preparation to the anomaly detection process. Given that the vast majority of observations in a general problem are typically “normal”, another conceptually important approach is to view anomaly detection as a rare occurrence learning classification problem or as a novelty detection data stream problem (we have discussed the former in Module 13; the latter will be tackled in Module ??).
Either way, while there a number of strategies that use regular classification/clustering algorithms for anomaly detection, they are rarely successful unless they are adapted or modified for the anomaly detection context.
Basic Concepts
A generic system (such as the monthly transit times from the previous subsection, say) may be realized in normal states or in abnormal states. Normality, perhaps counterintuitively, is not confined to finding the most likely state, however, as infrequently occurring states could still be normal or plausible under some interpretation of the system.
As the authors of [319] see it, a system’s states are the results of processes or behaviours that follow certain natural rules and broad principles; the observations are a manifestation of these states. Data, in general, allows for inferences to be made about the underlying processes, which can then be tested or invalidated by the collection of additional data. When the inputs are perturbed, the corresponding outputs are likely to be perturbed as well; if anomalies arise from perturbed processes, being able to identify when the process is abnormal, that is to say, being able to capture the various normal and abnormal processes, may lead to useful anomaly detection.
Any supervised anomaly detection algorithm requires a training set of historical labeled data (which may be costly to obtain) on which to build the prediction model, and a testing set on which to evaluate the model’s performance in terms of True Positives (\(\text{TP}\), detected anomalies that actually arise from process abnormalities); True Negatives (\(\text{TN}\), predicted normal observations that indeed arise from normal processes); False Positives (\(\text{FP}\), detected anomalies corresponding to regular processes), and False Negatives (\(\text{FN}\), predicted normal observations that are in fact the product of an abnormal process).
As discussed previously, the rare occurrence problem makes optimizing for maximum accuracy \[a=\frac{\text{TN}+\text{TP}}{\text{TN}+\text{TP}+\text{FN}+\text{FP}}\] a losing strategy; instead, algorithms attempt to minimize the FP rate and the FN rate under the assumption that the cost of making a false negative error could be substantially higher than the cost of making a false positive error.
Assume that for a testing set with \(\delta=\text{FN}+\text{TP}\) true outliers, an anomaly detection algorithm identifies \(\mu=\text{FP}+\text{TP}\) suspicious observations, of which \(\nu=\text{TP}\) are known to be true outliers. Performance evaluation in this context is often measured using:
Precision is the proportion of true outliers among the suspicious observations \[p=\frac{\nu}{\mu}=\frac{\text{TP}}{\text{FP}+\text{TP}};\] when most of the observations identified by the algorithm are true outliers, \(p\approx 1\);
Recall is the proportion of true outliers detected by the algorithm \[r=\frac{\nu}{\delta}=\frac{\text{TP}}{\text{FN}+\text{TP}};\] when most of the true outliers are identified by the algorithm, \(r\approx 1\);
The \(F_1\)score is the harmonic mean of the algorithm’s precision and its recall \[F_1=\frac{2pr}{p+r}=\frac{2\text{TP}}{2\text{TP}+\text{FP}+\text{FN}}.\]
One drawback of using precision, recall, and the \(F_1\)score is that they do not incorporate \(\text{TN}\) in the evaluation process, but this is unlikely to be problematic as regular observations that are correctly seen as unsuspicious are not usually the observations of interest.^{326}
Example: consider a test dataset \(\text{Te}\) with 5000 observations, 100 of which are anomalous. An algorithm which predicts all observations to be anomalous would score \(a=p=0.02\), \(r=1\), and \(F_1\approx 0.04\), whereas an algorithm that detects 10 of the true outliers would score \(r=0.1\) (the other metric values would change according to the \(\text{TN}\) and \(\text{FN}\) counts).
Supervised models are discussed in Modules 12 and 13.
Another supervised approach is to estimate the relative abnormality of various observations: it is usually quite difficult to estimate the probability that an observation \(\mathbf{x}_1\) is anomalous with any certainty, but it might be possible to determine that it is more likely to be anomalous than another observation \(\mathbf{x}_2\), say (denoted by \(\mathbf{x}_1\succeq \mathbf{x}_2\)).
This paradigm allows the suspicious observations to be ranked; let \(k_i\in\{1,\ldots,\mu\}\) be the rank of the \(i^{\text{th}}\) true outlier, \(i\in \{1,\ldots,\nu\}\), in the sorted list of suspicious observations \[\mathbf{x}_1\succeq \mathbf{x}_{k_1}\succeq \cdots\succeq\mathbf{x}_{k_i}\succeq \cdots \mathbf{x}_{k_\nu}\succeq \mathbf{x}_\mu;\] the rank power of the algorithm is \[RP=\frac{\nu(\nu+1)}{2\sum_{i=1}^\nu k_i}.\]
When the \(\delta\) actual anomalies are ranked in (or near) the top \(\delta\) suspicious observations, \(\text{RP}\approx 1\). RP is welldefined only when \(\mu\geq \delta\); as with most performance evaluation metrics, a single raw number is meaningless – it is in comparison with the performance of other algorithms that it is most useful.
Other SL performance evaluation metrics include:
AUC – the probability of ranking a randomly chosen anomaly higher than a randomly chosen normal observation (higher is better);
probabilistic AUC – a calibrated version of AUC.
The rare occurrence problem can be tackled by using:
a manipulated training set (oversampling, undersampling, generating artificial instances);
specific SL AD algorithms (CREDOS, PN, SHRINK);
boosting algorithms (SMOTEBoost, RareBoost);
costsensitive classifiers (MetaCost, AdaCost, CSB, SSTBoost),
etc. [321]
The rare (anomalous) class can be oversampled by duplicating the rare events until the data set is balanced (roughly the same number of anomalies and normal observations). This does not increase the overall level of information, but it will increase the misclassification cost.
They majority class (normal observations) can also be undersampled by randomly removing:
“near miss” observations or
observations far from anomalous observations.
Some loss of information has to be expected, as are “overly general” rules. Common strategies are illustrated in Figures 16.10 and Figure 16.11.
Another modern approach rests on the concept of dimension reduction (see Module 15); autoencoders learn a compressed representation of the data. In a sense, the reconstruction error measures how much information is lost in the compression.
Anomaly detection algorithms can be applied to the compressed data:
look for anomalous patters, and/or
anomalous reconstruction errors.
In the example of Figure 16.12, one observation is anomalous because its compressed representation does not follow the pattern of the other 8 observations, whereas another observation is anomalous because its reconstruction error is substantially higher than that of the other 8 observations (can you hazard a guess as to which one is which?). Autoencoders will be presented in more detail in the Module on Deep Learning.
On the unsupervised front, where anomalous/normal labels are not known or used, if anomalies are those observations that are dissimilar to other observations, and if clusters represent groupings of similar observations, then observations that do not naturally fit into a cluster could be potential anomalies (see Figure 11.23 for an illustration).
There are a number of challenges associated to unsupervised anomaly detection, not the least of which being that most clustering algorithms do not recognize potential outliers (DBSCAN is a happy exception) and that some appropriate measure of similarity/dissimilarity of observations has to be agreed upon (different measures could lead to different cluster assignments, as we have discussed in Module 14).
Finally, it is worth mentioning that the definitions of terms like normal and anomalous are kept purposely vague, to allow for flexibility.
16.1.3 Motivating Example
In this module, we will illustrate the concepts and the algorithms of anomaly detection on an artificial dataset.
Consider a dataset of \(102\) observations in \(\mathbb{R}^4\); the first \(100\) observations \(\mathbf{p}_1,\ldots,\mathbf{p}_{100}\) are drawn from a multivariate \(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\), with \[\boldsymbol{\mu}=(1,2,0,1),\quad \boldsymbol{\Sigma}=\begin{pmatrix}1 & 0.5 & 0.7 & 0.5 \\ 0.5 & 1 & 0.95 & 0.3 \\ 0.7 & 0.95 & 1 & 0.3 \\ 0.5 & 0.3 & 0.3 & 1\end{pmatrix}.\]
nobs = 100
mu = matrix(rep(c(1,2,0,1),100),nrow=4)
Sigma = matrix(c(1, 0.5, 0.7, 0.5,
0.5, 1, 0.95, 0.3,
0.7, 0.95, 1, 0.3,
0.5, 0.3, 0.3, 1), nrow=4, ncol=4)
We use the Cholesky decomposition of \(\boldsymbol{\Sigma}\) to generate random observations.
L = chol(Sigma)
nvars = dim(L)[1]
set.seed(0) # for replicability
r = t(mu + t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs))
The summary statistics for the 100 “regular” observations are given below:
x1 x2 x3 x4
Min. :1.9049 Min. :4.4113 Min. :2.532449 Min. :1.9949
1st Qu.: 0.3812 1st Qu.:2.6464 1st Qu.:0.619032 1st Qu.: 0.3361
Median : 0.9273 Median :2.0220 Median :0.050604 Median : 0.9381
Mean : 0.9374 Mean :1.9788 Mean : 0.007144 Mean : 0.9438
3rd Qu.: 1.4615 3rd Qu.:1.4002 3rd Qu.: 0.629557 3rd Qu.: 1.5906
Max. : 3.4414 Max. : 0.5223 Max. : 2.026547 Max. : 2.8073
We add two observations \(\mathbf{z}_1=(1,1,1,1)\) and \(\mathbf{z}_4=(4,4,4,4)\) that do not arise from \(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\):
pt.1 = c(1,1,1,1)
pt.2 = c(4,4,4,4)
rdata = rbind(rdata,pt.1,pt.2)
group = c(rep(1,nobs),2,3)
rdata = cbind(rdata,group)
The complete dataset is displayed below, with \(\mathbf{z}_1\) in pink and \(\mathbf{z}_4\) in green:
But since we will not usually know which observations are “regular” and which are “anomalous”, let us remove the colouring.
Evidently, a visual inspection suggests that there are in fact 3 outliers in the dataset!
Multiple references were consulted in the preparation of this module, in particular [319], [324]. Other good survey documents include [325], [326]. Specific methods and approaches are the focus of other papers: [327]–[329] (highdimensional data), [330] (DOBIN), [331] (outlier ensembles), [332], [333] (isolation forest), [334], [335] (DBSCAN), [336] (LOF), [337]–[341] (subspace method), [342] (time series data). On the practical side of things, we would be remiss if we did not mention [343], but note that there is a plethora of quality tutorials online for anomaly detection in the programming language of your choice.