16.1 Overview

[This section is an extension of Section 8.5.]

Isaac Asimov, the prolific American author, once wrote that

The most exciting phrase to hear […], the one that heralds the most discoveries, is not “Eureka!” but “That’s funny…”.

However, anomalous observations are not only harbingers of great scientific discoveries – unexpected observations can spoil analyses or be indicative of the presence of issues related to data collection or data processing.

Either way, it becomes imperative for decision-makers and analysts to establish anomaly detection protocols, and to identify strategies to deal with such observations.

16.1.1 Basic Notions and Concepts

Outlying observations are data points which are atypical in comparison to the unit’s remaining features (within-unit), or in comparison to the measurements for other units (between-units), or as part of a collective subset of observations. Outliers are thus observations which are dissimilar to other cases or which contradict known dependencies or rules.320

Observations could be anomalous in one context, but not in another. Consider, for instance, an adult male who is 6-foot tall. Such a man would fall in the 86th percentile among Canadian males [156], which, while on the tall side, is not unusual; in Bolivia, however, the same man would land in the 99.9th percentile [156], which would mark him as extremely tall and quite dissimilar to the rest of the population. Anomaly detection points towards interesting questions for analysts and subject matter experts: in this case, why is there such a large discrepancy in the two populations?

In practice, an outlier/anomalous observation may arise as

  • a “bad” object/measurement: data artifacts, spelling mistakes, poorly imputed values, etc.

  • a misclassified observation: according to the existing data patterns, the observation should have been labeled differently;

  • an observation whose measurements are found in the distribution tails, of a large enough number of features;

  • an unknown unknowns: a completely new type of observations whose existence was heretofore unsuspected.

A common mistake that analysts make when dealing with outlying observations is to remove them from the dataset without carefully studying whether they are influential data points, that is, observations whose absence leads to markedly different analysis results.

When influential observations are identified, remedial measures (such as data transformation strategies) may need to be applied to minimize any undue effect. Note that outliers may be influential, and influential data points may be outliers, but the conditions are neither necessary nor sufficient.

Anomaly Detection

By definition, anomalies are infrequent and typically shrouded in uncertainty due to their relatively low numbers, which makes it difficult to differentiate them from banal noise or data collection errors.

Furthermore, the boundary between normal and deviant observations is usually fuzzy; with the advent of e-shops, for instance, a purchase which is recorded at 3AM local time does not necessarily raise a red flag anymore.

When anomalies are actually associated to malicious activities, they are more than often disguised in order to blend in with normal observations, which obviously complicates the detection process. Numerous methods exist to identify anomalous observations; none of them are foolproof and judgement must be used.

Methods that employ graphical aids (such as box-plots, scatterplots, scatterplot matrices, and 2D tours) to identify outliers are particularly easy to implement, but a low-dimensional setting is usually required for ease of interpretability. These methods usually find the anomalies that shout the loudest [318].

What jumps at you here?

Figure 16.1: What jumps at you here?

Analytical methods also exist (using Cooke’s or Mahalanobis’ distances, say), but in general some additional level of analysis must be performed, especially when trying to identify influential observations (cf. leverage). With small datasets, anomaly detection can be conducted on a case-by-case basis, but with large datasets, the temptation to use automated detection/removal is strong – care must be exercised before the analyst decides to go down that route. This stems partly from the fact that once the “anomalous” observations have been removed from the dataset, previously “regular” observations can become anomalous in turn in the smaller dataset; it is not clear when that runaway train will stop.

In the early stages of anomaly detection, simple data analyses (such as descriptive statistics, 1- and 2-way tables, and traditional visualizations) may be performed to help identify anomalous observations, or to obtain insights about the data, which could eventually lead to modifications of the analysis plan.321

How are outliers actually detected? Most methods come in one of two flavours: supervised and unsupervised (we will discuss those in detail in later sections).

Supervised learning (SL) methods use a historical record of labeled (that is to say, previously identified) anomalous observations to build a predictive classification or regression model which estimates the probability that a unit is anomalous; domain expertise is required to tag the data.

Since anomalies are typically infrequent, these models often also have to accommodate the rare occurrence (or class imbalance) problem.

Supervised models are built to minimize a cost function; in default settings, it is often the case that the mis-classification cost is assumed to be symmetrical, which can lead to technically correct but useless solutions. For instance, the vast majority (99.999+%) of air passengers emphatically do not bring weapons with them on flights; a model that predicts that no passenger is attempting to smuggle a weapon on board a flight would be 99.999+% accurate, but it would miss the point completely.

For the security agency, the cost of wrongly thinking that a passenger is:

  • smuggling a weapon \(\Longrightarrow\) cost of a single search;

  • NOT smuggling a weapon \(\Longrightarrow\) catastrophe (potentially).

The wrongly targeted individuals may have a … somewhat different take on this, however, from a societal and personal perspective.

Unsupervised methods, on the other hand, use no previously labeled information (anomalous/non-anomalous) or data, and try to determine if an observation is an outlying one solely by comparing its behaviour to that of the other observations.

As an example, if all participants in a workshop except for one can view the video conference lectures, then the one individual/internet connection/computer is anomalous – it behaves in a manner which is different from the others.

It is very important to note that this DOES NOT mean that the different behaviour is the one we are actually interested in/searching for! In Figure 16.1, perhaps we were interested in the slightly larger red fish that swims in a different direction than the rest of the school, but perhaps we were really interested in the regular-sized teal fish that swims in the same direction as the others but that has orange eyes (can you spot it?).

Outlier Tests

The following traditional methods and tests of outlier detection fall into this category:322

  • Perhaps the most commonly-used test is Tukey’s boxplot test; for normally distributed data, regular observations typically lie between the inner fences \[Q_1-1.5(Q_3-Q_1) \quad\mbox{and}\quad Q_3+1.5(Q_3-Q_1).\] Suspected outliers lie between the inner fences and their respective outer fences \[Q_1-3(Q_3-Q_1) \quad\mbox{and}\quad Q_3+3(Q_3-Q_1).\] Points beyond the outer fences are identified as outliers (\(Q_1\) and \(Q_3\) represent the data’s \(1^{\textrm{st}}\) and \(3^{\textrm{rd}}\) quartile, respectively; see Figure 8.5 (a concrete example is provided in Section 8.5.2).
Tukey's boxplot test.

Figure 8.5: Tukey’s boxplot test; suspected outliers are marked by white disks, outliers by black disks.

  • The Grubbs test is another univariate test, which takes into consideration the number of observations in the dataset: \[H_0:\ \text{no outlier in the data}\quad \text{against}\quad H_1:\ \text{exactly one outlier in the data}.\] Let \(x_i\) be the value of feature \(X\) for the \(i^{\textrm{th}}\) unit, \(1\leq i\leq n\), let \((\overline{x},s_x)\) be the mean and standard deviation of feature \(X\), let \(\alpha\) be the desired significance level, and let \(T(\frac{\alpha}{2n},n)\) be the critical value of the Student \(t\)-distribution at significance \(\frac{\alpha}{2n}\). The test statistic is \[G=\frac{\max\{|x_i-\overline{x}|: i=1,\ldots, n\}}{s_x}=\frac{|x_{i^*}-\overline{x}|}{s_x}.\] Under \(H_0\), \(G\) follows a special distribution with critical value \[\ell(\alpha;n)=\frac{n-1}{\sqrt{n}}\sqrt{\frac{T^2(\frac{\alpha}{2n},n)}{n-2+T^2(\frac{\alpha}{2n},n)}}.\] At significance level \(\alpha\in (0,1)\), we reject the null hypothesis \(H_0\) in favour of the alternative hypothesis that \(x_{i^*}\) is the unique outlier along feature if \(G\geq\ell(\alpha;n)\). If we are looking for more than one outlier, it can be tempting to classify every observation \(\mathbf{x}_i\) for which \[\dfrac{|x_i-\overline{x}|}{s_x}\geq\ell(\alpha;n) \] as an outlier, but this approach is contra-indicated.

  • Other common tests include:

    • the Mahalanobis distance, which is linked to the leverage of an observation (a measure of influence), can also be used to find multi-dimensional outliers, when all relationships are linear (or nearly linear);

    • the Tietjen-Moore test, which is used to find a specific number of outliers;

    • the generalized extreme studentized deviate test, if the number of outliers is unknown;

    • the chi-square test, when outliers affect the goodness-of-fit, as well as

    • DBSCAN and other clustering-based outlier detection methods;

    • visual outlier detection (see Section 8.5.3 for some simple examples).

16.1.2 Statistical Learning Framework

Fraudulent behaviour is not always easily identifiable, even after the fact. Credit card fraudsters, for instance, will try to disguise their transactions as regular and banal, rather than as outlandish; to fool human observers into confusing what is merely plausible with what is probable (or at least, not improbable).

At its most basic level, anomaly detection is a problem in applied probability: if \(I\) denotes what is known about the dataset (behaviour of individual observations, behaviour of observations as a group, anomalous/normal verdict for a number of similar observations, etc.), is \[P(\text{observation is anomalous}\mid I) > P(\text{observation is not anomalous}\mid I)?\]

Anomaly detection models usually assume stationarity for normal observations, which is to say, that the underlying mechanism that generates data does not change in a substantial manner over time, or, if it does, that its rate of change (or cyclicity) is known.

A Time Series Detour

For time series data, this means that it may be necessary to first perform trend and seasonality extraction. Information on these topics can be obtained in Module ??.

Example: Supply chains play a crucial role in the transportation of goods from one part of the world to another. As the saying goes, “a given chain is only as strong as its weakest link” – in a multi-modal context, comparing the various transportation segments is far from an obvious endeavour: if shipments departing Shanghai in February 2013 took two more days, on average, to arrive in Vancouver than those departing in July 2017, can it be said with any certainty that the shipping process has improved in the intervening years? Are February departures always slower to cross the Pacific Ocean? Are either of the Feb 2013 or the July 2017 performances anomalous?

The seasonal variability of performance is relevant to supply chain monitoring; the ability to quantify and account for the severity of its impact on the data is thus of great interest.

One way to tackle this problem is to produce an index to track container transit times. This index should depict the reliability and the variability of transit times but in such a way as to be able to allow for performance comparison between differing time periods. To simplify the discussion, assume that the ultimate goal is to compare quarterly and/or monthly performance data, irrespective of the transit season, in order to determine how well the network is performing on the Shanghai \(\to\) Port Metro Vancouver/Prince Rupert \(\to\) Toronto corridor, say.

Multi-modal supply chain corridor.

Figure 16.2: Multi-modal supply chain corridor.

The supply chain under investigation has Shanghai as the point of origin of shipments, with Toronto as the final destination; the containers enter the country either through Vancouver or Prince Rupert. Containers leave their point of origin by boat, arrive and dwell in either of the two ports before reaching their final destination by rail.

For each of the three segments (Marine Transit, Port Dwell, Rail Transit), the data consists of the monthly empirical distribution of transit times, built from sub-samples (assumed to be randomly selected and fully representative) of all containers entering the appropriate segment. Each segment’s performance is measured using fluidity indicators (in this case, compiled at a monthly scale), which are computed using various statistics of the transit/dwelling time distributions for each of the supply chain segments, such as:

Reliability Indicator (RI)

the ratio of the 95\(^{\text{th}}\) percentile to the 5\(^{\text{th}}\) percentile of transit/dwelling times (a high RI indicates high volatility, whereas a low RI \((\approx 1)\) indicates a reliable corridor);

Buffer Index (BI)

the ratio of the positive difference between the 95\(^{\text{th}}\) percentile and the mean, to the mean. A small BI \((\approx 0)\) indicates only slight variability in the upper (longer) transit/dwelling times; a large BI indicates that the variability of the longer transit/dwelling times is high, and that outliers might be found in that domain;

Coefficient of Variation (CV)

the ratio of the standard deviation of transit/dwelling times to the mean transit/dwelling time.

Illustration of how to derive the various monthly fluidity indicator.

Figure 16.3: Illustration of how to derive the various monthly fluidity indicator.

The time series of monthly indicators (which are derived from the monthly transit/dwelling time distributions in each segment) are then decomposed into their

  • trend;

  • seasonal component (seasonality, trading-day, moving-holiday), and

  • irregular component.

The trend and the seasonal components provide the expected behaviour of the indicator time series;323 the irregular component arise as a consequence of supply chain volatility. A high irregular component at a given time point indicates a poor performance against expectations for that month, which is to say, an anomalous observation.

Conceptual time series decomposition; potential anomalous behaviour should be searched for in the irregular component.

Figure 16.4: Conceptual time series decomposition; potential anomalous behaviour should be searched for in the irregular component.

In general, the decomposition follows a model which is

  • multiplicative;

  • additive, or

  • pseudo-additive.

The choice of a model is driven by data behaviour and choice of assumptions; the X12 model automates some of the aspects of the decomposition, but manual intervention and diagnostics are still required.324 The additive model, for instance, assumes that:

  1. the seasonal component \(S_t\) and the irregular component \(I_t\) are independent of the trend \(T_t\);

  2. the seasonal component \(S_t\) remains stable from year to year; and

  3. there is no seasonal fluctuation: \(\sum_{j=1}^{12} S_{t+j}=0\).

Mathematically, the model is expressed as: \[O_t = T_t + S_t + I_t.\] All components share the same dimensions and units. After seasonality adjustment,the seasonality adjusted series is: \[SA_t = O_t - S_t = T_t + I_t.\] The multiplicative and pseudo-additive models are defined in similar ways (again, consult Module ?? for details).325

The data decomposition/preparation process is illustrated with the 40-month time series of marine transit CVs from 2010-2013, whose values are shown in Figure 16.5. The size of the peaks and troughs seems fairly constant with respect to the changing trend; the SAS implementation of X12 agrees with that assessment and suggests the additive decomposition model, with no need for further data transformations.

Marine transit CV data, from 2010 to 2013.

Figure 16.5: Marine transit CV data, from 2010 to 2013.

The diagnostic plots are shown in Figure 16.6: the CV series is prior-adjusted from the beginning until OCT2010 after the detection of a level shift. The SI (Seasonal Irregular) chart shows that there are more than one irregular component which exhibits volatility.

Diagnostic plot for marine transit CV data, from 2010 to 2013 (left); SI chart (right).

Figure 16.6: Diagnostic plot for marine transit CV data, from 2010 to 2013 (left); SI chart (right).

The adjusted series is shown in Figure 16.7; the trend and irregular components are also shown separately for readability. It is on the irregular component that detection anomaly would be conducted.

Adjusted plot for marine transit CV data, from 2010 to 2013.

Figure 16.7: Adjusted plot for marine transit CV data, from 2010 to 2013.

This example showcases the importance of domain understanding and data preparation to the anomaly detection process. Given that the vast majority of observations in a general problem are typically “normal”, another conceptually important approach is to view anomaly detection as a rare occurrence learning classification problem or as a novelty detection data stream problem (we have discussed the former in Module 13; the latter will be tackled in Module ??).

Either way, while there a number of strategies that use regular classification/clustering algorithms for anomaly detection, they are rarely successful unless they are adapted or modified for the anomaly detection context.

Basic Concepts

A generic system (such as the monthly transit times from the previous subsection, say) may be realized in normal states or in abnormal states. Normality, perhaps counter-intuitively, is not confined to finding the most likely state, however, as infrequently occurring states could still be normal or plausible under some interpretation of the system.

As the authors of [319] see it, a system’s states are the results of processes or behaviours that follow certain natural rules and broad principles; the observations are a manifestation of these states. Data, in general, allows for inferences to be made about the underlying processes, which can then be tested or invalidated by the collection of additional data. When the inputs are perturbed, the corresponding outputs are likely to be perturbed as well; if anomalies arise from perturbed processes, being able to identify when the process is abnormal, that is to say, being able to capture the various normal and abnormal processes, may lead to useful anomaly detection.

Any supervised anomaly detection algorithm requires a training set of historical labeled data (which may be costly to obtain) on which to build the prediction model, and a testing set on which to evaluate the model’s performance in terms of True Positives (\(\text{TP}\), detected anomalies that actually arise from process abnormalities); True Negatives (\(\text{TN}\), predicted normal observations that indeed arise from normal processes); False Positives (\(\text{FP}\), detected anomalies corresponding to regular processes), and False Negatives (\(\text{FN}\), predicted normal observations that are in fact the product of an abnormal process).

Confusion matrix for an anomaly detection problem.

Figure 16.8: Confusion matrix for an anomaly detection problem.

As discussed previously, the rare occurrence problem makes optimizing for maximum accuracy \[a=\frac{\text{TN}+\text{TP}}{\text{TN}+\text{TP}+\text{FN}+\text{FP}}\] a losing strategy; instead, algorithms attempt to minimize the FP rate and the FN rate under the assumption that the cost of making a false negative error could be substantially higher than the cost of making a false positive error.

Assume that for a testing set with \(\delta=\text{FN}+\text{TP}\) true outliers, an anomaly detection algorithm identifies \(\mu=\text{FP}+\text{TP}\) suspicious observations, of which \(\nu=\text{TP}\) are known to be true outliers. Performance evaluation in this context is often measured using:

Precision is the proportion of true outliers among the suspicious observations \[p=\frac{\nu}{\mu}=\frac{\text{TP}}{\text{FP}+\text{TP}};\] when most of the observations identified by the algorithm are true outliers, \(p\approx 1\);

Recall is the proportion of true outliers detected by the algorithm \[r=\frac{\nu}{\delta}=\frac{\text{TP}}{\text{FN}+\text{TP}};\] when most of the true outliers are identified by the algorithm, \(r\approx 1\);

The \(F_1-\)score is the harmonic mean of the algorithm’s precision and its recall \[F_1=\frac{2pr}{p+r}=\frac{2\text{TP}}{2\text{TP}+\text{FP}+\text{FN}}.\]

One drawback of using precision, recall, and the \(F_1-\)score is that they do not incorporate \(\text{TN}\) in the evaluation process, but this is unlikely to be problematic as regular observations that are correctly seen as unsuspicious are not usually the observations of interest.326

Example: consider a test dataset \(\text{Te}\) with 5000 observations, 100 of which are anomalous. An algorithm which predicts all observations to be anomalous would score \(a=p=0.02\), \(r=1\), and \(F_1\approx 0.04\), whereas an algorithm that detects 10 of the true outliers would score \(r=0.1\) (the other metric values would change according to the \(\text{TN}\) and \(\text{FN}\) counts).

Metric values for various supervised anomaly detection models.Metric values for various supervised anomaly detection models.Metric values for various supervised anomaly detection models.

Figure 16.9: Metric values for various supervised anomaly detection models.

Supervised models are discussed in Modules 12 and 13.

Another supervised approach is to estimate the relative abnormality of various observations: it is usually quite difficult to estimate the probability that an observation \(\mathbf{x}_1\) is anomalous with any certainty, but it might be possible to determine that it is more likely to be anomalous than another observation \(\mathbf{x}_2\), say (denoted by \(\mathbf{x}_1\succeq \mathbf{x}_2\)).

This paradigm allows the suspicious observations to be ranked; let \(k_i\in\{1,\ldots,\mu\}\) be the rank of the \(i^{\text{th}}\) true outlier, \(i\in \{1,\ldots,\nu\}\), in the sorted list of suspicious observations \[\mathbf{x}_1\succeq \mathbf{x}_{k_1}\succeq \cdots\succeq\mathbf{x}_{k_i}\succeq \cdots \mathbf{x}_{k_\nu}\succeq \mathbf{x}_\mu;\] the rank power of the algorithm is \[RP=\frac{\nu(\nu+1)}{2\sum_{i=1}^\nu k_i}.\]

When the \(\delta\) actual anomalies are ranked in (or near) the top \(\delta\) suspicious observations, \(\text{RP}\approx 1\). RP is well-defined only when \(\mu\geq \delta\); as with most performance evaluation metrics, a single raw number is meaningless – it is in comparison with the performance of other algorithms that it is most useful.

Other SL performance evaluation metrics include:

  • AUC – the probability of ranking a randomly chosen anomaly higher than a randomly chosen normal observation (higher is better);

  • probabilistic AUC – a calibrated version of AUC.

The rare occurrence problem can be tackled by using:

  • a manipulated training set (oversampling, undersampling, generating artificial instances);

  • specific SL AD algorithms (CREDOS, PN, SHRINK);

  • boosting algorithms (SMOTEBoost, RareBoost);

  • cost-sensitive classifiers (MetaCost, AdaCost, CSB, SSTBoost),

  • etc. [321]

The rare (anomalous) class can be oversampled by duplicating the rare events until the data set is balanced (roughly the same number of anomalies and normal observations). This does not increase the overall level of information, but it will increase the mis-classification cost.

They majority class (normal observations) can also be undersampled by randomly removing:

  • “near miss” observations or

  • observations far from anomalous observations.

Some loss of information has to be expected, as are “overly general” rules. Common strategies are illustrated in Figures 16.10 and Figure 16.11.

Oversampling, undersampling, and hybrid strategy for anomaly detection. [@Leetal]

Figure 16.10: Oversampling, undersampling, and hybrid strategy for anomaly detection. [322]

Generating artificial cases with SMOTE and DRAMOTE. [@SB]

Figure 16.11: Generating artificial cases with SMOTE and DRAMOTE. [323]

Another modern approach rests on the concept of dimension reduction (see Module 15); autoencoders learn a compressed representation of the data. In a sense, the reconstruction error measures how much information is lost in the compression.

Anomaly detection algorithms can be applied to the compressed data:

  • look for anomalous patters, and/or

  • anomalous reconstruction errors.

Illustration of autoencoder compression/reconstruction for anomaly detection, modified from [@Dayla]

Figure 16.12: Illustration of autoencoder compression/reconstruction for anomaly detection, modified from [318]

In the example of Figure 16.12, one observation is anomalous because its compressed representation does not follow the pattern of the other 8 observations, whereas another observation is anomalous because its reconstruction error is substantially higher than that of the other 8 observations (can you hazard a guess as to which one is which?). Autoencoders will be presented in more detail in the Module on Deep Learning.

On the unsupervised front, where anomalous/normal labels are not known or used, if anomalies are those observations that are dissimilar to other observations, and if clusters represent groupings of similar observations, then observations that do not naturally fit into a cluster could be potential anomalies (see Figure 11.23 for an illustration).

There are a number of challenges associated to unsupervised anomaly detection, not the least of which being that most clustering algorithms do not recognize potential outliers (DBSCAN is a happy exception) and that some appropriate measure of similarity/dissimilarity of observations has to be agreed upon (different measures could lead to different cluster assignments, as we have discussed in Module 14).

Finally, it is worth mentioning that the definitions of terms like normal and anomalous are kept purposely vague, to allow for flexibility.

16.1.3 Motivating Example

In this module, we will illustrate the concepts and the algorithms of anomaly detection on an artificial dataset.

Consider a dataset of \(102\) observations in \(\mathbb{R}^4\); the first \(100\) observations \(\mathbf{p}_1,\ldots,\mathbf{p}_{100}\) are drawn from a multivariate \(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\), with \[\boldsymbol{\mu}=(1,-2,0,1),\quad \boldsymbol{\Sigma}=\begin{pmatrix}1 & 0.5 & 0.7 & 0.5 \\ 0.5 & 1 & 0.95 & 0.3 \\ 0.7 & 0.95 & 1 & 0.3 \\ 0.5 & 0.3 & 0.3 & 1\end{pmatrix}.\]

nobs = 100
mu = matrix(rep(c(1,-2,0,1),100),nrow=4)
Sigma = matrix(c(1, 0.5, 0.7, 0.5,
                 0.5, 1, 0.95, 0.3,
                 0.7, 0.95, 1, 0.3,
                 0.5, 0.3, 0.3, 1), nrow=4, ncol=4)

We use the Cholesky decomposition of \(\boldsymbol{\Sigma}\) to generate random observations.

L = chol(Sigma)
nvars = dim(L)[1]

set.seed(0) # for replicability
r = t(mu + t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs))

The summary statistics for the 100 “regular” observations are given below:

rdata = as.data.frame(r)
names(rdata) = c('x1', 'x2', 'x3', 'x4')
       x1                x2                x3                  x4         
 Min.   :-1.9049   Min.   :-4.4113   Min.   :-2.532449   Min.   :-1.9949  
 1st Qu.: 0.3812   1st Qu.:-2.6464   1st Qu.:-0.619032   1st Qu.: 0.3361  
 Median : 0.9273   Median :-2.0220   Median :-0.050604   Median : 0.9381  
 Mean   : 0.9374   Mean   :-1.9788   Mean   : 0.007144   Mean   : 0.9438  
 3rd Qu.: 1.4615   3rd Qu.:-1.4002   3rd Qu.: 0.629557   3rd Qu.: 1.5906  
 Max.   : 3.4414   Max.   : 0.5223   Max.   : 2.026547   Max.   : 2.8073  

We add two observations \(\mathbf{z}_1=(1,1,1,1)\) and \(\mathbf{z}_4=(4,4,4,4)\) that do not arise from \(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\):

pt.1 = c(1,1,1,1)
pt.2 = c(4,4,4,4)

rdata = rbind(rdata,pt.1,pt.2)
group = c(rep(1,nobs),2,3)
rdata = cbind(rdata,group)

The complete dataset is displayed below, with \(\mathbf{z}_1\) in pink and \(\mathbf{z}_4\) in green:

lattice::splom(rdata[,1:4], groups=group, pch=22)

But since we will not usually know which observations are “regular” and which are “anomalous”, let us remove the colouring.

lattice::splom(rdata[,1:4], pch=22)

Evidently, a visual inspection suggests that there are in fact 3 outliers in the dataset!

Multiple references were consulted in the preparation of this module, in particular [319], [324]. Other good survey documents include [325], [326]. Specific methods and approaches are the focus of other papers: [327][329] (high-dimensional data), [330] (DOBIN), [331] (outlier ensembles), [332], [333] (isolation forest), [334], [335] (DBSCAN), [336] (LOF), [337][341] (subspace method), [342] (time series data). On the practical side of things, we would be remiss if we did not mention [343], but note that there is a plethora of quality tutorials online for anomaly detection in the programming language of your choice.


D. Baron, Outlier Detection.” XXX Winter School of Astrophysics on Big Data in Astronomy, GitHub repository, 2018.
K. G. Mehrotra, C. K. Mohan, and H. Huang, Anomaly Detection Principles and Algorithms. Springer, 2017.
T. Le, M. T. Vo, B. Vo, M. Y. Lee, and S. W. Baik, “A hybrid approach using oversampling technique and cost-sensitive learning for bankruptcy prediction,” Complexity, 2019, doi: 10.1155/2019/8460934.
O. Soufan et al., “Mining chemical activity status from high-throughput screening assays,” PloS one, vol. 10, no. 12, 2015, doi: 10.1371/journal.pone.0144426.
C. C. Aggarwal, Outlier Analysis. Springer International Publishing, 2016.
“Outlier Detection: A Survey.” Technical Report TR 07-017; Department of Computer Science; Engineering, University of Minnesota, 2007.
V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artif. Intell. Rev., vol. 22, no. 2, pp. 85–126, 2004.
C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” SIGMOD Rec., vol. 30, no. 2, pp. 37–46, 2001, doi: http://doi.acm.org/10.1145/376284.375668.
E. Muller, I. Assent, U. Steinhausen, and T. Seidl, “OutRank: Ranking Outliers in High-Dimensional Data,” in 2008 IEEE 24th International Conference on Data Engineering Workshop, 2008, pp. 600–603. doi: 10.1109/ICDEW.2008.4498387.
S. Kandanaarachchi and R. Hyndman, “Dimension reduction for outlier detection using DOBIN.” Sep. 2019. doi: 10.13140/RG.2.2.15437.18403.
C. C. Aggarwal and S. Sathe, Outlier Ensembles: An Introduction. Springer International Publishing, 2017.
F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422.
S. Hariri, M. Carrasco Kind, and R. J. Brunner, “Extended isolation forest,” IEEE Transactions on Knowledge and Data Engineering, 2019.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.
R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-Based Clustering Based on Hierarchical Density Estimates,” in Advances in Knowledge Discovery and Data Mining, 2013, pp. 160–172.
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” SIGMOD Rec., vol. 29, no. 2, pp. 93–104, 2000.
J. Zhang, M. Lou, T. W. Ling, and H. Wang, “Hos-miner: A System for Detecting Outlyting Subspaces of High-Dimensional Data,” in Proceedings of the Thirtieth International Conference on Very Large Data Bases, 2004, pp. 1265–1268.
E. Muller, M. Schiffer, and T. Seidl, “Statistical selection of relevant subspace projections for outlier ranking,” in Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, 2011, pp. 434–445. doi: 10.1109/ICDE.2011.5767916.
C. Chen and L.-M. Liu, “Joint estimation of model parameters and outlier effects in time series,” Journal of the American Statistical Association, vol. 88, pp. 284–297, 1993.