16.4 Anomalies in High-Dimensional Data

Nowadays, real datasets are often quite large; in some scenarios, the observations may contain hundreds or thousands of features (or dimensions).

Many classical methods use proximity (distance) concepts for anomaly detection (see Section 16.2 for a sample of such methods) and can only be expected to work reasonably well in cases where the sample size \(n\) is larger than the dimension \(p\) (\(n>p\)).

The management of high-dimensional data (\(n<p\)) offers specific difficulties: indeed, in such spaces observations are often isolated and scattered (or sparse) and the notion of proximity fails to maintain its relevance.

In that case, the notion of defining significant outliers is much more complex (and not as obvious): many conventional methods of detecting outliers are simply not efficient in the high-dimensional context, due to the curse of dimensionality. Consequently, high-dimensional anomaly detection methods are linked with dimension reduction and feature selection.

The remainder of this section is organized as follows: first, an attempt is made to define the concept and the challenges; then, anomaly detection techniques are discussed; finally, we end with a detailed description of ensembles and subspace methods. Our approach mainly follows those found in [319], [324], [331], [348], [349].

16.4.1 Definitions and Challenges

As we have seen previously, an anomalous observation is one that deviates or behaves differently from other the observations in the dataset, which makes us suspect that it was generated by some other mechanism [324]; such an observation would, of course, be considered to be irregular.

The challenges of anomaly and outlier detection in high-dimensional data (HDD) are due to:

  • the notion of distance becoming irrelevant due to the curse of dimensionality (whence “the problem of detecting outliers is like finding a needle in a haystack” [348]);

  • every point in such datasets has a tendency to be an outlier, and

  • datasets become more sparse as the dimension of the feature space increases.

The authors of [327] consider that in order to deal properly with large datasets, detection methods should:

  1. allow for effective management of sparse data issues;

  2. provide interpretability of the discrepancies (i.e., how the behaviour of such observations is different);

  3. allow anomaly measurements to be compared (“apples-to-apples”), and

  4. consider the local data behaviour to determine whether an observation is abnormal or not.

16.4.2 Projection Methods

HDLSS (high dimension, low sample size) datasets can contain hundreds of variables (or even more), and the curse of dimensionality affects the efficiency of conventional anomaly/outlier detection methods.

One solution to this problem is to reduce the dimensionality of the dataset while preserving its essential characteristics. We have discussed such projection methods in Module 15. Let us see how principal components analysis (see Section 15.2.3) can be applied to anomaly detection (we take advantage of this reprise to present PCA using a different formalism).

PCA (Reprise)

As we know, principal components analysis (PCA) aims to find a representation of the original dataset in a lower-dimensional subspace (such as a line or a plane) containing the greatest possible variation.

PCA corresponds to an orthogonal linear transformation of the data into a new coordinate system, such that the largest variance resulting from a scalar projection of the data is on the first coordinate (the first principal component), the second largest variance on the second coordinate, and so forth.

PCA is used in various contexts:

  • as a dimension reduction method used during the data pre-processing step;

  • as a data visualization aid, and, in the scenario of interest for this section,

  • as an anomaly and outlier detection approach.

Let the dataset be represented by a numerical, centered, and scaled \(n\times p\) matrix \(\textbf{X}=[\textbf{X}_1,\cdots,\textbf{X}_p]\) with \(n\) observations (number of rows) and \(p\) features (number of columns).

The principal components are linear combinations of the variables: \[\begin{aligned} \textbf{Y}_i=\ell^{\!\top}_i\textbf{X}=\ell_{1,i}\textbf{X}_1+\cdots+\ell_{p,i}\textbf{X}_p;\quad i=1,\cdots,k,\end{aligned}\] with \(k\leq p\), yielding the largest variance subjet to the constraint \(\|\ell_i\|=1\) (where \(\|\cdot\|\) represents the Euclidean norm).

We can thus deduce that \[\begin{aligned} \operatorname{Var}\left(\textbf{Y}_{i}\right) &=\operatorname{Var}\left(\ell_{i}^{\!\top} \textbf{X} \right)=\ell_{i}^{\!\top} \textbf{X}^{\!\top} \textbf{X} \ell_{i}=\ell_{i}^{\!\top} \textbf{X}^{\!\top} \textbf{X} \ell_{i} \\ \operatorname{Cov}\left(\textbf{Y}_{i}, \textbf{Y}_{k}\right) &=\operatorname{Cov}\left(\ell_{i}^{\!\top} \textbf{X}, \ell_{k}^{\!\top} \textbf{X}\right)=\ell_{i}^{\!\top}\textbf{X}^{\!\top} \textbf{X}\ell_{k}\, .\end{aligned}\]

In other words, PCA finds the loadings vector \(\ell_{1}\) which maximizes the variance of \(\textbf{Y}_1\), i.e. \[\begin{aligned} \mathbf{\ell}_{1}&=\underset{\|\mathbf{\ell_1}\|=1}{\arg \max }\left\{\ell^{\!\top}_1\textbf{X}^{\!\top} \textbf{X} \ell_1\right\},\end{aligned}\] then the loadings vector \(\ell_2\) (not correlated with \(\ell_1\)) which maximizes the variance of \(\textbf{Y}_2\), i.e. \[\begin{aligned} \mathbf{\ell}_{2}& =\underset{\|\mathbf{\ell_2}\|=1,\; \ell_{1}^{\top}\ell_{2} =0}{\arg \max }\left\{\ell^{\!\top}_2\textbf{X}^{\!\top} \textbf{X} \ell_2\right\}.\end{aligned}\]

Similarly, the loadings vector \(\ell_k\) is not correlated with any of the \(\ell_i\), \(i<k\), and maximizes the variance of \(\textbf{Y}_k\), i.e. \[\begin{aligned} \mathbf{\ell}_k=\underset{\substack{\|\ell_k\|=1,\\ \;\ell_{i}^{\top}\ell_k=0,\;\forall\;i<k}}{\arg \max} \left\{\ell^{\!\top}_k\textbf{X}^{\!\top} \textbf{X} \ell_k\right\}.\label{eq1}\end{aligned}\]

We solve this optimization problem for all \(i<k\) through the Lagrangian \[\begin{aligned} L=\ell^{\!\top}_k\textbf{X}^{\!\top} \textbf{X} \ell_k-\lambda_k(\ell^{\!\top}_k\ell_k-1)-w\ell_{i}^{\!\top}\ell_k.\end{aligned}\]

The critical points are found by differentiating with respect to each of the entries of \(\ell_k\), \(\lambda_k\) and \(w\), and setting the result to 0. Simplifying, we obtain \[\begin{aligned} \textbf{X}^{\!\top} \textbf{X} \ell_k&=\lambda_k\ell_k\\ \ell^{\!\top}_k\ell_k&=1 \quad\text{and}\quad\ell^{\!\top}_k\ell_i=0,\quad\text{for all}\; i<k.\end{aligned}\]

The loadings vector \(\ell_k\) is thus the eigenvector of the covariance matrix \(\textbf{X}^{\!\top} \textbf{X}\) associated to the \(k\)th largest eigenvalue. The proportion of the variance which can be explained by the PCA can be calculated by first noting that \[\begin{aligned} \sum_{i=1}^{p}\operatorname{Var}\left(\textbf{Y}_{i}\right) &=\sum_{i=1}^{p}\ell_{i}^{\!\top} \textbf{X}^{\!\top} \textbf{X} \ell_{i}=\sum_{i=1}^{p}\lambda_i \,.\end{aligned}\]

Consequently, the proportion of the total variance explained by the \(i\)th principal component is \[\begin{aligned} 0\leq \frac{\lambda_i}{\sum_{i=1}^{p}\lambda_i }\leq 1.\end{aligned}\]

The quality of the PCA results is strongly dependent on the number of retained principal components, that is, on the dimension \(k\) of the subspace on which the observations are projected. There are multiple ways to select the “right” \(k\) – we will briefly present two of them.

The proportion of the total variance explained by the first \(k\) principal components is given by \[\begin{aligned} p_k=\frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{p}\lambda_i}.\end{aligned}\]

One approach is to retain \(k\) principal components, where \(k\) is the smallest value for which \(p_k\) surpasses some pre-established threshold (often taken between \(80\%\) and \(90\%\)).

The scree plot method, on the other hand, consists in drawing the curve given by the decreasing eigenvalues (the scree plot), and to identify the curve’s “elbows”. These points correspond to principal components for which the variance decreases at a slower rate with added components. If such an elbow exists, we would retain the eigenvalues up to it (and thus, the corresponding principal components).

Example

PCA is applied on a dataset of genetic expression measurements, for \(n=72\) leukemia patients and \(p=7128\) genes [350].

# get data
leukemia.big <- read.csv("Data/leukemia_big.csv")
# scale and format data
leukemia.big <- t(leukemia.big)
leukemia.big.scaled <- scale(leukemia.big)

We compute the PCA decomposition on the scaled data and display the plots.

# pca
pca.leukemia <- prcomp(leukemia.big.scaled)
plot(pca.leukemia)

pca.leukemia.s <- summary(pca.leukemia) 
plot(pca.leukemia.s$importance[3,])

The scree plot suggests that only one principal component should be retained; the projection on the first 3 principal components is also shown below:


It is not obvious, however, that the methods presented here to find an optimal \(k\) are appropriate for anomaly detection purposes: simply put, a good \(k\) is one which allows for good anomaly detection.

There are other PCA-associated dimension reduction methods: independent components analysis, singular value decomposition, kernel PCA, etc. (see Module 15).

But what is the link with anomaly and/or outlier detection?

Once the dataset has been projected on a lower-dimensional subspace, the curse of dimensionality is (hopefully) mitigated – traditional methods are then applied to the projected data.

Dimension reduction usually leads to a loss of information, however, which can affect the accuracy of the detection procedure – especially if the presence/absence of anomalies is not aligned with the dataset’s principal components.

Examples of data analytical tasks that are out of alignment with PCA: pancake problem (left), in which the interesting problem might be to find the number of pancakes, but the PCA might focus on the pile's base; non-orthogonal components (right), where the data does not follow an orthogonality relation. Modified from [@DSML_NS]Examples of data analytical tasks that are out of alignment with PCA: pancake problem (left), in which the interesting problem might be to find the number of pancakes, but the PCA might focus on the pile's base; non-orthogonal components (right), where the data does not follow an orthogonality relation. Modified from [@DSML_NS]

Figure 16.20: Examples of data analytical tasks that are out of alignment with PCA: pancake problem (left), in which the interesting problem might be to find the number of pancakes, but the PCA might focus on the pile’s base; non-orthogonal components (right), where the data does not follow an orthogonality relation. Modified from [218]

Example

We start by computing the PCA decomposition of the scaled dataset:

pca.rdata <- prcomp(scaled)
plot(pca.rdata)

This suggests that data is almost 1-dimensional. Let’s display the data on the first PC:

plot(pca.rdata$x[,1])

The anomalies are highlighted below:

pca.rdata.2=data.frame(pca.rdata$x[,1:4],rdata[,5])
plot(pca.rdata.2[,1], col=group, pch=22)

We now recreate the Mahalanobis framework on the reduced data. First, we compute the empirical parameters:

mu.2 <- mean(pca.rdata.2[,1]) 
Sigma.inv.2 = 1/var(pca.rdata.2[,1])

Next, we display the histogram of Mahalanobis distances of observations to the empirical distribution:

M_d.2<-vector()

for(j in 1:nrow(rdata)){
  M_d.2[j]<-sqrt(as.matrix(pca.rdata.2[j,1]-mu.2) %*% 
                   Sigma.inv.2 %*% t(as.matrix(pca.rdata.2[j,1]-mu.2)))
}

pca.rdata.3 <- data.frame(pca.rdata.2,M_d.2)
summary(M_d)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4622  1.3479  1.6764  1.7980  2.1010  6.6393 

This histogram of distances is shown below:

pca.rdata.3 %>% ggplot2::ggplot(ggplot2::aes(x=M_d.2)) + 
  ggplot2::geom_histogram(colour="black",binwidth = 0.5) + 
  ggplot2::geom_rug() + ggplot2::theme_bw()

One outlier (observation 102, as it turns out) is still clearly visible, but observation 101 (which the Mahalanobis approach on the full unscaled dataset cleanly picks out, see Section 16.2.1) gets lost in the cloud of points when we only focus on the first PC. Whatever makes the latter an outlier is out of alignment with PC1.

Let us re-create the analysis using PC1 and PC2 to see if we can find out what is going on. The anomalies are highlighted below:

pca.rdata.2=data.frame(pca.rdata$x[,1:4],rdata[,5])
lattice::splom(pca.rdata.2[,1:2], groups=group, pch=22)

We now recreate the Mahalanobis framework on the reduced data. First, we compute the empirical parameters:

mu.2 <- colMeans(pca.rdata.2[,1:2]) 
Sigma.inv.2 = matlib::inv(cov(pca.rdata.2[,1:2]))

Next, we display the histogram of Mahalanobis distances of observations to the empirical distribution:

M_d.2<-vector()

for(j in 1:nrow(rdata)){
  M_d.2[j]<-sqrt(as.matrix(pca.rdata.2[j,1:2]-mu.2) %*% 
                   Sigma.inv.2 %*% t(as.matrix(pca.rdata.2[j,1:2]-mu.2)))
}

pca.rdata.3 <- data.frame(pca.rdata.2,M_d.2)
pca.rdata.3 %>% ggplot2::ggplot(ggplot2::aes(x=M_d.2)) + 
  ggplot2::geom_histogram(colour="black",binwidth = 0.5) + 
  ggplot2::geom_rug() + ggplot2::theme_bw()

This suggests (again) that there are 2 anomalies in the data.

pca.rdata.4 = data.frame(1:(nobs+2),pca.rdata.3$M_d.2)
names(pca.rdata.4) = c("obs","PCA_Mahalanobis")
pca.rdata.4 <- pca.rdata.4[order(-pca.rdata.4$PCA_Mahalanobis),]
rownames(pca.rdata.4) <- NULL
knitr::kable(head(pca.rdata.4))
obs PCA_Mahalanobis
102 4.681526
62 3.521898
3 2.556683
12 2.418989
55 2.405540
39 2.295610

This again fails to capture observation 101 as an anomaly.

pca.rdata.4 %>%  
  ggplot2::ggplot(ggplot2::aes(x=obs,y=PCA_Mahalanobis)) + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="PCA/Mahalanobis") +
  ggplot2::geom_point(ggplot2::aes(fill=PCA_Mahalanobis,
                                   colour=PCA_Mahalanobis,
                                   size=PCA_Mahalanobis),
                      pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") 

Perhaps if we use PC1, PC2, and PC3. The anomalies are highlighted below:

pca.rdata.2=data.frame(pca.rdata$x[,1:4],rdata[,5])
lattice::splom(pca.rdata.2[,1:3], groups=group, pch=22)

mu.2 <- colMeans(pca.rdata.2[,1:3]) 
Sigma.inv.2 = matlib::inv(cov(pca.rdata.2[,1:3]))
M_d.2<-vector()

for(j in 1:nrow(rdata)){
  M_d.2[j]<-sqrt(as.matrix(pca.rdata.2[j,1:3]-mu.2) %*% 
                   Sigma.inv.2 %*% t(as.matrix(pca.rdata.2[j,1:3]-mu.2)))
}

pca.rdata.3 <- data.frame(pca.rdata.2,M_d.2)
pca.rdata.3 %>% ggplot2::ggplot(ggplot2::aes(x=M_d.2)) + 
  ggplot2::geom_histogram(colour="black",binwidth = 0.5) + 
  ggplot2::geom_rug() + ggplot2::theme_bw()

This suggests (again) that there are 2 anomalies in the data.

pca.rdata.4 = data.frame(1:(nobs+2),pca.rdata.3$M_d.2)
names(pca.rdata.4) = c("obs","PCA_Mahalanobis")
pca.rdata.4 <- pca.rdata.4[order(-pca.rdata.4$PCA_Mahalanobis),]
rownames(pca.rdata.4) <- NULL
knitr::kable(head(pca.rdata.4))
obs PCA_Mahalanobis
102 4.896637
62 3.523881
14 2.867856
12 2.822241
27 2.656084
3 2.575923

This again fails to capture observation 101 as an anomaly.

pca.rdata.4 %>%  
  ggplot2::ggplot(ggplot2::aes(x=obs,y=PCA_Mahalanobis)) + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="PCA/Mahalanobis") +
  ggplot2::geom_point(ggplot2::aes(fill=PCA_Mahalanobis,
                                   colour=PCA_Mahalanobis,
                                   size=PCA_Mahalanobis),
                      pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") 

So now the question arises: is the fact that observation 101 not captured here related to dimension reduction, or is it an issue related to scaling? Is there something that can be done to separate the two procedures? Can we get the benefits of PCA dimension reduction without having to scale the data?

Distance-Based Outlier Basis Using Neighbours

Using PCA for anomaly detection is potentially problematic, as we have seen: whether an observation is anomalous or not does not figure in the construction of the principal component basis \(\{\text{PC}_1,\ldots,\text{PC}_k\}\) – there is not necessarily a correlation between the axes of heightened variance and the presence or absence of anomalies.

The distance-based outlier basis using neighbours algorithm (DOBIN) builds a basis which is better suited for the eventual detection of outlying observations. DOBIN’s main concept is to search for nearest neighbours that are in fact relatively distant from one another:

  1. We start by building a space \(\mathbf{Y}=\{\mathbf{y}_{\ell}\}\) which contains \(M\ll n(n+1)/2\) vectors of the form \[\mathbf{y}_{\ell}=(\mathbf{x}_i -\mathbf{x}_j)\odot (\mathbf{x}_i -\mathbf{x}_j),\] where \(\odot\) is the element-by-element Hadamard multiplication, and for which the \(1-\)norm \[\|\mathbf{y}_{\ell}\|_1=(x_{1,1}-x_{2,1})^2+\cdots +(x_{1,p}-x_{2,p})^2\] is the square of the distance between \(\mathbf{x}_i,\mathbf{x}_j \in\mathbf{X}\) (the selection of each of the \(M\) observation pairs is made according to a rather complex procedure which only considers \(\mathbf{x}_i\) and \(\mathbf{x}_j\) if they are part of one another’s \(k-\)neighbourhood, for \(k\in\{k_1,\ldots, k_2\}\)); the set \(\mathbf{Y}\) thus contains points for which \(\|\mathbf{y}_{\ell}\|_1\) is relatively large, which is to say that the observations \(\mathbf{x}_i\) are \(\mathbf{x}_j\) fairly distant from one another even if they are \(k-\)neighbours of each other;

  2. we next build a basis \(\{\boldsymbol{\eta}_1,\ldots,\boldsymbol{\eta}_p\}\subset \mathbb{R}^p\) where each \(\eta_i\) is a unit vector given by a particular linear combination of points in \(\mathbf{Y}\); they can be found using a Gram-Schmidt-like procedure: \[\begin{aligned} \mathbf{y}_{\ell_0}&=\mathbf{y}_{\ell},\quad \ell=1,\ldots, M \\ \boldsymbol{\eta}_1&=\frac{\sum_{\ell=1}^M\mathbf{y}_{\ell}}{\left\|\sum_{\ell=1}^M\mathbf{y}_{\ell}\right\|_2} \\ \mathbf{y}_{\ell_1}&=\mathbf{y}_{\ell}-\langle\boldsymbol{\eta}_1 \mid \mathbf{y}_{\ell}\rangle,\quad \ell=1,\ldots, M \\ \boldsymbol{\eta}_2&=\frac{\sum_{\ell=1}^M\mathbf{y}_{\ell_1}}{\left\|\sum_{\ell=1}^M\mathbf{y}_{\ell_1}\right\|_2} \\ & \vdots \\ \mathbf{y}_{\ell_{b-1}}&=\mathbf{y}_{\ell_{b-2}}-\langle\boldsymbol{\eta}_{b-1} \mid \mathbf{y}_{\ell_{b-2}}\rangle,\quad \ell=1,\ldots, M \\ \boldsymbol{\eta}_b&=\frac{\sum_{\ell=1}^M\mathbf{y}_{\ell_{b-1}}}{\left\|\sum_{\ell=1}^M\mathbf{y}_{\ell_{p-1}}\right\|_2},\end{aligned}\] for \(b=1,\ldots,p\),

  3. and we transform the original dataset \(\mathbf{X}\) via \(\hat{\mathbf{X}}=\mathcal{T}(\mathbf{X})\boldsymbol{\Theta},\) where \(\mathcal{T}(\mathbf{X})\) normalizes each feature of \(\mathbf{X}\) according to a problem-specific scheme (Min-Max or Median-IQR, say), and \[\boldsymbol{\Theta}=[\boldsymbol{\eta}_1\mid\cdots\mid \boldsymbol{\eta}_p]\] is an orthogonal \(p\times p\) matrix.

It is on the transformed space (which plays an analogous role to the subspace projection of \(\mathbf{X}\) in PCA) that we apply the various outlier and anomaly detection algorithms. The full details contain a fair number of technical complications; the interested reader is invited to consult the original documentation [330].

Example

DOBIN is implemented in R via the dobin package. In the example below, note that the data is not scaled.

out <- dobin::dobin(rdata[,1:4], frac=0.9, norm=3)

kk <- min(ceiling(dim(rdata)[1]/10),25) #  arbitrary  choice to keep kk not too large, but respond to dataset size
knn_dist <- FNN::knn.dist(out$coords[, 1:3], k = kk) # dimension reduction
knn_dist <- knn_dist[ ,kk]
ord <- order(knn_dist, decreasing=TRUE)

knn_dist.dobin <- data.frame(1:(nobs+2),knn_dist)
names(knn_dist.dobin) = c("obs","knn_dobin")
knn_dist.dobin <- knn_dist.dobin[order(-knn_dist.dobin$knn_dobin),]
rownames(knn_dist.dobin) <- NULL
knitr::kable(head(knn_dist.dobin))
obs knn_dobin
102 5.029217
62 2.982912
14 1.700037
49 1.540807
101 1.453970
67 1.375444
knn_dist.dobin %>% 
  ggplot2::ggplot(ggplot2::aes(x=obs,y=knn_dobin)) + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="11NN Distance - DOBIN") +
  ggplot2::geom_point(ggplot2::aes(fill=knn_dobin,
                                   colour=knn_dobin,
                                   size=knn_dobin),
                      pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") 

We see that it is the second variable that is the main contributor to anomalous behaviour:

out$vec[ ,1]
NULL

16.4.3 Subspace Methods

Subspace methods have been used particularly effectively by analysts for anomaly and outlier detection in high-dimensional datasets [331], [348], [351]; it is often easier to find the sought-after observations by exploring lower-dimensional subspaces (rather than the original set) – in that sense, subspace methods are akin to feature selection methods, whereas projection methods are closer to dimension reduction methods.

There is thus an intrinsic interest in exploring subspaces in their own right [324], [349]. This approach eliminates additive noise effects often found in high dimensional spaces and leads to more robust outliers (that is, outliers which are identified as such by multiple methods).

The problem is rather difficult to solve effectively and efficiently, since the potential number of subspace projections of high-dimensional data is related exponentially to the number of features in the dataset. The Feature Bagging algorithm formally uses the LOF algorithm of Section 16.2.2, but any fast anomaly detection algorithm could also be used instead. The anomaly scores and rankings from each run are aggregated as they are in the Independent Ensemble approach (see Section 16.4.4).

There are other, more sophisticated, subspace anomaly detection methods, including:

  • High-dimensional Outlying Subspaces (HOS) [337];

  • Subspace Outlier Degree (SOD) [338];

  • Projected Clustering Ensembles (OutRank) [339];

  • Local Selection of Subspace Projections (OUTRES) [340].

Example

The feature bagging algorithm is implemented in R via the HighDimOut package (not available on CRAN as of May 2022, but it has been archived. We apply it to our trusty artificial dataset.

res.FBOD <- HighDimOut::Func.FBOD(data = rdata[,1:4], iter=10, k.nn=5)
rdata.FBOD <- data.frame(1:(nobs+2),res.FBOD)
names(rdata.FBOD) = c("obs","FBOD")
rdata.FBOD <- rdata.FBOD[order(-rdata.FBOD$FBOD),]

rdata.FBOD %>% 
  ggplot2::ggplot(ggplot2::aes(x=obs,y=FBOD)) + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="Feature Bagging Anomaly Score, k=5, M=10") +
  ggplot2::geom_point(ggplot2::aes(fill=FBOD,
                                   colour=FBOD,
                                   size=FBOD),
                      pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::theme_bw() + theme(legend.position = "none") 

16.4.4 Ensemble Methods

In the preceding sections, we have described various anomaly detection algorithms whose relative performance varies with the type of data being considered. As is the case with pretty much of all of data science, it is impossible to come up with an algorithm that outperforms all the others [239].

This is because a particular anomaly detection algorithm may be well-adapted to a dataset and may be successful in detecting abnormal or outlier observations, but it may not work with other datasets whose characteristics do not match the first dataset. The impact of such a mismatch between algorithms can be mitigated by using ensemble methods, where the results of several algorithms are considered before making a final decision. Such an approach often provides the best results and thus improves the performance of the base anomaly detection algorithms [319]. We will consider two types of ensemble methods: sequential ensembles (boosting) and independent ensembles.

Sequential Ensembles

Sequential ensembles require a given algorithm (or a set of algorithms) to be applied to a dataset in a sequential manner, each time on slightly different dataset derived from the previous step’s dataset based on the previous steps’ results, and so forth. At each step, the weight associated with each observation is modified according to the preceding results using some “boosting” method (such as AdaBoost or XGBoost, for instance). The final result is either some weighted combination of all preceding results, or simply the results output by the last step in the sequence.

The formal procedure is as follows:

The details are out-of-scope for this module, but they are available in Section 13.5.3.

Independent Ensembles

In an independent ensemble, we instead apply different algorithms (or different instances of the same algorithm) to the dataset (or some resampled dataset). Choices made at the data and algorithm level are independent of the results obtained in previous runs (unlike in a sequential ensemble). The results are then combined to obtain more robust outliers.

Every base anomaly detection algorithm provides an anomaly score (or an abnormal/regular classification) for each observation in \(D\); observations with higher scores are considered to be more anomalous, observations with lower scores more normal.

The results are then combined using a task-specific method in order to provide a more robust classification of anomalous or outlying observations. Many such combination techniques used in practice:

  • majority vote,

  • average,

  • minimal rank, etc.

Let \(\alpha_i(\mathbf{p})\) represent the (normalized) anomaly score of \(\mathbf{p}\in D\), according to algorithm \(A_i\). If \(\alpha_i(\mathbf{p})\approx 0\), it is unlikely that \(\mathbf{p}\) is an anomaly according to \(A_i\), whereas if \(\alpha_i(\mathbf{p})\approx 1\), it is quite likely that \(\mathbf{p}\) according to \(A_i\).

The rank of \(\mathbf{p}\in D\) according to \(A_i\), on the other hand, is denoted by \(r_i(\mathbf{p})\): the higher the rank (smaller number), the higher the anomaly score and vice versa. In a dataset with \(n\) observations, the rank varies from \(1\) to \(n\) (ties are allowed).

If the base detection algorithms are \(A_1, \ldots, A_m\), the anomaly score and the rank of an observation \(\mathbf{p}\in D\) according to the independent ensemble method are, respectively, \[\begin{aligned} \alpha(\mathbf{p})=\frac{1}{m}\sum_{i=1}^{m}\alpha_i(\mathbf{p}) \qquad \text{and}\; \qquad r(\mathbf{p}) =\min_{1\leq i \leq m} \{r_i(\mathbf{p})\}.\end{aligned}\] If \(n=m=3\), for instance, we could end up with \[\begin{aligned} \alpha_{1}\left(\mathbf{p}_{1}\right)&=1.0,\; \alpha_{1}\left(\mathbf{p}_{2}\right)=0.9,\; \alpha_{1}\left(\mathbf{p}_{3}\right)=0.0; \\ \alpha_{2}\left(\mathbf{p}_{1}\right)&=1.0,\; \alpha_{2}\left(\mathbf{p}_{2}\right)=0.8,\; \alpha_{2}\left(\mathbf{p}_{3}\right)=0.0;\\ \alpha_{3}\left(\mathbf{p}_{1}\right)&=0.1,\; \alpha_{3}\left(\mathbf{p}_{2}\right)=1.0,\; \alpha_{3}\left(\mathbf{p}_{3}\right)=0.0.\end{aligned}\]

Using the mean as the combination techniques, we obtain \[\begin{aligned} \alpha\left(\mathbf{p}_{1}\right)&=0.7,\; \alpha\left(\mathbf{p}_{2}\right)=0.9,\; \alpha\left(\mathbf{p}_{3}\right)=0.0,\end{aligned}\] whence \[\mathbf{p}_2\succeq \mathbf{p}_1\succeq \mathbf{p}_3,\] that is, \(\mathbf{p}_2\) is more anomalous than \(\mathbf{p}_1\), which is itself more anomalous than \(\mathbf{p}_3\) (see the notation introduced in Section 16.1.2).

Using the minimal rank method, we obtain \[\begin{aligned} r_{1}\left(\mathbf{p}_{1}\right)&=1,\; r_{1}\left(\mathbf{p}_{2}\right)=2,\; r_{1}\left(\mathbf{p}_{3}\right)=3;\\ r_{2}\left(\mathbf{p}_{1}\right)&=1,\; r_{2}\left(\mathbf{p}_{2}\right)=2,\; r_{2}\left(\mathbf{p}_{3}\right)=3;\\ r_{3}\left(\mathbf{p}_{1}\right)&=2,\; r_{3}\left(\mathbf{p}_{2}\right)=1,\; r_{3}\left(\mathbf{p}_{3}\right)=3,\end{aligned}\] from which \[\begin{aligned} r\left(\mathbf{p}_{1}\right)=r\left(\mathbf{p}_{2}\right)=1,\; r \left(\mathbf{p}_{3}\right)=3, \end{aligned}\] and so \(\mathbf{p}_1\succeq \mathbf{p}_3\) and \(\mathbf{p}_2\succeq \mathbf{p}_3\), but \(\mathbf{p}_1\) and \(\mathbf{p}_2\) have the same anomaly levels.

Evidently, the results depend not only on the dataset under consideration and on the base algorithms that are used in the ensemble, but also on how they are combined. In the context of HDLSS data, ensemble methods can sometimes allow the analyst to mitigate some of the effects of the curse of dimensionality by selecting fast base algorithms (which can be run multiple times) and focusing on building robust relative anomaly scores.

Another suggested approach is to use a different sub-collection of the original dataset’s features at each step, in in order to de-correlate the base detection models (see feature bagging, in Section 16.4.3).

Even without combining the results, it may be useful to run multiple algorithms on different subspaces to produce an Overview of Outliers (O3), implemented in the R package OutliersO3, by A. Unwin.

O3d <- OutliersO3::O3prep(rdata[,1:4], method=c("HDo", "PCS", "BAC", "adjOut", "DDC", "MCD"))
O3d1 <- OutliersO3::O3plotM(O3d)
cx <- data.frame(outlier_method=names(O3d1$nOut), number_of_outliers=O3d1$nOut)
knitr::kable(cx, row.names=FALSE)
outlier_method number_of_outliers
HDo 8
PCS 1
BAC 2
adjOut 2
DDC 1
MCD 2
O3d1$gO3

The columns on the left indicate the subspace variables (see row colouring). The columns on the right indicate which observations were identified as outliers by at least \(1\) method331 in at least \(1\) subspace.

The colours depict the number of methods that identify each observation in each subspace as an outlier. For instance, Observation \(102\) is identified as an outlier by \(6\) methods in \(2\) subspaces, \(5\) methods in \(3\) subspaces, \(4\) methods in \(2\) subspaces, \(3\) methods in \(1\) subspace, \(2\) methods in \(4\) subspaces, and \(1\) method in \(3\) subspaces – it is clearly the most anomalous observation in the dataset. Observations \(62\) and \(101\) are also commonly identified as outliers.

Are the results aligned with those we have obtained throughout the module?


Ensemble approaches allow analysts to take a big picture view of the anomaly landscape, but it should be recalled that anomaly detection and outlier analysis is still very active as an area of research, with numerous challenges. The No Free Lunch Theorem suggests that, importantly, there is no magic method: all methods have strengths and limitations, and the results depend heavily on the data.

References

[218]
A. Ng and K. Soo, Eds., Surviving a Disaster, in Numsense! algobeans, 2016.
[239]
D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, 1997.
[319]
K. G. Mehrotra, C. K. Mohan, and H. Huang, Anomaly Detection Principles and Algorithms. Springer, 2017.
[324]
C. C. Aggarwal, Outlier Analysis. Springer International Publishing, 2016.
[327]
C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” SIGMOD Rec., vol. 30, no. 2, pp. 37–46, 2001, doi: http://doi.acm.org/10.1145/376284.375668.
[330]
S. Kandanaarachchi and R. Hyndman, “Dimension reduction for outlier detection using DOBIN.” Sep. 2019. doi: 10.13140/RG.2.2.15437.18403.
[331]
C. C. Aggarwal and S. Sathe, Outlier Ensembles: An Introduction. Springer International Publishing, 2017.
[337]
J. Zhang, M. Lou, T. W. Ling, and H. Wang, “Hos-miner: A System for Detecting Outlyting Subspaces of High-Dimensional Data,” in Proceedings of the Thirtieth International Conference on Very Large Data Bases, 2004, pp. 1265–1268.
[338]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Outlier detection in axis-parallel subspaces of high dimensional data,” in Advances in Knowledge Discovery and Data Mining, 2009, pp. 831–838.
[339]
E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm, “Outlier ranking via subspace analysis in multiple views of the data,” in 2012 IEEE 12th International Conference on Data Mining, 2012, pp. 529–538. doi: 10.1109/ICDM.2012.112.
[340]
E. Müller, M. Schiffer, and T. Seidl, “Adaptive outlierness for subspace outlier ranking,” in Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010, pp. 1629–1632.
[348]
A. Lazarević and V. Kumar, “Feature bagging for outlier detection,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005, pp. 157–166. doi: 10.1145/1081870.1081891.
[349]
A. Archimbaud, “Détection non-supervisée d’observations atypiques en contrôle de qualité : Un survol,” Journal de la Société Française de Statistique, vol. 159, no. 3, pp. 1–39, 2018.
[350]
T. Hastie, Leukemia dataset.”
[351]
Z. He, S. Deng, and X. Xu, “A unified subspace outlier ensemble framework for outlier detection,” in Advances in Web-Age Information Management, 2005, pp. 632–637.