16.2 Quantitative Approaches

Cluster-based methods are not the only types of UL anomaly detection methods. They come in two flavours: distance-based and density-based.

  • Distance-based methods include distance to all observations, distance to \(k\) nearest neighbours (\(k\)NN), average distance to \(k\)NN, median distance to \(k\)NN, etc.

  • Density-based methods include local outlier factor (LOF), isolation forest, HDBSCAN, etc.

16.2.1 Distance Methods

In order to determine whether an observation is anomalous or not, it must be compared to a set of other observations (anomalies are relative, not absolute). In the distance-based context, one natural way to compare observations is to consider their distance from one another, with increasing distance from the others being increasingly suggestive of anomalous status.

This approach works both in continuous and discrete cases, as long as a distance function or a pre-computed table of pair-wise distances between observations is given.

The choice of which sets of observations to use in this comparison distinguishes the different distance-based algorithms.

Notation

Let \(D \subset \mathbb{R}^n\) be an \(n\)-dimensional dataset, \(\mathbf{p},\mathbf{q}\in D\), \(P \subset D\) be a subset of \(D\). Assume that \(d: D \times D \to \mathbb{R}\) gives the distance between \(\mathbf{p}\) and \(\mathbf{q}\), written \(d(\mathbf{p},\mathbf{q})\).

An anomaly detection algorithm provides a function \(a : D \to \mathbb{R}\) that describes how anomalous a given observation is. This induces an ordering on the observations of \(D\): if \(a(\mathbf{p}) < a(\mathbf{q})\) for \(\mathbf{p},\mathbf{q} \in D\), then \(\mathbf{p}\) is less anomalous than \(\mathbf{q}\).

It could be necessary to define a threshold beyond which an observation is considered anomalous; if \(\alpha \in \mathbb{R}\) is such a threshold, then any \(\mathbf{p} \in D\) is absolutely anomalous if \(a(\mathbf{p}) > \alpha\).

Similarity Measures

A similarity measure is a real-valued function that describes the similarity between two objects. A common construction is to define the similarity \(w\) between two observations \(\mathbf{p},\mathbf{q}\) as \[w(\mathbf{p},\mathbf{q})=\frac{1}{1+d(\mathbf{p},\mathbf{q})}, \quad \text{for some distance }d,\] so that \(w\to 1\) as \(d\to 0\), and \(w\to 0\) as \(d\to \infty\).

A similarity measure can also be constructed between probability distributions. Let \(X\) and \(Y\) be two \(n\)-dimensional random vectors of (possibly) different distribution with probability mass/density functions (p.m.f./p.d.f.) \(f_X\) and \(f_Y\), respectively. Let \(\Omega\) be their shared domain. For discrete random variables, the Hellinger distance is defined by \[\begin{aligned} H(X,Y)&=\left(1- \sum_{\mathbf{z} \in \Omega} \sqrt{f_X(\mathbf{z}) f_Y(\mathbf{z})}\right)^{1/2}\,;\end{aligned}\] for continuous random variables, it is defined by \[\begin{aligned} H(X,Y)= \left(1-\int_{\Omega} \sqrt{f_X(\mathbf{z}) f_Y(\mathbf{z})}\, d\mathbf{z}\right)^{1/2}\,. \end{aligned}\] If \(f_X=f_Y\) (or \(f_X=f_Y\) almost everywhere in the continuous case, that is, except over a countable set), then \[\sum_{\Omega}\sqrt{f_xf_Y}=1 \quad\mbox{or} \quad \int_{\Omega}\sqrt{f_Xf_Y}\, d\mathbf{z}=1\] and \(H(X,Y)=0\). The fact that \(H(X,Y)\in [0,1]\) is a consequence of Cauchy’s inequality, with \(f_X^*=\sqrt{f_X}\) and \(f_Y^*=\sqrt{f_Y}\): \[\begin{aligned} 0\leq \int_{\Omega}\sqrt{f_Xf_Y}\, d\mathbf{z}&=\int_{\Omega}f_X^*f_Y^*\, d\mathbf{z}\\ &\leq \left(\int_{\Omega}|f_X^*|^2\, d\mathbf{z}\right)^{1/2}\left(\int_{\Omega}|f_Y^*|^2\, d\mathbf{z}\right)^{1/2} \\ &=\left(\int_{\Omega}f_X\, d\mathbf{z}\right)^{1/2}\left(\int_{\Omega}f_Y\, d\mathbf{z}\right)^{1/2}\!\!\!\!=1;\end{aligned}\] (a similar argument holds for discrete random variables).

Recall that the covariance matrices \(\Sigma_X\) and \(\Sigma_Y\) are \(n \times n\)-matrices whose \((i,j)\)-th entries are the covariance between the \(i\)-th and \(j\)-th positions of \(X\) and \(Y\), respectively. Given a collection of identically distributed samples, these covariance matrices can be estimated.

We can also consider a single observation \(\mathbf{p}\) to represent a probability distribution. In that case, the Hellinger distance between that observation and any other distribution with mean \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\) can be studied using the framework above, using the Mahalanobis distance: \[M(\mathbf{p})=\sqrt{(\mathbf{p} - \boldsymbol{\mu})^{\!\top} \boldsymbol{\Sigma}^{-1} (\mathbf{p} - \boldsymbol{\mu}).}\]

Alternatively, if \(\mathbf{p}\) and \(\mathbf{q}\) are drawn from the same distribution with covariance \(\boldsymbol{\Sigma}\), then the Mahalanobis distance is a dissimilarity measure between \(\mathbf{p}\) and \(\mathbf{q}\): \[d_M(\mathbf{p},\mathbf{q})=\sqrt{(\mathbf{p} - \mathbf{q})^{\!\top} \boldsymbol{\Sigma}^{-1} (\mathbf{p} - \mathbf{q}).}\]

Example

In general, we do not know the true mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\) from which the data could arise, and the mean vector and the covariance structure must be estimated from the data. In the example of Section 16.1.3, we have:

mu.1 <- colMeans(rdata[,1:4]) 
cov(rdata[,1:4])
          x1        x2        x3        x4
x1 0.8999038 0.5690744 0.6646093 0.5033570
x2 0.5690744 1.3124241 1.0685066 0.4694309
x3 0.6646093 1.0685066 0.9921537 0.3969461
x4 0.5033570 0.4694309 0.3969461 0.9043493

These are distinct from the true underlying collection of parameters \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\), but close enough to be explained by sampling variation and because \(\mathbf{z}_1,\mathbf{z}_4\not\sim \mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\).

We first attempt to identify the anomalous observations by computing the Mahalanobis distance from the empirical distribution to all observations in the dataset.

Sigma.inv = matlib::inv(cov(rdata[,1:4]))

M_d<-vector()

for(j in 1:nrow(rdata)){
  M_d[j] <- sqrt(as.matrix(rdata[j,1:4]-mu.1) %*% 
                   Sigma.inv %*% t(as.matrix(rdata[j,1:4]-mu.1)))
}

rdata <- data.frame(rdata,M_d)
summary(M_d)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4622  1.3479  1.6764  1.7980  2.1010  6.6393 

The summary certainly suggests that there is (at least) one observation for which the Mahalanobis distance to the empirical distribution is quite high.

library(dplyr)

rdata |> ggplot2::ggplot(ggplot2::aes(x=M_d)) + 
  ggplot2::geom_histogram(colour="black",binwidth = 0.5) + 
  ggplot2::geom_rug() + 
  ggplot2::theme_bw()

rdata |> ggplot2::ggplot(ggplot2::aes(x=M_d)) + 
  ggplot2::geom_boxplot() + ggplot2::geom_rug(color="black")

The histogram of Mahalanobis distances shows that most observations are fairly “regular”, but that two of the observations have substantially larger distances. The boxplot confirms it, but identifies a potential third outlying observation.

Below, we display the scatter plot matrix of the 102 observations, with colour given by the Mahalanobis distance of each observation from the empirical distribution (the code is ommitted).

It certainly seems as though the two anomalies could be \(\mathbf{z}_1\) and \(\mathbf{z}_4\).

Next, we plot the Mahadistance from each observation to every other observation.

M_pq<-matrix(nrow=nrow(rdata), ncol=nrow(rdata))

for(j in 1:nrow(rdata)){
  for(i in 1:nrow(rdata)){
    M_pq[j,i]<-sqrt(as.matrix(rdata[j,1:4]-rdata[i,1:4]) %*% 
                    Sigma.inv %*% t(as.matrix(rdata[j,1:4]-rdata[i,1:4])))
  }
}

M_pq<-as.data.frame.table(M_pq)
M_pq[,1:2]<-lapply(M_pq[,1:2],as.numeric)

M_pq |> ggplot2::ggplot(ggplot2::aes(x=Var1,y=Freq)) + 
  ggplot2::geom_point(ggplot2::aes(fill=Freq,colour=Freq),pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#FFFFFF") + 
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#FFFFFF") + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="Distance") + 
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none")

Note the differing patterns for observations 101 and 102, as well as the diffuse cloud of points above the distance \(5.0\) for the other observations. There are a few other observations for which the distances to other observations seem to be larger than in a majority of the cases.

Next, we display the same distributions with the help of boxplots.

median.value <- M_pq |> 
  group_by(Var1) |> 
  summarise(meanDist=mean(Freq)) |> 
  summarise(median_value=median(meanDist))

test <- M_pq |> 
  group_by(Var1) |> 
  summarise(meanDist=mean(Freq)) |> 
  summarise(std=sd(meanDist))

med.sd = test+median.value

M_pq |> ggplot2::ggplot(ggplot2::aes(x=as.factor(Var1),y=Freq)) + 
  ggplot2::geom_boxplot() +
  ggplot2::scale_x_discrete(name="Observations") + 
  ggplot2::scale_y_continuous(name="Distance") + 
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") + 
  ggplot2::geom_hline(yintercept=as.numeric(median.value), linetype = "dashed", color = "red")  + 
  ggplot2::geom_hline(yintercept=as.numeric(med.sd), linetype = "dotted", color = "red")  + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle=90))

The long-dashed red line represent the median of all the mean distances per observation; the short-dashed red line lies 1 standard deviation above the median.

To simplify the reading of the situation, we plot only the mean distance per observation, linking the colour intensity and the marker size to the mean distance (blue corresponding to larger distances, as do larger markers).

M_pq |> group_by(Var1) |> summarise(meanDist=mean(Freq)) |> 
  ggplot2::ggplot(ggplot2::aes(x=Var1,y=meanDist)) + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="Mean Distance") +
  ggplot2::geom_point(ggplot2::aes(fill=meanDist,colour=meanDist,size=meanDist),pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") + 
  ggplot2::geom_hline(yintercept=as.numeric(median.value), linetype = "dashed", color = "red")  + 
  ggplot2::geom_hline(yintercept=as.numeric(med.sd), linetype = "dotted", color = "red")  

Do any other observations strike you as possible outliers?

Similarity Measures (Reprise)

If \(\boldsymbol{\Sigma}\) is diagonal, then \[d_M(\mathbf{p},\mathbf{q})=\sqrt{\sum_{i=1}^n \frac{(p_i - q_i)^2}{\sigma_i^2}},\] where \(\sigma_i^2\) is the variance along the \(i\)-th dimension. If \(\Sigma\) is the identity matrix, then we recover the Euclidean distance \[d_2(\mathbf{p},\mathbf{q})=\sqrt{\sum_{i=1}^n (p_i - q_i)^2}.\]

When using the Euclidean distance in an anomaly detection context, a linear normalization is usually applied to each dimension so that each entry lies in the hypercube \([-1,1]^n\). The Minkowski distance of order \(p\) is a generalization of the Euclidean distance: \[d_p(\mathbf{p},\mathbf{q})=\left( \sum_{i=1}^n |p_i - q_i|^p \right)^{1/p}.\]

For \(p=2\) we recover the Euclidean distance \(d_2\), for \(p=1\) the Manhattan distance \[d_1(\mathbf{p},\mathbf{q})=\sum_{i=1}^n|p_i-q_i|,\] and for \(p=\infty\) the supremum distance (also called the Chebychev distance) \[d_{\infty}(\mathbf{p},\mathbf{q})=\max_{i=1}^n |p_i - q_i|.\] The Minkowski distance \(d_p\) is only actually a distance function (i.e., a metric) when \(p \geq 1\), but an exception is made for \[d_{-\infty}(\mathbf{p},\mathbf{q})=\min_{i=1}^n |p_i - q_i|\] to fall within the same framework.

The Jaccard similarity of two datasets \(P\) and \(Q\), is defined as the size of their intersection divided by the size of their union \[J(P,Q) = \frac{|P \cap Q|}{|P \cup Q|} = \frac{|P \cap Q|}{|P| + |Q| - |P \cap Q|}.\] Their Jaccard distance is then taken to be \(1 - J(P,Q)\).

This definition can be extended to compare binary vectors (i.e. vectors with entries in \(\{0,1\}\)) of the same length. Given two binary vectors \(\mathbf{p}\) and \(\mathbf{q}\) of length \(n\), consider an arbitrary set \(D\) of size \(n\). Then \(\mathbf{p}\) and \(\mathbf{q}\) can be viewed as subsets of \(D\): if \(p_i=1\) then \(\mathbf{p}\) is said to contain the \(i\)-th element of \(D\), while if \(p_i=0\) then it does not. Viewing \(\mathbf{p}\) and \(\mathbf{q}\) in this way allows us to compute their Jaccard similarity, and thus their Jaccard distance.

Finally, let \(\mathbf{p},\mathbf{q}\neq \mathbf{0}\). Recall that \(\mathbf{p} \cdot \mathbf{q} = \lVert p \rVert \lVert q \rVert \cos\theta,\) where \(\theta\) is the angle between \(\mathbf{p}\) and \(\mathbf{q}\). The cosine similarity between \(\mathbf{p}\) and \(\mathbf{q}\) is the cosine of \(\theta\), which can be computed as \[\cos\theta = \frac{p \cdot q}{\lVert p \rVert \lVert q \rVert} = \frac{\sum_{i=1}^n p_i q_i}{\sqrt{\sum_{i=1}^n p_i^2} \sqrt{\sum_{i=1}^n q_i^2}}.\]

This value ranges between \(1\) and \(-1\), with 1 attained when \(\mathbf{p}=\mathbf{q}\), \(-1\) when \(\mathbf{p}=-\mathbf{q}\), and \(0\) when \(\mathbf{p}\) and \(\mathbf{q}\) are perpendicular. Armed with these concepts, we can now explore distance-based methods for anomaly detection; they will also eventually be useful for density-based anomaly detection.

2D visualization of various similarity metrics [@whitenack2017machine].

Figure 16.13: 2D visualization of various similarity metrics [344].

16.2.1.1 Distance-Based Anomaly Detection

All these distance functions (similarity measures) can be used to create basic anomaly detection algorithms (the ideas can also be extended to more complex algorithms).

Given some distance function \(d\), dataset \(D\), and integers \(k,\nu\leq |D|\), the distance to all points (DTAP) anomaly detection algorithm considers each observation \(\mathbf{p}\) in \(D\) and adds the distance from \(\mathbf{p}\) to every other observation in \(D\), i.e. \[a(\mathbf{p}) = \sum_{\mathbf{q}\neq \mathbf{p} \in D} d(\mathbf{q}, \mathbf{p}).\]

The \(\nu\) observations with largest values for \(a\) are then said to be anomalous according to \(a\). This approach often selects the most extreme observations as anomalous, which may be of limited use in practice.

The distance to nearest neighbour (DTNN) algorithm defines \[a(\mathbf{p}) = \min_{\mathbf{q}\neq \mathbf{p} \in D} d(\mathbf{q}, \mathbf{p}),\] with a similar definition for the \(\nu\) anomalous observations. The average distance to \(k\) nearest neighbours and median distance to \(k\) nearest neighbours are defined similarly.

Example - Distance to All Points

We start by building the DTAP anomaly detector for the Euclidean distance (method="euclidean") on the scaled artificial data, which is shown below.

rdata.scaled=data.frame(matrix(ncol = 4, nrow = nobs+2))

for(i in 1:4){
  rdata.scaled[,i] <- 2/(max(rdata[,i]) - min(rdata[,i])) * rdata[,i] - 1
}

lattice::splom(rdata.scaled[,1:4], pch=22)

The top \(\nu=6\) anomalous observations are obtained by the following chunk of code.

m.L2 <- as.matrix(dist(rdata.scaled[,1:4], method="euclidean"))
adoa.L2 <- data.frame(1:(nobs+2),rowSums(m.L2))
colnames(adoa.L2) <- c("obs","dist")
adoa.L2 <- adoa.L2[order(-adoa.L2$dist),]
rownames(adoa.L2) <- NULL
knitr::kable(head(adoa.L2))
obs dist
102 241.7556
62 177.6135
14 113.9903
49 108.6464
55 106.5156
67 104.7870

The accompanying plot is:

adoa.L2 |>  
  ggplot2::ggplot(ggplot2::aes(x=obs,y=dist)) + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="Sum of Euclidean Distances") +
  ggplot2::geom_point(ggplot2::aes(fill=dist,
                                   colour=dist,
                                   size=dist),
                      pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") 

The top \(\nu=6\) anomalies are obtained for a variety of metric (we do not display the code that is used for these other metrics).

  • Chebychev (replace method="euclidean" by method="maximum"):

    obs dist
    102 144.85387
    62 114.83273
    14 88.25016
    55 81.47169
    49 80.33274
    101 73.11895

  • Manhattan (replace method="euclidean" by method="manhattan"):

    obs dist
    102 473.7469
    62 342.2531
    14 197.7493
    67 193.9257
    49 191.7707
    55 183.6509

  • Minkowski, \(p=1/2\) (replace method="euclidean" by method="manhattan", p=0.5):

    obs dist
    102 1873.7488
    62 1341.9020
    67 731.9811
    14 717.3451
    49 700.5557
    11 669.4313

  • Minkowski, \(p=4\) (replace method="euclidean" by method="manhattan", p=4):

    obs dist
    102 3738.781
    62 2672.940
    67 1444.177
    14 1403.799
    49 1372.067
    11 1318.830

We see that while observation 102 is always the most anomalous according to DTAP, the ranking is affected by the choice of distance metric. Is this surprising?

Example - Distance to Nearest Neighbour

We next build the DTNN anomaly detector for the Euclidean distance, again on the scaled artificial data.

The top \(\nu=6\) anomalous observations are obtained by the following chunk of code.

m.L2 <- m.L2 + 1000000*diag(nobs+2)
adoa.L2 <- data.frame(1:(nobs+2),apply(m.L2,1,min))
colnames(adoa.L2) <- c("obs","dist")
adoa.L2 <- adoa.L2[order(-adoa.L2$dist),]
rownames(adoa.L2) <- NULL
knitr::kable(head(adoa.L2))
obs dist
102 1.4549129
62 0.8667313
101 0.3746565
12 0.3611265
11 0.3186967
43 0.3145305

The accompanying plot is:

The top \(\nu=6\) anomalies are obtained for a variety of metric (we do not display the code that is used for these other metrics).

  • Chebychev:

    obs dist
    102 0.8269178
    62 0.5203870
    12 0.2449443
    101 0.2368307
    68 0.2356369
    43 0.2214142

  • Manhattan:

    obs dist
    102 2.8914743
    62 1.6745725
    12 0.6499116
    101 0.6082916
    11 0.5702036
    23 0.4903152

  • Minkowski, \(p=1/2\):

    obs dist
    102 10.534441
    62 6.559370
    101 1.857222
    11 1.704864
    23 1.618388
    12 1.591956

  • Minkowski, \(p=4\):

    obs dist
    102 20.605432
    62 13.059856
    101 3.521821
    11 3.081000
    23 2.979090
    12 2.809583


In both examples, there are commonalities (certain observations come back repeatedly as likely anomalous observations), but the anomaly rankings change according to the selected distance function and the choice of algorithm (the choice of data scaling approach would also play a role). This is par for the course in the anomaly detection context.

16.2.2 Density Methods

Density-based approaches view observations as anomalous if they occur in low-density regions.

Low-density areas as outlier nurseries [@Dayla].

Figure 16.14: Low-density areas as outlier nurseries [318].

Density-based methods include:

  • local outlier factors;

  • DBSCAN, and

  • isolation forests.

Local Outlier Factor

The Local Outlier Factor (LOF) algorithm was proposed in 2000 by [336] (a summary can be found in Section 6.4.2 of [319]). LOF works by measuring the local deviation of each observation in a dataset from its \(k\) nearest neighbours, with a point said to be anomalous if this deviation is large.

A local \(k-\)region \(N_k(\mathbf{p})\) around an observation \(\mathbf{p}\) is defined as the \(k\) nearest neighbours of \(\mathbf{p}\). The density of observations in each of their respective local \(k-\)neighbourhoods is estimated (the local density), and compared to the density of the local \(k-\)neighbourhoods of the observations within their own \(k-\)neighbourhood.

This can then be used to identify outliers that inhabit regions of lower density than their neighbours, as \(\mathbf{p}\) would be in Figure 16.15.

In this example, $\mathbf{p}$ has lower $k-$local density than its $2-$neighbours $\mathbf{q}_1,\mathbf{q}_2$.

Figure 16.15: In this example, \(\mathbf{p}\) has lower \(k-\)local density than its \(2-\)neighbours \(\mathbf{q}_1,\mathbf{q}_2\).

The formal procedure is as follows:

Any observation with a LOF \(a_k(\mathbf{p})\) above some threshold \(\tau\) is a local outlier, but selecting is not obvious. LOF introduces the idea of a reachability distance, which improves the stability of results within clusters/regions: within \(N_k(\mathbf{p})\), it is \[d_{\text{reach}}(\mathbf{p},\mathbf{q})=\max_{\ell}\{d(\mathbf{p},\mathbf{q}_{\ell}); \mathbf{q}_{\ell}\in N_k(\mathbf{p})\},\] the maximal distance to its \(k-\)neighbours; outside of \(N_k(\mathbf{p})\), it is \[d_{\text{reach}}(\mathbf{p},\mathbf{q})=d(\mathbf{p},\mathbf{q}),\] the actual distance.

In Figure 16.16, assuming \(k=3\), the observations \(\mathbf{q}_1, \mathbf{q}_2, \mathbf{q}_3\) all have the same reachability distance from \(\mathbf{p}\) as they are all \(3\)-neighbours of \(\mathbf{p}\), that is, \[d_{\text{reach}}(\mathbf{p}, \mathbf{q}_1) = d_{\text{reach}}(\mathbf{p}, \mathbf{q}_2) = d_{\text{reach}}(\mathbf{p}, \mathbf{q}_3) = d(\mathbf{p}, \mathbf{q}_3).\] The observation \(\mathbf{q}_4\), on the other hand, has \(d_{\text{reach}}(\mathbf{p}, \mathbf{q}_4) = d(\mathbf{p}, \mathbf{q}_4)\) as it is not a \(k\)-neighbour of \(\mathbf{p}\).

Illustration of reachability with $k=3$.

Figure 16.16: Illustration of reachability with \(k=3\).

Example

LOF is implemented in R via the Rlof package; we apply it to the scaled artificial data using the Euclidean and the Chebychev distances.

dist.L2 = dist(rdata.scaled[,1:4], method="euclidean")
lof <- Rlof::lof(dist.L2, k=4)
rdata.lof.L2 = data.frame(rdata.scaled[,1:4],lof)
rdata.lof.obs.L2 = data.frame(1:(nobs+2),lof)
names(rdata.lof.obs.L2) = c("obs",'lof')
rdata.lof.obs.L2 <- rdata.lof.obs.L2[order(-rdata.lof.obs.L2$lof),]
rownames(rdata.lof.obs.L2) <- NULL
knitr::kable(head(rdata.lof.obs.L2))
obs lof
102 5.377542
62 5.100218
12 1.609463
55 1.481152
27 1.475414
14 1.402112

dist.sup = dist(rdata.scaled[,1:4], method="maximum")
lof <- Rlof::lof(dist.sup, k=4)
rdata.lof.sup = data.frame(rdata.scaled[,1:4],lof)
rdata.lof.obs.sup = data.frame(1:(nobs+2),lof)
names(rdata.lof.obs.sup) = c("obs",'lof')
rdata.lof.obs.sup <- rdata.lof.obs.sup[order(-rdata.lof.obs.sup$lof),]
rownames(rdata.lof.obs.sup) <- NULL
knitr::kable(head(rdata.lof.obs.sup))
obs lof
102 2.984990
62 2.917587
14 1.636228
27 1.542470
12 1.466011
79 1.342456

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) was proposed in 1996 by [334] (a summary can be found in Section 4.1.5 of [319], as well as in Section 14.4.1). As its name suggests, it is a density-based clustering algorithm that groups nearby observations together and labels observations that do not fall in the clusters as anomalies.

Hierarchical DBSCAN (HDBSCAN) [335] was introduced in 2013. It notably removes the problem of choosing the parameter for the radius of a neighbourhood by considering all “possible” radii. Further documentation can be found at [345].

In DBSCAN,

  • an observation \(\mathbf{p}\) is a core point if there is a minimum of \(m\) observations within distance \(r\) of \(\mathbf{p}\);

  • an observation \(\mathbf{q}\) is a border point (or non-core point) if it is not itself a core point but is within distance \(r\) of one, and

  • an observation \(\mathbf{o}\) is an outlier if it is neither a core point nor a border point.

For minimum neighbourhood size $m=2$ and the fixed radius $r$ displayed above, $\mathbf{o}$ is an outlier, $\mathbf{p}$ is a core point, and $\mathbf{q}_1,\mathbf{q}_2$ are border points.

Figure 16.17: For minimum neighbourhood size \(m=2\) and the fixed radius \(r\) displayed above, \(\mathbf{o}\) is an outlier, \(\mathbf{p}\) is a core point, and \(\mathbf{q}_1,\mathbf{q}_2\) are border points.

DBSCAN considers each observation in the dataset individually. If that observation is an outlier, then it is added to a list of outliers. Otherwise if it is a core point, then its \(r\)-neighbourhood forms the beginning of a new cluster. Each observation in this \(r\)-neighbourhood is then considered in turn, with the \(r\)-neighbourhoods of other core observations contained in the neighbourhood being added to the cluster.

This expansion repeats until all observations have been examined. During this step, observations that were previously labelled as outliers may be updated as they become border points in the new cluster. This process continues until every observation has either been assigned to a cluster or labelled as an outlier.

The formal procedure is as follows:

While DBSCAN’s dual use as a clustering algorithm may seem irrelevant in the outlier detection setting, it is its ability to succesfully identify clusters that is crucial for labeling the remaining observations as outliers.

DBSCAN/HDSBCAN Strengths:

  • the number of clusters does not need to be known beforehand (unlike in \(k-\)means and other clustering algorithms);

  • clusters of arbitrary shape can be detected;

  • when using HDBSCAN, only the parameter for the minimum cluster size \(m\) is required, which can be set fairly intuitively.327

DBSCAN/HDBSCAN Limitations:

  • it is not deterministic, as border points can be assigned to different clusters depending on the order in which core observations are considered (this does not affect its use as an anomaly detection algorithm, however);

  • in high-dimensional spaces, the ability of any distance function based on Euclidean distance to distinguish near and distant observations diminishes due to the Curse of Dimensionality; thus in high dimension spaces, it become ineffective (as do other clustering algorithms);

  • it cannot handle differences in local densities as the radius of a neighbourhood \(r\) is fixed; this could lead to sparser clusters being labelled as outliers, or to outliers surrounding a denser cluster being included in the cluster (this issue is overcome in HDBSCAN, however).

Example

We use the R implementation of DBSCAN, HDBSCAN, and OPTICS (another density-based clustering algorithm) found in the dbscan package; we apply various parameters to the scaled artificial data, using the Euclidean distance in all instances.

scaled = scale(rdata[,1:4])
set.seed(1) # for replicability
  • DSBCAN, eps = 0.4, minPts = 4

    db.1 <- dbscan::dbscan(scaled, eps = .4, minPts = 4)
    db.1
    lattice::splom(scaled, groups=db.1$cluster + 1L, pch=22)

    DBSCAN clustering for 102 objects.
    Parameters: eps = 0.4, minPts = 4
    The clustering contains 1 cluster(s) and 96 noise points.
    
     0  1 
    96  6 
    
    Available fields: cluster, eps, minPts

    Evidently, 0.4 is too small a value for eps or 4 is too large a value for minPts (or both).

  • DSBCAN, eps = 1, minPts = 4

    DBSCAN clustering for 102 objects.
    Parameters: eps = 1, minPts = 4
    The clustering contains 1 cluster(s) and 6 noise points.
    
     0  1 
     6 96 
    
    Available fields: cluster, eps, minPts

    The results are reversed with a larger value of eps.

  • DSBCAN, eps = 1, minPts = 10

    DBSCAN clustering for 102 objects.
    Parameters: eps = 1, minPts = 10
    The clustering contains 1 cluster(s) and 24 noise points.
    
     0  1 
    24 78 
    
    Available fields: cluster, eps, minPts

    Are the clustering results (and thus the anomaly discovery rate) as expected? The interaction between the parameters can have unpredictable effects.

  • DSBCAN, eps = 2, minPts = 10

    DBSCAN clustering for 102 objects.
    Parameters: eps = 2, minPts = 10
    The clustering contains 1 cluster(s) and 2 noise points.
    
      0   1 
      2 100 
    
    Available fields: cluster, eps, minPts
  • HDBSCAN, minPts = 4

    hdb <- dbscan::hdbscan(scaled, minPts = 4)
    hdb
    lattice::splom(scaled, groups=hdb$cluster + 1L, pch=22)

    HDBSCAN clustering for 102 objects.
    Parameters: minPts = 4
    The clustering contains 4 cluster(s) and 71 noise points.
    
     0  1  2  3  4 
    71 10  6  4 11 
    
    Available fields: cluster, minPts, coredist, cluster_scores,
                      membership_prob, outlier_scores, hc

    Notice the absence of the eps parameter.

  • OPTICS, eps = 1, minPts = 4, eps_cl = 1, xi=.05 (read the documentation for details)

    opt <- dbscan::optics(scaled, eps = 1, minPts = 4)
    opt.1 <- dbscan::extractDBSCAN(opt, eps_cl = 1)
    opt.1
    lattice::splom(rdata[,1:4], groups=opt.1$cluster + 1L, pch=22)

    OPTICS ordering/clustering for 102 objects.
    Parameters: minPts = 4, eps = 1, eps_cl = 1, xi = NA
    The clustering contains 1 cluster(s) and 6 noise points.
    
     0  1 
     6 96 
    
    Available fields: order, reachdist, coredist, predecessor, minPts, eps,
                      eps_cl, xi, cluster
    opt.2 <- dbscan::extractXi(opt, xi = .05)
    opt.2
    lattice::splom(scaled, groups=opt.2$cluster + 1L, pch=22)

    OPTICS ordering/clustering for 102 objects.
    Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = 0.05
    The clustering contains 4 cluster(s) and 7 noise points.
    
    Available fields: order, reachdist, coredist, predecessor, minPts, eps,
                      eps_cl, xi, clusters_xi, cluster

    Are there any suprises?

Isolation Forest

The approaches that were previously discussed first construct models of what normal observations look like, and then identify observations that do not fit this model. The Isolation Forest algorithm [332] introduced in 2008 instead tries to explicitly identify outliers under the assumptions that there are few outliers in the data and that these outliers have very different attributes compared to normal (or regular) observations.

This allows the use of sampling techniques that increase algorithmic speed while decreasing memory requirements.

The Isolation Forest algorithm tries to isolate anomalous observations. It does so by randomly selecting an attribute and then randomly selecting a split value between that attribute’s min and max values. This recursively partitions the observations until every observation is isolated in its own partition component.

Recursive partitioning yields a binary tree called an Isolation Tree (IsoTree):

  • the root of this tree is the entire dataset;

  • each node is a subset of the observations;

  • each branch corresponds to one of the generated partitions, and

  • the leaves are singleton sets containing a single isolated observation.

Each observation is then assigned a score derived from how deep in the tree its singleton partition appears. As observations that are shallower in the tree were easier to separate from the rest, these are the likely outliers.

Since only shallow observations are of interest, once the height of the tree has reached a given threshold (the expected height of a random binary tree, say), further construction of the tree can be stopped to decrease computational cost.

Additionally, instead of building a tree from the entire dataset, a tree can be constructed from a subset. The location of any observation within this smaller tree can then be estimated, again saving computational and memory resources. These two improvements are detailed in the original paper [332].

Once a number of Isolation Trees have been randomly generated (Isolation Forest), a score can be computed for each point. This is done by searching each tree for the location of a given point and noting the path length required to reach it. Once an observation’s path length in each tree has been computed, the average path length is taken to be its score. In isolated forests, low scores are indicative of outlying behaviour.

Isolation Forest schematics [@Dayla]

Figure 16.18: Isolation Forest schematics [318]

The formal procedure is as follows:

It can be desirable to construct a normalized anomaly score that is independent of the size of the dataset. In order to do this, the expected path length of a random observation in an Isolation Tree (i.e. binary tree) must be estimated. With \(|D|=n\), it can be shown that the expected length is \[c(n) = 2 H(n-1) - \frac{2(n-1)}{n},\] where \(H(n-1)\) is the \((n-1)^{\text{th}}\) harmonic number, which can be approximated by \(\ln(n-1) + 0.577\); \(c(n)\) is then used to normalize the final anomaly score \(a(\mathbf{p})\) for \(\mathbf{p} \in D\), which is given by \[\log_2 a(\mathbf{p}) = -\frac{\text{average path length to $\mathbf{p}$ in the Isolation Trees}}{c(n)}.\]

Thus defined, \(a(\mathbf{p})\in [0,1]\), with \(a(\mathbf{p}) \approx 1\) suggesting \(\mathbf{p}\) is an anomaly, \(a(\mathbf{p}) \leq 0.5\) suggesting \(\mathbf{p}\) is a normal observation; if all observations receive a score around \(0.5\), this suggests that there are no anomalies present.

IsoForest Strengths:

  • small time and memory requirements;

  • can handle high dimensional data, and

  • do not need observations to have been labeled anomalies in the training set.

IsoForest Main Limitation:

  • the anomaly score assigned to a given point can have high variance over multiple runs of the algorithm. The authors of [333] propose some solutions.

Example

We use the R implementation of IsoForest found in the solitude package; we apply various parameters to the scaled artificial data, using the Euclidean distance in all instances. Note that this implementation uses a different scoring system, in which high scores are indicative of anomalous observations.

#library(solitude)
set.seed(1) # for replicability
index = 1:102

We initiate an isolation forest:

iso = solitude::isolationForest$new(sample_size = length(index))
iso$fit(dataset = rbind(scaled[index,1:4],c(0,0,0,0)))
test<-iso$predict(scaled[index,1:4]) # scores for training data
INFO  [10:35:49.974] Building Isolation Forest ...  
INFO  [10:35:50.765] done 
INFO  [10:35:50.772] Computing depth of terminal nodes ...  
INFO  [10:35:51.244] done 
INFO  [10:35:51.250] Completed growing isolation forest 

The top \(\nu=6\) IsoForest anomaly scores are given below:

rdata.iso = data.frame(1:(nobs+2),test$anomaly_score)
names(rdata.iso) = c("obs","anomaly_score")
rdata.iso <- rdata.iso[order(-rdata.iso$anomaly_score),]
rownames(rdata.iso) = NULL
knitr::kable(head(rdata.iso))
obs anomaly_score
102 0.7795207
62 0.7480289
55 0.6379525
29 0.6228751
49 0.6218485
67 0.6182687
rdata.iso |>  
  ggplot2::ggplot(ggplot2::aes(x=obs,y=anomaly_score)) + 
  ggplot2::scale_x_continuous(name="Observations") + 
  ggplot2::scale_y_continuous(name="IsoForest Anomaly Score") +
  ggplot2::geom_point(ggplot2::aes(fill=anomaly_score,
                                   colour=anomaly_score,
                                   size=anomaly_score),
                      pch=22) +
  ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +  
  ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") 

The profile of anomaly scores has a fairly distinct look (although we recognize quite a few of the usual suspects).


In general, density-based schemes are more powerful than distance-based schemes when a dataset contains patterns with diverse characteristics, but less effective when the patterns are of comparable densities with the outliers [346].

References

[318]
D. Baron, Outlier Detection.” XXX Winter School of Astrophysics on Big Data in Astronomy, GitHub repository, 2018.
[319]
K. G. Mehrotra, C. K. Mohan, and H. Huang, Anomaly Detection Principles and Algorithms. Springer, 2017.
[332]
F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422.
[333]
S. Hariri, M. Carrasco Kind, and R. J. Brunner, “Extended isolation forest,” IEEE Transactions on Knowledge and Data Engineering, 2019.
[334]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.
[335]
R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-Based Clustering Based on Hierarchical Density Estimates,” in Advances in Knowledge Discovery and Data Mining, 2013, pp. 160–172.
[336]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” SIGMOD Rec., vol. 29, no. 2, pp. 93–104, 2000.
[344]
D. Whitenack, Machine Learning with Go. Packt Publishing, 2017.
[345]
L. McInnes, J. Healy, and S. Astels, How HDBSCAN works.” 2016.
[346]
J. Tang, Z. Chen, A. W. Fu, and D. W. Cheung, “Capabilities of outlier detection schemes in large datasets, framework and methodologies,” Knowl. Inf. Syst., vol. 11, no. 1, pp. 45–84, Jan. 2007.