## 16.2 Quantitative Approaches

Cluster-based methods are not the only types of UL anomaly detection methods. They come in two flavours: **distance-based** and **density-based**.

**Distance-based methods**include distance to all observations, distance to \(k\) nearest neighbours (\(k\)NN), average distance to \(k\)NN, median distance to \(k\)NN, etc.**Density-based methods**include local outlier factor (LOF), isolation forest, HDBSCAN, etc.

### 16.2.1 Distance Methods

In order to determine whether an observation is anomalous or not, it must be compared to a set of other observations (anomalies are relative, not absolute). In the **distance-based context**, one natural way to compare observations is to consider their distance from one another, with increasing distance from the others being increasingly suggestive of anomalous status.

This approach works both in continuous and discrete cases, as long as a **distance function** or a **pre-computed table of pair-wise distances** between observations is given.

The choice of which sets of observations to use in this comparison distinguishes the different distance-based algorithms.

#### Notation

Let \(D \subset \mathbb{R}^n\) be an \(n\)-dimensional dataset, \(\mathbf{p},\mathbf{q}\in D\), \(P \subset D\) be a subset of \(D\). Assume that \(d: D \times D \to \mathbb{R}\) gives the distance between \(\mathbf{p}\) and \(\mathbf{q}\), written \(d(\mathbf{p},\mathbf{q})\).

An anomaly detection algorithm provides a function \(a : D \to \mathbb{R}\) that describes how anomalous a given observation is. This induces an ordering on the observations of \(D\): if \(a(\mathbf{p}) < a(\mathbf{q})\) for \(\mathbf{p},\mathbf{q} \in D\), then \(\mathbf{p}\) is **less anomalous** than \(\mathbf{q}\).

It could be necessary to define a threshold beyond which an observation is considered anomalous; if \(\alpha \in \mathbb{R}\) is such a threshold, then any \(\mathbf{p} \in D\) is **absolutely anomalous** if \(a(\mathbf{p}) > \alpha\).

#### Similarity Measures

A **similarity measure** is a real-valued function that describes the similarity between two objects. A common construction is to define the similarity \(w\) between two observations \(\mathbf{p},\mathbf{q}\) as \[w(\mathbf{p},\mathbf{q})=\frac{1}{1+d(\mathbf{p},\mathbf{q})}, \quad \text{for some distance }d,\] so that \(w\to 1\) as \(d\to 0\), and \(w\to 0\) as \(d\to \infty\).

A similarity measure can also be constructed between probability distributions. Let \(X\) and \(Y\) be two \(n\)-dimensional random vectors of (possibly) different distribution with probability mass/density functions (p.m.f./p.d.f.) \(f_X\) and \(f_Y\), respectively. Let \(\Omega\) be their shared domain. For discrete random variables, the **Hellinger distance** is defined by \[\begin{aligned} H(X,Y)&=\left(1- \sum_{\mathbf{z} \in \Omega} \sqrt{f_X(\mathbf{z}) f_Y(\mathbf{z})}\right)^{1/2}\,;\end{aligned}\] for continuous random variables, it is defined by \[\begin{aligned} H(X,Y)= \left(1-\int_{\Omega} \sqrt{f_X(\mathbf{z}) f_Y(\mathbf{z})}\, d\mathbf{z}\right)^{1/2}\,. \end{aligned}\]
If \(f_X=f_Y\) (or \(f_X=f_Y\) almost everywhere in the continuous case, that is, except over a countable set), then \[\sum_{\Omega}\sqrt{f_xf_Y}=1 \quad\mbox{or} \quad \int_{\Omega}\sqrt{f_Xf_Y}\, d\mathbf{z}=1\] and \(H(X,Y)=0\). The fact that \(H(X,Y)\in [0,1]\) is a consequence of Cauchy’s inequality, with \(f_X^*=\sqrt{f_X}\) and \(f_Y^*=\sqrt{f_Y}\): \[\begin{aligned} 0\leq \int_{\Omega}\sqrt{f_Xf_Y}\, d\mathbf{z}&=\int_{\Omega}f_X^*f_Y^*\, d\mathbf{z}\\ &\leq \left(\int_{\Omega}|f_X^*|^2\, d\mathbf{z}\right)^{1/2}\left(\int_{\Omega}|f_Y^*|^2\, d\mathbf{z}\right)^{1/2} \\ &=\left(\int_{\Omega}f_X\, d\mathbf{z}\right)^{1/2}\left(\int_{\Omega}f_Y\, d\mathbf{z}\right)^{1/2}\!\!\!\!=1;\end{aligned}\] (a similar argument holds for discrete random variables).

Recall that the covariance matrices \(\Sigma_X\) and \(\Sigma_Y\) are \(n \times n\)-matrices whose \((i,j)\)-th entries are the covariance between the \(i\)-th and \(j\)-th positions of \(X\) and \(Y\), respectively. Given a collection of identically distributed samples, these covariance matrices can be estimated.

We can also consider a single observation \(\mathbf{p}\) to represent a probability distribution. In that case, the Hellinger distance between that observation and any other distribution with mean \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\) can be studied using the framework above, using the **Mahalanobis distance**: \[M(\mathbf{p})=\sqrt{(\mathbf{p} - \boldsymbol{\mu})^{\!\top} \boldsymbol{\Sigma}^{-1} (\mathbf{p} - \boldsymbol{\mu}).}\]

Alternatively, if \(\mathbf{p}\) and \(\mathbf{q}\) are drawn from the same distribution with covariance \(\boldsymbol{\Sigma}\), then the Mahalanobis distance is a dissimilarity measure between \(\mathbf{p}\) and \(\mathbf{q}\): \[d_M(\mathbf{p},\mathbf{q})=\sqrt{(\mathbf{p} - \mathbf{q})^{\!\top} \boldsymbol{\Sigma}^{-1} (\mathbf{p} - \mathbf{q}).}\]

#### Example

In general, we do not know the true mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\) from which the data could arise, and the mean vector and the covariance structure must be estimated from the data. In the example of Section 16.1.3, we have:

```
x1 x2 x3 x4
x1 0.8999038 0.5690744 0.6646093 0.5033570
x2 0.5690744 1.3124241 1.0685066 0.4694309
x3 0.6646093 1.0685066 0.9921537 0.3969461
x4 0.5033570 0.4694309 0.3969461 0.9043493
```

These are distinct from the true underlying collection of parameters \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\), but close enough to be explained by **sampling variation** and because \(\mathbf{z}_1,\mathbf{z}_4\not\sim \mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\).

We first attempt to identify the anomalous observations by computing the Mahalanobis distance from the empirical distribution to all observations in the dataset.

```
Sigma.inv = matlib::inv(cov(rdata[,1:4]))
M_d<-vector()
for(j in 1:nrow(rdata)){
M_d[j] <- sqrt(as.matrix(rdata[j,1:4]-mu.1) %*%
Sigma.inv %*% t(as.matrix(rdata[j,1:4]-mu.1)))
}
rdata <- data.frame(rdata,M_d)
summary(M_d)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4622 1.3479 1.6764 1.7980 2.1010 6.6393
```

The summary certainly suggests that there is (at least) one observation for which the Mahalanobis distance to the empirical distribution is quite high.

```
library(dplyr)
rdata |> ggplot2::ggplot(ggplot2::aes(x=M_d)) +
ggplot2::geom_histogram(colour="black",binwidth = 0.5) +
ggplot2::geom_rug() +
ggplot2::theme_bw()
```

```
rdata |> ggplot2::ggplot(ggplot2::aes(x=M_d)) +
ggplot2::geom_boxplot() + ggplot2::geom_rug(color="black")
```

The histogram of Mahalanobis distances shows that most observations are fairly “regular”, but that two of the observations have substantially larger distances. The boxplot confirms it, but identifies a potential third outlying observation.

Below, we display the scatter plot matrix of the 102 observations, with colour given by the Mahalanobis distance of each observation from the empirical distribution (the code is ommitted).

It certainly seems as though the two anomalies could be \(\mathbf{z}_1\) and \(\mathbf{z}_4\).

Next, we plot the Mahadistance from each observation to every other observation.

```
M_pq<-matrix(nrow=nrow(rdata), ncol=nrow(rdata))
for(j in 1:nrow(rdata)){
for(i in 1:nrow(rdata)){
M_pq[j,i]<-sqrt(as.matrix(rdata[j,1:4]-rdata[i,1:4]) %*%
Sigma.inv %*% t(as.matrix(rdata[j,1:4]-rdata[i,1:4])))
}
}
M_pq<-as.data.frame.table(M_pq)
M_pq[,1:2]<-lapply(M_pq[,1:2],as.numeric)
M_pq |> ggplot2::ggplot(ggplot2::aes(x=Var1,y=Freq)) +
ggplot2::geom_point(ggplot2::aes(fill=Freq,colour=Freq),pch=22) +
ggplot2::scale_fill_continuous(high = "#0033FF", low = "#FFFFFF") +
ggplot2::scale_colour_continuous(high = "#0033FF", low = "#FFFFFF") +
ggplot2::scale_x_continuous(name="Observations") +
ggplot2::scale_y_continuous(name="Distance") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none")
```

Note the differing patterns for observations 101 and 102, as well as the diffuse cloud of points above the distance \(5.0\) for the other observations. There are a few other observations for which the distances to other observations seem to be larger than in a majority of the cases.

Next, we display the same distributions with the help of boxplots.

```
median.value <- M_pq |>
group_by(Var1) |>
summarise(meanDist=mean(Freq)) |>
summarise(median_value=median(meanDist))
test <- M_pq |>
group_by(Var1) |>
summarise(meanDist=mean(Freq)) |>
summarise(std=sd(meanDist))
med.sd = test+median.value
M_pq |> ggplot2::ggplot(ggplot2::aes(x=as.factor(Var1),y=Freq)) +
ggplot2::geom_boxplot() +
ggplot2::scale_x_discrete(name="Observations") +
ggplot2::scale_y_continuous(name="Distance") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") +
ggplot2::geom_hline(yintercept=as.numeric(median.value), linetype = "dashed", color = "red") +
ggplot2::geom_hline(yintercept=as.numeric(med.sd), linetype = "dotted", color = "red") +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle=90))
```

The long-dashed red line represent the median of all the mean distances per observation; the short-dashed red line lies 1 standard deviation above the median.

To simplify the reading of the situation, we plot only the mean distance per observation, linking the **colour intensity** and the **marker size** to the mean distance (blue corresponding to larger distances, as do larger markers).

```
M_pq |> group_by(Var1) |> summarise(meanDist=mean(Freq)) |>
ggplot2::ggplot(ggplot2::aes(x=Var1,y=meanDist)) +
ggplot2::scale_x_continuous(name="Observations") +
ggplot2::scale_y_continuous(name="Mean Distance") +
ggplot2::geom_point(ggplot2::aes(fill=meanDist,colour=meanDist,size=meanDist),pch=22) +
ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") +
ggplot2::geom_hline(yintercept=as.numeric(median.value), linetype = "dashed", color = "red") +
ggplot2::geom_hline(yintercept=as.numeric(med.sd), linetype = "dotted", color = "red")
```

Do any other observations strike you as possible outliers?

#### Similarity Measures (Reprise)

If \(\boldsymbol{\Sigma}\) is diagonal, then \[d_M(\mathbf{p},\mathbf{q})=\sqrt{\sum_{i=1}^n \frac{(p_i - q_i)^2}{\sigma_i^2}},\] where \(\sigma_i^2\) is the variance along the \(i\)-th dimension. If \(\Sigma\) is the identity matrix, then we recover the **Euclidean distance** \[d_2(\mathbf{p},\mathbf{q})=\sqrt{\sum_{i=1}^n (p_i - q_i)^2}.\]

When using the Euclidean distance in an anomaly detection context, a **linear normalization** is usually applied to each dimension so that each entry lies in the hypercube \([-1,1]^n\). The **Minkowski distance** of order \(p\) is a generalization of the Euclidean distance: \[d_p(\mathbf{p},\mathbf{q})=\left( \sum_{i=1}^n |p_i - q_i|^p \right)^{1/p}.\]

For \(p=2\) we recover the Euclidean distance \(d_2\), for \(p=1\) the **Manhattan distance** \[d_1(\mathbf{p},\mathbf{q})=\sum_{i=1}^n|p_i-q_i|,\] and for \(p=\infty\) the **supremum distance** (also called the **Chebychev distance**) \[d_{\infty}(\mathbf{p},\mathbf{q})=\max_{i=1}^n |p_i - q_i|.\] The Minkowski distance \(d_p\) is only actually a distance function (i.e., a **metric**) when \(p \geq 1\), but an exception is made for \[d_{-\infty}(\mathbf{p},\mathbf{q})=\min_{i=1}^n |p_i - q_i|\] to fall within the same framework.

The **Jaccard similarity** of two datasets \(P\) and \(Q\), is defined as the size of their intersection divided by the size of their union \[J(P,Q)
= \frac{|P \cap Q|}{|P \cup Q|} = \frac{|P \cap Q|}{|P| + |Q| - |P \cap Q|}.\] Their **Jaccard distance** is then taken to be \(1 - J(P,Q)\).

This definition can be extended to compare binary vectors (i.e. vectors with entries in \(\{0,1\}\)) of the same length. Given two binary vectors \(\mathbf{p}\) and \(\mathbf{q}\) of length \(n\), consider an arbitrary set \(D\) of size \(n\). Then \(\mathbf{p}\) and \(\mathbf{q}\) can be viewed as subsets of \(D\): if \(p_i=1\) then \(\mathbf{p}\) is said to contain the \(i\)-th element of \(D\), while if \(p_i=0\) then it does not. Viewing \(\mathbf{p}\) and \(\mathbf{q}\) in this way allows us to compute their Jaccard similarity, and thus their Jaccard distance.

Finally, let \(\mathbf{p},\mathbf{q}\neq \mathbf{0}\). Recall that \(\mathbf{p} \cdot \mathbf{q} = \lVert p \rVert \lVert q \rVert \cos\theta,\) where \(\theta\) is the angle between \(\mathbf{p}\) and \(\mathbf{q}\). The **cosine similarity** between \(\mathbf{p}\) and \(\mathbf{q}\) is the cosine of \(\theta\), which can be computed as \[\cos\theta = \frac{p \cdot q}{\lVert p \rVert \lVert q \rVert} = \frac{\sum_{i=1}^n p_i q_i}{\sqrt{\sum_{i=1}^n p_i^2} \sqrt{\sum_{i=1}^n q_i^2}}.\]

This value ranges between \(1\) and \(-1\), with 1 attained when \(\mathbf{p}=\mathbf{q}\), \(-1\) when \(\mathbf{p}=-\mathbf{q}\), and \(0\) when \(\mathbf{p}\) and \(\mathbf{q}\) are perpendicular. Armed with these concepts, we can now explore distance-based methods for anomaly detection; they will also eventually be useful for density-based anomaly detection.

#### 16.2.1.1 Distance-Based Anomaly Detection

All these distance functions (similarity measures) can be used to create basic anomaly detection algorithms (the ideas can also be extended to more complex algorithms).

Given some distance function \(d\), dataset \(D\), and integers \(k,\nu\leq |D|\), the **distance to all points** (DTAP) anomaly detection algorithm considers each observation \(\mathbf{p}\) in \(D\) and adds the distance from \(\mathbf{p}\) to every other observation in \(D\), i.e. \[a(\mathbf{p}) = \sum_{\mathbf{q}\neq \mathbf{p} \in D} d(\mathbf{q}, \mathbf{p}).\]

The \(\nu\) observations with largest values for \(a\) are then said to be **anomalous according to** \(a\). This approach often selects the most extreme observations as anomalous, which may be of limited use in practice.

The **distance to nearest neighbour** (DTNN) algorithm defines \[a(\mathbf{p}) = \min_{\mathbf{q}\neq \mathbf{p} \in D} d(\mathbf{q}, \mathbf{p}),\] with a similar definition for the \(\nu\) anomalous observations. The **average distance to \(k\) nearest neighbours** and **median distance to \(k\) nearest neighbours** are defined similarly.

#### Example - Distance to All Points

We start by building the DTAP anomaly detector for the Euclidean distance (`method="euclidean"`

) on the **scaled** artificial data, which is shown below.

```
rdata.scaled=data.frame(matrix(ncol = 4, nrow = nobs+2))
for(i in 1:4){
rdata.scaled[,i] <- 2/(max(rdata[,i]) - min(rdata[,i])) * rdata[,i] - 1
}
lattice::splom(rdata.scaled[,1:4], pch=22)
```

The top \(\nu=6\) anomalous observations are obtained by the following chunk of code.

```
m.L2 <- as.matrix(dist(rdata.scaled[,1:4], method="euclidean"))
adoa.L2 <- data.frame(1:(nobs+2),rowSums(m.L2))
colnames(adoa.L2) <- c("obs","dist")
adoa.L2 <- adoa.L2[order(-adoa.L2$dist),]
rownames(adoa.L2) <- NULL
knitr::kable(head(adoa.L2))
```

obs | dist |
---|---|

102 | 241.7556 |

62 | 177.6135 |

14 | 113.9903 |

49 | 108.6464 |

55 | 106.5156 |

67 | 104.7870 |

The accompanying plot is:

```
adoa.L2 |>
ggplot2::ggplot(ggplot2::aes(x=obs,y=dist)) +
ggplot2::scale_x_continuous(name="Observations") +
ggplot2::scale_y_continuous(name="Sum of Euclidean Distances") +
ggplot2::geom_point(ggplot2::aes(fill=dist,
colour=dist,
size=dist),
pch=22) +
ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none")
```

The top \(\nu=6\) anomalies are obtained for a variety of metric (we do not display the code that is used for these other metrics).

**Chebychev**(replace`method="euclidean"`

by`method="maximum"`

):obs dist 102 144.85387 62 114.83273 14 88.25016 55 81.47169 49 80.33274 101 73.11895 **Manhattan**(replace`method="euclidean"`

by`method="manhattan"`

):obs dist 102 473.7469 62 342.2531 14 197.7493 67 193.9257 49 191.7707 55 183.6509 **Minkowski**, \(p=1/2\) (replace`method="euclidean"`

by`method="manhattan", p=0.5`

):obs dist 102 1873.7488 62 1341.9020 67 731.9811 14 717.3451 49 700.5557 11 669.4313 **Minkowski**, \(p=4\) (replace`method="euclidean"`

by`method="manhattan", p=4`

):obs dist 102 3738.781 62 2672.940 67 1444.177 14 1403.799 49 1372.067 11 1318.830

We see that while observation 102 is always the most anomalous according to DTAP, the ranking is affected by the choice of distance metric. Is this surprising?

#### Example - Distance to Nearest Neighbour

We next build the DTNN anomaly detector for the Euclidean distance, again on the **scaled** artificial data.

The top \(\nu=6\) anomalous observations are obtained by the following chunk of code.

```
m.L2 <- m.L2 + 1000000*diag(nobs+2)
adoa.L2 <- data.frame(1:(nobs+2),apply(m.L2,1,min))
colnames(adoa.L2) <- c("obs","dist")
adoa.L2 <- adoa.L2[order(-adoa.L2$dist),]
rownames(adoa.L2) <- NULL
knitr::kable(head(adoa.L2))
```

obs | dist |
---|---|

102 | 1.4549129 |

62 | 0.8667313 |

101 | 0.3746565 |

12 | 0.3611265 |

11 | 0.3186967 |

43 | 0.3145305 |

The accompanying plot is:

The top \(\nu=6\) anomalies are obtained for a variety of metric (we do not display the code that is used for these other metrics).

**Chebychev**:obs dist 102 0.8269178 62 0.5203870 12 0.2449443 101 0.2368307 68 0.2356369 43 0.2214142 **Manhattan**:obs dist 102 2.8914743 62 1.6745725 12 0.6499116 101 0.6082916 11 0.5702036 23 0.4903152 **Minkowski**, \(p=1/2\):obs dist 102 10.534441 62 6.559370 101 1.857222 11 1.704864 23 1.618388 12 1.591956 **Minkowski**, \(p=4\):obs dist 102 20.605432 62 13.059856 101 3.521821 11 3.081000 23 2.979090 12 2.809583

In both examples, there are commonalities (certain observations come back repeatedly as likely anomalous observations), but the anomaly rankings change according to the **selected distance function** and the **choice of algorithm** (the choice of data scaling approach would also play a role). This is par for the course in the anomaly detection context.

### 16.2.2 Density Methods

Density-based approaches view observations as anomalous if they occur in **low-density regions**.

Density-based methods include:

**local outlier factors**;**DBSCAN**, and**isolation forests**.

#### Local Outlier Factor

The **Local Outlier Factor** (LOF) algorithm was proposed in 2000 by [336] (a summary can be found in Section 6.4.2 of [319]). LOF works by measuring the local deviation of each observation in a dataset from its \(k\) nearest neighbours, with a point said to be anomalous if this **deviation is large**.

A **local \(k-\)region** \(N_k(\mathbf{p})\) around an observation \(\mathbf{p}\) is defined as the \(k\) nearest neighbours of \(\mathbf{p}\). The density of observations in each of their respective local \(k-\)neighbourhoods is estimated (the **local density**), and compared to the density of the local \(k-\)neighbourhoods of the observations within their own \(k-\)neighbourhood.

This can then be used to identify outliers that inhabit regions of lower density than their neighbours, as \(\mathbf{p}\) would be in Figure 16.15.

The formal procedure is as follows:

Any observation with a LOF \(a_k(\mathbf{p})\) above some threshold \(\tau\) is a **local outlier**, but selecting is not obvious. LOF introduces the idea of a **reachability distance**, which improves the stability of results within clusters/regions: within \(N_k(\mathbf{p})\), it is \[d_{\text{reach}}(\mathbf{p},\mathbf{q})=\max_{\ell}\{d(\mathbf{p},\mathbf{q}_{\ell}); \mathbf{q}_{\ell}\in N_k(\mathbf{p})\},\] the maximal distance to its \(k-\)neighbours; outside of \(N_k(\mathbf{p})\), it is \[d_{\text{reach}}(\mathbf{p},\mathbf{q})=d(\mathbf{p},\mathbf{q}),\] the actual distance.

In Figure 16.16, assuming \(k=3\), the observations \(\mathbf{q}_1, \mathbf{q}_2, \mathbf{q}_3\) all have the same reachability distance from \(\mathbf{p}\) as they are all \(3\)-neighbours of \(\mathbf{p}\), that is, \[d_{\text{reach}}(\mathbf{p}, \mathbf{q}_1) = d_{\text{reach}}(\mathbf{p}, \mathbf{q}_2) = d_{\text{reach}}(\mathbf{p}, \mathbf{q}_3) = d(\mathbf{p}, \mathbf{q}_3).\] The observation \(\mathbf{q}_4\), on the other hand, has \(d_{\text{reach}}(\mathbf{p}, \mathbf{q}_4) = d(\mathbf{p}, \mathbf{q}_4)\) as it is not a \(k\)-neighbour of \(\mathbf{p}\).

#### Example

LOF is implemented in `R`

via the `Rlof`

package; we apply it to the scaled artificial data using the Euclidean and the Chebychev distances.

```
dist.L2 = dist(rdata.scaled[,1:4], method="euclidean")
lof <- Rlof::lof(dist.L2, k=4)
rdata.lof.L2 = data.frame(rdata.scaled[,1:4],lof)
rdata.lof.obs.L2 = data.frame(1:(nobs+2),lof)
names(rdata.lof.obs.L2) = c("obs",'lof')
rdata.lof.obs.L2 <- rdata.lof.obs.L2[order(-rdata.lof.obs.L2$lof),]
rownames(rdata.lof.obs.L2) <- NULL
knitr::kable(head(rdata.lof.obs.L2))
```

obs | lof |
---|---|

102 | 5.377542 |

62 | 5.100218 |

12 | 1.609463 |

55 | 1.481152 |

27 | 1.475414 |

14 | 1.402112 |

```
dist.sup = dist(rdata.scaled[,1:4], method="maximum")
lof <- Rlof::lof(dist.sup, k=4)
rdata.lof.sup = data.frame(rdata.scaled[,1:4],lof)
rdata.lof.obs.sup = data.frame(1:(nobs+2),lof)
names(rdata.lof.obs.sup) = c("obs",'lof')
rdata.lof.obs.sup <- rdata.lof.obs.sup[order(-rdata.lof.obs.sup$lof),]
rownames(rdata.lof.obs.sup) <- NULL
knitr::kable(head(rdata.lof.obs.sup))
```

obs | lof |
---|---|

102 | 2.984990 |

62 | 2.917587 |

14 | 1.636228 |

27 | 1.542470 |

12 | 1.466011 |

79 | 1.342456 |

#### DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) was proposed in 1996 by [334] (a summary can be found in Section 4.1.5 of [319], as well as in Section 14.4.1). As its name suggests, it is a density-based clustering algorithm that groups nearby observations together and labels observations that do not fall in the clusters as **anomalies**.

Hierarchical DBSCAN (HDBSCAN) [335] was introduced in 2013. It notably removes the problem of choosing the parameter for the radius of a neighbourhood by considering all “possible” radii. Further documentation can be found at [345].

In DBSCAN,

an observation \(\mathbf{p}\) is a

**core point**if there is a minimum of \(m\) observations within distance \(r\) of \(\mathbf{p}\);an observation \(\mathbf{q}\) is a

**border point**(or non-core point) if it is not itself a core point but is within distance \(r\) of one, andan observation \(\mathbf{o}\) is an

**outlier**if it is neither a core point nor a border point.

DBSCAN considers each observation in the dataset individually. If that observation is an outlier, then it is added to a list of outliers. Otherwise if it is a core point, then its \(r\)-neighbourhood forms the beginning of a new cluster. Each observation in this \(r\)-neighbourhood is then considered in turn, with the \(r\)-neighbourhoods of other core observations contained in the neighbourhood being **added to the cluster**.

This expansion repeats until all observations have been examined. During this step, observations that were previously labelled as outliers may be updated as they become border points in the new cluster. This process continues until every observation has either been assigned to a cluster or labelled as an outlier.

The formal procedure is as follows:

While DBSCAN’s dual use as a **clustering algorithm** may seem irrelevant in the outlier detection setting, it is its ability to succesfully identify clusters that is crucial for labeling the remaining observations as outliers.

**DBSCAN/HDSBCAN Strengths:**

the number of clusters does not need to be known beforehand (unlike in \(k-\)means and other clustering algorithms);

clusters of arbitrary shape can be detected;

when using HDBSCAN, only the parameter for the

**minimum cluster size**\(m\) is required, which can be set fairly intuitively.^{327}

**DBSCAN/HDBSCAN Limitations:**

it is not deterministic, as border points can be assigned to different clusters depending on the order in which core observations are considered (this does not affect its use as an anomaly detection algorithm, however);

in high-dimensional spaces, the ability of any distance function based on Euclidean distance to distinguish near and distant observations diminishes due to the

**Curse of Dimensionality**; thus in high dimension spaces, it become ineffective (as do other clustering algorithms);it cannot handle differences in local densities as the radius of a neighbourhood \(r\) is fixed; this could lead to sparser clusters being labelled as outliers, or to outliers surrounding a denser cluster being included in the cluster (this issue is overcome in HDBSCAN, however).

#### Example

We use the `R`

implementation of DBSCAN, HDBSCAN, and OPTICS (another density-based clustering algorithm) found in the `dbscan`

package; we apply various parameters to the scaled artificial data, using the Euclidean distance in all instances.

DSBCAN,

`eps = 0.4`

,`minPts = 4`

`db.1 <- dbscan::dbscan(scaled, eps = .4, minPts = 4) db.1 lattice::splom(scaled, groups=db.1$cluster + 1L, pch=22)`

`DBSCAN clustering for 102 objects. Parameters: eps = 0.4, minPts = 4 The clustering contains 1 cluster(s) and 96 noise points. 0 1 96 6 Available fields: cluster, eps, minPts`

Evidently, 0.4 is too small a value for

`eps`

or 4 is too large a value for`minPts`

(or both).DSBCAN,

`eps = 1`

,`minPts = 4`

`DBSCAN clustering for 102 objects. Parameters: eps = 1, minPts = 4 The clustering contains 1 cluster(s) and 6 noise points. 0 1 6 96 Available fields: cluster, eps, minPts`

The results are reversed with a larger value of

`eps`

.DSBCAN,

`eps = 1`

,`minPts = 10`

`DBSCAN clustering for 102 objects. Parameters: eps = 1, minPts = 10 The clustering contains 1 cluster(s) and 24 noise points. 0 1 24 78 Available fields: cluster, eps, minPts`

Are the clustering results (and thus the anomaly discovery rate) as expected? The interaction between the parameters can have unpredictable effects.

DSBCAN,

`eps = 2`

,`minPts = 10`

`DBSCAN clustering for 102 objects. Parameters: eps = 2, minPts = 10 The clustering contains 1 cluster(s) and 2 noise points. 0 1 2 100 Available fields: cluster, eps, minPts`

HDBSCAN,

`minPts = 4`

`hdb <- dbscan::hdbscan(scaled, minPts = 4) hdb lattice::splom(scaled, groups=hdb$cluster + 1L, pch=22)`

`HDBSCAN clustering for 102 objects. Parameters: minPts = 4 The clustering contains 4 cluster(s) and 71 noise points. 0 1 2 3 4 71 10 6 4 11 Available fields: cluster, minPts, coredist, cluster_scores, membership_prob, outlier_scores, hc`

Notice the absence of the

`eps`

parameter.OPTICS,

`eps = 1`

,`minPts = 4`

,`eps_cl = 1`

,`xi=.05`

(read the documentation for details)`opt <- dbscan::optics(scaled, eps = 1, minPts = 4) opt.1 <- dbscan::extractDBSCAN(opt, eps_cl = 1) opt.1 lattice::splom(rdata[,1:4], groups=opt.1$cluster + 1L, pch=22)`

`OPTICS ordering/clustering for 102 objects. Parameters: minPts = 4, eps = 1, eps_cl = 1, xi = NA The clustering contains 1 cluster(s) and 6 noise points. 0 1 6 96 Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi, cluster`

`opt.2 <- dbscan::extractXi(opt, xi = .05) opt.2 lattice::splom(scaled, groups=opt.2$cluster + 1L, pch=22)`

`OPTICS ordering/clustering for 102 objects. Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = 0.05 The clustering contains 4 cluster(s) and 7 noise points. Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi, clusters_xi, cluster`

Are there any suprises?

#### Isolation Forest

The approaches that were previously discussed first construct models of what normal observations look like, and then identify observations that do not fit this model. The **Isolation Forest** algorithm [332] introduced in 2008 instead tries to explicitly identify outliers under the assumptions that **there are few outliers** in the data and that these outliers have **very different attributes** compared to normal (or regular) observations.

This allows the use of sampling techniques that increase algorithmic speed while decreasing memory requirements.

The Isolation Forest algorithm tries to **isolate** anomalous observations. It does so by **randomly** selecting an attribute and then **randomly** selecting a split value between that attribute’s min and max values. This recursively partitions the observations until every observation is isolated in its own partition component.

Recursive partitioning yields a binary tree called an **Isolation Tree** (IsoTree):

the

**root**of this tree is the entire dataset;each

**node**is a subset of the observations;each

**branch**corresponds to one of the generated partitions, andthe

**leaves**are singleton sets containing a single isolated observation.

Each observation is then assigned a score derived from **how deep in the tree** its singleton partition appears. As observations that are shallower in the tree were easier to separate from the rest, these are the likely **outliers**.

Since only shallow observations are of interest, once the height of the tree has reached a **given threshold** (the expected height of a random binary tree, say), further construction of the tree can be stopped to decrease computational cost.

Additionally, instead of building a tree from the entire dataset, a tree can be constructed from a subset. The location of any observation within this smaller tree can then be estimated, again saving computational and memory resources. These two improvements are detailed in the original paper [332].

Once a number of Isolation Trees have been randomly generated (**Isolation Forest**), a score can be computed for each point. This is done by searching each tree for the location of a given point and noting the path length required to reach it. Once an observation’s path length in each tree has been computed, the average path length is taken to be its score. In isolated forests, low scores are indicative of outlying behaviour.

The formal procedure is as follows:

It can be desirable to construct a **normalized** anomaly score that is independent of the size of the dataset. In order to do this, the expected path length of a random observation in an Isolation Tree (i.e. binary tree) must be estimated. With \(|D|=n\), it can be shown that the expected length is \[c(n) = 2 H(n-1) - \frac{2(n-1)}{n},\] where \(H(n-1)\) is the \((n-1)^{\text{th}}\) **harmonic number**, which can be approximated by \(\ln(n-1) + 0.577\); \(c(n)\) is then used to normalize the final anomaly score \(a(\mathbf{p})\) for \(\mathbf{p} \in D\), which is given by \[\log_2 a(\mathbf{p}) = -\frac{\text{average path length to $\mathbf{p}$ in the Isolation Trees}}{c(n)}.\]

Thus defined, \(a(\mathbf{p})\in [0,1]\), with \(a(\mathbf{p}) \approx 1\) suggesting \(\mathbf{p}\) is an **anomaly**, \(a(\mathbf{p}) \leq 0.5\) suggesting \(\mathbf{p}\) is a normal observation; if all observations receive a score around \(0.5\), this suggests that there are no anomalies present.

**IsoForest Strengths:**

small time and memory requirements;

can handle high dimensional data, and

do not need observations to have been labeled anomalies in the training set.

**IsoForest Main Limitation:**

- the anomaly score assigned to a given point can have high variance over multiple runs of the algorithm. The authors of [333] propose some solutions.

#### Example

We use the `R`

implementation of IsoForest found in the `solitude`

package; we apply various parameters to the scaled artificial data, using the Euclidean distance in all instances. Note that this implementation uses a different scoring system, in which high scores are indicative of anomalous observations.

We initiate an isolation forest:

```
iso = solitude::isolationForest$new(sample_size = length(index))
iso$fit(dataset = rbind(scaled[index,1:4],c(0,0,0,0)))
test<-iso$predict(scaled[index,1:4]) # scores for training data
```

```
INFO [10:35:49.974] Building Isolation Forest ...
INFO [10:35:50.765] done
INFO [10:35:50.772] Computing depth of terminal nodes ...
INFO [10:35:51.244] done
INFO [10:35:51.250] Completed growing isolation forest
```

The top \(\nu=6\) IsoForest anomaly scores are given below:

```
rdata.iso = data.frame(1:(nobs+2),test$anomaly_score)
names(rdata.iso) = c("obs","anomaly_score")
rdata.iso <- rdata.iso[order(-rdata.iso$anomaly_score),]
rownames(rdata.iso) = NULL
knitr::kable(head(rdata.iso))
```

obs | anomaly_score |
---|---|

102 | 0.7795207 |

62 | 0.7480289 |

55 | 0.6379525 |

29 | 0.6228751 |

49 | 0.6218485 |

67 | 0.6182687 |

```
rdata.iso |>
ggplot2::ggplot(ggplot2::aes(x=obs,y=anomaly_score)) +
ggplot2::scale_x_continuous(name="Observations") +
ggplot2::scale_y_continuous(name="IsoForest Anomaly Score") +
ggplot2::geom_point(ggplot2::aes(fill=anomaly_score,
colour=anomaly_score,
size=anomaly_score),
pch=22) +
ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none")
```

The profile of anomaly scores has a fairly distinct look (although we recognize quite a few of the usual suspects).

In general, density-based schemes are more powerful than distance-based schemes when a dataset contains **patterns with diverse characteristics**, but less effective when the patterns are of **comparable densities with the outliers** [346].

### References

*Anomaly Detection Principles and Algorithms*. Springer, 2017.

*Proceedings of the Eighth IEEE International Conference on Data Mining*, 2008, pp. 413–422.

*IEEE Transactions on Knowledge and Data Engineering*, 2019.

*Proceedings of the Second International Conference on Knowledge Discovery and Data Mining*, 1996, pp. 226–231.

*Advances in Knowledge Discovery and Data Mining*, 2013, pp. 160–172.

*SIGMOD Rec.*, vol. 29, no. 2, pp. 93–104, 2000.

*Machine Learning with Go*. Packt Publishing, 2017.

*Knowl. Inf. Syst.*, vol. 11, no. 1, pp. 45–84, Jan. 2007.