## 16.2 Quantitative Approaches

Cluster-based methods are not the only types of UL anomaly detection methods. They come in two flavours: distance-based and density-based.

• Distance-based methods include distance to all observations, distance to $$k$$ nearest neighbours ($$k$$NN), average distance to $$k$$NN, median distance to $$k$$NN, etc.

• Density-based methods include local outlier factor (LOF), isolation forest, HDBSCAN, etc.

### 16.2.1 Distance Methods

In order to determine whether an observation is anomalous or not, it must be compared to a set of other observations (anomalies are relative, not absolute). In the distance-based context, one natural way to compare observations is to consider their distance from one another, with increasing distance from the others being increasingly suggestive of anomalous status.

This approach works both in continuous and discrete cases, as long as a distance function or a pre-computed table of pair-wise distances between observations is given.

The choice of which sets of observations to use in this comparison distinguishes the different distance-based algorithms.

#### Notation

Let $$D \subset \mathbb{R}^n$$ be an $$n$$-dimensional dataset, $$\mathbf{p},\mathbf{q}\in D$$, $$P \subset D$$ be a subset of $$D$$. Assume that $$d: D \times D \to \mathbb{R}$$ gives the distance between $$\mathbf{p}$$ and $$\mathbf{q}$$, written $$d(\mathbf{p},\mathbf{q})$$.

An anomaly detection algorithm provides a function $$a : D \to \mathbb{R}$$ that describes how anomalous a given observation is. This induces an ordering on the observations of $$D$$: if $$a(\mathbf{p}) < a(\mathbf{q})$$ for $$\mathbf{p},\mathbf{q} \in D$$, then $$\mathbf{p}$$ is less anomalous than $$\mathbf{q}$$.

It could be necessary to define a threshold beyond which an observation is considered anomalous; if $$\alpha \in \mathbb{R}$$ is such a threshold, then any $$\mathbf{p} \in D$$ is absolutely anomalous if $$a(\mathbf{p}) > \alpha$$.

#### Similarity Measures

A similarity measure is a real-valued function that describes the similarity between two objects. A common construction is to define the similarity $$w$$ between two observations $$\mathbf{p},\mathbf{q}$$ as $w(\mathbf{p},\mathbf{q})=\frac{1}{1+d(\mathbf{p},\mathbf{q})}, \quad \text{for some distance }d,$ so that $$w\to 1$$ as $$d\to 0$$, and $$w\to 0$$ as $$d\to \infty$$.

A similarity measure can also be constructed between probability distributions. Let $$X$$ and $$Y$$ be two $$n$$-dimensional random vectors of (possibly) different distribution with probability mass/density functions (p.m.f./p.d.f.) $$f_X$$ and $$f_Y$$, respectively. Let $$\Omega$$ be their shared domain. For discrete random variables, the Hellinger distance is defined by \begin{aligned} H(X,Y)&=\left(1- \sum_{\mathbf{z} \in \Omega} \sqrt{f_X(\mathbf{z}) f_Y(\mathbf{z})}\right)^{1/2}\,;\end{aligned} for continuous random variables, it is defined by \begin{aligned} H(X,Y)= \left(1-\int_{\Omega} \sqrt{f_X(\mathbf{z}) f_Y(\mathbf{z})}\, d\mathbf{z}\right)^{1/2}\,. \end{aligned} If $$f_X=f_Y$$ (or $$f_X=f_Y$$ almost everywhere in the continuous case, that is, except over a countable set), then $\sum_{\Omega}\sqrt{f_xf_Y}=1 \quad\mbox{or} \quad \int_{\Omega}\sqrt{f_Xf_Y}\, d\mathbf{z}=1$ and $$H(X,Y)=0$$. The fact that $$H(X,Y)\in [0,1]$$ is a consequence of Cauchy’s inequality, with $$f_X^*=\sqrt{f_X}$$ and $$f_Y^*=\sqrt{f_Y}$$: \begin{aligned} 0\leq \int_{\Omega}\sqrt{f_Xf_Y}\, d\mathbf{z}&=\int_{\Omega}f_X^*f_Y^*\, d\mathbf{z}\\ &\leq \left(\int_{\Omega}|f_X^*|^2\, d\mathbf{z}\right)^{1/2}\left(\int_{\Omega}|f_Y^*|^2\, d\mathbf{z}\right)^{1/2} \\ &=\left(\int_{\Omega}f_X\, d\mathbf{z}\right)^{1/2}\left(\int_{\Omega}f_Y\, d\mathbf{z}\right)^{1/2}\!\!\!\!=1;\end{aligned} (a similar argument holds for discrete random variables).

Recall that the covariance matrices $$\Sigma_X$$ and $$\Sigma_Y$$ are $$n \times n$$-matrices whose $$(i,j)$$-th entries are the covariance between the $$i$$-th and $$j$$-th positions of $$X$$ and $$Y$$, respectively. Given a collection of identically distributed samples, these covariance matrices can be estimated.

We can also consider a single observation $$\mathbf{p}$$ to represent a probability distribution. In that case, the Hellinger distance between that observation and any other distribution with mean $$\boldsymbol{\mu}$$ and covariance matrix $$\boldsymbol{\Sigma}$$ can be studied using the framework above, using the Mahalanobis distance: $M(\mathbf{p})=\sqrt{(\mathbf{p} - \boldsymbol{\mu})^{\!\top} \boldsymbol{\Sigma}^{-1} (\mathbf{p} - \boldsymbol{\mu}).}$

Alternatively, if $$\mathbf{p}$$ and $$\mathbf{q}$$ are drawn from the same distribution with covariance $$\boldsymbol{\Sigma}$$, then the Mahalanobis distance is a dissimilarity measure between $$\mathbf{p}$$ and $$\mathbf{q}$$: $d_M(\mathbf{p},\mathbf{q})=\sqrt{(\mathbf{p} - \mathbf{q})^{\!\top} \boldsymbol{\Sigma}^{-1} (\mathbf{p} - \mathbf{q}).}$

#### Example

In general, we do not know the true mean vector $$\boldsymbol{\mu}$$ and covariance matrix $$\boldsymbol{\Sigma}$$ from which the data could arise, and the mean vector and the covariance structure must be estimated from the data. In the example of Section 16.1.3, we have:

mu.1 <- colMeans(rdata[,1:4])
cov(rdata[,1:4])
          x1        x2        x3        x4
x1 0.8999038 0.5690744 0.6646093 0.5033570
x2 0.5690744 1.3124241 1.0685066 0.4694309
x3 0.6646093 1.0685066 0.9921537 0.3969461
x4 0.5033570 0.4694309 0.3969461 0.9043493

These are distinct from the true underlying collection of parameters $$\boldsymbol{\mu}$$ and $$\boldsymbol{\Sigma}$$, but close enough to be explained by sampling variation and because $$\mathbf{z}_1,\mathbf{z}_4\not\sim \mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$$.

We first attempt to identify the anomalous observations by computing the Mahalanobis distance from the empirical distribution to all observations in the dataset.

Sigma.inv = matlib::inv(cov(rdata[,1:4]))

M_d<-vector()

for(j in 1:nrow(rdata)){
M_d[j] <- sqrt(as.matrix(rdata[j,1:4]-mu.1) %*%
Sigma.inv %*% t(as.matrix(rdata[j,1:4]-mu.1)))
}

rdata <- data.frame(rdata,M_d)
summary(M_d)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.4622  1.3479  1.6764  1.7980  2.1010  6.6393 

The summary certainly suggests that there is (at least) one observation for which the Mahalanobis distance to the empirical distribution is quite high.

library(dplyr)

rdata |> ggplot2::ggplot(ggplot2::aes(x=M_d)) +
ggplot2::geom_histogram(colour="black",binwidth = 0.5) +
ggplot2::geom_rug() +
ggplot2::theme_bw()

rdata |> ggplot2::ggplot(ggplot2::aes(x=M_d)) +
ggplot2::geom_boxplot() + ggplot2::geom_rug(color="black")

The histogram of Mahalanobis distances shows that most observations are fairly “regular”, but that two of the observations have substantially larger distances. The boxplot confirms it, but identifies a potential third outlying observation.

Below, we display the scatter plot matrix of the 102 observations, with colour given by the Mahalanobis distance of each observation from the empirical distribution (the code is ommitted).

It certainly seems as though the two anomalies could be $$\mathbf{z}_1$$ and $$\mathbf{z}_4$$.

Next, we plot the Mahadistance from each observation to every other observation.

M_pq<-matrix(nrow=nrow(rdata), ncol=nrow(rdata))

for(j in 1:nrow(rdata)){
for(i in 1:nrow(rdata)){
M_pq[j,i]<-sqrt(as.matrix(rdata[j,1:4]-rdata[i,1:4]) %*%
Sigma.inv %*% t(as.matrix(rdata[j,1:4]-rdata[i,1:4])))
}
}

M_pq<-as.data.frame.table(M_pq)
M_pq[,1:2]<-lapply(M_pq[,1:2],as.numeric)

M_pq |> ggplot2::ggplot(ggplot2::aes(x=Var1,y=Freq)) +
ggplot2::geom_point(ggplot2::aes(fill=Freq,colour=Freq),pch=22) +
ggplot2::scale_fill_continuous(high = "#0033FF", low = "#FFFFFF") +
ggplot2::scale_colour_continuous(high = "#0033FF", low = "#FFFFFF") +
ggplot2::scale_x_continuous(name="Observations") +
ggplot2::scale_y_continuous(name="Distance") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none")

Note the differing patterns for observations 101 and 102, as well as the diffuse cloud of points above the distance $$5.0$$ for the other observations. There are a few other observations for which the distances to other observations seem to be larger than in a majority of the cases.

Next, we display the same distributions with the help of boxplots.

median.value <- M_pq |>
group_by(Var1) |>
summarise(meanDist=mean(Freq)) |>
summarise(median_value=median(meanDist))

test <- M_pq |>
group_by(Var1) |>
summarise(meanDist=mean(Freq)) |>
summarise(std=sd(meanDist))

med.sd = test+median.value

M_pq |> ggplot2::ggplot(ggplot2::aes(x=as.factor(Var1),y=Freq)) +
ggplot2::geom_boxplot() +
ggplot2::scale_x_discrete(name="Observations") +
ggplot2::scale_y_continuous(name="Distance") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") +
ggplot2::geom_hline(yintercept=as.numeric(median.value), linetype = "dashed", color = "red")  +
ggplot2::geom_hline(yintercept=as.numeric(med.sd), linetype = "dotted", color = "red")  +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle=90))

The long-dashed red line represent the median of all the mean distances per observation; the short-dashed red line lies 1 standard deviation above the median.

To simplify the reading of the situation, we plot only the mean distance per observation, linking the colour intensity and the marker size to the mean distance (blue corresponding to larger distances, as do larger markers).

M_pq |> group_by(Var1) |> summarise(meanDist=mean(Freq)) |>
ggplot2::ggplot(ggplot2::aes(x=Var1,y=meanDist)) +
ggplot2::scale_x_continuous(name="Observations") +
ggplot2::scale_y_continuous(name="Mean Distance") +
ggplot2::geom_point(ggplot2::aes(fill=meanDist,colour=meanDist,size=meanDist),pch=22) +
ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") +
ggplot2::geom_hline(yintercept=as.numeric(median.value), linetype = "dashed", color = "red")  +
ggplot2::geom_hline(yintercept=as.numeric(med.sd), linetype = "dotted", color = "red")  

Do any other observations strike you as possible outliers?

#### Similarity Measures (Reprise)

If $$\boldsymbol{\Sigma}$$ is diagonal, then $d_M(\mathbf{p},\mathbf{q})=\sqrt{\sum_{i=1}^n \frac{(p_i - q_i)^2}{\sigma_i^2}},$ where $$\sigma_i^2$$ is the variance along the $$i$$-th dimension. If $$\Sigma$$ is the identity matrix, then we recover the Euclidean distance $d_2(\mathbf{p},\mathbf{q})=\sqrt{\sum_{i=1}^n (p_i - q_i)^2}.$

When using the Euclidean distance in an anomaly detection context, a linear normalization is usually applied to each dimension so that each entry lies in the hypercube $$[-1,1]^n$$. The Minkowski distance of order $$p$$ is a generalization of the Euclidean distance: $d_p(\mathbf{p},\mathbf{q})=\left( \sum_{i=1}^n |p_i - q_i|^p \right)^{1/p}.$

For $$p=2$$ we recover the Euclidean distance $$d_2$$, for $$p=1$$ the Manhattan distance $d_1(\mathbf{p},\mathbf{q})=\sum_{i=1}^n|p_i-q_i|,$ and for $$p=\infty$$ the supremum distance (also called the Chebychev distance) $d_{\infty}(\mathbf{p},\mathbf{q})=\max_{i=1}^n |p_i - q_i|.$ The Minkowski distance $$d_p$$ is only actually a distance function (i.e., a metric) when $$p \geq 1$$, but an exception is made for $d_{-\infty}(\mathbf{p},\mathbf{q})=\min_{i=1}^n |p_i - q_i|$ to fall within the same framework.

The Jaccard similarity of two datasets $$P$$ and $$Q$$, is defined as the size of their intersection divided by the size of their union $J(P,Q) = \frac{|P \cap Q|}{|P \cup Q|} = \frac{|P \cap Q|}{|P| + |Q| - |P \cap Q|}.$ Their Jaccard distance is then taken to be $$1 - J(P,Q)$$.

This definition can be extended to compare binary vectors (i.e. vectors with entries in $$\{0,1\}$$) of the same length. Given two binary vectors $$\mathbf{p}$$ and $$\mathbf{q}$$ of length $$n$$, consider an arbitrary set $$D$$ of size $$n$$. Then $$\mathbf{p}$$ and $$\mathbf{q}$$ can be viewed as subsets of $$D$$: if $$p_i=1$$ then $$\mathbf{p}$$ is said to contain the $$i$$-th element of $$D$$, while if $$p_i=0$$ then it does not. Viewing $$\mathbf{p}$$ and $$\mathbf{q}$$ in this way allows us to compute their Jaccard similarity, and thus their Jaccard distance.

Finally, let $$\mathbf{p},\mathbf{q}\neq \mathbf{0}$$. Recall that $$\mathbf{p} \cdot \mathbf{q} = \lVert p \rVert \lVert q \rVert \cos\theta,$$ where $$\theta$$ is the angle between $$\mathbf{p}$$ and $$\mathbf{q}$$. The cosine similarity between $$\mathbf{p}$$ and $$\mathbf{q}$$ is the cosine of $$\theta$$, which can be computed as $\cos\theta = \frac{p \cdot q}{\lVert p \rVert \lVert q \rVert} = \frac{\sum_{i=1}^n p_i q_i}{\sqrt{\sum_{i=1}^n p_i^2} \sqrt{\sum_{i=1}^n q_i^2}}.$

This value ranges between $$1$$ and $$-1$$, with 1 attained when $$\mathbf{p}=\mathbf{q}$$, $$-1$$ when $$\mathbf{p}=-\mathbf{q}$$, and $$0$$ when $$\mathbf{p}$$ and $$\mathbf{q}$$ are perpendicular. Armed with these concepts, we can now explore distance-based methods for anomaly detection; they will also eventually be useful for density-based anomaly detection.

#### 16.2.1.1 Distance-Based Anomaly Detection

All these distance functions (similarity measures) can be used to create basic anomaly detection algorithms (the ideas can also be extended to more complex algorithms).

Given some distance function $$d$$, dataset $$D$$, and integers $$k,\nu\leq |D|$$, the distance to all points (DTAP) anomaly detection algorithm considers each observation $$\mathbf{p}$$ in $$D$$ and adds the distance from $$\mathbf{p}$$ to every other observation in $$D$$, i.e. $a(\mathbf{p}) = \sum_{\mathbf{q}\neq \mathbf{p} \in D} d(\mathbf{q}, \mathbf{p}).$

The $$\nu$$ observations with largest values for $$a$$ are then said to be anomalous according to $$a$$. This approach often selects the most extreme observations as anomalous, which may be of limited use in practice.

The distance to nearest neighbour (DTNN) algorithm defines $a(\mathbf{p}) = \min_{\mathbf{q}\neq \mathbf{p} \in D} d(\mathbf{q}, \mathbf{p}),$ with a similar definition for the $$\nu$$ anomalous observations. The average distance to $$k$$ nearest neighbours and median distance to $$k$$ nearest neighbours are defined similarly.

#### Example - Distance to All Points

We start by building the DTAP anomaly detector for the Euclidean distance (method="euclidean") on the scaled artificial data, which is shown below.

rdata.scaled=data.frame(matrix(ncol = 4, nrow = nobs+2))

for(i in 1:4){
rdata.scaled[,i] <- 2/(max(rdata[,i]) - min(rdata[,i])) * rdata[,i] - 1
}

lattice::splom(rdata.scaled[,1:4], pch=22)

The top $$\nu=6$$ anomalous observations are obtained by the following chunk of code.

m.L2 <- as.matrix(dist(rdata.scaled[,1:4], method="euclidean"))
adoa.L2 <- adoa.L2[order(-adoa.L2$dist),] rownames(adoa.L2) <- NULL knitr::kable(head(adoa.L2)) obs dist 102 241.7556 62 177.6135 14 113.9903 49 108.6464 55 106.5156 67 104.7870 The accompanying plot is: adoa.L2 |> ggplot2::ggplot(ggplot2::aes(x=obs,y=dist)) + ggplot2::scale_x_continuous(name="Observations") + ggplot2::scale_y_continuous(name="Sum of Euclidean Distances") + ggplot2::geom_point(ggplot2::aes(fill=dist, colour=dist, size=dist), pch=22) + ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") + ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") + ggplot2::theme_bw() + ggplot2::theme(legend.position = "none")  The top $$\nu=6$$ anomalies are obtained for a variety of metric (we do not display the code that is used for these other metrics). • Chebychev (replace method="euclidean" by method="maximum"): obs dist 102 144.85387 62 114.83273 14 88.25016 55 81.47169 49 80.33274 101 73.11895 • Manhattan (replace method="euclidean" by method="manhattan"): obs dist 102 473.7469 62 342.2531 14 197.7493 67 193.9257 49 191.7707 55 183.6509 • Minkowski, $$p=1/2$$ (replace method="euclidean" by method="manhattan", p=0.5): obs dist 102 1873.7488 62 1341.9020 67 731.9811 14 717.3451 49 700.5557 11 669.4313 • Minkowski, $$p=4$$ (replace method="euclidean" by method="manhattan", p=4): obs dist 102 3738.781 62 2672.940 67 1444.177 14 1403.799 49 1372.067 11 1318.830 We see that while observation 102 is always the most anomalous according to DTAP, the ranking is affected by the choice of distance metric. Is this surprising? #### Example - Distance to Nearest Neighbour We next build the DTNN anomaly detector for the Euclidean distance, again on the scaled artificial data. The top $$\nu=6$$ anomalous observations are obtained by the following chunk of code. m.L2 <- m.L2 + 1000000*diag(nobs+2) adoa.L2 <- data.frame(1:(nobs+2),apply(m.L2,1,min)) colnames(adoa.L2) <- c("obs","dist") adoa.L2 <- adoa.L2[order(-adoa.L2$dist),]
knitr::kable(head(adoa.L2))
obs dist
102 1.4549129
62 0.8667313
101 0.3746565
12 0.3611265
11 0.3186967
43 0.3145305

The accompanying plot is:

The top $$\nu=6$$ anomalies are obtained for a variety of metric (we do not display the code that is used for these other metrics).

• Chebychev:

obs dist
102 0.8269178
62 0.5203870
12 0.2449443
101 0.2368307
68 0.2356369
43 0.2214142

• Manhattan:

obs dist
102 2.8914743
62 1.6745725
12 0.6499116
101 0.6082916
11 0.5702036
23 0.4903152

• Minkowski, $$p=1/2$$:

obs dist
102 10.534441
62 6.559370
101 1.857222
11 1.704864
23 1.618388
12 1.591956

• Minkowski, $$p=4$$:

obs dist
102 20.605432
62 13.059856
101 3.521821
11 3.081000
23 2.979090
12 2.809583

In both examples, there are commonalities (certain observations come back repeatedly as likely anomalous observations), but the anomaly rankings change according to the selected distance function and the choice of algorithm (the choice of data scaling approach would also play a role). This is par for the course in the anomaly detection context.

### 16.2.2 Density Methods

Density-based approaches view observations as anomalous if they occur in low-density regions.

Density-based methods include:

• local outlier factors;

• DBSCAN, and

• isolation forests.

#### Local Outlier Factor

The Local Outlier Factor (LOF) algorithm was proposed in 2000 by [336] (a summary can be found in Section 6.4.2 of [319]). LOF works by measuring the local deviation of each observation in a dataset from its $$k$$ nearest neighbours, with a point said to be anomalous if this deviation is large.

A local $$k-$$region $$N_k(\mathbf{p})$$ around an observation $$\mathbf{p}$$ is defined as the $$k$$ nearest neighbours of $$\mathbf{p}$$. The density of observations in each of their respective local $$k-$$neighbourhoods is estimated (the local density), and compared to the density of the local $$k-$$neighbourhoods of the observations within their own $$k-$$neighbourhood.

This can then be used to identify outliers that inhabit regions of lower density than their neighbours, as $$\mathbf{p}$$ would be in Figure 16.15.

The formal procedure is as follows:

Any observation with a LOF $$a_k(\mathbf{p})$$ above some threshold $$\tau$$ is a local outlier, but selecting is not obvious. LOF introduces the idea of a reachability distance, which improves the stability of results within clusters/regions: within $$N_k(\mathbf{p})$$, it is $d_{\text{reach}}(\mathbf{p},\mathbf{q})=\max_{\ell}\{d(\mathbf{p},\mathbf{q}_{\ell}); \mathbf{q}_{\ell}\in N_k(\mathbf{p})\},$ the maximal distance to its $$k-$$neighbours; outside of $$N_k(\mathbf{p})$$, it is $d_{\text{reach}}(\mathbf{p},\mathbf{q})=d(\mathbf{p},\mathbf{q}),$ the actual distance.

In Figure 16.16, assuming $$k=3$$, the observations $$\mathbf{q}_1, \mathbf{q}_2, \mathbf{q}_3$$ all have the same reachability distance from $$\mathbf{p}$$ as they are all $$3$$-neighbours of $$\mathbf{p}$$, that is, $d_{\text{reach}}(\mathbf{p}, \mathbf{q}_1) = d_{\text{reach}}(\mathbf{p}, \mathbf{q}_2) = d_{\text{reach}}(\mathbf{p}, \mathbf{q}_3) = d(\mathbf{p}, \mathbf{q}_3).$ The observation $$\mathbf{q}_4$$, on the other hand, has $$d_{\text{reach}}(\mathbf{p}, \mathbf{q}_4) = d(\mathbf{p}, \mathbf{q}_4)$$ as it is not a $$k$$-neighbour of $$\mathbf{p}$$.

#### Example

LOF is implemented in R via the Rlof package; we apply it to the scaled artificial data using the Euclidean and the Chebychev distances.

dist.L2 = dist(rdata.scaled[,1:4], method="euclidean")
lof <- Rlof::lof(dist.L2, k=4)
rdata.lof.L2 = data.frame(rdata.scaled[,1:4],lof)
rdata.lof.obs.L2 = data.frame(1:(nobs+2),lof)
names(rdata.lof.obs.L2) = c("obs",'lof')
rdata.lof.obs.L2 <- rdata.lof.obs.L2[order(-rdata.lof.obs.L2$lof),] rownames(rdata.lof.obs.L2) <- NULL knitr::kable(head(rdata.lof.obs.L2)) obs lof 102 5.377542 62 5.100218 12 1.609463 55 1.481152 27 1.475414 14 1.402112 dist.sup = dist(rdata.scaled[,1:4], method="maximum") lof <- Rlof::lof(dist.sup, k=4) rdata.lof.sup = data.frame(rdata.scaled[,1:4],lof) rdata.lof.obs.sup = data.frame(1:(nobs+2),lof) names(rdata.lof.obs.sup) = c("obs",'lof') rdata.lof.obs.sup <- rdata.lof.obs.sup[order(-rdata.lof.obs.sup$lof),]
rownames(rdata.lof.obs.sup) <- NULL
knitr::kable(head(rdata.lof.obs.sup))
obs lof
102 2.984990
62 2.917587
14 1.636228
27 1.542470
12 1.466011
79 1.342456

#### DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) was proposed in 1996 by [334] (a summary can be found in Section 4.1.5 of [319], as well as in Section 14.4.1). As its name suggests, it is a density-based clustering algorithm that groups nearby observations together and labels observations that do not fall in the clusters as anomalies.

Hierarchical DBSCAN (HDBSCAN) [335] was introduced in 2013. It notably removes the problem of choosing the parameter for the radius of a neighbourhood by considering all “possible” radii. Further documentation can be found at [345].

In DBSCAN,

• an observation $$\mathbf{p}$$ is a core point if there is a minimum of $$m$$ observations within distance $$r$$ of $$\mathbf{p}$$;

• an observation $$\mathbf{q}$$ is a border point (or non-core point) if it is not itself a core point but is within distance $$r$$ of one, and

• an observation $$\mathbf{o}$$ is an outlier if it is neither a core point nor a border point.

DBSCAN considers each observation in the dataset individually. If that observation is an outlier, then it is added to a list of outliers. Otherwise if it is a core point, then its $$r$$-neighbourhood forms the beginning of a new cluster. Each observation in this $$r$$-neighbourhood is then considered in turn, with the $$r$$-neighbourhoods of other core observations contained in the neighbourhood being added to the cluster.

This expansion repeats until all observations have been examined. During this step, observations that were previously labelled as outliers may be updated as they become border points in the new cluster. This process continues until every observation has either been assigned to a cluster or labelled as an outlier.

The formal procedure is as follows:

While DBSCAN’s dual use as a clustering algorithm may seem irrelevant in the outlier detection setting, it is its ability to succesfully identify clusters that is crucial for labeling the remaining observations as outliers.

DBSCAN/HDSBCAN Strengths:

• the number of clusters does not need to be known beforehand (unlike in $$k-$$means and other clustering algorithms);

• clusters of arbitrary shape can be detected;

• when using HDBSCAN, only the parameter for the minimum cluster size $$m$$ is required, which can be set fairly intuitively.327

DBSCAN/HDBSCAN Limitations:

• it is not deterministic, as border points can be assigned to different clusters depending on the order in which core observations are considered (this does not affect its use as an anomaly detection algorithm, however);

• in high-dimensional spaces, the ability of any distance function based on Euclidean distance to distinguish near and distant observations diminishes due to the Curse of Dimensionality; thus in high dimension spaces, it become ineffective (as do other clustering algorithms);

• it cannot handle differences in local densities as the radius of a neighbourhood $$r$$ is fixed; this could lead to sparser clusters being labelled as outliers, or to outliers surrounding a denser cluster being included in the cluster (this issue is overcome in HDBSCAN, however).

#### Example

We use the R implementation of DBSCAN, HDBSCAN, and OPTICS (another density-based clustering algorithm) found in the dbscan package; we apply various parameters to the scaled artificial data, using the Euclidean distance in all instances.

scaled = scale(rdata[,1:4])
set.seed(1) # for replicability
• DSBCAN, eps = 0.4, minPts = 4

db.1 <- dbscan::dbscan(scaled, eps = .4, minPts = 4)
db.1
lattice::splom(scaled, groups=db.1$cluster + 1L, pch=22) DBSCAN clustering for 102 objects. Parameters: eps = 0.4, minPts = 4 The clustering contains 1 cluster(s) and 96 noise points. 0 1 96 6 Available fields: cluster, eps, minPts Evidently, 0.4 is too small a value for eps or 4 is too large a value for minPts (or both). • DSBCAN, eps = 1, minPts = 4 DBSCAN clustering for 102 objects. Parameters: eps = 1, minPts = 4 The clustering contains 1 cluster(s) and 6 noise points. 0 1 6 96 Available fields: cluster, eps, minPts The results are reversed with a larger value of eps. • DSBCAN, eps = 1, minPts = 10 DBSCAN clustering for 102 objects. Parameters: eps = 1, minPts = 10 The clustering contains 1 cluster(s) and 24 noise points. 0 1 24 78 Available fields: cluster, eps, minPts Are the clustering results (and thus the anomaly discovery rate) as expected? The interaction between the parameters can have unpredictable effects. • DSBCAN, eps = 2, minPts = 10 DBSCAN clustering for 102 objects. Parameters: eps = 2, minPts = 10 The clustering contains 1 cluster(s) and 2 noise points. 0 1 2 100 Available fields: cluster, eps, minPts • HDBSCAN, minPts = 4 hdb <- dbscan::hdbscan(scaled, minPts = 4) hdb lattice::splom(scaled, groups=hdb$cluster + 1L, pch=22)

HDBSCAN clustering for 102 objects.
Parameters: minPts = 4
The clustering contains 4 cluster(s) and 71 noise points.

0  1  2  3  4
71 10  6  4 11

Available fields: cluster, minPts, coredist, cluster_scores,
membership_prob, outlier_scores, hc

Notice the absence of the eps parameter.

• OPTICS, eps = 1, minPts = 4, eps_cl = 1, xi=.05 (read the documentation for details)

opt <- dbscan::optics(scaled, eps = 1, minPts = 4)
opt.1 <- dbscan::extractDBSCAN(opt, eps_cl = 1)
opt.1
lattice::splom(rdata[,1:4], groups=opt.1$cluster + 1L, pch=22) OPTICS ordering/clustering for 102 objects. Parameters: minPts = 4, eps = 1, eps_cl = 1, xi = NA The clustering contains 1 cluster(s) and 6 noise points. 0 1 6 96 Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi, cluster opt.2 <- dbscan::extractXi(opt, xi = .05) opt.2 lattice::splom(scaled, groups=opt.2$cluster + 1L, pch=22)

OPTICS ordering/clustering for 102 objects.
Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = 0.05
The clustering contains 4 cluster(s) and 7 noise points.

Available fields: order, reachdist, coredist, predecessor, minPts, eps,
eps_cl, xi, clusters_xi, cluster

Are there any suprises?

#### Isolation Forest

The approaches that were previously discussed first construct models of what normal observations look like, and then identify observations that do not fit this model. The Isolation Forest algorithm [332] introduced in 2008 instead tries to explicitly identify outliers under the assumptions that there are few outliers in the data and that these outliers have very different attributes compared to normal (or regular) observations.

This allows the use of sampling techniques that increase algorithmic speed while decreasing memory requirements.

The Isolation Forest algorithm tries to isolate anomalous observations. It does so by randomly selecting an attribute and then randomly selecting a split value between that attribute’s min and max values. This recursively partitions the observations until every observation is isolated in its own partition component.

Recursive partitioning yields a binary tree called an Isolation Tree (IsoTree):

• the root of this tree is the entire dataset;

• each node is a subset of the observations;

• each branch corresponds to one of the generated partitions, and

• the leaves are singleton sets containing a single isolated observation.

Each observation is then assigned a score derived from how deep in the tree its singleton partition appears. As observations that are shallower in the tree were easier to separate from the rest, these are the likely outliers.

Since only shallow observations are of interest, once the height of the tree has reached a given threshold (the expected height of a random binary tree, say), further construction of the tree can be stopped to decrease computational cost.

Additionally, instead of building a tree from the entire dataset, a tree can be constructed from a subset. The location of any observation within this smaller tree can then be estimated, again saving computational and memory resources. These two improvements are detailed in the original paper [332].

Once a number of Isolation Trees have been randomly generated (Isolation Forest), a score can be computed for each point. This is done by searching each tree for the location of a given point and noting the path length required to reach it. Once an observation’s path length in each tree has been computed, the average path length is taken to be its score. In isolated forests, low scores are indicative of outlying behaviour.

The formal procedure is as follows:

It can be desirable to construct a normalized anomaly score that is independent of the size of the dataset. In order to do this, the expected path length of a random observation in an Isolation Tree (i.e. binary tree) must be estimated. With $$|D|=n$$, it can be shown that the expected length is $c(n) = 2 H(n-1) - \frac{2(n-1)}{n},$ where $$H(n-1)$$ is the $$(n-1)^{\text{th}}$$ harmonic number, which can be approximated by $$\ln(n-1) + 0.577$$; $$c(n)$$ is then used to normalize the final anomaly score $$a(\mathbf{p})$$ for $$\mathbf{p} \in D$$, which is given by $\log_2 a(\mathbf{p}) = -\frac{\text{average path length to \mathbf{p} in the Isolation Trees}}{c(n)}.$

Thus defined, $$a(\mathbf{p})\in [0,1]$$, with $$a(\mathbf{p}) \approx 1$$ suggesting $$\mathbf{p}$$ is an anomaly, $$a(\mathbf{p}) \leq 0.5$$ suggesting $$\mathbf{p}$$ is a normal observation; if all observations receive a score around $$0.5$$, this suggests that there are no anomalies present.

IsoForest Strengths:

• small time and memory requirements;

• can handle high dimensional data, and

• do not need observations to have been labeled anomalies in the training set.

IsoForest Main Limitation:

• the anomaly score assigned to a given point can have high variance over multiple runs of the algorithm. The authors of [333] propose some solutions.

#### Example

We use the R implementation of IsoForest found in the solitude package; we apply various parameters to the scaled artificial data, using the Euclidean distance in all instances. Note that this implementation uses a different scoring system, in which high scores are indicative of anomalous observations.

#library(solitude)
set.seed(1) # for replicability
index = 1:102

We initiate an isolation forest:

iso = solitude::isolationForest$new(sample_size = length(index)) iso$fit(dataset = rbind(scaled[index,1:4],c(0,0,0,0)))
test<-iso$predict(scaled[index,1:4]) # scores for training data INFO [10:35:49.974] Building Isolation Forest ... INFO [10:35:50.765] done INFO [10:35:50.772] Computing depth of terminal nodes ... INFO [10:35:51.244] done INFO [10:35:51.250] Completed growing isolation forest  The top $$\nu=6$$ IsoForest anomaly scores are given below: rdata.iso = data.frame(1:(nobs+2),test$anomaly_score)
names(rdata.iso) = c("obs","anomaly_score")
rdata.iso <- rdata.iso[order(-rdata.iso\$anomaly_score),]
rownames(rdata.iso) = NULL
knitr::kable(head(rdata.iso))
obs anomaly_score
102 0.7795207
62 0.7480289
55 0.6379525
29 0.6228751
49 0.6218485
67 0.6182687
rdata.iso |>
ggplot2::ggplot(ggplot2::aes(x=obs,y=anomaly_score)) +
ggplot2::scale_x_continuous(name="Observations") +
ggplot2::scale_y_continuous(name="IsoForest Anomaly Score") +
ggplot2::geom_point(ggplot2::aes(fill=anomaly_score,
colour=anomaly_score,
size=anomaly_score),
pch=22) +
ggplot2::scale_fill_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::scale_colour_continuous(high = "#0033FF", low = "#CCCCCC") +
ggplot2::theme_bw() + ggplot2::theme(legend.position = "none") 

The profile of anomaly scores has a fairly distinct look (although we recognize quite a few of the usual suspects).

In general, density-based schemes are more powerful than distance-based schemes when a dataset contains patterns with diverse characteristics, but less effective when the patterns are of comparable densities with the outliers [346].

### References

[318]
D. Baron, XXX Winter School of Astrophysics on Big Data in Astronomy, GitHub repository, 2018.
[319]
K. G. Mehrotra, C. K. Mohan, and H. Huang, Anomaly Detection Principles and Algorithms. Springer, 2017.
[332]
F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422.
[333]
S. Hariri, M. Carrasco Kind, and R. J. Brunner, “Extended isolation forest,” IEEE Transactions on Knowledge and Data Engineering, 2019.
[334]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.
[335]
R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-Based Clustering Based on Hierarchical Density Estimates,” in Advances in Knowledge Discovery and Data Mining, 2013, pp. 160–172.
[336]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” SIGMOD Rec., vol. 29, no. 2, pp. 93–104, 2000.
[344]
D. Whitenack, Machine Learning with Go. Packt Publishing, 2017.
[345]
L. McInnes, J. Healy, and S. Astels, 2016.
[346]
J. Tang, Z. Chen, A. W. Fu, and D. W. Cheung, “Capabilities of outlier detection schemes in large datasets, framework and methodologies,” Knowl. Inf. Syst., vol. 11, no. 1, pp. 45–84, Jan. 2007.