## 16.3 Qualitative Approaches

Non-numerical variables present new challenges when it comes to anomaly detection. A categorical variable (or qualitative variable) is one whose levels are measured on a nominal scale; examples include an object’s colour, the mother tongue of an individual, her favourite meal, and so forth.

The central tendency of the values of a categorical variable is usually given by its mode; measures of spread are harder to define consistently. One possibility is to use the proportion of levels with more than a certain percentage of the observations above a given threshold.

Consider, for instance, a dataset consisting of $$n=517$$ observations with $$p=3$$ predictors:

• age ($$24-$$, $$24-44$$, $$45-64$$, $$65+$$);

• mother tongue (French, English, Mandarin, Arabic, Other), and

• hair colour (black, brown, blond, red).

The respective modes are $$24-44$$, English, and brown. We can see the distribution tables in Figure 16.19, as well as the number of levels that contain more than 15% and more than 25% of the observations.

We often associate qualitative features to numerical values, but with the caveat that these should not be interpreted as numerals; if we use the code “red” $$=1$$, “blond” $$=2$$, “brown” $$=3$$, and “black” $$=4$$ to represent hair colour, we cannot conclude that “blond” $$>$$ “red”, even though $$2>1$$, or that “black” $$-$$ “brown” $$=$$ “red”, even though $$4-3=1$$.

A categorical variable that has exactly two is called a dichotomous feature (or a binary variable); those with more than two levels are called polytomous variables.328

Commonly-used distances (apart from the $$0-1$$ distance and the related Hamming distance) typically require numerical inputs. But representing categorical variables with numerical features can lead to traps (as discussed above). Anomaly detection methods based on distance or on density are not recommended in the qualitative context (unless the distance function has been modified appropriately, but that is harder to do than can be expected). Another option is to look at combinations of feature levels, but this can be computationally expensive.

We present two specific categorical anomaly detection methods below: the attribute value frequency algorithm, and the greedy algorithm.

### 16.3.1 Attribute Value Frequency Algorithm

The Attribute Value Frequency (AVF) algorithm offers a fast and simple way to detect outlying observations in categorical data; it can be conducted without having to create ir which minimizes the amount of data analyses, without having to create or search through various combinations of feature levels, which reduces the overall runtime.

Intuitively, outlying observations are points which occur relatively infrequently in the (categorical) dataset; an “ideal” anomalous point is one for which each feature value is extremely anomalous (or relatively infrequent). The rarity of an attribute level can be measured by adding the number of times the corresponding feature takes that value in the dataset.

Consider a dataset with $$n$$ observations: $$\{\mathbf{x}_i\}\subseteq \mathbb{R}^p$$, $$i = 1, \ldots, n$$ (each observation is a vector of $$p$$ features). We write $\mathbf{x}_{i} = (x_{i,1}, \cdots , x_{i,\ell}, \cdots, x_{i,p}),$ where $$x_{i,\ell}$$ is $$\mathbf{x}_i$$’s $$\ell$$th feature’s level.

In the artificial categorial dataset presented previously above, we (perhaps) have \begin{aligned} \mathbf{x}_1&=(x_{1,1},x_{2,1},x_{3,1})=(24-,\text{French},\text{brown}) \\ &\vdots \\ \mathbf{x}_{517}&=(x_{517,1},x_{517,1},x_{517,1})=(24-,\text{Mandarin},\text{black}).\end{aligned}

Using the reasoning presented above, the AVF score is a good tool to determine whether $$\textbf{x}_i$$ should be considered an outlier or not: $\text{AVFscore}(\mathbf{x}_i) = \frac{1}{p} \sum_{\ell=1}^{p}f(x_{i,\ell}),$ where $$f(x_{i,\ell})$$ is the number of observations $$\mathbf{x}$$ for which the $$\ell$$th feature takes on the level $${x}_{i,\ell}$$.

A low AVF score indicates that the observation is more likely to be an outlier. Since $$\text{AVFscore}(\mathbf{x}_i)$$ is essentially a sum of $$p$$ positive values, it is minimized when each of the sum’s term is minimized, individually.329 Thus, the “ideal” anomalous observation (in the sense described above) minimizes the AVF score; the minimal score is reached when each of the observation’s features’ levels occur only once in the dataset.330

For an integer $$\nu$$, the suggested outliers are the $$\nu$$ observations signatures with smallest AVF score; the algorithm’s complexity is $$O(np)$$. The formal procedure is as follows:

The 10 lowest AVF scores in the artificial categorical dataset are highlighted in red below.

For instance, \begin{aligned}\text{AVFscore}(24-,\text{French},\text{blond})&=\textstyle{\frac{1}{3}(f(24-)+f(\text{French})+f(\text{blond}))} \\ &=\textstyle{\frac{1}{3}(175+141+79)=131.7}\end{aligned}

For anomaly detection purposes, individual raw AVF scores are not as meaningful as relative AVF scores.

### 16.3.2 Greedy Algorithm

The greedy algorithm “greedyAlg1” is an algorithm which identifies the set of candidate anomalous observations in an efficient manner. The mathematical formulation of the problem is simple – given a dataset $$D$$ and a number $$\nu$$ of anomalous observations to identify, we solve the optimization problem $\text{OS}=\arg\min_{O\subseteq D} \{H(D\setminus O)\}, \quad \text{subject to }|O|=\nu,$ where the entropy of the subset $$D\setminus O$$ is the sum of the entropy of each feature on $$D\setminus O$$ (see Section 11.4.3): $H(D\setminus O)=H(X_1;D\setminus O)+\cdots + H(X_m;D\setminus O)$ and $H(X_{\ell};D\setminus O)=-\!\!\!\!\!\!\!\!\!\!\!\sum_{z_{\ell}\in S(X_{\ell};D\setminus O)}\!\!\!\!\!\!\!\!\!\!\! p(z_{\ell})\log p(z_{\ell}),$ where $$S(X_{\ell};D\setminus O)$$ is the set of levels that the $$\ell$$th feature takes in $$D\setminus O$$.

The “greedyAlg1” algorithm solves the optimization problem as follows:

1. The set of outlying and/or anomalous observations $$\text{OS}$$ is initially set to be empty, and all observations of $$D\setminus \text{OS}$$ are identified as normal (or regular).

2. Compute $$H(D\setminus \text{OS})$$.

3. Scan the dataset in order to select a candidate anomalous observation: every normal observation $$\mathbf{x}$$ is temporarily taken out of $$D\setminus \text{OS}$$ to create a subset $$D'_{\mathbf{x}}$$, whose entropy $$H(D'_{\mathbf{x}})$$ is also computed.

4. The observation $$\mathbf{z}$$ which provides the maximal entropy impact, i.e. the one that minimizes $H(D\setminus \text{OS})-H(D'_{\mathbf{x}}),\quad \mathbf{x}\in D\setminus \text{OS},$ is added to $$\text{OS}$$.

5. Repeat steps 2-4 another $$\nu-1$$ times to obtain a set of $$\nu$$ candidate anomalous observations.

You can find more details in the in the source article [347]; an interesting detail is that it is scalable (it should also work for big datasets, provided the right framework is used).

### References

[347]
Z. He, S. Deng, X. Xu, and J. Z. Huang, “A fast greedy algorithm for outlier mining,” in Advances in Knowledge Discovery and Data Mining, 2006, pp. 567–576.