8.5 Anomalous Observations

The most exciting phrase to hear […], the one that heralds the most discoveries, is not “Eureka!” but “That’s funny…” [I. Asimov (attributed)].

Outlying observations are data points which are atypical in comparison to the unit’s remaining features (within-unit), or in comparison to the measurements for other units (between-units), or as part of a collective subset of observations. Outliers are thus observations which are dissimilar to other cases or which contradict known dependencies or rules.140

Observations could be anomalous in one context, but not in another.


Example: consider, for instance, an adult male who is 6 feet tall. Such a man would fall in the 86th percentile among Canadian males [156], which, while on the tall side, is not unusual; in Bolivia, however, the same man would land in the 99.9th percentile [156], which would mark him as extremely tall and quite dissimilar to the rest of the population.141

A common mistake that analysts make when dealing with outlying observations is to remove them from the dataset without carefully studying whether they are influential data points, that is, observations whose absence leads to markedly different analysis results.

When influential observations are identified, remedial measures (such as data transformation strategies) may need to be applied to minimize any undue effect. Outliers may be influential, and influential data points may be outliers, but the conditions are neither necessary nor sufficient.

8.5.1 Anomaly Detection

By definition, anomalies are infrequent and typically surrounded by uncertainty due to their relatively low numbers, which makes it difficult to differentiate them from banal noise or data collection errors.

Furthermore, the boundary between normal and deviant observations is usually fuzzy; with the advent of e-shops, for instance, a purchase which is recorded at 3AM local time does not necessarily raise a red flag anymore.

When anomalies are actually associated to malicious activities, they are more than often disguised in order to blend in with normal observations, which obviously complicates the detection process.

Numerous methods exist to identify anomalous observations; none of them are foolproof and judgement must be used. Methods that employ graphical aids (such as box-plots, scatterplots, scatterplot matrices, and 2D tours) to identify outliers are particularly easy to implement, but a low-dimensional setting is usually required for ease of interpretation.

Analytical detection methods also exist (using Cooke’s or Mahalanobis’ distances, for instance), but in general some additional level of analysis must be performed, especially when trying to identify influential points (cf. leverage).

With small datasets, anomaly detection can be conducted on a case-by-case basis, but with large datasets, the temptation to use automated detection/removal is strong – care must be exercised before the analyst decides to go down that route.142

In the early stages of anomaly detection, simple data analyses (such as descriptive statistics, 1- and 2-way tables, and traditional visualizations) may be performed to help identify anomalous observations, or to obtain insights about the data, which could eventually lead to modifications of the analysis plan.

8.5.2 Outlier Tests

How are outliers actually detected? Most methods come in one of two flavours: supervised and unsupervised (we will discuss those in detail in later sections).

Supervised methods use a historical record of labeled (that is to say, previously identified) anomalous observations to build a predictive classification or regression model which estimates the probability that a unit is anomalous; domain expertise is required to tag the data.

Since anomalies are typically infrequent, these models often also have to accommodate the rare occurrence problem.143

Unsupervised methods, on the other hand, use no previously labeled information or data, and try to determine if an observation is an outlying one solely by comparing its behaviour to that of the other observations. The following traditional methods and tests of outlier detection fall into this category:144

  • Perhaps the most commonly-used test is Tukey’s boxplot test; for normally distributed data, regular observations typically lie between the inner fences \[Q_1-1.5(Q_3-Q_1) \quad\mbox{and}\quad Q_3+1.5(Q_3-Q_1).\] Suspected outliers lie between the inner fences and their respective outer fences \[Q_1-3(Q_3-Q_1) \quad\mbox{and}\quad Q_3+3(Q_3-Q_1).\] Points beyond the outer fences are identified as outliers (\(Q_1\) and \(Q_3\) represent the data’s \(1^{\textrm{st}}\) and \(3^{\textrm{rd}}\) quartile, respectively; see Figure 8.5.
Tukey's boxplot test.

Figure 8.5: Tukey’s boxplot test; suspected outliers are marked by white disks, outliers by black disks.

As an example, let’s find the outliers for the midterm and final exam grades in Dr. Vanderwhede’s Advanced Retroencabulation course.

There are no boxplot anomalies for midterm grades:

boxplot(grades$MT)

boxplot.stats(grades$MT)$out
numeric(0)

but there are 4 boxplot anomalies for final exam grades.

boxplot(grades$FE)

boxplot.stats(grades$FE)$out
[1] 3 0 3 0

which can be found as follows:

out <- boxplot.stats(grades$FE)$out
out_ind <- which(grades$FE %in% c(out))
grades[out_ind,]
MT FE
44 97 3
46 55 0
161 25 3
163 27 0
  • The Grubbs test is another univariate test, which takes into consideration the number of observations in the dataset. Let \(x_i\) be the value of feature \(X\) for the \(i^{\textrm{th}}\) unit, \(1\leq i\leq N\), let \((\overline{x},s_x)\) be the mean and standard deviation of feature \(X\), let \(\alpha\) be the desired significance level, and let \(T(\alpha,N)\) be the critical value of the Student \(t\)-distribution at significance \(\alpha/2N\). Then, the \(i^{\textrm{th}}\) unit is an outlier along feature \(X\) if \[|x_i-\overline{x}| \geq \frac{s_x(N-1)}{\sqrt{N}}\sqrt{\frac{T^2(\alpha,N)}{N-2+T^2(\alpha,N)}}.\]

  • Other common tests include:

    • the Mahalanobis distance, which is linked to the leverage of an observation (a measure of influence), can also be used to find multi-dimensional outliers, when all relationships are linear (or nearly linear);

    • the Tietjen-Moore test, which is used to find a specific number of outliers;

    • the generalized extreme studentized deviate test, if the number of outliers is unknown;

    • the chi-square test, when outliers affect the goodness-of-fit, as well as

    • DBSCAN and other clustering-based outlier detection methods.

Many more such methods can be found in [148]; we will have a lot more to say on the topic in Anomaly Detection and Outlier Analysis.

8.5.3 Visual Outlier Detection

The following three (simple) examples illustrate the principles underlying visual outlier and anomaly detection.


Example: On a specific day, the height of several plants are measured. The records also show each plant’s age (the number of weeks since the seed has been planted).

Histograms of the data are shown in Figure 8.6 (age on the left, height in the middle).

Summary visualisations for an (artificial) plant dataset.Summary visualisations for an (artificial) plant dataset.Summary visualisations for an (artificial) plant dataset.

Figure 8.6: Summary visualisations for an (artificial) plant dataset: age distribution (left), height distribution (middle), height vs. age, with linear trend (right).

Very little can be said about the data at that stage: the age of the plants (controlled by the nursery staff) seems to be somewhat haphazard, as does the response variable (height). A scatter plot of the data (rightmost chart in Figure 8.6), however, reveals that growth is strongly correlated with age during the early period of a plant’s life for the observations in the dataset; points clutter around a linear trend. One point (in yellow) is easily identified as an outlier.

There are (at least) two possibilities: either that measurement was botched or mis-entered in the database (representing an invalid entry), or that one specimen has experienced unusual growth (outlier). Either way, the analyst has to investigate further.


Example: a government department has 11 service points in a jurisdiction. Service statistics are recorded: the monthly average arrival rates per teller and monthly average service rates per teller for each service point are available.

A scatter plot of the service rate per teller (\(y\) axis) against the arrival rate per teller (\(x\) axis), with linear regression trend, is shown in the leftmost chart in Figure 8.7. The trend is seen to inch upwards with increasing \(x\) values.

Visualisations for an (artificial) service point dataset.Visualisations for an (artificial) service point dataset.Visualisations for an (artificial) service point dataset.

Figure 8.7: Visualisations for an (artificial) service point dataset: trend for 11 service points (left), trend for 10 service points (middle), influential observations (right).

A similar chart, but with the left-most point removed from consideration, is shown in the middle chart of Figure 8.7. The trend still slopes upward, but the fit is significantly improved, suggesting that the removed observation is unduly influential (or anomalous) – a better understanding of the relationship between arrivals and services is afforded if it is set aside.

Any attempt to fit that data point into the model must take this information into consideration. Note, however, that influential observations depend on the analysis that is ultimately being conducted – a point may be influential for one analysis, but not for another.


Example: measurements of the length of the appendage of a certain species of insect have been made on 71 individuals. Descriptive statistics have been computed; the results are shown in Figure 8.8.

Descriptive statistics for an (artificial) appendage length dataset.

Figure 8.8: Descriptive statistics for an (artificial) appendage length dataset.

Analysts who are well-versed in statistical methods might recognize the tell-tale signs that the distribution of appendage lengths is likely to be asymmetrical (since the skewness is non-negligible) and to have a “fat” tail (due to the kurtosis being commensurate with the mean and the standard deviation, the range being so much larger than the interquartile range, and the maximum value being so much larger than the third quartile).

The mode, minimum, and first quartile values belong to individuals without appendages, so there appears to be at least two sub-groups in the population (perhaps split along the lines of juveniles/adults, or males/females).

The maximum value has already been seen to be quite large compared to the rest of the observations, which at first suggests that it might belong to an outlier.

The histogram of the measurements, however, shows that there are 3 individuals with very long appendages (see right-most chart in Figure 8.9: it now becomes plausible for these anomalous entries to belong to individuals from a different species altogether who were erroneously added to the dataset. This does not, of course, constitute a proof of such an error, but it raises the possibility, which is often the best that an analyst can do in the absence of subject matter expertise.

Frequency of the appendage lengths.

Figure 8.9: Frequency chart of the appendage lengths in the (artificial dataset).

References

[148]
Y. Cissokho, S. Fadel, R. Millson, R. Pourhasan, and P. Boily, Anomaly Detection and Outlier Analysis,” Data Science Report Series, 2020.
[156]