## 13.3 Rare Occurrences

Before we continue and discuss other supervised approaches, we need to briefly touch on the problem of rare ocurrences (or unbalanced dataset). Say we are looking to detect fraudulent transactions. We can build classifiers to approach this task using any number of methods, but there is a potential problem. If $$(100-\varepsilon)$$% of observations belong to the normal category, and $$\varepsilon$$% to the special category, the model that predicts that EVERY observation is normal has $$(100-\varepsilon)$$% accuracy. But the vast majority of transactions will be legitimate, so that $$0<\varepsilon \ll 1$$, so that the model in question has tremendous accuracy, even if it misses the point of the exercise altogether.

There are two general approaches to overcome this issue: either we modify the algorithms to take into account the asymmetric cost of making a classification error (through so-called cost-sensitive classifiers or one-class models), or we modify the training data to take into account the imbalance in the data. In the former case, we invite you to read the documentation of the methods that interest you to see how this could be achieved.

In the latter case, we could try to obtain more training data. This is the simplest method, but it is not always possible to do so, and it could be that the new data would follow the same pattern as the original data, which would leave us no better off than we were to start with. Another alternative is to create a new training set by either undersampling the majority class(es) or oversampling the under-represented class(es). We assume for now that there are only two classes in the data (the strategies can easily be adapted to multi-class problems).

#### Undersampling

Let $$\text{Tr}=L \sqcup M_c$$, where $$L$$ consists of all observations in the majority case, and $$M_c$$ of all observations in the minority case; by assumption, $$|L|\gg |M_c|$$. Split $$L$$ into $$K$$ subsets $$L_1, \ldots, L_K$$, each roughly of the same size; $$K$$ should be selected so that $$|M_c| \not \ll |L_i|$$, for all $$i$$ (in other words, even though $$|L_i|$$ could be larger than $$|M_c|$$, it is not going to be substantially so).

We then construct $$K$$ training sets $\text{Tr}_1=L_1\sqcup M_c, \ldots, \text{Tr}_K=L_K\sqcup M_c;$ for all $$1\leq i \leq K$$, we train a classifier $$C_i$$ (using a given algorithm) on $$\text{Tr}_i$$. Once that is done, we combine the predictions using bagging or other ensemble learning methods (see Section 13.5).

#### Oversampling

We can oversample the minority cases to create balanced datasets, but that introduces a dependency in the data that can have far-reaching effect when it comes to bias and variability.

Synthetic Minority Oversampling Technique (SMOTE) is a common approach which creates “synthetic” examples rather than oversampling with replacement – the same idea used to create samples for handwriting recognition by perturbing training data (e.g., rotating, skewing, etc.) [244].

1. We select random integers $$k\ll \ell$$;

2. we then draw a random sample $$\mathcal{V}_{\ell}$$ of size $$\ell$$ from the minority class $$M_c$$;

3. for each $$\mathbf{x}\in \mathcal{V}_{\ell}$$, we find the $$k$$ nearest neighbors of $$\mathbf{x}$$ in $$M_c$$, say $$\mathbf{z}_{\mathbf{x},1},\ldots,\mathbf{z}_{\mathbf{x},k}$$;

4. we then compute the vectors $$\mathbf{v}_{\mathbf{x},1},\ldots, \mathbf{v}_{\mathbf{x},k}$$, starting from $$\mathbf{x}$$ and ending at each of the $$\mathbf{z}_{\mathbf{x},1},\ldots,\mathbf{z}_{\mathbf{x},k}$$;

5. we draw random numbers $$\gamma_1,\ldots, \gamma_k\sim \mathcal{U}(0,1)$$, and multiply $$\mathbf{v}_{\mathbf{x},i}$$ by $$\gamma_i$$, for each $$1\leq i \leq k$$;

6. the points found at $$\mathbf{x}+\gamma_i\mathbf{v}_{\mathbf{x},i}$$ are added to the set $$M_c$$.

This procedure is repeated until $$|M_c| \not \ll |L|$$ (see illustration below, author unknown).

There are variants, where we always use the same $$k,\ell, \mathcal{V}_{\ell}$$, or where we only pick one of the $$k$$ nearest neighbours, etc. In general, SMOTE increases recall, but it comes at the cost of lower precision.

### References

[244]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.