## 13.1 Overview

We will discuss classification in the same context as we discussed regression/value estimation in Section 12.1; the latter should have been read before embarking on this chapter (or, at the very least, should be read concurrently). In particular, it is expected that readers are familiar with training sets \(\text{Tr}\) and testing sets \(\text{Te}\) for a dataset with \(n\) observations \(\mathbf{x}_1,\ldots, \mathbf{x}_n\), \(p\) predictors \(X_1, \ldots, X_p\) and response variable \(Y\) (see Figure 12.5 for details).

One important note about this module’s notation: in the design matrix \(\mathbf{X}\), a row corresponds to the signature vector of an observation (the values of the predictors); when we write \(\mathbf{x}\) or \(\boldsymbol{\beta}\), we typically understand those to be **column vectors**. When in doubt, remember that all matrices/vectors involved must have compatible dimensions when multiplied or compared; that will sometimes mean that vectors must be viewed as row-vectors rather than column vectors, and *vice-versa*, depending on the context.

### 13.1.1 Formalism

In a **classification setting**, the response \(Y\) is **categorical**, which is to say that \(Y\in\mathcal{C}\), where \(\mathcal{C}=\{C_1,\ldots, C_K\}\), but the supervised learning objectives remain the same:

build a classifier \(C(\mathbf{x}^*)\) that assigns a label \(C_k\in \mathcal{C}\) to test observations \(\mathbf{x}^*\);

understand the role of the predictors in this assignment, and

assess the uncertainty and the accuracy of the classifier.

The main difference with the regression setting (to be fair, it’s a big one), is that we do not have access to an MSE-type metric to evaluate the classifier’s performance.

The counterpart of the regression function \[f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}]\] is defined as follows: for \(1\leq k\leq K\), let \(p_k(\mathbf{x})=P(Y=C_k\mid \vec{X}=\mathbf{x})\) (in other words, pick the most numerous categorical label of observations for which the signature vector is \(\mathbf{x}\)) – the **Bayes optimal classifier** at \(\mathbf{x}\) is the function \[C(\mathbf{x})=C_j\quad \text{if }p_j(\mathbf{x})=\max\{p_1(\mathbf{x}),\ldots,p_K(\mathbf{x})\}.\]
As was the case or regression, it could be that there are too few observations at \(\vec{X}=\mathbf{x}\) to estimate the probability exactly, in which case we might want to allow for **nearest neighbour averaging**:
\[\hat{C}(\mathbf{x})=C_j,\quad \text{if }\tilde{p}_j(\mathbf{x})=\max\{\tilde{p}_1(\mathbf{x}),\ldots,\tilde{p}_K(\mathbf{x}) \},\]
where \(\tilde{p}_k(\mathbf{x})=P(Y=C_k\mid \vec{X}\in N(\mathbf{x}))\) and \(N(\mathbf{x})\) is a neighbourhood of \(\mathbf{x}\).^{221}

The quantity that plays an analogous role to the MSE for \(\tilde{C}(\mathbf{x})\) is the **misclassification error rate**: \[\text{ERR}_\text{Te}=\frac{1}{M}\sum_{j=N+1}^M\mathcal{I}[y_j\neq \tilde{C}(\mathbf{x}_j)],\] where \(\mathcal{I}\) is the **indicator function** \[\mathcal{I}[\text{condition}]=\begin{cases}0 & \text{if the condition is false} \\ 1 & \text{otherwise}\end{cases}\]
The Bayes optimal classifier \(C(\mathbf{x})\) is the optimal classifier with respect to \(\text{ERR}_\text{Te}\); the **Bayes error rate** \[\eta_{\mathbf{x}}=1-\text{E}\left[\max_kP(Y=C_k\mid \vec{X}=\mathbf{x})\right]\] corresponds to the irreducible error and provides a lower limit on any classifier’s expected error.

Most classifiers build structured models \(\hat{C}(\mathbf{x})\) which **directly** approximate the Bayes optimal classifier \(C(\mathbf{x})\) (such as support vector machines or naı̈ve Bayes classifiers), but some classifiers build structured models \(\hat{p}_k(\mathbf{x})\) for the **conditional probabilities** \(p_k(\mathbf{x})\), \(1\leq j\leq K\), which are then used to build \(\hat{C}(\mathbf{x})\) (such as logistic regression, generalized additive models, and \(k-\)nearest neighbours).

The latter models are said to be **calibrated** (i.e., the relative values of \(\hat{p}_k(\mathbf{x})\) represent relative probabilities), whereas the former are **non-calibrated** (only the most likely outcome is provided; it is impossible to say to what extent a given outcome is more likely than another).

### 13.1.2 Model Evaluation

The **confusion matrix** of a classifier on \(\text{Te}\) is a tool to evaluate the model’s performance:

\(\alpha\) | prediction |
||

0 | 1 | ||

actual |
0 | TP | FN |

1 | FP | TN |

Here, TP stands for **true positive**, FN for **true negative**, FP for **false positive**, and TN for **true negative**. There are various **classifier evaluation metrics**; if the testing set \(\text{Te}\) has \(M\) observations, then:

**accuracy**measures the correct classification rate \(\frac{\text{TP}+\text{TN}}{M}\);**misclassifcation**is \(\frac{\text{FP}+\text{FN}}{M}=1-\text{accuracy}\);**false positive rate**(FPR) is \(\frac{\text{FP}}{\text{FP}+\text{TN}}\);**false negative rate**(FNR) is \(\frac{\text{FN}}{\text{TP}+\text{FN}}\);**true positive rate**(TPR) is \(\frac{\text{TP}}{\text{TP}+\text{FN}}\);**true negative rate**(TNR) is \(\frac{\text{TN}}{\text{FP}+\text{TN}}\);

There are other measures, including the \(F_1-\)score, the Matthews’ correlation coefficient, etc. [243]. One thing to remember is that we should not put all the performance evaluation eggs in the same metric basket!

### 13.1.3 Bias-Variance Trade-Off

The **bias-variance trade-off** (see Section 12.1.4) is also observed in classifiers, although the decomposition is (of necessity) different (see [2] for details).

In a **\(k-\)nearest neighbours** classifier, for instance, the prediction for a new observation with predictors \(\mathbf{x}^*\in \text{Te}\) is obtained by finding the most frequent class label of the \(k\) nearest neighbours to \(\mathbf{x}^*\) in \(\text{Tr}\) on which the model \(\hat{C}_{k\text{NN}}(\mathbf{x})\) is built.

As the number of nearest neighbours under consideration increases, the complexity of the the model \(\hat{C}_{k\text{NN}}(\mathbf{x})\) decreases, and *vice-versa*. We would thus expect:

a model with a large \(k\) to underfit the data;

a model with a small \(k\) to overfit the data, and

models in the “Goldilock zone” to strike a balance between

**prediction accuracy**and**interpretability of the decision boundary**(see Figures 12.6 and 13.1).

As it happens, the optimal classifier \(Y=C(\vec{X})\) is, in fact, the Bayes optimal classifier.

#### Comparison Between \(k\)NN and OLS

We are going to try to get a better intuitive sense of the bias-variance trade-off by comparing ordinary least squares (OLS), a rigid yet simple model (as measured by the number of **effective parameters**), with \(k-\)nearest neighbours (\(k\)NN), a very flexible yet more complex model (again, according to the number of effective parameters). Given an input vector \(\mathbf{z}\in \mathbb{R}^p\), the **\(k-\)nearest neighbours** (\(k\)NN) model predicts the response \(Y\) as the average \[\hat{Y}=\text{Avg}\{Y(\mathbf{x})\mid \mathbf{x}\in N_k(\mathbf{z})\}=\frac{1}{k}\sum_{\mathbf{x}\in N(\mathbf{z})}Y(\mathbf{x}),\] where \(Y(\mathbf{x})\) is the known response for predictor \(\mathbf{x}\in \text{Tr}\) and \(N_k(\mathbf{z})\) is the set of the \(k\) training observations nearest to \(\mathbf{z}\).^{222}

For classification problems, \(k\)NN models use the mode instead of the average. Of course, the prediction may depend on the value of \(k\): in the **classification** image below, the \(6\)NN prediction would be a red star, whereas the \(19\)NN model prediction would be a blue disk.

The following classification example (based on [2]) illustrates some of the trade-off consequences. Consider a training dataset \(\text{Tr}\) consisting of \(200\) observations with features \((x_1,x_2)\in \mathbb{R}^2\) and responses \(y\in \{\text{BLUE}_{(=0)},\text{ORANGE}_{(=1)}\}\). Throughout, let \([\cdot]:\mathbb{R}\to \{\text{BLUE},\text{ORANGE}\}\) denote the function \[[w]=\begin{cases} \text{BLUE} & w\leq 0.5 \\ \text{ORANGE} & w> 0.5 \end{cases}\]

#### Linear Fit

Fit an OLS model \[\hat{y}(\mathbf{x})=\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2\] on \(\text{Tr}\); the class prediction is \(\hat{g}(\mathbf{x})=[\hat{y}(\mathbf{x})]\). The **decision boundary** \[\partial_{\text{OLS}} = \{(x_1,x_2)\mid \hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2=0.5\}\] is shown in Figure 13.2 (on the left); it is a straight line which can be described using only 2 **effective parameters**.^{223}

There are several **misclassifications** on both sides of \(\partial_{\text{OLS}}\); even though errors seem to be unavoidable, the OLS model is likely to be **too rigid**.

#### \(k\)NN Fit

If \(\hat{y}(\mathbf{x})\) represents the proportion of ORANGE points in \(N_{k}(\mathbf{x})\), then the class prediction is \(\hat{g}(\mathbf{x})=[\hat{y}(\mathbf{x})]\). The decision boundaries \(\partial_{1\text{NN}}\) and \(\partial_{15\text{NN}}\) are displayed in Figure @(fig:second) (in the middle and on the right, respectively).

They are both irregular: \(\partial_{1\text{NN}}\) is overfit, whereas \(\partial_{15\text{NN}}\) is probably not so (although neither is great for **interpretability**). The effective parameters are not as obviously defined for this model; one approach is to view \(k\)NN as a model that fits \(1\) parameter (a mean) to each **ideal** (non-overlapping) **neighbourhood** in the data, so that the number of effective parameters is roughly equal to the number of such neighbourhoods: \[\frac{N}{k}\approx \begin{cases} 13 & \text{when $k=15$} \\ 200 & \text{when $k=1$} \end{cases}\]
The \(k\)NN models are thus fairly complex, in comparison with the OLS model. There are no misclassification for \(k=1\), and several in the case \(k=15\) (but not as many as with the OLS model). The \(15\)NN model seems to strike a balance between various competing properties; it is likely nearer the “sweet spot” of the test error curve.^{224}

#### Conclusions

The OLS model is **stable** (adding a few training observations is unlikely to alter the fit substantially), but **biased** (the assumptions of a valid linear fit is questionable); the \(k\)NN models are **unstable** (adding a few training observations is likely to alter the fit substantially, especially for small values of \(k\)), but **unbiased** (no apparent assumptions are made about the data).

So which approach is best? That depends entirely on what the ultimate task is: description, prediction, etc. In predictive data science, machine learning, and artificial intelligence, the validity of modeling assumptions take a backseat to a model’s ability to **make good predictions on new (and unseen) observations**.

Naturally, we would expect that models whose assumptions are met are more likely to make good predictions than models for whom that is not the case, but it does not need to be the case. The theory of linear models is mature and extensive, and we could have discussed a number of its other features (see Section **??** for details).

Keep in mind, then, that machine learning methods are not meant to replace or supplant classical statistical analysis methods, but rather, to **complement them**. They simply provide different approaches to **gain insights from data**.

### References

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.*Springer, 2008.

*An Introduction to Statistical Learning: With Applications in R*. Springer, 2014.