## 13.4 Other Supervised Approaches

In this section, we present a number of **non-parametric** approaches, with a focus on classification methods (although we will also discuss regression problems):

naïve Bayes classification (NBC).

### 13.4.1 Tree-Based Methods

This family of methods involves **stratifying** or **segmenting** the predictor space into a small number of “simple” regions.

The set of **splitting rules** used to segment the space can be summarized using a **tree**, whence their name. On the one hand, tree-based methods are **simple** and **easy to interpret**; on the other, they are not competitive with the best supervised learning methods when it comes to predictive accuracy.

Nevertheless, there are instances where the ease of interpretability overrules the lessened accuracy. Tree-based methods are applicable both to regression and to classification problems.

#### Regression Trees

We introduce the important concepts via the 2011 Gapminder dataset, where the response \(Y\) is once again the life expectancy of nations, and the predictors \(X_1\) and \(X_2\) are the fertility rates and infant mortality rates per nation. In Figure 13.3, we see that when \(X_1\) and \(X_2\) are both high, \(Y\) is low, and when \(X_1\) and \(X_2\) are both low, \(Y\) is high. But what is the pattern “in the middle”?

Below, we see a possible **regression tree** for the (full) dataset (\(N=166\) observations).

```
reg.tree.1 <- rpart::rpart(life_expectancy~fertility+infant_mortality,
data=gapminder.2011)
rpart.plot::rpart.plot(reg.tree.1, box.palette="RdBu", shadow.col="gray", nn=TRUE)
```

This can also be re-written as:

```
1) root (166) 70.82349
2) infant_mortality>=35.65 (54) 60.85370
4) infant_mortality>=52.9 (28) 58.30714 *
5) infant_mortality< 52.9 (26) 63.59615 *
3) infant_mortality< 35.65 (112) 75.63036
6) infant_mortality>=9.35 (62) 72.89516
12) infant_mortality>=22.85 (18) 69.50000 *
13) infant_mortality< 22.85 (44) 74.28409 *
7) infant_mortality< 9.35 (50) 79.02200
14) infant_mortality>=4.25 (23) 76.86087 *
15) infant_mortality< 4.25 (27) 80.86296 *
```

Node 1 is the tree’s **root** (initial node) with 166 (100%) observations; the average life expectancy for these observations is \(70.82\).

The root is also the tree’s first **branching point**, separating the observations into two groups: node 2 with 54 observations (33%), given by infant mortality \(\geq 35.65\), for which the average life expectancy is \(60.85\), and node 3 with 112 observations (67%), given by infant mortality \(<35.65\), for which the average life expectancy is \(75.63\).

Note that \(54+112=166\) and that \[\frac{54(60.81)+112(75.63)}{54+112}=70.82.\] Node 2 is an **internal node** – it is further split into two groups: node 4 with 28 observations (17%), given with the additional rule infant mortality \(\geq 52.9\), for which the average life expectancy is \(58.31\), and node 5 with 26 observations (16%), given with the additional rule infant mortality \(<52.9\), for which the average life expectancy is \(63.60\).

Note that \(28+26=54\) and that \[\frac{28(58.31)+26(63.60)}{28+26}=60.85.\] Both nodes 4 and 5 are **leaves** (final nodes, terminal nodes); the tree does not grow any further on that branch.

The tree continues to grow from node 3, eventually leading to 4 leaves on that branch (there is an intermediate branching point). There are 6 leaves in total, 5 branching points (including the root) and the tree’s **depth** is 3 (excluding the root).

Only one feature is used in the regression tree in this example: to make a prediction for a new observation, only infant mortality is needed. If it was 21, say, the observation’s leaf would be node 13 and we would predict that the life expectancy of that nation would be 74.28.

The tree diagram is a useful heuristic, especially as it allows the display results without resorting to a multi-dimensional chart, but it does obscure the **predictor space’s stratification**. In our example, let \[\begin{aligned} R_4&=\{(\text{infant mortality},\text{fertility})\mid \text{infant mortality} \geq 52.9\} \\ R_5&=\{(\text{infant mortality},\text{fertility})\mid 36.65\leq \text{infant mortality} < 52.9\} \\ R_{12}&=\{(\text{infant mortality},\text{fertility})\mid 22.85\leq \text{infant mortality} < 35.65\} \\ R_{13}&=\{(\text{infant mortality},\text{fertility})\mid 9.35\leq \text{infant mortality} < 22.85\} \\ R_{14}&=\{(\text{infant mortality},\text{fertility})\mid 4.25\leq \text{infant mortality} < 9.35\} \\ R_{15}&=\{(\text{infant mortality},\text{fertility})\mid \text{infant mortality} < 4.25\} \end{aligned}\]
Only infant mortality is involved in the definition of the tree’s **terminal nodes**. The regions are shown below:

The regression tree model for life expectancy would thus be \[\hat{y}_{R_i}=\text{Avg}\{y\mid (x_1,x_2) \in R_i\}=\begin{cases} 58.3, & i=4 \\ 63.6, & i=5 \\ 69.5, & i=12 \\ 74.3, & i=13 \\ 76.9, & i=14 \\ 80.9, & i=15\end{cases}\]
The regression tree tells us that infant mortality is the most important factor in determining life expectancy, with a negative correlation. This interpretation is, of course, a coarse oversimplification, but it highlights the advantage of using a regression tree when it comes to **displaying**, **interpreting**, and **explaining** the results.

This tree is not the only way to stratify the data: in what way is it **optimal**,^{230} as opposed to some other tree?

#### Building A Regression Tree

The process is quite simple:

Divide the predictor space \(\mathcal{X}\subseteq \mathbb{R}^p\) into a distinct union \(J\) regions \(R_j\), \(j=1,\ldots, J\): \[\mathcal{X}=R_1\sqcup \cdots \sqcup R_J;\]

for any \(\mathbf{x}\in R_j\), \[\hat{y}(\mathbf{x})=\text{Avg}\{y(\mathbf{z})\mid \mathbf{z}\in R_j\in \text{Tr}\}.\]

The second step tells us that trees are locally constant (although it would be feasible to define trees that are not locally constant).

In theory, the \(R_j\) could have any shape as long as they form a disjoint cover of \(\mathcal{X}\); in practice, we use **hyperboxes** whose boundaries \(p-1\) affine spaces that are perpendicular/parallel to the \(p\) hyperplanes \(X_1\ldots \hat{X_k}\ldots X_p\), \(k=1, \ldots, p\).

We find the optimal \((R_1,\ldots, R_J)\) by minimizing \[\text{SSRes}=\sum_{j=1}^J\sum_{\mathbf{x}_i\in R_j}(y_i-\hat{y}_{R_j})^2,\] where \(\hat{y}_{R_j}\) is the mean response \(y\) in \(R_j\cap \text{Tr}\). In an ideal world, we would compute \(\text{SSRes}\) for all partitions of \(\mathcal{X}\) into hyperboxes, and pick the one that minimizes \(\text{SSRes}\), but that is not computationally feasible, in general.

Instead, we use a growth algorithmic approach known as **recursive binary splitting**, which is both **top-down** (starts at the root and successively splits \(\mathcal{X}\) *via* 2 new branches down the tree) and **greedy** (at each step of the splitting process, the best choice is made there and now, rather than by looking at long-term consequences).

#### Regression Tree Algorithm

The algorithm has 10 steps, but it is fairly straightforward.

Let \(\hat{y}_0=\text{Avg}\{y\mid (\mathbf{x},y)\in \text{Tr}\}\).

Set the baseline \(\text{SSRes}_0=\sum_{i=1}^N(y_i-\hat{y}_0)^2\).

For each \(1\leq k\leq p\), order the predictor values of \(X_k\) in \(\text{Tr}\): \(v_{k,1}\leq v_{k,2}\leq \cdots \leq v_{k,N}\).

For each \(X_k\), set \(s_{k,\ell}=\frac{1}{2}(v_{k,\ell}+v_{k,\ell+1})\), \(\ell=1,\ldots, N-1\).

For each \(k=1,\ldots, p\), \(\ell=1,\ldots, N-1\), define \[R_1(k,\ell)=\{\vec{X}\in \mathbb{R}^p \mid X_{k}<s_{k,\ell}\}\quad\mbox{and}\quad R_2(k,\ell)=\{\vec{X}\in \mathbb{R}^p \mid X_{k}\geq s_{k,\ell}\}.\] Note that \(\mathcal{X}=R_1(k,\ell)\sqcup R_2(k,\ell)\) for all \(k,\ell\).

For each \(k=1,\ldots, p\), \(\ell=1,\ldots, N-1\), set \[\text{SSRes}_{1}^{k,\ell}=\sum_{m=1}^2\sum_{\vec{X}_i\in R_{m}(k,\ell)}(y_i-\hat{y}_{R_{m}(k,\ell)})^2,\] where \(\hat{y}_{R_{m}(k,\ell)}=\text{Avg}\{y(\mathbf{x})\mid \mathbf{x}\in \text{Tr}\cap R_{k}(k,\ell)\}\).

Find \(k^*,\ell^*\) for which \(\text{SSRes}_{1}^{k,\ell}\) is

**minimized**.Define the

**children sets**\(R_1^L=R_1(k^*,\ell^*)\) and \(R_1^R=R_2(k^*,\ell^*)\).While some children sets \(R^\nu_{\mu}\) still do not meet a

**stopping criterion**, repeat steps 3 to 8, searching and minimizing \(\text{SSRes}\) over \(\mathcal{X}\cap R^{\nu}_{\mu}\), and producing a binary split \(R^{L}_{\mu+1}\), \(R^R_{\mu+1}\).^{231}Once the stopping criterion is met for all children sets, the tree’s growth ceases, and \(\mathcal{X}\) has been partitioned into \(J\) regions (renumbering as necessary) \[\mathcal{X}=R_1\sqcup \cdots \sqcup R_J,\] on which the regression tree predicts the \(J\) responses \(\{\hat{y}_1,\ldots,\hat{y}_J\}\), according to \(\hat{y}_j=\text{Avg}\{y(\mathbf{x})\mid \mathbf{x}\in R_j\}\).

For instance, if the training set was \(\text{Tr}=\{(x_{1,i},x_{2,i},y_i)\}_{i=1}^{N}\), the algorithm might provide the regression tree in Figure 13.11.

In `R`

, the recursive binary partition algorithm is implemented in package `rpart`

’s function `rpart()`

.

#### Tree Pruning

Regression trees grown with the algorithm are prone to overfitting; they can provide good predictions on \(\text{Tr}\), but they usually make shoddy predictions on \(\text{Te}\), because the resulting tree might be too complex (it fits the noise as well as the signal).

A smaller tree with fewer splits might lead to lower variance and better interpretability, at the cost of a little bias. Instead of simply growing a tree \(T_0\) until each leaf contains at most \(M\) observations, say (or whatever other stopping criterion), it could be beneficial to **prune** it in order to obtain some **optimal subtree**.

We use **cost complexity pruning** (CCP) to build a sequence of candidate subtrees indexed by the complexity parameter \(\alpha\geq 0\). For each such \(\alpha\), find a subtree \(T_{\alpha}\subseteq T_0\) which minimizes \[\begin{aligned} \text{SSRes}+\text{complexity penalty}=\sum_{m=1}^{|T|}\sum_{\mathbf{x}_i\in R_m}(y_i-\hat{y}_{R_m})^2+\alpha|T|,\end{aligned}\] where \(|T|\) represents the number of final nodes in \(T\); when \(\alpha\) is large, it is costly to have a complex tree.^{232}

#### Pruning Algorithm

Assume that a recursive binary splitting regression tree \(T_0\) has been grown on \(\text{Tr}\), using some pre-determined stopping criterion:

apply CCP to \(T_0\) to obtain a “sequence” \(T_\alpha\) of subtrees of \(T_0\);

divide \(\text{Tr}\) into \(K\) folds;

for all \(k=1,\ldots, K\), build a regression tree on \(\text{Tr}\setminus \text{Fold}_k\) and evaluate \[\widehat{\text{MSE}}(\alpha)=\text{Avg}_{1\leq k\leq K}\{\text{MSE}_k(\alpha)\mid \text{on Fold}_k\};\]

return \(T_{\alpha^*}\) from step 1, where \(\alpha^*=\arg\min_{\alpha}\{\widehat{\text{MSE}}(\alpha)\}\).

The Gapminder 2011 tree is pruned in the Figures below, using the `rpart`

function `plotcp()`

.

```
reg.tree.1.pruned.2 <- rpart::rpart(life_expectancy~fertility+infant_mortality,
data=gapminder.2011, cp=0.06)
rpart.plot::rpart.plot(reg.tree.1.pruned.2, box.palette="RdBu",
shadow.col="gray", nn=TRUE)
```

```
reg.tree.1.pruned.3 <- rpart::rpart(life_expectancy~fertility+infant_mortality,
data=gapminder.2011, cp=0.028)
rpart.plot::rpart.plot(reg.tree.1.pruned.3, box.palette="RdBu",
shadow.col="gray", nn=TRUE)
```

```
reg.tree.1.pruned.4 <- rpart::rpart(life_expectancy~fertility+infant_mortality,
data=gapminder.2011, cp=0.02)
rpart.plot::rpart.plot(reg.tree.1.pruned.4, box.palette="RdBu",
shadow.col="gray", nn=TRUE)
```

#### Classification Trees

The approach for classification is much the same, with a few appropriate substitutions:

prediction in a terminal node is either the

**class label mode**or the**relative frequency**of the class labels;\(\text{SSRes}\) must be replaced by some other fit measure:

the

**classification error rate**: \[E=\sum_{j=1}^J(1-\max_k \{\hat{p}_{j,k}\}),\] where \(\hat{p}_{j,k}\) is the proportion of \(\text{Tr}\) observations in \(R_j\) of class \(k\) (this measure is not a recommended choice, however);the

**Gini index**, which measures the total variance across classes \[G=\sum_{j=1}^J\sum_k \hat{p}_{j,k}(1-\hat{p}_{j,k}),\] which should be small when the nodes are**pure**(\(\hat{p}_{j,k}\approx 0\) or \(1\) throughout the regions), andthe

**cross-entropy deviance**\[D=-\sum_{j=1}^J\sum_k \hat{p}_{j,k}\ln \hat{p}_{j,k},\] which behaves like the Gini index, numerically.

One thing to note is that **classification and regression trees** (jointly known as CART) suffer from high variance and their structure is **unstable** – using different training sets typically gives rise to wildly varying trees.

As an extreme example, simply modifying the level of only one of the predictors in only one of the observations can yield a tree with a completely different topology, as in Figure 13.16.

This lack of robustness is a definite strike against the use of CART; despite this, the relative ease of their implementation makes them a popular classification tool.

#### Examples

A classification tree for the Gapminder 2011 dataset is shown below:

```
reg.tree.2 <- rpart::rpart(LE~fertility+infant_mortality+gdp,
data=gapminder.2011)
rpart.plot::rpart.plot(reg.tree.2, box.palette="RdBu",
shadow.col="gray", nn=TRUE)
```

```
1) root (166) high/low (0.5 0.5)
2) infant_mortality< 23 (94) high (0.862 0.138)
3) infant_mortality>=23 (72) low (0.028 0.972)
```

The stratification in the scatter plot is shown below.

```
ggplot2::ggplot(gapminder.2011) +
ggplot2::geom_point(ggplot2::aes(fill=LE, x=infant_mortality, y=fertility), pch=22) +
ggplot2::theme_bw() +
ggplot2::theme(legend.position = "bottom") +
ggplot2::geom_vline(xintercept = c(23), linetype="dashed", color = "blue", size=1) +
ggplot2::xlab("Infant Mortality") + ggplot2::ylab("Fertility")
```

Note that this tree should not be used for predictions as it was not built on a training subset of the data.

We now revisit the Iowa Housing Price example of Section 12.5.2. Recall that we had built a training set `dat.train`

with \(n=1160\) observations relating to the selling price `SalePrice`

of houses in Ames, Iowa. We build a CART for that response variable, demanding at least 5 observations per leaf.

```
n= 1160
node), split, n, deviance, yval
* denotes terminal node
1) root 1160 7371073.00 180.1428
2) OverallQual< 7.5 985 2301952.00 157.4737
4) Neighborhood=Blueste,BrDale,BrkSide,Edwards,IDOTRR,MeadowV,Mitchel,NAmes,NPkVill,OldTown,Sawyer,SWISU 580 672022.40 132.2839
8) X1stFlrSF< 1050.5 332 240468.90 118.2474 *
9) X1stFlrSF>=1050.5 248 278574.40 151.0747 *
5) Neighborhood=Blmngtn,ClearCr,CollgCr,Crawfor,Gilbert,NoRidge,NridgHt,NWAmes,SawyerW,Somerst,StoneBr,Timber,Veenker 405 734856.30 193.5480
10) GrLivArea< 1732.5 281 296100.70 178.2365
20) GrLivArea< 1204 56 25438.63 143.0286 *
21) GrLivArea>=1204 225 183967.70 186.9993 *
11) GrLivArea>=1732.5 124 223588.10 228.2459 *
3) OverallQual>=7.5 175 1713865.00 307.7376
6) OverallQual< 8.5 126 498478.90 273.6180
12) GrLivArea< 1925.5 71 167331.90 246.3000 *
13) GrLivArea>=1925.5 55 209762.20 308.8831 *
7) OverallQual>=8.5 49 691521.30 395.4736
14) Neighborhood=CollgCr,Edwards,Gilbert,NridgHt,Somerst,StoneBr,Timber,Veenker 44 358089.40 373.9441
28) Neighborhood=CollgCr,Edwards,Somerst,Timber 11 39962.11 293.0025 *
29) Neighborhood=Gilbert,NridgHt,StoneBr,Veenker 33 222038.20 400.9246
58) GrLivArea< 2260 20 42801.90 358.2032 *
59) GrLivArea>=2260 13 86576.40 466.6498 *
15) Neighborhood=NoRidge 5 133561.60 584.9336 *
```

There are many ways to plot the tree:

These are fully grown trees. Now we will use `rpart`

’s `plotcp()`

to determine how to control the tree’s growth (we prune the tree).

From this plot, we see that the tuning parameter should be around 0.024, so we prune the tree as follows:

```
n= 1160
node), split, n, deviance, yval
* denotes terminal node
1) root 1160 7371073.0 180.1428
2) OverallQual< 7.5 985 2301952.0 157.4737
4) Neighborhood=Blueste,BrDale,BrkSide,Edwards,IDOTRR,MeadowV,Mitchel,NAmes,NPkVill,OldTown,Sawyer,SWISU 580 672022.4 132.2839 *
5) Neighborhood=Blmngtn,ClearCr,CollgCr,Crawfor,Gilbert,NoRidge,NridgHt,NWAmes,SawyerW,Somerst,StoneBr,Timber,Veenker 405 734856.3 193.5480
10) GrLivArea< 1732.5 281 296100.7 178.2365 *
11) GrLivArea>=1732.5 124 223588.1 228.2459 *
3) OverallQual>=7.5 175 1713865.0 307.7376
6) OverallQual< 8.5 126 498478.9 273.6180 *
7) OverallQual>=8.5 49 691521.3 395.4736
14) Neighborhood=CollgCr,Edwards,Gilbert,NridgHt,Somerst,StoneBr,Timber,Veenker 44 358089.4 373.9441 *
15) Neighborhood=NoRidge 5 133561.6 584.9336 *
```

We go from 11 to 6 leaves. The structure of the pruned tree is plotted below.

Now that we have a full regression tree and a pruned tree, our last task is to see how well they perform as predictive models on the test data `dat.test`

.

For the full tree, we compute the reduction in using the full tree (as opposed to the mean model:

```
yhat.RT = predict(RT, dat.test)
SSE.RT = sum((yhat.RT-dat.test$SalePrice)^2)
SSE.average = sum((mean(dat.test$SalePrice)-dat.test$SalePrice)^2)
round((1-SSE.RT/SSE.average), digits=3)
```

`[1] 0.703`

For the pruned tree, the corresponding reduction is:

```
yhat.RT.p = predict(RT.p, dat.test)
SSE.RT.p = sum((yhat.RT.p-dat.test$SalePrice)^2)
round((1-SSE.RT.p/SSE.average), digits=3)
```

`[1] 0.646`

This suggests that pruning is a reasonable approach, in this case (based on ). The predictions of both trees are plotted against the actual `SalePrice`

values in the next plots.

```
xlimit = ylimit = c(0,600)
plot(NA, col=2, xlim=xlimit, ylim=ylimit, xlab="Predicted Price ($1,000)",
ylab="Actual Price ($1,000)")
abline(h=seq(0,600, length.out=13), v=seq(0,600, length.out=13), lty=2, col="grey")
abline(a=0, b=1)
points(yhat.RT, dat.test$SalePrice, col=2)
legend(0,600, legend=c("Full Tree"), col=c(2), pch=rep(1), bg='light grey')
```

```
plot(NA, col=2, xlim=xlimit, ylim=ylimit, xlab="Predicted Price ($1,000)",
ylab="Actual Price ($1,000)")
abline(h=seq(0,600, length.out=13), v=seq(0,600, length.out=13), lty=2, col="grey")
abline(a=0, b=1)
points(yhat.RT.p, dat.test$SalePrice, col=3)
legend(0,600, legend=c("Pruned Tree"), col=c(3), pch=rep(2), bg='light grey')
```

Obviously, there are some departures from the actual response values, but given that the regression trees can only predict a small number of selling prices (corresponding to the tree leaves), these predictions are reasonably accurate.

```
[1] 0.8412035
[1] 0.8047534
```

How do these CART predictions compare with the MARS predictions? The Iowa Housing dataset contains information about sale prices from 2006 to 2010; would you use the model to make predictions about 2022 sale prices?

Classification trees work in the same manner, although the evaluation step can be conducted in two ways: we can build trees that predict class membership (`type="class"`

) or probability of class membership (`type="prob"`

). Examples of how to work with these `predict()`

options are provided in Section 11.7.2.

### 13.4.2 Support Vector Machines

The next classifier is more sophisticated, from a mathematical perspective. It was invented by computer scientists in the 1990s.

**Support vector machines** (SVM) attempt to find hyperplanes that separate the classes in the feature space. In Figure 13.19, we see an artificial data with 3 features: \(X_1\) and \(X_2\) (numerical), \(Y\) (categorical, represented by different symbols).

We grow a classification tree (perhaps the one shown in Figure 13.20): two of the leaves are pure, but the risk of misclassification is fairly large in the other 2 (at least for that tree).^{233} Without access to more features, that tree is as good as it gets.^{234}

But it is easy to draw a decision curve which improves on the effectiveness of the decision tree (see image on the right): a single observation is misclassified by this rule.^{235}

Separating hyperplanes do not always exist; we may need to:

extend our notion of

**separability**, and/orextend the feature space so separation becomes possible.

A **hyperplane** \(H_{\boldsymbol{\beta},\beta_0}\subseteq \mathbb{R}^p\) is an affine (“flat”) subset of \(\mathbb{R}^p\), with \[\dim \left(H_{\boldsymbol{\beta},\beta_0}\right)=p-1;\] in other words, \[H_{\boldsymbol{\beta},\beta_0}:\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}=\beta_0+\beta_1x_1+\cdots+\beta_px_p=0.\]
The vector \(\boldsymbol{\beta}\) is normal to \(H_{\boldsymbol{\beta},\beta_0}\); if \(\beta_0=0\), \(H_{\boldsymbol{\beta},\beta_0}\) goes through the origin in \(\mathbb{R}^p\). Set \(F(\mathbf{x})=\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}\); then \(F(\mathbf{x})>0\) for points on one “side” of \(H_{\boldsymbol{\beta},\beta_0}\) and \(F(\mathbf{x})<0\) for points on the other.^{236}

In a binary classification problem \(\mathcal{C}=\{C_1,C_2\}=\{\pm 1\}\). If \[y_iF(\mathbf{x}_i)>0,\quad\text{ for all }(\mathbf{x}_i,y_i)\in \text{Tr}\] (or, \(y_iF(\mathbf{x}_i)<0\) for all \((\mathbf{x}_i,y_i)\in \text{Tr}\)), then \(F(\mathbf{x}_i)=0\) determines a **separating hyperplane** for \(\text{Tr}\) (which does not need to be unique, see Figure 13.21), and we say that \(\text{Tr}\) is **linearly separable**.

Among all separating hyperplanes, the one which provides the widest separation between the two classes is the **maximal margin hyperplane** (MMH); training observations on the boundary of the **separating strip** are called the **support vectors** (see observations in Figure 13.22).

The classification problem simplifies, as always, to a constrained optimization problem: \[(\boldsymbol{\beta}^*,\beta_0^*)=\arg\max_{(\boldsymbol{\beta},\beta_0)}\{M_{(\boldsymbol{\beta},\beta_0)}\}\quad\text{s.t.}\quad y_i(\beta_0+\boldsymbol{\beta}\mathbf{x}_i)\geq M_{(\boldsymbol{\beta},\beta_0)}\] for all \((\mathbf{x}_i,y_i)\in Tr\), with MMH given by \(F(\mathbf{x})=\beta_0^*+\boldsymbol{\beta}^*\mathbf{x}=0\).

Any hyperplane can be expressed in an uncountable number of ways; the MMH for which \(|F(\mathbf{x}^*)|=1\) for all support vectors \(\mathbf{x}\) provides a **canonical representation**). From geometry, we know that the distance from the canonical maximal margin hyperplane \(H_{\boldsymbol{\beta},\beta_0}\) to any point \(\mathbf{z}\) can be computed using vector projections.

Let \(\mathbf{x}_0\) be a point on MMH, i.e., \(F(\mathbf{x}_0)=\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}_0=0\), as shown below:

In particular, note that \(\beta_0=-\boldsymbol{\beta}^{\!\top}\mathbf{x}_0\). Then, \[\begin{aligned} \frac{M}{2}&=\text{dist}\left(\mathbf{z},H_{\boldsymbol{\beta},\beta_0}\right)=\left\|\text{proj}_{\boldsymbol{\beta}}(\mathbf{z}-\mathbf{x}_0) \right\|=\left\|\frac{\boldsymbol{\beta}^{\!\top}(\mathbf{z}-\mathbf{x}_0)}{\|\boldsymbol{\beta}\|^2}\boldsymbol{\beta} \right\| \\ &=\frac{|\boldsymbol{\beta}^{\!\top}(\mathbf{z}-\mathbf{x}_0)|}{\|\boldsymbol{\beta}\|^2}\|\boldsymbol{\beta}\|=\frac{|\boldsymbol{\beta}^{\!\top}\mathbf{z}-\boldsymbol{\beta}^{\!\top}\mathbf{x}_0|}{\|\boldsymbol{\beta}\|}=\frac{|F(\mathbf{z})|}{\|\boldsymbol{\beta\|}}.\end{aligned}\]
If \(\mathbf{z}\) is a support vector, then \(F(\mathbf{z})=1\), then \[\frac{M}{2}=\text{dist}\left(\mathbf{z},H_{\boldsymbol{\beta},\beta_0}\right)=\frac{1}{\|\boldsymbol{\beta}\|}.\]
Maximizing the margin \(M\) is thus equivalent to minimizing to minimizing \(\frac{\|\boldsymbol{\beta}\|}{2}\), and, since the square function is monotonic, \[\arg\max_{(\boldsymbol{\beta},\beta_0)}\{M\mid y_i(\beta_0+\mathbf{\beta}^{\!\top}\mathbf{x}_i),\ \forall \mathbf{x}_i\in \text{Tr}\}\] is equivalent to \[\arg\min_{(\boldsymbol{\beta},\beta_0)}\left\{\left.\frac{1}{2}\|\boldsymbol{\beta}\|^2 \ \right| \ y_i(\beta_0+\mathbf{\beta}^{\!\top}\mathbf{x}_i),\ \forall \mathbf{x}_i\in \text{Tr}\right\}.\]
This constrained quadratic problem (QP) can be solved by Lagrange multipliers (in implementations, it is solved numerically), but a key observation is that it is possible to rewrite \[\boldsymbol{\beta}=\sum_{i=1}^N\alpha_iy_i\mathbf{x}_i,\quad \text{with}\quad \sum_{i=1}^N\alpha_iy_i=0\] thanks to the **representer theorem**.^{237}

The original QP becomes \[\arg\min_{(\boldsymbol{\beta},\beta_0)}\left\{\left.\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\mathbf{x}_i^{\!\top}\mathbf{x}_j-\sum_{i=1}^N\alpha_i \ \right| \ \sum_{i=1}^N\alpha_iy_i=0,\ \forall \mathbf{x}_i,\mathbf{x}_j\in \text{Tr}\right\}.\]
Ultimately, it can be shown that all but \(L\) of the coefficients \(\alpha_i\) are \(0\), typically, \(L\ll N\). The **support vectors** are those training observations \(\mathbf{x}_{i_k}\), \(k=1, \ldots, L\), for which \(\alpha_{i_k}\neq 0\). The **decision function** is defined by \[T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\mathbf{x}_{i_k}^{\!\top}\mathbf{x}+\beta_0,\] scaled so that \(T(\mathbf{x}_{i_k};\alpha)=y_{i_k}=\pm 1\) for each support vector \(\mathbf{x}_{i_k}\).

The **class assignment** for any \(\mathbf{x}\in \text{Te}\) is thus \[\text{class}(\mathbf{x})=\begin{cases} +1 & \text{if $T(\mathbf{x};\alpha)\geq 0 $} \\ -1 & \text{if $T(\mathbf{x};\alpha)< 0 $} \end{cases}\] In practice (especially when \(N<p\)), the data is rarely linearly separable into distinct classes (as below, for instance).

Additionally, even when the classes are linearly separable, the data may be **noisy**, which could lead to overfitting, with technically optimal but practically sub-optimal maximal margin solutions (see [137] for examples).

In applications, **support vector classifiers** optimize instead a **soft margin**, one for which some misclassifications are permitted (as in Figure 13.24).

The soft margin problem can be written as \[\arg\min_{(\boldsymbol{\beta},\beta_0)}\left\{\left.\frac{1}{2}\boldsymbol{\beta}^{\!\top}\boldsymbol{\beta} \ \right| \ y_i(\beta_0+\boldsymbol{\beta}\mathbf{x}_i)\geq 1-\varepsilon_i,\ \varepsilon_i\geq 0, \forall \mathbf{x}_i\in \text{Tr},\ \|\boldsymbol{\varepsilon}\|<C\right\},\] where \(C\) is a (budget) **tuning parameter**, \(\boldsymbol{\varepsilon}\) is a vector of slack variables, canonically scaled so that \(|F(\mathbf{x}^*) |=|\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}^*|=1\) for any eventual support vector \(\mathbf{x}^*\).

Such a model offers greater robustness against unusual observations, while still classifying most training observations correctly:

if \(\varepsilon_i=0\), then \(\mathbf{x}_i\in \text{Tr}\) is correctly classified – it falls on the correct side of the hyperplane, and outside the maximum margin;

if \(0<\varepsilon_i<1\), then \(\mathbf{x}_i\in Tr\) is still correctly classified (falling on the correct side of the hyperplane), but it falls within the margin;

if \(\varepsilon_i\geq 1\), it is incorrectly classified.

If \(C=0\), then no violations are allowed (\(\|\boldsymbol{\varepsilon}\|=0\)) and the problem reduces to the hard margin SVM classifier; a solution may not even exist if the data is not linearly separable.

If \(C>0\) is an integer, no more than \(C\) training observations can be misclassified; indeed, if \(i_1, \ldots, i_C\) are the misclassified indices, then \(\varepsilon_{i_1},\ldots, \varepsilon_{i_C}\geq 1\) and \[C\geq \sum_{i=1}^N\varepsilon_i\geq \sum_{i=1}^C\varepsilon_i\geq C.\]
As \(C\) increases, tolerance for violations also increases, as does the width of the soft margin; \(C\) plays the role of a **regularization parameter**, and is usually selected *via* cross-validation.

Low values of \(C\) are associated with harder margins, which leads to low bias but high variance (a small change in the data could create qualitatively different margins); large values of \(C\) are associated with wider (softer) margins, leading to more potential misclassifications and higher bias, but also lower variance as small changes in the data are unlikely to change the margin significantly.

We can build a classifier through the representer theorem formulation as before, the only difference being that the **decision function** \(T_C(\mathbf{x};\boldsymbol{\alpha})\) is scaled so that \(|T(\mathbf{x}_{i_k};\boldsymbol{\alpha})|\geq 1-\varepsilon_{i_k}\) for every support vector \(\mathbf{x}_{i_k}\). It is difficult to determine what the value of the regularization parameter \(C\) should be at first glance; an optimal value can be obtained *via* a **tuning process**, which tries out various values and identifies the one that produces an optimal model.

#### Example

We train a SVM with \(C=0.1\) (obtained *via* a tuning procedure for \(C\)) for the 2011 Gapminder dataset to predict the life expectancy class \(Y\) in terms of the fertility rate \(X_1\) and the logarithm of GDP per capita \(X_2\); \(n=116\) observations are used in the training set `gapminder.2011.tr`

, the rest are set aside in the test set `gapminder.2011.te`

.

```
x <- gapminder.2011.tr[,c("fertility","gdp","population")]
w <- log(x[,2]/x[,3])
x <- data.frame(x[,1],w)
y <- gapminder.2011.tr[,c("LE")]
dat = data.frame(x,y)
plot(w,x[,1],col=y,bg=y,pch=(as.numeric(y)+23), xlab="Log GDPPC", ylab="fertility")
```

The red triangles represent countries with low life expectancy; the black ones, countries with high life expectancy. Notice the class overlap in the training data.

We run 7 linear SVM models with various cost parameters (through `e1071`

’s `tune()`

function), the optimal model has \(C=0.1\).

```
tuned.model <- e1071::tune(e1071::svm, y~., data = dat, kernel = "linear",
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
(best.mod <- tuned.model$best.model)
```

```
Call:
best.tune(method = e1071::svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001,
0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.1
Number of Support Vectors: 50
```

The corresponding SVM model is obtained *via* the `svm()`

function (note the parameters). The SVM decision boundary is shown below:

```
svmfit <- e1071::svm(y~., data = dat, kernel = "linear", cost=0.1)
plot(svmfit, dat, main="Linear Kernel")
```

We can evaluate the model’s performance on \(\text{Te}\) (`dat.te`

); the confusion matrix of the model on the test set is:

```
x <- gapminder.2011.te[,c("fertility","gdp","population")]
w <- log(x[,2]/x[,3])
x <- data.frame(x[,1],w)
y <- gapminder.2011.te[,c("LE")]
dat.te = data.frame(x,y) # Test data
results = predict(svmfit,dat.te) # Class prediction on test data
table(actual=gapminder.2011.te$LE,pred=results) # Confusion matrix
```

LE | prediction |
||

high | low | ||

actual |
high | 17 | 1 |

low | 10 | 22 |

It is not a perfectly accurate model, but it is certainly acceptable given the class overlap in \(\text{Tr}\).

#### Nonlinear Boundaries

If the boundaries between two classes is linear, the SVM classifier of the previous section is a natural way to attempt to separate the classes. In practice, however, the classes are rarely so cleanly separated, as below, say.

In both the hard and the soft margin support vector classifiers, the function to optimize thakes the form \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\mathbf{x}_i^{\!\top}\mathbf{x}_j-\sum_{i=1}^N\alpha_i,\] and the decision function, the form \[T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\mathbf{x}_{i_k}^{\!\top}\mathbf{x}+\beta_0.\]
In practice, however, we do not actually need to know the support vectors \(\mathbf{x}_{i_k}\) (or even the observations \(\mathbf{x}_i\), for that matter) in order to compute the decision function values – it is sufficient to have access to the **inner products** \(\mathbf{x}_i^{\!\top}\mathbf{x}_j\) or \(\mathbf{x}_{i_k}^{\!\top}\mathbf{x}\), which are usually denoted by \[\langle\mathbf{x}_{i_k},\mathbf{x}\rangle\quad\text{or}\quad \langle\mathbf{x}_{i},\mathbf{x}_j\rangle.\]
The objective function and the decision function can thus be written as \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\langle\mathbf{x}_i,\mathbf{x}_j\rangle-\sum_{i=1}^N\alpha_i,\quad T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\langle\mathbf{x}_{i_k},\mathbf{x}\rangle+\beta_0.\]
This seemingly innocuous remark opens the door to the **kernel approach**; we could conceivable replace the inner products \(\langle \mathbf{x},\mathbf{w}\rangle\) by generalized inner products \(K(\mathbf{x},\mathbf{w})\), which provide a measure of similarity between the observations \(\mathbf{x}\) and \(\mathbf{w}\).

Formally, a **kernel** is a symmetric (semi-)positive definite operator \(K:\mathbb{R}^p\times \mathbb{R}^p\to \mathbb{R}_0^+\).^{238} Common
statistical learning kernels include:

**linear**– \(K(\mathbf{x},\mathbf{w})=\mathbf{x}^{\!\top}\mathbf{w}\);**polynomial of degree \(d\)**– \(K_d(\mathbf{x},\mathbf{w})=(1+\mathbf{x}^{\!\top}\mathbf{w})^d\);**Gaussian**(or radial) – \(K_\gamma(\mathbf{x},\mathbf{w})=\exp(-\gamma\|\mathbf{x}-\mathbf{w}\|_2^2)\), \(\gamma>0\);**sigmoid**– \(K_{\kappa,\delta}(\mathbf{x},\mathbf{w})=\tanh(\kappa\mathbf{x}^{\!\top}\mathbf{w}-\delta)\), for allowable \(\kappa,\delta\).

For instance, a radial kernel SVM with \(\gamma=1\), \(C=0.5\) yields the
following classification on the above dataset (we are using `kernlab`

’s `ksvm()`

function and display the linear SVM output for comparison, whose performance we expect to be crap-tastic):

```
set.seed(0)
x <- matrix(rnorm(600*2), ncol = 2)
y <- c(rep(-1,200), rep(0,200),rep(1,200))
x[y==1,] <- x[y==1,] + 2
x[y==-1] <- x[y==-1,] - 2
y <- y^2
dat <- data.frame(x=x, y=as.factor(y))
# linear SVM
kernfit.lin <- kernlab::ksvm(x,y, type = "C-svc", kernel = 'vanilladot',
C = 10)
kernlab::plot(kernfit.lin, data=x)
```

```
# Gaussian SVM
kernfit.rbf <- kernlab::ksvm(x,y, type = "C-svc", kernel = 'rbfdot',
sigma=1, C = 0.5)
kernlab::plot(kernfit.rbf, data=x)
```

` Setting default kernel parameters `

How is the decision boundary computed? The principle is the same as with linear SVM: the objective function and the decision function are \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_jK(\mathbf{x}_i,\mathbf{x}_j)-\sum_{i=1}^N\alpha_i,\quad T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}K(\mathbf{x}_{i_k},\mathbf{x})+\beta_0.\] For the radial kernel, for instance, if a test observation \(\mathbf{x}\) is near a training observation \(\mathbf{x}_i\), then \(\|\mathbf{x}-\mathbf{x}_i\|_2^2\) is small and \(K_{\gamma}(\mathbf{x},\mathbf{x}_i)\approx 1\); if they are far from one another, then \(\|\mathbf{x}-\mathbf{x}_i\|_2^2\) is large and \(K_{\gamma}(\mathbf{x},\mathbf{x}_i)\approx 0\).

In other words, in the radial kernel framework, only those observations close to a test observation play a role in class prediction.

#### Kernel Trick

But why even use kernels in the first place? While the linear kernel is easier to interpret and implement, not all data sets are linearly separable, as we have just seen. Consider the following toy classification problem (adapted from an unknown online source).

The optimal margin separating “strip” is obviously not linear. One way out of this problem is to introduce a **transformation** \(\Phi\) from the original \(X-\)feature space to a higher-dimensional (or at least, of the same dimension) \(Z-\)feature space in which the data is linearly separable, and to build a linear SVM on the transformed training observations \(\mathbf{z}_i=\Phi(\mathbf{x})_i\).^{239}
In this example, we use some \(\Phi:\mathbb{R}^2\to \mathbb{R}^3\); the projection of the transformation into the \(Z_1Z_3-\)plane could be as below.

The objective function and the decision function take on the form \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\Phi(\mathbf{x}_i)^{\!\top}\Phi(\mathbf{x}_j)-\sum_{i=1}^N\alpha_i,\ T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\Phi(\mathbf{x}_{i_k})^{\!\top}\Phi(\mathbf{x})+\beta_0,\] and the linear SVM is built as before (but in \(Z-\)space, not in \(X-\)space).

It sounds straightforward, but it does take a fair amount of experience to recognize that one way to separate the data is to use \[\mathbf{z}=\Phi(\mathbf{x})=(x_1^2,\sqrt{2}x_1x_2,x_2^2).\] And this is one of the easy transformations: what should be used in the case below (image taken from Wikipedia)?

The **kernel trick** simply states that \(\Phi\) can remain unspecified if we replace \(\Phi(\mathbf{x})^{\!\top}\Phi(\mathbf{w})\) by a “reasonable” (often radial) kernel \(K(\mathbf{x},\mathbf{w})\).

#### General Classification

What do we do if the response variable has \(K>2\) classes? In the **one-versus-all** (OVA) approach, we fit \(K\) different \(2-\)class SVM decision functions \(T_k(\mathbf{x};\boldsymbol{\alpha})\), \(k=1,\ldots, K\); in each, one class versus the rest. The test observation \(\mathbf{x}^*\) is assigned to the class for which \(T_k(\mathbf{x^*};\boldsymbol{\alpha})\) is largest.

In the **one-versus-one** (OVO) approach, we fit all \(\binom{K}{2}\) pairwise \(2-\)class SVM classifiers \(\text{class}_{k,\ell}(\mathbf{x})\), for training observations with levels \(k,\ell\), where \(k>\ell=1,\ldots, K-1\). The test observation \(\mathbf{x}^*\) is assigned to the class that wins the most pairwise “competitions”.

If \(K\) is large, \(\binom{K}{2}\) might be too large to make OVO computationally efficient; when it is small enough, OVO is the recommended approach.

#### Example

The `vowel`

dataset was taken from the *openML* website. This modified version, by Turney, is based on Robinson’s *Deterding Vowel Recognition Data*, which is a speaker-independent recognition of the eleven steady state vowels of British English using a specified training set of `lpc`

-derived log area ratios.

**Real talk:** I don’t actually know what any of that means. But does it matter? I mean, yes, any conclusion we can draw from this dataset will need to be scrutinized by subject matter experts before we can hope to apply them to real-world situations. On the other hand, data is simply marks on paper (or perhaps electromagnetic patterns on the cloud). We can analyze the data without really knowing what the underlying meaning is. It is a sterile approach, but we can use it to illustrate SVM, in general.

Let’s read in the data and summarize it

Train.or.Test | Speaker.Name | Speaker.Sex | Feature.0 | Feature.1 | Feature.2 | Feature.3 | Feature.4 | Feature.5 | Feature.6 | Feature.7 | Feature.8 | Feature.9 | Class |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Train | Andrew | Male | -3.639 | 0.418 | -0.670 | 1.779 | -0.168 | 1.627 | -0.388 | 0.529 | -0.874 | -0.814 | hid |

Train | Andrew | Male | -3.327 | 0.496 | -0.694 | 1.365 | -0.265 | 1.933 | -0.363 | 0.510 | -0.621 | -0.488 | hId |

Train | Andrew | Male | -2.120 | 0.894 | -1.576 | 0.147 | -0.707 | 1.559 | -0.579 | 0.676 | -0.809 | -0.049 | hEd |

Train | Andrew | Male | -2.287 | 1.809 | -1.498 | 1.012 | -1.053 | 1.060 | -0.567 | 0.235 | -0.091 | -0.795 | hAd |

Train | Andrew | Male | -2.598 | 1.938 | -0.846 | 1.062 | -1.633 | 0.764 | 0.394 | -0.150 | 0.277 | -0.396 | hYd |

Train | Andrew | Male | -2.852 | 1.914 | -0.755 | 0.825 | -1.588 | 0.855 | 0.217 | -0.246 | 0.238 | -0.365 | had |

The dataset has \(n=990\) observations and \(p=14\) variables.

```
'data.frame': 990 obs. of 14 variables:
$ Train.or.Test: Factor w/ 2 levels "Test","Train": 2 2 2 2 2 2 2 2 2 2 ...
$ Speaker.Name : Factor w/ 15 levels "Andrew","Bill",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Speaker.Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Feature.0 : num -3.64 -3.33 -2.12 -2.29 -2.6 ...
$ Feature.1 : num 0.418 0.496 0.894 1.809 1.938 ...
$ Feature.2 : num -0.67 -0.694 -1.576 -1.498 -0.846 ...
$ Feature.3 : num 1.779 1.365 0.147 1.012 1.062 ...
$ Feature.4 : num -0.168 -0.265 -0.707 -1.053 -1.633 ...
$ Feature.5 : num 1.627 1.933 1.559 1.06 0.764 ...
$ Feature.6 : num -0.388 -0.363 -0.579 -0.567 0.394 0.217 0.322 -0.435 -0.512 -0.466 ...
$ Feature.7 : num 0.529 0.51 0.676 0.235 -0.15 -0.246 0.45 0.992 0.928 0.702 ...
$ Feature.8 : num -0.874 -0.621 -0.809 -0.091 0.277 0.238 0.377 0.575 -0.167 0.06 ...
$ Feature.9 : num -0.814 -0.488 -0.049 -0.795 -0.396 -0.365 -0.366 -0.301 -0.434 -0.836 ...
$ Class : Factor w/ 11 levels "had","hAd","hed",..: 5 6 4 2 11 1 8 7 10 9 ...
```

Train.or.Test | Speaker.Name | Speaker.Sex | Class | |
---|---|---|---|---|

Test :462 | Andrew:66 | Female:462 | had:90 | |

Train:528 | Bill :66 | Male :528 | hAd:90 | |

NA | David :66 | NA | hed:90 | |

NA | Jo :66 | NA | hEd:90 | |

NA | Kate :66 | NA | hid:90 | |

NA | Mark :66 | NA | hId:90 | |

NA | Mike :66 | NA | hod:90 | |

NA | Nick :66 | NA | hOd:90 | |

NA | Penny :66 | NA | hud:90 | |

NA | Rich :66 | NA | hUd:90 | |

NA | Rose :66 | NA | hYd:90 | |

NA | Sarah :66 | NA | NA | |

NA | Sue :66 | NA | NA | |

NA | Tim :66 | NA | NA | |

NA | Wendy :66 | NA | NA |

There is some imbalance in the training/testing set-up (especially as it relates to the speaker sex):

```
Female Male
Test 198 264
Train 264 264
```

All-in-all, the numerical features seem to be generated from a multivariate normal distribution, with mean vector

```
Feature.0 Feature.1 Feature.2 Feature.3 Feature.4 Feature.5
-3.203740404 1.881763636 -0.507769697 0.515482828 -0.305657576 0.630244444
Feature.6 Feature.7 Feature.8 Feature.9
-0.004364646 0.336552525 -0.302975758 -0.071339394
```

and correlation matrix

Can we get any information from the paired plots?

Perhaps if we focus only on certain variables?

The response variable is the `Class`

, with 11 levels, and the \(p=10\) predictors are `Feature.0`

, …, `Feature.9`

. We train an SVM model on a subset of the vowel dataset. In this instance, we use the training/testing split provided with the data (`Train.or.Test`

), but any randomly selected split would be appropriate.

```
training = vowel[vowel$Train.or.Test=="Train",4:14]
testing = vowel[vowel$Train.or.Test=="Test",4:14]
c(nrow(training),nrow(testing)) # training/testing split
```

`[1] 528 462`

We use the support vector machine implementation found in the `R`

library `e1071`

.

First we tune the hyper-parameters on a subsample of the training data by using the `tune()`

function, which selects optimal parameters by carrying out a grid search over the specified parameters (otherwise we might spend a lot of time trying to find a good combination of parameters).

For C-classification with a Gaussian kernel, the parameters are

\(C\), the cost of constraint violation (which controls the penalty paid by the SVM model for misclassifying a training point), and

\(\gamma\), the parameter of the Gaussian kernel (used to handle non-linear classification).

If \(C\) is “high”, then misclassification is costly, and *vice-versa*. If \(\gamma\) is “high”, than the Gaussian bump around the points are narrow, and *vice-versa*. Let us run a grid search with \(C\) varying from \(0.1\) to \(100\) by powers of \(10\), and \(\gamma = 0.5, 1, 2\).

```
vowel.svm.tune.1 <- e1071::tune(e1071::svm,
train.x=training[,1:10], train.y=training[,11],
kernel="radial",
ranges=list(cost=10^(-1:2), gamma=c(.5,1,2)))
print(vowel.svm.tune.1)
```

```
Parameter tuning of 'e1071::svm':
- sampling method: 10-fold cross validation
- best parameters:
cost gamma
10 0.5
- best performance: 0.007619739
```

The minimal misclassfication error (best performance) in this run is reached when the best parameters have the values listed in the output above. Obviously, that search was fairly coarse: searching at a finer level can be very demanding, time-wise.

For comparison’s sake, let’s see if tuning with finer intervals and larger ranges gives substantially different results.

```
vowel.svm.tune.2 <- e1071::tune(e1071::svm,
train.x=training[,1:10], train.y=training[,11],
kernel="radial",
ranges=list(cost=10^(-2:2), gamma=1:20*0.1))
print(vowel.svm.tune.2)
```

```
Parameter tuning of 'e1071::svm':
- sampling method: 10-fold cross validation
- best parameters:
cost gamma
10 0.8
- best performance: 0.003773585
```

The optimal parameters are sensibly the same, so we might as well stick with the optimal parameters values from the first tuning. Training the model with these values yields:

```
vowel.svm.model = e1071::svm(training[,11] ~ ., data = training,
type="C-classification", cost=10, kernel="radial", gamma=0.5)
summary(vowel.svm.model)
```

```
Call:
svm(formula = training[, 11] ~ ., data = training, type = "C-classification",
cost = 10, kernel = "radial", gamma = 0.5)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
Number of Support Vectors: 351
( 27 32 32 26 30 36 37 29 40 32 30 )
Number of Classes: 11
Levels:
had hAd hed hEd hid hId hod hOd hud hUd hYd
```

Note the number of support vectors. How accurately can this model predict the class of new observations?

```
predicted = predict(vowel.svm.model, testing)
(confusion.matrix = table(pred = predicted, true = testing[,11]))
e1071::classAgreement(confusion.matrix,match.names=TRUE)
```

```
true
pred had hAd hed hEd hid hId hod hOd hud hUd hYd
had 36 0 0 0 0 0 0 0 0 0 0
hAd 0 40 0 0 0 0 0 0 0 0 0
hed 0 0 42 0 0 0 0 0 0 0 0
hEd 0 0 0 37 0 0 0 0 0 0 0
hid 0 0 0 0 39 0 0 0 0 0 0
hId 0 0 0 0 2 42 0 0 1 0 0
hod 0 0 0 0 0 0 36 0 0 0 0
hOd 0 0 0 0 0 0 0 35 0 0 0
hud 0 0 0 0 0 0 0 0 23 0 0
hUd 6 2 0 5 1 0 6 7 18 42 16
hYd 0 0 0 0 0 0 0 0 0 0 26
$diag
[1] 0.8614719
$kappa
[1] 0.847619
$rand
[1] 0.9425585
$crand
[1] 0.6798992
```

What do you think?

#### Final Comments

In practice, it is not always obvious whether one should use SVM, logistic regression, linear discriminant analysis (LDA), decision trees, etc:^{240}

if classes are (nearly) separable, SVM and LDA are usually preferable to logistic regression;

otherwise, using logistic regression together with a ridge penalty (see Section 12.2.4) is roughly equivalent to using SVM;

if the aim is to estimate class membership probabilities, it is preferable to use logistic regression as SVM is not

**calibrated**;^{241}it is possible to use kernels in the logistic regression and LDA frameworks, but at the cost of increased computational complexity.

All in all, it remains crucial to understand that the *No Free Lunch Theorem* remains in effect. There is simply no magical recipe… although the next technique we discuss is often viewed (and used) as one.

### 13.4.3 Artificial Neural Networks

When practitioners discuss using Artificial Intelligence techniques [135] to solve a problem, the implicit assumption is often (but not always) that a neural network (or some other variant of **deep learning**) will be used, and for good reason: “neural networks blow all previous techniques out of the water in terms of performance” [249]. But there are some skeletons in the closet: “[…] given the existence of adversarial examples, it shows we really don’t understand what’s going on” [249].

At various times since Alan Turing’s seminal 1950 paper (in which he proposed the celebrated **Imitation Game** [250]), complete artificial intelligence has been announced to be “just around the corner” (see below).

With the advent of **deep learning** and Big Data processing, optimism is as high as it’s ever been, but opinions on the topic are varied – to some commentators, A.I. is a brilliant success, while to others it is a spectacular failure (see the headlines in Section 7.1.3). So what is really going on?

It is far from trivial to identify the **essential qualities and skills of an intelligence**. There have been multiple attempts to solve the problem by building on Turing’s original effort. An early argument by Hofstadter [56] is that any intelligence should:

provide flexible responses in various scenarios;

take advantage of lucky circumstances;

make sense out of contradictory messages;

recognize the relative importance of a situation’s elements;

find similarities between different situations;

draw distinctions between similar situations, and

come up with new ideas from scratch or by re-arranging previous known concepts.

This is not quite the approach taken by modern A.I. researchers, which define the discipline as the study of **intelligent agents** – any device that perceives its environment and takes actions to maximize its chance of success at some task/goal [251].

Examples include:

**expert systems**– TurboTax, WebMD, technical support, insurance claim processing, air traffic control, etc.;**decision-making**– Deep Blue, auto-pilot systems, “smart” meters, etc.;**natural Language Processing**– machine translation, Siri, named-entity recognition, etc.;**recommenders**– Google, Expedia, Facebook, LinkedIn, Netflix, Amazon, etc.;**content generators**– music composer, novel writer, animation creator, etc.;**classifiers**– facial recognition, object identification, fraud detection, etc.

A trained **artificial neural network** (ANN) is a function that maps inputs to outputs in a useful way: it uses a Swiss-army-knife approach to providing outputs – plenty of options are available in the **architecture**, but it’s not always clear which ones should be used.

One of the reasons that ANNs are so popular is that the user does not need to decide much about the function or know much about the problem space in advance – ANNs are **quiet models**.

Algorithms allow ANNs to **learn** (i.e. to generate the function and its internal values) automatically; technically, the only requirement is the user’s ability to minimize a cost function (which is to say, to be able to solve optimization problems).

#### Overview

The simplest definition of an **artificial neural network** is provided by the inventor of one of the first neuro-computers, R. Hecht-Nielsen, as:

“[…] a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. [252]”

An **artificial neural network** is an interconnected group of nodes, inspired by a simplification of neurons in a brain but on much smaller scales. Neural networks are typically organized in **layers**. Layers are made up of a number of interconnected **nodes** which contain an **activation function**.

A **pattern** \(\mathbf{x}\) (input, signal) is presented to the network *via* the **input layer**, which communicates with one or more **hidden layers**, where the actual processing is done *via* a system of weighted **connections** \(\mathbf{W}\) (edges).

The hidden layers then link to an **output layer**, which outputs the **predicted response** \(\hat{\mathbf{y}}\) (see Figure 13.26).

#### Neural Networks Architecture

In order to train a neural network, we need the following objects [242]:

some

**input data**,a number of

**layers**,a

**model**, anda

**learning process**(loss function and optimizer).

The object interactions is visualized below.

A network (model), which is composed of layers that are chained together, maps the input data into predictions.^{242} The loss function then compares these predictions to the targets, producing a loss value: a measure of how well the network’s predictions match what was expected. The optimizer uses this loss value to update the network’s weights.

#### Input Data

Neural networks start with the **input training data** (and corresponding **targets**) in the form of a **tensor**. Generally speaking, most modern machine learning systems use tensors as their basic data structure. At its core, a tensor is a **container** for data – and it is almost always numerical.

Tensors are defined by three key attributes: their

**rank**(number of axes) – for instance, a 3D tensor has three axes, while a matrix (2D tensor) has two axes;**shape**, a tuple of integers that describes how many dimensions the tensor has along each axis – for instance, a matrix’s shape is described using two elements, such as`(3,5)`

, a 3D tensor’s shape has three elements, such as`(3,5,5)`

, a vector (1D tensor)’s shape is given by a single element, such as`(5)`

, whereas a scalar has an empty shape,`( )`

;**data type**– for instance, a tensor’s type could be`float32`

,`uint8`

,`float64`

, etc.

Data tensors almost always fall into one of the following categories:

the most common case is

**vector data**; in such datasets, each single data point can be encoded as a vector, and a batch of data will be encoded as a matrix or 2D tensor of shape`(#samples,#features)`

, or more simply, as an array of vectors where the first axis is the samples axis and the second axis is the features axis;**time series or sequence data**, whenever the passage of time is crucial to the observations in the dataset (or the notion of sequence order), can be stored in a 3D tensor with an explicit time axis; each sample can be encoded as a sequence of vectors (a 2D tensor), and a batch of data will be encoded as a 3D tensor of shape`(#samples, #timesteps, #features)`

, as in Figure 13.28;

**images**typically have three dimensions: height, width, and colour depth;^{243}a batch of image data could thus be stored in a 4D tensor of shape`(#samples, #height, #width, #channels)`

, as in Figure 13.29;

**video data**is one of the few types of real-world data for which 5D tensors are needed – a video can be understood as a sequence of frames, each frame being a colour image; a sequence of frames can be stored in a 4D tensor`(#frames, #height, #width, #channels)`

, and so a batch of different videos can be stored in a 5D tensor of shape`(#samples, #frames, #height, #width, #channels)`

.

#### Layers

The core building block of neural networks is the **layer**, a data-processing module that is, in a sense, a **filter** for data: some data goes into the layer and comes out in a more useful form.

Specifically, layers extract **representations** out of the data fed into them – hopefully, representations that are **more meaningful** for the problem at hand. A layer takes as input \(1+\) tensors and outputs \(1+\) tensors. Different layers are appropriate for different tensor formats and different types of data processing.

For instance, simple vector data, stored in 2D tensors, is often processed by **densely connected** layers, also called **fully connected** or **dense** layers. Sequence data, stored in 3D tensors, is typically processed by **recurrent** layers. Image data, stored in 4D tensors, is usually processed by 2D **convolution** layers.

Most of deep learning consists of chaining together simple layers that will implement a form of **progressive data distillation**. However, to build deep learning models in tensor-based modules like Keras [242], it is important to clip together **compatible** layers to form useful data-transformation pipelines.

The notion of **layer compatibility** refers specifically to the fact that every layer can only accept input tensors of a certain shape and return output tensors of a certain shape.

We will discuss tensors in greater detail in the Deep Learning module.

#### Model: Networks of Layers

An artificial neural network model is essentially a **data processing sieve**, made of a succession of increasingly refined data filters – the **layers**. The most common example of a model is a linear stack of layers, mapping a single input to a single output. Other network topologies include: **two-branch networks**, **multihead networks**, and **inception blocks**. The topology of a network defines a **hypothesis space**.

Since machine learning is basically

“[…] searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal [242],”

by choosing a network topology, we constrain the space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data.

From a ML perspective, what we are searching for is a good set of values for the weight tensors involved in these tensor operations. Picking the right network architecture is more an art than a science; and although there are some best practices and principles we can rely on, practical experience is the main factor in becoming a proper neural network architect.

#### Learning Process: Loss Function and Optimizer

After a network architecture has been selected, two other objects need to be chosen:

the

**(objective) loss function**is the quantity that is minimized during training – it represents a measure of success for the task at hand, andthe

**optimizer**determines how the network is updated based on the loss function.

In this context, **learning** means finding a combination of model parameters that minimizes the **loss function** for a given set of training data observations and their corresponding targets.

Learning happens by drawing random batches of data samples and their targets, and computing the **gradient** of the network parameters with respect to the **loss** on the batch. The network parameters are then updated by a small amount (the magnitude of the move is defined by the learning rate) in the opposite direction from the gradient.

The entire learning process is made possible by the fact that under a network disguise, neural networks are simply **chains of differentiable tensor operations**, to which it is possible to apply the chain rule of differentiation to find the gradient function mapping the current parameters and current batch of data to a gradient value.

Choosing the right objective function for a given problem is extremely important: the network is ruthless when it comes to lowering its loss function, and it will take any shortcut it can to achieve that objective. If the latter does not fully correlate with success for the task at hand, the network may face unintended side effects.

Simple guidelines exist to help analysts select an appropriate loss function for common problems such as classification, regression, and sequence prediction. We typically use:

**binary cross entropy**for a binary classification;**categorical cross entropy**for a \(n-\)ary classification;**mean squared error**for a regression;**connectionist temporal classification**(CTC) for sequence-learning, etc.

In most cases, one of these will do the trick – only when analysts are working on truly new research problems do they have to develop their own objective functions. Let us first illustrate the principles underlying ANNs with a simple example.

We have seen that ANNs are formed from an **input layer** from which the **signal vector** \(\mathbf{x}\) is inputted, an **output layer** which produces an **output vector** \(\hat{\mathbf{y}}\), and any number of **hidden layers**; each layer consists of a number of **nodes** which are connected to the nodes of other layers *via* **directed edges** with associated **weights** \(\mathbf{w}\), as seen below.

Nodes from the hidden and output layers are typically **activation nodes** – the output \(a(\mathbf{z})\) is some function of the input \(\mathbf{z}\). Signals propagate through the ANN using simple arithmetic, once a set of weights \(\mathbf{w}\) and activation functions \(a(\cdot)\) have been selected.

In a nutshell, at each node, the neural net computes a weighted sum of inputs, applies an activation function, and sends a signal. This is repeated until the various signals reach the final output nodes.

That part is easy – given a signal, an ANN can produce an output, as long as the weights are specified. Matrix notation can simplify the expression for the output \(\hat{\mathbf{y}}\) in terms of the signal \(\mathbf{x}\), weights \(\mathbf{w}\), and activation function \(a(\cdot)\).

For instance, consider the network of Figure 13.30; if \[a(z)=(1+\exp(-z))^{-1},\] the network topology can be re-written as:

**input layer with \(p\) nodes**\[\mathbf{X}_{n\times p}={\mathbf{X}_{n\times 2}}={\begin{bmatrix} x_{A,1} & x_{B,1} \\ \vdots &\vdots \\ x_{A,n} & x_{B,n} \end{bmatrix}};\]**weights from input layer to hidden layer with \(m\) nodes**\[{\mathbf{W}^{(1)}_{p\times m}}={\mathbf{W}^{(1)}_{2\times 2}}={\begin{bmatrix} w_{AC} & w_{AD} \\ w_{BC} & w_{BD}\end{bmatrix}};\]**hidden layer with \(m\) nodes**\[{\mathbf{Z}^{(2)}_{n\times m}}={\mathbf{Z}^{(2)}_{n\times 2}}={\begin{bmatrix} z_{C,1} & z_{D,1} \\ \vdots & \vdots \\ z_{C,n} & z_{D,n}\end{bmatrix}}={\mathbf{X}}{\mathbf{W}^{(1)}};\]**activation function on hidden layer**\[{\mathbf{a}^{(2)}}={\begin{bmatrix} (1+\exp(-z_{C,1}))^{-1} & (1+\exp(-z_{D,1}))^{-1} \\ \vdots & \vdots \\ (1+\exp(-z_{C,n}))^{-1} & (1+\exp(-z_{D,n}))^{-1} \end{bmatrix}}=g({\mathbf{Z}^{(2)}});\]**weights from hidden layer with \(m\) nodes to output layer with \(k\) nodes**\[{\mathbf{W}^{(2)}_{m\times k}}={\mathbf{W}^{(2)}_{2\times 1}}={\begin{bmatrix} w_{CE} \\ w_{DE} \end{bmatrix}};\]**output layer with \(k\) nodes**\[{\mathbf{Z}^{(3)}_{n\times k}}={\mathbf{Z}^{(3)}_{n\times 1}}={\begin{bmatrix} z_{E,1} \\ \vdots \\ z_{E,n} \end{bmatrix}}={\mathbf{a}^{(2)}}{\mathbf{W}^{(2)}};\]**activation function on output layer**\[\hat{\mathbf{y}}={\mathbf{a}^{(3)}}={\begin{bmatrix} (1+\exp(-z_{E,1}))^{-1} \\ \vdots \\ (1+\exp(-z_{E,n}))^{-1} \end{bmatrix}}=g({\mathbf{Z}^{(3)}});\]

The main problem is that unless the weights are judiciously selected, the output that is produced is unlikely to have anything to do with the desired output. For supervised learning tasks (i.e. when an ANN attempts to emulate the results of training examples), there has to be some method to optimise the choice of the weights against an error function \[R(\mathbf{W})=\sum_{i=1}^n\sum_{\ell=1}^k\left(\hat{y}_{i,\ell}(\mathbf{W})-y_{i,\ell}\right)^2\quad \text{or}\quad R(W)=-\sum_{i=1}^n\sum_{\ell=1}^ky_{i,\ell}\ln \hat{y}_{i,\ell}(\mathbf{W})\] (for value estimation and classification, respectively), where \(n\) is the number of observations in the training set, \(k\) is the number of output nodes in the ANN, \(y_{i,\ell}\) is the known value or class label for the \(\ell^{\textrm{th}}\) output of the \(i^{\textrm{th}}\) observation in the training set.

Enter **backpropagation**, which is simply an application of the chain rule to \(R(\mathbf{W})\). Under reasonable regularity condition, the desired **minimizer** \(\mathbf{W}^*\) satisfies \(\nabla R (\mathbf{W}^*)=0\) and is found using **numerical gradient descent**.

#### Gradient-Based Optimization

Initially, the weight matrix \(\mathbf{W}\) is filled with small random values (a step called **random initialization**). The weights are then gradually **trained** (or learned), based on a **feedback signal**. This occurs within a **training loop**, which works as follows:

draw a batch of training samples \(\mathbf{x}\) and corresponding targets \(\mathbf{y}\);

run the network on \(\mathbf{x}\) (the

**forward pass**) to obtain predictions \(\hat{\mathbf{y}}\);compute the loss of the network on the batch, a measure of the mismatch between \(\hat{\mathbf{y}}\) and \(\mathbf{y}\);

update all weights of the network in a way that slightly reduces the loss on this batch.

Repeat these steps in a loop, as often as necessary. Hopefully, the process will eventually converge on a network with a very low loss on the training data, which is to say that there will be a low mismatch between the predictions \(\hat{\mathbf{y}}\) and the target \(\mathbf{y}\). In the vernacular, we say that the ANN has **learned** to map its inputs to correct targets.

Step 1 is easy enough. Steps 2 and 3 are simply the application of a handful of tensor operations (or matrix multiplication, as above). Step 4 is more difficult: how do we update the network’s weights? Given an individual weight coefficient in the network, how can we compute whether the coefficient should be increased or decreased, and by how much?

One solution would be to successively minimize the objective function along coordinate directions to find the minimum of a function. This algorithm is called **coordinate descent** and at each iteration determines a coordinate, then minimizes over the corresponding hyperplane while fixing all other coordinates [242].

It is based on the idea that minimization can be achieved by minimizing along one direction at a time. Coordinate descent is useful in situations where the objective function is not **differentiable**, as is the case for most regularized regression models, say. But this approach would be inefficient in deep learning networks, where there is a large collection of individual weights to update. A smarter approach is use the fact that all operations used to propagate a signal in the network are differentiable, and compute the **gradient** of the objective function (loss) with regard to the network’s coefficients.

Following a long-standing principle of calculus, we can decrease the objective function by updating the coefficients in the **opposite direction to the gradient**.^{244} For an input vector \(\mathbf{X}\), a weight matrix \(\mathbf{W}\), a target \(\mathbf{Y}\), and a loss function \(L\), we predict a target candidate \(\hat{\mathbf{Y}}(\mathbf{W})\), and compute the loss when approximating \(\mathbf{Y}\) by \(\hat{\mathbf{Y}}(\mathbf{W})\).

If \(\mathbf{X}\) and \(\mathbf{Y}\) are fixed, the loss function maps weights \(\mathbf{W}\) to loss values: \(f(\mathbf{W})=L(\hat{\mathbf{Y}}(\mathbf{W}),\mathbf{Y}).\)

In much the same way that the derivative of a function \(f(x)\) of a single variable at a point \(x_0\) is the slope of the tangent at \(f\) at \(x_0\), the gradient \(\nabla f(\mathbf{W}_0)\) is the tensor describing the **curvature** of \(f(\mathbf{W})\) around \(\mathbf{W}_0\). As is the case with the derivative, we can reduce \(f(\mathbf{W})\) by moving \(\mathbf{W}_0\) to \[\mathbf{W}_1=\mathbf{W}_0-s\nabla f(\mathbf{W}_0),\] where \(s\) is the **learning rate**, a small scalar needed to approximate the curvature of the hypersurface close to \(\mathbf{W}_0\).

#### Stochastic Gradient Descent

When dealing with ANNs, we can take advantage of the differentiability of the gradient by finding its **critical points** \(\nabla f(\mathbf{W})=0\) analytically.

If the neural network contains \(Q\) edges, this requires solving a polynomial equation in \(Q\) variables. However, real-world ANNs often have over a few thousand such connections (if not more), the analytical approach is not reasonable.

Instead, we modify the parameters slightly based on the current loss value on a random batch of data. Since we are dealing with a differentiable function, we can use a **mini-batch stochastic gradient descent** (minibatch SGD) to update the weights, simply by modifying Step 4 of the gradient descent algorithm as follows:

4a. compute the gradient of the loss with regard to the weights (the
**backward pass**);

4b. update the weights “a little” in the direction opposite the gradient.

Figure 13.31 illustrates how SGD works when the network only has the one parameter to learn, with a single training sample.

We automatically see why it is important to choose a reasonable learning rate (the step size); too small a value leads to either slow convergence or running the risk of staying stuck at some local minimum; too large a value may send the descent to essentially random locations on the curve and overshooting the global minimum altogether.

#### SGD Challenges

The main issue with minibatch SGD is that “good” convergence rates are not guaranteed, but there are other challenges as well:

selecting a reasonable learning rate can be difficult. Too small a rate leads to painfully slow convergence, too large a rate can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge [242];

the same learning rate applies to all parameter updates, which might not be ideal when the data is sparse;

a key challenge is in minimizing highly non-convex loss functions that commonly occur in ANNs and avoiding getting trapped in suboptimal local minima or saddle points. It is hard for SGD to escape these suboptimal local minima and even wors for the saddle points [253].

#### SGD Variants

There are several SGD variants that are commonly used by the deep learning community to overcome the aforementioned challenges. They take into account the previous weight updates when computing the next weight update, rather than simply considering the current value of the gradients. Popular **optimzers** include SGD with momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSProp, and many more [254], [255].^{245}

ANNs can be quite accurate when making predictions – more than other algorithms, if given a proper set up (but this can be hard to achieve). They degrade gracefully, and they often work when other things fail:

when the relationship between attributes is

**complex**;when there are a lot of dependencies/

**nonlinear relationships**;when the inputs are

**messy**and highly-connected (images, text and speech), andwhen dealing with non-linear classification.

But they are relatively slow and prone to overfitting (unless they have access to large and diverse training sets), they are notoriously hard to interpret due to their **blackbox** nature, and there is no algorithm in place to select the optimal network topology.

Finally, even when they do perform better than other options, ANNs may not perform that much better due to the **No Free-Lunch** theorems; and they always remain susceptible to various forms of **adversarial attacks** [256], so they should be used with caution.

#### Example

In this example, we explore the `wine`

dataset with the ANN architecture implemented in the `R`

package `neuralnet`

. We will revisit deep learning networks, and more complicated topologies, in the Deep Learning module.

```
wine = read.csv("Data/wine.csv", header=TRUE)
wine = as.data.frame(wine)
str(wine)
table(wine$Class)
```

```
'data.frame': 178 obs. of 14 variables:
$ Class : int 1 1 1 1 1 1 1 1 1 1 ...
$ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ Malic.acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ Alcalinity.of.ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
$ Total.phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ Nonflavanoid.phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ Colour.intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ OD280.OD315 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
$ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
1 2 3
59 71 48
```

We notice that there are only 3 classes: 1, 2, 3. These classes will be the level of the categorical response variable for a classification task.

We set-up the model parameters/inputs as follows:

```
# Number of instances
n = nrow(wine)
# Dependent variable - Class
Y = wine$Class
# Independent variables (full)
X = wine[,-1]
# Indices for Class=1,2,3
C1.loc = which(Y==1)
C2.loc = which(Y==2)
C3.loc = which(Y==3)
```

We begin data exploration by taking a look at the variables’ boxplots, an excellent way to understand the distribution of each variable.

```
plot.title = "Boxplot of Variables in the Wine dataset (original scale)"
boxplot(X, main=plot.title, xaxt="n")
axis(1, at=1:dim(X)[2], labels=colnames(X), las=2, cex.axis = 0.5)
```

We clearly see that `Proline`

has higher magnitudes in mean value and variability. This suggests that if we apply reduction techniques like PCA to the wine dataset, the 1st principal component will be based almost entirely on the `Proline`

value. In order to reduce undue effects, we need to standardize the data first (see Feature Selection and Dimension Reduction for details).

```
# Standardized predictor variables (full)
X.std = scale(X)
plot.title = "Boxplot of Variables in the Wine dataset (standardized)"
boxplot(X.std, main=plot.title, xaxt="n")
axis(1, at=1:dim(X)[2], labels=colnames(X), las=2, cex.axis = 0.5)
```

We now shift our focus on understanding the relationships amongst the explanatory variables. This is important because:

scatter plots show us whether classification is a relatively simple task for our data;

they let us visually inspect potential outliers or influential points;

correlations amongst variables tell us whether it is necessary to retain all of them, and

the variance inflation factor (VIF) (see Regression Analysis) helps us determine which variables can be removed in order to obtain more stable models, etc.

Let’s first take a look at the scatterplot matrix:

We can also calculate and display the correlation matrix:

We write a little function that will compute the VIF of each variable in a design matrix

```
vif <- function(X){
vif=rep(0,dim(X)[2])
for (i in 1:dim(X)[2]){
expl=X[,-c(i)]
y=lm(X[,i]~expl)
vif[i]=1/(1-summary(y)$r.squared)}
return(vif)
}
vif.X = matrix(vif(X.std), nrow=1)
colnames(vif.X) = colnames(X)
rownames(vif.X) = "VIF"
round(t(vif.X),2)
```

```
VIF
Alcohol 2.46
Malic.acid 1.66
Ash 2.19
Alcalinity.of.ash 2.24
Magnesium 1.42
Total.phenols 4.33
Flavanoids 7.03
Nonflavanoid.phenols 1.80
Proanthocyanins 1.98
Colour.intensity 3.03
Hue 2.55
OD280.OD315 3.79
Proline 2.82
```

We see that `Flavonoids`

has high multicollinearity with respect to the other variables, as its VIF is greater than 5; as such we have reasonable grounds to remove that variable from further analyses as the other variables can explain how `Flavonoids`

would behave and doing so might reduce the risk of creating **unstable models**.

Now we apply a PCA reduction (see Feature Selection and Dimension Reduction for details) to further reduce the problem complexity.

We start by performing principal component analysis on the standardized data:

Let us provide a summary of the outcome:

```
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.9757 1.5802 1.1870 0.94820 0.91472 0.80087 0.74082
Proportion of Variance 0.3253 0.2081 0.1174 0.07492 0.06973 0.05345 0.04573
Cumulative Proportion 0.3253 0.5334 0.6508 0.72569 0.79541 0.84886 0.89460
PC8 PC9 PC10 PC11 PC12
Standard deviation 0.58095 0.53687 0.49487 0.4750 0.41059
Proportion of Variance 0.02813 0.02402 0.02041 0.0188 0.01405
Cumulative Proportion 0.92272 0.94674 0.96715 0.9859 1.00000
```

PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Alcohol | -0.17 | 0.48 | -0.20 | 0.09 | -0.26 | 0.21 | -0.07 | 0.40 | -0.50 | -0.23 | 0.21 | -0.26 |

Malic.acid | 0.28 | 0.23 | 0.04 | -0.52 | -0.12 | 0.53 | 0.42 | 0.10 | 0.10 | 0.29 | -0.07 | 0.12 |

Ash | 0.02 | 0.32 | 0.64 | 0.21 | -0.11 | 0.15 | -0.15 | -0.21 | 0.29 | 0.03 | 0.51 | -0.06 |

Alcalinity.of.ash | 0.28 | -0.01 | 0.59 | -0.09 | 0.03 | -0.10 | -0.30 | 0.43 | -0.18 | -0.06 | -0.49 | -0.05 |

Magnesium | -0.17 | 0.30 | 0.18 | 0.12 | 0.78 | 0.05 | 0.34 | -0.12 | -0.27 | -0.06 | -0.07 | 0.07 |

Total.phenols | -0.42 | 0.06 | 0.17 | -0.18 | -0.24 | -0.09 | -0.04 | -0.45 | -0.31 | 0.46 | -0.27 | -0.33 |

Nonflavanoid.phenols | 0.33 | 0.03 | 0.14 | 0.35 | -0.41 | -0.26 | 0.60 | -0.21 | -0.21 | -0.21 | -0.13 | 0.05 |

Proanthocyanins | -0.34 | 0.04 | 0.17 | -0.46 | -0.04 | -0.55 | 0.34 | 0.31 | 0.21 | -0.15 | 0.23 | -0.10 |

Colour.intensity | 0.10 | 0.53 | -0.16 | -0.02 | -0.06 | -0.42 | -0.23 | -0.02 | -0.05 | 0.31 | -0.02 | 0.60 |

Hue | -0.33 | -0.28 | 0.14 | 0.43 | -0.09 | 0.10 | 0.22 | 0.47 | -0.04 | 0.50 | 0.07 | 0.25 |

OD280.OD315 | -0.40 | -0.17 | 0.20 | -0.19 | -0.20 | 0.26 | -0.07 | -0.17 | -0.18 | -0.45 | -0.06 | 0.59 |

Proline | -0.33 | 0.36 | -0.09 | 0.25 | -0.11 | 0.12 | 0.07 | 0.07 | 0.57 | -0.17 | -0.55 | -0.07 |

The scree-plot for proportions of variance explained by each principal component is shown below:

```
scree.y = eigen(t(X.std)%*%X.std)$values/sum(eigen(t(X.std)%*%X.std)$values)
barplot(scree.y, main=plot.title,ylim=c(0, 0.35), ylab="% explained",
xlab="PC",col=heat.colors(12))
test = seq(0.7, 13.9, length.out=12)
axis(1, at=test, labels=1:12)
```

Based on the summary table, the first 6 PC would be required to retain 80% of the explanatory power; the scree plot, on the other hand, shows a knee at the 4th component. We can explore this further: let’s take a look at the scatterplot of the first two PC. We start by transforming `X.std`

into its principal coordinates:

```
PC = X.std%*%pca.std$rotation
PCnames = c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12")
colnames(PC) <- PCnames
plot.title = "PC1 vs. PC2"
plot(PC[,1], PC[,2], cex=1.2, main=plot.title, col=Y+1, xlab="PC1", ylab="PC2")
```

The scatterplot shows that the three classes are separated reasonably well by PC1 and PC2 (although, not linearly).

```
plot.title = "Boxplots (in Principal Coordinates)"
par(mfrow = c(3,4))
for (i in 1:12){
plot.title.ind = paste("PC ", i, sep="")
boxplot(PC[,i]~Y, main=plot.title.ind, col=c("red","green","blue"))
}
```

The boxplots further leave an impression that PC3 to PC12 do not provide as clear a separation compared to PC1 and PC2. We will then use only the latter to **visually** evaluate the effectiveness of ANN (and its prediction regions).

In order to evaluate the effectiveness of the model (i.e., does it have good predictive power without overfitting the data?), we split the data into training and testing sets. The model is then developed using the training set (i.e., optimized using a subset of data), and then evaluated for its prediction power using the testing set.

There are \(n=178\) observations in total: we sample \(140\) of them for the training set (say, 46 of class 1, 56 of class 2, and 38 of class 3). The remaining 38 observations for the testing set.

```
set.seed(1111) # for replicabilty
C1.train.loc = sort(sample(C1.loc, size=46))
C2.train.loc = sort(sample(C2.loc, size=56))
C3.train.loc = sort(sample(C3.loc, size=38))
train.loc = c(C1.train.loc, C2.train.loc, C3.train.loc)
test.loc = which(!(1:length(Y) %in% train.loc))
# training data
PC.train = PC[train.loc,]
Y.train = Y[train.loc]
dat.train = as.data.frame(cbind(nnet::class.ind(Y.train), PC.train))
colnames(dat.train)[1:3] = c("C1", "C2", "C3")
# testing data
PC.test = PC[test.loc,]
Y.test = Y[test.loc]
dat.test = as.data.frame(cbind(nnet::class.ind(Y.test), PC.test))
colnames(dat.test)[1:3] = c("C1", "C2", "C3")
```

Let us display the training dataset (circles) and testing dataset (triangles) on the PC1/PC2 scatterplot.

```
plot.title = "Training and Testing data"
xlimit = c(-4,4)
ylimit = c(-3,3)
plot(dat.train$PC1, dat.train$PC2, cex=1.2, col=Y.train+1, main=plot.title,
xlab="PC1", ylab="PC2", xlim=xlimit, ylim=ylimit)
points(dat.test$PC1, dat.test$PC2, pch=17, cex=1.5, col=Y.test+1)
```

If we only use the PC1/PC2-reduced data, we would expect that at least two of the test observations would be misclassified (the left-most and right-most green triangles, respectively).

We are finally ready to build an ANN *via* the `R`

package `neuralnet`

(the main function is also called `neuralnet()`

). We run three analyses:

using only the first two principal components as inputs;

useing the first six principal components as inputs, and

using all 12 principal components as inputs.

We start by forming a grid in the PC1/PC2 space on which we can colour the prediction regions.

```
predict.region.PC1=seq(-5,5, length.out=100)
predict.region.PC2=seq(-4,4, length.out=100)
predict.region=expand.grid(x=predict.region.PC1, y=predict.region.PC2)
```

We will also use an expanded form of the confusion matrix:

```
# A souped-up version of the confusion matrix
confusion.expand <- function (pred.c, class) {
temp <-mda::confusion(pred.c,class)
row.sum <- apply(temp,1,sum)
col.sum <- apply(temp,2,sum)
t.sum <- sum(col.sum)
tmp <- rbind(temp, rep("----", dim(temp)[2]), col.sum)
tmp <- noquote(cbind(tmp, rep("|",dim(tmp)[1]), c(row.sum, "----", t.sum)))
dimnames(tmp)<-list(object = c(dimnames(temp)[[1]],"-------","Col Sum"),
true = c(dimnames(temp)[[2]],"|","Row Sum"))
attr(tmp, "error") <- attr(temp, "error")
attr(tmp, "mismatch") <- attr(temp, "mismatch")
return(tmp)
}
```

In the following chunks of code, we build an ANN model using `neuralnet`

, with PC1 and PC2 as inputs. The parameter `model.structure`

, which defines the number of hidden nodes in each hidden layer, is modifiable:

a model with no hidden layer would have

`model.structure = 0`

;a model with 1 hidden layer of 3 nodes would require

`model.structure = 3`

;a model with 2 hidden layers of 5 and 10 nodes, respectively, would require

`model.structure = c(5,10)`

, and so on.

We will use 2 hidden layers of 10 nodes each.

```
model.structure = c(10,10)
model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ PC1 + PC2, data = dat.train,
hidden = model.structure, err.fct = "ce",
linear.output = FALSE)
prob.model1.test <- neuralnet::compute(model1, PC.test[,1:2])
predict.model1.test = max.col(prob.model1.test$net.result)
print(paste("Confusion matrix (testing) for model = ", list(model.structure)[1], sep=""))
(conf.test=confusion.expand(predict.model1.test, Y.test))
```

```
[1] "Confusion matrix (testing) for model = c(10, 10)"
true
object 1 2 3 | Row Sum
1 13 1 0 | 14
2 0 13 1 | 14
3 0 1 9 | 10
------- ---- ---- ---- | ----
Col Sum 13 15 10 | 38
attr(,"error")
[1] 0.07894737
```

We can compute the prediction region for the two-input model and display it as follows:

```
prob.model1.region <- neuralnet::compute(model1, predict.region[,1:2])
predict.model1.region = max.col(prob.model1.region$net.result)
plot.title=paste("Prediction region for ANN with structure = ",
list(model.structure)[1], sep="")
plot(predict.region[,1], predict.region[,2], main=plot.title, xlim=xlimit, ylim=ylimit,
xlab="PC1", ylab="PC2", col=predict.model1.region+1, pch="+", cex=0.4)
points(dat.train$PC1, dat.train$PC2, cex=1.2, col=Y.train+1)
points(dat.test$PC1, dat.test$PC2, pch=17, cex=1.5, col=Y.test+1)
```

Notice the complex decision boundary.

Since the error function we seek to minimize (e.g., SSRes) is **non-convex**, it is possible for ANN to get stuck at local minima, rather than converge to the global minimum. We can run the model a number of times (say, 50 replicates) and then find the average prediction. We do so here and investigate the mean, median, and maximum number of misclassifications, as follows:

```
model.structure = c(10,10)
n.j = 50
conf.train.vector = conf.test.vector = NULL
for (j in 1:n.j){
model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ PC1 + PC2, data = dat.train,
hidden = model.structure, err.fct = "ce",
linear.output = FALSE)
prob.model1.test <- neuralnet::compute(model1, PC.test[,1:2])
predict.model1.test = max.col(prob.model1.test$net.result)
conf.test = confusion.expand(predict.model1.test, Y.test)
conf.test.vector=c(conf.test.vector, attributes(conf.test)$error)
}
# Convert proportion of misclassifications to a number of instances
conf.test.vector = round(conf.test.vector*length(Y.test))
print(paste("Summary of number of misclassifications in testing data out of", n.j, "trials", sep=" "))
round(summary(conf.test.vector), digits=2)
```

```
[1] "Summary of number of misclassifications in testing data out of 50 trials"
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 3.00 3.00 2.92 3.00 4.00
```

**Note:** this code may produce an error saying that ANN has issues converging (especially for simpler models). If this happens, simply re-run the code again or change the seed.

Now, we build artificial neural networks for the PCA-reduced 6-input dataset. We will try 11 different topologies: no hidden layer; 1 hidden layer with 2, 6, 10, and 30 nodes; 2-hidden layers with (6,6), (10,10), (30,30) nodes, and 30-hidden layers with (6,6,6), (10,10,10) and (30,30,30) nodes.

For each of these 11 topologies, we replicate the process 25 times. It should be noted that prediction regions are not computed, as our input is in more than 2 dimensional space.

```
model.structure = list(0, # no hidden layer
2, 6, 10, 30, # 1 hidden layer
rep(6,2), rep(10,2), rep(30,2), # 2 hidden layers
rep(6,3), rep(10,3), rep(30,3)) # 3 hidden layers
set.seed(1)
results = NULL
n.loop = length(model.structure)
n.j = 25
for (i in 1:n.loop){
conf.train.vector = conf.test.vector = NULL
for (j in 1:n.j){
model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = dat.train, hidden = model.structure[[i]],
err.fct = "ce", linear.output = FALSE)
prob.model1.test <- neuralnet::compute(model1, PC.test[,1:6])
predict.model1.test = max.col(prob.model1.test$net.result)
conf.test=confusion.expand(predict.model1.test, Y.test)
conf.test.vector=c(conf.test.vector, attributes(conf.test)$error)
}
results[[i]] = summary(round(conf.test.vector*length(Y.test)))
}
results = as.data.frame(dplyr::bind_rows(results, .id = "column_label"))
colnames(results) <- c("hidden", "min", "Q1", "med", "mean", "Q3", "max")
results$hidden <- model.structure
```

hidden | min | Q1 | med | mean | Q3 | max |
---|---|---|---|---|---|---|

0 | 2 | 2 | 2 | 2.00 | 2 | 2 |

2 | 1 | 2 | 2 | 1.92 | 2 | 3 |

6 | 1 | 2 | 2 | 1.92 | 2 | 2 |

10 | 1 | 2 | 2 | 1.96 | 2 | 2 |

30 | 2 | 2 | 2 | 2.00 | 2 | 2 |

6, 6 | 1 | 2 | 2 | 1.96 | 2 | 2 |

10, 10 | 1 | 2 | 2 | 1.88 | 2 | 2 |

30, 30 | 2 | 2 | 2 | 2.00 | 2 | 2 |

6, 6, 6 | 1 | 2 | 2 | 1.88 | 2 | 2 |

10, 10, 10 | 1 | 2 | 2 | 1.88 | 2 | 2 |

30, 30, 30 | 1 | 2 | 2 | 1.96 | 2 | 2 |

We can repeat this process once more, using all 12 PC.

```
model.structure = list(0, # no hidden layer
2, 6, 10, 30, # 1 hidden layer
rep(6,2), rep(10,2), rep(30,2), # 2 hidden layers
rep(6,3), rep(10,3), rep(30,3)) # 3 hidden layers
set.seed(1)
results = NULL
n.loop = length(model.structure)
n.j = 25
for (i in 1:n.loop){
conf.train.vector = conf.test.vector = NULL
for (j in 1:n.j){
model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ .,
data = dat.train, hidden = model.structure[[i]],
err.fct = "ce", linear.output = FALSE)
prob.model1.test <- neuralnet::compute(model1, PC.test[,1:12])
predict.model1.test = max.col(prob.model1.test$net.result)
conf.test=confusion.expand(predict.model1.test, Y.test)
conf.test.vector=c(conf.test.vector, attributes(conf.test)$error)
}
results[[i]] = summary(round(conf.test.vector*length(Y.test)))
}
results = as.data.frame(dplyr::bind_rows(results, .id = "column_label"))
colnames(results) <- c("hidden", "min", "Q1", "med", "mean", "Q3", "max")
results$hidden <- model.structure
```

hidden | min | Q1 | med | mean | Q3 | max |
---|---|---|---|---|---|---|

0 | 2 | 2 | 2 | 2.00 | 2 | 2 |

2 | 1 | 1 | 1 | 1.80 | 3 | 4 |

6 | 1 | 2 | 2 | 2.04 | 2 | 3 |

10 | 1 | 2 | 2 | 1.96 | 2 | 2 |

30 | 2 | 2 | 2 | 2.00 | 2 | 2 |

6, 6 | 1 | 2 | 2 | 2.00 | 2 | 3 |

10, 10 | 1 | 2 | 2 | 1.92 | 2 | 3 |

30, 30 | 1 | 2 | 2 | 1.84 | 2 | 2 |

6, 6, 6 | 1 | 1 | 2 | 1.76 | 2 | 3 |

10, 10, 10 | 1 | 2 | 2 | 1.84 | 2 | 2 |

30, 30, 30 | 1 | 1 | 2 | 1.68 | 2 | 2 |

Comparing the mean number of misclassifications, what can we conclude?

### 13.4.4 Naïve Bayes Classifiers

In classical statistics, model parameters (such as \(\mu\) and \(\sigma\), say) are treated as constants; **Bayesian statistics**, on the other hand assume that **model parameters are random variables**.

**Bayes’ Theorem** (see Section 3.1.7) lies at the foundation of such statistics: \[P(\text{hypothesis}\mid \text{data})=\frac{P(\text{data}\mid \text{hypothesis})\times P(\text{hypothesis})}{P(\text{data})},\] or simply \[P(H\mid D)=\frac{P(D\mid H)\times P(H)}{P(D)}.\] This is sometimes
written in shorthand as \(P(H\mid D) \propto P(D\mid H)\times P(H)\); in other words, our **degree of belief in a hypothesis should be updated by the evidence provided by the data**.^{246} We will discuss Bayesian data analysis in depth in the Bayesian Data Analysis module.

#### Naïve Bayes Classification for Tumor Diagnoses

We illustrate one approach to classification, which, in spite of the name, is not a Bayesian method in the strict sense: **naïve Bayes classifiers** (NBC). Suppose we are interested in diagnosing whether a tumor is begin or malignant, based on several measurements obtained from video imaging.

Bayes’ Theorem can be recast as \[\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}} \propto \text{likelihood} \times \text{prior}, \] where

**posterior:**\(P(H\mid D)=\) based on collected data, how likely is a given tumor to be benign (or malignant)?**prior:**\(P(H)=\) in what proportion are tumors benign (or malignant) in general?**likelihood:**\(P(D\mid H)=\) knowing a tumor is benign (or malignant), how likely is it that these particular measurements would have been observed?**evidence:**\(P(D)=\) regardless of a tumor being benign or malignant, what is the chance that a tumor has the observed characteristics?

The NBC procedure is straightforward.

**Objective function:**a simple way to determine whether a tumor is benign or malignant is to compare**posterior probabilities**and choose the one with highest probability. That is, we diagnose a tumor as**malignant**if \[\frac{P(\textrm{malignant}\mid D)}{P(\textrm{benign}\mid D)}=\frac{P(D\mid \textrm{malignant})\times P(\textrm{malignant})}{P(D\mid \textrm{benign})\times P(\textrm{benign})}>1,\] and as**benign**otherwise.**Dataset:**there are \(n=699\) observations (from the [in]famous*Wisconsin Breast Cancer*dataset) with nine predictor measurements, each scored on a scale of 1 to 10, with score of 0 being reserved for missing values. The predictors include items such as`Clump_Thickness`

and`Bare_Nuclei`

; the categorical response variable is the`Class`

(`Benign`

,`Malignant`

).

```
dat.BC = read.csv("Data/BreastCancer-Wisconsin.csv", header=TRUE, stringsAsFactors = TRUE)
str(dat.BC)
```

```
'data.frame': 699 obs. of 10 variables:
$ Clump_Thickness : int 5 5 3 6 4 8 1 2 2 4 ...
$ Uniformity_of_Cell_Size : int 1 4 1 8 1 10 1 1 1 2 ...
$ Uniformity_of_Cell_Shape : int 1 4 1 8 1 10 1 2 1 1 ...
$ Marginal_Adhesion : int 1 5 1 1 3 8 1 1 1 1 ...
$ Single_Epithelial_Cell_Size: int 2 7 2 3 2 7 2 2 2 2 ...
$ Bare_Nuclei : int 1 10 2 4 1 10 10 1 1 1 ...
$ Bland_Chromatin : int 3 3 3 3 3 9 3 3 1 2 ...
$ Normal_Nucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
$ Mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
$ Class : Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 2 1 1 1 1 ...
```

In table layout, the first 6 observations look like:

Clump_Thickness | Uniformity_of_Cell_Size | Uniformity_of_Cell_Shape | Marginal_Adhesion | Single_Epithelial_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | Class |
---|---|---|---|---|---|---|---|---|---|

5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | Benign |

5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | Benign |

3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | Benign |

6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | Benign |

4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | Benign |

8 | 10 | 10 | 8 | 7 | 10 | 9 | 7 | 1 | Malignant |

We create a training/testing split for the data, by selecting roughly 80%/20% of the observations.

```
set.seed(1234) # for reproducibility
ind = 1:nrow(dat.BC)
prop.train = 0.8
n.train = floor(nrow(dat.BC)*prop.train)
# indices of training observations
loc.train = sort(sample(ind, n.train, replace=FALSE)) #
# indices of testing observations
loc.test = ind[-which(ind%in%loc.train)]
# training data
dat.BC.train = dat.BC[loc.train,]
# test data
dat.BC.test = dat.BC[loc.test,]
```

We separate the Benign and Malignant subsets of the training data for graphing purposes.

```
location.Benign = which(dat.BC.train$Class=="Benign")
location.Malignant = which(!(1:nrow(dat.BC.train) %in% location.Benign))
cols_remove = c("Class")
dat.Benign=dat.BC.train[location.Benign,!colnames(dat.BC) %in% cols_remove]
dat.Malignant=dat.BC.train[location.Malignant,!colnames(dat.BC) %in% cols_remove]
```

The boxplots of the training measurements are shown below.

```
dat.BC2 = reshape2::melt(dat.BC.train, id.var="Class")
ggplot2::ggplot(data = dat.BC2, ggplot2::aes(x=variable, y=value)) +
ggplot2::geom_boxplot(ggplot2::aes(fill=Class)) +
ggplot2::scale_y_discrete(limits = 1:10) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1))
```

From these plots, we learn that benign tumors have lower scores on average, while malignant tumors have much higher scores, as well as variabilities.

In what follows, we treat the test observations as **undiagnosed cases**.

**Assumptions:**we assume that the scores of each measurement in a class are independent of one another (hence the**naive**qualifier); this assumption reduces the likelihood to \[\begin{aligned} P(H\mid D)&=P(H\mid x_{1},x_{2},\cdots,x_{9}) \propto P( x_{1},x_{2},\cdots,x_{9}\mid H) \times P(H) \\ & =P(x_{1}\mid H)\times \cdots\times P(x_9\mid H)\times P(H). \end{aligned}\]

Note that this assumption of **conditional independence** is not usually satisfied.

**Prior distribution:**we can ask subject matter experts to provide a rough estimate for the general ratio of benign to malignant tumors, or use the proportion of benign tumors in the sample as our prior. In situations where we have no knowledge about the distribution of priors, we may simply assume a**non-informative prior**(in this case, the prevalence rates would be the same for both responses).

In this example, we will assume that the training data represents the tumor population adequately, and we use the observed proportions as estimated prior probabilities.

```
n.Benign.train = nrow(dat.Benign)
n.Malignant.train = nrow(dat.Malignant)
(prior.Benign = n.Benign.train/(n.Benign.train + n.Malignant.train))
(prior.Malignant = 1 - prior.Benign)
```

```
[1] 0.6529517
[1] 0.3470483
```

**Computation of likelihoods:**under conditional independence, each measurement is assumed to follow a multinomial distribution (since scores are provided on a \(1\) to \(10\) scale): for each predictor, for each class, we must estimate \(p_1, \ldots, p_{10}\), with \(p_1+\cdots+p_{10}=1\).^{247}The best estimates are thus \[\hat{p}_{i,\text{pred}}=\frac{\text{# of training cells in the class with pred score $i$}}{\text{# of training cells in the class}},\quad i=1,\ldots, 10.\]

This is done in the code below; note that `count.xyz`

is a count matrix, while `freq.xyz`

is a frequency matrix.

```
# Benign cells
count.Benign = freq.Benign = NULL
for (i in 1:(ncol(dat.BC.train)-1)){
test.count = table(c(dat.Benign[,i],0:10))-1
test.freq = test.count/sum(test.count)
count.Benign = cbind(count.Benign, test.count)
freq.Benign = cbind(freq.Benign, test.freq)
}
colnames(count.Benign) = colnames(freq.Benign) = colnames(dat.Benign)
p.Benign = freq.Benign
p.Benign[1,] = 1
# Malignant cells
count.Malignant = freq.Malignant = NULL
for (i in 1:(ncol(dat.BC.train)-1)){
test.count = table(c(dat.Malignant[,i],0:10))-1
test.freq = test.count/sum(test.count)
count.Malignant = cbind(count.Malignant, test.count)
freq.Malignant = cbind(freq.Malignant, test.freq)
}
colnames(count.Malignant) = colnames(freq.Malignant) = colnames(dat.Malignant)
p.Malignant = freq.Malignant
p.Malignant[1,] = 1
```

These are then the best estimates for the multinomial parameters, for the benign tumors:

Clump_Thickness | Uniformity_of_Cell_Size | Uniformity_of_Cell_Shape | Marginal_Adhesion | Single_Epithelial_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | |
---|---|---|---|---|---|---|---|---|---|

0 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |

1 | 0.3342466 | 0.8356164 | 0.7616438 | 0.8273973 | 0.1123288 | 0.8383562 | 0.3397260 | 0.8931507 | 0.9780822 |

2 | 0.0986301 | 0.0821918 | 0.1178082 | 0.0767123 | 0.7863014 | 0.0465753 | 0.3589041 | 0.0575342 | 0.0136986 |

3 | 0.2000000 | 0.0575342 | 0.0821918 | 0.0630137 | 0.0630137 | 0.0356164 | 0.2657534 | 0.0219178 | 0.0027397 |

4 | 0.1315068 | 0.0109589 | 0.0219178 | 0.0109589 | 0.0164384 | 0.0136986 | 0.0136986 | 0.0027397 | 0.0000000 |

5 | 0.1945205 | 0.0000000 | 0.0027397 | 0.0054795 | 0.0109589 | 0.0219178 | 0.0082192 | 0.0027397 | 0.0000000 |

6 | 0.0356164 | 0.0054795 | 0.0054795 | 0.0109589 | 0.0027397 | 0.0000000 | 0.0027397 | 0.0054795 | 0.0000000 |

7 | 0.0027397 | 0.0027397 | 0.0054795 | 0.0000000 | 0.0054795 | 0.0000000 | 0.0109589 | 0.0054795 | 0.0027397 |

8 | 0.0027397 | 0.0027397 | 0.0027397 | 0.0000000 | 0.0027397 | 0.0054795 | 0.0000000 | 0.0082192 | 0.0027397 |

9 | 0.0000000 | 0.0027397 | 0.0000000 | 0.0027397 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0027397 | 0.0000000 |

10 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0027397 | 0.0000000 | 0.0054795 | 0.0000000 | 0.0000000 | 0.0000000 |

and the malignant tumors:

Clump_Thickness | Uniformity_of_Cell_Size | Uniformity_of_Cell_Shape | Marginal_Adhesion | Single_Epithelial_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | |
---|---|---|---|---|---|---|---|---|---|

0 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |

1 | 0.0154639 | 0.0103093 | 0.0103093 | 0.1494845 | 0.0051546 | 0.0463918 | 0.0051546 | 0.1752577 | 0.5257732 |

2 | 0.0154639 | 0.0360825 | 0.0257732 | 0.0979381 | 0.0927835 | 0.0257732 | 0.0309278 | 0.0309278 | 0.1082474 |

3 | 0.0567010 | 0.1134021 | 0.0979381 | 0.1030928 | 0.2010309 | 0.0721649 | 0.1546392 | 0.1391753 | 0.1340206 |

4 | 0.0567010 | 0.1391753 | 0.1546392 | 0.1185567 | 0.1391753 | 0.0567010 | 0.1185567 | 0.0773196 | 0.0515464 |

5 | 0.1649485 | 0.1185567 | 0.1185567 | 0.0721649 | 0.1494845 | 0.0721649 | 0.1340206 | 0.0773196 | 0.0257732 |

6 | 0.0618557 | 0.1030928 | 0.1030928 | 0.0721649 | 0.1546392 | 0.0206186 | 0.0360825 | 0.0721649 | 0.0154639 |

7 | 0.0927835 | 0.0824742 | 0.1082474 | 0.0515464 | 0.0360825 | 0.0309278 | 0.2783505 | 0.0463918 | 0.0412371 |

8 | 0.1701031 | 0.1134021 | 0.1082474 | 0.0824742 | 0.0824742 | 0.0876289 | 0.1082474 | 0.0773196 | 0.0309278 |

9 | 0.0618557 | 0.0154639 | 0.0360825 | 0.0206186 | 0.0103093 | 0.0360825 | 0.0463918 | 0.0412371 | 0.0000000 |

10 | 0.3041237 | 0.2680412 | 0.2371134 | 0.2319588 | 0.1288660 | 0.5463918 | 0.0876289 | 0.2628866 | 0.0670103 |

```
barplot(p.Benign[2:11,4],
xlab = "Score",
ylab = "Relative Frequency",
main = "Marginal Adhesion Scores - Benign Cells")
```

```
barplot(p.Malignant[2:11,4],
xlab = "Score",
ylab = "Relative Frequency",
main = "Marginal Adhesion Scores - Malignant Cells")
```

Multiplying probabilities across predictors from each multinomial distribution (one each for both classes) provides the overall likelihoods for benign and malignant tumors, respectively.

For instance, if the signature of an undiagnosed case was \[(1,1,3,1,0,2,3,2,2),\] then we would multiply the probabilities corresponding to each score across predictors, once assuming that the cell was benign, and once assuming it was malignant:

\[\begin{aligned} P(D\mid H)&=P(\mathbf{x}=(1,1,3,1,0,2,3,2,2) \mid H) \\ &=P(x_1=1\mid H)\times P(x_2=1\mid H)\times P(x_3=3\mid H)\times P(x_4=1\mid H)\times P(x_5=0\mid H) \\ & \qquad \times P(x_6=2\mid H)\times P(x_7=3\mid H)\times P(x_8=2\mid H)\times P(x_9=2\mid H). \end{aligned}\] We can extract the signature vector of probabilities in both cases as follows:

For the benign class, we have:

```
[1] 0.33424658 0.83561644 0.08219178 0.82739726 1.00000000 0.04657534 0.26575342
[8] 0.05753425 0.01369863
[1] 1.852912e-07
```

For the malignant class, we have:

```
p.Malignant[as.matrix(data.frame(x,y))]
(l.Malignant = prod(p.Malignant[as.matrix(data.frame(x,y))]))
```

```
[1] 0.01546392 0.01030928 0.09793814 0.14948454 1.00000000 0.02577320 0.15463918
[8] 0.03092784 0.10824742
[1] 3.114231e-11
```

Based on the multinomial probabilities given in `p.Benign`

and `p.Malignant`

, the (naïve) likelihood of the undiagnosed case being a benign tumor would thus be \(1.86 \times 10^{-7}\), while the likelihood of it being a malignant tumor would \(3.11\times 10^{-11}\).

**Be careful:** these are the likelihoods, not the posteriors!

**Computation of posterior:**multiplying the corresponding prior probabilities and likelihoods, we get a quantity that is proportional to the respective posterior probabilities: \[ P(H \mid \mathbf{x}) \propto P(H) \times P(\mathbf{x} \mid H)\approx P(H) \times \prod_{i=1}^9 P(x_i\mid H).\]

The “likelihoods” can be computed as follows:

```
test.Benign = test.Malignant = NULL
likelihood.Benign = likelihood.Malignant = NULL
for (i in 1:nrow(dat.BC.test)){
location = rapply(dat.BC.test[i,-10]+1,c)
for(j in 1:length(location)){
test.Benign[j] = p.Benign[location[j],j]
test.Malignant[j] = p.Malignant[location[j],j]
}
likelihood.Benign.i = prod(test.Benign)
likelihood.Malignant.i = prod(test.Malignant)
likelihood.Benign = c(likelihood.Benign, likelihood.Benign.i)
likelihood.Malignant = c(likelihood.Malignant, likelihood.Malignant.i)
}
```

The “posteriors” can then be computed as follows:

For the undiagnosed case \(\mathbf{x}=(1,1,3,1,0,2,3,2,2)\), we obtain:

```
[1] 1.209862e-07
[1] 1.080789e-11
```

Comparing the posteriors \[P(\text{Malignant}\mid \mathbf{x})<P(\text{Benign}\mid \mathbf{x}),\] we conclude that the tumor in the undiagnosed case is **likely benign** (note that we have no measurement on how much more likely it is to be benign than to be malignant – the classifier is **not calibrated**).

We can complete the procedure for all observations in the test set:

```
n.test=nrow(dat.BC.test)
prediction=rep(NA, n.test)
prediction[which(posteriors[,1]>posteriors[,2])]="Benign"
prediction[which(posteriors[,1]<posteriors[,2])]="Malignant"
prediction=as.factor(prediction)
table(prediction)
```

```
prediction
Benign Malignant
85 55
```

Since we actually know the true outcome for the test subjects, we can take a look at the NBC’s performance on the data.

```
prediction
Benign Malignant
Benign 85 8
Malignant 0 47
```

Let’s take a look at cases where NBC misclassified, and their corresponding posteriors:

```
dat.misclassified = dat.BC.test[which(dat.BC.test$Class!=prediction),]
missed.class = prediction[which(dat.BC.test$Class!=prediction)]
wrong.classifications = cbind(posteriors[which(dat.BC.test$Class!=prediction),])
colnames(wrong.classifications) = c("Posterior.Benign","Posterior.Malignant")
wrong.classifications = cbind(dat.misclassified, wrong.classifications)
```

Clump_Thickness | Uniformity_of_Cell_Size | Uniformity_of_Cell_Shape | Marginal_Adhesion | Single_Epithelial_Cell_Size | Bare_Nuclei | Bland_Chromatin | Normal_Nucleoli | Mitoses | Class | Posterior.Benign | Posterior.Malignant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

9 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 5 | Benign | 0 | 0 |

130 | 1 | 1 | 1 | 1 | 10 | 1 | 1 | 1 | 1 | Benign | 0 | 0 |

197 | 8 | 4 | 4 | 5 | 4 | 7 | 7 | 8 | 2 | Benign | 0 | 0 |

233 | 8 | 4 | 6 | 3 | 3 | 1 | 4 | 3 | 1 | Benign | 0 | 0 |

253 | 6 | 3 | 3 | 5 | 3 | 10 | 3 | 5 | 3 | Benign | 0 | 0 |

320 | 4 | 4 | 4 | 4 | 6 | 5 | 7 | 3 | 1 | Benign | 0 | 0 |

353 | 3 | 4 | 5 | 3 | 7 | 3 | 4 | 6 | 1 | Benign | 0 | 0 |

658 | 5 | 4 | 5 | 1 | 8 | 1 | 3 | 6 | 1 | Benign | 0 | 0 |

The confusion matrix tells us that 8 out of 140 cases (5.7%) are misclassified. A closer look at misclassified cases reveals that 3 of the 8 false positives are a result of a posterior probability being 0 (a score level that was not observed in the training set). Taking a close look at ID 130, for instance, all but `Single_Epithelial_Cell_Size`

have scores of 1, strongly indicating that tumor is likely benign (perhaps the 10 is a typo?).

Can we prevent misclassification similar to case in ID 130? One way to do so is to replace all 0 probabilities in the likelihood matrices by a small \(\varepsilon\) (obtained by multiplying a base probability with the smallest non-zero probability), and by re-scaling the columns so that they all add up to 1 (excluding the missing values from the process). If \(\varepsilon\) is small enough, the larger probabilities will not be affected, in practice.

Using a base probability of \(10^{-8}\), for instance, would reduce the misclassification rate on the test data to 6/140 (4.3%).

#### Notes and Comments

In practice, various prior distributions or conditional distributions (for the features) can be used; domain matter expertise can come in handy during these steps:

the naive assumption is made out of convenience, as it renders the computation of the likelihood much simpler;

the variables in a dataset are not typically independent of one another, but NBC still works well with test data (usually) – the method seems to be robust against departure from the independence assumption;

dependency among variables may change the true posterior values, but the class with maximum posterior probabilities is often unchanged;

in the classification context, we typically get more insight from independent/correlated data than from correlated data;

NBC works best for independent cases, but optimality can also be reached when dependency among variables inconsistently support one class over another;

the choice of a prior may have a great effect on the classification predictions, as can the presence of outlying observations, especially when \(|\text{Tr}\) is small);

it is not practical to conduct NBC manually, as we have done above – a complete implementation can be called by using the method

`naiveBayes()`

from the`R`

package`e1071`

(make sure to read the documentation first!), anda repeated reminder that, like the SVM models, NBC is

**not calibrated**and should not be used to estimate probabilities.

### References

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.*Springer, 2008.

*Data Clustering: Algorithms and Applications*. CRC Press, 2014.

*Applied Linear Statistical Models*. McGraw Hill Irwin, 2004.

*Gödel, Escher, Bach: an Eternal Golden Braid*. New York, NY: Basic Books, 1979.

*Variance Explained*, Jan. 2018.

*Data Science for Business*. O’Reilly, 2015.

*Deep Learning*. MIT press Cambridge, 2016.

*Deep Learning with Python*, 1st ed. USA: Manning Publications Co., 2017.

*Tree-Based Machine Learning Algorithms: Decision Trees, Random Forests, and Boosting*. CreateSpace Independent Publishing Platform, 2017.

*Annals of Statistics*, vol. 36, no. 3, pp. 1171–1220, 2008.

*Support Vector Machines: Optimization Based Theory, Algorithms, and Extensions*. CRC Press/Chapman; Hall, 2013.

*Mind*, 1950.

*AI Expert*, vol. 2, no. 12, pp. 46–52, Dec. 1987.

*ICLR*, 2015.