13.4 Other Supervised Approaches

In this section, we present a number of non-parametric approaches, with a focus on classification methods (although we will also discuss regression problems):

  • classification and regression trees (CART) [2][4], [245];

  • support vector machines (SVW) [2][4], [246], [247];

  • artificial neural networks (ANN) [147], [242], [248], and

  • naïve Bayes classification (NBC).

13.4.1 Tree-Based Methods

This family of methods involves stratifying or segmenting the predictor space into a small number of “simple” regions.

The set of splitting rules used to segment the space can be summarized using a tree, whence their name. On the one hand, tree-based methods are simple and easy to interpret; on the other, they are not competitive with the best supervised learning methods when it comes to predictive accuracy.

Nevertheless, there are instances where the ease of interpretability overrules the lessened accuracy. Tree-based methods are applicable both to regression and to classification problems.

Regression Trees

We introduce the important concepts via the 2011 Gapminder dataset, where the response \(Y\) is once again the life expectancy of nations, and the predictors \(X_1\) and \(X_2\) are the fertility rates and infant mortality rates per nation. In Figure 13.3, we see that when \(X_1\) and \(X_2\) are both high, \(Y\) is low, and when \(X_1\) and \(X_2\) are both low, \(Y\) is high. But what is the pattern “in the middle”?

Below, we see a possible regression tree for the (full) dataset (\(N=166\) observations).

reg.tree.1 <- rpart::rpart(life_expectancy~fertility+infant_mortality, 
rpart.plot::rpart.plot(reg.tree.1, box.palette="RdBu", shadow.col="gray", nn=TRUE)
Recursive partition tree for the full 2011 Gapminder dataset.

Figure 13.9: Recursive partition tree for the full 2011 Gapminder dataset.

This can also be re-written as:

1) root (166) 70.82349  
   2) infant_mortality>=35.65 (54) 60.85370 
     4) infant_mortality>=52.9 (28) 58.30714 *
     5) infant_mortality< 52.9 (26) 63.59615 *
   3) infant_mortality< 35.65 (112) 75.63036  
     6) infant_mortality>=9.35 (62) 72.89516  
      12) infant_mortality>=22.85 (18) 69.50000 *
      13) infant_mortality< 22.85 (44) 74.28409 *
     7) infant_mortality< 9.35 (50) 79.02200  
      14) infant_mortality>=4.25 (23) 76.86087 *
      15) infant_mortality< 4.25 (27) 80.86296 *

Node 1 is the tree’s root (initial node) with 166 (100%) observations; the average life expectancy for these observations is \(70.82\).

The root is also the tree’s first branching point, separating the observations into two groups: node 2 with 54 observations (33%), given by infant mortality \(\geq 35.65\), for which the average life expectancy is \(60.85\), and node 3 with 112 observations (67%), given by infant mortality \(<35.65\), for which the average life expectancy is \(75.63\).

Note that \(54+112=166\) and that \[\frac{54(60.81)+112(75.63)}{54+112}=70.82.\] Node 2 is an internal node – it is further split into two groups: node 4 with 28 observations (17%), given with the additional rule infant mortality \(\geq 52.9\), for which the average life expectancy is \(58.31\), and node 5 with 26 observations (16%), given with the additional rule infant mortality \(<52.9\), for which the average life expectancy is \(63.60\).

Note that \(28+26=54\) and that \[\frac{28(58.31)+26(63.60)}{28+26}=60.85.\] Both nodes 4 and 5 are leaves (final nodes, terminal nodes); the tree does not grow any further on that branch.

The tree continues to grow from node 3, eventually leading to 4 leaves on that branch (there is an intermediate branching point). There are 6 leaves in total, 5 branching points (including the root) and the tree’s depth is 3 (excluding the root).

Only one feature is used in the regression tree in this example: to make a prediction for a new observation, only infant mortality is needed. If it was 21, say, the observation’s leaf would be node 13 and we would predict that the life expectancy of that nation would be 74.28.

The tree diagram is a useful heuristic, especially as it allows the display results without resorting to a multi-dimensional chart, but it does obscure the predictor space’s stratification. In our example, let \[\begin{aligned} R_4&=\{(\text{infant mortality},\text{fertility})\mid \text{infant mortality} \geq 52.9\} \\ R_5&=\{(\text{infant mortality},\text{fertility})\mid 36.65\leq \text{infant mortality} < 52.9\} \\ R_{12}&=\{(\text{infant mortality},\text{fertility})\mid 22.85\leq \text{infant mortality} < 35.65\} \\ R_{13}&=\{(\text{infant mortality},\text{fertility})\mid 9.35\leq \text{infant mortality} < 22.85\} \\ R_{14}&=\{(\text{infant mortality},\text{fertility})\mid 4.25\leq \text{infant mortality} < 9.35\} \\ R_{15}&=\{(\text{infant mortality},\text{fertility})\mid \text{infant mortality} < 4.25\} \end{aligned}\] Only infant mortality is involved in the definition of the tree’s terminal nodes. The regions are shown below:

Stratification of the parameter space for the 2011 Gapminder data regression tree.

Figure 13.10: Stratification of the parameter space for the 2011 Gapminder data regression tree.

The regression tree model for life expectancy would thus be \[\hat{y}_{R_i}=\text{Avg}\{y\mid (x_1,x_2) \in R_i\}=\begin{cases} 58.3, & i=4 \\ 63.6, & i=5 \\ 69.5, & i=12 \\ 74.3, & i=13 \\ 76.9, & i=14 \\ 80.9, & i=15\end{cases}\] The regression tree tells us that infant mortality is the most important factor in determining life expectancy, with a negative correlation. This interpretation is, of course, a coarse oversimplification, but it highlights the advantage of using a regression tree when it comes to displaying, interpreting, and explaining the results.

This tree is not the only way to stratify the data: in what way is it optimal,230 as opposed to some other tree?

Building A Regression Tree

The process is quite simple:

  1. Divide the predictor space \(\mathcal{X}\subseteq \mathbb{R}^p\) into a distinct union \(J\) regions \(R_j\), \(j=1,\ldots, J\): \[\mathcal{X}=R_1\sqcup \cdots \sqcup R_J;\]

  2. for any \(\mathbf{x}\in R_j\), \[\hat{y}(\mathbf{x})=\text{Avg}\{y(\mathbf{z})\mid \mathbf{z}\in R_j\in \text{Tr}\}.\]

The second step tells us that trees are locally constant (although it would be feasible to define trees that are not locally constant).

In theory, the \(R_j\) could have any shape as long as they form a disjoint cover of \(\mathcal{X}\); in practice, we use hyperboxes whose boundaries \(p-1\) affine spaces that are perpendicular/parallel to the \(p\) hyperplanes \(X_1\ldots \hat{X_k}\ldots X_p\), \(k=1, \ldots, p\).

We find the optimal \((R_1,\ldots, R_J)\) by minimizing \[\text{SSRes}=\sum_{j=1}^J\sum_{\mathbf{x}_i\in R_j}(y_i-\hat{y}_{R_j})^2,\] where \(\hat{y}_{R_j}\) is the mean response \(y\) in \(R_j\cap \text{Tr}\). In an ideal world, we would compute \(\text{SSRes}\) for all partitions of \(\mathcal{X}\) into hyperboxes, and pick the one that minimizes \(\text{SSRes}\), but that is not computationally feasible, in general.

Instead, we use a growth algorithmic approach known as recursive binary splitting, which is both top-down (starts at the root and successively splits \(\mathcal{X}\) via 2 new branches down the tree) and greedy (at each step of the splitting process, the best choice is made there and now, rather than by looking at long-term consequences).

Regression Tree Algorithm

The algorithm has 10 steps, but it is fairly straightforward.

  1. Let \(\hat{y}_0=\text{Avg}\{y\mid (\mathbf{x},y)\in \text{Tr}\}\).

  2. Set the baseline \(\text{SSRes}_0=\sum_{i=1}^N(y_i-\hat{y}_0)^2\).

  3. For each \(1\leq k\leq p\), order the predictor values of \(X_k\) in \(\text{Tr}\): \(v_{k,1}\leq v_{k,2}\leq \cdots \leq v_{k,N}\).

  4. For each \(X_k\), set \(s_{k,\ell}=\frac{1}{2}(v_{k,\ell}+v_{k,\ell+1})\), \(\ell=1,\ldots, N-1\).

  5. For each \(k=1,\ldots, p\), \(\ell=1,\ldots, N-1\), define \[R_1(k,\ell)=\{\vec{X}\in \mathbb{R}^p \mid X_{k}<s_{k,\ell}\}\quad\mbox{and}\quad R_2(k,\ell)=\{\vec{X}\in \mathbb{R}^p \mid X_{k}\geq s_{k,\ell}\}.\] Note that \(\mathcal{X}=R_1(k,\ell)\sqcup R_2(k,\ell)\) for all \(k,\ell\).

  6. For each \(k=1,\ldots, p\), \(\ell=1,\ldots, N-1\), set \[\text{SSRes}_{1}^{k,\ell}=\sum_{m=1}^2\sum_{\vec{X}_i\in R_{m}(k,\ell)}(y_i-\hat{y}_{R_{m}(k,\ell)})^2,\] where \(\hat{y}_{R_{m}(k,\ell)}=\text{Avg}\{y(\mathbf{x})\mid \mathbf{x}\in \text{Tr}\cap R_{k}(k,\ell)\}\).

  7. Find \(k^*,\ell^*\) for which \(\text{SSRes}_{1}^{k,\ell}\) is minimized.

  8. Define the children sets \(R_1^L=R_1(k^*,\ell^*)\) and \(R_1^R=R_2(k^*,\ell^*)\).

  9. While some children sets \(R^\nu_{\mu}\) still do not meet a stopping criterion, repeat steps 3 to 8, searching and minimizing \(\text{SSRes}\) over \(\mathcal{X}\cap R^{\nu}_{\mu}\), and producing a binary split \(R^{L}_{\mu+1}\), \(R^R_{\mu+1}\).231

  10. Once the stopping criterion is met for all children sets, the tree’s growth ceases, and \(\mathcal{X}\) has been partitioned into \(J\) regions (renumbering as necessary) \[\mathcal{X}=R_1\sqcup \cdots \sqcup R_J,\] on which the regression tree predicts the \(J\) responses \(\{\hat{y}_1,\ldots,\hat{y}_J\}\), according to \(\hat{y}_j=\text{Avg}\{y(\mathbf{x})\mid \mathbf{x}\in R_j\}\).

For instance, if the training set was \(\text{Tr}=\{(x_{1,i},x_{2,i},y_i)\}_{i=1}^{N}\), the algorithm might provide the regression tree in Figure 13.11.

Generic recursive binary partition regression tree for a two-dimensional predictor space, with 5 leaves.

Figure 13.11: Generic recursive binary partition regression tree for a two-dimensional predictor space, with 5 leaves.

In R, the recursive binary partition algorithm is implemented in package rpart’s function rpart().

Tree Pruning

Regression trees grown with the algorithm are prone to overfitting; they can provide good predictions on \(\text{Tr}\), but they usually make shoddy predictions on \(\text{Te}\), because the resulting tree might be too complex (it fits the noise as well as the signal).

A smaller tree with fewer splits might lead to lower variance and better interpretability, at the cost of a little bias. Instead of simply growing a tree \(T_0\) until each leaf contains at most \(M\) observations, say (or whatever other stopping criterion), it could be beneficial to prune it in order to obtain some optimal subtree.

We use cost complexity pruning (CCP) to build a sequence of candidate subtrees indexed by the complexity parameter \(\alpha\geq 0\). For each such \(\alpha\), find a subtree \(T_{\alpha}\subseteq T_0\) which minimizes \[\begin{aligned} \text{SSRes}+\text{complexity penalty}=\sum_{m=1}^{|T|}\sum_{\mathbf{x}_i\in R_m}(y_i-\hat{y}_{R_m})^2+\alpha|T|,\end{aligned}\] where \(|T|\) represents the number of final nodes in \(T\); when \(\alpha\) is large, it is costly to have a complex tree.232

Pruning Algorithm

Assume that a recursive binary splitting regression tree \(T_0\) has been grown on \(\text{Tr}\), using some pre-determined stopping criterion:

  1. apply CCP to \(T_0\) to obtain a “sequence” \(T_\alpha\) of subtrees of \(T_0\);

  2. divide \(\text{Tr}\) into \(K\) folds;

  3. for all \(k=1,\ldots, K\), build a regression tree on \(\text{Tr}\setminus \text{Fold}_k\) and evaluate \[\widehat{\text{MSE}}(\alpha)=\text{Avg}_{1\leq k\leq K}\{\text{MSE}_k(\alpha)\mid \text{on Fold}_k\};\]

  4. return \(T_{\alpha^*}\) from step 1, where \(\alpha^*=\arg\min_{\alpha}\{\widehat{\text{MSE}}(\alpha)\}\).

The Gapminder 2011 tree is pruned in the Figures below, using the rpart function plotcp().

Complexity pruning parmeter, with pruned tree for cp$=0.06$, cp=$0.028$, and cp$=0.02$ in the Gapminder 2011 example. Note that the tree's complexity increases when cp decreases.

Figure 13.12: Complexity pruning parmeter, with pruned tree for cp\(=0.06\), cp=\(0.028\), and cp\(=0.02\) in the Gapminder 2011 example. Note that the tree’s complexity increases when cp decreases.

reg.tree.1.pruned.2 <- rpart::rpart(life_expectancy~fertility+infant_mortality, 
                                    data=gapminder.2011, cp=0.06)
rpart.plot::rpart.plot(reg.tree.1.pruned.2, box.palette="RdBu", 
                       shadow.col="gray", nn=TRUE)
Complexity pruning parmeter, with pruned tree for cp$=0.06$, cp=$0.028$, and cp$=0.02$ in the Gapminder 2011 example. Note that the tree's complexity increases when cp decreases.

Figure 13.13: Complexity pruning parmeter, with pruned tree for cp\(=0.06\), cp=\(0.028\), and cp\(=0.02\) in the Gapminder 2011 example. Note that the tree’s complexity increases when cp decreases.

reg.tree.1.pruned.3 <- rpart::rpart(life_expectancy~fertility+infant_mortality,
                                    data=gapminder.2011, cp=0.028)
rpart.plot::rpart.plot(reg.tree.1.pruned.3, box.palette="RdBu", 
                       shadow.col="gray", nn=TRUE)
Complexity pruning parmeter, with pruned tree for cp$=0.06$, cp=$0.028$, and cp$=0.02$ in the Gapminder 2011 example. Note that the tree's complexity increases when cp decreases.

Figure 13.14: Complexity pruning parmeter, with pruned tree for cp\(=0.06\), cp=\(0.028\), and cp\(=0.02\) in the Gapminder 2011 example. Note that the tree’s complexity increases when cp decreases.

reg.tree.1.pruned.4 <- rpart::rpart(life_expectancy~fertility+infant_mortality, 
                                    data=gapminder.2011, cp=0.02)
rpart.plot::rpart.plot(reg.tree.1.pruned.4, box.palette="RdBu", 
                       shadow.col="gray", nn=TRUE)
Complexity pruning parmeter, with pruned tree for cp$=0.06$, cp=$0.028$, and cp$=0.02$ in the Gapminder 2011 example. Note that the tree's complexity increases when cp decreases.

Figure 13.15: Complexity pruning parmeter, with pruned tree for cp\(=0.06\), cp=\(0.028\), and cp\(=0.02\) in the Gapminder 2011 example. Note that the tree’s complexity increases when cp decreases.

Classification Trees

The approach for classification is much the same, with a few appropriate substitutions:

  1. prediction in a terminal node is either the class label mode or the relative frequency of the class labels;

  2. \(\text{SSRes}\) must be replaced by some other fit measure:

    • the classification error rate: \[E=\sum_{j=1}^J(1-\max_k \{\hat{p}_{j,k}\}),\] where \(\hat{p}_{j,k}\) is the proportion of \(\text{Tr}\) observations in \(R_j\) of class \(k\) (this measure is not a recommended choice, however);

    • the Gini index, which measures the total variance across classes \[G=\sum_{j=1}^J\sum_k \hat{p}_{j,k}(1-\hat{p}_{j,k}),\] which should be small when the nodes are pure (\(\hat{p}_{j,k}\approx 0\) or \(1\) throughout the regions), and

    • the cross-entropy deviance \[D=-\sum_{j=1}^J\sum_k \hat{p}_{j,k}\ln \hat{p}_{j,k},\] which behaves like the Gini index, numerically.

One thing to note is that classification and regression trees (jointly known as CART) suffer from high variance and their structure is unstable – using different training sets typically gives rise to wildly varying trees.

As an extreme example, simply modifying the level of only one of the predictors in only one of the observations can yield a tree with a completely different topology, as in Figure 13.16.

Different tree topologies with small changes in the training set (data modified from [@SA_KNNL2]).

Figure 13.16: Different tree topologies with small changes in the training set (data modified from [34]).

This lack of robustness is a definite strike against the use of CART; despite this, the relative ease of their implementation makes them a popular classification tool.


A classification tree for the Gapminder 2011 dataset is shown below:

reg.tree.2 <- rpart::rpart(LE~fertility+infant_mortality+gdp, 
rpart.plot::rpart.plot(reg.tree.2, box.palette="RdBu", 
                       shadow.col="gray", nn=TRUE)
Classification tree for the response LE in the Gapminder 2011 example.

Figure 13.17: Classification tree for the response LE in the Gapminder 2011 example.

1) root (166) high/low (0.5 0.5)  
  2) infant_mortality< 23 (94) high (0.862 0.138)
  3) infant_mortality>=23 (72) low  (0.028 0.972)

The stratification in the scatter plot is shown below.

ggplot2::ggplot(gapminder.2011) +
  ggplot2::geom_point(ggplot2::aes(fill=LE, x=infant_mortality, y=fertility), pch=22) +
  ggplot2::theme_bw() + 
  ggplot2::theme(legend.position = "bottom") +
  ggplot2::geom_vline(xintercept = c(23), linetype="dashed", color = "blue", size=1) +
  ggplot2::xlab("Infant Mortality") + ggplot2::ylab("Fertility")
Classification tree for the response LE in the Gapminder 2011 example.

Figure 13.18: Classification tree for the response LE in the Gapminder 2011 example.

Note that this tree should not be used for predictions as it was not built on a training subset of the data.

We now revisit the Iowa Housing Price example of Section 12.5.2. Recall that we had built a training set dat.train with \(n=1160\) observations relating to the selling price SalePrice of houses in Ames, Iowa. We build a CART for that response variable, demanding at least 5 observations per leaf.

(RT = rpart::rpart(SalePrice ~., data=dat.train, minbucket=5))
n= 1160 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1160 7371073.00 180.1428  
   2) OverallQual< 7.5 985 2301952.00 157.4737  
     4) Neighborhood=Blueste,BrDale,BrkSide,Edwards,IDOTRR,MeadowV,Mitchel,NAmes,NPkVill,OldTown,Sawyer,SWISU 580  672022.40 132.2839  
       8) X1stFlrSF< 1050.5 332  240468.90 118.2474 *
       9) X1stFlrSF>=1050.5 248  278574.40 151.0747 *
     5) Neighborhood=Blmngtn,ClearCr,CollgCr,Crawfor,Gilbert,NoRidge,NridgHt,NWAmes,SawyerW,Somerst,StoneBr,Timber,Veenker 405  734856.30 193.5480  
      10) GrLivArea< 1732.5 281  296100.70 178.2365  
        20) GrLivArea< 1204 56   25438.63 143.0286 *
        21) GrLivArea>=1204 225  183967.70 186.9993 *
      11) GrLivArea>=1732.5 124  223588.10 228.2459 *
   3) OverallQual>=7.5 175 1713865.00 307.7376  
     6) OverallQual< 8.5 126  498478.90 273.6180  
      12) GrLivArea< 1925.5 71  167331.90 246.3000 *
      13) GrLivArea>=1925.5 55  209762.20 308.8831 *
     7) OverallQual>=8.5 49  691521.30 395.4736  
      14) Neighborhood=CollgCr,Edwards,Gilbert,NridgHt,Somerst,StoneBr,Timber,Veenker 44  358089.40 373.9441  
        28) Neighborhood=CollgCr,Edwards,Somerst,Timber 11   39962.11 293.0025 *
        29) Neighborhood=Gilbert,NridgHt,StoneBr,Veenker 33  222038.20 400.9246  
          58) GrLivArea< 2260 20   42801.90 358.2032 *
          59) GrLivArea>=2260 13   86576.40 466.6498 *
      15) Neighborhood=NoRidge 5  133561.60 584.9336 *

There are many ways to plot the tree:

plot(RT, margin=0.05, uniform=TRUE)
text(RT, all=TRUE, use.n=TRUE, fancy=FALSE, cex=0.6)


rattle::fancyRpartPlot(RT, main="Sale Price Regression Tree (Iowa Housing)") 

These are fully grown trees. Now we will use rpart’s plotcp() to determine how to control the tree’s growth (we prune the tree).


From this plot, we see that the tuning parameter should be around 0.024, so we prune the tree as follows:

(RT.p = rpart::prune(RT, cp=0.024))
n= 1160 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1160 7371073.0 180.1428  
   2) OverallQual< 7.5 985 2301952.0 157.4737  
     4) Neighborhood=Blueste,BrDale,BrkSide,Edwards,IDOTRR,MeadowV,Mitchel,NAmes,NPkVill,OldTown,Sawyer,SWISU 580  672022.4 132.2839 *
     5) Neighborhood=Blmngtn,ClearCr,CollgCr,Crawfor,Gilbert,NoRidge,NridgHt,NWAmes,SawyerW,Somerst,StoneBr,Timber,Veenker 405  734856.3 193.5480  
      10) GrLivArea< 1732.5 281  296100.7 178.2365 *
      11) GrLivArea>=1732.5 124  223588.1 228.2459 *
   3) OverallQual>=7.5 175 1713865.0 307.7376  
     6) OverallQual< 8.5 126  498478.9 273.6180 *
     7) OverallQual>=8.5 49  691521.3 395.4736  
      14) Neighborhood=CollgCr,Edwards,Gilbert,NridgHt,Somerst,StoneBr,Timber,Veenker 44  358089.4 373.9441 *
      15) Neighborhood=NoRidge 5  133561.6 584.9336 *

We go from 11 to 6 leaves. The structure of the pruned tree is plotted below.

plot(RT.p, margin=0.05, uniform=TRUE)
text(RT.p, all=TRUE, use.n=TRUE, fancy=FALSE, cex=0.6)


rattle::fancyRpartPlot(RT.p, main="Sale Price Pruned Regression Tree (Iowa Housing)") 

Now that we have a full regression tree and a pruned tree, our last task is to see how well they perform as predictive models on the test data dat.test.

For the full tree, we compute the reduction in using the full tree (as opposed to the mean model:

yhat.RT = predict(RT, dat.test) 
SSE.RT = sum((yhat.RT-dat.test$SalePrice)^2)
SSE.average = sum((mean(dat.test$SalePrice)-dat.test$SalePrice)^2)

round((1-SSE.RT/SSE.average), digits=3)
[1] 0.703

For the pruned tree, the corresponding reduction is:

yhat.RT.p = predict(RT.p, dat.test) 
SSE.RT.p = sum((yhat.RT.p-dat.test$SalePrice)^2)

round((1-SSE.RT.p/SSE.average), digits=3)
[1] 0.646

This suggests that pruning is a reasonable approach, in this case (based on ). The predictions of both trees are plotted against the actual SalePrice values in the next plots.

xlimit = ylimit = c(0,600)
plot(NA, col=2, xlim=xlimit, ylim=ylimit, xlab="Predicted Price ($1,000)", 
     ylab="Actual Price ($1,000)")
abline(h=seq(0,600, length.out=13), v=seq(0,600, length.out=13), lty=2, col="grey")
abline(a=0, b=1)
points(yhat.RT, dat.test$SalePrice, col=2)
legend(0,600, legend=c("Full Tree"), col=c(2), pch=rep(1), bg='light grey')

plot(NA, col=2, xlim=xlimit, ylim=ylimit, xlab="Predicted Price ($1,000)", 
     ylab="Actual Price ($1,000)")
abline(h=seq(0,600, length.out=13), v=seq(0,600, length.out=13), lty=2, col="grey")
abline(a=0, b=1)
points(yhat.RT.p, dat.test$SalePrice, col=3)
legend(0,600, legend=c("Pruned Tree"), col=c(3), pch=rep(2), bg='light grey')

Obviously, there are some departures from the actual response values, but given that the regression trees can only predict a small number of selling prices (corresponding to the tree leaves), these predictions are reasonably accurate.

cor(yhat.RT, dat.test$SalePrice)
cor(yhat.RT.p, dat.test$SalePrice)
[1] 0.8412035
[1] 0.8047534

How do these CART predictions compare with the MARS predictions? The Iowa Housing dataset contains information about sale prices from 2006 to 2010; would you use the model to make predictions about 2022 sale prices?

Classification trees work in the same manner, although the evaluation step can be conducted in two ways: we can build trees that predict class membership (type="class") or probability of class membership (type="prob"). Examples of how to work with these predict() options are provided in Section 11.7.2.

13.4.2 Support Vector Machines

The next classifier is more sophisticated, from a mathematical perspective. It was invented by computer scientists in the 1990s.

Support vector machines (SVM) attempt to find hyperplanes that separate the classes in the feature space. In Figure 13.19, we see an artificial data with 3 features: \(X_1\) and \(X_2\) (numerical), \(Y\) (categorical, represented by different symbols).

Two-class artificial dataset [@DSML_PF].

Figure 13.19: Two-class artificial dataset [137].

We grow a classification tree (perhaps the one shown in Figure 13.20): two of the leaves are pure, but the risk of misclassification is fairly large in the other 2 (at least for that tree).233 Without access to more features, that tree is as good as it gets.234

Classification tree on the two-class artificial dataset [@DSML_PF].

Figure 13.20: Classification tree on the two-class artificial dataset [137].

But it is easy to draw a decision curve which improves on the effectiveness of the decision tree (see image on the right): a single observation is misclassified by this rule.235

Separating hyperplanes do not always exist; we may need to:

  • extend our notion of separability, and/or

  • extend the feature space so separation becomes possible.

A hyperplane \(H_{\boldsymbol{\beta},\beta_0}\subseteq \mathbb{R}^p\) is an affine (“flat”) subset of \(\mathbb{R}^p\), with \[\dim \left(H_{\boldsymbol{\beta},\beta_0}\right)=p-1;\] in other words, \[H_{\boldsymbol{\beta},\beta_0}:\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}=\beta_0+\beta_1x_1+\cdots+\beta_px_p=0.\] The vector \(\boldsymbol{\beta}\) is normal to \(H_{\boldsymbol{\beta},\beta_0}\); if \(\beta_0=0\), \(H_{\boldsymbol{\beta},\beta_0}\) goes through the origin in \(\mathbb{R}^p\). Set \(F(\mathbf{x})=\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}\); then \(F(\mathbf{x})>0\) for points on one “side” of \(H_{\boldsymbol{\beta},\beta_0}\) and \(F(\mathbf{x})<0\) for points on the other.236

In a binary classification problem \(\mathcal{C}=\{C_1,C_2\}=\{\pm 1\}\). If \[y_iF(\mathbf{x}_i)>0,\quad\text{ for all }(\mathbf{x}_i,y_i)\in \text{Tr}\] (or, \(y_iF(\mathbf{x}_i)<0\) for all \((\mathbf{x}_i,y_i)\in \text{Tr}\)), then \(F(\mathbf{x}_i)=0\) determines a separating hyperplane for \(\text{Tr}\) (which does not need to be unique, see Figure 13.21), and we say that \(\text{Tr}\) is linearly separable.

Separating hyperplane on the two-class artificial dataset [@DSML_PF].

Figure 13.21: Separating hyperplane on the two-class artificial dataset [137].

Among all separating hyperplanes, the one which provides the widest separation between the two classes is the maximal margin hyperplane (MMH); training observations on the boundary of the separating strip are called the support vectors (see observations in Figure 13.22).

Artificial linearly separable subset of the two-class dataset (left), with separating hyperplanes (centre), maximal margin hyperplane with support vectors (right) [@DSML_PF].Artificial linearly separable subset of the two-class dataset (left), with separating hyperplanes (centre), maximal margin hyperplane with support vectors (right) [@DSML_PF].Artificial linearly separable subset of the two-class dataset (left), with separating hyperplanes (centre), maximal margin hyperplane with support vectors (right) [@DSML_PF].

Figure 13.22: Artificial linearly separable subset of the two-class dataset (left), with separating hyperplanes (centre), maximal margin hyperplane with support vectors (right) [137].

The classification problem simplifies, as always, to a constrained optimization problem: \[(\boldsymbol{\beta}^*,\beta_0^*)=\arg\max_{(\boldsymbol{\beta},\beta_0)}\{M_{(\boldsymbol{\beta},\beta_0)}\}\quad\text{s.t.}\quad y_i(\beta_0+\boldsymbol{\beta}\mathbf{x}_i)\geq M_{(\boldsymbol{\beta},\beta_0)}\] for all \((\mathbf{x}_i,y_i)\in Tr\), with MMH given by \(F(\mathbf{x})=\beta_0^*+\boldsymbol{\beta}^*\mathbf{x}=0\).

Any hyperplane can be expressed in an uncountable number of ways; the MMH for which \(|F(\mathbf{x}^*)|=1\) for all support vectors \(\mathbf{x}\) provides a canonical representation). From geometry, we know that the distance from the canonical maximal margin hyperplane \(H_{\boldsymbol{\beta},\beta_0}\) to any point \(\mathbf{z}\) can be computed using vector projections.

Let \(\mathbf{x}_0\) be a point on MMH, i.e., \(F(\mathbf{x}_0)=\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}_0=0\), as shown below:

In particular, note that \(\beta_0=-\boldsymbol{\beta}^{\!\top}\mathbf{x}_0\). Then, \[\begin{aligned} \frac{M}{2}&=\text{dist}\left(\mathbf{z},H_{\boldsymbol{\beta},\beta_0}\right)=\left\|\text{proj}_{\boldsymbol{\beta}}(\mathbf{z}-\mathbf{x}_0) \right\|=\left\|\frac{\boldsymbol{\beta}^{\!\top}(\mathbf{z}-\mathbf{x}_0)}{\|\boldsymbol{\beta}\|^2}\boldsymbol{\beta} \right\| \\ &=\frac{|\boldsymbol{\beta}^{\!\top}(\mathbf{z}-\mathbf{x}_0)|}{\|\boldsymbol{\beta}\|^2}\|\boldsymbol{\beta}\|=\frac{|\boldsymbol{\beta}^{\!\top}\mathbf{z}-\boldsymbol{\beta}^{\!\top}\mathbf{x}_0|}{\|\boldsymbol{\beta}\|}=\frac{|F(\mathbf{z})|}{\|\boldsymbol{\beta\|}}.\end{aligned}\] If \(\mathbf{z}\) is a support vector, then \(F(\mathbf{z})=1\), then \[\frac{M}{2}=\text{dist}\left(\mathbf{z},H_{\boldsymbol{\beta},\beta_0}\right)=\frac{1}{\|\boldsymbol{\beta}\|}.\] Maximizing the margin \(M\) is thus equivalent to minimizing to minimizing \(\frac{\|\boldsymbol{\beta}\|}{2}\), and, since the square function is monotonic, \[\arg\max_{(\boldsymbol{\beta},\beta_0)}\{M\mid y_i(\beta_0+\mathbf{\beta}^{\!\top}\mathbf{x}_i),\ \forall \mathbf{x}_i\in \text{Tr}\}\] is equivalent to \[\arg\min_{(\boldsymbol{\beta},\beta_0)}\left\{\left.\frac{1}{2}\|\boldsymbol{\beta}\|^2 \ \right| \ y_i(\beta_0+\mathbf{\beta}^{\!\top}\mathbf{x}_i),\ \forall \mathbf{x}_i\in \text{Tr}\right\}.\] This constrained quadratic problem (QP) can be solved by Lagrange multipliers (in implementations, it is solved numerically), but a key observation is that it is possible to rewrite \[\boldsymbol{\beta}=\sum_{i=1}^N\alpha_iy_i\mathbf{x}_i,\quad \text{with}\quad \sum_{i=1}^N\alpha_iy_i=0\] thanks to the representer theorem.237

The original QP becomes \[\arg\min_{(\boldsymbol{\beta},\beta_0)}\left\{\left.\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\mathbf{x}_i^{\!\top}\mathbf{x}_j-\sum_{i=1}^N\alpha_i \ \right| \ \sum_{i=1}^N\alpha_iy_i=0,\ \forall \mathbf{x}_i,\mathbf{x}_j\in \text{Tr}\right\}.\] Ultimately, it can be shown that all but \(L\) of the coefficients \(\alpha_i\) are \(0\), typically, \(L\ll N\). The support vectors are those training observations \(\mathbf{x}_{i_k}\), \(k=1, \ldots, L\), for which \(\alpha_{i_k}\neq 0\). The decision function is defined by \[T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\mathbf{x}_{i_k}^{\!\top}\mathbf{x}+\beta_0,\] scaled so that \(T(\mathbf{x}_{i_k};\alpha)=y_{i_k}=\pm 1\) for each support vector \(\mathbf{x}_{i_k}\).

The class assignment for any \(\mathbf{x}\in \text{Te}\) is thus \[\text{class}(\mathbf{x})=\begin{cases} +1 & \text{if $T(\mathbf{x};\alpha)\geq 0 $} \\ -1 & \text{if $T(\mathbf{x};\alpha)< 0 $} \end{cases}\] In practice (especially when \(N<p\)), the data is rarely linearly separable into distinct classes (as below, for instance).

Artificial linearly separable subset of the two-class dataset (left), with separating hyperplanes (centre), maximal margin hyperplane with support vectors (right) [@DSML_PF].Artificial linearly separable subset of the two-class dataset (left), with separating hyperplanes (centre), maximal margin hyperplane with support vectors (right) [@DSML_PF].

Figure 13.23: Artificial linearly separable subset of the two-class dataset (left), with separating hyperplanes (centre), maximal margin hyperplane with support vectors (right) [137].

Additionally, even when the classes are linearly separable, the data may be noisy, which could lead to overfitting, with technically optimal but practically sub-optimal maximal margin solutions (see [137] for examples).

In applications, support vector classifiers optimize instead a soft margin, one for which some misclassifications are permitted (as in Figure 13.24).

Hard margin for a linearly separable classifier (left); soft margin for a linearly separable classifier (middle); soft margin for a non-linearly separable classifier (right).

Figure 13.24: Hard margin for a linearly separable classifier (left); soft margin for a linearly separable classifier (middle); soft margin for a non-linearly separable classifier (right).

The soft margin problem can be written as \[\arg\min_{(\boldsymbol{\beta},\beta_0)}\left\{\left.\frac{1}{2}\boldsymbol{\beta}^{\!\top}\boldsymbol{\beta} \ \right| \ y_i(\beta_0+\boldsymbol{\beta}\mathbf{x}_i)\geq 1-\varepsilon_i,\ \varepsilon_i\geq 0, \forall \mathbf{x}_i\in \text{Tr},\ \|\boldsymbol{\varepsilon}\|<C\right\},\] where \(C\) is a (budget) tuning parameter, \(\boldsymbol{\varepsilon}\) is a vector of slack variables, canonically scaled so that \(|F(\mathbf{x}^*) |=|\beta_0+\boldsymbol{\beta}^{\!\top}\mathbf{x}^*|=1\) for any eventual support vector \(\mathbf{x}^*\).

Such a model offers greater robustness against unusual observations, while still classifying most training observations correctly:

  • if \(\varepsilon_i=0\), then \(\mathbf{x}_i\in \text{Tr}\) is correctly classified – it falls on the correct side of the hyperplane, and outside the maximum margin;

  • if \(0<\varepsilon_i<1\), then \(\mathbf{x}_i\in Tr\) is still correctly classified (falling on the correct side of the hyperplane), but it falls within the margin;

  • if \(\varepsilon_i\geq 1\), it is incorrectly classified.

If \(C=0\), then no violations are allowed (\(\|\boldsymbol{\varepsilon}\|=0\)) and the problem reduces to the hard margin SVM classifier; a solution may not even exist if the data is not linearly separable.

If \(C>0\) is an integer, no more than \(C\) training observations can be misclassified; indeed, if \(i_1, \ldots, i_C\) are the misclassified indices, then \(\varepsilon_{i_1},\ldots, \varepsilon_{i_C}\geq 1\) and \[C\geq \sum_{i=1}^N\varepsilon_i\geq \sum_{i=1}^C\varepsilon_i\geq C.\] As \(C\) increases, tolerance for violations also increases, as does the width of the soft margin; \(C\) plays the role of a regularization parameter, and is usually selected via cross-validation.

Low values of \(C\) are associated with harder margins, which leads to low bias but high variance (a small change in the data could create qualitatively different margins); large values of \(C\) are associated with wider (softer) margins, leading to more potential misclassifications and higher bias, but also lower variance as small changes in the data are unlikely to change the margin significantly.

We can build a classifier through the representer theorem formulation as before, the only difference being that the decision function \(T_C(\mathbf{x};\boldsymbol{\alpha})\) is scaled so that \(|T(\mathbf{x}_{i_k};\boldsymbol{\alpha})|\geq 1-\varepsilon_{i_k}\) for every support vector \(\mathbf{x}_{i_k}\). It is difficult to determine what the value of the regularization parameter \(C\) should be at first glance; an optimal value can be obtained via a tuning process, which tries out various values and identifies the one that produces an optimal model.


We train a SVM with \(C=0.1\) (obtained via a tuning procedure for \(C\)) for the 2011 Gapminder dataset to predict the life expectancy class \(Y\) in terms of the fertility rate \(X_1\) and the logarithm of GDP per capita \(X_2\); \(n=116\) observations are used in the training set gapminder.2011.tr, the rest are set aside in the test set gapminder.2011.te.

x <- gapminder.2011.tr[,c("fertility","gdp","population")]
w <- log(x[,2]/x[,3])
x <- data.frame(x[,1],w)
y <- gapminder.2011.tr[,c("LE")]
dat = data.frame(x,y)
plot(w,x[,1],col=y,bg=y,pch=(as.numeric(y)+23), xlab="Log GDPPC", ylab="fertility")

The red triangles represent countries with low life expectancy; the black ones, countries with high life expectancy. Notice the class overlap in the training data.

We run 7 linear SVM models with various cost parameters (through e1071’s tune() function), the optimal model has \(C=0.1\).

tuned.model <- e1071::tune(e1071::svm, y~., data = dat, kernel = "linear",
                    ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
(best.mod <- tuned.model$best.model)

best.tune(method = e1071::svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 
    0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")

   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.1 

Number of Support Vectors:  50

The corresponding SVM model is obtained via the svm() function (note the parameters). The SVM decision boundary is shown below:

svmfit <- e1071::svm(y~., data = dat, kernel = "linear", cost=0.1)
plot(svmfit, dat, main="Linear Kernel")

We can evaluate the model’s performance on \(\text{Te}\) (dat.te); the confusion matrix of the model on the test set is:

x <- gapminder.2011.te[,c("fertility","gdp","population")]
w <- log(x[,2]/x[,3])
x <- data.frame(x[,1],w)
y <- gapminder.2011.te[,c("LE")]
dat.te = data.frame(x,y)                        # Test data
results = predict(svmfit,dat.te)                # Class prediction on test data
table(actual=gapminder.2011.te$LE,pred=results) # Confusion matrix
LE prediction
high low
actual high 17 1
low 10 22

It is not a perfectly accurate model, but it is certainly acceptable given the class overlap in \(\text{Tr}\).

Nonlinear Boundaries

If the boundaries between two classes is linear, the SVM classifier of the previous section is a natural way to attempt to separate the classes. In practice, however, the classes are rarely so cleanly separated, as below, say.

In both the hard and the soft margin support vector classifiers, the function to optimize thakes the form \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\mathbf{x}_i^{\!\top}\mathbf{x}_j-\sum_{i=1}^N\alpha_i,\] and the decision function, the form \[T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\mathbf{x}_{i_k}^{\!\top}\mathbf{x}+\beta_0.\] In practice, however, we do not actually need to know the support vectors \(\mathbf{x}_{i_k}\) (or even the observations \(\mathbf{x}_i\), for that matter) in order to compute the decision function values – it is sufficient to have access to the inner products \(\mathbf{x}_i^{\!\top}\mathbf{x}_j\) or \(\mathbf{x}_{i_k}^{\!\top}\mathbf{x}\), which are usually denoted by \[\langle\mathbf{x}_{i_k},\mathbf{x}\rangle\quad\text{or}\quad \langle\mathbf{x}_{i},\mathbf{x}_j\rangle.\] The objective function and the decision function can thus be written as \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\langle\mathbf{x}_i,\mathbf{x}_j\rangle-\sum_{i=1}^N\alpha_i,\quad T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\langle\mathbf{x}_{i_k},\mathbf{x}\rangle+\beta_0.\] This seemingly innocuous remark opens the door to the kernel approach; we could conceivable replace the inner products \(\langle \mathbf{x},\mathbf{w}\rangle\) by generalized inner products \(K(\mathbf{x},\mathbf{w})\), which provide a measure of similarity between the observations \(\mathbf{x}\) and \(\mathbf{w}\).

Formally, a kernel is a symmetric (semi-)positive definite operator \(K:\mathbb{R}^p\times \mathbb{R}^p\to \mathbb{R}_0^+\).238 Common statistical learning kernels include:

  • linear\(K(\mathbf{x},\mathbf{w})=\mathbf{x}^{\!\top}\mathbf{w}\);

  • polynomial of degree \(d\)\(K_d(\mathbf{x},\mathbf{w})=(1+\mathbf{x}^{\!\top}\mathbf{w})^d\);

  • Gaussian (or radial) – \(K_\gamma(\mathbf{x},\mathbf{w})=\exp(-\gamma\|\mathbf{x}-\mathbf{w}\|_2^2)\), \(\gamma>0\);

  • sigmoid\(K_{\kappa,\delta}(\mathbf{x},\mathbf{w})=\tanh(\kappa\mathbf{x}^{\!\top}\mathbf{w}-\delta)\), for allowable \(\kappa,\delta\).

For instance, a radial kernel SVM with \(\gamma=1\), \(C=0.5\) yields the following classification on the above dataset (we are using kernlab’s ksvm() function and display the linear SVM output for comparison, whose performance we expect to be crap-tastic):

x <- matrix(rnorm(600*2), ncol = 2)
y <- c(rep(-1,200), rep(0,200),rep(1,200))
x[y==1,] <- x[y==1,] + 2
x[y==-1] <- x[y==-1,] - 2
y <- y^2
dat <- data.frame(x=x, y=as.factor(y))

# linear SVM
kernfit.lin <- kernlab::ksvm(x,y, type = "C-svc", kernel = 'vanilladot', 
                             C = 10)
kernlab::plot(kernfit.lin, data=x)

# Gaussian SVM
kernfit.rbf <- kernlab::ksvm(x,y, type = "C-svc", kernel = 'rbfdot', 
                           sigma=1, C = 0.5)
kernlab::plot(kernfit.rbf, data=x)

 Setting default kernel parameters  

How is the decision boundary computed? The principle is the same as with linear SVM: the objective function and the decision function are \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_jK(\mathbf{x}_i,\mathbf{x}_j)-\sum_{i=1}^N\alpha_i,\quad T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}K(\mathbf{x}_{i_k},\mathbf{x})+\beta_0.\] For the radial kernel, for instance, if a test observation \(\mathbf{x}\) is near a training observation \(\mathbf{x}_i\), then \(\|\mathbf{x}-\mathbf{x}_i\|_2^2\) is small and \(K_{\gamma}(\mathbf{x},\mathbf{x}_i)\approx 1\); if they are far from one another, then \(\|\mathbf{x}-\mathbf{x}_i\|_2^2\) is large and \(K_{\gamma}(\mathbf{x},\mathbf{x}_i)\approx 0\).

In other words, in the radial kernel framework, only those observations close to a test observation play a role in class prediction.

Kernel Trick

But why even use kernels in the first place? While the linear kernel is easier to interpret and implement, not all data sets are linearly separable, as we have just seen. Consider the following toy classification problem (adapted from an unknown online source).

The optimal margin separating “strip” is obviously not linear. One way out of this problem is to introduce a transformation \(\Phi\) from the original \(X-\)feature space to a higher-dimensional (or at least, of the same dimension) \(Z-\)feature space in which the data is linearly separable, and to build a linear SVM on the transformed training observations \(\mathbf{z}_i=\Phi(\mathbf{x})_i\).239 In this example, we use some \(\Phi:\mathbb{R}^2\to \mathbb{R}^3\); the projection of the transformation into the \(Z_1Z_3-\)plane could be as below.

The objective function and the decision function take on the form \[\frac{1}{2}\sum_{i,j=1}^N\alpha_i\alpha_j\Phi(\mathbf{x}_i)^{\!\top}\Phi(\mathbf{x}_j)-\sum_{i=1}^N\alpha_i,\ T(\mathbf{x};\boldsymbol{\alpha})=\sum_{k=1}^L\alpha_{i_k}y_{i_k}\Phi(\mathbf{x}_{i_k})^{\!\top}\Phi(\mathbf{x})+\beta_0,\] and the linear SVM is built as before (but in \(Z-\)space, not in \(X-\)space).

It sounds straightforward, but it does take a fair amount of experience to recognize that one way to separate the data is to use \[\mathbf{z}=\Phi(\mathbf{x})=(x_1^2,\sqrt{2}x_1x_2,x_2^2).\] And this is one of the easy transformations: what should be used in the case below (image taken from Wikipedia)?

The kernel trick simply states that \(\Phi\) can remain unspecified if we replace \(\Phi(\mathbf{x})^{\!\top}\Phi(\mathbf{w})\) by a “reasonable” (often radial) kernel \(K(\mathbf{x},\mathbf{w})\).

General Classification

What do we do if the response variable has \(K>2\) classes? In the one-versus-all (OVA) approach, we fit \(K\) different \(2-\)class SVM decision functions \(T_k(\mathbf{x};\boldsymbol{\alpha})\), \(k=1,\ldots, K\); in each, one class versus the rest. The test observation \(\mathbf{x}^*\) is assigned to the class for which \(T_k(\mathbf{x^*};\boldsymbol{\alpha})\) is largest.

In the one-versus-one (OVO) approach, we fit all \(\binom{K}{2}\) pairwise \(2-\)class SVM classifiers \(\text{class}_{k,\ell}(\mathbf{x})\), for training observations with levels \(k,\ell\), where \(k>\ell=1,\ldots, K-1\). The test observation \(\mathbf{x}^*\) is assigned to the class that wins the most pairwise “competitions”.

If \(K\) is large, \(\binom{K}{2}\) might be too large to make OVO computationally efficient; when it is small enough, OVO is the recommended approach.


The vowel dataset was taken from the openML website. This modified version, by Turney, is based on Robinson’s Deterding Vowel Recognition Data, which is a speaker-independent recognition of the eleven steady state vowels of British English using a specified training set of lpc-derived log area ratios.

Real talk: I don’t actually know what any of that means. But does it matter? I mean, yes, any conclusion we can draw from this dataset will need to be scrutinized by subject matter experts before we can hope to apply them to real-world situations. On the other hand, data is simply marks on paper (or perhaps electromagnetic patterns on the cloud). We can analyze the data without really knowing what the underlying meaning is. It is a sterile approach, but we can use it to illustrate SVM, in general.

Let’s read in the data and summarize it

vowel <- read.csv("Data/datasets-uci-vowel.csv", header=TRUE, sep=",", 
Train.or.Test Speaker.Name Speaker.Sex Feature.0 Feature.1 Feature.2 Feature.3 Feature.4 Feature.5 Feature.6 Feature.7 Feature.8 Feature.9 Class
Train Andrew Male -3.639 0.418 -0.670 1.779 -0.168 1.627 -0.388 0.529 -0.874 -0.814 hid
Train Andrew Male -3.327 0.496 -0.694 1.365 -0.265 1.933 -0.363 0.510 -0.621 -0.488 hId
Train Andrew Male -2.120 0.894 -1.576 0.147 -0.707 1.559 -0.579 0.676 -0.809 -0.049 hEd
Train Andrew Male -2.287 1.809 -1.498 1.012 -1.053 1.060 -0.567 0.235 -0.091 -0.795 hAd
Train Andrew Male -2.598 1.938 -0.846 1.062 -1.633 0.764 0.394 -0.150 0.277 -0.396 hYd
Train Andrew Male -2.852 1.914 -0.755 0.825 -1.588 0.855 0.217 -0.246 0.238 -0.365 had

The dataset has \(n=990\) observations and \(p=14\) variables.

'data.frame':   990 obs. of  14 variables:
 $ Train.or.Test: Factor w/ 2 levels "Test","Train": 2 2 2 2 2 2 2 2 2 2 ...
 $ Speaker.Name : Factor w/ 15 levels "Andrew","Bill",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Speaker.Sex  : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Feature.0    : num  -3.64 -3.33 -2.12 -2.29 -2.6 ...
 $ Feature.1    : num  0.418 0.496 0.894 1.809 1.938 ...
 $ Feature.2    : num  -0.67 -0.694 -1.576 -1.498 -0.846 ...
 $ Feature.3    : num  1.779 1.365 0.147 1.012 1.062 ...
 $ Feature.4    : num  -0.168 -0.265 -0.707 -1.053 -1.633 ...
 $ Feature.5    : num  1.627 1.933 1.559 1.06 0.764 ...
 $ Feature.6    : num  -0.388 -0.363 -0.579 -0.567 0.394 0.217 0.322 -0.435 -0.512 -0.466 ...
 $ Feature.7    : num  0.529 0.51 0.676 0.235 -0.15 -0.246 0.45 0.992 0.928 0.702 ...
 $ Feature.8    : num  -0.874 -0.621 -0.809 -0.091 0.277 0.238 0.377 0.575 -0.167 0.06 ...
 $ Feature.9    : num  -0.814 -0.488 -0.049 -0.795 -0.396 -0.365 -0.366 -0.301 -0.434 -0.836 ...
 $ Class        : Factor w/ 11 levels "had","hAd","hed",..: 5 6 4 2 11 1 8 7 10 9 ...
Train.or.Test Speaker.Name Speaker.Sex Class
Test :462 Andrew:66 Female:462 had:90
Train:528 Bill :66 Male :528 hAd:90
NA David :66 NA hed:90
NA Jo :66 NA hEd:90
NA Kate :66 NA hid:90
NA Mark :66 NA hId:90
NA Mike :66 NA hod:90
NA Nick :66 NA hOd:90
NA Penny :66 NA hud:90
NA Rich :66 NA hUd:90
NA Rose :66 NA hYd:90
NA Sarah :66 NA NA
NA Sue :66 NA NA
NA Tim :66 NA NA
NA Wendy :66 NA NA

There is some imbalance in the training/testing set-up (especially as it relates to the speaker sex):

table(vowel$Train.or.Test, vowel$Speaker.Sex)
        Female Male
  Test     198  264
  Train    264  264

All-in-all, the numerical features seem to be generated from a multivariate normal distribution, with mean vector

colMeans(vowel[,c(4:13)], dims = 1)
   Feature.0    Feature.1    Feature.2    Feature.3    Feature.4    Feature.5 
-3.203740404  1.881763636 -0.507769697  0.515482828 -0.305657576  0.630244444 
   Feature.6    Feature.7    Feature.8    Feature.9 
-0.004364646  0.336552525 -0.302975758 -0.071339394 

and correlation matrix


Can we get any information from the paired plots?

psych::pairs.panels(vowel[,4:13], pch = 21, bg = vowel$Class) 

Perhaps if we focus only on certain variables?

psych::pairs.panels(vowel[,c(4,6,7,9,13)], pch = 21, bg = vowel$Class) 

psych::pairs.panels(vowel[,c(5,8,14)], pch = 21, bg = vowel$Class) 

The response variable is the Class, with 11 levels, and the \(p=10\) predictors are Feature.0, …, Feature.9. We train an SVM model on a subset of the vowel dataset. In this instance, we use the training/testing split provided with the data (Train.or.Test), but any randomly selected split would be appropriate.

training = vowel[vowel$Train.or.Test=="Train",4:14]
testing = vowel[vowel$Train.or.Test=="Test",4:14]
c(nrow(training),nrow(testing))   # training/testing split
[1] 528 462

We use the support vector machine implementation found in the R library e1071.

First we tune the hyper-parameters on a subsample of the training data by using the tune() function, which selects optimal parameters by carrying out a grid search over the specified parameters (otherwise we might spend a lot of time trying to find a good combination of parameters).

For C-classification with a Gaussian kernel, the parameters are

  • \(C\), the cost of constraint violation (which controls the penalty paid by the SVM model for misclassifying a training point), and

  • \(\gamma\), the parameter of the Gaussian kernel (used to handle non-linear classification).

If \(C\) is “high”, then misclassification is costly, and vice-versa. If \(\gamma\) is “high”, than the Gaussian bump around the points are narrow, and vice-versa. Let us run a grid search with \(C\) varying from \(0.1\) to \(100\) by powers of \(10\), and \(\gamma = 0.5, 1, 2\).

vowel.svm.tune.1 <- e1071::tune(e1071::svm, 
                                train.x=training[,1:10], train.y=training[,11], 
                                ranges=list(cost=10^(-1:2), gamma=c(.5,1,2)))

Parameter tuning of 'e1071::svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
   10   0.5

- best performance: 0.007619739 

The minimal misclassfication error (best performance) in this run is reached when the best parameters have the values listed in the output above. Obviously, that search was fairly coarse: searching at a finer level can be very demanding, time-wise.

For comparison’s sake, let’s see if tuning with finer intervals and larger ranges gives substantially different results.

vowel.svm.tune.2 <- e1071::tune(e1071::svm, 
                                train.x=training[,1:10], train.y=training[,11], 
                                ranges=list(cost=10^(-2:2), gamma=1:20*0.1))

Parameter tuning of 'e1071::svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
   10   0.8

- best performance: 0.003773585 

The optimal parameters are sensibly the same, so we might as well stick with the optimal parameters values from the first tuning. Training the model with these values yields:

vowel.svm.model = e1071::svm(training[,11] ~ ., data = training,
                      type="C-classification", cost=10, kernel="radial", gamma=0.5)

svm(formula = training[, 11] ~ ., data = training, type = "C-classification", 
    cost = 10, kernel = "radial", gamma = 0.5)

   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  10 

Number of Support Vectors:  351

 ( 27 32 32 26 30 36 37 29 40 32 30 )

Number of Classes:  11 

 had hAd hed hEd hid hId hod hOd hud hUd hYd

Note the number of support vectors. How accurately can this model predict the class of new observations?

predicted = predict(vowel.svm.model, testing)
(confusion.matrix = table(pred = predicted, true = testing[,11]))
pred  had hAd hed hEd hid hId hod hOd hud hUd hYd
  had  36   0   0   0   0   0   0   0   0   0   0
  hAd   0  40   0   0   0   0   0   0   0   0   0
  hed   0   0  42   0   0   0   0   0   0   0   0
  hEd   0   0   0  37   0   0   0   0   0   0   0
  hid   0   0   0   0  39   0   0   0   0   0   0
  hId   0   0   0   0   2  42   0   0   1   0   0
  hod   0   0   0   0   0   0  36   0   0   0   0
  hOd   0   0   0   0   0   0   0  35   0   0   0
  hud   0   0   0   0   0   0   0   0  23   0   0
  hUd   6   2   0   5   1   0   6   7  18  42  16
  hYd   0   0   0   0   0   0   0   0   0   0  26
[1] 0.8614719

[1] 0.847619

[1] 0.9425585

[1] 0.6798992

What do you think?

Final Comments

In practice, it is not always obvious whether one should use SVM, logistic regression, linear discriminant analysis (LDA), decision trees, etc:240

  • if classes are (nearly) separable, SVM and LDA are usually preferable to logistic regression;

  • otherwise, using logistic regression together with a ridge penalty (see Section 12.2.4) is roughly equivalent to using SVM;

  • if the aim is to estimate class membership probabilities, it is preferable to use logistic regression as SVM is not calibrated;241

  • it is possible to use kernels in the logistic regression and LDA frameworks, but at the cost of increased computational complexity.

All in all, it remains crucial to understand that the No Free Lunch Theorem remains in effect. There is simply no magical recipe… although the next technique we discuss is often viewed (and used) as one.

13.4.3 Artificial Neural Networks

When practitioners discuss using Artificial Intelligence techniques [135] to solve a problem, the implicit assumption is often (but not always) that a neural network (or some other variant of deep learning) will be used, and for good reason: “neural networks blow all previous techniques out of the water in terms of performance” [249]. But there are some skeletons in the closet: “[…] given the existence of adversarial examples, it shows we really don’t understand what’s going on” [249].

At various times since Alan Turing’s seminal 1950 paper (in which he proposed the celebrated Imitation Game [250]), complete artificial intelligence has been announced to be “just around the corner” (see below).

Conceptual timeline of the interest and optimism regarding artificial intelligence (A.I.); important milestones are indicated below the dates.

Figure 13.25: Conceptual timeline of the interest and optimism regarding artificial intelligence (A.I.); important milestones are indicated below the dates.

With the advent of deep learning and Big Data processing, optimism is as high as it’s ever been, but opinions on the topic are varied – to some commentators, A.I. is a brilliant success, while to others it is a spectacular failure (see the headlines in Section 7.1.3). So what is really going on?

It is far from trivial to identify the essential qualities and skills of an intelligence. There have been multiple attempts to solve the problem by building on Turing’s original effort. An early argument by Hofstadter [56] is that any intelligence should:

  • provide flexible responses in various scenarios;

  • take advantage of lucky circumstances;

  • make sense out of contradictory messages;

  • recognize the relative importance of a situation’s elements;

  • find similarities between different situations;

  • draw distinctions between similar situations, and

  • come up with new ideas from scratch or by re-arranging previous known concepts.  

This is not quite the approach taken by modern A.I. researchers, which define the discipline as the study of intelligent agents – any device that perceives its environment and takes actions to maximize its chance of success at some task/goal [251].

Examples include:

  • expert systems – TurboTax, WebMD, technical support, insurance claim processing, air traffic control, etc.;

  • decision-making – Deep Blue, auto-pilot systems, “smart” meters, etc.;

  • natural Language Processing – machine translation, Siri, named-entity recognition, etc.;

  • recommenders – Google, Expedia, Facebook, LinkedIn, Netflix, Amazon, etc.;

  • content generators – music composer, novel writer, animation creator, etc.;

  • classifiers – facial recognition, object identification, fraud detection, etc.  

A trained artificial neural network (ANN) is a function that maps inputs to outputs in a useful way: it uses a Swiss-army-knife approach to providing outputs – plenty of options are available in the architecture, but it’s not always clear which ones should be used.

One of the reasons that ANNs are so popular is that the user does not need to decide much about the function or know much about the problem space in advance – ANNs are quiet models.

Algorithms allow ANNs to learn (i.e. to generate the function and its internal values) automatically; technically, the only requirement is the user’s ability to minimize a cost function (which is to say, to be able to solve optimization problems).


The simplest definition of an artificial neural network is provided by the inventor of one of the first neuro-computers, R. Hecht-Nielsen, as:

“[…] a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. [252]

An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain but on much smaller scales. Neural networks are typically organized in layers. Layers are made up of a number of interconnected nodes which contain an activation function.

A pattern \(\mathbf{x}\) (input, signal) is presented to the network via the input layer, which communicates with one or more hidden layers, where the actual processing is done via a system of weighted connections \(\mathbf{W}\) (edges).

The hidden layers then link to an output layer, which outputs the predicted response \(\hat{\mathbf{y}}\) (see Figure 13.26).

Artificial neural network topology -- conceptual example. The number of hidden layers is arbitrary, as is the size of the signal and output vectors.

Figure 13.26: Artificial neural network topology – conceptual example. The number of hidden layers is arbitrary, as is the size of the signal and output vectors.

Neural Networks Architecture

In order to train a neural network, we need the following objects [242]:

  • some input data,

  • a number of layers,

  • a model, and

  • a learning process (loss function and optimizer).

The object interactions is visualized below.

Relationship between the network, layers, loss function, and optimizer [@chollet].

Figure 13.27: Relationship between the network, layers, loss function, and optimizer [242].

A network (model), which is composed of layers that are chained together, maps the input data into predictions.242 The loss function then compares these predictions to the targets, producing a loss value: a measure of how well the network’s predictions match what was expected. The optimizer uses this loss value to update the network’s weights.

Input Data

Neural networks start with the input training data (and corresponding targets) in the form of a tensor. Generally speaking, most modern machine learning systems use tensors as their basic data structure. At its core, a tensor is a container for data – and it is almost always numerical.

Tensors are defined by three key attributes: their

  • rank (number of axes) – for instance, a 3D tensor has three axes, while a matrix (2D tensor) has two axes;

  • shape, a tuple of integers that describes how many dimensions the tensor has along each axis – for instance, a matrix’s shape is described using two elements, such as (3,5), a 3D tensor’s shape has three elements, such as (3,5,5), a vector (1D tensor)’s shape is given by a single element, such as (5), whereas a scalar has an empty shape, ( );

  • data type – for instance, a tensor’s type could be float32, uint8, float64, etc.

Data tensors almost always fall into one of the following categories:

  • the most common case is vector data; in such datasets, each single data point can be encoded as a vector, and a batch of data will be encoded as a matrix or 2D tensor of shape (#samples,#features), or more simply, as an array of vectors where the first axis is the samples axis and the second axis is the features axis;

  • time series or sequence data, whenever the passage of time is crucial to the observations in the dataset (or the notion of sequence order), can be stored in a 3D tensor with an explicit time axis; each sample can be encoded as a sequence of vectors (a 2D tensor), and a batch of data will be encoded as a 3D tensor of shape (#samples, #timesteps, #features), as in Figure 13.28;

A 3D time series data tensor [@chollet].

Figure 13.28: A 3D time series data tensor [242].

  • images typically have three dimensions: height, width, and colour depth;243 a batch of image data could thus be stored in a 4D tensor of shape (#samples, #height, #width, #channels), as in Figure 13.29;
A 4D image data tensor [@chollet].

Figure 13.29: A 4D image data tensor [242].

  • video data is one of the few types of real-world data for which 5D tensors are needed – a video can be understood as a sequence of frames, each frame being a colour image; a sequence of frames can be stored in a 4D tensor (#frames, #height, #width, #channels), and so a batch of different videos can be stored in a 5D tensor of shape (#samples, #frames, #height, #width, #channels).


The core building block of neural networks is the layer, a data-processing module that is, in a sense, a filter for data: some data goes into the layer and comes out in a more useful form.

Specifically, layers extract representations out of the data fed into them – hopefully, representations that are more meaningful for the problem at hand. A layer takes as input \(1+\) tensors and outputs \(1+\) tensors. Different layers are appropriate for different tensor formats and different types of data processing.

For instance, simple vector data, stored in 2D tensors, is often processed by densely connected layers, also called fully connected or dense layers. Sequence data, stored in 3D tensors, is typically processed by recurrent layers. Image data, stored in 4D tensors, is usually processed by 2D convolution layers.

Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation. However, to build deep learning models in tensor-based modules like Keras [242], it is important to clip together compatible layers to form useful data-transformation pipelines.

The notion of layer compatibility refers specifically to the fact that every layer can only accept input tensors of a certain shape and return output tensors of a certain shape.

We will discuss tensors in greater detail in the Deep Learning module.

Model: Networks of Layers

An artificial neural network model is essentially a data processing sieve, made of a succession of increasingly refined data filters – the layers. The most common example of a model is a linear stack of layers, mapping a single input to a single output. Other network topologies include: two-branch networks, multihead networks, and inception blocks. The topology of a network defines a hypothesis space.

Since machine learning is basically

“[…] searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal [242],”

by choosing a network topology, we constrain the space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data.

From a ML perspective, what we are searching for is a good set of values for the weight tensors involved in these tensor operations. Picking the right network architecture is more an art than a science; and although there are some best practices and principles we can rely on, practical experience is the main factor in becoming a proper neural network architect.

Learning Process: Loss Function and Optimizer

After a network architecture has been selected, two other objects need to be chosen:

  • the (objective) loss function is the quantity that is minimized during training – it represents a measure of success for the task at hand, and

  • the optimizer determines how the network is updated based on the loss function.

In this context, learning means finding a combination of model parameters that minimizes the loss function for a given set of training data observations and their corresponding targets.

Learning happens by drawing random batches of data samples and their targets, and computing the gradient of the network parameters with respect to the loss on the batch. The network parameters are then updated by a small amount (the magnitude of the move is defined by the learning rate) in the opposite direction from the gradient.

The entire learning process is made possible by the fact that under a network disguise, neural networks are simply chains of differentiable tensor operations, to which it is possible to apply the chain rule of differentiation to find the gradient function mapping the current parameters and current batch of data to a gradient value.

Choosing the right objective function for a given problem is extremely important: the network is ruthless when it comes to lowering its loss function, and it will take any shortcut it can to achieve that objective. If the latter does not fully correlate with success for the task at hand, the network may face unintended side effects.

Simple guidelines exist to help analysts select an appropriate loss function for common problems such as classification, regression, and sequence prediction. We typically use:

  • binary cross entropy for a binary classification;

  • categorical cross entropy for a \(n-\)ary classification;

  • mean squared error for a regression;

  • connectionist temporal classification (CTC) for sequence-learning, etc.

In most cases, one of these will do the trick – only when analysts are working on truly new research problems do they have to develop their own objective functions. Let us first illustrate the principles underlying ANNs with a simple example.

We have seen that ANNs are formed from an input layer from which the signal vector \(\mathbf{x}\) is inputted, an output layer which produces an output vector \(\hat{\mathbf{y}}\), and any number of hidden layers; each layer consists of a number of nodes which are connected to the nodes of other layers via directed edges with associated weights \(\mathbf{w}\), as seen below.

Nodes from the hidden and output layers are typically activation nodes – the output \(a(\mathbf{z})\) is some function of the input \(\mathbf{z}\). Signals propagate through the ANN using simple arithmetic, once a set of weights \(\mathbf{w}\) and activation functions \(a(\cdot)\) have been selected.

Signal propagating forward through an ANN; weights (in blue),  activation functions (in yellow), inputs (in green), and output (in black).Signal propagating forward through an ANN; weights (in blue),  activation functions (in yellow), inputs (in green), and output (in black).Signal propagating forward through an ANN; weights (in blue),  activation functions (in yellow), inputs (in green), and output (in black).Signal propagating forward through an ANN; weights (in blue),  activation functions (in yellow), inputs (in green), and output (in black).

Figure 13.30: Signal propagating forward through an ANN; weights (in blue), activation functions (in yellow), inputs (in green), and output (in black).

In a nutshell, at each node, the neural net computes a weighted sum of inputs, applies an activation function, and sends a signal. This is repeated until the various signals reach the final output nodes.

That part is easy – given a signal, an ANN can produce an output, as long as the weights are specified. Matrix notation can simplify the expression for the output \(\hat{\mathbf{y}}\) in terms of the signal \(\mathbf{x}\), weights \(\mathbf{w}\), and activation function \(a(\cdot)\).

For instance, consider the network of Figure 13.30; if \[a(z)=(1+\exp(-z))^{-1},\] the network topology can be re-written as:

  • input layer with \(p\) nodes \[\mathbf{X}_{n\times p}={\mathbf{X}_{n\times 2}}={\begin{bmatrix} x_{A,1} & x_{B,1} \\ \vdots &\vdots \\ x_{A,n} & x_{B,n} \end{bmatrix}};\]

  • weights from input layer to hidden layer with \(m\) nodes \[{\mathbf{W}^{(1)}_{p\times m}}={\mathbf{W}^{(1)}_{2\times 2}}={\begin{bmatrix} w_{AC} & w_{AD} \\ w_{BC} & w_{BD}\end{bmatrix}};\]

  • hidden layer with \(m\) nodes \[{\mathbf{Z}^{(2)}_{n\times m}}={\mathbf{Z}^{(2)}_{n\times 2}}={\begin{bmatrix} z_{C,1} & z_{D,1} \\ \vdots & \vdots \\ z_{C,n} & z_{D,n}\end{bmatrix}}={\mathbf{X}}{\mathbf{W}^{(1)}};\]

  • activation function on hidden layer \[{\mathbf{a}^{(2)}}={\begin{bmatrix} (1+\exp(-z_{C,1}))^{-1} & (1+\exp(-z_{D,1}))^{-1} \\ \vdots & \vdots \\ (1+\exp(-z_{C,n}))^{-1} & (1+\exp(-z_{D,n}))^{-1} \end{bmatrix}}=g({\mathbf{Z}^{(2)}});\]

  • weights from hidden layer with \(m\) nodes to output layer with \(k\) nodes \[{\mathbf{W}^{(2)}_{m\times k}}={\mathbf{W}^{(2)}_{2\times 1}}={\begin{bmatrix} w_{CE} \\ w_{DE} \end{bmatrix}};\]

  • output layer with \(k\) nodes \[{\mathbf{Z}^{(3)}_{n\times k}}={\mathbf{Z}^{(3)}_{n\times 1}}={\begin{bmatrix} z_{E,1} \\ \vdots \\ z_{E,n} \end{bmatrix}}={\mathbf{a}^{(2)}}{\mathbf{W}^{(2)}};\]

  • activation function on output layer \[\hat{\mathbf{y}}={\mathbf{a}^{(3)}}={\begin{bmatrix} (1+\exp(-z_{E,1}))^{-1} \\ \vdots \\ (1+\exp(-z_{E,n}))^{-1} \end{bmatrix}}=g({\mathbf{Z}^{(3)}});\]

The main problem is that unless the weights are judiciously selected, the output that is produced is unlikely to have anything to do with the desired output. For supervised learning tasks (i.e. when an ANN attempts to emulate the results of training examples), there has to be some method to optimise the choice of the weights against an error function \[R(\mathbf{W})=\sum_{i=1}^n\sum_{\ell=1}^k\left(\hat{y}_{i,\ell}(\mathbf{W})-y_{i,\ell}\right)^2\quad \text{or}\quad R(W)=-\sum_{i=1}^n\sum_{\ell=1}^ky_{i,\ell}\ln \hat{y}_{i,\ell}(\mathbf{W})\] (for value estimation and classification, respectively), where \(n\) is the number of observations in the training set, \(k\) is the number of output nodes in the ANN, \(y_{i,\ell}\) is the known value or class label for the \(\ell^{\textrm{th}}\) output of the \(i^{\textrm{th}}\) observation in the training set.

Enter backpropagation, which is simply an application of the chain rule to \(R(\mathbf{W})\). Under reasonable regularity condition, the desired minimizer \(\mathbf{W}^*\) satisfies \(\nabla R (\mathbf{W}^*)=0\) and is found using numerical gradient descent.

Gradient-Based Optimization

Initially, the weight matrix \(\mathbf{W}\) is filled with small random values (a step called random initialization). The weights are then gradually trained (or learned), based on a feedback signal. This occurs within a training loop, which works as follows:

  1. draw a batch of training samples \(\mathbf{x}\) and corresponding targets \(\mathbf{y}\);

  2. run the network on \(\mathbf{x}\) (the forward pass) to obtain predictions \(\hat{\mathbf{y}}\);

  3. compute the loss of the network on the batch, a measure of the mismatch between \(\hat{\mathbf{y}}\) and \(\mathbf{y}\);

  4. update all weights of the network in a way that slightly reduces the loss on this batch.

Repeat these steps in a loop, as often as necessary. Hopefully, the process will eventually converge on a network with a very low loss on the training data, which is to say that there will be a low mismatch between the predictions \(\hat{\mathbf{y}}\) and the target \(\mathbf{y}\). In the vernacular, we say that the ANN has learned to map its inputs to correct targets.

Step 1 is easy enough. Steps 2 and 3 are simply the application of a handful of tensor operations (or matrix multiplication, as above). Step 4 is more difficult: how do we update the network’s weights? Given an individual weight coefficient in the network, how can we compute whether the coefficient should be increased or decreased, and by how much?

One solution would be to successively minimize the objective function along coordinate directions to find the minimum of a function. This algorithm is called coordinate descent and at each iteration determines a coordinate, then minimizes over the corresponding hyperplane while fixing all other coordinates [242].

It is based on the idea that minimization can be achieved by minimizing along one direction at a time. Coordinate descent is useful in situations where the objective function is not differentiable, as is the case for most regularized regression models, say. But this approach would be inefficient in deep learning networks, where there is a large collection of individual weights to update. A smarter approach is use the fact that all operations used to propagate a signal in the network are differentiable, and compute the gradient of the objective function (loss) with regard to the network’s coefficients.

Following a long-standing principle of calculus, we can decrease the objective function by updating the coefficients in the opposite direction to the gradient.244 For an input vector \(\mathbf{X}\), a weight matrix \(\mathbf{W}\), a target \(\mathbf{Y}\), and a loss function \(L\), we predict a target candidate \(\hat{\mathbf{Y}}(\mathbf{W})\), and compute the loss when approximating \(\mathbf{Y}\) by \(\hat{\mathbf{Y}}(\mathbf{W})\).

If \(\mathbf{X}\) and \(\mathbf{Y}\) are fixed, the loss function maps weights \(\mathbf{W}\) to loss values: \(f(\mathbf{W})=L(\hat{\mathbf{Y}}(\mathbf{W}),\mathbf{Y}).\)

In much the same way that the derivative of a function \(f(x)\) of a single variable at a point \(x_0\) is the slope of the tangent at \(f\) at \(x_0\), the gradient \(\nabla f(\mathbf{W}_0)\) is the tensor describing the curvature of \(f(\mathbf{W})\) around \(\mathbf{W}_0\). As is the case with the derivative, we can reduce \(f(\mathbf{W})\) by moving \(\mathbf{W}_0\) to \[\mathbf{W}_1=\mathbf{W}_0-s\nabla f(\mathbf{W}_0),\] where \(s\) is the learning rate, a small scalar needed to approximate the curvature of the hypersurface close to \(\mathbf{W}_0\).

Stochastic Gradient Descent

When dealing with ANNs, we can take advantage of the differentiability of the gradient by finding its critical points \(\nabla f(\mathbf{W})=0\) analytically.

If the neural network contains \(Q\) edges, this requires solving a polynomial equation in \(Q\) variables. However, real-world ANNs often have over a few thousand such connections (if not more), the analytical approach is not reasonable.

Instead, we modify the parameters slightly based on the current loss value on a random batch of data. Since we are dealing with a differentiable function, we can use a mini-batch stochastic gradient descent (minibatch SGD) to update the weights, simply by modifying Step 4 of the gradient descent algorithm as follows:

4a. compute the gradient of the loss with regard to the weights (the backward pass);

4b. update the weights “a little” in the direction opposite the gradient.

Figure 13.31 illustrates how SGD works when the network only has the one parameter to learn, with a single training sample.

SGD with one parameter [@chollet]

Figure 13.31: SGD with one parameter [242]

We automatically see why it is important to choose a reasonable learning rate (the step size); too small a value leads to either slow convergence or running the risk of staying stuck at some local minimum; too large a value may send the descent to essentially random locations on the curve and overshooting the global minimum altogether.

SGD Challenges

The main issue with minibatch SGD is that “good” convergence rates are not guaranteed, but there are other challenges as well:

  • selecting a reasonable learning rate can be difficult. Too small a rate leads to painfully slow convergence, too large a rate can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge [242];

  • the same learning rate applies to all parameter updates, which might not be ideal when the data is sparse;

  • a key challenge is in minimizing highly non-convex loss functions that commonly occur in ANNs and avoiding getting trapped in suboptimal local minima or saddle points. It is hard for SGD to escape these suboptimal local minima and even wors for the saddle points [253].

SGD Variants

There are several SGD variants that are commonly used by the deep learning community to overcome the aforementioned challenges. They take into account the previous weight updates when computing the next weight update, rather than simply considering the current value of the gradients. Popular optimzers include SGD with momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSProp, and many more [254], [255].245

ANNs can be quite accurate when making predictions – more than other algorithms, if given a proper set up (but this can be hard to achieve). They degrade gracefully, and they often work when other things fail:

  • when the relationship between attributes is complex;

  • when there are a lot of dependencies/nonlinear relationships;

  • when the inputs are messy and highly-connected (images, text and speech), and

  • when dealing with non-linear classification.

But they are relatively slow and prone to overfitting (unless they have access to large and diverse training sets), they are notoriously hard to interpret due to their blackbox nature, and there is no algorithm in place to select the optimal network topology.

Finally, even when they do perform better than other options, ANNs may not perform that much better due to the No Free-Lunch theorems; and they always remain susceptible to various forms of adversarial attacks [256], so they should be used with caution.


In this example, we explore the wine dataset with the ANN architecture implemented in the R package neuralnet. We will revisit deep learning networks, and more complicated topologies, in the Deep Learning module.

wine = read.csv("Data/wine.csv", header=TRUE)
wine = as.data.frame(wine)
'data.frame':   178 obs. of  14 variables:
 $ Class               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Alcohol             : num  14.2 13.2 13.2 14.4 13.2 ...
 $ Malic.acid          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ Ash                 : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ Alcalinity.of.ash   : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ Magnesium           : int  127 100 101 113 118 112 96 121 97 98 ...
 $ Total.phenols       : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ Flavanoids          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ Nonflavanoid.phenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ Proanthocyanins     : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ Colour.intensity    : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ Hue                 : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ OD280.OD315         : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ Proline             : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

 1  2  3 
59 71 48 

We notice that there are only 3 classes: 1, 2, 3. These classes will be the level of the categorical response variable for a classification task.

We set-up the model parameters/inputs as follows:

# Number of instances
n = nrow(wine) 
# Dependent variable - Class
Y = wine$Class 
# Independent variables (full)
X = wine[,-1] 
# Indices for Class=1,2,3
C1.loc = which(Y==1) 
C2.loc = which(Y==2) 
C3.loc = which(Y==3) 

We begin data exploration by taking a look at the variables’ boxplots, an excellent way to understand the distribution of each variable.

plot.title = "Boxplot of Variables in the Wine dataset (original scale)"  
boxplot(X, main=plot.title, xaxt="n")
axis(1, at=1:dim(X)[2], labels=colnames(X), las=2, cex.axis = 0.5)

We clearly see that Proline has higher magnitudes in mean value and variability. This suggests that if we apply reduction techniques like PCA to the wine dataset, the 1st principal component will be based almost entirely on the Proline value. In order to reduce undue effects, we need to standardize the data first (see Feature Selection and Dimension Reduction for details).

# Standardized predictor variables (full)
X.std = scale(X) 

plot.title = "Boxplot of Variables in the Wine dataset (standardized)"  
boxplot(X.std, main=plot.title, xaxt="n")
axis(1, at=1:dim(X)[2], labels=colnames(X), las=2, cex.axis = 0.5)

We now shift our focus on understanding the relationships amongst the explanatory variables. This is important because:

  • scatter plots show us whether classification is a relatively simple task for our data;

  • they let us visually inspect potential outliers or influential points;

  • correlations amongst variables tell us whether it is necessary to retain all of them, and

  • the variance inflation factor (VIF) (see Regression Analysis) helps us determine which variables can be removed in order to obtain more stable models, etc.

Let’s first take a look at the scatterplot matrix:

plot.title = "Scatterplot matrix"
pairs(X.std, main=plot.title, col=Y+1, cex=0.5, pch=".")

We can also calculate and display the correlation matrix:

X.cor = cor(X.std)

We write a little function that will compute the VIF of each variable in a design matrix

vif <- function(X){
  for (i in 1:dim(X)[2]){

vif.X = matrix(vif(X.std), nrow=1)
colnames(vif.X) = colnames(X)
rownames(vif.X) = "VIF"
Alcohol              2.46
Malic.acid           1.66
Ash                  2.19
Alcalinity.of.ash    2.24
Magnesium            1.42
Total.phenols        4.33
Flavanoids           7.03
Nonflavanoid.phenols 1.80
Proanthocyanins      1.98
Colour.intensity     3.03
Hue                  2.55
OD280.OD315          3.79
Proline              2.82

We see that Flavonoids has high multicollinearity with respect to the other variables, as its VIF is greater than 5; as such we have reasonable grounds to remove that variable from further analyses as the other variables can explain how Flavonoids would behave and doing so might reduce the risk of creating unstable models.

X = X[,-7]
X.std = X.std[,-7]

Now we apply a PCA reduction (see Feature Selection and Dimension Reduction for details) to further reduce the problem complexity.

We start by performing principal component analysis on the standardized data:

pca.std = prcomp(X.std)

Let us provide a summary of the outcome:

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.9757 1.5802 1.1870 0.94820 0.91472 0.80087 0.74082
Proportion of Variance 0.3253 0.2081 0.1174 0.07492 0.06973 0.05345 0.04573
Cumulative Proportion  0.3253 0.5334 0.6508 0.72569 0.79541 0.84886 0.89460
                           PC8     PC9    PC10   PC11    PC12
Standard deviation     0.58095 0.53687 0.49487 0.4750 0.41059
Proportion of Variance 0.02813 0.02402 0.02041 0.0188 0.01405
Cumulative Proportion  0.92272 0.94674 0.96715 0.9859 1.00000
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Alcohol -0.17 0.48 -0.20 0.09 -0.26 0.21 -0.07 0.40 -0.50 -0.23 0.21 -0.26
Malic.acid 0.28 0.23 0.04 -0.52 -0.12 0.53 0.42 0.10 0.10 0.29 -0.07 0.12
Ash 0.02 0.32 0.64 0.21 -0.11 0.15 -0.15 -0.21 0.29 0.03 0.51 -0.06
Alcalinity.of.ash 0.28 -0.01 0.59 -0.09 0.03 -0.10 -0.30 0.43 -0.18 -0.06 -0.49 -0.05
Magnesium -0.17 0.30 0.18 0.12 0.78 0.05 0.34 -0.12 -0.27 -0.06 -0.07 0.07
Total.phenols -0.42 0.06 0.17 -0.18 -0.24 -0.09 -0.04 -0.45 -0.31 0.46 -0.27 -0.33
Nonflavanoid.phenols 0.33 0.03 0.14 0.35 -0.41 -0.26 0.60 -0.21 -0.21 -0.21 -0.13 0.05
Proanthocyanins -0.34 0.04 0.17 -0.46 -0.04 -0.55 0.34 0.31 0.21 -0.15 0.23 -0.10
Colour.intensity 0.10 0.53 -0.16 -0.02 -0.06 -0.42 -0.23 -0.02 -0.05 0.31 -0.02 0.60
Hue -0.33 -0.28 0.14 0.43 -0.09 0.10 0.22 0.47 -0.04 0.50 0.07 0.25
OD280.OD315 -0.40 -0.17 0.20 -0.19 -0.20 0.26 -0.07 -0.17 -0.18 -0.45 -0.06 0.59
Proline -0.33 0.36 -0.09 0.25 -0.11 0.12 0.07 0.07 0.57 -0.17 -0.55 -0.07

The scree-plot for proportions of variance explained by each principal component is shown below:

scree.y = eigen(t(X.std)%*%X.std)$values/sum(eigen(t(X.std)%*%X.std)$values)
barplot(scree.y, main=plot.title,ylim=c(0, 0.35), ylab="% explained",
test = seq(0.7, 13.9, length.out=12)
axis(1, at=test, labels=1:12)

Based on the summary table, the first 6 PC would be required to retain 80% of the explanatory power; the scree plot, on the other hand, shows a knee at the 4th component. We can explore this further: let’s take a look at the scatterplot of the first two PC. We start by transforming X.std into its principal coordinates:

PC = X.std%*%pca.std$rotation
PCnames = c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12")
colnames(PC) <- PCnames

plot.title = "PC1 vs. PC2"
plot(PC[,1], PC[,2], cex=1.2, main=plot.title, col=Y+1, xlab="PC1", ylab="PC2")

The scatterplot shows that the three classes are separated reasonably well by PC1 and PC2 (although, not linearly).

plot.title = "Boxplots (in Principal Coordinates)"
  par(mfrow = c(3,4))
  for (i in 1:12){
  plot.title.ind = paste("PC ", i, sep="")
    boxplot(PC[,i]~Y, main=plot.title.ind, col=c("red","green","blue"))

The boxplots further leave an impression that PC3 to PC12 do not provide as clear a separation compared to PC1 and PC2. We will then use only the latter to visually evaluate the effectiveness of ANN (and its prediction regions).

In order to evaluate the effectiveness of the model (i.e., does it have good predictive power without overfitting the data?), we split the data into training and testing sets. The model is then developed using the training set (i.e., optimized using a subset of data), and then evaluated for its prediction power using the testing set.

There are \(n=178\) observations in total: we sample \(140\) of them for the training set (say, 46 of class 1, 56 of class 2, and 38 of class 3). The remaining 38 observations for the testing set.

set.seed(1111) # for replicabilty
C1.train.loc = sort(sample(C1.loc, size=46))
C2.train.loc = sort(sample(C2.loc, size=56))
C3.train.loc = sort(sample(C3.loc, size=38))
train.loc = c(C1.train.loc, C2.train.loc, C3.train.loc)
test.loc = which(!(1:length(Y) %in% train.loc))

# training data
PC.train = PC[train.loc,]
Y.train = Y[train.loc]
dat.train = as.data.frame(cbind(nnet::class.ind(Y.train), PC.train))
colnames(dat.train)[1:3] = c("C1", "C2", "C3")

# testing data
PC.test = PC[test.loc,]
Y.test = Y[test.loc]
dat.test = as.data.frame(cbind(nnet::class.ind(Y.test), PC.test))
colnames(dat.test)[1:3] = c("C1", "C2", "C3")

Let us display the training dataset (circles) and testing dataset (triangles) on the PC1/PC2 scatterplot.

plot.title = "Training and Testing data"
xlimit = c(-4,4)
ylimit = c(-3,3)
plot(dat.train$PC1, dat.train$PC2, cex=1.2, col=Y.train+1, main=plot.title, 
     xlab="PC1", ylab="PC2", xlim=xlimit, ylim=ylimit)
points(dat.test$PC1, dat.test$PC2, pch=17, cex=1.5, col=Y.test+1)

If we only use the PC1/PC2-reduced data, we would expect that at least two of the test observations would be misclassified (the left-most and right-most green triangles, respectively).

We are finally ready to build an ANN via the R package neuralnet (the main function is also called neuralnet()). We run three analyses:

  1. using only the first two principal components as inputs;

  2. useing the first six principal components as inputs, and

  3. using all 12 principal components as inputs.

We start by forming a grid in the PC1/PC2 space on which we can colour the prediction regions.

predict.region.PC1=seq(-5,5, length.out=100)
predict.region.PC2=seq(-4,4, length.out=100)
predict.region=expand.grid(x=predict.region.PC1, y=predict.region.PC2)

We will also use an expanded form of the confusion matrix:

# A souped-up version of the confusion matrix 
confusion.expand <- function (pred.c, class) {
   temp <-mda::confusion(pred.c,class)
   row.sum <- apply(temp,1,sum)
   col.sum <- apply(temp,2,sum)
   t.sum <- sum(col.sum)
   tmp <- rbind(temp, rep("----", dim(temp)[2]), col.sum)
   tmp <- noquote(cbind(tmp, rep("|",dim(tmp)[1]), c(row.sum, "----", t.sum)))
   dimnames(tmp)<-list(object = c(dimnames(temp)[[1]],"-------","Col Sum"),
                 true = c(dimnames(temp)[[2]],"|","Row Sum"))
   attr(tmp, "error") <- attr(temp, "error")
   attr(tmp, "mismatch") <- attr(temp, "mismatch")

In the following chunks of code, we build an ANN model using neuralnet, with PC1 and PC2 as inputs. The parameter model.structure, which defines the number of hidden nodes in each hidden layer, is modifiable:

  • a model with no hidden layer would have model.structure = 0;

  • a model with 1 hidden layer of 3 nodes would require model.structure = 3;

  • a model with 2 hidden layers of 5 and 10 nodes, respectively, would require model.structure = c(5,10), and so on.

We will use 2 hidden layers of 10 nodes each.

model.structure = c(10,10)
model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ PC1 + PC2, data = dat.train, 
                               hidden = model.structure, err.fct = "ce", 
                               linear.output = FALSE)
prob.model1.test <- neuralnet::compute(model1, PC.test[,1:2])
predict.model1.test = max.col(prob.model1.test$net.result)
print(paste("Confusion matrix (testing) for model = ", list(model.structure)[1], sep=""))
(conf.test=confusion.expand(predict.model1.test, Y.test))
[1] "Confusion matrix (testing) for model = c(10, 10)"
object    1    2    3    | Row Sum
  1       13   1    0    | 14     
  2       0    13   1    | 14     
  3       0    1    9    | 10     
  ------- ---- ---- ---- | ----   
  Col Sum 13   15   10   | 38     
[1] 0.07894737

We can compute the prediction region for the two-input model and display it as follows:

prob.model1.region <- neuralnet::compute(model1, predict.region[,1:2])
predict.model1.region = max.col(prob.model1.region$net.result)
plot.title=paste("Prediction region for ANN with structure = ", 
                 list(model.structure)[1], sep="")
plot(predict.region[,1], predict.region[,2], main=plot.title, xlim=xlimit, ylim=ylimit,
     xlab="PC1", ylab="PC2", col=predict.model1.region+1, pch="+", cex=0.4)
points(dat.train$PC1, dat.train$PC2, cex=1.2, col=Y.train+1)
points(dat.test$PC1, dat.test$PC2, pch=17, cex=1.5, col=Y.test+1)

Notice the complex decision boundary.

Since the error function we seek to minimize (e.g., SSRes) is non-convex, it is possible for ANN to get stuck at local minima, rather than converge to the global minimum. We can run the model a number of times (say, 50 replicates) and then find the average prediction. We do so here and investigate the mean, median, and maximum number of misclassifications, as follows:

model.structure = c(10,10)
n.j = 50
conf.train.vector = conf.test.vector = NULL 

for (j in 1:n.j){
    model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ PC1 + PC2, data = dat.train, 
                                   hidden = model.structure, err.fct = "ce", 
                                   linear.output = FALSE)
    prob.model1.test <- neuralnet::compute(model1, PC.test[,1:2])
    predict.model1.test = max.col(prob.model1.test$net.result)
    conf.test = confusion.expand(predict.model1.test, Y.test)
    conf.test.vector=c(conf.test.vector, attributes(conf.test)$error)

# Convert proportion of misclassifications to a number of instances
conf.test.vector = round(conf.test.vector*length(Y.test)) 

print(paste("Summary of number of misclassifications in testing data out of", n.j, "trials", sep=" "))
round(summary(conf.test.vector), digits=2) 
[1] "Summary of number of misclassifications in testing data out of 50 trials"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.00    3.00    2.92    3.00    4.00 

Note: this code may produce an error saying that ANN has issues converging (especially for simpler models). If this happens, simply re-run the code again or change the seed.

Now, we build artificial neural networks for the PCA-reduced 6-input dataset. We will try 11 different topologies: no hidden layer; 1 hidden layer with 2, 6, 10, and 30 nodes; 2-hidden layers with (6,6), (10,10), (30,30) nodes, and 30-hidden layers with (6,6,6), (10,10,10) and (30,30,30) nodes.

For each of these 11 topologies, we replicate the process 25 times. It should be noted that prediction regions are not computed, as our input is in more than 2 dimensional space.

model.structure = list(0,                              # no hidden layer
                       2, 6, 10, 30,                   # 1 hidden layer
                       rep(6,2), rep(10,2), rep(30,2), # 2 hidden layers
                       rep(6,3), rep(10,3), rep(30,3)) # 3 hidden layers

results = NULL
n.loop = length(model.structure)
n.j = 25

for (i in 1:n.loop){
  conf.train.vector = conf.test.vector = NULL
  for (j in 1:n.j){
    model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, 
                                   data = dat.train, hidden = model.structure[[i]],
                                   err.fct = "ce", linear.output = FALSE)
    prob.model1.test <- neuralnet::compute(model1, PC.test[,1:6])
    predict.model1.test = max.col(prob.model1.test$net.result)
    conf.test=confusion.expand(predict.model1.test, Y.test)
    conf.test.vector=c(conf.test.vector, attributes(conf.test)$error)

  results[[i]] = summary(round(conf.test.vector*length(Y.test))) 

results = as.data.frame(dplyr::bind_rows(results, .id = "column_label"))
colnames(results) <- c("hidden", "min", "Q1", "med", "mean", "Q3", "max")
results$hidden <- model.structure
hidden min Q1 med mean Q3 max
0 2 2 2 2.00 2 2
2 1 2 2 1.92 2 3
6 1 2 2 1.92 2 2
10 1 2 2 1.96 2 2
30 2 2 2 2.00 2 2
6, 6 1 2 2 1.96 2 2
10, 10 1 2 2 1.88 2 2
30, 30 2 2 2 2.00 2 2
6, 6, 6 1 2 2 1.88 2 2
10, 10, 10 1 2 2 1.88 2 2
30, 30, 30 1 2 2 1.96 2 2

We can repeat this process once more, using all 12 PC.

model.structure = list(0,                              # no hidden layer
                       2, 6, 10, 30,                   # 1 hidden layer
                       rep(6,2), rep(10,2), rep(30,2), # 2 hidden layers
                       rep(6,3), rep(10,3), rep(30,3)) # 3 hidden layers

results = NULL
n.loop = length(model.structure)
n.j = 25

for (i in 1:n.loop){
  conf.train.vector = conf.test.vector = NULL
  for (j in 1:n.j){
    model1 <- neuralnet::neuralnet(C1 + C2 + C3 ~ ., 
                                   data = dat.train, hidden = model.structure[[i]],
                                   err.fct = "ce", linear.output = FALSE)
    prob.model1.test <- neuralnet::compute(model1, PC.test[,1:12])
    predict.model1.test = max.col(prob.model1.test$net.result)
    conf.test=confusion.expand(predict.model1.test, Y.test)
    conf.test.vector=c(conf.test.vector, attributes(conf.test)$error)

  results[[i]] = summary(round(conf.test.vector*length(Y.test))) 

results = as.data.frame(dplyr::bind_rows(results, .id = "column_label"))
colnames(results) <- c("hidden", "min", "Q1", "med", "mean", "Q3", "max")
results$hidden <- model.structure
hidden min Q1 med mean Q3 max
0 2 2 2 2.00 2 2
2 1 1 1 1.80 3 4
6 1 2 2 2.04 2 3
10 1 2 2 1.96 2 2
30 2 2 2 2.00 2 2
6, 6 1 2 2 2.00 2 3
10, 10 1 2 2 1.92 2 3
30, 30 1 2 2 1.84 2 2
6, 6, 6 1 1 2 1.76 2 3
10, 10, 10 1 2 2 1.84 2 2
30, 30, 30 1 1 2 1.68 2 2

Comparing the mean number of misclassifications, what can we conclude?

13.4.4 Naïve Bayes Classifiers

In classical statistics, model parameters (such as \(\mu\) and \(\sigma\), say) are treated as constants; Bayesian statistics, on the other hand assume that model parameters are random variables.

Bayes’ Theorem (see Section 3.1.7) lies at the foundation of such statistics: \[P(\text{hypothesis}\mid \text{data})=\frac{P(\text{data}\mid \text{hypothesis})\times P(\text{hypothesis})}{P(\text{data})},\] or simply \[P(H\mid D)=\frac{P(D\mid H)\times P(H)}{P(D)}.\] This is sometimes written in shorthand as \(P(H\mid D) \propto P(D\mid H)\times P(H)\); in other words, our degree of belief in a hypothesis should be updated by the evidence provided by the data.246 We will discuss Bayesian data analysis in depth in the Bayesian Data Analysis module.

Naïve Bayes Classification for Tumor Diagnoses

We illustrate one approach to classification, which, in spite of the name, is not a Bayesian method in the strict sense: naïve Bayes classifiers (NBC). Suppose we are interested in diagnosing whether a tumor is begin or malignant, based on several measurements obtained from video imaging.

Bayes’ Theorem can be recast as \[\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}} \propto \text{likelihood} \times \text{prior}, \] where

  • posterior: \(P(H\mid D)=\) based on collected data, how likely is a given tumor to be benign (or malignant)?

  • prior: \(P(H)=\) in what proportion are tumors benign (or malignant) in general?

  • likelihood: \(P(D\mid H)=\) knowing a tumor is benign (or malignant), how likely is it that these particular measurements would have been observed?

  • evidence: \(P(D)=\) regardless of a tumor being benign or malignant, what is the chance that a tumor has the observed characteristics?

The NBC procedure is straightforward.

  1. Objective function: a simple way to determine whether a tumor is benign or malignant is to compare posterior probabilities and choose the one with highest probability. That is, we diagnose a tumor as malignant if \[\frac{P(\textrm{malignant}\mid D)}{P(\textrm{benign}\mid D)}=\frac{P(D\mid \textrm{malignant})\times P(\textrm{malignant})}{P(D\mid \textrm{benign})\times P(\textrm{benign})}>1,\] and as benign otherwise.

  2. Dataset: there are \(n=699\) observations (from the [in]famous Wisconsin Breast Cancer dataset) with nine predictor measurements, each scored on a scale of 1 to 10, with score of 0 being reserved for missing values. The predictors include items such as Clump_Thickness and Bare_Nuclei; the categorical response variable is the Class (Benign, Malignant).

dat.BC = read.csv("Data/BreastCancer-Wisconsin.csv", header=TRUE, stringsAsFactors = TRUE)
'data.frame':   699 obs. of  10 variables:
 $ Clump_Thickness            : int  5 5 3 6 4 8 1 2 2 4 ...
 $ Uniformity_of_Cell_Size    : int  1 4 1 8 1 10 1 1 1 2 ...
 $ Uniformity_of_Cell_Shape   : int  1 4 1 8 1 10 1 2 1 1 ...
 $ Marginal_Adhesion          : int  1 5 1 1 3 8 1 1 1 1 ...
 $ Single_Epithelial_Cell_Size: int  2 7 2 3 2 7 2 2 2 2 ...
 $ Bare_Nuclei                : int  1 10 2 4 1 10 10 1 1 1 ...
 $ Bland_Chromatin            : int  3 3 3 3 3 9 3 3 1 2 ...
 $ Normal_Nucleoli            : int  1 2 1 7 1 7 1 1 1 1 ...
 $ Mitoses                    : int  1 1 1 1 1 1 1 1 5 1 ...
 $ Class                      : Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 2 1 1 1 1 ...

In table layout, the first 6 observations look like:

Clump_Thickness Uniformity_of_Cell_Size Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses Class
5 1 1 1 2 1 3 1 1 Benign
5 4 4 5 7 10 3 2 1 Benign
3 1 1 1 2 2 3 1 1 Benign
6 8 8 1 3 4 3 7 1 Benign
4 1 1 3 2 1 3 1 1 Benign
8 10 10 8 7 10 9 7 1 Malignant

We create a training/testing split for the data, by selecting roughly 80%/20% of the observations.

set.seed(1234) # for reproducibility
ind = 1:nrow(dat.BC)
prop.train = 0.8 
n.train = floor(nrow(dat.BC)*prop.train)

# indices of training observations
loc.train = sort(sample(ind, n.train, replace=FALSE)) # 
# indices of testing observations
loc.test = ind[-which(ind%in%loc.train)] 
# training data
dat.BC.train = dat.BC[loc.train,] 
# test data
dat.BC.test = dat.BC[loc.test,]

We separate the Benign and Malignant subsets of the training data for graphing purposes.

location.Benign = which(dat.BC.train$Class=="Benign")
location.Malignant = which(!(1:nrow(dat.BC.train) %in% location.Benign))
cols_remove = c("Class")
dat.Benign=dat.BC.train[location.Benign,!colnames(dat.BC) %in% cols_remove] 
dat.Malignant=dat.BC.train[location.Malignant,!colnames(dat.BC) %in% cols_remove] 

The boxplots of the training measurements are shown below.

dat.BC2 = reshape2::melt(dat.BC.train, id.var="Class")

ggplot2::ggplot(data = dat.BC2, ggplot2::aes(x=variable, y=value)) + 
  ggplot2::geom_boxplot(ggplot2::aes(fill=Class)) + 
  ggplot2::scale_y_discrete(limits = 1:10) + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1))

From these plots, we learn that benign tumors have lower scores on average, while malignant tumors have much higher scores, as well as variabilities.

In what follows, we treat the test observations as undiagnosed cases.

  1. Assumptions: we assume that the scores of each measurement in a class are independent of one another (hence the naive qualifier); this assumption reduces the likelihood to \[\begin{aligned} P(H\mid D)&=P(H\mid x_{1},x_{2},\cdots,x_{9}) \propto P( x_{1},x_{2},\cdots,x_{9}\mid H) \times P(H) \\ & =P(x_{1}\mid H)\times \cdots\times P(x_9\mid H)\times P(H). \end{aligned}\]

Note that this assumption of conditional independence is not usually satisfied.

  1. Prior distribution: we can ask subject matter experts to provide a rough estimate for the general ratio of benign to malignant tumors, or use the proportion of benign tumors in the sample as our prior. In situations where we have no knowledge about the distribution of priors, we may simply assume a non-informative prior (in this case, the prevalence rates would be the same for both responses).

In this example, we will assume that the training data represents the tumor population adequately, and we use the observed proportions as estimated prior probabilities.

n.Benign.train = nrow(dat.Benign)
n.Malignant.train = nrow(dat.Malignant)
(prior.Benign = n.Benign.train/(n.Benign.train + n.Malignant.train))
(prior.Malignant = 1 - prior.Benign)
[1] 0.6529517
[1] 0.3470483
  1. Computation of likelihoods: under conditional independence, each measurement is assumed to follow a multinomial distribution (since scores are provided on a \(1\) to \(10\) scale): for each predictor, for each class, we must estimate \(p_1, \ldots, p_{10}\), with \(p_1+\cdots+p_{10}=1\).247 The best estimates are thus \[\hat{p}_{i,\text{pred}}=\frac{\text{# of training cells in the class with pred score $i$}}{\text{# of training cells in the class}},\quad i=1,\ldots, 10.\]

This is done in the code below; note that count.xyz is a count matrix, while freq.xyz is a frequency matrix.

# Benign cells
count.Benign = freq.Benign = NULL
for (i in 1:(ncol(dat.BC.train)-1)){
  test.count = table(c(dat.Benign[,i],0:10))-1
  test.freq = test.count/sum(test.count)
  count.Benign = cbind(count.Benign, test.count)
  freq.Benign = cbind(freq.Benign, test.freq)
colnames(count.Benign) = colnames(freq.Benign) = colnames(dat.Benign)
p.Benign = freq.Benign
p.Benign[1,] = 1

# Malignant cells
count.Malignant = freq.Malignant = NULL
for (i in 1:(ncol(dat.BC.train)-1)){
  test.count = table(c(dat.Malignant[,i],0:10))-1
  test.freq = test.count/sum(test.count)
  count.Malignant = cbind(count.Malignant, test.count)
  freq.Malignant = cbind(freq.Malignant, test.freq)
colnames(count.Malignant) = colnames(freq.Malignant) = colnames(dat.Malignant)
p.Malignant = freq.Malignant
p.Malignant[1,] = 1

These are then the best estimates for the multinomial parameters, for the benign tumors:

Clump_Thickness Uniformity_of_Cell_Size Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses
0 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
1 0.3342466 0.8356164 0.7616438 0.8273973 0.1123288 0.8383562 0.3397260 0.8931507 0.9780822
2 0.0986301 0.0821918 0.1178082 0.0767123 0.7863014 0.0465753 0.3589041 0.0575342 0.0136986
3 0.2000000 0.0575342 0.0821918 0.0630137 0.0630137 0.0356164 0.2657534 0.0219178 0.0027397
4 0.1315068 0.0109589 0.0219178 0.0109589 0.0164384 0.0136986 0.0136986 0.0027397 0.0000000
5 0.1945205 0.0000000 0.0027397 0.0054795 0.0109589 0.0219178 0.0082192 0.0027397 0.0000000
6 0.0356164 0.0054795 0.0054795 0.0109589 0.0027397 0.0000000 0.0027397 0.0054795 0.0000000
7 0.0027397 0.0027397 0.0054795 0.0000000 0.0054795 0.0000000 0.0109589 0.0054795 0.0027397
8 0.0027397 0.0027397 0.0027397 0.0000000 0.0027397 0.0054795 0.0000000 0.0082192 0.0027397
9 0.0000000 0.0027397 0.0000000 0.0027397 0.0000000 0.0000000 0.0000000 0.0027397 0.0000000
10 0.0000000 0.0000000 0.0000000 0.0027397 0.0000000 0.0054795 0.0000000 0.0000000 0.0000000

and the malignant tumors:

Clump_Thickness Uniformity_of_Cell_Size Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses
0 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
1 0.0154639 0.0103093 0.0103093 0.1494845 0.0051546 0.0463918 0.0051546 0.1752577 0.5257732
2 0.0154639 0.0360825 0.0257732 0.0979381 0.0927835 0.0257732 0.0309278 0.0309278 0.1082474
3 0.0567010 0.1134021 0.0979381 0.1030928 0.2010309 0.0721649 0.1546392 0.1391753 0.1340206
4 0.0567010 0.1391753 0.1546392 0.1185567 0.1391753 0.0567010 0.1185567 0.0773196 0.0515464
5 0.1649485 0.1185567 0.1185567 0.0721649 0.1494845 0.0721649 0.1340206 0.0773196 0.0257732
6 0.0618557 0.1030928 0.1030928 0.0721649 0.1546392 0.0206186 0.0360825 0.0721649 0.0154639
7 0.0927835 0.0824742 0.1082474 0.0515464 0.0360825 0.0309278 0.2783505 0.0463918 0.0412371
8 0.1701031 0.1134021 0.1082474 0.0824742 0.0824742 0.0876289 0.1082474 0.0773196 0.0309278
9 0.0618557 0.0154639 0.0360825 0.0206186 0.0103093 0.0360825 0.0463918 0.0412371 0.0000000
10 0.3041237 0.2680412 0.2371134 0.2319588 0.1288660 0.5463918 0.0876289 0.2628866 0.0670103
        xlab = "Score", 
        ylab = "Relative Frequency", 
        main = "Marginal Adhesion Scores - Benign Cells")

        xlab = "Score", 
        ylab = "Relative Frequency", 
        main = "Marginal Adhesion Scores - Malignant Cells")

Multiplying probabilities across predictors from each multinomial distribution (one each for both classes) provides the overall likelihoods for benign and malignant tumors, respectively.

For instance, if the signature of an undiagnosed case was \[(1,1,3,1,0,2,3,2,2),\] then we would multiply the probabilities corresponding to each score across predictors, once assuming that the cell was benign, and once assuming it was malignant:

\[\begin{aligned} P(D\mid H)&=P(\mathbf{x}=(1,1,3,1,0,2,3,2,2) \mid H) \\ &=P(x_1=1\mid H)\times P(x_2=1\mid H)\times P(x_3=3\mid H)\times P(x_4=1\mid H)\times P(x_5=0\mid H) \\ & \qquad \times P(x_6=2\mid H)\times P(x_7=3\mid H)\times P(x_8=2\mid H)\times P(x_9=2\mid H). \end{aligned}\] We can extract the signature vector of probabilities in both cases as follows:

x = c(1,1,3,1,0,2,3,2,2) + 1
y = c(1,2,3,4,5,6,7,8,9)

For the benign class, we have:

(l.Benign = prod(p.Benign[as.matrix(data.frame(x,y))]))
[1] 0.33424658 0.83561644 0.08219178 0.82739726 1.00000000 0.04657534 0.26575342
[8] 0.05753425 0.01369863
[1] 1.852912e-07

For the malignant class, we have:

(l.Malignant = prod(p.Malignant[as.matrix(data.frame(x,y))]))
[1] 0.01546392 0.01030928 0.09793814 0.14948454 1.00000000 0.02577320 0.15463918
[8] 0.03092784 0.10824742
[1] 3.114231e-11

Based on the multinomial probabilities given in p.Benign and p.Malignant, the (naïve) likelihood of the undiagnosed case being a benign tumor would thus be \(1.86 \times 10^{-7}\), while the likelihood of it being a malignant tumor would \(3.11\times 10^{-11}\).

Be careful: these are the likelihoods, not the posteriors!

  1. Computation of posterior: multiplying the corresponding prior probabilities and likelihoods, we get a quantity that is proportional to the respective posterior probabilities: \[ P(H \mid \mathbf{x}) \propto P(H) \times P(\mathbf{x} \mid H)\approx P(H) \times \prod_{i=1}^9 P(x_i\mid H).\]

The “likelihoods” can be computed as follows:

test.Benign = test.Malignant = NULL
likelihood.Benign = likelihood.Malignant = NULL

for (i in 1:nrow(dat.BC.test)){
  location = rapply(dat.BC.test[i,-10]+1,c)
  for(j in 1:length(location)){
    test.Benign[j] = p.Benign[location[j],j]
    test.Malignant[j] = p.Malignant[location[j],j]
  likelihood.Benign.i = prod(test.Benign)
  likelihood.Malignant.i = prod(test.Malignant)
  likelihood.Benign = c(likelihood.Benign, likelihood.Benign.i)
  likelihood.Malignant = c(likelihood.Malignant, likelihood.Malignant.i)

The “posteriors” can then be computed as follows:

posteriors=cbind(likelihood.Benign*prior.Benign,  likelihood.Malignant*prior.Malignant)

For the undiagnosed case \(\mathbf{x}=(1,1,3,1,0,2,3,2,2)\), we obtain:

[1] 1.209862e-07
[1] 1.080789e-11

Comparing the posteriors \[P(\text{Malignant}\mid \mathbf{x})<P(\text{Benign}\mid \mathbf{x}),\] we conclude that the tumor in the undiagnosed case is likely benign (note that we have no measurement on how much more likely it is to be benign than to be malignant – the classifier is not calibrated).

We can complete the procedure for all observations in the test set:

prediction=rep(NA, n.test)
   Benign Malignant 
       85        55 

Since we actually know the true outcome for the test subjects, we can take a look at the NBC’s performance on the data.

            Benign Malignant
  Benign        85         8
  Malignant      0        47

Let’s take a look at cases where NBC misclassified, and their corresponding posteriors:

dat.misclassified = dat.BC.test[which(dat.BC.test$Class!=prediction),]
missed.class = prediction[which(dat.BC.test$Class!=prediction)]
wrong.classifications = cbind(posteriors[which(dat.BC.test$Class!=prediction),])
colnames(wrong.classifications) = c("Posterior.Benign","Posterior.Malignant")
wrong.classifications = cbind(dat.misclassified, wrong.classifications)
Clump_Thickness Uniformity_of_Cell_Size Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses Class Posterior.Benign Posterior.Malignant
9 2 1 1 1 2 1 1 1 5 Benign 0 0
130 1 1 1 1 10 1 1 1 1 Benign 0 0
197 8 4 4 5 4 7 7 8 2 Benign 0 0
233 8 4 6 3 3 1 4 3 1 Benign 0 0
253 6 3 3 5 3 10 3 5 3 Benign 0 0
320 4 4 4 4 6 5 7 3 1 Benign 0 0
353 3 4 5 3 7 3 4 6 1 Benign 0 0
658 5 4 5 1 8 1 3 6 1 Benign 0 0

The confusion matrix tells us that 8 out of 140 cases (5.7%) are misclassified. A closer look at misclassified cases reveals that 3 of the 8 false positives are a result of a posterior probability being 0 (a score level that was not observed in the training set). Taking a close look at ID 130, for instance, all but Single_Epithelial_Cell_Size have scores of 1, strongly indicating that tumor is likely benign (perhaps the 10 is a typo?).

Can we prevent misclassification similar to case in ID 130? One way to do so is to replace all 0 probabilities in the likelihood matrices by a small \(\varepsilon\) (obtained by multiplying a base probability with the smallest non-zero probability), and by re-scaling the columns so that they all add up to 1 (excluding the missing values from the process). If \(\varepsilon\) is small enough, the larger probabilities will not be affected, in practice.

Using a base probability of \(10^{-8}\), for instance, would reduce the misclassification rate on the test data to 6/140 (4.3%).

Notes and Comments

In practice, various prior distributions or conditional distributions (for the features) can be used; domain matter expertise can come in handy during these steps:

  • the naive assumption is made out of convenience, as it renders the computation of the likelihood much simpler;

  • the variables in a dataset are not typically independent of one another, but NBC still works well with test data (usually) – the method seems to be robust against departure from the independence assumption;

  • dependency among variables may change the true posterior values, but the class with maximum posterior probabilities is often unchanged;

  • in the classification context, we typically get more insight from independent/correlated data than from correlated data;

  • NBC works best for independent cases, but optimality can also be reached when dependency among variables inconsistently support one class over another;

  • the choice of a prior may have a great effect on the classification predictions, as can the presence of outlying observations, especially when \(|\text{Tr}\) is small);

  • it is not practical to conduct NBC manually, as we have done above – a complete implementation can be called by using the method naiveBayes() from the R package e1071 (make sure to read the documentation first!), and

  • a repeated reminder that, like the SVM models, NBC is not calibrated and should not be used to estimate probabilities.


T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms and Applications. CRC Press, 2014.
M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li, Applied Linear Statistical Models. McGraw Hill Irwin, 2004.
D. R. Hofstadter, Gödel, Escher, Bach: an Eternal Golden Braid. New York, NY: Basic Books, 1979.
F. Provost and T. Fawcett, Data Science for Business. O’Reilly, 2015.
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT press Cambridge, 2016.
F. Chollet, Deep Learning with Python, 1st ed. USA: Manning Publications Co., 2017.
C. Sheppard, Tree-Based Machine Learning Algorithms: Decision Trees, Random Forests, and Boosting. CreateSpace Independent Publishing Platform, 2017.
T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel Methods in Machine Learning,” Annals of Statistics, vol. 36, no. 3, pp. 1171–1220, 2008.
N. Deng, Y. Tian, and C. Zhang, Support Vector Machines: Optimization Based Theory, Algorithms, and Extensions. CRC Press/Chapman; Hall, 2013.
3Blue1Brown, Deep Learning.”
A. Turing, “Computing machinery and intelligence,” Mind, 1950.
Wikipedia, Artificial intelligence,” 2020.
M. Caudill, “Neural networks primer, part 1,” AI Expert, vol. 2, no. 12, pp. 46–52, Dec. 1987.
Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.” 2014.
R. S. Sutton, “Two problems with backpropagation and other steepest-descent learning procedures for networks,” 1986.
S. Ruder, “An overview of gradient descent optimization algorithms.” 2016.
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” ICLR, 2015.