## 12.1 Statistical Learning

**Statistical learning** is a series of procedures and approaches that allows analysts to tackle problems such as:

identifying risk factors associated to breast/prostate cancer;

predicting whether a patient will have a second, fatal heart attack within 30 days of the first on the basis of demographics, diet, clinical measurements, etc.;

establishing the relationship between salary and demographic information in population survey data;

predicting the yearly inflation rate using various indicators, etc.

Statistical learning tasks are typically divided into 2 main classes:
**supervised learning** and **unsupervised learning**.^{186}

### 12.1.1 Supervised Learning Framework

In the supervised learning environment, the **outcome** (response, target, dependent variable, etc.) is denoted by \(Y\), and the vector of \(p\) **predictors** (features) by \(\vec{X}=(X_1,\ldots,X_p)\).

If \(Y\) is quantitative (price, height, etc.), then the problem of predicting \(Y\) in terms of \(X\) is a **regression** task; if \(X\) takes on values in a finite unordered set (survived/died, colours, vegetation types, etc.), it is a **classification** task. This is typically achieved with the use of **training data**, which is to say historical observations or instances, which we often denote by \([\mathbf{X}\mid \mathbf{Y}]\) (the column denoting the observation IDs is dropped).

obs. |
predictors |
predictors |
predictors |
resp. |
---|---|---|---|---|

\(1\) | \(x_{1,1}\) | \(\cdots\) | \(x_{1,p}\) | \(y_1\) |

\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | |

\(N\) | \(x_{N,1}\) | \(\cdots\) | \(x_{N,p}\) | \(y_N\) |

The objectives of supervised learning are usually to:

accurately predict

**unseen**test cases;understand which inputs affect the outcomes (if any), and how, and/or

assess the quality of predictions and/or inferences made on the basis of the training data.

In unsupervised learning, on the contrary, there are no outcome variables only features
on a set of observations \(\mathbf{X}\).^{187}

The objectives are much more **vague** – analysts could seek to:

find sets of features (variables) that behave similarly across observations;

find combinations of features with the most variation;

find natural groups of similar observations, etc.

We will discuss such methods in detail in Module 14.

#### Statistical Learning vs. Machine Learning

The term “statistical learning” is not used frequently in practice (except by mathematicians and statisticians, perhaps); the current tendency is to speak of **machine learning**. If a distinction must be made, we could argue that:

statistical learning arises from statistical-like models, and the emphasis is usually placed on

**interpretability**,**precision**, and**uncertainty**, whereasmachine learning arise from artificial intelligence studies, with emphasis on

**large scale applications**and**prediction accuracy**.

The dividing line between the terms is blurry – the vocabulary used by practitioners mostly betrays their educational backgrounds (but see [135] for another take on this).

#### Motivating Example

Throughout, we will illustrate the concepts and notions *via* the `Gapminder`

dataset, which records socio-demographic information relating to the planet’s nations, ranging from 1960 to 2011 [237], [238].

We will be interested in determining if there is a link between life expectancy, at various moments in time, and the rest of the predictors.

The dataset contains 7139 observations of 9 columns:

a country \(\times\) year identifier (2 variables, \(i\) and \(X_1\));

a region and continent pair of categorical predictors (2 variables, \(X_2\) and \(X_3\));

four numerical predictors: population \(X_4\), infant mortality \(X_5\), fertility \(X_6\), gross domestic product in 1999 dollars \(X_7\), and

life expectancy \(Y\), the numerical response.

```
library(dplyr)
gapminder.ML = as.data.frame(unclass(data.frame(read.csv("Data/gapminder.csv"))),
stringsAsFactors=TRUE)
gapminder.ML <- gapminder.ML[complete.cases(gapminder.ML),]
gapminder.ML <- gapminder.ML[,c("country","year","region","continent",
"population","infant_mortality","fertility","gdp",
"life_expectancy")]
```

The structure and summary are provided below:

```
'data.frame': 7139 obs. of 9 variables:
$ country : Factor w/ 185 levels "Albania","Algeria",..: 2 5 8 9 11 13 14 16 18 20 ...
$ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ region : Factor w/ 22 levels "Australia and New Zealand",..: 11 15 1 22 2 18 2 22 20 15 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 1 2 5 4 2 3 2 4 1 2 ...
$ population : int 11124892 20619075 10292328 7065525 109526 48200702 230934 9140563 2431620 3693451 ...
$ infant_mortality: num 148.2 59.9 20.3 37.3 51 ...
$ fertility : num 7.65 3.11 3.45 2.7 4.5 6.73 4.33 2.6 6.28 6.7 ...
$ gdp : num 1.38e+10 1.08e+11 9.67e+10 5.24e+10 1.31e+09 ...
$ life_expectancy : num 47.5 65.4 70.9 68.8 62 ...
country year region continent
Algeria : 52 Min. :1960 Western Africa : 749 Africa :2203
Argentina : 52 1st Qu.:1978 Eastern Africa : 637 Americas:1546
Australia : 52 Median :1991 South America : 601 Asia :1598
Austria : 52 Mean :1989 Western Asia : 560 Europe :1418
Bangladesh: 52 3rd Qu.:2001 Caribbean : 432 Oceania : 374
Belgium : 52 Max. :2011 Central America: 409
(Other) :6827 (Other) :3751
population infant_mortality fertility gdp
Min. :4.154e+04 Min. : 1.80 Min. :1.130 Min. :4.040e+07
1st Qu.:2.115e+06 1st Qu.: 16.00 1st Qu.:2.160 1st Qu.:1.974e+09
Median :6.531e+06 Median : 41.70 Median :3.720 Median :8.249e+09
Mean :3.236e+07 Mean : 54.88 Mean :4.045 Mean :1.556e+11
3rd Qu.:2.003e+07 3rd Qu.: 85.40 3rd Qu.:5.900 3rd Qu.:5.843e+10
Max. :1.348e+09 Max. :237.20 Max. :8.670 Max. :1.174e+13
life_expectancy
Min. :13.20
1st Qu.:56.90
Median :67.58
Mean :64.83
3rd Qu.:73.20
Max. :82.90
```

In other words, we will be looking for **models** of the form \[Y=f(X_1,\ldots,X_7)+\varepsilon\equiv f(\vec{X})+\varepsilon,\] where \(f\) is the **systematic component of \(Y\) explained by \(X\)**, and \(\varepsilon\) is the **random error term**, which accounts for measurement errors and other discrepancies.

### 12.1.2 Systematic Component and Regression Function

It is the systematic component that is used for **predictions** and **inferences**. As long as \(f\) is “good”, we can:

make predictions for the response \(Y\) at new points \(\vec{X}=\mathbf{x}\);

understand which features of \(\vec{X}=(X_1,\ldots, X_p)\) are important to explain the variation in \(Y\), and

depending on the complexity of \(f\), understand the effect of each feature \(X_j\) on \(Y\).

Imagine a model with one predictor \(X\) and a target \(Y\), with systematic component \(f\), so that \[Y=f(X)+\varepsilon.\]

For instance, consider the following subset of the Gapminder dataset.

```
attach(gapminder.ML)
x=log(fertility[10*(1:730)]/infant_mortality[10*(1:730)])
y=life_expectancy[10*(1:730)]
x=x[!is.na(x)]
y=y[!is.na(y)]
plot(x,y,ylab="response Y", xlab="predictor X", pch=20)
```

What is the ideal \(f\) in this case? How can we find it?

In that case, what would be a good value of \(f(-2)\), say?

Ideally, we would like to have \(f(-2)=\text{E}[Y\mid X=-2]\).^{188} For any \(x\) in the range of \(X\), the function \[f(x)=\text{E}[Y\mid X=x]\] is the **regression function of \(Y\) on \(X\)**. In the general setting with \(p\) predictors, the regression function is \[ f(\mathbf{x})=f(x_1,\ldots,x_p)=\text{E}[Y\mid X_1=x_1,\ldots,X_p=x_p]=\text{E}[Y\mid \vec{X}=\mathbf{x}].\]
It is **optimal** in the sense that this regression function minimizes the average square deviation from the response variable, that is to say, \[f=\arg\min_g\big\{\text{E}[(Y-g(\vec{X}))^2\mid \vec{X}=\mathbf{x}]\big\}.\]
The term \[\varepsilon=\varepsilon_{\vec{X}}=Y-f(\vec{X})\] is the **irreducible error** of the regression. Typically, \(\varepsilon_{\vec{X}}\neq 0\) for all \(\vec{X}\), since, even when \(f\) is known exactly, there will still be some uncertainty in the predictions due to some noise-generating mechanism in the “real world”.

If \(\widehat{f}\) is any estimate of the regression function \(f\),^{189} then \[\begin{aligned} \text{E}[(Y-\widehat{Y})^2\mid \vec{X}=\mathbf{x}]&=\text{E}[(f(\vec{X})+\varepsilon-f(\vec{X}))^2\mid \vec{X}=\mathbf{x}] \\ &=\underbrace{[f(\vec{X})-\widehat{f}(\vec{X})]^2}_{\text{reducible}}+\underbrace{\text{Var}(\varepsilon)}_{\text{irreducible}}. \end{aligned}\]
Since the **irreducible component** is not a property of the estimate \(\widehat{f}\), the objective of minimizing \(\text{E}[(Y-\widehat{Y})^2]\) can only be achieved by working through the **reducible component**. When we speak of **learning** a model, we mean that we use the training data to find an estimate \(f\) of \(\widehat{f}\) that minimizes this reducible component, in some way.

#### Estimating the Regression Function

In theory, we know that the regression function is \[f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}];\] in practice, however, there might be too few (or even no) observations at \(\vec{X}=\mathbf{x}\) to trust the estimate provided by the sample mean. One solution is to approximate the expectation by a **nearest neighbour average** \[\widehat{f}(\mathbf{x})=\text{Avg}\{Y\mid \vec{X}\in N(\mathbf{x})\},\] where
\(N(\mathbf{x})\) is a neighbourhood of \(\mathbf{x}\).

In general, this approach works reasonably well when \(p\) is “small” (\(p\leq 4\)?) and \(N\) is “large”, but it fails when \(p\) is too large because of the **curse of dimensionality**. The problem is that nearest neighbours are usually far when \(p\) is large. Indeed, if \(N(\mathbf{x})\) is defined as the nearest 5% of observations to \(\mathbf{x}\), say,^{190}, then we need to leave the “local” neighbourhood of \(\mathbf{x}\) to build \(N(\mathbf{x})\), which could compromise the quality of \(\widehat{f}(\mathbf{x})\) as an approximation to \(f(\mathbf{x})\).

We provide more details in Module 15, but this is a topic about which it is worth being well-read (see [2] for a formal treatment).

The various statistical learning methods attempt to provide estimates of the regression function by minimizing the reducible component through **parametric** or **non-parametric** approaches.^{191} For instance, the **classical linear regression** approach is parametric: it assumes that the true regression function \(f\) is linear and suggests the estimate \[f_L(\mathbf{x})=\beta_0+\beta_1x_1+\cdots+\beta_px_p.\] The objective, in this case, is to learn the \(p+1\) parameters \(\beta_0,\beta_1,\ldots,\beta_p\) with the help of the training data.

In practice, this assumption almost never holds, but it often provides an **interpretable**^{192} approximation to the true regression function \(f\) (see below for an example).

```
lin.reg = lm(y~x)
plot(x,y,ylab="response Y", xlab="predictor X", pch=20)
abline(lin.reg, col="red", lwd=3)
```

As an example, if the true fit of the motivating example was \[\text{life expectancy}=f(\text{fertility},\text{infant mortality},\text{gdp})+\varepsilon,\] say, then the linear regression approach would assume that \[\begin{aligned} f_{L}&(\text{fertility},\text{infant mortality},\text{gdp})=\beta_0+\beta_1\cdot\text{fertility}+\beta_2\cdot\text{infant mortality}+ \beta_3\cdot\text{gdp}.\end{aligned}\] The main advantages of the linear model are that it is interpretable and that it is easier to learn \(p+1\) parameters than it is to learn a whole function \(f\).

On the flip side, the linear model does not usually match the true regression function \(f\); if \(f_L\not\approx f\), then predictions will suffer.

We could decide to consider more complex functions in order to get better estimates (and thus better prediction accuracy), but this comes at a **cost** – the resulting functions are usually more difficult to learn and they tend to **overfit the data** (i.e., they mistake noise in the data for a signal to model, see Section 12.1.4 for details).

**Splines** provide examples of non-parametric models (see Section 12.5.2): they make no assumption about the form of \(f\) – they simply seek to estimate \(f\) by getting close to the data points without being too **rough** or too **wiggly**, as below.

```
smoothingSpline = smooth.spline(x, y, spar=0.7)
plot(x,y,ylab="response Y", xlab="predictor X", pch=20)
lines(smoothingSpline, col="red", lwd=3)
```

The main advantage of non-parametric approaches is that they have the potential to fit a **wider range** of regression shapes. But since estimating \(f\) is not reduced to learning a small number of parameters, substantially more data is required to obtain accurate estimates (and the whole situation is susceptible to overfitting).

Non-parametric methods are usually more **flexible** (they can produce a large range of shapes when estimating the true regression function \(f\)); parametric models are usually more **interpretable** (the set of parameters to learn is small and we can more easily make sense of them, which leads us to a better understanding of how the predictors interact to produce outputs).

Approaches that provide

high flexibility, but low interpretability include ensemble learning, support vector machines, neural networks, and splines;

low flexibility, but high interpretability include the LASSO and OLS, and

medium flexibility and medium interpretability include generalized additive models and regression trees.

Importantly, there are no **high-flexibility/high-interpretability** approaches. The **trade-off** between two competing desirable properties is the calling card of machine learning; we will encounter such trade-offs time and time again; they dictate what the discipline can and cannot hope to achieve.

### 12.1.3 Model Evaluation

In an ideal world (from a model performance point of view), we would be able to identify the modeling approach that performs “best”, and use it for all problems.

In practice, however, the preceding discussion son trade-offs hows that the concept of “best performance” is impossible to define in practice in a way that meets all desired requirements, and a balance must be struck. But another issue lurks just around the corner, even when we settle on an “optimal” performance evaluation measure.

**No-Free Lunch Theorem:** no single method is optimal over all possible datasets.^{193}

Given a specific task and dataset, then, how do we select the approach that will yield the best results (for a given value of “best”)? In practice, **this is the main machine learning challenge.**

In order to evaluate a model’s performance at a specific task, we must be able to measure how well predictions match the **observed data**. In a **regression/value estimation setting**, various metrics are used:

**mean squared error**(MSE): \[\displaystyle{\frac{1}{N}\sum_{i=1}^N(y_i-\widehat{f}(\mathbf{x}_i))^2};\]**mean absolute error**(MAE): \[\displaystyle{\frac{1}{N}\sum_{i=1}^N|y_i-\widehat{f}(\mathbf{x}_i)|};\]**normalized mean squared error**(NMSE): \[\frac{\displaystyle{\sum_{i=1}^N(y_i-\widehat{f}(\mathbf{x}_i))^2}}{\displaystyle{\sum_{i=1}^N(y_i-\overline{y})^2}};\]**normalized mean absolute error**(NMAE): \[\frac{\displaystyle{\sum_{i=1}^N|y_i-\widehat{f}(\mathbf{x}_i)|}}{\displaystyle{\sum_{i=1}^N|y_i-\overline{y}|}};\]**mean average percentage error**(MAPE): \[\frac{1}{N}\sum_{i=1}^N\frac{|y_i-\widehat{f}(\mathbf{x}_i)|}{y_i};\]**correlation**\(\rho_{\widehat{y},y}\), etc.

The MSE has convenient mathematical properties, and we will follow the lead of just about every reference in making it our go-to metric, but note that the conceptual notions we will discuss would be qualitatively similar for all performance evaluation tools.

Note that in order to evaluate the suitability of a model for **predictive** purposes, these metrics should be evaluated on **testing data** (or unseen data), not on the training data.^{194}

For instance, if we are trying to determine whether any clinical measurement in patients are likely to predict the onset of Alzheimer’s disease, we do not particularly care if the algorithm does a good job of telling us that the patients we have already tested for the disease have it or not – it is new patients that are of interest.^{195}

Let \(\text{Tr}=\{(\mathbf{x}_i,y_i)\}_{i=1}^N\) be the **training set** and suppose that we use some statistical learning method to estimate the true relationship \(Y=f(\vec{X})+\varepsilon\) by \(\widehat{Y}=\widehat{f}(\vec{X})\) (i.e., **we fit \(\widehat{f}\) over \(\text{Tr}\)**). Hopefully, \(\widehat{f}(\mathbf{x}_i)\approx y_i\) for all \(i=1,\ldots,N\), and \[\text{MSE}_{\text{Tr}}=\frac{1}{N}\sum_{i=1}^N(y_i-\widehat{f}(\mathbf{x}_i)^2)\] is small.

If it is indeed small, then the model does a good job of **describing** \(\text{Tr}\). But, as discussed above, this is largely irrelevant to (if not uncorrelated with) our ability to make **good predictions**; what we would really like to know is if \[\widehat{f}(\mathbf{x}^*)\approx f(\mathbf{x}^*)=y^{*}\] for observations \((\mathbf{x}^*,y^*)\not \in \text{Tr}\).

An **optimal** statistical learning method for a given combination of task and dataset is one that minimizes \[\text{MSE}_{\text{Te}}=\frac{1}{M}\sum_{j=N+1}^{N+M}(y_j-\widehat{f}(\vec{x_j}))^2\] over the testing set \(\text{Te}=\{(\vec{x_j},y_j)\}_{j=N+1}^{M+N}\), where, *a priori*, none of the test observations were in \(\text{Tr}\).^{196}

The general situation is illustrated in Figures 12.5 and 12.6.

### 12.1.4 Bias-Variance Trade-Off

The "U" shape of the testing MSE in Figure 12.6 is **generic** – something of this nature occurs for nearly all datasets and choice of supervised learning family of methods (for regression and for classification): **underfitting** and **overfitting** is a fact of machine learning life.

The generic shape can be explained by two properties of SL methods: the **bias** and the **variance**. Consider a test observation \((\mathbf{x}^*,y^*)\), and a fitted model \(\widehat{f}\) (trained on training data \(\text{Tr}\)). which approximates the true model \[Y=f(\vec{X})+\varepsilon,\quad \text{where }f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}].\]

The **expected test MSE at \(\mathbf{x}^*\)** can be decomposed into 3 fundamental quantities \[\begin{aligned} \text{E}\left[\text{MSE}_{\text{Te}}(\mathbf{x}^*)\right]&=\text{E}\left[(y^*-\widehat{f}(\mathbf{x}^*))^2\right] \\ &=\underbrace{\text{Var}(\widehat{f}(\mathbf{x}^*))}_{\text{variance}}+\underbrace{\left\{\text{E}\left[\widehat{f}(\mathbf{x}^*)-f(\mathbf{x}^*)\right]\right\}^2}_{\text{squared bias}}+\text{Var}(\varepsilon).\end{aligned}\]
As before, \(\text{Var}(\varepsilon)\) is the **irreducible error** (due to the inherent noise in the data); the **variance component error** \(\text{Var}(\widehat{f}(\mathbf{x}^*))\) arises since different training sets would yield different fitted models \(\widehat{f}\), and the **(squared) bias component error** arises, in part, due to the “difficult” problem being approximated by a “simple” model (see [2], [3] for details).

The **overall expected test MSE** \(\text{E}[\text{MSE}_{\text{Te}}]\) is the average of \(\text{E}[\text{MSE}_{\text{Te}}(\mathbf{x}^*)]\) over all allowable \(\mathbf{x}^*\) in the testing space. Note that \[\text{E}[\text{MSE}_{\text{Te}}]\geq \text{Var}(\varepsilon),\] by construction.

In general, more flexible methods (i.e., more complex methods) tend to have higher variance and lower bias, and *vice-versa*: simpler methods have higher bias and lower variance. It is this interplay between bias and variance that causes models to underfit (high bias) or overfit (high variance) the data (see **bias-variance trade-off** diagram below, from [3]).

Let us summarize the main take-aways from the first section:

for

**numerical responses**, the optimal regression function \(Y=f(\vec{X})+\varepsilon\) is \[f(x)=\text{E}[Y\mid \vec{X}=\mathbf{x}];\]models are

**learned**over training data \(\text{Tr}\);in practice, we learn the best model from a

**restricted**group of model families;the best model \(\widehat{f}(\mathbf{x})\) (resp.\(\widehat{C}(\mathbf{x})\)) minimizes the reducible part of the prediction error \(\text{MSE}_{\text{Te}}\),

**evaluated**on testing data \(\text{Te}\);the

**bias-variance trade-off**tells us that models that are too simple (or too rigid)**underfit**the data, and that models that are too complex (or too “loose”)**overfit**the data;the total prediction error on \(\text{Te}\) is

**bounded below**by the irreducible error.

Finally, remember that a predictive model’s performance **can only be evaluated on unseen data** (i.e., on data not drawn from the training
set \(\text{Tr}\)).

### References

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.*Springer, 2008.

*An Introduction to Statistical Learning: With Applications in R*. Springer, 2014.

*Variance Explained*, Jan. 2018.

*Factfulness: Ten reasons we’re wrong about the world - and why things are better than you think*. Hodder & Stoughton, 2018.

*The health and wealth of nations*. Gapminder Foundation, 2012.