12.1 Statistical Learning

Statistical learning is a series of procedures and approaches that allows analysts to tackle problems such as:

  • identifying risk factors associated to breast/prostate cancer;

  • predicting whether a patient will have a second, fatal heart attack within 30 days of the first on the basis of demographics, diet, clinical measurements, etc.;

  • establishing the relationship between salary and demographic information in population survey data;

  • predicting the yearly inflation rate using various indicators, etc.

Statistical learning tasks are typically divided into 2 main classes: supervised learning and unsupervised learning.186

12.1.1 Supervised Learning Framework

In the supervised learning environment, the outcome (response, target, dependent variable, etc.) is denoted by \(Y\), and the vector of \(p\) predictors (features) by \(\vec{X}=(X_1,\ldots,X_p)\).

If \(Y\) is quantitative (price, height, etc.), then the problem of predicting \(Y\) in terms of \(X\) is a regression task; if \(X\) takes on values in a finite unordered set (survived/died, colours, vegetation types, etc.), it is a classification task. This is typically achieved with the use of training data, which is to say historical observations or instances, which we often denote by \([\mathbf{X}\mid \mathbf{Y}]\) (the column denoting the observation IDs is dropped).

obs. predictors predictors predictors resp.
\(1\) \(x_{1,1}\) \(\cdots\) \(x_{1,p}\) \(y_1\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(N\) \(x_{N,1}\) \(\cdots\) \(x_{N,p}\) \(y_N\)

The objectives of supervised learning are usually to:

  • accurately predict unseen test cases;

  • understand which inputs affect the outcomes (if any), and how, and/or

  • assess the quality of predictions and/or inferences made on the basis of the training data.

 
In unsupervised learning, on the contrary, there are no outcome variables only features on a set of observations \(\mathbf{X}\).187

The objectives are much more vague – analysts could seek to:

  • find sets of features (variables) that behave similarly across observations;

  • find combinations of features with the most variation;

  • find natural groups of similar observations, etc.

We will discuss such methods in detail in Module 14.

Statistical Learning vs. Machine Learning

The term “statistical learning” is not used frequently in practice (except by mathematicians and statisticians, perhaps); the current tendency is to speak of machine learning. If a distinction must be made, we could argue that:

  • statistical learning arises from statistical-like models, and the emphasis is usually placed on interpretability, precision, and uncertainty, whereas

  • machine learning arise from artificial intelligence studies, with emphasis on large scale applications and prediction accuracy.

The dividing line between the terms is blurry – the vocabulary used by practitioners mostly betrays their educational backgrounds (but see [135] for another take on this).

Motivating Example

Throughout, we will illustrate the concepts and notions via the Gapminder dataset, which records socio-demographic information relating to the planet’s nations, ranging from 1960 to 2011 [237], [238].

We will be interested in determining if there is a link between life expectancy, at various moments in time, and the rest of the predictors.

The dataset contains 7139 observations of 9 columns:

  • a country \(\times\) year identifier (2 variables, \(i\) and \(X_1\));

  • a region and continent pair of categorical predictors (2 variables, \(X_2\) and \(X_3\));

  • four numerical predictors: population \(X_4\), infant mortality \(X_5\), fertility \(X_6\), gross domestic product in 1999 dollars \(X_7\), and

  • life expectancy \(Y\), the numerical response.

library(dplyr)
gapminder.ML = as.data.frame(unclass(data.frame(read.csv("Data/gapminder.csv"))),
                          stringsAsFactors=TRUE)
gapminder.ML <- gapminder.ML[complete.cases(gapminder.ML),]
gapminder.ML <- gapminder.ML[,c("country","year","region","continent",
                          "population","infant_mortality","fertility","gdp",
                          "life_expectancy")]

The structure and summary are provided below:

str(gapminder.ML)
summary(gapminder.ML)
'data.frame':   7139 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 5 8 9 11 13 14 16 18 20 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 15 1 22 2 18 2 22 20 15 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 2 5 4 2 3 2 4 1 2 ...
 $ population      : int  11124892 20619075 10292328 7065525 109526 48200702 230934 9140563 2431620 3693451 ...
 $ infant_mortality: num  148.2 59.9 20.3 37.3 51 ...
 $ fertility       : num  7.65 3.11 3.45 2.7 4.5 6.73 4.33 2.6 6.28 6.7 ...
 $ gdp             : num  1.38e+10 1.08e+11 9.67e+10 5.24e+10 1.31e+09 ...
 $ life_expectancy : num  47.5 65.4 70.9 68.8 62 ...
       country          year                  region        continent   
 Algeria   :  52   Min.   :1960   Western Africa : 749   Africa  :2203  
 Argentina :  52   1st Qu.:1978   Eastern Africa : 637   Americas:1546  
 Australia :  52   Median :1991   South America  : 601   Asia    :1598  
 Austria   :  52   Mean   :1989   Western Asia   : 560   Europe  :1418  
 Bangladesh:  52   3rd Qu.:2001   Caribbean      : 432   Oceania : 374  
 Belgium   :  52   Max.   :2011   Central America: 409                  
 (Other)   :6827                  (Other)        :3751                  
   population        infant_mortality   fertility          gdp           
 Min.   :4.154e+04   Min.   :  1.80   Min.   :1.130   Min.   :4.040e+07  
 1st Qu.:2.115e+06   1st Qu.: 16.00   1st Qu.:2.160   1st Qu.:1.974e+09  
 Median :6.531e+06   Median : 41.70   Median :3.720   Median :8.249e+09  
 Mean   :3.236e+07   Mean   : 54.88   Mean   :4.045   Mean   :1.556e+11  
 3rd Qu.:2.003e+07   3rd Qu.: 85.40   3rd Qu.:5.900   3rd Qu.:5.843e+10  
 Max.   :1.348e+09   Max.   :237.20   Max.   :8.670   Max.   :1.174e+13  
                                                                         
 life_expectancy
 Min.   :13.20  
 1st Qu.:56.90  
 Median :67.58  
 Mean   :64.83  
 3rd Qu.:73.20  
 Max.   :82.90  
                

In other words, we will be looking for models of the form \[Y=f(X_1,\ldots,X_7)+\varepsilon\equiv f(\vec{X})+\varepsilon,\] where \(f\) is the systematic component of \(Y\) explained by \(X\), and \(\varepsilon\) is the random error term, which accounts for measurement errors and other discrepancies.

12.1.2 Systematic Component and Regression Function

It is the systematic component that is used for predictions and inferences. As long as \(f\) is “good”, we can:

  • make predictions for the response \(Y\) at new points \(\vec{X}=\mathbf{x}\);

  • understand which features of \(\vec{X}=(X_1,\ldots, X_p)\) are important to explain the variation in \(Y\), and

  • depending on the complexity of \(f\), understand the effect of each feature \(X_j\) on \(Y\).

Imagine a model with one predictor \(X\) and a target \(Y\), with systematic component \(f\), so that \[Y=f(X)+\varepsilon.\]

For instance, consider the following subset of the Gapminder dataset.

attach(gapminder.ML)
x=log(fertility[10*(1:730)]/infant_mortality[10*(1:730)])
y=life_expectancy[10*(1:730)]
x=x[!is.na(x)]
y=y[!is.na(y)]
plot(x,y,ylab="response Y", xlab="predictor X", pch=20)
**CAPTION**

Figure 12.1: CAPTION

What is the ideal \(f\) in this case? How can we find it?

Regression model for Gapminder subset.

Figure 12.2: Regression model for Gapminder subset.

In that case, what would be a good value of \(f(-2)\), say?

Regression model for Gapminder subset, and vertical line at $X=-2$.

Figure 12.3: Regression model for Gapminder subset, and vertical line at \(X=-2\).

Ideally, we would like to have \(f(-2)=\text{E}[Y\mid X=-2]\).188 For any \(x\) in the range of \(X\), the function \[f(x)=\text{E}[Y\mid X=x]\] is the regression function of \(Y\) on \(X\). In the general setting with \(p\) predictors, the regression function is \[ f(\mathbf{x})=f(x_1,\ldots,x_p)=\text{E}[Y\mid X_1=x_1,\ldots,X_p=x_p]=\text{E}[Y\mid \vec{X}=\mathbf{x}].\] It is optimal in the sense that this regression function minimizes the average square deviation from the response variable, that is to say, \[f=\arg\min_g\big\{\text{E}[(Y-g(\vec{X}))^2\mid \vec{X}=\mathbf{x}]\big\}.\] The term \[\varepsilon=\varepsilon_{\vec{X}}=Y-f(\vec{X})\] is the irreducible error of the regression. Typically, \(\varepsilon_{\vec{X}}\neq 0\) for all \(\vec{X}\), since, even when \(f\) is known exactly, there will still be some uncertainty in the predictions due to some noise-generating mechanism in the “real world”.

If \(\widehat{f}\) is any estimate of the regression function \(f\),189 then \[\begin{aligned} \text{E}[(Y-\widehat{Y})^2\mid \vec{X}=\mathbf{x}]&=\text{E}[(f(\vec{X})+\varepsilon-f(\vec{X}))^2\mid \vec{X}=\mathbf{x}] \\ &=\underbrace{[f(\vec{X})-\widehat{f}(\vec{X})]^2}_{\text{reducible}}+\underbrace{\text{Var}(\varepsilon)}_{\text{irreducible}}. \end{aligned}\] Since the irreducible component is not a property of the estimate \(\widehat{f}\), the objective of minimizing \(\text{E}[(Y-\widehat{Y})^2]\) can only be achieved by working through the reducible component. When we speak of learning a model, we mean that we use the training data to find an estimate \(f\) of \(\widehat{f}\) that minimizes this reducible component, in some way.

Estimating the Regression Function

In theory, we know that the regression function is \[f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}];\] in practice, however, there might be too few (or even no) observations at \(\vec{X}=\mathbf{x}\) to trust the estimate provided by the sample mean. One solution is to approximate the expectation by a nearest neighbour average \[\widehat{f}(\mathbf{x})=\text{Avg}\{Y\mid \vec{X}\in N(\mathbf{x})\},\] where \(N(\mathbf{x})\) is a neighbourhood of \(\mathbf{x}\).

Regression model for Gapminder subset, and vertical line at $X=-2$, with neighbourhood $N(-2)$.

Figure 12.4: Regression model for Gapminder subset, and vertical line at \(X=-2\), with neighbourhood \(N(-2)\).

In general, this approach works reasonably well when \(p\) is “small” (\(p\leq 4\)?) and \(N\) is “large”, but it fails when \(p\) is too large because of the curse of dimensionality. The problem is that nearest neighbours are usually far when \(p\) is large. Indeed, if \(N(\mathbf{x})\) is defined as the nearest 5% of observations to \(\mathbf{x}\), say,190, then we need to leave the “local” neighbourhood of \(\mathbf{x}\) to build \(N(\mathbf{x})\), which could compromise the quality of \(\widehat{f}(\mathbf{x})\) as an approximation to \(f(\mathbf{x})\).

We provide more details in Module 15, but this is a topic about which it is worth being well-read (see [2] for a formal treatment).

The various statistical learning methods attempt to provide estimates of the regression function by minimizing the reducible component through parametric or non-parametric approaches.191 For instance, the classical linear regression approach is parametric: it assumes that the true regression function \(f\) is linear and suggests the estimate \[f_L(\mathbf{x})=\beta_0+\beta_1x_1+\cdots+\beta_px_p.\] The objective, in this case, is to learn the \(p+1\) parameters \(\beta_0,\beta_1,\ldots,\beta_p\) with the help of the training data.

In practice, this assumption almost never holds, but it often provides an interpretable192 approximation to the true regression function \(f\) (see below for an example).

lin.reg = lm(y~x)
plot(x,y,ylab="response Y", xlab="predictor X", pch=20)
abline(lin.reg, col="red", lwd=3)

As an example, if the true fit of the motivating example was \[\text{life expectancy}=f(\text{fertility},\text{infant mortality},\text{gdp})+\varepsilon,\] say, then the linear regression approach would assume that \[\begin{aligned} f_{L}&(\text{fertility},\text{infant mortality},\text{gdp})=\beta_0+\beta_1\cdot\text{fertility}+\beta_2\cdot\text{infant mortality}+ \beta_3\cdot\text{gdp}.\end{aligned}\] The main advantages of the linear model are that it is interpretable and that it is easier to learn \(p+1\) parameters than it is to learn a whole function \(f\).

On the flip side, the linear model does not usually match the true regression function \(f\); if \(f_L\not\approx f\), then predictions will suffer.

We could decide to consider more complex functions in order to get better estimates (and thus better prediction accuracy), but this comes at a cost – the resulting functions are usually more difficult to learn and they tend to overfit the data (i.e., they mistake noise in the data for a signal to model, see Section 12.1.4 for details).

Splines provide examples of non-parametric models (see Section 12.5.2): they make no assumption about the form of \(f\) – they simply seek to estimate \(f\) by getting close to the data points without being too rough or too wiggly, as below.

smoothingSpline = smooth.spline(x, y, spar=0.7)
plot(x,y,ylab="response Y", xlab="predictor X", pch=20)
lines(smoothingSpline, col="red", lwd=3)

detach(gapminder.ML)

The main advantage of non-parametric approaches is that they have the potential to fit a wider range of regression shapes. But since estimating \(f\) is not reduced to learning a small number of parameters, substantially more data is required to obtain accurate estimates (and the whole situation is susceptible to overfitting).

Non-parametric methods are usually more flexible (they can produce a large range of shapes when estimating the true regression function \(f\)); parametric models are usually more interpretable (the set of parameters to learn is small and we can more easily make sense of them, which leads us to a better understanding of how the predictors interact to produce outputs).

Approaches that provide

  • high flexibility, but low interpretability include ensemble learning, support vector machines, neural networks, and splines;

  • low flexibility, but high interpretability include the LASSO and OLS, and

  • medium flexibility and medium interpretability include generalized additive models and regression trees.

  Importantly, there are no high-flexibility/high-interpretability approaches. The trade-off between two competing desirable properties is the calling card of machine learning; we will encounter such trade-offs time and time again; they dictate what the discipline can and cannot hope to achieve.

12.1.3 Model Evaluation

In an ideal world (from a model performance point of view), we would be able to identify the modeling approach that performs “best”, and use it for all problems.

In practice, however, the preceding discussion son trade-offs hows that the concept of “best performance” is impossible to define in practice in a way that meets all desired requirements, and a balance must be struck. But another issue lurks just around the corner, even when we settle on an “optimal” performance evaluation measure.

No-Free Lunch Theorem: no single method is optimal over all possible datasets.193

Given a specific task and dataset, then, how do we select the approach that will yield the best results (for a given value of “best”)? In practice, this is the main machine learning challenge.

In order to evaluate a model’s performance at a specific task, we must be able to measure how well predictions match the observed data. In a regression/value estimation setting, various metrics are used:

  • mean squared error (MSE): \[\displaystyle{\frac{1}{N}\sum_{i=1}^N(y_i-\widehat{f}(\mathbf{x}_i))^2};\]

  • mean absolute error (MAE): \[\displaystyle{\frac{1}{N}\sum_{i=1}^N|y_i-\widehat{f}(\mathbf{x}_i)|};\]

  • normalized mean squared error (NMSE): \[\frac{\displaystyle{\sum_{i=1}^N(y_i-\widehat{f}(\mathbf{x}_i))^2}}{\displaystyle{\sum_{i=1}^N(y_i-\overline{y})^2}};\]

  • normalized mean absolute error (NMAE): \[\frac{\displaystyle{\sum_{i=1}^N|y_i-\widehat{f}(\mathbf{x}_i)|}}{\displaystyle{\sum_{i=1}^N|y_i-\overline{y}|}};\]

  • mean average percentage error (MAPE): \[\frac{1}{N}\sum_{i=1}^N\frac{|y_i-\widehat{f}(\mathbf{x}_i)|}{y_i};\]

  • correlation \(\rho_{\widehat{y},y}\), etc.

 
The MSE has convenient mathematical properties, and we will follow the lead of just about every reference in making it our go-to metric, but note that the conceptual notions we will discuss would be qualitatively similar for all performance evaluation tools.

Note that in order to evaluate the suitability of a model for predictive purposes, these metrics should be evaluated on testing data (or unseen data), not on the training data.194

For instance, if we are trying to determine whether any clinical measurement in patients are likely to predict the onset of Alzheimer’s disease, we do not particularly care if the algorithm does a good job of telling us that the patients we have already tested for the disease have it or not – it is new patients that are of interest.195

Let \(\text{Tr}=\{(\mathbf{x}_i,y_i)\}_{i=1}^N\) be the training set and suppose that we use some statistical learning method to estimate the true relationship \(Y=f(\vec{X})+\varepsilon\) by \(\widehat{Y}=\widehat{f}(\vec{X})\) (i.e., we fit \(\widehat{f}\) over \(\text{Tr}\)). Hopefully, \(\widehat{f}(\mathbf{x}_i)\approx y_i\) for all \(i=1,\ldots,N\), and \[\text{MSE}_{\text{Tr}}=\frac{1}{N}\sum_{i=1}^N(y_i-\widehat{f}(\mathbf{x}_i)^2)\] is small.

If it is indeed small, then the model does a good job of describing \(\text{Tr}\). But, as discussed above, this is largely irrelevant to (if not uncorrelated with) our ability to make good predictions; what we would really like to know is if \[\widehat{f}(\mathbf{x}^*)\approx f(\mathbf{x}^*)=y^{*}\] for observations \((\mathbf{x}^*,y^*)\not \in \text{Tr}\).

An optimal statistical learning method for a given combination of task and dataset is one that minimizes \[\text{MSE}_{\text{Te}}=\frac{1}{M}\sum_{j=N+1}^{N+M}(y_j-\widehat{f}(\vec{x_j}))^2\] over the testing set \(\text{Te}=\{(\vec{x_j},y_j)\}_{j=N+1}^{M+N}\), where, a priori, none of the test observations were in \(\text{Tr}\).196

The general situation is illustrated in Figures 12.5 and 12.6.

The training/testing paradigm. Training data is fed into a variety of statistical learning methods, possibly arranged in increasing order of complexity, yielding a sequence of models. These  models are then used  to make predictions on the testing set (using only the predictors variables); the predictions are then compared with the actual values to evaluate the  performance of the models on the testing set. The performance of the  models on the training set can also be evaluated.

Figure 12.5: The training/testing paradigm. Training data is fed into a variety of statistical learning methods, possibly arranged in increasing order of complexity, yielding a sequence of models. These models are then used to make predictions on the testing set (using only the predictors variables); the predictions are then compared with the actual values to evaluate the performance of the models on the testing set. The performance of the models on the training set can also be evaluated.

Generic illustration of the bias-variance trade-off; when the complexity of the model increases, the training error decreases, but the testing error eventually starts increasing. Generally, models that are too simple will have 'large' prediction errors on both the training and the testing sets (underfitting), whereas for models that are too complex, the training error tends to be 'small' while the testing error tends to be 'large' (overfitting). Based on [@DSML_ISLR;@DSML_HTF]

Figure 12.6: Generic illustration of the bias-variance trade-off; when the complexity of the model increases, the training error decreases, but the testing error eventually starts increasing. Generally, models that are too simple will have ‘large’ prediction errors on both the training and the testing sets (underfitting), whereas for models that are too complex, the training error tends to be ‘small’ while the testing error tends to be ‘large’ (overfitting). Based on [2], [3]

12.1.4 Bias-Variance Trade-Off

The "U" shape of the testing MSE in Figure 12.6 is generic – something of this nature occurs for nearly all datasets and choice of supervised learning family of methods (for regression and for classification): underfitting and overfitting is a fact of machine learning life.

The generic shape can be explained by two properties of SL methods: the bias and the variance. Consider a test observation \((\mathbf{x}^*,y^*)\), and a fitted model \(\widehat{f}\) (trained on training data \(\text{Tr}\)). which approximates the true model \[Y=f(\vec{X})+\varepsilon,\quad \text{where }f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}].\]

The expected test MSE at \(\mathbf{x}^*\) can be decomposed into 3 fundamental quantities \[\begin{aligned} \text{E}\left[\text{MSE}_{\text{Te}}(\mathbf{x}^*)\right]&=\text{E}\left[(y^*-\widehat{f}(\mathbf{x}^*))^2\right] \\ &=\underbrace{\text{Var}(\widehat{f}(\mathbf{x}^*))}_{\text{variance}}+\underbrace{\left\{\text{E}\left[\widehat{f}(\mathbf{x}^*)-f(\mathbf{x}^*)\right]\right\}^2}_{\text{squared bias}}+\text{Var}(\varepsilon).\end{aligned}\] As before, \(\text{Var}(\varepsilon)\) is the irreducible error (due to the inherent noise in the data); the variance component error \(\text{Var}(\widehat{f}(\mathbf{x}^*))\) arises since different training sets would yield different fitted models \(\widehat{f}\), and the (squared) bias component error arises, in part, due to the “difficult” problem being approximated by a “simple” model (see [2], [3] for details).

The overall expected test MSE \(\text{E}[\text{MSE}_{\text{Te}}]\) is the average of \(\text{E}[\text{MSE}_{\text{Te}}(\mathbf{x}^*)]\) over all allowable \(\mathbf{x}^*\) in the testing space. Note that \[\text{E}[\text{MSE}_{\text{Te}}]\geq \text{Var}(\varepsilon),\] by construction.

In general, more flexible methods (i.e., more complex methods) tend to have higher variance and lower bias, and vice-versa: simpler methods have higher bias and lower variance. It is this interplay between bias and variance that causes models to underfit (high bias) or overfit (high variance) the data (see bias-variance trade-off diagram below, from [3]).

Expected test error decomposition, artificial dataset. [@DSML_ISLR]

Figure 12.7: Expected test error decomposition, artificial dataset. [3]


Let us summarize the main take-aways from the first section:

  • for numerical responses, the optimal regression function \(Y=f(\vec{X})+\varepsilon\) is \[f(x)=\text{E}[Y\mid \vec{X}=\mathbf{x}];\]

  • models are learned over training data \(\text{Tr}\);

  • in practice, we learn the best model from a restricted group of model families;

  • the best model \(\widehat{f}(\mathbf{x})\) (resp.\(\widehat{C}(\mathbf{x})\)) minimizes the reducible part of the prediction error \(\text{MSE}_{\text{Te}}\), evaluated on testing data \(\text{Te}\);

  • the bias-variance trade-off tells us that models that are too simple (or too rigid) underfit the data, and that models that are too complex (or too “loose”) overfit the data;

  • the total prediction error on \(\text{Te}\) is bounded below by the irreducible error.

Finally, remember that a predictive model’s performance can only be evaluated on unseen data (i.e., on data not drawn from the training set \(\text{Tr}\)).

References

[2]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[135]
[237]
H. Rosling, O. Rosling, and A. R. Rönnlund, Factfulness: Ten reasons we’re wrong about the world - and why things are better than you think. Hodder & Stoughton, 2018.
[238]
H. Rosling, The health and wealth of nations. Gapminder Foundation, 2012.