12.2 Regression Modeling

Recall that, in the regression setting, the goal is to estimate the regression function \[f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}],\] the solution to the regression problem \[Y=f(\vec{X})+\varepsilon.\] The best estimate \(\widehat{f}\) is the model that minimizes \[\text{MSE}_{\text{Te}}(\widehat{f})=\text{Avg}_{\mathbf{x}^*\in Te}\text{E}\left[(y^*-\widehat{f}(\mathbf{x}^*))^2\right].\]

In practice, this can be hard to achieve without restrictions on the functional form of \(\widehat{f}\), so we try to learn the best \(\widehat{f}\) from various families of models. Remember, however, that no matter the \(\widehat{f}\), \[\text{MSE}_{\text{Te}}(\widehat{f})\geq \text{Var}(\varepsilon.)\] What else can we say about \(\widehat{f}\)?

In the ordinary least square framework (OLS), we assume that \[\widehat{f}_{\text{OLS}}(\mathbf{x})\approx \mathbf{x}^{\!\top}\boldsymbol{\beta},\] which is to say that we assume that \(\widehat{f}_{\text{}OLS}\) is nearly globally linear (we neglect the intercept term, in one interpretation).

In practice, the true regression function is almost never linear, but the linear assumption yields models \(\widehat{f}\) that are both conceptually and practically useful – the model \(\widehat{f}\) is easily interpretable, and the associated prediction error \(\text{MSE}_{\text{Te}}(\widehat{f})\) is often “small-ish”.

The most common data modeling methods are linear and logistic regression methods. By some estimation, 90% of real-world data applications end up using these as their final model, typically after very carefully preparing the data (cleaning, encoding, creation of new variables, transformation of variables, etc.).

That is mostly due to:

  • these regression models being straightforward to interpret and to train;

  • the prediction error \(\text{MSE}_\text{Te}\) having a closed-form linear expression, and

  • the OLS solution being computable using simple matrix manipulations.

Gapminder Example

Let us revisit the Gapminder dataset, focusing on observations from 2011.

  • Is there a relationship between gross domestic product and life expectancy?

  • How strong is the relationship?

  • Which factors contribute to the life expectancy?

  • How accurately could we predict life expectancy given a set of new predictors?

  • Is the relationship linear?

  • are there combinations of factors that are linked with life expectancy?

Can the scatterplots of various predictors against life expectancy for 2011 Gapminder dataset, shown below with line of best fit, be used to answer these questions?

gapminder.2011 <- gapminder.ML |> filter(year==2011)
plot(x,y, xlab="Population", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)

plot(x,y, xlab="Infant Mortality", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)

plot(x,y, xlab="Fertility", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)

plot(x,y, xlab="Gross Domestic Product", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)

plot(x,y, xlab="GDP per capita", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)

plot(x,y, xlab="GDP per capita (log scale)", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)

12.2.1 Formalism

Consider a dataset \(\text{Tr}=\{(\vec{x_1},y_1),\ldots, (\vec{x_n},y_n)\}\) with \(n\) observations and \(p\) features. The corresponding design matrix, response vector, and coefficient vector are, respectively, \[\mathbf{X} =\begin{pmatrix}1 & x_{1,1} & \cdots & x_{1,p} \\ \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,p}\end{pmatrix},\quad \mathbf{Y}=\begin{pmatrix}y_1 \\ \vdots \\ y_m\end{pmatrix},\quad \boldsymbol{\beta}=\begin{pmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{pmatrix}.\] The objective is to find \(f\) such that \(\mathbf{Y}=f(\mathbf{X})+\boldsymbol{\varepsilon}\). The OLS solution assumes that \(f(\mathbf{X})=\mathbf{X}\boldsymbol{\beta}\); we must thus learn \(\boldsymbol{\beta}\) using the training data \(\text{Tr}\).

If \(\hat{\boldsymbol{\beta}}\) is the estimate of the true coefficient vector \(\boldsymbol{\beta}\), the linear regression model associated with \(\text{Tr}\) is \[\widehat{f}(\mathbf{x})=\widehat{\beta}_0+\widehat{\beta}_1x_1+\cdots+\widehat{\beta}_px_p.\] How do we find \(\hat{\boldsymbol{\beta}}\)? The OLS estimate minimizes the loss function \[\begin{aligned} \mathcal{L}(\boldsymbol{\beta})&=\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2=(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^{\!\top}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \\ &= \mathbf{Y}^{\!\top}\mathbf{Y}-((\mathbf{X}\boldsymbol{\beta})^{\!\top}\mathbf{Y}+\mathbf{Y}^{\!\top}\mathbf{X}\boldsymbol{\beta})+(\mathbf{X}\boldsymbol{\beta})^{\!\top}\mathbf{X}\boldsymbol{\beta}\\&=\mathbf{Y}^{\!\top}\mathbf{Y}-(\boldsymbol{\beta}^{\!\top}\mathbf{X}^{\!\top}\mathbf{Y}+\mathbf{Y}^{\!\top}\mathbf{X}\boldsymbol{\beta})+\boldsymbol{\beta}^{\!\top}\mathbf{X}^{\!\top}\mathbf{X}\boldsymbol{\beta}.\end{aligned}\] The loss function is a non-negative symmetric quadratic form in \(\boldsymbol{\beta}\), with no restriction on the coefficients, so any minimizer of \(\mathcal{L}\) must also be one of its critical points (assuming certain regularity conditions on the data). We are thus looking for coefficients for which \(\mathcal{L}(\boldsymbol{\beta})=\mathbf{0}\). Since \[\nabla \mathcal{L}(\boldsymbol{\beta})=-2(\mathbf{X}^{\top}\mathbf{Y}-\mathbf{X}^{\!\top}\mathbf{X}\boldsymbol{\beta}),\] any minimizer \(\hat{\boldsymbol{\beta}}\) must satisfy the canonical equations: \[\mathbf{X}^{\!\top}\mathbf{Y}=\mathbf{X}^{\!\top}\mathbf{X}\hat{\boldsymbol{\beta}}.\] If \(\mathbf{X}^{\!\top}\mathbf{X}\) is invertible, the minimizer \(\hat{\boldsymbol{\beta}}\) is unique and is given by \[\hat{\boldsymbol{\beta}}=(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}\mathbf{Y},\quad \text{with}\quad \text{Var}(\hat{\boldsymbol{\beta}})=\widehat{\sigma}^2(\mathbf{X}^{\!\top}\mathbf{X})^{-1},\] where \(\widehat{\sigma}^2\) is the variance of the residuals.197 We say that “we have learned the coeficient vector \(\hat{\boldsymbol{\beta}}\) on the training data \(\text{Tr}\) using linear regression”.

In what follows, we write \(\mathbf{x}\) to represent the vector \[(1,x_{1},\ldots,x_{p})^{\!\top}.\] At times, we will also use the notation to represent \(\mathbf{x}\); the context should make clear which version is meant.

The fitted value of the model \(\widehat{f}\) at input \(\mathbf{x}_i\in\text{Tr}\) is \[\widehat{y}_i=\widehat{f}(\mathbf{x}_i)=\mathbf{x}_i^{\!\top}\hat{\boldsymbol{\beta}},\] and the predicted value at an arbitrary \(\mathbf{x}^*\) is \[\widehat{y}^*=\widehat{f}(\mathbf{x}^*)={\mathbf{x}^*}^{\!\top}\hat{\boldsymbol{\beta}}.\]

The fitted surface is thus entirely described by the \(p+1\) parameters \(\hat{\boldsymbol{\beta}}\); the number of (effective) parameters is a measure of the complexity of the learner.

Motivating Example

We study a subset of the Gapminder dataset: the observations for 2011, the predictor variables infant mortality \(X_1\) and fertility \(X_2\), and the response variable life expectancy \(Y\). The training data \(\text{Tr}\) contains \(n=166\) observations and \(p=2\) predictor features.

The design matrix \(\mathbf{X}\) is thus of dimension \(166\times 3\).

gapminder.2011 <- gapminder.2011 |> dplyr::mutate(const=1)
design.X = gapminder.2011[,c("const","infant_mortality","fertility")]
'data.frame':   166 obs. of  3 variables:
 $ const           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ infant_mortality: num  14.3 22.8 106.8 7.2 12.7 ...
 $ fertility       : num  1.75 2.83 6.1 2.12 2.2 1.5 1.88 1.44 1.96 1.9 ...

The response is a \(166 \times 1\) vector.

resp.Y = gapminder.2011[,c("life_expectancy")]

The consituants of the canonical equations are:

(X.t.X = t(as.matrix(design.X)) %*% as.matrix(design.X))
(X.t.Y = t(as.matrix(design.X)) %*% as.matrix(resp.Y))
                   const infant_mortality fertility
const             166.00          4537.30   486.540
infant_mortality 4537.30        225043.25 18445.280
fertility         486.54         18445.28  1790.238
const             11756.70
infant_mortality 291153.33
fertility         32874.95

We thus see that
\[\mathbf{X}^{\!\top}\mathbf{X}=\begin{pmatrix} 166.0 & 4537.3 & 486.54 \\ 4537.3 & 225043.25 & 18445.28 \\ 486.54 & 18445.28 & 1790.238 \end{pmatrix}\] and \[\mathbf{X}^{\!\top}\mathbf{Y}=\begin{pmatrix}11756.7 \\ 291153.33 \\ 32874.95\end{pmatrix}.\]

We can now compute \(\hat{\boldsymbol{\beta}}:\)

(beta.hat = inv(X.t.X) %*% X.t.Y)
[1,] 79.6770747
[2,] -0.2757438
[3,] -0.4431873

Thus, \[\hat{\boldsymbol{\beta}}=(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}\mathbf{Y}=\begin{pmatrix}79.677 \\ -0.276 \\ -0.443\end{pmatrix}.\]

We have seen that the fitted surface is \[y^*=\widehat{f}(\mathbf{x}^*)=79.677-0.276x_1^*-0.443x_2^*\] for a test observation \(\mathbf{x}^*=(x_1^*,x_2^*)\).

Warning: predictions should not be made for observations outside the range (or the envelope) of the training predictors. In this example, the predictor envelope is shown in red in the figure below – one should resist the temptation to predict \(y^*\) for \(\mathbf{x^*}=(100,2)\), say.

Predictor enveloppe for the Gapminder subset.

Figure 12.8: Predictor enveloppe for the Gapminder subset.

Least Squares Assumptions

Since the family of OLS learners is a subset of all possible learners, the best we can say about \(\widehat{f}_{\text{OLS}}\) is that \[\text{MSE}_{\text{Te}}(\widehat{f}_{\text{OLS}})\geq \min_{\widehat{f}}\left\{\text{MSE}_{\text{Te}}(\widehat{f})\right\}\geq \text{Var}(\varepsilon).\]

In practice, we are free to approximate \(f\) with any learner \(\widehat{f}\). If we want \(\widehat{f}\) to be useful, however, we need to verify that it is a “decent” approximation.

There is another trade-off at play: when we restrict learners to specific families of functions,198 we typically also introduce a series of assumptions on the data.

The OLS assumptions are

  • linearity: the response variable is a linear combination of the predictors;

  • homoscedasticity: the error variance is constant for all predictor levels;

  • uncorrelated errors: the error is uncorrelated from one observation to the next;

  • full column rank for design matrix \(\mathbf{X}\): the predictors are not perfectly multi-collinear;

  • weak exogeneity: predictor values are free of measurement error.

Mathematically, the assumptions translate to \[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\] where \(\boldsymbol{\beta}\in \mathbb{R}^{p+1}\) is determined on a training set \(\text{Tr}\) without measurement error, and for which \[\text{E}[\boldsymbol{\varepsilon}\mid \mathbf{X}]=\mathbf{0} \quad\text{and}\quad\text{E}[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^{\!\top}\mid \mathbf{X}]=\sigma^2I_{n}.\] Typically, although that is not a requirement, it is often further assumed that \[\boldsymbol{\varepsilon}\mid \mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^2I_n).\]

We will discuss how these assumptions can be generalized in Sections 12.2.3 and 12.5. In the meantime, however, how can we determine if the choice of model is valid? In the traditional statistical analysis context, there is a number of tests available to the analyst (we will discuss them shortly). In the machine learning context, there is only one real test:

does the model make good predictions?

12.2.2 Least Squares Properties

Let us assume that the OLS assumptions are satisfied. What can we say about the linear regression results? (see Module ?? and [34], say, for a refresher).

For the Gapminder example above, for instance, we could us R’s lm().

full.model = lm(life_expectancy~infant_mortality+fertility)

lm(formula = life_expectancy ~ infant_mortality + fertility)

     Min       1Q   Median       3Q      Max 
-15.3233  -2.0057   0.2003   2.9570  10.6370 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       79.6759     0.7985  99.786   <2e-16 ***
infant_mortality  -0.2763     0.0248 -11.138   <2e-16 ***
fertility         -0.4440     0.4131  -1.075    0.284    
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.172 on 163 degrees of freedom
Multiple R-squared:  0.7612,    Adjusted R-squared:  0.7583 
F-statistic: 259.8 on 2 and 163 DF,  p-value: < 2.2e-16

Coefficient of Determination

Let \[\text{SSRes}=\mathbf{Y}^{\!\top}[I_n-\mathbf{X}(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}]\mathbf{Y}=\mathbf{Y}^{\!\top}[I_n-\mathbf{H}]\mathbf{Y}\] and \[\text{SSTot}=\mathbf{Y}^{\!\top}\mathbf{Y}-n\overline{y}^2.\]

In the Gapminder example, we have:

(SSTot=as.vector(t(as.matrix(resp.Y))%*%as.matrix(resp.Y)-nrow(as.matrix(resp.Y))*(mean(resp.Y))^2 ))
[1] 2837.69
[1] 11882.18

The coefficient of determination of the OLS regression is the quotient \[R^2=\frac{\text{SSTot}-\text{SSRes}}{\text{SSTot}}=\frac{\text{Cov}^2(\mathbf{Y},\mathbf{X}\hat{\boldsymbol{\beta}})}{\sigma^2_y\sigma^2_{\overline{y}}}.\]

In the Gapminder example, we have:

(R.2 = 1-SSRes/SSTot)
[1] 0.761181

The coefficient of determination identifies the proportion of the variation of the data which is explained by the linear regression. As such, \(0\leq R^2\leq 1\).

If \(R^2\approx 0\), then the predictor variables have little explanatory power on the response; if \(R^2\approx 1\), then the linear fit is deemed to be “good”, as a lot of the variability in the response is explained by the predictors. In practice, the number of predictors also affects the goodness-of-fit (this is related to the curse of dimensionality discussed previously).

The quantity \[R_a^2=1-\frac{n-1}{n-p-1}(1-R^2)=1-\frac{\text{SSRes}/(n-p-1)}{\text{SSTot}/(n-1)}\] is the adjusted coefficient of determination of the linear regression. While \(R_a^2\) can be negative, it is always smaller than \(R^2\). It plays also plays a role in the feature selection process.

In the Gapminder example, we have:

(R.a.2 = 1-(nrow(as.matrix(resp.Y))-1)/(nrow(as.matrix(resp.Y))-nrow(X.t.X)-1)*(1-R.2))
[1] 0.7567584

which suggests that a fair proportion of the variability in the life expectancy (about 75.7%) is explained by infant mortality and fertility.

Significance of Regression

We can determine if at least one of the predictors \(X_1,\ldots,X_p\) is useful in predicting the response \(Y\) by pitting \[H_0:(\beta_1,\ldots,\beta_p)=\mathbf{0}\quad\text{against}\quad H_1:(\beta_1,\ldots,\beta_p)\neq \mathbf{0}.\] Under the null hypothesis \(H_0\), the \(F-\)statistic \[F^*=\frac{(\text{SSTot}-\text{SSRes})/p}{\text{SSRes}/(n-p-1)}\sim F_{p,n-p-1}.\] At significance level \(\alpha\), if \(F^*\geq F_{p,n-p-1;\alpha}\) (the \(1-\alpha\) quantile of the \(F\) distributions with \(p\) and \(n-p-1\) degrees of freedom), then we reject the null hypothesis in favour of the alternative.

In the Gapminder model \[Y=79.677-0.276X_1-0.443X_2+ \varepsilon,\quad n=166,p=2,\] we have:

(F.star = ((SSTot-SSRes)/(nrow(X.t.X)-1))/(SSRes/(nrow(as.matrix(resp.Y))-nrow(X.t.X)-1)))
[1] 258.169

At a significance level \(\alpha=0.05\), the critical value of the \(F_{2,163}\) distribution is:

[1] 3.051819

Since \(F^*\geq F_{2,163;0.05}\), at least one of \(\beta_1,\beta_2\neq 0\), with probability 95% (in the frequentist interpretation).

Interpretation of the Coefficients

For \(j=1,\ldots,p\), the coefficient \(\beta_j\) is the average effect on \(Y\) of a 1-unit increase in \(X_j\), holding all other predictors fixed. Ideally, the predictors are uncorrelated (such as would be the case in a balanced design [33]). Each coefficient can then be tested (and estimated) separately, and the above interpretation is at least reasonable in theory.

In practice, however, we can not always control the predictor variables, and it might be impossible to “hold all other predictors fixed.” When the predictors are correlated, there are potential variance inflation issues for the estimated regression coefficients, and the interpretion is risky, since when \(X_j\) changes, so do the other predictors.199 More importantly, the interpretation can also be read as a claim of causality, which should be avoided when dealing with observational data.

“The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively.” (paraphrased from [240])

In the Gapminder example, the correlation between the variables \(X_1\) and \(X_2\) is:

[1] -0.8714863

The predictors are thus strongly correlated, and the standard interpretation is not available to us.

Hypothesis Testing

We can also determine if a specific predictor \(X_j\) is useful in predicting the response \(Y\), by testing for \[H_0:\beta_j=0\quad\text{against}\quad H_1:\beta_j\neq 0.\] Under the null hypothesis \(H_0\), the test statistic \[t^*=\frac{\widehat{\beta}_j}{\text{se}(\widehat{\beta}_j)}\sim T_{n-2},\] where \(\text{se}(\widehat{\beta}_j)=\sqrt{\widehat{\sigma}^2(\mathbf{X}^{\!\top}\mathbf{X})^{-1}_{j,j}}\), and \(\widehat{\sigma}^2=\frac{\text{SSRes}}{n-p-1}\), and \(T_{n-2}\) is the Student \(T\) distribution with \(n-2\) degrees of freedom.

At a significance level \(\alpha\), if \(|t^*|\geq |t_{n-2;\alpha/2}|\) (the \(1-\alpha/2\) quantile of the \(T\) distribution with \(n-2\) degrees of freedom), then we reject the null hypothesis in favour of the alternative.

In the Gapminder model, we have: \(n=166\), \(p=2\), and \(\widehat{\beta}_1=-0.276\) so that

[1] 17.51661



so that


At a significance level \(\alpha=0.05\), the critical value of the \(T_{164}\) distribution is

[1] -1.974535

Since \(|t^*|\geq |t_{164;0.025}|\), \(\beta_1\neq 0\) with probability 95% (in the frequentist interpretation).

Confidence Intervals

The standard error of \(\widehat{\beta}_j\) reflects how the estimate would vary under various \(\text{Tr}\); it can be used to compute a \((1-\alpha)\%\) confidence interval for the true \(\beta_j\): \[\text{C.I.}(\beta_j;\alpha)\equiv \widehat{\beta}_j\pm z_{\alpha/2}\cdot \text{se}(\widehat{\beta}_j);\] at \(\alpha=0.05\), \(z_{\alpha/2}=1.96\approx 2\), so that \[\text{C.I.}(\beta_j;0.05)\equiv \widehat{\beta}_j\pm 2\text{se}(\widehat{\beta}_j).\]

In the Gapminder example, we have

coeff. est. s.e. \(\mathbf{t^*}\) 95% C.I.
\(\beta_0\) \(79.677\) \(0.7985\) \(99.786\) \([78.1,81.3,]\)
\(\beta_1\) \(-0.276\) \(0.0248\) \(-11.138\) \([-0.33,-0.23]\)
\(\beta_2\) \(0.443\) \(0.4131\) \(-1.075\) \([-1.27,0.38]\)

In frequentist statistics, the confidence interval has a particular interpretation – it does not mean, as one might wish, that there is a 95% chance, say, that the true \(\beta_j\) is found in the \(\text{C.I.}\); rather, it suggests that the approach used to build the 95% \(\text{C.I.}\) will yield an interval in which the true \(\beta_j\) will reside approximately 95% of the time.200

The resulting confidence intervals also depend on the underlying model. For instance, the 95% \(\text{C.I.}\) for \(\beta_1\) in the full model is \([-0.33,-0.23]\) (see above), whereas the corresponding \(\text{C.I.}\) in the reduced model \[\widehat{Y}=\gamma_0+\gamma_1X_1\] is \([-0.33,-0.27]\).

The estimates are necessarily distinct as well:

reduced.model = lm(life_expectancy~infant_mortality)

lm(formula = life_expectancy ~ infant_mortality)

     Min       1Q   Median       3Q      Max 
-14.9729  -1.9716   0.1726   2.9727  11.0275 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      78.99279    0.48357  163.35   <2e-16 ***
infant_mortality -0.29888    0.01313  -22.76   <2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.174 on 164 degrees of freedom
Multiple R-squared:  0.7595,    Adjusted R-squared:  0.758 
F-statistic: 517.9 on 1 and 164 DF,  p-value: < 2.2e-16

Note that \(\widehat{\beta}_1=-0.2763\neq -0.2989=\widehat{\gamma}_1\).

Feature Selection

How would we determine if all the predictors help explain the response \(Y\), or if only a (proper) subset of the predictors is needed? The most direct approach to solve this problem (in the linear regression context) is to run best subsets regression.

The procedure is as follows: fit an OLS model for all possible subsets of predictors and select the optimal model based on a criterion that balances training error with model size.

There are \(2^{p+1}\) such models (a quantity that quickly becomes unmanageable). In practice, we need to automate and speed-up the search through a subset of predictor subsets. OLS approaches include forward selection and backward selection (we discuss these in detail in Feature Selection and Dimension Reduction).

Forward selection is a bottom-up approach:

  1. start with the null model \(\mathcal{M}_0: Y=\beta_0+\varepsilon\);

  2. fit \(p\) simple linear regressions \(Y=\beta_0+\beta_jX_j+\varepsilon\) and add to the null model the predictor \(X_{j_1}\) resulting in the lowest \(\text{SSRes}\): \[\mathcal{M}_1:Y=\beta_0+\beta_{j_1}X_{j_1}+\varepsilon;\]

  3. add to that model the variable that results in the lowest \(\text{SSRes}\) among all the two-variable models: \[\mathcal{M}_2:Y=\beta_0+\beta_{j_1}X_{j_1}+\beta_{j_2}X_{j_2}+\varepsilon;\]

  4. the process continues until a stopping criterion is met.

Backward selection is a top-down approach, and it works in reverse, removing predictors from the full model.

In both approaches, there are at most \[p+(p-1)+\cdots+2+1=\frac{p(p+1)}{2}\ll 2^{p+1}~~\text{(when $p$ is large)}\] regressions to run. These methods are, frankly, not ideal in the machine learning framework (see Section 12.2.4 for alternatives).

Other Questions

  • How do we handle qualitative variables? (dummy binary variables);

  • How do we handle interaction terms? (add features);

  • How do we handle outliers? (median regression, Theil-Sen estimate);

  • How do we handle non-constant variance of error terms? (data transformations, weighted least square regression, Bayesian regression);

  • How do we handle high-leverage observations? (robust regression);

  • How do we handle collinearity? (principal component analysis, genearlized linear models, partial least square regression);

  • How do we handle multiple tests? (Bonferroni correction: for \(q\) independent tests with the same data, set significance level to \(\alpha/q\) to get joint significance equivalent to \(\alpha\) for a single test).

12.2.3 Generalizations of OLS

The OLS assumptions are convenient from a mathematical perspective, but they are not always met in practice.

One way out of this problem is to use remedial measures to transform the data into a compliant set; another one is to extend the assumptions and to work out the corresponding mathematical formalism:

  • generalized linear models (GLM) implement responses with non-normal conditional distributions;

  • classifiers (logistic regression, decision trees, support vector machines, naı̈ve Bayes, neural networks) extend regression to categorical responses (see Module 13;

  • non-linear methods such as splines, generalized additive models (GAM), nearest neighbour methods, kernel smoothing methods are used for responses that are not linear combinations of the predictors (see Section @(RVE-NM));

  • tree-based methods and ensemble learning methods (bagging, random forests, boosting) are used for predictor interactions (see Module 13);

  • regularization methods (ridge regression, LASSO, elastic net) facilitate the process of model selection and feature selection (see Section 12.2.4).

Generalized Linear Models

GLM extend the least square paradigm by accommodating response variables with non-normal conditional distributions. Apart from the error structure, a GLM is essentially a linear model: \[Y_i\sim \mathcal{D}(\mu_i),\quad\text{where }g(\mu_i)=\mathbf{x}_i^{\!\top}\boldsymbol{\beta}.\]

A GLM consists of:

  • a systematic component \(\mathbf{x}_i^{\!\top}\boldsymbol{\beta}\);

  • a random component specified by the distribution \(\mathcal{D}\) for \(Y_i\), and

  • a link function \(g\).

The systematic component is specified in terms of the linear predictor for the \(i^{\text{th}}\) observation \(\eta_i=\mathbf{x}_i^{\!\top}\boldsymbol{\beta}\); the general ideas and concepts of OLS carry over to GLM, with the added presence of the link function and the distribution of the response \(y_i\).

In principle, the link function \(g\) could be any function linking the linear predictor \(\eta_i\) to the distribution of the response \(Y_i\); in practice, however, \(g\) should be smooth (or at least differentiable) and monotonic (and so invertible).

We could specify any distribution \(\mathcal{D}\) for the response \(Y_i\), but they are usually selected from the exponential family of distributions.201 OLS is an example of GLM, with:

  • systematic component \(\eta_i=\mathbf{x}_i^{\!\top}\boldsymbol{\beta}\);

  • random component \(Y_i\sim \mathcal{N}(\mu_i,\sigma^2)\);

  • link function \(g(\mu)=\mu\).

For a more substantial example, consider the following situation. In the early stages of a rumour spreading, the rate at which new individual learn the information increases exponentially over time. If \(\mu_i\) is the expected number of people who have heard the rumor on day \(t_i\), a model of the form \(\mu_i=\gamma\exp(\delta t_i)\) might be appropriate: \[\underbrace{\ln(\mu_i)}_{\text{link}}=\ln \gamma+\delta t_i=\beta_0+\beta_1t_i=\underbrace{(1,t_i)^{\!\top}(\beta_0,\beta_1)}_{\text{systematic component}}.\] Furthermore, since we measure a count of individuals, the Poisson distribution could be a reasonable choice: \[Y_i\sim \underbrace{\strut\text{Poisson}(\mu_i),}_{\text{random component}}\quad \ln(\mu_i)=(1,t_i)^{\!\top}(\beta_0,\beta_1).\]

The main advantages of GLM are that:

  • there is no need to transform the response \(Y\) if it does not follow a normal distribution;

  • if the link produces additive effects, the assumption of homoscedasticity does not need to be met;

  • the choice of the link is separate from the choice of random component, providing modeling flexibility;

  • models are still fitted via a maximum likelihood procedure;

  • inference tools and model checks (Wald ratio test, likelihood ratio test, deviance, residuals, \(\text{C.I.}\), etc.) still apply;

  • they are easily implemented in SAS (proc genmod), R (glm()), etc., and

  • the framework unites various regression modeling approaches (OLS, logistic, Poisson, etc.) under a single umbrella.

12.2.4 Shrinkage Methods

We will discuss the curse of dimensionality (CoD), subset selection, and dimension reduction in Module 15. Another approach to dealing with high-dimensionality is provided by the least absolute shrinkage and selection operator (LASSO) and its variants.

In what follows, assume that the training set consists of \(N\) centered and scaled observations \(\mathbf{x}_i=(x_{i,1},\cdots,x_{i,p})\), together with target observations \(y_i\).

Let \(\widehat{\beta}_{\textrm{OLS},j}\) be the \(j\)th OLS coefficient, and set a threshold \(\lambda>0\), whose value depends on the training dataset \(\text{Tr}\). Recall that \(\widehat{\boldsymbol{\beta}}_{\textrm{OLS}}\) is the exact solution to the OLS problem \[\widehat{\boldsymbol{\beta}}_{\textrm{OLS}}=\arg\min_{\boldsymbol{\beta}}\{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2\}=\arg\min_{\boldsymbol{\beta}}\{\text{SSRes}\}.\] In general, no restrictions are assumed on the values of the coefficients \(\widehat{\beta}_{\textrm{OLS},j}\) – large magnitudes imply that corresponding features play an important role in predicting the target. This observation forms the basis of a series of useful OLS variants. Ridge Regression

Ridge regression (RR) is a method to regularize the OLS regression coefficients. Effectively, it shrinks the OLS coefficients by penalizing solutions with large magnitudes – if the magnitude of a specific coefficient is large, then it must have great relevance in predicting the target variable.

This leads to a modified OLS problem: \[\widehat{\boldsymbol{\beta}}_{\textrm{RR}}=\arg\min_{\boldsymbol{\beta}}\{\underbrace{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2}_{\text{SSRes}}+\underbrace{N\lambda \|\boldsymbol{\beta}\|_2^2}_{\text{shrinkage penalty}}\}.\] This quantity is small when \(\text{SSRes}\) is small (i.e., the model is a good fit to the data) and when the shrinkage penalty is small (i.e., when each \(\beta_j\) is small). RR solutions are typically obtained via numerical methods.202

The hyperparameter \(\lambda\) controls the relative impact of both components. If \(\lambda\) is small, then the shrinkage penalty is small even if the individual coefficients \(\beta_j\) are large; if \(\lambda\) is large, then the shrinkage penalty is only small when all coefficients \(\beta_j\) are small (see Figure 12.9 for an illustration).

Ridge regression coefficients in a generic problem (top); note how the coefficients converge to 0 when the threshold lambda increases (left); the ratio between the magnitude of the ridge regression parameter and the corresponding OLS parameter is shown on the right. [@DSML_ISLR]

Figure 12.9: Ridge regression coefficients in a generic problem (top); note how the coefficients converge to 0 when the threshold lambda increases (left); the ratio between the magnitude of the ridge regression parameter and the corresponding OLS parameter is shown on the right. [3]

Setting the “right” value for \(\lambda\) is crucial; it can be done via cross-validation (see [3, pp. 227–228] and Section 12.3.1 for details). The OLS estimate are equivariant: if \(\widehat{\beta}_j\) is the estimate for the coefficient \(\beta_j\) of \(X_j\), then \(\frac{\widehat{\beta}_j}{c}\) is the estimate for the coefficient of the scaled variable \(cX_j\). RR coefficients do not have this property, however, which is why the dataset must be centered and scaled to start with.

Finally, note that RR estimates help to mitigate the bias-variance trade-off and reduce issues related to overfitting (even if they do not reduce the dimensions of the dataset).

Regression With Best Subset Selection

Regression with best subset selection runs on the same principle but uses a different penalty term, which effectively sets some of the coefficients to 0 (this could be used to select the features with non-zero coefficients, potentially).

The problem consists in solving another modified version of the OLS scenario, namely \[\widehat{\boldsymbol{\beta}}_{\textrm{BS}}=\arg\min_{\boldsymbol{\beta}}\{\underbrace{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2}_{\text{SSRes}}+\underbrace{N\lambda \|\boldsymbol{\beta}\|_0}_{\text{shrinkage}}\},\qquad \|\boldsymbol{\beta}\|_0=\sum_j\textrm{sgn}(|\beta_j|).\] Solving the BS problem typically (also) requires numerical methods and cross-validation.203

A slight modification to the RR shrinkage penalty can overcome the lack of covariance.


The LASSO is an alternative to RR obtained by solving \[\widehat{\boldsymbol{\beta}}_{\textrm{L}}=\arg\min_{\boldsymbol{\beta}}\{\underbrace{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2}_{\text{SSRes}}+\underbrace{N\lambda \|\boldsymbol{\beta}\|_1}_{\text{shrinkage}}\};\] the penalty effectively forces coefficients which combine the propertiesof RR and BS, selecting at most \(\max\{p,N\}\) features, and usually no more than one per group of highly correlated variables (the other coefficients are forced down to \(0\) when \(\lambda\) is large enough, see Figure 12.10 for an illustration).204

LASSO coefficients in a generic problem; note how the coefficients goes directly to 0 after a certain threshold lambda (left); the ratio between the magnitude of the LASSO parameter and the corresponding OLS parameter is shown on the right. [@DSML_ISLR]

Figure 12.10: LASSO coefficients in a generic problem; note how the coefficients goes directly to 0 after a certain threshold lambda (left); the ratio between the magnitude of the LASSO parameter and the corresponding OLS parameter is shown on the right. [3]

Why do we get \(\widehat{\beta}_{\text{L},j}=0\) for some \(j\), but not for the RR coefficients? The RR and LASSO formulations are equivalent to \[\begin{aligned} \widehat{\boldsymbol{\beta}}_{\textrm{RR}}&=\arg\min_{\boldsymbol{\beta}}\{\text{SSRes}\mid \|\boldsymbol{\beta}\|_2^2\leq s\}\ \text{for some $s$} \\ \widehat{\boldsymbol{\beta}}_{\textrm{L}}&=\arg\min_{\boldsymbol{\beta}}\{\text{SSRes}\mid \|\boldsymbol{\beta}\|_1\leq s\}\ \text{for some $s$}\end{aligned}\]

Graphically, this looks like the images shown in Figure 12.11.

Level curves and neighbourhoods for LASSO (left) and ridge regression (right). [@DSML_ISLR]

Figure 12.11: Level curves and neighbourhoods for LASSO (left) and ridge regression (right). [3]

The RR coefficients \(\widehat{\boldsymbol{\beta}}_{\textrm{RR}}\) are found at the first intersection of the ellipses of constant \(\text{SSRes}\) around the OLS coefficient \(\widehat{\boldsymbol{\beta}}\) with the \(2-\)sphere \(\|\boldsymbol{\beta}\|_{2}^2\leq s\); that intersection is usually away from the axes (due to the lack of “sharp” points); this is not usually the case for the intersection of the \(1-\)sphere \(\|\boldsymbol{\beta}\|_2\leq s\).

The LASSO thus typically produces simpler models, but predictive accuracy matters too (in the form of \(\text{MSE}_\text{Te}\), say). Depending on the data, either of the two approaches can be optimal, thanks to the No Free Lunch Theorem.


If the response is related to a relatively small number of predictors (whether this is the case or not is not something we usually know a priori), LASSO is recommended. The use of other penalty functions (or combinations thereof) provides various extensions, such as: elastic nets; group, fused and adaptive lassos; bridge regression, etc.

The modifications described above were defined assuming an underlying linear regression model, but they generalize to arbitrary regression/classification models as well. For a loss (cost) function \(\mathcal{L}\left(\mathbf{Y},\mathbf{\widehat{y}}(\mathbf{W})\right)\) between the actual target and the values predicted by the model parameterized by \(\mathbf{W}\), and a penalty vector \(\mathbf{R}(\mathbf{W})=\left(R_{1}(\mathbf{W}),\cdots,R_{k}(\mathbf{W})\right)^{\!\top}\), the regularized parametrization \(\mathbf{W}^*\) solves the general regularization problem \[\mathbf{W}^*= \textrm{arg}\min_{\mathbf{W}} \{\mathcal{L}\left(\mathbf{Y},\mathbf{\widehat{y}}(\mathbf{W})\right)+N\mathbf{\lambda}^{\!\top}\mathbf{R}(\mathbf{W})\},\] which can be solved numerically, assuming some nice properties on \(\mathcal{L}\) and \(\mathbf{R}\) [241]; as before, cross-validation can be used to determine the optimal vector \(\mathbf{\lambda}\) [2].

Gapminder Example

In R, regularization is implemented in the package glmnet (among others). In glmnet() the parameter alpha controls the elastic net mixture: LASSO (alpha = 1), RR (alpha = 0).

Say we are interested in modeling life expectancy \(Y\) in the 2011 Gapminder dataset as a function of population, infant mortality, fertility, gdp, and continental membership (we use the entire set as a training set \(\text{Tr}\)).

A priori, an OLS model on this data would take the form \[\begin{aligned}Y=\alpha_0&+\alpha_1\text{population}+\alpha_2\text{infant mortality}+\alpha_3\text{fertility}+ \alpha_4\text{gpd} \\ &+ \alpha_5\text{Africa}+\alpha_6\text{Americas}+\alpha_7\text{Asia}+\alpha_8\text{Europe}+\alpha_9\text{Oceania}.\end{aligned}\]

We run a LASSO regression on the data, using \(\text{MSE}_{\text{Tr}}\) as the evaluation metric. We start by creating dummy variables for the continents:

gapminder.2011.f <- fastDummies::dummy_cols(gapminder.2011, select_columns = 'continent')

Next, we select the appropriate variables for the response and the training set, and scale and center the data (it must be in a matrix format to be compatible with glmnet():

gapminder.2011.f <- fastDummies::dummy_cols(gapminder.2011, select_columns = 'continent')
y <- gapminder.2011.f |> select(life_expectancy) |> as.matrix()
x <- gapminder.2011.f |> select(c("population","infant_mortality","fertility","gdp",
      "continent_Europe","continent_Oceania")) |> 
      scale(center = TRUE, scale = TRUE) |> as.matrix()

Finally, we run the regression and extract the LASSO coefficient for hyperparameter \(\lambda=1\):

glmnet1<-glmnet::glmnet(x=x,y=y, type.measure='mse',alpha=1)
10 x 1 sparse Matrix of class "dgCMatrix"
(Intercept)        70.82349398
population          .         
infant_mortality   -5.57897055
fertility           .         
gdp                 .         
continent_Africa   -1.13074639
continent_Americas  .         
continent_Asia      .         
continent_Europe    .         
continent_Oceania  -0.03096299

Thus \[\begin{aligned} Y&=70.82-5.58(\text{infant mortality}) -1.13(\text{Africa}) -0.03(\text{Oceania}).\end{aligned}\]

For RR (\(\alpha=0\)), we obtain, with the same hyperparameter \(\lambda=1\):

glmnet0<-glmnet::glmnet(x=x,y=y, type.measure='mse',alpha=0)
10 x 1 sparse Matrix of class "dgCMatrix"
(Intercept)        70.8234940
population         -0.3471671
infant_mortality   -4.4002779
fertility          -0.6348077
gdp                 0.5803223
continent_Africa   -1.6275714
continent_Americas  0.5475769
continent_Asia      0.6117358
continent_Europe    1.0141934
continent_Oceania  -0.6855980

which is to say: \[\begin{aligned} Y&=70.82-0.34(\text{population})-4.4(\text{infant mortality}) -0.63(\text{fertility}) +0.58(\text{gdp)}\\ & -1.62(\text{Africa}) +0.55(\text{Americas})+0.61(\text{Asia})+1.01(\text{Europe}) -0.68(\text{Oceania}),\end{aligned}\] which is compatible with the above discussion.

The values of the coefficient themselves are not as important as their signs and the fact that they are roughly similar in both models.

It is important to note, however, that the choice of \(\lambda=1\) was arbitrary, and that we have not been evaluating the result on test data \(\text{Te}\). We will revisit these issues in Cross-Validation.


T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
H. Sahai and M. I. Ageel, The Analysis of Variance: Fixed, Random and Mixed Models. Birkhäuser, 2000.
M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li, Applied Linear Statistical Models. McGraw Hill Irwin, 2004.
G. E. P. Box, “Use and abuse of regression,” Journal of Technometrics, vol. 8, no. 4, pp. 625–629, Nov. 1966.
T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity : the LASSO and Generalizations. CRC Press, 2015.