## 12.2 Regression Modeling

Recall that, in the regression setting, the goal is to estimate the regression function \[f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}],\] the solution to the regression problem \[Y=f(\vec{X})+\varepsilon.\] The best estimate \(\widehat{f}\) is the model that minimizes \[\text{MSE}_{\text{Te}}(\widehat{f})=\text{Avg}_{\mathbf{x}^*\in Te}\text{E}\left[(y^*-\widehat{f}(\mathbf{x}^*))^2\right].\]

In practice, this can be hard to achieve without restrictions on the functional form of \(\widehat{f}\), so we try to **learn** the best \(\widehat{f}\) from **various families of models**. Remember, however, that no matter the \(\widehat{f}\), \[\text{MSE}_{\text{Te}}(\widehat{f})\geq \text{Var}(\varepsilon.)\] What else
can we say about \(\widehat{f}\)?

In the **ordinary least square framework** (OLS), we assume that \[\widehat{f}_{\text{OLS}}(\mathbf{x})\approx \mathbf{x}^{\!\top}\boldsymbol{\beta},\] which is to say that we assume that \(\widehat{f}_{\text{}OLS}\) is nearly globally linear (we neglect the intercept term, in one interpretation).

In practice, the true regression function is almost never linear, but the linear assumption yields models \(\widehat{f}\) that are both **conceptually** and **practically** useful – the model \(\widehat{f}\) is easily interpretable, and the associated prediction error \(\text{MSE}_{\text{Te}}(\widehat{f})\) is often “small-ish”.

The most common data modeling methods are **linear and logistic regression methods**. By some estimation, 90% of real-world data applications end up using these as their final model, typically after very carefully preparing the data (cleaning, encoding, creation of new variables, transformation of variables, etc.).

That is mostly due to:

these regression models being straightforward to

**interpret**and to**train**;the prediction error \(\text{MSE}_\text{Te}\) having a closed-form linear expression, and

the OLS solution being computable using simple matrix manipulations.

#### Gapminder Example

Let us revisit the `Gapminder`

dataset, focusing on observations from 2011.

Is there a relationship between gross domestic product and life expectancy?

How strong is the relationship?

Which factors contribute to the life expectancy?

How accurately could we predict life expectancy given a set of new predictors?

Is the relationship linear?

are there combinations of factors that are linked with life expectancy?

Can the scatterplots of various predictors against life expectancy for 2011 Gapminder dataset, shown below with line of best fit, be used to answer these questions?

```
gapminder.2011 <- gapminder.ML |> filter(year==2011)
attach(gapminder.2011)
x=population
y=life_expectancy
plot(x,y, xlab="Population", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)
```

```
x=infant_mortality
y=life_expectancy
plot(x,y, xlab="Infant Mortality", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)
```

```
x=fertility
y=life_expectancy
plot(x,y, xlab="Fertility", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)
```

```
x=gdp
y=life_expectancy
plot(x,y, xlab="Gross Domestic Product", ylab="Life Expectancy", pch=20)
abline(lm(y~x), col="red",lwd=3)
```

### 12.2.1 Formalism

Consider a dataset \(\text{Tr}=\{(\vec{x_1},y_1),\ldots, (\vec{x_n},y_n)\}\) with \(n\) observations and \(p\) features. The corresponding **design matrix**, **response vector**, and **coefficient vector** are, respectively,
\[\mathbf{X} =\begin{pmatrix}1 & x_{1,1} & \cdots & x_{1,p} \\ \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,p}\end{pmatrix},\quad \mathbf{Y}=\begin{pmatrix}y_1 \\ \vdots \\ y_m\end{pmatrix},\quad \boldsymbol{\beta}=\begin{pmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{pmatrix}.\]
The objective is to find \(f\) such that \(\mathbf{Y}=f(\mathbf{X})+\boldsymbol{\varepsilon}\). The OLS solution assumes that \(f(\mathbf{X})=\mathbf{X}\boldsymbol{\beta}\); we must thus learn \(\boldsymbol{\beta}\) using the training data \(\text{Tr}\).

If \(\hat{\boldsymbol{\beta}}\) is the estimate of the true coefficient vector \(\boldsymbol{\beta}\), the **linear regression model** associated with \(\text{Tr}\) is \[\widehat{f}(\mathbf{x})=\widehat{\beta}_0+\widehat{\beta}_1x_1+\cdots+\widehat{\beta}_px_p.\]
How do we find \(\hat{\boldsymbol{\beta}}\)? The OLS estimate minimizes the **loss function** \[\begin{aligned} \mathcal{L}(\boldsymbol{\beta})&=\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2=(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^{\!\top}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \\ &= \mathbf{Y}^{\!\top}\mathbf{Y}-((\mathbf{X}\boldsymbol{\beta})^{\!\top}\mathbf{Y}+\mathbf{Y}^{\!\top}\mathbf{X}\boldsymbol{\beta})+(\mathbf{X}\boldsymbol{\beta})^{\!\top}\mathbf{X}\boldsymbol{\beta}\\&=\mathbf{Y}^{\!\top}\mathbf{Y}-(\boldsymbol{\beta}^{\!\top}\mathbf{X}^{\!\top}\mathbf{Y}+\mathbf{Y}^{\!\top}\mathbf{X}\boldsymbol{\beta})+\boldsymbol{\beta}^{\!\top}\mathbf{X}^{\!\top}\mathbf{X}\boldsymbol{\beta}.\end{aligned}\]
The loss function is a non-negative symmetric quadratic form in \(\boldsymbol{\beta}\), with no restriction on the coefficients, so any minimizer of \(\mathcal{L}\) must also be one of its critical points (assuming certain regularity conditions on the data). We are thus looking for coefficients for which \(\mathcal{L}(\boldsymbol{\beta})=\mathbf{0}\). Since \[\nabla \mathcal{L}(\boldsymbol{\beta})=-2(\mathbf{X}^{\top}\mathbf{Y}-\mathbf{X}^{\!\top}\mathbf{X}\boldsymbol{\beta}),\] any minimizer \(\hat{\boldsymbol{\beta}}\) must satisfy the **canonical equations**:
\[\mathbf{X}^{\!\top}\mathbf{Y}=\mathbf{X}^{\!\top}\mathbf{X}\hat{\boldsymbol{\beta}}.\]
If \(\mathbf{X}^{\!\top}\mathbf{X}\) is invertible, the minimizer \(\hat{\boldsymbol{\beta}}\) is unique and is given by \[\hat{\boldsymbol{\beta}}=(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}\mathbf{Y},\quad \text{with}\quad \text{Var}(\hat{\boldsymbol{\beta}})=\widehat{\sigma}^2(\mathbf{X}^{\!\top}\mathbf{X})^{-1},\] where \(\widehat{\sigma}^2\) is the variance of the residuals.^{197} We say that “**we have learned the coeficient vector \(\hat{\boldsymbol{\beta}}\) on the training data \(\text{Tr}\) using linear regression**”.

In what follows, we write \(\mathbf{x}\) to represent the vector \[(1,x_{1},\ldots,x_{p})^{\!\top}.\] At times, we will also use the notation to represent \(\mathbf{x}\); the context should make clear which version is meant.

The **fitted value of the model \(\widehat{f}\) at input** \(\mathbf{x}_i\in\text{Tr}\) is \[\widehat{y}_i=\widehat{f}(\mathbf{x}_i)=\mathbf{x}_i^{\!\top}\hat{\boldsymbol{\beta}},\] and the **predicted value at an arbitrary** \(\mathbf{x}^*\) is \[\widehat{y}^*=\widehat{f}(\mathbf{x}^*)={\mathbf{x}^*}^{\!\top}\hat{\boldsymbol{\beta}}.\]

The **fitted surface** is thus entirely described by the \(p+1\) parameters \(\hat{\boldsymbol{\beta}}\); the number of (effective) parameters is a measure of the **complexity** of the learner.

#### Motivating Example

We study a subset of the Gapminder dataset: the observations for 2011, the predictor variables infant mortality \(X_1\) and fertility \(X_2\), and the response variable life expectancy \(Y\). The training data \(\text{Tr}\) contains \(n=166\) observations and \(p=2\) predictor features.

The design matrix \(\mathbf{X}\) is thus of dimension \(166\times 3\).

```
library(matlib)
gapminder.2011 <- gapminder.2011 |> dplyr::mutate(const=1)
design.X = gapminder.2011[,c("const","infant_mortality","fertility")]
str(design.X)
```

```
'data.frame': 166 obs. of 3 variables:
$ const : num 1 1 1 1 1 1 1 1 1 1 ...
$ infant_mortality: num 14.3 22.8 106.8 7.2 12.7 ...
$ fertility : num 1.75 2.83 6.1 2.12 2.2 1.5 1.88 1.44 1.96 1.9 ...
```

The response is a \(166 \times 1\) vector.

The consituants of the canonical equations are:

```
(X.t.X = t(as.matrix(design.X)) %*% as.matrix(design.X))
(X.t.Y = t(as.matrix(design.X)) %*% as.matrix(resp.Y))
```

```
const infant_mortality fertility
const 166.00 4537.30 486.540
infant_mortality 4537.30 225043.25 18445.280
fertility 486.54 18445.28 1790.238
[,1]
const 11756.70
infant_mortality 291153.33
fertility 32874.95
```

We thus see that

\[\mathbf{X}^{\!\top}\mathbf{X}=\begin{pmatrix} 166.0 & 4537.3 & 486.54 \\ 4537.3 & 225043.25 & 18445.28 \\ 486.54 & 18445.28 & 1790.238 \end{pmatrix}\] and \[\mathbf{X}^{\!\top}\mathbf{Y}=\begin{pmatrix}11756.7 \\ 291153.33 \\ 32874.95\end{pmatrix}.\]

We can now compute \(\hat{\boldsymbol{\beta}}:\)

```
[,1]
[1,] 79.6770747
[2,] -0.2757438
[3,] -0.4431873
```

Thus, \[\hat{\boldsymbol{\beta}}=(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}\mathbf{Y}=\begin{pmatrix}79.677 \\ -0.276 \\ -0.443\end{pmatrix}.\]

We have seen that the fitted surface is \[y^*=\widehat{f}(\mathbf{x}^*)=79.677-0.276x_1^*-0.443x_2^*\] for a test observation \(\mathbf{x}^*=(x_1^*,x_2^*)\).

**Warning:** predictions should not be made for observations outside the range (or the envelope)
of the training predictors. In this example, the predictor envelope is
shown in red in the figure below – one should resist the
temptation to predict \(y^*\) for \(\mathbf{x^*}=(100,2)\), say.

#### Least Squares Assumptions

Since the family of OLS learners is a subset of all possible learners, the best we can say about \(\widehat{f}_{\text{OLS}}\) is that \[\text{MSE}_{\text{Te}}(\widehat{f}_{\text{OLS}})\geq \min_{\widehat{f}}\left\{\text{MSE}_{\text{Te}}(\widehat{f})\right\}\geq \text{Var}(\varepsilon).\]

In practice, we are free to approximate \(f\) with **any** learner \(\widehat{f}\). If we want \(\widehat{f}\) to be useful, however, we need to verify that it is a “decent” approximation.

There is another trade-off at play: when we restrict learners to specific families of functions,^{198} we typically also introduce a series of assumptions on the data.

The OLS assumptions are

**linearity:**the response variable is a linear combination of the predictors;**homoscedasticity:**the error variance is constant for all predictor levels;**uncorrelated errors:**the error is uncorrelated from one observation to the next;**full column rank for design matrix \(\mathbf{X}\):**the predictors are not perfectly multi-collinear;**weak exogeneity:**predictor values are free of measurement error.

Mathematically, the assumptions translate to \[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\] where \(\boldsymbol{\beta}\in \mathbb{R}^{p+1}\) is determined on a training set \(\text{Tr}\) without measurement error, and for which \[\text{E}[\boldsymbol{\varepsilon}\mid \mathbf{X}]=\mathbf{0} \quad\text{and}\quad\text{E}[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^{\!\top}\mid \mathbf{X}]=\sigma^2I_{n}.\] Typically, although that is not a requirement, it is often further assumed that \[\boldsymbol{\varepsilon}\mid \mathbf{X}\sim\mathcal{N}(\mathbf{0},\sigma^2I_n).\]

We will discuss how these assumptions can be generalized in Sections 12.2.3 and 12.5. In the meantime, however, how can we determine if the choice of model is valid? In the traditional statistical analysis context, there is a number of tests available to the analyst (we will discuss them shortly). In the machine learning context, there is only one real test:

**does the model make good predictions?**

### 12.2.2 Least Squares Properties

Let us assume that the OLS assumptions are satisfied. What can we say
about the linear regression results? (see Module **??** and [34], say, for a refresher).

For the Gapminder example above, for instance, we could us `R`

’s `lm()`

.

```
Call:
lm(formula = life_expectancy ~ infant_mortality + fertility)
Residuals:
Min 1Q Median 3Q Max
-15.3233 -2.0057 0.2003 2.9570 10.6370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.6759 0.7985 99.786 <2e-16 ***
infant_mortality -0.2763 0.0248 -11.138 <2e-16 ***
fertility -0.4440 0.4131 -1.075 0.284
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.172 on 163 degrees of freedom
Multiple R-squared: 0.7612, Adjusted R-squared: 0.7583
F-statistic: 259.8 on 2 and 163 DF, p-value: < 2.2e-16
```

#### Coefficient of Determination

Let \[\text{SSRes}=\mathbf{Y}^{\!\top}[I_n-\mathbf{X}(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}]\mathbf{Y}=\mathbf{Y}^{\!\top}[I_n-\mathbf{H}]\mathbf{Y}\] and \[\text{SSTot}=\mathbf{Y}^{\!\top}\mathbf{Y}-n\overline{y}^2.\]

In the Gapminder example, we have:

```
(SSRes=anova(full.model)[[2]][3])
(SSTot=as.vector(t(as.matrix(resp.Y))%*%as.matrix(resp.Y)-nrow(as.matrix(resp.Y))*(mean(resp.Y))^2 ))
```

```
[1] 2837.69
[1] 11882.18
```

The coefficient of determination of the OLS regression is the quotient \[R^2=\frac{\text{SSTot}-\text{SSRes}}{\text{SSTot}}=\frac{\text{Cov}^2(\mathbf{Y},\mathbf{X}\hat{\boldsymbol{\beta}})}{\sigma^2_y\sigma^2_{\overline{y}}}.\]

In the Gapminder example, we have:

`[1] 0.761181`

The coefficient of determination identifies the proportion of the variation of the data which is explained by the linear regression. As such, \(0\leq R^2\leq 1\).

If \(R^2\approx 0\), then the predictor variables have little explanatory power on the response; if \(R^2\approx 1\), then the linear fit is deemed to be “good”, as a lot of the variability in the response is explained by the predictors. In practice, the number of predictors also affects the goodness-of-fit (this is related to the curse of dimensionality discussed previously).

The quantity \[R_a^2=1-\frac{n-1}{n-p-1}(1-R^2)=1-\frac{\text{SSRes}/(n-p-1)}{\text{SSTot}/(n-1)}\] is the **adjusted coefficient of determination** of the linear regression. While \(R_a^2\) can be negative, it is always smaller than \(R^2\). It plays also plays a role in the **feature selection** process.

In the Gapminder example, we have:

`[1] 0.7567584`

which suggests that a fair proportion of the variability in the life expectancy (about 75.7%) is explained by infant mortality and fertility.

#### Significance of Regression

We can determine if at least one of the predictors \(X_1,\ldots,X_p\) is useful in predicting the response \(Y\) by pitting \[H_0:(\beta_1,\ldots,\beta_p)=\mathbf{0}\quad\text{against}\quad H_1:(\beta_1,\ldots,\beta_p)\neq \mathbf{0}.\] Under the null hypothesis \(H_0\), the \(F-\)statistic \[F^*=\frac{(\text{SSTot}-\text{SSRes})/p}{\text{SSRes}/(n-p-1)}\sim F_{p,n-p-1}.\] At significance level \(\alpha\), if \(F^*\geq F_{p,n-p-1;\alpha}\) (the \(1-\alpha\) quantile of the \(F\) distributions with \(p\) and \(n-p-1\) degrees of freedom), then we reject the null hypothesis in favour of the alternative.

In the Gapminder model \[Y=79.677-0.276X_1-0.443X_2+ \varepsilon,\quad n=166,p=2,\] we have:

`[1] 258.169`

At a significance level \(\alpha=0.05\), the critical value of the \(F_{2,163}\) distribution is:

`[1] 3.051819`

Since \(F^*\geq F_{2,163;0.05}\), at least one of \(\beta_1,\beta_2\neq 0\), with probability 95% (in the frequentist interpretation).

#### Interpretation of the Coefficients

For \(j=1,\ldots,p\), the coefficient \(\beta_j\) is the **average** effect on \(Y\) of a 1-unit increase in \(X_j\), **holding all other predictors fixed**. Ideally, the predictors are uncorrelated (such as would be the case in a **balanced design** [33]). Each coefficient can then be tested (and estimated) separately, and the above interpretation is at least reasonable in theory.

In practice, however, we can not always control the predictor variables, and it might be impossible to “hold all other predictors fixed.” When the predictors are correlated, there are potential **variance inflation** issues for the estimated regression coefficients, and the interpretion is risky, since when \(X_j\) changes, so do the other predictors.^{199} More importantly, the interpretation can also be read as a claim of causality, which **should be avoided** when dealing with observational data.

“The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively.” (paraphrased from [240])

In the Gapminder example, the correlation between the variables \(X_1\) and \(X_2\) is:

`[1] -0.8714863`

The predictors are thus strongly correlated, and the standard interpretation is not available to us.

#### Hypothesis Testing

We can also determine if a specific predictor \(X_j\) is useful in predicting the response \(Y\), by testing for \[H_0:\beta_j=0\quad\text{against}\quad H_1:\beta_j\neq 0.\] Under the null hypothesis \(H_0\), the test statistic \[t^*=\frac{\widehat{\beta}_j}{\text{se}(\widehat{\beta}_j)}\sim T_{n-2},\] where \(\text{se}(\widehat{\beta}_j)=\sqrt{\widehat{\sigma}^2(\mathbf{X}^{\!\top}\mathbf{X})^{-1}_{j,j}}\), and \(\widehat{\sigma}^2=\frac{\text{SSRes}}{n-p-1}\), and \(T_{n-2}\) is the Student \(T\) distribution with \(n-2\) degrees of freedom.

At a significance level \(\alpha\), if \(|t^*|\geq |t_{n-2;\alpha/2}|\) (the \(1-\alpha/2\) quantile of the \(T\) distribution with \(n-2\) degrees of freedom), then we reject the null hypothesis in favour of the alternative.

In the Gapminder model, we have: \(n=166\), \(p=2\), and \(\widehat{\beta}_1=-0.276\) so that

`[1] 17.51661`

and

```
0.02488045
```

so that

```
-11.08275
```

At a significance level \(\alpha=0.05\), the critical value of the \(T_{164}\) distribution is

`[1] -1.974535`

Since \(|t^*|\geq |t_{164;0.025}|\), \(\beta_1\neq 0\) with probability 95% (in the frequentist interpretation).

#### Confidence Intervals

The **standard error** of \(\widehat{\beta}_j\) reflects how the estimate would vary under various \(\text{Tr}\); it can be used to compute a \((1-\alpha)\%\) **confidence interval** for the true \(\beta_j\): \[\text{C.I.}(\beta_j;\alpha)\equiv \widehat{\beta}_j\pm z_{\alpha/2}\cdot \text{se}(\widehat{\beta}_j);\] at \(\alpha=0.05\), \(z_{\alpha/2}=1.96\approx 2\), so that \[\text{C.I.}(\beta_j;0.05)\equiv \widehat{\beta}_j\pm 2\text{se}(\widehat{\beta}_j).\]

In the Gapminder example, we have

coeff. |
est. |
s.e. |
\(\mathbf{t^*}\) | 95% C.I. |
---|---|---|---|---|

\(\beta_0\) | \(79.677\) | \(0.7985\) | \(99.786\) | \([78.1,81.3,]\) |

\(\beta_1\) | \(-0.276\) | \(0.0248\) | \(-11.138\) | \([-0.33,-0.23]\) |

\(\beta_2\) | \(0.443\) | \(0.4131\) | \(-1.075\) | \([-1.27,0.38]\) |

In **frequentist** statistics, the confidence interval has a particular interpretation – it does not mean, as one might wish, that there is a 95% chance, say, that the true \(\beta_j\) is found in the \(\text{C.I.}\); rather, it suggests that the approach used to build the 95% \(\text{C.I.}\) will yield an interval in which the true \(\beta_j\) will reside approximately 95% of the time.^{200}

The resulting confidence intervals also depend on the underlying model. For instance, the 95% \(\text{C.I.}\) for \(\beta_1\) in the **full model** is \([-0.33,-0.23]\) (see above), whereas the corresponding \(\text{C.I.}\) in the **reduced model** \[\widehat{Y}=\gamma_0+\gamma_1X_1\] is \([-0.33,-0.27]\).

The estimates are necessarily distinct as well:

```
Call:
lm(formula = life_expectancy ~ infant_mortality)
Residuals:
Min 1Q Median 3Q Max
-14.9729 -1.9716 0.1726 2.9727 11.0275
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.99279 0.48357 163.35 <2e-16 ***
infant_mortality -0.29888 0.01313 -22.76 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.174 on 164 degrees of freedom
Multiple R-squared: 0.7595, Adjusted R-squared: 0.758
F-statistic: 517.9 on 1 and 164 DF, p-value: < 2.2e-16
```

Note that \(\widehat{\beta}_1=-0.2763\neq -0.2989=\widehat{\gamma}_1\).

#### Feature Selection

How would we determine if all the predictors help explain the response \(Y\), or if only a (proper) subset of the predictors is needed? The most direct approach to solve this problem (in the linear regression context) is to run **best subsets** regression.

The procedure is as follows: fit an OLS model for all possible subsets of predictors and select the **optimal model** based on a criterion that balances **training error** with **model size**.

There are \(2^{p+1}\) such models (a quantity that quickly becomes unmanageable). In practice, we need to automate and speed-up the search through a subset of predictor subsets. OLS approaches include **forward selection** and **backward selection** (we discuss these in detail in Feature Selection and Dimension Reduction).

Forward selection is a bottom-up approach:

start with the

**null model**\(\mathcal{M}_0: Y=\beta_0+\varepsilon\);fit \(p\) simple linear regressions \(Y=\beta_0+\beta_jX_j+\varepsilon\) and add to the null model the predictor \(X_{j_1}\) resulting in the lowest \(\text{SSRes}\): \[\mathcal{M}_1:Y=\beta_0+\beta_{j_1}X_{j_1}+\varepsilon;\]

add to that model the variable that results in the lowest \(\text{SSRes}\) among all the two-variable models: \[\mathcal{M}_2:Y=\beta_0+\beta_{j_1}X_{j_1}+\beta_{j_2}X_{j_2}+\varepsilon;\]

the process continues until a stopping criterion is met.

Backward selection is a top-down approach, and it works in reverse, removing predictors from the full model.

In both approaches, there are at most \[p+(p-1)+\cdots+2+1=\frac{p(p+1)}{2}\ll 2^{p+1}~~\text{(when $p$ is large)}\] regressions to run. These methods are, frankly, not ideal in the machine learning framework (see Section 12.2.4 for alternatives).

#### Other Questions

How do we handle qualitative variables? (dummy binary variables);

How do we handle interaction terms? (add features);

How do we handle outliers? (median regression, Theil-Sen estimate);

How do we handle non-constant variance of error terms? (data transformations, weighted least square regression, Bayesian regression);

How do we handle high-leverage observations? (robust regression);

How do we handle collinearity? (principal component analysis, genearlized linear models, partial least square regression);

How do we handle multiple tests? (Bonferroni correction: for \(q\) independent tests with the same data, set significance level to \(\alpha/q\) to get joint significance equivalent to \(\alpha\) for a single test).

### 12.2.3 Generalizations of OLS

The OLS assumptions are convenient from a mathematical perspective, but they are not always met in practice.

One way out of this problem is to use **remedial measures** to transform the data into a compliant set; another one is to extend the assumptions and to work out the corresponding mathematical formalism:

**generalized linear models**(GLM) implement responses with non-normal conditional distributions;**classifiers**(logistic regression, decision trees, support vector machines, naı̈ve Bayes, neural networks) extend regression to categorical responses (see Module 13;**non-linear methods**such as splines, generalized additive models (GAM), nearest neighbour methods, kernel smoothing methods are used for responses that are not linear combinations of the predictors (see Section @(RVE-NM));**tree-based methods**and**ensemble learning methods**(bagging, random forests, boosting) are used for predictor interactions (see Module 13);**regularization methods**(ridge regression, LASSO, elastic net) facilitate the process of model selection and feature selection (see Section 12.2.4).

#### Generalized Linear Models

GLM extend the least square paradigm by accommodating response variables with **non-normal** conditional distributions. Apart from the error structure, a GLM is essentially a linear model: \[Y_i\sim \mathcal{D}(\mu_i),\quad\text{where }g(\mu_i)=\mathbf{x}_i^{\!\top}\boldsymbol{\beta}.\]

A GLM consists of:

a systematic component \(\mathbf{x}_i^{\!\top}\boldsymbol{\beta}\);

a random component specified by the distribution \(\mathcal{D}\) for \(Y_i\), and

a link function \(g\).

The **systematic component** is specified in terms of the linear predictor for the \(i^{\text{th}}\) observation \(\eta_i=\mathbf{x}_i^{\!\top}\boldsymbol{\beta}\); the general ideas and concepts of OLS carry over to GLM, with the added presence of the link function and the distribution of the response \(y_i\).

In principle, the link function \(g\) could be any function linking the linear predictor \(\eta_i\) to the distribution of the response \(Y_i\); in practice, however, \(g\) should be **smooth** (or at least differentiable) and **monotonic** (and so invertible).

We could specify any distribution \(\mathcal{D}\) for the response \(Y_i\), but they are usually selected from the **exponential family** of distributions.^{201} OLS is an example of GLM, with:

systematic component \(\eta_i=\mathbf{x}_i^{\!\top}\boldsymbol{\beta}\);

random component \(Y_i\sim \mathcal{N}(\mu_i,\sigma^2)\);

link function \(g(\mu)=\mu\).

For a more substantial example, consider the following situation. In the early stages of a rumour spreading, the rate at which new individual learn the information increases exponentially over time. If \(\mu_i\) is the expected number of people who have heard the rumor on day \(t_i\), a model of the form \(\mu_i=\gamma\exp(\delta t_i)\) might be appropriate: \[\underbrace{\ln(\mu_i)}_{\text{link}}=\ln \gamma+\delta t_i=\beta_0+\beta_1t_i=\underbrace{(1,t_i)^{\!\top}(\beta_0,\beta_1)}_{\text{systematic component}}.\] Furthermore, since we measure a count of individuals, the Poisson distribution could be a reasonable choice: \[Y_i\sim \underbrace{\strut\text{Poisson}(\mu_i),}_{\text{random component}}\quad \ln(\mu_i)=(1,t_i)^{\!\top}(\beta_0,\beta_1).\]

The main advantages of GLM are that:

there is no need to transform the response \(Y\) if it does not follow a normal distribution;

if the link produces

**additive effects**, the assumption of homoscedasticity does not need to be met;the choice of the link is separate from the choice of random component, providing modeling flexibility;

models are still fitted

*via*a**maximum likelihood**procedure;**inference tools**and**model checks**(Wald ratio test, likelihood ratio test, deviance, residuals, \(\text{C.I.}\), etc.) still apply;they are easily implemented in SAS (

`proc genmod`

),`R`

(`glm()`

), etc., andthe framework unites various regression modeling approaches (OLS, logistic, Poisson, etc.) under a single umbrella.

### 12.2.4 Shrinkage Methods

We will discuss the curse of dimensionality (CoD), subset selection, and dimension reduction in Module 15. Another approach to dealing with high-dimensionality is provided by the **least absolute shrinkage and selection operator** (LASSO) and its variants.

In what follows, assume that the training set consists of \(N\) **centered** and **scaled** observations \(\mathbf{x}_i=(x_{i,1},\cdots,x_{i,p})\), together with target observations \(y_i\).

Let \(\widehat{\beta}_{\textrm{OLS},j}\) be the \(j\)th OLS coefficient, and set a threshold \(\lambda>0\), whose value depends on the training dataset \(\text{Tr}\). Recall that \(\widehat{\boldsymbol{\beta}}_{\textrm{OLS}}\) is the exact solution to the OLS problem \[\widehat{\boldsymbol{\beta}}_{\textrm{OLS}}=\arg\min_{\boldsymbol{\beta}}\{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2\}=\arg\min_{\boldsymbol{\beta}}\{\text{SSRes}\}.\]
In general, **no restrictions** are assumed on the values of the coefficients \(\widehat{\beta}_{\textrm{OLS},j}\) – large magnitudes imply that corresponding features **play an important role** in predicting the target. This observation forms the basis of a series of useful OLS variants.

#### 12.2.4.1 Ridge Regression

**Ridge regression** (RR) is a method to **regularize** the OLS regression coefficients. Effectively, it shrinks the OLS coefficients by penalizing solutions with large magnitudes – if the magnitude of a specific coefficient is large, then it must have great relevance in predicting the target variable.

This leads to a modified OLS problem: \[\widehat{\boldsymbol{\beta}}_{\textrm{RR}}=\arg\min_{\boldsymbol{\beta}}\{\underbrace{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2}_{\text{SSRes}}+\underbrace{N\lambda \|\boldsymbol{\beta}\|_2^2}_{\text{shrinkage penalty}}\}.\]
This quantity is small when \(\text{SSRes}\) is small (i.e., the model is a good fit to the data) and when the **shrinkage penalty** is small (i.e., when each \(\beta_j\) is small). RR solutions are typically obtained *via* numerical methods.^{202}

The hyperparameter \(\lambda\) controls the relative impact of both components. If \(\lambda\) is small, then the shrinkage penalty is small even if the individual coefficients \(\beta_j\) are large; if \(\lambda\) is large, then the shrinkage penalty is only small when all coefficients \(\beta_j\) are small (see Figure 12.9 for an illustration).

Setting the “right” value for \(\lambda\) is crucial; it can be done via cross-validation (see [3, pp. 227–228] and Section 12.3.1 for details). The OLS estimate are **equivariant**: if \(\widehat{\beta}_j\) is the estimate for the coefficient \(\beta_j\) of \(X_j\), then \(\frac{\widehat{\beta}_j}{c}\) is the estimate for the coefficient of the scaled variable \(cX_j\). RR coefficients do not have this property, however, which is why the dataset must be centered and scaled to start with.

Finally, note that RR estimates help to mitigate the bias-variance trade-off and reduce issues related to overfitting (even if they do not reduce the dimensions of the dataset).

#### Regression With Best Subset Selection

**Regression with best subset selection** runs on the same principle but uses a different penalty term, which effectively sets some of the coefficients to 0 (this could be used to select the features with non-zero coefficients, potentially).

The problem consists in solving another modified version of the OLS scenario, namely \[\widehat{\boldsymbol{\beta}}_{\textrm{BS}}=\arg\min_{\boldsymbol{\beta}}\{\underbrace{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2}_{\text{SSRes}}+\underbrace{N\lambda \|\boldsymbol{\beta}\|_0}_{\text{shrinkage}}\},\qquad \|\boldsymbol{\beta}\|_0=\sum_j\textrm{sgn}(|\beta_j|).\]
Solving the BS problem typically (also) requires numerical methods and cross-validation.^{203}

A slight modification to the RR shrinkage penalty can overcome the lack of covariance.

##### LASSO

The **LASSO** is an alternative to RR obtained by solving \[\widehat{\boldsymbol{\beta}}_{\textrm{L}}=\arg\min_{\boldsymbol{\beta}}\{\underbrace{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2^2}_{\text{SSRes}}+\underbrace{N\lambda \|\boldsymbol{\beta}\|_1}_{\text{shrinkage}}\};\] the penalty effectively forces coefficients which combine the propertiesof RR and BS, selecting at most \(\max\{p,N\}\) features, and usually no
more than one per group of highly correlated variables (the other
coefficients are forced down to \(0\) when \(\lambda\) is large enough, see
Figure 12.10 for an illustration).^{204}

Why do we get \(\widehat{\beta}_{\text{L},j}=0\) for some \(j\), but not for the RR coefficients? The RR and LASSO formulations are equivalent to \[\begin{aligned} \widehat{\boldsymbol{\beta}}_{\textrm{RR}}&=\arg\min_{\boldsymbol{\beta}}\{\text{SSRes}\mid \|\boldsymbol{\beta}\|_2^2\leq s\}\ \text{for some $s$} \\ \widehat{\boldsymbol{\beta}}_{\textrm{L}}&=\arg\min_{\boldsymbol{\beta}}\{\text{SSRes}\mid \|\boldsymbol{\beta}\|_1\leq s\}\ \text{for some $s$}\end{aligned}\]

Graphically, this looks like the images shown in Figure 12.11.

The RR coefficients \(\widehat{\boldsymbol{\beta}}_{\textrm{RR}}\) are found at the first intersection of the ellipses of constant \(\text{SSRes}\) around the OLS coefficient \(\widehat{\boldsymbol{\beta}}\) with the \(2-\)sphere \(\|\boldsymbol{\beta}\|_{2}^2\leq s\); that intersection is usually away from the axes (due to the lack of “sharp” points); this is not usually the case for the intersection of the \(1-\)sphere \(\|\boldsymbol{\beta}\|_2\leq s\).

The LASSO thus typically produces simpler models, but predictive accuracy matters too (in the form of \(\text{MSE}_\text{Te}\), say). Depending on the data, either of the two approaches can be optimal, thanks to the *No Free Lunch Theorem*.

#### Generalizations

If the response is related to a relatively small number of predictors (whether this is the case or not is not something we usually know *a priori*), LASSO is recommended. The use of other penalty functions (or combinations thereof) provides various extensions, such as: **elastic nets**; **group, fused and adaptive lassos**; **bridge regression**, etc.

The modifications described above were defined assuming an underlying linear regression model, but they generalize to arbitrary regression/classification models as well. For a **loss** (cost) function \(\mathcal{L}\left(\mathbf{Y},\mathbf{\widehat{y}}(\mathbf{W})\right)\) between the actual target and the values predicted by the model parameterized by \(\mathbf{W}\), and a **penalty** vector \(\mathbf{R}(\mathbf{W})=\left(R_{1}(\mathbf{W}),\cdots,R_{k}(\mathbf{W})\right)^{\!\top}\), the **regularized parametrization** \(\mathbf{W}^*\) solves the **general regularization** problem \[\mathbf{W}^*= \textrm{arg}\min_{\mathbf{W}} \{\mathcal{L}\left(\mathbf{Y},\mathbf{\widehat{y}}(\mathbf{W})\right)+N\mathbf{\lambda}^{\!\top}\mathbf{R}(\mathbf{W})\},\] which can be solved numerically, assuming some nice properties on \(\mathcal{L}\) and \(\mathbf{R}\) [241]; as before, cross-validation can be used to determine the optimal vector \(\mathbf{\lambda}\) [2].

#### Gapminder Example

In `R`

, regularization is implemented in the package `glmnet`

(among others). In `glmnet()`

the parameter `alpha`

controls the elastic net mixture: LASSO (`alpha = 1`

), RR (`alpha = 0`

).

Say we are interested in modeling life expectancy \(Y\) in the 2011 Gapminder dataset as a function of population, infant mortality, fertility, gdp, and continental membership (we use the entire set as a training set \(\text{Tr}\)).

A priori, an OLS model on this data would take the form \[\begin{aligned}Y=\alpha_0&+\alpha_1\text{population}+\alpha_2\text{infant mortality}+\alpha_3\text{fertility}+ \alpha_4\text{gpd} \\ &+ \alpha_5\text{Africa}+\alpha_6\text{Americas}+\alpha_7\text{Asia}+\alpha_8\text{Europe}+\alpha_9\text{Oceania}.\end{aligned}\]

We run a LASSO regression on the data, using \(\text{MSE}_{\text{Tr}}\) as the evaluation metric. We start by creating dummy variables for the continents:

Next, we select the appropriate variables for the response and the training set, and scale and center the data (it must be in a matrix format to be compatible with `glmnet()`

:

```
gapminder.2011.f <- fastDummies::dummy_cols(gapminder.2011, select_columns = 'continent')
y <- gapminder.2011.f |> select(life_expectancy) |> as.matrix()
x <- gapminder.2011.f |> select(c("population","infant_mortality","fertility","gdp",
"continent_Africa","continent_Americas","continent_Asia",
"continent_Europe","continent_Oceania")) |>
scale(center = TRUE, scale = TRUE) |> as.matrix()
```

Finally, we run the regression and extract the LASSO coefficient for hyperparameter \(\lambda=1\):

```
glmnet1<-glmnet::glmnet(x=x,y=y, type.measure='mse',alpha=1)
(c1<-coef(glmnet1,x=x,y=y,s=1,exact=TRUE))
```

```
10 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 70.82349398
population .
infant_mortality -5.57897055
fertility .
gdp .
continent_Africa -1.13074639
continent_Americas .
continent_Asia .
continent_Europe .
continent_Oceania -0.03096299
```

Thus \[\begin{aligned} Y&=70.82-5.58(\text{infant mortality}) -1.13(\text{Africa}) -0.03(\text{Oceania}).\end{aligned}\]

For RR (\(\alpha=0\)), we obtain, with the same hyperparameter \(\lambda=1\):

```
glmnet0<-glmnet::glmnet(x=x,y=y, type.measure='mse',alpha=0)
(c0<-coef(glmnet0,x=x,y=y,s=1,exact=TRUE))
```

```
10 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 70.8234940
population -0.3471671
infant_mortality -4.4002779
fertility -0.6348077
gdp 0.5803223
continent_Africa -1.6275714
continent_Americas 0.5475769
continent_Asia 0.6117358
continent_Europe 1.0141934
continent_Oceania -0.6855980
```

which is to say: \[\begin{aligned} Y&=70.82-0.34(\text{population})-4.4(\text{infant mortality}) -0.63(\text{fertility}) +0.58(\text{gdp)}\\ & -1.62(\text{Africa}) +0.55(\text{Americas})+0.61(\text{Asia})+1.01(\text{Europe}) -0.68(\text{Oceania}),\end{aligned}\] which is compatible with the above discussion.

The values of the coefficient themselves are not as important as their signs and the fact that they are roughly similar in both models.

It is important to note, however, that the choice of \(\lambda=1\) was arbitrary, and that we have not been evaluating the result on test data \(\text{Te}\). We will revisit these issues in Cross-Validation.

### References

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.*Springer, 2008.

*An Introduction to Statistical Learning: With Applications in R*. Springer, 2014.

*The Analysis of Variance: Fixed, Random and Mixed Models*. Birkhäuser, 2000.

*Applied Linear Statistical Models*. McGraw Hill Irwin, 2004.

*Journal of Technometrics*, vol. 8, no. 4, pp. 625–629, Nov. 1966.

*Statistical Learning with Sparsity : the LASSO and Generalizations*. CRC Press, 2015.