## 12.3 Resampling Methods

How do we determine the variability of a regression fit? It can be done by drawing different samples from the available data, fitting a regression model to each sample, and then examining the extent to which the various fits differ from one another.

Resampling methods provide additional information about a fitted model, by applying the same fitting approach to various sub-samples of the training set $$\text{Tr}$$. We will consider three such methods:

• cross-validation, which is used to estimate the test error associated with a modeling approach in order to evaluate model performance;

• the bootstrap, which is used to provide a measure of accuracy, standard deviation, bias, etc. of various model parameter estimates, and

• the jackknife, which is a simpler approach with the same aims as the bootstrap.

For quantitative responses, the test error associated with a statistical learning model is the average error arising when predicting the response for observations that were not used to train the model.

The training error, on the other hand, is computed directly by comparing the model’s predictions to the actual responses in $$\text{Tr}$$. In general, the training error underestimates the test error, dramatically so when the model complexity increases (i.e.the variance-bias trade-off, see Figure 12.6).

A possible solution to this conundrum is to set aside a large-enough testing set $$\text{Te}$$, but that’s not always possible if the original dataset is not that large in the first place.205

In the statistical learning framework, we estimate the test error by holding a subset $$\text{Va}\subseteq \text{Tr}$$ out from the fitting process (which takes place on $$\text{Tr}\setminus\text{Va}$$). The validation approach is a simple strategy that is used to estimate the test error associated with a particular statistical model on a set of observations.

Formally, the latter is split into a training set $$\text{Tr}$$ and a validation set $$\text{Va}$$ (the hold-out set). The model is fit on the training set; the fitted model is used to make predictions on the validation set. The resulting validation set error provides an estimate for the test error.

This approach is easy to implement and interpret, but it has a number of drawbacks, most importantly:

• the validation error is highly dependent on the choice of the validation set, and is thus quite volatile;

• the model is fitted on a proper subset of the available observations, and we might expect that this would leadd to the validation error being larger than the test error in general;

• a number of classical statistical models can provide test error estimates without having to resort to the validation set approach.

### 12.3.1 Cross-Validation

$$K$$-Fold Cross Validation is a widely-used approach to estimate the test error without losing some observations to a hold-out test.206

The procedure is simple:

1. Divide the dataset randomly into $$K$$ (roughly) equal-sized folds (typically, $$K=4,5, 10$$).

2. Each fold plays, in succession, the role of the validation set. If there are $$N$$ observations in the dataset, partition $\{1,\ldots, N\}=\underbrace{\mathcal{C}_1}_{\text{fold }1} \sqcup \cdots \sqcup \underbrace{\mathcal{C}_K}_{\text{fold }K}.$ If $$|\mathcal{C}_k|=n_k$$, we expect $$n_k\approx \frac{N}{K}$$ for all $$k=1,\ldots, K$$.

3. For all $$k=1,\ldots, K$$, fit a model on $$\{1,\ldots,N\}\setminus \mathcal{C}_k$$ and denote the error on $$\mathcal{C}_k$$ by $$E_k$$.207

4. Write $$\overline{E}$$ for the average of the $$E_k$$.

5. The cross-validation estimate of the test error is $\text{CV}_{(K)}=\sum_{k=1}^K\frac{n_k}{N}E_k,$ with standard error $\widehat{\text{se}}\left(\text{CV}_{(K)}\right)=\sqrt{\frac{1}{K-1}\sum_{k=1}^K(E_k-\overline{E})^2}.$

These steps could also be replicated $$n$$ times to generate a distribution of an evaluation metric (such as the standard error), see Figure 11.39 for an illustration.

The resulting mean can prove useful in order to determine how well a statistical learning procedure will perform on unseen data. If, however, we are interested in selecting a method from a list of methods, or a flexibility level among a family of approaches, we do not care about the specific value of $$\text{CV}_{(K)}$$ so much as where it is minimized.208

From the perspective of bias reduction (in the estimate for the test error), the best choice is $$K=N$$, but this is mitigated by the variance-bias trade-off. With $$K=N$$, we have $$N$$ models and $$N$$ estimates for the test error, but these estimates are highly correlated and the mean of highly correlated estimates has high variance (see Section 12.3.3 for details).

#### Gapminder Example

We use cross-validation in the Gapminder dataset to estimate the test error $$\text{MSE}_{\text{Te}}$$ when predicting life expectancy as a regression against the logarithm of the GDP per capita for the 2011 data.

gapminder.2011.cv <- gapminder.2011 |>
dplyr::mutate(lgdppc = log(gdp/population)) |>
select(life_expectancy,lgdppc)

ggpubr::ggscatter(gapminder.2011.cv, x="lgdppc",  y="life_expectancy", palette="jco",
size = 2, xlab="GDP per capita (log-scale)", ylab = "Life Expectancy",
title = "Gapminder 2011 Data", xlim=c(0,12), ylim=c(0,85)) 

We split the dataset into $$K=10$$ random folds, each containing 16 or 17 observations, and fit 10 linear regression models using the 149 or 150 remaining observations.209

The indices for each of the folds are computed below:

set.seed(0) # for replicability
true.order=sample.int(nrow(gapminder.2011.cv),
nrow(gapminder.2011.cv),replace=FALSE)

index=list()
for(k in 1:6){
index[[k]] = true.order[((k-1)*17+1):(k*17)]
}
for(k in 7:10){
index[[k]] = true.order[(102+(k-6-1)*16+1):(k*16+6)]
}

Each fold is used, in turn, as a testing set while the remaining folds form the training set. We fit an OLS model on each training set, and evaluate the MSE performance of the model on the appropriate fold testing set.

training.gap = list()
testing.gap = list()
model.lm.gap = list()
pred.lm.gap = list()
beta_0 = c()
beta_1 = c()
MSE.cv = c()
n.row = c()

for(k in 1:10){
n.row[k]=length(index[[k]])
training.gap[[k]] = gapminder.2011.cv[-index[[k]],]
testing.gap[[k]] = gapminder.2011.cv[index[[k]],]
model.lm.gap[[k]] = lm(life_expectancy~lgdppc, data=training.gap[[k]])
beta_0[k] = model.lm.gap[[k]][[1]][1]
beta_1[k] = model.lm.gap[[k]][[1]][2]
pred.lm.gap[[k]] = predict(model.lm.gap[[k]], newdata=testing.gap[[k]])
tmp=data.frame(pred.lm.gap[[k]],testing.gap[[k]][1])
MSE.cv[k]=1/nrow(tmp)*sum((tmp[,1]-tmp[,2])^2)
}

The number of observations in each fold, as well as the regression parameters and the MSE on each fold testing set are shown below:

results = data.frame(n.row,beta_0,beta_1,MSE.cv)
n.row beta_0 beta_1 MSE.cv
17 37.69812 4.254105 33.859051
17 36.22257 4.442061 21.415376
17 37.59386 4.247255 45.620933
17 36.66761 4.345484 29.584469
17 37.49685 4.268917 24.127300
17 36.49849 4.386124 19.398769
16 36.78991 4.380887 48.157391
16 36.91113 4.331365 23.142625
16 37.41767 4.274771 7.837172
16 37.68955 4.254600 19.743046

The $$10-$$fold cross-validation estimate of $$\text{MSE}_{\text{Te}}$$ is thus \begin{aligned} \overline{\text{MSE}_{\text{Te}}}&=\frac{1}{10}\sum_{k=1}^{10}\text{MSE}_{\text{Te}_k}= 27.29\\ \text{CV}_{(K)}&=\sum_{k=1}^{10}\frac{n_k}{166}\text{MSE}_{\text{Te}_k}=27.35\\ \widehat{\text{se}}\left(\text{CV}_{(K)}\right)&=\sqrt{\frac{1}{10-1}\sum_{k=1}^{10}(\text{MSE}_{\text{Te}_k}-\overline{\text{MSE}_\text{Te}})^2}=12.38, \end{aligned} as can be seen below.

(mean.MSE=mean(results$MSE.cv)) (cv.k=sum(results$n.row*results$MSE.cv/sum(results$n.row)))
(se.cv.k=sqrt(1/(nrow(results)-1)*sum((resultsMSE.cv-mean.MSE)^2))) [1] 27.28861 [1] 27.35051 [1] 12.37939 Thus, $$27.35 \pm 2(12.38)\equiv (2.59,52.11)$$ is a 95% $$\text{C.I.}$$ for the test error We can also get $$10-$$fold cross-validation estimates of the true $$\beta_0$$ and $$\beta_1$$: we have \begin{aligned} \beta_{0_{(K)}}&=\sum_{k=1}^{10}\frac{n_k}{166}\beta_{0;k}= 37.10\\ \widehat{\text{se}}\left(\beta_{0_{(K)}}\right)&=\sqrt{\frac{1}{10-1}\sum_{k=1}^{10}(\beta_{0;k}-\overline{\beta_0})^2}=0.54, \end{aligned}, with $$37.10 \pm 2(0.54) \equiv (36.00,38.18)$$ as a 95% $$\text{C.I.}$$ for $$\beta_0$$; \begin{aligned}\beta_{1_{(K)}}&=\sum_{k=1}^{10}\frac{n_k}{166}\beta_{1;k}= 4.32\\ \widehat{\text{se}}\left(\beta_{1_{(K)}}\right)&=\sqrt{\frac{1}{10-1}\sum_{k=1}^{10}(\beta_{1;k}-\overline{\beta_1})^2}= 0.07, \end{aligned} with $$4.32 \pm 2(0.07)\equiv (4.18,4.56)$$ as a 95% $$\text{C.I.}$$ for $$\beta_1$$, as is seen below: mean.beta_0=mean(resultsbeta_0)
cv.beta_0.k=sum(results$n.row*results$beta_0/sum(results$n.row)) se.cv.beta_0.k=sqrt(1/(nrow(results)-1)*sum((results$beta_0-mean.beta_0)^2))

mean.beta_1=mean(results$beta_1) cv.beta_1.k=sum(results$n.row*results$beta_1/sum(results$n.row))
se.cv.beta_1.k=sqrt(1/(nrow(results)-1)*sum((results$beta_1-mean.beta_1)^2)) #### LASSO and Regression Ridge Revisited How would we pick the optimal hyperparameter $$\lambda$$ in shrinkage regressions? Let us revisit the example from Section 12.2.4. As before, we are interested in modeling life expectancy $$Y$$ in the 2011 Gapminder dataset as a function of population, infant mortality, fertility, gdp, and continental membership. We again create dummy variables for the continents, and select the appropriate variables for the response and the training set (which will be centered and scaled): gapminder.2011.f <- fastDummies::dummy_cols(gapminder.2011, select_columns = 'continent') y <- gapminder.2011.f |> select(life_expectancy) |> as.matrix() x <- gapminder.2011.f |> select(c("population","infant_mortality","fertility","gdp", "continent_Africa","continent_Americas","continent_Asia", "continent_Europe","continent_Oceania")) |> scale(center = TRUE, scale = TRUE) |> as.matrix() We run a 5-fold cross-validation LASSO regression for a variety of hyperparameter values $$\lambda$$, and evaluate the CV test error for each $$\lambda$$ using $$\text{MSE}$$. The optimal $$\lambda$$ is the one that minimizes the CV test error. Let us start with the LASSO (alpha=1) glmnet1<-glmnet::cv.glmnet(x=x,y=y, type.measure='mse',nfolds=5,alpha=1) (c1<-coef(glmnet1,s='lambda.min',exact=TRUE)) 10 x 1 sparse Matrix of class "dgCMatrix" s1 (Intercept) 70.8234940 population . infant_mortality -5.7375945 fertility . gdp 0.1616446 continent_Africa -1.7592037 continent_Americas . continent_Asia . continent_Europe 0.1219114 continent_Oceania -0.7977736 The optimal $$\lambda$$ in this case is (lambda1 = glmnet1$lambda.min)
[1] 0.3118295

We repeat the process for RR (alpha = 0):

glmnet0<-glmnet::cv.glmnet(x=x,y=y, type.measure='mse',nfolds=5,alpha=0)
(c0<-coef(glmnet0,s='lambda.min',exact=TRUE))
10 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept)        70.8234940
population         -0.3466483
infant_mortality   -4.6968992
fertility          -0.4240814
gdp                 0.5813385
continent_Africa   -1.6192452
continent_Americas  0.5467797
continent_Asia      0.6295896
continent_Europe    1.0091460
continent_Oceania  -0.7207190

The optimal $$\lambda$$ in this case is

(boot.beta_1 = mean(results.bootbeta_1)) (cov(results.boot)) [1] 37.22167 [1] 4.308113 beta_0 beta_1 beta_0 6.3281004 -0.7168375 beta_1 -0.7168375 0.0845396 The vector $$\mathbf{\hat{\mu}}^*$$ provides the bootstrap estimates; the corresponding estimates for the standard errors are $$\hat{\text{se}}(\mathbf{\hat{\mu}}^*)=(2.51,0.29)^{\!\top}$$, and the 95% $$\text{CI}$$ are \begin{aligned} \text{CI}(\beta_0;0.95)&=37.22\pm 2(2.51)\equiv(32.19,42.25) \\ \text{CI}(\beta_1;0.95)&=4.31\pm 2(0.29)\equiv(3.73,4.89),\end{aligned} as seen below: (se.boot.beta_0 = sqrt(1/(nrow(results.boot)-1)*sum((results.bootbeta_0-mean(results.boot$beta_0))^2))) (se.boot.beta_1 = sqrt(1/(nrow(results.boot)-1)*sum((results.boot$beta_1-mean(results.boot$beta_1))^2))) [1] 2.515572 [1] 0.2907569 Note that the bootstrap estimates are wider than the corresponding cross-validation estimates. ### 12.3.3 Jackknife The jackknife estimator arises from cross-validation when $$K=N$$; the sole difference being in the variance estimate $\text{Var}(\hat{\alpha}^*)=\frac{1}{N(N-1)}\sum_{i=1}^N(\hat{\alpha}_i^*-\overline{\hat{\alpha}^*})^2.$ The jackknife procedure is also known as leave one out validation. #### Gapminder Example We use the jackknife procedure for the regression problem with life expectancy and the log of the GDP per capita in the 2011 Gapminder data. For each fold $$1\leq k\leq N$$, we find the OLS fit on $$\text{TR}_k$$ and retain the intercept $$\beta_{0,k}$$ and slope $$\beta_{1,k}$$. beta_0 = c() beta_1 = c() set.seed(0) for(k in 1:nrow(gapminder.2011.cv)){ training.gap = gapminder.2011.cv[-k,] model.lm.gap = lm(life_expectancy~lgdppc, data=training.gap) beta_0[k] = model.lm.gap[[1]][1] beta_1[k] = model.lm.gap[[1]][2] } results.jack = data.frame(beta_0,beta_1) We display the joint distribution of $$\boldsymbol{\beta}=(\beta_0,\beta_1)^{\!\top}$$, together with the marginal distributions for each parameter. p <- ggplot2::ggplot(results.jack, ggplot2::aes(x=beta_0, y=beta_1)) + ggplot2::geom_point() + ggplot2::theme(legend.position="none") ggExtra::ggMarginal(p, type="density", fill = "slateblue") It would appear that $$\boldsymbol{\beta}$$ approximately follows a multivariate normal $$\mathcal{N}(\mathbf{\mu},\mathbf{\Sigma})$$, with $\mathbf{\mu}\approx \mathbf{\hat{\mu}}^*=\begin{pmatrix}37.11 \\ 4.32\end{pmatrix},\ \mathbf{\Sigma}\approx \mathbf{\hat{\Sigma}}^*=\begin{pmatrix}0.021 & -0.002 \\ -0.002 & 0.0003\end{pmatrix},$ as seen below: (jack.beta_0 = mean(results.jack$beta_0))
(jack.beta_1 = mean(results.jackbeta_1)) (cov(results.jack)) [1] 37.11005 [1] 4.316852 beta_0 beta_1 beta_0 0.021108102 -0.0023729712 beta_1 -0.002372971 0.0002780615 The vector $$\mathbf{\hat{\mu}}^*$$ provides the jackknife estimates; the corresponding estimates for the standard errors are $$\hat{\text{se}}(\mathbf{\hat{\mu}}^*)=(0.011,0.001)^{\!\top}$$, and the 95% $$\text{CI}$$ are \begin{aligned} \text{CI}(\beta_0;0.95)&=37.11\pm 2(0.011)\equiv(37.09,37.13) \\ \text{CI}(\beta_1;0.95)&=4.32\pm 2(0.001)\equiv(4.318,4.322),\end{aligned} as seen below: (se.jack.beta_0 = sqrt(1/(nrow(results.jack)*(nrow(results.jack)-1))*sum((results.jackbeta_0-mean(results.jack$beta_0))^2))) (se.jack.beta_1 = sqrt(1/(nrow(results.jack)*(nrow(results.jack)-1))*sum((results.jack$beta_1-mean(results.jack\$beta_1))^2)))
[1] 0.0112764
[1] 0.001294245

Note that the jackknife estimates are much tighter than the corresponding bootstrap estimates. Will this always be the case?