12.4 Model Selection

A linear model \[Y=\vec{X}^{\!\top}\boldsymbol{\beta}+\varepsilon\] should be seen as an attempt to approximate the regression function \[y=f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}].\] What we gain in convenience of fit (and structure), we lose in modeling accuracy.

In this framework, we assume a linear relationship between the response \(Y\) and the predictors \(X_1,\ldots, X_p\), which we (typically) fit using (ordinary) least squares framework, which is to say \[\hat{\boldsymbol{\beta}}=\arg\min_{\boldsymbol{\beta}}\{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|^2\},\] for the response vector \(\mathbf{Y}\) and design matrix \(\mathbf{X}\) provided by a training set \(\text{Tr}\).

Fundamentally, there are 3 ways in which the OLS framework can be extended:

  1. additive but non-linear models (see Section 12.5.3);

  2. non-linear models (see Section 12.5 and Spotlight on Classification), and

  3. replacing LS with alternative fitting procedures (see Section 12.2.4).

The latter approach can produce better accuracy than OLS without sacrificing too much in the way of model interpretability.212

But in the OLS framework, prediction accuracy suffers when \(p>n\), due to curse of dimensionality (see Section 15.2.2); model interpretability can be improved by removing irrelevant features or by reducing \(p\).

The 3 classes of methods to do so are:

  • shrinkage and regularization methods;

  • dimension reduction, and

  • subset selection/feature selection.

For shrinkage/regularization methods, we fit a model involving all \(p\) predictors, but the estimated coefficients are shrunk towards \(0\) relative to the OLS parameter estimates, which has the effect of reducing variance and simultaneously perform variable selection (see Section 12.2.4).

In dimension reduction, we project the \(p\) predictors onto an \(M-\)dimensional manifold \(\mathcal{H}\), with \(M\ll p\); in numerous circumstances, \(\mathcal{H}\) is a subspace of \(\mathbf{R}^p\) and we can fit an OLS model on the projected coordinates (see Section 15.2).

In subset selection, we identify a subset of the \(p\) predictors for which there is evidence of a (strong-ish) link with the response, and we fit a model to this reduced set using the OLS framework. Given \(p\) predictors (some of which may be interaction terms), there are \(2^p\) OLS models that can be fit on a training set \(\text{Tr}\). Which of those models should be selected as the best model?

12.4.1 Best Subset Selection

In the best subset selection BSS approach, the search for the best model is usually broken down into 3 stages:

  1. let \(\mathcal{M}_0\) denote the null model (without predictor) which simply predicts the sample mean for all observations;

  2. For \(k=1,\ldots, p\) (as long as the model can be fit):

    1. fit every model that contains exactly \(k\) predictors (there are \(\binom{p}{k}\) of them);

    2. pick the model with smallest \(\text{SSRes}\) (largest \(R^2\)) and denote it by \(\mathcal{M}_k\);

  3. select a unique model from \(\{\mathcal{M}_0,\ldots,\mathcal{M}_p\}\) using \(\text{CV}_{(K)}\), \(C_p\) (AIC), BIC, \(R_a^2\), or any other appropriate metric.213

BSS is conceptually simple, but with \(2^p\) models to try out, it quickly becomes computationally infeasible for large \(p\) (\(p>40\), say). When \(p\) is large, the chances of finding a model that performs well according to step 3 but poorly for new data increase, which can lead to overfitting and high-variance estimates, which were exactly the problems we were trying to avoid in the first place.214

12.4.2 Stepwise Selection

Stepwise selection (SS) methods attempt to overcome this challenge by only looking at a restricted set of models. Forward stepwise selection (FSS) starts with the null model \(\mathcal{M}_0\) and adding predictors one-by-one until it reaches the full model \(\mathcal{M}_p\):

  1. Let \(\mathcal{M}_0\) denote the null model;

  2. For \(k=0,\ldots,p-1\) (as long as the model can be fit):

    1. consider the \(p-k\) models that add a single predictor to \(\mathcal{M}_k\);

    2. pick the model with smallest \(\text{SSRes}\) (largest \(R^2\)) and denote it by \(\mathcal{M}_{k+1}\);

  3. select a unique model from \(\{\mathcal{M}_0,\ldots,\mathcal{M}_p\}\) using \(\text{CV}_{(K)}\), \(C_p\) (AIC), BIC, \(R_a^2\), or any other appropriate metric.

Backward stepwise selection (also BSS, unfortunately) works the other way, starting with the full model \(\mathcal{M}_p\) and removing predictors one-by-one until it reaches the null model \(\mathcal{M}_0\):

  1. Let \(\mathcal{M}_p\) denote the full model;

  2. For \(k=p,\ldots,1\) (as long as the model can be fit):

    1. consider the \(k\) models that remove a single predictor from \(\mathcal{M}_k\);

    2. pick the model with smallest \(\text{SSRes}\) (largest \(R^2\)) and denote it by \(\mathcal{M}_{k-1}\);

  3. select a unique model from \(\{\mathcal{M}_0,\ldots,\mathcal{M}_p\}\) using \(\text{CV}_{(K)}\), \(C_p\) (AIC), BIC, \(R_a^2\), or any other appropriate metric.

The computational advantage of SS over B(est)SS is evident: instead of having to fit \(2^p\) models, SS only requires \[1+p+(p-1)+\cdots+2+1 = \frac{p^2+p+2}{2}\] models to be fit to \(\text{Tr}\).

However, there is no guarantee that the “best” model (among the \(2^p\) BSS models) will be found in the \(\frac{p^2+p+2}{2}\) SS models. SS can be used in settings where \(p\) is too large for BSS to be computationally feasible. Note that for OLS models, backward stepwise selection only works if \(p\leq n\) (otherwise OLS might not have a unique parameter solution); if \(p>n\), only forward stepwise selection is viable.


Hybrid selection (HS) methods attempt to mimic BSS while keeping model computation in a manageable range, not unlike in SS. More information on this topic is available in [3].

12.4.3 Selecting the Optimal Model

The full model always has largest \(R^2\)/smallest \(\text{SSRes}\) (as it is a measure of the training error, and as such, is subject to the overfitting property found in the bias-variance trade-off diagram of Figure 12.6.

In order to estimate the test error (and thus pick the optimal model in the list \(\{\mathcal{M}_0,\ldots,\mathcal{M}_p\}\), we can either:

  • adjust the training error to account for the bias induced by overfitting, or

  • directly estimate the test error using a validation set or cross-validation.

Adjustment Statistics

Commonly, we use one of the following adjustment statistics: Mallow’s \(C_p\), the Akaike information criterion (AIC), the Bayesian information criteria (BIC), or the adjusted coefficience of determination \(R^2_a\); \(C_p\), AIC, and BIC must be minimized, while \(R^2_a\) must be maximized.

The adjustment statistics require the following quantities:

  • \(n\), the number of observations in \(\text{Tr}\);

  • \(p\), the number of predictors under consideration;

  • \(d=p+2\),

  • \(\hat{\sigma}^2\), the estimate of \(\text{Var}(\varepsilon)\) (irreducible error);

  • \(\text{SSRes}\) and \(\text{SSTot}\), the residual and the total sum of squares.

Mallow’s \(C_p\) statistic is given by \[C_p=\frac{1}{n}(\text{SSRes}+2d\hat{\sigma}^2)=\text{MSE}_\text{Tr}+\underbrace{\frac{2d\hat{\sigma}^2}{n}}_{\text{adjustment}}.\] As \(d\) increases, so does the adjustment term. Note that if \(\hat{\sigma}^2\) is an unbiased estimate of \(\text{Var}(\varepsilon)\), \(C_p\) is an unbiased estimate of \(\text{MSE}_\text{Te}\).

The Akaike information criterion (AIC) is given by \[\text{AIC}=-2\ln L+\underbrace{2d}_{\text{adjustment}},\] where \(L\) is the maximized value of the likelihood function for the estimated model. If the errors are normally distributed, this requires finding the maximum of \[\begin{aligned} L&=\prod_{i=1}^n\frac{1}{\sqrt{2\pi}\hat{\sigma}}\exp\left(-\frac{(Y_i-\mathbf{X}_i^{\!\top}\boldsymbol{\beta})^2}{2\hat{\sigma}^2}\right) \\ &=\frac{1}{(2\pi)^{n/2}\hat{\sigma}^n}\exp\left(-\frac{1}{2\hat{\sigma}^2}\sum_{i=1}^n(Y_i-\mathbf{X}_i^{\!\top}\boldsymbol{\beta})^2\right),\end{aligned}\] or, upon taking the logarithm, \[\ln L=\text{constant}-\frac{1}{2\hat{\sigma}^2}\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|^2,\] and so \[\arg\max_{\boldsymbol{\beta}}\{\ln L(\boldsymbol{\beta})\}=\arg\min_{\boldsymbol{\beta}}\{\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|^2\}.\] However, \[\begin{aligned} \text{AIC}&=-2\ln L +2d = \text{constant}+\frac{1}{\hat{\sigma}^2}\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|^2+2d \\ &=\text{constant}+\frac{\text{SSRes}}{\hat{\sigma}^2}+2d\\&=\text{constant}+\frac{n}{\hat{\sigma}^2}\cdot \frac{1}{n}\left(\text{SSRes}+2d\hat{\sigma}^2\right)=\text{constant}+\frac{n}{\hat{\sigma}^2}C_p.\end{aligned}\] Evidently, when the error structure is normal, minimizing AIC is equivalent to minimizing \(C_p\).

The Bayesian information criterion uses a different adjustment term: \[\text{BIC}=\frac{1}{n}(\text{SSRes}+d\hat{\sigma}^2\ln n)=\text{MSE}_\text{Tr}+\underbrace{d\hat{\sigma^2}\frac{\ln n}{n}}_{\text{adjustment}}.\]

This adjustment penalizes models with large number of predictors; minimizing BIC results in selecting models with fewer variables than those obtained by minimizing \(C_p\), in general.

The adjusted coefficient of determination \(R_a^2\) is the Ur-example of an adjusted statistic: \[R^2_a=1-\frac{\text{SSRes}/(n-p-1)}{\text{SSTot}/(n-1)}=1-(1-R^2)\frac{n-1}{n-p-1}.\] Maximizing \(R^2_a\) is equivalent to minimizing \(\frac{\text{SSRes}}{n-p-1}\). Note that \(R_a^2\) penalizes models with unnecessary variables.

Validation and Cross-Validation (Reprise)

As before, we want to select \(\mathcal{M}_{k^*}\) from a sequence of models \(\{\mathcal{M}_1,\mathcal{M}_2, \ldots\}\). The procedure is simple: we compute \(\text{MSE}_\text{Va}\) on some validation set or \(\text{CV}_{(K)}\) for each \(\mathcal{M}_k\), and select \(k^*\) for which the value is smallest (see Section 12.3.1).

The main advantages of this approach are that:

  • there is no need to estimate the irreducible error \(\text{Var}(\varepsilon)=\sigma^2\);

  • the method produces an estimate for \(\text{MSE}_\text{Te}\) “for free,” and

  • it can be used when the number of parameters is hard to pinpoint (in deep learning networks, for instance).

Historically, adjustment approaches were preferred because cross-validation was computationally demanding, especially when \(p,n\) were large, but that is not as much of a problem in modern times.

Consequently, cross-validation is championed as the optimal model selection approach, using the one standard error rule: calculate the standard error of \(\widehat{\text{MSE}}_\text{Te}\) for each model size, and select the smallest model for which \(\widehat{\text{MSE}}_\text{Te}\) is within one standard from the lowest point on the cross-validation error curve.

This is equivalent to Occam’s Razor215 on models that have equivalent predictive power, roughly speaking.

In the image below (modified from [3]), the lowest point is reached when \(p=6\) (blue “X”) and the dashed red lines represent the 1-standard error limits; according to the rule described above, we would select the model with \(p=4\) parameters (red dot).


SS methods are used extensively in practice, but there are serious limitations to this approach:

  • all intermediate tests are biased, as they are all based on the same data;

  • \(R^2_a\) only takes into account the number of features in the final model, not the degrees of freedom that have been used up during the entire process;

  • if the cross-validation error is used, stepwise selection should be repeated or each sub-model.

All in all, SS is a classic example of \(p-\)hacking: we are getting results without setting hypotheses up first.

Example

In spite of the warning mentioned above, it could still be useful to know how to perform stepwise selection. In what follows, we search for the best FSS and BSS linear models to predict the credit card balance for observations contained in the training set Credit.csv.

Credit<-read.csv("data/Credit.csv", stringsAsFactors = TRUE)
str(Credit)
'data.frame':   400 obs. of  12 variables:
 $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Income   : num  14.9 106 104.6 148.9 55.9 ...
 $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
 $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
 $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
 $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
 $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
 $ Gender   : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 1 1 ...
 $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
 $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
 $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
 $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

We remove the id variable X, and create dummy variables for the categorical levels.

Credit <- Credit[,-c(1)]
Credit$Gender.dummy <- ifelse(Credit$Gender == "Female",1,0)
Credit$Student.dummy <- ifelse(Credit$Student == "Yes",1,0)
Credit$Married.dummy <- ifelse(Credit$Married == "Yes",1,0)
Credit$Ethnicity.AA.dummy <- ifelse(Credit$Ethnicity == "African American",1,0)
Credit$Ethnicity.A.dummy <- ifelse(Credit$Ethnicity == "Asian",1,0)
Credit <- Credit[,c(1:6,12:16,11)]
summary(Credit)
     Income           Limit           Rating          Cards      
 Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000  
 1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000  
 Median : 33.12   Median : 4622   Median :344.0   Median :3.000  
 Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958  
 3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000  
 Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000  
      Age          Education      Gender.dummy    Student.dummy
 Min.   :23.00   Min.   : 5.00   Min.   :0.0000   Min.   :0.0  
 1st Qu.:41.75   1st Qu.:11.00   1st Qu.:0.0000   1st Qu.:0.0  
 Median :56.00   Median :14.00   Median :1.0000   Median :0.0  
 Mean   :55.67   Mean   :13.45   Mean   :0.5175   Mean   :0.1  
 3rd Qu.:70.00   3rd Qu.:16.00   3rd Qu.:1.0000   3rd Qu.:0.0  
 Max.   :98.00   Max.   :20.00   Max.   :1.0000   Max.   :1.0  
 Married.dummy    Ethnicity.AA.dummy Ethnicity.A.dummy    Balance       
 Min.   :0.0000   Min.   :0.0000     Min.   :0.000     Min.   :   0.00  
 1st Qu.:0.0000   1st Qu.:0.0000     1st Qu.:0.000     1st Qu.:  68.75  
 Median :1.0000   Median :0.0000     Median :0.000     Median : 459.50  
 Mean   :0.6125   Mean   :0.2475     Mean   :0.255     Mean   : 520.01  
 3rd Qu.:1.0000   3rd Qu.:0.0000     3rd Qu.:1.000     3rd Qu.: 863.00  
 Max.   :1.0000   Max.   :1.0000     Max.   :1.000     Max.   :1999.00  

We will work with a scaled version of the dataset.

Credit.scaled <- scale(Credit)
parameters <- attributes(Credit.scaled)
Credit.scaled <- data.frame(Credit.scaled)
var.names<-colnames(Credit.scaled)

We start by implementing step 2 of the FSS algorithm.

model <- c()
ind   <- c()

for(i in 1:(ncol(Credit.scaled)-1)){
r2 <- c()
  for(j in setdiff((1:(ncol(Credit.scaled)-1)),c(ind))){
    model <- lm(Balance ~ ., data=Credit.scaled[,c(ind,j,12)])
    r2[j]    <- summary(model)$r.squared
  }
ind[i] <- which.max(r2)
}

var.names[ind]
 [1] "Rating"             "Income"             "Student.dummy"     
 [4] "Limit"              "Cards"              "Age"               
 [7] "Gender.dummy"       "Ethnicity.AA.dummy" "Married.dummy"     
[10] "Education"          "Ethnicity.A.dummy" 

The best 1-parameter model \(\mathcal{M}_1\) uses Rating, the best 2-parameter model \(\mathcal{M}_2\) built from \(\mathcal{M}_1\) uses Rating and Income, and so on.

Next, we implement step 3 by computing the adjustment statistics (AIC, BIC, \(R^2_a\)) and the cross-validation error (with \(k=5\) folds) for each of \(\mathcal{M}_0,\mathcal{M}_1,\ldots\) (the latter uses the function cv.lm() available in the lmvar package in R).

We deal with \(\mathcal{M}_0\) first.

model <- c()
aic   <- c()
bic   <- c()
r2a   <- c()
cv.m  <- c()

model[[1]] <- lm(Balance ~ 1, data=Credit.scaled, y=TRUE, x=TRUE)
cv.m[1]    <- lmvar::cv.lm(model[[1]],k=5)$MSE[[1]]
r2a[1]     <- summary(model[[1]])$adj.r.squared
aic[1]     <- AIC(model[[1]])
bic[1]     <- BIC(model[[1]])

The remaining models are similarly handled:

for(i in 1:(ncol(Credit.scaled)-1)){
  model[[i+1]] <- lm(Balance ~., data=Credit.scaled[,c(ind[c(1:i)],12)], y=TRUE, x=TRUE)
  cv.m[i+1]    <- lmvar::cv.lm(model[[i+1]],k=5)$MSE[[1]]
  r2a[i+1]     <- summary(model[[i+1]])$adj.r.squared
  aic[i+1]     <- AIC(model[[i+1]])
  bic[i+1]     <- BIC(model[[1+1]])
}

Let us plot the outcome for each adjustment statistic (and the CV estimation of the test error):

plot(cv.m)

plot(r2a)

plot(aic)

plot(bic)

The best FSS model using the CV estimate of the test error is:

ind.cv  <- which.min(cv.m)
var.names[c(1:ind.cv)]
[1] "Income"       "Limit"        "Rating"       "Cards"        "Age"         
[6] "Education"    "Gender.dummy"

The best FSS model using \(R^2_a\) is:

ind.r2a <- which.max(r2a)
var.names[c(1:ind.r2a)]
[1] "Income"        "Limit"         "Rating"        "Cards"        
[5] "Age"           "Education"     "Gender.dummy"  "Student.dummy"
[9] "Married.dummy"

The best FSS model using AIC is:

ind.aic <- which.min(aic)
var.names[c(1:ind.aic)]
[1] "Income"       "Limit"        "Rating"       "Cards"        "Age"         
[6] "Education"    "Gender.dummy"

The best FSS model using BIC is:

ind.bic <- which.min(bic)
var.names[c(1:ind.bic)]
[1] "Income" "Limit" 

Are there overlaps? The same can be done for BSS, instead:

model.BSS <- c()
ind.BSS   <- list()

for(i in 1:(ncol(Credit.scaled)-1)){
  r2 <- c()
  list.of.indices <- combn(1:(ncol(Credit.scaled)-1), i)
  for(j in 1:ncol(list.of.indices)){
    model.BSS <- lm(Balance ~ ., data=Credit.scaled[,c(list.of.indices[,j],12)])
    r2[j]    <- summary(model.BSS)$r.squared
  }
  ind.BSS[[i]] <- list.of.indices[,which.max(r2)] 
}

model.BSS <- c()
aic.BSS   <- c()
bic.BSS   <- c()
r2a.BSS   <- c()
cv.m.BSS  <- c()

model.BSS[[1]] <- lm(Balance ~ 1, data=Credit.scaled, y=TRUE, x=TRUE)
cv.m.BSS[1]    <- lmvar::cv.lm(model.BSS[[1]],k=5)$MSE[[1]]
r2a.BSS[1]     <- summary(model.BSS[[1]])$adj.r.squared
aic.BSS[1]     <- AIC(model.BSS[[1]])
bic.BSS[1]     <- BIC(model.BSS[[1]])

for(i in 1:(ncol(Credit.scaled)-1)){
  model.BSS[[i+1]] <- lm(Balance ~., data=Credit.scaled[,c(ind.BSS[[i]],12)], y=TRUE, x=TRUE)
  cv.m.BSS[i+1]    <- lmvar::cv.lm(model.BSS[[i+1]],k=5)$MSE[[1]]
  r2a.BSS[i+1]     <- summary(model.BSS[[i+1]])$adj.r.squared
  aic.BSS[i+1]     <- AIC(model.BSS[[i+1]])
  bic.BSS[i+1]     <- BIC(model.BSS[[1+1]])
}

ind.cv.BSS  <- which.min(cv.m.BSS)
var.names[c(1:ind.cv.BSS)]
ind.r2a.BSS <- which.max(r2a.BSS)
var.names[c(1:ind.r2a.BSS)]
ind.aic.BSS <- which.min(aic.BSS)
var.names[c(1:ind.aic.BSS)]
ind.bic.BSS <- which.min(bic.BSS)
var.names[c(1:ind.bic.BSS)]
[1] "Income"    "Limit"     "Rating"    "Cards"     "Age"       "Education"
[1] "Income"        "Limit"         "Rating"        "Cards"        
[5] "Age"           "Education"     "Gender.dummy"  "Student.dummy"
[9] "Married.dummy"
[1] "Income"       "Limit"        "Rating"       "Cards"        "Age"         
[6] "Education"    "Gender.dummy"
[1] "Income" "Limit" 

Any surprises?

References

[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.