13.5 Ensemble Learning

In practice, individual learners are often weak – they perform better than random guessing would, but not necessarily that much better, or sufficiently so for specific analytical purposes. In the late 80’s, Kearns and Valiant asked the following question: can a set of weak learners be used to create a strong learner? The answer, as it turn out, is yes – via ensemble learning methods.

As an example, scientists trained 16 pigeons (weak learners, one would assume) to identify pictures of magnified biopsies of possible breast cancers. On average, each pigeon had an accuracy of about 85%, but when the most popular answer among the group was selected, the accuracy jumped to 99% [257].

(The material of this section follows [3] closely.)

13.5.1 Bagging

Bootstrap aggregation (also known as bagging) is an extension of bootstrapping. Originally, bootstrapping was used in situations where it is nearly impossible to compute the variance of a quantity of interest by exact means (see Section 12.3.2).

But it can also be used to improve the performance of various statistical learners, especially those that exhibit high variance (such as CART).248 Given a learning method, bagging can be used to reduce the variance of that method.

If \(Z_1,\ldots, Z_B\) are independent observations with \[\text{Cov}(Z_i,Z_j)=\begin{cases} \sigma^2 & \text{if $i=j$} \\ 0 & \text{else} \end{cases}\] the central limit theorem states that \[\begin{aligned} \text{Var}(\overline{Z})&=\text{Var}\left(\frac{Z_1+\cdots+Z_B}{B}\right)=\frac{1}{B^2}\text{Var}(Z_1+\cdots+Z_B) \\ &=\frac{1}{B^2}\sum_{i,j=1}^B\text{Cov}(Z_i,Z_j)=\frac{1}{B^2}\sum_{k=1}^B\text{Var}(Z_i)=\frac{\sigma^2}{B}.\end{aligned}\] In other words, averaging a set of observations reduces the variance as \(\sigma^2\geq \frac{\sigma^2}{B}\) for all \(B\in \mathbb{N}\). In practice, this conclusion seems, at first, not to be as interesting as originally intended since we do not usually have access to multiple training sets. However, resampling methods can be used to generate multiple bagging training sets from the original training set \(\text{Tr}\).

Let \(B>1\) be an integer. We generate \(B\) bootstrapped training sets from \(\text{Tr}\) by sampling \(N=|\text{Tr}|\) observations from \(\text{Tr}\), with replacement, to yield \[\text{Tr}_1,\ldots, \text{Tr}_B,\] and train a model \(\hat{f}_i\) (for regression) or \(\hat{C}_i\) (for classification) on each \(\text{Tr}_i\), \(i=1,\ldots, B\); for each \(\mathbf{x}^*\in \text{Te}\), we then have \(B\) predictions \[\hat{f}_1(\mathbf{x}^*),\ldots, \hat{f}_B(\mathbf{x}^*) \quad\text{(for regression)}\] or \[\hat{C}_1(\mathbf{x}^*),\ldots, \hat{C}_B(\mathbf{x}^*) \quad\text{(for classification)}.\]

The bagging prediction at \(\mathbf{x}^*\in \text{Te}\) is the average of all predictions \[\hat{f}_{\text{Bag}}(\mathbf{x}^*)=\frac{1}{B}\sum_{i=1}^B\hat{f}_i(\mathbf{x}^*)\quad\text{(for regression)}\] or the most frequent prediction \[\hat{C}_{\text{Bag}}(\mathbf{x}^*)=\text{Mode}\{\hat{C}_1(\mathbf{x}^*),\ldots,\hat{C}_B(\mathbf{x}^*)\}\quad\text{(for classification)}.\] Bagging is particularly helpful in the CART framework; to take full advantage of bagging, however, the trees should be grown deep (i.e., no pruning), as their complexity will lead to high variance but low bias (thanks to the bias-variance trade-off).

In practice, the bagged tree predictions would also have low bias, but the variance will be reduced by the bagging process; bagging with 100s/1000s of trees typically produces greatly improved predictions (at the cost of interpretability, however).

Out-of-Bag Error Estimation

As is usually the case with supervised models, we will need to estimate the test error for a bagged model. There is an easy way to provide the estimate without relying on cross-validation, which is computationally expensive when \(N\) is large.

The \(j\)th model is fit to the bootstrapped training set \(\text{Tr}_j\), \(j=1,\ldots,B\). We can show that, on average, each of the \(\text{Tr}_j\) contains \(\approx 2/3\) distinct observations of \(\text{Tr}\), which means that \(\approx 1/3\) of the training observations are not used to build the model (we refer to those observations are out-of-bag (OOB) observations).

We can then predict the response \(y_i\) for the \(i\)th observation in \(\text{Tr}\) using only those models for which \(\mathbf{x}_i\) was OOB; there should be about \(B/3\) such predictions, and \[\hat{y}_i=\text{Avg}\{\hat{f}_j(\mathbf{x}_i) \mid \mathbf{x}_i \in \text{OOB}(\text{Tr}_j)=\text{Tr}\setminus \text{Tr}_j\} \quad \text{(for regression)}\] or \[\hat{y}_i=\text{Mode}\{\hat{C}_j(\mathbf{x}_i) \mid \mathbf{x}_i \in \text{OOB}(\text{Tr}_j)\} \quad \text{(for classification)}.\] The OOB MSE (or the OOB misclassification rate) are thus good \(\text{Te}\) error estimates since none of the predictions are given by models that used the test observations in their training.

Variable Importance Measure

Bagging improves the accuracy of stand-alone models, but such an improvement comes at the cost of reduced interpretability, especially in the case of CART: the bagged tree predictions cannot, in general, be expressed with the help of a single tree. In such a tree, the relative importance of the features is linked to the hierarchy of splits (namely, the most “important” variables appear in the earlier splits).

For bagged regression trees, a measure such as the total amount in decreased \(\text{SSRes}\) due to splits over a given predictor, in which we compare \(\text{SSRes}\) in trees with these splits against \(\text{SSRes}\) in trees without these splits, averaged over the \(B\) bagged trees provides a summary of the importance of each variable (large scores indicate important variables).

For bagged classification trees, we would replace \(\text{SSRes}\) with the Gini index, instead. Another approach might be to weigh the importance of a factor inversely proportionally to the level in which it appears (if at all) in each bagging tree and to average over all bagging trees.

For instance, if predictor \(X_1\) appears in the 1st split level of bagged tree 1, the 4th split level of bagged tree 2, and the 3rd split level of bagged tree 5, whereas predictor \(X_2\) appears in the 2nd, 2nd, 3rd, and 5th split levels of bagged trees 2, 3, 4, 5 (resp.), then the relative importance of each predictor over the 5 bagged trees is \[\begin{aligned} X_1&: (1+1/4+0+0 +1/3)\cdot \frac{1}{5}=19/60=0.32 \\ X_2&: (0+1/2+1/2+1/3 +1/5)\cdot \frac{1}{5}=23/75=0.31, \end{aligned}\] and the first variable would be nominally more important than the second.

Example

We once again revisit the Iowa Housing Price example of Sections 12.5.2 and 13.4.1. Recall that we had built a training set dat.train with \(n=1160\) observations relating to the selling price SalePrice of houses in Ames, Iowa.

We build a regression tree bagging model using the R package ipred, with 150 bags, and the OOB error estimate.

set.seed(1234)
bag <- ipred::bagging(formula = SalePrice ~ ., data = dat.train, nbagg = 150, 
                      coob = TRUE, control = rpart::rpart.control(minsplit = 5, cp = 0),
                      importance = TRUE)
bag

Bagging regression trees with 150 bootstrap replications 

Call: bagging.data.frame(formula = SalePrice ~ ., data = dat.train, 
    nbagg = 150, coob = TRUE, control = rpart::rpart.control(minsplit = 5, 
        cp = 0), importance = TRUE)

Out-of-bag estimate of root mean squared error:  29.2526 

We can display the 5 most important variables:

p=ncol(dat.train)-1
vim <- data.frame(var=names(dat.train[,1:p]), imp=caret::varImp(bag))
vim.plot <- vim[order(vim$Overall, decreasing=TRUE),]
vim.plot <- vim.plot[1:5,]
barplot(vim.plot$Overall, names.arg=rownames(vim.plot), 
        col=heat.colors(5), xlab='Variable Importance')

The predictions on the testing data are obtained (and plotted) as follows:

yhat.bag = predict(bag, newdata=dat.test)

xlimit = ylimit = c(0,600)
plot(NA, col=2, xlim=xlimit, ylim=ylimit, xlab="Predicted Price ($1,000)", 
     ylab="Actual Price ($1,000)")
abline(h=seq(0,600, length.out=13), v=seq(0,600, length.out=13), lty=2, col="grey")
abline(a=0, b=1)
points(yhat.bag, dat.test$SalePrice, col=2)
legend(0,600, legend=c("Bagged Tree - (150 bags)"), col=c(2), pch=rep(1), bg='light grey')

The correlation measure between predicted and actual sale prices on the dat.test is

cor(yhat.bag, dat.test$SalePrice)
[1] 0.9413044

How does that compare to the MARS and CART models?

13.5.2 Random Forests

In a bagging procedure, we fit models on various training sets, and we use the central limit theorem, assuming independence of the models, to reduce the variance.

In practice, however, the independence assumption is rarely met: if there are a few strong predictors in \(\text{Tr}\), each of the bagged models (built on the bootstrapped training sets \(\text{Tr}_i\)) is likely to be similar to the others, and the various predictions are unlikely to be un-correlated, so that \[\text{Var}(\hat{y}_i)\neq \frac{\sigma^2}{B};\] averaging highly correlated quantities does not reduce the variance significantly (the central limit theorem assumption of independence of observations is necessary).

With a small tweak, however, we can decorrelate the bagged models, leading to variance reduction when we average the bagged predictions. Random forests also build models on \(B\) bootstrapped training samples, but each model is built out of a random subset of \(m\leq p\) predictors.

For decision trees, every time a split is considered, the set of allowable predictors is selected from a random subset of \(m\) predictors out of the full \(p\) predictors. By selecting predictors randomly for each model, we lose out on building the best possible model on each training sample, but we also reduce the chance of them being correlated. For a test observations \(\mathbf{x}^*\), the \(B\) predictions are combined as in bagging to yield the random forest prediction.

If \(m=p\), random forests reduce to bagged models; in practice we use \(m\approx \sqrt{p}\) for classification and \(m\approx p/3\) for regression. When the predictors are highly correlated, however, smaller values of \(m\) are recommended.

Example

We revisit the Wines example, using the R package randomForest.

wine = read.csv("Data/wine.csv", header=TRUE, stringsAsFactors = TRUE)
wine$Class = as.factor(wine$Class)

Let’s implement a 70%/30% training/testing split:

set.seed(1111)
ind.train <- sample(nrow(wine), 0.8*nrow(wine), replace = FALSE)
dat.train <- wine[ind.train,]
dat.test <- wine[-ind.train,]

There are \(p=13\) predictors in the dataset, so we should use \(m\approx \sqrt{13}\pprox 4\). Keep in mind, however, that we have seen that some of the variables are correlated, so we will try models for \(m=1\), \(m=2\), \(m=3\), and \(m=4\).

wine.rf.1 <- randomForest::randomForest(Class ~ ., data = dat.train, 
                                        ntree = 500, mtry = 1, importance = TRUE)
wine.rf.1

Call:
 randomForest(formula = Class ~ ., data = dat.train, ntree = 500,      mtry = 1, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 0.7%
Confusion matrix:
   1  2  3 class.error
1 47  0  0  0.00000000
2  0 54  1  0.01818182
3  0  0 40  0.00000000
wine.rf.2 <- randomForest::randomForest(Class ~ ., data = dat.train, 
                                        ntree = 500, mtry = 2, importance = TRUE)
wine.rf.2

Call:
 randomForest(formula = Class ~ ., data = dat.train, ntree = 500,      mtry = 2, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 0.7%
Confusion matrix:
   1  2  3 class.error
1 47  0  0  0.00000000
2  0 54  1  0.01818182
3  0  0 40  0.00000000
wine.rf.3 <- randomForest::randomForest(Class ~ ., data = dat.train, 
                                        ntree = 500, mtry = 3, importance = TRUE)
wine.rf.3

Call:
 randomForest(formula = Class ~ ., data = dat.train, ntree = 500,      mtry = 3, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 1.41%
Confusion matrix:
   1  2  3 class.error
1 46  1  0  0.02127660
2  0 54  1  0.01818182
3  0  0 40  0.00000000
wine.rf.4 <- randomForest::randomForest(Class ~ ., data = dat.train, 
                                        ntree = 500, mtry = 4, importance = TRUE)
wine.rf.4

Call:
 randomForest(formula = Class ~ ., data = dat.train, ntree = 500,      mtry = 4, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 1.41%
Confusion matrix:
   1  2  3 class.error
1 46  1  0  0.02127660
2  0 54  1  0.01818182
3  0  0 40  0.00000000

In this example, there are no differences.

13.5.3 Boosting

Another general approach to improving prediction results for statistical learners involves creating a sequence of models, each improving over the previous model in the series. Boosting does not involve bootstrap sampling; instead, it fits models on a hierarchical sequence of residuals, but it does so in a slow manner.

For regression problems, we proceed as follows:

  1. set \(\hat{f}(\mathbf{x})=0\) and \(r_i=y_i\) for all \(\mathbf{x}_i\in \text{Tr}\);

  2. for \(b=1,2,\ldots, B\):

    1. fit a model \(\hat{f}^b\) to the training \((\mathbf{X},\mathbf{r})\);

    2. update the regression function \(\hat{f}:=\hat{f}+\lambda\hat{f}^b\);

    3. update the residuals \(r_i:=r_i-\lambda\hat{f}^b(\mathbf{x}_i)\) for all \(\mathbf{x}_i\in \text{Tr}\);

  3. output the boosted model \(\hat{f}(\mathbf{x})=\lambda(\hat{f}^1(\mathbf{x})+\cdots+\hat{f}^B(\mathbf{x}))\).

In this version of the algorithm, boosting requires three tuning parameters:

  • the number of models \(B\), which can be selected through cross-validation (boosting can overfit if \(B\) is too large);

  • the shrinkage parameter \(\lambda\) (typically, \(0<\lambda\ll 1\)), which controls the boosting learning rate (a small \(\lambda\) needs a large \(B\), in general); the optimal \(\lambda\) and \(B\) can be found via cross-validation, and

  • although not explicitly stated, we also need the learning models to reach some complexity threshold.

Variants of the boosting algorithm allowing for classification and for varying weights depending on performance regions in predictor space also exist and are quite popular. While the various No Free Lunch theorems guarantee that no supervised learning algorithm is always best regardless of context/data, the combination of AdaBoost with weak CART learners is seen by many as the best “out-of-the-box” classifier.

Example

Consider the Credit.csv dataset [3]; the task is to determine the credit card balance based on a number of other factors.

Credit = read.csv("Data/Credit.csv", header=TRUE, stringsAsFactors = TRUE)
str(Credit)
'data.frame':   400 obs. of  12 variables:
 $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Income   : num  14.9 106 104.6 148.9 55.9 ...
 $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
 $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
 $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
 $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
 $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
 $ Gender   : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 1 1 ...
 $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
 $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
 $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
 $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

We remove the index variable, and create binary variables for all categorical levels in the data.

Credit <- Credit[,-c(1)]
Credit$Gender.dummy <- ifelse(Credit$Gender == "Female",1,0)
Credit$Student.dummy <- ifelse(Credit$Student == "Yes",1,0)
Credit$Married.dummy <- ifelse(Credit$Married == "Yes",1,0)
Credit$Ethnicity.AA.dummy <- ifelse(Credit$Ethnicity == "African American",1,0)
Credit$Ethnicity.A.dummy <- ifelse(Credit$Ethnicity == "Asian",1,0)

The dataset under consideration will then have the following shape:

Credit <- Credit[,c(1:6,12:16,11)]
str(Credit)
'data.frame':   400 obs. of  12 variables:
 $ Income            : num  14.9 106 104.6 148.9 55.9 ...
 $ Limit             : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
 $ Rating            : int  283 483 514 681 357 569 259 512 266 491 ...
 $ Cards             : int  2 3 4 3 2 4 2 2 5 3 ...
 $ Age               : int  34 82 71 36 68 77 37 87 66 41 ...
 $ Education         : int  11 15 11 11 16 10 12 9 13 19 ...
 $ Gender.dummy      : num  0 1 0 1 0 0 1 0 1 1 ...
 $ Student.dummy     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ Married.dummy     : num  1 1 0 0 1 0 0 0 0 1 ...
 $ Ethnicity.AA.dummy: num  0 0 0 0 0 0 1 0 0 1 ...
 $ Ethnicity.A.dummy : num  0 1 1 1 0 0 0 1 0 0 ...
 $ Balance           : int  333 903 580 964 331 1151 203 872 279 1350 ...
Income Limit Rating Cards Age Education
Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000 Min. :23.00 Min. : 5.00
1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000 1st Qu.:41.75 1st Qu.:11.00
Median : 33.12 Median : 4622 Median :344.0 Median :3.000 Median :56.00 Median :14.00
Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958 Mean :55.67 Mean :13.45
3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000 3rd Qu.:70.00 3rd Qu.:16.00
Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000 Max. :98.00 Max. :20.00
Gender.dummy Student.dummy Married.dummy Ethnicity.AA.dummy Ethnicity.A.dummy Balance
Min. :0.0000 Min. :0.0 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. : 0.00
1st Qu.:0.0000 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 68.75
Median :1.0000 Median :0.0 Median :1.0000 Median :0.0000 Median :0.000 Median : 459.50
Mean :0.5175 Mean :0.1 Mean :0.6125 Mean :0.2475 Mean :0.255 Mean : 520.01
3rd Qu.:1.0000 3rd Qu.:0.0 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.: 863.00
Max. :1.0000 Max. :1.0 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1999.00

We pick 300 of the 400 observations to be part of the training set:

set.seed(1234)
ind = sample(1:nrow(Credit), size = 300)
Credit.train = Credit[ind,]
Credit.test = Credit[-ind,]

We are going to use \(\lambda=0.005\) as a shrinkage parameter and \(B=2000\) models.

lambda = 0.005
X <- Credit.train[,1:11] # predictors
r <- Credit.train[,12]   # response
X.test <- Credit.test[,1:11]
Y.test = Credit.test[,12]

We start by building the first iteration of the boosting model, using R’s tree package, and look at its predictions on the test data:

tree.f <- tree::tree(r ~ ., data = data.frame(cbind(X,r)), na.action=na.pass)
r <- r - lambda * predict(tree.f, X)
plot(Y.test,predict(tree.f, X.test))
abline(0,1,col="red")

Visually, the performance seems middling, which is born by the correlation metric between the actual test observations and the predicted test observations.

cor(Y.test,predict(tree.f, X.test))
[1] 0.8640351

Let’s compare with the results of boosting with 2000 models.

results.boost <- 0*Y.test
B=2000
tree.full <- c()
tree.snipped <- c()

for(b in 1:B){
  tree.full[[b]] <- tree::tree(r ~ ., data = data.frame(cbind(X,r)), na.action=na.pass)
  tree.snipped[[b]] <- tree::tree(r ~ ., data = data.frame(cbind(X,r)), na.action=na.pass) 
  r <- r - lambda * predict(tree.snipped[[b]], X)
  results.boost = results.boost + lambda * predict(tree.snipped[[b]], X.test)
}

plot(Y.test,results.boost)
abline(0,1,col="red")

Visually, the performance is much improved; the correlation metric bears that out:

cor(Y.test,results.boost)
[1] 0.9734103

We can also use the pre-built gbm package to achieve sensibly the same results, and get the influential predictors:

boost.credit = gbm::gbm(Balance~., data = Credit.train, 
                        distribution = "gaussian", n.trees = 10000, 
                        shrinkage = 0.01, interaction.depth = 4)
summary(boost.credit)

                                  var     rel.inf
Limit                           Limit 42.87367610
Rating                         Rating 34.15766697
Income                         Income 12.72692829
Student.dummy           Student.dummy  5.04913001
Age                               Age  2.48572012
Education                   Education  1.13432102
Cards                           Cards  0.78545388
Ethnicity.AA.dummy Ethnicity.AA.dummy  0.30692238
Married.dummy           Married.dummy  0.25299822
Gender.dummy             Gender.dummy  0.15135520
Ethnicity.A.dummy   Ethnicity.A.dummy  0.07582781

In order to determine the optimal number of models \(B\) to use (is it really n ecessary to run 10,000 models?), we seek to minimize the MSE for the predictions:

n.trees = seq(from = 100, to = 10000, by = 100)
predmat = predict(boost.credit, newdata = Credit.test, n.trees = n.trees)
boost.err = with(Credit.test, apply( (predmat - Balance)^2, 2, mean) )
plot(n.trees, boost.err, pch = 23, ylab = "Mean Squared Error", 
     xlab = "# Trees", main = "Boosting Test Error")
abline(h = min(boost.err), col = "red")

which(boost.err==min(boost.err))
1200 
  12 

The optimal gbm boosted model (with parameters as in the gbm() call above) is thus:

results.boost.gbm = predict(boost.credit, newdata = Credit.test, n.trees = 1700)
plot(Y.test,results.boost)
abline(0,1,col="red")

cor(Y.test,results.boost)
[1] 0.9734103

AdaBoost

Adaptive Boosting (AdaBoost) adapts dynamic boosting to a set of models in order to minimize the error made by the individual weak models (such as “stubby” trees or “coarse” linear models). The “adaptive” part means that any new weak learner that is added to the boosted model is modified to improve the predictive power on instances which were “mis-predicted” by the (previously) boosted model.

The main idea behind dynamic boosting is to consider a weighted sum of weak learners whose weights are adjusted so that the prediction error is minimized.

Consider a classification context, where \[\text{Tr}=\{(\mathbf{x}_i,y_i)\mid i=1,\ldots, n\}, \quad \text{where }y_i\in\{-1,+1\}\quad \forall i=1,\ldots, n.\] The boosted classifier is a function \[F(\mathbf{x})=\sum_{b=1}^Bc_bf_b(\mathbf{x}),\] where \(f_b(\mathbf{x})\in \{-1,+1\}\) and \(c_b\in \mathbb{R}\) for all \(b\) and all \(\mathbf{x}\). The class prediction at \(\mathbf{x}\) is simply \(\text{sign}(F(\mathbf{x}))\).

The AdaBoost contribution comes in at the modeling stage, where, for each \(\mu\in \{1,\ldots, B\}\) the weak learner \(f_{\mu}\) is trained on a weighted version of the original training data, with observations that are misclassified by the partial boosted model \[F_{\mu}(\mathbf{x})=\sum_{b=1}^{\mu-1}c_bf_b(\mathbf{x})\] given larger weights; AdaBoost estimates the weights \(w_i\), \(i=1,\ldots,n\) at each of the boosting steps \(b=1, \ldots, B\).

Real AdaBoost [258] is a generalization of this approach which does away with the constants \(c_b\):

  1. Initialize the weights \(\mathbf{w}\), with \(w_i=1/n\), for \(1\leq i\leq n\);

  2. For \(b=1, \ldots, B\), repeat the following steps :

    1. fit the class probability estimate \(p_m(\mathbf{x})=P(y=1 \mid \mathbf{x},\mathbf{w})\), using the weak learner algorithm of interest;

    2. define \(f_b(\mathbf{x})=\frac{1}{2}\log \frac{p_b(\mathbf{x})}{1-p_b(\mathbf{x})}\);

    3. set \(w_i \leftarrow w_i \exp \{-y_if(\mathbf{x}_i)\}\), for \(1 \leq i \leq n\);

    4. renormalize so that \(\|\mathbf{w} \|_1 =1\);

  3. Output the classifier \[F(\mathbf{x}) = \text{sign}\left\{\displaystyle{\sum_{b=1}^Bf_b(\mathbf{x})}\right\}.\]

For regression tasks, this procedure must be modified to some extent (in particular, the equivalent task of assigning larger weights to currently misclassified observations at a given step is to train the model to predict (shrunken) residuals at a given step… more or less). Since boosting is susceptible to overfitting (unlike bagging and random forests), the optimal number of boosting steps \(B\) should be derived from cross-validation.

Example

The Python library scikit-learn provides a useful implementation of AdaBoost. In order to use it, a base estimator (that is to say, a week learner) must first be selected. In what follows, we will use a decision tree classifier. Once this is achieved, a AdaBoostClassifier object is created, to which is fed the weak learner, the number of estimators and the learning rate (a.k.a. the shrinkage parameter, a small positive number). In general, small learning rates require a large number of estimators to provide adequate performance. By default, scikit-learn’s implementation uses 50 estimators with a learning rate of 1.

We use the classic Two-Moons dataset consisting of two interleaving half circles with added noise, in order to test and compare classification results for AdaBoost (and eventually Gradient Boosting, see below). This dataset is conveniently built-in to scikit-learn and accessible via make_moons(), which returns a data matrix X and a label vector y. We can treat the dataset as a complete training set as we eventually use cross-validation to estimate the test error.249

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_moons
N = 1000
X,Y = make_moons(N,noise=0.2)
plt.scatter(X[:,0],X[:,1], c=Y)
plt.show()

We first attempt to classify the data using DecisionTreeClassifier with maximum depth of 3 (this same structure will later be used for our weak learners).

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X,Y)
xx,yy = np.meshgrid(np.linspace(-1.5,2.5,50), np.linspace(-1,1.5,50))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.scatter(X[:,0],X[:,1], c = Y)
plt.contourf(xx,yy,Z,alpha=0.3)
plt.show()
DecisionTreeClassifier(max_depth=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
<matplotlib.contour.QuadContourSet object at 0x7f8abf5c65e0>

As can be seen from the graph above, this single decision tree does not provide the best of fits.

Next, we build an AdaBoost classifier. We first consider a model with \(B=5\) decision trees, and with a learning rate \(\lambda= 1/10\).

ada = AdaBoostClassifier(clf, n_estimators=5, learning_rate=0.1)
ada.fit(X,Y)
xx,yy = np.meshgrid(np.linspace(-1.5,2.5,50),np.linspace(-1,1.5,50))
Z = ada.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.scatter(X[:,0],X[:,1], c = Y)
plt.contourf(xx,yy,Z,alpha=0.3)
plt.show()
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
                   learning_rate=0.1, n_estimators=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
<matplotlib.contour.QuadContourSet object at 0x7f8abf5a7a30>

Finally, we build an AdaBoost classifier with \(B=10\) decision trees and with a learning rate \(\lambda= 1/10\).

ada = AdaBoostClassifier(clf, n_estimators=10, learning_rate=0.1)
ada.fit(X,Y)
xx,yy = np.meshgrid(np.linspace(-1.5,2.5,50),np.linspace(-1,1.5,50))
Z = ada.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.scatter(X[:,0],X[:,1], c = Y)
plt.contourf(xx,yy,Z,alpha=0.3)
plt.show()
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
                   learning_rate=0.1, n_estimators=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
<matplotlib.contour.QuadContourSet object at 0x7f8abf62df40>

The AdaBoosted tree is better at capturing the dataset’s structure. Of course, until we evaluate the performance on an independent test set, this could simply be a consequence of overfitting (one of AdaBoost’s main weaknesses, as the procedure is sensitive to outliers and noise). We can guard against this eventuality by adjusting the learning rate (which provides a step size for the algorithm’s iterations).

To find the optimal value for the learning rate and the number of estimators, one can use the GridSearchCV method from sklearn.model_selection, which implements cross-validation on a grid of parameters. It can also be parallelized, in case the efficiency of the algorithm should ever need improving (that is not necessary on such a small dataset, but could prove useful with larger datasets).

from sklearn.model_selection import GridSearchCV
params = {
     'n_estimators': np.arange(10,300,10),
     'learning_rate': [0.01, 0.05, 0.1, 1],
 }

grid_cv = GridSearchCV(AdaBoostClassifier(), param_grid= params, cv=5, n_jobs=1)
grid_cv.fit(X,Y)
grid_cv.best_params_
GridSearchCV(cv=5, estimator=AdaBoostClassifier(), n_jobs=1,
             param_grid={'learning_rate': [0.01, 0.05, 0.1, 1],
                         'n_estimators': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,
       140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260,
       270, 280, 290])})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
{'learning_rate': 0.05, 'n_estimators': 200}

The results show that, given the selected grid search ranges, the optimal parameters (those that provide the best cross-validation fit for the data) are 70 estimators with a learning rate of 1 (these parameters change if cv and n_jobs are modified). The plot of the model of these parameters indeed shows that the fit looks quite acceptable.

ada = grid_cv.best_estimator_
xx,yy = np.meshgrid(np.linspace(-1.5,2.5,50),np.linspace(-1,1.5,50))
Z = ada.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.scatter(X[:,0],X[:,1], c = Y)
plt.contourf(xx,yy,Z,alpha=0.3)
plt.show()
<matplotlib.contour.QuadContourSet object at 0x7f8aa27109d0>

Gradient Boosting

The implementation of Gradient Boosting is simpler than that of AdaBoost. The idea is to first fit a model, then to compute the residual generated by this model. Next, a new model is trained, but on the residuals instead of on the original response. The resulting model is added to the first one. Those steps are repeated a sufficient number of times. The final model will be a sum of the initial model and of the subsequent models trained on the chain of residuals.

We will not go into the details of gradient boosting here (see [215] for details), but showcase how it would be applied on the Two-Moons dataset (remember that without an estimate of the test error, we cannot use these classifiers for predictive purposes due to overfitting).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=9, learning_rate=0.5)
gbc.fit(X,Y)
xx,yy = np.meshgrid(np.linspace(-1.5,2.5,50),np.linspace(-1,1.5,50))
Z = gbc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.scatter(X[:,0],X[:,1], c = Y)
plt.contourf(xx,yy,Z,alpha=0.3)
plt.show()
GradientBoostingClassifier(learning_rate=0.5, n_estimators=9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
<matplotlib.contour.QuadContourSet object at 0x7f8abf63fe80>

References

[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[215]
O. Leduc and P. Boily, Boosting with AdaBoost and gradient boosting,” Data Action Lab Blog, 2019.
[257]
R. M. Levenson, E. A. Krupinski, V. M. Navarro, and E. A. Wasserman, “Pigeons (columba livia) as trainable observers of pathology and radiology breast cancer images,” PLOS ONE, vol. 10, no. 11, pp. 1–21, Nov. 2015, doi: 10.1371/journal.pone.0141357.
[258]
J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” Annals of Statistics, vol. 28, p. 2000, 1998.