12.6 Example: Algae Blooms

We continue the algae blooms analysis (based on a Case Study by L.Torgo [74]) started in the Data Preparation module. The objective is to predict various algae levels in water samples: we continue the analysis with the data frame algae_blooms.sna2.

12.6.1 Value Estimation Modeling

For supervised learning tasks, the bias-variance trade-off means that we need to set aside a testing set on which the models (which were learned on the training set) are evaluated.

There are no hard-and-fast rules to determine how to split the data; if the signal is very strong, a small training set should capture its important features, but we do not typically know how strong the signal is before we start the modeling process. On the other hand, if the training set is too large, we run the risk of overfitting the data. Weighing both of these considerations, we elect to use a 65%/35% training/testing split.

The training data should also be representative, to avoid injecting biases in the model (in case the data was provided according to some systematic but unknown rule). There are numerous ways to do this (see Survey Sampling Methods, for instance), but we can do so using a simple random sample of 218 observations (we could also have stratified according to season, size, etc.).

To avoid issues due related to replicability, we will always use the same training set.219

# ind<-sample(1:dim(algae_blooms.sna2)[1],218, replace=FALSE)
ind <- 1:218
218/338      # proportion of training observations 
algae.train<-algae_blooms.sna2[ind,] # training set
algae.test<-algae_blooms.sna2[-ind,] # testing set

set.seed(0)  # for replicability
[1] 0.6449704

Generalized Linear Model

Before we get too excited about using various machine learning methods, it is always worth seeing what the traditional approach yields. We will run a linear model to predict a2, to pick but one of the response variables, against all the predictor variables, but only using the training set.

linear.model.a2 <- lm(a2 ~ season + size + speed + mxPH + mnO2 + 
                        Cl + NO3 + NH4 + oPO4 + PO4 + Chla,data=algae.train)

A summary of the results can be given by calling the summary method on the resulting object.

summary(linear.model.a2)

Call:
lm(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.436  -5.281  -2.613   2.026  62.712 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.083e+01  1.257e+01  -2.452 0.015056 *  
seasonsummer -1.166e-01  2.112e+00  -0.055 0.956035    
seasonautumn  1.071e+00  2.370e+00   0.452 0.651934    
seasonwinter -1.451e+00  2.000e+00  -0.726 0.468935    
sizemedium   -2.628e+00  1.895e+00  -1.387 0.166896    
sizelarge    -3.210e+00  2.412e+00  -1.331 0.184767    
speedmedium   3.887e+00  2.485e+00   1.564 0.119325    
speedhigh    -1.104e+00  2.772e+00  -0.398 0.690751    
mxPH          4.859e+00  1.559e+00   3.117 0.002092 ** 
mnO2         -1.841e-01  3.924e-01  -0.469 0.639474    
Cl           -7.432e-03  2.006e-02  -0.371 0.711351    
NO3           2.132e-01  3.028e-01   0.704 0.482249    
NH4          -5.979e-04  5.355e-04  -1.117 0.265510    
oPO4          2.290e-03  9.876e-03   0.232 0.816875    
PO4          -1.559e-03  5.936e-03  -0.263 0.793090    
Chla          1.652e-01  4.614e-02   3.579 0.000432 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.74 on 202 degrees of freedom
Multiple R-squared:  0.206, Adjusted R-squared:  0.147 
F-statistic: 3.493 on 15 and 202 DF,  p-value: 2.498e-05

We see that the adjusted \(R^2\) coefficient is fairly small, which is not ideal. Furthermore, the residuals should have a mean of 0 and be “small”, which is not quite what we are seeing here. That being said, the \(F-\)statistic seems to indicate that there is some (linear) dependence on the predictor variables.

We can get a better handle on the regression diagnostics by calling the plot() method on the object.

plot(linear.model.a2)

All in all, the linear model is … not great.

The significance of some of the coefficients is questionable, however, and we might wonder what effect their inclusion might have.

anova(linear.model.a2)
Analysis of Variance Table

Response: a2
           Df  Sum Sq Mean Sq F value    Pr(>F)    
season      3   112.3   37.42  0.3243 0.8078029    
size        2   436.0  217.99  1.8892 0.1538604    
speed       2  1552.8  776.42  6.7287 0.0014825 ** 
mxPH        1  2223.5 2223.54 19.2698 1.829e-05 ***
mnO2        1     0.5    0.54  0.0047 0.9455025    
Cl          1     0.3    0.33  0.0029 0.9572795    
NO3         1    43.9   43.91  0.3806 0.5380001    
NH4         1   193.8  193.82  1.6797 0.1964428    
oPO4        1     0.1    0.09  0.0008 0.9775762    
PO4         1     4.8    4.82  0.0417 0.8383141    
Chla        1  1478.2 1478.18 12.8103 0.0004316 ***
Residuals 202 23308.8  115.39                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An analysis of variance (ANOVA) finds that NH4 least contributes to the prediction of a2, and we might be interested in the results of a linear regression with that predictor removed.

linear.model.a2.mod <- update(linear.model.a2, . ~ . - NH4)
summary(linear.model.a2.mod)

Call:
lm(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + oPO4 + PO4 + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.801  -5.500  -2.647   2.504  63.259 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -30.996221  12.580354  -2.464 0.014577 *  
seasonsummer  -0.107996   2.113697  -0.051 0.959301    
seasonautumn   0.806683   2.359593   0.342 0.732799    
seasonwinter  -1.397244   2.000275  -0.699 0.485648    
sizemedium    -2.378831   1.882645  -1.264 0.207838    
sizelarge     -3.086404   2.411377  -1.280 0.202029    
speedmedium    3.637403   2.476278   1.469 0.143408    
speedhigh     -1.382060   2.762133  -0.500 0.617364    
mxPH           4.821140   1.559264   3.092 0.002268 ** 
mnO2          -0.074216   0.380118  -0.195 0.845398    
Cl            -0.001602   0.019376  -0.083 0.934181    
NO3           -0.013968   0.224437  -0.062 0.950437    
oPO4          -0.001285   0.009348  -0.137 0.890775    
PO4           -0.001518   0.005940  -0.256 0.798576    
Chla           0.165865   0.046166   3.593 0.000411 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.75 on 203 degrees of freedom
Multiple R-squared:  0.2011,    Adjusted R-squared:  0.146 
F-statistic: 3.649 on 14 and 203 DF,  p-value: 2.009e-05

The fit is not that much better, but an ANOVA on the 2 suggested models shows that we are at least ~88% certain that the models are different (see below).

anova(linear.model.a2,linear.model.a2.mod)
Analysis of Variance Table

Model 1: a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
    PO4 + Chla
Model 2: a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + oPO4 + 
    PO4 + Chla
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    202 23309                           
2    203 23453 -1   -143.86 1.2467 0.2655

The step() function uses AIC to perform a model search (using backward elimination). The “best” linear model for a2 is thus:

final.linear.model.a2 <- step(linear.model.a2)
summary(final.linear.model.a2)
Start:  AIC=1050.52
a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
    PO4 + Chla

         Df Sum of Sq   RSS    AIC
- season  3    157.44 23466 1046.0
- oPO4    1      6.20 23315 1048.6
- PO4     1      7.96 23317 1048.6
- Cl      1     15.85 23325 1048.7
- mnO2    1     25.40 23334 1048.8
- NO3     1     57.19 23366 1049.0
- size    2    282.28 23591 1049.1
- NH4     1    143.86 23453 1049.9
<none>                23309 1050.5
- speed   2    967.47 24276 1055.4
- mxPH    1   1121.22 24430 1058.8
- Chla    1   1478.18 24787 1061.9

Step:  AIC=1045.98
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + 
    Chla

        Df Sum of Sq   RSS    AIC
- oPO4   1      2.54 23469 1044.0
- PO4    1      4.10 23470 1044.0
- mnO2   1      6.61 23473 1044.0
- Cl     1     15.59 23482 1044.1
- size   2    257.60 23724 1044.4
- NO3    1     47.04 23513 1044.4
- NH4    1    114.06 23580 1045.0
<none>               23466 1046.0
- speed  2   1035.56 24502 1051.4
- mxPH   1   1052.01 24518 1053.5
- Chla   1   1477.06 24943 1057.3

Step:  AIC=1044.01
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + PO4 + Chla

        Df Sum of Sq   RSS    AIC
- PO4    1      1.62 23470 1042.0
- mnO2   1      7.17 23476 1042.1
- Cl     1     14.19 23483 1042.1
- NO3    1     44.93 23514 1042.4
- size   2    266.73 23736 1042.5
- NH4    1    114.91 23584 1043.1
<none>               23469 1044.0
- speed  2   1050.55 24519 1049.5
- mxPH   1   1099.78 24569 1052.0
- Chla   1   1480.47 24949 1055.3

Step:  AIC=1042.02
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- mnO2   1      6.59 23477 1040.1
- Cl     1     17.42 23488 1040.2
- size   2    265.19 23736 1040.5
- NO3    1     51.04 23521 1040.5
- NH4    1    140.72 23611 1041.3
<none>               23470 1042.0
- speed  2   1050.42 24521 1047.6
- mxPH   1   1105.21 24576 1050.0
- Chla   1   1482.34 24953 1053.4

Step:  AIC=1040.08
a2 ~ size + speed + mxPH + Cl + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- Cl     1     13.41 23490 1038.2
- size   2    260.65 23738 1038.5
- NO3    1     44.48 23522 1038.5
- NH4    1    135.66 23613 1039.3
<none>               23477 1040.1
- speed  2   1121.64 24599 1046.3
- mxPH   1   1103.17 24580 1048.1
- Chla   1   1492.55 24970 1051.5

Step:  AIC=1038.21
a2 ~ size + speed + mxPH + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- NO3    1     36.13 23526 1036.5
- size   2    275.91 23766 1036.8
- NH4    1    128.31 23619 1037.4
<none>               23490 1038.2
- speed  2   1172.78 24663 1044.8
- mxPH   1   1089.85 24580 1046.1
- Chla   1   1490.94 24981 1049.6

Step:  AIC=1036.54
a2 ~ size + speed + mxPH + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- size   2    244.91 23771 1034.8
- NH4    1     93.48 23620 1035.4
<none>               23526 1036.5
- speed  2   1164.36 24691 1043.1
- mxPH   1   1053.88 24580 1044.1
- Chla   1   1611.04 25138 1049.0

Step:  AIC=1034.8
a2 ~ speed + mxPH + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- NH4    1     82.62 23854 1033.6
<none>               23771 1034.8
- mxPH   1    850.56 24622 1040.5
- speed  2   1085.45 24857 1040.5
- Chla   1   1540.50 25312 1046.5

Step:  AIC=1033.56
a2 ~ speed + mxPH + Chla

        Df Sum of Sq   RSS    AIC
<none>               23854 1033.6
- speed  2   1021.27 24875 1038.7
- mxPH   1    928.72 24783 1039.9
- Chla   1   1479.59 25334 1044.7

Call:
lm(formula = a2 ~ speed + mxPH + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.195  -6.008  -2.530   2.024  63.589 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -27.13270   11.07921  -2.449 0.015134 *  
speedmedium   4.17176    2.34330   1.780 0.076453 .  
speedhigh    -0.32929    2.41899  -0.136 0.891850    
mxPH          3.89794    1.35358   2.880 0.004387 ** 
Chla          0.15945    0.04387   3.635 0.000349 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.58 on 213 degrees of freedom
Multiple R-squared:  0.1874,    Adjusted R-squared:  0.1721 
F-statistic: 12.28 on 4 and 213 DF,  p-value: 5.289e-09

It is still not a great fit (the adjusted \(R^2\) is quite small); we conclude that the linear model is not ideal to predict a2.

anova(final.linear.model.a2)
Analysis of Variance Table

Response: a2
           Df  Sum Sq Mean Sq F value    Pr(>F)    
speed       2  1994.8  997.42  8.9063 0.0001929 ***
mxPH        1  2026.6 2026.63 18.0964 3.145e-05 ***
Chla        1  1479.6 1479.59 13.2117 0.0003488 ***
Residuals 213 23854.1  111.99                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(linear.model.a2,final.linear.model.a2)
Analysis of Variance Table

Model 1: a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
    PO4 + Chla
Model 2: a2 ~ speed + mxPH + Chla
  Res.Df   RSS  Df Sum of Sq      F Pr(>F)
1    202 23309                            
2    213 23854 -11   -545.26 0.4296 0.9416
plot(final.linear.model.a2)

Regression Tree Model

An alternative to regression is the use of regression trees, implemented in the R library rpart (its syntax is similar to lm’s).

regression.tree.a2 <-rpart::rpart(a2 ~ season + size + speed + mxPH + 
                                    mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla, data=algae.train)

The outcome can be displayed by calling the object directly.

regression.tree.a2
n= 218 

node), split, n, deviance, yval
      * denotes terminal node

  1) root 218 29355.1300  7.6366970  
    2) Cl< 16.6875 83  1193.6400  1.8891570  
      4) size=small,medium 67   398.6457  0.9447761 *
      5) size=large 16   485.0194  5.8437500 *
    3) Cl>=16.6875 135 23733.9200 11.1703700  
      6) mxPH< 8.065 59  3831.8290  5.3864410  
       12) season=autumn,winter 29   561.8414  2.5172410 *
       13) season=spring,summer 30  2800.4720  8.1600000  
         26) mxPH< 7.9375 23   889.9730  5.3173910 *
         27) mxPH>=7.9375 7  1114.0000 17.5000000 *
      7) mxPH>=8.065 76 16396.0400 15.6605300  
       14) Chla>=2.65 68  9694.0890 13.8544100  
         28) Chla< 14.8875 29  2747.5810  8.7172410  
           56) NH4< 226.875 21   558.4257  5.7857140 *
           57) NH4>=226.875 8  1534.9490 16.4125000 *
         29) Chla>=14.8875 39  5612.0940 17.6743600  
           58) mnO2< 11.05 30  3139.0940 15.4233300  
            116) NH4>=158.409 8   577.1000  8.9000000 *
            117) NH4< 158.409 22  2097.7700 17.7954500  
              234) season=spring,autumn 14   674.7521 14.6642900 *
              235) season=summer,winter 8  1045.5550 23.2750000 *
           59) mnO2>=11.05 9  1814.2760 25.1777800 *
       15) Chla< 2.65 8  4594.6690 31.0125000 *

The tree structure can be hard to determine when there is a large number of nodes; we can improve on the visuals by using the R library rpart.plot.

rpart.plot::prp(regression.tree.a2,extra=101,box.col="orange",split.box.col="gray")

Details on the regression tree can be obtained by calling the summary() method on the object.

summary(regression.tree.a2)
Call:
rpart::rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + 
    Cl + NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)
  n= 218 

           CP nsplit rel error    xerror      xstd
1  0.15082765      0 1.0000000 1.0059069 0.1990911
2  0.11943572      1 0.8491724 0.9492709 0.1815913
3  0.07178590      2 0.7297366 0.8655117 0.1688012
4  0.04545758      3 0.6579507 0.9007445 0.1699016
5  0.02243987      4 0.6124932 0.9597254 0.1737117
6  0.02228595      5 0.5900533 0.9472199 0.1658890
7  0.02156378      6 0.5677673 0.9472199 0.1658890
8  0.01581407      8 0.5246398 0.9287217 0.1629262
9  0.01285848      9 0.5088257 0.9255472 0.1613858
10 0.01055949     10 0.4959672 0.9320459 0.1622581
11 0.01000000     11 0.4854077 0.9389544 0.1625727

Variable importance
  Chla    NH4     Cl   mxPH   oPO4    PO4    NO3  speed   mnO2 season   size 
    19     14     14     13     11      9      6      5      4      3      2 

Node number 1: 218 observations,    complexity param=0.1508276
  mean=7.636697, MSE=134.6565 
  left son=2 (83 obs) right son=3 (135 obs)
  Primary splits:
      Cl   < 16.6875  to the left,  improve=0.15082760, (0 missing)
      mxPH < 7.94     to the left,  improve=0.14900670, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.11106510, (0 missing)
      Chla < 12.21    to the left,  improve=0.09817759, (0 missing)
      NH4  < 46.35    to the left,  improve=0.09032131, (0 missing)
  Surrogate splits:
      PO4   < 70.465   to the left,  agree=0.844, adj=0.590, (0 split)
      oPO4  < 19.8635  to the left,  agree=0.835, adj=0.566, (0 split)
      NH4   < 46.35    to the left,  agree=0.807, adj=0.494, (0 split)
      Chla  < 2.225    to the left,  agree=0.807, adj=0.494, (0 split)
      speed splits as  RRL,          agree=0.775, adj=0.410, (0 split)

Node number 2: 83 observations,    complexity param=0.01055949
  mean=1.889157, MSE=14.38121 
  left son=4 (67 obs) right son=5 (16 obs)
  Primary splits:
      size  splits as  LLR,          improve=0.25968900, (0 missing)
      Chla  < 4.4835   to the left,  improve=0.16315540, (0 missing)
      mxPH  < 7.835    to the left,  improve=0.07327114, (0 missing)
      oPO4  < 18.889   to the left,  improve=0.05858864, (0 missing)
      speed splits as  LRL,          improve=0.05534763, (0 missing)
  Surrogate splits:
      speed splits as  LRL,          agree=0.843, adj=0.188, (0 split)
      mxPH  < 8.385    to the left,  agree=0.831, adj=0.125, (0 split)
      mnO2  < 8.3      to the right, agree=0.831, adj=0.125, (0 split)
      Cl    < 15.2705  to the left,  agree=0.819, adj=0.062, (0 split)

Node number 3: 135 observations,    complexity param=0.1194357
  mean=11.17037, MSE=175.8068 
  left son=6 (59 obs) right son=7 (76 obs)
  Primary splits:
      mxPH < 8.065    to the left,  improve=0.14772320, (0 missing)
      NO3  < 0.3785   to the right, improve=0.09864039, (0 missing)
      Chla < 2.85     to the right, improve=0.06998520, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.06691501, (0 missing)
      NH4  < 474.1875 to the right, improve=0.05884159, (0 missing)
  Surrogate splits:
      NH4  < 272.2915 to the right, agree=0.689, adj=0.288, (0 split)
      Chla < 11.65    to the left,  agree=0.667, adj=0.237, (0 split)
      NO3  < 5.37     to the right, agree=0.659, adj=0.220, (0 split)
      mnO2 < 6.55     to the left,  agree=0.622, adj=0.136, (0 split)
      oPO4 < 279.5085 to the right, agree=0.607, adj=0.102, (0 split)

Node number 4: 67 observations
  mean=0.9447761, MSE=5.949935 

Node number 5: 16 observations
  mean=5.84375, MSE=30.31371 

Node number 6: 59 observations,    complexity param=0.02156378
  mean=5.386441, MSE=64.94626 
  left son=12 (29 obs) right son=13 (30 obs)
  Primary splits:
      season splits as  RRLL,         improve=0.12253050, (0 missing)
      NH4    < 339.9305 to the right, improve=0.08821754, (0 missing)
      mxPH   < 7.905    to the left,  improve=0.07581077, (0 missing)
      Chla   < 2.5      to the right, improve=0.05654547, (0 missing)
      oPO4   < 25.75    to the left,  improve=0.05324680, (0 missing)
  Surrogate splits:
      NO3  < 2.8605   to the right, agree=0.712, adj=0.414, (0 split)
      mnO2 < 9.75     to the right, agree=0.661, adj=0.310, (0 split)
      oPO4 < 144.1885 to the left,  agree=0.627, adj=0.241, (0 split)
      Chla < 6.85     to the right, agree=0.627, adj=0.241, (0 split)
      mxPH < 7.975    to the right, agree=0.610, adj=0.207, (0 split)

Node number 7: 76 observations,    complexity param=0.0717859
  mean=15.66053, MSE=215.7374 
  left son=14 (68 obs) right son=15 (8 obs)
  Primary splits:
      Chla < 2.65     to the right, improve=0.12852400, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.10098910, (0 missing)
      NO3  < 0.6045   to the right, improve=0.08727846, (0 missing)
      Cl   < 35.705   to the right, improve=0.07266453, (0 missing)
      mxPH < 8.375    to the right, improve=0.04660101, (0 missing)
  Surrogate splits:
      Cl   < 19.11    to the right, agree=0.921, adj=0.25, (0 split)
      NO3  < 0.076    to the right, agree=0.921, adj=0.25, (0 split)
      NH4  < 2713     to the left,  agree=0.921, adj=0.25, (0 split)
      oPO4 < 8        to the right, agree=0.921, adj=0.25, (0 split)
      PO4  < 40.3335  to the right, agree=0.921, adj=0.25, (0 split)

Node number 12: 29 observations
  mean=2.517241, MSE=19.37384 

Node number 13: 30 observations,    complexity param=0.02156378
  mean=8.16, MSE=93.34907 
  left son=26 (23 obs) right son=27 (7 obs)
  Primary splits:
      mxPH < 7.9375   to the left,  improve=0.28441600, (0 missing)
      NH4  < 119.2855 to the left,  improve=0.10978520, (0 missing)
      oPO4 < 42.857   to the left,  improve=0.06038492, (0 missing)
      Cl   < 37.624   to the left,  improve=0.05321153, (0 missing)
      Chla < 3.1      to the right, improve=0.04623221, (0 missing)
  Surrogate splits:
      NO3 < 1.0425   to the right, agree=0.833, adj=0.286, (0 split)

Node number 14: 68 observations,    complexity param=0.04545758
  mean=13.85441, MSE=142.5601 
  left son=28 (29 obs) right son=29 (39 obs)
  Primary splits:
      Chla < 14.8875  to the left,  improve=0.13765220, (0 missing)
      Cl   < 68.875   to the right, improve=0.10411670, (0 missing)
      NH4  < 228.175  to the left,  improve=0.09585894, (0 missing)
      NO3  < 5.247    to the left,  improve=0.08016472, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.05486673, (0 missing)
  Surrogate splits:
      size   splits as  LRR,          agree=0.706, adj=0.310, (0 split)
      mxPH   < 8.215    to the left,  agree=0.676, adj=0.241, (0 split)
      oPO4   < 180.8335 to the right, agree=0.662, adj=0.207, (0 split)
      NO3    < 3.779    to the right, agree=0.647, adj=0.172, (0 split)
      season splits as  RLRR,         agree=0.632, adj=0.138, (0 split)

Node number 15: 8 observations
  mean=31.0125, MSE=574.3336 

Node number 26: 23 observations
  mean=5.317391, MSE=38.69448 

Node number 27: 7 observations
  mean=17.5, MSE=159.1429 

Node number 28: 29 observations,    complexity param=0.02228595
  mean=8.717241, MSE=94.74419 
  left son=56 (21 obs) right son=57 (8 obs)
  Primary splits:
      NH4  < 226.875  to the left,  improve=0.23810280, (0 missing)
      Cl   < 68.875   to the right, improve=0.12545720, (0 missing)
      NO3  < 3.8045   to the right, improve=0.10340650, (0 missing)
      mnO2 < 9.6      to the right, improve=0.08307998, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.07494295, (0 missing)
  Surrogate splits:
      Chla < 12.08    to the left,  agree=0.862, adj=0.5, (0 split)

Node number 29: 39 observations,    complexity param=0.02243987
  mean=17.67436, MSE=143.8999 
  left son=58 (30 obs) right son=59 (9 obs)
  Primary splits:
      mnO2 < 11.05    to the left,  improve=0.11737600, (0 missing)
      mxPH < 8.35     to the right, improve=0.11494980, (0 missing)
      Cl   < 51.6665  to the right, improve=0.11368430, (0 missing)
      NO3  < 5.0885   to the left,  improve=0.10842770, (0 missing)
      NH4  < 231.3    to the left,  improve=0.08663984, (0 missing)
  Surrogate splits:
      NH4  < 245.682  to the left,  agree=0.821, adj=0.222, (0 split)
      oPO4 < 19.4585  to the right, agree=0.821, adj=0.222, (0 split)
      PO4  < 96.283   to the right, agree=0.821, adj=0.222, (0 split)
      Chla < 16.4     to the right, agree=0.821, adj=0.222, (0 split)
      size splits as  RLL,          agree=0.795, adj=0.111, (0 split)

Node number 56: 21 observations
  mean=5.785714, MSE=26.5917 

Node number 57: 8 observations
  mean=16.4125, MSE=191.8686 

Node number 58: 30 observations,    complexity param=0.01581407
  mean=15.42333, MSE=104.6365 
  left son=116 (8 obs) right son=117 (22 obs)
  Primary splits:
      NH4  < 158.409  to the right, improve=0.14788480, (0 missing)
      Chla < 48.15    to the left,  improve=0.13800510, (0 missing)
      mnO2 < 9.95     to the right, improve=0.12649850, (0 missing)
      NO3  < 0.698    to the right, improve=0.11199640, (0 missing)
      oPO4 < 144.0415 to the left,  improve=0.08054276, (0 missing)
  Surrogate splits:
      mxPH < 8.27     to the left,  agree=0.767, adj=0.125, (0 split)
      oPO4 < 36.4     to the left,  agree=0.767, adj=0.125, (0 split)

Node number 59: 9 observations
  mean=25.17778, MSE=201.5862 

Node number 116: 8 observations
  mean=8.9, MSE=72.1375 

Node number 117: 22 observations,    complexity param=0.01285848
  mean=17.79545, MSE=95.35316 
  left son=234 (14 obs) right son=235 (8 obs)
  Primary splits:
      season splits as  LRLR,         improve=0.1799351, (0 missing)
      NO3    < 1.2775   to the right, improve=0.1189142, (0 missing)
      Chla   < 31.869   to the left,  improve=0.1160835, (0 missing)
      NH4    < 99.6665  to the left,  improve=0.1091819, (0 missing)
      Cl     < 42.818   to the left,  improve=0.1085589, (0 missing)
  Surrogate splits:
      NO3   < 3.2455   to the left,  agree=0.727, adj=0.250, (0 split)
      oPO4  < 103.293  to the left,  agree=0.727, adj=0.250, (0 split)
      size  splits as  RLL,          agree=0.682, adj=0.125, (0 split)
      speed splits as  LRL,          agree=0.682, adj=0.125, (0 split)
      mnO2  < 7.4      to the left,  agree=0.682, adj=0.125, (0 split)

Node number 234: 14 observations
  mean=14.66429, MSE=48.19658 

Node number 235: 8 observations
  mean=23.275, MSE=130.6944 

Note that rpart() grows a tree on the training data until one of the following criterion is met: - decrease in deviance goes below a certain threshold (cp) - number of samples in a node is below some other threshold (minsplit) - depth of the tree crosses yet another threshold (maxdepth)

The library implements a pruning method based on cost complexity: finding the value of cp which best balances predictive accuracy and tree size.

rpart::printcp(regression.tree.a2)

Regression tree:
rpart::rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + 
    Cl + NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)

Variables actually used in tree construction:
[1] Chla   Cl     mnO2   mxPH   NH4    season size  

Root node error: 29355/218 = 134.66

n= 218 

         CP nsplit rel error  xerror    xstd
1  0.150828      0   1.00000 1.00591 0.19909
2  0.119436      1   0.84917 0.94927 0.18159
3  0.071786      2   0.72974 0.86551 0.16880
4  0.045458      3   0.65795 0.90074 0.16990
5  0.022440      4   0.61249 0.95973 0.17371
6  0.022286      5   0.59005 0.94722 0.16589
7  0.021564      6   0.56777 0.94722 0.16589
8  0.015814      8   0.52464 0.92872 0.16293
9  0.012858      9   0.50883 0.92555 0.16139
10 0.010559     10   0.49597 0.93205 0.16226
11 0.010000     11   0.48541 0.93895 0.16257

The tree returned by rpart() is the final one (cp\(=0.01\) is the default value); it requires 11 decision tests, and has a relative error of 0.485. Internally, rpart() uses 10-fold cross-validation to estimate that the tree has an average relative error of \(0.98\pm 0.18\) (these values might change when you run the model due to the stochastic nature of the internal cross-validation routines).

In this framework, the optimal tree minimizes the value of xerror. Alternatively, one could use the 1-SE rule to find the minimal xerror + xstd tree.

rpart::plotcp(regression.tree.a2)

The scree plot above suggests that cp\(=0.08\) (that value may change when you run yours due to the stochastic nature of the internal cross-validation algorithm) is a special value for tree growth, so we could prune the tree using that specific value.

regression.tree.a2.mod <- rpart::prune(regression.tree.a2,cp=0.05)
regression.tree.a2.mod
n= 218 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 218 29355.130  7.636697  
   2) Cl< 16.6875 83  1193.640  1.889157 *
   3) Cl>=16.6875 135 23733.920 11.170370  
     6) mxPH< 8.065 59  3831.829  5.386441 *
     7) mxPH>=8.065 76 16396.040 15.660530  
      14) Chla>=2.65 68  9694.089 13.854410 *
      15) Chla< 2.65 8  4594.669 31.012500 *
rpart.plot::prp(regression.tree.a2.mod,extra=101,box.col="orange",split.box.col="gray")

The entire process is automated in the wrapper method rpartXse() provided with the DMwR library. In our implementation, we (abitrarily) use se\(=0.2\).

library(DMwR)
(regression.tree.a2.final <- DMwR::rpartXse(a2 ~ season + size + speed + mxPH + mnO2 + 
                                              Cl + NO3 + NH4 + oPO4 + PO4 + Chla,
                                            data=algae.train, se=0.2))
summary(regression.tree.a2.final)
n= 218 

node), split, n, deviance, yval
      * denotes terminal node

1) root 218 29355.13  7.636697  
  2) Cl< 16.6875 83  1193.64  1.889157 *
  3) Cl>=16.6875 135 23733.92 11.170370 *
Call:
rpart(formula = form, data = data, cp = cp, minsplit = minsplit)
  n= 218 

         CP nsplit rel error    xerror      xstd
1 0.1508276      0 1.0000000 1.0130822 0.2001495
2 0.1194357      1 0.8491724 0.9320224 0.1752140

Variable importance
   Cl   PO4  oPO4  Chla   NH4 speed 
   28    17    16    14    14    12 

Node number 1: 218 observations,    complexity param=0.1508276
  mean=7.636697, MSE=134.6565 
  left son=2 (83 obs) right son=3 (135 obs)
  Primary splits:
      Cl   < 16.6875 to the left,  improve=0.15082760, (0 missing)
      mxPH < 7.94    to the left,  improve=0.14900670, (0 missing)
      NO3  < 0.18    to the right, improve=0.11564070, (0 missing)
      oPO4 < 45.1    to the left,  improve=0.11106510, (0 missing)
      Chla < 12.21   to the left,  improve=0.09817759, (0 missing)
  Surrogate splits:
      PO4   < 70.465  to the left,  agree=0.844, adj=0.590, (0 split)
      oPO4  < 19.8635 to the left,  agree=0.835, adj=0.566, (0 split)
      NH4   < 46.35   to the left,  agree=0.807, adj=0.494, (0 split)
      Chla  < 2.225   to the left,  agree=0.807, adj=0.494, (0 split)
      speed splits as  RRL,         agree=0.775, adj=0.410, (0 split)

Node number 2: 83 observations
  mean=1.889157, MSE=14.38121 

Node number 3: 135 observations
  mean=11.17037, MSE=175.8068 
rpart.plot::prp(regression.tree.a2.final,extra=101,box.col="orange",split.box.col="gray")

The resulting tree is not nearly as complex as the original tree (hence discourages overfitting) but is still more complex than the pruned tree (which should improve predicting accuracy).

12.6.2 Model Evaluation

At this stage, we know that the linear model is not great for a2, and we have seen how to grow a regression tree for a2 but we have not yet discussed whether this model is a good fit, to say nothing of the remaining 6 algae concentrations. Can we get a better handle on these models’ performance (i.e., comparing the model predictions to the real values of the target variable in the test data)?

We have discussed various metrics that can be used to determine how the values compare; we will use the normalized mean squared error (NMSE): \[\frac{\text{MSE}}{\text{mean}\left\{\left(\overline{\text{real}}-\text{real}_i\right)^2; i=1,..., N\right\}}.\]

As the ratio of MSE to a baseline predictor (the mean of the value of the target), NMSE is unitless. NMSE values between 0 and 1 (smaller is better) indicate that the model performs better than the baseline; greater than 1 indicate that the model’s performance is sub-par.

We use the performanceEstimation library to run \(5\times 10-\)fold cross-validations to determine which of the models (linear model and 4 regression trees parametrized by se) yields an optimal (smaller) NMSE value when trying to predict a2.

library(performanceEstimation)
kCV.results.algae.a2 <- performanceEstimation(
  PredTask(a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
             NO3 + NH4 + oPO4 + PO4 + Chla, data=algae.train, "a2"), 
  c(Workflow(learner="lm",post="onlyPos"),
    workflowVariants(learner="rpartXse",
                     learner.pars=list( se=c(0,0.25,0.5,0.75,1) ))),
  EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)


##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v4 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v5 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************

A summary and plot of the cross-validation results for NMSE can be displayed using calls to summary() and plot().

summary(kCV.results.algae.a2)

== Summary of a  Cross Validation Performance Estimation Experiment ==

Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 

* Predictive Tasks ::  a2
* Workflows  ::  lm, rpartXse.v1, rpartXse.v2, rpartXse.v3, rpartXse.v4, rpartXse.v5 

-> Task:  a2
  *Workflow: lm 
             nmse
avg     0.9723125
std     0.2221976
med     0.9634147
iqr     0.1771688
min     0.5878283
max     2.0801221
invalid 0.0000000

  *Workflow: rpartXse.v1 
             nmse
avg     1.1148436
std     0.3871551
med     1.0000000
iqr     0.2226673
min     0.5701226
max     2.8186400
invalid 0.0000000

  *Workflow: rpartXse.v2 
              nmse
avg     1.08587675
std     0.35111303
med     1.00000000
iqr     0.07178237
min     0.76004730
max     2.81864005
invalid 0.00000000

  *Workflow: rpartXse.v3 
                nmse
avg     1.035773e+00
std     1.470430e-01
med     1.000000e+00
iqr     2.220446e-16
min     8.044770e-01
max     1.701835e+00
invalid 0.000000e+00

  *Workflow: rpartXse.v4 
                nmse
avg     1.011250e+00
std     1.214329e-01
med     1.000000e+00
iqr     2.220446e-16
min     6.800497e-01
max     1.701835e+00
invalid 0.000000e+00

  *Workflow: rpartXse.v5 
                nmse
avg     1.004167e+00
std     5.174279e-02
med     1.000000e+00
iqr     2.220446e-16
min     8.692699e-01
max     1.339067e+00
invalid 0.000000e+00
plot(kCV.results.algae.a2)

It is not obvious which of the models has smaller values of NMSE, although it does seem that the latter versions of the regression tree models are not substantially better than the baseline model.

The first regression tree model sometimes produces very small NMSE values, but that is offset by some of the larger values it also produces (similarly for the linear model).

At any rate, visual evidence seems to suggest that the linear model is the best predictive model for a2 given the training data (in this version of \(k\)CV), which is corrobated by a call to topPerformers().

topPerformers(kCV.results.algae.a2)
$a2
     Workflow Estimate
nmse       lm    0.972

This might seem disheartening at first given how poorly the linear model performed, but it might be helpful to remember that there is no guarantee that a decent predictive model even exists.

Furthermore, regression trees and linear models are only two of a whole collection of possible models. How do support vector machines perform the task, for instance?220

This time, however, we will learn models and perform evaluation for all target variables (a1-a7) simultaneously. This does not mean that we are looking for a single model which will optimize all learning tasks at once, but rather that we can prepare and evaluate the models for each target variable with the same bit of code.

This first require some code to create the appropriate model formulas (a1 ~ ., … ,a7 ~ .) and the appropriate training data.

gg<-function(x,list.of.variables){    
    PredTask(as.formula(paste(x,"~ .")),algae.train[,c(list.of.variables,x)], x, copy=TRUE)
}
data.sources <- sapply(names(algae.train[12:18]),gg,names(algae.train[1:11]))
data.sources
$a1
Prediction Task Object:
    Task Name         :: a1 
    Task Type         :: regression 
    Target Feature    :: a1 
    Formula           :: a1 ~ .
<environment: 0x7f9e515b6b58>
    Task Data Source  :: internal  218x12 data frame.

$a2
Prediction Task Object:
    Task Name         :: a2 
    Task Type         :: regression 
    Target Feature    :: a2 
    Formula           :: a2 ~ .
<environment: 0x7f9e5aac4620>
    Task Data Source  :: internal  218x12 data frame.

$a3
Prediction Task Object:
    Task Name         :: a3 
    Task Type         :: regression 
    Target Feature    :: a3 
    Formula           :: a3 ~ .
<environment: 0x7f9e5af2b758>
    Task Data Source  :: internal  218x12 data frame.

$a4
Prediction Task Object:
    Task Name         :: a4 
    Task Type         :: regression 
    Target Feature    :: a4 
    Formula           :: a4 ~ .
<environment: 0x7f9e5af8b300>
    Task Data Source  :: internal  218x12 data frame.

$a5
Prediction Task Object:
    Task Name         :: a5 
    Task Type         :: regression 
    Target Feature    :: a5 
    Formula           :: a5 ~ .
<environment: 0x7f9e5c340c08>
    Task Data Source  :: internal  218x12 data frame.

$a6
Prediction Task Object:
    Task Name         :: a6 
    Task Type         :: regression 
    Target Feature    :: a6 
    Formula           :: a6 ~ .
<environment: 0x7f9e5c0508e8>
    Task Data Source  :: internal  218x12 data frame.

$a7
Prediction Task Object:
    Task Name         :: a7 
    Task Type         :: regression 
    Target Feature    :: a7 
    Formula           :: a7 ~ .
<environment: 0x7f9e5afe1928>
    Task Data Source  :: internal  218x12 data frame.

We shall use e1071’s implementation of svm(), with various values of the svm()-specific parameters cost and gamma.

library(e1071)
kCV.results.algae.all <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)


##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a1

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a3

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a4

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a5

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a6

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a7

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************

The rest of the evaluation proceeds much as before, except that we can display results for the 7 target variables simultaneously.

plot(kCV.results.algae.all)

rankWorkflows(kCV.results.algae.all,top=3)
$a1
$a1$nmse
     Workflow  Estimate
1 rpartXse.v1 0.6163406
2 rpartXse.v2 0.6278027
3 rpartXse.v3 0.6430736


$a2
$a2$nmse
     Workflow  Estimate
1          lm 0.9723125
2         svm 0.9954432
3 rpartXse.v3 1.0041667


$a3
$a3$nmse
     Workflow  Estimate
1         svm 0.9497730
2          lm 0.9801662
3 rpartXse.v2 1.0000000


$a4
$a4$nmse
     Workflow Estimate
1 rpartXse.v3 1.001453
2 rpartXse.v2 1.351494
3          lm 1.357243


$a5
$a5$nmse
     Workflow  Estimate
1         svm 0.9968475
2 rpartXse.v3 0.9990465
3 rpartXse.v2 1.0194733


$a6
$a6$nmse
     Workflow Estimate
1 rpartXse.v2 1.010069
2 rpartXse.v3 1.010069
3         svm 1.054975


$a7
$a7$nmse
     Workflow Estimate
1 rpartXse.v2  1.00000
2 rpartXse.v3  1.00000
3 rpartXse.v1  1.00797
topPerformers(kCV.results.algae.all)
$a1
        Workflow Estimate
nmse rpartXse.v1    0.616

$a2
     Workflow Estimate
nmse       lm    0.972

$a3
     Workflow Estimate
nmse      svm     0.95

$a4
        Workflow Estimate
nmse rpartXse.v3    1.001

$a5
     Workflow Estimate
nmse      svm    0.997

$a6
        Workflow Estimate
nmse rpartXse.v2     1.01

$a7
        Workflow Estimate
nmse rpartXse.v2        1

For a1, the models seem to perform reasonably well. But it is not as rosy for the other target variables, where the baseline model is sometimes better. Again, this could be built-in in the data, but we might benefit from incorporating more models.

library(randomForest)
kCV.results.algae.all.rf <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1))),
      workflowVariants(learner="randomForest",learner.pars=list(ntree=c(200,500,700)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)


##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a1

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a3

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a4

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a5

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a6

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a7

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
     Run with seed =  1234 
Iteration :**************************************************
rankWorkflows(kCV.results.algae.all.rf,top=3) 
$a1
$a1$nmse
         Workflow  Estimate
1 randomForest.v2 0.5217204
2 randomForest.v3 0.5228744
3 randomForest.v1 0.5264328


$a2
$a2$nmse
         Workflow  Estimate
1 randomForest.v3 0.7798749
2 randomForest.v2 0.7806831
3 randomForest.v1 0.7849360


$a3
$a3$nmse
         Workflow  Estimate
1 randomForest.v3 0.9377108
2 randomForest.v2 0.9400108
3 randomForest.v1 0.9431801


$a4
$a4$nmse
         Workflow Estimate
1     rpartXse.v3 1.001453
2 randomForest.v3 1.006496
3 randomForest.v1 1.006806


$a5
$a5$nmse
         Workflow  Estimate
1 randomForest.v1 0.7626241
2 randomForest.v2 0.7675794
3 randomForest.v3 0.7681834


$a6
$a6$nmse
         Workflow  Estimate
1 randomForest.v2 0.8590227
2 randomForest.v3 0.8621478
3 randomForest.v1 0.8663869


$a7
$a7$nmse
     Workflow Estimate
1 rpartXse.v2  1.00000
2 rpartXse.v3  1.00000
3 rpartXse.v1  1.00797

rankWorkflows() does not report on the standard error, so we cannot tell whether the differences between the score of the best model and the other models is statistically significant.

randomForest.v3 seems to have the best ranking across all learning tasks, so we will use it as the baseline model.

p<-pairedComparisons(kCV.results.algae.all.rf,baseline="randomForest.v3")
p$nmse$F.test
p$nmse$BonferroniDunn.test
$chi
[1] 22.86905

$FF
[1] 5.251025

$critVal
[1] 0.7071231

$rejNull
[1] TRUE

$critDif
[1] 3.52218

$baseline
[1] "randomForest.v3"

$rkDifs
             lm             svm     rpartXse.v1     rpartXse.v2     rpartXse.v3 
      4.1428571       2.8571429       4.1428571       2.6428571       1.9285714 
randomForest.v1 randomForest.v2 
      0.8571429       0.0000000 

$signifDifs
             lm             svm     rpartXse.v1     rpartXse.v2     rpartXse.v3 
           TRUE           FALSE            TRUE           FALSE           FALSE 
randomForest.v1 randomForest.v2 
          FALSE           FALSE 

We can reject with 95% certainty the hypothesis that the performance of the baseline method (randomForest.v3) is the same as that of the linear model and the first 2 regression trees, but not that it is better than svm, rpartXse.v3, and the other 2 random forests.

The information is also displayed in the Bonferroni-Dunn CD diagram below.

CDdiagram.BD(p)

12.6.3 Model Predictions

Finally, we might actually be interested in generating predictions for each of the target variables in the testing set. This simply requires that the best performers for each target response be brought together in an R object.

best.performers <- sapply(taskNames(kCV.results.algae.all.rf), function(x)
  topPerformer(kCV.results.algae.all.rf,metric="nmse",task=x)
                          )
best.performers
$a1
Workflow Object:
    Workflow ID       ::  randomForest.v2 
    Workflow Function ::  standardWF
         Parameter values:
         learner.pars  -> ntree=500 
         learner  -> randomForest 


$a2
Workflow Object:
    Workflow ID       ::  randomForest.v3 
    Workflow Function ::  standardWF
         Parameter values:
         learner.pars  -> ntree=700 
         learner  -> randomForest 


$a3
Workflow Object:
    Workflow ID       ::  randomForest.v3 
    Workflow Function ::  standardWF
         Parameter values:
         learner.pars  -> ntree=700 
         learner  -> randomForest 


$a4
Workflow Object:
    Workflow ID       ::  rpartXse.v3 
    Workflow Function ::  standardWF
         Parameter values:
         learner.pars  -> se=1 
         learner  -> rpartXse 


$a5
Workflow Object:
    Workflow ID       ::  randomForest.v1 
    Workflow Function ::  standardWF
         Parameter values:
         learner.pars  -> ntree=200 
         learner  -> randomForest 


$a6
Workflow Object:
    Workflow ID       ::  randomForest.v2 
    Workflow Function ::  standardWF
         Parameter values:
         learner.pars  -> ntree=500 
         learner  -> randomForest 


$a7
Workflow Object:
    Workflow ID       ::  rpartXse.v2 
    Workflow Function ::  standardWF
         Parameter values:
         learner.pars  -> se=0.7 
         learner  -> rpartXse 

The observations that form the testing set are placed in an object, as below:

test.observations <- array(dim=c(nrow(algae.test),7,2),dimnames=list(rownames(algae.test),paste("a",1:7),c("actual","predicted")))

The function runWorkflow() will compute the predicted values for each of the targets’ best performers. We can then plot the predicted and actual values for each of the testing set targets.

for(j in 1:7){ 
    results <- runWorkflow(best.performers[[j]],
                          as.formula(paste(names(best.performers)[j],"~ .")),
                          algae.train[,c(1:11,11+j)],
                          algae.test[,c(1:11,11+j)])
    test.observations[,j,"actual"] <- results$trues
    test.observations[,j,"predicted"] <- results$preds
}
df.a1 <- as.data.frame(test.observations[,1,])
df.a2 <- as.data.frame(test.observations[,2,])
df.a3 <- as.data.frame(test.observations[,3,])
df.a4 <- as.data.frame(test.observations[,4,])
df.a5 <- as.data.frame(test.observations[,5,])
df.a6 <- as.data.frame(test.observations[,6,])
df.a7 <- as.data.frame(test.observations[,7,])
plot(df.a1,main="a1 - predicted vs. actual")
abline(0,1,col="red")

plot(df.a2,main="a2 - predicted vs. actual")
abline(0,1,col="red")

plot(df.a3,main="a3 - predicted vs. actual")
abline(0,1,col="red")

plot(df.a4,main="a4 - predicted vs. actual")
abline(0,1,col="red")

plot(df.a5,main="a5 - predicted vs. actual")
abline(0,1,col="red")

plot(df.a6,main="a6 - predicted vs. actual")
abline(0,1,col="red")

plot(df.a7,main="a7 - predicted vs. actual")
abline(0,1,col="red")

The models simply are not that great, but we already expected that. The average prediction for each target is shown below.

average.prediction <- apply(algae.train[,12:18],2, mean)
average.prediction
       a1        a2        a3        a4        a5        a6        a7 
17.472018  7.636697  4.132110  1.983028  4.963761  5.811927  2.504128 

Finally, you might be interested in the NMSE metrics for the predicted values and how they compare to the NMSE metrics on the training set. Is any of this surprising?

apply((test.observations[,,"actual"]-test.observations[,,"predicted"])^2,2,sum) / apply((scale(test.observations[,,"actual"],average.prediction,FALSE))^2,2,sum)
      a 1       a 2       a 3       a 4       a 5       a 6       a 7 
0.4047690 0.8841894 0.7760535 1.0000000 0.7133264 0.8391650 1.0000000 

References

[74]
L. Torgo, Data Mining with R, 2nd ed. CRC Press, 2016.