11.7 R Examples

We provide the R code that was used to produce the outputs of the toy examples in Sections 11.3.6, 11.4.6, and 11.5.6.

11.7.1 Association Rules Mining: Titanic Dataset

This example refers to Toy Example: Titanic Dataset.

The very first step in programming with R is to import data. Then we perform a few routines to explore the structure of data.

class=as.factor(c(rep("3rd",52),rep("1st",118),rep("2nd",154),rep("3rd",387),
                  rep("Crew",670),rep("1st",4),rep("2nd",13.01),rep("3rd",89),
                  rep("Crew",3),rep("1st",5),rep("2nd",11),rep("3rd",13),
                  rep("1st",1),rep("2nd",13),rep("3rd",14),rep("1st",57),
                  rep("2nd",14),rep("3rd",75),rep("Crew",192),rep("1st",140),
                  rep("2nd",80),rep("3rd",76),rep("Crew",20)))
sex=as.factor(c(rep("Male",35),rep("Female",17),rep("Male",1329),rep("Female",109),
                rep("Male",29),rep("Female",28),rep("Male",338),rep("Female",316)))
age=as.factor(c(rep("Child",52),rep("Adult",1438),rep("Child",57),rep("Adult",654)))
survived=as.factor(c(rep("No",1490),rep("Yes",711)))
titanic=data.frame(class,sex,age,survived)
class sex age survived
1st :325 Female: 470 Adult:2092 No :1490
2nd :285 Male :1731 Child: 109 Yes: 711
3rd :706 NA NA NA
Crew:885 NA NA NA
No Yes
Adult 1438 654
Child 52 57
No Yes
1st 122 203
2nd 167 118
3rd 528 178
Crew 673 212

Then we use the arules package function apriori(), which returns all possible rules. By default, apriori() creates rules with minimum support of 0.1, minimum confidence of 0.8, and maximum of 10

rules.titanic <- arules::apriori(titanic)
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5     0.1      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 220 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [27 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

For a rule \(X\to Y\), arule evaluates a number of default metrics. We can extract the rules using the function inspect(): there are 27 in total.

arules::inspect(rules.titanic)
     lhs                                     rhs           support   confidence
[1]  {}                                   => {age=Adult}   0.9504771 0.9504771 
[2]  {class=2nd}                          => {age=Adult}   0.1185825 0.9157895 
[3]  {class=1st}                          => {age=Adult}   0.1449341 0.9815385 
[4]  {sex=Female}                         => {age=Adult}   0.1930940 0.9042553 
[5]  {class=3rd}                          => {age=Adult}   0.2848705 0.8881020 
[6]  {survived=Yes}                       => {age=Adult}   0.2971377 0.9198312 
[7]  {class=Crew}                         => {sex=Male}    0.3916402 0.9740113 
[8]  {class=Crew}                         => {age=Adult}   0.4020900 1.0000000 
[9]  {survived=No}                        => {sex=Male}    0.6197183 0.9154362 
[10] {survived=No}                        => {age=Adult}   0.6533394 0.9651007 
[11] {sex=Male}                           => {age=Adult}   0.7573830 0.9630272 
[12] {sex=Female, survived=Yes}           => {age=Adult}   0.1435711 0.9186047 
[13] {class=3rd, sex=Male}                => {survived=No} 0.1917310 0.8274510 
[14] {class=3rd, survived=No}             => {age=Adult}   0.2162653 0.9015152 
[15] {class=3rd, sex=Male}                => {age=Adult}   0.2099046 0.9058824 
[16] {sex=Male, survived=Yes}             => {age=Adult}   0.1535666 0.9209809 
[17] {class=Crew, survived=No}            => {sex=Male}    0.3044071 0.9955423 
[18] {class=Crew, survived=No}            => {age=Adult}   0.3057701 1.0000000 
[19] {class=Crew, sex=Male}               => {age=Adult}   0.3916402 1.0000000 
[20] {class=Crew, age=Adult}              => {sex=Male}    0.3916402 0.9740113 
[21] {sex=Male, survived=No}              => {age=Adult}   0.6038164 0.9743402 
[22] {age=Adult, survived=No}             => {sex=Male}    0.6038164 0.9242003 
[23] {class=3rd, sex=Male, survived=No}   => {age=Adult}   0.1758292 0.9170616 
[24] {class=3rd, age=Adult, survived=No}  => {sex=Male}    0.1758292 0.8130252 
[25] {class=3rd, sex=Male, age=Adult}     => {survived=No} 0.1758292 0.8376623 
[26] {class=Crew, sex=Male, survived=No}  => {age=Adult}   0.3044071 1.0000000 
[27] {class=Crew, age=Adult, survived=No} => {sex=Male}    0.3044071 0.9955423 
     coverage  lift      count
[1]  1.0000000 1.0000000 2092 
[2]  0.1294866 0.9635051  261 
[3]  0.1476602 1.0326798  319 
[4]  0.2135393 0.9513700  425 
[5]  0.3207633 0.9343750  627 
[6]  0.3230350 0.9677574  654 
[7]  0.4020900 1.2384742  862 
[8]  0.4020900 1.0521033  885 
[9]  0.6769650 1.1639949 1364 
[10] 0.6769650 1.0153856 1438 
[11] 0.7864607 1.0132040 1667 
[12] 0.1562926 0.9664669  316 
[13] 0.2317129 1.2222950  422 
[14] 0.2398910 0.9484870  476 
[15] 0.2317129 0.9530818  462 
[16] 0.1667424 0.9689670  338 
[17] 0.3057701 1.2658514  670 
[18] 0.3057701 1.0521033  673 
[19] 0.3916402 1.0521033  862 
[20] 0.4020900 1.2384742  862 
[21] 0.6197183 1.0251065 1329 
[22] 0.6533394 1.1751385 1329 
[23] 0.1917310 0.9648435  387 
[24] 0.2162653 1.0337773  387 
[25] 0.2099046 1.2373791  387 
[26] 0.3044071 1.0521033  670 
[27] 0.3057701 1.2658514  670 

We set our own parameters to create a new list of rules, with minimum support of 0.005 and minimum confidence of 0.8. As we are naturally interested in finding combinations of attributes associated with survival (either Yes or No), we only retain rule for which the conclusion is {survived=No} or {survived=Yes}.

There are 12 such rules.

rules.titanic.2 <- arules::apriori(titanic,
                                   parameter = list(minlen=2, supp=0.005, conf=0.8),
                                   appearance = list(rhs=c("survived=No",
                                                           "survived=Yes"), default="lhs"),
                                   control = list(verbose=F))

We then sort the list by the lift value (in descending order) and print the results:

arules::inspect(rules.titanic.2)
     lhs                                    rhs            support    
[1]  {class=2nd, age=Child}              => {survived=Yes} 0.010904134
[2]  {class=2nd, sex=Female, age=Child}  => {survived=Yes} 0.005906406
[3]  {class=1st, sex=Female}             => {survived=Yes} 0.064061790
[4]  {class=1st, sex=Female, age=Adult}  => {survived=Yes} 0.063607451
[5]  {class=2nd, sex=Female}             => {survived=Yes} 0.042253521
[6]  {class=Crew, sex=Female}            => {survived=Yes} 0.009086779
[7]  {class=Crew, sex=Female, age=Adult} => {survived=Yes} 0.009086779
[8]  {class=2nd, sex=Female, age=Adult}  => {survived=Yes} 0.036347115
[9]  {class=2nd, sex=Male, age=Adult}    => {survived=No}  0.069968196
[10] {class=2nd, sex=Male}               => {survived=No}  0.069968196
[11] {class=3rd, sex=Male, age=Adult}    => {survived=No}  0.175829169
[12] {class=3rd, sex=Male}               => {survived=No}  0.191731031
     confidence coverage    lift     count
[1]  1.0000000  0.010904134 3.095640  24  
[2]  1.0000000  0.005906406 3.095640  13  
[3]  0.9724138  0.065879146 3.010243 141  
[4]  0.9722222  0.065424807 3.009650 140  
[5]  0.8773585  0.048159927 2.715986  93  
[6]  0.8695652  0.010449796 2.691861  20  
[7]  0.8695652  0.010449796 2.691861  20  
[8]  0.8602151  0.042253521 2.662916  80  
[9]  0.9166667  0.076328941 1.354083 154  
[10] 0.8603352  0.081326670 1.270871 154  
[11] 0.8376623  0.209904589 1.237379 387  
[12] 0.8274510  0.231712858 1.222295 422  

Are all these rules independent? If we know that {class=3rd,sex=Male}=>{survived=No} is a rule, say, then we would not be surprised to find out that {class=3rd,sex=Male,age=Adult}=>{survived=No} is also a rule.

The following chunk of code identifies which rules have an antecedent which is a subset of another rule’s antecedent, marking one of them as redundant, and removing those from the set of rules, which brings us down to 8 rules:

subset.matrix <- as.matrix(arules::is.subset(rules.titanic.2, rules.titanic.2))
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant.titanic <- colSums(subset.matrix, na.rm=T) >= 1
rules.titanic.2.pruned <- rules.titanic.2[!redundant.titanic]
arules::inspect(rules.titanic.2.pruned)
    lhs                                 rhs            support     confidence
[1] {class=2nd, age=Child}           => {survived=Yes} 0.010904134 1.0000000 
[2] {class=1st, sex=Female}          => {survived=Yes} 0.064061790 0.9724138 
[3] {class=2nd, sex=Female}          => {survived=Yes} 0.042253521 0.8773585 
[4] {class=Crew, sex=Female}         => {survived=Yes} 0.009086779 0.8695652 
[5] {class=2nd, sex=Male, age=Adult} => {survived=No}  0.069968196 0.9166667 
[6] {class=2nd, sex=Male}            => {survived=No}  0.069968196 0.8603352 
[7] {class=3rd, sex=Male, age=Adult} => {survived=No}  0.175829169 0.8376623 
[8] {class=3rd, sex=Male}            => {survived=No}  0.191731031 0.8274510 
    coverage   lift     count
[1] 0.01090413 3.095640  24  
[2] 0.06587915 3.010243 141  
[3] 0.04815993 2.715986  93  
[4] 0.01044980 2.691861  20  
[5] 0.07632894 1.354083 154  
[6] 0.08132667 1.270871 154  
[7] 0.20990459 1.237379 387  
[8] 0.23171286 1.222295 422  

We have presented the “interesting” associations in tabular format, but there are a variety of graphical representations as well (available in package arulesViz), such as:

  • a bubble chart;

  • a two-key plot (taking into account the rules’ lengths);

  • a graph structure, or

  • parallel coordinates (where the width of the arrows represents support and the intensity of the color represent confidence).

The original rules are shown below:

library(arulesViz)
plot(rules.titanic) 

plot(rules.titanic, method="two-key plot") 

plot(rules.titanic, method="graph") 

plot(rules.titanic, method="paracoord", control = list(reorder = TRUE)) 

For the constrained rules, we obtain:

plot(rules.titanic.2) 

plot(rules.titanic.2, method="two-key plot") 

plot(rules.titanic.2, method="graph") 

plot(rules.titanic.2, method="paracoord", control = list(reorder = TRUE)) 

For the pruned rules:

plot(rules.titanic.2.pruned) 

plot(rules.titanic.2.pruned, method="graph") 

plot(rules.titanic.2.pruned, method="two-key plot")

plot(rules.titanic.2.pruned, method="paracoord", control = list(reorder = TRUE)) 

Is there anything surprising about these outcomes?

11.7.2 Classification: Kyphosis Dataset

This example refers to Toy Example: Kyphosis Dataset; we explore the built-in kyphosis dataset with two decision tree methods (rpart, ctree).

Let’s get some information on the kyphosis dataset. We could call-up the help file on the dataset:

?rpart::kyphosis     

And/or determine its structure:

str(rpart::kyphosis) 
'data.frame':   81 obs. of  4 variables:
 $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
 $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
 $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
 $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...

We can also obtain the following summary statistics:

    Kyphosis       Age             Number           Start      
 absent :64   Min.   :  1.00   Min.   : 2.000   Min.   : 1.00  
 present:17   1st Qu.: 26.00   1st Qu.: 3.000   1st Qu.: 9.00  
              Median : 87.00   Median : 4.000   Median :13.00  
              Mean   : 83.65   Mean   : 4.049   Mean   :11.49  
              3rd Qu.:130.00   3rd Qu.: 5.000   3rd Qu.:16.00  
              Max.   :206.00   Max.   :10.000   Max.   :18.00  
          vars  n  mean    sd median trimmed   mad min max range  skew kurtosis
Kyphosis*    1 81  1.21  0.41      1    1.14  0.00   1   2     1  1.40    -0.04
Age          2 81 83.65 58.10     87   82.29 77.10   1 206   205  0.05    -1.24
Number       3 81  4.05  1.62      4    3.88  1.48   2  10     8  1.09     1.57
Start        4 81 11.49  4.88     13   12.08  4.45   1  18    17 -0.85    -0.49
            se
Kyphosis* 0.05
Age       6.46
Number    0.18
Start     0.54
Kyphosis Age Number Start
absent :64 Min. : 1.00 Min. : 2.000 Min. : 1.00
present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
NA Median : 87.00 Median : 4.000 Median :13.00
NA Mean : 83.65 Mean : 4.049 Mean :11.49
NA 3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00
NA Max. :206.00 Max. :10.000 Max. :18.00
vars n mean sd median trimmed mad min max range skew kurtosis se
Kyphosis* 1 81 1.209877 0.4097575 1 1.138462 0.0000 1 2 1 1.3985914 -0.0440302 0.0455286
Age 2 81 83.654321 58.1042512 87 82.292308 77.0952 1 206 205 0.0523222 -1.2397855 6.4560279
Number 3 81 4.049383 1.6194230 4 3.876923 1.4826 2 10 8 1.0898312 1.5681463 0.1799359
Start 4 81 11.493827 4.8839622 13 12.076923 4.4478 1 18 17 -0.8533837 -0.4885992 0.5426625

As always, we should take the time to visualize the dataset. In this case, since there are 4 variables (one of which is categorical), a scatterplot matrix is probably a good approach.

pairs(rpart::kyphosis[,2:4], main = "Kyphosis Data", pch = 21, 
      bg = c("red", "blue")[unclass(rpart::kyphosis[,1])], lower.panel=NULL, 
      labels=c("Age","Number","Start"), font.labels=2, cex.labels=4.5)

What should the legend of this scatterplot matrix be (red=??, blue=??)?

We build a tree using the recursive partitioning algorithm implemented in rpart.185 For the time being, we’re assuming that the training set is the dataset as a whole (so there is no reason to expect that the decision trees should have predictive power, only descriptive power).

set.seed(2) # for replicability
tree <- rpart::rpart(Kyphosis ~ Age + Number + Start, 
                     method="class", data=rpart::kyphosis)
tree
n= 81 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 81 17 absent (0.79012346 0.20987654)  
   2) Start>=8.5 62  6 absent (0.90322581 0.09677419)  
     4) Start>=14.5 29  0 absent (1.00000000 0.00000000) *
     5) Start< 14.5 33  6 absent (0.81818182 0.18181818)  
      10) Age< 55 12  0 absent (1.00000000 0.00000000) *
      11) Age>=55 21  6 absent (0.71428571 0.28571429)  
        22) Age>=111 14  2 absent (0.85714286 0.14285714) *
        23) Age< 111 7  3 present (0.42857143 0.57142857) *
   3) Start< 8.5 19  8 present (0.42105263 0.57894737) *

We can access the method results by using printcp() (although how informative this output will prove depends on the expectations…)

rpart::printcp(tree) 

Classification tree:
rpart::rpart(formula = Kyphosis ~ Age + Number + Start, data = rpart::kyphosis, 
    method = "class")

Variables actually used in tree construction:
[1] Age   Start

Root node error: 17/81 = 0.20988

n= 81 

        CP nsplit rel error xerror    xstd
1 0.176471      0   1.00000 1.0000 0.21559
2 0.019608      1   0.82353 1.2353 0.23200
3 0.010000      4   0.76471 1.2941 0.23548

Details on the nodes and the splits can be obtained using the summary() function.

summary(tree)
Call:
rpart::rpart(formula = Kyphosis ~ Age + Number + Start, data = rpart::kyphosis, 
    method = "class")
  n= 81 

          CP nsplit rel error   xerror      xstd
1 0.17647059      0 1.0000000 1.000000 0.2155872
2 0.01960784      1 0.8235294 1.235294 0.2320031
3 0.01000000      4 0.7647059 1.294118 0.2354756

Variable importance
 Start    Age Number 
    64     24     12 

Node number 1: 81 observations,    complexity param=0.1764706
  predicted class=absent   expected loss=0.2098765  P(node) =1
    class counts:    64    17
   probabilities: 0.790 0.210 
  left son=2 (62 obs) right son=3 (19 obs)
  Primary splits:
      Start  < 8.5  to the right, improve=6.762330, (0 missing)
      Number < 5.5  to the left,  improve=2.866795, (0 missing)
      Age    < 39.5 to the left,  improve=2.250212, (0 missing)
  Surrogate splits:
      Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)

Node number 2: 62 observations,    complexity param=0.01960784
  predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
    class counts:    56     6
   probabilities: 0.903 0.097 
  left son=4 (29 obs) right son=5 (33 obs)
  Primary splits:
      Start  < 14.5 to the right, improve=1.0205280, (0 missing)
      Age    < 55   to the left,  improve=0.6848635, (0 missing)
      Number < 4.5  to the left,  improve=0.2975332, (0 missing)
  Surrogate splits:
      Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
      Age    < 16   to the left,  agree=0.597, adj=0.138, (0 split)

Node number 3: 19 observations
  predicted class=present  expected loss=0.4210526  P(node) =0.2345679
    class counts:     8    11
   probabilities: 0.421 0.579 

Node number 4: 29 observations
  predicted class=absent   expected loss=0  P(node) =0.3580247
    class counts:    29     0
   probabilities: 1.000 0.000 

Node number 5: 33 observations,    complexity param=0.01960784
  predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
    class counts:    27     6
   probabilities: 0.818 0.182 
  left son=10 (12 obs) right son=11 (21 obs)
  Primary splits:
      Age    < 55   to the left,  improve=1.2467530, (0 missing)
      Start  < 12.5 to the right, improve=0.2887701, (0 missing)
      Number < 3.5  to the right, improve=0.1753247, (0 missing)
  Surrogate splits:
      Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
      Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)

Node number 10: 12 observations
  predicted class=absent   expected loss=0  P(node) =0.1481481
    class counts:    12     0
   probabilities: 1.000 0.000 

Node number 11: 21 observations,    complexity param=0.01960784
  predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
    class counts:    15     6
   probabilities: 0.714 0.286 
  left son=22 (14 obs) right son=23 (7 obs)
  Primary splits:
      Age    < 111  to the right, improve=1.71428600, (0 missing)
      Start  < 12.5 to the right, improve=0.79365080, (0 missing)
      Number < 3.5  to the right, improve=0.07142857, (0 missing)

Node number 22: 14 observations
  predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
    class counts:    12     2
   probabilities: 0.857 0.143 

Node number 23: 7 observations
  predicted class=present  expected loss=0.4285714  P(node) =0.08641975
    class counts:     3     4
   probabilities: 0.429 0.571 

What are these nodes that are being referred to? Plotting the tree provides more information. Here is a basic plot:

rpart.plot::prp(tree)

and a fancier one:

rattle::fancyRpartPlot(tree, main="Classification Tree for Kyphosis") 

In the fancy plot, does the intensity of the colour play a role? What about the percentages? What about the decimals?

Unchecked tree growth usually leads to overfitting. This is typically a problem when a dataset contains too many variables (the corresponding decision tree will contain too many splits in that case). Overfitting is not going to be an issue in the case of the kyphosis dataset because it contains only three variables; nevertheless, the code below shows you how you would prune the growth of the tree in general, by finding a value of cp which maximizes xerror (this will be revisited in a future module).

tree2 = rpart::prune(tree, cp = 0.02)
rattle::fancyRpartPlot(tree2)

How good is the classification model provided by the tree? We don’t have access to \(p-\)values or confidence intervals – we need to rely on the model’s confusion matrix.

We can obtain the predictions made by the model on the object tree by using the predict() function. This procedure takes each observation and feeds it to the model, outputting the likelihood of kyphosis being absent or present. Note that the probabilities are calibrated - compare with the Naive Bayes method which we will see later.

The following model predicts the class probabilities for all observations:

predictions1 = predict(tree, type = "prob")

The first 10 predictions are:

absent present
0.4210526 0.5789474
0.8571429 0.1428571
0.4210526 0.5789474
0.4210526 0.5789474
1.0000000 0.0000000
1.0000000 0.0000000
1.0000000 0.0000000
1.0000000 0.0000000
1.0000000 0.0000000
0.4285714 0.5714286

In general, the confusion matrix requires a specific prediction (absent or present), against which we compare the actual classification. Here, we have probabilities. How can we take the probabilities and transform them into specific predictions?

Here is one way to do this.

random1 <- runif(81) # uniformly generate a random number (between 0 and 1) 
                     # for each of the observations 
real <- rpart::kyphosis$Kyphosis # extract the actual classification of the observations
test1 <- cbind(predictions1,random1,real) # join together the prediction probabilities, 
                                          # the random numbers, and the actual classification
                                          # since we're joining together text and numbers, 
                                          # cbind coerces the factors to numerical values
                                          # absent = 1, present = 2
pred1 <- 2-(test1[,3]<test1[,1])  # this code takes advantage of the numerical presentation 
                                  # of the factors to output a specific prediction
                                  # if random1 < prob of absent, then 
                                  # real = absent (1), otherwise real = present (2)
test1 <- cbind(test1,pred1) # add the specific predictions to the test1 dataset

The first 10 predictions would be:

absent present real pred1
0.4210526 0.5789474 1 1
0.8571429 0.1428571 1 1
0.4210526 0.5789474 2 2
0.4210526 0.5789474 1 2
1.0000000 0.0000000 1 1
1.0000000 0.0000000 1 1
1.0000000 0.0000000 1 1
1.0000000 0.0000000 1 1
1.0000000 0.0000000 1 1
0.4285714 0.5714286 2 1

We can now build the confusion matrix on the model predictions using real (actual classification) and pred1 (specific model prediction).

The function table() produces a joint distribution, where the rows correspond to the first variable in the call (actual), and the columns to the second variable (predicted).

table(test1[,4],test1[,5]) 
   
     1  2
  1 59  5
  2  9  8

Note that the confusion matrix could be different every time a new set of predictions are made. Why would that be the case?

Is this a good classification model or not?

In this example, we computed the confusion matrix using the entire dataset. In a sense, we should not have been surprised that the results were decent, because we were using the same data to build the model and to evaluate it.

From a predictive perspective, classification models are built on a subset of the data (the training set) and evaluated on the remaining data (the testing set).

The idea is that if there IS a strong classification signal, it should be found in any representative subset. As a rule, we look for training sets making up between 70% and 80% of the data. They should be selected randomly.

We start by creating a training set with 50 instances, and we fit the rpart() algorithm to this data:

sub <- c(sample(1:81, 50)) 
fit <- rpart::rpart(Kyphosis ~ ., data = rpart::kyphosis, subset = sub) 

(If we use all variables, we do not need to specify them in the model call; we only need to use the “.”)

The model is given by:

fit 
n= 50 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 50 9 absent (0.8200000 0.1800000)  
  2) Start>=8.5 40 3 absent (0.9250000 0.0750000) *
  3) Start< 8.5 10 4 present (0.4000000 0.6000000) *

The confusion matrix has to be built on the testing set (indicated here by -sub, that is, the opposite of the sub indices).

There are 2 ways to make a prediction: we either use the most likely outcome, or we generate a random vector of predictions using the probabilities for each class, as above.

We can predict on the basis of class:

table(predict(fit, rpart::kyphosis[-sub,], type = "class"), rpart::kyphosis[-sub, "Kyphosis"])
         
          absent present
  absent      19       3
  present      4       5

Or we can predict on the basis of probability:

prob.fit<-predict(fit, rpart::kyphosis[-sub,], type = "prob")
random1 <- runif(31)
real <- rpart::kyphosis[-sub,"Kyphosis"]
test1 <- cbind(prob.fit,random1,real) # absent = 1, present = 2
pred1 <- 2-(test1[,3]<test1[,1])
test1 <- cbind(test1,pred1)
table(test1[,4], test1[,5])
   
     1  2
  1 20  3
  2  5  3

These are the confusion matrices that should be used to evaluate the decision’s tree performance.

Another way to build decision trees is via party’s ctree() function. Conditional inference trees have the property that they will automatically prune themselves once a statistical criterion is met by the tree as a whole. The downside is that they do not usually pick fully reasonable splits (they may not conform to contextual understanding).

kyphosis.ctree <- party::ctree(Kyphosis ~ ., data = rpart::kyphosis, subset = sub)
kyphosis.ctree

plot(kyphosis.ctree)


     Conditional inference tree with 2 terminal nodes

Response:  Kyphosis 
Inputs:  Age, Number, Start 
Number of observations:  50 

1) Number <= 5; criterion = 0.998, statistic = 11.641
  2)*  weights = 42 
1) Number > 5
  3)*  weights = 8 

A very simple tree, as can be seen. Its confusion matrix (on the testing set) is computed below.

table(predict(kyphosis.ctree, rpart::kyphosis[-sub,]),rpart::kyphosis[-sub,1]) 
         
          absent present
  absent      21       7
  present      2       1

How good of a model is this one? Which of the rpart() or ctree() model is preferable? Does this depend on the training set?

11.7.3 Clustering: Iris Dataset

This example refers to Toy Example: Iris Dataset; we cluster the ubiquitous (built-in) iris dataset, via \(k\)-means.

The procedure is straightforward:

  1. cluster the on the first 2 principal components, with n = 2, 3, 4, 5,…, 15 clusters;

  2. display the Within Sum of Squares curve, as a function of the number of clusters;

  3. display the Davies-Bouldin curve, as a function of the number of clusters;

  4. select the optimal number of clusters on the basis of these curves.

Let us load the data and take a look at the iris dataset (without the species labels, as this is an unsupervised problem).

my.data<-iris[,1:4]
head(my.data)
pairs(my.data)

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

The eye test would most likely identify 2 (or potentially 3 clusters).

In preparation for the cluster analysis, we will scale the data so that all the variables are represented on the same scale. This can be done using the scale() function.

my.data.scaled<-scale(my.data)
head(my.data.scaled)
pairs(my.data.scaled)

     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]   -0.8976739  1.01560199    -1.335752   -1.311052
[2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
[3,]   -1.3807271  0.32731751    -1.392399   -1.311052
[4,]   -1.5014904  0.09788935    -1.279104   -1.311052
[5,]   -1.0184372  1.24503015    -1.335752   -1.311052
[6,]   -0.5353840  1.93331463    -1.165809   -1.048667

The shape of the dataset is the same, but the axis ranges have changed.

One way to reduce the dimension of the problem is to work with the principal components (see Feature Selection and Dimension Reduction). The hope is that most of the variability in the data can be explained by a smaller number of derived variables, expressed as linear combinations of the original variables.

This can be done in R with the princomp() function.

pc.agg.data = princomp(my.data.scaled)
summary(pc.agg.data) 
Importance of components:
                          Comp.1    Comp.2     Comp.3      Comp.4
Standard deviation     1.7026571 0.9528572 0.38180950 0.143445939
Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
Cumulative Proportion  0.7296245 0.9581321 0.99482129 1.000000000

This provides a summary of the “strength” of the signal in each component. The cumulative proportion of the variance is the value of interest. If 2 principal components are needed to explain 95% of the variance, we would expect that roughly 95% of the set is “2-dimensional”.

Since 2 components are enough to explain 95% of the variance in the data, we will cluster the data projected on only the first 2 components, which can be accessed via the scores attribute of pc.agg.data.

The plot below shows a 2D representation of the iris dataset obtained via PCA.

pc.df.agg.data = cbind(pc.agg.data$scores[,1], pc.agg.data$scores[,2])
plot(pc.df.agg.data,col=iris[,5]) 
title('PCA plot of Iris Data - 2 Main PCs') 

The Davies-Bouldin (DB) index is a measure that is used to determine the optimal number of clusters in the data. For a given dataset, the optimal number of clusters is obtained by maximizing the DB index (there are different versions of the DB index, all giving equivalent results).

Davies.Bouldin <- function(A, SS, m) {
  # A  - the centres of the clusters
  # SS - the within sum of squares
  # m  - the sizes of the clusters
  N <- nrow(A)   # number of clusters
  # intercluster distance
  S <- sqrt(SS/m)
  # Get the distances between centres
  M <- as.matrix(dist(A))
  # Get the ratio of intercluster/centre.dist
  R <- matrix(0, N, N)
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      R[i,j] <- (S[i] + S[j])/M[i,j]
      R[j,i] <- R[i,j]
    }
  }
  return(mean(apply(R, 1, max)))
}

But we do not yet know how many clusters we will ultimately be using (although the visual inspection above suggested \(k=2\) or \(k=3\)), so we will construct \(k-\)means clusters for a variety of values \(k\).

We will in fact produce 40 replicates for each \(k=2,\ldots, 15\) and track both the DB index and the Within Sums of Squares (SS), which is a measure of how similar observations are within each cluster, and how different they are from observations in other clusters (the value for the SS can be found by calling the attribute withinss on a kmeans object).

If the DB index does not provide a clear-cut winner, the SS curve can be used: the optimal number of clusters is obtained when the slope of the SS curve flattens “drastically” (so that adding more clusters does not provide as big a decrease in SS).

We start by setting up the number of repetitions and the loop (2, 3, …, 15 clusters).

## setting up the repetitions and display options
oldpar <- par(mfrow = c(4, 4))

N = 40   # Number of repetitions
max.cluster = 15   # Number of maximum number of desired clusters

## initializing values
m.errs <- rep(0, max.cluster)
m.DBI <- rep(0, max.cluster)

s.errs <- rep(0, max.cluster)
s.DBI <- rep(0, max.cluster)

Now we run 40 replicates for each number of clusters (so 560 calls to the kmeans() clustering algorithm in total). For each clustering schemes, we compute the DB index and the SS, and store them in memory.

We also print one of the clustering schemes for each of the number of clusters in the iteration.

set.seed(0) # for replicability

## clustering and plotting 
for (i in 2:max.cluster) {
  errs <- rep(0, max.cluster)
  DBI <- rep(0, max.cluster)
  
  for (j in 1:N) {
    KM <- kmeans(pc.df.agg.data, iter.max = 10,  i) 
    # data, number of internal shifts of the cluster centres, number of clusters
    
    errs[j] <- sum(KM$withinss)
    DBI[j] <- Davies.Bouldin(KM$centers, KM$withinss, KM$size)
  }
  
  m.errs[i - 1] = mean(errs)
  s.errs[i - 1] = sd(errs)
  m.DBI[i - 1] = mean(DBI)
  s.DBI[i - 1] = sd(DBI)
  
  plot(pc.df.agg.data,col=KM$cluster, pch=KM$cluster, 
       main=paste(i,"clusters - kmeans (euclidean)"))   
}

Since we have replicates, we can compute confidence bonds for both the average DB index and the average SS.

MSE.errs_up = m.errs + 1.96 * s.errs / sqrt(N)
MSE.errs_low = m.errs - 1.96 * s.errs / sqrt(N)

MSE.DBI_up = m.DBI + 1.96 * s.DBI / sqrt(N)
MSE.DBI_low = m.DBI - 1.96 * s.DBI / sqrt(N)


plot(2:(max.cluster+1), m.errs, main = "SS", xlab="k", ylab="SS")
lines(2:(max.cluster+1), m.errs)
par(col = "red")
lines(2:(max.cluster+1), MSE.errs_up)
lines(2:(max.cluster+1), MSE.errs_low)

par(col = "black")
#
plot(2:(max.cluster+1), m.DBI, main = "Davies-Bouldin", xlab="k", ylab="DBI")
lines(2:(max.cluster+1), m.DBI)
par(col="red")
lines(2:(max.cluster+1), MSE.DBI_up)
lines(2:(max.cluster+1), MSE.DBI_low)

par(col = "black")

Where is the DB curve maximized? Does that match what the SS curve shows? We pick the optimal number of clusters using the following code:

(i_choice <- which(m.DBI==max(m.DBI[1:(length(m.DBI)-1)]))+1)
[1] 5

Finally, let us plot a “final” clustering scheme based on the optimal choice. We cluster on the PCA reduced scaled data, but we plot the results in the original iris data.

KM <- kmeans(agg.data, iter.max = 10, i_choice)
plot(iris[,1:4],col=KM$cluster, pch=KM$cluster, main=paste(i_choice,"clusters - kmeans (euclidean)"))

Do we get similar clustering schemes if we use a different distance measure (the default measure in kmeans() is the Euclidean metric)? Let us try the manhattan distance (\(k\)-medians) with the cclus() method (available with the flexclust package, which provides a more flexible clustering approach, including different algorithms and distances).

library(flexclust)  
Loading required package: grid
Loading required package: lattice
Loading required package: modeltools
Loading required package: stats4

Attaching package: 'modeltools'
The following object is masked from 'package:arules':

    info
KMed <- cclust(pc.df.agg.data, i_choice, dist="manhattan")
plot(iris[,1:4], col=predict(KMed), pch=predict(KMed), 
     main=paste(i_choice,"clusters - kmedians (manhattan)"))

How do these compare?