14.2 Simple Clustering Algorithms

We start by briefly discussing two of the simplest clustering algorithms: \(k-\)means and hierarchical clustering.253

14.2.1 \(k-\)Means and Variants

One potential objective is to achieve minimal within-cluster variation – observations within a cluster should be very similar to one another, and the total variation over all clusters should also be small.

Assume that there are \(k\) clusters in the (scaled) dataset \[\mathbf{X}_{n\times p}=\begin{bmatrix}\mathbf{x}_1 \\ \vdots \\ \mathbf{x}_n \end{bmatrix}.\]

Let \(C_1, \ldots, C_k\) denote the set of indices in each cluster, so that \[\{1,\ldots,n\}=C_1\sqcup \cdots\sqcup C_k\quad \textbf{(hard clustering)};\] we use the notation \(\mathbf{x}_i\in C_\ell\) to indicate that observation \(i\) lies in cluster \(\ell\). The within cluster variation \(\text{WCV}(C_\ell)\) measures the amount by which the observations in \(C_\ell\) differ from one another.

The approach is partition-based; we look for a partition \(\{C_{\ell}^*\}_{\ell=1}^k\) such that the total within cluster variation is minimized: \[\{C_\ell^*\}=\arg\min_{\{C_\ell\}}\left\{\sum_{\ell=1}^k\text{WCV}(C_\ell)\right\}.\]

The first challenge is that there are numerous ways to define \(\text{WCV}(C_\ell)\), and that they do not necessarily lead to the same results (as one would expect from clustering); most definitions, however, fall in line with expressions looking like \[\text{WCV}(C_\ell)=\frac{1}{(|C_\ell|-g)^\mu}\sum_{\mathbf{x}_i,\mathbf{x}_j\in C_\ell}\text{variation}(\mathbf{x}_i,\mathbf{x}_j),\] where it is understood that \(\text{variation}(\mathbf{x},\mathbf{x})=0\).

Common choices for the variation include \[\text{variation}(\mathbf{x}_i,\mathbf{x}_j)=\|\mathbf{x}_i-\mathbf{x}_j\|_2^2=\sum_{m=1}^p(x_{i,m}-x_{j,m})^2\] or \[\text{variation}(\mathbf{x}_i,\mathbf{x}_j)=\|\mathbf{x}_i-\mathbf{x}_j\|_1=\sum_{m=1}^p|x_{i,m}-x_{j,m}|;\] these are typically used because of the ease of vectorizing the distance measurements, and not necessarily because they make the most sense in context.

With these choices, if all observations \(\mathbf{x}\) within a cluster \(C\) are near one another, we would expect \(\text{WCV}(C)\) to be small. The values of the parameter \(\mu\) can be adjusted to influence the cluster sizes. Traditionally, we use \(\mu=0\) (or \(\mu=1\)) and \(g=0\), and the partition problem reduces to \[\{C_\ell^*\}=\arg\min_{\{C_\ell\}}\left\{\sum_{\ell=1}^k\frac{1}{|C_\ell|^{\mu}}\sum_{\mathbf{x}_i,\mathbf{x}_j\in C_\ell} \text{variation}(\mathbf{x}_i,\mathbf{x}_j)\right\}.\]

As an optimization problem, obtaining \(\{C_{\ell}^*\}_{\ell=1}^k\) is NP-difficult due to the combinatorial explosion of possible partitions \(\{C_\ell\}_{\ell=1}^k\) when \(n\) is large.254

Algorithm

There is a way to obtain a reasonably close partition to the optimal one (hopefully), without having to go through all possible partitions:

  1. randomly assign a cluster number \(\{1,\ldots, k\}\) to each observation in the dataset;

  2. for each \(C_\ell\), compute the cluster centroid;

  3. assign each observation to the cluster whose centroid is nearest to the observation;

  4. repeat steps 2-3 until the clusters are stable.

Three choices need to be made in order for the algorithm to run:

  • the number of clusters \(k\) in step 1;

  • the centroid computation measure in step 2;

  • the distance metric used in step 3.

The most common choice of centroid measure for numerical data is to use the vector of means along each feature of the observations in each cluster (hence, \(k-\)means); using other centrality measures yield different methods (such as \(k-\)medians, for instance). For categorical data, the algorithm becomes \(k-\)modes.

The distance used in step 3 is usually aligned with the centroid measure of step 2 (and with the choice of a variation function in the problem statement): Euclidean for \(k-\)means, Manhattan for \(k-\)medians, Hamming for \(k-\)modes. Variants of these approaches may use a different random initialization step: the first iteration centroids may be selected randomly from the list of observations, say.255

Other variants indicate how to process computations in parallel (for Big Data, see Module ??) or for data streams (with an updating rule, see Module ??). The algorithm can be shown to converge to a stable cluster assignment, but there is no guarantee that this assignment is the global minimizer of the objective function; indeed, different initial conditions can find different local minima, which is to say, different clustering schemes.

Example

We have worked with the Gapminder dataset in Modules 12 and 13; we will use it once again to illustrate some of the notions in this module. The 2011 data contains observations on \(n=184\) countries, for the following variables:

  • life expectancy (in years);

  • infant mortality rate (per 1000 births);

  • fertility rate (in children per woman);

  • population (we use the logarithm for clustering), and

  • GDP per capita (same).

library(dplyr)
library(tidyverse) # for remove_rownames() and column_to_rownames()

gapminder.SoCL = read.csv("Data/gapminder_all.csv",stringsAsFactors=TRUE)

gapminder.SoCL.2011  = gapminder.SoCL |> 
  filter(time==2011) |>
  select(geo,
         population_total,
         income_per_person_gdppercapita_ppp_inflation_adjusted,
         life_expectancy_years,
         infant_mortality_rate_per_1000_births,
         children_per_woman_total_fertility) |> 
  mutate(log10_pop=log10(population_total),
         log10_gdppc=log10(income_per_person_gdppercapita_ppp_inflation_adjusted)) |>
  na.omit() |>
  remove_rownames() |> 
  column_to_rownames(var="geo")

colnames(gapminder.SoCL.2011)<- c("Pop","GDPpc","Life Exp","Inf Mort",
                                  "Fert","log10 Pop","log10 GDPpc")
knitr::kable(summary(gapminder.SoCL.2011))
Pop GDPpc Life Exp Inf Mort Fert log10 Pop log10 GDPpc
Min. :9.183e+04 Min. : 614 Min. :47.55 Min. : 1.8 Min. :1.210 Min. :4.963 Min. :2.788
1st Qu.:2.625e+06 1st Qu.: 3251 1st Qu.:64.34 1st Qu.: 7.5 1st Qu.:1.798 1st Qu.:6.417 1st Qu.:3.511
Median :8.833e+06 Median : 9910 Median :72.60 Median : 17.4 Median :2.455 Median :6.946 Median :3.996
Mean :3.806e+07 Mean : 16424 Mean :70.62 Mean : 27.7 Mean :2.962 Mean :6.862 Mean :3.934
3rd Qu.:2.567e+07 3rd Qu.: 22586 3rd Qu.:76.90 3rd Qu.: 41.8 3rd Qu.:4.027 3rd Qu.:7.409 3rd Qu.:4.354
Max. :1.367e+09 Max. :129350 Max. :82.91 Max. :106.8 Max. :7.460 Max. :9.136 Max. :5.112

A scatter plot of the original dataset and of the transformed dataset is shown below.

GGally::ggpairs(gapminder.SoCL.2011[,c(3:5,1:2)])
Scatterplots of the 2011 Gapminder dataset (top). Due to outlying observations in the population variable (China and India), we will be working instead with the logarithm of the population and the logarithm of GDP per capita (bottom).

Figure 14.1: Scatterplots of the 2011 Gapminder dataset (top). Due to outlying observations in the population variable (China and India), we will be working instead with the logarithm of the population and the logarithm of GDP per capita (bottom).

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)])
Scatterplots of the 2011 Gapminder dataset (top). Due to outlying observations in the population variable (China and India), we will be working instead with the logarithm of the population and the logarithm of GDP per capita (bottom).

Figure 14.2: Scatterplots of the 2011 Gapminder dataset (top). Due to outlying observations in the population variable (China and India), we will be working instead with the logarithm of the population and the logarithm of GDP per capita (bottom).

Throughout, we will be working with the scaled dataset.

gapminder.SoCL.2011.s <- data.frame(scale(gapminder.SoCL.2011[,c(3:7)]))
str(gapminder.SoCL.2011.s)
'data.frame':   184 obs. of  5 variables:
 $ Life.Exp   : num  -1.647 -1.151 0.635 0.377 1.366 ...
 $ Inf.Mort   : num  1.834 3.175 -0.602 -0.498 -0.959 ...
 $ Fert       : num  1.77 2.093 -0.404 -0.981 -0.7 ...
 $ log10.Pop  : num  0.742 0.635 0.921 -0.49 0.595 ...
 $ log10.GDPpc: num  -1.336 -0.304 0.672 -0.164 1.286 ...

The following function will allow us to plot the distributions of each of the variables in each of the clusters (plots appearing on the diagonal):

my_dens <- function(data, mapping, ..., low = "#132B43", high = "#56B1F7") {
  ggplot2::ggplot(data = data, mapping=mapping) +
    ggplot2::geom_density(..., alpha=0.7) 
}

We run \(k-\)means for \(k=2,3,4\) and obtain the results shown below:

set.seed(1) # for replicability
# k=2
gapminder.SoCL.2011.s.k2 = kmeans(gapminder.SoCL.2011.s,2,iter.max=250,nstart=1)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(gapminder.SoCL.2011.s.k2$cluster)),
        diag=list(continuous=my_dens))
A realization of $2-$means on the 2011 Gapminder dataset.

Figure 14.3: A realization of \(2-\)means on the 2011 Gapminder dataset.

table(gapminder.SoCL.2011.s.k2$cluster)

  1   2 
 64 120 
# k=3
gapminder.SoCL.2011.s.k3 = kmeans(gapminder.SoCL.2011.s,3,iter.max=250,nstart=1)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(gapminder.SoCL.2011.s.k3$cluster)),
        diag=list(continuous=my_dens))
A realization of $3-$means on the 2011 Gapminder dataset.

Figure 14.4: A realization of \(3-\)means on the 2011 Gapminder dataset.

table(gapminder.SoCL.2011.s.k3$cluster)

 1  2  3 
46 84 54 
# k=4
gapminder.SoCL.2011.s.k4 = kmeans(gapminder.SoCL.2011.s,4,iter.max=250,nstart=1)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(gapminder.SoCL.2011.s.k4$cluster)),
        diag=list(continuous=my_dens))
A realization of $4-$means on the 2011 Gapminder dataset.

Figure 14.5: A realization of \(4-\)means on the 2011 Gapminder dataset.

table(gapminder.SoCL.2011.s.k4$cluster)

 1  2  3  4 
26 50 61 47 

The colours (cluster labels) are not used by the clustering algorithm – they are its outputs (the cluster label value is irrelevant). This next chart shows the result of a different initialization for \(k=3\), leading to a different cluster assignment.

set.seed(1234) # differentinitialization
# k=3
gapminder.SoCL.2011.s.k3 = kmeans(gapminder.SoCL.2011.s,3,iter.max=250,nstart=1)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(gapminder.SoCL.2011.s.k3$cluster)),
        diag=list(continuous=my_dens))
A different realizations of $3-$means on the 2011 Gapminder dataset.

Figure 14.6: A different realizations of \(3-\)means on the 2011 Gapminder dataset.

table(gapminder.SoCL.2011.s.k3$cluster)

 1  2  3 
56 74 54 

Another example is provided in Section 11.7.3.

14.2.2 Hierarchical Clustering

One of the issues surrounding the use of \(k-\)means (and its variants) is that nothing in the result of a single run of the algorithm indicates if the choice of \(k\) was a good one.256

Determining a “good” value of \(k\) can only be achieved by repeatedly running the algorithm over a range of “reasonable” values of \(k\) (to account for initialization variability), and by comparing the outputs using some of the methods discussed in Section 14.3. This process can be memory- (and time-)extensive.

Hierarchical clustering (HC) can sidestep this difficulty altogether by building a deterministic global clustering structure (for a given choice of parameters), from which we can select a specific number of clusters after the algorithm has converged; the advantage of this approach is that if we want to use a different number of clusters, we do not need to re-run the clustering algorithm – we simply read off the new clusters from the global clustering structure.

There are two main conceptual approaches:

  • bottom-up/agglomerative (AGNES) – initially, each observation sits in its own separate cluster, which are then merged (in pairs) as the hierarchy is climbed, leading (after the last merge) to a large cluster containing all observations;

  • top-down/divisive (DIANA) – initially, all observations lie in the same cluster, which is split (and re-split) in pairs as the hierarchy is traversed downward, leading (after the last split) to each observation siting in its own separate cluster.

Both approaches are illustrated in the figures below: the first one is an illustration of AGNES and DIANA. The corresponding hierarchical structure is shown in the second image; the dendrogram in the third.

In theory, the two approaches are equivalent (they produce the same hierarchical structure given a similarity metric and a linkage strategy (more on this later); in practice, we use AGNES over DIANA for anything but small datasets as the former approach runs in polynomial time (with respect to the number of observations), whereas the latter runs in exponential time.

With AGNES, the clustering dendrogram is built starting from the leaves, and combining clusters by pairs, up to the root, as in Figure 14.7.

Cluster dendrogram for the  hiearchical cluster structure of a dataset with 50 observations and 3 variables, with average linkage (UPGMA) and using the Euclidean distance as the  dissimilarity measure [author unknown]. The dendrogram is cut at a dissimilarity level $pprox 0.6$ so that 5 clusters emerge (ordered and coloured in red, magenta, blue, green, and red); the observations profiles are shown in the stacked bar chart and provide potential descriptions of the clusters (magenta $=$ small total height, with mostly dominant 3rd components, say). Based on the profiles, we might also have elected to cut a slightly lower dissimilarity level ($pprox$ 0.55), so that the yellow and green clusters would have been further split into two clusters apiece (between observations 35 and 13, and 30 and 10, perhaps?).

Figure 14.7: Cluster dendrogram for the hiearchical cluster structure of a dataset with 50 observations and 3 variables, with average linkage (UPGMA) and using the Euclidean distance as the dissimilarity measure [author unknown]. The dendrogram is cut at a dissimilarity level \(pprox 0.6\) so that 5 clusters emerge (ordered and coloured in red, magenta, blue, green, and red); the observations profiles are shown in the stacked bar chart and provide potential descriptions of the clusters (magenta \(=\) small total height, with mostly dominant 3rd components, say). Based on the profiles, we might also have elected to cut a slightly lower dissimilarity level (\(pprox\) 0.55), so that the yellow and green clusters would have been further split into two clusters apiece (between observations 35 and 13, and 30 and 10, perhaps?).

Algorithm

We build the global structure as follows:

  1. each observation is assigned to its own cluster (there are \(n\) clusters, initially);

  2. the two clusters that are the least dissimilar are merged into a supra-cluster;

  3. repeat step 2 until all of the observations belong to a single large merged clusters (with \(n\) observations).

Three parameters need to be made in the order for the algorithm to run:

  • the choice of a linkage strategy in steps 2 and 3;

  • the dissimilarity measure used in step 2;

  • the dissimilarity threshold required to “cut” the dendrogram into clusters.

If Figure 14.7, the dataset is first split into \(n=50\) clusters; observations 13 and 34 are then found to be most similar, and merged into a single cluster, and the 50 observations are grouped into 49 clusters. The next two observations which are most similar are 14 and 37, which are themselves merged, so that there are 48 clusters at that level.

The process is continued until all observations are merged into a single cluster, leading to the global clustering structure (clustering dendrogram) for the dataset. In order to obtain actual clusters (as opposed to the global structure), we cut the dendrogram at the selected dissimilarity level, with the resulting groups of observations yielding the dataset clusters (5, in the example). Increasing the dissimilarity threshold decreases the number of clusters, and vice-versa.

Linkage Strategy

In the first AGNES stage, we compare all pairs of observations to determine which two are least dissimilar, and how these are merged into a cluster.257

In the second stage, we must also compare each of the non-merged observations with a cluster of two observations to determine their dissimilarity (the other dissimilarities have been computed in the first stage and do not need to be computed anew).

In subsequent steps, we might also need to compare two clusters with any number of observations for overall similarity. How can this be achieved? Let \(A\) and \(B\) be clusters in the data, with \(|A|=n_A\), \(|B|=n_B\). The dissimilarity between \(A\) and \(B\) can be computed in multiple ways:

  • complete linkage – the largest dissimilarity among all pairwise dissimilarities between the observations in \(A\) and those in \(B\) (\(n_An_B\) computations);

  • single linkage – the smallest dissimilarity among all pairwise dissimilarities as in complete linkage;

  • average linkage – the average dissimilarity among all pairwise dissimilarities as in complete (or single) linkage;

  • centroid linkage – the dissimilarity between the centroids of \(A\) and \(B\) (found using whatever method is appropriate for the context);

  • Ward’s method (and its variants) uses any reasonable objective function which reflects domain knowledge [269], [270];

  • etc.

The choice of a linkage strategy (and of a dissimilarity measure) affects not only how clusters are compared and merged, but also the topology (shape/structure) of the resulting dendrogram (are the clusters tight, loose, blob-like, etc.). The various linkage strategies are illustrated below, assuming Euclidean dissimilarity.

Example

We show the results of hierarchical clustering on the Gapminder 2011 data, using complete linkage and Eulidean dissimilarity, and Ward \(D\) linkage and the maximum dissimilarity. In each case, we consider \(k=2,3,4\) clusters. Reminder: we are working on the scaled data.

# global, complete, Euclidean
par(cex=0.45)
hclust.gapminder.SoCL.2011 <- hclust(dist(gapminder.SoCL.2011.s))
plot(hclust.gapminder.SoCL.2011, hang = -1, cex=0.7, 
                 main = "Gapminder 2011 Data \n HC - Global Structure - Euclidean - Complete", 
                 ylab="") 

# k=2, complete, Euclidean
par(cex=0.45)
hclust.gapminder.SoCL.2011 |> 
  as.dendrogram() |> 
  dendextend::set("branches_k_color", value = c("red", "blue"), k = 2)  |>  
  plot(main = "Gapminder 2011 Data \n HC - 2 clusters - Euclidean - Complete") 
hclust.gapminder.SoCL.2011 |> 
  as.dendrogram() |> dendextend::rect.dendrogram(k=2, border = 8, lty = 5, 
                                                 lwd = 2, lower_rect=0) 
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(cutree(hclust.gapminder.SoCL.2011, k = 2))),
        diag=list(continuous=my_dens))
table(cutree(hclust.gapminder.SoCL.2011, k = 2))

  1   2 
 66 118 

# k=3, complete, Euclidean
par(cex=0.45)
hclust.gapminder.SoCL.2011 |> 
  as.dendrogram() |> 
  dendextend::set("branches_k_color", value = c("red", "blue", "darkgreen"), k = 3)  |> 
  plot(main = "Gapminder 2011 Data \n HC - 3 clusters - Euclidean - Complete") 
hclust.gapminder.SoCL.2011 |> 
  as.dendrogram() |> dendextend::rect.dendrogram(k=3, border = 8, lty = 5, 
                                                 lwd = 2, lower_rect=0) 
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(cutree(hclust.gapminder.SoCL.2011, k = 3))),
        diag=list(continuous=my_dens))
table(cutree(hclust.gapminder.SoCL.2011, k = 3))

 1  2  3 
66 24 94 

# k=4, complete, Euclidean
par(cex=0.45)
hclust.gapminder.SoCL.2011 |> 
  as.dendrogram() |> 
  dendextend::set("branches_k_color", value = c("red", "blue", "darkgreen", "gray"), k = 4)  |> 
  plot(main = "Gapminder 2011 Data \n HC - 4 clusters - Euclidean - Complete") 
hclust.gapminder.SoCL.2011 |> 
  as.dendrogram() |> dendextend::rect.dendrogram(k=4, border = 8, lty = 5, 
                                                 lwd = 2, lower_rect=0) 
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(cutree(hclust.gapminder.SoCL.2011, k = 4))),
        diag=list(continuous=my_dens))
table(cutree(hclust.gapminder.SoCL.2011, k = 4))

 1  2  3  4 
35 24 94 31 

Notice how the number of observations in each cluster follows a hierarchical structure: when we go from \(k=2\) to \(k=3\) clusters, the new cluster is a subset of one of the old clusters (and similarly when we go from \(k=3\) to \(k=4\)). We can see how the results change when we use a different distance metric and a different linkage strategy.

# global, Ward D, Maximum
par(cex=0.45)
hclust.gapminder.SoCL.2011.2 <- hclust(dist(gapminder.SoCL.2011.s,method="maximum"), method="ward.D")
plot(hclust.gapminder.SoCL.2011.2, hang = -1, cex=0.7, main = "Gapminder 2011 Data \n HC - Global Structure - Maximum - Ward D", ylab="") 

# k=2, Ward D, Maximum
par(cex=0.45)
hclust.gapminder.SoCL.2011.2 |> 
  as.dendrogram() |> 
  dendextend::set("branches_k_color", value = c("red", "blue"), k = 2)  |>  
  plot(main = "Gapminder 2011 Data \n HC - 2 clusters - Maximum - Ward D") 
hclust.gapminder.SoCL.2011.2 |> 
  as.dendrogram() |> dendextend::rect.dendrogram(k=2, border = 8, lty = 5, 
                                                 lwd = 2, lower_rect=0) 
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(cutree(hclust.gapminder.SoCL.2011.2, k = 2))),
        diag=list(continuous=my_dens))
table(cutree(hclust.gapminder.SoCL.2011.2, k = 2))

  1   2 
 72 112 

par(cex=0.45)
# k=3, Ward D, Maximum
hclust.gapminder.SoCL.2011.2 |> 
  as.dendrogram() |> 
  dendextend::set("branches_k_color", value = c("red", "blue", "darkgreen"), k = 3)  |> 
  plot(main = "Gapminder 2011 Data \n HC - 3 clusters - Maximum - Ward D") 
hclust.gapminder.SoCL.2011.2 |> 
  as.dendrogram() |> dendextend::rect.dendrogram(k=3, border = 8, lty = 5, 
                                                 lwd = 2, lower_rect=0) 
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(cutree(hclust.gapminder.SoCL.2011.2, k = 3))),
        diag=list(continuous=my_dens))
table(cutree(hclust.gapminder.SoCL.2011.2, k = 3))

 1  2  3 
72 80 32 

par(cex=0.45)
# k=4, Ward D, Maximum
hclust.gapminder.SoCL.2011.2 |> 
  as.dendrogram() |> 
  dendextend::set("branches_k_color", value = c("red", "blue", "darkgreen", "gray"), k = 4)  |> 
  plot(main = "Gapminder 2011 Data \n HC - 4 clusters - Maximum - Ward D") 
hclust.gapminder.SoCL.2011.2 |> 
  as.dendrogram() |> dendextend::rect.dendrogram(k=4, border = 8, lty = 5, 
                                                 lwd = 2, lower_rect=0) 
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
        ggplot2::aes(color=as.factor(cutree(hclust.gapminder.SoCL.2011.2, k = 4))),
        diag=list(continuous=my_dens))
table(cutree(hclust.gapminder.SoCL.2011.2, k = 4))

 1  2  3  4 
47 80 32 25 

References

[269]
R. C. Amorim, “Feature relevance in ward’s hierarchical clustering using the lp norm,” J. Classif., vol. 32, no. 1, pp. 46–62, Apr. 2015, doi: 10.1007/s00357-015-9167-1.
[270]
Ward’s method.” Wikipedia.