14.3 Clustering Evaluation

Hierarchical clustering (HC) and \(k-\)means (and its variants) both attempt to separate the data into natural groups, using different conceptual approaches; \(k-\)means tries to minimize within-cluster variation while HC builds a global clustering structure.

We have discussed some of their shortcomings in the previous section; the fact that they may yield different clustering outcomes depending on the choices made along the way (initialization, similarity/dissimilarity measures, linkage strategy, number of clusters, etc.) reinforces the notion that unsupervised learning is difficult.

We will discuss advanced algorithms that sidestep some of these issues in Section 14.4, but we make an important observation in the meantime: a hallmark of clustering is that whenever a new approach manages to overcome a previously-identified difficulty, it usually does so at the cost of introducing holes in hitherto sound aspects. We may thus not be able to ever find a “best” clustering approach/outcome (an updated take on the No Free Lunch theorem, perhaps? [239]), but is it possible to identify which of several clustering scheme is “optimal” (or best-suited for an eventual task)?

14.3.1 Clustering Assessment

In machine learning, clustering is defined as grouping objects based on their over-all similarity (or dissimilarity) to each other.258 It can be tempting to focus on just one or two attributes (especially for visual or “eyeball” clustering), but keep in mind that even if we were to focus on one or two particular attribute, all of the other attributes must still come along for the ride.

For instance, consider the objects shown below.

What is the same about these objects? What is different? Do they belong in the same group? If so, how many?

As a start, they are all pictorial representations of apples;259 they all possess stems, and appear to be of similar size. On the other hand, two of them are (mostly) red while the other is green(ish); one of the stem has a leaf while the other two do not, and one of them is spherical, while the other two are “tapered” at the bottom.

While we do recognize them all as apples, we could make the argument that each of them is unlike the other two (and thus also that each of them is similar to exactly one of the other two).

Fruit Image Dataset

In order to appreciate the challenges presented by clustering validation, it will be helpful to relate the concepts to something tangible. We will explore some of these notions through an artificial dataset of 20 fruit images (see Figure 14.8):

  • are there right or wrong groupings of this dataset?

  • are there multiple possible ‘natural’ clusterings?

  • could different clusterings be used differently?

  • will some clusterings be of (objectively) higher quality than others?

Toy dataset with which the key concepts of clustering validation will be  illustrated.

Figure 14.8: Toy dataset with which the key concepts of clustering validation will be illustrated.

Key Notions

At a fundamental level, clustering relies on the notion of representativeness; ideally, the essence of instances (observations) in a cluster would be faithfully captured by the cluster concept (examplar, representative), and differentiated from those of other clusters by the same token.

For instance, the image below is a concept for “apples”:

as is the Mirriam-Webster definition:

“The fleshy, usually rounded red, yellow, or green edible pome fruit of a usually cultivated tree (genus Malus) of the rose family.”

Of course, this is not all that an apple is, but most people who have seen or eaten an apple at some point in the past will have no trouble recognizing what is being alluded to by the concept, even if the corresponding mental image differs from one person to the next.

The cluster concepts offer a generalized representation – they capture “something” of their corresponding cluster’s instances. For a given cluster, then, can we clearly identify a concept capturing its (and solely its) essence? If so, does that make the entire clustering scheme a good one?

For machine learning purposes, we use signature vectors to represent instance properties. Each vector element represents an instance attribute; the element’s value is the measured value of the corresponding object’s property (for instance, the colour of the apple).

The apple below, as an example, could perhaps be described by the signature vector \[(12,9.12,\text{tapered},\text{golden delicious}),\] where the components are the instance’s colour (ordinal), height (continuous), shape and variety (both categorical).260

Signature vectors are used to compare objects (instance-to-instance relationships); such comparisons could yield, among others, a measure of similarity between instances.

While similarity can be measured against a single dimension (component), the comparisons of interest for clustering task require an overall similarity measure, across all dimensions. We would compare the two apples below, say, by comparing their signature vectors \[\begin{aligned} \mathbf{v}_1&=(12,9.12,\text{tapered},\text{golden delicious}) \\ \mathbf{v}_2&=(2,10.43,\text{spherical},\text{macintosh}) \end{aligned}\] with the help of some similarity measure \(w(\mathbf{v}_1,\mathbf{v}_2)\).261

As we have discussed in Section 14.1, the use of a distance measure (or metric) is a popular strategy for defining how similar (or dissimilar) two objects are to each other. Importantly, a distance takes into account all of the properties of the objects in question, not just a few of them. Traditionally, only numeric attributes are allowed as input (see Module 16 for an in-depth discussion of distance metrics), but it is technically possible to convert categorical attributes to numeric ones, or to define mixed distances.262

In the clustering framework, we are often interested in all pairwise similarities between objects, not just in the similarity between two objects, which is to say that pairs of objects are not solely interesting in and of themselves, but also in relation to other pairs of objects.

In a dataset with 4 objects, for instance, we might require the computation of (up to) 6 pairwise similarities (as shown below).

As is the case with objects, clusters also have properties. These could include:

  • the number of instances in a cluster;

  • the similarity measures across instances within a cluster (minimum, maximum, average, median, standard deviation, mean absolute deviation, variance, etc.);

  • the cluster representative, which may be an actual instance, or an amalgamation of multiple instances (exemplar).

How many instances are there in the cluster below, for instance? What pair of observations is most similar? The least similar? What are the similarity values? Which instance is most representative?

We can also define cluster-to-instance relationships. A specific instance can be:

  • compared to a cluster representative, and/or

  • compared to specific instances in a cluster (as in instance-to-instance relationships), such as the most similar instance or the most distant instance.

This allows for membership questions: is the green apple similar to the cluster below? Does it belong in the cluster, or is it most likely to belong to another cluster? Or perhaps to no cluster in particular?

Finally, we might be interested in cluster-to-cluster relationships, where we compare cluster-level properties, such as:

  • number of instances;

  • within-cluster similarities;

  • cluster representatives.

To these, we can also add between-cluster (or across-cluster) similarities, as a way to determine if the instances are notably different from one cluster to the next. This allows for validity questions: are the two clusters below significantly different? Should they be joined into a mega-cluster? Does it make sense to have them as separate clusters in the dataset?

How would we qualify the clustering outcome of Figure 14.9, for instance, as it relates to colour, height, and shape? Could there be clusterings of higher quality? Of lower quality? How could this be quantified?

Two clusters in a subset of the fruit image toy dataset.

Figure 14.9: Two clusters in a subset of the fruit image toy dataset.

Cluster and instance comparisons can be combined in many different ways, which can then be used to generate a vast number of clustering validation functions. The central cluster validation question becomes: what can these tell us about the quality of a particular clustering outcome relative to some objective criteria about “good” clustering schemes (internal validation), to another clustering option (relative validation), or to external information (external validation)?

Clustering Quality Measures

In general, clustering involves two main activities:

  • creating/building the clusters, and

  • assessing their quality, individually and as a whole.

From a practical perspective, clustering requires two functions: one which assigns each instance to a cluster,263, and one which assigns each clustering scheme to a cluster quality measurement.264

An illustration is provided below, on an artificial dataset containing two variables, with dissimilarity between instances given by the corresponding Euclidean distance:

We obtain three different clustering schemes, and their quality is assessed with the help of some clustering validation function:265

  • top – two clusters are found in the data (with outliers), and the quality of the clustering is assessed as 0.61;

  • middle – three clusters are found (no outliers), with quality assessment at 0.41;

  • bottom – two clusters are found (no outliers), with quality assessment at 0.53.

With this choice of clustering validation function, the top scheme would be prefered, followed by the bottom scheme; the middle one brings up the rear. We have already mentioned the abundance of clustering algorithms [145]; it will come as no surprise that a tremendous number of clustering validation function in practice as well (for much the same reasons as those discussed in Section 14.1.2).

They are, however, all built out of the basic measures relating to instance or cluster properties we have already reviewed, as illustrated schematically in Figure 14.10:

  • instance properties;

  • cluster properties;

  • instance-to-instance relationship properties;

  • cluster-to-instance relationship properties, and

  • cluster-to-cluster relationship properties,

Schematic illustrations of various instance/cluster properties and relationships.

Figure 14.10: Schematic illustrations of various instance/cluster properties and relationships.

Internal Validation

Context is quite relevant to the quality of a given clustering result. But what if no context is readily available? Internal validation methods attempt to measure cluster quality objectively, without context.266

We could elect to validate the clustering outcome using only the properties of the obtained clusters, such as, say, the distance between all clusters: if the average between-cluster distance is large, we might feel that there is some evidence to suggest that the resulting clusters provide a good segmentation of the data into natural groups (as at in the image below).

Alternatively, we might validate cluster quality by tempering the average between-cluster distance against the average within-cluster distance between the instances, which would reward “tight” and “isolated clusters”, as opposed to simply “isolated” cones, as below.

There are multiple ways of including both between-cluster and within-cluster similarities in a cluster quality metric (CQM): both the Davies-Bouldin index and Dunn’s index do so, to name but two examples. The broad objectives of clustering remain universal: instances within a cluster should be similar; instances in different clusters should be dissimilar.

The problem is that there are many ways for clusters to deviate from this ideal: in specific clustering cases, how do we weigh the “good” aspects (such as high within-cluster similarity) relative to the “bad” ones (such as low between-cluster separation)?

Other internal properties and relationships can also be used and combined in various ways, leading to an embarrassment of riches when it comes to CQMs. The following indices are all available in the R package clusterCrit, for instance [230]:

  • Ball-Hall

  • Banfeld-Raftery

  • C

  • Calinski-Harabasz

  • Davies-Bouldin

  • Det Ratio and LogDetRatio

  • Dunn

  • Baker-Hubert Gamma

  • GDI

  • Gplus

  • KsqDetW

  • McClain-Rao

  • PBM

  • Point-Biserial

  • Ratkowsky-Lance

  • Ray-Turi

  • Scott-Symons

  • SD

  • SDbw

  • Silhouette

  • Tau

  • TraceW and TraceWiB

  • Wemmert-Gancarski

  • WSS and LogSSRatio

  • Xie-Beni


While it is useful to have access to so many CQMs, we should remain aware that the No-Free Lunch theorem is still in play: none of them is universally superior to any of the others.267

In practice, certain approaches (weightings) may prove more relevant given the eventual specific analysis goals, but what these could prove to be is not always evident from the context; consequently, we recommend assessing the quality of the clustering outcomes using a variety of CQMs. Studies have been performed to compare a large number of internal validation measures, relative to one another and against human evaluation; generally speaking, variants of the silhouette index performed reasonably well (but see previous NFLT comment) [271], [272].

One possible interpretation of these results is that internal validation via CQMs may be providing information about clustering across all contexts, and that it is usually easier to identify clustering outcomes that clearly miss the mark than it is to objectively differentiate amongst “good” outcomes, for reasons that are not fully understood yet.

Details on the CQMs are readily available from a multitude of sources (see [4], [271], [272], among others); we provide more information for 4 commonly-used CQMs:

  • (within) sum of squared error;

  • Davies-Bouldin;

  • Dunn, and

  • silhouette;

Let \(\mathcal{C}=\{C_1,\ldots,C_K\}\) be the \(K\) clusters obtained from a dataset \(\mathbf{X}\) via some algorithm \(\mathcal{A}\). Denote the centroid (or some other central representative) of cluster \(C_k\) by \(\mathbf{c}_k\). The (within) sum of error for \(\mathcal{C}\) is \[\text{WSE}=\sum_{k=1}^K\sum_{\mathbf{x}\in C_k}\text{dissimilarity}(\mathbf{x},\mathbf{c}_k).\]

The dissimilarity is often selected to be the square of the Euclidean distance (thence “sum of squared error”), but the choice of the Euclidean distance (and of the square exponent) is arbitrary – it would make more sense, in practice, to use a distance metric related to the similarity measure used by \(\mathcal{A}\).

Note that \(\text{WSS}\) decreases with the number of clusters \(K\), and optimality is obtained at points of diminishing returns (see the next section for details); from a validation perspective, it might be easier to interpret the (within) average error: \[\text{WAE}=\text{Avg}_{k=1}^K\left\{\text{Avg}_{\mathbf{x}\in C_k}\left\{\text{dissimilarity}(\mathbf{x},\mathbf{c}_k)\right\}\right\}:=\text{Avg}_{k=1}^K\left\{s_k\right\}.\]

With \(s_k\), \(k=1,\ldots, K\) as above, the Davies-Bouldin index (DBI) is defined as \[\text{DBI}=\frac{1}{K}\sum_{k=1}^K\max_{\ell\neq k}\left\{\frac{s_k+s_\ell}{\text{dissimilarity}(\mathbf{c}_k,\mathbf{c}_\ell)}\right\}.\] If the clusters \(\{C_k\}\) are tight and dissimilar to one another, we expect the numerators \(s_k+s_{\ell}\) to be small and the denominators \(\text{dissimilarity}(\mathbf{c}_k,\mathbf{c}_\ell)\) to be large, so that the DBI would be small.

Conversely, with clusters that are loose or somewhat similar to one another, we expect the DBI to be large. There are other ways to assess separation and tightness; Dunn’s index is the ratio of the smallest between-cluster dissimilarity (for pairs of observations not in the same cluster) to the largest cluster diameter (within-cluster dissimilarity).

If the clusters \(\{C_k\}\) are tight and dissimilar to one another, we expect the smallest between-cluster dissimilarity to be large and the largest cluster diameter to be small, leasing to a large Dunn ratio.

Conversely, with clusters that are loose or somewhat similar to one another, the Dunn ratio will be small. As is the case with the sum of errors and the DBI, the choice of the dissimilarity (or distance) measure leads to different variants of the Dunn index, but all of them take non-negative values.

The three previous CQMs provide examples of validation metrics that are computed for the entire clustering outcome; the silhouette metric instead assigns a score to each observation, and aggregates the scores at a cluster level and at the dataset level: for a dissimilarity \(d\) and a point \(\mathbf{x}\) in a cluster \(C\), set \[b(\mathbf{x})=\min_{C'\neq C}\left\{\text{Avg}_{\mathbf{y}\in C'}\left\{d(\mathbf{x},\mathbf{y})\right\}\right\},\quad a(\mathbf{x})=\text{Avg}_{\substack{\mathbf{w}\in C \\ \mathbf{w}\neq\mathbf{x}}}\left\{d(\mathbf{x},\mathbf{w})\right\}.\]

Small values of \(a(\mathbf{x})\) imply that \(\mathbf{x}\) is similar to the instances in its cluster, large values of \(b(\mathbf{x})\) imply that it is dissimilar to instances in other clusters.

The silhouette metric at \(\mathbf{x}\) is \[\begin{aligned} \text{silhouette}(\mathbf{x})&=\frac{b(\mathbf{x})-a(\mathbf{x})}{\max\{a(\mathbf{x}),b(\mathbf{x})\}} =\begin{cases} 1-a(\mathbf{x})/b(\mathbf{x}) & \text{if $a(\mathbf{x})<b(\mathbf{x})$} \\ 0 & \text{if $a(\mathbf{x})=b(\mathbf{x})$} \\ b(\mathbf{x})/a(\mathbf{x})-1 & \text{if $a(\mathbf{x})>b(\mathbf{x})$}\end{cases} \end{aligned}\]

Consequently, \(-1\leq \text{silhouette}(\mathbf{x}) \leq 1\) for all \(\mathbf{x}\). Thus, when \(\text{silhouette}(\mathbf{x})\) is large (\(\approx 1\)), the ratio \(a(\mathbf{x})/b(\mathbf{x})\) is small and we interpret \(\mathbf{x}\) to be correctly assigned to cluster \(C\) (and conversely, when \(\text{silhouette}(\mathbf{x})\) is small (\(\approx -1\)), we interpret \(\mathbf{x}\)’s assignment to \(C\) to be “incorrect”).

The silhouette diagram corresponding to the clustering results displays the silhouette scores for each observations, and averages out the scores in each cluster. The mean over all observations (and the number of instances that have been poorly assigned to a cluster) gives an indication of the appropriateness of the clustering outcome.

In the example below, 65 observations are separated into 5 clusters: 6 observations are poorly assigned (those with negative silhouette scores) and the average silhouette score over the entire dataset is 0.2, which suggests that the clustering is more successful than unsuccessful, overall.

Note, however that it could prove difficult to associate intrinsic meaning to a lone numerical score, in general – there could be clustering contexts where 0.2 is considered to be a fantastic silhouette score, and others where it is viewed as an abject failure. It is in comparison to the scores obtained by different clustering schemes that this score (and those of the other CQMs) can best serve as an indicator of a strong (or a poor) clustering outcome.

Relative Validation

Obtaining a single validation measure for a single clustering outcome is not always that useful – how would we really characterize the silhouette score of 0.2 in the previous example? Could the results be better? Is this the best that can be achieved?

One approach to this problem is to compare clustering outcomes across multiple runs to take advantage of non-deterministic algorithms or various parameters’ values (see image below, and schematics in Figure 14.11).

If the outcome of different clustering algorithms on the same dataset are “similar”, this provides evidence in favour of the resulting clusters being efficient, natural, or valid, in some sense.

Schematics of relative clustering validation.

Figure 14.11: Schematics of relative clustering validation.

Consider, for instance, a dataset with 6 instances, which is clustered in two different manners (\(\mathcal{A}\) and \(\mathcal{B}\), with \(|\mathcal{A}|=3\) and \(|\mathcal{B}|=2\)), as shown below. The clusterings are clearly not identical; how similar are they?

One way to approach relative validation for two outcomes is by computing “correlations” between cluster outcomes. Intuitively, we would expect the similarity between both clustering schemes to be high, seeing as the two outcomes are not that different from one another: In \(\mathcal{B}\), the two smallest clusters of \(\mathcal{A}\) have been merged into a single, larger cluster.]

This can be represented in the form of matrices:

How can this be quantified? Correlation measures include the Rand, Jaccard (see Anomaly Detection and Outlier Analysis](#ADOA)), and Gamma (see [273]) indices. Let \(\mathcal{A}=\{A_1,\ldots,A_k\}\) and \(\mathcal{B}=\{B_1,\ldots,B_{\ell}\}\) be two clusterings of a dataset \(\mathbf{X}\). If \(\mathbf{X}\) consists of \(n\) instances, there are thus \[\binom{n}{2}=\frac{n(n-1)}{2}\] pairs of observations in the data.

Denote the number of pairs of observations that are in:

  • the same cluster in \(\mathcal{A}\) and in the same cluster in \(\mathcal{B}\) by \(\text{ss}\),

  • different clusters in \(\mathcal{A}\) and different clusters in \(\mathcal{B}\) by \(\text{dd}\);

  • the same cluster in \(\mathcal{A}\) but in different clusters in \(\mathcal{B}\) by \(\text{sd}\), and

  • different clusters in \(\mathcal{A}\) but in the same cluster in \(\mathcal{B}\) by \(\text{ds}\).

Thus, \[\binom{n}{2}=\text{ss}+\text{dd}+\text{sd}+\text{ds},\] and we define the Rand index of \(\mathcal{A}\) and \(\mathcal{B}\) as \[\text{RI}(\mathcal{A},\mathcal{B})=\frac{\text{ss}+\text{dd}}{\text{ss}+\text{dd}+\text{sd}+\text{ds}}=\frac{\text{ss}+\text{dd}}{\binom{n}{2}}.\]

Large values of the index are indicative of similarty of clustering outcomes.268 Unfortunately, the Rand index does not satisfy the constant baseline property.269

The adjusted Rand index (as well as several other pair-counting, set-matching, and information theoretic measures) relies on the contingency table between \(\mathcal{A}\) and \(\mathcal{B}\), a method to summarize the outcomes of two clustering results on the same dataset:

\[\begin{array}{c|ccc|c} & B_1 & \cdots & B_\ell & \text{Total} \\ \hline A_1 & n_{1,1} & \cdots & n_{1,\ell} & \mu_1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ A_k & n_{k,1} & \cdots & n_{k,\ell} & \mu_k \\ \hline \text{Total} & \nu_1 & \cdots & \nu_\ell & n \end{array}\]

In this array, \(n_{i,j}=|A_i\cap B_j|\) represents the number of instances that are found in both the cluster \(A_i\in \mathcal{A}\) and \(B_j\in \mathcal{B}\). The adjusted Rand index \((\in [-1,1])\) is given by \[\text{ARI}(\mathcal{A},\mathcal{B})=\frac{ \left. \sum_{ij} \binom{n_{ij}}{2} - \left[\sum_i \binom{\mu_i}{2} \sum_j \binom{\nu_j}{2}\right] \right/ \binom{n}{2} }{ \left. \frac{1}{2} \left[\sum_i \binom{\mu_i}{2} + \sum_j \binom{\nu_j}{2}\right] - \left[\sum_i \binom{\mu_i}{2} \sum_j \binom{\nu_j}{2}\right] \right/ \binom{n}{2} },\] which can also be re-written as \[\text{ARI}(\mathcal{A},\mathcal{B})=\frac{2(\text{ss}\cdot\text{dd}-\text{sd}\cdot\text{ds})}{(\text{ss}+\text{sd})(\text{ds}+\text{dd})+(\text{ss}+\text{ds})(\text{sd}+\text{dd})}.\]

It is straightforward to compute RI and ARI for the two clusterings of the artificial example with 6 instances. Since \(n=6\), there are \(6\cdot 5/2=15\) pairs of observations in the data, and we have:

  • \(\text{ss}=4\) (\(x_1\) and \(x_3\); \(x_1\) and \(x_4\); \(x_3\) and \(x_4\); \(x_2\) and \(x_5\));

  • \(\text{ds}=2\) (\(x_2\) and \(x_6\); \(x_5\) and \(x_6\));

  • \(\text{sd}=0\) (none);

  • \(\text{dd}=9\) (all remaining pairs).

Thus, \[\begin{aligned} \text{RI}(\mathcal{A},\mathcal{B})&=\frac{4+9}{4+9+0+2}=\frac{13}{15}=0.87 \\ \text{ARI}(\mathcal{A},\mathcal{B})&=\frac{2(4\cdot9-0\cdot2)}{(4+0)(2+9)+(4+2)(0+9)}=0.73.\end{aligned}\]

Both of these values are indicative of high similarity between the clustering outcomes \(\mathcal{A}\) and \(\mathcal{B}\).

Cluster stability can also be used to assess the appropriateness of the choice of algorithm for the data. Assume that we apply a clustering algorithm \(\mathcal{G}\) to a dataset \(\mathbf{X}\), resulting in a clustering scheme \(\mathcal{C}=\{C_1,\ldots, C_K\}\).

In general, a dataset \(\mathbf{X}\) is a realization of a (potentially unknown) underlying data generating mechanism related to the situation of interest; a different realization \(\mathbf{X}'\), which could arise from the collection of new data, may yield a distinct clustering scheme \(\mathcal{C}'\). How stable is the clustering outcome of \(\mathcal{G}\), relative to the realization of \(\mathbf{X}\)?

We can get a sense for the underlying stability by sampling multiple row subsets from \(\mathbf{X}\). Alternatively, we could also sample from \(\mathbf{X}\)’s columns, or sample columns from a variety of sampled rows of \(\mathbf{X}\); however this is achieved, we have obtained \(\ell\) subsets \(\mathbf{X}_1,\ldots, X_{\ell}\subseteq \mathbf{X}\), on which we apply the algorithm \(\mathcal{G}\), with parameters selection \(\mathcal{P}\), yielding the corresponding clustering schemes \(\mathcal{C}_1, \ldots, \mathcal{C}_\ell\), as below.

Let \(\mathcal{W}\) be the similarity matrix for the various clustering schemes: \[\mathcal{W}=\begin{pmatrix}w(\mathcal{C}_1,\mathcal{C}_1) & \cdots & (\mathcal{C}_1,\mathcal{C}_\ell) \\ \vdots & \ddots & \vdots \\ w(\mathcal{C}_\ell,\mathcal{C}_1) & \cdots & (\mathcal{C}_\ell,\mathcal{C}_\ell) \end{pmatrix};\] this \(\mathcal{W}\) can be used as the basis of a meta-clustering scheme in which \(\mathcal{C}_1,\ldots,\mathcal{C}_\ell\) play the role of instances, using a graph-based meta-clustering method such as DBSCAN (which we will discuss in Section @(CLUST-OCA-dbscan)). If the clustering results obtained from \(\mathbf{X}\) by applying \(\mathcal{G}\) are stable, we might expect the meta-clustering results to yield a single meta-cluster.

If multiple meta-clusters are found from \(\mathcal{W}\), this supports the position that \(\mathcal{G}\) does not produce stable clusters in \(\mathbf{X}\), although this does not necessarily imply instability as the number of meta-clusters could be an artefact related to the choice of the meta-clustering method. This approach seems simple in theory, but in practice it often simply pushes the issues and challenges ofclustering to the meta-clustering activity; a more sophisticated treatment of the problem of cluster stability assessment is presented in [275].

External Validation

In everyday applications, we tend to gauge clustering results against some (often implicit) external expectation (or standard): we cannot help but bring in outside information to evaluate the clusters.

The outside information is typically what is deemed to be the ‘correct’ groups to which the instances belong, which is starting to look an awful lot like classification, a supervised learning approach. In the example below, for instance, we might be interested in finding natural groups in the dataset of shapes, but we might hold the pre-conceived notion that the natural groups are linked to the underlying shape (square, triangle, circle).

If the outcome agrees with this (external) notion, we would naturally feel that the clustering was successful; if, instead the outcome seems to pick up something about the sharpness of the shapes’ vertices, say (as in the image below), we might conclude that the algorithm does not a good job of capturing the essential nature of the objects, or, more rarely, that we need to revisit our pre-conceived notions about the dataset and its natural groups.

This validation strategy is often used to build confidence in the overall approach, based on preliminary or sample results, but it rests on a huge assumption (which often conflicts with the unsupervised learning framework), namely, that the natural groups in the data are identifiable, explicitly or implicitly. Due to the conceptual similarity to classification,270, external validation measures often resemble classification performance measures.

Case in point, the purity metric. In the presence of an external classification variable, this metric assign a label to a cluster \(C\) according to the most frequent classes of the instances in \(C\), and \[\text{ purity}(C)=\frac{\text{# correctly assigned points in } C}{|C|},\] as in the example below:

The clustering purity score for \(\mathcal{C}=\{C_1,\ldots,C_K\}\) is obtained as the average of the cluster purity scores, or as a weighted average of purity scores: \[\text{weighted purity}(\mathcal{C})=\frac{1}{n}\sum_{\ell=1}^K|C_\ell|\cdot \text{purity}(C_\ell),\] where \(n\) represents the number of instances in the data.

In the image above, the green cluster is labeled as the square cluster (since 4 of its 6 instances are classified as squares), and the blue cluster is labeled as the circle cluster (since 5 of its 7 instances are classified as circles). At the cluster level, the purity scores are thus: \[\text{purity}(C_\square)=\frac{2}{3},\quad \text{purity}(C_\bigcirc)=\frac{5}{7};\] the average and weighted purity scores are \[\begin{aligned} \text{average purity}(\mathcal{C})&=\frac{1}{2}\left(\frac{2}{3}+\frac{5}{7}\right)=69.0\% \\ \text{weighted purity}(\mathcal{C})&=\frac{1}{6+7}\left(6\cdot \frac{2}{3}+7\cdot\frac{5}{7}\right)=69.2\%.\end{aligned}\]

These two numbers are very nearly identical because the clusters have roughly the same size; if the size variance is large, the two measurements would be quite different. The purity is an obvious analogue to accuracy; other measures based on precision and recall are also popular [276].

Useful external quality metrics take advantage of the natural classes (if they are aligned with the clustering results), and take into account cluster homogeneity (first row, below), completeness, (second row), noisy and outlying data (third row), and size vs. quantity considerations (last row): the preferred behaviour is shown on the right [276].


Let us illustrate some of these notions using various \(k-\)means and hierarchical clusters of the Gapminder data used in the previous sections. In all instances, we use Euclidean distance on the scaled dataset.

Internal Validation

We use the R packages cluster, fpc, and clusterCrit to compute 3 CQMs: the Dunn index, the average silhouette width, and the Calinski-Harabasz index (the ratio of the sum of between-clusters dispersion to the inter-cluster dispersion for all clusters; higher is better). We start by clustering the data using \(4-\)means; we then use hierarchical clustering with complete linkage and 3 clusters (the global structure has already been computed).

set.seed(123) # for replicability
kmeans.4 = kmeans(gapminder.SoCL.2011.s,4,iter.max=2509,nstart=1)
stats.kmeans.4 <- as.numeric(clusterCrit::intCriteria(as.matrix(gapminder.SoCL.2011.s),

dist.all <- cluster::daisy(gapminder.SoCL.2011.s,metric="euclidean",stand=FALSE)
hc.1.3 <- cutree(hclust.gapminder.SoCL.2011, k = 3)
stats.hc.1.3 <- c(fpc::cluster.stats(dist.all, clustering=hc.1.3, silhouette = TRUE)$dunn, 
                  fpc::cluster.stats(dist.all, clustering=hc.1.3, silhouette = TRUE)$avg.silwidth,
                  fpc::cluster.stats(dist.all, clustering=hc.1.3, silhouette = TRUE)$ch)

The results are summarized below:

stats <-rbind(stats.kmeans.4,stats.hc.1.3)
colnames(stats) <- c("Dunn","Silhouette","Calinski-Harabasz")
method Dunn Avg. Sil. C.-H.
\(4-\)means 0.097 0.315 139.0
HC(comp; 3) 0.091 0.274 125.4

Both of the Dunn values are low, although the \(4-\)means result is marginally superior; while the average silhouette widths are also low, they are least positive in both cases, with a slight advantage in favour of \(4-\)means; the Calinski-Harabasz values are not very indicative on their own, but once again, \(4-\)means comes out ahead of HC.

The average silhouette width is an interesting metric. On the one hand, we attempt to gauge the quality of the entire clustering with a single number, but the average is a fickle summary measurement and potentially affected by outlying values; on the other, we do have access to a silhouette score for each observation, and can thus get a better idea of the performance by studying the silhouette profile. We show the silhouette results for hierarchical clustering with complete linkage for 4 (average width\(=0.23\)) and 6 clusters (average width\(=0.22\)).

plot(cluster::silhouette(cutree(hclust.gapminder.SoCL.2011, k = 4), dist.all))

plot(cluster::silhouette(cutree(hclust.gapminder.SoCL.2011, k = 6), dist.all))

While the average silhouette width would seem to favour the 4-cluster result, the profile for the 6-cluster result seems more in-line with desirable properties: in both instances, some observations are “mis-clustered” (negative silhouette scores), but these seem to be more broadly distributed in the latter case.271

Relative Validation

We compute the Rand index (RI) and the adjusted Rand index (ARI) to compare the outcomes of a single run of \(2-\)means (\(\mathcal{A}_2\)), \(3-\)means (\(\mathcal{A}_3\)), and \(4-\)means (\(\mathcal{A}_4\)), respectively.

set.seed(1) # for replicability
kmeans.2 = kmeans(gapminder.SoCL.2011.s,2,iter.max=250,nstart=1)
kmeans.3 = kmeans(gapminder.SoCL.2011.s,3,iter.max=250,nstart=1)
kmeans.4 = kmeans(gapminder.SoCL.2011.s,4,iter.max=250,nstart=1)

We can compute the Rand index and the adjusted Rand index using the following function:

# create a matrix of 1s and 0s depending as to whether observations 
# i and j are in the same cluster or not
mat2=floor(1 - abs(sqrt(w2%*%t(w2))) %% 1)

mat3=floor(1 - abs(sqrt(w3%*%t(w3))) %% 1)

mat4=floor(1 - abs(sqrt(w4%*%t(w4))) %% 1)

# build the rand index from these matrices
randindices  <- function(W1,W2) {
  diag(W1) <- -1
  diag(W2) <- -1

# compute RI and ARI for the 3 comparisons

         X1        X2
1 0.7327156 0.5151994
         X1        X2
1 0.6407104 0.3285053
         X1      X2
1 0.7548705 0.45843

There are \(\binom{184}{2}=16836\) pairs of distinct observations in the 2011 Gapminder dataset; the full pair types and indices break down as below:

Schemes ss dd sd ds RI ARI
\(\mathcal{A}_2,\mathcal{A}_3\) 5304 7032 3852 648 0.73 0.52
\(\mathcal{A}_2,\mathcal{A}_4\) 4395 6392 4761 1288 0.64 0.33
\(\mathcal{A}_3,\mathcal{A}_4\) 3754 8955 2198 1929 0.75 0.46

\(\mathcal{A}_2, \mathcal{A}_3\) are relatively similar according to RI, as are \(\mathcal{A}_3, \mathcal{A}_4\), but the ARI suggests that \(\mathcal{A}_2, \mathcal{A}_3\) are more similar to one another than \(\mathcal{A}_3, \mathcal{A}_4\) are; \(\mathcal{A}_2, \mathcal{A}_4\) are not as similar, according to both indices, which is not that surprising as the number of clusters in this case jumps from 2 to 4.

External Validation

Finally, we compare the clustering results of hierarchical clustering, for 4 and 8 clusters, with a variety of external grouping: 6 world regions, as determined by the Gapminder Foundation, and OECD/G77 membership (see Figure 14.12).

6 world regions (left): America (yellow, 33 countries), East Asia Pacific (red, 26), Europe Central Asia (orange, 49), Middle East North Africa (green, 20), South Asia (turquoise, 8), Sub Saharan Africa (blue, 48); memberships (right): OECD (green, 30), G77 (purple, 128), other (red, 26); bubble size represents population [@DV_Gapminder].6 world regions (left): America (yellow, 33 countries), East Asia Pacific (red, 26), Europe Central Asia (orange, 49), Middle East North Africa (green, 20), South Asia (turquoise, 8), Sub Saharan Africa (blue, 48); memberships (right): OECD (green, 30), G77 (purple, 128), other (red, 26); bubble size represents population [@DV_Gapminder].

Figure 14.12: 6 world regions (left): America (yellow, 33 countries), East Asia Pacific (red, 26), Europe Central Asia (orange, 49), Middle East North Africa (green, 20), South Asia (turquoise, 8), Sub Saharan Africa (blue, 48); memberships (right): OECD (green, 30), G77 (purple, 128), other (red, 26); bubble size represents population [238].

We start by importing the external groupings:

gapminder.regions = read.csv("Data/gapminder_regions.csv",stringsAsFactors=TRUE)
colnames(gapminder.regions)[1] <- c("geo")

Then, we cluster the data using HC with complete linkage, for \(k=4\) and \(k=8\) clusters (again using Euclidean dissimilarity). Recall that the global dendrogram structure was originally stored in hclust.gapminder.SoCL.2011.

hc.4.ce <- as.factor(cutree(hclust.gapminder.SoCL.2011, k = 4)) # complete, Euclidean
hc.8.ce <- as.factor(cutree(hclust.gapminder.SoCL.2011, k = 8)) # complete, Euclidean
geo = names(hc.4.ce)
external.results <- merge(gapminder.regions,clusters, by="geo")

This is what the external.results table looks like (the first 6 entries):

geo name four_regions eight_regions six_regions members_oecd_g77 hc.4.ce hc.8.ce
afg Afghanistan asia asia_west south_asia g77 1 1
ago Angola africa africa_sub_saharan sub_saharan_africa g77 1 1
alb Albania europe europe_east europe_central_asia others 3 3
are United Arab Emirates asia asia_west middle_east_north_africa g77 3 3
arg Argentina americas america_south america g77 2 2
arm Armenia europe europe_east europe_central_asia others 3 3

The clusters are then labeled with their most frequent cluster assignment, which can be extracted with the following function:

mode <- function(x) { names(which.max(table(x))) }
tab <- external.results %>% group_by(hc.4.ce) %>% 
  summarise(mode = mode(six_regions), n=n()) 
n.mode <- external.results %>% group_by(hc.4.ce) %>% count(six_regions) %>% 
  summarise(n.mode = max(n))
info <- merge(tab,n.mode, by="hc.4.ce")[,2:4]

Are there any reasons to suspect that the clusters would be aligned with these external classes? For the 6 world regions classes, the clusters labels (the most frequent class per cluster) for \(\text{HC}(4)\) are shown below:

Cluster Label Size Frequency
1 Sub Saharan Africa 35 31
2 East Asia Pacific 24 9
3 Europe Central Asia 94 42
4 Sub Saharan Africa 31 14
info[,4] <- info$n.mode/info$n
(purity <- mean(info$V4))
(weighted.purity <- weighted.mean(info$V4,info$n))
[1] 0.5397839
[1] 0.5217391

This clustering scheme yields a purity of 0.54 and a weighted purity of 0.52 – the overall score is not great, but the Sub Saharan countries are fairly well captured by clusters 1 and 4.

We repeat the same process for \(\text{HC}(8)\) (the code is omitted). The clusters labels in that case are found below:

Cluster Label Size Frequency
1 Sub Saharan Africa 35 31
2 America 17 6
3 Europe Central Asia 65 34
4 East Asia Pacific 7 3
5 america 29 10
6 Sub Saharan Africa 18 7
7 East Asia Pacific 10 5
8 Sub Saharan Africa 3 3

This now yields a purity of 0.55 and a weighted purity of 0.54; which is still …. not that great. Perhaps the clusters have little to do with geography, in the end.

Are they aligned with OECD/G77/other membership? The labels for \(\text{HC}(8)\) with this external grouping are found below:

Cluster Label Size Frequency
1 G77 27 27
2 G77 29 22
3 OECD 28 17
4 G77 20 18
5 G77 12 11
6 OECD 23 11
7 G77 25 24
8 G77 20 10

The purity values are 0.77 and 0.76, respectively: these are better values than the previous external validation attempts, but they might not really be meaningful given the preponderance of G77 countries in the data.

It seems, then, that neither of the external classifications seems to be a good gauge of cluster validity.

For the most part, the cluster validation yields middling results. The few algorithms we have tried with the data suggest that there is some low-level grouping at play, but nothing we have seen so far would suggest that the data segments are all that “natural.”

While this result is somewhat disappointing, we should note that this is often the case with real-world data: there is no guarantee that natural groups even exist in the data. However, we have not been directing our choices of algorithms and parameters – up to now, they have been made fairly arbitrary. Can anything be done to help with model selection?

14.3.2 Model Selection

How do we pick the number of clusters and the various other parameters (including choice of to use for “optimal” results? A common approach is to look at all the outcomes obtained from various parameter runs and replicates (for a given algorithm), and to select the parameter values that optimize a set of QCMs, such as Davies-Bouldin, Dunn, CH, etc.

Optimization is, of course, dependent on each QCM’s properties: in some cases, we are searching for parameters that maximizes the index, in other cases, those that minimize it, and yet in other cases, for “knees” or “change points” in various associated plots. Note, however, that the parameter values that optimize a QCM may not optimize others; when they coincide, this reinforces the support for the existence of natural groups; when they do not, they provide a smaller collection of models from which to select, removing some of the arbitrariness discussed above.]

This can also be done for clustering outcomes arising from different algorithms, although in this case we are not selecting parameter values so much as identifying the model that best describes the natural groups in the data among all results, according to some metric(s).

The metrics presented in Section 14.3.1 all provide frameworks to achieve this. There are additional approaches, such as: seeking the clustering \(\mathcal{C}=\{C_1,\ldots,C_k\}\), among a list of such outcomes, which minimizes the quadratic cost \[\Lambda_\mathbf{W}(\mathcal{C})=-\text{trace}\left(Z^{\!\top}(\mathcal{C})\mathbf{W} Z(\mathcal{C})\right),\] where \(z_{i,\ell}=1\) is \(\mathbf{x}_i\in C_\ell\) and \(0\) otherwise, associated with a similarity matrix \(\mathbf{W}\); or methods relying on stability assessment [275], [277]. Model assessment and selection remains a popular research topic.

But it remains important to remember that there is a lot of diversity in clustering validation techniques. The various types of validation methods do not always give concordant results; this variation within the types can be demoralizing at times, but it can also be leveraged to extract useful information about the data’s underlying structure.

In general, we should avoid using a single assessment method; it is preferable to seek “signals of agreement” across a variety of strategies (both in the choices of clustering algorithms and evaluation methods). Finally, remember that clustering results may just be ‘ok’ … but that is ok too! We can study the situation and decide what is important and what can safely be ignored – as always, a lot depends on the context.


How many clusters \(k\) should we seek when clustering the (scaled) 2011 Gapminder dataset using Euclidean distance? For each \(k=2, \ldots, 15\), we compute the outcome of \(m=40\) runs of \(k-\)means, and average the within sum of squares (WSS) and a (modified) Davies-Bouldin index (DBI) over the runs. The optimal number of parameters is obtained at a DBI maximum or a WSS “knee”.

We start by computing the principal components (for displaying purposes, although we could also use them to cluster the data, at the cost of some information about the dataset).

# this computes the principal component decomposition
pc.agg.data = princomp(gapminder.SoCL.2011.s)  
Importance of components:
                         Comp.1    Comp.2     Comp.3     Comp.4    Comp.5
Standard deviation     1.850624 0.9973737 0.49950729 0.45130387 0.3163521
Proportion of Variance 0.688705 0.2000380 0.05017419 0.04095763 0.0201251
Cumulative Proportion  0.688705 0.8887431 0.93891726 0.97987490 1.0000000
pc.df.agg.data = cbind(pc.agg.data$scores[,1], pc.agg.data$scores[,2])
plot(pc.df.agg.data, xlab="PC1", ylab="PC2") 
title('PCA plot of Gapminder Data - 2 Main PCs')
PCA representation of the 2011 Gapminder dataset

Figure 14.13: PCA representation of the 2011 Gapminder dataset

The Davies-Bouldin index is computed using the following formula.

Davies.Bouldin <- function(A, SS, m) {
  # A  - the centres of the clusters
  # SS - the within sum of squares
  # m  - the sizes of the clusters
  N <- nrow(A)   # number of clusters
  # intercluster distance
  S <- sqrt(SS/m)
  # Get the distances between centres
  M <- as.matrix(dist(A))
  # Get the ratio of intercluster/centre.dist
  R <- matrix(0, N, N)
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      R[i,j] <- (S[i] + S[j])/M[i,j]
      R[j,i] <- R[i,j]
  return(mean(apply(R, 1, max)))

For each \(k=2,\ldots, 15\), we run \(k-\)means \(N=40\) times (using Euclidean dissimilarity). One realization is displayed for each \(k\), as are the DBI curves and the WSS curves (with confidence bands). The clusters are displayed on the first 2 principal components of the dataset, which explain 88% of the variation in the data.

N = 40            # Number of repetitions
max.cluster = 15  # Number of maximum number of desired clusters

# initializing values
m.errs <- rep(0, max.cluster)
m.DBI <- rep(0, max.cluster)
s.errs <- rep(0, max.cluster)
s.DBI <- rep(0, max.cluster)

# clustering and plotting 
for (i in 2:max.cluster) {
  errs <- rep(0, max.cluster)
  DBI <- rep(0, max.cluster)
  for (j in 1:N) {
    # data, number of internal shifts of the cluster centres, number of clusters
    KM <- kmeans(gapminder.SoCL.2011.s,i,iter.max=2509,nstart=1)

    errs[j] <- sum(KM$withinss)
    DBI[j] <- Davies.Bouldin(KM$centers, KM$withinss, KM$size)
  m.errs[i - 1] = mean(errs)
  s.errs[i - 1] = sd(errs)
  m.DBI[i - 1] = mean(DBI)
  s.DBI[i - 1] = sd(DBI)
  plot(pc.df.agg.data,col=KM$cluster, pch=KM$cluster, main=paste(i,"clusters"), 
       xlab="PC1", ylab="PC2")   

The average within sum of squares curve and the average Davies-Bouldin curves are also provided, with 95% confidence intervals.

MSE.errs_up = m.errs + 1.96 * s.errs / sqrt(N)
MSE.errs_low = m.errs - 1.96 * s.errs / sqrt(N)

plot(2:(max.cluster), m.errs[1:(length(m.errs)-1)], main = "Within SS", xlab="", ylab="")
lines(2:(max.cluster), m.errs[1:(length(m.errs)-1)])
par(col = "red")

lines(2:(max.cluster), MSE.errs_up[1:(length(MSE.errs_up)-1)])
lines(2:(max.cluster), MSE.errs_low[1:(length(MSE.errs_low)-1)])

MSE.DBI_up = m.DBI + 1.96 * s.DBI / sqrt(N)
MSE.DBI_low = m.DBI - 1.96 * s.DBI / sqrt(N)

par(col = "black")
plot(2:(max.cluster), m.DBI[1:(length(m.DBI)-1)], main = "Davies-Bouldin", xlab="", ylab="")
lines(2:(max.cluster), m.DBI[1:(length(m.DBI)-1)])

lines(2:(max.cluster), MSE.DBI_up[1:(length(MSE.DBI_up)-1)])
lines(2:(max.cluster), MSE.DBI_low[1:(length(MSE.DBI_low)-1)])

Where is it maximized?

(i_choice <- which(m.DBI==max(m.DBI[1:(length(m.DBI)-1)]))+1)
[1] 9

The WSS curve does not yield much information, but the DBI curve suggests that both \(k=3\) and \(k=9\) could be good parameter choices. With parsimony considerations in mind, we might elect to use \(k=3\), but if the results are too simple or if there are signs of instability appear (recall that \(k-\)means is a stochastic algorithm), \(k=9\) might prove to be a better choice in the end.


C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms and Applications. CRC Press, 2014.
Wikipedia, Cluster analysis algorithms.”
B. Desgraupes, ClusterCrit: Clustering Indices. 2018.
H. Rosling, The health and wealth of nations. Gapminder Foundation, 2012.
D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, 1997.
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka, “Relative clustering validity criteria: A comparative overview,” Stat. Anal. Data Min., vol. 3, pp. 209–235, 2010.
J. M. Lewis, M. Ackerman, and V. R. de Sa, “Human cluster evaluation and formal quality measures: A comparative study,” Cognitive Science, vol. 34, 2012.
R. Yedida, Evaluating clusters.” Beginning with ML, 2019.
U. von Luxburg, “Clustering stability: An overview,” Foundations and Trends in Machine Learning, vol. 2, no. 3, pp. 235–274, 2010, doi: 10.1561/2200000008.
E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo, “A comparison of extrinsic clustering evaluation metrics based on formal constraints,” Inf. Retr., vol. 12, no. 5, p. 613, 2009.
T. Lange, M. Braun, V. Roth, and J. Buhmann, “Stability-based model selection,” Advances in Neural Information Processing Systems (NIPS 2002): 2002, Jun. 2003.