## 11.5 Clustering

Clustering is in the eye of the beholder, and as such, researchers have proposed many induction principles and models whose corresponding optimisation problem can only be approximately solved by an even larger number of algorithms. [V. Estivill-Castro,

Why So Many Clustering Algorithms?]

### 11.5.1 Overview

We can make a variety of **quantitative statements** about a dataset, at the **univariate** level. For instance, we can

compute

**frequency counts**for the variables, andidentify

**measures of centrality**(mean, mode, median), and**dispersal**(range, standard deviation), among others.

At the **multivariate** level, the various options include **\(n-\)way tabulations**, **correlation analysis**, and **data visualization**, among others.

While these might provide insights in simple situations, datasets with a **large number of variables** or with **mixed types** (categorical and numerical) might not yield to such an approach. Instead, insights might come in the form of **aggregation** or **clustering** of similar observations.

A successful **clustering scheme** is one that tightly joins together any number of similarity profiles (“tight” in this context refers to small variability within the cluster, see 11.23 for an illustration).

A typical application is one found in **search engines**, where the listed search results are the nearest similar objects (**relevant webpages**) clustered around the search item.

Dissimilar objects (**irrelevant webpages**) should not appear in the list, being “far” from the search item. Left undefined in this example is the crucial notion of **closeness**: what does it mean for one observation to be near another one? Various **metrics** can be used (see 11.24 for some simple examples), and not all of them lead to the same results.

Clustering is a form of **unsupervised learning** since the cluster labels (and possibly their number) are not determined ahead of the analysis.

The algorithms can be **complex** and **non-intuitive**, based on varying notions of similarities between observations, and yet, the temptation to provide a simple *a posteriori* explanation for the groupings remains **strong** – we really, really want to reason with the data.^{177}

The algorithms are also (typically) **non-deterministic** – the same routine, applied twice (or more) to the same dataset, can discover completely different clusters.^{178}

This (potential) non-replicability is not just problematic for validation – it can also leads to client dissatisfaction. If the analyst is tasked with finding customer clusters for marketing purposes and the clusters change every time the client or the stakeholder asks for a report, they will be very confused (and doubtful) unless the **stochastic nature** of the process has already been explained.

Another interesting aspect of clustering algorithms is that they often find clusters even when there are no natural ways to break down a dataset into constituent parts.

When there is no natural way to break up the data into clusters, the results may be **arbitrary** and fail to represent any underlying reality of the dataset. On the other hand, it could be that while there was no **recognized** way of naturally breaking up the data into clusters, the algorithm **discovered** such a grouping – clustering is sometimes called **automated classification** as a result.

The aim of clustering, then, is to divide into **naturally occurring groups**. Within each group, observations are similar; between groups, they are dissimilar (see Figure 11.25 for an illustration).

As a learning process, clustering is fairly **intuitive** for human beings – our brains unconsciously search for patterns and they can generally handle **messy** data with the same relative ease as clean data.

Computers have a harder time of it, however; part of the challenge is that there is no **agreed-upon definition of what constitutes a cluster**, and so we cannot easily code their recognition into algorithms – to paraphrase Justice Potter Stewart,

“I may not be able to define what a cluster is, but I know one when I see one.”

All clustering algorithms rely on the notion of **similarity** \(w\)between observations; in many instances, similarity is obtained *via* a **distance** (or metric) \(d\), with \(w\to 1\) as \(d\to 0\), and \(w\to 0\) as \(d\to\infty\). However, there are similarity measures which are not derived from a distance metric.

One additional clustering challenge is that there is no such thing as **the** distance or **the** similarity measure between observations – observations which are similar using a specific measure **may not be similar at all using another**. Commonly-used metrics include:

euclidean,

Manhattan,

cosine,

Canberra,

Haming,

Jaccard,

Pearson,

and so on.

Note, however, that no matter which similarity measure is selected, the data must first be **transformed**: scaled, centered, etc. (see Figure 11.24). This introduces another layer of arbitrary choices, as there are multiple available options and no **canonical** way to perform this.

#### Applications

Frequently, we use clustering and other unsupervised learning tasks as **preliminary steps** in supervised learning problems, but there exist stand-alone applications as well:

**text analysis**– grouping similar documents according to their topics, based on the patterns of common and unusual terms;**product recommendations**– grouping online purchasers based on the products they have viewed, purchased, liked, or disliked, or grouping products based on customer reviews;**marketing**– grouping client profiles based on their demographics and preferences;**social network analysis**– recognising communities within large groups of people;**medical imaging**– differentiating between different tissue types in a 3D voxel;**genetics**– inferring structures in populations;dividing a larger group (or area, or category) into smaller groups, with members of the smaller groups having similarities of some kind, as analytical tasks may then be solved separately for each of the smaller groups, which may lead to increased accuracy once the separate results are aggregated, or

creating (new) taxonomies on the fly, as new items are added to a group of items, which could allow for easier product navigation on a website like Netflix, for instance.

Plenty of other applications may be found in [4], [125], [128]–[131], [222]–[228].

When all is said and done, the clustering process is quite standard, notwithstanding the choice of scaling strategy, similarity measure, and algorithm and parameters (see the pipeline shown in Figure 11.26).

### 11.5.2 Clustering Algorithms

As is the case with classification, the number of clustering algorithms is quite high; the Wikipedia page lists 40+ such algorithms as of August 2018 [145]. The choice of algorithms (and associated parameters) is as much an art as it is a science, although domain expertise can come in handy [4]. There is a smaller list of common algorithms thatdata scientists and consultants should have in their toolbox (full descriptions available in [4], [74], [137]):

**\(k-\)means**, close on the heels of decision trees for the title of “most-used data science algorithm”, is a**partition clustering method**which tends to produce equal-sized clusters; when clients ask for their data to be clustered, they are typically envisioning \(k-\)means with the Euclidean metric; variants include \(k-\)mode (for categorical data), \(k-\)medians (for data with outliers), and \(k-\)means\(||\) and \(k-\)means\(++\) for large data sets; the number of clusters \(k\) (and the similarity measure/distance metric) must be provided by the user; works fairly well for “blob”-like data;**hierarchical clustering**is one of the few deterministic algorithms on the market, with**divisive**(DIANA) and**agglomerative**(AGNES) versions; no parameters need to be inputted, but the users must select a**linkage**strategy (roughly speaking, a metric that computes the distance between clusters) and a level at which to read off the clusters (see Figure 11.27 for an an illustration);**density-based spatial clustering**(DBSCAN) is a graph-based approach which attempts to identify densely-packed regions in the dataset; its most obvious advantages (and of its variants OPTICS and DENCLUE) are**robustness to outliers**and not needing to input a**number of clusters**to search for in the data; the main disadvantage is that the optimal input parameters (**neighbourhood radius**and**minimum number of points**to be considered dense) are not easy to derive (see Figure 11.27 for an an illustration);**affinity propagation**is another algorithm which selects the optimal number of clusters directly from the data, but it does so by trying and evaluating various scenarios, which may end up being**time-consuming**;**spectral clustering**can be used to recognize**non-globular**clusters (see Figure 11.27 for an an illustration); these are found by computing eigenvalues of an associated Laplacian matrix – consequently, spectral clustering is**fast**.

Other methods include **latent Dirichlet allocation** (used in topics modeling), **expectation-maximisation** (particularly useful to find gaussian clusters), **BIRCH** (a local method which does not require the entire dataset to be scanned) and **fuzzy clustering** (a soft clustering scheme in which the observations have a degree of belonging to each cluster).

### 11.5.3 k-Means

As mentioned previously, \(k-\)means is a very natural way to group observations together (formally, \(k-\)means is linked to **Voronoi tilings**). \(k-\)means clustering is achieved by:

selecting a

**distance metric**\(d\) (based on the data type and domain expertise);selecting a

**number of clusters**\(k\);randomly choosing \(k\) data instances as initial

**cluster centres**;calculating the

**distance**from each observation to each centre;placing each instance in the cluster whose centre it is

**nearest**to;computing/updating the

**centroid**for each cluster (see Figure 11.28 for an illustration);repeating steps 4-6 until the clusters are

**“stable”**.

For \(k-\)means, cluster centroids are obtained by averaging all points in the cluster. For \(k-\)medians and \(k-\)mode, the centrality measure is replaced by the obvious candidate.

This simple algorithm has numerous strengths:

it is elegant and

**easy to implement**(without actually having to compute pairwise distances), and so is extremely common as a result;in many contexts, it is a

**natural way**to look at grouping observations, andit helps provide a first-pass

**basic understanding of the data structure**.

On the other hand,

it can only assign an instance to

**one**cluster, which can lead to overfitting – a more robust solution would be to compute the probability of belonging to each cluster, perhaps based on the distance to the centroid;it requires the “true” underlying clusters to be

**gaussian-**or blob-shaped, and it will fail to to produce useful clusters if that assumption is not met in practice,it does not allow for

**overlapping**or**hierarchical**groupings.

#### Notes

Let us now return to some issues relating to clustering **in general** (and not just to \(k-\)means):

No matter the choice of algorithm, clustering rests on the assumption that **nearness of observations** (in whatever metric) is linked with **object similarity** similarity, and that **large distances** are linked with **dissimilarity**. While there are plenty of situations where this is an appropriate assumption to make (temperature readings on a map, for instance), there are others where it is unlikely to be the case (chocolate bars and sensationalist tabloids at a grocery’s checkout, say).

The **lack of a clear-cut definition** of what a cluster actually is (see Figure 11.29 for an example) makes it difficult to validate clustering results. Much more can be said on the topic [4].

The fact that various algorithms are **non-deterministic** is also problematic – clustering schemes should never be obtained using **only one** algorithmic pass, as the outcome could be different depending on the **location** of random starting positions and the distance/similarity metric in use.

But this apparent fickleness is not necessarily a problem: **essential patterns** may emerge if the algorithms are implemented multiple times, with different starting positions and re-ordered data (see **cluster ensembles** [4]). For those algorithms that require the **number of clusters** as an input, it may be difficult to determine what the optimal number should be (see Figure 11.30 for an illustration).

This number obviously depends on the choice of algorithm/metric, the underlying data, and the use that will be made of the resulting clusters – a dataset could have 3 natural groups when seen through the lens of \(k-\)means, but only 2 clusters for a specific choice of parameter values in DBSCAN, and so on.

This problem could be overcome by producing clustering schemes (from the same family of algorithms) with an increasing number of clusters and to plot the average distance of a cluster member to its cluster representative (centroid) against the number of clusters. Any kink in the plot represents a number of clusters at which an increase does not provide an in-step increase in clustering “resolution”, so to speak (see Figure 11.35 in Toy Example: Iris Dataset for an illustration).

And even when a cluster scheme has been accepted as valid, a **cluster description** might be difficult to come by – should clusters be described using representative instances or average values or some combination of its’ members most salient features? Although there are exceptions, the ease with which clusters can bedescribed often provides an indication about how **natural** the groups really are.

One of the most frustrating aspects of the process is that most methods will find clusters in the data even if there are none – although DBSCAN is exempt from this **ghost clustering** phenomenon (see Figure 11.31 for a \(k-\)means example).

Finally, analysts should beware (andresist) the temptation of ** a posteriori rationalisation** – once clusters have been found, it is tempting to try to “explain” them; why are the groups as they have been found? But that is a job for domain experts, at best, and a waste of time and
resources, at worst. Thread carefully.

### 11.5.4 Clustering Validation

What does it mean for a clustering scheme to be **better** than another? What does it mean for a clustering scheme to be **valid**? What does it mean for a single cluster to be **good**? **How many** clusters are there in the data, really?

These are not easy questions to answer. In general, asking if a clustering scheme is the right one or a good one is meaningless – much better to ask if it is **optimal** or **sub-optimal**, potentially in comparison to other schemes.

An **optimal** clustering scheme is one which

**maximizes separation**between clusters;**maximizes similarity**within groups;agrees with the

**human eye test**, andis

**useful**at achieving its goals.

There are 3 families of **clustering validation** approaches:

**external**, which use additional information (but the labels in question might have very little to do with the similarity of the observations);**internal**, which use only the clustering results (shape and size of clusters, etc), and**relative**, which compare across a number of clustering attempts.

In order to illustrate some of the possibilities, consider a dataset with **clustering scheme** \(\mathcal{C}=\{\mathcal{C}_1,\ldots,\mathcal{C}_N\}\), where \(\mathcal{C}_m\)’s centroid is denoted by \(c_m\), and the average distance of \(\mathcal{C}_m\)’s members to \(c_m\) is denoted by \(s_m\). The **Davies-Bouldin Index** is defined as \[\textrm{DB}_{\mathcal{C}}=\frac{1}{N}=\sum_{i=1}^N \max_{j\neq i}\left\{\frac{s_i+s_j}{d(c_i,c_j)}\right\},\] where \(d\) is the selected distance metric. Since \(\textrm{DB}_{\mathcal{C}}\) is only defined using the clustering
results, it is an **internal validation method**.

Heuristically, if the **cluster separation is small**, we might expect \(d(c_i,c_j)\) to be (relatively) small, and so \(\textrm{DB}_{\mathcal{C}}\) should be (relatively) large.

In the same vein, if the clusters are **heterogeneous**, we might expect \(s_i+s_j\) to be (relatively) large, and so \(\textrm{DB}_{\mathcal{C}}\) should be (relatively) large.

In short, when the clustering scheme is **sub-optimal**, \(\textrm{DB}_{\mathcal{C}}\) is “large”. This suggests another way to determine the optimal number of clusters – pick the scheme with minimal \(\textrm{DB}_{\mathcal{C}}\) (see Figure 11.35 in Toy Example: Iris Dataset, which uses a modified version of the index, for an illustration).

Other **cluster quality metrics** exist, including **SSE**, **Dunn’s Index**, the **Silhouette Metric**, etc. [4], [230].

### 11.5.5 Case Study: Livehoods

When we think of similarity at the urban level, we typically think in terms of neighbourhoods. Is there some other way to identify similar parts of a city?

In *The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City* [125], Cranshaw et al. study the social dynamics of urban living spaces with the help of clustering algorithms.

#### Objective

The researchers aims to draw the boundaries of **livehoods**, areas of similar character within a city, by using clustering models. Unlike **static** administrative neighborhoods, the livehoods are defined based on the habits of people who live there.

#### Methodology

The case study introduces **spectral clustering** to discover the **distinct geographic areas** of the city based on its inhabitants’ collective **movement patterns**. Semi-structured interviews are also used to **explore**, **label**, and **validate** the resulting clusters, as well as the urban dynamics that shape them.

Livehood clusters are built and defined using the following methodology:

a

**geographic distance**is computed based on pairs of check-in venues’ coordinates;**social similarity**between each pair of**venues**is computed using cosine measurements;spectral clustering produces

**candidate livehoods**clusters;interviews are conducted with residents in order to

**validate**the clusters discovered by the algorithm.

#### Data

The data comes from two sources, combining approximately 11 million (a recommendation site for venues based on users’ experiences) check-ins from the dataset of Chen et al. [231] and a new dataset of 7 million Twitter check-ins downloaded between June and December of 2011.

For each check-in, the data consists of the **user ID**, the **time**, the **latitude and longitude**, the **name of the venue**, and its **category**.

In this case study, it is livehood clusters from the city of Pittsburgh, Pennsylvania, that are examined *via* 42,787 check-ins of 3840 users at 5349 venues.

#### 11.5.5.1 Strengths and Limitations of the Approach

The technique used in this study is

**agnostic**towards the particular source of the data: it is not dependent on meta-knowledge about the data.The algorithm may be prone to “majority” bias, consequently misrepresenting/hiding minority behaviours.

The dataset is built from a

**limited**sample of check-ins shared on Twitter and are therefore biased towards the types of visits/locations that people typically want to share**publicly**.Tuning the clusters is non-trivial: experimenter bias may combine with “confirmation bias” of the interviewees in the validation stage – if the researchers are themselves residents of Pittsburgh, will they see clusters when there are none?

#### Procedures

The Livehoods project uses a **spectral clustering model** to provide structure for local **urban areas** (UAs), grouping close Foursquare venues into clusters based on both the **spatial proximity** between venues and the **social proximity** which is derived from the distribution of people that check-in to them.

The guiding principle of the model is that the “character” of an UA is defined both by the types of venues it contains and by the people frequent them as part of their daily activities. These clusters are referred to as **Livehoods**, by analogy with more traditional neighbourhoods.

Let \(V\) be a list of Foursquare venues, \(A\) the associated **affinity matrix** representing a measure of similarity between each venue, and \(G_m(A)\) be the graph obtained from the \(A\) by linking each venue to its nearest \(m\) neighbours.

Spectral clustering is implemented as follows (we will discuss spectral clustering and other clustering algorithms in detail in Spotlight on Clustering):

Compute the diagonal degree matrix \(D_{ii} = \sum_{j} A_{ij}\);

Set the Laplacian matrix \(L=D-A\) and \[L_{\text{norm}}=D^{-1/2}LD^{-1/2};\]

Find the k smallest eigenvalues of \(L_{\text{norm}}\), where \(k\) is the index which provides the biggest jump in successive eigenvalues of eigenvalues of \(L_{\text{norm}}\), in increasing order;

Find the eigenvectors \(e_1, ...e_k\) of \(L\) corresponding to the \(k\) smallest eigenvalues;

Construct the matrix \(E\) with the eigenvectors \(e_1, ...e_k\) as columns;

Denote the rows of \(E\) by \(y_1,...,y_{n}\), and cluster them into \(k\) clusters \(C_1,...,C_k\) using \(k\)-means. This induces a clustering \(\{A_1,...,A_k\}\) defined by \[A_i =\{j\mid y_j \in C_i\}.\]

For each \(A_i\), let \(G(A_i)\) be the subgraph of \(G_m(A)\) induced by vertex \(A_i\). Split \(G(A_i)\) into connected components. Add each component as a new cluster to the list of clusters, and remove the subgraph \(G(A_i)\) from the list.

Let \(b\) be the area of bounding box containing coordinates in the set of venues \(V\), and \(b_i\) be the area of the box containing \(A_i\). If \(\frac{b_i} {b} > \tau\), delete cluster \(A_i\), and redistribute each of its venues \(v \in A_i\) to the closest \(A_j\) under the distance measurement.

#### Results, Evaluation and Validation

The parameters used for the clustering were \(m = 10\), \(k_{\text{min}} = 30\), \(k_{\text{max}} = 45\), and \(\tau = 0.4\). The results for three areas of the city are shown in Figure 11.32. In total, 9 livehoods have been identified and validated by 27 Pittsburgh residents; the article has more information on this process).

**Municipal Neighborhoods Borders:**livehoods are dynamic, and evolve as people’s behaviours change, unlike the fixed neighbourhood borders set by the city government.**Demographics:**the interviews displayed strong evidence that the demographics of the residents and visitors of an area often play a strong role in explaining the divisions between livehoods.**Development and Resources:**economic development can affect the character of an area. Similarly, the resources (or lack there of) provided by a region has a strong influence on the people that visit it, and hence its resulting character. This is assumed to be reflected in the livehoods.**Geography and Architecture:**the movements of people through a certain area is presumably shaped by its geography and architecture; livehoods can reveal this influence and the effects it has over visiting patterns.

#### Take-Away

While this is a neat example of practical clustering, its main take-away, from our perspective, is to remind everyone that \(k-\)means is not the sole clustering algorithm in applications!

### 11.5.6 Toy Example: Iris Dataset

Iris is a genus of plants with showy flowers. The `iris`

dataset contains 150 observations of 5 attributes for specimens collected by Anderson, mostly from a Gaspé peninsula’s pasture in the 1930s [232].

The attributes are

**petal width****petal length****sepal width****sepal length****species**(virginica, versicolor, setosa)

A “description” of these features is provided by the picture in the picture below:

This dataset has become synonymous with data analysis,^{179} being used to showcase just about every algorithm under the sun. That is, sadly, also what we are going to do in this section.^{180}

A **principal component projection** of the dataset, with species indicated by colours, is shown in Figure 11.33.

From an **unsupervised learning** point of view, one question of interest is whether the observations form natural groupings, and, if so, whether these groupings correspond to the (known) species.

We use the \(k-\)means algorithm with Euclidean distance to resolve the problem. Since we do not know how many clusters there should be in the data (the fact that there are 3 species does not mean that there should be 3 clusters), we run 40 replicates for \(k=2, \ldots, 15\). The resulting clustering scheme for \(k=2,3,4,15\) (for one of the replicates in each case) are shown in Figure 11.34.

For each replicate, we compute a (modified) **Davies-Bouldin Index** and the **Sum of Squared Errors** of the associated clustering schemes (see Figure 11.35 for the output) – the validation curves seem to indicate that there could be either \(3\) of \(5\) natural \(k-\)means clusters in the data. Is this a surprising outcome?

A single replicate with \(k=5\) is shown in Figure 11.36.

Would you consider this representative final clustering scheme to be meaningful?

### References

*Data Clustering: Algorithms and Applications*. CRC Press, 2014.

*Data Mining with R, 2nd ed.*CRC Press, 2016.

*ICWSM*.

*J. Mach. Learn. Res.*, vol. 7, pp. 1963–2001, Dec. 2006.

*et al.*, “Automated detection of brain atrophy patterns based on MRI for the prediction of alzheimer’s disease,”

*NeuroImage*, vol. 50, no. 1, pp. 162–174, 2010.

*Data Science for Business*. O’Reilly, 2015.

*ACM Trans. Database Syst.*, vol. 42, no. 3, Jul. 2017, doi: 10.1145/3068335.

*Computational science and its applications - ICCSA 2011*, 2011, pp. 454–465.

*ClusterCrit: Clustering Indices*. 2018.

*ICWSM*, 2011.

*Annals of Eugenics*, vol. 7, no. 7, pp. 179–188, 1936.