## 14.1 Overview

We introduced some of the basic notions of unsupervised learning in Machine Learning 101; in this module, we review some of these concepts in the context of clustering, discuss the problems of validation and model selection, and present some simple and sophisticated clustering algorithms.

### 14.1.1 Unsupervised Learning

In supervised learning, we differentiate between a dataset’s response variables $$Y_1,\ldots, Y_m$$ and its predictor variables $$X_1, \ldots, X_p$$. Which variables are predictors and which are responses depend on the context – for some questions, a given variable could be a predictor, for others, a response.

Unsupervised learning tasks do away with the responses altogether, which means that prediction is off the table; note that variables that would have been deemed response variables in a supervised learning framework are not necessarily removed from the dataset during the analysis – they are simply not viewed as an outcome to predict, and the predictor variables are just observation features. In unsupervised learning, the objective is to identify and uncover interesting insights about the dataset and the system that it represents (See Section 7.2.2), such as:

• informative ways of visualizing the dataset (often associated with Feature Selection and Dimension Reduction);

• highlighting subgroups among the dataset’s variables or observations (clustering), or

• finding links between variables (association rules mining, link profiling, etc.), say.

### 14.1.2 Clustering Framework

Clustering consists of a large family of algorithms and methods used to discover so-called latent groups in the datasets – natural groups that exist but have not been identified or labeled as such.

Clustering is a subjective analytical task; unlike classification and regression, clustering analysis does not have as “simple” a goal as predicting a response for a new observation based on historical data patterns, and there is no “solution key” against which to compare analysis results.

Applications:

• finding subgroups of breast and/or prostate cancer patients based on their gene expression measurements or their socio-demographic characteristics in order to better understand the disease and potential treatment side-effects;

• grouping products in an online shop based on ratings and reviews assigned by customers, or grouping customers based on their purchasing history, in order to make product recommendations;

• finding documents that apply to search queries, and finding similar queries to those entered by a user to increase the odds of finding the documents they are really looking for;

• identifying population segments to test various incentives for vaccination;

• etc.

In each of these cases, the number of these latent groups is unknown (and can in fact be taken as a true unknown of the problem). The subjectivity of unsupervised learning tasks may seem to be an insurmountable flaw: analysts attempting to find latent groups in a dataset, say, may obtain a different number of such groups, or assign different observations to their groups if their numbers are identical, without one of them being necessarily “wrong” (although it is conceivable that some of them could produce sub-optimal groups, see Section 14.3 for a detailed discussion on this topic).

In spite of this, clustering is a popular analytical task, in part because it is much easier (and cheaper), typically, to obtain unlabeled data than it is to obtain labeled data (against which supervised methods could be evaluated). A cluster is a subset of observations that all have something in common – they are similar, according to some measure of similarity. Furthermore, a cluster’s observations should be dissimilar to other clusters’ observations.

Clusters do not necessarily need to be distinct (as in so-called hard clustering) – in some cases, it might be sufficient to quantify the likelihood or the degree to which an observation belongs to a cluster (soft clustering).

The choice of a similarity/dissimilarity measure is also entirely subjective; there are contexts for which proximity could be used as a decent proxy for similarity, and others where it could not. Even in the former case, a distance measure (metric) has to be selected, and infinitely many choices are available to analysts.

Without domain-specific considerations (this requires thorough data and context understanding), the choice of measure is arbitrary; but understanding the data and the context does not guarantee that all reasonable analysts would agree on such a measure.

For instance, in any group of human beings, which of

age, ethnic background, gender, postal code, sexual orientation, linguistic abilities, mathematical skills, career, social class, political affiliation, operating system preferences, educational achievements, hockey club fandom, etc.

is responsible for separating its members into “US” vs.”THEM” groups? Is it some combination of these characteristics? Are the groups fixed? Is everybody in the “US” group based on age also in the “US” group based on “gender”?

We could bypass the problem by creating more groups; given an age group and gender, we could create the clusters: “same age group and gender” (US), “same age group, different gender” (THEM$$_1$$), “different age group, same gender’ (THEM$$_2$$)’,”different age group and gender” (THEM$$_3$$).

It is clear how the process can be expanded to include more combinations of feature levels, but at the price of introducing an ever increasing number of clusters – how many “THEM” groups are too many for analysts or human brains to process?

Clustering algorithms are designed to try to model various aspects of this problem, but the latter’s complexity gives rise to an enormous number of algorithms: at least 100 have been published, as of January 2022 [145]. Most of these belong to one of six main families [4]:

• partitional ($$k-$$means and variants, CLARA, etc.);

• hierarchical (AGNES, DIANA, BIRCH, etc.);

• density-based (DBSCAN, DENCLUE, OPTICS, etc.);

• connectivity-based (spectral and variants, etc.);

• grid-based (GRIDCLUS, STING and variants, etc.);

• model-based (mixture models, latent Dirichlet allocation, expectation-maximization, etc.).

As is the case for all analytical methods, some modifications are required when dealing with “Big Datasets”, for high-dimensional data, or for specific types of datasets, such as stream data, network data, categorical data, text and multimedia data, time series data, and so on. Ensemble methods, which combine various clustering results, can also prove useful.

#### Distance, Similarity, and Dissimilarity Measures

Although the choice of how to interpret and compute similarity between observations is, to all intents and purposes, completely up to the analysts, all such measures must satisfy certain properties: they must take on

• large values for similar objects, and

• small (or even negative) values for dissimilar objects.

Dissimilarity measures function in the opposite manner. The kernel functions250 of machine learning (see Section 13.4.2) are examples of similarity (or dissimilarity) measures, most notably the Gaussian (or radial) kernel $K_\gamma(\mathbf{x},\mathbf{y})=\exp(-\gamma\|\mathbf{x}-\mathbf{y}\|_2^2),$ for a given $$\gamma>0$$, for which points near one another (in the $$\|\cdot\|_2$$ sense) have a similarity measure $$w=K(\mathbf{x},\mathbf{y})\approx 1$$ (and thus are similar), and points far from one another have a similarity measure near $$0$$ (and thus are dissimilar).

Some similarity measures are derived from distance (metrics) $$d:\mathbb{R}^p\times \mathbb{R}^p\to \mathbb{R}_0^+$$, which are functions with special properties:

1. $$d(\mathbf{x},\mathbf{y})=0 \iff \mathbf{x}=\mathbf{y}$$;

2. $$d(\mathbf{x},\mathbf{y})\geq 0$$ for all $$\mathbf{x},\mathbf{y}\in \mathbb{R}^p$$;

3. $$d(\mathbf{x},\mathbf{y})=d(\mathbf{y},\mathbf{x})$$ for all $$\mathbf{x},\mathbf{y}\in \mathbb{R}^p$$;

4. $$d(\mathbf{x},\mathbf{y})\leq d(\mathbf{x},\mathbf{z})+d(\mathbf{z},\mathbf{y})$$ for all $$\mathbf{x},\mathbf{y}\in \mathbb{R}^p$$.

In effect, distances are positive-definite symmetric functions $$\mathbb{R}^p\times \mathbb{R}^p\to \mathbb{R}_0^+$$ satisfying the Triangle Inequality. Commonly used distances include the:

• Euclidean distance $$d_2(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_2$$;

• Manhattan distance $$d_1(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_1$$;

• supremum distance $$d_\infty(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_\infty$$;

• more general Minkowski distance $$d_p(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_p$$, for $$p\geq 1$$, of which the first three examples are special cases;

• and more esoteric distances such as the Jaccard distance for binary vectors, the Hamming distance for categorical vectors, the Canberra distance for ranked lists, the cosine distance for text data, mixed distances for mixed variables, and so on .

Given a distance $$d$$, a common construction is to define the associated similarity measures $w=\ell-d,\quad w=\exp(-kd^2),\quad \text{or} \quad w=\frac{1}{1+d}.$ Note that there are similarity measures that cannot be derived from distance measures, however.

#### Data Transformations Prior to Clustering

Prior to clustering, it is crucial that the data be scaled (and potentially centered) so that none of the variables unduly influence the outcomes, or, as the expression prosaically puts it, so that we do not have to compare apples with oranges – if age in years and height in cm are dataset variables, a 10-unit difference in age is likely to be more significant (in real terms) than a 10-unit difference in height.

Putting everything on a $$(\min,\max)$$ scale, for instance, guarantees that relative differences (relative to the distributions of each variables), and not absolute distances, play the important role. However, there are many ways to scale the data, and the scaling approach may have an effect on the clustering results (as we are sure you will not be surprised to find out, by this point – that is the way, with clustering: out of the frying pan and into the fire).

#### Common Difficulties

There are issues related to clustering other than the vagueness we have already discussed:

• in many instances, the underlying assumption is that nearness of observations (in whatever metric) is linked with object similarity, and that large distances are linked with dissimilarity;

• the lack of a clear-cut definition of what a cluster actually is makes it difficult to validate clustering results;

• various clustering algorithms are non-deterministic;

• the number of clusters cannot usually be known before the analysis;

• even when a cluster scheme has been accepted as valid, a cluster description might be difficult to come by;

• most methods will find clusters in the data even if there are none;

• once clusters have been found, it is tempting to try to “explain” them, but that is a job for domain experts.

### 14.1.3 A Philosophical Approach to Clustering

In the context of artificial general intelligence,251 clustering provides a basic way for intelligences to structure their experience of the world.

Clustering techniques can allow such machines to identify object instances in the world around them and then, on the basis of this identification, to identify or define types of objects by grouping together the object instances they have discovered. With this in mind, we can view creating concepts as the fundamental purpose of identifying groupings of similar datapoints; these concepts allow an intelligent agent (whether machine or person) to:

• work in shorthand when dealing with objects (i.e., it is easier to deal with 10 ‘cats’ than with 10 unique objects), and

• make assumptions about the object instances in a cluster associated with the concept (if an object is a cat, then that object probably likes fish).

If the existence of some “ground truth” about what should be clustered together (and by extension what should be counted as a concept) can be presupposed, then regardless of what is currently known about that truth, neither the choice of algorithms nor of clustering algorithm parameters is wholly subjective, in the psychological sense of the term (where it has the connotation of “coming from a person’s experience”, which tends to indicate that whatever it is that is being talked about does not exist separate from such an experience); choosing one algorithm over another (or one set of parameters over another) may lead to a “better” or “worse” reflection of the underlying ground truth.

But what counts as a ground truth? There are, of course, debates about this in philosophical circles. Suppose that natural kinds exist, that is to say, suppose that there is a privileged and objectively essential way in which objects are properly grouped in nature.

This assumption is very commonly made in the sciences, where uncovering or discovering universal truths about natural kinds of objects is a major objective. In such a case, natural kinds can count as a ground truth, with some clusterings more closely reflecting this reality than others.

The fact that the ground truth is not known by the clustering agent does not mean that it does not exist, or that it cannot be sought out using various techniques. Indeed, this is arguably what scientists do when they are using the scientific method; they do not know, a priori, which of their hypotheses are true or which are false, but they nonetheless engage in various techniques to try to get a better sense of what is true and false.

Even if the existence of natural kinds is rejected, it can still be the case that, relative to a particular circumstance, some clustering results are of higher quality than others. Or, stated in terms of goals, some clustering results could achieve a stated goal to a greater or lesser extent than others.

This does appear to be more subjective, in the sense that the goal, and the success of the outcome relative to the goal, are both defined by an individual or individuals, rather than being independent of them.

Outside of clustering, it is not unusual for people to create contextual definitions of what counts as ‘good’. Consider as an example the concept of a ‘good meal’. What qualifies as a good meal when camping in the backcountry is not the same as what counts as a good meal when staying at a four-star resort. Context matters.

But this does not mean that it is impossible to have a bad meal under either of these circumstances, or that anything counts as a good meal – recall that we do not use subjective in the stultifying post-modernist sense that there are no constraints whatsoever and everything is a social construct.252

Nonetheless, such situations are difficult to pin down or define in a rigorous fashion. Even if there were some more abstract or subjective sense in which something could be said to be common to all types of good meals – or to all types of good clustering results, in this analogy – it is difficult to imagine how this could be stated with any precision. This is, frustratingly, a typically human limitation to dealing with the world.

However, since the underlying objective of machine learning and artificial intelligence is to create machines with abilities similar to those of humans, perhaps it is worth trying to capture this less than rigorous approach within the context of machine learning.

Given this inherent lack of rigour, are there applied situations where clustering is useful? More concretely, suppose we desire to cluster furniture, based on data about the furniture. We could make measurements of various kinds on physical objects, either selected randomly or haphazardly; perhaps, rather than working with the furniture itself, we could use a website catalogue in which each page showcases a particular type of furniture available for sale.

In this scenario, there may be one grouping (created by tagging and linking pages, for instance) of the furniture pages that allows users to visit the website with maximum efficiency, and another that helps the store maximize its sales.

Moreover, if we believe that natural kinds exist (which, as noted, is debated by philosophers but is a common assumption in science), there might also be one grouping of the furniture that best matches the underlying furniture natural kinds.

When considering outcomes relative to a particular situation, the most appropriate strategy for a particular clustering will depend, broadly speaking, on two considerations:

• the chosen goal it is intended to support, and

• the underlying structure of the data itself.

The first can be explicitly known and stated, but the second will likely not be known in advance, which leads to numerous technical issues when applying clustering algorithms.

A multitude of clustering algorithms can be applied to problems like the website furniture problems described above, and for each of those, many different parameter settings exist. Suppose six different clustering process are carried out in the case of the furniture website example and they generate six (potentially) different clustering outcomes.

Presumably, some will be more effective than others if the objective is to get people to spend a maximum amount of money on the online store. If the objective is to allow customers to make their purchases most quickly, the “optimal” clustering might not be the one that leads customers to spend the most money.

It is difficult to say ahead of time which of the six groupings will be the most effective one, in each of these cases. However, it might be possible to carry out A/B testing to determine which one is the most effective, once they have been generated. But can the A/B testing step be avoided?

If the applied goal (e.g. the goal to group furniture pages in order to maximize profits) can be operationalized more directly in terms of similarity and difference, then it can be more directly tied to clustering approaches.

If we think that the best way to increase sales is to have loose clusters where people are forced to browse a certain amount, while being exposed to somewhat similar (but still interesting) pieces of furniture to find what they want, it might be possible to select a clustering approach to generate clusters with this desirable property.

Returning for a moment to the less applied general artificial intelligence scenario introduced earlier, if the fundamental functionality of clustering is viewed as creating concepts, then it would seem to make sense to operationalize this goal in terms of creating groupings where the observations in a group are similar to each other and different from those in other groups. In this context, a poor grouping would de facto be one where multiple observations are similar to those in other clusters, or very different from those in their own cluster. This could happen if a clustering process runs into technical difficulties, but it can also happen if there is no such strongly grouped structure in the data itself.

To eliminate the possibility that the problem is not linked with the chosen clustering procedure, one strategy is to use multiple clustering techniques, as well as multiple parameter settings for each technique.

If the issue remains, then we could conclude that it is likely that there is no good clustering structure in the data and by extension, in the objects being represented by the data.

### References

[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[4]
C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms and Applications. CRC Press, 2014.
[5]
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer, 2015.
[125]
J. Cranshaw, R. Schwartz, J. I. Hong, and N. M. Sadeh, in ICWSM.
[128]
F. R. Bach and M. I. Jordan, “Learning spectral clustering, with application to speech separation,” J. Mach. Learn. Res., vol. 7, pp. 1963–2001, Dec. 2006.
[129]
H. T. Kung and D. Vlah, “A spectral clustering approach to validating sensors via their peers in distributed sensor networks,” Int. J. Sen. Netw., vol. 8, no. 3/4, pp. 202–208, Oct. 2010, doi: 10.1504/IJSNET.2010.036195.
[131]
C. Plant et al., NeuroImage, vol. 50, no. 1, pp. 162–174, 2010.
[145]
Wikipedia,
[222]
E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN,” ACM Trans. Database Syst., vol. 42, no. 3, Jul. 2017, doi: 10.1145/3068335.
[229]
N. Harris,
[259]
M. Grootendorst, Towards Data Science, Feb. 2021.
[260]
M. Harmouch, Towards Data Science, Mar. 2021.
[261]
P. Boily and J. Schellinck, “Machine learning 101,” Data Science Report Series, 2021.
[268]
Y. Murat and Z. Cakici, “An integration of different computing approaches in traffic safety analysis,” Transportation Research Procedia, vol. 22, pp. 265–274, Dec. 2017, doi: 10.1016/j.trpro.2017.03.033.