14.4 Advanced Clustering Methods

We present representative clustering algorithms from the remaining families. More information is available in [2], [4], [145].

14.4.1 Density-Based Clustering

The assumptions of the \(k-\)means algorithm imply that the clusters that it finds are usually Gaussian (blob-like). But this is not always a desired outcome.

In density-based clustering, it is the density of observations and the connectivity of the accompanying clustering network (which we will discuss further in the next section) that determine the number and location of clusters. Popular density-based clustering algorithms include DBSCAN, DENCLUE, OPTICS, CHAMELEON, etc.

Once density has been defined in a meaningful way (which depends on a number of contextual factors), density-based algorithms are fairly straightforward to apply and understand (see [4], [130], [222], [228], [229] for details).

Density

How do we measure density? Intuitively, we can recognize areas of low density and high density in the (artificial) dataset below.

As the saying goes, “birds of a feather flock together”; it should not come as a surprise that areas of higher density could be viewed as clusters in the data. In that context, if \(\mathbf{\Psi}\subseteq \mathbb{R}^n\) is an \(n-\)dimensional sub-manifold of \(\mathbb{R}^n\), we could define the density of \(\mathbf{\Psi}\) around \(\mathbf{x}\) by, say, \[\text{density}_{\mathbf{\Psi}}(\mathbf{x};d)=\lim_{\varepsilon\to 0^+}\dfrac{\text{Vol}_n(B_{d}(\mathbf{x},\varepsilon)\cap \mathbf{\Psi} )}{\text{Vol}_n(B_{d}(\mathbf{x},\varepsilon))},\] where \[B_{d}(\mathbf{x},\varepsilon)=\{\mathbf{y}\in \mathbb{R}^n\mid d(\mathbf{x},\mathbf{y})<\varepsilon\}\] and \[\text{Vol}_n(A)=\text{volume of }A\text{ in }\mathbf{R}^n.\]

DBSCAN

In practice, the dataset \(\mathbf{X}\) is usually a discrete subset of \(\mathbb{R}^n\), and the limit definition above cannot apply. Density-based spatial clustering of applications with noise (DBSCAN) estimates the density at an observations \(\mathbf{x}\in\mathbf{X}\) as follows: we pick a “reasonable” value of \(\varepsilon^*>0\) and set \[\text{density}_{\mathbf{X}}(\mathbf{x};d)=\left|B_{d}(\mathbf{x},\varepsilon^*)\cap \mathbf{X} \right|.\]

The outcome depends, of course, on the choice of both \(\varepsilon^*\) and the distance \(d\).

DBSCAN also requires a connectivity parameter: the minimum number of points minPts in \[V_{\mathbf{x}}=B_d(\mathbf{x},\varepsilon^*)\cap \left[\mathbf{X}\setminus \{\mathbf{x}\}\right]\] (excluding \(\mathbf{x}\)). If \(|V_{\mathbf{x}}|\geq \textit{minPts}\), the observations in \(V_{\mathbf{x}}\) are said to be within reach of (or reachable from) \(\mathbf{x}\).

In other words, for a given choice of \(d\), \(\varepsilon^*\), and minPts, there are three types of observations in \(\mathbf{X}\):

  • outliers are observations that are not within reach of any of the other observations, such as \(\mathbf{x}_1\) below:

  • reachable (non-core) observations are observations that are within reach of fewer than minPts other observations, such as \(\mathbf{x}_2\) and \(\mathbf{x}_3\) below (with \(\textit{minPts}= 3\)):

  • core observations are within reach of at least minPts other observations, such as \(\mathbf{x}_4\) below (with \(\textit{minPts}= 3\)):

There are other such points: \(\mathbf{x}_5\), \(\mathbf{x}_6\), \(\mathbf{x}_7\), \(\mathbf{x}_8\), and \(\mathbf{x}_9\).

Reachability is not a symmetric relation: no observation is reachable from a non-core point (a non-core point may be reachable, but nothing can be reached from it).

We can build a new symmetric relation on non-outlying observations on the basis of reachability, however: \[\mathbf{p},\mathbf{q}\in\mathbf{X}\setminus\{\text{outliers}(\mathbf{X})\}\] are said to be density-connected for \(\varepsilon^*>0\) and \(d\) if there is an observation \(\mathbf{o}\in \mathbf{X}\) such that \(\mathbf{p},\mathbf{q}\in V_{\mathbf{o}}\), with \(|V_{\mathbf{o}}|\geq \textit{minPts}\).

The same \(\mathbf{p},\mathbf{q}\) are said to be density-connected in a path if either they are density-connected or if there is a sequence of observations \[\mathbf{p}=\mathbf{r}_0,\mathbf{r}_1,\ldots,\mathbf{r}_{k-1},\mathbf{r}_k=\mathbf{q}\] such that \(\mathbf{r}_{i-1},\mathbf{r}_i\) is density-connected for all \(i=1,\ldots, k\).

That the latter is a relation on \(\mathbf{X}\setminus \{\text{outliers}(\mathbf{X})\}\) is clear:

  • it is reflexive as every \(\mathbf{x}\in \mathbf{X}\setminus \{\text{outliers}(\mathbf{X})\}\) is either reachable or a core observation, so that \(\exists \mathbf{o}_{\mathbf{x}}\in\mathbf{X}\) with \(\mathbf{x}\in V_{\mathbf{o}_{\mathbf{x}}}\) and \(|V_{\mathbf{o}_{\mathbf{x}}}|\geq \text{minPts}\), and so \(\mathbf{x}\) is density-connected to itself;

  • it is symmetric and transitive by construction.

DBSCAN clusters are, essentially, composed of observations that are density-connected in a path.

In the image above, arrows represent density-connection: each orange observation is within reach of a red one, but no observation can be reached from the orange points.

Algorithm

DBSCAN clusters are grown according to the following algorithm:

  1. Select an observation at random, from the list of not previously selected observations that have not been assigned to a cluster yet.

  2. Determine the selected observation’s type (outlier, non-core, core).

  3. If the observation is an outlier or a non-core point, assign it to the noise cluster.

  4. Else, build its network of density-connected paths.

  5. Assign all observations in the network to a stand-alone cluster.

  6. Repeat steps 1 to 5 until all points have assigned to a cluster.

All points within a cluster are mutually density-connected in a path. If a point is reachable from any point of the cluster, it is part of the cluster as well. An illustration of the DBSCAN algorithm is provided in Figure 14.14.

Illustration of DBSCAN on an artificial dataset (top, left). The parameters $\varepsilon$ and $\textit{minPts}$ are shown in each display. We select a point at random (second image, top row); it is not a core point as its $\varepsilon-$neighbourhood does not contain more than $\textit{minPts}$ observations (excluding the selected point itself); it is assigned to the noise cluster. We select another point at random (top, right); that one is core point, as its $\varepsilon-$neighbourhood contains 4 observations. All its density-connected observations are shown in green (bottom, left). Its network of density-connected paths is shown in green, for the core observations, and in  light green, for the reachable observations (bottom row, second image); they make up cluster 1 (bottom row, third image). Continuing on this way, we obtain 2 clusters and noisy observations (bottom, right).

Figure 14.14: Illustration of DBSCAN on an artificial dataset (top, left). The parameters \(\varepsilon\) and \(\textit{minPts}\) are shown in each display. We select a point at random (second image, top row); it is not a core point as its \(\varepsilon-\)neighbourhood does not contain more than \(\textit{minPts}\) observations (excluding the selected point itself); it is assigned to the noise cluster. We select another point at random (top, right); that one is core point, as its \(\varepsilon-\)neighbourhood contains 4 observations. All its density-connected observations are shown in green (bottom, left). Its network of density-connected paths is shown in green, for the core observations, and in light green, for the reachable observations (bottom row, second image); they make up cluster 1 (bottom row, third image). Continuing on this way, we obtain 2 clusters and noisy observations (bottom, right).

The observations in the noise cluster are typically identified as outliers, making DBSCAN a reasonable unsupervised learning approach for anomaly detection (see Module 16).

Note that clusters, by definition, must contain at least one core point. Small groups of observations that are not density-connected to any core points will then also be assigned to the noise cluster. A non-core point that has been assigned to the noise cluster may end up being assigned to a stand-alone cluster at a later stage (but the opposite cannot occur).

It is possible for two clusters to share non-core points, in which case the points in question are randomly assigned (the random order of selection in step 1 may affect the results); consequently, some clusters may end up containing fewer than minPts observations.

Advantages and Limitations

The main advantages of DBSCAN are that:

  • there is no need to specify the number of clusters to find in the data;

  • clusters of arbitrary shapes can be discovered;

  • observations that are "noisy"/outlying are not forced into a cluster;

  • the clusters are robust with respect to outliers, and

  • it only requires two parameters (\(\varepsilon^*>0\) and \(\textit{minPts}\)) to run properly, which can be set by domain experts if the data is well understood.

In general, it is suggested to use \(\textit{minPts} \geq p+1\), with larger values being preferable for noisy data sets, or \(\textit{minPts} \geq 2p\) for large datasets or sets with duplicates. Meanwhile, the choice of \(\varepsilon^*>0\) should take into account that if it is too small, a large portion of the observations will be assigned to the noise cluster; but if it is too large, a majority of observations will be found in a single cluster. Small values are preferable, but how small is too small?

The parameter choices have a large impact on the DBSCAN results, as does the choice of the distance function, which should take place before \(\varepsilon^*\) is selected to avoid data dredging and “begging the question”. Given that DBSCAN can handle globular clusters as well as non-globular clusters, why would we not always use it?

One important reason relates to computational efficiency. For a dataset \(\mathbf{X}\) with \(n\) observations, the basic \(k-\)means algorithm has order \(O(nk)\), whereas the most efficient versions of DBSCAN algorithm has order \(O(n \log n)\). Thus, when \(n\) increases, the DBSCAN run time increases faster than the \(k-\)means run time.

Another reason is that DBSCAN works well when the density of clusters is assumed to be constant.

Most of us would agree that there are two clusters in the image above (a loose one in the bottom/left corner, and a tight one in the top/right corner), as well as some outliers (around the tight cluster), but no combination of \(\varepsilon^*>0\) and \(\textit{minPts}\) can allow DBSCAN to discover this structure: either it finds no outliers, or it only finds the one tight cluster.

Example

Let us re-visit the (scaled) 2011 Gapminder dataset gapminder.SoCL.2011.s. We use Euclidean dissimilarity in this example, but the dbscan() function from the fpc package can accommodate other metrics: first compute the corresponding distance matrix and specify method="dist" instead of method="raw" in the function call.

We use 9 combinations of parameters \[(\varepsilon^*=\{0.75,1,1.25\})\times (\{6,10,15\} =\textit{minPts}).\]

set.seed(0) # for replicability
dbscan1 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 0.75, MinPts = 6)
dbscan2 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 1.0, MinPts = 6)
dbscan3 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 1.25, MinPts = 6)
dbscan4 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 0.75, MinPts = 10)
dbscan5 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 1.0, MinPts = 10)
dbscan6 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 1.25, MinPts = 10)
dbscan7 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 0.75, MinPts = 15)
dbscan8 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 1.0, MinPts = 15)
dbscan9 <- fpc::dbscan(gapminder.SoCL.2011.s, eps = 1.25, MinPts = 15)

There is, without a doubt, more efficient ways to go through the 9 combinations, but this will do for the time being:

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan1$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan2$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan3$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan4$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan5$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan6$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan7$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan8$cluster)),
                diag=list(continuous=my_dens))

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(dbscan9$cluster)),
                diag=list(continuous=my_dens))

The noisy observations are shown in red: one immediate insight is that the number of outlying observations decreases as \(\varepsilon^*\) increases, which is as expected. Another insight is that the number of noisy observations increases as \(\textit{minPts}\) increases, which is again not surprising.

If we compare the shape of the DBSCAN clusters with those of the \(k-\)means and HC clusters, we notice that the option of identifying observations as noisy – coupled with the "right" combination of parameters – creates "reasonable" clusters, that is to say, clusters for which do not have to stretch our ideas about what clusters ought to look like: the problematic observations (like China and India) are simply explained away as outliers.

The various runs find either 1 or 2 stand-alone clusters (as well as noisy observations), but that can change if we use different parameter values.

We can also determine if the cluster observations are core or non-core observations. For instance, in the realization with \(\varepsilon^*=1\) and \(\textit{minPts}=6\), we have

noise cluster 1 cluster 2
outlier \(34\)
reachable \(10\) \(17\)
core \(20\) \(103\)
total \(\mathbf{34}\) \(\mathbf{30}\) \(\mathbf{120}\)

14.4.2 Spectral Clustering

At a fairly coarse level, clustering algorithms are divided among those focusing on compactness and those focusing on connectivity.

Compactness methods are usually variants of \(k\) Nearest Neighbours (\(k\)NN) methods (see Section 13.1.3), and are effective when there are distinct clumps in the data. We can make specific assumptions about the distribution of the different clusters ahead of time (as in the next section), but compact methods struggle to achieve meaningful results in scenarios where groups are not linearly separable.

In cases where we have little to no knowledge of the dataset, making assumptions about the distributions of clusters can lead to invalid clustering schemes; in such cases, connectivity-based methods have been shown to work reasonably well [2], [278].

Connectivity methods, such as DBSCAN, focus on dividing observations into groups based on their similarity graphs; observations that are quite different in their features (and as such would be differentiated using compactness methods) may end up in the same cluster if there is a chain of sufficiently similar observations linking them.

Connectivity methods require fewer initial assumptions, but their use can be harder to justify mathematically. The validity of such methods can only be determined post hoc. Spectral clustering is a connectivity method that has become quite popular in practice; in a nutshell, we transform the dataset into its similarity graph and convert the latter into an eigenvalue problem. We then solve the eigenvalue problem, convert the solution into a graph cut, and then translate the cut back into dataset clusters (as illustrated in Figure 14.15).

Schematics of spectral clustering. We extract the similarity graph of a dataset, which gives rise to an eigenvalue problem (top). The eigenenvalue problem is then solved, which suggests an `optimal' graph cut, which in turns leads to data clusters (bottom).Schematics of spectral clustering. We extract the similarity graph of a dataset, which gives rise to an eigenvalue problem (top). The eigenenvalue problem is then solved, which suggests an `optimal' graph cut, which in turns leads to data clusters (bottom).

Figure 14.15: Schematics of spectral clustering. We extract the similarity graph of a dataset, which gives rise to an eigenvalue problem (top). The eigenenvalue problem is then solved, which suggests an `optimal’ graph cut, which in turns leads to data clusters (bottom).

Before we start delving into the spectral clustering algorithm, we must discuss a few concepts relating to graphs and linear algebra.272

Graphs and Cuts

A graph is an object which connects nodes (or vertices) together through edges. The edges have weights and can also be directed. In certain cases, we may assume that all edge weights are identical and bidirectional, which is equivalent to saying that the edges just represent that a relationship exist.

Airports (vertices) and flight paths (edges) form a graph in transportation networks, as do people (vertices) and relationships (edges) in social networks; the edges can be weighted according to flight frequency and/or directed according to their origin and destination, say, in the transportation example; in the social network example, they could be weighted according to frequency of communication and/or directed according to who follows who on some app.

The link with clustering is that once a similarity measure \(w\) has been selected, a dataset can be represented by a similarity graph \(G=(V,E,W)\):

  1. observations \(\mathbf{x}\) correspond to vertices \(v\in V\);

  2. if \(i\neq j\), vertices \(v_i,v_j\in V\) are connected by an edge \(e_{i,j}=1\) if the similarity weight \(w_{i,j}=w(\mathbf{x}_i,\mathbf{x}_j)>\tau\) for a predetermined threshold \(\tau\in [0,1)\), and by no edge (\(e_{i,j}=0\)) otherwise;

  3. the edges \((e_{i,j})\) m the adjacency matrix \(E\);

  4. the weights \((w_{i,j})\) form the similarity matrix \(W\);

  5. the (diagonal) degree matrix \(D\) provides information about the number of edges attached to a vertex: \(d_{i,i}=\sum_{j=1}^ne_{i,j}\).

Note that, by convention, \(w_{i,i}=0\) for all \(i\).

As an example, we could use the Gower similarity measure \[w(\mathbf{x}_i,\mathbf{x}_j)=1-\frac{1}{p}\sum_{k=1}^p\dfrac{|x_{i,k}-x_{j,k}|}{\text{range of $k$th feature in $\mathbf{X}$}}\] on the dataset found in Figure 14.15; the ranges of \(X_1\) and \(X_2\) are both \(r_1=r_2=3\), so that \[\begin{aligned} w_{3,4}&=w_{4,3}=w(\mathbf{x}_3,\mathbf{x}_4)=1-\frac{1}{2}\left(\dfrac{|x_{3,1}-x_{4,1}|}{r_1}+\dfrac{|x_{3,2}-x_{4,2}|}{r_2}\right)\\ &=1-\frac{1}{2}\left(\dfrac{|2-2|}{3}+\dfrac{|0-2|}{3}\right)=1-\frac{1}{2}\cdot\frac{2}{3}=\frac{2}{3};\end{aligned}\] the similarity matrix as a whole is \[W=\begin{pmatrix} 0 & 5/6 & 1/2 & 1/2 & 5/6 & 1/6 \\ 5/6 & 0 & 2/3 & 1/3 & 2/3 & 0 \\ 1/2 & 2/3 & 0 & 2/3 & 1/3 & 1/3 \\ 1/2 & 1/3 & 2/3 & 0 & 2/3 & 2/3 \\ 5/6 & 2/3 & 1/3 & 2/3 & 0 & 1/3 \\ 1/6 & 0 & 1/3 & 2/3 & 1/3 & 0 \end{pmatrix}.\] If we use a threshold value of \(\tau=0.6\), say, then the adjacency matrix is \[E=\begin{pmatrix} 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{pmatrix},\] and the degree matrix is \[D=\begin{pmatrix} 2 & 0 & 0 & 0 & 0 & 0 \\ 0 & 3 & 0 & 0 & 0 & 0 \\ 0 & 0 & 2 & 0 & 0 & 0 \\ 0 & 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}.\] The degree matrix can also be read directly from the similarity graph (which depends on the threshold \(\tau\)), by counting the number of edges at each node (see Figure 14.15).

A graph cut is the process by which we remove edges from the graph and separate the vertices into into groups (or sub-graphs).

The clustering task is to separate the nodes into multiple groups by minimizing the total weight of the edges we have to break in the process (i.e., making sure that the groups are as dissimilar as possible). This is also known as the minimum cut problem (MinCut).273

This task is NP-Hard, which means that there is no theoretically guaranteed efficient way to do so, in comparison to simply testing every possible cut and finding the minimum weight. This is problematic: for datasets with \(n\) observations, the number of cuts is bounded below by \(2^n\) (when we only consider \(2-\)cuts); when \(n\) is relatively small, the overall number of cuts to consider remains manageable, but for nearly all reasonable datasets, the size of \(n\) turns this task into an exercise in futility.

The spectral clustering approach generalizes the MinCut problem (or any of the other problems) by imposing some properties on the similarity graph to ensure that we can get approximate the true MinCut solution in a computationally efficient manner.274

Formally, the MinCut problem involves finding a partition \(\{A_1,...,A_k\}\) of \(G\) which minimizes the objective function \[\text{Cut}(A_1,...,A_k) = \frac{1}{2}\sum_{i=1}^{k}\mathcal{W}(A_i, \overline{A_i}),\] where \[\mathcal{W}(A,B):=\sum_{i \in A, j \in B} w_{i,j}\] and \(\overline{A}\) is the (set-theoretic) complement of \(A\). The factor \(\frac{1}{2}\) is there to remove double-counted edges.

Normalized Cut

The spectral clustering approach solves the Normalized Cut (NCut) problem, which is similar to the MinCut problem except that we are minimizing the weight of edges escaping a cluster relative to the total weight of the cluster.275

In the NCut problem, the objective function is \[J_{\text{NCut}}(A,B)=\text{Cut}(A,B)\left(\frac{1}{\text{Vol}(A)}+\frac{1}{\text{Vol}(B)}\right),\] where \[\text{Vol}(C)=\sum_{i \in C}w_{i,*};\] in a first pass, we seek to minimize \(J_{\text{NCut}}\) against the set of all possible partitions \((A,B)\) of \(G\). The procedure can be repeated as often as necessary on the cluster sub-graphs.

Intuitively, \(J_{\text{NCut}}\) is small when the observations within each sub-graph are similar to one another (\(\text{Vol}(A),\text{Vol}(B)\) are large) and the observations across are dissimilar to one another (\(\text{Cut}(A,B)\) is small).

On the plus side, takes into consideration the size of the partitioned groups and intra-group variance, and tends to avoid isolating vertices, but it is not any easier to solve than the MinCut problem. So why do we even bring it up in the first place? As it happens, we can provide an approximation to the NCut solution using purely algebraic means.

Similarity and Degree Matrices, Revisited

There are different ways to construct a graph representing the relationships between the dataset’s observations. We can use:

  • fully connected graphs, where all vertices having non-zero similarities are connected to each other;

  • \(r-\)neighbourhood graphs, where each vertex is only connected to the vertices falling inside a ball of radius \(r\) (according to some distance metric \(d\)), where \(r\) has to be tuned in order to catch the local structure of data;

  • \(k\) nearest neighbours graphs (and variants), where each vertex is connected to its \(k\) nearest neighbours (again, according to some distance metric \(d\)), with \(k\) pre-selected, and

  • mixtures of \(r-\)neighbourhood and \(k\)NN graphs, to better capture sparsity in the data.

The similarity measure \(w\) is usually picked from a list that includes: Gaussian (most common), cosine, fuzzy, Euclidean, Gower, etc. The similarity matrix \(W\) is symmetric and has zeros along the diagonal; its non-diagonal entries represent the similarity strength between the corresponding graph vertices (and so beteween the corresponding observations in the dataset).

We have discussed previously how to build the adjacency matrix \(E\) from \(W\) and a threshold \(\tau \in [0,1)\). The only component of a graph that similarity matrices do not directly capture are the degrees of each vertex, the number of edges that enter it (we are viewing the similarity graph as undirected). The diagonal of the degree matrix \(D\) holds that information for each vertex. We can combine \(W\) and \(D\) (or \(E\) and \(D\)) to create a matrix \(L\) known as the Laplacian, which has properties linked to the topology of the similarity graph.

Laplacian

The Laplacian of a graph is defined by \[L_0=D-\Theta,\quad \Theta \in \{E,W\};\] The symmetric Laplacian by \[L_S=D^{-1/2}LD^{-1/2}=\mathbf{I}_n-D^{-1/2}\Theta D^{-1/2},\] and the asymmetric Laplacian by \[L_A=D^{-1}L=\mathbf{I}_n-D^{-1}\Theta.\] In all cases, the off-diagonal entries are non-positive, and the diagonal entries contain the degree of each node.

The Laplacians have the following useful properties:

  • \(L_0\), \(L_S\) are symmetric; \(L_A\) is not necessarily so;276

  • all their eigenvalues are real and non-negative;

  • every row and column adds up to \(0\), which means that \(\lambda_0=0\) is the smallest eigenvalue of each Laplacian (hence they are singular and cannot be inverted);

  • the number of connected components in the graph is the dimension of the nullspace of the Laplacian associated to \(\lambda_0=0\) (which may provide a first approximation to the number of clusters in \(\mathbf{X}\)), and

  • the second smallest eigenvalue gives the graph’s sparsest cut.277

Spectral Clustering Algorithm

In the case of two clusters, the objective function \(J_{\text{NCut}}\) is minimized when finding the eigenvector \(\mathbf{f}\) corresponding to the smallest positive eigenvalue of \(L_S\), also known as the spectral gap (this notion will also play a role in Section 15.4.3).

The clustering in the original data is recovered by sending \(\mathbf{x}_i\) to \(A\) when \(f_i>0\) and \(\mathbf{x}_j\) to \(B\) otherwise. This deterministic algorithm is a special case of the spectral clustering algorithm [280].

To divide \(\mathbf{X}\) into \(k\) clusters, we follow the steps below:

  1. form a similarity matrix \(W\) and a degree matrix \(D\), using a threshold \(\tau\in[0,1)\);

  2. construct a Laplacian matrix \(L_{\xi}\), using \(\Theta=W\);

  3. compute the first \(k\) eigenvectors \(\{\mathbf{u}_1,...,\mathbf{u}_k\}\) of \(L_{\xi}\) corresponding to the \(k\) smallest positive eigenvalues of \(L_{\xi}\);

  4. construct the \(n\times k\) matrix \(\mathbf{U}\) containing the vectors \(\{\mathbf{u}_1,...,\mathbf{u}_k\}\) as columns;

  5. normalize the rows of \(\mathbf{U}\) into a matrix \(\mathbf{Y}\) with rows \(\{\mathbf{y}_1,\ldots, \mathbf{y}_n\}\) having unit length;

  6. cluster the rows of \(\mathbf{Y}\) into \(k\) clusters;

  7. assign \(\mathbf{x}_i\) to cluster \(j\) of \(\mathbf{X}\) if \(\mathbf{y}_i\) was assigned to cluster \(j\) in the preceding step.

Spectral clusters for the dataset of Figure 14.15, computed using the Laplacian and symmetric Laplacian, are shown in Figure 14.16.

Two clusters for the artificial dataset: simple Laplacian (left); symmetric Laplacian (right).Two clusters for the artificial dataset: simple Laplacian (left); symmetric Laplacian (right).

Figure 14.16: Two clusters for the artificial dataset: simple Laplacian (left); symmetric Laplacian (right).

From an experimental perspective, spectral clustering provides an attractive approach because it is easy to implement and reasonably fast, especially for sparse data sets: it is a graph partitioning problem that makes no initial assumptions on the form of the data clusters.

Spectral clustering has variants, which depend on the many choices that can be made at various points in the process:

  1. pre-processing (choice of: number of cluster \(k\), similarity measure \(w\), threshold \(\tau\));

  2. spectral representation (choice of Laplacian);

  3. clustering algorithm (choice of compact based, potentially non-deterministic algorithm to unleash on the rows of the representation \(\mathbf{Y}\)).

The NJW algorithm uses \(L_S\) for the spectral representation and \(k-\)means as a clustering approach. It can be interpreted as kernalized \(k-\)means: if we select a kernel which transforms the points to their mapped value in the Laplacian of the graph, then we (almost directl) obtain spectral clustering [279].278

In Figure 14.17, the different outcomes of \(k-\)means and NJW are illustrated on the spirals dataset.

Comparing $2-$means (middle) and spectral clustering with $k=2$ (right) on the spirals dataset (left).Comparing $2-$means (middle) and spectral clustering with $k=2$ (right) on the spirals dataset (left).Comparing $2-$means (middle) and spectral clustering with $k=2$ (right) on the spirals dataset (left).

Figure 14.17: Comparing \(2-\)means (middle) and spectral clustering with \(k=2\) (right) on the spirals dataset (left).

Practical Details and Limitations

The most obvious practical detail in the implementation of spectral clustering is related to the construction of the similarity graph. In general, there is virtually no theoretical justification for determining what type of clustering approach to use; even when an approach has been selected, it can be quite difficult to choose appropriate parameter values.

In spectral clustering, there are considerations in favour of using sparse similarity/adjacency matrix: we seek to strike a balance between a Laplacian which is too densely connected, and one for which almost all of the observations are seen as dissimilar to one another. Another issue relates to the computational challenge of finding the eigenvalues of the Laplacian. This can be done relatively efficiently if the matrix is sparse enough, however, which suggests using a relatively-high threshold \(\tau\); there are methods which help spectral clustering automatically tune for the best parameter values (including \(\tau\)), but they take up a significant amount of resources [280].

Spectral clustering methods are extremely effective because they do not require assumptions about distributions and centers, are fairly easy to implement, and are transparent and interpretable.

However, they suffer from some of the same drawbacks as other clustering methods, namely when it comes to:

  • selecting initial parameter values,

  • run-times that do not scale with larger datasets, and

  • determining optimal ways to visualize the results.

As in all clustering scenarios, analysts are faced with decisions at various levels of the process; they must be prepared to run multiple algorithms, in multiple configurations, in order to get a sense for the data structure (some strategies specific to spectral clustering are presented in [280]).

Examples

In a first example, spectral clustering is used to segment greyscale images into different segments based on contrasting colours [281]. Figure 14.18 shows instances with high contrast, with fairly decent segmentation performance using NCut, Self-Tuning SC [282], and a proposed SC algorithm [281]; Figure 14.19 shows other instances with less contrast (resulting in a poorer segmentation with the same methods); Figure 14.20 shows the comparison in segmentations using the proposed SC algorithm when the same image is presented at different resolutions.

High contrast image segmentation with spectral clustering [@SP4].

Figure 14.18: High contrast image segmentation with spectral clustering [281].

Low contrast image segmentation with spectral clustering [@SP4].

Figure 14.19: Low contrast image segmentation with spectral clustering [281].

Spectral clustering image segmentation of images at different resolutions [@SP4].

Figure 14.20: Spectral clustering image segmentation of images at different resolutions [281].


In the second example, consider a dataset of \(n=250\) times series, with \(N=60\) entries each (see below).

The distance \(d\) between 2 time series is the average absolute gap between series: \[d(\mathbf{x}_i,\mathbf{x}_j)=\frac{1}{60}\sum_{\ell=1}^{60}|x_{i,\ell}-x_{j,\ell}|.\] We build the Gaussian similarity measure \[w(\mathbf{x}_i,\mathbf{x}_j)=\exp\left(-\frac{d^2(\mathbf{x}_i,\mathbf{x}_j)}{2\sigma^2}\right),\] and we use the following parameter values \[\sigma^2=300,\quad \tau=0.9,\quad k=5.\] The spectral clustering results are quite appealing, as can be seen in the first realization of the NJW algorithm with \(k=5\) clusters (see Figure 14.21). Note however that not every run of the algorithm yields an outcome that we would consider meaningful (see Figure 14.21).

A realization of spectral clustering, using the NJW algorithm with $k=5$; the original dataset is shown in blue. We see that the NJW algorithm has captured  5 clusters with different times series characteristics, which is an encouraging result.

Figure 14.21: A realization of spectral clustering, using the NJW algorithm with \(k=5\); the original dataset is shown in blue. We see that the NJW algorithm has captured 5 clusters with different times series characteristics, which is an encouraging result.

A second realization of spectral clustering, using the NJW algorighm with $k=5$; again, the original dataset is shown in blue. In this case, the $k-$means portion of the algorithm leads to different clusters, which appear to be of lower quality.

Figure 14.22: A second realization of spectral clustering, using the NJW algorighm with \(k=5\); again, the original dataset is shown in blue. In this case, the \(k-\)means portion of the algorithm leads to different clusters, which appear to be of lower quality.


In the final example, we once again revisit the (scaled) 2011 Gapminder dataset using Euclidean dissimilarity. We use the kernlab implementation of the NJW algorithm found in specc(), with the default settings. We run one instance of the algorithm for \(k=2\) to \(k=7\) clusters.

set.seed(0) # for replicability
sc.gapminder.2 <- kernlab::specc(as.matrix(gapminder.SoCL.2011.s), 2)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(sc.gapminder.2)),
                diag=list(continuous=my_dens))

sc.gapminder.3 <- kernlab::specc(as.matrix(gapminder.SoCL.2011.s), 3)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(sc.gapminder.3)),
                diag=list(continuous=my_dens))

sc.gapminder.4 <- kernlab::specc(as.matrix(gapminder.SoCL.2011.s), 4)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(sc.gapminder.4)),
                diag=list(continuous=my_dens))

sc.gapminder.5 <- kernlab::specc(as.matrix(gapminder.SoCL.2011.s), 5)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(sc.gapminder.5)),
                diag=list(continuous=my_dens))

sc.gapminder.6 <- kernlab::specc(as.matrix(gapminder.SoCL.2011.s), 6)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(sc.gapminder.6)),
                diag=list(continuous=my_dens))

sc.gapminder.7 <- kernlab::specc(as.matrix(gapminder.SoCL.2011.s), 7)
GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(sc.gapminder.7)),
                diag=list(continuous=my_dens))

We can take stock of our attempts to cluster the Gapminder data up to now: in none of the \(k-\)means, hierarchical, DBSCAN, and spectral runs have we found what one might call natural groups. Perhaps we have not hit on the right method yet, but it could also mean that the task is futile and no such groups exist in the first place.

14.4.3 Probability-Based Clustering

In contrast with the model-free approach density-based clustering and spectral clustering, probabilistic-based clustering attempt to optimize the fit between the observed data and some mathematical model of clustering, with the assumption that the data is generated via a number of underlying probability distributions.

In practice, we assume that clusters are represented by parametric probability distributions, and the objective is to learn the parameters for each of these distributions. This assumption allows us to use probability theory to derive learning formulas for the parameters.279

Mixture Models

The main underlying assumptions of mixture models is that each observation is drawn (or generated) from one of several mechanisms (or components). In model-based clustering, we learn the parameters that provide the optimal fit to the data; in other words, we make a series of predictions about which components generated each of the observations.

This naturally leads to clusters, all observations generated by a given component belonging to the same cluster. Formally, we let \[\mathbf{X}=\begin{bmatrix} \mathbf{x}_1 \\ \vdots \\ \mathbf{x}_n \end{bmatrix}\in M_{n,p}(\mathbb{R}).\] Assume that there are \(k\) mechanisms that generate data, and that each of them is determined by a vector of parameters \(\boldsymbol{\theta}_{\ell}\), \(1\leq \ell\leq k\).

For \(1\leq j\leq n\), denote the probability of \(\mathbf{x}_j\) being generated by the \(\ell-\)th mechanism, \(1\leq \ell\leq k\), by \[P(\mathbf{x}_j\mid \boldsymbol{\theta}_\ell).\] The mixture vector \(\boldsymbol{\pi}=(\pi_1,\ldots,\pi_k)\) is a vector such that \(\pi_{\ell}\in [0,1]\) for all \(1\leq \ell\leq k\) and \(\pi_1+\cdots+\pi_k=1\).

If \(z_j\sim \text{C}(k;\boldsymbol{\pi})\), i.e., if \(P(z_j=\ell)=\pi_\ell\), for \(1\leq \ell \leq k\), and if \[P(\mathbf{x}_j\mid z_j=\ell)=P(\mathbf{x}_j\mid \boldsymbol{\theta}_{\ell})\quad \forall j,\ell,\] then the probability of observing \(\mathbf{x}_j\) is \[P(\mathbf{x}_j)=\sum_{\ell=1}^k\pi_{\ell}P(\mathbf{x}_j\mid\boldsymbol{\theta}_{\ell})=\sum_{\ell=1}^kP(z_j=\ell)P(\mathbf{x}_j\mid z_j=\ell),\] according to the Law of Total Probability.

In this set-up, we interpret \(z_j\) as the cluster label for \(\mathbf{x}_j\). Alternatively, we could use \[\mathbf{z}_j\in\{0,1\}^k,\quad \|\mathbf{z}_j\|_2=1\] to denote the cluster signature of \(\mathbf{x}_j\). The norm condition implies that exactly one of the components of \(\mathbf{z}_j\) is \(1\); all others are \(0\). For instance, if there are \(k=5\) mechanisms (clusters) in the data and \(\mathbf{x}_j\in C_4\), then \(\mathbf{z}_j=(0,0,0,1,0)\).280

If we write \[P(\mathbf{z}_j)=\pi_1^{z_{j,1}}\times \cdots \times \pi_k^{z_{j,k}}=\prod_{\ell=1}^k \pi_{\ell}^{z_{j,\ell}}\] and \[P(\mathbf{x}_j\mid \mathbf{z}_j )=P(\mathbf{x}_j\mid \boldsymbol{\theta}_{1})^{\ z_{j,1}}\times \cdots \times P(\mathbf{x}_j\mid \boldsymbol{\theta}_{k})^{\ z_{j,k}}=\prod_{\ell=1}^k P(\mathbf{x}_j\mid \boldsymbol{\theta}_\ell)^{\ z_{j,\ell}},\] we recover the mixture model \[P(\mathbf{x}_j)=\sum_{\ell=1}^k\pi_\ell P(\mathbf{x}_j\mid \boldsymbol{\theta}_\ell)=\sum_{\ell=1}^kP(\mathbf{z}_j\in C_\ell)P(\mathbf{x}_j\mid \mathbf{z}_j \in C_{\ell}).\]

Generative Process

In practice, then, we can imagine that the dataset \(\mathbf{X}\) is generated as follows. For \(1\leq j\leq n\):

  1. draw a cluster signature \(\mathbf{z}_j\sim \mathcal{G}_k(\boldsymbol{\pi})=\text{Mult}_k(\boldsymbol{\pi})\), and

  2. draw an observation \(\mathbf{x}_j\) from the corresponding mechanism according to \(P(\mathbf{x}_j \mid \mathbf{z}_j)\).

But we usually do not have access to this generative process; instead, we are given \(\mathbf{X}\) and the clustering task is to determine how likely it is that component \(C_\ell\), \(1\leq \ell\leq k\), is responsible for observation \(\mathbf{x}_j\), \(1\leq j\leq n\).

To do so, we need to compute the \[\gamma(z_{j,\ell})=P(\mathbf{z}_j\in C_\ell\mid\mathbf{x}_j),\quad \forall j,\ell.\] This is difficult to do directly; we use Bayes’ Theorem to provide an easier handle on the computations: \[\begin{aligned} \gamma(z_{j,\ell})&=P(\mathbf{z}_j\in C_\ell\mid\mathbf{x}_j) = \dfrac{P(\mathbf{z}_j\in C_\ell)P(\mathbf{x}_j\mid \mathbf{z}_j\in C_{\ell})}{P(\mathbf{x}_j)} \\ &= \dfrac{P(\mathbf{z}_j\in C_\ell)P(\mathbf{x}_j\mid \mathbf{z}_j\in C_{\ell})}{{\sum_{\nu=1}^kP(\mathbf{z}_j\in C_\nu)P(\mathbf{x}_j\mid \mathbf{z}_j \in C_{\nu})}} = \dfrac{\pi_\ell P(\mathbf{x}_j\mid \boldsymbol{\theta}_{\ell})}{{\sum_{\nu=1}^k \pi_\nu P(\mathbf{x}_j\mid \boldsymbol{\theta}_\nu)}}. \end{aligned}\]

The clustering objective is to infer \(\{\pi_\ell\}_{\ell=1}^k\), \(\{\boldsymbol{\theta}_\ell\}_{\ell=1}^k\) from \(\mathbf{X}\) for a fixed \(k\), to obtain the desired probabilities \(\gamma(z_{j,\ell})\). Denote \[\boldsymbol{\Theta}=\{\pi_1,\ldots,\pi_k,\boldsymbol{\theta}_1,\ldots, \boldsymbol{\theta}_\ell\}.\]

If we further assume that the \(\mathbf{x}_j\) are independently drawn by the generative process, then \[P(\mathbf{X}\mid \boldsymbol{\Theta})=\prod_{j=1}^n\sum_{\ell=1}^k\pi_k P(\mathbf{x}_j\mid \boldsymbol{\theta}_\ell),\] or \[\text{LL}(\boldsymbol{\Theta})=\ln P(\mathbf{X}\mid \boldsymbol{\Theta})=\sum_{j=1}^n\ln \left( \sum_{\ell=1}^k\pi_k P(\mathbf{x}_j\mid \boldsymbol{\theta}_\ell)\right),\] by construction.

The maximum likelihood estimator (MLE) of \(\boldsymbol{\Theta}\) is \[\boldsymbol{\Theta}_{\text{MLE}}=\arg\max_{\boldsymbol{\Theta}}\big\{\ln P(\mathbf{X}\mid \boldsymbol{\Theta})\big\};\] if we have information about the prior \(P(\boldsymbol{\Theta})\), then we may use the maximum a posteriori estimator (MAP) instead: \[\boldsymbol{\Theta}_{\text{MAP}}=\arg\max_{\boldsymbol{\Theta}}\big\{\ln P(\mathbf{X}\mid \boldsymbol{\Theta})+\ln P(\boldsymbol{\Theta})\big\}.\] Whether we use MLE or MAP depend, in large part, on the form taken by the component distributions.

Gaussian Mixture Models

A standard assumption is that all clusters are generated by Gaussian mechanisms, which is to say that \(P(\mathbf{x}_j\mid \boldsymbol{\theta}_\ell)\) arises from a multivariate Gaussian distribution: \[\begin{aligned} \mathcal{N}&(\mathbf{x}_j\mid \boldsymbol{\mu}_\ell,\boldsymbol{\Sigma}_{\ell})=\dfrac{1}{\sqrt{(2\pi)^p|\boldsymbol{\Sigma}_k|}}\exp\left(-\textstyle{\frac{1}{2}}(\mathbf{x}_j-\boldsymbol{\mu}_\ell)^{\!\top}\boldsymbol{\Sigma^{-1}_{\ell}}(\mathbf{x}_j-\boldsymbol{\mu}_\ell)\right),\end{aligned}\] where \(\boldsymbol{\mu}_\ell\in \mathbb{R}^p\) and \(\boldsymbol{\Sigma}_\ell\) is a symmetric positive semi-definite matrix.

Thus, if there are \(k\) components, then \[P(\mathbf{x}_j\mid \boldsymbol{\Theta})=\sum_{\ell=1}^k\pi_\ell\mathcal{N}(\mathbf{x}_j\mid \boldsymbol{\mu}_\ell,\boldsymbol{\Sigma}_{\ell})\] and \[\text{LL}(\boldsymbol{\Theta})=\ln P(\mathbf{X}\mid \boldsymbol{\Theta})=\sum_{j=1}^n \ln \left(\sum_{\ell=1}^k\pi_\ell\mathcal{N}(\mathbf{x}_j\mid \boldsymbol{\mu}_\ell,\boldsymbol{\Sigma}_{\ell})\right).\]

It is straightforward to show that \[\nabla \text{LL}(\boldsymbol{\mu}_\ell)=\boldsymbol{\Sigma}^{-1}_\ell\sum_{j=1}^n\gamma(z_{j,\ell})(\mathbf{x}_j-\boldsymbol{\mu}_\ell),\] so that the MLE estimators for the mean vectors are \[\hat{\boldsymbol{\mu}}_\ell=\dfrac{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell})\ \mathbf{x}_j}}{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell})}};\] that this is a maximizer for \(\text{LL}(\boldsymbol{\Theta})\) is due to the positive semi-definiteness of \(\boldsymbol{\Sigma}_{\ell}\).

Thus \(\hat{\boldsymbol{\mu}}_{\ell}\) is a weighted mean of the observations of \(\mathbf{X}\), with weights corresponding to the posterior probability \(\gamma(z_{j,\ell})\) that the \(\ell-\)th component was responsible for generating \(\mathbf{x}_j\).

Simultaneously, we can show that \[\nabla\text{LL}(\boldsymbol{\Sigma}_{\ell})= \sum_{j=1}^n\dfrac{\pi_\ell}{P(\mathbf{x}_j\mid \boldsymbol{\Theta})}\cdot \dfrac{\partial \mathcal{N}(\mathbf{x}_j\mid \boldsymbol{\mu}_{\ell},\boldsymbol{\Sigma}_\ell)}{\partial \boldsymbol{\Sigma}_{\ell}};\] slightly more complicated manipulations show that the MLE estimators for the covariance matrices are also weighted averages: \[\hat{\boldsymbol{\Sigma}}_{\ell}=\dfrac{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell})(\mathbf{x}_j-\hat{\boldsymbol{\mu}}_{\ell})(\mathbf{x}_j-\hat{\boldsymbol{\mu}}_{\ell})^{\!\top}}}{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell})}}.\]

Finally, to obtain the mixture probabilities \(\pi_\ell\), we must maximize \(\text{LL}(\Theta)\) with respect to \(\boldsymbol{\pi}\), subject to \(\pi_{\ell} \in [0,1],\) \[\pi_1+\cdots+\pi_k=1;\] we can use Lagrange multipliers to show that the MLE estimates of the mixture probabilities are also an average: \[\hat{\pi}_{\ell}=\frac{1}{n}\sum_{j=1}^n\gamma(z_{j,\ell}).\]

So we have nice expressions for the MLE estimates \(\hat{\boldsymbol{\Theta}}\), but there is a problem: we need the clustering probabilities \(\gamma(z_{j,\ell})\) in order to provide the MLE estimates, but these depend on the MLE estimates themselves.

Expectation-Maximization Algorithm

While there is no closed-form solution allowing us to express the cluster signatures directly in terms of the observed data \(\mathbf{X}\), there is a simple iterative solution based on the Expectation-Maximization algorithm for Gaussian Mixture Models:

Input: \(\mathbf{X}\)
Output: \(\boldsymbol{\Theta}^*\) which maximizes \(\text{LL}(\boldsymbol{\Theta})\)

  1. Initialize \(\boldsymbol{\Theta}^{[0]}=\left\{\boldsymbol{\mu}^{[0]}_{\ell},\boldsymbol{\Sigma}^{[0]}_{\ell},\pi^{[0]}_{\ell}\right\}_{\ell=1}^k\) and set \[\text{LL}^{[0]}=\text{LL}(\boldsymbol{\Theta}^{[0]});\]

For \(i=0\) to max_step, do:

  1. E(xpectation)-step: compute the responsibilities \[\gamma(z_{j,\ell}^{[i]})= \dfrac{\pi_\ell^{[i]} \mathcal{N}(\mathbf{x}_j\mid \boldsymbol{\mu}^{[i]}_{\ell},\boldsymbol{\Sigma}^{[i]}_{\ell})}{{\sum_{\nu=1}^k \pi^{[i]}_\nu \mathcal{N}(\mathbf{x}_j\mid \boldsymbol{\mu}^{[i]}_{\nu},\boldsymbol{\Sigma}^{[i]}_{\nu})}},\quad \forall j,\ell;\]

  2. M(aximization)-step: update the parameters \[\begin{aligned}\boldsymbol{\mu}^{[i+1]}_\ell&=\dfrac{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})\ \mathbf{x}_j}}{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})}},\quad \forall \ell;\\ \boldsymbol{\Sigma}^{[i+1]}_{\ell}&=\dfrac{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})(\mathbf{x}_j-\boldsymbol{\mu}^{[i]}_{\ell})(\mathbf{x}_j-\boldsymbol{\mu}^{[i]}_{\ell})^{\!\top}}}{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})}},\quad \forall \ell,\\ \pi_{\ell}^{[i+1]}&=\frac{1}{n}\sum_{j=1}^n \gamma(z_{j,\ell}^{[i]}),\quad \forall \ell;\end{aligned}\]

  3. Set \(\text{LL}^{[i+1]}=\text{LL}(\boldsymbol{\Theta}^{[i]})\) and check for convergence according to some convergence criterion \[(\| \boldsymbol{\Theta}^{[i]}-\boldsymbol{\Theta}^{[i+1]}\|<\varepsilon,\quad \text{say}):\] if satisfied, set \(\boldsymbol{\Theta}^{*}=\boldsymbol{\Theta}^{[i+1]}\); otherwise, repeat steps 2 to 4.

There are two main limitations to using EM for GMM:

  • EM is costlier (has a longer run-time) than \(k-\)means, and depending on the initialization, the algorithm may converge to a local critical point which is not necessarily the global maximizer;

  • as the algorithm iterates, two (or more) GMM clusters can collapse into a single GMM cluster.

Note that the EM algorithm can be sped-up by first running \(k-\)means and using the mean vector, covariance matrix, and proportion of observations of observations in the \(k-\)means cluster \(C_\ell\) for the initialization of \(\boldsymbol{\mu}^{[0]}_\ell\), \(\boldsymbol{\Sigma}^{[0]}_\ell\), and \(\pi_\ell\) for \(1\leq \ell \leq k\).

The collapsing of clusters can be mitigated by monitoring \(\|\boldsymbol{\Sigma}_{\ell}^{{i}}\|_2\) and randomly resetting \(\boldsymbol{\mu}^{[i]}_\ell\), \(\boldsymbol{\Sigma}^{[i]}_\ell\) when some threshold is reached.

Special Cases and Variants

In a GMM with \(k\) components, if \(\boldsymbol{\Sigma}_{\ell}=\boldsymbol{\Sigma}=\sigma^2 \mathbf{I}_n\) for \(1\leq \ell\leq \ell\), then \[P(\mathbf{x}_j\mid \boldsymbol{\mu}_{\ell},\boldsymbol{\Sigma})=\frac{1}{\sqrt{(2\pi)^p}\sigma}\cdot \exp\left(-\frac{1}{2\sigma^2}\|(\mathbf{x}-\boldsymbol{\mu}_{\ell})\|_2^2\right);\] the EM algorithm applied to this special case leads to \[\begin{aligned} \textbf{E-step:}\quad \gamma(z_{j,\ell}^{[i]})&=\frac{\pi_{\ell}^{[i]} \exp\left(-\|\mathbf{x}_j-\boldsymbol{\mu}_{\ell}^{[i]}\|_2^2/2\sigma^2\right)}{\sum_{\nu=1}^k\pi_{\nu}^{[i]}\exp \left(-\|\mathbf{x}_j-\boldsymbol{\mu}_{\nu}^{[i]}\|^2_2/2\sigma^2\right) } \\ \textbf{M-step:}\quad \boldsymbol{\mu}_{\ell}^{[i+1]}&=\dfrac{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})\ \mathbf{x}_j}}{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})}}\\ \pi_{\ell}^{[i+1]}&=\frac{1}{n}\sum_{j=1}^n \gamma(z_{j,\ell}^{[i]}).\end{aligned}\] When \(\sigma\to 0\), we can show that \[\gamma(z_{j,\ell})\to \begin{cases}1 & \text{if $\ell=\arg\min_{\nu}\left\{\|\mathbf{x}_j-\boldsymbol{\mu}_{\nu}\|_2^2\right\}$} \\ 0 & \text{otherwise}\end{cases}\] which is simply the formulation for \(k-\)means. Note that the components do not need to be multivariate Gaussians; there is a general EM algorithm that takes advantage of the concavity of the \(\ln\) function [4].

If the dataset of observations is binary, as may occur in image datasets (each pixel taking on the values 0 or 1, depending as to whether the pixel is white or black, say), we can modify GMM so that \(P(\mathbf{x}_j\mid \boldsymbol{\mu}_{\ell})\) arises from a multivariate Bernoulli distribution: \[\mathcal{B}(\mathbf{x}_j\mid \boldsymbol{\mu}_{\ell})=\prod_{\nu=1}^p\mu_{\ell,\nu}^{\ x_{j,\nu}}(1-\mu_{\ell,\nu})^{\ 1-x_{j,i}},\] where \(\boldsymbol{\mu}_{\ell}\in [0,1]^p\). Thus, if there are \(k\) components, then \[P(\mathbf{x}_j\mid \boldsymbol{\Theta})=\sum_{\ell=1}^k\pi_\ell\mathcal{B}(\mathbf{x}_j\mid \boldsymbol{\mu}_\ell)\] and \[\text{LL}(\boldsymbol{\Theta})=\ln P(\mathbf{X}\mid \boldsymbol{\Theta})=\sum_{j=1}^n \ln \left(\sum_{\ell=1}^k\pi_\ell\prod_{\nu=1}^p\mu_{\ell,\nu}^{\ x_{j,\nu}}(1-\mu_{\ell,\nu})^{\ 1-x_{j,i}} \right).\]

We can find \(\boldsymbol{\Theta}^*\) that maximizes \(\text{LL}(\boldsymbol{\Theta})\) by using the EM algorithm for the Bernoulli Mixture Models: the EM algorithm applied to this special case leads to \[\begin{aligned} \textbf{E-step:}\quad \gamma(z_{j,\ell}^{[i]})&=\pi_\ell^{[i]}\prod_{\nu=1}^p\left(\mu_{\ell,\nu}^{[i]}\right)^{\ x_{j,\nu}}(1-\mu_{\ell,\nu}^{[i]})^{\ 1-x_{j,i}} \\ \textbf{M-step:}\quad \boldsymbol{\mu}_{\ell}^{[i+1]}&=\dfrac{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})\ \mathbf{x}_j}}{\displaystyle{\sum_{j=1}^n\gamma(z_{j,\ell}^{[i]})}}\\ \pi_{\ell}^{[i+1]}&=\frac{1}{n}\sum_{j=1}^n \gamma(z_{j,\ell}^{[i]}),\end{aligned}\] with initialization \(\pi_{\ell}^{[0]}=\frac{1}{k}\) and \[\boldsymbol{\mu}_{\ell} \sim \prod_{\nu=1}^p\mathcal{U}(0.25,0.75)\] for \(1\leq \ell \leq k\).

Other variants include Generalized EM, Variational EM, and Stochastic EM [4]. Note that the essence of EM methods remains the same for all algorithms: we attempt to “guess” the value of the "hidden" cluster variable \(z_{j,\ell}\) in the E-step, and we update the model parameters in the M-step, based on the approximated responsibilities found in the \(E-\)step.

Interestingly, EM can detect overlapping clusters (unlike \(k-\)means). But most variants share the same limitations: convergence to a global maximizer is not guaranteed; it may be quite slow even when it does converge, and the correct number of components is assumed to be known prior to analysis.

Example

We cluster the 2011 Gapminder dataset using the mclust implementation of EM in R; no parameters need be specified (unless we want to use a different dissimilarity measure).281 This implementation determines the optimal number of clusters using BIC (see Section 12.4.3). We cluster both the raw data and the scaled data, to showcase the impact scaling can have.

We start by determining the number of sources in the raw data:

library(mclust)
set.seed(0)
BIC <- mclustBIC(gapminder.SoCL.2011[,c(3:7)])
plot(BIC)

summary(BIC)
Best BIC values:
             VVE,5       VVV,2       VVE,4
BIC      -3545.915 -3573.00451 -3577.50705
BIC diff     0.000   -27.08943   -31.59198

This suggests that there are 5 clusters, as seen below.

mod1 <- Mclust(gapminder.SoCL.2011[,c(3:7)], x = BIC)
plot(mod1, what = "classification")

The mean vectors and covariance matrices for each cluster are found to be:

summary(mod1, parameters = TRUE)
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVE (ellipsoidal, equal orientation) model with 5 components: 

 log-likelihood   n df       BIC       ICL
       -1606.08 184 64 -3545.915 -3573.311

Clustering table:
 1  2  3  4  5 
51 27 52 41 13 

Mixing probabilities:
         1          2          3          4          5 
0.28320179 0.14590868 0.27317375 0.22804391 0.06967186 

Means:
                 [,1]      [,2]      [,3]      [,4]      [,5]
Life Exp    59.834937 81.134281 75.998817 71.888971 67.179530
Inf Mort    58.426776  3.254371  9.913182 24.238438 35.027409
Fert         4.798189  1.691375  1.800049  2.566219  4.014588
log10 Pop    7.061637  7.016423  6.695346  7.133957  5.492939
log10 GDPpc  3.419639  4.606936  4.257569  3.889928  3.491324

Variances:
[,,1]
               Life Exp   Inf Mort       Fert   log10 Pop log10 GDPpc
Life Exp     28.3344610 -55.700455 -0.7444078  0.38472919  1.15550574
Inf Mort    -55.7004554 423.909389 11.8241909  1.11117987 -4.78143371
Fert         -0.7444078  11.824191  1.1663137  0.13060543 -0.21270286
log10 Pop     0.3847292   1.111180  0.1306054  0.26127270 -0.01805256
log10 GDPpc   1.1555057  -4.781434 -0.2127029 -0.01805256  0.19296195
[,,2]
              Life Exp     Inf Mort        Fert    log10 Pop  log10 GDPpc
Life Exp    0.77045283  0.012104972  0.02774975  0.011407709  0.019766044
Inf Mort    0.01210497  0.679738470  0.01905169  0.003200837 -0.004476642
Fert        0.02774975  0.019051687  0.12771916 -0.042927224 -0.012328032
log10 Pop   0.01140771  0.003200837 -0.04292722  0.416389572 -0.018157675
log10 GDPpc 0.01976604 -0.004476642 -0.01232803 -0.018157675  0.015998859
[,,3]
              Life Exp    Inf Mort         Fert   log10 Pop  log10 GDPpc
Life Exp     4.9940940 -1.72216863  0.142340294  0.10257591  0.137637940
Inf Mort    -1.7221686 17.19416942  0.499861409  0.05750027 -0.180062410
Fert         0.1423403  0.49986141  0.115694181 -0.09747328  0.001937248
log10 Pop    0.1025759  0.05750027 -0.097473276  0.79668779 -0.027754402
log10 GDPpc  0.1376379 -0.18006241  0.001937248 -0.02775440  0.071168246
[,,4]
                Life Exp    Inf Mort        Fert   log10 Pop log10 GDPpc
Life Exp      9.91288695 -14.4994297 -0.09127317  0.15574758  0.36288775
Inf Mort    -14.49942972 112.8663448  3.16128196  0.30356930 -1.26453815
Fert         -0.09127317   3.1612820  0.33684903 -0.03319199 -0.04501104
log10 Pop     0.15574758   0.3035693 -0.03319199  0.57090101 -0.02068610
log10 GDPpc   0.36288775  -1.2645382 -0.04501104 -0.02068610  0.10900700
[,,5]
               Life Exp    Inf Mort       Fert   log10 Pop log10 GDPpc
Life Exp     18.0275970 -35.9634327 -0.5031561  0.24042845  0.74329472
Inf Mort    -35.9634327 273.4386404  7.6150659  0.71441466 -3.08374929
Fert         -0.5031561   7.6150659  1.0499241  0.12825306 -0.19073273
log10 Pop     0.2404284   0.7144147  0.1282531  0.15122047 -0.02103882
log10 GDPpc   0.7432947  -3.0837493 -0.1907327 -0.02103882  0.06307893

Let us repeat the procedure on the scaled dataset.

BIC.s <- mclustBIC(scale(gapminder.SoCL.2011[,c(3:7)]))
plot(BIC.s)

summary(BIC.s)
Best BIC values:
            VVV,2       VVE,4       VVI,5
BIC      -1760.55 -1779.31339 -1785.64061
BIC diff     0.00   -18.76374   -25.09096

This suggests that there are 5 clusters, as seen below.

mod2 <- Mclust(gapminder.SoCL.2011[,c(3:7)], x = BIC.s)
plot(mod2, what = "classification")

The mean vectors and covariance matrices for each cluster are found to be:

summary(mod2, parameters = TRUE)
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2
components: 

 log-likelihood   n df       BIC       ICL
      -1679.596 184 41 -3573.005 -3584.034

Clustering table:
 1  2 
91 93 

Mixing probabilities:
        1         2 
0.5059039 0.4940961 

Means:
                 [,1]      [,2]
Life Exp    64.226346 77.164028
Inf Mort    45.696249  9.268183
Fert         4.038171  1.860683
log10 Pop    6.906989  6.816294
log10 GDPpc  3.566453  4.310365

Variances:
[,,1]
                 Life Exp    Inf Mort        Fert   log10 Pop log10 GDPpc
Life Exp      46.94600319 -112.051411 -5.17027793 -0.06217152  1.55664144
Inf Mort    -112.05141081  531.145010 22.59825954  2.37700481 -5.67438362
Fert          -5.17027793   22.598260  1.84974433  0.03491542 -0.37634844
log10 Pop     -0.06217152    2.377005  0.03491542  0.62342443 -0.01356435
log10 GDPpc    1.55664144   -5.674384 -0.37634844 -0.01356435  0.17508305
[,,2]
               Life Exp     Inf Mort        Fert    log10 Pop  log10 GDPpc
Life Exp     10.7636945 -13.79843667 -0.40969404  0.508375039  0.789463554
Inf Mort    -13.7984367  34.33063934  1.47575086 -0.094412049 -1.497455669
Fert         -0.4096940   1.47575086  0.17525844 -0.037372594 -0.032046730
log10 Pop     0.5083750  -0.09441205 -0.03737259  0.719418611  0.002308201
log10 GDPpc   0.7894636  -1.49745567 -0.03204673  0.002308201  0.115062583

14.4.4 Affinity Propagation

Affinity propagation (AP) is a fairly recent arrival on the clustering stage [283], [284]; it takes a somewhat novel perspective on clustering although, as might be expected, there are still similarities to other clustering methods, in particular, DBSCAN and \(k-\)means.

AP takes the \(k-\)medoids algorithm as a jumping off point. Unlike \(k-\)means or EM, this algorithm does not operate on statistical principles; rather, it selects existing observations to act as the exemplar for a particular cluster (rather than a mean vector, as in \(k-\)means; see Figure 14.23 for an illustration).

Illustration of $3-$mediods on an artificial dataset; modified from [@Dueck2009].

Figure 14.23: Illustration of \(3-\)mediods on an artificial dataset; modified from [283].

The \(k-\)mediods algorithm refines the selection of these exemplars so that in the final (stable) configuration, the observations assigned to an exemplar are quite similar to it, relative to other exemplars. As the name suggests, the number of clusters \(k\) must be selected prior to running the algorithm; as is the case with \(k-\)means, \(k-\)medoids is non-deterministic and is sensitive to the initial choice of exemplars and similarity metric.

The AP algorithm attempts to overcome the issues arising with \(k-\)medoids, using Bayesian network theory (in particular, belief propagation networks and factor graphs), and treats observations as a connected graph. In this approach, each graph vertex can:

  • communicate with any other vertex, and

  • act as a possible exemplar for other observations.

The selection of exemplars is determined by exchanging real-valued messages between points. Eventually, sets of exemplars and data points associated with each exemplar are generated from this iterative process, forming clusters. Messages are updated on the basis of fairly simple formulae. As in all clustering contexts, a similarity measure \(s\) must first be selected prior to clustering: for distinct pairs \((i,k)\), \(s(i,k)\) represents initially the suitability of \(k\) as an exemplar of \(i\) (this suitability will be updated as “messages” are passed between observations).

Each observation \(k\) is further assigned a preference \(s(k,k)\) that it be chosen as an exemplar. The preference can be constant, to indicate no particular initial preference.

Two types of messages get sent:

  • the availability \(a(i,k)\) sent from \(k\) to \(i\), which reports on the suitability of \(k\) to be an exemplar of \(i\);

  • the responsiblity \(r(i,k)\) sent from \(i\) to \(k\), which reports on the suitability of \(i\) to be represented by \(k\).

The availabilities are initialized to \(a(i,k) \leftarrow 0\), the responsibilities to \[r(i,k) \leftarrow s(i,k) − \max_{k'\neq k} \{a(i,k') + s(i,k')\}.\]

This calculation allows eligible exemplars of an observation to “compete” for each observations, in a sense, so they can become that observation’s exemplar.282 After the initial assignment, an availability \(a(i,k)=0\) means that observation \(i\) has no affinity for \(k\) as its exemplar).

Subsequently, the focus switches back and forth between the exemplar and the observation perspective, with observations looking for available exemplars: \[a(i,k) \leftarrow \begin{cases}\displaystyle{\min \left\{0, r(k,k) + \sum_{i'\not\in \{i,k\}} \max\{0,r(i,k)\}\right\}} & i\neq k\\ \displaystyle{\sum_{i' \neq k}\max\{0,r(i',k)\}} & i=k\end{cases}\]

The case \(i=k\) is intended to reflect current evidence that observation \(k\) is an exemplar. The responsibilities and availabilities are updated, reflecting the current affinity that one observation has for choosing another observation as its exemplar (hence the name), until the quantities converge to \(r(i,k)\) and \(a(i,k)\), respectively, for all pairs of observations \((i,k)\).

This leads to the cluster assignment \(\{c_1,\ldots,c_n\}\), where \[c_i=\arg\max_k \{a(i, k)+r(i, k)\},\quad 1\leq i\leq n;\] if \(i\) is an observation with associated exemplar \(k\), then \(c_i=c_k=k\).

The fact that any observation can become an exemplar when the quantities are updated, and thus that the number of clusters is not an algorithm parameter, is an important distinction between AP and \(k-\)medoids (and other segementation clustering approaches). The process is illustrated below.

Illustration of affinity propagation on an artificial dataset (top); illustration of availability and responsibility (bottom); modified from [@Dueck2009].

Figure 14.24: Illustration of affinity propagation on an artificial dataset (top); illustration of availability and responsibility (bottom); modified from [283].

Setting Algorithm Parameters

Two parameters impact AP’s clustering behaviour: the input preference (which influences the eventual number of clusters) and the dampening parameter.

The input preference determines the suitability of each observation to act as an exemplar; this is often set as the median similarity in the data, but it can be tweaked. In principle, certain observations could be assigned preference values in a different manner, perhaps relating to domain knowledge (or previous results).

The dampening parameter is slightly more technical. Because affinity propagation creates a directed graph to generate clusters, it can become vulnerable to graph loops, which could result in algorithmic oscillations (the algorithm may not converge to a particular solution). The dampening factor acts to control this oscillation problem.

Comparison with Other Algorithms

Performance of clustering algorithms can be considered both in general (e.g., based on best/worst cases of an implemented algorithm) or in the context of applications in particular domains. One major AP drawback is the cost of calculation of the similarity matrix, which is \(O(n^2)\). Once the similarity matrix has been calculated, the number of scalar computations scales linearly in the number of similarities or quadratically in the number of observations if all possible pairwise similarities are used [283]. In other words, AP is slow on larger datasets.

Arguably, one of the major advantages of AP (other than not having to specify the number of clusters up front) is its ability to use any similarity measure. As a result, we do not need to alter the dataset to ‘fit’ with a distance/similarity framework (e.g., by changing categorical variables into numeric variables in some way, or ignoring categorical variables altogether).

Example

We once again re-visit the 2011 (scaled) Gapminder dataset. We use the AP implementation found in the R package apcluster, with similarity \(s(i,k)=-\|\mathbf{x}_i-\mathbf{x}_k\|^2.\) We start by setting the input preference as the median similarity and obtain 13 clusters.

library(apcluster)
ap.gap.1 <- apcluster(negDistMat(r=2), scale(gapminder.SoCL.2011[,c(3:7)]))
ap.gap.1

plot(ap.gap.1, gapminder.SoCL.2011[,c(3:7)])


APResult object

Number of samples     =  184 
Number of iterations  =  160 
Input preference      =  -7.061518 
Sum of similarities   =  -144.0751 
Sum of preferences    =  -98.86126 
Net similarity        =  -242.9363 
Number of clusters    =  14 

Exemplars:
   bfa brb col com dnk gha hrv idn ita nam npl pry tcd vut
Clusters:
   Cluster 1, exemplar bfa:
      afg bdi ben bfa civ cmr gin lbr moz mwi ner nga ssd tgo uga zmb cod
   Cluster 2, exemplar brb:
      bhs blz brb cpv isl mdv mlt sur brn lca mne vct atg grd syc
   Cluster 3, exemplar col:
      arg bra chl col dza irn lka mar mex mys per rou tha tur ukr ven vnm
   Cluster 4, exemplar com:
      com dji gmb gnb mrt tls
   Cluster 5, exemplar dnk:
      aut bel che dnk fin grc irl isr lux nld nor nzl omn prt qat sgp swe are 
      bhr cyp kwt sau
   Cluster 6, exemplar gha:
      eri eth gha hti ken lao mdg pak png rwa sdn sen tza yem zaf zwe irq
   Cluster 7, exemplar hrv:
      bgr blr cri cub cze est hrv hun lbn ltu lva mus srb svk svn tto ury alb 
      mkd bih
   Cluster 8, exemplar idn:
      chn egy idn ind phl rus
   Cluster 9, exemplar ita:
      aus can deu esp fra gbr ita jpn kor pol usa
   Cluster 10, exemplar nam:
      bwa cog gab gnq lso nam swz lby tkm
   Cluster 11, exemplar npl:
      bgd khm mmr npl tjk uzb prk
   Cluster 12, exemplar pry:
      arm aze bol btn dom ecu geo gtm hnd jam kaz kgz mda mng nic pan pry slv 
      tun jor pse syr
   Cluster 13, exemplar tcd:
      ago caf mli sle som tcd
   Cluster 14, exemplar vut:
      fji guy stp slb ton vut wsm kir fsm

If instead we use the minimum similarity, we obtain 4 clusters (exemplars: Guinea, Guyana, Croatia, Morocco).

ap.gap.2 <- apcluster(negDistMat(r=2), scale(gapminder.SoCL.2011[,c(3:7)]), q=0)
ap.gap.2

plot(ap.gap.2, gapminder.SoCL.2011[,c(3:7)])


APResult object

Number of samples     =  184 
Number of iterations  =  170 
Input preference      =  -56.66018 
Sum of similarities   =  -305.4992 
Sum of preferences    =  -226.6407 
Net similarity        =  -532.1399 
Number of clusters    =  4 

Exemplars:
   gin guy hrv mar
Clusters:
   Cluster 1, exemplar gin:
      afg ago bdi ben bfa caf civ cmr cog com eri eth gha gin gmb gnb hti ken 
      lbr lso mdg mli moz mrt mwi ner nga pak png rwa sdn sen sle som ssd tcd 
      tgo tza uga zmb zwe cod tls
   Cluster 2, exemplar guy:
      blz btn bwa cpv dji fji gab gnq guy lao mng nam stp sur swz lby slb tkm 
      ton vct vut wsm grd kir syc fsm
   Cluster 3, exemplar hrv:
      arm aus aut bel bgr bhs blr brb can che chl cri cub cze deu dnk esp est 
      fin fra gbr geo grc hrv hun irl isl isr ita jam jpn kor lbn ltu lux lva 
      mda mdv mlt mus mys nld nor nzl omn pan pol prt qat rou sgp srb svk svn 
      swe tto tun ury alb are bhr brn cyp kwt lca mkd mne sau bih atg
   Cluster 4, exemplar mar:
      arg aze bgd bol bra chn col dom dza ecu egy gtm hnd idn ind irn kaz kgz 
      khm lka mar mex mmr nic npl per phl pry rus slv tha tjk tur ukr usa uzb 
      ven vnm yem zaf irq jor pse syr prk

Which of these two schemes seems to provide a better segmentation?

14.4.5 Fuzzy Clustering

Fuzzy clustering (FC) is also called “soft” clustering (in opposition to “hard” clustering). Rather than assigning each observation to a cluster, they are assigned a cluster signature, a set of values that indicate their relative membership in each of the clusters.

The signature vector is often interpreted as a probability vector: observation \(\mathbf{x}_i\) belongs to cluster \(\ell\) with probability \(p_{i,\ell}\geq 0\), with \[p_{i,1}+\cdots+p_{i,c}=1,\quad \text{for all }1\leq i \leq n.\]

Fuzzy \(c-\)Means: The Typical Approach

The most prevalent algorithm for carrying out FC is called fuzzy \(c-\)means (FCM). It is a variant of \(k-\)means with two modifications:

  • the presence of a new parameter \(m> 1\), called the fuzzyfier, which determines the degree of "fuzziness" of the clusters, and

  • cluster membership is output as a weight vector, with weights in \([0,1]\) adding to 1.

As in \(k-\)means, \(c\) observations are selected randomly as the initial cluster centroids, as are the membership weights of each observation. The membership weights of each observations, relative to the current centroid, are re-calculated based on how “close” the point is to the given centroid in comparison to the distance to all of the other centroids.283

Effectively, we are looking for clusters that minimize the objective function \[\sum_{\ell=1}^c\sum_{\mathbf{x}_i\in C_\ell}u_{i,\ell}^m \text{variation}(\mathbf{x}_i,\boldsymbol{\mu}_\ell),\] where the degree \(u_{i,\ell}^m\) to which observation \(\mathbf{x}_i\) belongs to cluster \(C_{\ell}\) is \[u_{i,\ell}^m=\frac{1}{\displaystyle{\sum_{j=1}^c \left(\dfrac{\text{variation}(\mathbf{x}_i,\boldsymbol{\mu}_\ell)}{\text{variation}(\mathbf{x}_i,\boldsymbol{\mu}_j)}\right)^{2/(m-1)}}}.\]

The value of \(m\) effectively determines the width of fuzziness bands around clusters, where clusters may overlap with other clusters. Within these bands, if there are overlaps, points will have weights between 0 and 1. Outside of these bands, points will have a membership of 1 for a particular cluster (that it is close to) and a membership of 0 for other bands.

As with \(k-\)means, the algorithm is generally run until the change in membership values, or in this case the weights, falls below a particular threshold. In practice, we typically use \(m=2\) and \[\text{variation}(\mathbf{x}_i,\boldsymbol{\mu}_{\ell})=\|\mathbf{x}_i-\boldsymbol{\mu}_{\ell}\|^2.\] As \(m\to 1\), FCM converges to \(k-\)means.

Comparison Between Fuzzy \(c-\)Means and \(k-\)Means

To gain an appreciation for how FCM works, it can be useful to compare its results to those provided by \(k-\)means. The image below shows the same dataset clustered by \(k-\)means (left) and fuzzy \(c-\)means (right) [285].

On the right, we can see observations that “belong” to the 2 clusters. FCM is useful in this context because it would seem almost arbitrary for some of the points to be assigned to one or the other cluster (which is what \(k-\)means does).

Other Fuzzy Clustering Options

Although FCM is the most popular fuzzy clustering algorithm, it is not a particularly nuanced algorithm. Like \(k-\)means, the resulting clusters are essentially blob-shaped. Sophisticated results can be gained by using more complex algorithms. The Gustafson–Kessel (GK) clustering algorithm [286] is an early extension of FCM which replaces the simple distance measure used in FCM with a (covariance) matrix. This brings FCM more in-line with EM clustering, which also provides fuzzy results, and can be carried out with a variety of statistical models, resulting in a more mature clustering results, albeit at the cost of heavier processing. FANNY [287] is another fuzzy approach; it is less sensitive to outliers than FCM is.

Fuzzy Clustering Validation

As with hard clustering, it is important to validate fuzzy clusters. A number of validation strategies have been developed; the Xie-Beni index is a popular choice. It can be calculated for non-fuzzy clusters as well as for fuzzy clusters. However, it takes into accounts the weights of the points for each clustering by weighting the clustering separation and compactness measures using the membership matrix (i.e., the matrix that contains the weights for each observation with respect to each cluster). Other metrics include the Tang index and the Kwon index [288][290].

Example

We show some results of FANNY (with \(c=2,3,4\), and \(6\) clusters, implemented in cluster’s fanny()) and FCM (with \(c=4\) clusters, implemented in e1071’s cmeans()) on the (scaled) 2011 Gapminder dataset (again, using Euclidean dissimilarity).

We start with FANNY(2).

set.seed(987) # for replicability
fuzzy.gap <- cluster::fanny(scale(gapminder.SoCL.2011[,c(3:7)]), 
                            k=2, metric="euclidean", maxit=20000)
attributes(fuzzy.gap)

factoextra::fviz_cluster(fuzzy.gap, ellipse.type = "norm", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right")

factoextra::fviz_silhouette(fuzzy.gap, palette = "jco",
                ggtheme = theme_minimal())

$names
 [1] "membership"  "coeff"       "memb.exp"    "clustering"  "k.crisp"    
 [6] "objective"   "convergence" "diss"        "call"        "silinfo"    
[11] "data"       

$class
[1] "fanny"     "partition"

  cluster size ave.sil.width
1       1   75          0.32
2       2  109          0.52

We display only the plots for FANNY(3), FANNY(4), and FANNY(6).

fuzzy.gap <- cluster::fanny(scale(gapminder.SoCL.2011[,c(3:7)]), 
                            k=3, metric="euclidean", maxit=20000)

factoextra::fviz_cluster(fuzzy.gap, ellipse.type = "norm", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right")

factoextra::fviz_silhouette(fuzzy.gap, palette = "jco",
                ggtheme = theme_minimal())

#########

fuzzy.gap <- cluster::fanny(scale(gapminder.SoCL.2011[,c(3:7)]), 
                            k=4, metric="euclidean", maxit=20000)

factoextra::fviz_cluster(fuzzy.gap, ellipse.type = "norm", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right")

factoextra::fviz_silhouette(fuzzy.gap, palette = "jco",
                ggtheme = theme_minimal())

#########

fuzzy.gap <- cluster::fanny(scale(gapminder.SoCL.2011[,c(3:7)]), 
                            k=6, metric="euclidean", maxit=20000)

factoextra::fviz_cluster(fuzzy.gap, ellipse.type = "norm", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right")

factoextra::fviz_silhouette(fuzzy.gap, palette = "jco",
                ggtheme = theme_minimal())

  cluster size ave.sil.width
1       1   64          0.26
2       2   74          0.34
3       3   46          0.15
  cluster size ave.sil.width
1       1   53          0.33
2       2   80         -0.01
3       3   50         -0.17
4       4    1          0.00
  cluster size ave.sil.width
1       1   30          0.35
2       2   83          0.06
3       3   34         -0.21
4       4   34          0.03
5       5    3          0.60

Finally, here are the results for F4M:

cm.gap <- e1071::cmeans(scale(gapminder.SoCL.2011[,c(3:7)]), 4)
attributes(cm.gap)
$names
[1] "centers"     "size"        "cluster"     "membership"  "iter"       
[6] "withinerror" "call"       

$class
[1] "fclust"
factoextra::fviz_cluster(list(data = scale(gapminder.SoCL.2011[,c(3:7)]), 
                              cluster=cm.gap$cluster), 
             ellipse.type = "norm",
             ellipse.level = 0.68,
             palette = "jco",
             ggtheme = theme_minimal())

GGally::ggpairs(gapminder.SoCL.2011[,c(3:7)],
                ggplot2::aes(color=as.factor(cm.gap$cluster)),
                diag=list(continuous=my_dens))

How does that compare to all the other approaches we have used so far? Based on these, how many clusters do you think there are in the 2011 Gapminder dataset?

14.4.6 Cluster Ensembles

We have seen that the choice of clustering method and algorithm parameters may have an impact on the nature and number of clusters in the data; quite often, the resulting clusters are volatile. This is aligned with the idea that the ability to accurately assess the quality of a clustering outcome remains elusive, for the most part.

The goal of ensemble clustering is to combine the results of multiple clustering runs to create a more robust outcome.

Most ensemble models use the following two steps to generate an outcome:

  1. generate different clustering schemes, using different models, parameters, or data selection mechanisms (the ensemble components), and

  2. combine the different results into a single outcome.

Selecting Different Ensemble Components

The ensemble components are either model-based or data selection-based.

In model-based ensembles, the different components of the ensemble reflect different models, such as the use of

  • different clustering approaches;

  • different parameter settings for a given approach;

  • different randomizations (for stochastic algorithms),

  • or some combination of these.

For instance, an ensemble’s components could be built from:

  1. 5 runs of \(k-\)means for each of \(k=2,\ldots,10\), for each of the Euclidean and Manhattan similarities (90 components);

  2. the hierarchical clustering outcome for each of the complete, single, average, centroid, and Ward linkage, for each of the Euclidean and Manhattan distances, for each of \(k=2,\ldots, 10\) clusters (90 components);

  3. the DBSCAN outcome for each of 5 values of \(\varepsilon^*\), for each of \(\textit{minPts}=2,\ldots, 10\), for each of the Euclidean and Manhattan distances (90 components), and

  4. the spectral clustering outcome for each of 3 threshold values \(\tau\), for each of the 3 types of Laplacians, for \(k=2,4,6,8,10\), for each of the Euclidean and Manhattan distances (90 components),

This provides a total of \(4\times 90=360\) components. Note that we could also pick algorithms, settings, and similarity measures randomly, from a list of reasonable options.

In data selection-based ensembles, we might select a specific clustering approach, combined with a set of parameters, and a given randomization (if the approach is stochastic) and instead build the different components of the model by running the algorithm on different subsets of the data, either via:

  • selecting subsets of observations using random or other probabilistic sampling scheme;

  • selecting subsets of variables, again using probabilistic sampling, or

  • some combination of both.

For instance, an ensemble’s components could be built using affinity propagation with Euclidean distance and a specific combination of input preference and dampening parameter, and 360 subsets of the data, obtained as follows:

  1. for each component, draw a % of observations to sample and a # of variables to select from the data;

  2. randomly select a subset with these properties;

  3. run affinity propagation on the subset to obtain a clustering outcome.

We could also combine model-based and data selection-based approaches to create the components.

Combining Different Ensemble Components

However the components are obtained, we need to find a way to combine them to obtain a robust clustering consensus. There are three basic methods to do this:

  • general affiliation;

  • hypergraph partitioning, and

  • meta-clustering.

In the general affiliation approach, we consider each pair of observations and determine how frequently they are found in the same clusters in each of the ensemble components. The corresponding proportions create a similarity matrix, which can then be used to cluster the data using some graph-based method, such as DBSCAN.

In the hypergraph partitioning approach, each observation in the data is represented by a hypergraph vertex. A cluster in any of the ensemble components is represented as a hypergraph hyperedge, a generalization of the notion of edge which connects (potentially) more than two vertices in the form of a complete clique. This hypergraph is then partitioned using graph clustering methods.284

The meta-clustering approach is also a graph-based approach, except that vertices are associated with each cluster in the ensemble components; each vertex therefore represents a set of data objects. A graph partitioning algorithm is then applied to this graph.285 Balancing constraints may be added to the meta-clustering phase to ensure that the resulting clusters are balanced.

Cluster ensembles are implemented in R via the packages diceR and clue. More information is available in [4], [5], [274].

References

[2]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
[4]
C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms and Applications. CRC Press, 2014.
[5]
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer, 2015.
[130]
[145]
Wikipedia, Cluster analysis algorithms.”
[222]
E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN,” ACM Trans. Database Syst., vol. 42, no. 3, Jul. 2017, doi: 10.1145/3068335.
[228]
G. Schoier and G. Borruso, “Individual movements and geographical data mining. Clustering algorithms for highlighting hotspots in personal navigation routes,” in Computational science and its applications - ICCSA 2011, 2011, pp. 454–465.
[229]
[274]
N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, Dec. 2010.
[278]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Proceedings of the 14th international conference on neural information processing systems: Natural and synthetic, 2001, pp. 849–856.
[279]
Y. Bengio, P. Vincent, and J.-F. Paiement, Learning Eigenfunctions of Similarity: Linking Spectral Clustering and Kernel PCA,” Département d’informatique et recherche opérationnelle, Université de Montréal, 1232, 2003.
[280]
U. von Luxburg, A Tutorial on Spectral Clustering,” Stat. Comput., vol. 17, no. 4, pp. 395–416, 2007.
[281]
F. Tung, A. Wong, and D. A. Clausi, “Enabling scalable spectral clustering for image segmentation,” Pattern Recognition, vol. 43, no. 12, pp. 4069–4076, 2010, doi: https://doi.org/10.1016/j.patcog.2010.06.015.
[282]
L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering,” in Advances in Neural Information Processing Systems, 2005, vol. 17.
[283]
D. Dueck, “Affinity propagation: Clustering data by passing messages,” Ph.D. Thesis, Jan. 2009.
[284]
B. Frey and D. Dueck, “Clustering by passing messages between data points,” Science (New York, N.Y.), vol. 315, pp. 972–6, Mar. 2007, doi: 10.1126/science.1136800.
[285]
A. Bagnaro, F. Baltar, and G. Brownstein, “Reducing the arbitrary: Fuzzy detection of microbial ecotones and ecosystems – focus on the pelagic environment,” Environmental Microbiome, vol. 15, no. 16, 2020.
[286]
D. Gustafson and W. C. Kessel, “Fuzzy clustering with a fuzzy covariance matrix,” 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes, pp. 761–766, 1978.
[287]
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990.
[288]
S. H. Kwon, J. Kim, and S. H. Son, “Improved cluster validity index for fuzzy clustering,” Electronics Letters, vol. 57, no. 21, pp. 792–794, 2021.
[290]
X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991.