15.4 Advanced Topics

When used appropriately, the approaches to feature selection and dimension reduction methods presented in the last two sections provide a solid toolkit to help mitigate the effects of the curse of dimensionality.

They remain, however, for the most part rather straightforward. The methods discussed in this section are decidedly more sophisticated, from a mathematical perspective; an increase in conceptual complexity can lead to insights that are out of reach by more direct approaches.

15.4.1 Singular Value Decomposition

From a database management perspective, it pays not to view datasets simply as flat file arrays; from an analytical perspective, however, viewing datasets as matrices allows analysts to use the full machinery of linear algebra and matrix factorization techniques, of which singular value decomposition (SVD) is a well-known component.305

As before, let \(\{\mathbf{x}_i\}_{i=1}^n\subseteq \mathbb{R}^p\) and denote the matrix of observations by \[\mathbf{X}=\begin{bmatrix}\mathbf{x}_1 \\ \mathbf{x}_2 \\ \vdots \\ \mathbf{x}_n\end{bmatrix}\in \mathbb{M}_{n,p}(\mathbb{R})=\mathbb{R}^{n\times p}.\]

Let \(d\geq\min{p,n}\). From a dimension reduction perspective, the goal of matrix factorization is to find two narrow matrices \(\mathbf{W}_d\in \mathbb{R}^{n\times d}\) (the cases) and \(\mathbf{C}_d\in \mathbb{R}^{p\times d}\) (the concepts) such that the product \(\mathbf{W}_d\mathbf{C}^{\!\top}_d=\widetilde{\mathbf{X}}_d\) is the best rank \(d\) approximation of \(\mathbf{X}\), i.e. \[\widetilde{\mathbf{X}}_d =\arg\min_{\mathbf{X}'} \{\|\mathbf{X}-\mathbf{X}'\|_F\},\quad \text{ with } \textrm{rank}(\mathbf{X}')=d,\] where the Frobenius norm \(F\) of a matrix is \[\|\mathbf{A}\|_F^2=\sum_{i,j}|a_{i,j}|^2.\] In a sense, \(\widetilde{\mathbf{X}}_d\) is a “smooth” representation of \(\mathbf{X}\); the dimension reduction takes place when \(\mathbf{W}_d\) is used as a dense \(d-\)representation of \(\mathbf{X}\).The link with the singular value decomposition of \(\mathbf{X}\) can be made explicit: there exist orthonormal matrices \(\mathbf{U}\in\mathbb{R}^{n\times n}\), \(\mathbf{V}\in \mathbb{R}^{p\times p}\), and a diagonal matrix \(\mathbf{\Sigma}\in\mathbb{R}^{n\times p}\) with \(\sigma_{i,j}=0\) if \(i\neq j\) and \(\sigma_{i,i}\geq \sigma_{i+1,i+1}\geq 0\) for all \(i\),306 such that \[\mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\!\top};\] the decomposition is unique if and only if \(n=p\).

Let \(\mathbf{\Sigma}_d\in\mathbb{R}^{d\times d}\) be the matrix obtained by retaining the first \(d\) rows and the the first \(d\) columns of \(\mathbf{\Sigma}\); \(\mathbf{U}_d\in \mathbb{R}^{n\times d}\) be the matrix obtained by retaining the first \(d\) columns of \(\mathbf{U}\), and \(\mathbf{V}_d^{\!\top}\in \mathbb{R}^{d\times p}\) be the matrix obtained by retaining the first \(d\) rows of \(\mathbf{V}^{\top}\).

Then \[\widetilde{\mathbf{X}}_d=\underbrace{\mathbf{U}_d\mathbf{\Sigma}_d}_{\mathbf{W}_d}\mathbf{V}^{\!\top}_d,\] and the \(d\)-dimensional rows of \(\mathbf{W}_d\) are approximations of the \(p-\)dimensional rows of \(\mathbf{X}\) in the sense that \[\left\langle\mathbf{W}_{d}[i],\mathbf{W}_{d}[j] \right\rangle= \left\langle\widetilde{\mathbf{X}}_{d}[i],\widetilde{\mathbf{X}}_{d}[j] \right\rangle \approx \left\langle{\mathbf{X}}_{d}[i],{\mathbf{X}}_{d}[j] \right\rangle \mbox{ for all }i,j.\]


  1. One of the advantages of SVD is that it allows for substantial savings in data storage (modified from [208]):

Storing \(\mathbf{X}\) requires \(n p\) saved entries, but an approximate version of the original requires only \(d(n+p+d)\) saved entries; if \(\mathbf{X}\) represents a \(2000\times 2000\) image (with 4 million entries) to be transmitted, say, a decent approximation can be sent via \(d=10\) using only 40100 entries, roughly 1% of the original number of entries (see Figure 15.17 for an illustration).

Image reconstruction: $d=1400$ (left), $d=10$ (middle), $d=50$ (right); Llewellyn and Gwynneth Rayfield.Image reconstruction: $d=1400$ (left), $d=10$ (middle), $d=50$ (right); Llewellyn and Gwynneth Rayfield.Image reconstruction: $d=1400$ (left), $d=10$ (middle), $d=50$ (right); Llewellyn and Gwynneth Rayfield.

Figure 15.17: Image reconstruction: \(d=1400\) (left), \(d=10\) (middle), \(d=50\) (right); Llewellyn and Gwynneth Rayfield.

  1. SVD can also be used to learn word embedding vectors. In the traditional approach to text mining and natural language processing (NLP) (see Sections ?? and ??), words and associated concepts are represented using one-hot encoding.307

For instance, if the task is to predict the part-of-speech (POS) tag of a word given its context in a sentence (current and previous word identities \(w\) and \(pw\), as well as the latter’s part-of-speech (POS) tag \(pt\)), the input vector could be obtained by concatenation of the one-hot encoding of \(w\), the one-hot encoding of \(pw\), and the one-hot encoding of \(pt\).

The input vector that would be fed into a classifier to predict the POS of the word “house” in the sentence fragment “my house”, say, given that “my” has been tagged as a ‘determiner’ could be:

The sparsity of this vector is a major CoD liability: a reasonably restrictive vocabulary subset of English might contain \(|V_W|\approx 20,000\) words, while the Penn Treebank project recognizes \(\approx 40\) POS tags, which means that \(\mathbf{x}\in \mathbb{R}^{40,040}\) (at least).

Another issue is that the one-hot encoding of words does not allow for meaningful similarity comparisons between words: in NLP, words are considered to be similar if they appear in similar sentence contexts.308

The terms “black” and “white” are similar in this framework as they both often occur immediately preceding the same noun (such as “car”, “dress”, etc.); human beings recognize that the similarity goes further than both of the terms being adjectives – they are both colours, and are often used as opposite poles on a variety of spectra. This similarity is impossible to detect from the one-hot encoding vectors, however, as all its word representations are exactly as far from one another.

SVD proposes a single resolution to both of these issues. Let \(\mathbf{M}^f\in\mathbb{R}^{|V_W|\times |V_C|}\) be the word-context matrix of the association measure \(f\), derived from some appropriate corpus, that is, if \(V_W=\{w_1,\ldots,w_{|V_W|}\}\) and \(V_C=\{c_1,\ldots,c_{|V_C|}\}\) are the vocabulary and contexts of the corpus, respectively, then \(M^f_{i,j}=f(w_i,c_j)\) for all \(i,j\). For instance, one could have \[\begin{aligned}V_W&=\left\{\text{aardvark},\ldots,\text{zygote}\right\}, \\ V_C&=\{\ldots,\text{word appearing before "cat"},\ldots\},\end{aligned}\] and \(f\) given by positive pointwise mutual information for words and contexts in the corpus (the specifics of which are not germane to the discussion at hand; see [311] for details).

The SVD \[\mathbf{M}^f \approx \mathbf{M}^f_d =\mathbf{U}_d\mathbf{\Sigma}_d\mathbf{V}_d^{\!\top}\] yields \(d-\)dimensional word embedding vectors \(\mathbf{U}_d\mathbf{\Sigma}_d\) which preserve the context-similarity property discussed above. The decomposition of the POS-context matrix, where words are replaced by POS tags, produces POS embedding vectors.

For instance, a pre-calculated 4-dimensional word embedding of \(V_W\) could be

while a 3-dimensional POS embeddings could be

leading to a 11-dimensional representation \(\mathbf{x}'\) of \(\mathbf{x}\)

which provides a substantial reduction in the dimension of the input space.

15.4.2 PCA Regression and Partial Least Squares

For \(m=1,\ldots, M\leq p\), we let \(z_m=\vec{X}^{\!\top}\boldsymbol{\phi}_{m}\) be linear combinations of the original predictors \(\{X_1,\ldots, X_p\}\).

If we are fitting \(y=f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}]\) using OLS, we can also fit \(y_i=\theta_0+\mathbf{z}_i^{\!\top}\boldsymbol{\theta}+\varepsilon_i\), \(i=1,\ldots, n\) using OLS. If the constants \(\phi_{m,j}\) are selected wisely, then transforming the variables can yield a model that outperforms OLS regression – the predictions might be better than those obtained by fitting \(y_i=\beta_0+\mathbf{x}_i^{\!\top}\boldsymbol{\beta}+\varepsilon_i\), \(i=1,\ldots,N\).

By definition, \(\theta_0=\beta_0\) and \[\begin{aligned} \mathbf{z}_i^{\!\top}\boldsymbol{\theta}&=\sum_{m=1}^M\theta_mz_{i,m}=\sum_{m=1}^M\theta_m\mathbf{x}_i^{\!\top}\boldsymbol{\phi}=\sum_{m=1}^M\theta_m\sum_{j=1}^p\phi_{m,j}x_{i,j}\\ &=\sum_{j=1}^p\sum_{m=1}^M\theta_m\phi_{m,j}x_{i,j}=\sum_{j=1}^p\beta_jx_{i,j}=\mathbf{x}_i^{\!\top}\boldsymbol{\beta},\end{aligned}\] where \(\beta_j=\sum_{m=1}^M\theta_m\phi_{m,j}\), which is to say that the dimension reduction regression is a special case of the original linear regression model, with constrained coefficients \(\beta_j\).

Such constraints can help with the bias-variance trade-off (when \(p\gg n\), picking \(M\ll p\) can reduce the variance of the fitted coefficients).

The challenge then is to find an appropriate way to pick the \(\phi_{m,j}\). We will consider two approaches: principal components and partial least squares.

Principal Components Regression

Let us assume that \(M\) principal components \(\{Z_1,\ldots,Z_M\}\) have been retained (see Section 15.2.3), where \[Z_i=\mathbf{w}_i^{\!\top}(X_1,\ldots,X_p),\] assuming that the eigenvectors \(\mathbf{w}_i\) are ordered according to the corresponding eigenvalues (\(\lambda_1\geq \lambda_2 \geq \cdots \geq \lambda_p\geq 0\)):

  • the first principal component is the normalized (centered and scaled) linear combination of variables with the largest variance;

  • the second principal component is the normalized linear combination of variables with the largest variance, subject to having no correlation with all previous components (the first);

  • the \(M\)th principal component is the normalized linear combination of variables with the largest variance, subject to having no correlation with all previous components.

The regression function \(f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}]\) is hopefully well approximated by the function \(g(\vec{z})=\text{E}[Y\mid \vec{Z}=\vec{z}]\), i.e., \[\hat{y}_z=g(\mathbf{z})=\gamma_0+\gamma_1z_1+\cdots +\gamma_Mz_M\] should compare reasonably well to \[\hat{y}_z=f(\mathbf{z})=\beta_0+\beta_1x_1+\cdots +\beta_px_p.\]

The main challenge is to determine the optimal \(M\). If \(M\) is too small, we run the risk of having a model with high squared bias and low variance (underfitting); if \(M\) is too large, not only we we not achieve much in the way of dimension reduction, but we might produce a model with low squared bias and high variance (overfitting).

Any method that allows for the estimation of \(\text{MSE}_\text{Te}\) (such as cross-validation) could be used to select \(M\), but there are other approaches as well (again, see Section 15.2.3).

Partial Least Squares

In principal component regression (PCR), the identified directions (linear combinations) that best represent the predictors \(\{X_1,\ldots,X_p\}\) are determined in an unsupervised manner: the response \(Y\) plays no role in determining the principal components.

As such, there is no guarantee that the directions that best explain the predictors are also the best directions to use to predict the response. The framework for partial least squares is the same as that for PCR, except that the directions \(Z_i\) are selected both to explain the predictors and to be related to the response \(Y\). As in PCR, we start by normalizing (centering and scaling) the training set \[\mathbf{X}=\begin{bmatrix}\mathbf{1} & \mathbf{X}_1 & \cdots & \mathbf{X}_p\end{bmatrix}.\]

The first direction \(Z_1\) is computed using the OLS coefficient estimates of \[Y_i=\phi_{0,j}^1+\phi_{1,j}X_{i,j}+\gamma_i,\quad j=1,\ldots, p, i=1,\ldots,N.\]

Note that each \(\phi_{1,j}\) is proportional to \(\rho_{X_j,Y}\) and that the direction \[Z_1=\sum_{j=1}^p\phi_{1,j}X_j=\boldsymbol{\phi}_1^{\!\top}\vec{X}\] places the highest weights on the predictors that are most strongly linked to the response. Now, we run an OLS regression of \(Y\) using \(Z_1\) as a predictor: \[Y_i=\psi_0+\psi_1z_{1,i}+\varepsilon_i,\quad i=1,\ldots, N\] and let \(\varepsilon_i=Y_i-\psi_0-\psi_{1}z_{1,i}\) be the component of the data not “explained” by \(Z_1\). THe second direction \(Z_2\) is computed using the OLS coefficient estimates of \[\varepsilon_i=\phi_{0,j}^2+\phi_{2,j}X_{i,j}+\gamma_i, \quad j=1,\ldots,p, i=1,\ldots,N.\]

Note that each \(\phi_{2,j}\) is proportional to \(\rho_{X_j,\varepsilon}\) and that the direction \[Z_2=\sum_{j=1}^p\phi_{2,j}X_j=\boldsymbol{\phi}_2^{\!\top}\vec{X}\] places higher weights on the predictors that are most strongly linked to the first residual (which is to say, the component that does not explain \(Z_1\)). The process continues in the same way, building directions \(Z_3,\ldots, Z_p\) that are strongly linked, in sequence, to the preceding residuals; as the chain starts with the response \(Y\), the directions do take into account both the related response and the predictor structure.309


In summary, due to the bias-variance trade-off (see Modules 11 and 13), we must often strike the right balance in terms of model complexity, which is usually measured in terms of the number of parameters that must be estimated from \(\text{Tr}\). While this allows us to compare completely different models with one another, it also suggests that models that use fewer predictors as inputs are not as complex as those that use the full set of predictors. The full models are not necessarily the ones that perform best (in terms of \(\text{MSE}_\text{Te}\)), due, in parts, to the curse of dimensionality. Thankfully, predictor subset selection methods can be used to select the best model: while the cross-validation approach is strongly encouraged, other approaches (including shrinkage, feature selection, dimension reduction) could also prove competitive.

15.4.3 Spectral Feature Selection

Text mining tasks often give rise to datasets which are likely to be affected by the CoD; the problem also occurs when dealing with-high resolution images, with each of the millions of pixels it contains as a feature viewed as a feature; as images have each image containing millions of pixels.

Spectral feature selection attempts to identify “good” or “useful” training features in such datasets by measuring their relevance to a given task via spectral analysis of the data.

General Formulation for Feature Selection

Let \(\mathbf{X} \in \mathbb{R}^{n\times p}\) be a data set with \(p\) features and \(n\) observations. The problem of \(\ell-\)feature selection, with \(1\leq \ell\leq p\), can be formulated as an optimization problem [312]: \[\begin{aligned} &\max_{\mathbf{W}} r(\hat{\mathbf{X}}) \\ \text{s.t.}\ &\ \hat{\mathbf{X}} = \mathbf{XW}, \quad \mathbf{W}\in \{0,1\}^{p\times \ell} \\ &\ \mathbf{W}^{\!\top}\mathbf{1}_{p\times 1} = \mathbf{1}_{\ell\times1}, \quad \lVert \mathbf{W1}_{\ell\times 1} \rVert_0 = \ell\end{aligned}\]

The score function \(r(\cdot)\) is the objective which evaluates the relevance of the features in \(\hat{\mathbf{X}}\), the data set with only certain feature selected by the selection matrix \(\textbf{W}\) with entries either 0 or 1. To ensure that only the original feature are selected (and not a linear combination of features), the problem stipulates that each column of \(\mathbf{W}\) contains only one 1 (\(\mathbf{W}^{\!\top}\mathbf{1}_{p\times 1}= \mathbf{1}_{\ell\times 1}\)); to ensure that exactly \(\ell\) rows contain one 1, the constraint \(\lVert \mathbf{W1}_{\ell\times1}\rVert_0 = l\) is added.

The selected features are often represented by \[\hat{\mathbf{X}} = \mathbf{XW} = (f_{i_1},\dots, f_{i_{\ell}})\quad\mbox{with } \{i_1,\dots,i_{\ell}\} \subset \{1,\dots,p\}.\] If \(r(\cdot)\) does not evaluate features independently, this optimization problem is NP-hard. To make to problem easier to solve, the features are assumed to be independent of one another.310 In that case, the objective function reduces to \[\max_{\mathbf{W}} r(\hat{\mathbf{X}}) = \max_{\mathbf{W}} \left(r(f_{i_1}) + \dots + r(f_{i_l})\right);\] the optimization problem can then be solved by selecting the \(\ell\) features with the largest individual scores. The link with spectral analysis will now be explored.311

Similarity Matrix

Let \(s_{i,j}\) denote the pairwise similarity between observations \(\mathbf{x}_i\) and \(\mathbf{x}_j\). If class labels \(y(\mathbf{x})\in \{1,\ldots,K\}\) are known for all instances \(\mathbf{x}\), the following function can be used \[s_{i,j} = \begin{cases} \frac{1}{n_{k}}, & y(\mathbf{x}_i)=y(\mathbf{x}_j)=k \\ 0, & \text{otherwise} \end{cases}\] where \(n_{k}\) is the number of observations with class label \(k\).

If class labels are not available, a popular similarity function is the Gaussian radial basis function (RBF) kernel, given by \[s_{i,j} = \exp\left( - \frac{\lVert x_i-x_j \rVert^2}{2\sigma^2} \right),\] where \(\sigma\) is a parameter that is used to control the Gaussian’s width.312 For a given \(s_{i,j}\) and \(n\) observations, the similarity matrix \(S\) is an \(n\times n\) matrix containing the observations’ pairwise similarities, \(S(i,j) = s_{i,j}\), \(i,j=1,\dots,n\).

By convention, \(\text{diag}(S)=\mathbf{0}\). Other similarity functions include the following kernels [313]:

  1. linear\(s_{i,j} = \mathbf{x}_i^{\!\top}\mathbf{x}_j + c\), \(c\in \mathbb{R}\);

  2. polynomial\(s_{i,j} = (\alpha \mathbf{x}_i^{\!\top}\mathbf{x}_j + c)^d\), \(\alpha, c\in \mathbb{R}\),313 \(d\in \mathbb{N}\), and

  3. cosine\(s_{i,j} = \frac{\mathbf{x}_j^{\!\top}\mathbf{x}_i}{\lVert \mathbf{x}_i \rVert \lVert \mathbf{x}_j \rVert}\), which measures the similarity of 2 vectors by determining the angle between them.314

Similarity Graph

For each similarity matrix \(S\), one can construct a weighted graph \(G(V,E)\) in which each observation corresponds to a node and the associated pairwise similarities correspond to the respective edge weights; \(V\) is the set of all vertices (nodes) and \(E\) the set of all graph edges. As an example, consider the simple dataset: \[\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 0 & 2 \\ 3 & 2 \\ 6 & 4 \\ 7 & 4 \\ 6 & 8 \end{bmatrix};\] for validation purposes, we will assume that the first three belong to one group, the last three, to another. The scatter plot is obtained below:

from matplotlib import ticker, cm
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform, cdist
import networkx as nx
from numpy.linalg import eigh,norm
import random # for replicability

# Data
X = np.array([[1,1,1],[0,2,1],[3,2,1],[6,4,1],[7,5,1],[6,8,1]])
Y = np.array([0,0,0,1,1,1])
# Plot
plt.scatter(X[:,0],X[:,1], c=Y)

Next, we compute the similarity (adjacency matrix). We use the Gaussian RBF kernel with \(\sigma = 0.5\), and pdist() from scipy.spatial.distance which computes the pairwise distance of each row from an input data matrix. pdist() return a vector, which allows a speed boost for the following computation, but it needs to be converted back to a matrix for the next steps. This is done via squareform(), also from scipy.spatial.distance.

S = squareform(np.exp(-pdist(X))/((0.5)**2))

or \[S = \begin{bmatrix} 0 & 0.972 & 0.428 & 0.012 & 0.003 & 0.001 \\ 0.972 & 0 & 0.199 & 0.007 & 0.002 & 0.001 \\ 0.428 & 0.199 & 0 & 0.109 & 0.027 & 0.005 \\ 0.012 & 0.007 & 0.109 & 0 & 0.972 & 0.073 \\ 0.003 & 0.002 & 0.027 & 0.972 & 0 & 0.169 \\ 0.001 & 0.001 & 0.005 & 0.073 & 0.169 & 0 \end{bmatrix},\] and the resulting graph \(G\) is shown below:

G = nx.from_numpy_matrix(S)
# Add the weight (similarity) attribute to the graph
pos = {i : [X[i,0],5*X[i,1]] for i in range(6)}
labels = nx.get_edge_attributes(G,'weight')

# round the similarity (for display)
labels = {k: round(v,2) for k, v in labels.items()}
# Create the edge list and labels. Remove null edges
edge_list = [k for k, v in labels.items() if v != 0]
labels = {k: v for k, v in labels.items() if v != 0} 
# Draw it using the edge and labels list, at the right position 
nx.draw_networkx_edges(G,pos, edgelist=edge_list)
nx.draw_networkx_edge_labels(G,pos, edge_labels=labels)
{(0, 1): Text(0.5, 7.5, '0.97'), (0, 2): Text(2.0, 7.5, '0.43'), (0, 3): Text(3.5, 12.5, '0.01'), (1, 2): Text(1.5, 10.0, '0.2'), (1, 3): Text(3.0, 15.0, '0.01'), (2, 3): Text(4.5, 15.0, '0.11'), (2, 4): Text(5.0, 17.5, '0.03'), (3, 4): Text(6.5, 22.5, '0.97'), (3, 5): Text(6.0, 30.0, '0.07'), (4, 5): Text(6.5, 32.5, '0.17')}
(-0.7350000000000001, 7.734999999999999, 1.325, 43.675)

Laplacian Matrix of a Graph

The similarity matrix \(S\) is also known as the adjacency matrix \(A\) of the graph \(G\),315 from which the degree matrix \(D\) can be constructed: \[D(i,j) = \begin{cases} d_{i,i} = \displaystyle{\sum_{k=1}^n a_{i,k}}, & i=j \\ 0, & \text{otherwise} \end{cases}\]

By definition, \(D\) is diagonal; the element \(d_{i,i}\) can be viewed as an estimate of the density around \(\mathbf{x}_i\); as \(a_{i,k}(=s_{i,k)}\) is a measure of similarity between \(\mathbf{x}_i\) and \(\mathbf{x}_k\), the larger \(a_{i,k}\) is, the more similar the two observations are.

A large value of \(d_{i,i}\) indicates the presence of one or more observations “near” \(\mathbf{x}_i\); conversely, a small value of \(d_{i,i}\) suggests that \(\mathbf{x}_i\) is isolated.

The Laplacian and normalized Laplacian matrices can now be defined as \[L = D - A\quad\mbox{and}\quad \mathcal{L} = D^{-1/2} L D^{-1/2},\] respectively. Since \(D\) is diagonal, \(D^{-1/2} = \text{diag}\left(d_{i,i}^{-1/2}\right)\).

It can be shown that \(L\) and \(\mathcal{L}\) are both positive semi-definite matrices. By construction, the smallest eigenvalue of \(L\) is 0, with associated eigenvector \(\mathbb{1}\), since \[\begin{aligned}L\mathbf{1}&=(D-A)\mathbf{1}=D\mathbf{1}-A\mathbf{1} \\ &= \begin{pmatrix}d_{1,1} \\ \vdots \\ d_{n,n}\end{pmatrix}-\begin{pmatrix}\displaystyle{\sum_{k=1}^na_{1,k}} \\ \vdots \\ \displaystyle{\sum_{k=1}^na_{n,k}}\end{pmatrix}=\begin{pmatrix}d_{1,1}-\displaystyle{\sum_{k=1}^na_{1,k}} \\ \vdots \\ d_{n,n}-\displaystyle{\sum_{k=1}^na_{n,k}}\end{pmatrix}=\begin{pmatrix}0 \\ \vdots \\ 0\end{pmatrix}=0\cdot \mathbf{1}.\end{aligned}\]

For \(\mathcal{L}\), the corresponding eigenpair is \(0\) and \(\text{diag}(D^{1/2})\) (the proof is similar).

We can compute the degree matrix \(D\) and the Laplacian \(L\) for the toy example above. For the degree matrix, we use the method sum() from numpy with the argument axis=1 to sum over the columns and diag() to convert the result into a diagonal matrix. The Laplacian is simply \(L=D-S\).

rowsum = S.sum(axis=1)
D = np.diag(rowsum)
L = D - S

Eigenvectors as Cluster Separators

The eigenvectors of the Laplacian matrices have some very useful properties relating to features selection. If \(\xi\in\mathbb{R}^n\) is an eigenvector of \(L\) or \(\mathcal{L}\), then \(\xi\) can be viewed as a function that assigns a value to each observation in \(\mathbf{X}\).

This point-of-view can prove quite useful, as the following simple example from [312] shows. Let \(\mathbf{X}\) be constructed of three two-dimensional Gaussians, each with unit variance (and no covariance) but with different means. We start by generating 30 observations for each of the 3 mechanisms, and re-shuffle them into a “random” dataset.

random.seed(1234) # for replicability

mean1 = [0,5]
cov1 = [[1,0],[0,1]]
X1 = np.random.multivariate_normal(mean1,cov1,30)

mean2 = [5,0]
cov2 = [[1,0],[0,1]]
X2 = np.random.multivariate_normal(mean2,cov2,30)

mean3 = [-5,-5]
cov3 = [[1,0],[0,1]]
X3 = np.random.multivariate_normal(mean3,cov3,30)


X = np.concatenate((X1,X2,X3), axis=0)
Y = np.array([0] * 30 + [1] * 30 + [2] * 30).reshape((90,1))

df = np.concatenate((X,Y), axis = 1)
X,Y = df[:,:2],df[:,2]

We can then compute the similarity (using Guassian RBF with \(\sigma = 1\)), adjacency, degree and Laplacian matrix. Using eigh() from numpy we can find the eigenvalue and eigenvectors of \(L\). This function returns a vector of all the eigenvalues of \(L\) and a matrix of all its eigenvectors as columns. According to the convention, the eigenspace is sorted so that the eigenvalues satisfy \(0=\lambda_1 \le \lambda_2 \le \lambda_3 \le \dots\).

A = squareform(np.exp(-1 * pdist(X, 'sqeuclidean')))
rowsum = A.sum(axis=1)
D = np.diag(rowsum)
L = D - A

# Find eigenspace of L (vs: eigenvalue vector, es: eigenvector matrix)
vs,es = eigh(L)

# Sort the eigenvalue
arg_sort = vs.argsort()
vs = vs[arg_sort]
es = es[:,arg_sort]

We show properties of \(L\)’s spectrum by providing the contour plot of the second eigenvector/eigenvalue pair.316

def gen_z(lam = 1):
  """ Compute the meshgrid for z
  for i in range(len(x)):
    for j in range(len(y)):
      dist = cdist([[x_[i,j],y_[i,j]]],X)[0]
      z[i,j] = np.average(es[:,lam], weights= 1/dist)

# Setup the meshgrid 
x = np.arange(-8.5,8.5,0.05)
y =  np.arange(-8.5,8.5,0.05)
x_, y_ = np.meshgrid(x,y, indexing='ij')
z = 0*x_

# Index of the eigenvalue to be plotted (0 is the first, etc)
l = 1 

# Compute the meshgrid with the gen_z function

# Setup and plot
fig,ax = plt.subplots()
cs = ax.contourf(x_, y_, z)
ax.grid(c='k', ls='-', alpha=0.3)
ax.scatter(X[:,0],X[:,1], c=Y)
<matplotlib.contour.QuadContourSet object at 0x7f894b865460>
<matplotlib.colorbar.Colorbar object at 0x7f8948e7ac40>

The next block accomplishes the same tasks, but it does so for potentially more than one eigenvector/eigenvalue pair, and , and aranges the plots in a grid. The contour plot of the ranked eigenvectors \(\xi_1, \xi_2, \xi_3, \xi_4, \xi_5\) and \(\xi_{20}\), corresponding to the eigenvalues \[\lambda_1\leq\lambda_2\leq \lambda_3\leq\lambda_4 \leq \lambda_5\leq \lambda_{20},\] are computed and displayed.

x = np.arange(-8.5,8.5,0.5)
y =  np.arange(-8.5,8.5,0.5)
x_, y_ = np.meshgrid(x,y, indexing='ij')
z = 0*x_

# List of index of the eig to be plotted
ls = [0,1,2,3,4,19]
grid_length = 3
grid_width = 2

for i in range(len(ls)):
  ax = plt.subplot(grid_length,grid_width,i+1)
  z = 0*x_
  cs = ax.contourf(x_, y_, z)
  ax.grid(c='k', ls='-', alpha=0.3)
  ax.scatter(X[:,0],X[:,1], c=Y)

From those plots, it seems as though the first eigenvector does a better job at capturing the cluster structure in the data, while larger values tends to capture more of the sub-cluster structure. One thing to note is that it might appear that \(\lambda_1\) is as good (or better) as \(\lambda_2\) and \(\lambda_3\) to separate the groups, but a closer look at the scale of the contour plot of \(\lambda_1\) shows that its values have a miniscule range.

The fact that there is any variation at all is due to floating point errors in the practical computation of the eigenvalue \(\lambda_1\) and the eigenvector \(\xi_1\); as seen previously, these should be exactly \(0\) and \(\mathbf{1}\), respectively.

At any rate, this process shows how the eigenpairs of the Laplacian matrix contains information about the structure of \(\mathbf{X}\).

In spectral graph theory, the eigenvalues of the Laplacian measure the smoothness of the eigenvectors. An eigenvector is said to be smooth if it assigns similar values to points that are near one another.

In the previous example, assume that the data is unshuffled, that is, the first \(k_1\) points are in the same cluster, the next \(k_2\) are also in the same, albeit different, cluster, and so on. The next plot shows the smoothness of the eigenvector over each cluster; colour is added to emphasize the cluster limits.

# Recompute everything without the shuffling and sorting part
X = np.concatenate((X1,X2,X3), axis=0)
Y = np.array([0] * 30 + [1] * 30 + [2] * 30).reshape((90,1))

A = squareform(np.exp(-1 * pdist(X, 'sqeuclidean')))
rowsum = A.sum(axis=1)
D = np.diag(rowsum)
L = D - A
vs,es = eigh(L)

# List of index of selected eigenvalue. 
ls = [0,1,2,19]

# Plot a line for each index in ls
for l in ls:
  plt.plot(es[:,l], label = 'lambda{0}'.format(l+1))

# Color the cluster limits
plt.axvspan(0,30, alpha = 0.5)
plt.text(15,-0.4, 'Cluster 1', horizontalalignment='center', fontsize=12)
plt.axvspan(30,60, alpha = 0.5, facecolor='g')
plt.text(45,-0.4, 'Cluster 2', horizontalalignment='center', fontsize=12)
plt.axvspan(60,90, alpha = 0.5, facecolor='orange')
plt.text(75,-0.4, 'Cluster 3', horizontalalignment='center', fontsize=12)

plt.xlabel('Component k of the eigenvector')

Both \(\lambda_2\) and \(\lambda_3\) are fairly smooth, as they seem to be piece-wise constant on each cluster, whereas \(\lambda_{20}\) is all over the place on cluster 1 and constant on the rest of the data. As discussed previously \(\lambda_1\) is constant over the entirety of the dataset, marking it as maximally smooth but not very useful from the perspective of differentiating data structure.317

Indeed, let \(\mathbf{x}\in \mathbb{R}^n\). then \[\begin{aligned} \mathbf{x}^{\!\top}L\mathbf{x} &= \mathbf{x}^{\!\top}D\mathbf{x} - \mathbf{x}^{\!\top}A\mathbf{x} = \sum_{i=1}^n d_ix^2_i - \sum_{i,j=1}^n a_{i,j} x_ix_j \\&= \frac{1}{2} \left( \sum_{i=1}^n d_ix^2_i - 2\sum_{i,j=1}^n a_{ij} x_ix_j + \sum_{j=1}^n d_jx^2_j \right) \\&= \frac{1}{2} \sum_{i,j=1}^n a_{ij} (x_i - x_j)^2\end{aligned}\]

If \(\mathbf{x} = \xi\) is a normalized eigenvector of \(L\), then \(\xi^{\!\top}L\xi =\lambda \xi^{\!\top}\xi = \lambda\), thus \[\lambda = \xi^{\!\top}L\xi = \frac{1}{2} \sum_{i,j=1}^n a_{ij} (\xi_i - \xi_j)^2.\]

Instinctively, if the eigenvector component does not vary a lot for observations that are near one another, one would expect the corresponding eigenvalue to be small; this result illustrates why the small magnitude of the eigenvalue is a good measure of the smoothness of its associated eigenvector.

Feature Ranking

We can use the above discussion as a basis for feature selection. If \(\mathbf{x}\) is not en eigenvector of \(L\), the value \(\mathbf{x}^{\!\top}L\mathbf{x}\) can also be seen as a measure of how much \(\mathbf{x}\) varies locally. This can be used to measure how meaningful a feature \(\mathbf{f} \in \mathbb{R}^n\) is.

In the current example, the only two features are the Euclidean coordinates of the observations: \(\mathbf{f}_1\) and \(\mathbf{f}_2\). Let us add a useless feature to the dataset \(\mathbf{f}_3\), say a uniformly distributed random variable across the clusters.

# Add a random feature
U1 = np.random.uniform(size=(90,1))
X = np.concatenate((X,U1), axis=1)

f = [0,1,2]

for i in f:
  plt.plot(X[:,i], label = 'F{0}'.format(i+1))

plt.axvspan(0,30, alpha = 0.5)
plt.text(15,2, 'Cluster 1', horizontalalignment='center', fontsize=12)
plt.axvspan(30,60, alpha = 0.5, facecolor='g')
plt.text(45,2, 'Cluster 2', horizontalalignment='center', fontsize=12)
plt.axvspan(60,90, alpha = 0.5, facecolor='orange')
plt.text(75,2, 'Cluster 3', horizontalalignment='center', fontsize=12)

plt.xlabel('Component k of feature')

The plot shows that the third feature is not able to distinguish between the clusters. However, we also have:

for i in [0,1,2]:
  f = X[:,i]
  print("F{0}".format(i+1), "->",f.T.dot(f.dot(L)))
F1 -> 105.35092482283625
F2 -> 110.6011478717457
F3 -> 46.90787692252817

so that \[\mathbf{f}_1^{\top} L \mathbf{f}_1 = 94.3, \quad \mathbf{f}_2^{\top} L \mathbf{f}_2 = 102.5, \quad \mathbf{f}_3^{\top} L \mathbf{f}_3 = 41.7;\] by the previous assumption relating the magnitude of \(\xi^{\!\top}L\xi\) to the smoothness of \(\xi\), this would seem to indicate that \(\mathbf{f}_3\) is a “better” feature than the other two.

The problem is that the value of \(\mathbf{f}_i^{\!\top}L \mathbf{f}_i\) is affected by the respective norms of \(\mathbf{f}_i\) and \(L\). This need to be addressed.

The relation between \(L\) and \(\mathcal{L}\) yields \[\mathbf{f}_i^{\!\top} L \mathbf{f}_i = \mathbf{f}_i^{\!\top} D^{1/2} \mathcal{L} D^{1/2} \mathbf{f}_i = (D^{1/2} \mathbf{f}_i)^T \mathcal{L} (D^{1/2} \mathbf{f}_i).\] Set \(\tilde{\mathbf{f}_i} = (D^{1/2} \mathbf{f}_i)\) and \(\hat{\mathbf{f}_i} = \tilde{\mathbf{f}_i} / \lVert \tilde{\mathbf{f}_i} \rVert\). The feature score metric \(\varphi_1\) is a normalized version of the smoothness measure: \[\varphi_1 (\mathbf{f}_i) = \hat{\mathbf{f}_i}^{\!\top} \mathcal{L} \hat{\mathbf{f}_i}, \quad i=1,\ldots,p.\]

For \(\varphi_1\), smaller values are better. The scoring function can also be defined using the spectral decomposition of \(\mathcal{L}\).

Suppose that \((\lambda_k,\xi_k), \ 1\le k \le n\) are eigenpairs of \(\mathcal{L}\) and let \(\alpha_{k} = \hat{\mathbf{f}_i}^{\!\top} \xi_k\), for a given \(i\). Then \[\varphi_1(\mathbf{f}_i) = \sum_{k=1}^n \alpha_k^2\lambda_k,\quad \text{with } \sum_{k=1}^n \alpha_k^2 =1.\]

Indeed, let \(\mathcal{L} = \mathbf{U\Sigma U}^{\!\top}\) be the eigen-decomposition of \(\mathcal{L}\). By construction, \(\mathbf{U} = [\xi_1| \xi_2| \cdots| \xi_n]\) and \(\mathbf{\Sigma}=\text{diag}(\lambda_k)\), so that \[\begin{aligned} \varphi_1 (\mathbf{f}_i) &= \hat{\mathbf{f}_i}^{\!\top} \mathcal{L} \hat{\mathbf{f}_i} = \hat{\mathbf{f}_i}^{\!\top} \mathbf{U\Sigma U}^{\!\top}\hat{\mathbf{f}_i} \\ &= (\alpha_1,\dots,\alpha_n)\boldsymbol{\Sigma}(\alpha_1,\dots,\alpha_n)^{\!\top} = \sum_{k=1}^n \alpha_k^2 \lambda_k.\end{aligned}\]

This representation allows for a better comprehension of the \(\varphi_1\) score; \(\alpha_k\) is the cosine of the angle between the normalized feature \(\hat{\mathbf{f}}_i\) and eigenvector \(\xi_k\). If a feature aligns with “good” eigenvectors (i.e., those with small eigenvalues), its \(\varphi_1\) score will also be small.

The larger \(\alpha_1^2\) is, the smaller \(\sum_{k=2}^n \alpha_k^2\) is; this is problematic because, in such cases, a small value of \(\varphi_1\) indicates smoothness but not separability. To overcome this issue, \(\varphi_1\) can be normalized by \(\sum_{k=2}^n \alpha_k^2\), which yields a new scoring function: \[\varphi_2(\mathbf{f}_i) = \frac{\sum_{k=1}^n \alpha_k^2 \lambda_k}{\sum_{k=2}^n \alpha_k^2} = \frac{\hat{\mathbf{f}_i}^{\!\top} \mathcal{L} \hat{\mathbf{f}_i}}{1 - \left( \hat{\mathbf{f}_i}^{\!\top} \xi_1 \right)}\]

A small value for \(\varphi_2\) once again indicates that a feature closely aligns with “good” eigenvectors.

Another ranking feature is closely related to the other two. According to spectral clustering, the first \(k\) non-trivial eigenvectors form an optimal set of soft cluster indicators that separate the graph \(G\) into \(k\) connected parts. Therefore, we define \(\varphi_3\) as \[\varphi_3(\mathbf{f}_i,k) = \sum_{j=2}^k(2-\lambda_j)\alpha_j^2.\] Contrary to the other scoring functions, \(\varphi_3\) assigns larger value to feature that are more relevant. It also prioritizes the leading eigenvectors, which helps to reduce noise. Using this ranking function requires a number of categories or clusters \(k\) to be selected (depending on the nature of the ultimate task at hand); if this value is unknown, \(k\) becomes a hyper-parameter to be tuned.

The feature score metrics are implemented as below:

D_sq = np.sqrt(D)
L_norm = D_sq.dot(L).dot(D_sq)

def score_1(i):
  f_tilde = D_sq.dot(X[:,i])
  f_hat = f_tilde / norm(f_tilde)
  return f_hat.dot(L_norm).dot(f_hat)
def score_2(i):
  f_tilde = D_sq.dot(X[:,i])
  f_hat = f_tilde / norm(f_tilde)
  phi_1 = f_hat.dot(L_norm).dot(f_hat)
  return phi_1 / (1 - (f_hat.dot(es[:,0])**2))

def score_3(i,k):
  f_tilde = D_sq.dot(X[:,i])
  f_hat = f_tilde / norm(f_tilde)
  alpha = f_hat.dot(es)
  temp = (2 - vs[1:k]) * (alpha[1:k])**2
  return np.sum(temp)  

from tabulate import tabulate

n_feature = X.shape[1]
results = {'phi_1':[], 'phi_2':[], 'phi_3': []}
k = 3

for i in range(n_feature):

    phi_1    phi_2    phi_3
--  -------  -------  --------
0   2.5516   3.28365  1.37706
1   2.71268  3.47861  1.40086
2   17.0148  31.8369  0.433994

This make more sense, as the pattern is similar to the pattern obtained for the eigenvalues: \(\mathbf{f}_1,\mathbf{f}_2\), being able to differentiate the clusters, have smaller \(\varphi_1\) scores than \(\mathbf{f}_3\). Returning to the current example, while the score of the useless feature 3, \(\varphi_1(\mathbf{f}_3)\) is larger than the other scores, it is still small when compared to the eigenvalues of \(L\). This is due to the fact the \(\mathbf{f}_3\) and \(\xi_1\) are nearly co-linear.

Computing \(\varphi_2\) for our three features yields a larger distinction between the real features and the random, useless one than with \(\varphi_1\).


There is one glaring problem with the ranking functions that have been defined previously: they all assume the existence of a gap between subsets of “large” and “small” eigenvalues. For clearly separated data, that is to be expected; but in noisy data, this gap may be negligible, which leads to an increase in the score value of poor features [314].

This issue can be tackled by applying a spectral matrix function \(\gamma(\cdot)\) to the Laplacian \(\mathcal{L}\), replacing the original eigenvalues by regularized eigenvalues as follows: \[\gamma(\mathcal{L}) = \sum_{j=1}^n \gamma(\lambda_j)\xi_j\xi_j^{\!\top}.\]

In order for this to work properly, \(\gamma\) needs to be (strictly) increasing. Examples of such regularization functions include:

\(\gamma(\lambda)\) (name)
\(1 + \sigma^2 \lambda\) (regularized Laplacian)
\(\exp(\sigma^2/2\lambda)\) (diffusion Process)
\(\lambda^{\nu}, \ \nu\ge2\) (high-order polynomial)
\((a - \lambda)^{-p}, \ a\ge 2\) (\(p\)-step random walk)
\((\cos\lambda\pi/4)^{-1}\) (inverse cosine)

The ranking function \(\varphi_1, \varphi_2,\varphi_3\) can be regularized via \[\begin{aligned} \hat{\varphi}_1(\mathbf{f}_i) &= \sum_{k=1}^n \alpha_k^2 \gamma(\lambda_k) \\ \hat{\varphi}_2(\mathbf{f}_i) &= \frac{\hat{\mathbf{f}_i}^{\!\top} \gamma(\mathcal{L}) \hat{\mathbf{f}_i}}{1 - \left( \hat{\mathbf{f}_i}^{\!\top} \xi_1 \right)} \\\hat{\varphi}_3(\mathbf{f}_i) &= \sum_{j=2}^k(\gamma(2)-\gamma(\lambda_j))\alpha_j^2 \end{aligned}\]

To illustrate how this regularization process can help reduce noise (still using the framework from the previous example), \(\mathbf{X}\) was contaminated with random values from a normal distribution with a variance of \(1.5\).

random.seed(5678) # for replicability
noise = np.random.normal(0, 1.3, 90*3).reshape(90,3)
X_noise = X + noise
plt.scatter(X_noise[:,0], X_noise[:,1])

We plot the components of the eigenvalues of three normalized Laplacians: one from the original data, one from the noisy data, and one from the noisy data with a 3rd order polynomial regularization.

def isqrt(x):
  if x == 0:
    return 0
    return x**(-0.5)

v_isqrt = np.vectorize(isqrt)

def find_eig(X, k = lambda x:x):
  A = squareform(np.exp(-1 * pdist(X, 'sqeuclidean')))
  rowsum = A.sum(axis=1)
  D = np.diag(rowsum)
  L = D - A
  D_is = v_isqrt(D)
  L_norm = k(D_is.dot(L).dot(D_is))
  vs,es = eigh(L_norm)
  arg_sort = vs.argsort()
  vs = vs[arg_sort]
  es = es[:,arg_sort]
  return vs,es, L_norm

vs,es,L = find_eig(X)
vs_n,es_n,L_noise = find_eig(X_noise)
vs_k,es_k, L_k = find_eig(X_noise, k = lambda x:x**3)

plt.plot(vs, '.', label = 'Real')
plt.plot(vs_n, 'x', label = 'Noise')
plt.plot(vs_k, "+", label = '3 order poly')
plt.title('eigenvalue lambda')

The above plot shows the effect of noise on \(\mathcal{L}\)’s: it tends to linearize the eigenvalues, and this provides much support to the poorer eigenvectors. In the same Figure, the eigenvalues of the noisy Laplacian have been regularized using the standard cubic \(\gamma(\lambda) = \lambda^3\); the distinction between the first eigenvalues and the rest is clear.

We can compare different kernels: - regularized Laplacian\(\gamma(\lambda) = 1 + (0.9)^2 \lambda\) - high-order polynomial\(\gamma(\lambda) = \lambda^3\) - diffusion process\(\gamma(\lambda) = \exp( (0.3)^2 / 2\lambda)\) - \(p-\)step random walk\(\gamma(\lambda) = (2-\lambda)^{-1}\) - inverse cosine\(\gamma(\lambda) = \cos(\lambda * \pi/4)^{-1}\)

vs_n,es_n,_ = find_eig(X_noise)
vs_reg, es_reg,_ = find_eig(X_noise, k = lambda x:1 + (0.9)**2 * x)
vs_k3,es_k3,_ = find_eig(X_noise, k = lambda x:x**3)
vs_diff, es_diff,_ = find_eig(X_noise, k = lambda x: np.exp((0.3)**2/(2*x)))
vs_1step, es_1step,_ = find_eig(X_noise, k = lambda x: (2-x)**(-1))
vs_cos, es_cos, _ = find_eig(X_noise, k = lambda x: np.cos(x * (3.1416/4))**(-1))

plt.plot(vs_n, '.', label = 'Noise')
plt.plot(vs_k, '.', label = '3 order poly')
plt.plot(vs_reg, '.', label = 'Regularized (0.9)')
plt.plot(vs_diff, '.', label = 'Diffusion (0.5)')
plt.plot(vs_1step, '.', label = '1-Step (2)')
plt.plot(vs_cos, '.', label = 'Inverse Cos')

plt.title('eigenvalues lambda')
[<matplotlib.lines.Line2D object at 0x7f893209e370>]
[<matplotlib.lines.Line2D object at 0x7f893209e7f0>]
[<matplotlib.lines.Line2D object at 0x7f893209ea60>]
[<matplotlib.lines.Line2D object at 0x7f893209ed60>]
[<matplotlib.lines.Line2D object at 0x7f893209ec40>]
[<matplotlib.lines.Line2D object at 0x7f89320a8460>]
(0.0, 3.0)
Text(0.5, 1.0, 'eigenvalues lambda')
<matplotlib.legend.Legend object at 0x7f8932086220>

The choice of a specific regularization function depends on the context and the goals of the data analysis task; for large datasets, considerations of ease of computation may also form part of the selection strategy.

Spectral Feature Selection with SPEC

The remarks from the previous subsections can be combined to create a feature selection framework called SPEC [315]:

  1. using a specified similarity function \(s\), construct a similarity matrix \(S\) of the data \(\mathbf{X}\) (optionally with labels \(\mathbf{Y}\));

  2. construct the graph \(G\) of the data;

  3. extract the normalized Laplacian \(\mathcal{L}\) from this graph;

  4. compute the eigenpairs (eigenvalues and eigenvectors) of \(\mathcal{L}\);

  5. select a regularization function \(\gamma(\cdot)\);

  6. for each feature \(\mathbf{f}_i\), \(i=1,\ldots, p\), compute the relevance \(\hat\varphi(\mathbf{f}_i)\), where \(\hat\varphi \in \{\hat\varphi_1, \hat\varphi_2,\hat\varphi_3 \}\), and

  7. return the features in descending order of relevance.

In order for SPEC to provide “good” results, proper choices for the similarity, ranking, and regularization functions are needed. Among other considerations, the similarity matrix should reflect the true relationships between the observations.

Furthermore, if the data is noisy, it might be helpful to opt for \(\hat\varphi= \hat\varphi_3\) and/or \(\gamma(\lambda)=\lambda^{\nu}\), \(\nu\geq 2\). Finally, when the gap between the small and the large eigenvalues is wide, \(\hat\varphi=\hat\varphi_2\) or \(\hat\varphi=\hat\varphi_3\) provide good choices, although \(\hat\varphi_2\) has been shown to be more robust.[^11]

15.4.4 Uniform Manifold Approximation and Projection

The feature selection and dimension reduction landscape is in flux, and there are more recent (and sophisticated) developments that are generating a lot of interest. Case in point, consider Uniform Manifold Approximation and Projection (UMAP) methods.

Dimensionality Reduction and UMAP

A mountain is certainly a \(3\)-dimensional object. And the surface of a mountain range is \(2\)-dimensional – it can be represented with a flat map – even though the surface, and the map for that matter, still actually exist in \(3\)-dimensional space.

What does it mean to say that a shape is \(q\)-dimensional for some \(q\)? An answer to that question first requires an understanding of what a shape is.

Shapes could be lines, cubes, spheres, polyhedrons, or more complicated things. In geometry, the customary way to represent a shape is via a set of points \(S \subseteq \mathbb{R}^p\). For instance, a circle is the set of points whose distance to a fixed point (the centre) is exactly equal to the radius \(r\), whereas a disk is the set of points whose distance to the centre is at most the radius.

In the mountain example, \(p=3\) for both the mountains \(S_m\) and the mountain surface \(S_s\). So the question is, when is the (effective) dimension of a set \(S\) less than \(p\), and how is that dimension calculated?

It turns out that there are multiple definitions of the dimension \(q\) of a set \(S \subseteq \mathbb{R}^p\) [316]:

  • the smallest \(q\) for which \(S\) is \(q\)-dimensional manifold;

  • how many nontrivial \(n\)th homology groups of \(S\) there are;

  • how the “size” of the set scales as the coordinates scale.

A \(q\)-dimensional manifold is a set where each small region is approximately the same as a small region of \(\mathbb{R}^q\). For instance, if a small piece of a (stretchable) sphere is cut out with a cookie cutter, it could theoretically be bent so that it looks like it came from a flat plane, without changing its “essential” shape.

Dimensionality reduction is more than just a matter of selecting a definition and computing \(q\), however. Indeed, any dataset \(\mathbf{X}\) is necessarily finite and is thus, by definition, actually \(0-\)dimensional; the object of interest is the shape that the data would form if there were infinitely many available data points, or, in other words, the support of the distribution generating the data.

Furthermore, any dataset is probably noisy and may only approximately lie in a lower-dimensional shape.

Lastly, it is not clear how to build an algorithm that would, for example, determine what all the homology groups of some set \(S\) are. The problem is quite thorny. Let \(X \subseteq \mathbb{R}^p\) be a finite set of points. A dimensionality reducer for \(\mathbf{X}\) is a function \(f_{\mathbf{X}}:\mathbf{X}\to\mathbb{R}^q\), where \(q<p\), which satisfies certain properties that imply that \(f_{\mathbf{X}}(\mathbf{X})\) has similar structure to \(\mathbf{X}\).318

Various dimensionality reducers were discussed in Section 15.2; they each differ based on the relationship between \(\mathbf{X}\) and \(f_{\mathbf{X}}(\mathbf{X})\).

For instance, in PCA, the dataset \(\mathbf{X}\) is first translated, so that its points (or at least its “principal components”) lie in a linear subspace. Then \(q\) unit-length linear basis elements are chosen to span a subspace, projection onto which yields an affine map \(f\) from \(\mathbf{X}\) to \(\mathbb{R}^q\) that preserves Euclidean distances between points (a rigid transformation), assuming that the non-principal dimensions are ignored.

PCA seems reasonable but what if a rigid transformation down to \(\mathbb{R}^q\) is not possible? As an example, consider the swiss roll of Figure 15.10, which is a loosely rolled up rectangle in \(3\)-dimensional space. What can be preserved when we “reduce” this space? Only the local structure? The global structure?

UMAP is a dimension reduction method that attempts to approximately preserve both the local and global structure. It can be especially useful for visualization purposes, i.e., reducing to \(q=3\) or fewer dimensions. While the semantics of UMAP can be stated in terms of graph layouts, the method was derived from abstract topological assumptions. For a full treatment and mathematical properties, see [317].

Note that UMAP works best when the data \(\mathbf{X}\) is evenly distributed on its support \(\mathcal{S}\). In this way, the points of \(\mathbf{X}\) “cover” \(\mathcal{S}\) and UMAP can determine where the true gaps or holes in \(S\) are.

UMAP Semantics

Let the (scaled) data be denoted by \(\mathbf{X} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\), where \(\mathbf{x}_i \in \mathbb{R}^p\) for all \(i\); let \(d: \mathbf{X} \times \mathbf{X} \to \mathbb{R}_{\geq 0}\) be a distance function, and let \(k\geq 1\) be an integer.

Consider a directed graph graph \(D=(V,E,A)\), with

  • vertices \(V(D) = \mathbf{X}\);

  • edges \(E(D)\) consisting of the ordered pairs \((\mathbf{x}_i, \mathbf{x}_j)\) such that \(\mathbf{x}_j\) is one of the \(k\) nearest neighbours of \(\mathbf{x}_i\) according to \(d\);

  • weight function \(W:E(D)\to \mathbb{R}\) such that \[w(\mathbf{x}_i, \mathbf{x}_{i,j}) = \exp\left(\frac{-\max(0, d(\mathbf{x}_i, \mathbf{x}_{i,j}) - \rho_i)}{\sigma_i}\right),\] where \(\mathbf{x}_{i,1}, \ldots, \mathbf{x}_{i,k}\) are the \(k\) nearest neighbours of \(\mathbf{x}_i\) according to \(d\), \(\rho_i\) is the minimum nonzero distance from \(\mathbf{x}_i\) to any of its neighbours, and \(\sigma_i\) is the unique real solution of \[\sum_{j=1}^k \exp \left( \frac{-\max(0, d(x_i,x_{i,j}) - \rho_i)}{\sigma_i}\right)= \log_2(k),\] and

  • \(A\) is the weighted adjacency matrix of \(D\) with vertex ordering \(\mathbf{x}_1, \ldots,\mathbf{x}_n\).

Define a symmetric matrix \[B = A + A^{\!\top} - A \circ A^{\!\top},\] where \(\circ\) is Hadamard’s component-wise product. The graph \(G=(V,W,B)\) has the same vertex set, the same vertex ordering, and the same edge set as \(D\), but its edge weights are given by \(B\). Since \(B\) is symmetric, \(G\) can be considered to be undirected.

UMAP returns the (reduced) points \(f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n) \in \mathbb{R}^q\) by finding the position of each vertex in a force directed graph layout, which is defined via a graph, an attractive force function defined on edges, and a repulsive force function defined on all pairs of vertices.

Both force functions produce force values with a direction and magnitude based on the pair of vertices and their respective positions in \(\mathbb{R}^q\).

To compute the layout, initial positions in \(\mathbb{R}^q\) are chosen for each vertex, and an iterative process of translating points based on their attractive and repulsive forces is carried out until a convergence criterion is met. In UMAP, the attractive force between vertices \(\mathbf{x}_i, \mathbf{x}_j\) at positions \(y_i, y_j \in \mathbb{R}^q\), respectively, is \[\frac{-2ab \Vert y_i-y_j\Vert_2^{2(b-1)}}{1+\Vert y_i - y_j \Vert_2^2} w(x_i, x_j)(y_i-y_j),\] where \(a\) and \(b\) are parameters, and the repulsive force is \[\frac{b(1-w(x_i, x_j))(y_i - y_j)}{(0.001 + \Vert y_i - y_j \Vert_2^2)(1 + \Vert y_i - y_j \Vert_2^2)}.\]

There are a number of important free parameters to select, namely

  • \(k\): nearest neighbor neighborhood count;

  • \(q\): target dimension;

  • \(d\): distance function, e.g. Euclidean metric.

The UMAP documentation states,

low values of \(k\) will force UMAP to concentrate on very local structure (potentially to the detriment of the big picture), while large values will push UMAP to look at larger neighborhoods of each point …losing fine detail structure for the sake of getting the broader structure of the data. [317]

The user may set these parameters to appropriate values for the dataset. The choice of a distance metric plays the same role as in clustering, where closer pairs of points are considered to be more similar than farther pairs. There is also a minimum distance value used within the force directed layout algorithm which says how close together the positions may be.


We compare various dimensionality reducers for a number of datasets (adapted from the UMAP documentation).319 Let us start by installing the required Python modules.

import warnings

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, decomposition, manifold, preprocessing
from colorsys import hsv_to_rgb
import umap.umap_ as umap

We will plot the two-dimensional reduction of five different datasets; the points will be coloured to get a sense of which points got sent where (otherwise we would only know the shape of the reduced dataset, but have no way to tell how the local structure is affected by the reducers).

We use \(t-\)SNE, Isomap, MDS (multidimensional scaling), PCA, and UMAP. The 5 datasets are:

  • a set consisting of 4 distinct 10-dimensional Gaussian distributions;

  • the digit classification dataset;

  • the wine characteristics dataset (essentially 1D);

  • a 2D rectangle rolled up in 3D space (colours indicate the position along the unrolled rectangle), and

  • points on the 2D surface of a 3D sphere (which is not homeomorphic to \(\mathbb{R}^2\))); even with the north pole removed, the stereographic projection would map points close to the north pole arbitrarily far from the origin (colour hue indicates the angle around the equator and darkness indicates distance from south pole).

blobs, blob_labels = datasets.make_blobs(
    n_samples=500, n_features=10, centers=4
digits = datasets.load_digits(n_class=10)

wine = datasets.load_wine()

swissroll, swissroll_labels = datasets.make_swiss_roll(
    n_samples=1000, noise=0.1
sphere = np.random.normal(size=(600, 3))
# scale points to have same distance from origin
sphere = preprocessing.normalize(sphere)
# compute colours in HSV format
sphere_hsv = np.array([
          (np.arctan2(c[1], c[0]) + np.pi) / (2 * np.pi),
          np.abs(c[2]), min((c[2] + 1.1), 1.0),
      for c in sphere
# convert colours to RGB format
sphere_colors = np.array([hsv_to_rgb(*c) for c in sphere_hsv])  

Next we set parameters for the reducer algorithms. For UMAP, we set min_dist to 0.3 which will spread out the points to a noticable degree. We also set the number \(k\) of nearest neighbours to 30. The choices for the other reduces are shown in the code block.

In practice, we would typically set these parameters on a dataset-by-dataset basis, but for illustration purposes, we do not need to fine tune the choices.

reducers = [
    (manifold.TSNE, {"perplexity": 50}),
    (manifold.Isomap, {"n_neighbors": 30}),
    (manifold.MDS, {}),
    (decomposition.PCA, {}),
    (umap.UMAP, {"n_neighbors": 30, "min_dist": 0.3}),

test_data = [
    (blobs, blob_labels),
    (digits.data, digits.target),
    (wine.data, wine.target),
    (swissroll, swissroll_labels),
    (sphere, sphere_colors),

dataset_names = ["Blobs", "Digits", "Wine", "Swiss Roll", "Sphere"]

Now, we compute the 2D reductions for every reducer-dataset pair (this step is time-consuming):

reductions_and_labels = [(reducer(n_components=2, **args).fit_transform(data), 
    for data, labels in test_data
    for reducer, args in reducers

And we display the results:

n_rows = len(test_data)
n_cols = len(reducers)
fig = plt.figure(figsize=(16, 16))

fig.subplots_adjust(left=.02, right=.98, bottom=.001, 
  top=.96, wspace=.05, hspace=.02)
ax_index = 1
ax_list = []

for reduction, labels in reductions_and_labels:
    ax = fig.add_subplot(n_rows, n_cols, ax_index)
    if isinstance(labels[0], tuple):
        # if labels are colours, use them
        ax.scatter(*reduction.T, s=10, c=labels, alpha=0.5)
        # otherwise, use "spectral" map from labels to colours
        ax.scatter(*reduction.T, s=10, c=labels, cmap="Spectral", alpha=0.5)
    ax_index += 1

plt.setp(ax_list, xticks=[], yticks=[])

for i in np.arange(n_rows) * n_cols:
    ax_list[i].set_ylabel(dataset_names[i // n_cols], size=16)

for i in range(n_cols):
    ax_list[i].set_xlabel(repr(reducers[i][0]()).split("(")[0], size=16)


Notice that Isomap removes exactly the wrong dimension in the swiss roll; \(t-\)SNE (and MDS to some extent) reduces the wine data down to one dimension (its true dimensionality) even though it was only asked for a reduction to two dimensions. UMAP manages to give the most sensible output for the swiss roll.


J. Leskovec, A. Rajamaran, and J. D. Ullman, Mining of Massive Datasets. Cambridge Press, 2014.
Y. Goldberg, Neural Network Methods for Natural Language Processing. Morgan; Claypool, 2017.
Z. A. Zhao and H. Liu, Spectral Feature Selection for Data Mining. CRC Press, 2011.
D. Duvenaud, Automatic Model Construction with Gaussian Processes,” PhD thesis, Computational and Biological Learning Laboratory, University of Cambridge, 2014.
T. Zhang and R. K. Ando, Analysis of spectral kernel design based semi-supervised learning,” in Advances in neural information processing systems, 2005, vol. 18.
A. J. Smola and R. Kondor, “Kernels and regularization on graphs,” in Learning Theory and Kernel Machines, 2003, pp. 144–158.
T. Gowers, J. Barrow-Green, and I. Leader, Eds., The Princeton Companion to Mathematics. Princeton University Press, 2008.
L. McInnes, J. Healy, and J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint, 2018.