## 15.2 Dimension Reduction

There are many advantages to working with reduced, low-dimensional data:

**visualisation methods**of all kinds are available and (more) readily applicable to such data in order to extract and present insights;high-dimensional datasets are subject to the so-called

**curse of dimensionality**(CoD), which asserts (among other things) that multi-dimensional spaces are well, vast, and when the number of features in a model increases, the number of observations required to maintain predictive power also increases, but at a**substantially larger rate**;another consequence of the curse is that in high-dimension sets, all observations are roughly

**dissimilar**to one another – observations tend to be nearer the dataset’s boundaries than they are to one another.

Dimension reduction techniques such as the ubiquitous **principal component analysis**, **independent component analysis**, **factor analysis** (for numerical data) or **multiple correspondence analysis** (for categorical data) project multi-dimensional datasets onto low-dimensional but high-information spaces (the so-called **Manifold Hypothesis**.

Some information is necessarily lost in the process, but in many instances the drain can be kept under control and the gains made by working with smaller datasets can offset the loss of completeness.

### 15.2.1 Sampling Observations

Datasets can be “big” in a variety of ways:

they can be too large for the

**hardware**to handle (that is to say, they cannot be stored or accessed properly due to the number of observations, the number of features, or the overall size of the dataset), orthe dimensions can go against specific

*modeling assumptions*(such as the number of features being much larger than the number of observations, say).

For instance, multiple sensors which record 100+ observations per second in a large geographical area over a long time period can lead to excessively big datasets, say. A natural question then, regarding such a dataset, is whether every one of its row needs to be used: if rows are selected randomly (with or without replacement), the resulting sample might be **representative**^{289} of the entire dataset, and the smaller set might be easier to handle.

There are some drawbacks to the sampling approach, however:

if the signal of interest is

**rare**, sampling might lose its presence altogether;if aggregation will be happening at some point in the reporting process, sampling will necessarily affect the totals

^{290}, andeven simple operations on large files (finding the number of lines, say) can be taxing on the memory or in term of computation time – some knowledge or

**prior information**about the dataset structure can help.

Sampled datasets can also be used to work the kinks out of the data analysis workflows, but the key take-away is that if data is too big to store, access, and manipulate in a reasonable amount of time, the issue is mostly a **Big Data problem** – this is the time to start considering the use of distributed computing (see Big Data and Parallel Computing).

### 15.2.2 The Curse of Dimensionality

A model is said to be **local** if it depends solely on the observations near the input vector (\(k\) nearest neigbours classification is local, whereas linear regression is global). With a large training set, increasing \(k\) in a \(k\)NN model, say, will yield enough data points to provide a solid approximation to the theoretical classification boundary.

The **curse of dimensionality** (CoD) is the breakdown of this approach in high-dimensional spaces: when the number of features increases, the number of observations required to maintain predictive power also increases, **but at a substantially higher rate** (see Figure 8.10 for an illustration of the CoD in action).

#### Manifestations of CoD

Let \(x_i\sim U^1(0,1)\) be i.i.d. for \(i=1,\ldots, n\). For any \(z \in [0,1]\) and \(\varepsilon\in (0,1]\) such that \[I_1(z;\varepsilon)=\{y\in \mathbb{R}: |z-y|_{\infty}<\varepsilon\}\subseteq [0,1],\] the expected number of observations \(x_i\) in \(I_1(z;\varepsilon)\) is \[\left|I_1(z;\varepsilon)\cap \{x_i\}_{i=1}^n\right|\approx\varepsilon\cdot N,\] so an \(\varepsilon_{\infty}-\)ball subset of \([0,1]^1\) contains approximately \(\varepsilon\) of the observations in \(\{x_i\}_{i=1}^n\subseteq \mathbb{R}\), on average.

Let \(\mathbf{x}_i\sim U^2(0,1)\) be i.i.d. for all \(i=1,\ldots, n\). For any \(\mathbf{z} \in [0,1]^2\) and \(\varepsilon\in (0,1]\) such that \[I_2(\mathbf{z};\varepsilon)=\{\mathbf{Y}\in \mathbb{R}^2: \|\mathbf{z}-\mathbf{Y}\|_{\infty}<\varepsilon\}\subseteq [0,1]^2,\] the expected number of observations \(\mathbf{x}_i\) in \(I_2(\mathbf{z};\varepsilon)\) is \[\left|I_1(\mathbf{z};\varepsilon)\cap \{\mathbf{x}_i\}_{i=1}^n\right|\approx\varepsilon^2\cdot N,\] so an \(\varepsilon_{\infty}-\)ball subset of \([0,1]^2\) contains approximately \(\varepsilon^2\) of the observations in \(\{\mathbf{x}_i\}_{i=1}^n\subseteq \mathbb{R}^2\), on average.

In general, the same reasoning shows that an \(\varepsilon_{\infty}-\)ball subset of \([0,1]^p\subseteq \mathbb{R}^p\) contains approximately \(\varepsilon^p\) of the observations in \(\{\mathbf{x}_i\}_{i=1}^n\subseteq \mathbb{R}^p\), on average.

Thus, to capture \(r\)% of uniformly distributed observations in a unit \(p-\)hypercube, a \(p-\)hypercube with edge \[\varepsilon_p(r)=r^{1/p}\] is needed, on average. For instance, to capture \(r=1/3\) of the observations in a unit \(p-\)hypercube in \(\mathbb{R}\), \(\mathbb{R}^2\), and \(\mathbb{R}^{10}\), a hyper-subset with edge \(\varepsilon_1(1/3)\approx 0.33\), \(\varepsilon_2(1/3)\approx 0.58\), and \(\varepsilon_{10}(1/3)\approx 0.90\), respectively.

The inference is simple, but far-reaching: in general, as \(p\) increases, the nearest observations to a given point \(\mathbf{x}_j\in\mathbb{R}^p\) are in fact quite distant from \(\mathbf{x}_j\), in the Euclidean sense, on average – **locality is lost**!^{291}

This can wreak havoc on models and algorithms that rely on the (Euclidean) nearness of observations (\(k\) nearest neighbours, \(k-\)means clustering, etc.). The CoD manifests itself in various ways.

In datasets with a large number of features:

most observations are

**nearer the edge of the sample than they are to other observations**, andrealistic training sets are necessarily

**sparse**.

Imposing restrictions on models can help mitigate the effects of the CoD, but if the assumptions are not warranted the end result may be catastrophic.

### 15.2.3 Principal Component Analysis

**Principal component analysis** (PCA) can be used to find the combinations of variables along which the data points are **most spread out**; it attempts to fit a \(p-\)**ellipsoid** to a centered representation of the data. The ellipsoid axes are the **principal components** of the data.

Small axes are components along which the variance is “small”; removing these components leads, in an ideal setting, to a “small” loss of information^{292} (see
Figure 15.7).

The procedure is simple:

centre and “scale” the data to obtain a matrix \(\mathbf{X}\) (warning: this is NOT the design matrix as defined in Regression and Value Estimation);

compute the data’s covariance matrix \(\mathbf{K}=\mathbf{X}^{\!\top}\mathbf{X}\);

compute \(\mathbf{K}\)’s eigenvalues \(\mathbf{\Lambda}\) and its orthonormal eigenvectors matrix \(\mathbf{W}\);

each eigenvector \(\mathbf{w}\) (also known as

**loading**) represents an axis, whose variance is given by the associated eigenvalue \(\lambda\).

The loading that explains the most variance along a single axis (the **first principal component**) is the eigenvector of the empirical covariance matrix corresponding to the largest eigenvalue, and that variance is proportional to the eigenvalue; the second largest eigenvalue and its corresponding eigenvector form the **second principal component** and variance pair, and so on, yielding **orthonormal** principal components \(\textrm{PC}_1,\ldots,\textrm{PC}_r\), where
\(r=\textrm{rank}(\mathbf{X})\).^{293}

Principal component analysis can provide an avenue for dimension reduction, by “removing” components with small eigenvalues (as in Figure 15.7). The **proportion of the spread in the data** which can be explained by each principal component (PC) can be placed in a **scree plot** (a plot of eigenvalues against ordered component indices) and retain the ordered PCs:

for which the eigenvalue is above some threshold (say, \(25\%\));

for which the cumulative proportion of the spread falls below some threshold (say \(95\%\)), or

prior to a

**kink**in the scree plot.

For instance, consider an \(8-\)dimensional dataset for which the ordered PCA eigenvalues are provided below:

PC |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
---|---|---|---|---|---|---|---|---|

Var |
17 | 8 | 3 | 2 | 1 | 0.5 | 0.25 | 0 |

Prop |
54 | 25 | 9 | 6 | 3 | 2 | 1 | 0 |

Cumul |
54 | 79 | 88 | 94 | 98 | 99 | 100 | 100 |

If the only PCs that are retained are those that explain up to 95% of the cumulative variation, the original data reduces to a 4-dimensional subset; if only the PCs that individually explain more than 25% of the variation are retained, to a 2-dimensional subset; if only the PCs that lead into the first kink in the scree plot are retained, to a 3-dimensional subset (see Figure 15.8).

PCA is commonly-used, but often without regard to its inherent **limitations**, unfortunately:

it is dependent on scaling, and so is not uniquely determined;

with little domain expertise, it may be difficult to interpret the PCs;

it is quite sensitive to outliers;

the analysis goals are not always aligned with the principal components, and

the data assumptions are not always met – in particular, does it always make sense that important data structures and data spread be correlated (the so-called

**counting pancakes**problem), or that the components be**orthogonal**?

There are other methods to find the **principal manifolds** of a dataset, including UMAP, self-organizing maps, auto-encoders, curvilinear component analysis, manifold sculpting, kernel PCA, etc.

#### Formalism

Because \(\mathbf{K}\) is **positive semi-definite** (\(\mathbf{K}\geq 0\)), the eigenvalues \(\lambda_i=s_i^2\geq 0\) and they can be ordered in a decreasing sequence \[\boldsymbol{\Lambda}=\text{diag}(\lambda_1,\ldots,\lambda_p),\quad \text{where }\lambda_1\geq \cdots \lambda_p\geq 0\] and \(\mathbf{W}=[\mathbf{w}_1 | \cdots | \mathbf{w}_p]\).

If \(k=\text{rank}(\mathbf{X})\), then there are \(p-k\) “empty” principal component (corresponding to null eigenvalues) and \(k\) “regular” principal components (corresponding to zero eigenvalues). We write \(\mathbf{W}^*=[\mathbf{w}_1 | \cdots | \mathbf{w}_k]\) and \(\boldsymbol{\Lambda}^*=\text{diag}(\lambda_1,\ldots,\lambda_k)\). If \(p-k\neq 0\), then the eigenvalue decomposition of \(\mathbf{K}\) is \[\mathbf{K}=\begin{bmatrix}\mathbf{W}^* & \mathbf{0} \end{bmatrix}\begin{bmatrix}\boldsymbol{\Lambda}^* & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{bmatrix}\begin{bmatrix}(\mathbf{W}^*)^{\!\top} \\ \mathbf{0}\end{bmatrix}=\mathbf{W}\boldsymbol{\Lambda}\mathbf{W}^{\!\top};\] if \(\mathbf{X}\) is of full rank, then \(\mathbf{W}^*=\mathbf{W}\) and \(\boldsymbol{\Lambda}^*=\boldsymbol{\Lambda}\).

The eigenvectors of \(K\) (the \(\mathbf{w}_{j}\)) are the **singular vectors** of \(\mathbf{X}\): there exist \(\mathbf{U}_{n\times n}\) and \(\boldsymbol{\Sigma}_{n\times p}\) such that \[\mathbf{X}=\mathbf{U}\boldsymbol{\Sigma}\mathbf{W}^{\!\top},\] where \[\mathbf{U}=\begin{pmatrix}\mathbf{U}^* & \mathbf{0}\end{pmatrix}\quad\mbox{and}\quad \boldsymbol{\Sigma}=\begin{pmatrix}\text{diag}(s_i) \\ \mathbf{0} \end{pmatrix}.\]

If \(\mathbf{X}\) is of full rank, then \(\mathbf{W}\) is **orthonormal** and so represents a **rotation matrix**. As \(\mathbf{W}^{-1}=\mathbf{W}^{\!\top}\), we must then have \(\mathbf{XW}=\mathbf{U}\boldsymbol{\Sigma}\), the **principal component decomposition of** \(\mathbf{X}\): \[\mathbf{T}_{n\times p}=\mathbf{XW},\quad \begin{bmatrix}\mathbf{t}_1 & \cdots & \mathbf{t}_p\end{bmatrix}=\begin{bmatrix}\mathbf{x}_1 & \cdots & \mathbf{x}_n\end{bmatrix}^{\!\top}\begin{bmatrix}\mathbf{w}_1 & \cdots & \mathbf{w}_p\end{bmatrix}.\]

The link between the principal components and the eigenvectors can be made explicit: the first principal component \(\text{PC}_1\) is the loading \(\mathbf{w}_1\) (with \(\|\mathbf{w}_1\|_2=1\)) which **maximizes the variance of the first column** of \(\mathbf{T}\): \[\mathbf{w}_1=\arg\max_{\|\mathbf{w}\|_2=1}\{\text{Var}(\mathbf{t}_1)\}=\arg\max_{\|\mathbf{w}\|_2=1}\left\{\frac{1}{n-1}\sum_{i=1}^n(t_{1,i}-\overline{t_1})^2\right\}.\]

Since \[\begin{aligned}\overline{t_1}&=\frac{1}{n}\sum_{j=1}^n\text{E}[\mathbf{x}_j^{\!\top}\mathbf{w}_1]=\frac{1}{n}\sum_{j=1}^n\text{E}\left[\sum_{i=1}^px_{j,i}w_{i,1}\right]\\&=\frac{1}{n}\sum_{j=1}^n\sum_{i=1}^pw_{i,1}\underbrace{\text{E}[x_{j,i}]}_{\text{$=0$}}=0,\end{aligned}\] then \(\text{Var}(\mathbf{t}_1)=\frac{1}{n-1}(t_{1,1}^2+\cdots+t_{n,1}^2)\) and the problem is equivalent to \[\mathbf{w}_1=\arg\max_{\|\mathbf{w}\|_2=1}\{t_{1,1}^2+\cdots+t_{n,1}^2\},\] By construction, \(t_{i,1}^2=(\mathbf{x}_i^{\!\top}\mathbf{w}_1)^2\) for all \(i\), so \[t_{1,1}^2+\cdots+t_{n,1}^2=(\mathbf{x}_1^{\!\top}\mathbf{w}_1)^2+\cdots+(\mathbf{x}_n^{\!\top}\mathbf{w}_1)^2=\|\mathbf{X}\mathbf{w}_1\|_2=\mathbf{w}_1\mathbf{X}^{\!\top}\mathbf{X}\mathbf{w}_1.\] Hence, \[\mathbf{w}_1=\arg\max_{\|\mathbf{w}\|_2=1}\{\mathbf{w}\mathbf{X}^{\!\top}\mathbf{X}\mathbf{w}\}=\arg\max_{\|\mathbf{w}\|_2=1}\{\mathbf{w}^{\!\top}\mathbf{K}\mathbf{w}\};\] this is equivalent to finding the maximizer of \(F(\mathbf{w})=\mathbf{w}^{\!\top}\mathbf{K}\mathbf{w}\) subject to the constraint \[G(\mathbf{w})=1-\mathbf{w}^{\!\top}\mathbf{w}=0.\]

We solve this problem by using the method of Lagarange multipliers; any optimizer \(\mathbf{w}^*\) must be either:

a critical point of \(F\), or

a solution of \(\nabla F(\mathbf{w})+\lambda \nabla G(\mathbf{w})=\mathbf{0}\), \(\lambda\neq 0\).

But \(\nabla F(\mathbf{w})=2\mathbf{Kw}\) and \(\nabla G(\mathbf{w})=-2\mathbf{w}\); either \(\mathbf{w}^*\in \ker (\mathbf{K})\) (case 1) or \(2\mathbf{Kw}^*-2\lambda^*\mathbf{w}^*=\mathbf{0}\) (case 2); either \[\mathbf{Kw}^*=\mathbf{0}\quad\text{or}\quad (\mathbf{K}-\lambda^*I)\mathbf{w}^*=\mathbf{0},\ \lambda^*\neq 0.\] In either case, \(\lambda^*\geq 0\) is an eigenvalue of \(K\), with associated eigenvector \(\mathbf{w}^*\). There are at most \(p\) distinct possibilities \(\{(\lambda_j,\mathbf{w}_j)\}_{j=1}^p\), and for each of them \[\mathbf{w}_j^{\!\top}\mathbf{K}\mathbf{w}_j=\mathbf{w}_j^{\!\top}\lambda_j\mathbf{w}_j=\lambda_j\mathbf{w}_j^{\!\top}\mathbf{w}_j=\lambda_j,\] since \(\mathbf{w}_j^{\!\top}\mathbf{w}_j=1\).

Thus, \[\arg\max_{\|\mathbf{w}\|_2=1}\{\text{Var}(\mathbf{t}_1)\}=\arg\max_{\|\mathbf{w}\|_2=1}\{\lambda_j\}=\mathbf{w}_1=\text{PC}_1,\] since \(\lambda_1\geq \lambda\geq 0\) for all eigenvalues \(\lambda\) of \(\mathbf{K}\).

A similar argument shows that \(\mathbf{w}_j\), \(j=2,\ldots,p\), is the direction along which the variance is the \(j\)th highest, assuming that \(\mathbf{w}_j\) is orthonormal to all the preceding \(\mathbf{w}_{\ell}\), \(\ell=1,\ldots, j-1\), and that the variance is proportional to \(\lambda_j\).

The process is repeated at most \(p\) times, yielding \(r\) non-trivial principal components \(\text{PC}_1,\ldots,\text{PC}_r\), where \(r\leq p\) is the \(\text{rank}(\mathbf{X})\). Thus, we see that the rotation matrix \(\mathbf{W}\) that maximizes the variance sequentially in the columns of \(\mathbf{T}=\mathbf{XW}\) is the matrix of eigenvectors of \(\mathbf{K}=\mathbf{X}^{\!\top}\mathbf{X}\).

We show how to implement principal component analysis in Sections 11.7.3 and 13.4.3 [the Wines example].

### 15.2.4 The Manifold Hypothesis

**Manifold learning** involves mapping high-dimensional data to a lower dimensional manifold, such as mapping a set of points in \(\mathbb{R}^3\) to a torus shape, which can then be unfolded (or embedded) into a 2D object. Techniques for manifold learning are commonly-used because data is often (usually?) sampled from unknown and underlying sources which cannot be measured directly.

Learning a suitable “low-dimension” manifold from a higher-dimensional space is approximately as complicated as learning the sources (in a machine learning sense). This problem can also be re-cast as finding a set of **degrees of freedom** which can reproduce most of the variability in a dataset.

For instance, a set of multiple photographs of a 3D object taken from different positions but all at the same distance from the object can be represented by two degrees of freedom: the horizontal angle and the vertical angle from which the picture was taken. As another example, consider a set of hand-written drawings of the digit “2” [296]. Each of these drawings can also be represented using a small number of degrees of freedom:

the ratio of the length of the lowest horizontal line to the height of the hand-written drawing;

the ratio of the length of the arch in the curve at the top to the smallest horizontal distance from the end point of the arch to the main vertical curve;

the rotation of the digit as a whole with respect to some baseline orientation, etc.

These two scenarios are illustrated in Figure 15.9.

Dimensionality reduction and manifold learning are often used for one of three purposes:

to reduce the overall dimensionality of the data while trying to preserve the data variance;

to display high-dimensional datasets, or

to reduce the processing time of supervised learning algorithms by lowering the dimensionality of the processed data.

PCA, for instance, provides a sequence of best linear approximations to high-dimensional observations (see previous section); the process has fantastic theoretical properties for computation and applications, but data is not always well-approximated by a fully linear process.

In this section, the focus is on non-linear dimensionality reduction methods, most of which are a variant of **kernel PCA**: LLE, Laplacian eigenmap, isomap, semidefinite embedding, and \(t-\)SNE. By way of illustration, the different methods are applied to an “S”-shaped coloured 2D object living in 3D space (see Figure 15.12).

#### Kernel Principal Component Analysis

High-dimensional datasets often have a non-linear nature, in the sense that a linear PCA may only weakly capture/explain the variance across the entire dataset. This is, in part, due to PCA relying on Euclidean distance as opposed to **geodesic distance** – the distance between two points *along* the manifold, that is to say, the distance that would be recorded if the high-dimensional object was first unrolled (see Figure 15.10).

Residents of the Earth^{294} are familiar with this concept: the Euclidean distance (“as the mole burrows”) between Sao Paulo and Reykjavik is the length of the shortest tunnel joining the two cities, whereas the geodesic distance (“as the crow flies”) is the arclength of the **great circle** through the two locations (see Figure 15.11).

High-dimensional manifolds can be unfolded with the use of **transformations** \(\Phi\) which map the input set of points \[\{\mathbf{x}_1,\ldots, \mathbf{x}_n\}\subseteq \mathbb{R}^p\] onto a new set of points \[\{\Phi(\mathbf{x}_1),\ldots, \Phi(\mathbf{x}_n)\}\subseteq \mathbb{R}^m\], with \(m\geq n\).

If \(\Phi\) is chosen so that \(\sum_{i=1}^n \Phi(\mathbf{x}_i) = \mathbf{0}\) (i.e., the transformed data is also centered in \(\mathbb{R}^m\)), we can formulate the kernel PCA objective in \(\mathbb{R}^p\) as a linear PCA objective in \(\mathbb{R}^m\):
\[\text{min} \left\{\sum_{i=1}^n\|\Phi(\mathbf{x}_i)-V_qV_q^{\!\top}\Phi(\mathbf{x}_i)\|^2\right\},\] over the set of \(m\times q\) matrices \(V_q\) with orthonormal columns, where \(q\) is the desired dimension of the manifold.^{295}

In practice, it can be quite difficult to determine \(\Phi\) explicitly; in many instances, it is inner-product-like quantities that are of interest to the analyst.

The problem can be resolved by working with **positive-definite kernel functions** \(K:\mathbb{R}^p\times \mathbb{R}^p\to \mathbb{R}_+\) which satisfy \(K(\mathbf{x},\mathbf{y})=K(\mathbf{y},\mathbf{x})\) for all \(\mathbf{x},\mathbf{y}\in \mathbb{R}^p\) and \[\sum_{i=1}^k\sum_{j=1}^kc_ic_jK(\mathbf{x}_i,\mathbf{x}_j)\geq 0\] for any integer \(k\), coefficients \(c_1,\ldots,c_k\in \mathbb{R}\) and vectors \(\mathbf{x}_1, \ldots, \mathbf{x}_k\in \mathbb{R}^p\), with equality if and only if \(c_1,\cdots,c_k=0\).^{296}

Popular data analysis kernels include the:

**linear kernel**\(K(\mathbf{x},\mathbf{y})= \mathbf{x}^{\!\top}\mathbf{y}\);**ploynomial kernel**\(K(\mathbf{x},\mathbf{y})= (\mathbf{x}^{\!\top}\mathbf{y}+r)^{k}\), \(n\in\mathbb{N}\), \(r\geq 0\), and**Gaussian kernel**\(K(\mathbf{x},\mathbf{y})= \text{E}p\left\{\frac{-\|\mathbf{x}-\mathbf{y}\|}{2\sigma^2}\right\}\), \(\sigma>0\).

Most dimension reduction algorithms can be re-expressed as some form of kernel PCA.

#### Locally Linear Embedding

**Locally linear embedding** (LLE) is another manifold learning approach which addresses the problem of nonlinear dimension reduction by computing low-dimensional, neighbourhood-preserving embedding of high-dimensional data.

The main assumption is that for any subset \(\{\mathbf{x}_i\}\subseteq \mathbb{R}^p\) lying on some sufficiently well-behaved underlying \(d-\)dimensional manifold \(\mathcal{M}\), each data point and its neighbours lie on a **locally linear patch** of \(\mathcal{M}\). Using translations, rotations, and rescaling, the (high-dimensional) coordinates of each locally linear neighbourhood can be mapped to a set of \(d-\)dimensional global coordinates of \(\mathcal{M}\).

This needs to be done in such a way that the relationships between neighbouring points are preserved. This can be achieved in 3 steps:

identify the punctured neighbourhood \(N_i=\{i_1,\ldots,i_k\}\) of each data point \(\mathbf{x}_i\)

*via*\(k\) nearest neighbours (this could also be done by selecting all points within some fixed radius \(\varepsilon\), but \(k\) is not a constant anymore, and that complicates matters);find the weights \(z_{i,j}\) that provide the best linear reconstruction of each \(\mathbf{x}_i\in \mathbb{R}^p\) from their respective punctured neighbourhoods

^{297}, i.e., solve \[\min_{\mathbf{W}}\left\{\sum_{i=1}^n \left\|\mathbf{x}_i-\textstyle{\sum_{j\in N_i}}z_{i,j}\mathbf{x}_{N_i(j)}\right\|^2\right\},\] where \(\mathbf{Z}=\left(z_{i,j}\right)\) is an \(n\times n\) matrix (\(z_{i,j}=0\) if \(j \not\in N_i\)), andfind the low-dimensional embedding (or code) vectors \(\mathbf{y}_i\in \mathcal{M}(\subseteq\mathbb{R}^d)\) and neighbours \(\mathbf{y}_{N_i(j)}\in \mathcal{M}\) for each \(i\) which are best reconstructed by the weights determined in the previous step, i.e., solve \[\min_{\mathbf{Y}}\left\{\sum_{i=1}^n \left\|\mathbf{y}_i-\textstyle{\sum_{j\in N_i}}w_{i,j}\mathbf{y}_{N_i(j)}\right\|^2\right\}=\min_{\mathbf{Y}} \left\{\text{Tr}\left(\mathbf{Y}^{\!\top}\mathbf{Y}L\right)\right\},\] where \(L=(I-\mathbf{Z})^{\!\top}(I-\mathbf{Z})\) and \(\mathbf{Y}\) is an \(n\times d\) matrix.

If the global coordinates of the sampled points are centered at the origin and have unit variance (which can always be done by adding an appropriate set of restrictions), it can be shown that \(L\) has a 0 eigenvalue with associated eigenvector. The \(j^{\text{\footnotesize th}}\) column of \(\mathbf{Y}\) is then simply the eigenvector associated with the \(j^{\text{\footnotesize th}}\) smallest non-zero eigenvalue of \(L\) [297].

#### Laplacian Eigenmaps

**Laplacian eigenmaps** are similar to LLE, except that the first step consists in constructing a **weighted graph** \(\mathcal{G}\) with \(n\) nodes (number of \(p-\)dimensional observations) and a set of edges connecting the neighbouring points.^{298} In practice, the edges’ weights are determined either:

by using the inverse exponential with respect to the Euclidean distance \[w_{i,j}=\text{E}p\left(-\frac{\|\mathbf{x}_i-\mathbf{x}_j\|^2}{s}\right),\] for all \(i,j\), for some parameter \(s>0\), or

by setting \(w_{i,j}=1\), for all \(i,j\).

The embedding map is then provided by the following objective \[\min_{\mathbf{Y}} \sum_{i=1}^{n}\sum_{j=1}^{n}\left\{w_{i,j}(\mathbf{y}_i-\mathbf{y}_j)^2\right\}=\min_{\mathbf{Y}} \left\{\text{Tr}(\mathbf{Y}L\mathbf{Y}^{\!\top})\right\},\] subject to appropriate constraints, with the **Laplacian** \(L\) given by \(L=D-W\), where \(D\) is the (diagonal) **degree matrix** of \(\mathcal{G}\) (the sum of weights emanating from each node), and \(W\) its **weight matrix**. The Laplacian eigenmap construction is identical to the LLE construction, save for their definition of \(L\).

#### Isomap

**Isomap** follows the same steps as LLE except that it uses **geodesic distance** instead of Euclidean distance when looking for each point’s neighbours (as always, neighbourhoods can be selected with \(k\)NN or with a fixed \(\varepsilon\). These neighbourhood relations are represented by a graph \(\mathcal{G}\) in which each observation is connected to its neighbours *via* edges with weight \(d_x(i,j)\) between neighbours. The geodesic distances \(d_{\mathcal{M}}(i,j)\) between all pairs of points on the manifold \(\mathcal{M}\) are then estimated in the second step.

#### Semidefinite Embedding

**Semidefinite embeddings** (SDE) involve learning the kernel \(K(\mathbf{x},\mathbf{z})=\Phi(\mathbf{x})^{\!\top}\Phi(\mathbf{z})\) from the data before applying the kernel PCA transformation \(\Phi\), which is achieved by formulating the problem of learning \(K\) as an instance of **semidefinite programming**.

The distances and angles between observations and their neighbours are preserved under transformations by \(\Phi\): \(\|\Phi(\mathbf{x_i})-\Phi(\mathbf{x}_j)\|^2 = \|\mathbf{x}_i - \mathbf{x}_j\|^2,\) for all \(\mathbf{x}_i,\mathbf{x}_j \in \mathbb{R}^p\).

In terms of the kernel matrix, this constraint can be written as \[K(\mathbf{x}_i,\mathbf{x}_i)- 2K(\mathbf{x}_i,\mathbf{x}_j) + K(\mathbf{x}_j,\mathbf{x}_j) = \|\mathbf{x}_i-\mathbf{x}_j\|^2,\] for all \(\mathbf{x}_i,\mathbf{x}_j \in \mathbb{R}^p\). By adding an objective function to maximize \(\text{Tr}(K)\), that is, the variance of the observations in the learned feature space, SDE constructs a semidefinite program for learning the **kernel matrix** \[K=\left(K_{i,j}\right)_{i,j=1}^n=\left(K(\mathbf{x}_i,\mathbf{x}_j)\right)_{i,j=1}^n,\] from which we can proceed with kernel PCA.

#### Unified Framework

All of the above algorithms (LLE, Laplacian Eigenmaps, Isomap, SDE) can be rewritten in the kernel PCA framework:

in the case of LLE, if \(\lambda_{\text{max}}\) is the largest eigenvalue of \(L=(I-\mathbf{W})^{\!\top}(I-\mathbf{W})\), then \(K_{\text{LLE}}=\lambda_{\text{max}}I-L\);

with \(L=D-W\), \(D\) a (diagonal) degree matrix with \(D_{i,i}=\sum_{j=1}^nW_{i,j}\), then then the corresponding \(K_{\text{LE}}\) is related to commute times of diffusion on the underlying graph, and

with the Isomap element-wise squared geodesic distance matrix \(\mathcal{D}^2\), \[K_{\text{Isomap}}=-\frac{1}{2}\left(I-\frac{1}{p}\mathbf{ee}^{\!\top}\right)\mathcal{D}^{2}\left(I-\frac{1}{p}\mathbf{ee}^{\!\top}\right),\] where \(\mathbf{e}\) is a \(p-\)dimensional vector consisting solely of 1’s (note that this kernel is not always positive semi-definite).

#### \(t-\)SNE

There are a few relatively new manifold learning techniques that do not fit neatly in the kernel PCA framework: Uniform Manifold Approximation and Projection (UMAP, see Section 15.4.4) and **\(T-\)Distributed Stochastic Neighbour Embedding** (\(t-\)SNE).

For a dataset \(\{\mathbf{x}_i\}_{i=1}^n\subseteq \mathbb{R}^p\), the latter involves calculating probabilities \[\begin{gathered} p_{i,j}=\frac{1}{2n}\left\{\frac{\text{E}p(-\|\mathbf{x}_i-\mathbf{x}_j\|^2/2\sigma_i^2)}{\textstyle{\sum_{k\neq i}\text{E}p(-\|\mathbf{x}_i-\mathbf{x}_k\|^2/2\sigma_i^2) }} +\frac{\text{E}p(-\|\mathbf{x}_i-\mathbf{x}_j\|^2/2\sigma_j^2)}{\textstyle{\sum_{k\neq j}\text{E}p(-\|\mathbf{x}_j-\mathbf{x}_k\|^2/2\sigma_j^2) }}\right\},\end{gathered}\] which are proportional to the similarity of points in high-dimensional space \(\mathbb{R}^p\) for all \(i,j\), and \(p_{i,i}\) is set to 0 for all \(i\).^{299} The bandwidths \(\sigma_i\) are selected in such a way that they are smaller in denser data areas.

The lower-dimensional manifold \(\{\mathbf{y}_i\}_{i=1}^n\subseteq \mathcal{M}\subseteq\mathbb{R}^d\) is selected in such a way as to preserve the similarities \(p_{i,j}\) as much as possible; this can be achieved by building the (reduced) probabilities \[q_{i,j}=\frac{(1+\|\mathbf{y}_i-\mathbf{y}_j\|^2)^{-1}}{\textstyle{\sum_{k\neq i}(1+\|\mathbf{y}_i-\mathbf{y}_k\|^2)^{-1}}}\] for all \(i,j\) (note the asymmetry) and minimizing the **Kullback-Leibler divergence** of \(Q\) from \(P\): \[\text{KL}(P||Q)=\sum_{i\neq j}p_{i,j}\log\frac{p_{i,j}}{q_{i,j}}\] over possible coordinates \(\{\mathbf{y}_i\}_{i=1}^n\) [298].

#### MNIST Example

In [299], the methods above are used to learn \(2-\)dimensional manifolds for the MNIST dataset [300], a database of handwritten digits.

The results for 4 of those are shown in Figure 15.14.

The analysis of optimal manifold learning methods remains fairly subjective, as it depends not only on the outcome, but also on how much computing power is used and how long it takes to obtain the mapping. Naïvely, one would expect to see the coordinates in the reduced manifold congregate in 10 (or more) distinct groups; in that regard, \(t-\)SNE seems to perform admirably on MNIST.

### References

*Science*, vol. 290, no. 5500, p. 2319, 2000.

*Science*, vol. 290, no. 5500, pp. 2323–2326, 2000.