# A tractable latent variable model for nonlinear dimensionality reduction

See allHide authors and affiliations

Edited by Nathan Kallus, Cornell University, New York, NY, and accepted by Editorial Board Member Donald B. Rubin May 6, 2020 (received for review September 15, 2019)

## Significance

Latent variable models (LVMs) are powerful tools for discovering hidden structure in data. Canonical LVMs include factor analysis, which explains the correlation of a large number of observed variables in terms of a smaller number of unobserved ones, and Gaussian mixture models, which reveal clusters of data arising from an underlying multimodal distribution. In this paper, we describe a conceptually simple and equally effective LVM for nonlinear dimensionality reduction (NLDR), where the goal is to discover faithful, neighborhood-preserving embeddings of high-dimensional data. Tools for NLDR can help researchers across all areas of science and engineering to better understand and visualize their data. Our approach elevates NLDR into the family of problems that can be studied by especially tractable LVMs.

## Abstract

We propose a latent variable model to discover faithful low-dimensional representations of high-dimensional data. The model computes a low-dimensional embedding that aims to preserve neighborhood relationships encoded by a sparse graph. The model both leverages and extends current leading approaches to this problem. Like t-distributed Stochastic Neighborhood Embedding, the model can produce two- and three-dimensional embeddings for visualization, but it can also learn higher-dimensional embeddings for other uses. Like LargeVis and Uniform Manifold Approximation and Projection, the model produces embeddings by balancing two goals—pulling nearby examples closer together and pushing distant examples further apart. Unlike these approaches, however, the latent variables in our model provide additional structure that can be exploited for learning. We derive an Expectation–Maximization procedure with closed-form updates that monotonically improve the model’s likelihood: In this procedure, embeddings are iteratively adapted by solving sparse, diagonally dominant systems of linear equations that arise from a discrete graph Laplacian. For large problems, we also develop an approximate coarse-graining procedure that avoids the need for negative sampling of nonadjacent nodes in the graph. We demonstrate the model’s effectiveness on datasets of images and text.

A common problem arises in many fields of science and engineering—how to discover low-dimensional representations of high-dimensional data. These representations can help to visualize complex patterns; they can also support the faster indexing and querying of large data collections.

Current leading approaches in this area have demonstrated remarkable successes. Algorithms such as t-distributed Stochastic Neighborhood Embedding (t-SNE) (1), LargeVis (2), and Uniform Manifold Approximation and Projection (UMAP) (3) have produced stunning visualizations of high-dimensional datasets; they have also tackled extremely large problems with efficient data structures (4, 5) and clever randomizations (6, 7). In practical applications, they have largely supplanted earlier approaches, based on spectral methods (8⇓⇓⇓–12), for manifold learning and nonlinear dimensionality reduction (NLDR).

The models for t-SNE, LargeVis, and UMAP share a common element: They all appeal to probabilistic notions of similarity and dissimilarity. Despite this shared grounding, however, these approaches bear little resemblance to other canonical probabilistic models for unsupervised learning. Problems such as factor analysis (13) and clustering (14) are amenable to especially tractable latent variable models (LVMs). The problem of NLDR is equally simple to state: Given a set of high-dimensional (observed) inputs

In this paper, we make this connection explicit. We develop an LVM that provides another way of conceptualizing the recent breakthroughs in NLDR. We also describe a coarse-graining procedure to break the main logjam of learning in NLDR—the large number of pairwise interactions between nonneighboring inputs. The rest of the paper presents our LVM, evaluates its performance on datasets of images and text, and places it more fully in the context of previous work.

## LVM

Our LVM is designed to learn a faithful embedding of the high-dimensional inputs

To begin, we associate, to each input, a low-dimensional output

We should not expect too much from such a mapping, however. For example, distances are unlikely to be preserved: A neighborhood-preserving embedding may need considerable flexibility to situate the outputs in a space of much lower dimensionality. We can appeal, however, to some weaker properties of the desired mapping, which we again express in terms of the latent variables h and

These criteria suggest a natural way to measure the quality of an embedding, which can then be iteratively optimized. We start by considering how well the embedding behaves with respect to a particular pair of inputs **3** and **4** measure the degrees to which the Gaussian latent variables h and

We can measure the overall quality of the LVM’s embedding by accumulating these scores over multiple pairs of inputs. As we shall see, it will be convenient to invest some pairs of inputs with a larger role in this process than others. Allowing for a weighted sum of scores, we obtain an overall log (conditional) likelihood**6** as constants, so that the remaining optimization over **6** have been fixed, and focus on the more computationally intensive problem of maximum likelihood (ML) estimation. As a final aside, we note that LargeVis (2) and UMAP (3), although not formulated as LVMs, are based on objective functions of this same general form (involving sums over pairs of similar and dissimilar examples) but that differ in the weighting of individual terms and the way they assess similarity.

## EM Algorithm

To proceed, we avail ourselves of the Expectation–Maximization (EM) algorithm (15) for ML estimation in LVMs. When applied to maximum effect, EM algorithms yield closed-form parameter updates that monotonically improve the likelihood in these models. EM algorithms have been widely used for ML estimation in Gaussian mixture models (14), factor analysis (13), and hidden Markov models (16), where the posterior statistics can be efficiently computed. Inference in these LVMs (as required by the E-step of the EM algorithm) is especially tractable because either the latent variables are discrete, with dependencies that are represented by a trivial or sparsely connected graph, or they are continuous and their posterior distributions are multivariate Gaussian.

The LVM in this paper is interesting in two respects. First, although its latent variables are continuous, with posterior distributions that are not always multivariate Gaussian, exact inference remains tractable. Second, unlike most LVMs for unsupervised learning, ours does not attempt to learn a so-called generative model of the data; instead, the LVM’s parameters **6**, and it is these parameters that provide a low-dimensional representation of the data. *SI Appendix* gives a full derivation of the expectation step (E-step) and the maximization step (M-step) of the EM algorithm; here we present only the key results.

### Inference (E-step).

We can express the parameter updates for the EM algorithm most simply in terms of the posterior statistics of the LVM’s continuous latent variables. For each pair of inputs **6**, the relevant posterior distributions are given by Bayes rule,**1**–**5** into Eq. **7**, we see that the first of these distributions, **4**; this ratio will simplify the expressions for many posterior statistics of interest.

The most important statistics to compute for the latent variables are their posterior means. For example, we have*SI Appendix* gives corresponding expressions for

The other posterior statistics we need for the EM algorithm are the expected squared Euclidean distances between the latent variables and their prior means. As shorthand, let*SI Appendix* gives the corresponding expressions for

### Learning (M-step).

The EM algorithm for LVMs is based on maximizing a lower bound on the log-likelihood (15). There are two ways to derive EM updates for our LVM. The first (standard) approach employs a single lower bound on Eq. **6** to derive joint updates for the model parameters

We begin by giving the update for the outputs **18**, while the second is a nonnegative diagonal matrix that incorporates the dissimilarity weights *Laplacian paradigm* (18), in which fast solvers for SDD systems are used as algorithmic primitives in problems involving large graphs.

Next, we give the update for the variances **12**–**15**, this update takes an especially simple form:**20** are guaranteed to increase the log conditional likelihood in Eq. **6** (except at stationary points).

## Length Scales and Pairwise Similarities

The goal of our LVM is to learn an embedding that maps nearby inputs to nearby outputs and distant inputs to distant outputs. To begin, however, we must formalize which pairs of inputs we regard as nearby in the first place. As is common, we encode these neighborhood relationships by a sparse graph. Then we use this graph to derive the length scales **3** and **4** and the weights **6**.

### Neighborhood Graph.

It would be simplest to regard the inputs

For our LVMs, we have used a simple two-parameter procedure to construct sparse neighborhood graphs, where both parameters

Our procedure starts by computing the kNN of each input and encoding these neighborhood relationships by a binary matrix K, where

In the first step, we compute the minimum spanning tree of the undirected graph with adjacency matrix

In the second step, we accumulate the first s powers of the matrix K; the value of s (typically, small) can be regarded as the number of steps of a directed random walk. Let

The final step of our procedure combines the results of the previous two. We define a neighborhood graph with edges

### Coarse Graining.

Finally, we use the neighborhood graph **3** and **4** and the weights **6**. The edges that are present in this graph indicate the pairs of inputs that should be mapped to nearby outputs, while the edges that are absent indicate the pairs that should not. When the neighborhood graph is sparse, the number of pairs in the latter category will be prohibitively large to enumerate for all but the smallest datasets. Algorithms such as LargeVis (2) and UMAP (3) overcome this difficulty by a negative sampling procedure (7) for nonadjacent nodes in the graph. As an alternative, we describe a recursive, coarse-graining procedure where the number of levels (ℓ) of recursion can be adjusted to learn from larger numbers of inputs.

*Case* ℓ = 0.

We choose the weights and length scales to achieve two complementary goals—first, to shrink the distances between neighbors, and second, to ensure that neighbors remain closer than nonneighbors. In the base case, this is especially simple: We set the weights by **3** and **4** so that

*Case* ℓ = 1.

Fig. 1 shows the basic intuition behind coarse graining. Suppose that n is too large to model *landmarks* and assign each input to its closest landmark; similar ideas (20⇓⇓–23) have been used in many algorithms for NLDR. Let **26** were chosen with this in mind—so that ML estimation after coarse graining will separate inputs belonging to different landmarks by at least the amount as before coarse graining. As in the base case, we also rescale the weights before proceeding with EM; *SI Appendix* provides further details.

It remains to specify how many landmarks to choose. In practice, we choose

*Case* ℓ≥2.

For large n, it may remain prohibitive to explicitly model the dissimilarity between pairs of landmarks and/or between pairs of inputs assigned to the same landmark. In this case, the coarse-graining procedure can be applied recursively. For example, if there are *superlandmarks* and model their separation at even longer length scales; or, if there are *sublandmarks* and model their separation at a shorter length scale. In the idealized case of perfectly balanced assignments, a similar scaling argument (in *SI Appendix*) shows that the next level of coarse graining reduces the number of nonzero dissimilarity weights from

## Results

We have implemented the EM algorithm for this LVM in MATLAB. All results were obtained on an Intel Core i5 CPU running at 1.3 GHz. In each experiment, we ran the EM algorithm for 400 iterations and used a momentum term to accelerate the update in Eq. **20**. We also initialized the outputs *SI Appendix*.

Fig. 2 shows a two-dimensional (2D) visualization of the Modified National Institute of Standards and Technology (MNIST) dataset of *SI Appendix*, which also explores the effects of different choices for k, s, and ℓ.

Our remaining experiments highlight the LVM’s ability to learn higher-dimensional embeddings. Fig. 3 shows the performance of a 9NN classifier on the LVM’s embeddings

We also experimented on a subset of normalized word vectors from the word2vec model trained on the GoogleNews corpus (7). The original model produced a vector of dimensionality 300 for each word in a 3-million-word vocabulary; the subset, containing

## Perspective

Our model builds directly on earlier approaches. LargeVis (2) and UMAP (3) optimize a likelihood of the same general form as Eq. **6**, but with a different specific form for the probability in Eq. **5** that pairs of outputs are colocated. An EM algorithm similar to the one in this paper was also investigated for an earlier model of distance metric learning (26).

Our work was heavily inspired by the success of previous visualizations, as well as a desire to replicate them with longstanding methodologies. Visualizations from t-SNE (1, 4), LargeVis, and UMAP have emerged as core tools in exploratory data analysis; in addition, current implementations of LargeVis and UMAP scale very well to large datasets, as do leading implementations of t-SNE (5). While the main bottleneck of our approach is currently the search for exact nearest neighbors, this is not an intrinsic bottleneck: Like these other methods, we could also turn to approximate nearest neighbor search (6, 27).

We emphasize, however, that the general problem of NLDR has larger goals than visualization (which focuses only on embeddings in three or fewer dimensions). This general problem is where more work remains to be done, and also where our formulation is likely to yield the most benefits. LVMs can be extended in many directions—most notably, by incorporating additional latent variables—and tractable LVMs can serve as a basis for approximate inference in less tractable ones (28). Notwithstanding the power of current approaches, we anticipate many further developments along these lines.

### Data Availability.

Code and scripts are available at https://osf.io/wy793/. The dataset of MNIST handwritten digits is available at http://yann.lecun.com/exdb/mnist/, and the Matlab word2vec dataset is available at https://github.com/chrisjmccormick/word2vec_matlab.

## Acknowledgments

The Matlab experiments were made possible by the combinatorial multigrid solver distributed by J. Koutis and the abridged word2vec dataset distributed by C. McCormick on Github. A similar debt is owed to the anonymous reviewers, whose suggestions improved the paper in numerous ways.

## Footnotes

- ↵
^{1}Email: saul{at}cs.ucsd.edu.

Author contributions: L.K.S. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The author declares no competing interest.

This article is a PNAS Direct Submission. N.K. is a guest editor invited by the Editorial Board.

Data deposition: Code and scripts are available in Open Science Framework at https://osf.io/wy793/.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1916012117/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- J. Bourdeau,
- J. Hendler,
- R. Nkambuo,
- I. Horrocks,
- B. Y. Zhao

- J. Tang,
- J. Liu,
- M. Zhang,
- Q. Mei

- ↵
- L. McInnes,
- J. Healy

- ↵
- L. van der Maaten

- ↵
- ↵
- S. Srinivasan et al

- W. Dong,
- M. Charikar,
- K. Li

- ↵
- C. J. C. Burges,
- L. Bottou,
- M. Welling,
- Z. Ghahramani,
- K. Q. Weinberger

- T. Mikolov,
- I. Sutskever,
- K. Chen,
- G. S. Corrado,
- J. Dean

- ↵
- J. B. Tenenbaum,
- V. de Silva,
- J. C. Langford

- ↵
- S. T. Roweis,
- L. K. Saul

- ↵
- ↵
- D. L. Donoho,
- C. E. Grimes

- ↵
- R. R. Coifman et al.

- ↵
- ↵
- R. A. Redner,
- H. F. Walker

- ↵
- A. P. Dempster,
- N. M. Laird,
- D. B. Rubin

- ↵
- L. E. Baum,
- T. Petrie,
- G. Soules,
- N. Weiss

- ↵
- I. Koutis,
- G. L. Miller,
- R. Peng

- ↵
- D. A. Spielman,
- S.-H. Teng

- ↵
- L. K. Saul,
- Y. Weiss,
- L. Bottou

- M. A. Carreira-Perpina,
- R. S. Zemel

- ↵
- S. Becker,
- S. Thrun,
- K. Obermayer

- V. D. Silva,
- J. B. Tenenbaum

- ↵
- S. Thrun,
- L. K. Saul,
- B. Schölkopf

- J. C. Platt

- ↵
- N. Pezzotti,
- T. Höllt,
- B. Lelieveldt,
- E. Eisemann,
- A. Vilanova

- ↵
- ↵
- Y. LeCun,
- L. Bottou,
- Y. Bengio,
- P. Haffner

- ↵
- ↵
- F. Pereira,
- C. J. C. Burges,
- L. Bottou,
- K. Q. Weinberger

- M. Der,
- L. K. Saul

- ↵
- J. Johnson,
- M. Douze,
- H. Jégou

- ↵
- D. M. Blei,
- A. Kucukelbir,
- J. D. McAuliffe

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics