# Visualizing probabilistic models and data with Intensive Principal Component Analysis

See allHide authors and affiliations

Edited by William Bialek, Princeton University, Princeton, NJ, and approved May 31, 2019 (received for review October 5, 2018)

## Significance

We introduce Intensive Principal Component Analysis (InPCA), a widely applicable manifold-learning method to visualize general probabilistic models and data. Using replicas to tune dimensionality in high-dimensional data, we use the zero-replica limit to discover a distance metric, which preserves distinguishability in high dimensions, and an embedding with superior visualization performance. We apply InPCA to the model of cosmology which predicts the angular power spectrum of the cosmic microwave background, allowing visualization of the space of model predictions (i.e., different universes).

## Abstract

Unsupervised learning makes manifest the underlying structure of data without curated training and specific problem definitions. However, the inference of relationships between data points is frustrated by the “curse of dimensionality” in high dimensions. Inspired by replica theory from statistical mechanics, we consider replicas of the system to tune the dimensionality and take the limit as the number of replicas goes to zero. The result is intensive embedding, which not only is isometric (preserving local distances) but also allows global structure to be more transparently visualized. We develop the Intensive Principal Component Analysis (InPCA) and demonstrate clear improvements in visualizations of the Ising model of magnetic spins, a neural network, and the dark energy cold dark matter (

Visualizing high-dimensional data is a cornerstone of machine learning, modeling, big data, and data mining. These fields require learning faithful and interpretable low-dimensional representations of high-dimensional data and, almost as critically, producing visualizations which allow interpretation and evaluation of what was learned (1⇓⇓–4). Unsupervised learning, which infers features from data without manually curated data or specific problem definitions (5), is especially important for high-dimensional, big data applications in which specific models are unknown or impractical. For high dimensions, the relative distances between features become small and most points are orthogonal to one another (6). A trade-off between preserving local and global structure must often be made when inferring a low-dimensional representation. Classic manifold learning techniques include linear methods such as principal component analysis (PCA) (7) and multidimensional scaling (MDS) (8), which preserve global structure but at the cost of obscuring local features. Existing nonlinear manifold learning techniques, such as t-distributed stochastic network embedding (t-SNE) (9) and diffusion maps (10), preserve the local structure while maintaining only some qualitative global patterns such as large clusters. The uniform manifold approximation (UMAP) (11) better preserves topological structures in data, a global property.

In this article, we develop a nonlinear manifold learning technique which achieves a compromise between preserving local and global structure. We accomplish this by developing an isometric embedding for general probabilistic models based on the replica trick (12). Taking the number of replicas to zero, we reveal an intensive property—an information density characterizing the distinguishability of distributions—ameliorating the canonical orthogonality problem and “curse of dimensionality.” We then describe a simple, deterministic algorithm that can be used for any such model, which we call Intensive Principal Component Analysis (InPCA). Our method quantitatively captures global structure while preserving local distances. We first apply InPCA to the canonical Ising model of magnetism, which inspired the zero-replica limit. Next, we show how InPCA can capture and summarize the learning trajectory of a neural network. Finally, we visualize the dark energy cold dark matter (

## Model Manifolds of Probability Distributions

Any measurement obtained from an experiment with uncertainty can generally be understood as a probability distribution. For example, when some data x are observed with normally distributed noise ξ of variance

We define a probabilistic model

## Hypersphere Embedding

We promised an embedding which both is isometric and preserves global structures. We satisfy the first promise by considering the hypersphere embedding

As the dimension of the data increases, almost all features become orthogonal to each other, and most measures of distance lose their ability to discriminate between the smallest and largest distances (19). For the hypersphere embedding, we see that as the dimension of x increases, the inner product in the Hellinger distance of Eq. **4** becomes smaller as the probability is distributed over more dimensions. In the limit of large dimension, all nonidentical pairs of points become orthogonal and equidistant around the hypersphere (a constant distance

To illustrate this problem with the hypersphere embedding, consider the Ising model, which predicts the likelihood of observing a particular configuration of binary random variables (spins) on a lattice. The probability of a spin configuration is determined by the Boltzmann distribution and is a function of a local pairwise coupling and a global applied field. The dimension is determined by the number of spin configurations, **3** by projecting the predictions onto the first three principal components of X. Fig. 1*A* shows this projection of the model manifold of a *B* shows a larger,

A natural way to increase the dimensionality of a probabilistic model is by drawing multiple samples from the distribution. If D is the dimension of x, then N identical draws from the distribution will have dimension *C*, where we drew four replica samples from the same model. Note that compared with the original 2 × 2 model, the model manifold of the four-replica 2 × 2 model “wraps” more around the hypersphere, just like the larger,

## Replica Theory and the Intensive Embedding

We saw in Fig. 1 that increasing the dimension of the data led to a saturation of the distance function Eq. **4**. This problem is referred to as the loss of relative contrast or the concentration of distances (19), and to overcome it requires a non-Euclidean distance function, discussed below. In the previous section we saw the same saturation of distance could be achieved by adding replicas, increasing the embedding dimension. Fig. 2*A* shows this process taken to an extreme: the model manifold of the *B* shows the result of this analysis, the intensive embedding, where the distance concentration has been cured, and the inherent 2D structure of the Ising model has been recovered.

To find the intensive embedding, we must first find the distance between replicated models. The likelihood for N replicas of a system is given by their product**4**, the distance per replica **5**, so that we can be confident the intensive embedding distance preserves local structures.

Importantly, the intensive distance does not satisfy the triangle inequality (and is thus non-Euclidean): The distance between points on the hypersphere can go to infinity, rather than lie constrained to the finite radius of the hypersphere embedding. Because of this, the intensive embedding can overcome the loss of relative contrast (19) discussed at the beginning of this section. Distances in the intensive embedding maintain distinguishability in high dimensions, as illustrated in Fig. 2*B*, wherein the 2D nature of the Ising model has been recovered. We hypothesize that this process, which cures the curse of dimensionality for models with too many samples, will also cure it for models with intrinsically high dimensionality. The intensive distance obtained here is proportional to the Bhattacharyya distance (21). Considering the zero-replica limit of the Hellinger divergence, we discovered a way to derive the Bhattacharyya distance. The importance of this is discussed further in the following section.

### Connection to Least Squares.

Consider the concrete and canonical paradigm of models **9** finds for the case of nonlinear least squares that

## Intensive Principal Component Analysis

Classical PCA takes a set of data examples and infers features which are linearly uncorrelated. (7). The features to be analyzed with PCA are compared via their Euclidean distance. Can we generalize this comparison to use our intensive embedding distance? Given a matrix of data examples

The principal components can also be obtained from the cross-covariance matrix,

Writing our data matrix as **3** for replicated systems, the cross-covariance is**6**. As with the intensive embedding, we can take the limit as the number of replicas goes to zero to find

In summary, InPCA is achieved by the following procedure: (*i*) Compute the cross-covariance matrix from a set of probability samples: Compute **18**. (*ii*) Compute the eigenvalue decomposition *iii*) Compute the coordinate projections, *iv*) Plot the projections using the columns of T.

### Neural Network MNIST Digit Classifier.

To demonstrate the utility of InPCA, we use it to visualize the training of a two-layer convolution neural network (CNN), constructed using TensorFlow (22), trained on the MNIST dataset of hand-written digits (23). A set of 55,000 images was used to train the network, which was then used to predict the likelihood that an additional set of 10,000 images is classified each as a specific digit between 0 and 9. We use softmax (24) to calculate the probabilities from the category estimates supplied by the network. The CNN defines the likelihood

## Properties of the Intensive Embedding and InPCA

The space characterized by our intensive embedding has two weird properties: First, it is formally 1D, yet there are multiple orthogonal directions upon which it can be projected; and second, it is Minkowski-like, in that it has negative squared distances, violating the triangle inequality. We posit that, fundamentally, this second property is what allows InPCA to cure the orthogonality problem.

We begin with a discussion of the the 1D nature of the embedding space. The embedding dimension is given by *ii* of our algorithm. Visualizations produced by InPCA are cross-sections of a space of the dimension equal to the number of sampled points of the model manifold p, instead of the dimension D or

In the limit of zero replicas in Eq. **18**, the positive-definite, Wishart structure of the cross-covariance matrix is lost. It is therefore possible to have negative squared distances. The non-Euclidean nature of the embedding (flat, but Minkowski-like) does not suffer from the concentration of distances which plagues Euclidean measures in high dimensions, thus allowing the model manifold to be “unwound” from the N-sphere and for InPCA to produce useful, low-dimensional representations.

Finally, the eigenvalues of InPCA correspond to the cross-sectional widths of the model manifold. We see this quite explicitly with the following example of a biased coin (specifically, in Fig. 4*B*) where the eigenvalues extracted from InPCA map directly to the manifold widths measured along the direction of the corresponding InPCA eigenvector. Therefore, we see that InPCA produces a hierarchy of directions, ordered by the global widths of the model manifold. Note that, as with classical PCA, this correspondence depends on how faithfully the model manifold was originally sampled; that is, InPCA can tell you about the structure of the manifold only from observed points.

### Biased Coins.

To illustrate the properties of InPCA, we use it to visualize a simple probabilistic model, that of a simple biased coin. A biased coin has one parameter, the odds ratio of heads to tails, and so forms a 1D manifold. Fig. 4*A* shows the first two InPCA components for the manifold of biased coins, for 2,000 sampled points with probabilities uniformly spread between 0 and 1 (excluding the endpoints, since they are orthogonal and thus are infinitely far apart). The two extracted InPCA components correspond to the bias and variance of the coin, respectively. The hierarchy of components extracted from InPCA therefore corresponds to known features of the model (i.e., they are meaningful).

The importance of the negative-squared distances is illustrated in Fig. 4. The contour lines representing constant distances from a fair coin and are hyperbolas: Points can be a finite distance from a fair coin yet an infinite distance from each other. As two oppositely biased coins become increasingly biased, their distance from each other can go to infinity (because an outcome of a coin which always lands on heads will never be the same as an outcome of a coin which always land on tails) yet all points remain a finite distance from a fair coin. Note that all points are in the left and right portions of Fig. 4*A*, representing net positive distances (the intensive pairwise distances are all positive).

### Comparing with t-SNE and Diffusion Maps.

We compare our manifold learning technique to two standard methods, t-SNE and the diffusion maps, by applying each one to the six-parameter

To determine the likelihood functions, we use the Code for Anisotropies in the Microwave Background software package to generate power spectra (26). We perform a Monte Carlo sampling of 50,000 points around the best-fit parameters provided by the 2015 Planck data release (25), with sample weights based on the intensive distance to the best fit.

In Fig. 5 we show the first two components of the manifold embedding for InPCA, t-SNE, and diffusion maps. To apply t-SNE and the diffusion map to probabilistic data we must provide a distance. We therefore use our intensive distance, from Eq. **9**, for consistency and ease of comparison. In all three cases, the first component from each method is directly related to the primordial fluctuation amplitude *SI Appendix*.

Such stark differences between manifold learning methods are surprising, as all techniques aim to extract important features in the data distribution, i.e., important geometric features in the manifolds. Given the ranges of sampled parameters, one would expect the variation in the Hubble constant to relate in some way to one of the dominant components, as it does for InPCA. Figures illustrating the effect of different parameters are provided in *SI Appendix*, following results from ref. 27.

There are two important differences between InPCA and other methods. First, InPCA has no tunable parameters and yields a geometric object defined entirely by the model distribution. For example, t-SNE embeddings rely on parameters such as the perplexity, a learning rate, and a random seed (yielding nondeterministic results), and the diffusion maps rely on a diffusion parameter and choice of diffusion operator, all of which must be manually optimized to obtain good results. Second, t-SNE and diffusion maps embed manifolds in Euclidean spaces in a way which aims to preserve local features. However, InPCA seeks to preserve both global and local features, by embedding manifolds in a non-Euclidean space.

## Summary

In this article, we introduce an unsupervised manifold learning technique, InPCA, which captures low-dimensional features of general, probabilistic models with wide-ranging applicability. We consider replicas of a probabilistic system to tune its dimensionality and consider the limit of zero replicas, deriving an intensive embedding that ameliorates the canonical orthogonality problem. Our intensive embedding provides a natural, meaningful way to characterize a symmetric distance between probabilistic data and produces a simple, deterministic algorithm to visualize the resulting manifold.

## Acknowledgments

We thank Mark Transtrum for guidance on algorithms and for useful conversations. We thank Pankaj Mehta for pointing out the connection to MDS. K.N.Q. was supported by a fellowship from the Natural Sciences and Engineering Research Council of Canada, and J.P.S. and K.N.Q. were supported by the National Science Foundation (NSF) through NSF Grants DMR-1312160 and DMR-1719490. M.D.N. was supported by NSF Grant AST-1454881.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: knq2{at}cornell.edu.

Author contributions: K.N.Q., C.B.C., F.D.B., M.D.N., and J.P.S. designed research; K.N.Q. performed research; K.N.Q. contributed new reagents/analytic tools; K.N.Q. analyzed data; and K.N.Q., C.B.C., F.D.B., M.D.N., and J.P.S. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: All code used to generate the figures is available through public repositories on Github. Fig. 1 and Fig. 2 can be generated from code on https://github.com/katnquinn/Ising_ModelManifold, Fig. 3 can be generated from code on https://github.com/katnquinn/IntensiveEmbedding, Fig. 4 from code found on https://github.com/katnquinn/1Spin, and Fig. 5 from code on https://github.com/katnquinn/CMB_ModelManifold and from Code for Anisotropies in the Microwave Background software found at https://camb.info/.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1817218116/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- M. F. De Oliveira,
- H. Levkowitz

- ↵
- S. Liu,
- D. Maljovec,
- B. Wang,
- P. T. Bremer,
- V. Pascucci

- ↵
- J. A. Lee,
- M. Verleysen

- ↵
- A. Zimek,
- E. Schubert,
- H. P. Kriegel

- ↵
- K. P. Murphy

- ↵
- H. P. Kriegel,
- P. Kröger,
- A. Zimek

- ↵
- ↵
- ↵
- ↵
- R. R. Coifman et al.

- ↵
- L. McInnes,
- J. Healy,
- J. Melville

- ↵
- M. Mézard,
- G. Parisi,
- M. Virasoro

- ↵
- B. B. Machta,
- R. Chachra,
- M. K. Transtrum,
- J. P. Sethna

- ↵
- ↵
- ↵
- E. Hellinger

- ↵
- M. Gromov

- ↵
- S. Amari,
- H. Nagaoka

- ↵
- K. Beyer,
- J. Goldstein,
- R. Ramakrishnan,
- U. Shaft

- ↵
- ↵
- A. Bhattacharyya

- ↵
- M. Abadi et al.

- ↵
- Y. LeCun,
- C. Cortes,
- C. J. Burges

- ↵
- C. M. Bishop

- ↵
- Planck Collaboration

- ↵
- ↵
- ↵
- K. Quinn

- ↵
- K. Quinn

- ↵

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics