## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Modeling and replicating statistical topology and evidence for CMB nonhomogeneity

Edited by Larry Wasserman, Carnegie Mellon University, Pittsburgh, PA, and approved August 29, 2017 (received for review April 25, 2017)

## Significance

Under the general heading of “topological data analysis” (TDA), the recent adoption of topological methods for the analysis of large, complex, and high-dimensional data sets has established that the abstract concepts of algebraic topology provide powerful tools for data analysis. However, despite the successes of TDA, most applications have lacked formal statistical veracity, primarily due to difficulties in deriving distributional information about topological descriptors. We present an approach, Replicating Statistical Topology (RST), which takes the most basic descriptor of TDA, the persistence diagram, and, using models based on Gibbs distributions and Markov chain Monte Carlo, provides replications of it. These allow for formal statistical hypothesis testing, without requiring costly, or perhaps intrinsically unavailable, replications of the original data set.

## Abstract

Under the banner of “big data,” the detection and classification of structure in extremely large, high-dimensional, data sets are two of the central statistical challenges of our times. Among the most intriguing new approaches to this challenge is “TDA,” or “topological data analysis,” one of the primary aims of which is providing nonmetric, but topologically informative, preanalyses of data which make later, more quantitative, analyses feasible. While TDA rests on strong mathematical foundations from topology, in applications, it has faced challenges due to difficulties in handling issues of statistical reliability and robustness, often leading to an inability to make scientific claims with verifiable levels of statistical confidence. We propose a methodology for the parametric representation, estimation, and replication of persistence diagrams, the main diagnostic tool of TDA. The power of the methodology lies in the fact that even if only one persistence diagram is available for analysis—the typical case for big data applications—the replications permit conventional statistical hypothesis testing. The methodology is conceptually simple and computationally practical, and provides a broadly effective statistical framework for persistence diagram TDA analysis. We demonstrate the basic ideas on a toy example, and the power of the parametric approach to TDA modeling in an analysis of cosmic microwave background (CMB) nonhomogeneity.

As a consequence of the current explosion in size, complexity, and dimensionality of data sets, there has been a growing need for the development of powerful but concise summary statistics and visualization methods that facilitate understanding and decision-making. A singularly novel approach, which has been particularly promising in areas as widespread as biology and medicine (1⇓–3), neurophysiology, (4), cosmology (5, 6), and materials science (7), has been via the application of the powerful, and rather abstract, concepts of algebraic topology to develop what generally now falls under the label of “topological data analysis” (TDA). While approaching complex data from a topological viewpoint is not entirely new—it underlies Tukey’s “Exploratory data analysis” of the 1960s (8) and the more recent approach by Friston and coworkers to brain imaging data (9)—TDA differs from all its forebears in its sophisticated exploitation of recent developments in computational topology. In particular, much of TDA has become almost synonymous with an analysis based on some version of persistent homology (10⇓–12), represented visually as barcodes, persistence diagrams (PDs), or related representations (13⇓⇓⇓–17).

With relatively few exceptions, notably refs. 17⇓⇓⇓⇓–22 (see additional citations in *SI Appendix*, *Homology and Persistent Homology*) TDA has not used statistical methodology as part of its approach, and, as a consequence, has typically been unable to associate clearly defined levels of statistical significance to its discoveries. While there may be a variety of reasons for this, one of the main obstacles to doing so is that the mathematical challenges involved in computing the statistical distributions of topological quantifiers have so far proven to be intractable. This is despite the fact that the measure-theoretic issues involved in defining probability measures which support notions such as expectations, variances, percentiles, and conditional probabilities have been effectively solved, for example, refs. 23⇓–25.

One approach adopted by refs. 18⇓–20 and others to circumvent these difficulties has been to reduce persistence diagrams to a single test statistic, often related to bottleneck norms, and then to adopt standard statistical resampling methods to analyze this statistic. If multiple diagrams are available, then the resampling can be done on them. However, since TDA is typically used in areas of very large data sets, the availability of replicates is rare, and consequently this approach is impracticable in most applications. An alternative approach is to (sub)sample points from the persistence diagram, and compute statistics on the subsamples. The problem with this approach, however, is that the true random object here is the full persistence diagram, and thus it is often unclear what is the precise meaning of the statistics produced this way.

We introduce an approach, based on generating a sequence of persistence diagrams which has similar statistical properties to those of the one generated by the data. The individual concepts underlying the method are not difficult, and follow a number of clearly defined stages. First, a parametric model is adopted that is sufficiently flexible to model an extremely wide class of persistence diagrams. The model we use is a class of Gibbs distributions, since these have a long history of success in modeling point sets (ref. 26 and its bibliography), which, essentially, is what a persistence diagram is. Having estimated parameters, we then exploit the fact that Gibbs distributions lend themselves to simulation by Markov chain Monte Carlo (MCMC) methods, and apply MCMC to produce a simulated sequence of diagrams. Since the underlying raison d’être of this approach is that persistence diagrams provide an excellent summary of topology, and statistics computed off the diagrams themselves furnish even more succinct summaries, we call this procedure “replicating statistical topology” (RST), and its introduction and descriptions of its implementation are the main contributions of the paper. We believe that these ideas, integrated with the existing techniques of TDA, provide another significant contribution toward putting TDA on a more solid statistical footing. To support this, we treat one toy example, showing that the technique works as predicted, and then study the fascinating and important topic of nonhomogeneities in the cosmic microwave background (CMB) radiation via parameter estimation of our Gibbs model.

## TDA and Persistence Diagrams

As homology is an algebraic method for describing topological features of shapes and functions, so persistent homology is an extension of this method for both enriching these descriptions and for describing how topology undergoes changes. We shall use it to describe the upper-level sets of real-valued functions *SI Appendix*.

Persistent homology is undeniably the most popular tool in the burgeoning area of TDA, one of the main reasons for which is the fact that it is easily visualized via barcode diagrams. Continuing with the upper-level set example of the previous paragraph, and starting with

In Fig. 1, *Left*, we see a sample *Middle*, we see the corresponding kernel density estimate, defined by*Right*, we have the corresponding PD of the upper-level set filtration of

While the PD in Fig. 1 performs as expected, and it is easy to identify the points that, a priori, we knew had to be there, there are many other points in the diagrams. Were we not in the situation of knowing ahead of time which points were “really” significant, it would not have been clear how to discount the additional points. We will treat this issue in *Example 1: Two Circles*.

There is another class of problems, in which the general structure of the PD, including points near the diagonal, is of more importance than the behavior of a handful of outliers. These are typically problems in which the topology is complex, and occurs at a number of different scales. Examples include tree-structured data objects such as brain artery trees (27), and a variety of cosmological structures (5, 6), including the CMB that we treat in *Example 2: CMB Nonhomogeneity*.

First, however, we must describe the general approach.

## Gibbs Measures for Persistence Diagrams

Given a finite collection *SI Appendix*, *Probability Models for Persistent Diagrams*. The normalization

As in all applications of Gibbs distributions, success depends on an appropriate choice for the energy function. Here is a way to do it for

We now define a Hamiltonian, at effective interaction distance *SI Appendix*, *Probability Models for Persistent Diagrams*, working with energy densities rather than absolute energies (i.e., without the **3**) leads to more-robust numerical procedures.

Now, given a PD **2** with Hamiltonian Eq. **3**.

While this may seem a rather arbitrary form for the distribution of a PPD, there are a number of facts justifying it. The first is the trivial observation that any multivariate distribution can be written in the form of Eq. **2**, simply by taking **3**. Finally, there is the issue, discussed in detail in *SI Appendix*, *Probability Models for Persistent Diagrams*, that we will often use these distributions not as exact models for PPDs but rather as a tool in a perturbative analysis. In these cases, the convenience of the models is more important than whether or not they provide a perfect fit to PPD data.

The determination of **4** scale for the order of magnitude of the data, which is unimportant topologically. (For cases for which

### Pseudolikelihood.

Given

The standard way around this problem, which we adopt, is the pseudolikelihood approach (32, 33). This originated in the context of point cloud data with spatial dependence, which is, essentially, a description of a PD. In particular, it exploits the inherent spatial Markovianness of a Gibbs distribution to replace the overall probability of, in our case, a random PPD

### Model Specification and Parameter Estimation.

While it might be expected that PPDs originating from different physical phenomena might require quite different models, we have found, in all of the examples that we tried, that taking **3**—so that the largest cluster size was 3 or 4—was both efficient and sufficient. If a lower *SI Appendix*, *Probability Models for Persistent Diagrams*, we describe some of this experimentation, giving examples of when these models are, and are not, successful in fitting PDs.

## RST and MCMC

We refer the reader to refs. 35 and 36 for technical background to this section, in which we describe a standard Metropolis–Hastings MCMC approach to replicating PDs. In particular, see ref. 35, section 10.3.3, in which the particular approach we take is called “Metropolis-within-Gibbs” and its properties are discussed.

Given a pseudolikelihood as in *Pseudolikelihood* (with known or estimated parameters), generating simulated replications of the associated point set via MCMC is not hard, but first we need some definitions.

First, given a **6** depends on

The basic step of the algorithm, which describes the update of the point set *Algorithm 1*.

To obtain *Algorithm 1* for a burn-in period. Then, starting with the final PPD from the burn-in, run the algorithm a further *SI Appendix*, *Probability Models for Persistent Diagrams* for details and practical guidelines for choosing these parameters.

Given the collection of

The higher levels are driven by the specific application, but the basic idea is to compute simpler, real, vector, or function-valued statistics off the simulated PDs, and take their empirical distribution as an estimate of the true, underlying distribution of the statistic. The same statistic, computed off the original PD, can then be tested for statistical significance against this empirical distribution in standard fashion. Diagrammatically, treating persistence-based TDA as a sequence of three steps,

## Examples

We treat two examples. One is a toy problem, for which the true situation is known, to see how and if RST works. The second studies the topology of CMB, and the analysis required is far more subtle. Details of both are given in *SI Appendix*. For both cases, we emphasize the point implied in the preceding paragraph, that our main interest is in the replication of the PDs, and not the particular method of statistical analysis following that.

### Example 1: Two Circles.

As a simple (but representative) test case, take a random sample from two circles, as in Fig. 1. Note that, while there are many points corresponding to the

However, there are certainly enough

Adopting the approach of RST, we estimated the parameters for a Gibbs distribution for the model with pseudolikelihood Eq. **5** for the

Using these three sets of simulations, we computed a number of statistics, but report on only one set here: the order statistics of the distances of the points in the PD to the diagonal. Given the points *SI Appendix*, *Probability Models for Persistent Diagrams*. These include a comparison with the bootstrap, confidence interval-based techniques of ref. 18. Using the same kernel bandwidth for the density estimate Eq. **1** that we used, these techniques identified only one point in each of the

In summary, blindly applying RST to simulate PDs, and taking the simplest of all statistics, showed (correctly) strong statistical evidence for two connected components in the topological space (two circles) which generated the PD, with borderline (but misleading) evidence for a third component. Despite the fact that the PD has a number of points far from the diagonal, and quite close to the third-farthest point (Fig. 1), these were (correctly) considered statistically insignificant. Thus, in this toy example, with the simplest of statistical quantifiers, RST competes favorably with existing methodologies.

### Example 2: CMB Nonhomogeneity.

Current cosmological theory describes a phase of rapid inflation in the primordial universe roughly

The CMB is the thermal radiation, generated as the universe cooled, some 300,000 years after the hypothesized Big Bang. Amazingly, it is measurable still today, and, since the temperature fluctuations in the observable CMB follow the pattern of the quantum perturbations from the inflationary era, it enables the mapping of the fluctuations in the distribution of matter in the early universe.

CMB data are directional, measuring fluctuations in radiation coming into a satellite from different directions. The first, satellite-based, detailed measurement of the CMB was carried out by the Cosmic Background Explorer (COBE) probe in the early 1990s, followed a decade later by the Wilkinson Microwave Anisotropy Probe experiment. Most recently, the high-precision Planck mission measured temperature anisotropies of the CMB to an accuracy of

There are many mathematical models for CMB, the most common being a homogeneous, isotropic Gaussian random field (38⇓⇓⇓–42). Both assumptions of Gaussianity and homogeneity have been challenged recently, from both theoretical and empirical viewpoints (43, 44), and it is homogeneity that we now address, parametrically, using our Gibbs model. (Non-Gaussianity has been addressed, geometrically, in refs. 45 and 46.)

To test homogeneity, we first cut out a ring around the equator of

The next step is to generate five smoothed versions of the CMB in each cap, which we do with five different Gaussian kernels, with full width at half maximum 300’, 180’, 120’, 90’, and 60’. The highest level of smoothing (300’) suppresses most of the fine-scale variation seen in Fig. 2, while the 60’ level leads to no visually distinguishable difference. For each such smoothing, we produce PDs generated by the upper-level set filtration, for both

The two PDs of Fig. 3 are quite similar, and, apart from a handful of outliers, it is hard to see any consistent differences between them. However, fitting a Gibbs model with pseudolikelihood Eq. **5**, again taking *SI Appendix, Example 2 - Analyzing CMB Data*. For reasons described there, we have more faith in the second, more conservative, of these tests.

The results provide statistical confirmation for differences between northern and southern PDs, with the most significant differences at the intermediate smoothing levels. (This is clearest from the *SI Appendix*, Tables S2 and S3.) While we do not have a definitive physical explanation for this, it is most likely due to the effect of interactions between objects that evolved due to the true primordial CMB fluctuations and foreground phenomena that evolved at later epochs. However, whatever the cosmological reason underlying Table 1, the implication is that it is unreasonable to blandly assume that the northern and southern cap CMB maps are realizations of the same stochastic process. In other words, a hypothesis of homogeneity is questionable.

From the point of view of this paper, however, our main discovery is not cosmological but lies in demonstrating the ability of the Gibbs model, which assumes nothing about the original data nor about how PDs express properties of the underlying data, to differentiate between complex structures using purely topological methods. Consequently, we believe that the approach described here will open the door to developing a wide variety of (semiparametric) statistical methods for further applications of TDA.

## Acknowledgments

Two splendid referees are responsible for making us work hard, think a lot, and produce over 30 pages of *Supporting Information*, which should clarify most of the imprecision forced on us by PNAS’s six-page format. Among others, we thank Jose Blanchet, Herbert Edelsbrunner, Jingchen Liu, Anthea Monod, Sayan Mukerjee, and Katherine Turner for helpful conversations in various stages of this research, which was supported in part by the research projects Stochastic Algebraic Topology and Its Applications II (AFOSR, FA9550-15-1-0032) and Understanding Random Systems via Algebraic Topology (European Research Council Advanced Grant 320422).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: radler{at}technion.ac.il.

Author contributions: R.J.A. designed research; R.J.A. and S.A. performed research; S.A. and P.P. analyzed data; and R.J.A. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission..

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1706885114/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵.
- Nicolau M,
- Levine A,
- Carlsson G

- ↵.
- Chan J,
- Carlsson G,
- Rabadan P

- ↵.
- Giusti C,
- Pastalkova E,
- Curto C,
- Itskov V

- ↵.
- Curto C

- ↵
- ↵
- ↵.
- Hiraoka Y, et al.

- ↵
- ↵
- ↵.
- Edelsbrunner H

*A Short Course in Computational Geometry and Topology*, Springer Briefs in Applied Sciences and Technology (Springer, New York). - ↵.
- Edelsbrunner H,
- Harer J

*Surveys on Discrete and Computational Geometry*, Contemporary Mathematics (Am Math Soc, Providence, RI) Vol 453, PP 257–282. - ↵.
- Edelsbrunner H,
- Harer J

- ↵
- ↵.
- Carlsson G

- ↵.
- Ghrist R

- ↵.
- Ghrist R

- ↵.
- Wasserman L

- ↵.
- Chazal F, et al.

- ↵.
- Fasy B, et al.

- ↵.
- Robinson A,
- Turner K

- ↵.
- Bobrowski O,
- Mukherjee S,
- Taylor J

- ↵.
- Bubenik P

- ↵.
- Mileyko Y,
- Mukherjee S,
- Harer J

- ↵.
- Munch E, et al.

- ↵
- ↵.
- Banerjee S,
- Carlin BP,
- Gelfand AE

*Hierarchical Modeling and Analysis for Spatial Data*, Monographs on Statistics and Applied Probability (CRC, Boca Raton, FL), 2nd Ed, Vol 135. - ↵.
- Bendich P,
- Marron JS,
- Miller E,
- Pieloch A,
- Skwerer S

- ↵.
- Georgii HO

*Gibbs Measures and Phase Transitions*, de Gruyter Studies in Mathematics (Walter de Gruyter, Berlin), 2nd Ed, Vol 9. - ↵.
- Adcock A,
- Carlsson E,
- Carlsson G

- ↵.
- Kahle M

*Algebraic Topology: Applications and New Directions*, Contemporary Mathematics (Am Math Soc, Providence, RI),Vol 620, PP 201–221. - ↵.
- Bobrowski O,
- Kahle M

- ↵.
- Besag J

- ↵.
- Chalmond B

*Modeling and Inverse Problems in Imaging Analysis*, transMaître H,Applied Mathematical Sciences (Springer, New York), Vol 155. - ↵.
- Burnham KP,
- Anderson DR

- ↵.
- Robert CP,
- Casella G

*Monte Carlo Statistical Methods*, Springer Texts in Statistics (Springer, New York), 2nd Ed. - ↵.
- Brooks S,
- Gemna A,
- Jones G,
- Meng XL

- ↵.
- Bobrowski O,
- Kahle M,
- Skraba P

- ↵
- ↵
- ↵
- ↵
- ↵.
- Planck Collaboration

- ↵
- ↵
- ↵.
- Planck Collaboration

- ↵.
- Buchert T,
- France MJ,
- Steiner F

- ↵.
- Planck Collaboration

## Citation Manager Formats

### More Articles of This Classification

### Physical Sciences

### Related Content

- No related articles found.

### Cited by...

- No citing articles found.