# Geometric anomaly detection in data

See allHide authors and affiliations

Edited by Tom C. Lubensky, University of Pennsylvania, Philadelphia, PA, and approved July 2, 2020 (received for review February 1, 2020)

## Significance

The problem of fitting low-dimensional manifolds to high-dimensional data has been extensively studied from both theoretical and computational perspectives. As datasets get more heterogeneous and complicated, so must the spaces that are used to approximate them. Stratified spaces, built out of manifold pieces coherently glued together, form natural candidates for such geometric models. The key difficulty encountered when fitting stratified spaces to data is that none of the sampled points can be expected to lie exactly on the low-dimensional singular strata. The present work uses local cohomology to overcome this difficulty. Here, we describe an efficient and practical framework for singularity detection from finite samples and demonstrate its ability to detect interfaces in real and simulated data.

## Abstract

The quest for low-dimensional models which approximate high-dimensional data is pervasive across the physical, natural, and social sciences. The dominant paradigm underlying most standard modeling techniques assumes that the data are concentrated near a single unknown manifold of relatively small intrinsic dimension. Here, we present a systematic framework for detecting interfaces and related anomalies in data which may fail to satisfy the manifold hypothesis. By computing the local topology of small regions around each data point, we are able to partition a given dataset into disjoint classes, each of which can be individually approximated by a single manifold. Since these manifolds may have different intrinsic dimensions, local topology discovers singular regions in data even when none of the points have been sampled precisely from the singularities. We showcase this method by identifying the intersection of two surfaces in the 24-dimensional space of cyclo-octane conformations and by locating all of the self-intersections of a Henneberg minimal surface immersed in 3-dimensional space. Due to the local nature of the topological computations, the algorithmic burden of performing such data stratification is readily distributable across several processors.

The *manifold hypothesis* (1) forms a cornerstone of modern data science; it assumes that the points in typical datasets tend to cluster near some unknown manifold of dimension substantially lower than the ambient dimension of the data. Manifold learning and dimensionality reduction techniques (2) rely on this assumption in order to infer faithful low-dimensional representations of high-dimensional data. Examples of such methods include a) classical *principal component analysis* (3, 4), where the approximating manifold is an affine subspace; b) *visual perception* (5), where continuous changes of configurations of an object yield smoothly varying changes along a curved manifold; c) *subspace clustering* (6), where data are clustered into disjoint sets that are well approximated by affine subspaces; and d) *generative adversarial networks*, which naturally produce data on pairs of manifolds (7). In sharp contrast to this profusion, one encounters a remarkable dearth of techniques designed for the analysis of data sampled from non-manifold, or singular, spaces. Among the simplest examples of singular spaces are unions of two manifolds along a common submanifold (as shown in Fig. 1); these arise organically when more than one class of data is present in the same set of observations. Recent techniques for the analysis of such heterogeneous data [see, for instance, *capsule networks* (8)] have focused primarily on coherently fusing together the multiple data classes.

The present work is motivated by an antipodal philosophy—singular regions of spaces that underlie modern datasets are inherently interesting, they will play an increasingly important role in the future of data analysis, and it is therefore of paramount importance to be able to detect these singularities directly from the data points. The task of fitting singular spaces to data is rendered difficult by the generic lack of observations which are located exactly on the geometric anomalies. For instance, most natural ways of sampling finitely many points from the space in Fig. 1 will not produce even a single point lying on the anomalous circle. Singular spaces occur quite naturally in several areas of data science—for instance, low-rank matrix approximation amounts to optimization over the determinantal variety of bounded-rank matrices, which is not a manifold (ref. 9, lecture 9). Moreover, even the simplest machine learning architectures, such as the multilayer perceptron, exhibit singularities in their parameter spaces (ref. 10, ch. 12.2). Here, we describe an algorithm that can detect singular regions from finitely many data points even when the points only lie near—rather than precisely on—the singularities. Our approach is based on *local cohomology* (11) and the theory of *stratifications* (12), which form particularly rich and fruitful enterprises in the study of singular spaces that arise in algebraic topology (13) and geometry (14). Recent computational advances in these fields (15, 16) have made it possible to bring this formidable theory to bear on the very concrete task of analyzing data that live on, or even near, spaces that are far more complicated than manifolds. Most of this existing machinery requires defining equations for a space in order to construct a stratification; in sharp contrast, here we only make use of a finite point sample.

Manifolds of dimension n are characterized by the requirement that a small neighborhood around each point should resemble the n-dimensional Euclidean disk (up to a standard equivalence relation called homeomorphism). While there can be no algorithmic procedure to determine whether two n-manifolds are homeomorphic or not (17) for *cohomology*, which assigns a sequence

Since the boundary of an n-dimensional disk is an *annular neighbors* *persistent cohomology* of

## Results

Local persistent cohomology successfully identifies all of the non-manifold regions in two completely different datasets whose underlying spaces are known to admit singularities. The first of these is the *conformation space of the cyclo-octane molecule* *conformations*, in three-dimensional (3D) space. The locations of hydrogen atoms are completely determined by those of the carbon atoms, so each conformation may be represented by a point in

Our second dataset is obtained by uniformly sampling points from the nonorientable *Henneberg minimal surface*, which is an immersion of the punctured 2D projective space in standard 3D space. The results are depicted in Fig. 3: again, the points that lie near the four self-intersections are manifestly separated from manifold-like points and boundary points. Additional details involving both datasets, including an explicit parameterization of the Henneberg minimal surface, can be found in *Materials and Methods*.

## Discussion

The two singular circles in the cyclo-octane conformation space were originally discovered using a local version of principal component analysis (PCA). This method uses spectral techniques to fit affine subspaces to local neighborhoods of data points (18, 19). While such methods work remarkably well for detecting intersections of flat manifolds (i.e., manifolds with curvature almost zero), they tend to require extremely dense samples and very small local neighborhood sizes in the presence of high curvature. Our approach differs substantially from such affine embedding techniques because cohomology, being a purely topological invariant, remains largely agnostic to the vagaries of local geometry such as curvature. Thus, it identifies dimensionally anomalous regions correctly even in highly curved regimes. As another pleasant side effect of the relative coarseness of cohomology, one obtains a far greater degree of robustness to the choice of neighborhood size than for local PCA. This latter phenomenon is illustrated for the cyclo-octane dataset in Fig. 4.

The enormous quantities of heterogeneous data being generated by modern experimental tools demand a concordant increase in the variety and sophistication of available geometric models. The procedure described here enables us to transcend the ubiquitous manifold hypothesis by allowing us to fit singular spaces to datasets. Aside from the data-dependent choice of radius parameters r and s, which determine the sizes of annular neighborhoods

## Materials and Methods

Detailed accounts of the first three topics described below may be found in the textbooks of Hatcher (21), Oudot (22), and Kirwan and Woolf (12), respectively.

### The Cohomology of Simplicial Complexes.

A *simplicial complex* K is a collection of subsets of a finite set V (usually called the set of vertices) satisfying the following condition: if *cochains*. It is possible to construct a sequence of *coboundary operators* *cohomology* of K is the quotient vector space

Cohomology is an extremely well-studied (21) descriptor of simplicial complexes and related spaces; it enjoys many wonderful properties, but only two of them are relevant to our purposes here. First, it is a *homeomorphism invariant*, meaning that any two different triangulations of the same space X will produce identical cohomologies even though the cochain spaces and coboundary operators might be wildly different. For instance, the cohomology vector spaces of an n sphere depend neither on geometric intricacies (such as its radius or its embedding in Euclidean space) nor on the combinatorics of a particular choice of simplicial decomposition. Second, cohomology is *functorial* with respect to the subcomplex relation among simplicial complexes. A subset L of simplices in K is called a subcomplex if it happens to be a simplicial complex in its own right. Whenever L is a subcomplex of K, there are well-defined linear maps

### The Persistent Cohomology of Data.

Given a finite dataset P embedded in Euclidean space *Vietoris–Rips* simplicial complex *persistence modules*, and their systematic study—which forms the theoretical core of topological data analysis—has been greatly facilitated by three miraculous properties.

The first property is algebraic—although persistence modules appear to involve an infinite amount of information prima facie, any V arising from the Vietoris–Rips cohomology of a finite dataset *barcode* of P. The second property is computational; barcodes can be extracted via elementary matrix algebra, and there are several software packages dedicated to their efficient computation (23). The third crucial property of persistence modules is geometric and takes the form of a *stability theorem* (20). Roughly, this result asserts that if the points of P are perturbed by an amount

### Stratified Spaces.

Singular spaces, such as algebraic varieties and quotients of group actions on manifolds, are often analyzed via their *stratifications*. We remark that most stratifications are derived from algebraic or analytic equations, rather than data. Each stratification *strata*, are open i-dimensional submanifolds of Y. Every simplicial complex, for instance, admits a natural stratification whose i strata are precisely the i-*simplices*. It is customary to impose two additional constraints on the strata in order to render the study of stratified spaces tractable. The first requirement, called the *frontier axiom*, ensures that the set of all strata is partially ordered by the boundary relation *equisingularity* or *normal triviality*, imposes severe topological constraints on intersections of small neighborhoods in Y around various points of a single i-stratum with the higher strata

As a consequence of equisingularity, to each i-stratum σ one can assign a single *link* of σ, so that the following property holds. For each point y in σ and all choices of small neighborhoods

### Datasets.

The cyclo-octane dataset, which was introduced by Martin et al. (18), consists of 6,040 points in

### Algorithm and Implementation.

The procedure *Geometric Anomaly Detection* discovers intersections of dimension *Geometric Anomaly Detection* decomposes the original dataset P into the k-manifold points

*Algorithm*:*Geometric Anomaly Detection*.

**In:** Finite point set

**Out:** A partition of P into subsets

This algorithm as written will also cast points lying in certain nongeneric intersections of k manifolds into

More relevant for practical applications is the fact that this algorithm can be suitably iterated in order to also detect certain lower-dimensional singularities as follows. Since the subset of points *Geometric Anomaly Detection* a second time, with P being replaced by

## Data Availability.

Both the cyclo-octane and the Henneberg datasets have been made available at https://github.com/stolzbernadette/Geometric-Anomalies/tree/master/Data_Sets. Our Matlab implementation of the *Geometric Anomaly Detection* algorithm is available at https://github.com/stolzbernadette/Geometric-Anomalies.

## Acknowledgments

We thank Barbara Mahler for performing the isomap projection of the cyclo-octane data to

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: nanda{at}maths.ox.ac.uk.

Author contributions: B.J.S., J.T., H.A.H., and V.N. designed research; B.J.S., J.T., H.A.H., and V.N. performed research; B.J.S. analyzed data; and B.J.S., J.T., H.A.H., and V.N. wrote the paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission.

Database deposition: Both the cyclo-octane and the Henneberg datasets have been made available at GitHub (https://github.com/stolzbernadette/Geometric-Anomalies/tree/master/Data_Sets). Our Matlab implementation of the geometric anomaly detection algorithm is available at GitHub (https://github.com/stolzbernadette/Geometric-Anomalies).

- Copyright © 2020 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- C. Fefferman,
- S. Mitter,
- H. Narayanan

- ↵
- J. A. Lee,
- M. Verleysen

- ↵
- ↵
- J. B. Tenenbaum,
- V. De Silva,
- J. C. Langford

- ↵
- H. Sebastian Seung,
- D. D. Lee

- ↵
- R. Vidal

- ↵
- T. Che,
- Y. Li,
- A. Paul Jacob,
- Y. Bengio,
- W. Li

- ↵
- I. Guyon et al.

- S. Sabour,
- N. Frosst,
- G. E. Hinton

- ↵
- J. Harris

- ↵
- S.-i. Amari

- ↵
- V. Nanda

- ↵
- F. Kirwan,
- J. Woolf

- ↵
- M. Goresky,
- R. MacPherson

- ↵
- W. Fulton

- ↵
- ↵
- G. Henselman,
- R. Ghrist

- ↵
- A. A. Markov

- ↵
- ↵
- S. Martin,
- J.-P. Watson

- ↵
- D. Cohen-Steiner,
- H. Edelsbrunner,
- J. Harer

- ↵
- A. Hatcher

- ↵
- S. Oudot

- ↵
- N. Otter,
- M. A. Porter,
- U. Tillmann,
- P. Grindrod,
- H. A. Harrington

- ↵
- H. Hong,
- C. Yap

- A. Tausz,
- M. Vejdemo-Johansson,
- H. Adams

- ↵
- U. Bauer

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics