# Robust continuous clustering

See allHide authors and affiliations

Edited by David L. Donoho, Stanford University, Stanford, CA, and approved August 7, 2017 (received for review January 13, 2017)

## Significance

Clustering is a fundamental experimental procedure in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Despite the importance and ubiquity of clustering, existing algorithms suffer from a variety of drawbacks and no universal solution has emerged. We present a clustering algorithm that reliably achieves high accuracy across domains, handles high data dimensionality, and scales to large datasets. The algorithm optimizes a smooth global objective, using efficient numerical methods. Experiments demonstrate that our method outperforms state-of-the-art clustering algorithms by significant factors in multiple domains.

## Abstract

Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank.

Clustering is one of the fundamental experimental procedures in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Data-clustering algorithms have been developed for more than half a century (1). Significant advances in the last two decades include spectral clustering (2⇓–4), generalizations of classic center-based methods (5, 6), mixture models (7, 8), mean shift (9), affinity propagation (10), subspace clustering (11⇓–13), nonparametric methods (14, 15), and feature selection (16⇓⇓⇓–20).

Despite these developments, no single algorithm has emerged to displace the

We present a clustering algorithm that is fast, easy to use, and effective in high dimensions. The algorithm optimizes a clear continuous objective, using standard numerical methods that scale to massive datasets. The number of clusters need not be known in advance.

The operation of the algorithm can be understood by contrasting it with other popular clustering techniques. In center-based algorithms such as

Our formulation is based on recent convex relaxations for clustering (25, 26). However, our objective is deliberately not convex. We use redescending robust estimators that allow even heavily mixed clusters to be untangled by optimizing a single continuous objective. Despite the nonconvexity of the objective, the optimization can still be performed using standard linear least-squares solvers, which are highly efficient and scalable. Since the algorithm expresses clustering as optimization of a continuous objective based on robust estimation, we call it robust continuous clustering (RCC).

One of the characteristics of the presented formulation is that clustering is reduced to optimization of a continuous objective. This enables the integration of clustering in end-to-end feature learning pipelines. We demonstrate this by extending RCC to perform joint clustering and dimensionality reduction. The extended algorithm, called RCC-DR, learns an embedding of the data into a low-dimensional space in which it is clustered. Embedding and clustering are performed jointly, by an algorithm that optimizes a clear global objective.

We evaluate RCC and RCC-DR on a large number of datasets from a variety of domains. These include image datasets, document datasets, a dataset of sensor readings from the Space Shuttle, and a dataset of protein expression levels in mice. Experiments demonstrate that our method significantly outperforms prior state-of-the-art techniques. RCC-DR is particularly robust across datasets from different domains, outperforming the best prior algorithm by a factor of 3 in average rank.

## Formulation

We consider the problem of clustering a set of

The RCC objective has the following form:

The function **1** into a convex optimization problem. However, convex functions—even the

The second term in objective **1** is related to the mean shift objective (9). The RCC objective differs in that it includes an additional data term, uses a sparse (as opposed to a fully connected) connectivity structure, and is based on robust estimation.

Our approach is based on the duality between robust estimation and line processes (28). We introduce an auxiliary variable **1** and **2** are equivalent with respect to

Objective **2** can be optimized by any gradient-based method. However, its form enables efficient and scalable optimization by iterative solution of linear least-squares systems. This yields a general approach that can accommodate many robust nonconvex functions

While the presented approach can accommodate many estimators in the same computationally efficient framework, our exposition and experiments use a form of the well-known Geman–McClure estimator (31),**1** and **2** equivalent with respect to

## Optimization

Objective **2** is biconvex on **2** turns into a linear least-squares problem. We exploit this special structure and optimize the objective by alternatingly updating the variable sets

When **5** into Eq. **2**, which yields objective **1** with respect to

When **[****2****]** in matrix form and obtain a simplified expression for solving **7** can be solved independently and in parallel.

The RCC algorithm is summarized in *Algorithm 1: RCC*. Note that all updates of **2**.

The algorithm uses graduated nonconvexity (32). It begins with a locally convex approximation of the objective, obtained by setting

The parameter **1** balances the strength of the data terms and pairwise terms. The reformulation of RCC as a linear least-squares problem enables setting **7** suggests that the data terms and pairwise terms can be balanced by setting

Additional details concerning Algorithm 1 are provided in *SI Methods*.

## SI Methods

### Initialization and Output.

We initialize the optimization with

### Connectivity Structure.

The connectivity structure

### Graduated Nonconvexity.

The penalty function in Eq. **3** is nonconvex and its shape depends on the value of the parameter

### Parameter Setting.

The termination conditions are set to maxiterations = 100 and

For RCC-DR, the sparse coding parameters are set to

### Implementation.

We use an approximate nearest-neighbor search to construct the connectivity structure (54) and a conjugate gradient solver for linear systems (55).

### The RCC-DR Algorithm.

The RCC-DR algorithm is summarized in *Algorithm S1: Joint Clustering and Dimensionality Reduction*.

## Joint Clustering and Dimensionality Reduction

The RCC formulation can be interpreted as learning a graph-regularized embedding

We begin by considering an initial formulation for the RCC-DR objective:**10** is problematic because in the beginning of the optimization the representation **1**, but in formulation **10** the contamination of **3** for both

To optimize objective **11**, we introduce line processes **5**. The variables

The dictionary

The variables

A precise specification of the RCC-DR algorithm is provided in Algorithm S1.

## Experiments

### Datasets.

We have conducted experiments on datasets from multiple domains. The dimensionality of the data in the different datasets varies from 9 to just below 50,000. `Reuters-21578` is the classic benchmark for text classification, comprising 21,578 articles that appeared on the Reuters newswire in 1987. `RCV1` is a more recent benchmark of 800,000 manually categorized Reuters newswire articles (38). (Due to limited scalability of some prior algorithms, we use 10,000 random samples from RCV1.) `Shuttle` is a dataset from NASA that contains 58,000 multivariate measurements produced by sensors in the radiator subsystem of the Space Shuttle; these measurements are known to arise from seven different conditions of the radiators. `Mice Protein` is a dataset that consists of the expression levels of 77 proteins measured in the cerebral cortex of eight classes of control and trisomic mice (39). The last two datasets were obtained from the University of California, Irvine, machine-learning repository (40).

`MNIST` is the classic dataset of 70,000 hand-written digits (41). `Pendigits` is another well-known dataset of hand-written digits (42). The Extended Yale Face Database B (`YaleB`) contains images of faces of 28 human subjects (43). The YouTube Faces Database (`YTF`) contains videos of faces of different subjects (44); we use all video frames from the first 40 subjects sorted in chronological order. Columbia University Image Library (`COIL-100`) is a classic collection of color images of 100 objects, each imaged from 72 viewpoints (45). The datasets are summarized in Table 1.

### Baselines.

We compare RCC and RCC-DR to 13 baselines, which include widely known clustering algorithms as well as recent techniques that were reported to achieve state-of-the-art performance. Our baselines are

### Measures.

Normalized mutual information (NMI) has emerged as the standard measure for evaluating clustering accuracy in the machine-learning community (51). However, NMI is known to be biased in favor of fine-grained partitions. For this reason, we use adjusted mutual information (AMI), which removes this bias (52). This measure is defined as follows:

### Results.

Results on all datasets are reported in Table 2. In addition to accuracy on each dataset, Table 2 also reports the average rank of each algorithm across datasets. For example, if an algorithm achieves the third-highest accuracy on half of the datasets and the fourth-highest one on the other half, its average rank is 3.5. If an algorithm did not yield a result on a dataset due to its size, that dataset is not taken into account in computing the average rank of the algorithm.

RCC or RCC-DR achieves the highest accuracy on seven of the nine datasets. RCC-DR achieves the highest or second-highest accuracy on eight of the nine datasets and RCC achieves the highest or second-highest accuracy on five datasets. The average rank of RCC-DR and RCC is 1.6 and 2.4, respectively. The best-performing prior algorithm, LDMGI, has an average rank of 4.9, three times higher than the rank of RCC-DR. This indicates that the performance of prior algorithms is not only lower than the performance of RCC and RCC-DR, it is also inconsistent, since no prior algorithm clearly leads the others across datasets. In contrast, the low average rank of RCC and RCC-DR indicates consistently high performance across datasets.

### Clustering Gene Expression Data.

We conducted an additional comprehensive evaluation on a large-scale benchmark that consists of more than 30 cancer gene expression datasets, collected for the purpose of evaluating clustering algorithms (53). The results are reported in Table S3. RCC-DR achieves the highest accuracy on eight of the datasets. Among the prior algorithms, affinity propagation achieves the highest accuracy on six of the datasets and all others on fewer. Overall, RCC-DR achieves the highest average AMI across the datasets.

### Running Time.

The execution time of RCC-DR optimization is visualized in Fig. 2. For reference, we also show the corresponding timings for affinity propagation, a well-known modern clustering algorithm (10), and LDMGI, the baseline that demonstrated the best performance across datasets (48). Fig. 2 shows the running time of each algorithm on randomly sampled subsets of the 784-dimensional MNIST dataset. We sample subsets of different sizes to evaluate runtime growth as a function of dataset size. Performance is measured on a workstation with an Intel Core i7-5960x CPU clocked at 3.0 GHz. RCC-DR clusters the whole MNIST dataset within 200 s, whereas affinity propagation takes 37 h and LDMGI takes 17 h for 40,000 points.

### Visualization.

We now qualitatively analyze the output of RCC by visualization. We use the MNIST dataset for this purpose. On this dataset, RCC identifies 17 clusters. Nine of these are large clusters with more than 6,000 instances each. The remaining 8 are small clusters that encapsulate outlying data points: Seven of these contain between 2 and 11 instances, and one contains 148 instances. Fig. 3*A* shows 10 randomly sampled data points *B*. Fig. 3*C* shows 2 randomly sampled data points from each of the small outlying clusters. Additional visualization of RCC output on the Coil-100 dataset is shown in Fig. S3.

Fig. 4 compares the representation

## Discussion

We have presented a clustering algorithm that optimizes a continuous objective based on robust estimation. The objective is optimized using linear least-squares solvers, which scale to large high-dimensional datasets. The robust terms in the objective enable separation of entangled clusters, yielding high accuracy across datasets and domains.

The continuous form of the clustering objective allows it to be integrated into end-to-end feature learning pipelines. We have demonstrated this by extending the algorithm to perform joint clustering and dimensionality reduction.

## SI Experiments

### Datasets.

For Reuters-21578 we combine the train and test sets of the Modified Apte split and use only samples from categories with more than five examples. For RCV1 we consider four root categories and a random subset of 10,000 samples. For text datasets, the graph is constructed on PCA projected input. The number of PCA components is set to the number of ground-truth clusters. We compute term frequency–inverse document frequency features on the 2,000 most frequently occurring word stems.

On YaleB we consider only the frontal face images and preprocess them using gamma correction and DoG filter. For YTF we use all of the video frames from the first 40 subjects sorted in chronological order. For all image datasets we scale the pixel intensities to the range

### Baselines.

For

Unlike the presented algorithms, many baselines rely on multiple executions with random restarts. To maximize their reported accuracy, we use 10 random restarts for these baselines. Following common practice, for

Most of the baselines require setting one or more parameters. For a fair comparison, for each algorithm we tune one major parameter across datasets and use the default values for all other parameters. For all algorithms, the tuned value is selected based on the best average performance across all datasets. Parameter settings for the baselines are summarized in Table S1. The notation (m : s : M) indicates that parameter search is conducted in the range

### Additional Accuracy Measure.

For completeness, we evaluate the accuracy of RCC, RCC-DR, and all baselines, using the NMI measure (51, 52). The results are reported in Table S2.

### Results on Gene Expression Datasets.

Table S3 lists AMI results on more than 30 cancer gene expression datasets collected by de Souto et al. (53). The maximum number of samples across datasets is only

### Robustness to Hyperparameter Settings.

The parameters of the RCC algorithm are set automatically based on the data. The RCC-DR algorithm does have a number of parameters but is largely insensitive to their settings. In the following experiment, we vary the sparse-coding parameters *A* and *B* compares the sensitivity of RCC-DR to these parameters with the sensitivity of the best-performing prior algorithms to their key parameters. For each baseline, we use the default search range proposed in their respective papers. The *x* axis in Fig. S1 corresponds to the parameter index. As Fig. S1 demonstrates, the accuracy of RCC-DR is robust to hyperparameter settings: The relative change of RCC-DR accuracy in AMI on YaleB is 0.005, 0.008, and 0 across the range of

### Robustness to Dataset Imbalance.

We now evaluate the robustness of different approaches to imbalance in class sizes. This experiment uses the MNIST dataset. We control the degree of imbalance by varying a parameter

### Visualization.

Fig. S3*A* shows 10 randomly sampled data points *B* shows the corresponding representatives

### Learned Representation.

One way to quantitatively evaluate the success of the learned representation *Left* reports the performance of multiple baselines when they are given, as input, the representation *Right* reports corresponding results when the baselines are given the representation

The results indicate that the performance of prior clustering algorithms improves significantly when they are run on the representations learned by RCC and RCC-DR. The accuracy improvements for

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: sohilas{at}umd.edu.

Author contributions: S.A.S. and V.K. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1700770114/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- MacQueen J

- ↵.
- Shi J,
- Malik J

- ↵.
- Dietterich TG,
- Becker S,
- Ghahramani Z

- Ng AY,
- Jordan MI,
- Weiss Y

- ↵
- ↵.
- Banerjee A,
- Merugu S,
- Dhillon IS,
- Ghosh J

- ↵.
- Teboulle M

- ↵.
- McLachlan G,
- Peel D

- ↵
- ↵.
- Comaniciu D,
- Meer P

- ↵.
- Frey BJ,
- Dueck D

- ↵.
- Vidal R

- ↵.
- Elhamifar E,
- Vidal R

- ↵.
- Soltanolkotabi M,
- Elhamifar E,
- Candès EJ

- ↵.
- Ben-Hur A,
- Horn D,
- Siegelmann HT,
- Vapnik V

- ↵.
- Langford J,
- Pineau J

- Kulis B,
- Jordan MI

- ↵.
- Friedman JH,
- Meulman JJ

- ↵
- ↵
- ↵.
- Pan W,
- Shen X

- ↵
- ↵.
- Jain AK

- ↵.
- Everitt BS,
- Landau S,
- Leese M,
- Stahl D

- ↵
- ↵.
- Arthur D,
- Vassilvitskii S

- ↵.
- Getoor L,
- Scheffer T

- Hocking T,
- Joulin A,
- Bach FR,
- Vert J

- ↵.
- Chi EC,
- Lange K

- ↵.
- Brito M,
- Chávez E,
- Quiroz A,
- Yukich J

- ↵.
- Black MJ,
- Rangarajan A

- ↵.
- Green PJ

- ↵.
- Marchetti Y,
- Zhou Q

- ↵.
- Geman S,
- McClure DE

- ↵.
- Blake A,
- Zisserman A

- ↵.
- Mobahi H,
- Fisher III JW

- ↵.
- Yang Q,
- Wooldridge M

- Wang Z, et al.

- ↵.
- Flammarion N,
- Palanisamy B,
- Bach FR

- ↵.
- Aharon M,
- Elad M,
- Bruckstein A

- ↵.
- Parikh N,
- Boyd SP

- ↵.
- Lewis DD,
- Yang Y,
- Rose TG,
- Li F

- ↵.
- Higuera C,
- Gardiner KJ,
- Cios KJ

- ↵.
- Lichman M

- ↵
- ↵.
- Alimoglu F,
- Alpaydin E

- ↵.
- Georghiades AS,
- Belhumeur PN,
- Kriegman DJ

- ↵.
- Wolf L,
- Hassner T,
- Maoz I

*Proceedings of IEEE CVPR 2011*, eds Felzenszwalb P., Forsyth D., and Fua P. (IEEE Computer Society, New York), Vol 1, pp 529–534. - ↵.
- Nene SA,
- Nayar SK,
- Murase H

- ↵.
- Koller D,
- Schuurmans D,
- Bengio Y

- Zhao D,
- Tang X

- ↵.
- Boutilier C

- Nie F,
- Xu D,
- Tsang IW,
- Zhang C

- ↵
- ↵.
- Fitzgibbon A,
- Lazebnik S,
- Perona P,
- Sato Y,
- Schmid C

- Zhang W,
- Wang X,
- Zhao D,
- Tang X

- ↵.
- Zhang W,
- Zhao D,
- Wang X

- ↵
- ↵.
- Vinh NX,
- Epps J,
- Bailey J

- ↵
- ↵.
- Muja M,
- Lowe DG

- ↵.
- Guennebaud G, et al.

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Computer Sciences