Robust continuous clustering

Significance Clustering is a fundamental experimental procedure in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Despite the importance and ubiquity of clustering, existing algorithms suffer from a variety of drawbacks and no universal solution has emerged. We present a clustering algorithm that reliably achieves high accuracy across domains, handles high data dimensionality, and scales to large datasets. The algorithm optimizes a smooth global objective, using efficient numerical methods. Experiments demonstrate that our method outperforms state-of-the-art clustering algorithms by significant factors in multiple domains. Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank.

Despite these developments, no single algorithm has emerged to displace the k -means scheme and its variants (21). This is despite the known drawbacks of such center-based methods, including sensitivity to initialization, limited effectiveness in high-dimensional spaces, and the requirement that the number of clusters be set in advance. The endurance of these methods is in part due to their simplicity and in part due to difficulties associated with some of the new techniques, such as additional hyperparameters that need to be tuned, high computational cost, and varying effectiveness across domains. Consequently, scientists who analyze large high-dimensional datasets with unknown distribution must maintain and apply multiple different clustering algorithms in the hope that one will succeed. Books have been written to guide practitioners through the landscape of data-clustering techniques (22).
We present a clustering algorithm that is fast, easy to use, and effective in high dimensions. The algorithm optimizes a clear continuous objective, using standard numerical methods that scale to massive datasets. The number of clusters need not be known in advance.
The operation of the algorithm can be understood by contrasting it with other popular clustering techniques. In center-based algorithms such as k -means (1,24), a small set of putative cluster centers is initialized from the data and then iteratively refined. In affinity propagation (10), data points communicate over a graph structure to elect a subset of the points as representatives. In the presented algorithm, each data point has a dedicated representative, initially located at the data point. Over the course of the algorithm, the representatives move and coalesce into easily separable clusters. The progress of the algorithm is visualized in Fig. 1.
Our formulation is based on recent convex relaxations for clustering (25,26). However, our objective is deliberately not convex. We use redescending robust estimators that allow even heavily mixed clusters to be untangled by optimizing a single continuous objective. Despite the nonconvexity of the objective, the optimization can still be performed using standard linear leastsquares solvers, which are highly efficient and scalable. Since the algorithm expresses clustering as optimization of a continuous objective based on robust estimation, we call it robust continuous clustering (RCC).
One of the characteristics of the presented formulation is that clustering is reduced to optimization of a continuous objective. This enables the integration of clustering in end-to-end feature learning pipelines. We demonstrate this by extending RCC to perform joint clustering and dimensionality reduction. The extended algorithm, called RCC-DR, learns an embedding of the data into a low-dimensional space in which it is clustered. Embedding and clustering are performed jointly, by an algorithm that optimizes a clear global objective.
We evaluate RCC and RCC-DR on a large number of datasets from a variety of domains. These include image datasets, document datasets, a dataset of sensor readings from the Space Shuttle, and a dataset of protein expression levels in mice. Experiments demonstrate that our method significantly outperforms prior state-of-the-art techniques. RCC-DR is particularly robust across datasets from different domains, outperforming the best prior algorithm by a factor of 3 in average rank.

Formulation
We consider the problem of clustering a set of n data points. The input is denoted by X = [x1, x2, . . . , xn ], where xi ∈ R D . Our approach operates on a set of representatives U = [u1, u2, . . . , un ], where ui ∈ R D . The representatives U are initialized at the corresponding data points X. The optimization operates on the representation U, which coalesces to reveal the cluster structure latent in the data. Thus, the number of clusters Significance Clustering is a fundamental experimental procedure in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Despite the importance and ubiquity of clustering, existing algorithms suffer from a variety of drawbacks and no universal solution has emerged. We present a clustering algorithm that reliably achieves high accuracy across domains, handles high data dimensionality, and scales to large datasets. The algorithm optimizes a smooth global objective, using efficient numerical methods. Experiments demonstrate that our method outperforms state-ofthe-art clustering algorithms by significant factors in multiple domains. need not be known in advance. The optimization of U is illustrated in Fig. 1.
The RCC objective has the following form: Here E is the set of edges in a graph connecting the data points. The graph is constructed automatically from the data. We use mutual k -nearest neighbors (m-kNN) connectivity (27), which is more robust than commonly used kNN graphs. The weights wp,q balance the contribution of each data point to the pairwise terms and λ balances the strength of different objective terms. The function ρ(·) is a penalty on the regularization terms. The use of an appropriate robust penalty function ρ is central to our method. Since we want representatives ui of observations from the same latent cluster to collapse into a single point, a natural penalty would be the 0 norm (ρ(y) = [y = 0], where [·] is the Iverson bracket). However, this transforms the objective into an intractable combinatorial optimization problem. At another extreme, recent work has explored the use of convex penalties, such as the 1 and 2 norms (25,26). This has the advantage of turning objective 1 into a convex optimization problem. However, convex functions-even the 1 norm-have limited robustness to spurious edges in the connectivity structure E, because the influence of a spurious pairwise term does not diminish as representatives move apart during the optimization. Given noisy real-world data, heavy contamination of the connectivity structure by connections across different underlying clusters is inevitable. Our method uses robust estimators to automatically prune spurious intercluster connections while maintaining veridical intracluster correspondences, all within a single continuous objective.
The second term in objective 1 is related to the mean shift objective (9). The RCC objective differs in that it includes an additional data term, uses a sparse (as opposed to a fully connected) connectivity structure, and is based on robust estimation.
Our approach is based on the duality between robust estimation and line processes (28). We introduce an auxiliary variable lp,q for each connection (p, q) ∈ E and optimize a joint objective over the representatives U and the line process L = {lp,q }: wp,q lp,q up − uq 2 2 + Ψ(lp,q ) .
Here Ψ(lp,q ) is a penalty on ignoring a connection (p, q): Ψ(lp,q ) tends to zero when the connection is active (lp,q → 1) and to one when the connection is disabled (lp,q → 0). A broad variety of robust estimators ρ(·) have corresponding penalty functions Ψ(·) such that objectives 1 and 2 are equivalent with respect to U: Optimizing either of the two objectives yields the same set of representatives U. This formulation is related to iteratively reweighted least squares (IRLS) (29), but is more flexible due to the explicit variables L and the ability to define additional terms over these variables. Objective 2 can be optimized by any gradient-based method. However, its form enables efficient and scalable optimization by iterative solution of linear least-squares systems. This yields a general approach that can accommodate many robust nonconvex functions ρ, reduces clustering to the application of highly optimized off-the-shelf linear system solvers, and easily scales to datasets with hundreds of thousands of points in tens of thousands of dimensions. In comparison, recent work has considered a specific family of concave penalties and derived a computationally intensive majorization-minimization scheme for optimizing the objective in this special case (30). Our work provides a highly efficient general solution.
While the presented approach can accommodate many estimators in the same computationally efficient framework, our exposition and experiments use a form of the well-known Geman-McClure estimator (31), where µ is a scale parameter. The corresponding penalty function that makes objectives 1 and 2 equivalent with respect to U is

[4]
Optimization Objective 2 is biconvex on (U, L). When variables U are fixed, the individual pairwise terms decouple and the optimal value of each lp,q can be computed independently in closed form. When variables L are fixed, objective 2 turns into a linear leastsquares problem. We exploit this special structure and optimize the objective by alternatingly updating the variable sets U and L.
As a block coordinate descent algorithm, this alternating minimization scheme provably converges. When U are fixed, the optimal value of each lp,q is given by [5] This can be verified by substituting Eq. 5 into Eq. 2, which yields objective 1 with respect to U. When L are fixed, we can rewrite [2] in matrix form and obtain a simplified expression for solving U, where ei is an indicator vector with the ith element set to 1. This is a linear least-squares problem that can be efficiently solved using fast and scalable solvers. The linear least-squares formulation is given by Here I ∈ R n×n is the identity matrix. It is easy to prove that is a Laplacian matrix and hence M is symmetric and positive semidefinite. As with any multigrid solver, each row of U in Eq. 7 can be solved independently and in parallel.
The RCC algorithm is summarized in Algorithm 1: RCC. Note that all updates of U and L optimize the same continuous global objective 2.
The algorithm uses graduated nonconvexity (32). It begins with a locally convex approximation of the objective, obtained by setting µ such that the second derivative of the estimator is positive (ρ(y) > 0) over the relevant part of the domain. Over the iterations, µ is automatically decreased, gradually introducing nonconvexity into the objective. Under certain assumptions, such continuation schemes are known to attain solutions that are close to the global optimum (33).
The parameter λ in the RCC objective 1 balances the strength of the data terms and pairwise terms. The reformulation of RCC as a linear least-squares problem enables setting λ automatically. Specifically, Eq. 7 suggests that the data terms and pairwise terms can be balanced by setting The value of λ is updated automatically according to this formula after every update of µ. An update involves computing only the largest eigenvalue of the Laplacian matrix A. The spectral norm of X is precomputed at initialization and reused. Additional details concerning Algorithm 1 are provided in SI Methods.

Joint Clustering and Dimensionality Reduction
The RCC formulation can be interpreted as learning a graphregularized embedding U of the data X. In Algorithm 1 the dimensionality of the embedding U is the same as the dimensionality of the data X. However, since RCC optimizes a continuous and differentiable objective, it can be used within end-to-end feature learning pipelines. We now demonstrate this by extending RCC to perform joint clustering and dimensionality reduction. Such joint optimization has been considered in recent work (34,35). The algorithm we develop, RCC-DR, learns a linear mapping into a reduced space in which the data are clustered. The V: Initialize u i = x i , lp,q = 1, µ max xp − xq 2 2 , λ = χ A 2 . VI: while C t − C t−1 < ε or t < maxiterations do VII: Update lp,q using Eq. 5 and A using Eq. 8.
XI: Output clusters given by the connected components of G.
mapping is optimized as part of the clustering objective, yielding an embedding in which the data can be clustered most effectively. RCC-DR inherits the appealing properties of RCC: Clustering and dimensionality reduction are performed jointly by optimizing a clear continuous objective, the framework supports nonconvex robust estimators that can untangle mixed clusters, and optimization is performed by efficient and scalable numerical methods. We begin by considering an initial formulation for the RCC-DR objective: Here D ∈ R D×d is a dictionary, zi ∈ R d is a sparse code corresponding to the i th data sample, and ui ∈ R d is the lowdimensional embedding of xi . For a fixed D, the parameter ν balances the data term in the sparse coding objective with the clustering objective in the reduced space. This initial formulation 10 is problematic because in the beginning of the optimization the representation U can be noisy due to spurious intercluster connections that have not yet been disabled. This had no effect on the convergence of the original RCC objective 1, but in formulation 10 the contamination of U can infect the sparse coding system via Z and corrupt the dictionary D. For this reason, we use a different formulation that has the added benefit of eliminating the parameter ν: Here we replaced the 2 penalty on the data term in the reduced space with a robust penalty. We use the Geman-McClure estimator 3 for both ρ1 and ρ2.
To optimize objective 11, we introduce line processes L 1 and L 2 corresponding to the data and pairwise terms in the reduced space, respectively, and optimize a joint objective over U, Z, D, L 1 , and L 2 . The optimization is performed by block coordinate descent over these groups of variables. The line processes L 1 and L 2 can be updated in closed form as in Eq. 5. The variables U are updated by solving the linear system UM dr = ZH, [12]

∼1
For each dataset, the number of instances, number of dimensions, number of ground-truth clusters, and the imbalance, defined as the ratio of the largest and smallest cardinalities of ground-truth clusters, are shown. For each dataset, the maximum AMI is highlighted in bold. Some prior algorithms did not scale to large datasets such as MNIST (70,000 data points in 784 dimensions). RCC or RCC-DR achieves the highest accuracy on seven of the nine datasets. RCC-DR achieves the highest or second-highest accuracy on eight of the nine datasets. The average rank of RCC-DR across datasets is lower by a multiplicative factor of 3 or more than the average rank of any prior algorithm. NA, not applicable. where and H is a diagonal matrix with hi,i = l 1 i . The dictionary D and codes Z are initialized using principal component analysis (PCA). [The K-SVD algorithm can also be used for this purpose (36).] The variables Z are updated by accelerated proximal gradient-descent steps (37), and ω t = t t+3 . The prox ε . 1 operator performs elementwise soft thresholding: [15] The variables D are updated usinḡ where β is a small regularization value set to β = 10 −4 tr(ZZ ). A precise specification of the RCC-DR algorithm is provided in Algorithm S1.

Experiments
Datasets. We have conducted experiments on datasets from multiple domains. The dimensionality of the data in the different datasets varies from 9 to just below 50,000. Reuters-21578 is the classic benchmark for text classification, comprising 21,578 articles that appeared on the Reuters newswire in 1987. RCV1 is a more recent benchmark of 800,000 manually categorized Reuters newswire articles (38). (Due to limited scalability of some prior algorithms, we use 10,000 random samples from RCV1.) Shuttle is a dataset from NASA that contains 58,000 multivariate measurements produced by sensors in the radiator subsystem of the Space Shuttle; these measurements are known to arise from seven different conditions of the radiators. Mice Protein is a dataset that consists of the expression levels of 77 proteins measured in the cerebral cortex of eight classes of control and trisomic mice (39). The last two datasets were obtained from the University of California, Irvine, machine-learning repository (40).
MNIST is the classic dataset of 70,000 hand-written digits (41). Pendigits is another well-known dataset of hand-written digits (42). The Extended Yale Face Database B (YaleB) contains images of faces of 28 human subjects (43). The YouTube Faces Database (YTF) contains videos of faces of different subjects (44); we use all video frames from the first 40 subjects sorted in chronological order. Columbia University Image Library (COIL-100) is a classic collection of color images of 100 objects, each imaged from 72 viewpoints (45). The datasets are summarized in Table 1.
Measures. Normalized mutual information (NMI) has emerged as the standard measure for evaluating clustering accuracy in the machine-learning community (51). However, NMI is known to be biased in favor of fine-grained partitions. For this reason, we use adjusted mutual information (AMI), which removes this bias (52). This measure is defined as follows:  Here H (·) is the entropy, MI(·, ·) is the mutual information, and c andĉ are the two partitions being compared. For completeness, Table S2 provides an evaluation using the NMI measure.
Results. Results on all datasets are reported in Table 2. In addition to accuracy on each dataset, Table 2 also reports the average rank of each algorithm across datasets. For example, if an algorithm achieves the third-highest accuracy on half of the datasets and the fourth-highest one on the other half, its average rank is 3.5. If an algorithm did not yield a result on a dataset due to its size, that dataset is not taken into account in computing the average rank of the algorithm. RCC or RCC-DR achieves the highest accuracy on seven of the nine datasets. RCC-DR achieves the highest or secondhighest accuracy on eight of the nine datasets and RCC achieves the highest or second-highest accuracy on five datasets. The average rank of RCC-DR and RCC is 1.6 and 2.4, respectively. The best-performing prior algorithm, LDMGI, has an average rank of 4.9, three times higher than the rank of RCC-DR. This indicates that the performance of prior algorithms is not only lower than the performance of RCC and RCC-DR, it is also inconsistent, since no prior algorithm clearly leads the others across datasets. In contrast, the low average rank of RCC and RCC-DR indicates consistently high performance across datasets.
Clustering Gene Expression Data. We conducted an additional comprehensive evaluation on a large-scale benchmark that consists of more than 30 cancer gene expression datasets, collected for the purpose of evaluating clustering algorithms (53). The results are reported in Table S3. RCC-DR achieves the highest accuracy on eight of the datasets. Among the prior algorithms, affinity propagation achieves the highest accuracy on six of the datasets and all others on fewer. Overall, RCC-DR achieves the highest average AMI across the datasets.
Running Time. The execution time of RCC-DR optimization is visualized in Fig. 2. For reference, we also show the corresponding timings for affinity propagation, a well-known modern clustering algorithm (10), and LDMGI, the baseline that demonstrated the best performance across datasets (48). Fig. 2 shows the running time of each algorithm on randomly sampled subsets of the 784-dimensional MNIST dataset. We sample subsets of different sizes to evaluate runtime growth as a function of dataset size. Performance is measured on a workstation with an Intel Core i7-5960x CPU clocked at 3.0 GHz. RCC-DR clusters the whole MNIST dataset within 200 s, whereas affinity propagation takes 37 h and LDMGI takes 17 h for 40,000 points.
Visualization. We now qualitatively analyze the output of RCC by visualization. We use the MNIST dataset for this purpose. On this dataset, RCC identifies 17 clusters. Nine of these are large clusters with more than 6,000 instances each. The remaining 8 are small clusters that encapsulate outlying data points: Seven of these contain between 2 and 11 instances, and one contains 148 instances. Fig. 3A shows 10 randomly sampled data points xi from each of the large clusters discovered by RCC. Their corresponding representatives ui are shown in Fig. 3B. Fig. 3C shows 2 randomly sampled data points from each of the small outlying clusters. Additional visualization of RCC output on the Coil-100 dataset is shown in Fig. S3. Fig. 4 compares the representation U learned by RCC to representations learned by the best-performing prior algorithms, LDMGI and N-Cuts. We use the MNIST dataset for this purpose and visualize the output of the algorithms on a subset of 5,000 randomly sampled instances from this dataset. Both of the prior algorithms construct Euclidean representations of the data, which can be visualized by dimensionality reduction. We use t-SNE (23) to visualize the representations discovered by the algorithms. As shown in Fig. 4, the representation discovered by RCC cleanly separates the different clusters by significant margins. In contrast, the prior algorithms fail to discover the structure of the data and leave some of the clusters intermixed.

Discussion
We have presented a clustering algorithm that optimizes a continuous objective based on robust estimation. The objective is optimized using linear least-squares solvers, which scale to large high-dimensional datasets. The robust terms in the objective enable separation of entangled clusters, yielding high accuracy across datasets and domains.
The continuous form of the clustering objective allows it to be integrated into end-to-end feature learning pipelines. We have demonstrated this by extending the algorithm to perform joint clustering and dimensionality reduction.