## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Semisoft clustering of single-cell data

Edited by Haiyan Huang, University of California, Berkeley, CA, and accepted by Editorial Board Member Charles F. Stevens November 27, 2018 (received for review October 24, 2018)

## Significance

Growth typically involves differentiation of cells from progenitors into more specialized descendants, often involving lineages of pure and transitional cells to achieve final form. Recent technology has enabled estimation of gene expression profiles of single cells and these profiles theoretically differentiate pure cell types. What is missing from the analytical toolbox is an efficient technique to classify pure and transitional cells from their profiles. Here we propose semisoft clustering with pure cells (SOUP). This algorithm performs well in the hard-clustering problem for pure cell types and excels at identifying transitional cells with soft memberships. Moreover, SOUP provides an estimate of the developmental trajectories based on the estimated cell type membership that naturally adapts to cells in transition.

## Abstract

Motivated by the dynamics of development, in which cells of recognizable types, or pure cell types, transition into other types over time, we propose a method of semisoft clustering that can classify both pure and intermediate cell types from data on gene expression from individual cells. Called semisoft clustering with pure cells (SOUP), this algorithm reveals the clustering structure for both pure cells and transitional cells with soft memberships. SOUP involves a two-step process: Identify the set of pure cells and then estimate a membership matrix. To find pure cells, SOUP uses the special block structure in the expression similarity matrix. Once pure cells are identified, they provide the key information from which the membership matrix can be computed. By modeling cells as a continuous mixture of K discrete types we obtain more parsimonious results than obtained with standard clustering algorithms. Moreover, using soft membership estimates of cell type cluster centers leads to better estimates of developmental trajectories. The strong performance of SOUP is documented via simulation studies, which show its robustness to violations of modeling assumptions. The advantages of SOUP are illustrated by analyses of two independent datasets of gene expression from a large number of cells from fetal brain.

Development often involves pluripotent cells transitioning into other cell types, sometimes in a series of stages. For example, early in development of the cerebral cortex (1), one progression begins with neuroepithelial cells differentiating to apical progenitors, which can develop into basal progenitors, which will transition to neurons. Moreover, there are diverse classes of neurons, some arising from distinct types of progenitor cells (2, 3). By the human midfetal period there are myriad cell types and the foundations of typical and atypical neurodevelopment are already established (4). While the challenges for neurobiology in this setting are obvious, some of them could be alleviated by statistical methods that permit cells to be classified into pure or transitional types. We develop such a method here. Similar scenarios arise with the development of bone-marrow–derived immune cells, cancer cells, and disease cells (5); hence we envision broad applicability of the proposed modeling tools.

Different types of cells have different transcriptomes or gene expression profiles (4). Thus, they can be identified by these profiles (6), especially by expression of certain genes that tend to have cell-specific expression (marker genes). Characterization of these profiles has recently been facilitated by single-cell RNA sequencing (scRNA-seq) techniques (7, 8), which seek to quantify expression for all genes in the genome. For single cells, the number of possible sequence reads is limited and therefore the data can be noisy. Nonetheless, cells of the same and different cell types can be successfully clustered using these data (6, 9⇓⇓–12).

What is missing from the clustering toolbox is a method that recognizes development, with both pure type and transitional cells. In this paper, we develop an efficient algorithm for semisoft clustering with pure cells (SOUP). SOUP intelligently recovers the set of pure cells by exploiting the block structures in a cell–cell similarity matrix and also estimates the soft memberships for transitional cells. We also incorporate a gene selection procedure to identify the informative genes for clustering. This selection procedure is shown to retain fine-scaled clustering structures in the data and substantially enhances clustering accuracy. Incorporating soft-clustering results into methods that estimate developmental trajectories yields less biased estimates of developmental courses.

We first document the performance of SOUP via extensive simulations. These show that SOUP performs well in a wide range of contexts; it is superior to natural competitors for soft clustering; and it compares quite well, if not better, than other clustering methods in settings ideal for hard clustering. Next, we apply it to two single-cell datasets from fetal development of the prefrontal cortex of the human brain. In both settings SOUP produces results congruent with known features of fetal development.

## Results

### Model Overview.

Suppose we observe the expression levels of n cells measured on p genes and let *i*) pure cells, each belonging to a single cluster and requiring a hard cluster assignment, and (*ii*) mixed cells (transitional cells) that are transitioning between two or more cell types and hence should obtain soft assignments. With K distinct cell types, to represent the soft membership, let

Let

In practice, many genes will not follow the developmental trajectory described by Eq. **1**; however, it is expected that the expression of many marker genes and other highly informative genes will transition smoothly between cluster centers during development (for example, the genes featured in ref. 13). In particular, one can empirically check the plausibility of Eq. **1** for marker genes; see *Case Studies* below for details. Moreover, because SOUP’s inferences are based on the empirical cell–cell similarity matrix **2**, a weaker assumption than Eq. **1**. Indeed, similar assumptions are implicit in many algorithms that estimate developmental trajectories (14⇓⇓–17). Gene expression is also likely to have nonconstant variance, depending on gene and cell type. However, our pure cell search algorithm does not depend on the diagonal entries of A, and our estimate of Θ is based on spectral decomposition of A, so the method remains robust to moderate fluctuation of diagonal entries of A unless the magnitude of noise is unrealistically large.

As a graphical illustration of the SOUP model, we simulate an example with a developmental trajectory of type1 → type2 → type3. A fraction of the genes were chosen to have differential expression across cell types, and of these a fraction change nonlinearly between cell types (Fig. 1*A*). Regardless of the violations of Eq. **1**, the cells depict a smooth transition between cell types (Fig. 1*B*).

Similar factorization problems to that of Eq. **2** have appeared in previous literature under different settings. The most popular are the mixed-membership stochastic block model (MMSB) (18) and topic modeling (for example, refs. 19⇓–21). However, it is nontrivial to extend these algorithms to our scenario. A similar formulation also appeared in nonnegative matrix factorization (NMF), where nonnegative rank-K matrices Θ and C are estimated such that *i*) The NMF problem is nonidentifiable without introducing nontrivial assumptions, and (*ii*) SOUP does not rely on the nonnegativeness of C, which makes it more broadly applicable to scRNA-seq data after certain preprocessing steps, such as batch-effect corrections, which can result in negative values. Recent work in ref. 23 considered the problem of overlapping variable clustering under latent factor models. Despite the different setup, the model comes down to a problem similar to Eq. **2**, and the authors proposed the latent-model approach to overlapping clustering (LOVE) algorithm to recover the variable allocation matrix, which can be treated as a generalized membership matrix. LOVE consists of two steps: (*i*) finding pure variables and (*ii*) estimating the allocations of the remaining overlapping variables. Both steps rely on a critical tuning parameter that corresponds to the noise level, which can be estimated using a cross-validation procedure. When we applied the LOVE algorithm to our single-cell datasets, however, we found it sensitive to noise, leading to poor performance (*SI Appendix*). Nonetheless, inspired by the LOVE algorithm, SOUP works in a similar two-step manner, while adopting different approaches in both parts. Most importantly, SOUP parameters are intuitive to set, and it is illustrated to have robust performance in both simulations and real data.

#### SOUP algorithm.

The SOUP algorithm involves finding the set of pure cells and then estimating Θ. Pure cells play a critical role in this problem. Intuitively, they provide valuable information from which to recover the cluster centers, which further guides the estimation of Θ for the mixed cells. In fact, it has been shown in ref. 23 that the existence of pure cells is essential for model (2) in ref. 23 to be identifiable, and we restate the theorem below.

### Theorem 1 (Identifiability).

*Model* (2) *is identifiable up to the permutation of labels*, *if* (*a*) Θ *is a membership matrix*; (*b*) *there exist at least two pure cells per cluster*; *and* (*c*) Z *is full rank.*

These assumptions are minimal, because in most single-cell datasets, it is natural to expect the existence of at least a few pure cells in each type, and Z usually has larger entries along the diagonal.

The details of SOUP are presented in *Methods* and *SI Appendix*. As an overview, to recover the pure cells the key is to notice the special block structure formed by the pure cells in the similarity matrix A. SOUP exploits this structure to calculate a purity score for each cell. This calculation requires two tuning parameters: ϵ, the fraction of most similar neighbors to be examined for each cell, and γ, the fraction of cells declared as pure after ranking the purity scores. After selection, the pure cells are partitioned into K clusters, by standard clustering algorithms such as K-means. The choice of K is guided by empirical investigations, including a sample splitting procedure (*SI Appendix*).

To recover Θ, consider the top K eigenvectors of the similarity matrix A, denoted as *Theorem 2*). In practice, we plug in the sample similarity matrix Â to obtain an estimate

### Theorem 2 (SOUP clustering).

*In model* (2), *let* *be the top* K *eigenvectors of* A *and* I *be the set of pure cells. Under the same assumptions as those of Theorem 1*, *the optimization problem**has a unique solution* *such that*

The majority membership probability is

#### Developmental trajectories.

SOUP provides two outcomes not available from hard-clustering procedures such as in refs. 24⇓–26: soft membership probabilities,

### Performance Evaluation.

#### Simulations.

There are no direct competitors of SOUP for semisoft clustering in the single-cell literature, and here we use the following three candidates for comparison: NMF, where we use the standard algorithm from ref. 22 to solve for nonnegative

Although SOUP is derived from a linear model, it is robust and applicable to general scRNA-seq data. To illustrate this, we use the splat algorithm in the Splatter R package (30) to conduct simulations. Splatter is a single-cell simulation framework that generates synthetic scRNA-seq data with hyperparameters estimated from a real dataset. The algorithm incorporates expected violations of the model assumptions (*SI Appendix*). We simulate 500 genes and 300 pure cells from four clusters. Mixed cells are simulated along a developmental path and the number varies from 100 to 500.

For comparable evaluation across different scenarios with different cell numbers, we present the average *A*). In particular, with 100, 300, and 500 mixed cells, the true proportions of pure cells in the data are 75%, 50%, and 37.5%, respectively. Note that we always set

One of the biggest challenges in single-cell data is the existence of dropouts (31), where the mRNA for a gene fails to be amplified before sequencing, producing a “false” zero in the observed data. We see that SOUP remains robust and outperforms all other algorithms (Fig. 2*B*).

#### SOUP as hard clustering.

Although SOUP aims at recovering the full membership matrix Θ, it can also be used as a hard-clustering method by labeling each cell as the majority type. We benchmark SOUP as a hard-clustering method on seven labeled public single-cell datasets (refs. 6 and 10; details in *SI Appendix*, Table S6). We compare SOUP to three popular single-cell clustering algorithms: (*i*) SC3, or single-cell consensus clustering (24); (*ii*) CIDR, or clustering through imputation and dimensionality reduction (25); and (*iii*) Seurat, named for Georges Seurat (26). Because we aim at hard clustering, here we set *SI Appendix*, Table S6).

### Case Studies.

#### Fetal brain cells I.

We apply SOUP to a fetal brain scRNA-seq dataset, with 220 developing fetal brain cells between 12 and 13 gestational weeks (GW) (9). Guided with marker genes, these single cells are labeled with seven types in the original paper: two subtypes of apical progenitors (AP1, AP2), two subtypes of basal progenitors (BP1, BP2), and three subtypes of neurons (N1, N2, N3). We refer to these as Camp labels after the lead author of ref. 9. At this age many cells are still transitioning between different types, providing valuable information regarding brain development. Therefore, instead of the traditional hard-clustering methods, SOUP can be used to recover the fine-scaled soft-clustering structure.

We run SOUP with *SI Appendix*, Fig. S6*A*). For these data, when cells are in various developmental stages, hard clustering appears to overfit the data.

Next, we examine the soft assignments. For each cluster k, we label it by an anchor gene, which is the marker gene defined in ref. 9 that has the largest anchor score, **1**. In the top three PCs space, the cells show a smooth developmental trajectory between clusters (Fig. 5*A*), which is also consistent with Eqs. **1** and **2**.

To model the developmental trajectories we plot the cluster centers determined directly by SOUP (softSOUP) and by hard clustering (hardSOUP). Fitting a MST to the cluster centers, softSOUP identifies two lineages, AP-BP-N and AP-N (Fig. 5*A*), both of which were previously described in ref. 9, while hardSOUP identifies less intuitive BP-AP-N and AP-N lineages (Fig. 5*B*). Using Slingshot to fit smooth branching curves to these lineages via simultaneous principal curves, hardSOUP recovers AP-N and BP-N transitions, and the artificial BP1–AP2 transition in the initial MST fit is dropped (Fig. 5*D*). However, the AP–BP transition is still missing. softSOUP MST successfully reveals AP-N and AP-BP-N transitions (Fig. 5 *A* and *C*), thus capturing the true transition of cell types leading to neurons by accounting for the soft membership structures.

#### Fetal brain cells II.

We next applied SOUP to a richer dataset with 2,309 single cells from human embryonic prefrontal cortex (PFC) from 8 GW to 26 GW (33). Using the Seurat package (26) the authors identified six major clusters: neural progenitor cells (NPC), excitatory neurons (EN), interneurons (IN), astrocytes (AST), oligodendrocyte progenitor cells (OPC) and microglia (MIC), which are referred to as Zhong labels after the lead author of ref. 33. Our objective is to evaluate the developmental trajectories of the major cell types, after excluding IN and MIC, which are known to originate elsewhere and migrate to the PFC (33). After several iterations of hard clustering by SOUP to remove IN and MIC cells (*SI Appendix*, Tables S1–S3) 1,503 cells remain, and they cluster into *A*); however, many cells have low majority membership probabilities (*SI Appendix*, Fig. S8) and do not strongly favor a particular cluster (*SI Appendix*, Table S4). To illustrate this feature we display cells assigned to clusters 3 (NPC) and 7 (EN), color coded by the majority membership probability (Fig. 6*B*). The two clusters divide the PC space evenly, with the pure cells identifying the cluster centers, while many nonpure cells can be best described as transitioning between clusters. SOUP captures the transitional nature by soft clustering.

The SOUP trajectories reveal two developmental paths (Fig. 7): a neuronal lineage showing NPCs evolving to ENs (clusters: *SI Appendix*, Table S4). Our results are similar to those in ref. 33; however, we found that NPCs evolve to OPCs and then to ASTs (clusters:

Additional strengths of SOUP are highlighted by analyses described in *SI Appendix*, which investigate gene expression as a function of cell membership to cluster and the proximity of cells to the neuronal trajectory (Fig. 7 and *SI Appendix*, Fig. S9). In particular, we evaluate the final clusters of the neuronal lineage, clusters 5 and 6. In terms of gene expression, cells in cluster 6 shows all of the hallmarks of neuronal development, including low expression of neuronal markers in immature neurons and much higher expression in maturing neurons. There is also some evidence of heterogeneity of expression of genes marking neurons in some cells, consistent with differentiation into different neuronal subtypes. For cells from cluster 5, the evidence is far less clear: The majority of cells manifest neuronal markers at high levels, consistent with maturing neurons; yet, there is also expression of a substantial set of NPC markers in these neurons, a puzzling feature that could be either a technical artifact or an unanticipated developmental feature of deep-layer projection neurons.

## Discussion

We develop SOUP, a semisoft clustering algorithm for single-cell data. SOUP fills the gap of modeling uncertain cell labels, including cells that are transitioning between cell types, which is ubiquitous in single-cell datasets. SOUP outperforms generic soft-clustering algorithms and, if treated as hard clustering, it also achieves comparable performance to that of state-of-the-art single-cell clustering methods. By using soft-clustering input, it can provide an estimate of developmental trajectories that is less biased and these results reflect valuable information regarding developmental patterns. We present the results from two case studies based on expression of human fetal brain cells and find SOUP reveals patterns of development not apparent in prior published analyses.

As is typical for clustering algorithms, selecting the optimal number of clusters, K, is challenging. We recommend balancing input from several empirical approaches and iterating over a range of K to determine a good choice. Notably, applying SOUP to different numbers of clusters reveals hierarchical structure among the cell types. To determine fine-scale structure within major cell types, SOUP can be applied iteratively to subsets of cells.

Using SOUP to obtain soft membership probabilities and then estimate developmental trajectories provides two complementary views of the data. Some cells can be reliably assigned to a cluster and these cells constitute pure types, which can be highly informative. Other cells are transitioning and estimated membership will fall within two, or even more, cell types. Examining the membership probabilities, and the placement on a developmental trajectory, provides critical information about the developmental processes and offers a parsimonious and scientifically meaningful alternative to estimating a large number of discrete cell types.

Notably, although SOUP is derived under a generic additive noise model and does not explicitly model the technical noise such as dropouts, we find it to be robust when applied to realistic simulations and to a variety of single-cell datasets. Moreover, it is computationally efficient. SOUP takes less than 15 min for 3,600 cells and 20,000 genes, benchmarked on a Linux computer equipped with an AMD Opteron Processor 6320 at 2.8 GHz. Therefore, SOUP is a versatile tool for single-cell analyses.

## Methods

### SOUP.

Our SOUP algorithm contains two steps: (*i*) Find the set of pure cells and (*ii*) estimate Θ. Pure cells play a critical role in this problem. Intuitively, they provide valuable information from which to recover the cluster centers, which further guides the estimation of Θ for the mixed cells. Once the pure cells are identified, then the algorithm proceeds as described in *Results*.

### Find Pure Cells.

Denote the set of pure cells in cluster k as**2**, the pure cells form K blocks in A, where the entries in these blocks are also the maxima in their rows and columns, ignoring the diagonal. Specifically, define *SI Appendix*, Theorem S1).

In practice, we plug in the sample similarity matrix *SI Appendix*.

### Tuning Parameters.

The two tuning parameters of SOUP are the quantiles, ϵ and γ, both intuitive to set. The quantile γ should be an estimate of the proportion of pure cells in the data, of which we usually have prior knowledge. In practice, we find that SOUP remains stable even when γ is far from the true pure proportion, and it is helpful to use a generous choice. Throughout this paper, we always set

### Gene Selection.

It is usually expected that not all genes are informative for clustering. For example, housekeeping genes are unlikely to differ across cell types and hence provide limited information for clustering other than introducing extra noise. Therefore, it is desirable to select a set of informative genes before applying SOUP clustering. Here, we combine two approaches for gene selection: (*i*) the DESCEND algorithm proposed in ref. 35 based on the Gini index and (*ii*) the Sparse PCA (SPCA) algorithm (36) (*SI Appendix*).

The R package of SOUP is available at https://github.com/lingxuez/SOUPR.

## Acknowledgments

This work was supported by National Institute of Mental Health Grants R37MH057881 and R01MH109900 and the Simons Foundation Grants SFARI 402281 and 367561.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: roeder{at}andrew.cmu.edu.

Author contributions: L.Z., J.L., B.D., and K.R. designed research; L.Z., J.L., B.D., and K.R. performed research; L.Z., J.L., and K.R. contributed new reagents/analytic tools; L.Z., L.K., B.D., and K.R. analyzed data; and L.Z., J.L., L.K., B.D., and K.R. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. H.H. is a guest editor invited by the Editorial Board.

Data deposition: The R package of SOUP is available on GitHub (https://github.com/lingxuez/SOUPR).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1817715116/-/DCSupplemental.

- Copyright © 2019 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- ↵
- ↵
- ↵
- Silbereis JC,
- Pochareddy S,
- Zhu Y,
- Li M,
- Sestan N

- ↵
- ↵
- Darmanis S, et al.

- ↵
- ↵
- ↵
- Camp JG, et al.

- ↵
- Baron M, et al.

- ↵
- Zeisel A, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Mao X,
- Sarkar P,
- Chakrabarti D

*Proceedings of the 34th International Conference on Machine Learning*. Available at proceedings.mlr.press/v70/mao17a.html. Accessed December 18, 2018. - ↵
- Arora S,
- Ge R,
- Moitra A

*2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS)*. Available at https://ieeexplore.ieee.org/document/6375276. Accessed December 18, 2018. - ↵
- Arora S, et al.

- ↵
- Lee DD,
- Sugiyama M,
- Luxburg UV,
- Guyon I,
- Garnett R

- Huang K,
- Fu X,
- Sidiropoulos ND

- ↵
- Leen TK,
- Dietterich TG,
- Tresp V

- Lee DD,
- Seung HS

- ↵
- Bing X,
- Bunea F,
- Ning Y,
- Wegkamp M

- ↵
- ↵
- Lin P,
- Troup M,
- Ho JW

- ↵
- ↵
- ↵
- Bezdek JC

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Zhu X,
- Bergles DE,
- Nishiyama A

- ↵
- Wang J, et al.

- ↵

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Statistics