# Gene expression distribution deconvolution in single-cell RNA sequencing

^{a}Department of Statistics, University of Pennsylvania, Philadelphia, PA 19104;^{b}Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104;^{c}Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104;^{d}Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, PA 19104

See allHide authors and affiliations

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved May 29, 2018 (received for review December 6, 2017)

## Significance

We developed deconvolution of single-cell expression distribution (DESCEND), a method to recover cross-cell distribution of the true gene expression level from observed counts in single-cell RNA sequencing, allowing adjustment of known confounding cell-level factors. With the recovered distribution, DESCEND provides reliable estimates of distribution-based measurements, such as the dispersion of true gene expression and the probability that true gene expression is positive. This is important, as with better estimates of these measurements, DESCEND clarifies and improves many downstream analyses including finding differentially expressed genes, identifying cell types, and selecting differentiation markers. Another contribution is that we verified using nine public datasets a simple “Poisson-alpha” noise model for the technical noise of unique molecular identifier-based single-cell RNA-sequencing data, clarifying the current intense debate on this issue.

## Abstract

Single-cell RNA sequencing (scRNA-seq) enables the quantification of each gene’s expression distribution across cells, thus allowing the assessment of the dispersion, nonzero fraction, and other aspects of its distribution beyond the mean. These statistical characterizations of the gene expression distribution are critical for understanding expression variation and for selecting marker genes for population heterogeneity. However, scRNA-seq data are noisy, with each cell typically sequenced at low coverage, thus making it difficult to infer properties of the gene expression distribution from raw counts. Based on a reexamination of nine public datasets, we propose a simple technical noise model for scRNA-seq data with unique molecular identifiers (UMI). We develop deconvolution of single-cell expression distribution (DESCEND), a method that deconvolves the true cross-cell gene expression distribution from observed scRNA-seq counts, leading to improved estimates of properties of the distribution such as dispersion and nonzero fraction. DESCEND can adjust for cell-level covariates such as cell size, cell cycle, and batch effects. DESCEND’s noise model and estimation accuracy are further evaluated through comparisons to RNA FISH data, through data splitting and simulations and through its effectiveness in removing known batch effects. We demonstrate how DESCEND can clarify and improve downstream analyses such as finding differentially expressed genes, identifying cell types, and selecting differentiation markers.

- single-cell transcriptomics
- RNA sequencing
- differential expression
- Gini coefficient
- highly variable genes

Cells are the basic biological units of multicellular organisms. Within a cell population, individual cells vary in their gene expression levels, reflecting the dynamism of transcription across cells (1⇓⇓⇓–5). Traditional microarray and bulk RNA-sequencing (RNA-seq) technologies profile the average gene expression level of all cells in the population. In contrast, recent single-cell RNA-seq (scRNA-seq) methods enable the quantification of a much richer set of properties of the gene expression distribution across cells. For example, measures of dispersion such as the coefficient of variation (CV) and the Gini coefficient can be used to elucidate biological states that are not reflected in the population average (6⇓–8). Two-state mixture models, alternatively, allow a better understanding of transcriptional regulation at the single-cell level (5, 9⇓–11).

However, it is challenging to compute such distribution-based statistics of true gene expression due to the technical noise in scRNA-seq data (12⇓⇓⇓–16). Single-cell RNA-seq protocols are complex, involving multiple steps each contributing to the substantially increased noise level of scRNA-seq relative to bulk RNA-seq. Unique molecular identifiers (UMI) (17) were introduced as a barcoding technique to reduce amplification noise, but the observed expression distribution computed from observed UMI counts is, for most genes, still a poor representation of their true expression distribution.

Recently, many computational methods for scRNA-seq analysis have been proposed, including methods for quantifying dispersion, for characterizing expression “states” using mixture models, and for finding differentially expressed genes (6, 18⇓⇓⇓⇓⇓⇓⇓–26). Although some of these methods have taken technical noise into consideration, to our knowledge, there is currently no method for recovering the cross-cell gene expression distribution from scRNA-seq data, unless strong assumptions are made about this distribution. There is also a lack of methods for comparing aspects of this true biological distribution beyond the mean, especially when there is a need to adjust for confounding factors. In fact, there is still active debate on the technical noise distribution for scRNA-seq data, and a proper technical noise model is critical for studying the underlying distribution.

Here we develop deconvolution of single-cell expression distribution (DESCEND), a statistical method that deconvolves the true cross-cell gene expression distribution from observed scRNA-seq counts and quantifies how this distribution depends on cell-level covariates. Examples of common cell-level covariates are cell size, defined as the total number of RNA molecules in a cell, and cell-cycle stage. DESCEND adopts the “G-modeling” distribution deconvolution framework (27), which avoids constraining parametric assumptions. The accuracy of DESCEND is evaluated using population-matched RNA fluorescence in situ hybridization (FISH) data, in silico sample splitting, and parametric and down-sampling simulations. Our evaluations show that, under very reasonable data quality assumptions, DESCEND can accurately deconvolve the true gene expression distribution, leading to improved estimates of expression dispersion and proportion of cells with zero/nonzero expression. We benchmark against existing methods (6, 25, 28) for analyses in which comparable methods exist, but focus on demonstrating the unique applications of DESCEND in the case studies that follow.

Although the DESCEND framework can be used with any technical noise model, the datasets we use in this paper all use UMIs, for which we have decided upon a satisfactory noise model. There is a lot of debate regarding what noise model to use, even for UMI-based scRNA-seq data. Through a reanalysis of nine public datasets, we show that a Poisson distribution is sufficient to capture the technical noise in single-cell UMI counts, once the cross-cell differences in capture and sequencing efficiency have been accounted for. Thus, DESCEND adopts a “Poisson-alpha” noise model for single-cell UMI counts, which allows fast computation and stable estimation. We show using the data from Tung et al. (29) that, with this noise model, DESCEND can effectively remove artificial differences between known experimental batches.

We demonstrate the applications of DESCEND through three case studies. The first case study illustrates how to conduct differential expression analysis under a two-state model for gene expression. Here, we also conduct a transcriptome-wide examination of how gene expression distributions are associated with cell size, again using population-matched RNA FISH to validate our findings. Next, we illustrate how to use the DESCEND estimated dispersion measures, and their standard errors, for downstream inference. The second case study focuses on cell type identification, and the third case study focuses on the selection of marker genes for cell differentiation.

## Results

### Model Overview.

Fig. 1 gives an overview of the DESCEND framework. The observed counts in an scRNA-seq experiment are a noisy reflection of true expression levels. We model the observed count

Currently, the noise model used by DESCEND is designed for single-cell experiments that use UMIs. For extension to non-UMI read counts, see *Discussion*. For UMI-based single-cell RNA-seq data, DESCEND uses the default noise model

In the default noise model, DESCEND sets *Materials and Methods* and *SI Appendix*, *Mathematical Details of DESCEND*.

The true gene expression distribution *Materials and Methods*).

One meaningful aspect of **3** and **4** refer to the underlying, unobserved, true gene expression distribution. The concepts of nonzero fraction and nonzero mean have appeared, under varying definitions and differing names, in single-cell studies (5, 25, 30), yet many existing approaches to estimate them (5, 18, 30⇓–32) do not account for technical noise. If the population from which the cells are sampled can be assumed to be ergodic, then a two-state transcriptional bursting model (9, 11, 33), formulated as a periodic stochastic dynamic process, leads to a Poisson-Beta distribution for gene expression across cells. In that scenario, Eqs. **3** and **4** can be derived from the burst frequency and burst size parameters defined in the Poisson-Beta distribution. However, the strong ergodicity assumption is, in most cases, too ideal for scRNA-seq experiments, in which cell populations are unavoidably heterogeneous even when limited to a specific cell type. In DESCEND, we choose not to assume the Poisson-Beta distribution and instead focus on the more transparent quantities [**3**, **4**].

DESCEND allows the modeling of covariate effects on both the nonzero fraction and nonzero mean. When covariates are specified, DESCEND uses a log-linear model for the covariate effect on nonzero mean and a logit model for the covariate effect on nonzero fraction. In this case, *Materials and Methods*).

A cell-level covariate that we study in detail is the cell size, defined as the total RNA molecule count in the cell. Cell size can be estimated by spike-in molecules added to the cell after RNA extraction: Let **2**; then**1**.

DESCEND also computes standard errors and performs hypothesis tests on features of the underlying biological distribution, such as dispersion, nonzero fraction, and nonzero mean. See *Materials and Methods* for details.

## Model Assessment and Validation

### Technical noise model for UMI-based scRNA-seq experiments.

For UMI-based scRNA-seq data, Kim et al. (14) gave an analytic argument for a Poisson error model, which we discuss and clarify in *SI Appendix*, *Mathematical Details of DESCEND*. Several studies 18, 34, (35) used spike-in sets and bulk RNA splitting experiments to explore the technical noise in scRNA-seq data, finding that a Poisson distribution for UMI-based counts is plausible, but raised the issue of overdispersion. While the analyses from these studies were insightful, we believe that they failed to account for the inevitable random variations across cells/samples in the spike-in input at low concentrations. We reexamined the spike-in data from nine UMI-based scRNA-seq datasets, covering seven scRNA-seq protocols (6, 7, 29, 36⇓⇓⇓–40). All of the data except for those in ref. 29 have also been analyzed in ref. 40, which showed that capture efficiency varies substantially across cells within each experiment and across experimental protocols. We show that, after accounting for the cell-to-cell variation in efficiency, the technical noise of UMI counts is simply Poisson in most cases.

For External RNA controls Consortium (ERCC) spike-in “genes,” the observed count for each gene in each cell depends on the number of input molecules and the technical noise. Due to the low spike-in concentration added to each cell, the number of input molecules for each spike-in is not fixed, but random with a certain target expectation. If we assume that the molecules in the spike-in dilution are randomly dispersed, then the number that results in each cell partition is Poisson with mean computable from the dilution ratios (see *SI Appendix*, *SI Text*, for more details). If the molecules in the spike-in dilution are not randomly dispersed, e.g., due to clumping, or if there are uncontrolled batch issues, then the input number of spike-in molecules for each cell would be overdispersed compared with the Poisson.

The key point here is that the input quantity of spike-in molecules is not fixed across cells, as assumed by current studies, but random with Poisson noise in the ideal case of perfect random dispersion with no batch variation. Such randomness in the input should not be counted as part of the technical noise of scRNA-seq experiments, as that is due to the spike-in process. Previous analyses of spike-in data have attributed this input-level variation to technical noise, thus inflating their estimates of technical noise dispersion.

To assess whether the

Fig. 2*A* shows that the DESCEND-recovered distribution in all but one (37) of the nine UMI datasets has overdispersion *Materials and Methods* for discussion).

### Evaluation of deconvolution accuracy using RNA FISH as gold standard.

Next, we evaluate the accuracy of DESCEND on the data from ref. 13, where Drop-seq and RNA FISH are both applied to the same melanoma cell line. A total of 5,763 cells and 12,241 genes were kept for analysis from the Drop-seq experiment, with median 1,473 UMIs per cell. Of these genes, 24 were profiled using RNA FISH (*VCF* and *FOSL1* were removed from the original data; *SI Appendix*, *SI Text*). We further excluded genes with zero UMI count in more than *GAPDH* (41).

Both CV and Gini coefficients recovered using DESCEND match well with corresponding values from RNA FISH (Fig. 2*B*) for all 11 genes (*GAPDH* excluded). In comparison, Gini and CV computed on the original Drop-seq counts, standardized by library size *SI Appendix*, *SI Text*) shows bias toward 0 compared with values calculated from RNA FISH. This analysis also shows that the 1-SD error bars given by DESCEND reasonably quantify the uncertainty of its estimates.

DESCEND provides reasonably accurate estimates of the nonzero fraction, despite the low sequencing depth of this dataset (Fig. 2*B*). In contrast, the naive estimate, derived from the proportion of nonzero raw counts for each gene, is grossly inflated due to the low sequencing depth and is not a reliable estimator of nonzero fraction. DESCEND outperforms a recent method QVARKS, which estimates the nonzero fraction (called “ON fraction” in ref. 25) using a Bayesian approach.

Finally, the shape of the relative gene expression distribution (Fig. 2*C*) given by DESCEND matches that from FISH. In comparison, the distribution of the library-size standardized observed counts is quite different from that of their FISH counterparts, showing severe zero inflation and increased skewness.

### Assessment of estimation accuracy and test validity by simulations.

We further evaluate the accuracy of DESCEND by in silico sample splitting and by parametric and down-sampling simulations. For all in silico evaluations, we start with the observed counts of the 820 oligodendrocyte cells from ref. 7, for which ERCC spike-ins are available to estimate cell-specific efficiencies. Details of each simulation are given in *SI Appendix*, *SI Text*.

First, in the sample-splitting experiment, the 820 cells are randomly split into two equal-sized groups. Since the data are split randomly, there should not be any real differences between groups. The agreement in DESCEND estimates of nonzero mean, nonzero fraction, and Gini coefficients between the two groups (Fig. 2*D*) indicates that the procedure has low variance and is robust to the randomness of observed counts.

The above experiment gives a model-free assessment of DESCEND estimation variance. To assess the estimation bias, we then performed a parametric simulation experiment where true gene expression counts were simulated from a lognormal distribution with cell size as covariate and where noise was simulated from our proposed noise model. True values of all involved parameters were set to be estimates from real data. We see that the estimation of covariate effects on the nonzero fraction/mean (Fig. 2*D*), for which there is no RNA FISH validation, is reliable. Nonzero fraction, CV, and the Gini also get accurate and unbiased estimates (*SI Appendix*, Fig. S1*A*). In addition, with the Benjamini–Hochberg procedure, DESCEND effectively controls the false discovery rate (FDR) in the test of whether the nonzero fraction is 1 (Fig. 2*D*).

Finally, we perform down-sampling simulations to assess how DESCEND performs under varying sequencing depth. The top 150 genes with highest total UMI count are selected and their UMI counts are treated as true expression levels. These values are then down-sampled at *D* and *SI Appendix*, Fig. S1*A*).

There is, of course, an endless list of parameters for which evaluations can be performed. We have merely summarized here evaluations that are relevant for the case studies discussed later in this paper.

### Batch effects can be removed in differential analysis by adding batch as covariate.

Tung et al. (29) performed scRNA-seq on three human iPSC cell lines, with three technical replicates per cell line, and showed that there can be substantial variation between technical replicates. Ref. 29 further showed that simple ERCC spike-in adjustment and library size normalization cannot effectively remove this technical “batch effect.” We apply DESCEND to these data to see whether using batch as a covariate can effectively remove technical differences between replicates.

Starting from the data of ref. 29, we created two groups of cells, each containing 150 cells obtained by pooling 50 cells randomly selected from each of the three individuals. Thus, the two groups of cells should have no biological differences. However, the replicates (batches) are manually chosen to preserve the technical batch effect between the two groups: The first group contains cells sampled from one replicate for each subject: NA19098 replicate 1, NA19101 replicate 2, and NA19239 replicate 1; the second group contains cells sampled from another replicate from each subject: NA19098 replicate 3, NA19101 replicate 1, and NA19239 replicate 2. With the two groups of cells constructed in this way, any detection made during differential testing must be a false positive due to the technical differences between replicates (batch effects).

DESCEND was applied to these data to test for differences in Gini coefficient and CV between the two groups (Fig. 2*E* and *SI Appendix*, Fig. S1*B*). Without consideration of batch, DESCEND indeed finds many (false positive) differences in both Gini and CV. However, with batch added as covariate in the DESCEND model, the dispersion estimates from the two groups are comparable, and no significant detections are made. The fact that spike-in–based normalization cannot effectively remove this batch effect, which is effectively removed by the DESCEND model, indicates that batch effects are gene specific. We also conducted differential dispersion analysis between two biologically different samples (the three replicates from NA19101 vs. the three replicates from NA19239), with batch as covariate, and found some significant changes in Gini. The fact that significant differences are found between biologically different samples, but not between biologically identical samples, suggests that DESCEND still has the power to detect biological signals while removing batch effect.

## Case Studies

### Cell-size effect on expression distribution and differential testing of nonzero fraction and mean.

At the single-cell level, many genes are bursty, being inactive in some cells and active in other cells. In Eqs. **3** and **4**, we defined the nonzero fraction and nonzero mean to characterize this heterogeneity in the true underlying gene expression. Using DESCEND, we analyze the scRNA-seq data of mouse hippocampal region from Zeisel et al. (7), where the 3,005 cells were classified into seven major cell types. Our goal is to characterize the dependence of nonzero fraction and nonzero mean on cell size and to find genes that are differentially expressed in these parameters between different cell types, controlling for cell size. Recall that cell size is estimated as Eq. **5**.

First, consider the transcriptome-wide patterns of the DESCEND deconvolved nonzero fraction and nonzero mean, without adjusting for cell size; these were estimated for each gene in each cell type separately, with no added covariates (details in *SI Appendix*, *SI Text*). As shown in Fig. 3*A*, the deconvolved gene expression distributions for most genes have much larger nonzero fraction in the neuron cell types (CA1 pyramidal, S1 pyramidal, and interneurons) compared with the nonneuron cell types (astrocytes–ependymal, endothelial–mural, microglia, and oligodentrocytes), thus suggesting that more genes are in the “on” state in neurons. However, neurons are known to be larger cells, and, indeed, cell-size estimates are substantially larger for neurons compared with nonneurons in this dataset (*SI Appendix*, Fig. S2*A*). Can the global increase in nonzero fraction in neurons simply be attributed to increased cell size? To answer this question we need to first quantify how cell size affects the gene expression distribution.

We applied DESCEND, with cell size as a covariate, to obtain the deconvolved cell-size–adjusted gene expression distribution for each gene in each of the seven cell types. The coefficients estimated by DESCEND allow us to assess, for each gene, whether its nonzero mean has superlinear, linear, or sublinear growth with cell size and whether its nonzero fraction increases, remains constant, or decreases with cell size. See statistical details in *Materials and Methods*. Taking the endothelial–mural cells as an example, we find that for most genes, nonzero fraction increases with cell size as the coefficients are positive (Fig. 3*C*). The mean trend across genes is that a doubling of cell size is associated with at least a doubling of the odds of observing at least one transcript. We also find that, globally, nonzero mean has a slightly sublinear dependence on cell size as many coefficients are below 1 (Fig. 3*D*). The sublinear dependence of nonzero mean on cell size is consistent with previous findings in ref. 41, which used RNA FISH to study a small set of genes and found their expression to have increased concentration in smaller cells, although the quantity measured in ref. 41 directly reflects transcription burst size. These relationships between expression distribution and cell size are consistent across all seven cell types in this study (*SI Appendix*, Fig. S2 *E* and *F*).

It is important to emphasize here that both cell size and true expression distribution are not directly observable quantities and that this relationship between cell size and nonzero mean/fraction is not a direct consequence of larger observed library size leading to larger change of having a nonzero count for a gene. To demonstrate further that the discovered relationships are biological, not technical, we conducted a parallel analysis of the RNA FISH data of ref. 13 (*SI Appendix*, *SI Text*). For the 23 genes (excluding GADPH) available in the RNA FISH data, we observe the same trends as above: The nonzero fraction increases with cell size, with a mean odds ratio of at least 2, and the nonzero mean increases sublinearly with cell size (Fig. 3 *C* and *D* and *SI Appendix*, Fig. S2*C*). The fact that this trend is observed using two different technologies and for eight different cell types (seven by scRNA-seq, melanoma cell line by RNA FISH) suggests that it may be a general relationship between cell size and single-cell gene expression distributions.

Fig. 3*B* shows the nonzero fractions across genes within each cell type, estimated by applying DESCEND with cell size as a covariate. After adjusting for differences in cell size, the transcriptome-wide patterns in nonzero fraction/mean are much more similar across cell types. This suggests that the increased nonzero fraction in neuron cells can mostly be attributed to cell-size differences. For example, compare two cell types: endothelial–mural and pyramidal CA1 cells. Before cell-size adjustment, 879 genes show significant decrease of nonzero fraction in pyramidal CA1 at FDR of *E* and *SI Appendix*, Fig. S2*B*); the results change substantially after cell-size adjustment, with only 84 significant genes, 78 of which were in the original set of 879 genes. This highlights the importance of cell-size adjustment in differential testing.

We also test for the change on nonzero mean between endothelial–mural and pyramidal CA1 cells. Across genes, the estimated change in nonzero fraction does not seem correlated with the estimated change in nonzero mean (Fig. 3*H*), indicating that, after accounting for cell size, the two types of change are unrelated. Differential testing results on nonzero mean and nonzero fraction, taken together, give a more detailed characterization of differential expression (*SI Appendix*, Fig. S3).

### DESCEND improves the accuracy of cell type identification by a better selection of highly variable genes.

One major step in cell type identification is the selection of highly variable genes (HVG) before applying any dimension reduction and clustering algorithms (6, 34, 37, 42). Filtering out genes with low variation reduces noise while keeping the genes that are more likely to be cell subpopulation markers. Current pipelines select HVGs mainly by computing dispersion measurements, such as CV and Fano factor, directly on the raw or library-normalized counts or by variance decomposition. However, as shown in Fig. 2*B*, these methods are not as accurate as DESCEND in estimating the true biological dispersion of the gene expression levels. Furthermore, compared with CV and Fano factor, the Gini coefficient is a more robust indicator of dispersion (see *Materials and Methods* for derivation), and we have shown that DESCEND allows accurate estimate of this indicator. Here, we examine whether DESCEND-selected HVGs improve the accuracy of cell type identification when used with existing clustering algorithms.

We consider cell type identification in two datasets where somewhat reliable cell type labels are available. One consists of the 3,005 cells in ref. 7, which were classified into seven major cell types with the help of domain knowledge. The other is obtained from 10 purified cell populations derived from peripheral blood mononuclear cells (PBMC) in ref. 39, where 1,000 cells were sampled randomly from each cell type and combined to form a 10,000-cell dataset. Since the PBMC data consist mostly of immune cell subtypes (CD4^{+} helper T cells, CD4^{+}/CD25^{+} regulatory T cells, CD4^{+}/CD45RA^{+}/CD24^{−} naive T cells, CD4^{+}/CD45RO^{+} memory T cells, CD8^{+} cytotoxic T cells, and CD8^{+}/CD45RA^{+} naive cytotoxic T cells) which are well known to have similar transcriptomes, it is a more challenging test case. For both datasets, we treat the cell type labels given in their original papers as the gold standard.

There are many existing cell clustering methods for scRNA-seq. To focus on evaluating the effectiveness of the initial HVG selection step, we limit to Seurat, one of the most widely used algorithms, and compare clustering results of Seurat (Version 2.1) with a modified version of Seurat where the initial HVG selection step is replaced by DESCEND. Seurat selects HVGs by ranking the Fano factors of the normalized counts, yielding a list with only *A*) for both datasets. To compare cell clustering accuracy, we use the adjusted Rand index, which ranges from 0, for a level of similarity expected by chance, to 1 for identical clusters (43). The number of clusters is set to the truth for both datasets and both pipelines. Fig. 4*B* shows that with DESCEND, Seurat achieves consistently better results than its original version. Seurat, like most other clustering algorithms, first performs dimension reduction using principal components analysis (PCA), and the number of principal components (PCs) affects downstream clustering. As seen in Fig. 4*B*, the accuracy boost obtained from DESCEND-based HVG selection is consistent across the number of PCs used for dimension reduction.

### Dispersion analysis using the DESCEND-estimated Gini coefficient highlights early-stage differentiation markers for mES cells.

We apply DESCEND to data from Klein et al. (6), where single cells were sampled from a differentiating mESC population before and at 2 d, 4 d, and 7 d after Leukemia inhibitory factor (LIF) withdrawal. In these data, while pluripotency markers and differentiation-related genes have changes in mean expression over time, due to complete transcriptome remodeling, almost all genes have significant change in mean expression even by day 2 (*SI Appendix*, Fig. S4*C*). Thus, change in mean is not an effective measure for identifying differentiation markers. Instead, we used DESCEND to test for change in expression dispersion across the early time points, under the rationale that early differentiation markers would exhibit high heterogeneity at this initial stage.

First, consider expression dispersion as quantified by the Gini coefficients computed from the observed counts distribution, before deconvolution by DESCEND: For most genes, they are much higher at days 2 and 4, compared with days 0 and 7 (Fig. 5*A*). Although this pattern is consistent with our expectation that cells should exhibit higher heterogeneity during differentiation, compared with directly before or directly afterward, it is confounded by the substantially lower sequencing depth for the day 2 and 4 samples (*SI Appendix*, Fig. S4*A*). Are the higher Gini coefficients in days 2 and 4 due to the technical reason of lower sequencing depth or the biological reason of increased population heterogeneity? We applied DESCEND to compute Gini coefficients of the true expression distribution, thus removing the technical confounder of sequencing depth. DESCEND-estimated Gini coefficients confirm that Gini coefficients are indeed higher at days 2 and 4, compared with days 0 and 7 (Fig. 5*B*), as expected for this evolving population.

The differentiation of mES cells upon LIF withdrawal is a poorly characterized process. Ref. 6 observed that, whereas at day 7, almost *Krt8*, *Krt18*, *Tagln*, *Cald1*, *Tpm1*, and *Fxyd6*, does not show significant increase until day 4 (Fig. 5*C*), when almost half of the cells show complete transcriptome reprogramming (*SI Appendix*, Fig. S5*A*). In comparison, these known marker genes belong to a much smaller set of genes that show a dramatic increase in Gini coefficient between day 0 and day 2 (Fig. 5*C* and *SI Appendix*, Fig. S4*C*).

DESCEND allows the computation of SEs, and from these SEs we assessed the significance of change in Gini coefficient between day 0 and day 2. At an FDR threshold of *SI Appendix*, Fig. S4*C*). Of the 56 genes with significant change in Gini coefficient but not in mean computed by DESCEND, many are meaningful differentiation markers. For instance, the genes *Tagln*, *Anxa2*, *H19*, *Sparc*, and *Ccno* are listed in ref. 6 as marker genes for the cell types that appear during differentiation (*SI Appendix*, Fig. S4*D*). *Jun*, *Anxa3*, *Klf6*, *Fos*, and *Dusp4* can also be marker genes for the epiblast cells as these genes show significant increase in mean expression at the later stages (day 4 or day 7) (*SI Appendix*, Fig. S4*E*).

DESCEND also allows a more detailed characterization of the activity of pluripotency factors during differentiation (44, 45). As discussed in ref. 6, the expression levels of some pluripotency genes drop gradually (*Pou5f1*, *Dppa5a*, *Sox2*) while those of others drop rapidly (*Nanog*, *Zfp42*, *Klf4*) during differentiation. This is revealed by the trend of the DESCEND-estimated Gini coefficient (Fig. 5*E*) across time. The rapid drop-off genes react early during differentiation, and thus their Gini coefficients increase immediately at day 2 (*SI Appendix*, Fig. S5). In comparison, the gradual drop-off genes react late during differentiation and thus their Gini coefficients remain unchanged at day 2, starting to increase only at day 4. In contrast, this difference in early vs. late expression drop-off is not visible by mean expression due to heterogeneity between cells with regard to their differentiation timing. As a negative control, the cell-cycle marker gene *Ccnd3* has no significant change in DESCEND-estimated Gini coefficient during differentiation, agreeing with the fact that its expression heterogeneity across cells should be steady during differentiation.

## Discussion

We have described DESCEND, a method for gene expression deconvolution for scRNA-seq. All results in this paper are based on a Poisson model for UMI counts. The deconvolution accuracy of DESCEND was extensively assessed through comparisons to population-matched RNA FISH data and through data splitting and simulation experiments. DESCEND’s noise model allows it to remove known batch effects, as demonstrated on data from ref. 29.

DESCEND’s formulation allows more complex noise models, which would be necessary for analysis of scRNA-seq data without UMIs where there is amplification bias and zero inflation beyond what could be captured by Poisson sampling (24, 46). But such models contain many more parameters, and the estimation of these parameters is nontrivial. Some aspects of the deconvolved distribution, such as Eqs. **3** and **4**, are highly sensitive to the noise model and the estimation quality of the technical parameters. More efforts are needed to develop robust error models for non-UMI scRNA-seq data.

Even in UMI-based scRNA-seq data, technical noise may have substantial overdispersion, and a negative binomial distribution may be more appropriate. DESCEND accepts the negative binomial distribution with a known overdispersion parameter θ. As shown in Fig. 2*A*, θ can be estimated using the spike-in genes as the square of the CV limit when the expected number of input molecules increases. The overdispersion may also be cell or gene specific, but a more realistic model with more parameters may not always lead to better analyses if those parameters cannot be estimated reliably from data. So far, we have settled on a simple model with at most one overdispersion parameter for UMI-based data.

Without covariate adjustment, DESCEND currently requires a few seconds for deconvolution of the distribution of a single gene with hundreds of cells and 10–20 s when there are thousands of cells. Adding covariates and performing likelihood-ratio tests increase the computation cost by roughly three or four times. Computation can be easily parallelized across genes.

Accuracy of DESCEND estimation increases with number of cells and with sequencing coverage. For example, for the Drop-seq data from ref. 13, although the average UMI count per cell is only around 1,500, the large number of cells (a few thousand) allows accurate DESCEND estimation. For the data from refs. 6 and 7, there are only a few hundred cells for each condition, but the high sequencing depth and the prefiltering allow good estimates.

## Code Availability

The R package DESCEND is available on the Github repository https://github.com/jingshuw/descend. All source code and intermediate analysis results that were used to generate the figures in this paper are at repository https://github.com/jingshuw/DESCEND_manuscript_source_code.

## Materials and Methods

### Model.

We introduce the model, estimation procedure, and inference framework here, but leave technical details and a full discussion to the accompanying mathematical supplement in *SI Appendix*, *Mathematical Details of DESCEND*.

The observed count **1**. In this case,

DESCEND allows more complex technical noise models with gene-specific batch effects and Beta-binomial noise distribution. These extensions are described in *SI Appendix*, *Mathematical Details of DESCEND*.

Without covariates, our model assumes that true expression

### Modeling the Nonzero Component.

Let

As in ref. 27, we discretize to simplify estimation. That is, we assume *SI Appendix*, *SI Text* for how a is chosen.

### Penalized Maximum-Likelihood Estimation.

Now we combine zero inflation and covariates adjustment with density h to get the likelihood of the observed data **8**) (the formula for **11** to account for Eq. **9**,

Following ref. 27, we maximize a penalized log-likelihood to estimate the parameters

### Statistical Inference.

Ref. 27 showed that second-order approximations provide useful inference on model parameters. By Taylor expansion of the log-likelihood around the true values of

Now consider differential testing between two populations of cells. Let *P*-values computation.

DESCEND uses likelihood-ratio tests with the unpenalized likelihood

### Finding HVGs.

We use quantile smooth regression (default quantile 0.5) to fit a smooth curve of the relationship between the mean of deconvolved distributions and Gini across genes, using the R package quantreg (47). The dispersion score of each gene is computed as the distance of the Gini from the curve, which is further normalized by its SE. We select HVGs as the genes whose normalized scores are larger than a threshold T (default value is 10).

### Randomness of ERCC Spike-in Counts in scRNA-seq Experiments.

Randomness of both the input counts and technical noise contributes to the randomness of ERCC spike-ins observed counts. First, the actual count of each spike-in gene in each cell deviates from its target count computed from dilution ratios and is approximately Poisson given the following three assumptions: (*i*) The molecules are distributed evenly in the dilution, (*ii*) each molecule moves independently, and (*iii*) the distribution of molecules in dilution does not change during the spike-in process. If any of these assumptions fail, the actual count landing in each cell would be overdispersed compared with Poisson. Such randomness is also observed empirically. For example, in ref. 7, 339 of the 3,005 cells have at least one ERCC gene (among the 38 spike-in genes with expected input count

In the ERCC data of ref. 37, the variances of the DESCEND deconvolved distributions are roughly

## Acknowledgments

J.W. is supported by NIH Grant 2 R01HG006137 and the Wharton Dean’s Fund. M.L. is supported by NIH Grants R01GM108600, R01GM125301, and R01HL113147. N.R.Z. is supported by NIH Grant R01HG006137.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: nzh{at}wharton.upenn.edu.

Author contributions: J.W. and N.R.Z. conceptualized the study; J.W. formulated the model, developed the algorithms and methods, and executed the simulations and real data analysis; J.W., M.L., and N.R.Z. planned the model validation and case studies; E.T., H.D., S.S., J.M., and A.R. provided the RNA FISH and Drop-Seq data of the same melanoma cell line; and J.W., M.H., M.L., and N.R.Z. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1721085115/-/DCSupplemental.

- Copyright © 2018 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Zeisel A, et al.

- ↵
- ↵
- ↵
- ↵
- Jiang Y,
- Zhang NR,
- Li M

- ↵
- ↵
- Torre E, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Gu J,
- Du Q,
- Wang X,
- Yu P,
- Lin W

- ↵
- ↵
- ↵
- Balcan MF,
- Weinberger KQ

- Prabhakaran S,
- Azizi E,
- Carr A,
- Pe’er D

- ↵
- ↵
- Narayanan M,
- Martins AJ,
- Tsang JS

- ↵
- Korthauer KD, et al.

- ↵
- ↵
- ↵
- ↵
- McDavid A, et al.

- ↵
- Delmans M,
- Hemberg M

- ↵
- ↵
- ↵
- ↵
- ↵
- Jaitin DA, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Papatsenko D,
- Xu H,
- Ma’ayan A,
- Lemischka I

- ↵
- Jia C,
- Kelly D,
- Kim J,
- Li M,
- Zhang N

- ↵
- Koenker R

*quantreg: Quantile Regression*, R package Version 5.34.

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology

- Physical Sciences
- Statistics