# Parameterizing cell-to-cell regulatory heterogeneities via stochastic transcriptional profiles

^{a}Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908;^{b}Institute of Computational Biology, Helmholtz Center Munich, German Research Center for Environmental Health, 85764 Neuherberg, Germany; and^{c}Institute for Mathematical Sciences, Technical University Munich, 85747 Garching, Germany

See allHide authors and affiliations

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved December 27, 2013 (received for review June 18, 2013)

## Significance

Cell-to-cell variations in gene regulation occur in a number of biological contexts, such as development and cancer. Discovering regulatory heterogeneities in an unbiased manner is difficult owing to the population averaging that is required for most global molecular methods. Here, we show that we can infer single-cell regulatory states by mathematically deconvolving global measurements taken as averages from small groups of cells. This averaging-and-deconvolution approach allows us to quantify single-cell regulatory heterogeneities while avoiding the measurement noise of global single-cell techniques. Our method is particularly relevant to solid tissues, where single-cell dissociation and molecular profiling is especially problematic.

## Abstract

Regulated changes in gene expression underlie many biological processes, but globally profiling cell-to-cell variations in transcriptional regulation is problematic when measuring single cells. Transcriptome-wide identification of regulatory heterogeneities can be robustly achieved by randomly collecting small numbers of cells followed by statistical analysis. However, this stochastic-profiling approach blurs out the expression states of the individual cells in each pooled sample. Here, we show that the underlying distribution of single-cell regulatory states can be deconvolved from stochastic-profiling data through maximum-likelihood inference. Guided by the mechanisms of transcriptional regulation, we formulated plausible mixture models for cell-to-cell regulatory heterogeneity and maximized the resulting likelihood functions to infer model parameters. Inferences were validated both computationally and experimentally for different mixture models, which included regulatory states for multicellular function that were occupied by as few as 1 in 40 cells of the population. Importantly, when the method was extended to programs of heterogeneously coexpressed transcripts, we found that population-level inferences were much more accurate with pooled samples than with one-cell samples when the extent of sampling was limited. Our deconvolution method provides a means to quantify the heterogeneous regulation of molecular states efficiently and gain a deeper understanding of the heterogeneous execution of cell decisions.

Cell-to-cell differences in transcriptional or posttranslational regulation can give rise to heterogeneous phenotypes within a population (1⇓⇓⇓⇓⇓–7). There are several elegant techniques for monitoring regulatory states in single cells after a network of marker and effector genes has been identified (8⇓⇓⇓⇓–13). However, the options are much more limited when seeking to discover novel states without a predefined network. At the transcript level, global methods have been developed to profile single cells by oligonucleotide microarrays (14, 15) or RNA sequencing (16⇓⇓–19). However, generally such approaches overlook the considerable technical variation in RNA extraction (20) and reverse transcription (21) when applied to the limited starting material of single cells. Single-cell profiles also retain the biological noisiness (22) associated with each cell’s isolation and handling. These confounding sources of variation cannot be separated from reproducible heterogeneities in regulation unless many (>50) cells are individually profiled (9). Therefore, challenges remain for single-cell methods to discover regulatory heterogeneities in a reliable, unbiased, and efficient way.

An attractive alternative to single-cell methods is to analyze sets of population-averaged data and define regulatory signatures for discrete subpopulations. Existing approaches for transcriptomic data are able to deconvolve mixed cellular states computationally, but they require hundreds of coexpressed markers (23) or calibration with purified cell populations (24, 25). Usually, the size or identity of regulatory states is not defined beforehand and their discovery is what motivates the study (9, 11, 26). Certain states may also lack well-defined surface markers that would allow purification. It thus remains unclear whether computational inference with multiple cell averages can track quantitative characteristics of regulatory states not previously thought to exist.

As a hybrid between single-cell and mixture-based approaches, we previously developed a technique that applies probability theory to transcriptome-wide measurements (27). The method begins with random collections of up to 10 cells isolated in situ where cell-to-cell regulatory heterogeneities could possibly reside. Each of these “stochastic samples” is then profiled for overall mRNA expression by using a heavily customized cDNA amplification procedure together with oligonucleotide microarrays (20, 27). The process of random sampling is repeated 15–20 times to build a distribution of 10-cell averages. Transcripts with stark cell-to-cell variations can be distinguished statistically because of binomial fluctuations in single-cell expression that convolve their 10-cell averages. Last, candidate heterogeneities are clustered on a gene-by-gene basis according to the patterns of their sampling fluctuations to indicate putative regulatory states in single cells (27).

Stochastic-profiling experiments are quantitative and highly reproducible as a result of the 10-fold increase in starting material compared with a single cell (20). However, a recognized drawback of the approach is that explicit information about single cells is “lost” in the 10-cell averages. Here, we report that one can recover this information computationally and reconstruct the single-cell distribution of regulatory states with remarkable accuracy. Our method combines maximum-likelihood estimation with mixture models that are grounded in known mechanisms of transcriptional regulation. This approach of maximum-likelihood inference quantifies the single-cell characteristics of each regulatory state, including the probability that a cell will reside in one state or the other. Our predictions are validated with independent gene-specific observations in single cells, and we demonstrate for one very rare state (∼2–3% of the population) that it is important for normal morphogenesis of breast epithelial cells in 3D culture. Last, we show that, when sampling is limited to fewer than 20 observations, the parameterization of regulatory states is substantially more accurate when given 10-cell data compared with one-cell data. Maximum-likelihood inference now enables stochastic profiling to bridge the gap between -omics datasets and single-cell information.

## Results

### Probability Models for Heterogeneous Transcriptional Regulation.

To make reliable single-cell inferences, it was critical to start with simple probabilistic models of gene expression that were biologically accurate. Our method considers genes that exhibit two distinct regulatory states in a population of cells (19, 20, 27). Within each state, the cell-to-cell variation of expression was originally described by a lognormal distribution according to measurements of high-copy transcripts in single mammalian cells (28, 29). We tested whether there was a mechanistic foundation for using two lognormal subpopulations by examining a standard model of regulated gene expression (30, 31) (*SI Appendix*, Fig. S1). In this model, transcript levels per cell are determined by the kinetics of polymerase binding–unbinding, transcriptional elongation, and mRNA degradation. The relative magnitudes of the kinetic rate parameters together govern the steady-state distribution of transcripts in the population (32), allowing different regulatory states to be simulated.

For parameter sets where the probability of observing zero transcripts per cell was near zero, we found that the lognormal distribution was a suitable approximation of basal expression (Fig. 1*A*, blue). Parameter sets yielding median expression levels as low as 20 copies per cell showed only minor skewness in quantile–quantile comparisons with a lognormal distribution (Fig. 1*A*, blue inset). Starting with this basal distribution, we simulated a second cellular regulatory state by increasing the rate of polymerase binding, decreasing the rate of mRNA degradation, or both (Fig. 1*A*, orange). The apparent rate of polymerase binding increases upon recruitment by transcription factors that are expressed or activated heterogeneously within a population of cells (4, 5). Conversely, mRNA stabilization occurs posttranscriptionally through dedicated signal-transduction pathways activated by environmental stresses and proinflammatory stimuli (33). We found that either mechanism of gene up-regulation led to right-shifted distributions that were lognormal (Fig. 1*A*, orange inset). These simulations indicated that lognormal random variables were appropriate for the regulated expression of mid- to high-abundance transcripts.

One drawback of the lognormal distribution is that it has no support at zero copies (34), making it poor for capturing low-abundance genes that are completely silenced in some cells. To identify an alternative in this circumstance, we reconfigured the parameters of the model and defined a steady-state population where most cells would contain close to zero transcripts (Fig. 1*B*, blue). As noted before (32), this regulatory state was best captured by an exponential distribution (Fig. 1*B*, blue inset). Importantly, we found that when the kinetic parameters of a basal exponential state were modified to create a second right-shifted state (Fig. 1*B*, orange), the resulting distributions were lognormal (Fig. 1*B*, orange inset). Together, we conclude that the basic mechanisms of gene expression lead to steady-state distributions described by probability models that are relatively simple.

### Deconvolution of Random 10-Cell Averages by Maximum-Likelihood Inference.

Our results from the gene-expression model suggested that single-cell regulatory heterogeneities could be depicted as a mixture of two lognormal states or as a mixture of an exponential state and a lognormal state (Fig. 1). Either mixture gives rise to a probability distribution characterized by four key parameters. The lognormal–lognormal (LN–LN) mixture requires the log-mean expression of the two regulatory states (*µ*_{1} and *µ*_{2}), the log-SD for biological noise (*σ*), and the expression frequency (*F*) describing the probability that cells will occupy the higher regulatory state (step 1, Fig. 2*A*). (For simulations, the two lognormal states are assumed to share a common *σ*, but in practice we test whether inferences are improved when each lognormal state is allowed its own noise parameter; see below.) Thus, an LN–LN gene that is expressed at an approximately eightfold higher level in 20% of the population with a coefficient of variation (CV) of ∼50% is captured by *µ*_{1} – *µ*_{2} = 2, *F* = 20%, and *σ* = 0.48.

The exponential–lognormal (EXP–LN) mixture also requires *σ* and *F*, along with a single log-mean for the high lognormal state (*µ*) and a rate parameter for the low exponential state (*λ*) (step 1, *SI Appendix*, Fig. S2). The rate parameter relates to how quickly the lower distribution decays above zero copies per cell. For example, a rate parameter of *λ* = 1 creates a distribution that has ∼37% overlap with that of a high lognormal state of *µ* = 0.5 and *σ* = 0.225 whereas *λ* = 3 causes only a ∼6.3% overlap. We modeled two distinct regulatory states by restricting the simulations to rate parameters that caused negligible overlap with the high lognormal state (*λ* > 3). Together, the different mixture models enabled us to simulate stochastic-profiling data by summing the expression of 10 cells randomly sampled from the appropriate two-state distribution (step 2, Fig. 2*A* and *SI Appendix*, Fig. S2).

To infer the most-likely parameters from a collection of random 10-cell samples, we derived maximum-likelihood estimators for the LN–LN and EXP–LN mixtures (*Methods* and *SI Appendix*, *Methods*). Maximum-likelihood estimation requires a defined probability density function (pdf). The stochastic-sampling pdf is the convolution of 10 binomial choices drawn from the two underlying distributions in the mixture (step 3, Fig. 2*A* and *SI Appendix*, Fig. S2). The pdf has a ≤11-modal shape where each mode corresponds to choosing 0–10 cells from the high regulatory state. The most-likely parameter combination was calculated by maximizing the likelihood function (*Methods*), yielding parameters with interval estimates that best explained the data (step 4, Fig. 2*A* and *SI Appendix*, Fig. S2). By performing this maximum-likelihood estimation, we could “invert” stochastic profiling data to infer single-cell characteristics from 10-cell samples.

### Theoretical and Experimental Validation of Maximum-Likelihood Inference.

We evaluated the performance of our approach by using computational simulations of 10-cell samples with known distribution parameters. First, it was important to identify the minimum number of random samples needed to ensure accurate parameter estimation. Given hundreds to thousands of samples, we found that robust and accurate estimates were obtained for all model parameters irrespective of the mixture type (*SI Appendix*, Fig. S3 *A* and *B*). Conversely, with very few samples (∼20 or fewer), the convolved distributions were incompletely populated and our resulting estimates were highly uncertain and sometimes inaccurate for the LN–LN and EXP–LN mixtures. The transition between the two regimes occurred at 50–100 samples, which we defined as the approximate number of data points required for effective maximum-likelihood inference of single transcripts.

We next used simulations to identify the parameter ranges where maximum-likelihood inference makes accurate estimates of each regulatory state. Starting with the LN–LN mixture, we perturbed *µ*_{1}, *µ*_{2}, *σ*, or *F* individually while keeping the other three parameters fixed and simulated 50 random 10-cell samples. For a wide range of subpopulation log-means (*µ*_{1} and *µ*_{2}), maximum-likelihood inference accurately inferred model parameters with negligible bias (Fig. 2 *B* and *C*). We also observed good performance when altering the expression frequency (*F*). Accuracy declined near *F* = 50%, when the two subpopulations offset one another and disguise as a distribution with large *σ* (Fig. 2*D*). Nevertheless, the estimation procedure still accurately and confidently captured ∼70% of the total parameter space (*F* = 0–35% over the range of 0–50%). For the log-SD (*σ*), performance declined only when this parameter was extremely high (Fig. 2*E*). Parameter estimates were accurate until *σ* reached ∼0.8, corresponding to a ∼95% CV that is higher than nearly all genes examined thus far (35, 36). None of the mixture parameters could be reliably inferred from higher-order moments of the 10-cell distributions, although low *F* or high *σ* correlated with a slight increase in skewness (*SI Appendix*, Fig. S4). These results indicated that maximum-likelihood inference could extract parameters that were otherwise inaccessible by descriptive statistics.

We repeated the simulations for the EXP–LN mixture and arrived at very similar conclusions. As long as *λ* and *μ* were large enough to prevent overlap of the two regulatory states, we found that parameter estimates were accurate, although the variance of inferred *σ* was somewhat higher than in the LN–LN mixture (*SI Appendix*, Fig. S5). Together, these simulations suggested that maximum-likelihood inference is able to deconvolve a wide range of regulatory heterogeneities from 10-cell samples.

To examine the accuracy of maximum-likelihood inference with real 10-cell samples, we focused on expression of the detoxifying enzyme superoxide dismutase 2 (*SOD2*) during breast-epithelial acinar morphogenesis. We used a culture model in which immortalized human breast epithelial cells are grown as single-cell clones in reconstituted basement-membrane ECM to form 3D organotypic spheroids (37, 38). Earlier stochastic-profiling studies of developing spheroids had suggested that there were two *SOD2* regulatory states among the ECM-attached cells (27, 39). To apply maximum-likelihood inference, we deeply sampled *SOD2* expression by quantitative PCR (qPCR) in 81 random samples of 10 ECM-attached cells (Fig. 2*F*, *Left*). Using these data, we maximized the likelihood of the LN–LN and EXP–LN models, as well as that of a relaxed LN–LN model, which allowed each regulatory state to have its own log-SD (*σ*_{1} and *σ*_{2}). The three estimates were compared by using the Bayesian information criterion (BIC) score to calculate the quality of the fit relative to the number of inferred parameters (*SI Appendix*, Table S1). The best overall estimate was the mixture model that parameterized two distinct regulatory states with the lowest BIC score.

For the 10-cell measurements of *SOD2*, we found that the LN–LN mixture was slightly preferred over the EXP–LN mixture (Fig. 2*F*, *Right* and *SI Appendix*, Table S1). The inability to discriminate clearly between these two models was likely caused by the basal regulatory state, which could be described as an exponential distribution (*λ* = 46) or a lognormal distribution with a very small log-mean (*µ*_{2} = –4.1) given the sampling data. Regardless, the two models predicted similar *SOD2* expression frequencies among ECM-attached cells: 23% (13–33%) for the LN–LN mixture vs. 19% (12–27%) for the EXP–LN mixture. To determine the accuracy of this shared prediction, we directly measured *F* in 3D spheroids by RNA FISH (Fig. 2*G*). Scoring individual cells with high *SOD2* fluorescence intensity, we calculated an expression frequency of ∼26%. This measurement closely agreed with the inferred parameter of the LN–LN mixture (the better-scoring model; Fig. 2*H* and *SI Appendix*, Table S1) and lay within the estimated confidence interval (CI) of the EXP–LN mixture. By resampling the 10-cell *SOD2* data, we found that at least 50 observations were required to arrive at an accurate result (*SI Appendix*, Fig. S3*C*), confirming our earlier estimates using simulated data (*SI Appendix*, Fig. S3 *A* and *B*). The *SOD2* parameterization suggested that maximum-likelihood inference could correctly extract single-cell information from 10-cell sampling data.

### Maximum-Likelihood Inference of Coordinated Stochastic Transcriptional Profiles.

Programs of gene expression are often controlled by common upstream factors that enforce the regulatory state. We reasoned that coordinated single-cell gene programs would be the product of an overarching regulatory heterogeneity characterized by a shared *F*. If true, then it should be possible to estimate the expression frequency more confidently and with fewer samples by extending maximum-likelihood inference to gene clusters with coordinated 10-cell fluctuations.

We extended the approach as follows (Fig. 3*A*). First, each gene within the cluster was assigned its own *µ*_{1} and *µ*_{2} (or *µ* and *λ* for the EXP–LN mixture) to account for gene-to-gene differences in expression level and detection sensitivity. Next, we assumed that the genes within a cluster share a common *F* and *σ* (or *F*, *σ*_{1}, and *σ*_{2} in the relaxed LN–LN mixture), implying a shared mechanism of regulation (39, 40). Therefore, each mixture model of a cluster of *g* genes involved 2*g* + 2 or 2*g* + 3 parameters. Even for small gene programs (*g* ≤ 10), this parameter search space was too large for nonconvex optimization methods to maximize the global likelihood function quickly (*Methods*). To increase the speed and efficiency of estimation, the cluster was broken down into smaller four-gene subgroups to infer log-means for each gene in the subgroup together with local estimates of *F*, *σ *, and *λ* (steps 1 and 2, Fig. 3*A*). After log-means were locally estimated, the remaining parameters were globally inferred by remaximizing the likelihood function for the entire gene cluster while retaining the local gene-specific estimates of *µ*_{1} and *µ*_{2} (LN–LN mixture) or *µ* (EXP–LN mixture) (steps 3 and 4, Fig. 3*A*). As before, selection of the LN–LN, relaxed LN–LN, and EXP–LN mixture model was made according to the lowest BIC score (*SI Appendix*, Table S1). This revised formulation of maximum-likelihood inference enabled accurate and confident estimates of the expression frequency while requiring only approximately one-third of the sample size (*SI Appendix*, Fig. S6).

We tested our extension of maximum-likelihood inference by extracting from an earlier study two coexpression clusters that were completely uncharacterized (27) (*SI Appendix*, Fig. S7). These clusters contained one to two dozen genes with strongly coordinated expression fluctuations across 16 samples of 10 ECM-attached cells, but the patterns of fluctuation were markedly different (Fig. 3 *B* and *C*). Accordingly, when we inferred the parameters for the two clusters, the model predicted two very different expression frequencies. The first “infrequent” gene cluster was predicted to be up-regulated in ∼25% of the ECM-attached population (Fig. 3*B*). The LN–LN mixture model was preferred over the EXP–LN or relaxed LN–LN mixtures (*SI Appendix*, Table S1), although all three models converged upon similar values for *F*. By contrast, the expression frequency of the “rare” second cluster was predicted to be ∼10% by the LN–LN mixture (Fig. 3*C*), which was the best-scoring model of the three (*SI Appendix*, Table S1). Our parameterization of the two clusters emphasizes the mosaicked regulatory states that evolve even in a very simple model of tissue architecture (27, 38, 41).

To test whether the predicted values of *F* were accurate within the coexpressed clusters, we designed and validated riboprobes for four or five genes in each cluster and quantified their frequency of high expression by RNA FISH (*SI Appendix*, Fig. S8 *A* and *B*). We found that transcripts in the infrequent expression cluster were strongly expressed in three to five ECM-attached cells per acinus cross-section (Fig. 3 *D* and *F* and *SI Appendix*, Fig. S9*A*), yielding an average expression frequency of ∼25%. Conversely, genes in the rare expression cluster (Fig. 3 *E* and *G* and *SI Appendix*, Fig. S9*B*) were strongly expressed in one or two ECM-attached cells per acinus cross-section, consistent with an expression frequency of ∼10%. The expression frequencies of both clusters closely agreed with the inferred *F* parameters, suggesting that our extended inference approach was effective and accurate.

We evaluated the estimates of expression frequency more broadly by selecting four additional clusters from the same dataset for parameterization (*SI Appendix*, Fig. S7) (27). The clusters showed distinct fluctuation patterns and consequently led to *F* estimates that ranged from less than 5% to greater than 25% (Fig. 4 *A–D*, *Upper*). We validated riboprobes for a representative gene in each cluster and scored the expression frequency (Fig. 4 *A*–*D*, *Lower* and *SI Appendix*, Fig. S8*C*). Together with the earlier clusters, we observed a strong correlation between the expression frequency inferred computationally and the manual counts obtained by RNA FISH (*R* = 0.89, Fig. 4*E*). The accuracy of the manual counts was further confirmed by correlation with an expression-frequency index derived from digital image analysis of segmented acini (*Methods* and *SI Appendix*, Fig. S10). Taken together, these data indicate that maximum-likelihood inference accurately infers single-cell expression frequencies from cluster-wide patterns of 10-cell fluctuations.

### Identification of a Peculiar, Very Rare Transcriptional Regulatory State.

Maximum-likelihood inference provides critical information about the state distribution and expression frequency of any gene cluster identified by stochastic profiling to be heterogeneously regulated. As a proof-of-concept application, we screened gene clusters from the 3D profiling data (27) to identify unusual regulatory states that warranted follow-up study. One cluster was notable among those surveyed because the predicted expression frequency of the high regulatory state was very rare (*F* = 2.3%). The very rare cluster was also distinguished by its strong concordance with the relaxed LN–LN mixture compared with the alternative mixture models (*SI Appendix*, Table S1). Moreover, the log-mean of the low regulatory state was extremely low (*μ*_{2} ∼–3.3), suggesting that the cluster was at or below detection in the population. Within this coexpression cluster, we recognized several genes that were strongly associated with breast cancer, including the breast cancer susceptibility gene *BRIP1* [alternatively called *FANCJ* or *BACH1* (42)], the breast cancer-associated gene *IRF2* (43), and the zinc-finger gene *HIVEP2*, which is frequently down-regulated or mutated in breast cancer (44, 45) (Fig. 5*A*). We speculated that genes within the cluster were tightly regulated so that they could be activated in a restricted cellular context where their expression was critical.

Among the genes in the very rare cluster, we were most intrigued by the phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit δ isoform (*PIK3CD* alternatively called *p110δ*). Three-dimensional breast epithelial cultures abundantly express two other PI3K isoforms, *PIK3CA* and *PIK3CB* (Fig. 5*B* and *SI Appendix*, Fig. S11), and it is generally thought that any PI3K isoform can support proliferation and survival (46). Nevertheless, we found that the low-copy expression of *PIK3CD* was transcriptionally up-regulated with delayed kinetics compared with the other PI3K isoforms (Fig. 5*B*), suggesting a unique regulatory mechanism. When *PIK3CD* abundance was visualized in single cells by RNA FISH, we observed a striking pattern. Most cells lacked *PIK3CD* or expressed it at very low levels; however, we consistently identified a sporadic subpopulation of cells (roughly one or two cells every other acinus cross-section) with high *PIK3CD* expression (Fig. 5*C* and *SI Appendix*, Fig. S8*D*). The overall frequency of cells in the *PIK3CD*^{hi} state was somewhat higher than the estimates of *F* for the cluster, but the inferred frequency agreed with the very rare expression of two other members of the cluster, *FEM1A* and IRF2 (*SI Appendix*, Fig. S12). Together, these observations pointed to an acute (and likely transient) regulatory event triggering the selective induction of cluster genes in single ECM-attached cells.

We next asked whether *PIK3CD* was specifically important for normal acinar morphogenesis. To eliminate the very rare *PIK3CD*^{hi} subpopulation, we perturbed p110δ by two independent methods: RNA interference and the p110δ-specific small-molecule inhibitor, IC87114 (ref. 47, Fig. 5*D*, and *SI Appendix*, Fig. S13). When shPIK3CD cells were placed in 3D culture, we found that acini were larger and distorted, suggesting a defect in proliferation arrest. Using phosphorylated Rb (pRb) as a proliferative marker, we observed that shPIK3CD acini were still cycling after 15 d of 3D culture when shGFP control acini had quiesced (Fig. 5 *E* and *F*). Furthermore, when control cells were cultured with IC87114, we observed sustained proliferation that phenocopied *PIK3CD* knockdown (Fig. 5 *E* and *F*). These data together indicate that p110δ activity stemming from the very rare *PIK3CD*^{hi} regulatory state is critical for normal proliferation arrest of breast epithelia in 3D culture. More broadly, our results with the very rare cluster illustrate how maximum-likelihood inference can be used to hone in on gene programs with an expression frequency or regulatory pattern of interest.

### Comparison with Alternative Deconvolution Methods.

We compared the performance of our method to other computational approaches for deconvolving mixed populations (48⇓–50). The alternative methods invoked different mathematical formalisms—Bayesian statistics (48), nonnegative matrix factorization (49), and principal component analysis (50)—and none had previously been applied to transcriptional profiles of small samples. Using the sampling fluctuations within the infrequent, rare, and very rare clusters, we attempted estimates of expression frequency and found that all were substantially less accurate than our approach (Table 1). The comparison illustrates that our method is uniquely effective at parameterizing transcriptional regulatory states within cell populations.

### Direct Comparison of Single-Cell and 10-Cell Sampling Strategies.

Maximum-likelihood inference reconstructs the single-cell expression distribution without the need to measure single cells. Ignoring the technical challenges of global single-cell methods (17, 20, 21, 27), it should also be theoretically possible to recreate the complete expression distribution by measuring many individual cells. However, it was not clear whether single-cell profiling would be as effective as stochastic profiling when reconstructing from a limited number of 1- or 10-cell samples. We anticipated that low expression frequencies would be particularly difficult for single-cell methods to characterize because of uncertainty in reliably capturing the rare regulatory state.

To compare single-cell profiling with stochastic profiling, we repeatedly simulated 1- or 10-cell measurements of gene clusters with similar characteristics to those previously examined (Figs. 3 *F* and *G* and 5*A* and *Methods*). The three 12-gene clusters varied in their expression fraction—infrequent (*F* = 25%), rare (*F* = 10%), and very rare (*F* = 5%)—and the very rare cluster was simulated as an LN–LN mixture or an EXP–LN mixture. When the number of observations was limited to 16 (as in the actual data), we found that maximum-likelihood inference provided superior estimates of *F* when using 10-cell groups (Table 2). Estimates from simulated observations of 16 single cells showed substantially higher mean squared error (MSE) for all gene clusters compared with 16 10-cell observations. The larger MSE of one-cell estimates was caused by increases in both the bias and variance of the estimate, whose magnitudes depended on the cluster characteristics and mixture model. These computational simulations provide an upper bound on performance, because experimental error from actual single-cell experiments (17, 19) should blur the data much more. By collecting a greater total number of cells when observations are limited, maximum-likelihood inference of stochastic 10-cell profiles provides a more accurate picture of the single-cell distribution than single-cell profiles.

## Discussion

Maximum-likelihood inference of mixed regulatory states enables accurate single-cell expression characteristics to be gleaned from 10-cell measurements. For individual genes, the model requires a large number of samples to obtain precise estimates, and its advantage over explicit single-cell methods is debatable. However, by extending the approach to coregulated gene clusters (27, 39, 40), we can infer expression frequencies much more robustly than single-cell methods when the extent of sampling is limited. In fact, after identifying heterogeneously regulated genes at the transcriptome-wide level by stochastic profiling (20, 27), global inferences are achievable with the same number of random 10-cell samples. Maximum-likelihood inference can thus be immediately incorporated into stochastic profiling studies that seek a further understanding of single-cell regulation (20, 39).

Multiple studies have demonstrated that heterogeneous phenotypes are primed by earlier regulatory nonuniformities in gene expression (1⇓⇓⇓⇓⇓–7). However, to date, these discoveries have relied on either predefined intracellular circuits or a mix of screening and serendipity. By combining stochastic profiling with maximum-likelihood inference, one can now examine the single-cell transcriptome for expression frequencies or other regulatory patterns that correlate with a downstream phenotype of interest. Such programs are most likely to contain one or more triggers of the heterogeneous phenotype. For example, our follow-on work with *PIK3CD* suggests that it may enforce a quiescent phenotype in a subpopulation of cells that would otherwise enter the cell cycle.

One day, it may be possible to measure the genome, transcriptome, and proteome accurately and cheaply in single cells. While progress is being made toward this goal (9, 16, 17, 35, 51), in the meantime it is valuable to develop alternative methods with less-stringent sample requirements. Our study shows that a surprising amount of quantitative single-cell information can be deconvolved mathematically from measurements with 10-fold more starting material. The “average” cell might indeed be a myth (52), but that does not mean that small-sample averages of cells cannot point to the truth.

## Methods

### Single-Cell Model of Regulated Gene Expression.

Distributions of transcripts per cell were generated under the steady-state approximation as previously described (30, 31). The basal lognormal regulatory state (Fig. 1*A*, blue) was defined with the following model parameters: *k*_{binding} = 5, *k*_{unbinding} = 10, *k*_{elongation} = 50, and *k*_{degradation} = 1.

The exponential regulatory state (Fig. 1*B*, blue) was defined with the following model parameters: *k*_{binding} = 0.5, *k*_{unbinding} = 10, *k*_{elongation} = 50, and *k*_{degradation} = 1. Basal regulatory states were perturbed by increasing *k*_{binding} by 10-fold (lognormal) or 20-fold (exponential), decreasing *k*_{degradation} by 3.3-fold (lognormal) or fivefold (exponential), or both. Probability densities were compared with the lognormal and exponential test distributions by integrating over integer copy numbers to generate a representative observation set. Observations and distributions were compared with the qqplot function in MATLAB (The MathWorks).

### Simulations of Random 10-Cell Samples.

Simulated 10-cell expression profiles were generated in MATLAB with the statistics toolbox or in R. The LN–LN model assumes a binomial distribution for the two regulatory states and a lognormal distribution of the transcripts within each state. For a random *n*-cell sampling (here *n* = 10), the number of cells drawn from the high regulatory state (*h*) was specified by a binomial distribution with parameters *n* and *F*. Next, *h* expression measurements were randomly drawn from a lognormal distribution with log-mean *μ*_{1} and log-SD *σ*. The remaining *n* – *h* expression measurements were also drawn from a lognormal distribution with log-mean *μ*_{2} and log-SD *σ*. The sum of *n* measurements constituted one stochastic *n*-cell sample. In the EXP–LN model, transcripts from the basal regulatory state were drawn from an exponential distribution with rate parameter *λ*. This procedure was repeated for the indicated number of random samples.

### Derivation of LN–LN Maximum Likelihood Estimator.

To derive the LN–LN maximum-likelihood estimator, we began with a mixed population of cells occupying one of two regulatory states. The basal regulatory state expresses a transcript (*g*) at a low level with log-mean and log-SD *σ*. The induced regulatory state expresses the transcript at a higher level with log-mean and log-SD *σ*. The probability of drawing a single cell from the high regulatory state is characterized by the parameter *F*.

According to the two-state model, the single-cell expression for transcript *g* follows the pdf:

where and are defined as

The *i*^{th} random sample of transcript *g*, , is the sum of *n* independent single-cell expression measurements (here, *n* = 10):

where is the expression of transcript *g* in the *j*^{th} cell of the *i*^{th} random sample. Together, the random sample describing the *n*-cell mixture has the pdf

represents the binomial selection of cells from the basal or induced regulatory states with probabilities *F* and 1 – *F*, respectively. is the density of a sum of independent random variables representing the *n*-cell draw from the mixture model:

The pdf for the sum of lognormally distributed random variables was approximated as previously described (53).

When expanded to a cluster of *m* transcripts, the log-likelihood function for the model parameters given *k* random *n*-cell samples is

where and are vectors containing the transcript-specific log-means for the two regulatory states: and . The log-likelihood functions assume that the expression levels of each transcript are independent as defined by the specific mixture model and *F*. Derivation of all three maximum-likelihood estimators is included in *SI Appendix*, *Methods**.*

### Maximum-Likelihood Parameter Estimation and Model Selection.

The derived log-likelihood functions in Eqs. **12**–**14** of *SI Appendix*, *Methods* are maximized by the most likely combination of parameters for the data . To estimate the parameters for the LN–LN mixtures, we required that . This constraint ensured identifiability because . We also transformed *F* with the logit function and and *σ* with the logarithm function to enable the use of faster, unconstrained optimization algorithms.

Because the log-likelihood function was multimodal, it precluded the straightforward use of gradient-based approaches to find globally optimal parameter combinations. We solved the high-dimensional nonconvex global optimization problem by combining genetic and simplex algorithms. First, the log-likelihood function was computed at randomly drawn parameter combinations to identify high-likelihood regions in parameter space at computationally low cost. In the regions of highest log likelihood, we then used the Nelder–Mead algorithm (54) to identify local maxima of the likelihood function. We further localized the global optimum by repeating a random search of parameter space around the optimum identified by the Nelder–Mead algorithm. The resulting high-likelihood regions were used to seed another Nelder–Mead optimization. The iterations of random search and Nelder–Mead optimization continued until convergence.

For estimating model parameters from transcriptional clusters, we first considered smaller subgroups of the cluster of interest. The best balance of computational time and stability of the resulting parameter estimates was achieved with four-gene subgroups (*SI Appendix*, Fig. S6). The log likelihood of each subgroup was optimized by the algorithm described above to identify the most-likely parameters for the transcripts in the subgroup. Based on the subgroup estimate, we then kept fixed and (for the LN–LN and relaxed LN–LN models) or (for the EXP–LN model) and globally inferred *F* and *σ* (or *F*, *σ*_{1}, and *σ*_{2} for the relaxed LN–LN model, or , *F*, and *σ* for the EXP–LN model) by using the optimization algorithm described above. To confirm that the global optimum for each model had been identified, we pursued a constrained optimization in parallel, which required that the two regulatory states be sufficiently distinct from each other. Specifically, the density of the high regulatory state was constrained to be greater than the low regulatory state in the domain between the mode of the high state and the largest observation in the dataset. The likelihoods of the constrained and unconstrained optimizations were compared, and the higher likelihood inference was selected as the best parameterization for that mixture model. Last, the three mixture models were compared according to their BIC score:

where is the vector of inferred parameters, *c* is the number of inferred parameters in the model, *m* is the number of transcripts in the cluster, and *k* is the number of *n*-cell random samples for each transcript. The best model predicted two distinct regulatory states with the lowest BIC score (*SI Appendix*, Table S1).

Approximate 95% CIs for the best model were estimated by numerically computing the inverse Hessian matrix of the negative log-likelihood function evaluated at the optimal parameter combination. Each *i*^{th} diagonal element (*d*_{i}) of this matrix leads to the confidence in the *i*^{th} inferred parameter () as follows:

Source code for the maximum-likelihood parameter estimation can be found at http://hmgu.de/icb/StochasticProfiling_ML.

### Inference Comparisons of 1- and 10-Cell Random Samples.

We simulated measurements for various gene clusters as described above with either *n* = 1 or *n* = 10, *m* = 12, and *k* = 16 with the mixture model and *F* specified in Table 1. Values of , , , , and *σ* were drawn randomly from the individual transcripts comprising the inferences of Figs. 3 *F* and *G* and 4*A*. Model parameters were inferred as described above with the correct value of *n* in *SI Appendix*, Eqs. **12** and **14**. The inference procedure was repeated 100 times, yielding estimates (*j* = 1, 2, ... 100) for each true parameter . This gives the following Monte Carlo estimates of bias, variance, and mean-squared error:

### Cell Lines and Culture Conditions.

Cell lines and 3D culture conditions are described in *SI Appendix*, *Methods*.

### Stochastic Sampling.

Stochastic samples of *SOD2* were collected as previously described (20, 27, 39). Briefly, 3D cultures were snap-frozen and sectioned at day 10 of morphogenesis. Random 10-cell samples of ECM-attached acinar cells were achieved by laser-capture microdissection from cryosections. The RNA collected from these samples was amplified with a custom small-sample mRNA amplification procedure and quantified by qPCR or microarray (20, 27, 39). Microarray-based expression clusters were identified based on correlated expression fluctuations as described (27, 39).

### Image Aquisition and Processing.

RNA FISH, immunofluorescence, confocal microscopy, manual expression-frequency scoring, and image processing are described in *SI Appendix*, *Methods**.*

### Digital Scoring of Expression-Frequency Index.

Multicolor RNA FISH images were acquired with wheat germ agglutinin (WGA), the riboprobe of interest, and the loading-control riboprobes as described in *SI Appendix*, *Methods*. Individual ECM-attached cells were manually segmented in ImageJ using the WGA, riboprobe, and loading-control stains to determine cell boundaries. The segmented regions of interest (ROIs) were saved as a single ZIP file in ImageJ. The pixels within each ROI were extracted and compared against a null pixel distribution composed of a random set of pixels from segmented cells within the same image. The 85th–95th percentiles of the cell ROI and the null distribution were compared after bootstrapping each distribution 300 times. A cell was scored in the high regulatory state if the bootstrapped 90% CI of the cell ROI was consistently greater than the bootstrapped 90% CI of the null distribution when evaluated from the 85th–95th percentile of pixels. Performing the same analysis on the loading-control riboprobes showed that fewer than 1% of all cells segmented showed detectable differences in total RNA expression. Therefore, the expression-frequency index for a field of view was quantified as the number of cells detected in the high regulatory state divided by the total number of cells segmented. At least 18 fields of view with at least 10 cells per field were acquired for each gene analyzed. Source code for image analysis can be found at http://hmgu.de/icb/StochasticProfiling_ML.

### shRNA Cloning and Lentiviral RNAi.

shRNA sequences against *PIK3CD* were cloned based on the targeting sequences suggested by the RNAi Consortium, except that the XhoI restriction site in the shRNA loop was changed to a PstI site for easier diagnosis during cloning. shGFP control was used as previously described (55). Primers were annealed at 95 °C in annealing buffer (10 mM Tris⋅HCl, 100 mM NaCl, and 1 mM EDTA) for 5 min on a thermocycler and cooled slowly to room temperature by unplugging the instrument. Annealed primers were phosphorylated in vitro with T4 polynucleotide kinase (New England Biolabs) and then cloned into pLKO.1 puro (56) that had been digested with EcoRI and AgeI. Lentiviruses were packaged and transduced into MCF10A-5E cells as previously described (39). Stable lines were screened for knockdown efficiency by immunoblotting.

### Molecular Biology.

Riboprobe synthesis, qPCR, immunoblotting, and validation of the IC87114 inhibitor are described in *SI Appendix*, *Methods**.*

## Acknowledgments

We thank Jason Papin and Eric Greenwald for critically reviewing an early version of this manuscript. This work was supported by American Cancer Society Grant 120668-RSG-11-047-01-DMC (to K.A.J.), National Institutes of Health Director’s New Innovator Award Program Grant 1-DP2-OD006464 (to K.A.J.), German Science Foundation Grants SPP 1395 and SPP 1356 (to F.J.T.), and German Academic Exchange Service Integrated Action 54367112 (to F.J.T.). K.A.J. is further supported by the Pew Scholars Program in the Biomedical Sciences and the David and Lucile Packard Foundation. S.S.B. is supported by a Graduate Research Fellowship from the National Science Foundation. C.F. and F.J.T. are supported by the European Union within the European Research Council Grant LatentCauses.

## Footnotes

↵

^{1}S.S.B. and C.F. contributed equally to this work.- ↵
^{2}To whom correspondence may be addressed. E-mail: kjanes{at}virginia.edu or fabian.theis{at}helmholtz-muenchen.de.

Author contributions: S.S.B., C.F., A.R., F.J.T., and K.A.J. designed research; S.S.B., C.F., and A.R. performed research; S.S.B. and C.F. contributed new reagents/analytic tools; S.S.B., C.F., F.J.T., and K.A.J. analyzed data; and S.S.B., C.F., and K.A.J. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1311647111/-/DCSupplemental.

## References

- ↵
- Slack MD,
- Martinez ED,
- Wu LF,
- Altschuler SJ

- ↵
- ↵
- Singh DK,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Bendall SC,
- et al.

- ↵
- ↵
- ↵
- Kurimoto K,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Reiter M,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- Loo LH,
- et al.

- ↵
- ↵
- Bengtsson M,
- Ståhlberg A,
- Rorsman P,
- Kubista M

- ↵
- Warren L,
- Bryder D,
- Weissman IL,
- Quake SR

- ↵
- ↵
- ↵
- Munsky B,
- Neuert G,
- van Oudenaarden A

- ↵
- ↵
- Limpert E,
- Stahel WA,
- Abbt M

- ↵
- ↵
- ↵
- ↵
- ↵
- Wang L,
- Brugge JS,
- Janes KA

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Foukas LC,
- Berenjeno IM,
- Gray A,
- Khwaja A,
- Vanhaesebroeck B

- ↵
- ↵
- Erkkilä T,
- et al.

- ↵
- ↵
- Tolliver D,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Systems Biology