# Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations

^{a}Department of Statistics, Stanford University, Stanford, CA 94305;^{b}Department of Biomedical Data Science, Stanford University, Stanford, CA 94305;^{c}Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305;^{d}Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, Department of Automation, Tsinghua University, 100084 Beijing, China;^{e}Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100080 Beijing, China;^{f}Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, 650223 Kunming, China

See allHide authors and affiliations

Contributed by Wing Hung Wong, June 14, 2018 (sent for review April 4, 2018; reviewed by Andrew D. Smith and Nancy R. Zhang)

## Significance

Biological samples are often heterogeneous mixtures of different types of cells. Suppose we have two single-cell datasets, each providing information on a different cellular feature and generated on a different sample from this mixture. Then, the clustering of cells in the two samples should be coupled as both clusterings are reflecting the underlying cell types in the same mixture. This “coupled clustering” problem is a new problem not covered by existing clustering methods. In this paper, we develop an approach for its solution based on the coupling of two nonnegative matrix factorizations. The method should be useful for integrative single-cell genomics analysis tasks such as the joint analysis of single-cell RNA-sequencing and single-cell ATAC-sequencing data.

## Abstract

When different types of functional genomics data are generated on single cells from different samples of cells from the same heterogeneous population, the clustering of cells in the different samples should be coupled. We formulate this “coupled clustering” problem as an optimization problem and propose the method of coupled nonnegative matrix factorizations (coupled NMF) for its solution. The method is illustrated by the integrative analysis of single-cell RNA-sequencing (RNA-seq) and single-cell ATAC-sequencing (ATAC-seq) data.

Biological samples of interest in clinical or experimental studies are often heterogeneous mixtures; i.e., a sample may consist of many subpopulations of cells with distinct cellular states. To resolve the heterogeneity and to characterize the constituent subpopulations, it is necessary to generate functional genomic data at the single-cell level. An exciting recent development in genomics technology has been the emergence of methods for single-cell (sc) measurements; for example, scRNA sequencing (scRNA-seq) (1) enables transcription profiling, scATAC sequencing (scATAC-seq) (2) identifies accessible chromatin regions, and sc-bisulfite sequencing (3) measures DNA methylation, all at the single-cell level.

Often, the first step in the analysis of single-cell data is clustering, that is, to classify cells into the constituent subpopulations. Clustering methods for scRNA-seq data are discussed in refs. 4 and 5, and clustering of scATAC-seq data is described in ref. 6. Existing methods, however, do not address the increasingly common situation where two or more types of sc-genomics experiments are performed on different subsamples from the same cell population. For example, Fig. 1*A* depicts the situation when one subsample is analyzed by scRNA-seq while another is analyzed by scATAC-seq. Although the clustering methods developed for scRNA-seq and scATAC-seq were each shown to be capable of identifying distinct cell types, the association of gene expression changes to chromatin accessibility dynamics better defines cell types and lineages, especially in complex tissues (7, 8). To connect these two assays, one might suggest to separately cluster each data type, followed by an integration afterward. However, such approaches can be problematic because scATAC-seq and scRNA-seq data do not always possess a similar power for detection of cell types (9). Furthermore, clusters may be data type specific due to technical noise. So it is advantageous to systematically couple the two clustering processes in such a way that the clustering of the cells in the scRNA-seq sample can also make use of information from the scATAC-seq sample, and vice versa (10). In this paper, we formulate this “coupled clustering” problem as an optimization problem and introduce a method, named coupled nonnegative matrix factorizations (coupled NMF), for its solution.

## Approach

### Coupled NMF.

We first introduce our approach in general terms. Let *O* be a *p*_{1} by *n*_{1} matrix representing data on *p*_{1} features for *n*_{1} units in the first sample; then a “soft” clustering of the units in this sample can be obtained from a nonnegative factorization *O = W*_{1}*H*_{1} as follows: The *i*th column of *W*_{1} gives the mean vector for the *i*th cluster of units, while the *j*th column of *H*_{1} gives the assignment weights of the *j*th unit to the different clusters. Similarly, clustering of the second sample can be obtained from the factorization *E = W*_{2}*H*_{2}, where *E* is the *p*_{2} by *n*_{2} matrix of data on *p*_{2} features (which are different from the features measured in the first sample) for *n*_{2} units in the second sample. To couple two matrix factorizations, we introduce a term *A* is a “coupling matrix.” The construction of *A* is application specific but depends on the assumption that, based on scientific understanding or prior data, it is possible to identify a subset of features in one sample that are linearly predictable from the features measured in the other sample. In such a situation, we can take *A* to be the matrix representation of the linear prediction operator.

Once the coupling is defined this way, we can obtain the factorizations of the two data matrices by solving the following optimization problem (Fig. 1*C*):*W*_{1} and *W*_{2}. After solving the optimization, the cluster profile and the cluster assignments for the *k*th cluster in the first sample can be obtained respectively from the *k*th column of *W*_{1} and the *k*th row of *H*_{1}. Similarly, the clustering in the second sample can be obtained from *W*_{2} and *H*_{2} (Fig. 1*D*).

### Application in Single-Cell Genomic Data.

We apply the coupled NMF approach to cluster scRNA-seq and scATAC-seq data. In this application, *i*th region in the *j*th cell (6). By a region we mean either a predefined regulatory element (RE) or a peak called from merged scATAC-seq data. From scRNA-seq data, we compute the data matrix *E* where *g*th gene in the *h*th cell (11). Details are given in *Materials and Methods*, Data Processing. Note that the scATAC-seq and the scRNA-seq data are not measured in the same cell (Fig. 1*A*).

To get the coupling matrix *A*, we take advantage of our recent work on modeling paired gene expression and chromatin accessibility data (7) and use a diverse panel of cell lines with both expression and accessibility data to train a prediction model of gene expression from accessibility (see *Materials and Methods* for details). After fitting the model, we select a set of “well-predicted” genes (named gene set *S*) and use this set of genes’ RE–gene associations to couple the two types of data. In this application, *AW*_{1} gives the cluster-specific predictions of the expression of genes based on the cluster-specific accessibilities of REs, and hence the trace term enforces our expectation that the expression of genes should be consistent with the predictions based on accessibility of REs. As the coupling matrix *A* is noisy, we can refine the coupling iteratively, as follows, to get a better result. We assign single cells to clusters according to the assignment weights given by *H*_{1} and *H*_{2}. After getting the cluster results, we choose cluster-specific genes based on scRNA-seq clustering. We restrict the rows of *A* and *W*_{2} to the cluster-specific genes and get a submatrix of *A* and *W*_{2}. Then we replace the *A* and *W*_{2} in the trace term by the submatrices and recluster the cells by optimizing the objective function in Eq. **1**. We continue until the cluster assignments are not changed by further iterations.

## Results

### Results on Simulation Data.

We first evaluate the performance of our method in a simulation study. Single-cell datasets are simulated by sampling reads from a bulk dataset (*Materials and Methods*). The bulk datasets used in our simulation study are from two very similar cell types from a hematopoietic differentiation process, namely common myeloid progenitor (CMP) and megakaryocyte erythroid progenitor (MEP) (12). For each of these two cell types, we first generated 100 scRNA-seq datasets and 100 scATAC-seq datasets (*Materials and Methods*).

To simulate a scRNA-seq dataset from a mixed population with two cell types, we simply mix the 200 scRNA-seq data from two cell lines together and treat them as a single scRNA-seq dataset. We apply *k*-means and NMF to cluster the mixed cells. We run *k*-means 50 times with different random initial values and choose the result that gives the minimum total sum of within-cluster distances. Similarly, we run NMF 50 times and choose the result that gives the minimum approximation error in the Frobenius norm. The results of all 50 runs on scRNA-seq and scATAC-seq data by *k*-means and NMF are shown in *SI Appendix*, Fig. S1. Finally, we perform coupled NMF clustering based on both the 200-cell mixture of the scRNA-seq sample and the 200-cell mixture of the scATAC-seq sample (*SI Appendix*, Fig. S3). The performance of the three clustering results (*k*-means on scRNA-seq only, NMF on scRNA-seq only, and coupled NMF on both scRNA-seq and scATAC-seq) is presented in Fig. 2*A*. A similar comparison on the clustering results of scATAC-seq is illustrated in Fig. 2*B*. The convergence of coupled NMF is given in *SI Appendix*, Fig. S2. It is seen that coupling leads to greatly improved results, reducing the assignment error rate by more than threefold over the other two methods (Fig. 2*C*).

### Assessment of Prediction Model Before Coupling.

We are interested in applying coupled NMF to analyze data generated from differentiation of a mouse embryonic stem cell, namely scRNA-seq and scATAC-seq at day 4 after retinoic acid (RA) treatment (*Materials and Methods*). Before analyzing the single-cell data, we want to assess whether the model learned from the diverse panel (i.e., matrix *A*) provides reliable predictive power to connect chromatin accessibility and gene expression in this biological context. We thus generated bulk RNA-seq and ATAC-seq at day 4 of the RA treatment (named RA day 4). Using the model trained on the diverse panel, we predicted the expression of genes in set *S* at RA day 4. *SI Appendix*, Fig. S3 shows the observed vs. predicted gene expressions. It is seen that genes in *S* were predicted with high accuracy (*R*^{2} = 0.75, *r* = 0.87). This gives us confidence in using the model to initiate the coupling.

### Results on Real Single-Cell Data.

Next, we use coupled NMF to analyze RA day-4 scRNA-seq and scATAC-seq data. We first perform coupled NMF with *K* = 2 (i.e., two clusters) and then visualize the clustering result on Spearman correlation-based t-distributed stochastic neighbor embedding (t-SNE) plots. There are clearly two separated clusters in both t-SNE plots of scATAC-seq and scRNA-seq. Increasing the number of clusters to three, we can see three well-separated clusters in t-SNE plots. However, when *K* is increased to 4 or 5, the separation among clusters is no longer clear (Fig. 3*A* and *SI Appendix*, Fig. S4). We also calculate clustering stability based on the method in Brunet et al. (13) for *K* = 2–5 (*SI Appendix*, Fig. S5). The results show that clustering results are stable when *K* = 2 or 3, while the results are not stable when *K* is increased to 4 or 5. Hence, we set *K* = 3 for the remaining of the analysis.

For each of the three clusters, we identify cluster-specific transcription factors (TFs) based on their expression from RNA-seq data and compare their motif activities in scATAC-seq data (Fig. 3). Here the motif activity of a TF in a cell reflects the enrichment level of the TF’s motif in accessible REs in a scATAC-seq dataset (*Materials and Methods*). Fig. 3*B* shows the motif activities and expressions of some cluster-specific TFs on the t-SNE plots (Fig. 3*B* and *SI Appendix*, Fig. S6). Fig. 3*C* shows the heat maps of motif activities and expressions for a subset of cluster-specific TFs, namely those with expression transcripts per million (TPM) greater than 10 in at least 40 cells. It is seen that cluster-1–specific TFs (e.g., Ebf1, Lhx1, and Neurod1) have high motif activities in the corresponding cluster of the scATAC-seq sample. Similarly, cluster-2–specific TFs, Gata4, Foxa2, and Jun, have high motif activity in cluster 2 and cluster-3–specific TFs, Rfx4, Sox2, Sox9, Pou3f2, and Pou3f4, have high motif activity in cluster 3. This result shows that our method leads to highly consistent TF expression and TF motif activities within each of the inferred constituent subpopulations.

To further assess the coupled NMF results, we select cluster-specific genes from RNA-seq data and cluster-specific peaks from ATAC-seq data. We test whether the cluster-specific genes from scRNA-seq data are significantly overlapped with the genes with nearby (100 kb) cluster-specific ATAC-seq peaks by performing Fisher’s exact test based on the overlap of two sets of genes (*SI Appendix*, Fig. S7). Fig. 3*D* gives the *P* values for all possible pairings of the RNA-seq clusters with ATAC-seq clusters. It is seen that the pairings identified by coupled NMF indeed gave dramatically more significant *P* values and higher fold changes than the other possible pairings.

### Coupled Clustering of Single Cells Sheds Light on Stem Cell Differentiation.

The cluster-specific gene expression profiles and chromatin accessibility profiles provided by our method can provide useful insight into the constituent subpopulations. First, we use cluster-specific peaks from scATAC-seq data to annotate the clusters. We collect previously determined enhancers in mouse tissues at seven developmental stages from 11.5 d postconception until birth (14). Fig. 4 *A*–*C* shows the degree of overlap of our cluster-specific peaks with these developmental enhancers for different tissues and at different developmental stages. The number represents 10,000 times the Jaccard index (intersection over union) and NA indicates that enhancer data for that tissue in that stage are not available. The results show that cluster-1–specific peaks are enriched in forebrain and midbrain enhancers at E12.5 and E13.5. Cluster-2–specific peaks are enriched in heart enhancers at E15.5 and E16.5. Cluster-3–specific peaks are enriched in forebrain enhancers from E12.5 to E16.5 and also in midbrain, hindbrain, and neural tube. In addition, we also collect experimentally validated tissue-specific enhancers from the VISTA database (https://enhancer.lbl.gov/) and overlap them to cluster-specific peaks. Fig. 4*D* shows the percentage of tissue-specific VISTA enhancers overlapped to cluster-specific peaks. Only those tissues with at least one enhancer overlapping with the cluster-specific peaks are shown. Enhancers from nervous system-associated tissue (neural tube, cranial nerve, hind brain, midbrain, forebrain, trigeminal V, dorsal root ganglion, eye, nose) have overlap with cluster-specific peaks from cluster 1 and cluster 3. Cluster-2–specific peaks are enriched in blood vessel enhancers and heart enhancers. These results suggest that clusters 1 and 3 may be related to nervous system tissues and cluster 2 may be related to heart tissue.

Next, we analyzed cluster-specific genes from scRNA-seq data. Fig. 4*E* presents the most enriched gene ontology (GO) terms, their *P* values, and fold changes in each cluster. The results show that cluster 2 is enriched in blood vessel development and cardiovascular system development, while clusters 1 and 3 are enriched in nervous system-associated terms. The results from scRNA-seq–based annotation are consistent with the results from scATAC-seq–based analysis. Although clusters 1 and 3 are nervous system-associated clusters, there are interesting differences. Cluster 1 is more enriched in axon guidance and neuron projection guidance, while cluster 3 is more enriched in brain development and oligodendrocytes differentiation. It seems cluster 1 is related to neuron-specific development and cluster 3 is more related to general nervous system development. Overall our results suggest that the RA-induced stem cell at day 4 is a mixture of cells related to neuron, cardiovascular system, and nervous system. These results are largely consistent with previous studies (15, 16).

We can construct cluster-specific gene regulatory networks as graphs with directed edges from the cluster-specific peaks to the cluster-specific genes that are within 100 kb distance and directed edges from cluster-specific TFs to cluster-specific peaks containing significant matches to the corresponding motifs. These cluster-specific subnetworks are presented in *SI Appendix*, Fig. S8. It is seen that Klf7, Ebf1, Sox11, and Nhlh1 are playing an important role in the network for cluster 1; Gata4, Gata6, Sox17, Foxa2, Ap1 complex, and Tead family are important in cluster 2; and Rarb, Nr2f1, Rfx4, Sox2, Sox9, Sox21, Pou3f2, Pou3f3, and Pou3f4 are important in cluster 3.

## Discussion

In this paper, we proposed a coupled clustering method and applied it to single-cell genomic data. We emphasize that the measurement of multiple data types in the same cell is technically challenging due to the complex cellular reactions. Our method utilizes external information to integrate gene expression and chromatin accessibility that are not measured on the same cell. In the simulation study, we showed that the coupled NMF outperforms clustering results derived from just one data type. Moreover, we showed that our method identifies important peaks and genes that characterize cellular heterogeneity in the context of the RA-induced stem cell. The proposed method enables a systematic mapping of peaks to genes, informative for downstream analysis such as inferring gene regulatory networks at the single-cell level.

As far as we know, coupled clustering is a problem different from other complex clustering tasks such as biclustering or multiview clustering. Biclustering (17⇓–19), also called block clustering or coclustering, has been used widely to cluster subjects and cluster genes simultaneously based on a *p* by *n* data matrix of expression measurements on *p* genes for *n* subjects. The same data matrix is used in the clustering in gene space as well as the clustering in subject space. In contrast, two different data matrices are used in coupled clustering of two separate samples. In multiview clustering (20), the set of features measured on each subject can be divided into two independent subsets; for example, one of them may represent gene expression measurements while the other represent accessibility measurements. The important difference between multiview clustering and coupled clustering is that in the former setting all features are measured on each subject, whereas in the latter one only one of the subsets can be measured on any subject. Clearly, coupled clustering is a more challenging task and requires external information such as subject domain knowledge or prior data to initialize the coupling.

## Materials and Methods

### Construction of Data Matrices.

From scATAC-seq data, we compute a data matrix *O*, where *i*th region in the *j*th cell (6). By region we mean union of predefined REs and peaks. From scRNA-seq data, we compute the data matrix *E* where *g*th gene in the *h*th cell (11). Details are given in *Materials and Methods*, *Data Processing*. Note that the scATAC-seq and the scRNA-seq data are not measured in the same cell (Fig. 1*A*).

### Construction of Coupling Matrix.

Our approach to the contraction of *A* is to look for a subset of genes whose expression is highly predictable from chromatin accessibility of REs. To do this, we take advantage of our recent work on modeling paired gene expression and chromatin accessibility data (on bulk samples) across diverse cellular contexts (7). From the paired expression and chromatin accessibility (PECA) model in that work, for each gene *g*, we can extract a set *S*_{g} of REs that regulate that gene. We consider the regression model of target gene (TG) expression (denoted as

We estimate the parameter *α*_{g} by fitting the penalized least-squares problem (Eq. **3**) based on expression and accessibility data on a diverse panel of cell lines [56 cell lines in the case of mouse and 148 cell lines in the case of human (Dataset S1)],

where δ is determined by fivefold cross-validation. After fitting the model, we select a set of well-predicted genes for which regression *R*^{2} is greater than 0.8. In this way, we selected 5,966 well-predicted genes in mouse and selected 6,253 well-predicted genes in human. In coupling matrix *A =* (

### Optimization Algorithm.

We optimize the object function in Eq. **1** by a multiplicative update algorithm,

where *i*th row and the *j*th column in matrix

### Cluster-Specific Features.

We apply a *t* test to define the cluster-specific genes and cluster-specific peaks, and the default *P*-value cutoff is 0.0001.

### Evaluation of the Clustering Results.

We evaluate the results in terms of consistency of true expression values and the predicted values. We calculated the *AW*_{1} with *R*. We use the determinant of correlation matrix *R* to measure the consistency of true expression values with the predicted values. Higher determinant means higher diagonal of the matrix, which means higher correlation between matched clusters and lower correlation between unmatched clusters. When *K* = 1, the determinant simply reflects the correlation between true gene expression and predicted gene expression. When *K* > 1, the determinant will integrate information from both within-cluster correlations and between-cluster correlations. For example, suppose there are three true clusters in the population but in our clustering result one of the true clusters is randomly split into two subclusters. In this situation, the clustering result has four clusters and all four within-cluster correlations will be high; however, the determinant will be close to zero because the correlation vectors due to the two subclusters will be highly colinear. Thus, the use of the determinant instead of the product of within-cluster correlations will offer protection against overpartitioning.

### Parameter Selection.

We solve optimization problems *R* can be used to select the tuning parameters. We choose the tuning parameters which give the highest determinant (*SI Appendix*, Fig. S9). The number of clusters *K* can be determined by a method similar to that in ref. 13.

### TF Motif Activity.

We use the software chromVAR (22) to calculate the TF motif activity on each single cell based on its scATAC-seq data.

### Single-Cell Sample at RA Day 4.

We generated a heterogeneous biological population of cells that arise from the same origin. Specifically, we used the hanging-drop technique to form embryonic bodies (EBs) from mouse embryonic stem cells (mESCs) and induced differentiation by RA treatment. After 4 d of induction, we sample cells for bulk RNA-seq and bulk ATACC-seq experiments for use in validating the coupling. To test the coupled-NMF clustering method, we also generated scATAC-seq and scRNA-seq on the RA day-4 population. After removing low read-count cells (3,000 in RNA-seq and 10,000 in ATAC-seq), we get ATAC-seq data and RNA-seq data on 415 and 463 single cells, respectively.

### Data Processing.

We align the scATAC-seq reads to reference genome mm9 and remove duplicates. We employed MACS2 (23) to do peak calling by merging all of the reads from all of the single cells. We consider only the narrow peaks which are at least present (one or more reads) on 10 cells. Read counts for each region on each cell are calculated by bedtools (24) with the intersect command. Features defined from scATAC-seq data consist of an openness index on regions including REs and narrow peaks from MACS2. REs include promoters and enhancers. We use REs that regulate at least one TG from the PECA network (7).

scRNA-seq raw reads are mapped to mm10 by STAR (25) with ENCODE options. Gene expression TPM are calculated by RSEM (26). The transcriptome annotation we use is GENCODE vM16.

### Simulation of scRNA-Seq and scATAC-Seq.

We simulate scRNA-seq data for each single cell from bulk RNA-seq data by following the Splatter pipeline (27). Specifically, it includes three steps: (*i*) adding noise on expression data *T*, *ii*) getting expected read counts per gene *N* is the total number of read counts in bulk data, *i*, and *P* reflects the sequencing depth for each single cell *iii*) getting the observed read counts for each gene,

### Experimental Design of RA-Induced mESC Differentiation.

mESC lines R1 were obtained from ATCC. The mESCs were first expanded on a mouse embryonic fibroblasts feeder layer previously irradiated. Then, subculturing was carried out on 0.1% bovine gelatin-coated tissue culture plates. Cells were propagated in mESC medium consisting of Knockout DMEM supplemented with 15% Knockout Serum Replacement, 100

mESCs were differentiated using the hanging-drop method (28). Trypsinized cells were suspended in differentiation medium (mESC medium without LIF) to a concentration of 50,000 cells/mL. Twenty-microliter drops (∼1,000 cells) were then placed on the lid of a bacterial plate and the lid was placed upside down. After 48 h incubation, EBs formed at the bottom of the drops were collected and placed in the well of a six-well ultralow attachment plate with fresh differentiation medium containing 0.5 μM RA for up to 4 d, with the medium being changed daily.

### scATAC-Seq.

We followed the scATAC-seq protocol published by Buenrostro et al. (2) with the following modifications. The EBs were first incubated with StemPro Accutase cell dissociation reagent (Gibco) at 37 °C for 10 min, and then the EBs were gently pipetted for an additional 15 min to obtain a single-cell suspension. To further remove nondissociated EBs, the cell suspension was filtered sequentially with a 40-μM cell strainer (BD Falcon) and a 20-μM pluriStrainer (pluriSelect). After washing three times with C1 DNA Seq Cell Wash Buffer, cells at a concentration of 350–400 cells/μL were loaded on the C1 Single-Cell Auto Prep System (Fluidigm, Inc.). Single cells were captured and processed on a 10- to 17-μM IFC microfluidic chip using ATAC-seq scripts (2). A total of seven IFC chips were included in this study. The library was sequenced on Illumina NextSeq with 75-bp paired-end reads.

### scRNA-Seq.

To prepare a scRNA-seq library, we followed the SMART-Seq v4 Ultra Low Input RNA Kit for the Fluidigm C1 System (Clontech Laboratories, Inc.). The EBs were first dissociated with Accutase as described previously. Cells at a concentration of 200–250 cells/μL were then loaded on the C1 Single-Cell Auto Prep System (Fluidigm, Inc.). The single cells were captured and processed on a 10- to 17-μM IFC microfluidic chip, using SMART-Seq v4 scripts. A total of five IFC chips were included in this study. After harvest, cDNA concentration for each sample was measured using the Fragment Analyzer Automated CE System (Advanced Analytical Technologies, Inc.) and the cDNA concentration we used for Nextera XT library preparation is ∼0.2 ng/μL. The library was sequenced on Illumina HiSeq with 100-bp paired-end reads.

### Software and Data.

Software and simulation data are available at http://web.stanford.edu/∼zduren/CoupledNMF/. Single-cell gene expression data and chromatin accessibility data of RA induction have been deposited in the GEO database under accession nos. GSE115968 and GSE115970.

## Acknowledgments

This work was supported by National Institutes of Health Grants R01HG007834, R01GM109836, and P50HG007735 and the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13000000).

## Footnotes

↵

^{1}Z.D., X.C., and M.Z. contributed equally to this work.- ↵
^{2}To whom correspondence should be addressed. Email: whwong{at}stanford.edu.

Author contributions: H.Y.C., Y.W., and W.H.W. designed research; Z.D., X.C., W.Z., and A.T.S. performed research; Z.D., M.Z., and W.H.W. analyzed data; and Z.D., M.Z., and W.H.W. wrote the paper.

Reviewers: A.D.S., University of Southern California; and N.R.Z., University of Pennsylvania.

The authors declare no conflict of interest.

Data deposition: The single- cell gene expression data and chromatin accessibility data of RA induction reported in this paper have been deposited in the Gene Expression Omnibus (GEO) database, https://www.ncbi.nlm.nih.gov/geo (accession nos. GSE115968 and GSE115970).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1805681115/-/DCSupplemental.

- Copyright © 2018 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Zamanighomi M, et al.

- ↵
- Duren Z,
- Chen X,
- Jiang R,
- Wang Y,
- Wong WH

- ↵
- ↵
- ↵
- ↵
- ↵
- Lara-Astiaso D, et al.

- ↵
- Brunet J-P,
- Tamayo P,
- Golub TR,
- Mesirov JP

- ↵
- Gorkin D, et al.

- ↵
- Lin S-C, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- Bickel S,
- Scheffer T

*Proceedings of the IEEE International Conference on Data Mining (ICDM)*, pp 19–26. - ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Wang X,
- Yang P

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics

- Biological Sciences
- Genetics