Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations
- aDepartment of Statistics, Stanford University, Stanford, CA 94305;
- bDepartment of Biomedical Data Science, Stanford University, Stanford, CA 94305;
- cCenter for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305;
- dMinistry of Education Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, Department of Automation, Tsinghua University, 100084 Beijing, China;
- eAcademy of Mathematics and Systems Science, Chinese Academy of Sciences, 100080 Beijing, China;
- fCenter for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, 650223 Kunming, China
See allHide authors and affiliations
Contributed by Wing Hung Wong, June 14, 2018 (sent for review April 4, 2018; reviewed by Andrew D. Smith and Nancy R. Zhang)

Significance
Biological samples are often heterogeneous mixtures of different types of cells. Suppose we have two single-cell datasets, each providing information on a different cellular feature and generated on a different sample from this mixture. Then, the clustering of cells in the two samples should be coupled as both clusterings are reflecting the underlying cell types in the same mixture. This “coupled clustering” problem is a new problem not covered by existing clustering methods. In this paper, we develop an approach for its solution based on the coupling of two nonnegative matrix factorizations. The method should be useful for integrative single-cell genomics analysis tasks such as the joint analysis of single-cell RNA-sequencing and single-cell ATAC-sequencing data.
Abstract
When different types of functional genomics data are generated on single cells from different samples of cells from the same heterogeneous population, the clustering of cells in the different samples should be coupled. We formulate this “coupled clustering” problem as an optimization problem and propose the method of coupled nonnegative matrix factorizations (coupled NMF) for its solution. The method is illustrated by the integrative analysis of single-cell RNA-sequencing (RNA-seq) and single-cell ATAC-sequencing (ATAC-seq) data.
Footnotes
↵1Z.D., X.C., and M.Z. contributed equally to this work.
- ↵2To whom correspondence should be addressed. Email: whwong{at}stanford.edu.
Author contributions: H.Y.C., Y.W., and W.H.W. designed research; Z.D., X.C., W.Z., and A.T.S. performed research; Z.D., M.Z., and W.H.W. analyzed data; and Z.D., M.Z., and W.H.W. wrote the paper.
Reviewers: A.D.S., University of Southern California; and N.R.Z., University of Pennsylvania.
The authors declare no conflict of interest.
Data deposition: The single- cell gene expression data and chromatin accessibility data of RA induction reported in this paper have been deposited in the Gene Expression Omnibus (GEO) database, https://www.ncbi.nlm.nih.gov/geo (accession nos. GSE115968 and GSE115970).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1805681115/-/DCSupplemental.
- Copyright © 2018 the Author(s). Published by PNAS.
This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
Citation Manager Formats
Article Classifications
- Physical Sciences
- Statistics
- Biological Sciences
- Genetics