# Algorithm for cellular reprogramming

^{a}Department of Computational Medicine and Bioinformatics, Medical School, University of Michigan, Ann Arbor, MI 48109;^{b}Department of Curriculum Design, IXL Learning, Raleigh, NC 27560;^{c}Department of Pediatrics and Communicable Diseases, University of Michigan, Ann Arbor, MI 48109;^{d}Department of Biological Sciences, University of Maryland, College Park, MD 20742;^{e}Department of Hematology/Oncology, University of Michigan, Ann Arbor, MI 48109;^{f}Department of Mathematics, University of Michigan, Ann Arbor, MI 48109;^{g}John A. Paulson School of Engineering and Applied Science, Harvard University, Cambridge, MA 02138

See allHide authors and affiliations

Edited by Steven Henikoff, Fred Hutchinson Cancer Research Center, Seattle, WA, and approved September 26, 2017 (received for review July 14, 2017)

## Significance

Reprogramming the human genome toward any desirable state is within reach; application of select transcription factors drives cell types toward different lineages in many settings. We introduce the concept of data-guided control in building a universal algorithm for directly reprogramming any human cell type into any other type. Our algorithm is based on time series genome transcription and architecture data and known regulatory activities of transcription factors, with natural dimension reduction using genome architectural features. Our algorithm predicts known reprogramming factors, top candidates for new settings, and ideal timing for application of transcription factors. This framework can be used to develop strategies for tissue regeneration, cancer cell reprogramming, and control of dynamical systems beyond cell biology.

## Abstract

The day we understand the time evolution of subcellular events at a level of detail comparable to physical systems governed by Newton’s laws of motion seems far away. Even so, quantitative approaches to cellular dynamics add to our understanding of cell biology. With data-guided frameworks we can develop better predictions about, and methods for, control over specific biological processes and system-wide cell behavior. Here we describe an approach for optimizing the use of transcription factors (TFs) in cellular reprogramming, based on a device commonly used in optimal control. We construct an approximate model for the natural evolution of a cell-cycle–synchronized population of human fibroblasts, based on data obtained by sampling the expression of 22,083 genes at several time points during the cell cycle. To arrive at a model of moderate complexity, we cluster gene expression based on division of the genome into topologically associating domains (TADs) and then model the dynamics of TAD expression levels. Based on this dynamical model and additional data, such as known TF binding sites and activity, we develop a methodology for identifying the top TF candidates for a specific cellular reprogramming task. Our data-guided methodology identifies a number of TFs previously validated for reprogramming and/or natural differentiation and predicts some potentially useful combinations of TFs. Our findings highlight the immense potential of dynamical models, mathematics, and data-guided methodologies for improving strategies for control over biological processes.

In 1989, pioneering work by Weintraub et al. (1) successfully reprogrammed human fibroblasts into muscle cells via overexpression of transcription factor (TF) MYOD1, becoming the first study to demonstrate that the natural course of cell development could be altered. In 2007, Yamanaka and coworkers (2) changed the paradigm further by successfully reprogramming human fibroblasts into an embryonic stem-cell–like state [induced pluripotent stem cells (iPSCs)], using four TFs: POU5F1, SOX2, KLF4, and MYC. This work showed that a differentiated cell state could be reverted to a more pluripotent state. These discoveries have changed the trajectory of regenerative medicine, opening the possibility of generating needed cell types on demand for repairing damaged or diseased tissues. Ultimately, patient-derived fibroblasts could be used in autologous transplantations to minimize immune incompatibility.

These remarkable findings also demonstrate that the genome is a system capable of being controlled via an external input of TFs. In this context, determining how to push the cell from one state to another is, at least conceptually, a classical problem of control theory (3). The difficulty arises in the fact that the dynamics—and even proper representations of the cell state and inputs—are not well defined in the context of cellular reprogramming. Nevertheless, it seems natural to treat reprogramming as a problem in control theory, with the final state being the desired reprogrammed cell. In this paper, we provide such a framework based on empirical data and demonstrate the potential of this framework to provide insights into cellular reprogramming (4).

Our goal is to mathematically identify TFs that can directly reprogram human fibroblasts into a desired target cell type. As part of our methodology, we create a model for the natural dynamics of proliferating human fibroblasts, using time series data collected throughout the cell cycle. We couple data from bioinformatics with methods of mathematical control theory—a framework that we dub data-guided control (DGC). We use this model to determine a principled way to identify the best TFs for efficient reprogramming.

Previously, selection of TFs for reprogramming has been based largely on trial and error, typically relying on TF differential expression between cell types for initial predictions. Recent work has sought to predict TFs for reprogramming the cell state (5⇓⇓–8). Rackham et al. (7) devised a predictive method based on differential expression, as well as gene and protein network data. Our approach is fundamentally different in that we take a dynamical systems point of view, opening avenues for investigating efficiency (probability of conversion), timing (when to introduce TFs), and optimality (minimizing the number of TFs and amount of input).

Our method identifies TFs previously found to reprogram human fibroblasts into embryonic stem-cell–like cells, muscle cells, and many additional target cell types. Furthermore, our analysis predicts the points in the cell cycle at which the introduction of TFs might most efficiently affect a desired change of cell state. In addition, we demonstrate the efficacy of using topologically associating domains (TADs) for genome dimension reduction. Implicit in this approach is the notion of distance between cell types, which is measured in terms of the amount of transcriptional change required to transform one cell type into another. In this way, we are able to provide a comprehensive quantitative view of human cell types based on the respective distances between them.

Our framework separates into three parts:

*i*) Define the state. Use structure and function observations of the initial and target cell types’ genomes to define a comprehensive state representation.*ii*) Model the dynamics. Apply model identification methods to approximate the natural dynamics of the genome from time series data.*iii*) Define and evaluate the inputs. Infer from bioinformatics (TF binding location and function) where TFs can influence the genome and then quantify controllability properties with respect to the target cell type.

The actual dynamics of the genome are undoubtedly very complicated, but as is often done in mathematical modeling studies, we use measurements to identify a linear approximation. This will take the form of a difference equation that is widely studied in the control systems literature, (9):

## Methods

### Genome-State Representation and Dimension Reduction: 𝐱 𝒌 .

The state representation x in Eq. **1** is the foundation for any control system and is critical for controllability analysis. To fully represent the state of a cell, a high number of measurements would need to be taken, including gene expression, protein level, chromatin conformation, and epigenetic measurements. As a simplification, we assume that the gene expression profile is a sufficient representation of the cell state.

Gene expression for a given cell is dependent on a number of factors, including (but not limited to) cell type, cell-cycle stage, circadian-rhythm stage, and growth conditions. To best capture the natural fibroblast dynamics from population-level data, time series RNA-seq was performed on cells that were cell-cycle and circadian-rhythm synchronized in normal growth medium conditions (*SI Appendix*). Before data collection, all cells were temporarily held in the first stage of the cell cycle, G_{0}/G_{1}, via serum starvation. Upon release into the cell cycle, the population was observed every *SI Appendix*, Fig. S1). Because of this, we define

An obstacle to using g to represent x in a dynamical systems approach is the computational feasibility of studying a system with over 20,000 variables, necessitating a dimension reduction. A comprehensive genome-state representation should include aspects of both structure and function and simultaneously have low enough dimension to be computationally reasonable. Along these lines, we propose a biologically inspired dimension reduction based on TADs.

The advent of genome-wide chromosome conformation capture (Hi-C) allowed for the studying of higher-order chromatin structure and the subsequent discovery of TADs (10). TADs are inherent structural units of chromosomes: contiguous segments of the 1D genome for which empirical physical interactions can be observed (11). Moreover, genes within a TAD tend to exhibit similar activity, and TAD boundaries have been found to be largely cell-type invariant (11, 12). TADs group structurally and functionally similar genes, serving as a natural dimension reduction that preserves important genomic properties. Fig. 1 depicts an overview of this concept. We computed TAD boundaries from Hi-C data via an algorithm that uses Fielder vector partitioning, described in Chen et al. (13) (*SI Appendix*).

Let

### State Transition Matrix: 𝐀 𝒌 .

Given the data we have, perhaps the most direct way to model the evolution of TAD activity level would be to assume a model of the form

Define a time-dependent state transition matrix

### Input Matrix and Input Signal: 𝐁 , 𝐮 𝒌 .

With the natural TAD-level dynamics established in the context of our control Eq. **1**, we turn our attention to quantifying methods for control.

A TF can regulate a gene positively or negatively by binding to a specific DNA sequence near a gene and encouraging or discouraging transcription. The degree to which a TF activates or represses gene expression depends on the specific TF–gene interaction, which is influenced by a variety of factors that are difficult to quantify. Let

Extensive TF perturbation experiments would be needed to determine *SI Appendix*). Position frequency matrices (PFMs), which give information on TF–DNA binding probability, were obtained for 547 TFs from a number of publicly available sources (*SI Appendix*, Fig. S2).

Although many TFs can do both in the right circumstances, most TFs have a tendency toward either activator or repressor activity (15). That is, if TF m is known to activate (repress) most genes, we can say with some confidence that TF m is an activator (repressor), so *SI Appendix*). The remaining TFs were labeled unknown for lack of conclusive evidence and were evaluated as both an activator and a repressor in separate calculations. Here, we define

TFBSs are cell-type invariant since they are based strictly on the linear genome. However, it is known that for a given cell type, certain areas of the genome may be opened or closed, depending on epigenetic aspects. To capture cell-type–specific regulatory information, we obtained publicly available gene accessibility data (DNase-seq) on human fibroblasts (GSM1014531). DNase-seq extracts cell-type–specific chromatin accessibility information genome-wide by testing the genome’s sensitivity to the endonuclease DNase I and sequencing the nondigested genome fragments. These data are used for our initial cell type to determine which genes are available to be controlled by TFs (16). Here, we define *SI Appendix*).

We approximate

Since we are working off a TAD-dimensional model, our input matrix B must match this dimension. Let

The amount of control input is captured in

With all variables of our control Eq. **1** defined, we can now attempt to predict which TFs will most efficiently achieve cellular reprogramming from some

### Selection of TFs.

Our general procedure for scoring TFs is explained as follows. Eq. **1** has an explicit solution that is given below. The first few terms are

If

When appropriate, we write

Let *lsqnonneg* function to solve Eq. **7**, which gives

Let

We consider different scenarios for the type of input regime in the results. The first one assumes the input signal is constant

### Remark.

Subsets of TFs were chosen for each calculation based on the following criteria: ≥10-fold expression increase in target state compared with initial state and ≥10 RPKM in target state. These criteria are used to select differentially expressed TFs and TFs that are sufficiently active in the target state.

## Results

### Quantitative Measure Between Cell Types.

To best use our algorithm to predict TFs for reprogramming, compatible data on target cell types must be collected. For this, we explore a number of publicly available databases where RNA-seq has been collected, along with RNA-seq data collected in our laboratory. The ENCODE Consortium has provided data on myotubes and embryonic stem cells (ESCs) (*SI Appendix*) (18). The GTEx portal provides RNA-seq data on a large variety of different human tissue types (19). Although each GTEx experiment is performed on tissue samples, thus containing multiple different cell types, we use these data as more general cell-state targets.

To give a numerical structure to cell-type differences, conceptually similar to Waddington’s epigenetic landscape, we calculate *A* shows *SI Appendix*). GTEx RNA-seq data are scaled to keep total RPKM difference between time series fibroblast and GTEx fibroblast RNA-seq minimal (*SI Appendix*).

### TF Scores.

To assess our method’s predictive power, a subset of target cell types is presented here that has validated either TF reprogramming methods or TFs highly associated with the target cell type. Additional predicted TFs for reprogramming are included in *SI Appendix*. We note that although experimentally validated TFs provide the best current standard for comparison, we believe experimental validation with our predicted TFs may provide more efficient and comprehensive reprogramming results. For all reprogramming regimes presented in this section, fibroblast is used as the initial cell type due to the availability of synchronized time series data, and all TFs are introduced at

For conversion of fibroblast to myotubes, the top predicted single-input TFs are MYOG and MYOD1, both of which are known to be crucial for myogenesis. While MYOD1 is the classic master regulator reprogramming TF for myotube conversion, activation of downstream factor MYOG is necessary for full conversion (20). For fibroblast to ESC conversion, a number of TFs known to be necessary for pluripotency are predicted, including MYCN, ZFP42, NANOG, and SOX2 (2). With the knowledge that no single TF has been shown to fully reprogram a fibroblast to an embryonic state, combinations of TFs are more informative for this analysis. The top-scoring combination of three TFs is MYCN, NANOG, and POU5F1—three well-known markers for pluripotency (2). Interestingly, POU5F1 scores poorly when input individually, but is within the top set of three TFs when used in combination with MYCN and NANOG. Left ventricle reprogramming includes TFs that are known to be necessary for natural differentiation in the top score for all one to three combinations. These include GATA4 (a known TF in fibroblast to cardiomyocyte reprogramming), HEY2, and IRX4 (21⇓–23).

### Time-Dependent TF Addition.

Fibroblast to ESC conversion was of particular interest in our analysis as this is a well-studied regime with a number of validated TFs (with a variety of reported efficiencies), and this conversion is promising for its regenerative medicine application. High-scoring TFs yield many that are known markers for pluripotency, but the top combination of three, MYCN, NANOG, and POU5F1, has not been used specifically together, to our knowledge. Here, we analyzed how the TF combination would score if input at different points throughout the cell cycle.

Time-dependent analysis of the top-scoring ESC TFs reveals that scores vary widely, depending on the time of input. MYCN and NANOG show a strong preference for input at the beginning of the cell cycle, while POU5F1 shows a slight preference for input toward the end of the cell cycle, with the highest score achieved when MYCN and NANOG are input at 0 h and POU5F1 is input at 32 h. Analysis on how the time of input control affects μ is shown in Fig. 3*C*. Time-dependent analysis was also conducted for the top combination of three TFs for fibroblast to left ventricle. This analysis predicted that the best reprogramming results would occur if GATA4 is given immediately (0 h), with IRX4 and HEY2 given later (24 and 32 h, respectively).

## Discussion

The results from this algorithm show promise in their prediction of known reprogramming TFs and demonstrate the importance of including time series data for gene network dynamics. Time of input control has shown to have an impact on the end cell state, in line with what has been shown in natural differentiation (24).

While we believe that this is the best model currently available for predicting TFs for reprogramming, we are aware of its limitations and assumptions. TAD-based dimension reduction is based on the observation that genes within them correlate in expression over time, although we lack definitive proof of regulation by shared transcriptional machinery (11). This assumption was deemed necessary for dimension reduction in the context of deriving transition matrix

Although this program can score TFs relative to other TFs in a given reprogramming regime, it is difficult to predict a μ threshold that would guarantee conversion. Additionally, rigorous experimental testing will be required to validate these findings and determine how our u vector translates to TF concentration. This is a product of the large number of assumptions that we have made to develop the initial framework for a reprogramming algorithm. With finer resolution in the time series gene expression, more subtle aspects of the genomic network may be observed, allowing for better prediction.

Our proposed DGC framework successfully identified known TFs for fibroblast to ESC and fibroblast to muscle cell reprogramming regimes. We use a biologically inspired dimension reduction via TADs, a natural partitioning of the genome. This comprehensive state representation was the foundation of our framework, and the success of our methods motivates further investigation of the importance of TADs as functional units to control the genome.

A dynamical systems view of the genome allows for analysis of timing, efficiency, and optimality in the context of reprogramming. Our framework is the first step toward this view. The successful implementation of time-varying reprogramming regimes would open unique avenues for direct reprogramming. This template can be used to develop regimes for changing any cell into any other cell, for applications that include reprogramming cancer cells and controlling the immune system. Our DGC framework is well equipped for designing personalized cellular reprogramming regimes. Finally, this framework can serve as a general technique for investigating the controllability of networks strictly from data.

## Materials and Methods

Hi-C and RNA-seq data were collected from cell-cycle– and circadian-rhythm–synchronized proliferating human fibroblasts of normal karyotype. Data were collected every 8 h, spanning 56 h. Publicly available data were used for target cell types. Detailed materials and methods are provided in Chen et al. (11), Dataset S1, and in *SI Appendix*.

## Acknowledgments

We thank Robert Oakes, Emily Crossette, and Sijia Liu for their critical reading of the manuscript and helpful discussions. We especially thank James Gimlett and Srikanta Kumar at Defense Advanced Research Projects Agency (DARPA) for support and encouragement. This work is supported, in part, by the DARPA Biochronicity, Deep-Purple, and FunCC programs.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: indikar{at}umich.edu.

Author contributions: R.B. and I.R. designed research; S.R., G.P., A.B., R.B., and I.R. performed research; S.R., G.P., S.L., H.C., M.B., M.S.W., A.B., R.B., and I.R. analyzed data; and S.R., G.P., L.A.M., S.L., A.B., R.B., and I.R. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1712350114/-/DCSupplemental.

- Copyright © 2017 the Author(s). Published by PNAS.

This is an open access article distributed under the PNAS license.

## References

- ↵
- Weintraub H, et al.

- ↵
- ↵
- Brockett R

- ↵
- Rajapakse I,
- Groudine M,
- Mesbahi M

- ↵
- ↵
- ↵
- ↵
- Michael DG, et al.

- ↵
- Aström KJ,
- Murray RM

- ↵
- ↵
- Chen H, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Fischer A,
- Schumacher N,
- Maier M,
- Sendtner M,
- Gessler M

- ↵
- Nelson DO,
- Jin DX,
- Downs KM,
- Kamp TJ,
- Lyons GE

- ↵

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Biophysics and Computational Biology

- Biological Sciences
- Systems Biology