Using machine learning to identify disease-relevant regulatory RNAs

September 17, 2013
110 (39) 15516-15517
Research Article
Reconstructing dynamic microRNA-regulated interaction networks
Marcel H. Schulz, Kusum V. Pandit [...] Ziv Bar-Joseph
A complex eukaryotic genome contains several hundred to thousands of transcriptional and posttranscriptional regulators, which together make it possible to encode specific patterns of gene expression for many different conditions (1). The cooperation between transcription factors (TFs) and microRNAs (miRNAs) has been a particularly interesting topic, because they can regulate each other and define molecular network motifs with quantitative properties that either regulatory process alone cannot easily achieve (2, 3). Consequently, TF–miRNA interactions have been found to play important roles in biomedically relevant processes ranging from cancer to stem cell differentiation (4, 5). Computational approaches are increasingly used to integrate data interrogating different layers of gene regulation and successfully predict the specific context under which regulatory factors play important roles. In PNAS, Schulz et al. provide a compelling example of how machine learning is successfully applied on expression and regulatory sequence data to identify new roles for miRNAs in lung development and related diseases such as pulmonary fibrosis (6).

Machine Learning in Systems Biology

Machine learning techniques, in particular classification, have a long and strong history in computational biology. Many genomewide inference problems related to molecular sequence and functional analysis can be phrased as classification problems. To this end, each sample is represented by a set of features (such as matches to a set of TF binding sites or simply all DNA/RNA substrings of a certain length). Sometimes, the classification result alone is of interest [such as when annotating protein domains or protein function (7)]; in more complex scenarios, classifiers are used in the context of a larger system [such as splice site recognizers in a model that annotates complex transcript structures (8)].
Genomewide data are inherently noisy, and our knowledge of the underlying biological processes is incomplete. This suggests the use of approaches that are able to use existing knowledge and learn from the data in a manner that allows for generalizing well to new examples. Probabilistic approaches do not make use of hard yes/no classification rules; rather, they use distributions to represent how frequently or specifically features occur in samples from different classes and how they are correlated with each other. Learning (or training) refers to the estimation
The study by Schulz et al. shows convincingly how genomewide data can be used to identify regulators with disease implications.
of these distributions from available data. If a successful classifier can be trained, the assumption is that this tells us something about the biological question by identifying useful features, i.e., those that contributed most to the classification performance. It is this aspect that connects machine learning to systems biology: Based on quantitative models, testable computational predictions can be made, for instance, about the role of regulatory factors.

Predictive Models for Gene Expression

Although the expression of any given gene is defined by only a small subset of the hundreds of possible factors, the regulatory sites recruiting the factors are short and degenerate and therefore frequently appear in the genome by chance. Predicting expression from sequence alone is therefore a highly challenging task; sequence itself is static, so to identify context-specific features, it is crucial to know in which condition they may play a role. Protocols that measure in vivo protein–nucleotide interactions, such as ChIP for DNA or cross-linking and immunoprecipitation for RNA, provide this information but only for the specific biological states that are experimentally profiled. This information may therefore not be available for the system under investigation and/or may be too cumbersome to obtain for the set of all potentially relevant regulators.
Schulz et al. combine static regulatory sequence information (predicted TF binding and miRNA target sites) together with dynamic differentiation expression data of protein-coding genes and miRNAs, to predict which factors are the most explanatory for changes in expression over time. A specific set of miRNAs and their target transcripts were found to be differentially regulated in lung development. Observations were validated in cell culture experiments; crucially for future translational work, they were also found to be consistent with lung gene expression data from patients with idiopathic pulmonary fibrosis.
To leverage genomewide data successfully, one needs to carefully specify the problem that is to be solved. For instance, in the paper at hand, classifiers are built for subsets of previously coregulated genes whose expression diverges at specific time points. These classifiers are then used within the context of a model for the whole time course experiment, which allows for tolerating errors or incomplete knowledge—genes can be put in the “wrong” class at a time point if its expression changes are more consistent with genes in other classes before and after. This only works when many examples are available; clearly, any single sequence-based prediction of a functional site (such as a microRNA target) can be wrong and, indeed, often is. Furthermore, every gene runs its own specific expression program, so there is a tradeoff when defining coexpressed genes.
To not overfit a model to the data (i.e., to learn things from the training data that are not generally important), the model complexity needs to be controlled: the fewer data one has, the fewer parameters can be reliably estimated. Sparsity constraints directly incorporate this into the training by modifying the objective function that is optimized: it is a tradeoff between how well it explains the data and how complicated the model is, i.e., how many of the possible features it effectively uses (9). This implicit feature selection provides small, frequently interpretable (if not necessarily unique) sets of candidates with possible direct or indirect function. With enough data, machine learning can lead to meaningful models for genomewide data—not for a single gene, but for a class of genes with a similar behavior. In this context, a regulatory network specifies “modules” of regulators and sets of genes and not individual interactions between molecules (10).

Accounting for Different Types of Regulators

Extending on the authors’ previous work (11), a distinguishing feature of this study is its inclusion of small noncoding RNAs. miRNAs have a well-documented role as lineage specifiers that act as repressors of genes important for a previous developmental stage (12, 13). To reflect this, miRNAs are constrained in the model to exert negative roles on expression levels. In the context of regulatory networks, miRNAs have additionally been found to stabilize gene expression levels rather than repress them (14, 15); at coarse temporal resolution, such mild repression may manifest itself as coactivation, as both a miRNA and its targets are now active and were not before. Documented roles of small RNAs are ever expanding; more precise data on the production and decay of mRNAs will allow for relaxing this constraint within models that investigate a wider spectrum of miRNA functions in other systems.
A considerable part of the regulatory code is located distally from the genes, such as in enhancers. Due to the lack of relevant data on the location of these regions, approaches (including Schulz et al.) have often been constrained to gene-proximal noncoding regions. Because of ongoing efforts such as the Encyclopedia of DNA elements (ENCODE) or the NIH Roadmap Epigenomics Consortium, comprehensive in vivo data on open chromatin, or chromatin states defined by histone modifications, have become available (1, 16). Making use of condition-specific information such as the state of chromatin has proven to be a promising in-road, as evidenced by successful studies on dissecting the regulatory code for specific gene sets (17, 18). This provides an opportunity for further improvements, because the noise in locating functional elements in large noncoding sequences can be reduced by orders of magnitude. Knowing the precise locations of functional sites will be especially important when including interaction terms between features, i.e., when relaxing the assumption of independence between regulators.

From Genomewide Data to Validated Function

The study by Schulz et al. shows convincingly how genomewide data can be used to identify regulators with disease implications. Quantifying the expression programs that run in the normal and diseased state and building models to decode the regulatory sequence-based contribution are complementary to classical statistical genetics, where variants with associations are identified from large populations (19, 20). Noncoding variants can be distributed and compensatory across loci, and multiple coregulated genes may in turn contribute to a clinical phenotype.
Genomewide profiles have often been regarded with skepticism by quantitative systems biology researchers who aim at modeling precise interactions with biophysical approaches. Initiatives such as ENCODE have provided a wealth of functional genome annotation, but the mission of such initiatives has been on generating the data and not on integrating and modeling it to answer specific biological questions. Despite criticism leveraged at some of the wide-reaching conclusions reached by ENCODE, the data now available to computational systems biology are immense and in themselves of immense value. We can expect many further contributions that demonstrate the possibilities opened up by the applications of machine learning to decipher the players in gene regulation that go awry in specific diseases.


MB Gerstein, et al., Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012).
M Megraw, S Mukherjee, U Ohler, Sustained-input switches for transcription factors and microRNAs are central building blocks of eukaryotic gene circuits. Genome Biol 14, R85 (2013).
O Hobert, Gene regulation by transcription factors and microRNAs. Science 319, 1785–1786 (2008).
KA O’Donnell, EA Wentzel, KI Zeller, CV Dang, JT Mendell, c-Myc-regulated microRNAs modulate E2F1 expression. Nature 435, 839–843 (2005).
A Marson, et al., Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 134, 521–533 (2008).
MH Schulz, et al., Reconstructing dynamic microRNA-regulated interaction networks. Proc Natl Acad Sci USA 110, 15686–15691 (2013).
A Krogh, M Brown, IS Mian, K Sjölander, D Haussler, Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235, 1501–1531 (1994).
C Burge, S Karlin, Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78–94 (1997).
R Tibshirani, Regression shrinkage and selection via the lasso: A retrospective. J R Stat Soc B 73, 273–282 (2011).
E Segal, N Friedman, D Koller, A Regev, A module map showing conditional activity of expression modules in cancer. Nat Genet 36, 1090–1098 (2004).
J Ernst, O Vainas, CT Harbison, I Simon, Z Bar-Joseph, Reconstructing dynamic regulatory maps. Mol Syst Biol 3, 74 (2007).
RC Lee, RL Feinbaum, V Ambros, The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843–854 (1993).
Bushati N, Stark A, Brennecke J, Cohen SM (2008) Temporal reciprocity of miRNAs and their targets during the maternal-to-zygotic transition in Drosophila. Curr Biol 18(7):501–506.
E Hornstein, N Shomron, Canalization of development by microRNAs. Nat Genet 38, S20–S24 (2006).
J Tsang, J Zhu, A van Oudenaarden, MicroRNA-mediated feedback and feedforward loops are recurrent network motifs in mammals. Mol Cell 26, 753–767 (2007).
J Ernst, et al., Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).
A Natarajan, GG Yardimci, NC Sheffield, GE Crawford, U Ohler, Predicting cell-type-specific gene expression from regions of open chromatin. Genome Res 22, 1711–1722 (2012).
X Dong, et al., Modeling gene expression using chromatin features in various cellular contexts. Genome Biol 13, R53 (2012).
K Chen, N Rajewsky, Natural selection on human microRNA binding sites inferred from SNP data. Nat Genet 38, 1452–1456 (2006).
AA Pai, et al., The contribution of RNA decay quantitative trait loci to inter-individual variation in steady-state gene expression levels. PLoS Genet 8, e1003000 (2012).

Information & Authors


Published in

Go to Proceedings of the National Academy of Sciences
Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 110 | No. 39
September 24, 2013
PubMed: 24046375


Submission history

Published online: September 17, 2013
Published in issue: September 24, 2013


See companion article on page 15686.



Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, 13125 Berlin, Germany


Author contributions: U.O. wrote the paper.

Competing Interests

The author declares no conflict of interest.

Metrics & Citations


Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.

Citation statements



If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

View Options

View options

PDF format

Download this article as a PDF file


Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login

Recommend to a librarian

Recommend PNAS to a Librarian

Purchase options

Purchase this article to get full access to it.

Single Article Purchase

Using machine learning to identify disease-relevant regulatory RNAs
Proceedings of the National Academy of Sciences
  • Vol. 110
  • No. 39
  • pp. 15503-15849







Share article link

Share on social media