Previous Article |
Table of Contents
| Next Article
PHYSICAL SCIENCES / APPLIED PHYSICAL SCIENCES
The human disease network
,
,
,
,¶
,
,¶,**
,
,**
*Center for Complex Network Research and Department of Physics, University of Notre Dame, Notre Dame, IN 46556;
Center for Cancer Systems Biology (CCSB) and ¶Department of Cancer Biology, Dana–Farber Cancer Institute, 44 Binney Street, Boston, MA 02115;
Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115;
Department of Physics, Korea University, Seoul 136-713, Korea; and ||Department of Pediatrics and the McKusick–Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205
Edited by H. Eugene Stanley, Boston University, Boston, MA, and approved April 3, 2007 (received for review February 14, 2007)
| Abstract |
|---|
|
|
|---|
biological networks | complex networks | human genetics | systems biology | diseasome
Here we take a conceptually different approach, exploring whether human genetic disorders and the corresponding disease genes might be related to each other at a higher level of cellular and organismal organization. Support for the validity of this approach is provided by examples of genetic disorders that arise from mutations in more than a single gene (locus heterogeneity). For example, Zellweger syndrome is caused by mutations in any of at least 11 genes, all associated with peroxisome biogenesis (10). Similarly, there are many examples of different mutations in the same gene (allelic heterogeneity) giving rise to phenotypes currently classified as different disorders. For example, mutations in TP53 have been linked to 11 clinically distinguishable cancer-related disorders (11). Given the highly interlinked internal organization of the cell (12–17), it should be possible to improve the single gene–single disorder approach by developing a conceptual framework to link systematically all genetic disorders (the human "disease phenome") with the complete list of disease genes (the "disease genome"), resulting in a global view of the "diseasome," the combined set of all known disorder/disease gene associations.
| Results |
|---|
|
|
|---|
|
|
Although the HDN layout was generated independently of any knowledge on disorder classes, the resulting network is naturally and visibly clustered according to major disorder classes. Yet, there are visible differences between different classes of disorders. Whereas the large cancer cluster is tightly interconnected due to the many genes associated with multiple types of cancer (TP53, KRAS, ERBB2, NF1, etc.) and includes several diseases with strong predisposition to cancer, such as Fanconi anemia and ataxia telangiectasia, metabolic disorders do not appear to form a single distinct cluster but are underrepresented in the giant component and overrepresented in the small connected components (Fig. 2a). To quantify this difference, we measured the locus heterogeneity of each disorder class and the fraction of disorders that are connected to each other in the HDN (see SI Text). We find that cancer and neurological disorders show high locus heterogeneity and also represent the most connected disease classes, in contrast with metabolic, skeletal, and multiple disorders that have low genetic heterogeneity and are the least connected (SI Fig. 7).
Properties of the DGN. In the DGN, two disease genes are connected if they are associated with the same disorder, providing a complementary, gene-centered view of the diseasome. Given that the links signify related phenotypic association between two genes, they represent a measure of their phenotypic relatedness, which could be used in future studies, in conjunction with protein–protein interactions (6, 7, 19), transcription factor-promoter interactions (20), and metabolic reactions (8), to discover novel genetic interactions. In the DGN, 1,377 of 1,777 disease genes are connected to other disease genes, and 903 genes belong to a giant component (Fig. 2b). Whereas the number of genes involved in multiple diseases decreases rapidly (SI Fig. 6d; light gray nodes in Fig. 2b), several disease genes (e.g., TP53, PAX6) are involved in as many as 10 disorders, representing major hubs in the network.
Functional Clustering of HDN and DGN. To probe how the topology of the HDN and GDN deviates from random, we randomly shuffled the associations between disorders and genes, while keeping the number of links per each disorder and disease gene in the bipartite network unchanged. Interestingly, the average size of the giant component of 104 randomized disease networks is 643 ± 16, significantly larger than 516 (P < 10–4; for details of statistical analyses of the results reported hereafter, see SI Text), the actual size of the HDN (SI Fig. 6c). Similarly, the average size of the giant component from randomized gene networks is 1,087 ± 20 genes, significantly larger than 903 (P < 10–4), the actual size of the DGN (SI Fig. 6e). These differences suggest important pathophysiological clustering of disorders and disease genes. Indeed, in the actual networks disorders (genes) are more likely linked to disorders (genes) of the same disorder class. For example, in the HDN there are 812 links between disorders of the same class, an 8-fold enrichment with respect to 107 ± 10 links obtained between the same set of nodes in the randomized networks. This local functional clustering accounts for the small size of the giant components observed in the actual networks.
Disease-Associated Genes Identify Distinct Functional Modules. For several disorders known to arise from mutations in any one of a few distinct genes, the corresponding protein products have been shown to participate in the same cellular pathway, molecular complex, or functional module (21, 22). For example, Fanconi anemia arises from mutations in a set of genes encoding proteins involved in DNA repair, many of them forming a single heteromeric complex (23). Yet, the extent to which most disorders and disorder classes correspond to distinct functional modules in the cellular network has remained largely unclear. If genes linked by disorder associations encode proteins that interact in functionally distinguishable modules, then the proteins within such disease modules should more likely interact with one another than with other proteins. To test this hypothesis, we overlaid the DGN on a network of physical protein–protein interactions derived from high-quality systematic interactome mapping (6, 7) and literature curation (6). We found that 290 interactions overlap between the two networks, a 10-fold increase relative to random expectation (P < 10–6; Fig. 3a).
|
Disease genes encoding proteins that interact within common functional modules should tend to be expressed in the same tissue. To measure this, we introduced the tissue-homogeneity coefficient of a disorder, defined as the maximum fraction of genes among those belonging to a common disorder that are expressed in a specific tissue in a microarray data set obtained for 10,594 genes across 36 healthy tissues (25). We found that 68% of disorders exhibited almost perfect tissue-homogeneity (Fig. 3b), compared with 51% expected by chance (P < 10–5).
Finally, disease genes that participate in a common functional module should also show high expression profiling correlation (26). The distribution of Pearson correlation coefficients (PCCs) for the coexpression profiles of pairs of genes associated with the same disorder was shifted toward higher values compared with that of a random control (Fig. 3c; P < 10–6,
2 test). Similarly, the average PCC over all pairs of genes within a given disorder shows a significant shift from the random reference (Fig. 3d), with a small but clearly distinguishable peak in the distribution around PCC
0.75. This peak corresponds to
33 disorders with average PCC > 0.6 for which all genes are highly coexpressed in most tissues, including Heinz body anemia (PCC = 0.935), Bethlem myopathy (PCC = 0.835), and spherocytosis (PCC = 0.656).
In summary, genes that contribute to a common disorder (i) show an increased tendency for their products to interact with each other through protein–protein interactions, (ii) have a tendency to be expressed together in specific tissues, (iii) tend to display high coexpression levels, (iv) exhibit synchronized expression as a group, and (v) tend to share GO terms. Together, these findings support the hypothesis of a global functional relatedness for disease genes and their products and offer a network-based model for the diseasome. Cellular networks are modular, consisting of groups of highly interconnected proteins responsible for specific cellular functions (21, 22). A disorder then represents the perturbation or breakdown of a specific functional module caused by variation in one or more of the components producing recognizable developmental and/or physiological abnormalities.
This model offers a network-based explanation for the emergence of complex or polygenic disorders: a phenotype often correlates with the inability of a particular functional module to carry out its basic functions. For extended modules, many different combinations of perturbed genes could incapacitate the module, as a result of which mutations in different genes will appear to lead to the same phenotype. This correlation between disease and functional modules can also inform our understanding of cellular networks by helping us to identify which genes are involved in the same cellular function or network module (21, 22).
Centrality and Peripherality. An early indication of the connection between the structure of a cellular network and its functional properties was the finding that in Saccharomyces cerevisiae highly connected proteins or "hubs" are more likely encoded by essential genes (15, 16). This prompted a number of recent studies (27, 28) to formulate the hypothesis that human disease genes should also have a tendency to encode hubs. Yet, previous measurements found only a weak correlation between disease genes and hubs (29), resulting in an important mystery: what is the role, if any, of the cellular network in human diseases? Are disease genes more likely to encode hubs in the cellular network?
Our initial analysis appears to support the hypothesis that disease genes, given their impact on the organism, display a tendency to encode hubs in the interactome (27, 28), finding that disease related proteins have a 32% larger number of interactions (6, 7) with other proteins (average degree) than the nondisease proteins (see SI Fig 9) and that high-degree proteins are more likely to be encoded by genes associated with diseases than proteins with few interactions (P = 1.6 x 10–17; Fig. 4a). Next, we show, however, that despite this apparent correlation, the relationship between diseases and hubs hides deep differences between various disease genes.
|
First, we find that essential proteins show a tendency to be associated with hubs (P = 1.3 x 10–17; Fig. 4c), displaying a much stronger trend than the one observed for all disease proteins (Fig. 4a). This raises an important question: Could the observed correlation between disease genes and hubs (Fig. 4a) be the sole consequence of the fact that a small fraction (22%) of disease genes is also essential? To address this question we measured the degree dependence of the nonessential disease proteins (Fig. 4d). Surprisingly, the correlation between hubs and disease proteins entirely disappears. Thus, the vast majority of disease genes (78%), those that are nonessential, do not show a tendency to encode hubs, indicating that the observed weak correlations between hubs and disease genes (Fig. 4a) was entirely due to the few essential genes within the disease gene class.
To carry on its basic functions, the cell needs to maintain the coordinated activity of important functional modules, driving in a relatively synchronized manner the expression patterns of the most important genes. Therefore, one expects that the expression pattern of both essential and disease genes will be synchronized with a significant number of other genes. To test this, we determined the average gene coexpression coefficient 

i =
jPCCij between an essential (or nonessential disease) gene i and all other genes in the cell, calculating the PCCij values from healthy human tissue microarray measurements (25). Confirming our expectation, for essential genes we find that genes that display high average coexpression 

with all other genes are more likely to be essential than those that show small or negative 

(P = 1.7 x 10–4; Fig. 4e). Surprisingly, however, nonessential disease genes show the opposite effect, being associated with genes whose expression pattern is anticorrelated or not-correlated with other genes, and underrepresented among the genes that are highly synchronized (

> 0.2) (P = 2.6 x 10–8; Fig. 4f). Thus, the expression pattern of nonessential disease genes appears to be decoupled from the overall expression pattern of all other genes, whereas essential genes have a tendency to be coupled to the rest of the cell.
Finally, we asked whether housekeeping genes, expressed in all tissues, have a tendency to encode disease genes. We find that the more tissues in which a gene is expressed, the higher the likelihood that it will be essential (P = 2.8 x 10–16; Fig. 4g). The opposite is true for nonessential disease genes: they have a tendency to be expressed in a few tissues (P = 1.4 x 10–6; Fig. 4h). Similarly, we found that only 9.9% of housekeeping genes correspond to disease genes, compared with 13.5% of nonhousekeeping genes, a significant 36% difference (P = 3.6 x 10–6). In contrast, 59.8% of housekeeping genes annotated with mouse phenotype were essential, compared with 40.5% for nonhousekeeping genes (P < 10–4).
These results support the somewhat unexpected conclusion that nonessential disease genes are not associated with hubs (27, 28), show smaller correlation in their expression pattern with the rest of the genes in the cell than expected from random, and have a tendency to be expressed in only a few tissues. Therefore, contrary to earlier hypotheses and our expectations, the vast majority of nonessential disease genes occupy functionally peripheral and topologically neutral positions in the cellular network. In stark contrast, essential genes are likely to encode hubs, show highly synchronized expression with the rest of the genes, and are expressed in most tissues, being overrepresented among housekeeping genes. Thus, essential genes are topologically and functionally central.
This unexpected peripherality of most disease genes can be best explained by using an evolutionary argument. Mutations in topologically central, widely expressed genes are more likely to result in severe impairment of normal developmental and/or physiological function, leading to lethality in utero or early extrauterine life and to eventual deletion from the population. Only mutations compatible with survival into the reproductive years are likely to be maintained in a population. Therefore, disease-related mutations in the functionally and topologically peripheral regions of the cell give a higher chance of viability.
Disease genes whose mutations are somatic should not be subject to the selective pressure discussed above. Instead, somatic mutations that lead to severe disease phenotypes should more likely affect the functional center. To test the predictive power of this selection-based argument, we studied separately the properties of somatic cancer genes (Cancer Genome Census; www.sanger.ac.uk/genetics/CGP/Census) and found that they (i) are more likely to encode hubs, (ii) show higher coexpression with the rest of the genes in the cell, and (iii) are more represented among housekeeping genes (SI Fig. 10). The observed functional and topological centrality of somatic cancer genes fits well with our current understanding that many cancer genes play critical roles in cellular development and growth (11).
| Discussion |
|---|
|
|
|---|
An important tool in this quest is the HDN that represents a genome-wide roadmap for future studies on disease associations. The accompanying detailed diseasome map (SI Fig. 13), showing all disorders and the genes associated with different disorders, offers a rapid visual reference of the genetic links between disorders and disease genes, a valuable global perspective for physicians, genetic counselors, and biomedical researchers alike.
To test whether the conclusions obtained in this work are robust to the incompleteness of the OMIM coverage, we expanded our study to include not only genes with identified mutations linked to the specific disease phenotype, but also those that satisfy the less stringent criterion that the phenotype has not been mapped to a specific locus (18). This expansion increased the number of disease-associated genes from 1,777 to 2,765, but also introduced noise in the data, because the link between many of the newly added genes and diseases is less stringent. Yet, the overall organization of the expanded diseasome map remains largely unaltered (SI Fig. 11), and none of the trends uncovered in Fig. 4 are affected by this extension (SI Fig. 12), supporting the robustness of our findings to further expansion of the OMIM database. Thus, although the maps shown in Fig. 2 and SI Fig. 13 will inevitably undergo local changes with the discovery of new disease genes, this will not change the overall organization and layout of the HDN significantly, because the HDN reflects the underlying cellular network-based relationship between genes and functional modules.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: DGN, disease gene network; HDN, human disease network; GO, Gene Ontology; OMIM, Online Mendelian Inheritance in Man; PCC, Pearson correlation coefficient.
**To whom correspondence may be addressed. E-mail: alb{at}nd.edu or marc_vidal{at}dfci.harvard.edu
Author contributions: D.V., B.C., M.V., and A.-L.B. designed research; K.-I.G. and M.E.C. performed research; K.-I.G. and M.E.C. analyzed data; and K.-I.G., M.E.C., D.V., M.V., and A.-L.B. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/cgi/content/full/0701361104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
Z. Hu, E. S. Snitkin, and C. DeLisi VisANT: an integrative framework for networks in systems biology Brief Bioinform, May 7, 2008; (2008) bbn020v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. N. Benfey and T. Mitchell-Olds From Genotype to Phenotype: Systems Biology Meets Natural Variation Science, April 25, 2008; 320(5875): 495 - 497. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Ideker and R. Sharan Protein networks in disease Genome Res., April 1, 2008; 18(4): 644 - 652. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Feldman, A. Rzhetsky, and D. Vitkup Network properties of genes harboring inherited disease mutations PNAS, March 18, 2008; 105(11): 4323 - 4328. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Thifault, S. Ondrej, Y. Sun, A. Fortin, E. Skamene, R. Lalonde, J. Tremblay, and P. Hamet Genetic determinants of emotionality and stress response in AcB/BcA recombinant congenic mice and in silico evidence of convergence with cardiovascular candidate genes Hum. Mol. Genet., February 1, 2008; 17(3): 331 - 344. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. M. Kim, J. O. Korbel, and M. B. Gerstein Positive selection at the protein network periphery: Evaluation in terms of structural constraints and cellular context PNAS, December 18, 2007; 104(51): 20274 - 20279. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Kann Protein interactions and disease: computational approaches to uncover the etiology of diseases Brief Bioinform, September 1, 2007; 8(5): 333 - 346. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. E. Talkowski, M. Bamne, H. Mansour, and V. L. Nimgaonkar Dopamine Genes and Schizophrenia: Case Closed or Evidence Pending? Schizophr Bull, September 1, 2007; 33(5): 1071 - 1081. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Barabasi Network Medicine -- From Obesity to the "Diseasome" N. Engl. J. Med., July 26, 2007; 357(4): 404 - 407. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||