Goh et al. 10.1073/pnas.0701361104.
Fig. 5. A subset of the Morbid Map, providing the disorder-disease gene associations. We parsed the MM list to merge eleven complementation groups of Fanconi anemia into a single disorder (id: 523) and two entries of fibromatosis into a single disorder (id: 537). The same procedure was reported for all entries, mapping 2,929 disorder entries in the MM into 1,286 distinct human disorder IDs. The full list of such disorders is provided in SI Table 1. The original MM list consists of the second to the fifth columns.
Fig. 6. Characterizing the topology of the HDN and the DGN. (A and B) Distribution of (A) the size s (number of genes involved in a disorder) and (B) the degree k (number of other disorders a disorder shares genes with) of all disorders in the HDN. Gray symbols correspond to a linear binning, while red dots represent the logarithmicallybinned data, maintaining the same statistical significance in each bin. The continuous lines represent the fit to the logbinned data, following the generalized powerlaw f(x) = c(x + a)b with (A) b »2.7 and (B) b » 6.5, obtained from the least-square fit. (C) Distribution of the cluster sizes in the HDN (blue), where the isolated peak at 516 corresponds to the size of the giant component. The component size distribution for the randomized network is also shown (light blue). (D) Histogram of the number of disorders a gene is involved in, identifying the four genes associated with the largest number of disorders. (E) Distribution of the connected component sizes in the DGN (blue). The largest component contains 903 genes. The component size distribution for randomized networks is also shown (light blue).
Fig. 7. Genetic heterogeneity and connectivity of disorder classes. Red (blue) denotes the enrichment (depletion) of the measure in the corresponding cell. Dimcolored cell denotes the observed value is not statistically significant (P > 102). The statistical significance of each value is calculated with the randomized HDN obtained from the randomized disorderdisease gene associations.
Fig. 8. Gene Ontology Homogeneity. The GO homogeneity of disorders for the GO categories biological process (Left), molecular function (Center), and cellular component (Right). Red bars represent the actual histogram and the blue bars denote the random control, obtained for each disorder by choosing the same number of genes randomly.
Fig. 9. Average degrees of groups of human genes. Average degree of disease genes, essential genes, and nonessential disease genes, as well as all genes. Error bar denotes the standard error.
Fig. 10. Centrality of somatic cancer genes. Plotted in each panel is the fraction of somatic cancer genes as a function of the average degree <k> (Left), the average coexpression coefficient <r> (Center), and the number of tissues expressed nT (Right). All three quantities show positive trends, suggesting that somatic cancer genes are topologically and functionally central.
Fig. 11. Layout of the HDN with the extended data set. HDN with the extended dataset consists of 944 disorders with at least one link to other disorders and 576 disorders form the giant component.
Fig. 12. Functional characteristics of disease and essential genes for the extended data set. (a) The fraction of disease genes among those whose protein products interact with k other proteins. (b) The fraction of genes with lethal mouse phenotype among those whose protein products interact with k other proteins. (c) The same as in a, but excluding the proteins with lethal mouse phenotypes. (d) Average degree of disease, essential, and nonessential disease genes, as well as all genes (random). The fraction of essential genes (e) and disease genes (f) among those whose average PCC with other genes is <r>. Gray horizontal lines in a-c and e-f indicate the global average and the gray bars in e-f show the number of genes (without scales) in each bin indicating that the high fluctuations at low and high <r> are due to the small number of genes in those bins. (g) Average PCC of genes in different categories. (h) Fraction of disease genes (orange) and of genes with lethal (red) and nonlethal (light blue) mouse phenotypes among housekeeping and nonhousekeeping genes.
Fig. 13. Bipartite-graph representation of the diseasome. A disorder (circle) and a gene (rectangle) are connected if the gene is implicated in the disorder. The size of the circle represents the number of distinct genes associated with the disorder. Isolated disorders (disorders having no links to other disorders) are not shown. Also, only genes connecting disorders are shown.
SI Text
S1. Construction of the Diseasome Map
The most complete and best-curated list of known disorder-gene associations is maintained in the Morbid Map (MM) of the Online Mendelian Inheritance in Man (OMIM) (1). Each entry of the MM is composed of four fields, the name of the disorder, the associated gene symbols, its corresponding OMIM id, and the chromosomal location. We downloaded the MM file on December 21, 2005. Out of 4,043 MM entries, we selected 2,929 entries with the "(3)" tag, for which there is strong evidence that at least one mutation in the particular gene is causative to the disorder. We then parsed these 2,929 disorder terms into 1,284 distinct disorders by merging disease subtypes of a single disease, based on their given disorder names. For example, the eleven complementation groups of Fanconi anemia are merged into the disease "Fanconi anemia" and the two fibromatosis entries are merged into the disease "fibromatosis" [see Supporting Information (SI) Fig. 5]. The merging was done first automatically by running a string-match script and then each entry was verified manually. Each disease was then assigned a unique disease ID.
Subsequently we classified each disorder into 20 primary disorder classes manually, following the classification scheme shown in Fig. 2. The classification is based on the physiological system affected by the disorder. For example, 113 disorders constitute a "cancer" class and 41 disorders such as atherosclerosis and stroke constitute a "cardiovascular" class. When a disorder affects multiple systems we tried to assign it to a primary class based on which system was most affected; if a primary class was not evident then the disorder was placed into the "multiple" class. While disease categorization is often a cause of heated debate, occasional misclassification will not affect our results. Disorders having distinct multiple clinical features are assigned to the "multiple" class, with 155 disorders in this category. For 31 disorders there was insufficient information available for a clear class assignment, thus we annotated these into an "unclassified" class. Therefore every disorder was annotated into one of 22 disorder classes.
Finally, each gene symbol was mapped onto an Entrez ID, generating the list of disease gene-disorder class associations available as SI Table 1.
SI Tables 2 and 3 summarize the properties of the diseasome map. For each disorder Table 2 tabulates the disorder class, the degree k in the HDN, the class-degree k (number of distinct disorder classes it connects to), the size s (number of associated genes) and the list of the associated disease genes. For each gene Table 3 tabulates the class of its associated disorders, and the number and the list of disorders to which it is associated.
S2. Properties of the HDN and DGN
The layouts of the giant components of both HDN and DGN in Fig. 2 were generated by a force-directed algorithm, followed by a local rearrangement for visual clarity, while leaving the network's overall layout unperturbed.
In the HDN, the number of genes associated with a disorder, s, has a broad distribution (SI Fig. 6a), indicating that most disorders relate to a few disease genes, while a handful of disorders relate to dozens of genes (SI Table 2). The HDN exhibits a skewed degree (k) distribution (SI Fig. 6b), with most disorders linked to only a few other disorders, while a few disorders (SI Table 2) represent hubs that are connected to a large number of distinct disorders.
In the complementary DGN, while the number of genes involved in multiple diseases decreases rapidly, several disease genes are involved in as many as ten disorders, representing major hubs in the network (SI Fig. 6d).
S3. Component Size Distributions of HDN and DGN
The topology of the HDN and GDN deviates from random. To obtain random controls we randomly shuffled the disorder-disease gene associations, while keeping both the number of genes that a disorder is associated with and the number of disorders that a gene is implicated in unchanged. From these randomized disorder-disease gene associations we obtained the two projections of randomized diseasome, the randomized HDN and the randomized GDN. To control for cluster size distribution, we generated 104 independent randomized samples.
The component size distributions, n(s), defined as the number of components with size s, for the two network projections obtained after the randomizations are shown in light blue in SI Fig. 6 c and e. Both HDN and DGN have significantly smaller giant components than expected random, due to the functional clustering. Thus, HDN and DGN represent an intermediate structure between a completely randomized network with a very large giant component and a functionally fully segregated network which would be broken into isolated clusters, each representing a disorder class. Apart from the giant component, in both networks the sizes of disconnected components are distributed approximately as a power law, with the exponent approximately -3 both times. A solid conclusion on the distribution is difficult to draw due to the limited statistics.
S4. Genetic Heterogeneity and Connectivity of Disorder Classes
Genetic heterogeneity, specifically locus heterogeneity, means that mutations in more than one gene lead to similar disorder phenotypes (2). We measured the genetic heterogeneity of a disorder class as the average number of genes in the disorders belonging to the selected class (i.e., the average size of nodes in the class in the HDN). To quantify the statistical significance of the observed values, we randomized class annotations by randomly shuffling the disorder-class associations. According to the obtained P-values we identified significantly enriched (red) and significantly depleted (blue) classes (SI Fig. 7). The cancer and neurological disorder classes show high genetic heterogeneity, while the metabolic, skeletal, and multiple disorder classes show low genetic heterogeneity. We also calculated for each disorder class the fraction of disorders that are connected to each other in the HDN to quantify the "connectivity" of the particular disorder class. The statistical significance of the observed connectivity was assessed by the randomized disorder-disease gene associations (SI Fig. 7). By this measure, cancer is the most connected class and the metabolic disorder class is the least connected.
S5. Protein-Protein Interaction Data
To obtain a comprehensive human protein-protein interaction (PPI) data we combined two high quality systematic yeast two-hybrid experiments (3, 4) with PPIs obtained from literature by manual curation (3). The integrated set of PPIs contains 22,052 non-self-interacting, non-redundant interactions between 7,533 genes. The list of PPIs used is available as SI Table 4.
S6. Random Control for the PPI-GDN Overlap
To generate the random control of the overlap between the PPI network and the GDN in Fig. 3a, we first identified 1,203 genes that are present both in the PPI network and the GDN and for each disorder i the number of genes ni that are present in the PPI network. To obtain the random control of the overlap, we calculated for each disease i the number of PPIs between the ni nodes selected randomly among the 1,203 genes while keeping the degrees of the associated nodes. We performed this procedure for every disorder to obtain the number of all overlaps for a single random configuration. We generated 106 independent random configurations to obtain significant statistics and P-value.
S7. Gene Ontology Analysis
If the HDN shows modular organization then a group of genes associated with the same common disorder should share similar cellular and functional characteristics, as annotated in Gene Ontology (GO; ref. 5). To investigate this, we measured the GO homogeneity (GH) of each disorder as the maximum fraction of genes in the same disorder that have the same GO terms. It is defined as
GHi = maxj [nji/ni],
where in this case ni denotes the number of genes in the disorder i that have any GO annotations, and nji the number of genes that have the specific GO term j. We calculated GHi separately for each branch of GO, biological process (BP), molecular function (MF), and cellular component (CC). As expected, we find a significant elevation in the GH with respect to random controls in all three branches (SI Fig. 8). For example, we find a 23-fold increase in the perfect homogeneity for BP (79% vs. 3.4%), a 13-fold increase (75% vs. 5.5%) for MF, and a 9-fold increase (79% vs. 8.8%) for CC. To obtain the random control of the GO homogeneity distribution for each disorder we picked the same number of genes randomly in the GO annotation data and calculated their GO homogeneity. We generated 104 random instances to reach statistical significance.
S8. Gene Expression Microarray Data
To calculate the coexpression correlation between wild-type human gene transcripts, we used microarray data available for 36 normal human tissues (6). By matching Entrez gene ID, 1,357 human disease genes had corresponding microarray probes (76% of human disease genes). A gene is considered to be "expressed" if the P-value associated with its transcript abundance is less than the threshold, P < 0.02 (6). We consider a gene as housekeeping gene if it is expressed in all 36 tissues. Genes that are not expressed in any examined tissues are excluded from the analysis.
S9. Tissue Homogeneity
The tissue homogeneity (TH) coefficient quantifies whether genes that are implicated in the same disorders tend to be expressed in similar human tissues. We define the TH of a disorder i as
THi = maxj [nji/ni],
where ni denotes the number of genes in the disorder i that are expressed in at least one tissue, nji the number of genes that are expressed in the tissue j among them, and maxj [×] denotes the function returning the maximum-value argument across j. TH has the maximal value 1 if all of the genes are expressed together in at least one tissue, and takes the minimum value 1/n when all are expressed in different tissues. To obtain the random control of the tissue homogeneity distribution we picked the same number of genes randomly in the microarray data for each disorder and calculated their tissue homogeneity. We generated 105 random instances to reach statistical significance.
S10. Random Controls for Gene Expression Analysis
To obtain the random control of the Pearson correlation coefficient (PCC) distributions for the gene expression in Fig. 3c and d, we calculated the distribution of all gene pairs in the microarray data (Fig. 3c) and the average PCC between the same number of genes chosen randomly from the microarray data (Fig. 3d). To obtain the P-values, we perform the c2 -test, calculating c2 values between the random normalized histograms obtained from the reference distributions (blue) and the actual distribution (red), performing 106 independent runs to obtain significant statistics.
S11. Mouse Phenotype Data
To predict the essentiality of a human gene, we used the phenotype information of the corresponding mouse orthologs. A human gene was defined as "essential" if a knock-out of its mouse ortholog confers lethality. We obtained the human-mouse orthology and mouse phenotype data from Mouse Genome Informatics (http://www.informatics.jax.org) on January 3, 2006. We considered the classes of embryonic/prenatal lethality and postnatal lethality as lethal phenotypes, and the rest of phenotypes as non-lethal ones. There were 1,267 mouse-lethal human orthologs, of which 398 have known human disease associations (22% of human disease genes).
S12. Significance Analyses of the Results in Fig. 4
To assess the statistical significance of the reported results, we apply the linear regression model and performed the c2 test for the significance of the measured trends dictated by the regression coefficient (7). In particular, we take log2k as the dependent variable in Fig. 4a, c, and d, which is more appropriate than k due to the power-law degree distribution. We found that the measured trends described by the linear regression model are statistically significant, with associated with P values 4.2 ´ 10-6 (a), 2.8 ´ 10-5 (c), 1.4 ´ 10-4 (d), 1.1 ´ 10-16 (e), and 3.5 ´ 10-7 (f).
In addition, we assess the significance of the "strength" of the trends using the linear regression model. The slope A is obtained as the coefficient in the linear regression model and quantifies the strength (magnitude) of the trend. The P value of the observed trend AO will be the probability that we have A equal or larger (smaller, for the negative trend in f) than AO purely by chance. To calculate the P value, we randomized our sample by which we randomly redistribute the attributes (genes associated with diseases in a, for example) and perform the linear regression to obtain randomized (null) values of A, denoted by AR. We found that AR approximately follows a Gaussian distribution with zero mean and standard deviation sA. Thus, we can calculate the P value of the observed AO from the Z-score defined as Z = (AO-‹AR›)/sA = AO/sA. In SI Table 5, we summarize the results of the significance analysis according to the described procedure.
S13. Centrality of Somatic Cancer Genes
The selection-based argument presented in the paper does not apply to the diseases that are caused by somatic mutations. Thus, for example, cancers caused by somatic mutations need not be at the functional periphery. Instead, given the severe physiological damage, often leading to death, resulting from such mutations, these mutations are expected to affect the functional center of the cell. To test this, we studied the properties of genes whose somatic mutations are known to induce cancer. We obtained the list of somatic cancer genes from the Cancer Gene Census (www.sanger.ac.uk/genetics/CGP/Census). From the analysis (SI Fig. 10) we found that these somatic cancer genes indeed are (i) more likely to encode hub proteins, (ii) show higher coexpression with the rest of the genes in the cell, and (iii) are more represented among housekeeping genes, confirming our expectation that somatic cancer genes are functionally central.
S14. Analysis of the Extended Data Set
In the analysis presented in the paper we included only the disorder-disease gene associations for which the wild-type gene is mapped and the mutation thereof is clearly demonstrated to be associated with the disorder, among all of the data catalogued in the Morbid Map in OMIM (1). In this section, we relax this criterion, and perform the same analysis with all of the disorder-disease genes associations listed in the Morbid Map, including those with weaker evidences, for which only the mapping of either the wild-type gene [tag "(1)"] or the disease phenotype itself [tag "(2)"] is known (1). Technically this extension did not require any further data curation, given that entries denoted by "(3)" tag in the OMIM database correspond to genes for which there is strong evidence that at least one mutation in the particular gene is causative of the disorder. In the earlier study we focused only on these genes (see Section S1). In the extended study presented here, we included also those entries that have "(1)" or "(2)" tags. This extension will increase the coverage of the data but at the same time introduce potential errors in the form of disease genes that may not turn out to be associated with the specific disorder. The objective of this extension is to test the robustness of the main findings of current study to the increase in the coverage and the presence of noise in the dataset.
With this extension, we obtain a list of associations between 1,578 disorders and 2,765 disease genes, from 4,043 Morbid Map entries as of December 21, 2005. Thus the coverage in the genome increases by 50%. First we generate the analog of Fig. 2a, the layout of the HDN (SI Fig. 11). The overall layout of the disorder classes and the overall position of diseases remain largely unaltered. Next we perform the analysis analogous to that shown in Fig. 4. The overall behavior that (i) the essential genes likely encode hubs whereas the nonessential disease genes are not particularly associated with hubs and (ii) the highly coexpressed genes are depleted with the disease genes while enriched with essential genes is also manifest clearly (SI Fig. 12). Overall, the main findings of the current study are robust to the extension of the dataset which both increases the coverage and introduces potential errors.
Supporting References
1. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Nucleic Acids Res 33:D514-D517.
2. Nussbaum RL, McInnes RR, Willard HF (2004) Thompson & Thompson Genetics in Medicine (Saunders, Philadelphia), 6th Ed.
3. Rual J-P, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al. (2005) Nature 437:1173-1178.
4. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner N, Schoenherr A, Koeppen S, et al. (2005) Cell 122:957-968.
5. Gene Ontology Consortium (2006) Nucleic Acids Res 34:D322-D326.
6. Ge X, Yamamoto S, Tsutsumi S, Midorikawa Y, Ihara S, Wang SM, Aburatani H (2005) Genomics 86:127-141.
7. Daniel WW (1991) Biostatistics: A Foundation for Analysis in the Health Sciences (Wiley, New York).