Skip to main content
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian
  • Log in
  • My Cart

Main menu

  • Home
  • Articles
    • Current
    • Latest Articles
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • Archive
  • Front Matter
  • News
    • For the Press
    • Highlights from Latest Articles
    • PNAS in the News
  • Podcasts
  • Authors
    • Purpose and Scope
    • Editorial and Journal Policies
    • Submission Procedures
    • For Reviewers
    • Author FAQ
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home

Advanced Search

  • Home
  • Articles
    • Current
    • Latest Articles
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • Archive
  • Front Matter
  • News
    • For the Press
    • Highlights from Latest Articles
    • PNAS in the News
  • Podcasts
  • Authors
    • Purpose and Scope
    • Editorial and Journal Policies
    • Submission Procedures
    • For Reviewers
    • Author FAQ

New Research In

Physical Sciences

Featured Portals

  • Physics
  • Chemistry
  • Sustainability Science

Articles by Topic

  • Applied Mathematics
  • Applied Physical Sciences
  • Astronomy
  • Computer Sciences
  • Earth, Atmospheric, and Planetary Sciences
  • Engineering
  • Environmental Sciences
  • Mathematics
  • Statistics

Social Sciences

Featured Portals

  • Anthropology
  • Sustainability Science

Articles by Topic

  • Economic Sciences
  • Environmental Sciences
  • Political Sciences
  • Psychological and Cognitive Sciences
  • Social Sciences

Biological Sciences

Featured Portals

  • Sustainability Science

Articles by Topic

  • Agricultural Sciences
  • Anthropology
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology
  • Cell Biology
  • Developmental Biology
  • Ecology
  • Environmental Sciences
  • Evolution
  • Genetics
  • Immunology and Inflammation
  • Medical Sciences
  • Microbiology
  • Neuroscience
  • Pharmacology
  • Physiology
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences
  • Sustainability Science
  • Systems Biology

Bayesian approach to transforming public gene expression repositories into disease diagnosis databases

Haiyan Huang, Chun-Chi Liu, and Xianghong Jasmine Zhou
PNAS April 13, 2010 107 (15) 6823-6828; https://doi.org/10.1073/pnas.0912043107
Haiyan Huang
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chun-Chi Liu
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xianghong Jasmine Zhou
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  1. Edited by Wing Hung Wong, Stanford University, Stanford, CA, and approved February 19, 2010 (received for review October 26, 2009)

  2. ↵1H.H. and C.-C.L. contributed equally to this work.

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Abstract

The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. This paper describes the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. Our approach, including standardizing cross-platform gene expression data and heterogeneous disease annotations, allows analyzing both sources of information in a unified probabilistic system. A high level of overall diagnostic accuracy was shown by cross validation. It was also demonstrated that the power of our method can increase significantly with the continued growth of public gene expression repositories. Finally, we showed how our disease diagnosis system can be used to characterize complex phenotypes and to construct a disease-drug connectivity map.

The rapid accumulation of high-throughput genomic data offers an unprecedented opportunity to study human diseases. The National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) (1) with more than 330,000 gene expression profiles and an annual growth rate of 150%, is currently the largest database of its kind. The GEO systematically documents the molecular basis of many disease types, including heart disease, mental illness, infectious disease, and a wide variety of cancers. This repository could serve as a rich resource for diagnosis: by screening the enormous number of disease expression datasets in an automated fashion, it should be possible to rapidly narrow down disease candidates for a query expression profile. A screening approach such as this would be particularly useful when the potential disease is not obvious or lacks biochemical diagnostic tests.

We aim to turn the NCBI GEO expression repository into an automated disease diagnosis database, such that a query gene expression profile can be assigned to one or multiple disease concepts. This effort requires the effective integration of the two major information sources in the GEO database; namely quantitative expression data and complex phenotypic information. Such integrative analysis is essential to exploiting the full power of public gene expression databases and tackling the ultimate scientific goal of genomics research—linking genotypes to phenotypes. The problem of searching and querying microarray databases has attracted considerable attention. However, existing works either query only the expression data with an expression signature to identify relevant microarray datasets (2–4), or query only the phenotype meta-data with a specific phenotype term to search for datasets of related phenotypes (5 and 6). In this paper, going beyond such simple database query approaches, we describe an unified framework for jointly modeling the two information sources. By this means, the heterogeneous public repository is transformed into a database with standardized expression profiles and phenotype terms suitable for diagnosis purposes. An automated, Bayesian analysis of this database then links standardized query expression profiles to probable disease classes. This task is not trivial due to the large amount of complex heterogeneous data in public repositories, while it is less of a challenge if the microarray-based disease diagnosis studies were of limited scales (e.g., within a single laboratory (7 and 8) or targeting specific types of disease (9–11)).

Following a preprocessing phase (i.e., standardizing the cross-platform expression data and the complex phenotype information), we formulate the disease diagnosis question as a hierarchical multilabel classification (HMC) problem (12). That is, we categorize a standardized query gene expression profile into multiple disease classes following a hierarchical disease taxonomy. The standardization of a profile is based on its comparison against a control array in order to remove cross-platform/lab systematic variations. We developed a two-stage learning approach to achieve the diagnosis: we first build independent Bayesian classifiers for each disease class, then integrate their predictions within a Bayesian network model. The network model allows for collaborative error correction across classes in the disease hierarchy. This two-stage learning approach interprets both genomic and phenotypic data under a unified probabilistic framework, thereby constituting an advance over existing microarray diagnostic methods in both scale and depth.

To validate our approach, we collected 9,169 human microarray experiments from major platforms in the NCBI GEO database and constructed 110 disease classes. Cross validation demonstrates a high level of overall diagnostic accuracy (95%). Moreover, we show that the predictive power of our system is expected to increase significantly as public gene expression repositories continue to grow.

The proposed disease diagnosis system can also be applied to reveal unique relationships between diseases and drugs, if the query expression profile concerns the treatment effect of known drugs. Querying a large number of drug-treatment profiles against our diagnosis system, we established a disease-drug connectivity map. Interestingly, a large number of known drug side effects were recovered and many unique disease-drug associations were discovered. The principle here is similar to that of the landmark study by Lamb et al. (“connectivity map”) (13), where disease-drug connections were inferred by comparing a disease profile to a specifically constructed reference compendium of drug-treatment profiles. Our approach complements that work in the way that we provide a rigorous prediction scoring system, and more importantly, we make it possible to use the entire heterogeneous public gene expression data as the reference compendium.

Results

Construction of a Disease Diagnosis Database.

Our analysis scheme is sketched in Fig. 1. The first phase concerns data preprocessing and it involves two tasks: (i) standardizing the expression data to remove platform or laboratory differences, and (ii) transforming the heterogeneous phenotype information, embedded in texts, into a workable format.

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

Major steps of the disease diagnosis system: (1) Preprocess the public microarray repositories to build the diagnosis database with standardized expression and phenotype data. (2) Diagnose a query profile via a two-stage Bayesian approach: at the first stage, we build Bayesian classifiers for each UMLS concept; at the second stage, we integrate the individual predictions with a Bayesian network model to allow collaborative error-correction over all classes in the hierarchy (red nodes represent diagnosed disease concepts).

For the first task, we performed the standardization by ranking the expression values within each array profile, and then taking the logarithm of the expression rank ratio between two arrays, one of a disease and another of normal conditions, within the same dataset (14). The resulting log-rank-ratio vector is termed a standardized profile; its components reflect the level and direction of differential expression in disease-related genes. Given two standardized profiles, we measure their similarity by Pearson correlation. This measure is favored due to its sensitivity to large absolute values, which in our case correspond to highly differentially expressed genes. Our standardization is in principle similar to the Gene Set Enrichment Analysis (13). The rationale is that differentially expressed genes are likely to carry the most stable information on disease characteristics regardless of differences in platform or lab. With our method, researchers need not choose a threshold of significance for differentially expressed genes. Extensive testing has demonstrated the effectiveness of our method (SI Text). Note that a typical GEO dataset consists of several subsets, each subset having a group of replicated samples. We term the set of standardized profiles derived from all replicated arrays in a pair of disease and normal subsets, a standardized dataset. Naturally, the standardized profiles within a standardized dataset are considered replicates. For this initial study, we focus on those GEO datasets that contain at least one pair of disease and normal subsets. We note that the requirement of normal subsets poses a potential limitation for our standardization approach. Further discussions on our approach and its possible alternatives are in SI Text.

To standardize the disease information provided with microarray datasets, we use the Unified Medical Language System (UMLS) (15). Its associated text-mining tools (MetaMap) (16) allow for the automatic extraction of phenotype concepts from text annotations (17 and 18). Note that each microarray dataset has phenotype descriptions at both the dataset level and the subset level. In our Bayesian diagnosis analysis, we give the subset-level annotations a higher credence due to their stronger association with individual samples.

This preprocessing phase results in a new database consisting of standardized expression profiles and disease concepts (i.e., UMLS concepts). Hereafter, we refer to this collection as the “disease diagnosis database.” More details on constructing this database are in Methods and SI Text.

A Bayesian Framework for Automated Disease Diagnosis.

We formulate the task of automated disease diagnosis as a hierarchical multilabel classification problem. The goal is to place a query profile into one or multiple UMLS disease classes, following the hierarchical UMLS disease taxonomy. We introduce a two-stage Bayesian learning approach: first a classifier is built for each disease class, and then the predictions of individual classifiers are combined to allow collaborative error correction across classes in the hierarchy. We summarize the key steps of this method below; a complete description is in SI Text.

Given a query array to be diagnosed and a control array profiling a normal sample (of the same lab and platform), we derive the log-rank-ratio vector x; this is the standardized query profile. Let Qx,k = 1 when x is diagnostic of the UMLS concept Uk, and Qx,k = 0 otherwise. That is, Qx,k is a binary label indicating membership of x in the disease class k. To build a Bayesian classifier for disease class k, it is equivalent to derive the posterior distribution of Qx,k given the information including: (i) s = {sx,i,i = 1,…,M}, where M is the number of standardized datasets in our database, and sx,i is a list of similarity scores (i.e., Pearson correlation coefficients) quantifying the similarities between x and the profiles in the ith standardized dataset. (ii) e = {ei,k,i = 1,…,M}, where ei,k tells whether the UMLS concept Uk is an annotation of the ith standardized dataset (see Methods). We define ei,k = 2 if Uk occurs at the subset level (see Methods), ei,k = 1 if it occurs at the dataset level, and ei,k = 0 if the annotation does not occur.

The relationship between Qx,k, e and s is illustrated in Fig. 2. It is obvious that e provides no useful information to infer the value of Qx,k unless the similarity scores s are also known. That is P(Qx,k|e) = P(Qx,k). In these terms, the target Bayesian posterior can be expressed as Embedded Image[1]The prior P(Qx,k) can be empirically estimated from the database. The computation of P(s|Qx,k,e) is more involved due to the complex properties of e and s. For instance, the UMLS annotations e include items at the dataset and subset levels, and may also suffer from text-mining errors. To take into account such complexities and facilitate the modeling of P(s|Qx,k,e), we introduced a set of latent binary random variables T = {Ti,k} whose values are not observable but can be inferred from e: Ti,k = 1 when the ith standardized dataset is related to the UMLS concept Uk, and Ti,k = 0 otherwise. (The principles for defining P(T|e) are in Methods.) Accordingly, we can express the target posterior in Eq. 1 as Embedded Image[2]We now need to determine P(s|Qx,k,T). By assuming independence among the standardized datasets, we can decompose P(s|Qx,k,T) into the computation of individual terms P(sx,i|Qx,k,Ti,k), where sx,i is a vector denoting the similarity scores between x and the profiles in the ith standardized dataset(i = 1,…,M). Due to the difficulty in modeling and deriving P(sx,i|Qx,k,Ti,k) (SI Text), we alternatively considered the ratio P1(sx,i)/P0(sx,i), referred to simply as “P1/P0” hereafter. P1 denotes the distribution of sx,i when the query x and the ith standardized dataset are associated with a common disease, and P0 the distribution when they are not. We modeled P1/P0 by a log-linear regression: log(P1/P0) = λ0 + λ1 × Mean(sx,i) + ϵi, where λ0 and λ1 are estimated independently of the query profile and ϵi is a Gaussian error term. Further details are in Methods and SI Text.

Fig. 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 2.

Information chart for the posterior inference of Qx,k. We wish to estimate the probability that a query profile x is diagnosed with the UMLS concept Uk, given ei,k and sx,i with i = 1,…,M.

Putting all the details together, we can infer Qx,k (SI Text). This process is repeated for all UMLS disease classes. Those UMLS concepts with the posterior probability above a predefined threshold are considered significant, and generate an initial set of diagnostic annotations for the query profile x. Note that in practice, the query data may contain replicates of disease and normal arrays. In that situation, we derive the log-rank ratio vectors from all possible pairs of query disease and normal samples, and x represents a list of replicated standardized query profiles. sx,i will then include all the similarity scores between every replicated standardized query profile and every profile in the ith standardized dataset. The inclusion of replicates can enhance the robustness of the P1(sx,i)/P0(sx,i) estimation, but all other procedures remain the same. We also note that in Bayesian analysis, the effects of the prior distribution and the significance threshold selection tend to vanish as the data accumulate. We have also demonstrated that our Bayesian classifiers, built by carefully modeling the specific properties of noisy data such as e and s, outperformed Support Vector Machine, the most commonly used classification method (SI Text).

Next, we use the UMLS hierarchical disease taxonomy to leverage the predictions made for individual UMLS concepts. In particular, we exploit a Bayesian network model defined on the UMLS hierarchy to resolve inconsistencies in the initial set of diagnostic predictions (19) (details are in SI Text). Given the high level of information exchange and integration among the UMLS concepts along the disease hierarchy, this procedure is expected to improve the accuracy and robustness of the diagnosis.

Performance Assessment of the Automated Disease Diagnosis System.

To validate our framework, we used an initial set of GEO microarray datasets containing at least one disease subset and at least one normal subset. In total, we collected 9,169 microarray experiments and constructed 110 disease classes, each containing from 3 to 62 standardized datasets. The 110 classes covered a wide spectrum of diseases: cardiovascular disease, neoplasms, CNS disorders, skin disorders, and metabolic diseases, to name a few.

Using the leave-one-out cross-validation approach detailed in Methods, our diagnoses achieved an overall accuracy of 95% (precision 82% and recall 20%). The recall rate of 20% is comparable to that observed in an analogous hierarchical multilabel classification problem for predicting gene functions. In the mouse model, the best performance in such an application was achieved with a recall rate of 20% and a precision of 41%.(20). Varying the threshold for classification, we plot the precision and recall curves in Fig. 3A. Not surprisingly, the performance of our method is significantly enhanced after applying collaborative error correction along the disease hierarchy. An example diagnosis in the context of the UMLS hierarchy is depicted in Fig. 3B. The prediction results for a subset of prevalent diseases are listed in Table 1.

Fig. 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 3.

Validation results and case examples. (A) Precision-recall plots by pooled disease classes. The blue curve shows the performance after Stage I diagnosis, and the red curve shows the final performance after Stage II refinement. (B) An example illustrating the error correction by the Stage II refinement. The query profile studies uterine leiomyomas obtained from fibroid afflicted patients (GDS484). The profile is annotated with four concepts by UMLS text mapping: Connective/Soft Tissue Neoplasm, Muscle tissue neoplasm, fibroid tumor, and uterine fibroids. The Stage I diagnosis predicted four concepts (red nodes) with one false positive (lymphoblastic leukemia), and one false negative (uterine fibroids). The false positive prediction is later corrected by Stage II refinement. (C) The figure presents the 110 disease classes and their hierarchical relationships. The red nodes represent diagnosed disease concepts for GDS563: (1) Nervous system disorder (2) Neuromuscular diseases (3) Myopathy (4) Musculoskeletal diseases (5) Congenital, Hereditary, and Neonatal diseases and abnormalities (CHNDA) (6) Genetic diseases, inborn (7) Genetic diseases, x-linked (8) Muscular disorders, atrophic (9) Muscular dystrophies (10) Muscular Dystrophy, Duchenne. (D) The prediction performance decreases with the data reduction.

View this table:
  • View inline
  • View popup
Table 1.

Prediction result of a subset of prevalent diseases

We further exemplify the performance of our approach using the NCBI GEO dataset GDS563. This dataset was produced to identify modifying factors and pathogenic pathways involved in Duchenne Muscular Dystrophy (DMD). It consists of 24 microarrays from two subsets: 12 from DMD patients and 12 from unaffected control patients, which form a standardized dataset with 12 × 12 = 144 replicated standardized profiles. Masking the known phenotypic annotations and querying this standardized dataset against our database, we predicted 10 UMLS concepts (see Fig. 3C). All of the 10 predictions, positioned coherently along the UMLS hierarchy, agree 100% with the known annotations. Another example with less perfect yet more typical prediction performance is shown in SI Text.

A closer examination of the results shows further interesting features of our method. One example comes from the result for a query profiling the T-cells of HIV patients (GDS2649). Even though HIV is not included in the 110 disease classes of our diagnosis database due to the lack of sufficient training data, we obtain the relevant concept RNA virus infection that can describe the characteristic of the HIV disease. This implies that our system can not only diagnose known diseases, but may also identify important features of understudied or unknown diseases.

In general, the prediction performance on individual disease classes increases with the number of datasets in the class. For example, among disease classes containing only three datasets, the best precision achieved was 41% with a recall of 23%. For disease classes containing seven datasets, the same precision (41%) was achieved when the recall was 43%. In fact, for classes with seven datasets, the best achieved precision was 97% with 33% recall. To further confirm the important role of class size, we randomly reduced the number of datasets in the disease diagnosis database by 20%, 40%, or 80%. Fig. 3D demonstrates that both precision and recall significantly increase with the number of datasets. This behavior highlights the advantage of multiple dataset integration and demonstrates that the power of our approach can increase significantly with the continued growth of public gene expression repositories.

Construction of a Disease-Drug Connectivity Map.

Our approach can be generalized from disease diagnosis to building disease/phenotype connections when the query’s phenotype is known. The connections between drugs and diseases are of special interest. These can be discovered by applying our diagnosis system to queries involving drug (or small molecule) treatments. The diagnosis results would link the queries (involving drug-treatment effects) into one or more well-known disease classes, facilitating the establishment of unique links between diseases and drugs.

We constructed a connectivity map using 1,248 queries characterizing the phenotypic differences between “drug treated” and “untreated” subjects (these 1,248 queries were not in our established disease diagnosis database). We only kept predicted diseases that were not already included in the query’s annotations. A new drug-disease connection was considered significant if queries concerning a given drug treatment were predominantly classified into the same disease class (see Methods and SI Text for details). In total, we found 234 significant drug-disease links, unique to the queries’ phenotype description, connecting 99 drug concepts to 43 disease classes (Fig. 4).

Fig. 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 4.

Disease-drug connectivity map. The map contains 234 significant connections between 99 drug concepts (pink nodes) and 43 disease concepts (blue nodes). (A) The network structure of the connectivity map. (B) Close-up of the Doxorubicin subnetwork. (C) Close-up of the obesity subnetwork.

To verify that the linked diseases and drug treatments truly share common molecular mechanisms, we used an independent resource: the Online Mendelian Inheritance in Man database (OMIM) (21). For each disease or drug, we compiled a list of associated genes based on the knowledge in OMIM (SI Text). We then assessed the statistical significance of the “link” between a disease and a drug based on the intersection of the disease-related genes and the drug-related genes using the hypergeometric test. Strikingly, 42.4% of the significant drug-disease links identified emerge also as statistically significant by this criterion with hypergeometric p value ≤ 0.05.

On close examination, Fig. 4A (see SI Text for a detailed version) reveals many known drug side effects as well as some unique associations. To take an interesting example, we found that the anticancer drug, doxorubicin, is linked to several diseases besides cancer/tumor (Fig. 4B). A predicted connection to rheumatoid arthritis confirms the drug’s newly proposed role as an antiarthritic agent (22); a connection to skin disorder reflects its known side effect of skin erupts (23). Most notably, the connection to cardiovascular diseases points to doxorubicin’s potentially fatal toxicity to the heart muscles, which is cumulative over the patient’s lifetime (24). This dosage-limiting cardiotoxicity, whose mechanism is not yet understood, has so far severely limited the usage of doxorubicin. Previous work has suggested that doxorubicin may break down the myofilament protein titin, leading to myocyte cell death (25). Our prediction, based on a global differential gene expression comparison, suggests a more fundamental transcriptional mechanism: doxorubicin may trigger cardiomyocyte-specific expression signatures. In fact, two transcription regulators, GATA4 and CARP, have already been implicated in such cardiomyocyte changes (26). Our result thus points out a unique direction in the design of cardioprotection strategies that would allow wider application of this effective oncologic agent.

Also of interest are the numerous links between “obesity” and anticancer drugs shown in Fig. 4C. Cancer and obesity have both become increasingly prevalent, and a physical basis for this relationship can be found in certain shared key molecules such as insulin, PDGF, VEGF, and cytokines. Accumulating evidence suggests that leptin and adiponectin, key hormones in regulating appetite and metabolism, may play important roles in regulating cancer cell growth and proliferation (27). In fact, weight gain is a common side effect in women undergoing adjuvant chemotherapy for breast cancer (28). Previous findings suggest that weight gain during adjuvant chemotherapy is associated with increased recurrence and poorer survival (29). Thus, understanding the molecular mechanism of this side effect is of therapeutic importance. The global similarity between obesity-specific expression and anticancer drug-treatment expression may lead to the discovery of shared mechanisms at the transcription level, allowing optimized treatments to be designed.

Discussion

We have proposed a computational approach capable of transforming large public microarray repositories into an automated disease diagnosis database. Compared to existing studies on searching and querying microarray databases, which focus on either expression data or phenotype data alone (2–6), we model both data sources in a unified framework. Integrating these two data types requires a careful consideration of their complex properties as well as various sources of noise. We employ a two-stage probabilistic framework that assigns one or multiple disease and phenotype labels to a query profile. While most medical diagnostic tests narrowly focus on one or a few diseases and conditions, our approach promises the ability to distinguish a large number of conditions efficiently and in parallel. As the available training datasets become more numerous and homogenous, the classification power of our system should increase dramatically. The framework presented here will also benefit from ongoing efforts to develop more advanced UMLS text-mining tools, as well as an improved and more comprehensive community standard of phenotype annotations.

As demonstrated by our disease-drug connectivity map, our approach shows great promise as a tool for linking a broad range of phenotypes to diseases through shared molecular mechanisms. This potential can be enhanced by extending the scope of queries and the range of disease classes in our diagnosis database. For instance, queries could be crafted to reveal the subtle distinctions between related phenotypes such as “type I vs. type II diabetes”, “metastasis vs. nonmetastasis cancer”, or two differentiation stages of embryonic stem cells. It is likely that the major characteristics of such subtle distinctions could be jointly described by the disease classes in our diagnosis database. On the other hand, our diagnostic database can also be extended by adding phenotype classes beyond disease, for example stress response, drug perturbation, and cell differentiation. Further investigations along this direction could facilitate the construction of a phenotype connection map, and potentially redefine our views of disease and phenotype classification.

In our approach, the gene expression data are transformed into log-rank-ratio vectors, which are dimensionless and not tied to any technology platform. As such our system should also be applicable to mRNA-seq data when they become widely available in public repositories. On the other hand, the capability of mRNA-seq technology in measuring absolute gene expression levels may ultimately remove the requirement of control samples (for deriving the standardized profiles), and expand the set of queryable samples. Finally, as our Bayesian learning framework takes similarity scores between query profiles and database disease profiles as input, we expect our method would continue to be applicable as technology evolves, as long as such similarity scores are available.

Methods

Microarray Data Collection and Filtering.

We selected microarray datasets from the NCBI GEO database according to the following criteria: (i) all samples were of human origin; (ii) data were generated with one of the major platforms (e.g., Affymetrix HG-U95A, HG-U133A, or HG-U133 Plus 2) that share a large number (8,358) of common genes; (iii) each dataset contains at least one disease subset and at least one normal subset. This selection resulted in 100 GEO datasets, comprising 9,169 experiments.

Standardizing Expression Data and Phenotype Data.

As described in Results, we standardized the expression data by taking log-rank-ratios between disease and normal profiles. For instance, given a dataset containing a disease subset with m replicated samples and a normal subset with n replicated samples, we obtain mn log-rank-ratio vectors (standardized profiles) that together constitute a standardized dataset. The 100 GEO datasets selected for this study gave rise to 196 standardized datasets.

UMLS provides an extensive catalog of medical concepts, but in this study we concentrate on human disease concepts. We determine the phenotypic context of each GEO dataset from two sources (30): the Medical Subject Headings (MeSH) of its PubMed record, and the summary description in GEO. Both texts were parsed to identify relevant UMLS concepts using the MetaMap program (16). These concepts provide the dataset-level annotations. Furthermore, we parsed subset descriptions and sample descriptions to identify additional UMLS concepts. These UMLS concepts are the subset-level annotations. Example UMLS annotations can be found in SI Text. After discarding UML concepts that are too general or too rare (SI Text), we were left with 110 concepts, each representing a disease class.

Modeling the Latent Variables T.

We defined P(Ti,k|ei,k) according to the following principles: (i) P(Ti,k = 1|e) is larger if Uk is mapped to the ith standardized dataset at the GEO subset level (ei,k = 2) rather than the dataset level (ei,k = 1); (ii) P(Ti,k = 1|e) is smaller if many diverse UMLS concepts are mapped to the ith standardized dataset (that is, we discount noisy annotations); and (iii) we assign P(Ti,k = 1|e) a very small (close to zero) value if Uk is not mapped to the ith standardized dataset. More details are in SI Text. The last condition is designed to correct possible text-mining errors. The introduction of T to the model allows a researcher to assign varied levels of learning credence to the training data based on their confidence in the UMLS annotations. This property cannot be incorporated into traditional classification approaches.

The Log-Linear Regression.

The model structure is based on extensive observations that the mean value of sx,i is effective in distinguishing between P1(sx,i) and P0(sx,i). Our other efforts to model P1(sx,i)/P0(sx,i) included the following regression, which contains more factors related to the distribution of sx,i: Embedded ImageStudies using the available data did not reveal any clear advantage to using this model over that described in the text, which is not surprising since the mean value had already proven effective at distinguishing between P0(sx,i) and P1(sx,i). Any improvement on this regression would further increase the effectiveness of our method.

Cross-Validation Procedure.

We evaluated the performance of our disease diagnosis scheme as follows. Considering each GEO dataset in turn, we took all standardized profiles derived from that dataset out of the database and used the remaining data to train the model. The resulting system was then used to diagnose the left-out data. More details are in SI Text. We repeated this procedure for all 100 GEO datasets, and assessed the overall classification performance using three measures: (i) Precision = TP/(TP + TN), (ii) Recall = TP/(TP + FN), and (iii) Accuracy = (TP + TN)/(TP + TN + FP + FN). TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

Significance Evaluation of the Disease-Drug Links.

We applied our disease diagnosis system/database to 1,248 queries involving drug treatments. The diagnosis results linked the queried drugs to one or more UMLS disease concepts. After removing redundant drug/disease concepts and excluding the disease classes that were already in the query’s known annotations, 1,720 unique disease-drug links were left for consideration. We evaluated each link using a hypergeometric test (SI Text). Due to the high level of data dependency and the possible violation of assumptions on the hypergeometric distribution, we adjusted the hypergeometric p value by a bootstrap p value (SI Text). Finally we identified 234 significant links (FDR < 0.3).

Acknowledgments

We thank Dr. Ming-Chih J. Kao for his generous contribution of clinical knowledge to this study. We thank Dr. Frank Alber for his assistance in preparing the manuscript. We also thank the anonymous reviewers for their helpful comments. This project was supported by the National Institutes of Health Grants R01GM074163 (to X.J.Z.) and R21EY019094 (to H.H.), and the National Science Foundation Grants 0515936 and 0747475 (to X.J.Z.).

Footnotes

  • 2To whom correspondence may be addressed. E-mail: xjzhou{at}usc.edu or hhuang{at}stat.berkeley.edu.
  • Author contributions: H.H. and X.J.Z. designed research; H.H., C.-C.L., and X.J.Z. performed research; H.H., C.-C.L., and X.J.Z. contributed new reagents/analytic tools; H.H., C.-C.L., and X.J.Z. analyzed data; and H.H., C.-C.L., and X.J.Z. wrote the paper.

  • The authors declare no conflict of interest.

  • This article is a PNAS Direct Submission.

  • This article contains supporting information online at www.pnas.org/cgi/content/full/0912043107/DCSupplemental.

References

  1. ↵
    1. Edgar R,
    2. Domrachev M,
    3. Lash AE
    (2002) Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30:207–210.
    OpenUrlAbstract/FREE Full Text
  2. ↵
    1. Horton PB,
    2. Kiseleva L,
    3. Fujibuchi W
    (2006) RaPiDS: an algorithm for rapid expression profile database search. Genome Inform Ser 17(2):67–76.
    OpenUrl
  3. ↵
    1. Tanner SW,
    2. Agarwal P
    (2008) Gene vector analysis (Geneva): A unified method to detect differentially-regulated gene sets and similar microarray experiments. BMC Bioinformatics 9(1):348.
    OpenUrlCrossRefPubMed
  4. ↵
    1. Hibbs MA,
    2. et al.
    (2007) Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 23(20):2692–2699.
    OpenUrlAbstract/FREE Full Text
  5. ↵
    1. Zhu Y,
    2. et al.
    (2008) GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics 24(23):2798–2800.
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Shah NH,
    2. et al.
    (2009) Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics 10(Suppl 2):S1.
    OpenUrl
  7. ↵
    1. Alizadeh AA,
    2. et al.
    (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511.
    OpenUrlCrossRefPubMed
  8. ↵
    1. Golub TR,
    2. et al.
    (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537.
    OpenUrlAbstract/FREE Full Text
  9. ↵
    1. Nilsson B,
    2. Andersson A,
    3. Johansson M,
    4. Fioretos T
    (2006) Cross-platform classification in microarray-based leukemia diagnostics. Haematologica 91:821–824.
    OpenUrlAbstract/FREE Full Text
  10. ↵
    1. Stec J,
    2. et al.
    (2005) Comparison of the predictive accuracy of DNA array-based multigene classifiers across cDNA arrays and Affymetrix GeneChips. J Mol Diagn 7:357–367.
    OpenUrlAbstract/FREE Full Text
  11. ↵
    1. Warnat P,
    2. Eils R,
    3. Brors B
    (2005) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics 6:265.
    OpenUrlCrossRefPubMed
  12. ↵
    1. Koller D,
    2. Sahami M
    (1997) Hierarchically classifying documents using very few words. Proceedings of the 14th International Conference on Machine Learning (ICML) 170–178.
  13. ↵
    1. Lamb J,
    2. et al.
    (2006) The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313:1929–1935.
    OpenUrlAbstract/FREE Full Text
  14. ↵
    1. Liu CC,
    2. et al.
    (2009) Integrative disease classification based on cross-platform microarray data. BMC Bioinformatics 10(Suppl 1):S25.
    OpenUrl
  15. ↵
    1. Bodenreider O
    (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:D267–270.
    OpenUrlAbstract/FREE Full Text
  16. ↵
    1. Aronson AR
    (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium, 17–21.
  17. ↵
    1. Butte AJ,
    2. Kohane IS
    (2006) Creation and implications of a phenome-genome network. Nat Biotechnol 24:55–62.
    OpenUrlCrossRefPubMed
  18. ↵
    1. Dudley JT,
    2. Tibshirani R,
    3. Deshpande T,
    4. Butte AJ
    (2009) Disease signatures are robust across tissues and experiments. Molecular Systems Biology 5:307.
    OpenUrlPubMed
  19. ↵
    1. Barutcuoglu Z,
    2. Schapire RE,
    3. Troyanskaya OG
    (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22:830–836.
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. Pena-Castillo L,
    2. et al.
    (2008) A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 9(Suppl 1):S2.
    OpenUrl
  21. ↵
    1. Hamosh A,
    2. et al.
    (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33(Database Issue):D514–D517.
    OpenUrlAbstract/FREE Full Text
  22. ↵
    1. Jackson JK,
    2. Higo T,
    3. Hunter WL,
    4. Burt HM
    (2008) Topoisomerase inhibitors as anti-arthritic agents. Inflamm Res 57(3):126–134.
    OpenUrlCrossRefPubMed
  23. ↵
    1. Kim RJ,
    2. Peterson G,
    3. Kulp B,
    4. Zanotti KM,
    5. Markman M
    (2005) Skin toxicity associated with pegylated liposomal doxorubicin (40 mg/m2) in the treatment of gynecologic cancers. Gynecol Oncol 97(2):374–378.
    OpenUrlCrossRefPubMed
  24. ↵
    1. Shan K,
    2. Lincoff AM,
    3. Young JB
    (1996) Anthracycline-induced cardiotoxicity. Annals of internal medicine 125(1):47–58.
    OpenUrlAbstract/FREE Full Text
  25. ↵
    1. Peng X,
    2. Chen B,
    3. Lim CC,
    4. Sawyer DB
    (2005) The cardiotoxicology of anthracycline chemotherapeutics: translating molecular mechanism into preventative medicine. Mol Interv 5(3):163–171.
    OpenUrlAbstract/FREE Full Text
  26. ↵
    1. Horenstein MS,
    2. Vander Heide RS,
    3. L’Ecuyer TJ
    (2000) Molecular basis of anthracycline-induced cardiotoxicity and its prevention. Mol Genet Metab 71:436–444.
    OpenUrlCrossRefPubMed
  27. ↵
    1. Garofalo C,
    2. Surmacz E
    (2006) Leptin and cancer. J cell physiol 207:12–22.
    OpenUrlCrossRefPubMed
  28. ↵
    1. Rio GD,
    2. et al.
    (2002) Weight gain in women with breast cancer treated with adjuvant cyclophosphomide, methotrexate and 5-fluorouracil. Analysis of resting energy expenditure and body composition. Breast Cancer Res Tr 73(3):267–273.
    OpenUrlCrossRef
  29. ↵
    1. Camoriano JK,
    2. et al.
    (1990) Weight change in women treated with adjuvant therapy or observed following mastectomy for node-positive breast cancer. J Clin Oncol 8:1327–1334.
    OpenUrlAbstract
  30. ↵
    1. Butte AJ,
    2. Chen R
    (2006) Proceedings of the AMIA Symposium, Finding disease-related genomic experiments within an international repository: First steps in translational bioinformatics, pp 106–110.
View Abstract
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Bayesian approach to transforming public gene expression repositories into disease diagnosis databases
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
Citation Tools
Bayesian approach to transforming public gene expression repositories into disease diagnosis databases
Haiyan Huang, Chun-Chi Liu, Xianghong Jasmine Zhou
Proceedings of the National Academy of Sciences Apr 2010, 107 (15) 6823-6828; DOI: 10.1073/pnas.0912043107

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Bayesian approach to transforming public gene expression repositories into disease diagnosis databases
Haiyan Huang, Chun-Chi Liu, Xianghong Jasmine Zhou
Proceedings of the National Academy of Sciences Apr 2010, 107 (15) 6823-6828; DOI: 10.1073/pnas.0912043107
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

More Articles of This Classification

Biological Sciences

  • Movement kinematics drive chain selection toward intention detection
  • Phosphoethanolamine cellulose enhances curli-mediated adhesion of uropathogenic Escherichia coli to bladder epithelial cells
  • Repurposing type III polyketide synthase as a malonyl-CoA biosensor for metabolic engineering in bacteria
Show more

Biophysics and Computational Biology

  • Structural basis for cooperative regulation of KIX-mediated transcription pathways by the HTLV-1 HBZ activation domain
  • Simultaneous polymerization and adhesion under hypoxia in sickle cell disease
  • Machining protein microcrystals for structure determination by electron diffraction
Show more

Physical Sciences

  • Revised M06 density functional for main-group and transition-metal chemistry
  • Predicting polymorphism in molecular crystals using orientational entropy
  • Phosphoethanolamine cellulose enhances curli-mediated adhesion of uropathogenic Escherichia coli to bladder epithelial cells
Show more

Applied Mathematics

  • A simple developmental model recapitulates complex insect wing venation patterns
  • Neural-inspired sensors enable sparse, efficient classification of spatiotemporal data
  • Material barriers to diffusive and stochastic transport
Show more

Related Content

  • No related articles found.
  • Scopus
  • PubMed
  • Google Scholar

Cited by...

  • Omics Profiling in Precision Oncology
  • Molecular Pathways: Extracting Medical Knowledge from High-Throughput Genomic Data
  • Scopus (37)
  • Google Scholar

Similar Articles

You May Also be Interested in

The videos, shown with minimal information and often without sound or music, are meant to provide a sort of scientific cinéma vérité. Image courtesy of Nipam Patel (University of California, Berkeley, CA).
Science and Culture: Raw data videos offer a glimpse into laboratory research
The videos, shown with minimal information and often without sound or music, are meant to provide a sort of scientific cinéma vérité.
Image courtesy of Nipam Patel (University of California, Berkeley, CA).
Victoria Orphan and Elizabeth Trembath-Reichert discuss microbial life in the deep subseafloor.
Deep subseafloor microbial life
Victoria Orphan and Elizabeth Trembath-Reichert discuss microbial life in the deep subseafloor.
Listen
Past PodcastsSubscribe
PNAS Profile with NAS member and anthropologist Michael Tomasello
PNAS Profile
PNAS Profile with NAS member and anthropologist Michael Tomasello
Early monumental burial sites
Researchers report an early monumental burial site near Lake Turkana in Kenya that may have served as a stable landmark for mobile herders in a changing physical environment and as a social anchor point to foster communal identity and interaction among mobile herders.
Moon. Image courtesy of Pixabay/Ponciano.
Evidence of surface water ice on the moon
A study reports evidence of water ice on the moon’s surface, discerned via a signature in the near-infrared reflectance spectra that suggests the ice was formed by slow condensation due to impact or water migration through the lunar exosphere.
Image courtesy of Pixabay/Ponciano.
Proceedings of the National Academy of Sciences: 115 (38)
Current Issue

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Abstract
    • Results
    • Discussion
    • Methods
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF
Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Latest Articles
  • Archive

PNAS Portals

  • Classics
  • Front Matter
  • Teaching Resources
  • Anthropology
  • Chemistry
  • Physics
  • Sustainability Science

Information

  • Authors
  • Reviewers
  • Press
  • Site Map

Feedback    Privacy/Legal

Copyright © 2018 National Academy of Sciences.