Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Commentary

Illuminating the dark matter in metabolomics

Ricardo R. da Silva, Pieter C. Dorrestein, and Robert A. Quinn
  1. aCollaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA 92093;
  2. bNúcleo de Pesquisa em Produtos Naturais e Sintéticos, Departamento de Física e Química, Faculdade Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, São Paulo 14040-903, Brazil;
  3. cCenter for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, La Jolla, CA 92037

See allHide authors and affiliations

PNAS October 13, 2015 112 (41) 12549-12550; first published October 1, 2015; https://doi.org/10.1073/pnas.1516878112
Ricardo R. da Silva
aCollaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA 92093;
bNúcleo de Pesquisa em Produtos Naturais e Sintéticos, Departamento de Física e Química, Faculdade Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, São Paulo 14040-903, Brazil;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Pieter C. Dorrestein
aCollaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA 92093;
cCenter for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, La Jolla, CA 92037
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: pdorrestein@ucsd.edu
Robert A. Quinn
aCollaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA 92093;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Despite the over 100-y history of mass spectrometry, it remains challenging to link the large volume of known chemical structures to the data obtained with mass spectrometers. Presently, only 1.8% of spectra in an untargeted metabolomics experiment can be annotated. This means that the vast majority of information collected by metabolomics is “dark matter,” chemical signatures that remain uncharacterized (Fig. 1). For a genomic comparison, 80% of predicted genes in the Escherichia coli genome are known. In a bacteriophage metagenome, a well-known frontier of biological dark matter, the amount of known genes is 1–30%, depending on the sample (1). Thus, one could argue that we know more about the genetics of uncultured phage than we do about the chemistry within our own bodies. Much of the chemical dark matter may include known structures, but they remain undiscovered because the reference spectra are not available in mass spectrometry databases. The only way to overcome this challenge is through the development of computational solutions. In PNAS, Dührkop et al. describe the development of such a computational tool, called CSI (compound structure identification):FingerID (2). The tool is designed to aid in the annotation of chemistries that can be observed by mass spectrometry. CSI:FingerID uses fragmentation trees to connect tandem MS (MS/MS) data to chemical structures found in public chemistry databases. Tools such as this can allow metabolomics with mass spectrometry to become as commonly used and scientifically productive as sequencing technologies have in the field of genomics.

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

Millions of MS/MS spectra can be generated on a natural sample, such as this coral reef, but the vast majority of spectra are from unknown molecules. CSI:FingerID can help illuminate the chemical dark matter.

There are >60 million molecules in PubChem, yet only 220,000 MS/MS spectra representing about 20,000 molecules that are accessible for untargeted metabolomics experiments (3). Chemists and biologists attempting to identify a mass spectrum without a match in a reference database, such as GNPS, Metlin, NIST, MassBank, and others, must often resort to Googling the parent mass or manually entering it into PubChem or similar chemical databases, hoping to find a match (3⇓–5). The alternative is complete structure elucidation de novo, an even more laborious task, requiring years of work with high-level expertise to isolate and determine the structure of a single molecule. To put this in perspective, a modern day metabolomics experiment with hundreds to thousands of independent samples can easily contain 1 million unique spectra. Assuming that spectral matching takes approximately 10 min to a trained eye, a gross underestimate, it would take 19 y of nonstop data analysis for a single project. This is obviously an unrealistic endeavor, especially considering that mass spectrometers will become even faster and more sensitive in the future.

The method presented by Dührkop et al. (2) is divided into three phases. In the first phase, called the learning phase, a tandem mass spectra database of reference compounds is used to train a set of predictors for known molecular properties (the fingerprint). Using the data from these reference spectra, the method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. The tree assigns molecular formulas to the corresponding fragment peaks in the MS/MS spectrum, and fragments are connected by the assumed losses. The algorithm then tries to recover the identity and connectivity of the atoms in a molecular structure. With the predicted structure from the fragmentation tree, the method searches for multiple similarity measures for molecular structural comparisons (called kernels) to improve the performance of molecular fingerprint prediction. A molecular fingerprint is based on its molecular properties retrieved from the publicly available known structures (e.g., in PubChem or the literature).

In the second phase, a Support Vector Machine classifier is trained using the kernel similarities to separate molecular structures in a class that contains the molecular property, and one that doesn’t. Such classification is repeated for all molecular properties present in the fingerprint. With the classifier carefully built on the previous step, the method follows to the Prediction phase. Here, given the MS/MS spectra of an unknown compound, the task is to calculate its kernel similarities against all compounds in the reference dataset. A learning tree is again built and the result is a predicted fingerprint of the unknown compound. Dührkop et al. (2) point out that the machine-learning basis of the method allows for improvement in performance with additional reference MS/MS data. In the metabolomics and natural product community, the benefit from publicly available annotated reference spectra is becoming increasingly evident. One such resource is a part of the Global Natural Product Social Molecular Networking effort at gnps.ucsd.edu, which the authors extensively used for the development of CSI:FingerID. Such reference collections are crucial for the development of search tools, because machine-learning methods perform better with more comprehensive training sets. Studies such as this one will hopefully stimulate groups that isolate and characterize specific molecules to share their data. Data-sharing will facilitate the prediction and detection of new structures within the same molecular class, which will be enormously beneficial to both the mass spectrometry and life sciences community. Dührkop et al. (2) refer to the use of spectral orthogonal information (retention time, infrared and UV spectroscopy, and so forth), as a way to “manually” refine the best spectral match. There are several automated methods for using such orthogonal information, but most of them are limited to a specific experimental setup (6, 7). The availability of datasets covering different organisms and experimental procedures will allow the use of the full informational content of a mass spectrum, resulting in improved identification scores.

The final stage of CSI:FingerID is the Scoring phase. With the predicted fingerprint of an unknown molecule, one can retrieve all structures, matching the same molecular formula in a structure database. For each candidate molecular structure, its fingerprint is scored against the predicted fingerprint. Dührkop et al. (2) benchmarked their tool and found an enormous improvement on the scoring function compared with similar algorithms (8, 9). In the last few years, computational methods for structural assessment of metabolomics data have seen significant development. For the two large-scale MS/MS datasets that were tested, the method achieved more correct identifications than the next-best available search algorithm. Dührkop et al.’s (2) method also provides fivefold more unique and correct identifications. The CSI:FingerID tool is available as a web server providing an easy-to-use tool for wet laboratory scientists. The next step of the tool’s evolution will be the ability to process multiple spectra at the same time in a batch process and providing a standalone version to run on the user’s own computer. These options will speed the analysis workflow of complex metabolomics datasets. The method has the potential to improve identification in metabolomics experiments, by expanding the search space outside of that available in spectral libraries. Dührkop et al. (2) also point to the potential to search databases containing hypothetical simulated compounds, expanding the search space by an order of millions (10). Matching spectra in a metabolomics experiment to molecules whose structure has not yet been elucidated may well be in reach within the next few years.

As tools such as CSI:FingerID begin to illuminate more of the chemical dark matter, some form of a chemical ontology must be agreed upon to better classify and bin structures into groups of related compounds. A classification hierarchy will allow the research community to link metabolites to their associated biological processes, whether or not the specific metabolite in question is biologically characterized. Such ontology would greatly benefit from biological information about where a particular molecule or molecular family comes from and what it does. Many compounds in structure databases are chemically synthesized and not produced naturally. Although these compounds broaden the molecular space of these databases, they are most often not clearly differentiated from natural products. For CSI:FingerID, Dührkop et al. (2) enrich the molecular property information with molecules that have known biological activity (11) and weight these signatures with higher scores in their identifications. This is crucial to avoid convoluting the search with synthetic compounds (12), as strategies to differentiate signatures of metabolites and synthetic compounds improve the quality of results from search tools (13, 14). Databases with biologically relevant chemical information are becoming available, including the ChEBI database, Kyoto Encyclopedia of Genes and Genomes, and others (15, 16). These will be extremely useful to chemists and biologists as they apply computational tools to more complex systems with increasingly complex chemistry.

The field of genomics was made possible by the development of algorithms for comparing nucleic acid sequences to identify relatedness in genetic information. In the late 1980s and early 1990s hundreds of these algorithms were developed, including the basic local alignment search tool (BLAST) (17). Since then the field has exploded and technologies for sequencing millions of nucleic acids have been developed to capture the genetic information in our biological world. In metabolomics, the technological advances are already in place. Mass spectrometers are incredible machines capable of identifying the mass of molecules to unprecedented accuracy, on a massive scale, in timeframes of less than a second. However, computational resources analogous to BLAST and the NCBI’s GenBank database are only in their infancy. This year is the 25th anniversary of the release of BLAST. CSI:Finger ID is an example of the type of tools required to expand the power of metabolomics and catch up to the successes of genomics. These tools are fundamental to harnessing mass spectral information and similar in their synthesis and setting to the early tools developed for genomics that revolutionized the field of biology. CSI:FingerID and other algorithms will help catch up to the field of genomic bioinformatics, despite its 25-y head start, and begin to illuminate the diverse chemistry in our biological world.

Acknowledgments

R.R.d.S. is supported by the São Paulo Research Foundation (FAPESP-2015/03348-3).

Footnotes

  • ↵1To whom correspondence should be addressed. Email: pdorrestein{at}ucsd.edu.
  • Author contributions: R.R.d.S., P.C.D., and R.A.Q. wrote the paper.

  • The authors declare no conflict of interest.

  • See companion article on page 12580.

References

  1. ↵
    1. Mokili JL,
    2. Rohwer F,
    3. Dutilh BE
    (2012) Metagenomics and future perspectives in virus discovery. Curr Opin Virol 2(1):63–77
    .
    OpenUrlCrossRefPubMed
  2. ↵
    1. Dührkop K,
    2. Shen H,
    3. Meusel M,
    4. Rousu J,
    5. Böcker S
    (2015) Searching molecular structure databases with tandem mass spectra using CSI:Finger ID. Proc Natl Acad Sci USA 112:12580–12585
    .
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Johnson SR,
    2. Lange BM
    (2015) Open-access metabolomics databases for natural product research: present capabilities and future potential. Front Bioeng Biotechnol 3:22
    .
    OpenUrlPubMed
  4. ↵
    1. Vaniya A,
    2. Fiehn O
    (2015) Using fragmentation trees and mass spectral trees for identifying unknown compounds in metabolomics. Trends Analyt Chem 69:52–61
    .
    OpenUrlCrossRefPubMed
  5. ↵
    1. Boulsimani A,
    2. Sanchez LM,
    3. Garg N,
    4. Dorrestein PC
    (2014) Mass spectrometry of natural products: Current, emerging and future technologies. Nat Prod Rep 31(6):718–729
    .
    OpenUrlCrossRefPubMed
  6. ↵
    1. Pluskal T,
    2. Uehara T,
    3. Yanagida M
    (2012) Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal Chem 84(10):4396–4403
    .
    OpenUrlCrossRefPubMed
  7. ↵
    1. Stanstrup J,
    2. Gerlich M,
    3. Dragsted LO,
    4. Neumann S
    (2013) Metabolite profiling and beyond: Approaches for the rapid processing and annotation of human blood serum mass spectrometry data. Anal Bioanal Chem 405(15):5037–5048
    .
    OpenUrlCrossRefPubMed
  8. ↵
    1. Heinonen M,
    2. Shen H,
    3. Zamboni N,
    4. Rousu J
    (2012) Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics 28(18):2333–2341
    .
    OpenUrlAbstract/FREE Full Text
  9. ↵
    1. Shen Y,
    2. Yin C,
    3. Su M,
    4. Tu J
    (2010) Rapid, sensitive and selective liquid chromatography-tandem mass spectrometry (LC-MS/MS) method for the quantification of topically applied azithromycin in rabbit conjunctiva tissues. J Pharm Biomed Anal 52(1):99–104
    .
    OpenUrlCrossRefPubMed
  10. ↵
    1. Kind T,
    2. Fiehn O
    (2010) Advances in structure elucidation of small molecules using mass spectrometry. Bioanal Rev 2(1-4):23–60
    .
    OpenUrlCrossRefPubMed
  11. ↵
    1. Klekota J,
    2. Roth FP
    (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24(21):2518–2525
    .
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. Allen F,
    2. Greiner R,
    3. Wishart D
    (2014) Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11(1):98–110
    .
    OpenUrl
  13. ↵
    1. Peironcely JE,
    2. Reijmers T,
    3. Coulier L,
    4. Bender A,
    5. Hankemeier T
    (2011) Understanding and classifying metabolite space and metabolite-likeness. PLoS One 6(12):e28966
    .
    OpenUrlCrossRefPubMed
  14. ↵
    1. Ruttkies C,
    2. Gerlich M,
    3. Neumann S
    (2013) Tackling CASMI 2012: Solutions from MetFrag and MetFusion. Metabolites 3(3):623–636
    .
    OpenUrlCrossRefPubMed
  15. ↵
    1. Kanehisa M,
    2. Goto S
    (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28(1):27–30
    .
    OpenUrlAbstract/FREE Full Text
  16. ↵
    1. Hastings J, et al.
    (2013) The ChEBI reference database and ontology for biologically relevant chemistry: Enhancements for 2013. Nucleic Acids Res 41(Database issue):D456–D463
    .
    OpenUrlAbstract/FREE Full Text
  17. ↵
    1. Altschul SF,
    2. Gish W,
    3. Miller W,
    4. Myers EW,
    5. Lipman DJ
    (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
    .
    OpenUrlCrossRefPubMed
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Illuminating the dark matter in metabolomics
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Illuminating the dark matter in metabolomics
Ricardo R. da Silva, Pieter C. Dorrestein, Robert A. Quinn
Proceedings of the National Academy of Sciences Oct 2015, 112 (41) 12549-12550; DOI: 10.1073/pnas.1516878112

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Illuminating the dark matter in metabolomics
Ricardo R. da Silva, Pieter C. Dorrestein, Robert A. Quinn
Proceedings of the National Academy of Sciences Oct 2015, 112 (41) 12549-12550; DOI: 10.1073/pnas.1516878112
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

Article Classifications

  • Physical Sciences
  • Chemistry
  • Biological Sciences
  • Cell Biology

See related content:

  • Searching structure databases using CSI:FingerID
    - Sep 21, 2015
Proceedings of the National Academy of Sciences: 112 (41)
Table of Contents

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Surgeons hands during surgery
Inner Workings: Advances in infectious disease treatment promise to expand the pool of donor organs
Despite myriad challenges, clinicians see room for progress.
Image credit: Shutterstock/David Tadevosian.
Setting sun over a sun-baked dirt landscape
Core Concept: Popular integrated assessment climate policy models have key caveats
Better explicating the strengths and shortcomings of these models will help refine projections and improve transparency in the years ahead.
Image credit: Witsawat.S.
Double helix
Journal Club: Noncoding DNA shown to underlie function, cause limb malformations
Using CRISPR, researchers showed that a region some used to label “junk DNA” has a major role in a rare genetic disorder.
Image credit: Nathan Devery.
Steamboat Geyser eruption.
Eruption of Steamboat Geyser
Mara Reed and Michael Manga explore why Yellowstone's Steamboat Geyser resumed erupting in 2018.
Listen
Past PodcastsSubscribe
Birds nestling on tree branches
Parent–offspring conflict in songbird fledging
Some songbird parents might improve their own fitness by manipulating their offspring into leaving the nest early, at the cost of fledgling survival, a study finds.
Image credit: Gil Eckrich (photographer).

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490