Illuminating the dark matter in metabolomics
- aCollaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA 92093;
- bNúcleo de Pesquisa em Produtos Naturais e Sintéticos, Departamento de Física e Química, Faculdade Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, São Paulo 14040-903, Brazil;
- cCenter for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, La Jolla, CA 92037
See allHide authors and affiliations

Despite the over 100-y history of mass spectrometry, it remains challenging to link the large volume of known chemical structures to the data obtained with mass spectrometers. Presently, only 1.8% of spectra in an untargeted metabolomics experiment can be annotated. This means that the vast majority of information collected by metabolomics is “dark matter,” chemical signatures that remain uncharacterized (Fig. 1). For a genomic comparison, 80% of predicted genes in the Escherichia coli genome are known. In a bacteriophage metagenome, a well-known frontier of biological dark matter, the amount of known genes is 1–30%, depending on the sample (1). Thus, one could argue that we know more about the genetics of uncultured phage than we do about the chemistry within our own bodies. Much of the chemical dark matter may include known structures, but they remain undiscovered because the reference spectra are not available in mass spectrometry databases. The only way to overcome this challenge is through the development of computational solutions. In PNAS, Dührkop et al. describe the development of such a computational tool, called CSI (compound structure identification):FingerID (2). The tool is designed to aid in the annotation of chemistries that can be observed by mass spectrometry. CSI:FingerID uses fragmentation trees to connect tandem MS (MS/MS) data to chemical structures found in public chemistry databases. Tools such as this can allow metabolomics with mass spectrometry to become as commonly used and scientifically productive as sequencing technologies have in the field of genomics.
Millions of MS/MS spectra can be generated on a natural sample, such as this coral reef, but the vast majority of spectra are from unknown molecules. CSI:FingerID can help illuminate the chemical dark matter.
There are >60 million molecules in PubChem, yet only 220,000 MS/MS spectra representing about 20,000 molecules that are accessible for untargeted metabolomics experiments (3). Chemists and biologists attempting to identify a mass spectrum without a match in a reference database, such as GNPS, Metlin, NIST, MassBank, and others, must often resort to Googling the parent mass or manually entering it into PubChem or similar chemical databases, hoping to find a match (3⇓–5). The alternative is complete structure elucidation de novo, an even more laborious task, requiring years of work with high-level expertise to isolate and determine the structure of a single molecule. To put this in perspective, a modern day metabolomics experiment with hundreds to thousands of independent samples can easily contain 1 million unique spectra. Assuming that spectral matching takes approximately 10 min to a trained eye, a gross underestimate, it would take 19 y of nonstop data analysis for a single project. This is obviously an unrealistic endeavor, especially considering that mass spectrometers will become even faster and more sensitive in the future.
The method presented by Dührkop et al. (2) is divided into three phases. In the first phase, called the learning phase, a tandem mass spectra database of reference compounds is used to train a set of predictors for known molecular properties (the fingerprint). Using the data from these reference spectra, the method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. The tree assigns molecular formulas to the corresponding fragment peaks in the MS/MS spectrum, and fragments are connected by the assumed losses. The algorithm then tries to recover the identity and connectivity of the atoms in a molecular structure. With the predicted structure from the fragmentation tree, the method searches for multiple similarity measures for molecular structural comparisons (called kernels) to improve the performance of molecular fingerprint prediction. A molecular fingerprint is based on its molecular properties retrieved from the publicly available known structures (e.g., in PubChem or the literature).
In the second phase, a Support Vector Machine classifier is trained using the kernel similarities to separate molecular structures in a class that contains the molecular property, and one that doesn’t. Such classification is repeated for all molecular properties present in the fingerprint. With the classifier carefully built on the previous step, the method follows to the Prediction phase. Here, given the MS/MS spectra of an unknown compound, the task is to calculate its kernel similarities against all compounds in the reference dataset. A learning tree is again built and the result is a predicted fingerprint of the unknown compound. Dührkop et al. (2) point out that the machine-learning basis of the method allows for improvement in performance with additional reference MS/MS data. In the metabolomics and natural product community, the benefit from publicly available annotated reference spectra is becoming increasingly evident. One such resource is a part of the Global Natural Product Social Molecular Networking effort at gnps.ucsd.edu, which the authors extensively used for the development of CSI:FingerID. Such reference collections are crucial for the development of search tools, because machine-learning methods perform better with more comprehensive training sets. Studies such as this one will hopefully stimulate groups that isolate and characterize specific molecules to share their data. Data-sharing will facilitate the prediction and detection of new structures within the same molecular class, which will be enormously beneficial to both the mass spectrometry and life sciences community. Dührkop et al. (2) refer to the use of spectral orthogonal information (retention time, infrared and UV spectroscopy, and so forth), as a way to “manually” refine the best spectral match. There are several automated methods for using such orthogonal information, but most of them are limited to a specific experimental setup (6, 7). The availability of datasets covering different organisms and experimental procedures will allow the use of the full informational content of a mass spectrum, resulting in improved identification scores.
The final stage of CSI:FingerID is the Scoring phase. With the predicted fingerprint of an unknown molecule, one can retrieve all structures, matching the same molecular formula in a structure database. For each candidate molecular structure, its fingerprint is scored against the predicted fingerprint. Dührkop et al. (2) benchmarked their tool and found an enormous improvement on the scoring function compared with similar algorithms (8, 9). In the last few years, computational methods for structural assessment of metabolomics data have seen significant development. For the two large-scale MS/MS datasets that were tested, the method achieved more correct identifications than the next-best available search algorithm. Dührkop et al.’s (2) method also provides fivefold more unique and correct identifications. The CSI:FingerID tool is available as a web server providing an easy-to-use tool for wet laboratory scientists. The next step of the tool’s evolution will be the ability to process multiple spectra at the same time in a batch process and providing a standalone version to run on the user’s own computer. These options will speed the analysis workflow of complex metabolomics datasets. The method has the potential to improve identification in metabolomics experiments, by expanding the search space outside of that available in spectral libraries. Dührkop et al. (2) also point to the potential to search databases containing hypothetical simulated compounds, expanding the search space by an order of millions (10). Matching spectra in a metabolomics experiment to molecules whose structure has not yet been elucidated may well be in reach within the next few years.
As tools such as CSI:FingerID begin to illuminate more of the chemical dark matter, some form of a chemical ontology must be agreed upon to better classify and bin structures into groups of related compounds. A classification hierarchy will allow the research community to link metabolites to their associated biological processes, whether or not the specific metabolite in question is biologically characterized. Such ontology would greatly benefit from biological information about where a particular molecule or molecular family comes from and what it does. Many compounds in structure databases are chemically synthesized and not produced naturally. Although these compounds broaden the molecular space of these databases, they are most often not clearly differentiated from natural products. For CSI:FingerID, Dührkop et al. (2) enrich the molecular property information with molecules that have known biological activity (11) and weight these signatures with higher scores in their identifications. This is crucial to avoid convoluting the search with synthetic compounds (12), as strategies to differentiate signatures of metabolites and synthetic compounds improve the quality of results from search tools (13, 14). Databases with biologically relevant chemical information are becoming available, including the ChEBI database, Kyoto Encyclopedia of Genes and Genomes, and others (15, 16). These will be extremely useful to chemists and biologists as they apply computational tools to more complex systems with increasingly complex chemistry.
The field of genomics was made possible by the development of algorithms for comparing nucleic acid sequences to identify relatedness in genetic information. In the late 1980s and early 1990s hundreds of these algorithms were developed, including the basic local alignment search tool (BLAST) (17). Since then the field has exploded and technologies for sequencing millions of nucleic acids have been developed to capture the genetic information in our biological world. In metabolomics, the technological advances are already in place. Mass spectrometers are incredible machines capable of identifying the mass of molecules to unprecedented accuracy, on a massive scale, in timeframes of less than a second. However, computational resources analogous to BLAST and the NCBI’s GenBank database are only in their infancy. This year is the 25th anniversary of the release of BLAST. CSI:Finger ID is an example of the type of tools required to expand the power of metabolomics and catch up to the successes of genomics. These tools are fundamental to harnessing mass spectral information and similar in their synthesis and setting to the early tools developed for genomics that revolutionized the field of biology. CSI:FingerID and other algorithms will help catch up to the field of genomic bioinformatics, despite its 25-y head start, and begin to illuminate the diverse chemistry in our biological world.
Acknowledgments
R.R.d.S. is supported by the São Paulo Research Foundation (FAPESP-2015/03348-3).
References
- ↵
- ↵.
- Dührkop K,
- Shen H,
- Meusel M,
- Rousu J,
- Böcker S
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Heinonen M,
- Shen H,
- Zamboni N,
- Rousu J
- ↵
- ↵
- ↵.
- Klekota J,
- Roth FP
- ↵.
- Allen F,
- Greiner R,
- Wishart D
- ↵
- ↵
- ↵.
- Kanehisa M,
- Goto S
- ↵.
- Hastings J, et al.
- ↵
Citation Manager Formats
Article Classifications
- Physical Sciences
- Chemistry
- Biological Sciences
- Cell Biology
See related content: