De novo peptide sequencing by deep learning

Edited by John R. Yates III, Scripps Research Institute, La Jolla, CA, and accepted by Editorial Board Member David Baker June 26, 2017 (received for review April 6, 2017)
July 18, 2017
114 (31) 8247-8252

Significance

Our method, DeepNovo, introduces deep learning to de novo peptide sequencing from tandem MS data, the key technology for protein characterization in proteomics research. DeepNovo achieves major improvement of sequencing accuracy over state of the art methods and subsequently enables complete assembly of protein sequences without assisting databases. Our model is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution, an important feature given the growing massive amount of data. Our study also presents an innovative approach to combine deep learning and dynamic programming to solve optimization problems.

Abstract

De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7–22.9% higher accuracy at the amino acid level and 38.1–64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5–100% coverage and 97.2–99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming.
Proteomics research focuses on large-scale studies to characterize the proteome, the entire set of proteins, in a living organism (15). In proteomics, de novo peptide sequencing from tandem MS data plays the key role in the characterization of novel protein sequences. This field has been actively studied over the past 20 y, and many de novo sequencing tools have been proposed, such as PepNovo, PEAKS, NovoHMM, MSNovo, pNovo, UniNovo, and Novor among others (619). The recent “gold rush” into mAbs has undoubtedly elevated the application of de novo sequencing to a new horizon (2023). However, computational challenges still remain, because MS/MS spectra contain much noise and ambiguity that require rigorous global optimization with various forms of dynamic programming that have been developed over the past decade (810, 12, 13, 1519, 24).
In this study, we introduce neural networks and deep learning to de novo peptide sequencing and achieve major breakthroughs on this well-studied problem. Deep learning has recently brought about a revolution in many research fields (25), repeatedly breaking state of the art records in image processing (26, 27), speech recognition (28), and natural language processing (29). It now forms the core of the artificial intelligence platforms of several technology giants, such as Google, Facebook, and Microsoft, as well as many startups in the industry. Deep learning has also made its way into biological sciences (30) [for instance, in the field of genomics, where deep neural network models have been developed for predicting the effects of noncoding single-nucleotide variants (31), predicting protein DNA and RNA binding sites (32), protein contact map prediction (33), and MS imaging (34)]. The key aspect of deep learning is its ability to learn multiple levels of representation of high-dimensional data through its many layers of neurons. Furthermore, unlike traditional machine learning methods, those feature layers are not predesigned based on domain-specific knowledge, and hence, they have more flexibility to discover complex structures of the data.
The task of de novo peptide sequencing is to reconstruct the amino acid sequence of a peptide given an MS/MS spectrum and the peptide mass. A spectrum can be represented as a histogram of intensity vs. mass (more precisely, m/z) of the ions acquired from the peptide fragmentation inside a mass spectrometer (Fig. 1A). The problem bears some similarity to the recently trending topic of “automatically generating a description for an image.” In that research, a convolutional neural network (CNN; i.e., a type of feed-forward artificial neural network consisting of multiple layers of receptive fields) is used to encode or “understand” the image. Then, a long short-term memory (LSTM) recurrent neural network (RNN) (35) is used to decode or “describe” the content of the image (36, 37). The research is exciting, because it tries to connect image recognition and natural language processing by integrating two fundamental types of neural networks, CNN and LSTM.
Fig. 1.
The DeepNovo model for de novo peptide sequencing. (A) Spectra are processed by the CNN spectrum-CNN and then used to initialize the LSTM network. (B) DeepNovo sequences a peptide by predicting one amino acid at each iteration. Beginning with a special symbol start, the model predicts the next amino acid by conditioning on the input spectrum and the output of previous steps. The process stops if, in the current step, the model outputs the special symbol end. (C) Details of a sequencing step in DeepNovo. Two classification models, ion-CNN and LSTM, use the output of previous sequencing steps as a prefix to predict the next amino acid.
In our de novo sequencing problem, the research is carried to the next extreme, where exactly 1 of 20L amino acid sequences can be considered as the correct prediction (L is the peptide length, and 20 is the total number of amino acid letters). Another challenge is that peptide fragmentation generates multiple types of ions, including a, b, c, x, y, z, internal cleavage, and immonium ions (38). Depending on the fragmentation methods, different types of ions may have quite different intensity values (peak heights), and yet, the ion type information remains unknown from spectrum data. Furthermore, there are plenty of noise peaks mixing together with the real ions. Finally, the predicted amino acid sequence should have its total mass approximately equal to the given peptide mass. This challenge points to a complicated problem of pattern recognition and global optimization on noisy and incomplete data. The problem is typically handled by global dynamic programming (810, 12, 13, 1519, 24), divide and conquer (11), or integer linear programming (14). Hence, a naïve application of existing deep learning architectures does not work directly on this problem. Neural networks are often known to be good at simulating human brain capability, senses and intuition, rather than such precise optimization tasks. Thus, de novo peptide sequencing is a perfect case for us to explore the boundaries of deep learning.
In this study, we have succeeded in designing a deep learning system, DeepNovo, for de novo peptide sequencing. Our model features a sophisticated architecture of CNNs and LSTM networks together with local dynamic programming. DeepNovo has beaten the decade long-standing state of the art records of de novo sequencing algorithms by a large margin of 7.7–22.9% at the amino acid level and 38.1–64.0% at the peptide level. Similar to other deep learning-based models, DeepNovo takes advantage of high-performance computing graphics processing units (GPUs) and massive amounts of data to offer a complete end-to-end training and prediction solution. The CNN and LSTM networks of DeepNovo can be jointly trained from scratch given a set of annotated spectra obtained from spectral libraries or database search tools. This architecture allows us to train both general and specific models to adapt to any sources of data. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, an important downstream application of peptide sequencing. This application previously required de novo sequencing, database search, and homology search together to succeed (21) but now can be done by using DeepNovo alone.

Results

DeepNovo Model.

The DeepNovo model is briefly illustrated in Fig. 1. The model takes a spectrum as input and tries to sequence the peptide by predicting one amino acid at each iteration (Fig. 1 A and B). The sequencing process begins with a special symbol “start.” At each sequencing step, the model predicts the next amino acid by conditioning on the input spectrum and the output of previous steps. The process stops if, in the current step, the model outputs the special symbol “end.” Backward sequencing is performed in a similar way to form the bidirectional sequencing, and the highest-scoring candidate is selected as the final prediction.
Details inside a sequencing step are described in Fig. 1C. DeepNovo incorporates two classification models that use the output of previous sequencing steps as a prefix to predict the next amino acid. In the first model, the prefix mass is first calculated as the sum of its amino acids’ masses and the corresponding terminal. Then, each amino acid type is appended to the prefix, and the corresponding theoretical b- and y-fragment ions are identified. For each fragment ion, an intensity window of size 1.0 Da around its location on the input spectrum is retrieved. The combined intensity profile of the fragment ions then flows through a CNN, called ion-CNN. The ion-CNN learns local features (the peaks) and summarizes the overall information provided by the fragment ions in the spectrum (SI Text).
The second model of DeepNovo is an LSTM network, the most popular type of RNN (35). The LSTM model represents each amino acid class by an embedding vector [i.e., a collection of parameters that characterize the class; similar to word2vec (39)]. Given a prefix, the model looks for the corresponding embedding vectors and sequentially put them through the LSTM network. Moreover, DeepNovo also encodes the input spectrum and uses it to initialize the cell state of the LSTM network (36, 37). For that purpose, the spectrum is discretized into an intensity vector that subsequently flows through another CNN, called spectrum-CNN, before being fed to the LSTM network (Fig. 1A and SI Text).
The outputs of the two models are finally combined to produce a probability distribution over the amino acid classes. The next amino acid can be selected as the one with the highest probability or sampled from the distribution. Moreover, given the peptide mass and the prefix mass, DeepNovo calculates the suffix mass and uses the knapsack dynamic programming algorithm to filter out those amino acids with masses that do not fit the suffix mass. This processing guarantees that final candidate sequences will have the correct peptide mass. Combining all together, DeepNovo then performs beam search, a heuristic search algorithm that explores a fixed number of top candidate sequences at each iteration, until it finds the optimum prediction. Additional details of the model can be found in Methods and SI Text.

Datasets and Benchmark Criteria.

We evaluated the performance of DeepNovo compared with current state of the art de novo peptide sequencing tools, including PEAKS [version 8.0 (40)], Novor (19), and PepNovo (12). For performance evaluation, we used two sets of data, low resolution and high resolution, from previous publications. The low-resolution set includes seven datasets (4147) (Table S1). The first five datasets were acquired from the Thermo Scientific LTQ Orbitrap with the collision-induced dissociation (CID) technique. The other two were acquired from the Thermo Scientific Orbitrap Fusion with the higher-energy collisional dissociation (HCD) technique. The high-resolution set includes nine datasets acquired from the Thermo Scientific Q-Exactive with the HCD technique (4856) (Table S2). We chose data from a wide variety of species and research groups to ensure an unbiased evaluation. All datasets can be downloaded from the ProteomeXchange database and the Chorus database. More details about the datasets and liquid chromatography (LC)-MS/MS experiments can be found in Tables S1 and S2 and the original publications.
Table S1.
Summary of seven low-resolution datasets used in our experiments
SpeciesNo. of RAW filesTotal no. of spectraNo. of PSMs (1% FDR)Error toleranceAccession no.
Precursor mass (ppm)Fragment ion (Da)
Mus musculus40792,148355,514100.6PXD002247
Caenorhabditis elegans1081,125,050437,097200.8PXD000636
Escherichia coli703,239,1161,174,817200.5PXD002912
Drosophila melanogaster18681,968178,853200.5PXD004120
H. sapiens27804,473497,191100.6PXD002179
Saccharomyces cerevisiae6558,564280,377200.5Link is available in ref. 46
Pseudomonas aeruginosa252,781,682603,601200.3PXD004560
FDR, false discovery rate; RAW, a Thermo Scientific mass spectrometer binary data file format.
Table S2.
Summary of nine high-resolution datasets used in our experiments
SpeciesNo. of RAW filesTotal no. of spectraNo. of PSMs (1% FDR)Error toleranceAccession no.
Precursor mass (ppm)Fragment ion (Da)
Vigna mungo19735,61837,775200.05PXD005025
Mus musculus9276,64837,021100.05PXD004948
Methanosarcina mazei16800,768164,421100.05PXD004325
Bacillus14571,615291,783300.05PXD004565
Candidatus endoloripes91,862,619150,611200.05PXD004536
Solanum lycopersicum60603,506290,050150.05PXD004947
Saccharomyces cerevisiae5277,077111,312200.05PXD003868
Apis mellifera17822,069314,571200.05PXD004467
H. sapiens26684,821130,583200.02PXD004424
FDR, false discovery rate; RAW, a Thermo Scientific mass spectrometer binary data file format.
We used PEAKS DB software [version 8.0 (40)] with a false discovery rate of 1% to search those datasets against the UniProt database and the taxon of the sample. The peptide sequences identified from the database search were assigned to the corresponding MS/MS spectra and then used as ground truth for testing the accuracy of de novo sequencing results. Tables S1 and S2 show the summary of PEAKS DB search results for the low- and high-resolution datasets, respectively.
We performed leave-one-out cross-validations. In each validation, all except one of the datasets were used for training DeepNovo (from scratch), and the remaining dataset was used for testing. Other tools have already been trained by their authors and were only tested on all datasets. It should be noted that the training datasets and the testing dataset come from different species. The cross-validation is to guarantee unbiased training and testing and does not give DeepNovo any advantage. All tools were configured with the same settings, including fixed modification carbamidomethylation, variable modifications oxidation and deamidation, and fragment ion and precursor mass error tolerances (Tables S1 and S2).
To measure the accuracy of de novo sequencing results, we compared the real peptide sequence and the de novo peptide sequence of each spectrum. A de novo amino acid is considered “matched” with a real amino acid if their masses are different by less than 0.1 Da and if the prefix masses before them are different by less than 0.5 Da. Such an approximate match is used instead of an exact match because of the resolution of the benchmark datasets. We calculated the total recall (and precision) of de novo sequencing as the ratio of the total number of matched amino acids over the total length of real peptide sequences (and predicted peptide sequences, respectively) in the testing dataset. We also calculated the recall at the peptide level (i.e., the fraction of real peptide sequences that were fully correctly predicted). Most importantly, all sequencing tools report confidence scores for their predictions. The confidence scores reflect the quality of predicted amino acids and are valuable for downstream analysis [e.g., reconstructing the entire protein sequence from its peptides (21)]. Setting a higher threshold of confidence scores will output a smaller set of peptides with high precision but will leave the rest of the dataset without results, hence leading to lower recall and vice versa. Hence, given the availability of recall, precision, and confidence scores, it is reasonable to draw precision–recall curves and use the area under the curve (AUC) as a summary of de novo sequencing accuracy (57). These measures of sequencing accuracy have been widely used in previous publications (10, 12, 19).

Comparison of De Novo Sequencing Accuracy.

Fig. 2 and Fig. S1 show the precision–recall curves and the AUC of de novo sequencing tools on the seven low-resolution datasets. DeepNovo considerably outperformed other tools across all seven datasets. In particular, for Homo sapiens, the AUC of DeepNovo was 33.3% higher than that of PEAKS (0.48/0.36 = 1.333) and 11.6% higher than that of Novor (0.48/0.43 = 1.116). PEAKS and Novor often came in the second place, whereas PepNovo fell behind, probably because of not being updated with recent data. We also noticed that Novor performed relatively better on CID data, whereas PEAKS performed relatively better on HCD data. The AUC of DeepNovo was 18.8–50.0% higher than PEAKS, 7.7–34.4% higher than Novor, and overall, 7.7–22.9% higher than the second best tool across all seven datasets. The improvement of DeepNovo over other methods was much better on HCD data than on CID data, probably because the HCD technique produces better fragmentations and hence, more patterns for DeepNovo to learn. The superior accuracy over state of the art sequencing tools on a wide variety of species shows the powerful and robust performance of DeepNovo.
Fig. 2.
The precision–recall curves and the AUC of PepNovo, PEAKS, Novor, and DeepNovo. (A) Precision–recall curves on H. sapiens. (B) Precision–recall curves on Saccharomyces cerevisiae. (C) Precision–recall curves on Pseudomonas aeruginosa. (D) AUC of four sequencing tools on seven datasets. C. elegans, Caenorhabditis elegans; D. melanogaster, Drosophila melanogaster; E. coli, Escherichia coli; M. musculus, Mus musculus.
Fig. S1.
The precision–recall curves of the de novo sequencing results on the other four low-resolution datasets. (A) The precision–recall curves on Mus musculus. (B) The precision–recall curves on Caenorhabditis elegans. (C) The precision–recall curves on Escherichia coli. (D) The precision–recall curves on Drosophila melanogaster.
Fig. 3 A and B shows the total recall and precision, respectively, of de novo sequencing results on the seven datasets. Here, we used all sequencing results from each tool, regardless of their confidence scores. Again, DeepNovo consistently achieved both higher recall and precision than other tools. DeepNovo recall was 8.4–30.2% higher than PEAKS and 3.9–22.1% higher than Novor. DeepNovo precision was 2.3–18.1% higher than PEAKS and 2.4–20.9% higher than Novor.
Fig. 3.
Total recall and precision of PepNovo, PEAKS, Novor, and DeepNovo on seven datasets. (A) Recall at amino acid level. (B) Precision at amino acid level. (C) Recall at peptide level. C. elegans, Caenorhabditis elegans; D. melanogaster, Drosophila melanogaster; E. coli, Escherichia coli; M. musculus, Mus musculus; P. aeruginosa, Pseudomonas aeruginosa; S. cerevisiae, Saccharomyces cerevisiae.
Fig. 3C shows the total recall of de novo sequencing tools at the peptide level. MS/MS spectra often have missing fragment ions, making it difficult to predict a few amino acids, especially those at the beginning or the end of a peptide sequence. Hence, de novo-sequenced peptides are often not fully correct. Those few amino acids may not increase the amino acid-level accuracy much, but they can result in substantially more fully correct peptides. As shown in Fig. 3C, DeepNovo greatly surpassed other tools; its recall at the peptide level was 38.1–88.0% higher than PEAKS and 42.7–67.6% higher than Novor. This result shows the advantage of the LSTM model in DeepNovo that makes use of sequence patterns to overcome the limitation of MS/MS missing data.
Fig. S2 shows the evaluation results on the nine high-resolution datasets. Novor and PepNovo were not trained with this type of data, and hence, their performance was not as good as PEAKS and DeepNovo. As can be seen from Fig. S2, the AUC of DeepNovo outperformed that of PEAKS across all nine datasets from 1.6 to 33.3%. Fig. S3 shows that the total amino acid recall of DeepNovo was 0.2–5.7% higher than that of PEAKS for eight datasets and 3.1% lower than PEAKS for the H. sapiens dataset. At the peptide level, the total recall of DeepNovo was 5.9–45.6% higher than PEAKS across all nine datasets.
Fig. S2.
The precision–recall curves and the AUCs of PepNovo, Novor, PEAKS, and DeepNovo on nine high-resolution datasets. (A) Precision –recall curves on Vigna mungo. (B) Precision–recall curves on Mus musculus. (C) Precision–recall curves on Methanosarcina mazei. (D) Precision–recall curves on Bacillus. (E) Precision–recall curves on Candidatus endoloripes. (F) Precision–recall curves on Solanum lycopersicum. (G) Precision–recall curves on Saccharomyces cerevisiae. (H) Precision–recall curves on Apis mellifera. (I) Precision–recall curves on Homo sapiens. (J) AUC of four sequencing tools on nine datasets.
Fig. S3.
Total recall and precision of PepNovo, Novor, PEAKS, and DeepNovo on nine high-resolution datasets. (A) Recall at the amino acid level. (B) Precision at the amino acid level. (C) Recall at the peptide level. A. mellifera, Apis mellifera; C. endoloripes, Candidatus endoloripes; M. mazei, Methanosarcina mazei; M. musculus, Mus musculus; S. cerevisiae, Saccharomyces cerevisiae; S. lycopersicum, Solanum lycopersicum; V. mungo, Vigna mungo.
We also evaluated DeepNovo, Novor, and PEAKS on three testing datasets in the Novor paper (19). The results were consistent with those reported earlier, and DeepNovo achieved 4.1–12.1% higher accuracy than the other tools (Fig. S4).
Fig. S4.
The precision–recall curves and the AUCs of PEAKS, Novor, and DeepNovo on three public real datasets. (A) Precision–recall curves on Ubiquitin. (B) Precision–recall curves on UPS2. (C) Precision–recall curves on U2OS. (D) AUC of three sequencing tools on three real datasets.

Performance of Neural Networks Models in DeepNovo.

The improvement of DeepNovo over state of the art methods comes from its two classification models, ion-CNN and LSTM, and the knapsack dynamic programming algorithm. Fig. 4 shows a detailed breakdown of how those components contributed to the total recall. DeepNovo has options to use its models individually or collectively, making it very convenient for additional research and development. The neural networks can be trained together, or they can also be trained separately and combined via the last hidden layer, a common training technique for multimodal neural networks. Although it is not a simple cumulative increasing of accuracy when one combines multiple models, Fig. 4 suggests that there is still much room to improve the LSTM network, and that will be our priority for additional development of DeepNovo.
Fig. 4.
The contributions of DeepNovo’s components to its total recall on seven datasets. C. elegans, Caenorhabditis elegans; D. melanogaster, Drosophila melanogaster; E. coli, Escherichia coli; M. musculus, Mus musculus; P. aeruginosa, Pseudomonas aeruginosa; S. cerevisiae, Saccharomyces cerevisiae.

Reconstructing Antibody Sequences with DeepNovo.

In this section, we present a key downstream application of DeepNovo for complete de novo sequencing of mAbs. We trained the DeepNovo model with an in-house antibody database and used it to perform de novo peptide sequencing on two antibody datasets, the WIgG1 light and heavy chains of mouse (21). Note that the two testing datasets were not included in the training database. De novo peptides from DeepNovo were then used by the assembler ALPS (21) to automatically reconstruct the complete sequences of the antibodies (Figs. S5 and S6). For the light chain (length of 219 aa), we were able to reconstruct a single full-length contig that covered 100% of the target with 99.5% accuracy (218/219). For the heavy chain (length of 441 aa), we obtained three contigs together covering 97.5% of the target (430/441) with 97.2% accuracy (418/430). This application of whole-protein sequencing previously required both de novo peptide sequencing and database search together to succeed but now can be achieved with DeepNovo alone. This result further shows the major advantage of DeepNovo for de novo protein sequencing. In addition, we also showed that DeepNovo was able to identify peptides that could not be detected by database search (Fig. S7).
Fig. S5.
DeepNovo assembly result for the WIgG1 light chain. (A) BLAST alignment of the full-length assembled contig against the target light chain. (B) Details of the alignment in A. The red bars indicate the mismatches between the assembled light-chain sequence and the target light-chain sequence.
Fig. S6.
DeepNovo assembly result for the WIgG1 heavy chain. (A) BLAST alignment of the top-assembled contigs against the target heavy chain. (B) Details of the alignment in A. The red bars indicate the mismatches between the assembled heavy-chain sequence and the target heavy-chain sequence.
Fig. S7.
Identification of spectra with de novo high scores but elude database search. (A) The number of spectra identified by searching the human–yeast database, searching the yeast database only, and using DeepNovo. (B) The Venn diagram of spectra matched with human peptides and DeepNovo only.

Discussion

De novo peptide sequencing is a challenging and computationally intensive problem that includes both pattern recognition and global optimization on noisy and incomplete data. In this study, we proposed DeepNovo, a deep neural network model that combines recent advances in deep learning and dynamic programming to address this problem. DeepNovo integrates CNNs and LSTM networks to learn features of tandem mass spectra, fragment ions, and sequence patterns for predicting peptides. Our experiment results show that DeepNovo consistently surpassed state of the art records in de novo peptide sequencing.
Interestingly, existing methods for de novo peptide sequencing rely heavily on rigorous global dynamic programming or graph-theoretical algorithms to address the global optimization problem. Here, we use knapsack, a “local” version of dynamic programming, to simply filter amino acids not suitable for the suffix mass, and we do not perform backtracking. This result implies that (i) the neural networks in DeepNovo learn better features that can bypass the global optimization problem and that (ii) DeepNovo can be further enhanced with more advanced search algorithms.
It should be noted that both method and training data are crucial for the model performance. For example, deep learning models often learn directly from raw data and require a large amount of training data. Some other machine learning models may rely on well-designed features based on domain-specific knowledge and may need less training data. Our study shows that DeepNovo and our training data achieved better de novo sequencing results than other existing methods and their respective training data. A more comprehensive benchmark study of de novo sequencing methods could be done by collecting well-annotated, gold standard training and testing datasets. This benchmark study is a potential question for future research.
Some database search engines and postprocessors, such as MS-GF+ (58) and Percolator (59), allow us to retrain their model parameters to adapt to a particular dataset and hence, increase the peptide identification rate. Similarly, PepNovo (12) includes the option to retrain its scoring models for de novo sequencing. DeepNovo is also retrainable and provides a complete end-to-end training and prediction solution. Retrainable is an important feature given massive amounts of data coming from several types of instruments and diverse species as well as different experiment designs. DeepNovo can be first trained on a huge amount of data to obtain a general model and then, gently retrained on a much smaller yet more targeted data source to reach the final data-specific model. Training data simply include a list of spectra and their corresponding peptides, and such annotated data can be found in spectral libraries, such as the NIST Mass Spectral Library, or retrieved by using database search tools (e.g., PEAKS DB) (40).
Because our work introduces deep learning to de novo peptide sequencing, there were no guidelines on how to design the architecture of neural networks or how to train them with tandem MS data. However, the lack of guidelines also means that there is still a lot of room for improvement. Going deeper is definitely an option. Another interesting topic is that protein sequences from different species may be considered as different languages. Hence, we need to explore how to train the LSTM network for a general model and a species-specific model. We can also train models to target a particular class of instruments or fragmentation techniques. Those all are potential directions for additional research.
Although DeepNovo is presented here in the context of de novo peptide sequencing, the idea can be easily extended to the sequence database search, because both share the same problem of matching a spectrum to a peptide. Moreover, we believe that DeepNovo can be further developed for the analysis of data-independent acquisition, in particular, the problem of inferring multiple sequences from a tandem mass spectra that include fragments from many different peptides. With the LSTM RNN, DeepNovo can learn patterns of peptide sequences in addition to the fragment ion information. The additional information of sequence patterns can offer some help to address the ambiguity of inferring multiple peptides from a spectrum.
After recent breakthroughs of deep learning in image processing, speech recognition, and natural language processing, DeepNovo makes important progress on de novo peptide sequencing, a fundamental and long-standing research problem in the field of bioinformatics. Our work opens a door for combining deep learning with other sophisticated algorithms to solve optimization problems, especially those with complicated mixing of signals and noises. This research will enable more applications of deep learning in the near future.

Methods

The architecture of DeepNovo is described in Fig. 1. More in-depth details of the model together with deep learning background are available in SI Text. DeepNovo is implemented and tested on the Google TensorFlow library, Python API, release r0.10.
To train DeepNovo, a dataset is randomly partitioned into three sets: training, validation, and testing. As we mentioned earlier, because of the one to many relationship between peptide and spectra, it is important to make sure that the three sets do not share peptides to avoid overfitting. The training dataset is processed in minibatches. At each training step, a minibatch is randomly selected from the training dataset and fed to the model. The model is provided with a real prefix and asked to predict the next amino acid. The output logits and the real amino acid are then used to calculate the cross-entropy loss function. Next, backpropagation is performed to calculate gradients (partial derivatives) and update the parameters using the Adam optimizer (60). During the training, we periodically calculate the loss function on the validation set and decide to save the model if there is an improvement. More training details can be found in SI Text.

SI Text

DeepNovo Model.

Input processing.

As shown in Fig. 1A, a tandem mass spectrum is often presented as a histogram plot of intensity vs. mass (more precisely, m/z). The underlying raw format (e.g., mgf) is simply a list of pairs of mass and intensity. In DeepNovo, we discretize a spectrum into a vector, called an intensity vector, in which masses correspond to indices and intensities are values. This representation assumes a maximum mass and also depends on a mass resolution parameter. For instance, if the maximum mass is 5,000 Da, and the resolution is 0.1 Da, then the vector size is 50,000, and every 1 Da mass is represented by 10 bins in the vector. In these implementation, we consider two types of data: low resolution (0.1 Da) and high resolution (0.01 Da). High-resolution data often allow de novo peptide sequencing tools to achieve better accuracy.

Ion-CNN model.

The ion-CNN model is designed to learn features of fragment ions in a spectrum. The input is a prefix (i.e., a sequence including the “start” symbol and the amino acids that have been predicted up to the current iteration). The output is a probability distribution over 20 amino acid residues, their modifications, and three special symbols: start, “end,” and “padding.” In this paper, we consider three variable modifications oxidation and deamidations, hence, a total of 26 symbols for prediction. For example, in Fig. 1 B and C, the prefix consists of four symbols: start, “P,” “E,” and P. Symbol “T” is predicted as the next amino acid by sampling or selecting the highest probability from the model output probability distribution.
Given the input prefix, DeepNovo first computes the prefix mass (i.e., the sum of masses of the N terminus and amino acids in the prefix) (Fig. 1C). Next, DeepNovo tries to add each of 26 symbols to the current prefix and updates its mass accordingly. For each candidate, the corresponding masses of b and y ions are calculated. In this implementation, we use eight ion types: b, y, b(2+), y(2+), b-H2O, y-H2O, b-NH3, and y-NH3 (24). Given an ion mass, DeepNovo identifies its location on the intensity vector using the mass resolution. For example, the prefix of four symbols start, P, E, and P with the next candidate T will have a b ion of mass 424.2 Da, which corresponds to index 4,240 on the intensity vector of resolution 0.1 Da. DeepNovo then extracts an intensity window of size 10 around the ion location. Thus, for each input prefix, DeepNovo computes a 3D array of shape 26×8×10. Deep learning libraries often process data in batches to take advantage of parallel computing. Here, we use a batch size of 128 (i.e., we process 128 prefixes at the same time, and their arrays are packed into a 4D array of shape 128×26×8×10). We further transpose the shape into 128×8×10×26 (reason will be explained later). This final array, denoted by X128×8×10×26, is similar to the common data setting of image processing, where the first dimension is the number of images, the second is the height, the third is the width, and the fourth is the number of channels (e.g., three for red–green–blue or one for black–white).
The ion-CNN model is a CNN with two convolutional layers and two fully connected layers (Fig. 1C) (27, 61). The first convolutional layer uses a 4D kernel W1×3×26×32 and a bias term B32 to transform the input array X128×8×10×26 into a new array Y128×8×10×32. This convolution operator slides 26×32=832 receptive fields (filters) of size 1×3 of the kernel W over the input array X and performs a series of dot products and additions as follows:
Yi,j,k,l=m=126n=13W1,n,m,lXi,j,k+n1,m+Bl,
[S1]
where 1i128,1j8,1k10,1l32, and the third dimension of X is padded with 0 when needed. The purpose of convolution is to learn as many local features as possible through several different filters. Hence, the kernel W is often called the “feature detector,” and the output Y is called the “feature map.” As can be seen from Eq. S1, we perform convolution along the third dimension of X (i.e., the intensity window) to learn the bell-shaped features (i.e., peaks) (Fig. 1C). We also use different sets of filters for different amino acids. This setting is currently the best setting that we found after trying multiple convolution combinations of ions and/or amino acids. However, more investigations with more data are worth trying in future development.
The linear convolution is followed by an activation with rectified linear unit [ReLU; i.e., f(x)=max(0,x)]. Activation functions are often used to add nonlinearity into neural network models, and ReLU is currently the most favorite because of its many advantages (62). Thus, the output Z of the first convolutional layer is obtained by applying the ReLU function on Y elementwise:
Zi,j,k,l=ReLU(Yi,j,k,l).
[S2]
The second convolutional layer is applied on top of the first convolutional layer in a similar way with another kernel V1×2×32×32. Adding more convolutional layers does not show significant improvement of accuracy, probably because the bell-shaped features are not too complicated to learn. We also apply max-pooling, but it seems not to have much impact, because the dimensionality is not large.
The convolutional layers are followed by a fully connected layer or often-called hidden layer of 512 neuron units (Fig. 1C). As the name suggests, each unit is connected to every output of the previous convolutional layer to process all local features together. This connection is done via a linear matrix multiplication and addition as follows:
Yhidden128×512=ReLU(Xhidden128×2,560Whidden2,560×512+Bhidden512).
[S3]
Notice that the output of the previous convolutional layer with shape 128×8×10×32 is first reshaped into Xhidden128×2,560 to be compatible with the matrix multiplication operator. We also apply ReLU elementwise after the linear operations.
The final fully connected layer has 26 neuron units, which correspond to 26 symbols to predict. It is connected to the previous hidden layer in a similar way as Eq. S3, except that there is no ReLU activation.
We also apply dropout, an important technique to prevent neural networks from overfitting (63). We use dropout after the second convolutional layer with probability 0.25 and after the first fully connected layer with probability 0.5. The idea of dropout is that neuron units are randomly activated (or dropped) at every training iteration so that they do not coadapt. At the testing phase, all units are activated, and their effects are averaged by the dropout probability.

Spectrum-CNN and LSTM model.

The spectrum-CNN coupled with LSTM model is designed to learn sequence patterns of amino acids of the peptide in association with the corresponding spectrum. We adopt this idea from a recently trending topic of “automatically generating a description for an image.” In that research, a CNN is used to encode or “understand” the image, and an LSTM RNN (35) is used to decode or “describe” the content of the image (36, 37). Here, we consider the spectrum intensity vector as an image (with one dimension and one channel) and the peptide sequence as a caption. We use the spectrum-CNN to encode the intensity vector and the LSTM to decode the amino acids.
Spectrum-CNN: Simple version.
The input to the spectrum-CNN is an array of shape 128×1×50,000×1, where 128 is the batch size and 50,000 is the size of intensity vectors given the maximum mass of 5,000 Da and the resolution of 0.1 Da. Because the input size is too large, we first try a simple version of spectrum-CNN that includes two convolutional layers, each with four filters of size 1×4 and one fully connected layer of 512 neuron units. We also use ReLU activation, max-pooling, and dropout in the same way as for the ion-CNN model described above.
It should be noted that the pattern recognition problem with tandem mass spectra here is quite different from traditional object recognition problems. Usually, an object is recognized by its shape and its features (e.g., face recognition). However, in a tandem mass spectrum, an amino acid is identified by two bell-shaped signals (i.e., peaks) with distance between them that has to precisely match with the amino acid mass. Because distance is involved, our simple spectrum-CNN and other common CNN models may not be good enough.
Spectrum-CNN: Advanced version.
To take the distance into account, we slice each input intensity vector into pieces based on the amino acid masses. For instance, given that the mass of Alanine or “A” is 71.0 Da and the resolution is 0.1 Da, we slice the intensity vector from index 710 until the end to create a new vector. We pad the new vector by 0 so that it has the same size as the original one and concatenate the two along the second dimension to obtain an array of shape 128×2×50,000×1. We repeat this procedure for all 26 symbols and construct a new input array of shape 128×2×50,000×26.
After preprocessing, we apply the first convolutional layer with the kernel of shape 2×10×26×32. The idea is to capture two bell-shaped signals in the same filter of size 2×10. This convolutional layer is followed by another one with kernel of shape 1×5×32×64 and one fully connected layer of 512 neuron units. Again, we also use ReLU activation, max-pooling, and dropout. Note that here we use max-pooling aggressively, because the intensity vectors are very sparse.
It should be noted that the goal of our spectrum-CNN is not to make accurate predictions of the next amino acid, such as the ion-CNN. Instead, the spectrum-CNN only tries to pick up signals of which amino acids are presented in the spectrum and provide that information to the LSTM model to better learn sequence patterns of amino acids. The spectrum-CNN output is a vector of size 512, corresponding to 512 neuron units of its fully connected layer.
LSTM model.
LSTM networks, a special kind of RNNs, are the most widely used models to handle sequential data in natural language processing and speech recognition (35). RNNs are called “recurrent,” because they repeat the same computations on every element of a sequence, and the next iteration depends on the networks’ “memory” of previous steps. For example, one could predict the next word in a sentence given the previous words. In the problem of de novo peptide sequencing, we want to predict the next amino acid, a symbol, given the previous ones (i.e., the prefix) (Fig. 1 B and C). This assumption is reasonable, because amino acids do not just appear in a random order in protein sequences. In other words, protein sequences may speak their own “languages.”
Because of limited space, we do not try to include all details of LSTM and RNNs in this manuscript. We use the standard LSTM model, which can be found in many articles in the literature, such as refs. 3537, or online resources. Here, we only discuss some important configurations of our LSTM model. First, we use embedding vectors of size 512 to represent each of 26 symbols, similar to the common word2vec (39) approach that uses embedding vectors to represent words in a vocabulary. The embedding vectors form a 2D array Embedding26×512. Thus, the input to the LSTM model at each iteration is a vector of size 512. Second, the output of the spectrum-CNN is used to initialize the LSTM model (i.e., being fed as the zero input). Third, the LSTM architecture consists of one layer of 512 neuron units and dropout layers with probability 0.5 for input and output. The recurrent iterations of the LSTM model can be summarized as follows:
x0|=CNNspectrum(I)xt1|=Embeddingat1,,t>1st|=LSTM(xt1),
where I is the spectrum intensity vector, at1 is the symbol predicted at iteration t1, Embeddingi, is the row i of the embedding array, and st is the output of the LSTM model and will be used to predict the symbol at iteration t,t=1,2,3, Similar to the ion-CNN model, we also add a fully connected layer of 26 neuron units to perform a linear transformation of the LSTM 512 output units into signals of 26 symbols to predict.
Last but not least, LSTM networks often iterate from the beginning to the end of a sequence. However, to achieve a general model for diverse species, we found that it is better to apply LSTM on short k-mers. However, this topic requires additional analysis with more data to find an optimum solution.

Integrating ion-CNN and LSTM models.

To combine the ion-CNN and LSTM models, we first concatenate the outputs of their second to last layers, each of size 512, to form a vector of size 1,024. Then, we add a fully connected layer of 1,024 neuron units with ReLU activation and dropout with probability 0.5 followed by another fully connected layer of 26 neuron units to perform a linear transformation into signals of 26 symbols to predict (Fig. 1C). Thus, the final output of DeepNovo neural networks is a vector of 26 signals, often called logits (unscaled log probabilities). This logits vector will be further used to calculate the loss function during training or calculate the prediction probabilities during testing.
In this section, we have completely described all details of DeepNovo model. All weight and bias parameters (i.e., W s and B s) of the CNNs, embedding vectors, and parameters of the LSTM will be estimated and optimized during the training process. In addition, DeepNovo performs bidirectional sequencing and uses two separate sets of parameters, forward and backward, except for the spectrum-CNN and the embedding vectors. The hyperparameters, such as the numbers of layers, the numbers of neuron units, the size of embedding vector, the dropout probabilities, the number and types of fragment ions, etc., can be configured to define an instance of DeepNovo model.

De Novo Peptides Identified by DeepNovo but Missed by Database Search.

DeepNovo is able to find high-quality matches that elude database search identification. To show this advantage, we performed an experiment on a conventional dataset Clinical Proteomic Tumor Analysis Consortium as follows.
A yeast lysate was spiked with a mixture of 48 human proteins (Sigma-Aldrich UPS1). The sample was then analyzed three times by the Thermo LTQ-Orbitrap instrument. We then used PEAKS DB to do a database search with a false discovery rate of 1%. We first searched this dataset against a combined database including both human and yeast proteins. As shown in Fig. S7A, the total number of identified peptide-spectrum matches (PSMs) is 18,306, including 16,617 from yeast and 1,689 from human. Next, we searched this dataset against the yeast database only and found 16,693 PSMs.
We used DeepNovo to perform de novo sequencing on the whole dataset. After excluding 16,693 spectra identified from the yeast database search and selecting the top 50% high-confidence results, we found 7,146 spectra identified by DeepNovo only. Among those 7,146 spectra, 1,524 matched to the human peptides were identified in the first round of database search and covered ∼93% (1,524/1,631) of total human PSMs (Fig. S7B). Thus, DeepNovo was able to identify human peptides that eluded the second round of database search. This result shows the importance of de novo sequencing when the database information is missing.

Training DeepNovo with MS/MS Data.

Here, we would like to emphasize some important techniques for training DeepNovo. MS/MS data have a special property: the same peptide could appear multiple times in a dataset with different spectra. Such spectra may have different fragment ions, and even if they share some major ions, the intensities of those ions also vary from spectrum to spectrum. However, the model is able to learn some common features of different spectra that come from the same peptide, and those features are not generalized well to other peptides. This problem will lead to overfitting if we randomly partition a dataset into training, validation, and testing sets (a common technique in most model training tasks). The model will perform well on those three sets, but its performance gets worse on a new dataset. Thus, it is essential to make sure that the training, validation, and testing sets do not share common peptides. In addition, we found that it is preferable to collect more data from a wide variety of sources than increase data from the same source. This observation may be related to the one to many relationship between peptide and spectra mentioned earlier.

GPUs and Big Data: Two Advantages of Neural Network Models.

Recent breakthroughs in neural networks and deep learning are driven by the two main engines: powerful GPUs and massive amount of datasets. These two also fit nicely into the problem of de novo peptide sequencing. De novo peptide sequencing is well-known as a computation-intensive optimization problem, and modern MS instruments often produce data faster than many sequencing software can analyze in real time. Recently, Novor has greatly improved the speed and is able to keep up with the rate of data acquisition. However, it is still highly desirable to make use of high-performance hardware, such as GPUs, instead of traditional central processing units (CPUs). DeepNovo is implemented on the Google TensorFlow platform and able to run on both GPUs and CPUs. Moreover, TensorFlow scales up easily to multiple GPUs, CPUs, and even different workstations, maximizing most computational resources.
In this study, we used only 50,000 spectra from each dataset for training (i.e., about 10% of the total data for training; for testing, we still used all data). However, even with that limited amount of training data, the accuracy of DeepNovo was already 7.7–22.9% higher than the current state of the art. Although it is not always a simple increase between the amount of training data and the model accuracy, we believe that neural network models, such as DeepNovo, are the ideal choice and can benefit the most from huge proteomics databases, such as PRIDE, MassIVE, and others.

Data Availability

Data deposition: DeepNovo is publicly available for non-commercial uses. The source code of DeepNovo is stored on GitHub (https://github.com/nh2tran/DeepNovo). All training and testing datasets, pretrained models, and source code of DeepNovo can also be downloaded from the FTP server of the MassIVE database via the following link: ftp://Nancyzxll:[email protected] (user account: Nancyzxll, password: DeepNovo2017).

Acknowledgments

We thank Lin He for discussions. We also thank Nicole Keshav, Zac Anderson, and Brian Munro for proofreading the manuscript. This work was partially supported by Natural Sciences and Engineering Research Council of Canada (NSERC) Grant OGP0046506, China's Key Research and Development Program under Grant 2016YFB1000902, Beijing Advanced Innovation Center for Imaging Technology Grant BAICIT-2016031, and Canada Research Chair program.

Supporting Information

Supporting Information (PDF)

References

1
RS Johnson, K Biemann, The primary structure of thioredoxin from Chromatium vinosum determined by high-performance tandem mass spectrometry. Biochemistry 26, 1209–1214 (1987).
2
LA Martin-Visscher, et al., Isolation and characterization of carnocyclin a, a novel circular bacteriocin produced by Carnobacterium maltaromaticum UAL307. Appl Environ Microbiol 74, 4756–4763 (2008).
3
N Hatano, T Hamada, Proteome analysis of pitcher fluid of the carnivorous plant Nepenthes alata. J Proteome Res 7, 809–816 (2008).
4
J Catusse, J-M Strub, C Job, A Van Dorsselaer, D Job, Proteome-wide characterization of sugarbeet seed vigor and its tissue specific expression. Proc Natl Acad Sci USA 105, 10262–10267 (2008).
5
JV Jorrín-Novo, et al., Fourteen years of plant proteomics reflected in Proteomics: Moving from model species and 2DE-based approaches to orphan species and gel-free platforms. Proteomics 15, 1089–1112 (2015).
6
JA Taylor, RS Johnson, Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 11, 1067–1075 (1997).
7
JA Taylor, RS Johnson, Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal Chem 73, 2594–2604 (2001).
8
T Chen, MY Kao, M Tepel, J Rush, GM Church, A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J Comput Biol 8, 325–337 (2001).
9
V Dancík, TA Addona, KR Clauser, JE Vath, PA Pevzner, De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 6, 327–342 (1999).
10
B Ma, et al., PEAKS: Powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 17, 2337–2342 (2003).
11
Z Zhang, De novo peptide sequencing based on a divide-and-conquer algorithm and peptide tandem spectrum simulation. Anal Chem 76, 6374–6383 (2004).
12
A Frank, P Pevzner, PepNovo: De novo peptide sequencing via probabilistic network modeling. Anal Chem 77, 964–973 (2005).
13
B Fischer, et al., NovoHMM: A hidden Markov model for de novo peptide sequencing. Anal Chem 77, 7265–7273 (2005).
14
Jr PA DiMaggio, CA Floudas, De novo peptide identification via tandem mass spectrometry and integer linear optimization. Anal Chem 79, 1433–1446 (2007).
15
L Mo, D Dutta, Y Wan, T Chen, MSNovo: A dynamic programming algorithm for de novo peptide sequencing via tandem mass spectrometry. Anal Chem 79, 4870–4878 (2007).
16
H Chi, et al., pNovo: De novo peptide sequencing and identification using HCD spectra. J Proteome Res 9, 2713–2724 (2010).
17
K Jeong, S Kim, PA Pevzner, UniNovo: A universal tool for de novo peptide sequencing. Bioinformatics 29, 1953–1962 (2013).
18
H Chi, et al., pNovo+: De novo peptide sequencing using complementary HCD and ETD tandem mass spectra. J Proteome Res 12, 615–625 (2013).
19
B Ma, Novor: Real-time peptide de novo sequencing software. J Am Soc Mass Spectrom 26, 1885–1894 (2015).
20
K Maggon, Monoclonal antibody “gold rush.”. Curr Med Chem 14, 1978–1987 (2007).
21
NH Tran, et al., Complete de novo assembly of monoclonal antibody sequences. Sci Rep 6, 31730 (2016).
22
N Bandeira, V Pham, P Pevzner, D Arnott, JR Lill, Automated de novo protein sequencing of monoclonal antibodies. Nat Biotechnol 26, 1336–1338 (2008).
23
A Guthals, KR Clauser, AM Frank, N Bandeira, Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides. J Proteome Res 12, 2846–2857 (2013).
24
B Ma, R Johnson, De novo sequencing and homology searching. Mol Cell Proteomics 11, O111.014902 (2012).
25
Y LeCun, Y Bengio, G Hinton, Deep learning. Nature 521, 436–444 (2015).
26
D Ciresan, A Giusti, LM Gambardella, J Schmidhuber, Deep neural networks segment neuronal membranes in electron microscopy images. Adv Neural Inf Process Syst 25, 2843–2851 (2012).
27
A Krizhevsky, I Sutskever, G Hinton, ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25, 1097–1105 (2012).
28
G Hinton, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29, 82–97 (2012).
29
I Sutskever, O Vinyals, Q Le, Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27, 3104–3112 (2014).
30
N Rusk, Deep learning. Nat Methods 13, 35 (2016).
31
J Zhou, OG Troyanskaya, Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12, 931–934 (2015).
32
B Alipanahi, A Delong, MT Weirauch, BJ Frey, Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33, 831–838 (2015).
33
S Wang, S Sun, Z Li, R Zhang, J Xu, Accurate de novo prediction of protein contact map by ultra-deep learning model. PLOS Comput Biol 13, e1005324 (2017).
34
P Inglese, et al., Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chem Sci (Camb) 8, 3500–3511 (2017).
35
S Hochreiter, J Schmidhuber, Long short-term memory. Neural Comput 9, 1735–1780 (1997).
36
A Karpathy, FF Li, Deep visual-semantic alignments for generating image description. Conf Comput Vis Pattern Recognit Workshops 2015, 3128–3137 (2015).
37
O Vinyals, A Toshev, S Bengio, D Erhan, Show and tell: A neural image caption generator. Conf Comput Vis Pattern Recognit Workshops 2015, 3156–3164 (2015).
38
H Steen, M Mann, The ABC’s (and XYZ’s) of peptide sequencing. Nat Rev Mol Cell Biol 5, 699–711 (2004).
39
T Mikolov, I Sutskever, K Chen, G Corrado, J Dean, Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26, 3111–3119 (2013).
40
Bioinformatics Solutions Inc. (2016) PEAKS Studio (Bioinformatics Solutions Inc., Waterloo, ON, Canada), Version 8.0.
41
A Grosche, et al., The proteome of native adult Muller Glial cells from murine retina. Mol Cell Proteomics 15, 462–480 (2016).
42
E Marza, et al., Genome-wide screen identifies a novel p97/CDC-48-dependent pathway regulating ER-stress-induced gene transcription. EMBO Rep 16, 332–340 (2015).
43
VK Pettersen, KA Mosevoll, PC Lindemann, HG Wiker, Coordination of metabolism and virulence factors expression of extraintestinal pathogenic Escherichia coli purified from blood cultures of patients with sepsis. Mol Cell Proteomics 15, 2890–2907 (2016).
44
B Hampoelz, et al., Pre-assembled nuclear pores insert into the nuclear envelope during early development. Cell 166, 664–678 (2016).
45
Y Zhang, et al., Tissue-based proteogenomics reveals that human testis endows plentiful missing proteins. J Proteome Res 14, 3583–3594 (2015).
46
AS Hebert, et al., The one hour yeast proteome. Mol Cell Proteomics 13, 339–347 (2014).
47
J Peng, J Cao, FM Ng, J Hill, Pseudomonas aeruginosa develops Ciprofloxacin resistance from low to high level with distinctive proteome changes. J Proteomics 152, 75–87 (2017).
48
AL Paiva, JT Oliveira, GA de Souza, IM Vasconcelos, Label-free proteomics reveals that Cowpea severe mosaic virus transiently suppresses the host leaf protein accumulation during the compatible interaction with Cowpea (Vigna unguiculata [L.] Walp.). J Proteome Res 15, 4208–4220 (2016).
49
N Nevo, et al., Impact of cystinosin glycosylation on protein stability by differential dynamic stable isotope labeling by amino acids in cell culture (SILAC). Mol Cell Proteomics 16, 457–468 (2017).
50
L Cassidy, D Prasse, D Linke, RA Schmitz, A Tholey, Combination of bottom-up 2D-LC-MS and semi-top-down GelFree-LC-MS enhances coverage of proteome and low molecular weight short open reading frame encoded peptides of the Archaeon Methanosarcina mazei. J Proteome Res 15, 3773–3783 (2016).
51
DR Reuß, et al., Large-scale reduction of the Bacillus subtilis genome: Consequences for the transcriptional network, resource allocation, and metabolism. Genome Res 27, 289–299 (2017).
52
JM Petersen, et al., Chemosynthetic symbionts of marine invertebrate animals are capable of nitrogen fixation. Nat Microbiol 2, 16195 (2016).
53
CI Mata, et al., In-depth characterization of the tomato fruit pericarp proteome. Proteomics 17, 1–2 (2017).
54
G Seidel, et al., Quantitative global proteomics of Yeast PBP1 deletion mutants and their stress responses identifies glucose metabolism, mitochondrial, and stress granule changes. J Proteome Res 16, 504–515 (2017).
55
H Hu, et al., Proteome analysis of the hemolymph, mushroom body, and antenna provides novel insight into honeybee resistance against varroa infestation. J Proteome Res 15, 2841–2854 (2016).
56
W Cypryk, M Lorey, A Puustinen, TA Nyman, S Matikainen, Proteomic and bioinformatic characterization of extracellular vesicles released from human macrophages upon Influenza A virus infection. J Proteome Res 16, 217–227 (2017).
57
J Davis, M Goadrich, The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, eds W Cohen, A Moore (ACM, New York), pp. 233–240 (2006).
58
S Kim, PAMS-GF Pevzner, MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun 5, 5277 (2014).
59
L Käll, JD Canterbury, J Weston, WS Noble, MJ MacCoss, Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 4, 923–925 (2007).
60
Kingma DP, Ba J Adam: A method for stochastic optimization. arXiv:1412.6980.
61
Y LeCun, et al., Backpropagation applied to handwritten zip code recognition. Neural Comput 11, 541–551 (1989).
62
X Glorot, A Bordes, Y Bengio, Deep sparse rectifier neural networks. JMLR Workshop Conf Proc 15, 315–323 (2011).
63
N Srivastava, G Hinton, A Krizhevsky, I Sutskever, R Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15, 1929–1958 (2014).

Information & Authors

Information

Published in

Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 114 | No. 31
August 1, 2017
PubMed: 28720701

Classifications

Data Availability

Data deposition: DeepNovo is publicly available for non-commercial uses. The source code of DeepNovo is stored on GitHub (https://github.com/nh2tran/DeepNovo). All training and testing datasets, pretrained models, and source code of DeepNovo can also be downloaded from the FTP server of the MassIVE database via the following link: ftp://Nancyzxll:[email protected] (user account: Nancyzxll, password: DeepNovo2017).

Submission history

Published online: July 18, 2017
Published in issue: August 1, 2017

Keywords

  1. deep learning
  2. MS
  3. de novo sequencing

Acknowledgments

We thank Lin He for discussions. We also thank Nicole Keshav, Zac Anderson, and Brian Munro for proofreading the manuscript. This work was partially supported by Natural Sciences and Engineering Research Council of Canada (NSERC) Grant OGP0046506, China's Key Research and Development Program under Grant 2016YFB1000902, Beijing Advanced Innovation Center for Imaging Technology Grant BAICIT-2016031, and Canada Research Chair program.

Notes

This article is a PNAS Direct Submission. J.R.Y. is a guest editor invited by the Editorial Board.

Authors

Affiliations

Ngoc Hieu Tran1
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada;
Xianglilan Zhang1
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada;
State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing 100071, China;
Lei Xin
Bioinformatics Solutions Inc., Waterloo, ON N2L 6J2, Canada
Baozhen Shan
Bioinformatics Solutions Inc., Waterloo, ON N2L 6J2, Canada
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada;

Notes

2
To whom correspondence should be addressed. Email: [email protected].
Author contributions: N.H.T., B.S., and M.L. designed research; N.H.T., X.Z., and L.X. performed research; N.H.T. contributed new reagents/analytic tools; N.H.T., X.Z., and L.X. analyzed data; and N.H.T., X.Z., and M.L. wrote the paper.
1
N.H.T. and X.Z. contributed equally to this work.

Competing Interests

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements




Altmetrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Get Access

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    De novo peptide sequencing by deep learning
    Proceedings of the National Academy of Sciences
    • Vol. 114
    • No. 31
    • pp. 8125-E6475

    Media

    Figures

    Tables

    Other

    Share

    Share

    Share article link

    Share on social media