Skip to main content
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian
  • Log in
  • My Cart

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses

New Research In

Physical Sciences

Featured Portals

  • Physics
  • Chemistry
  • Sustainability Science

Articles by Topic

  • Applied Mathematics
  • Applied Physical Sciences
  • Astronomy
  • Computer Sciences
  • Earth, Atmospheric, and Planetary Sciences
  • Engineering
  • Environmental Sciences
  • Mathematics
  • Statistics

Social Sciences

Featured Portals

  • Anthropology
  • Sustainability Science

Articles by Topic

  • Economic Sciences
  • Environmental Sciences
  • Political Sciences
  • Psychological and Cognitive Sciences
  • Social Sciences

Biological Sciences

Featured Portals

  • Sustainability Science

Articles by Topic

  • Agricultural Sciences
  • Anthropology
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology
  • Cell Biology
  • Developmental Biology
  • Ecology
  • Environmental Sciences
  • Evolution
  • Genetics
  • Immunology and Inflammation
  • Medical Sciences
  • Microbiology
  • Neuroscience
  • Pharmacology
  • Physiology
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences
  • Sustainability Science
  • Systems Biology
Research Article

Genome-based peptide fingerprint scanning

Michael C. Giddings, Atul A. Shah, Ray Gesteland, and Barry Moore
PNAS January 7, 2003 100 (1) 20-25; https://doi.org/10.1073/pnas.0136893100
Michael C. Giddings
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Atul A. Shah
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ray Gesteland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Barry Moore
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  1. Communicated by Clyde A. Hutchison III, University of North Carolina, Chapel Hill, NC (received for review July 31, 2002)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Abstract

We have implemented a method that identifies the genomic origins of sample proteins by scanning their peptide-mass fingerprint against the theoretical translation and proteolytic digest of an entire genome. Unlike previously reported techniques, this method requires no predefined ORF or protein annotations. Fixed-size windows along the genome sequence are scored by an equation accounting for the number of matching peptides, the number of missed enzymatic cleavages in each peptide, the number of in-frame stop codons within a window, the adjacency between peptides, and duplicate peptide matches. Statistical significance of matching regions is assessed by comparing their scores to scores from windows matching randomly generated mass data. Tests with samples from Saccharomyces cerevisiae mitochondria and Escherichia coli have demonstrated the ability to produce statistically significant identifications, agreeing with two commonly used programs, peptident and mascot, in 86% of samples analyzed. This genome fingerprint scanning method has the potential to aid in genome annotation, identify proteins for which annotation is incorrect or missing, and handle cases where sequencing errors have caused framing mistakes in the databases. It might also aid in the identification of proteins in which recoding events such as frameshifting or stop-codon read-through have occurred, elucidating alternative translation mechanisms. The prototype is implemented as a client/server pair, allowing the distribution, among a set of cluster nodes, of a single or multiple genomes for concurrent analysis.

Peptide mass fingerprinting is a principal protein identification technique that was introduced in 1993 by several groups (1–3). Fractions from the separation of a protein sample by means of 2D gel electrophoresis or multidimensional HPLC are enzymatically digested and analyzed by MS. The resulting peptide mass fingerprints are matched against a sequence database to identify the proteins present in the sample. Commonly used computer programs such as peptident (4), profound (5), mascot (6), and sherpa (7) match peptide fingerprints by comparing the masses in the fingerprint to those derived by in silico digestion of predicted or confirmed database protein sequences. Misidentified or unidentified ORFs can present a major challenge to the process, as can sequencing insertion/deletion errors (indels). Furthermore, current methods cannot readily detect proteins generated by various alternative processing mechanisms observed at each stage of protein production. Examples include transcriptional slippage (8), alternative splicing (9), internal initiation (10, 11), and recoding (12, 13), the latter of which includes nonstandard translational phenomena such as programmed frameshift and stop codon read-through.

Proteins produced by such mechanisms may be absent from a protein database, leaving no positive search target. A frameshift product might be incorrectly identified as the in-frame product (with lower confidence because of no matches past the frameshift site), or not at all, because there are too few peptides matching in the original frame. The normal and transframe proteins translated from a sequence prone to programmed frameshift will be identified as only a single product. Without new search methods, such “under-identifications” may end up hiding mechanisms of biological significance from researchers.

The present genome-search approach was conceived to aid in the detection of alternative products among proteins from mitochondria of Saccharomyces cerevisiae separated by HPLC and analyzed by electrospray ionization (ESI)-MS of both intact proteins and their tryptic-digest products (B.M., C. Nelson, A.A.S., A. J. Baucum, M.C.G., R. Chowdry, J. Simmons, N. Wills, J. Atkins, and R.G., unpublished work). The genome fingerprint scanning (GFS) application matches peptide mass fingerprint data to a genomic locus without reference to ORF, protein anchor, or other annotation. The entire putative proteome is translated from the full genome sequence and digested by using the rules for a particular protease. The program matches masses from the peptide fingerprint to those generated by the in silico digestion, then scans in windows across the genome to identify regions where a high density of hits indicates a putative genomic origin for one of the sample proteins. The process is summarized in Fig. 1.

Figure 1
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1

Illustration of the matching process. Masses derived from MS analysis are matched within a tolerance Δ% against the in silico genome digest. The matching peptides from the digest are then mapped positionally back onto the chromosome whence they were derived, and clusters are located to determine protein hits.

The in silico digestion of the raw S. cerevisiae proteome (translated genome) in all six reading frames using rules for the enzyme trypsin generates 8.9 million hypothetical peptides. The profusion of in silico fragments generated by the simulated digestion of a whole genome results in a large background of spurious hits against which to discern genuine matches. For example, in 100 full-genome searches against 40 randomly generated masses at 0.05% mass tolerance, an average of 145,190 or 1.6% of the S. cerevisiae fragments match in each search, equivalent to an average of 3,630 fragments per input mass. If evenly distributed along forward and reverse strands, a hit would occur every 185 nt. In reality, the hits are unevenly distributed, forming clusters of varying density determined by factors such as the location of lysine and arginine-encoding codons across the genome. In any case, masses from the sample protein invariably match many regions of the genome in addition to their genomic origin. Conversely, because of competitive ionization of peptide species, only a portion of the peptides expected from a given coding region are observed by means of MS analysis.

The method we have developed appears able to discriminate real protein hits from the random background when using a loose match tolerance of 0.05% (500 ppm). Tests with data from liquid chromatography-separated yeast mitochondrial proteins analyzed with ESI-MS (Micromass Quattro II, Manchester, U.K.) and from matrix-assisted laser desorption ionization (MALDI)-MS analysis of 2D gel-separated Escherichia coli proteins produced statistically significant identifications (P < 0.05), the majority in agreement with mascot or peptident.

The GFS method has promise for identifying proteins involving nonstandard translation and also has potential to be used for genome annotation based directly on observed proteins (14, 15). Researchers have also recently developed methods for scanning raw genomic data by using tandem MS data (16, 17). Tandem MS involves an additional step where select peptides are introduced into a collisional unit and bombarded with heavy atoms. The resulting stepwise fragmentation pattern can be used to reconstruct the peptide sequence. Although approaches that incorporate tandem and higher-order MS are generally agreed to provide the most authoritative identifications, the analysis of spectra can be complicated by posttranslational modifications (18) or internal (nonterminal) dissociation of peptide bonds (19). In contrast, peptide mass fingerprinting has the benefit of simplicity and lower equipment cost. We believe that GFS and the tandem-MS genome-scanning methods could be used complementarily to produce a high-confidence, genome-based characterization of protein samples.

Computational Methods

In silico digestion of an entire genome sequence generates masses for all peptides that might be produced by its in vivo translation and subsequent proteolytic digestion. Translation and digestion are performed in all forward and reverse frames. The process proceeds from 5′ to 3′ on each sequence, keeping an in-process list of not yet terminated fragments and a final master list of terminated fragments. As each in-frame codon is encountered the mass of its amino acid translation is added to each fragment of the in-process list. A new fragment is added to the incomplete list after each cleavage or stop codon and at each start codon. Terminated fragments are transferred to the master list when cleavage sites and stop codons are encountered. Any fragment falling below a length threshold of three codons is discarded.

The enzyme trypsin cleaves after lysine and arginine, encoded by a total of eight codons. In the standard genetic code, the start codon is ATG (also coding internal methionine) and stop codons are TAA, TAG, and TGA. For mtDNA, start and stop codons are ATG, ATA and TAA, TAG, respectively. The system supports the use of alternative genetic codes by using separate translation dictionaries, selected by annotation in the FASTA header of a sequence, and defaulting to standard nuclear encoding.

Duplicate fragments (produced, e.g., when a start codon follows a cleavage codon) are prevented by checking a queue containing fragments recently transferred for any fragment with the same start codon and mass before transferring the new fragment to the final digest list. Incomplete proteolytic digestion of a protein sample results in peptides containing internal cleavage sites. The presence of such missed cleavages requires that the in silico digest peptides also contain them. The program uses the variable b to specify the maximal number of missed cleavage sites (breaks) allowed internally to a fragment. For efficiency, b is generally kept low, i.e., b ≤ 2.

In silico digestion of the nuclear plus mitochondrial genome of S. cerevisiae with b = 2 generates 8.9 million fragments and requires ≈200 MB of memory. The program stores the mass, start point, length, number of breaks, and reading frame of each fragment. The computational complexity of the digestion process is roughly linear, or O(n), in correspondence to genome size n. The matching process is O(nm), with m representing the size of the mass list. Further algorithmic details are provided in Supporting Text, which is published as supporting information on the PNAS web site, www.pnas.org.

Mass data from either MALDI-MS or ESI-MS is matched to the in silico digestion to within a tolerance level calculated as a percentage of the experimental mass (Δ%). Each matching fragment is mapped back to its position on the chromosome (Fig. 1). Two scans across each chromosome evaluate match criteria for windows of size w to identify those with a high score (scoring discussed below). For all experiments herein w = 500. The first scan scores windows in 100-nt increments, providing an internal histogram of score distributions used by the program to establish a cutoff level such that only the top n scoring clusters are examined on the subsequent scan. Currently, n is fixed at a value of 10. Regions with windows scoring above the threshold are examined in further detail by an extension scan used to determine the full extent of the hit-cluster region. The extension scan starts with each high-scoring window and proceeds backward in 50-nt steps, considering each window of size w until the score falls below a defined cutoff. This marks the start of the full region. A scan forward from the original window in 50-nt steps again considers windows of size w until the same cutoff is reached, marking the end of the extended region. The cutoff currently used is half of the score of the 10th highest-scoring window found on the initial scan.

Several scoring methods have been investigated. Simple measures include the total number of hits (matches) per window and the percentage of the DNA sequence contained within matching fragments. The former tends to favor regions containing multiple repeats of a peptide matching one of the input masses and fails to account for the lower frequency of peptides with missed cleavages. Sequence coverage scores are inflated by high-mass fragments. A scoring function was developed to address these issues. It considers the following aspects of each window of size w: (i) the number of hits in a single frame; (ii) the number of possible hits in the same frame (i.e., the total digested fragment count); (iii) the number of missed cleavages in each fragment; (iv) the number of in-frame stop codons encountered in the window before the current fragment; (v) duplicate mass matches; and (vi) abutment of fragments.

Multiplicative combination of these attributes is important for scoring features such as the number of preceding stop codons, missed cleavages, and duplicate fragments matched. For example, whereas a region with one in-frame stop may still be considered because it could be caused by a sequence error or stop codon read-through (13, 20), multiple in-frame stops are increasingly unlikely. If the probability of stop-codon present in-frame is p, then the probability of s occurrences is ps. The scoring equation does not attempt to model actual probabilities. The data required to ascertain realistic probabilities or frequencies of these occurrences would be very difficult to obtain. Instead, the scoring function is intended to maximally discriminate randomly formed clusters from real hits, while allowing for occurrences such as sequence errors or missed cleavages.

Assignment of penalty values to each of the listed factors allows their multiplicative combination. The values are: cb, penalty for missed cleavages, default 0.6; cs, penalty for preceding in-frame stops, default 0.4; cd, penalty for preceding duplicate-mass matches, default 0.6; and cā, penalty for N terminus not abutting a preceding fragment, default 0.9.

For a window containing t hypothetical fragments, h of which match experimentally measured peptide masses, we calculate a window score s by summing the penalty products for each fragment j = 1 … h: Math1 The functions b(fj), s(fj), d(fj), and a(fj) return counts for the number of breaks in a fragment (e.g., Lys and Arg codons), the number of stops preceding a fragment, the number of other preceding matches for this mass in the window, and whether or not (1 or 0) the amino terminus of the fragment abuts a preceding fragment, respectively. The term t normalizes the results for the total number of digest fragments in the window, but used by itself can skew results toward windows with a small number of possible fragments; h counterbalances this by giving weight to the number of hits in the window, and is reduced by d, the number of duplicate matches in the window. Scores are multiplied by a scaling factor of 100 to simplify the histogram analysis.

These scores are used to differentiate statistically significant match regions from the backdrop of random hits. Methods to assign probability estimators to fingerprint matches have been described by several groups. One approach is to derive a probabilistic function describing the likelihoods of alternative identifications, as was done in profound (5), or to assign a probability that a given protein match is by chance as was done in mascot (6). Another is to use randomized data to establish a baseline for assigning significance to matches with real data (21).

We use a method similar to the latter, by which we calculate the significance of a match region as a function of its window score and the total number of masses in the experimental spectrum. Each value in a randomly selected subset of a large set of experimental peptide masses is perturbed by a random modification representing the addition or subtraction of H, C, O, and N atoms. A large number of such mass lists is searched against the genome fragment database, establishing a histogram to represent the range of the null hypothesis (i.e., that any given result is caused by chance). The histogram is used to define Ps, the probability in a single genome scan that a randomly chosen set of masses, of the same size, would achieve a score equal or above the score considered (derivation in Supporting Text). The scoring methods used by GFS and mascot cannot be directly equated because the GFS significance score describes the probability of a false positive in a complete genome scan, whereas mascot produces a score that describes the probability of false-positive for each protein considered.

The system consists of a client-server pair with a UNIX command-line interface. The server performs the in silico digestion and keeps the resulting database in memory, obviating the recomputation of the genome digest over multiple MS analyses. Each client receives a peptide mass list, connects to the server for processing, and outputs the results. The client generates a formatted HTML file displaying the clusters for the 10 highest-scoring windows, with matched fragments highlighted in different colors according to the reading frame in which they are found to facilitate visual identification of transframe events. The entire region's score and the highest score for any contained window of size w are reported. The significance is calculated as a function of the latter, maximal fixed window-size score.

Server parameters include the number of missed cleavages to be calculated, window size, a directory containing all FASTA-formatted genome sequences, and whether to use average or mono-isotopic masses. Client parameters include the file containing a peptide mass list, the host name of the server, the tcp port number, Δ% mass tolerance, and parameters related to random trials. The programs currently run on the UNIX command line of MacOS X, with potential deployment on Linux. We plan to make a graphical interface accessible to other researchers via the web. The current prototype code is available free of charge to nonprofit researchers.

Laboratory Methods

S. cerevisiae Mitochondria.

The S. cerevisiae strain used was BY 4743 Diploid, His 3Δ, Leu 2Δ, Ura 3Δ. Yeast cultures were grown in YPGE (1% yeast extract/2% bactopeptone/2% glycerol/2% ethanol) media at 30°C to an OD600 of <1.0. Mitochondria were isolated as described (22). Oxyliticase (Enzogenetics, Corvallis, OR) was used instead of or in addition to Zymolase (ICN) in some preparations. Mitochondria were lysed by sonication in 50 mM 3-cyclohexylamino-1-propane sulfonic acid (CAPS), pH 10.5. Polyethyleneimine was added to a final concentration of 0.1% to precipitate nucleic acids. After a 20-min incubation at 4°C, samples were centrifuged at 60,000 rpm for 2 h in a Beckman TL-100 centrifuge (TLA 100.3 rotor).

Cleared mitochondrial lysate was separated on a PerSeptive Biosystems (Framingham, MA) BioCad Sprint HPLC system with a 4.6 × 100 mm column packed with Poros 20 HQ (strong anion exchange) media (PerSeptive Biosystems). The running buffer was 50 mM CAPS, pH 10.5, and proteins were eluted from the column with 0 to 1 M NaCl gradient over five column volumes. Collected fractions were further separated on the same system with Poros 20 R2 (reversed phase) media. The running buffer was 0.1% trifluoroacetic acid/15% acetonitrile, and proteins were eluted from the column with a 15–45% acetonitrile gradient. Fractions were lyophilized then digested with modified sequencing grade trypsin (Promega) as per vendor instructions.

Molecular weights of proteins and peptides were determined by using positive-ion electrospray MS on a Quattro-II mass spectrometer (Micromass). ESI generates a series of multiply charged molecular ions from which mass assignments are derived for each protein or peptide. Molecular masses for peptides were determined by manually deisotoping and deconvolving the mass-to-charge (m/z) spectra.

E. coli.

Data from an analysis of selected proteins from two strains of E. coli, CSH 142 and CSH 156, was provided by workers at Kendrick Labs (Madison, WI), who performed 2D electrophoresis by using the methods described (23). A brief summary follows. Proteins were added as standards to the gel: myosin (220 kDa), phosphorylase A (94 kDa), catalase (60 kDa), actin (43 kDa), carbonic anhydrase (29 kDa), and lysozyme (14 kDa) (Sigma). Spots with large differences in expression level between the two strains were selected for analysis. The bands/spots were cut and digested by using 0.06 μg of modified trypsin (sequencing grade, Roche Molecular Biochemicals) in 13–15 μl of 0.025 M Tris, pH 8.5. The tubes were placed in a heating block at 32°C and left overnight. Peptides were extracted with 2 × 50 μl of 50% acetonitrile/2% trifluoroacetic acid and then the combined extracts were dried and resuspended in matrix solution, 4-hydroxy-α-cyanocinnamic acid in 50% acetonitrile/0.1% trifluoroacetic acid with two standards, angiotensin and bovine insulin. An aliquot of 0.7 ml was spotted onto the sample plate, completely dried, and washed twice with water. A PerSeptive Voyager DE-RP MALDI–time-of-flight mass spectrometer was used to analyze digest samples in the linear or reflector mode (Applied Biosystems). The National Center for Biotechnology Information and/or GenPept databases were searched by the service lab by using profound (http://prowl.rockefeller.edu/cgi-bin/ProFound), ms-fit (http://prospector.ucsf.edu), and peptident (http://us.expasy.org/tools/peptident.html).

Results

Data from ESI-MS analysis of S. cerevisiae mitochondrial proteins and MALDI-MS analysis of E. coli proteins were used to test the GFS system. We rejected the use of synthetic data for reasons similar to those cited by Perkins et al. (6), e.g., because the real experimental factors that play into the observed data are not well enough understood to be modeled as required for the generation of realistic synthetic data.

On the other hand, the absence of authoritative identifications for the analyzed proteins requires that we evaluate performance by comparing our results with those of other algorithms. Our performance assessment is based on comparisons with both peptident (4) and mascot (6), well recognized tools used for the data analysis in our yeast proteome project. In most cases both of the programs were used for comparison; however, there are a few for which only mascot was used, because later in the S. cerevisiae project that became the default analysis tool. For mascot analyses in our proteome project, proteins with a match score >70 were considered significant identifications.

As opposed to the single protein samples typically produced by 2D gel separation, our yeast mitochondrial samples contain multiple proteins, a result of the method of HPLC separation used. The regular presence of multiple proteins imposes certain constraints on the system. For example, it greatly complicates the consideration of mutually exclusive possibilities required to develop a Bayesian formula for probability assessment of identifications as was done in profound (5). An alternative is to base comparisons on the distribution of randomized data, as we have done here. The increased number of peptides in a multiprotein analyte further increases the background noise level, making matches more difficult to distinguish.

To establish the significance levels of scores we performed multiple repetitions of experiments by using randomized mass lists of varying lengths. For S. cerevisiae, 1,000 repeat trials were performed, with lists of length 20–100 in increments of five. An example histogram of scores produced from one such set of trials is shown in Fig. 3, which is published as supporting information on the PNAS web site. The process was repeated for E. coli with lists of 20–50 random masses, incremented by five. The program keeps a summary histogram plotting scores against the number of w = 500-nt windows achieving each score. A plot of the Ps< 0.001 (maximum scores) and Ps< 0.05 values versus mass-list size is in Fig. 2. The curves show a close to linear relationship between the number of masses input and confidence thresholds, with the Ps< 0.05 varying more smoothly because of the higher quantity of data available to establish it.

Figure 2
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2

Confidence levels established by sets of random mass-list trials against the S. cervisiae whole-genome digest. ⧫, Top scores in 1,000 trials, equating to Ps < 0.001. ■, The scores above which Ps < 0.05. The dotted and solid lines are a linear regression for the two data sets, shown along with their equations and R values.

Twenty-two samples were chosen for the comparative performance analysis, 18 from yeast and four from E. coli. The yeast samples were selected at random from the larger pool of those available. The program was run with each mass list, and results were manually parsed, with the top two scoring clusters considered in each case. The score for each cluster region was compared with the null hypothesis to compute the P value by interpolating between the two nearest mass-list sizes (increments of five).

The genome position of each identified cluster was compared with ORF annotations for yeast or E. coli. An ORF encompassing and in-frame with the cluster region was noted as a match. For the yeast searches, an in-house program was used to match positions to annotated ORFs in a mid-2001 download from the Saccharomyces Genome Database (http://genome-www.stanford.edu/Saccharomyces/). E. coli searches were performed manually against the National Center for Biotechnology Information database by using the summary data at www.ncbi.nlm.nih.gov/cgi-bin/Entrez/altik?gi=115&db=Genome (late 2001). The parameters for all reported experiments were w = 500, Δ% = 0.05, and b = 3, with monoisotopic peptide masses used for the yeast/ESI experiments and average peptide masses for the E. coli/MALDI experiments. Fig. 4, which is published as supporting information on the PNAS web site, is an illustration of typical output from the program.

Seventeen of the 22 samples had a top-scoring cluster region with significance Ps< 0.05. Of the five not within this threshold, four had top-scoring clusters falling in ORFs identified in the database as mitochondrially localized proteins; the remaining region did not correspond to an obvious ORF. The top-scoring GFS-identified ORF was also top scoring for one of the other programs in 16 cases, including one case where mascot and GFS agreed that no significant matches were present. When counting agreement between either first- or second-place significant protein matches, the GFS and mascot/peptident corroboration increases to 19 samples or 86%. Table 1 shows six representative results, including two disagreements and one where the second-place identification was corroborated. The disagreements provide insights into differences between the algorithms. Sample 2378 F9 with GFS score 121 (Ps ≈ 0.004) mapped to an ORF for the mitochondrial precursor of alcohol dehydrogenase (ADH3), whereas the other programs identified the sample protein as phosphoglycerate kinase (PGK). Given the high significance and mitochondrial localization of ADH3, it is likely that GFS correctly identified a different component of the sample than the other programs. GFS also corroborated the PGK identification, it being the second top-scoring region found.

View this table:
  • View inline
  • View popup
Table 1

Representative results for five yeast samples and one E. coli sample

Another analyte was identified by GFS as PET127, a protein annotated in GenBank as a component of the mitochondrial translation system. For this sample neither mascot nor peptident found anything significant. It appears from the GFS program output that only a small, ≈10-kDa portion of the protein was present during tryptic digestion, which would account for the poor significance score. It would also explain the difficulty mascot had with this, because this is a small piece of a large 93-kDa protein predicted in the ORF database. In another case, GFS identified POM152 (Ps ≈ 0.03), whereas mascot found nothing significant and peptident had a weak match for ABC1. As with the previous case, the parent protein predicted by the database (Saccharomyces Genome Database) is large: 151 kDa. This finding highlights an advantage of position-based peptide matching. It is likely that a portion of analyzed proteins has undergone in vivo proteolysis, causing incomplete peptide coverage. Because the peptide coverage, when averaged over a large protein, will be low, such cases can confound searches that rely on ORF or protein annotation. With the genome-based positional scanning of GFS, parent protein size does not directly affect its performance unless a protein is much smaller in size than the window used for analysis.

Table 2 illustrates the detection of multiple proteins within a sample containing a large number of peptides (88 total). To calculate the significance for each match region we remove from the experimental mass list all masses matched to a higher-scoring region and not contained in the current region and calculate the significance corresponding to the score and this new number of masses. The rationale for this is its equivalency to removing the masses previously matched and rerunning the scan. GFS is unique among MS search algorithms for its ability to establish significance levels based only on the number of masses input. The use of a fixed window size for scanning simplifies the detection of multiple proteins and assessment of their significance in a single-pass analysis.

View this table:
  • View inline
  • View popup
Table 2

Results for the top five scoring clusters in a GFS analysis of a sample containing multiple proteins

Identifications for all four E. coli samples matched the standard results, and all of them had GFS scores with significance Ps< 0.001. The stronger E. coli results are likely caused by the samples each having only one protein and by higher mass accuracy from both the instrument and data analysis methods. The ESI process generates ions with multiple charge states, whereas MALDI typically produces singly charged ions. For the ESI data, we lacked appropriate software to perform deconvolution of multiply charged spectra into straight mass spectra; our manual deconvolution may have been less accurate. Given the lower quality of these data and the relaxed match criteria used (500 ppm), the performance of GFS on the ESI/yeast data is promising.

Discussion

The empirically determined working parameter sets sufficed for practical operation on these distinct data sets but should be optimized by further experimentation to match the properties of the experimental equipment used. For example, mass tolerance should be decreased for higher mass accuracy data, effectively reducing the number of random hits. Optimization of window size is potentially complex, depending on protein size, gene composition, and the local configuration of matched fragments. If spliced genes are analyzed, the window size may have to be much larger, with a second, smaller window scan to identify putative exons.

We performed an experiment to investigate the extent of the random backdrop for large mammalian genomes. We repeated 1,000 queries with randomly generated 41- and 61-length mass lists against the in silico digest of human chromosome XIV, comprising 1/35th of the human genome. The background from this experiment should be roughly equivalent to scanning the entire genome 28 times. The maximum score for 41-mass lists was 107, and for 61-mass lists was 134, indicating that a real match scoring above those thresholds would have a significance value of less than ≈1/28 = 0.04. Our yeast sample (2378 E2) that had a 41-mass input list produced a score of 114, which for yeast is a Ps value <0.001, and for a genome of human size and composition would still have significance <0.04. A 61-mass sample that had a Ps≈ 0.004 for yeast would have a Ps≈ 0.13 if matched against a human-sized genome. These data indicate that scanning a much larger genome is statistically feasible. Improved MS data, allowing reduction of the tolerance from the present 500 ppm to ≈50 ppm, should provide a proportional 10-fold reduction in the random backdrop, improving the likelihood of success. However, the identification of proteins consisting of multiple exons remains a considerable challenge that has not yet been addressed.

Although our implementation is a proof-of-principle prototype, we have obtained interesting results contributing useful information to our research of yeast mitochondrial proteins. In application to a less-annotated genome than yeast, this method could contribute to the identification of proteins for which the database representation is not complete. Potential future enhancements include the automatic assignment of probability values from the regression of random trials, integration with ORF databases for automatic output of the ORF covering any high scoring cluster, and a web-based interface. We are optimizing the code for high-throughput analysis on modern vector processors and plan to deploy the system to provide concurrent search of multiple mass lists and several genomes.

Acknowledgments

We thank Pavel Baranov, Hendrick Labs, and the Protein Chemistry Core Facility at Columbia University (New York) for providing the E. coli data and Chad Nelson for producing the ESI-MS data for S. cerevisiae. This work was supported by National Institutes of Health Genome Scholar Award HG00044 (to M.C.G.) and Department of Energy Grant DE-FG03-99ER62732 (to R.G.).

Footnotes

    • ↵§ To whom correspondence should be addressed. E-mail: giddings{at}unc.edu.

    Abbreviations

    ESI,
    electrospray ionization;
    GFS,
    genome fingerprint scanning;
    MALDI,
    matrix-assisted laser desorption ionization
    • Received July 31, 2002.
    • Accepted November 12, 2002.
    • Copyright © 2003, The National Academy of Sciences
    View Abstract

    References

    1. ↵
      1. Mann M,
      2. Hojrup P,
      3. Roepstorff P
      (1993) Biol Mass Spectrom 22:338–345, pmid:8329463.
      OpenUrlCrossRefPubMed
      1. Henzel W J,
      2. Billeci T M,
      3. Stults J T,
      4. Wong S C,
      5. Grimley C,
      6. Watanabe C
      (1993) Proc Natl Acad Sci USA 90:5011–5015, pmid:8506346.
      OpenUrlAbstract/FREE Full Text
    2. ↵
      1. James P,
      2. Quadroni M,
      3. Carafoli E,
      4. Gonnet G
      (1993) Biochem Biophys Res Commun 195:58–64, pmid:8363627.
      OpenUrlCrossRefPubMed
    3. ↵
      1. Wilkins M R,
      2. Gasteiger E,
      3. Bairoch A,
      4. Sanchez J C,
      5. Williams K L,
      6. Appel R D,
      7. Hochstrasser D F
      (1999) Methods Mol Biol 112:531–552, pmid:10027275.
      OpenUrlPubMed
    4. ↵
      1. Zhang W,
      2. Chait B T
      (2000) Anal Chem 72:2482–2489, pmid:10857624.
      OpenUrlPubMed
    5. ↵
      1. Perkins D N,
      2. Pappin D J,
      3. Creasy D M,
      4. Cottrell J S
      (1999) Electrophoresis 20:3551–3567, pmid:10612281.
      OpenUrlCrossRefPubMed
    6. ↵
      1. Taylor J A,
      2. Walsh K A,
      3. Johnson R S
      (1996) Rapid Commun Mass Spectrom 10:679–687, pmid:8624418.
      OpenUrlCrossRefPubMed
    7. ↵
      1. Wagner L A,
      2. Weiss R B,
      3. Driscoll R,
      4. Dunn D S,
      5. Gesteland R F
      (1990) Nucleic Acids Res 18:3529–3535, pmid:2194164.
      OpenUrlAbstract/FREE Full Text
    8. ↵
      1. Black D L
      (2000) Cell 103:367–370, pmid:11081623.
      OpenUrlCrossRefPubMed
    9. ↵
      1. Liu C C,
      2. Simonsen C C,
      3. Levinson A D
      (1984) Nature 309:82–85, pmid:6717585.
      OpenUrlCrossRefPubMed
    10. ↵
      1. Tesar M,
      2. Harmon S A,
      3. Summers D F,
      4. Ehrenfeld E
      (1992) Virology 186:609–618, pmid:1310188.
      OpenUrlCrossRefPubMed
    11. ↵
      1. Gesteland R F,
      2. Atkins J F
      (1996) Annu Rev Biochem 65:741–768, pmid:8811194.
      OpenUrlCrossRefPubMed
    12. ↵
      1. Tate W P,
      2. Mannering S A
      (1996) Mol Microbiol 21:213–219, pmid:8858577.
      OpenUrlCrossRefPubMed
    13. ↵
      1. Andersen J S,
      2. Mann M
      (2000) FEBS Lett 480:25–31, pmid:10967324.
      OpenUrlCrossRefPubMed
    14. ↵
      1. Pandey A,
      2. Mann M
      (2000) Nature 405:837–846, pmid:10866210.
      OpenUrlCrossRefPubMed
    15. ↵
      1. Kuster B,
      2. Mortensen P,
      3. Andersen J S,
      4. Mann M
      (2001) Proteomics 1:641–650, pmid:11678034.
      OpenUrlCrossRefPubMed
    16. ↵
      1. Choudhary J S,
      2. Blackstock W P,
      3. Creasy D M,
      4. Cottrell J S
      (2001) Proteomics 1:651–667, pmid:11678035.
      OpenUrlCrossRefPubMed
    17. ↵
      1. Pevzner P A,
      2. Mulyukov Z,
      3. Dancik V,
      4. Tang C L
      (2001) Genome Res 11:290–299, pmid:11157792.
      OpenUrlAbstract/FREE Full Text
    18. ↵
      1. Bafna V,
      2. Edwards N
      (2001) Bioinformatics 17, Suppl. 1:S13–S21, pmid:11222258.
      OpenUrlPubMed
    19. ↵
      1. Tate W P,
      2. Mansell J B,
      3. Mannering S A,
      4. Irvine J H,
      5. Major L L,
      6. Wilson D N
      (1999) Biochemistry (Moscow) 64:1342–1353, pmid:10648957.
      OpenUrlPubMed
    20. ↵
      1. Eriksson J,
      2. Chait B T,
      3. Fenyo D
      (2000) Anal Chem 72:999–1005, pmid:10739204.
      OpenUrlPubMed
    21. ↵
      1. Glick B,
      2. Pons L
      (1995) Methods Enzymol 260:213–223, pmid:8592446.
      OpenUrlPubMed
    22. ↵
      1. O'Farrell P H
      (1975) J Biol Chem 250:4007–4021, pmid:236308.
      OpenUrlAbstract/FREE Full Text
    PreviousNext
    Back to top
    Article Alerts
    Email Article

    Thank you for your interest in spreading the word on PNAS.

    NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

    Enter multiple addresses on separate lines or separate them with commas.
    Genome-based peptide fingerprint scanning
    (Your Name) has sent you a message from PNAS
    (Your Name) thought you would like to see the PNAS web site.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Citation Tools
    Genome-based peptide fingerprint scanning
    Michael C. Giddings, Atul A. Shah, Ray Gesteland, Barry Moore
    Proceedings of the National Academy of Sciences Jan 2003, 100 (1) 20-25; DOI: 10.1073/pnas.0136893100

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    Request Permissions
    Share
    Genome-based peptide fingerprint scanning
    Michael C. Giddings, Atul A. Shah, Ray Gesteland, Barry Moore
    Proceedings of the National Academy of Sciences Jan 2003, 100 (1) 20-25; DOI: 10.1073/pnas.0136893100
    Digg logo Reddit logo Twitter logo Facebook logo Google logo Mendeley logo
    • Tweet Widget
    • Facebook Like
    • Mendeley logo Mendeley
    Proceedings of the National Academy of Sciences: 100 (1)
    Table of Contents

    Submit

    Sign up for Article Alerts

    Jump to section

    • Article
      • Abstract
      • Computational Methods
      • Laboratory Methods
      • Results
      • Discussion
      • Acknowledgments
      • Footnotes
      • Abbreviations
      • References
    • Figures & SI
    • Info & Metrics
    • PDF

    You May Also be Interested in

    Abstract depiction of a guitar and musical note
    Science & Culture: At the nexus of music and medicine, some see disease treatments
    Although the evidence is still limited, a growing body of research suggests music may have beneficial effects for diseases such as Parkinson’s.
    Image credit: Shutterstock/agsandrew.
    Large piece of gold
    News Feature: Tracing gold's cosmic origins
    Astronomers thought they’d finally figured out where gold and other heavy elements in the universe came from. In light of recent results, they’re not so sure.
    Image credit: Science Source/Tom McHugh.
    Dancers in red dresses
    Journal Club: Friends appear to share patterns of brain activity
    Researchers are still trying to understand what causes this strong correlation between neural and social networks.
    Image credit: Shutterstock/Yeongsik Im.
    White and blue bird
    Hazards of ozone pollution to birds
    Amanda Rodewald, Ivan Rudik, and Catherine Kling talk about the hazards of ozone pollution to birds.
    Listen
    Past PodcastsSubscribe
    Goats standing in a pin
    Transplantation of sperm-producing stem cells
    CRISPR-Cas9 gene editing can improve the effectiveness of spermatogonial stem cell transplantation in mice and livestock, a study finds.
    Image credit: Jon M. Oatley.

    Similar Articles

    Site Logo
    Powered by HighWire
    • Submit Manuscript
    • Twitter
    • Facebook
    • RSS Feeds
    • Email Alerts

    Articles

    • Current Issue
    • Special Feature Articles – Most Recent
    • List of Issues

    PNAS Portals

    • Anthropology
    • Chemistry
    • Classics
    • Front Matter
    • Physics
    • Sustainability Science
    • Teaching Resources

    Information

    • Authors
    • Editorial Board
    • Reviewers
    • Librarians
    • Press
    • Site Map
    • PNAS Updates

    Feedback    Privacy/Legal

    Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490