Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / BIOPHYSICS
Growth of novel protein structural data
Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305-5126
Contributed by Michael Levitt, December 29, 2006 (received for review October 12, 2006)
| Abstract |
|---|
|
|
|---|
number of folds | Protein Data Bank | structural genomics
In the past, structural data have grown as protein structures were solved to answer key biological questions. The value of the structures outside their biological context was increasingly appreciated thanks to theoretical work that showed how a known protein structure can be used to model the structure of a protein with a closely related sequence (7, 8). This field of homology modeling is now a major preoccupation of human modelers and automatic modeling servers alike (9, 10). After Chothia (11) hypothesized that the number of different protein shapes was finite and perhaps as small as 1,000, it seemed feasible to determine structures of representative proteins and to then derive most other structures by homology modeling (12). This was the basis of the Protein Structure Initiative started by the United States National Institutes of Health in 1999 (13). Similar initiatives were started elsewhere, especially at Riken in Japan (14) and SPINE in Europe (15), to give a worldwide effort in a new field known as structural genomics or SG (the SPINE initiative does not aim to extend coverage of structure space).
Despite considerable investment in new methods for solving protein structures, little attention was given to the need to track the progress of these initiatives until the Chandonia and Brenner article at the beginning of 2006 (16). In that article, three different measures are used to measure the novelty of structures: (i) the number of structures that are sequence-unique at different levels similarity as measured by sequence identity or other more sophisticated scores, (ii) the number of matches to a sequence in a PFAM family that had no previous members of known structure, and (iii) the number of new Structural Classification of Proteins (SCOP) folds. Use of multiple measures is problematic because the overall novelty score will depend on how the measures are weighted. Use of matches to SCOP depends on manual curation, which is generally not up-to-date (SCOP Version 1.67 used by Chandonia and Brenner did not include PDB files released after May 15, 2004).
Here, we present a robust and reliable statistical method that quantifies novel protein structural data accurately. It uses sequence alone and does not require that structures be examined or compared. Account is taken of the sequence identity of multiple chains solved in the same PDB entry, sequence similarity of chains in different entries and the different number of residues in each entry. A weighted count method is introduced to eliminate sequence redundancy by down-weighting the contribution of sequences with more sequence neighbors already in the PDB. This measure is numerically very similar to the number of clusters found by hierarchical clustering. Although clustering is most like conventional classification schemes, the weighted count is much easier to compute and also more robust. Using this measure on data deposited at different times, we find that the change in the number of weighted chains at a 25% sequence identity threshold is the best predictor of the corresponding number of entries in the hand-curated SCOP protein classification (17); this allows prediction of the category sizes in SCOP well before the actual numbers are available.
The PDB is not growing exponentially (18, 19). Instead, the annual growth rate over the past 33 years fluctuates between 6% and 150% with three periods of >30% annual growth (October 1972 to July 1976, October 1980 to April 1982, and October 1988 to April 1994) followed by a steady decrease in growth rate since 1997. Growth rates are essentially the same for PDB entries, nonidentical chains, sequence clusters, and weighted counts.
| Results |
|---|
|
|
|---|
and N
to be numerically similar, and the agreement we see here gives us greater confidence to use N
, a more robust measure that is also much easier to calculate. In clustering, links are always symmetrical in that if A links to B then B links to A. Because of differences in chain lengths, some links will be different from A to B than from B to A. Asymmetric links were left out of the clustering to give N
, which is much closer to N
.
|
Growth of the PDB.
The rapid growth of protein structural data are clear from Fig. 2a. All measures of structure seem to agree on this log scale plot, and the curves are approximately straight lines, which would seem to imply exponential growth. The Dickerson equation (18), which predicted the number of PDB files to be (1/0.19) x exp(0.19(t 1960), where t is the year, fits the real data very well for the 10 years from 1996 to 2006 but does much less well for the period from 1978 to 1995. Close examination of the changes with time of the percentage growth rates (Fig. 2b), which are also very similar for all measures, shows that the growth rates change dramatically with time. Were the growth of structural data to be exponential, the percentage growth rate would be constant. Instead, it shows three peaks. The first peak, occurring between 1972 and 1976 and including an additional 27 PDB files, corresponds to the initial explosion of crystallography that took place after the first structures were solved in England in the decade 1960 to 1970. The second peak is most likely due to the greater ease with which structures could be solved thanks to the spread of Digital Equipment Company's VAX 780 virtual memory computers first introduced in 1978. The third peak, occurring between 1991 and 1997 and including over 250 PDB files, corresponds to the availability of intense beams of x-rays from synchrotrons coupled with the use of crystals cooled in liquid nitrogen. Header records of these PDB files show that the frequency of use of crystal cooling rose rapidly between 1995 and 1997, settling down to
45% of solved PDB files. Use of synchrotron radiation rose slowly between 1995 and 2000; it is used for about half the structures solved since then.
|
N
/
NCHA, which measures structural novelty of solved protein structures, is remarkably constant with a value of 0.18 ± 0.03 since 1992 (Fig. 2a).
Predicting SCOP Growth.
The linear dependence of SCOP category sizes on the number of PDB files shown in Fig. 3a seems to offer a way to predict these sizes. As a simple rule-of-thumb, since 1997,
10.7% of deposited PDB files correspond to new families, 5.1% correspond to new superfamilies, and 2.9% correspond to new folds. Such prediction is important as manual curation is very labor intensive, and the lag time between SCOP classification and deposition of PDB files can be significant: since the last release on 1 October 2004 (1.69), the number of deposited files has increased by
30%. Caution is required in predicting the size of SCOP categories from the number of deposited files: given a new PDB chain, there is no way one can tell whether it will start a new SCOP family before Alexei Murzin examines its sequence and structure. The probability that the new chain will be a new family is, therefore, random at 0.07 (N
/NCHA = 0.07). Given a new chain with a weight of 1 (no sequence neighbors at the 25% ID), one expects a higher probability that it will be the first member a new family. Thus, one expects the size of the categories to depend more on N
, a measure of novel structural information.
|
or NPDB is a better predictor of the size of a SCOP category? We use the correlated count and correlated weight methods (see Methods). Odds are calculated for sequence-unique chains predicting new families (OFAM:W25) and also for new families predicting new folds (OFOL:FAM). In both cases, the overall odds are significantly favorable at OFAM:W25 = 5.0 ± 0.7 and OFOL:FAM = 10.5 ± 2.1. This means that when a new chain is sequence-unique, it is five times more likely to be the first member of a new SCOP family than expected by chance (for PDB chains, which predict families at random, OFAM:CHA = 1). The probability that a sequence-unique chain starts a new family is five times larger than random, but the value is still small at 0.35 (5 x 0.07). The OFAM:W25 values fluctuate with time but the overall curve is flat, indicating that the power of N
to predict NFAM remains unchanged with time (SI Fig. 5). The results obtained with correlated weights also show the power of N
to predict NFAM in that the correlation coefficient between the weight of each added chain and whether it starts a new family is significant with an overall value of CFAM:W25 = 0.56 ± 0.07. Both the odds and correlation coefficients decrease as the sequence threshold used to calculate the sequence weight is increased: as the %ID is increased from 25 to 100, OFAM:W25 decreases from 5.0 to 1.8, whereas CFAM:W25 decreases from 0.57 to 0.28 (SI Fig. 6). This is expected because a method that detects remote homology better should be a better predictor of whether a sequence-unique chain starts a new family.
Knowing that N
is the best predictor of SCOP category sizes, Fig. 3b plots the sizes of SCOP families, superfamilies, and folds against N
. The curves are not linear and show saturation: as the amount of novel structural data grows, new categories are being formed more slowly. The sizes of the SCOP family, superfamily, and fold categories on August 20, 2006, the date of this analysis, are 3,757, 1,847, and 1,097. Use of saturating functions with more adjustable parameters had no affect on the fit or extrapolation.
Contribution of SG.
Fig. 4a shows the growth in novel structural data since 2000. In the six and a half year period to August 2006, N
has increased from 2,059 to 7,792, a factor of 3.8. If those structures that have been solved by the worldwide SG initiative are omitted, growth is from 2,044 to 5,899, a factor of 2.9. Without the data from SG, the growth in N
is almost linear with time with 216 weighted chain counts added per year. The SCOP category sizes behave similarly (using the real data to October 2004 and predicted data based on the saturating functions in Fig. 3a, thereafter) with increases from 1,301 to 3,757, 817 to 1847, and 537 to 1,097 for families, superfamilies, and folds, respectively (factors of 2.9, 2.3, and 2.0). The percentage of the annually deposited novel structural data measured by N
that comes from SG has risen steadily and since the beginning of 2005 is 50% (Fig. 4b, % yearly
N
Only SG).
|
| Discussion |
|---|
|
|
|---|
An important property that distinguishes the weighted chain method from clustering is that it can be used on a subset of the data. Using the weights of a small random subset, just 2% of the sequences gives estimates of N
that are accurate to 5%. The time dependence of the estimated N
is also accurate after year 1990, and better results can be obtained by averaging over five different random subsets. It would be interesting to calculate the value of N
for all known sequences and use this to estimate the total number of SCOP families, superfamilies, and folds that would be found were all these sequenced proteins to have their structures solved. The current National Center for Biotechnology Information nonredundant database contains
3,000,000 sequences and a total number of residues that is
120 times larger than the corresponding database of unique chains from the PDB. Using FASTA on all these pairs would require 1202 more computer time than the 200 days used by the PDB all-vs.-all comparison; this is close to 3,000,000 days or 8,000 years on an Intel Xeon 2.8-GHz processor. Using a random subset of 1,000 query sequences would require 1,000/3,000,000 = 0.03% of the sequence comparisons, take 60 days, and give an estimate of N
accurate to a few percent.
A potential deficiency of our study is that we use the polypeptide chain in a PDB file as the basic unit. SCOP uses domains that are found by examination of the structure; using their definitions would make objective comparison with SCOP impossible and would also mean we had to wait for the SCOP domain definitions. We can parse chains into domains using sequence alignment. If disjoint regions of a long chain are sequence-matched to other chains that do not show any match to one another, then the long chain can be split into domains. Preliminary tests of automatic splitting with a 40% sequence identity threshold give a total of 51,765 chains versus the original number of 44,220 chains and has a reassuringly small effect on the results presented here. Correct parsing of chains into domains is difficult (2022).
As noted in Results, the rate of growth of SCOP categories with deposited novel structural data as measured by N
is slowing down (the ratio NFAM/N
is decreasing in Fig. 3b). If a similar plot is drawn for the most recent CATH 3.0.0 classification (23), the effect is more marked, probably because many of the most recently deposited PDB files are not being classified by CATH. It is not clear whether this is a property of cluster-based classification in general or of the SCOP classification in particular. In a preliminary test, we find N
/N
decreases with time but by <1/3 of the decrease in NFAM/N
; a proper test requires a more sensitive method of detecting chain similarity.
The present method of detecting similarity by using pairwise sequence alignment is not sensitive enough. It would be preferable to use more sensitive matching methods that use multiple sequences (like PSI-BLAST; ref. 24). One could also match known structures using structural alignment (25). Parsing the chains into structural domains could be done as described above for sequence matching and augmented with automated domain finding programs (26).
It would seem that fitting the growth of the SCOP categories to the weighted chain count, N
(Fig. 3b), allows one to estimate the total number of SCOP folds as N
becomes very large. This is a question that has received a great deal of attention since Chothia (11) suggested that there are <1,000 protein folds in all biology. Follow-up work (27) showed that with better assumptions, the number must be substantially larger. Since then, the estimated number of folds has varied from 650 (28) to >10,000 (29), with many other estimates in between these extremes (3037). From the saturating function fit in Fig. 3b, we find that the maximum value of NFOL is 1613. Any extrapolation assumes that the selection of proteins for structure determination will continue as it was in the past, which is unlikely. In fact, the smooth dependence of the size of SCOP categories on the number of released PDB files (Fig. 3a) is surprising because the priorities for selecting proteins for structure determination have changed greatly over the past three decades.
Estimating the Contribution of SG.
Protein SG centers began contributing structures to the PDB 10 years ago. Initial growth was very slow, and by the year 2000 only 36 out of 11,802 PDB files could be traced to SG (SI Table 1). Over the next six and a half years, an additional 3,134 SG files were deposited. More than half of this growth (1,918 PDB files) has occurred since October 1, 2004, the date of the most recent SCOP release (1.69). Clearly, one needs a better way to estimate the contribution of SG than waiting for SCOP. In their study, Chandonia and Brenner (16) estimated the contribution of SG by counting chains that are unique at a 30% identity threshold, by counting occurrences in SCOP 1.67 (dated 15 May 2004) and by counting the number of new PFAM families. They concluded that in the year 2005, SG centers had contributed about half the first structures in a protein family. In Fig. 4b, we show that the SG contribution to N
over this same period is 50%. More importantly, we show that in the subsequent 18 months to August 20, 2006, this level remained steady. We estimate SG centers to have contributed half of the SCOP families, superfamilies, and folds in the two and a half years since January 1, 2004. It will be very interesting to see how these estimates compare with the real numbers that are likely to be available soon in the next release of SCOP. An earlier study by Todd et al. (38) used very time-consuming manual inspection of three-dimensional structures released and was only able to study proteins deposited into the PDB by July 31, 2003. From Fig. 4b, it is clear that by that date, the contribution of SG to novel structure was
25% or half its current value.
Is the increasing contribution of SG to novel protein structures causing a decrease in the novelty of non-SG structures? This is examined by looking at the ratio
N
/
NCHA for recently deposited non-SG structures (SI Table 1). From October 1, 2004 to August 22, 2006, N
increased from 4,598 to 5,899, whereas NCHA increased from 30,037 to 41,204 so that
N
/
NCHA = 0.117. For the period from January 1, 2000 to September 30, 2004,
N
/
NCHA = 0.140 so that non-SG structures are less novel than they used to be. The rate of growth of novel non-SG structure has been constant at 15% since 2000 but for the preceding 4 years it was higher at >20% (Fig. 2b). Thus, SG may have impacted negatively on conventional structure determination as measured by the novelty and the rate of growth of the structures released. It may also have allowed crystallographers to examine biologically interesting systems without regard to structural novelty. With a 3.8 times higher level of structural novelty and more rapid growth, SG has effectively doubled the rate of novel structure determination.
| Conclusions |
|---|
|
|
|---|
| Methods |
|---|
|
|
|---|
At the time data were downloaded (August 20, 2006), there were 35,805 released PDB files for proteins, 44,374 nonidentical protein chains (we eliminate multiple copies of the same chain in a PDB file), and 3,102 PDB files from SG. For details on PDB file selection and deposit and release dates, see SI Text.
SCOP Data Sets Used. This work used the SCOP 1.69 classification dated July 2005 (http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.69). Note that although this file is dated July 2005, it only includes the 24,037 PDB files with a release date earlier than October 1, 2004. We also used http://scop.mrc-lmb.cam.ac.uk/scop/count.html to give the history of the SCOP classification based on 13 SCOP releases from October 20, 1997.
Sequence Matching. The degree of sequence similarity of pairs of PDB entries varies greatly: some pairs of PDB files have identical sequences, whereas others show no significant similarity. All chains that are highly similar in sequence or structure can be considered to contribute redundant structural information to the database. Here, we focus on sequence similarity which is easier to calculate and can also be applied more widely than structural comparison (25). Sequence similarity is measured by comparing all 44,374 nonidentical chain sequences with one another using Pearson's FASTA program (40). Only matches with e-values below 102 were kept.
Sequence matching of chains described above has a serious problem in that FASTA is a local sequence alignment method. If sequence A can consist of two parts, A1 and A2, where A1 is identical to sequence B and A2 is identical to sequence C, the FASTA will find that A is identical to B and also identical to C but could find that B and C are totally dissimilar. The correct way to solve this problem is to parse all of the sequences and split sequence A into two chains A1 and A2. Such parsing is not trivial (2022), and here we deal with the problem in a different way.
If the FASTA percent identity of a match between chain A and chain B is p, the aligned region consists of N residues and that the length of chain A is LA, then the effective percent identity is calculated as pE = p x (N/LA). Note that pE < p unless the entire length of chain A is aligned. The effective percent identity is now different for the link between chain A to chain B versus chain B to chain A. With this scheme, partial matching is heavily down-weighted. Thus, in the above example, if the regions L1 and L2 are the same lengths, then for the match of A to B pE will be 50% rather than 100%, whereas the match of B to A it will be 100%.
Links Between Structures Based on Sequence Identity. Chains A and B aligned by FASTA are considered to be linked if the pE value for A to B is above the sequence %ID, varied from 25% to 100%, in steps of 5%. Such links between protein chains can be used in two ways: (i) join sequence-related chains into clusters (cluster method), and (ii) weight each chain by its number of neighbors (weight method).
Cluster Method.
In the cluster method, links are used to organize chains into clusters so that each cluster contains all of the chains that are linked to any member of the cluster (single linkage clustering). In this form of clustering, the distance between objects is either above or below the threshold: objects in the same cluster must be connected by a path of links and do not need to be linked directly. The number of clusters at each percentage identity cutoff, N
, is defined as the number of clusters. The count N
must include the singleton clusters, the chains that make no links. In the cluster method, the links are effectively symmetrical in that it does not matter whether A is linked to B or B is linked to A. Here, the effective percent identity between A and B, p
, may be very different from that between B and A, p
. Links for which p
, and p
differ by >20% points are termed asymmetric and are omitted in most of the clustering runs.
Weighted Count Method.
In the weighted count method, chains that are related to other chains are down-weighted because they contribute less to the total body of novel structural data. In this method, each chain is given a weight, and these weights are summed to get a weighted number of chains. The links found by sequence matching and used above for clustering are also used to calculate the weight of chain i with Wi = 1/(nneib + 1), where nneib is the number of neighboring chains linked to chain i. This scheme has properties that are intuitively sensible: singleton chains have no neighbors (nneib = 0), so their weight is given by Ws = 1; chains that are part of a completely connected cluster of size n will each have a weight of 1/n, and the total weight of the cluster will be Wcc = n(1/n) = 1. Thus, adding identical copies of a particular object has no effect on the weight of that class of objects. Once the weight of every chain is defined, the weighted number of chains for the particular sequence ID is taken as the sum of the chain weights, N
=
W
. Because each object is assigned a weight, it is easy to calculate the total number of weighted residues as M
=
W
mi, where mi is the chain length. It is also easy to calculate the total number of chains or residues belonging to any particular subset of PDB files (e.g., deposited before a certain date, solved by SG, solved in a certain geographical location, etc.). In the weight method, the links are asymmetrical in that the links to chain A are used to determine its weight, while different links are used to determine the weight of chain B; thus, all links are used.
Measuring Growth of Structural Data.
Growth of structural data with time is found by considering what was in the PDB at two different times. This involves eliminating all structures deposited (or alternatively released) after a particular date to give a modified PDB. This set of structures is then analyzed by the cluster and weighted count methods using a sequence %ID that ranges from 25% to 100% to give the number of clusters, N
, and the weighted chain count, N
, at different dates. The number of deposited PDB files, NPDB, and the number of nonidentical chains, NCHA, are also recorded. To get the contribution of the subset of structures solved by SG, these PDB files are omitted and the analysis is repeated: the SG contribution is then the difference between the counts with and without the SG structures. The contributions of SG to SCOP are more problematic as a SCOP fold discovered first by SG and later discovered without it would not be recorded by the above omission method. Instead, we use all of the data and count the first member of each SCOP category for PDB entries from SG and not from SG.
Growth rates of these numbers are calculated for each quarter and smoothed to give the annual growth rate (see legend to SI Fig. 7 for details). Growth rates are always expressed as the change in number divided by the current number (simple percentage growth).
Correlation to SCOP Category Size.
Measuring the correlation between the change of the size of a particular SCOP category and an easily measured quantity like the change in the number of PDB files (
NPDB), the number of unique chains (
NCHA), or the weighted chain count (
N
) is more difficult than one might expect. As the SCOP categories increase with increases in
NPDB,
NCHA, and
N
, they all seem to be highly correlated.
Chains are ordered by date of release, and each chain is marked with four 0s or 1s depending on whether it is the first member of a new family, a new superfamily, or a new fold, or is sequence-unique (no sequence neighbors at the particular threshold so that its weight is 1.0). In doing this, all of the SCOP domains that are contained in a particular chain are used. We then take nonoverlapping sets of 500 structures sorted by increasing deposit date and calculate the number of times that a sequence-unique chain starts a new SCOP family, superfamily, or fold. This number is normalized by the frequency that would be expected by chance to give the odds that a sequence-unique chain starts a new SCOP category more often than expected by chance. Specifically, the odds that a sequence-unique chain will be the first member of a new SCOP family is OPDB:FAM = 500 x (KW25:FAM)/KW25 x KFAM), where KW25 is the number of sequence-unique chains at 25% ID, KFAM is the number of first members of new SCOP families, and KKW25:FAM is the number of cases where sequence-unique chains are also members of a new family. This method is known here as the correlated count method.
Because there is no way to tell whether a new PDB or new chain is going to be a member of a new family without examining its sequence or structure, the odds that a new PDB file or a new chain will start a new SCOP family is 1.0 [on average, KPDB:FAM = (KW25 x KFAM)/500]. Thus, if the corresponding odds for sequence-unique chains is >1, KW25 will be a better predictor of the number of SCOP categories. Rather than consider the sequence-unique chains with weight of 1.0, it is possible to use the actual weight of each structure at the time it is added to the PDB. Correlation coefficients can then be calculated between the weight of each structure and a value that is 0 or 1 depending on whether the structure was the first member of a new SCOP family, superfamily, or fold. This method is known here as the correlated weight method.
SCOP 1.71 dated December 4, 2006, includes all PDF entries released by January 18, 2005. There are 1,626 additional PDB files, and 159, 50, and 26 additional SCOP families, superfamilies, and folds, respectively. From the dependence on N
(Fig. 3b), we predict SCOP increases of 162, 62, and 33, respectively. Using the dependence on NPDB would give less accurate SCOP predictions of 174, 83, and 47, respectively.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: PDB, Protein Data Bank; SCOP, Structural Classification of Proteins; SG, structural genomics; ID, identity threshold.
*E-mail: michael.levitt{at}stanford.edu
Freely available online through the PNAS open access option.
Author contributions: M.L. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.
The author declares no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0611678104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
A. Andreeva, D. Howorth, J.-M. Chandonia, S. E. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. Murzin Data growth and its impact on the SCOP database: new developments Nucleic Acids Res., January 11, 2008; 36(suppl_1): D419 - D425. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.D.J. van Dijk, C.J.F. ter Braak, R.G. Immink, G.C. Angenent, and R.C.H.J. van Ham Predicting and understanding transcription factor interactions based on sequence level determinants of combinatorial control Bioinformatics, January 1, 2008; 24(1): 26 - 33. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Chen and L. Kurgan PFRES: protein fold classification by using evolutionary information and predicted secondary structure Bioinformatics, November 1, 2007; 23(21): 2843 - 2850. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||