Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / BIOPHYSICS
On the origin and highly likely completeness of single-domain protein structures



*Center of Excellence in Bioinformatics, University at Buffalo, State University of New York, 901 Washington Street, Buffalo, NY 14203; and
Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, MA 02138
Edited by Harold A. Scheraga, Cornell University, Ithaca, NY, and approved December 30, 2005 (received for review October 27, 2005)
| Abstract |
|---|
|
|
|---|
atoms, these results also suggest that the observed protein folds are insensitive to the details of side-chain packing. Sequence specificity enters both in fine-tuning the structure and thermodynamically stabilizing a given fold with respect to the set of alternatives. Scanning the models against a three-dimensional active-site library, close geometric matches are frequently found. Thus, the presence of active-site-like geometries also seems to be a consequence of the packing of compact, secondary structural elements. These results have significant implications for the evolution of protein structure and function.
evolution | Protein Data Bank | protein folding | protein structure prediction
In recent work that builds on the other studies (8, 12, 13), we suggested that the library of single-domain proteins already found in the PDB is essentially complete in the sense that single-domain PDB structures provide a set of structures from which any other single-domain protein can be modeled (9, 14). By using sensitive structural alignment algorithms that assess the structural similarity of two protein structures, even when proteins belonging to different secondary structure classes are compared (e.g., comparing
-proteins to
/
and
-proteins), protein structures in the PDB can be found with very similar topology; i.e., the arrangement of their secondary structural elements (
-helices and/or
-strands) is similar (9). Moreover, protein structure space is extremely dense in that there are many apparently nonhomologous structures that give acceptable structural alignments to an arbitrary selected single-domain protein. However, the structural alignment usually has unaligned regions or gaps. Starting from these alignments, state-of-the-art refinement algorithms can build full-length models that are of biological utility [with an average root-mean-square deviation (rmsd) to native of 2.3 Å for the backbone atoms] (14). Furthermore, incorrectly folded models generated by structure prediction algorithms also have structural analogues in the PDB, an observation again consistent with PDB completeness (15). Nevertheless, one might argue that comparing PDB structures against themselves as well as with structures generated using knowledge-based potentials extracted from the PDB (which retain some features of native proteins), although suggestive that the PDB is complete, does not establish that the universe of single-domain protein structures is complete; nor even if true, does it establish the reason for such completeness.
Here, we address these issues and show the surprising result that the highly likely completeness of the PDB results from the requirement of having compact arrangements of hydrogen-bonded (H-bonded), secondary structure elements and nothing more. By studying compact homopolypeptide conformations having a typical distribution of secondary structures, we further show that the resulting library of computer-generated compact structures is found in the current PDB, and, conversely, the generated library of compact structures is complete, i.e., all compact, single-domain proteins in the PDB have a structural analogue in a rather small set of computer-generated models. These studies go significantly beyond previous work, where relatively small supersecondary structural elements are generated assuming that the protein is a homopolymer confined to a semiflexible tube that mimics H-bonding (16), to show that by using a simpler, physics-based force field, the complex topologies of single-domain proteins result. Furthermore, if we scan the set of randomly generated, compact structures against a three-dimensional active-site template library (17), close geometric matches for a considerable number of known active sites can be found. The possible implications of these results for both protein design and evolution are discussed below.
| Results |
|---|
|
|
|---|
/
proteins, the order of
-helices and
-strands is randomly chosen, each with 50% probability.
Global Folds of Compact Homopolypeptides with Protein-Like Secondary Structures Are All in the PDB.
Collapsed, low-energy conformations of 100- and 200-residue-long, sticky homopolypeptides were generated for the reduced protein model, whereas, because of computational cost, only 100-residue homopolypeptides were considered in the detailed atomic model (18). For each chain length in the reduced protein model, a set of chains with 150 different secondary-structure assignments is simulated (50
-, 50
/
-, and 50
-proteins). For the atomic model, because its H-bond scheme does not work well for
-strands, mainly
-proteins result. For both protein representations, the topologies of the generated computer models for the set of compact, homopolypeptide chains are highly divergent. Typically, the population of the largest cluster is <5% of the total number of structures, and there is minimal energetic separation between different clusters. In contrast, in a typical structure prediction on a real protein sequence, the largest cluster population is
50% (19).
We selected pairs of structurally related proteins by their TM-score, a metric of structural similarity, identified by the structural alignment program TM-ALIGN (15). Compared with the conventional rmsd between a pair of structures, the TM-score is more sensitive to the similarity in global topology of the compared structures. It is normalized so that its magnitude is independent of protein size, with a value of 0.30 and a standard deviation of 0.01, for the best structural alignment of an average pair of randomly related structures (15, 20) and a value of 1.0 for two identical protein structures.
Fig. 1A and B shows the rmsd vs. coverage plot for 100-residue-long chains of the atomic and reduced protein models, respectively, where each point represents a computer model matched with the PDB structure of the highest TM-score. TM-scores on the order of 0.45 (with a z-score of
15) are indicative of highly significant structural similarity. In all cases, the randomly generated compact structures have related folds in the PDB. The atomic models have an average rmsd of 3.9 Å with its closest structural neighbor from the PDB, 83% average coverage, and an average TM-score of 0.52 (z-score of 22). For the 100-residue-long reduced models, these numbers are 3.9 Å, 83%, and 0.51 (z-score of 21), respectively. Thus, there is no difference in average results between the atomic and reduced protein models, indicative of their robustness and invariance to model details. This similarity further indicates that the helix-length distribution in the atomic model, in particular, and most likely in general, is dictated by the balance between compactness and H-bonding. In Fig. 1 Right, we show representative examples of structures belonging to the different secondary structural classes of proteins compared with the closest PDB structure. It is evident that protein structures of quite complex topology are generated and that all have close structural matches in the PDB.
|
-proteins; nevertheless, the global topology is matched, with the majority of the core region aligned. Based on our previous work, rather high-quality comparative models could be built from these alignments (14), even if one secondary-structural element is missed as can sometimes happen in the most extreme cases. It is precisely in this sense that all compact homopolypeptide structures are in the PDB. This essential point is discussed in further detail below and in Supporting Materials and Methods and Figs. 6 and 7, which are published as supporting information on the PNAS web site. Thus, the results summarized in Fig. 1 strongly suggest that the requirements to generate the complex topologies found in the PDB are inherently geometric and just involve the packing of compact structures containing H-bonded, secondary-structure elements.
Is presence of H-bonded, secondary structures necessary to reproduce the set of single-domain protein structures found in the PDB at a reasonable level of accuracy, or is compactness alone sufficient? To examine this issue, we generated an ensemble of compact, freely jointed chains (FJC) (21) that lack both regular secondary structure and H-bonds, but that retain C
atom-excluded volume interactions. We then performed the identical analysis as in Fig. 1. The results are summarized in Fig. 2 and are qualitatively different (see also Fig. 8, which is published as supporting information on the PNAS web site). For the resulting ensemble of compact FJC models that are 100 and 200 AA residues in length, the average TM-score is
0.30. This value is just the average TM-score of structural alignments between two randomly related structures. As shown in the typical examples of Fig. 2, the structures very poorly resemble real proteins both at the level of the global fold as well as in their local chain geometry. Thus, compactness alone does not recover protein-like topologies, nor does it generate appreciable secondary structure (22).
|
-proteins, 116
-proteins, 580
/
-proteins, and 4 proteins with little if any secondary structure. Here, we exclude proteins having irregular, extended structures by using a radius of gyration (G) cutoff, i.e., G <1.5G0, where G0 (= 2.2L0.38) denotes the average value of radius of gyration for a protein of length L (23). Nevertheless, a significant number of PDB structures with dangling tails remain after filtration, thereby making structure comparison with the compact, homopolypeptide library a somewhat more difficult test. As shown in Fig. 3A, if we use the set of 15,000 clustered structures generated for the 200-residue, compact, sticky homopolypeptide chains (150 proteins, each with a distinct, randomly selected pattern of secondary structure times the top 100 clusters), then the resulting library of generated compact structures is complete with respect to the PDB. In fact, single-domain proteins in the current PDB structural repertoire can be matched to the compact structure fold library with an average rmsd of 4 Å, 75% coverage, and TM-score = 0.47 (z-score of 17).
|
virtual bonds could be constructed from the structures), we selected the 10 worst PDB-compact homopolypeptide matches on the basis of their TM-score whose value is
0.37; not surprisingly, many have dangling tails that are responsible for this relatively low TM-score. As described in Table 1 and Figs. 911, which are published as supporting information on the PNAS web site, these alignments cover
2/3 of the core of the protein. Full-length models can be built by using the protein structure prediction program TASSER (19, 35); the average TM-score after TASSER modeling improved to 0.62 (z-score of 32). In all but one case (again because of a dangling tail), TASSER also improved the quality of the core regions. It is in this sense that structural space is complete: The compact homopolypeptide models are buildable, and the global topology of all proteins in the PDB can be recovered by using straightforward modeling techniques to add the unaligned residues that mainly occur in the loops. The final model sometimes contains minor modifications in the core. In Fig. 3B, we reduce the size of the compact homopolypeptide library to 7,000 structures by reclustering the set of 15,000 models, a similar size to the PDB library used in Fig. 1. Now, the average rmsd is 4 Å, with 75% average coverage and a TM-score of 0.46 (z-score of 16). In Fig. 3C, we again reduce the number of models by half to 3,500 distinct structures by reclustering the 7,000 models using a smaller TM-score cutoff. Here, the average rmsd is 4.1 Å, the average coverage is 74%, and the average TM-score is 0.45 (z-score of 15). Thus, even when the structure library is reduced by half, the set of representative homopolypeptide conformations is still a complete representation of the PDB. Moreover, as indicated by the trend shown in Fig. 3, the space covered by such structures is very dense with many compact, sticky homopolypeptide structures that give acceptable structural alignments to PDB structures. In Fig. 3 Lower, we show structure alignments of representative PDB structures for the three different secondary structure classes to members of the compact, 15,000-member sticky homopolypeptide structural library. This library and the set of alignments to the PDB150 set are included in Supporting Materials and Methods.
The fact that the library of compact sticky homopolypeptide structures (that have not been subject to any evolutionary selection) is complete with respect to the PDB as well as the converse argues that both are highly likely to be complete. That is, they fully represent the set of topological arrangements of secondary-structural elements that single-domain proteins may adopt. Furthermore, structures of acceptable quality can be built by using the structural alignment as the starting conformation. This probable completeness is the result of the packing of H-bonded, secondary structure in compact proteins. This finding also explains why misfolded decoys generated by protein structure prediction algorithms are found in the PDB, because they too are just compact structures containing H-bonded, secondary-structural elements.
How can it be that such an apparently small number of compact structures is complete for single-domain protein structures, especially because we only consider 150 distinct secondary structure patterns (a number arbitrarily chosen for reasons of computational cost)? The reason is that a given structure can be the source of many different structural alignments, all of which can yield buildable, full-length protein models. The set of compact structures with randomly selected protein-like secondary structures can be thought of as a set of "basis vectors" or building blocks that span the space of single-domain folds. Because structural alignments sample an exponentially large number of possibilities (24), given a reasonable set, the ability to cover the PDB converges rather rapidly as a function of the number of disparate protein structures, a picture confirmed by Fig. 3.
Nonlocal Substructures Bearing a Close Relationship to Active-Site Geometries Are Found in the Compact, Sticky Homopolypeptide Structure Library.
Given the global similarity between single-domain proteins and the set of compact sticky homopolypeptide structures, we next examine the corresponding relationship between nonlocal substructures (local in space, but not local in sequence). Because of their biological relevance, we explored the extent to which the geometry of functionally important, nonlocal substructures is also a consequence of the packing of compact, secondary-structural elements. We first scanned 750 sticky homopolypeptide structures (150 proteins with distinct secondary structure times the top five clusters for the 200 AA models) and the same number of native structures (a nonredundant set at a 40% sequence identity cutoff), with a library of sequence-independent, active-site templates, the Automated Functional Template (AFT) library (17). Each AFT contains three to five functional residues and is comprised of the functional residues C
and C
atoms and the C
atoms of the adjacent residues. The C
atoms partially account for the orientation of the active-site side chains. To eliminate the direct influence of evolution that would lead to trivial results, before native structures were scanned, all enzymes sharing the first two EC digits with that of the AFT under analysis were excluded.
As shown in Fig. 4, in both sets, we find substructures whose geometries are very close to those of active sites, even though we remove from consideration those native structures corresponding to enzymes functionally related to the AFT under analysis. For instance, with a tolerance of 0.5 Å in the distance rmsd (drmsd) from the restrictive cutoff (the maximum drmsd observed between a true positive hit and the corresponding AFT) (17), we detected matches for 23% of the AFTs in at least 1% of the homopolypeptide structures and matches for 31% of the AFTs in at least 1% of the native structures (see Fig. 12, which is published as supporting information on the PNAS web site). Both distributions are remarkably similar, bearing in mind that the AFTs are directly derived from very specific arrangements of functional residues in native enzyme active sites. Thus, the existence of active-site-like geometries also seems to be a consequence of the packing of compact, secondary-structural elements. They occur at a remarkably high frequency, even under conditions where there is no selection pressure to adopt such geometries. Furthermore, if we require matches with a tolerance of a 0.5-Å drmsd in at least one of 3,500 sticky homopolypeptide structures (the same set shown in Fig. 3C, which is complete with respect to the PDB), then we observe that the set is 48% complete with respect to our active-site library.
|
| Conclusions |
|---|
|
|
|---|
By studying the completeness of a library of compact homopolypeptides that contain a protein-like distribution of H-bonded, secondary-structural elements, we have demonstrated that the resulting set of computer-generated, compact structures can be found in the PDB and, conversely, for single-domain proteins in the PDB, even when a very small set of secondary structural elements are used (here, 150 different sequential arrangements), the resulting library is likely complete at the level of low-to-moderate resolution structures. That is, they contain the majority, if not all, of the core secondary structure elements of all compact, single-domain proteins and that structures of biological utility can be generated with simple modeling procedures that use one of these compact homopolypeptides structures as the starting template. This finding suggests that both the PDB and the compact homopolypeptide structural libraries are complete. Furthermore, it is highly likely that a necessary and sufficient condition for this completeness is the packing of compact, H-bonded secondary-structural elements. Although this conclusion might seem trivial, it is commonly believed that the complex folds adopted by proteins are the result of the fine tuning of the details of side-chain packing and are specially selected for during the course of evolution. This work suggests the contrary: the library of folds that are adopted is because of relatively simple and robust considerations of the packing of compact, H-bonded secondary-structural elements. In essence, single-domain proteins are in the small chain limit: they have a relatively small number of secondary-structural elements whose random packing yields a set of structures that span the space of protein folds. When the chains are completely flexible (i.e., lacking in secondary structure) and their number of degrees of freedom is on the order of the number of residues, this is not the case, and the resulting compact structure fold space is not complete.
Because our results suggest that the PDB has already explored the universe of compact single-domain protein folds, the target selection strategy of structural genomics (10, 31) might need to be revisited to focus either on multiple domain and multimeric proteins, where the PDB is most likely not yet complete (32), and/or on the selection of single-domain protein sequence families whose folds cannot be assigned by using state-of-the-art structure-prediction tools (3335). Finally, we note that just as the likely completeness of the PDB at the level of global folds arises from geometric factors, the set of compact, sticky homopolypeptides contains the approximate geometry of many active sites in enzymes. Together, these results suggest a simple first-order picture of the origin and probable completeness of the folds in the PDB that is inherently geometric and that arises from the general physical chemical principles of the packing of H-bonded, secondary-structural elements in compact structures, with a remarkable richness of detail that follows from these few, simple assumptions.
| Methods |
|---|
|
|
|---|
atoms that are confined to a high coordination number lattice (19). Both models represent each side chain by a C
atom. Although isosteric to polyalanine, these are generic protein representations that depict the most minimal geometric features shared by all proteins and should allow us to examine the most general features underlying the origin of the set of protein folds. Additional methodological details are in Supporting Materials and Methods. Structure Generation and Analysis. Folding starts from a set of randomly generated, expanded states. The resulting compact structures were clustered based on their mutual structural similarity and ordered according to their population using the SPICKER structure clustering algorithm (37). The top 5, 10th, and then every 25th structure to the 200th structure was compared with a template library of 6,967 proteins that cover the PDB at a 50% pairwise sequence identity cutoff. The structural similarity of each pair of native and homopolypeptide structures was assessed by using a recently developed structural alignment algorithm, TM-ALIGN (15), which uses the TM-score (20) as the metric of structural similarity. We also report the corresponding rmsd and coverage, the fraction of aligned residues, from the best structural alignment. Additional details are in Supporting Materials and Methods and also Table 2, which is published as supporting information on the PNAS web site.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: AFT, Automated Functional Template; PDB, Protein Data Bank; rmsd, rms deviation; drmsd, distance rmsd.
To whom correspondence should be sent at the present address: Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, 250 14th Street NW, Atlanta, GA 30318. E-mail: skolnick{at}gatech.edu
Author contributions: Y.Z., I.A.H., A.K.A., E.S., and J.S. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.
Conflict of interest statement: No conflicts declared.
This paper was submitted directly (Track II) to the PNAS office.
© 2006 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
L. Wroblewska, A. Jagielska, and J. Skolnick Development of a Physics-Based Force Field for the Scoring and Refinement of Protein Models Biophys. J., April 15, 2008; 94(8): 3227 - 3240. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Zhou and J. Skolnick Ab Initio Protein Structure Prediction Using Chunk-TASSER Biophys. J., September 1, 2007; 93(5): 1510 - 1518. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-X. Zhou and S. Qin Interaction-site prediction for protein complexes: a critical assessment Bioinformatics, September 1, 2007; 23(17): 2203 - 2209. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Meyerguz, J. Kleinberg, and R. Elber From the Cover: The network of sequence flow between protein structures PNAS, July 10, 2007; 104(28): 11627 - 11632. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Zhong, J. M. Moix, S. Quirk, and R. Hernandez Dihedral-Angle Information Entropy as a Gauge of Secondary Structure Propensity Biophys. J., December 1, 2006; 91(11): 4014 - 4023. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. D. Rose, P. J. Fleming, J. R. Banavar, and A. Maritan A backbone-based theory of protein folding PNAS, November 7, 2006; 103(45): 16623 - 16633. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. J. Fleming, H. Gong, and G. D. Rose Secondary structure determines protein topology Protein Sci., August 1, 2006; 15(8): 1829 - 1834. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||