New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Sequence physical properties encode the global organization of protein structure space

Edited by Harold A. Scheraga, Cornell University, Ithaca, NY, and approved July 7, 2009 (received for review April 3, 2009)
Abstract
It is demonstrated that, properly represented, the amino acid composition of protein sequences contains the information necessary to delineate the global properties of protein structure space. A numerical representation of amino acid sequence in terms of a set of property factors is used, and the values of those property factors are averaged over individual sequences and then over sets of sequences belonging to structurally defined groups. These sequence sets then can be viewed as points in a 10dimensional space, and the organization of that space, determined only by sequence properties, is similar at both local and global scales to that of the space of protein structures determined previously.
Evaluating the degree of structural homology between protein sequences is a significant outstanding problem in biomedical research. That this problem remains open is apparent from the persistence of interest in the “remote homolog” problem—the observation that in any reasonably large group of sequences that fold to a specified, common architecture, there will be pairs of sequences that are not related by any currently known criterion.
Accurate methods for structural homology detection depend on an understanding of the sequence code underlying fold selection. There are intriguing hints that this code may be less complex than once thought. That the average amino acid compositions of proteins can give reasonably accurate classifications of structural class (1–3) and fold family (4–12) is well known. But a classification scheme gives no information about quantitative relationships between the classes under consideration, because it reflects only local details of the underlying space of protein structures. Quantitating those relationships requires a sequencebased metric function capable of objectively measuring the distance between 2 arbitrarily selected classes.
We have delineated the organization of structure space in previous work (13–15). Those results were obtained using only structural data, with no reference to sequence information. The picture that emerged, subseqently verified by Yee and Dill (16), is that of a structure gradient in which allhelical structures are concentrated at one extreme of the space and allsheet/barrel structures are concentrated at the other extreme, with mixed alpha/beta structures in the intervening region.
In the present work we demonstrate that when protein sequences are represented appropriately, the average amino acid properties of those sequences encode a similar picture of the global organization of protein sequence space. We demonstrate the existence of a metric function, based entirely on sequence properties, that reproduces the known characteristics of structure space.
Sequence Model
Rigorous determination of the characteristics of protein sequences requires that they be analyzed numerically, using a representation that is both complete and nonredundant. Representations that rely on arbitrarily chosen sets of physical properties of the amino acids generally are both incomplete and correlated. This problem was addressed by Kidera and coworkers (17, 18), who performed a factor analysis on all available sets of physical properties of the 20 amino acids. They demonstrated that all of these data can be represented by a set of 10 property factors, which together carry 86% of the variance of the entire property database. Therefore, to a very good approximation, an amino acid X can be represented numerically as a 10vector, It follows that an Nresidue sequence can be written as a set of 10 numerical strings of length N, each of which describes the variation of one of the property factors along the length of the protein. The property factors are linearly independent by construction, and therefore the 10 strings together give a complete, uncorrelated description of the physical properties of the sequence. The definitions of the property factors are given in supporting information (SI) Table S1.
Applying this representation requires a database of protein sequences. We have constructed a very large set of sequences taken from the CATH database (19, 20). The organization of CATH is ideal for this investigation, because domains are organized in a hierarchical fashion based (in order of increasing detail) on class, architecture, topology, and homology. In the present work, we wanted to use a comprehensive sequence/structure database that reflects the composition of the entire Protein Data Bank, rather than relying on selection criteria. A primary consideration in avoiding biased results is eliminating sequences with a high degree of similarity from the database. We therefore began with a subset of the entire CATH database, CathDomainSeqs.S35.ATOM.v3.1.020, which was selected by the CATH curators to be representative of the entire database while containing no pairs of sequences with sequence identity exceeding 35%. This value is generally considered to mark the lower limit of sequence relatedness, and thus our working database is composed entirely of sequence pairs that are in the “twilight zone.” It contains no pairs that can be considered homologs in the traditional sense.
We further adjusted the database by removing all sequences with missing residues and all sequences with fewer than 60 amino acids. We were left with a data set of 7,056 sequences known to be complete and unrelated by any standard criterion. The highest level of the CATH hierarchy consists of 4 classes. C = 1 contains allhelical structures, C = 2 contains sheet/barrel structures, and C = 3 contains mixed alpha/beta structures. The very small class C = 4 (73 sequences) contains proteins whose only common feature is a lack of regular structure, and is not considered in this work. Our final database contained 1,538 sequences with C = 1, 1,690 sequences with C = 2, and 3,755 sequences with C = 3. Sequences ranged in length from 60 to 1146 aa, and the total number of residues in the database was 1,114,667.
The sets of sequences in which we are interested here are those characterized by common values of the 3 identifiers C, A, and T. These are sequences known to fold to similar architectures but for which no specification of sequence homology is given. We restrict our attention to those CAT classes in the database that have at least 20 members. There are 59 such classes, constituting 6% of the 980 CAT classes in the database. These contain a total of 4,319 sequences—60% of the sequences in the database. The groups included in the present study are shown in Table S2. It should be noted that this database is significantly larger than the databases used in earlier work (13, 16).
For every sequence S in the database, we can define the sequenceaveraged value of the mth property factor, where N_{S} is the number of residues in the sequence. We can further average these quantities over the set of N_{Q} sequences that belong to some predefined set {Q}, The N_{Q} sequences in {Q} are then represented by the 10vector of averaged property factors, We refer to this as the averaged property factor (APF) representation of the sequence class Q. It should be noted that the sequenceaveraged property factors in eq. (2) are the k = 0 Fourier transforms of the 10 numerical strings that together represent the sequence (21, 22). This observation provides a direction for further generalization of the results, through inclusion of higher Fourier components in the analysis of sequence space.
The 10vectors Q_{CAT} for the 59 CAT classes can be thought of as the position vectors of these classes in 10space. To understand the relationships between classes established by the Euclidean metric inherent in the APF representation, we need to visualize the distribution of the corresponding points. We therefore performed a principal components analysis (PCA) (23) of the 10vectors.
Results
The PCA results are summarized in Table 1. The first 3 eigenvectors carry a total of 67.2% of the variance of the entire data set; therefore, a lowdimensional representation of the structure space is both feasible and meaningful. Each of the principal components includes contributions from all 10 property factors. The principal components are listed in Table 2.
A projection of the APF space onto the first 3 eigenvectors of the PCA is shown in Fig. 1. It can be seen that the distribution of CAT groups, identified by structural class (i.e., by the value of the CATH classifier C), is isomorphic to that obtained from purely structural considerations, in that the allhelical and allsheet/barrel groups occupy opposite extremes of the space, separated by alpha/beta structures. To make this observation quantitative, hyperplanes separating the regions corresponding to the 3 C classes were determined, using a minimum squared error (MSE) algorithm (23). The ability of these hyperplanes (which are defined in Table S3) to separate the classes is summarized in Table 3. It can be seen that the separation between classes, although not perfect, is very clean. The relatively few misclassfied groups arise from an inability of fairly simplistic, unoptimized hyperplane classifiers to completely separate the points in the 3 regions, and the misclassifications are entirely consistent with the largescale structure of the space. P values were calculated for the observed distributions of groups with all 3 C values, and all satisfy P < .0001. Optimization of the hyperplanes, or use of a more flexible separation function, may produce a perfect classification of the 59 CAT groups.
A related question of potential interest is the predictive power of this approach. As a preliminary test, a test data set (Table S4) was constructed comprising 60 CAT groups from the original database that have between 10 and 19 members. By construction, the sequences in this data set have >35% pairwise sequence identity with each other and with the sequences in the 59group development set. This data set is expected to be a challenging test of any classification procedure for two reasons: (i) The small size of the CAT groups makes the averages over sequence properties in eq. (3) less reliable, and (ii) the disjunction between CAT classes in the 2 data sets guarantees that the groups in the test database differ significantly from those in the development set.
Application of the MSE hyperplanes that classify the development set to the classification of the CAT groups in the test set gives an overall accuracy of 67.8% (Table 4). A more sophisticated classification was then carried out using a support vector machine (SVM), with an RBF (Gaussian) kernel, giving an overall accuracy of 81.7%. It is of interest to compare this result to other recent results on sequence classification. This comparison is complicated by 2 factors: (i) A wide spectrum of methods was used, some of which combine multiple classes of preoptimized descriptive properties whose statistical independence has not been investigated, and (ii) the training and testing sequence databases on which those studies are based differ widely in both size and difficulty.
A recent study comparing results on class prediction using 16 different methods found accuracies between 77% and 99.5% (24). The data sets used in that study were those of Chou (25) and of Zhou (26), containing 204–498 sequences. The present work is based on a much larger data set, and, because it is directed toward delineating the global structure of sequence space, classifies CAT groups rather than individual sequences. More importantly, the sequence descriptors are not optimized for correspondence with a preexisting classification. Nevertheless, the SVM results are consonant with previous results.
Further confirmation of the power of the APF representation comes from a completelinkage clustering of the 59 CAT groups. This gives a set of superclusters, the members of which are CAT groups, each containing at least 20 sequences. Cluster compositions at the 7supercluster level are given in Table 5. Almost every cluster is dominated by one value of C, indicating that the APF parameters encode information capable of distinguishing the structure classes. At the same time, the clusters straddle the borders between classes in a manner consistent with the largescale structure of sequence space, as revealed by the PCA and shown in Fig. 1.
Discussion
It should be emphasized that our results were obtained without using structural information, and that the chemical data used did not include the actual sequences of amino acids along the chainonly sequence and groupaveraged values of the amino acid property factors. Clearly, when amino acid physical properties are appropriately represented, their averages encode not only membership in fold families, but also the global organization of protein structure space.
We have demonstrated an unexpectedly simple connection between chemical constitution and structure in proteins. We have also shown that the principal components can be used as a metric function to quantitate the differences between groups. Further explorations of the implications of this metric are underway.
Acknowledgments
I thank Professor Igor Kuznetsov for very helpful discussions. This work was supported by the National Library of Medicine of the National Institutes of Health (Grant LM06789).
Footnotes
 ^{1}Email: shalom.rackovsky{at}mssm.edu

Author contribution: S.R. designed research, performed research, analyzed data, and wrote the paper.

The author declares no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0903433106/DCSupplemental.
References
 ↵
 Nakashima H,
 Nishikawa K,
 Ooi T
 ↵
 ↵
 ↵
 ↵
 ↵
 Dubchak I,
 Muchnik I,
 Holbrook SR,
 Kim SH
 ↵
 Hobohm W,
 Sander C
 ↵
 Ding CHQ,
 Dubchak I
 ↵
 ↵
 Shen HB,
 Chou KC
 ↵
 ↵
 ↵
 ↵
 Rackovsky S
 ↵
 Walker JM
 Rackovsky S
 ↵
 ↵
 ↵
 ↵
 ↵
Available at http://cathwww.biochem.ucl.ac.uk/latest/index.html. Accessed May 9, 2007.
 ↵
 Rackovsky S
 ↵
 ↵
 Duda RO,
 Hart PE,
 Stork DG
 ↵
 ↵
 ↵
Citation Manager Formats
Sign up for Article Alerts
Jump to section
You May Also be Interested in
More Articles of This Classification
Biological Sciences
Related Content
 No related articles found.