Solving the protein sequence metric problem
See allHide authors and affiliations

Edited by Walter M. Fitch, University of California, Irvine, CA, and approved March 22, 2005 (received for review December 14, 2004)
Abstract
Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These highdimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helixloophelix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.
A major obstacle to rigorous statistical analyses of biological sequence data is the socalled “sequence metric problem,” i.e., use of alphabetic letter codes to characterize sequence elements. Letter codes lack a natural underlying metric for comparison. For example, the amino acid leucine (L) is more similar in its physiochemical properties to valine (V) than leucine is to alanine (A). However, the alphabetic “distance” between these letters in the alphabet does not reflect these relationships. Using single letters as nominal variables in sequence analyses results in a significant loss of resolution and information about physiochemical properties of amino acids when compared to interval variables.
Previous authors circumvented sequence metric problems in different ways. Some generated ad hoc quantitative indices to summarize amino acid variability (1, 2). However, ad hoc indices generally summarize only part of the total variability in amino acid attributes. If a numerical index approach is to be effective, indices must (i) represent the proximate causes of amino acid variability; (ii) reflect interpretable partitions of total amino acid variation; and (iii) resolve intercorrelations among relevant amino attributes.
More recently, researchers accepted the alphabetic character of these data and used information theory, e.g., entropy and mutual information, to describe variability and covariability among amino acid sites (310). Several authors (3, 4) carried out multivariate analyses on mutual information matrices to better understand the dimensionality and patterns that underlie multidimensional sequence data. The information theoretic approach to alphabetic data are a significant improvement over ad hoc indices; however, it still has serious shortcomings. For example, it is difficult to describe inverse (negative) relationships among sequence sites, such as those found with compensatory variation associated with amino acid charge or size (5, 6). Furthermore, this approach provides little information about the underlying causal complexity of observed intersite covariation.
Herein, we describe multivariate statistical analyses of a large number of amino acid attributes to resolve this sequence metric problem. Factor analysis is used to derive a small set of numerical values that summarize large and interpretable components of amino acid variation. This approach has many positive features and greatly facilitates statistical analysis of sequence data. Numerical scores produced in this way are of general utility and can be used in many types of analysis without modification. Our goals are to elucidate latent structure of multidimensional amino acid attribute data, describe the major patterns of interpretable covariation among these attributes, and explore some underlying causal components of multivariate attribute variation. Our approach facilitates (i) understanding dimensionality of multivariate sequence information; (ii) elucidation of multidimensional patterns of correlated amino acid attribute variation; (iii) understanding interrelationships between sequence, structural, and functional variation; (iv) standardization of data input for many different types of analyses; (v) decomposing sequence variation into its underlying evolutionary, structural, and functional components; and (vi) modeling dynamics of protein variability.
Materials and Methods
An amino acid index is a set of 20 numerical values representing different physiochemical and biological properties. The data analyzed here were obtained from an online database (AAIndex) containing 494 such amino acid indices (www.genome.ad.jp/dbget/aaindex.html). These indices include general attributes, such as molecular volume or size, hydrophobicity, and charge, as well as more specific measures, such as the amount of nonbonded energy per atom or side chain orientation angle.
Factor analysis was used to produce a subset of numerical descriptors that would summarize the entire constellation of amino acid physiochemical properties. A powerful exploratory statistical procedure, factor analysis simplifies highdimensional data by generating a smaller number of “factors” that describe the structure of highly correlated variables (1113). The resultant factors are linear functions of the original data, fewer in number than the original, and reflect clusters of covarying traits that describe the underlying or “latent structure” of the variables. Factor analysis models assume that observation i denoted x_{i} ∈ R ^{p} , can be decomposed into , where is linear, and f_{i} ∼ N_{k} (0, I_{k} ), u_{i} ∼ N_{p} (0, ψ), where ψ is diagonal, all f_{i} and u_{j} are independent, and k < p. The new set of inferred variables f_{i} are called common or latent factors, whereas u_{j} are called unique factors. Factor analysis differs from principal components analysis in that the latter does not distinguish between common and unique variance; with principal components, all u_{i} = 0.
The Λ matrix contains the factor coefficients λ _{jk} , which give the contribution of trait j to common factor k. The factor coefficient is a regression coefficient quantifying the relationship between the trait and the common factor. The communality is the proportion of the variation in the trait accounted for by the common factors. The uniqueness is the portion of variability in the trait not accounted for by the common factors. Factors are rotated to simple structure to improve their interpretation. Such rotations maximize the number of 1, 0, and +1 factor coefficients and ensure that traits with high coefficients occur on only one or a few factors. Rotation methods include orthogonal and oblique solutions. The promax algorithm for oblique simple structure rotation is used here (11). Factor analysis produces a new set of synthetic traits called factor scores that are linear combinations of the original variables. These scores are the multivariate analog of univariate attribute amino acid indices, and they position each amino acid in a multivariate spectrum of attribute variability. Factor scores are then used to transform each alphabetic letter in aligned protein sequences to biologically meaningful numerical values. These numerically transformed sequences are then used to statistically evaluate hypotheses about evolution and structure of groups of proteins. These numerical transformations of amino acids are general and can be used in a broad range of analyses.
To understand the latent structure of sequence variation, amino acid variability can be partitioned into underlying causal components (6, 14) such that V _{total} = V _{phylogeny} + V _{structure} + V _{function} + V _{interaction} + V _{stochastic}. The model includes constraints on amino acid variability due to evolutionary history, structural and functional constraints, and random events not explained by the other main effects. The interaction component is necessary to account for possible nonlinearity. Herein, we examine the relationship between pairwise distances among amino acids based on the factor scores versus amino acid substitution matrices. This examination is done to explore the relative magnitude of structural versus phylogenetic components of protein variability. Substitution matrices estimate the rate amino acids changes over evolutionary time. Hence, relationships between patterns in amino acid variability versus changes in substitution matrix elements provide information about the underlying causes of sequence variability. Numerous amino acid substitution matrices exist; the six used here are JTT (15), Gonnet (16), WAG (17), and three different BloSum (18) matrices covering a range of evolutionary divergence. The Markov transition matrix P for each substitution matrix was computed from the logodds scoring matrix and equilibrium amino acid probabilities, logarithmically transformed to a rate matrix, Q, then further reduced to the frequencyindependent amino acid interconversion rate, S, normalized such that (1/20)∑S_{ii} = 1 (19). The 190 elements of each substitution matrix were then compared with elements of five distance matrices generated from the five sets of factor scores. The extent of association between elements was determined by a productmoment correlation coefficient. A significant correlation indicates that variation in that particular attribute dimension is evolutionary in nature.
Results and Discussion
Attribute Selection. Factor analysis on the 494 attributes show that these data were highly redundant. A subset of 54 interpretable amino acid attributes was selected based on relative magnitude of factor coefficients, statistical distributional properties, relative ease of interpretation, and perceived structural importance. Factor analysis of these 54 amino acid attributes gave five “clusters,” or patterns of highly intercorrelated physiochemical variables (Table 1), a reduction in dimensionality of two orders of magnitude from the original 494 attributes. Table 5, which is published as supporting information on the PNAS web site, lists the subsets of amino acid attributes that exhibit very large coefficients for the five factors.
Factor I is bipolar (large positive and negative factor coefficients) and reflects simultaneous covariation in portion of exposed residues versus buried residues, nonbonded energy versus free energy, number of hydrogen bond donors, polarity versus nonpolarity, and hydrophobicity versus hydrophilicity (Table 1). The variable with the largest positive value was average nonbonded energy per atom computed as an average of pairwise interactions between constituent atoms (20). Another energy variable with a large negative coefficient is transfer free energy, which represents the partition coefficient of a particular amino acid between buried and accessible molar fractions (21). For simplicity, we designate Factor I as a polarity index. Table 5 shows those attributes from the original database with high pairwise product moment correlation coefficients (>0.85) with Factor I.
Factor II is a secondary structure factor. There is an inverse relationship of relative propensity for various amino acids in various secondary structural configurations, such as a coil, a turn, or a bend versus the frequency in an αhelix.
Factor III relates to molecular size or volume with high factor coefficients for bulkiness, residue volume, average volume of a buried residue, side chain volume, and molecular weight. A large negative factor coefficient occurs for normalized frequency of a lefthanded αhelix.
Factor IV reflects relative amino acid composition in various proteins, number of codons coding for an amino acid, and amino acid composition. These attributes vary inversely with refractivity (22) and heat capacity (23).
Factor V refers to electrostatic charge with high coefficients on isoelectric point and net charge. As expected, there is an inverse relationship between positive and negative charge.
Factor Scores. The predicted value for each amino acid on each factor (factor score) can be substituted for alphabetic letter codes in sequence analyses (Table 2). Thus, each amino acid letter code can be translated into five highly interpretable numerical variables. Databases of aligned sequences are converted to five separate matrices of numerical values to facilitate statistical analyses of homologous amino acids (aligned sites) that can be analyzed individually or simultaneously. Conventional statistical analyses are now possible on these transformed sequence data. A threedimensional projection of scores for Factors IIII clarifies the relationships among the 20 amino acids (Fig. 1). Conventionally, amino acid attributes are classified into functional groups. However, because of the diversity of amino acid attributes in the original database, other interpretations may exist about patterns of physiochemical variation. So the unweighted pair group method with arithmatic mean cluster analyses are shown to clarify multidimensional relationships among amino acids based on these independent sets of factor scores (Fig. 2).
Communality values, the sum of the squared factor coefficients for each physiochemical attribute, reflect the portion of the common variation in a trait explained by the five factors. Many of the traits expressed high communality values (>0.9), suggesting that they have high factor coefficients on at least one factor and that the fivefactor model is sufficient (Table 1). However, some of the 54 attributes, like relative mutability (24), do not fit this model very well and have a large unique component of their variability. No attempt was made to deliberately seek out physicochemical variables that do not fit this statistical model. Clearly, there may be others among the 494 variables in the online database with significantly high unique fractions of their variability. Such variables should be analyzed by other methods.
To explore the effect of factor rotation, an orthogonal varimax rotation was compared with the nonorthogonal promax solution by computing pairwise productmoment correlation coefficients for the factor pattern coefficients (n = 54 per factor). Correlation coefficients ranged from 0.99 between varimax (Factor I) and promax (Factor I) to 0.96 between varimax (Factor V) and promax (Factor V). The correlation among factor scores ranged from 0.99 between varimax and promax (Factor III) to 0.95 between varimax and promax (Factor IV). Thus, whether one uses an orthogonal or nonorthogonal rotation has little effect on the interpretation of patterns of coefficients.
Interfactor Correlations. Is there significant correlation among factors? Such relationships can be estimated by the eigenvector correlations or through computation of correlation coefficients between scores from each factor. Factors III and V exhibit significant intercorrelation (r = 0.43), i.e., there is a higherorder relationship between molecular size and charge in these analyses. Furthermore, correlations between factor scores suggest a significant correlation (r = 0.64, P < 0.001) between Factors III and V.
Phylogenetic Versus Structural Variation. Can we partition variability in these five multivariate patterns of physiochemical attributes into underlying causal components? This question is explored through relationships between factor scores and substitution matrices. Six different substitution matrices based on different models of evolutionary change and degrees of evolutionary relationship are examined, including JTT, Gonnet, WAG, and three versions of BloSum. Factor scores from Table 2 were used to create separate distance matrices for polarity, propensity for secondary structure, molecular size, codon diversity, and charge among the 20 amino acids. These five matrices were compared element by element to JTT, WAG, and the BloSum30/60/90 substitution matrices. A productmoment correlation was also computed to assess the strength of relationship. Significant correlation between factor scores and elements of a substitution matrix suggests that factor score patterns contain a significant evolutionary (phylogenetic) component.
Table 3 shows the correlation between the five factor scores and the substitution matrix values. Correcting for multiple tests, there is a significant correlation between Factor I scores and all substitution matrices. Thus, a strong evolutionary basis exists for a complex pattern of covariation involving the extent of amino acid accessibility, polarity, hydrophobicity, and related attributes. These relationships have been substantiated in various experimental and analytical analyses. Atchley et al. (6) described in considerable detail patterns of variability in buried hydrophobic versus accessible hydrophilic sites in the dimerization domain of basic helixloophelix (bHLH) proteins. These observed patterns were related to natural selection, evolutionary change, and phylogenetic divergence.
Factor IV, which reflects codon and amino acid diversity, exhibits a weaker but still highly significant correlation between physiochemical attributes and evolutionary change in patterns of substitution. Significant correlations are noted with Gonnet and the three BloSum matrices.
The remaining three factors indicate no significant association between physiochemical attribute variation and evolutionary patterns of amino acid substitution. Thus, variation in propensity to form various secondary structural configurations (Factor II), molecular size (Factor III), and charge (Factor V) cannot be ascribed to evolutionary divergence but rather to nonevolutionary changes in structure and function.
Applications of Factor Scores to Sequence Analysis. Let us explore DNA binding patterns in the bHLH proteins as an example of how factor scores of amino acid attributes can be used in sequence analyses. bHLH proteins bind DNA through a hexanucleotide Ebox (CANNTG). There are five groups of bHLH proteins distinguished by the interactions among 13 amino acids in the basic DNAbinding region. These five DNAbinding groups and the included proteins have been discussed in detail (5, 6, 2527). Here, we analyze some physiochemical aspects of amino acid variation in the DNAbinding region of 196 bHLH sequences from widely divergent proteins in a large number of organisms. The number of sequences in these groups, Groups A, B, C, D, and E, are 83, 72, 16, 9, and 16, respectively. It is well documented that the relative amino acid composition at sites 5, 8, 9, and 13 define these five binding groups (25). However, structural and functional aspects of these and other sites in the DNAbinding region are not well understood.
A series of univariate analyses of variance was carried out on the Factor I scores for the 13 amino acid sites in the basic region. Highly significant differences among these five groups exist for 12 of the 13 amino acid sites (Fig. 3). Only site 6 does not exhibit significant differences between the five binding groups. The factor score means for Factor I in each of the five binding groups are ranked. Means that do not differ significantly in a Bonferronicorrected test are circled in Fig. 3. Such results can help answer important questions about biological sequences, e.g., physiochemical aspects of DNA binding, evolutionary consequences of amino acid composition at certain sites at nodes in a phylogenetic tree, and the basis for amino acid changes in multiple sequence alignments.
A multigroup discriminant analysis (canonical variate analysis) (11) of these five binding groups was carried out by using the Factor Itransformed sequence data for amino acid sites 113 (Table 4). Discriminant analysis finds an optimal subset of variables that maximize separation of a priori defined groups (i.e., the DNAbinding groups). Discriminant analysis explores the question “What combination of amino acid sites within the DNAbinding region best discriminate these five groups of proteins based on traits reflecting polarity, hydrophobicity, and accessibility?” Table 4 gives the first two discriminant function vectors and accounts for 91% of the total variance. Vector 1 has large coefficients on amino acid sites 5, 8, and 13, with an inverse relationship between site 8 versus 5 and 13. Consideration of the group means shows that this vector discriminates Group A from Groups B and E. Group A binds to an CAGCTG Ebox and always has an R or K residue in amino acid site 8. Groups B and E bind to a CACGNG Ebox and have H, R, or K residues at site 5 and R at site 13. This relationship explains the large discriminant function coefficients for these sites and that site 8 varies inversely from sites 5 and 13. Vector 2 distinguishes Groups C and D and has large coefficients on sites 9, 10, and 12. bHLH proteins that directly bind DNA always have an E residue at site 9, whereas nonDNAbinding proteins lack the E. Proteins in Groups C and D do not directly bind DNA. These simple analyses focusing only on Factor I data suggest that this overall approach can be very powerful for understanding important aspects of sequence variability.
Conclusions
These results provide a method to quantify alphabetic information inherent to biological sequences. This approach produces five multidimensional patterns that summarize a large portion of the known variability among amino acids. These index or factor scores can be used to transform alphabetic letter codes for amino acids to five numerical values that are highly interpretable and that explain much of the information in the literature on amino acid attributes. The set of numerical scores provided here can be used in a wide variety of other types of analyses directed toward understanding the evolutionary, structural, and functional aspects of protein variability.
Acknowledgments
We are grateful to Michael J. Buck for contributing to preliminary stages of the analyses. This work was supported by National Institutes of Health Grant GM45344 and by funds from North Carolina State University.
Footnotes

↵ § To whom correspondence should be addressed. Email: bill{at}atchleylab.org.

Author contributions: W.R.A. designed research; J.Z., A.D.F., and T.D. performed research; T.D. contributed new reagents/analytic tools; W.R.A., J.Z., and A.D.F. analyzed data; and W.R.A. wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviation: bHLH, basic helixloophelix.
 Copyright © 2005, The National Academy of Sciences
References

↵
Grantham, R. (1974) Science 185 , 862864. pmid:4843792
 ↵

↵
Atchley, W. R. & Buck, M. J. (2005) J. Mol. Evol., in press.

↵
Atchley, W. R. & Fernandes, A. D. (2005) Proc. Natl. Acad. Sci. USA 102 , 64016406. pmid:15851686
 ↵

↵
Atchley, W. R., Wollenberg, K. R., Fitch, W. M., Terhalle, W. & Dress, A. W. (2000) Mol. Biol. Evol. 17 , 164178. pmid:10666716

Korber, B. T., Farber, R. M., Wolpert, D. H. & Lapedes, A. S. (1993) Proc. Natl. Acad. Sci. USA 90 , 71767180. pmid:8346232
 ↵

↵
Johnson, R. A. & Wichern, D. W. (2002) Applied Multivariate Statistical Analysis (Prentice Hall, Upper Saddle River, NJ).

Jolliffe, I. T. (1986) Principal Component Analysis (Springer, New York).

↵
Krzanowski, W. J. (1988) Principles of Multivariate Analysis: A User's Perspective (Clarendon, New York).

↵
Wollenberg, K. R. & Atchley, W. R. (2000) Proc. Natl. Acad. Sci. USA 97 , 32883291. pmid:10725404

↵
Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992) Comput. Appl. Biosci. 8 , 275282. pmid:1633570

↵
Gonnet, G. H., Cohen, M. A. & Benner, S. A. (1992) Science 256 , 14431445. pmid:1604319

↵
Whelan, S. & Goldman, N. (2001) Mol. Biol. Evol. 18 , 691699. pmid:11319253

↵
Henikoff, S. & Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA 89 , 1091510919. pmid:1438297

↵
Goldman, N. & Whelan, S. (2002) Mol. Biol. Evol. 19 , 18211831. pmid:12411592
 ↵
 ↵

↵
McMeekin, T. L., Groves, M. L. & Hipp, N. J. (1964) in Amino Acids and Serum Proteins, ed. Stekol, J. A. (Am. Chem. Soc., Washington, DC), pp. 54.

↵
Hutchens, J. O. (1970) in Handbook of Biochemistry, ed. Sober, H. A. (CRC, Cleveland), pp. B60B61.

↵
Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlas of Protein Sequence and Structure, ed. Dayhoff, M. O. (Natl. Biomed. Res. Found., Washington, DC), Vol. 5, Suppl. 3, p. 352.

↵
Atchley, W. R. & Fitch, W. M. (1997) Proc. Natl. Acad. Sci. USA 94 , 51725176. pmid:9144210

Ledent, V., Paquet, O. & Vervoort, M. (2002) Genome Biol. 3 , research0030.1research0030.18. pmid:12093377

↵
Ledent, V. & Vervoort, M. (2001) Genome. Res. 11 , 754770. pmid:11337472