## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Global characteristics of protein sequences and their implications

Edited* by Harold A. Scheraga, Cornell University, Ithaca, NY, and approved March 30, 2010 (received for review February 3, 2010)

## Abstract

Computational studies of the relationships between protein sequence, structure, and folding have traditionally relied on purely local sequence representations. Here we show that global representations, on the basis of parameters that encode information about complete sequences, contain otherwise inaccessible information about the organization of sequences. By studying the spectral properties of these parameters, we demonstrate that amino acid physical properties fall into two distinct classes. One class is comprised of properties that favor sequentially localized interaction clusters. The other class is comprised of properties that favor globally distributed interactions. This observation provides a bridge between two classic models of protein folding—the collapse model and the nucleation model—and provides a basis for understanding how any degree of intermediacy between these two extremes can occur.

Bioinformatic studies of protein sequences have concentrated almost exclusively on their local properties. The relationship between local sequence properties and local folding has been extensively examined. Sequence homology studies have concentrated on developing methods for establishing local equivalences between corresponding residues in pairs of sequences. It has become increasingly clear, however, that a purely local view of protein sequences is not adequate. In a number of recent studies (1⇓⇓⇓⇓–6), we have demonstrated quantitatively that there are intrinsic limitations to the informatic power of local descriptions of protein sequence, particularly with respect to the encoding of structural information. It is clear, however, that sequence *does* completely determine protein structure, and it therefore follows that folding instructions must be encrypted in global, rather than local, sequence information. In the present work, we discuss some fundamental global properties of protein sequences and examine their implications for mechanisms of protein folding.

## Model

A necessary preliminary to any meaningful discussion of sequence characteristics is the conversion of protein sequences into a numerical form amenable to systematic analysis. We follow a procedure set forth in previous work (7⇓–9), by using the 10 Kidera property factors (10, 11) , which form an orthonormal and essentially complete basis set for the known physical properties of the amino acids, to represent an amino acid as a 10-vector. (The Kidera factors are given in Table 1.) A complete protein sequence is then represented by a set of 10 *N*-member numerical strings, each of which records the course of one property factor along the *N*-residue sequence. These strings can be Fourier transformed, leading to a representation of the sequence by a set of sine and cosine Fourier coefficients. Each of these coefficients, which is labeled by a wave number *k* and a property identifier *l*, encodes information about the entire sequence of the protein. Furthermore, the Fourier components are determined by information associated with different intrinsic length scales in the sequence—coefficients with wave number *k* contain information (7) about structural features of size ∼*N*/*k*. The Fourier decomposition of a sequence is therefore, by construction, complete and orthonormal with respect to both physical properties and wave number. This approach provides a method for systematically studying the presence in each property factor of features on specific scales and for doing so in a uniform manner in sequences of varying lengths. (This scaling is an advantage of working in *k* space. Inspection of features at a specified length scale along the sequence would, of course, involve different *k* values in sequences of different length.) In this respect it differs from Fourier methods and other periodicity-based approaches previously proposed by a number of workers (12⇓⇓⇓⇓⇓–18), who have used these tools to examine the role of sequence in local structure formation and particularly in studying the role of hydrophobicity in protein sequences. Recent studies (19) have used simple potentials to relate sequence and secondary structure prediction in model proteins. The present work is distinct both in the nature of the sequence representation used and in presenting a systematic study of the complete set of physical properties over a wide range of wave numbers.

In recent work (9) we studied the informatic properties of the *k* = 0 Fourier coefficient. It was demonstrated that this component contains information about protein sequences that correctly encodes the *structural* relationships between proteins. This finding is particularly remarkable in view of the fact that the *k* = 0 coefficient contains information about sequence composition but not about the actual sequential arrangement of residues along the chain. In the present work we ask how information is encoded by Fourier components of sequences with *k* > 0. We are particularly interested in relating the *k*-space properties of protein sequences to the folding mechanisms of proteins. We formulate this interest in terms of three specific questions:

At what values of

*k*are unusually large Fourier coefficients observed, for each property factor?Do the observed occurrences of large Fourier coefficients form coherent patterns?

What are the implications of these patterns for folding mechanisms?

Our methodology is straightforward. Because we are interested only in the magnitudes of the Fourier coefficients, we study the behavior of the sine and cosine power spectra, the elements of which are squared Fourier coefficients. It is necessary to determine whether the observed value of a power spectral element differs significantly from that one would expect at random. We determine the statistical significance of spectral magnitudes by calculating *Z* functions for the power spectra: [1]Here is the sine or cosine Fourier coefficient with wave number *k* for property *l* (where 1 ≤ *l* ≤ 10), the subscripted brackets denote an average over all possible permutations of the sequence, and *σ* denotes the standard deviation of the power spectral element over the ensemble of possible sequence permutations. By measuring the value of the power spectral element relative to the expected value over all sequence permutations, we determine the contribution of the specific sequence to the power spectrum, beyond that provided by sequence amino acid composition. The *Z* function for *k* > 0 differs in this respect from the *k* = 0 Fourier coefficient and provides information complementary to that obtained at *k* = 0. Determination of the *Z* functions was greatly simplified by the fact that the averages and standard deviations in Eq. **1** can be calculated analytically and exactly (8).

We define a *signal* in the power spectrum by the equation [2]This condition is the standard criterion for a power spectral element that is larger than average at the 5% confidence level. We then seek those values of *k* and *l* at which spectral signals are observed with high frequency.

To study this question, a dataset was assembled from the CATH database (20, 21). The dataset was based on the CathDomainSeqs.S35.ATOM.v3.1.020 subset of CATH, which was constructed to be representative of the entire database while containing no sequence pairs with identity greater than 35%. The entire dataset thus lies in the “twilight zone” and contains no pairs that can be considered to be homologous in the traditional sense. This subset of CATH was edited in order to remove all sequences with missing segments or sequence uncertainties. This redaction left 7,056 sequences, each of which was subjected to the Fourier and spectral analyses described above.

The first two questions we have posed can be answered by counting the occurrences of spectral signals in the sequences of the dataset as a function of *k* and *l*. By the properties of the Fourier transform, the presence of a signal in a given sequence, at given *k* and *l*, is independent of the presence of signals in the same sequence at different values of *k* and *l*, or in other sequences. Therefore, the statistical significance of the observed number of signals at *k*, compared to the average number observed over all values of *k*, can be calculated by using Bernoulli statistics. We have determined this significance for a wide range of *k* (1 ≤ *k* ≤ 60), for each of the property factors (1 ≤ *l* ≤ 10). For some values of *k* and *l*, the observed number of signals *N*_{s}(*k*,*l*) will be significantly larger than average, and for others *N*_{s}(*k*,*l*) will be significantly smaller than average.

## Results

In Fig. 1 we summarize the variation with *k* of the number of signals (in combined data from both sin and cos spectra). Rather than plotting *N*_{s}(*k*,*l*), we plot, as a function of *k*, the values of an auxiliary function *λ*(*k*; *l*), which takes the value -1 if signal usage is significantly lower than average, +1 if usage is significantly higher than average, and 0 if usage does not differ significantly from average. The following points are evident by inspection:

Four of the property factors have very pronounced significance patterns, which fall into one of two classes. The first two property factors exhibit runs of high signal usage at low

*k*and low signal usage at higher values of*k*. Conversely, factors 3 and 4 exhibit runs of low signal usage at low values of*k*, and property factor 4 exhibits a particularly pronounced run of high signal usage over a large range of elevated values of*k*.Deviations from average for the remaining six property factors occur as isolated cases, and strong patterns are not as clearly visible by inspection. There are, however, hints in these properties too of behavior corresponding to one or the other of the two classes.

In order to investigate quantitatively the accuracy of these empirical impressions, we measured pairwise distances between the signal usage patterns *λ*(*k*; *l*) by using a correlation function metric: [3]where the overbar indicates an average over all values of *k*. The matrix of correlation coefficients is a similarity matrix for the set of functions {*λ*(*k*; *l*)} and can be used as input for a message passing algorithm (22). Message passing is distinct from other clustering methods in that it does not require the presupposition of a specified number of clusters but rather determines the number of clusters present directly from the input data. We find that the 10 signal usage spectra fall into two classes, corresponding exactly to the visual impression produced by the plots in Fig. 1. These are *C*_{1} is comprised of those property factors that exhibit statistically elevated signal usage at low values of *k* and low usage at high *k*. *C*_{2} is comprised of those property factors that exhibit the opposite behavior.

## Discussion

These two distinct patterns of *k* dependence correspond to distinctly different physical behaviors. As we noted above, a signal in property *l* at wave number *k* = *k*_{0} arises from the existence of physical features in the sequence on a length scale ∼*N*/*k*_{0}. It follows that large values of the properties in class *C*_{1} are distributed preferentially in a few sequentially long regions, separated by long regions in which the property value is low. Conversely, large values of the properties in *C*_{2} are preferentially distributed in many small, closely spaced regions. These contrasting behaviors are illustrated in Fig. 2. Large values of a particular property factor imply strong interactions arising from a corresponding term in the intramolecular potential energy. The two behavior types therefore correspond to different physical interaction patterns. These different interaction patterns in turn are likely to lead to different folding mechanisms. Folding governed by properties in *C*_{1} will take place under the influence of interactions strongly localized in a limited number of well-separated regions, leading to folding by a nucleation-like mechanism. Folding governed by properties in *C*_{2} will take place under the influence of interactions in small regions distributed proximally along the entire length of the chain. These regions can interact with neighboring regions, leading to delocalized folding modes (“collapse”), in much the same way that periodic nearest-neighbor interactions in solids lead to delocalized collective excitations.

With this picture in mind, we note that Kidera et al. showed (10, 11) that, of the 10 property factors, the first four carry the largest part (68%) of the variance of the dataset and that those four principal property factors are essentially single amino acid properties. Two of these principal factors occur in class *C*_{1}, the class of localized interactions—helix/bend preference and side-chain size. The remaining principal property factors—extended structure preference and hydrophobicity (10, 11)—fall in *C*_{2}, the class of collective, delocalized interactions. The latter observation is particularly intriguing, because it suggests that the formation of extended structure may occur by a collective mechanism that shares certain underlying physical features with hydrophobic collapse.

We have demonstrated the existence of two types of physical properties, with clear differences in global sequence behavior, which we suggest favor the two classic, diametrically opposite prototypical folding mechanisms—nucleation and collapse. In specific sequences, of course, multiple signals in properties from both classes may be present simultaneously, and the balance between the strengths of their corresponding interactions, and the relationships between their wave numbers, will determine the folding mechanism of the protein. This approach provides a unified framework in which an entire range of folding mechanisms can be induced. We are continuing to investigate the implications of these observations.

## Acknowledgments

This work was supported by the National Library of Medicine of the National Institutes of Health, through Grant LM06789.

## Footnotes

- ↵
^{1}E-mail: Shalom.Rackovsky{at}mssm.edu.

Author contributions: S.R. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

## References

- ↵
- Rackovsky S

- ↵
- Rackovsky S

- ↵
- ↵
- ↵
- Solis AD,
- Rackovsky S

- ↵
- ↵
- Rackovsky S

- ↵
- ↵
- Rackovsky S

- ↵
- ↵
- ↵
- Eisenberg D,
- et al.

- ↵
- Xiong H,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Frey BJ,
- Dueck D

## Citation Manager Formats

### More Articles of This Classification

### Biological Sciences

### Related Content

- No related articles found.

### Cited by...

- Global informatics and physical property selection in protein sequences
- Homolog detection using global sequence properties suggests an alternate view of structural encoding in protein sequences
- Mutant methionyl-tRNA synthetase from bacteria enables site-selective N-terminal labeling of proteins expressed in mammalian cells