New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Expanding protein universe and its origin from the biological Big Bang

Communicated by Dudley R. Herschbach, Harvard University, Cambridge, MA (received for review July 8, 2002)
Abstract
The bottomup approach to understanding the evolution of organisms is by studying molecular evolution. With the large number of protein structures identified in the past decades, we have discovered peculiar patterns that nature imprints on protein structural space in the course of evolution. In particular, we have discovered that the universe of protein structures is organized hierarchically into a scalefree network. By understanding the cause of these patterns, we attempt to glance at the very origin of life.
It is well known that many proteins with undetectable sequence similarity—as low as expected for random sequences (8–9%)—share a similar threedimensional structure (fold) (1–4). The possibility that many dissimilar sequences fold into the same stable threedimensional structure has been demonstrated in a variety of simplified models (5, 6) and is understood on theoretical grounds. General models of protein evolution that are based solely on the requirement of protein stability reproduced observed conservation of amino acids in proteins with dissimilar sequences with reasonable accuracy (4, 7). Several authors have also pointed out that diverse functions can be carried out by proteins of the same fold (8–11). However, a striking observation is that different folds are represented to a different degree in genomes: some folds are represented by many nonhomologous and functionally diverse proteins, whereas other folds are uniquely represented by a single sequence (orphans, i.e., domains that are not structurally similar to any other domains; refs. 12–14). A possible physical or biological reason for such variability in fold representation is one of the major unsolved problems of molecular biophysics.
One suggested explanation of the observed variability in fold representation is based on the premise of convergent evolution. It is presumed that evolution has reached equilibrium in the protein sequence space. Thus, more “designable” folds that can be encoded by many sequences have a higher representation in genomes (15–17). This proposal, called the “designability principle,” is based on phenomenological considerations (18) and on observations drawn from exhaustive enumeration of all sequences in simplified two and threedimensional lattice protein models. However, the designability principle has not been successful thus far in predicting the actual structural features of known folds that render them highly designable, in contrast to less populated and orphan folds. Furthermore, the underlying assumption of equilibrium in sequence space is difficult to justify if one considers the sheer size of sequence space.
It is difficult to determine the evolutionary relation between proteins based on their sequence similarity when it is as low as that between randomly selected proteins (8–9%; ref. 4). Given the large number of sequences that correspond to a specific fold (4, 19), structure is a more robust protein characteristic than sequence. To this end, we focus on the analysis and possible genesis of the protein universe based on the structural rather than the sequential classification of proteins. Such analysis may be complicated by the fact that structural similarity is not always rigorously defined. Two popular databases, SCOP (20) and CATH (21), use a semiintuitive definition of folds that is somewhat subjective. The FSSP database—based on the DALI structure comparison algorithm (2)—defines a quantitative measure of structural similarity, the Z score. However, selection of the threshold value Z_{min} of the Z score, beyond which proteins are considered structurally similar, also introduces an element of ambiguity into FSSPbased family classification. In a recent paper, Domany and coworkers (22) provided a quantitative relationship between FSSP, CATH, and SCOP classifications. These authors noted that the matrix of pairwise Z scores can be viewed as a weighted graph, where each two proteins that have similarity Z > 2 (Z = 2 is the minimal Z score reported in FSSP) are connected by an edge that carries weight corresponding to the Z score similarity between these two proteins. Clustering algorithms developed for weighted graphs then can be used to identify fold families. However, clustering of weighted graphs is not exact as it may depend on the chosen algorithm and other factors. Another well known problem with structural classification of whole proteins presented in FSSP is socalled “floats,” where two structurally unrelated proteins having a common “promiscuous” domain are identified as structurally similar.
To overcome some of these difficulties, we employ a graph representation of the protein domain universe, in which we consider only protein domains that do not exhibit pairwise sequence similarity in excess of 25%, and each such protein domain represents a node of the graph. We use protein domains as identified by Dietmann and Holm in the FSSP database of protein domains (23). Structural similarity between each pair of protein domains is characterized by their DALI Z score (23). We define a structural similarity threshold Z_{min} and connect any two domains on our graph that have DALI Z score Z ≥ Z_{min} by an edge. Thus, we create the protein domain universe graph (PDUG). It is crucial to note that, in contrast to weighted graphs considered in (22), the PDUG is an unweighted graph where each edge that made it above threshold is considered equally. Clustering of such an unweighted graph represents its partitioning into disjoint clusters that can be carried out exactly by using the classical depthfirst search algorithm (24). Each disjoint cluster represents a family of structurally related proteins in which each protein is presented only once (Fig. 1). Disjoint PDUG clusters are, in principle, equivalent to the fold classification level of the SCOP database (20).
Although the fact that PDUG is an unweighted graph significantly simplifies its analysis, such simplification comes at a price of possible dependence of the results on the selection of threshold value Z_{min}. Thus our first goal is to evaluate how PDUG depends on selection of the threshold value Z_{min} and whether there is a preferred choice of this parameter. To this end, we study the properties of the largest cluster (giant component; ref. 25) of the PDUG as a function of the cutoff similarity score Z_{min}. As a key control throughout this work, we compare the properties of the PDUG to that of random graphs that have the same number of nodes and edges as PDUG and the same distribution of pairwise Z scores, but for which the edges are connected randomly. To generate such random graphs, we take the same number of nodes as in PDUG (3,464) and connect randomly chosen pairs by edges. This procedure is repeated until the total number of edges reaches its value (39,479) in the original PDUG (by “original PDUG,” we mean PDUG at the lowest possible threshold Z_{min} = 2, where all pairs of domains reported by DALI as structurally similar with Z ≥ 2 are connected by an edge). Next, each edge is given a weight, Z score, drawn randomly from a distribution of Z scores for edges observed in a real, original PDUG. Finally, we remove the edges of a random graph with Z < Z_{min} and identify disjoint clusters by using a depthfirst clustering algorithm.
We compute the size of the largest cluster in PDUG and random control graph as a function of Z_{min} (26). We find a pronounced transition of the size of the largest cluster in PDUG at Z_{min} = Z_{c} ≈ 9 (Fig. 2a). The random graphs feature a similar transition but at a higher value of Z_{min} = Z_{c} ≈ 11 (Fig. 2b). The distribution of cluster sizes depends significantly on whether Z_{min} > Z_{c} or Z_{min} < Z_{c} for both the PDUG and random graphs. We also find that the probability density, P(M), of cluster sizes M for both the PDUG and random graphs follows a power law at their respective Z_{c}: P(M) ∼ M^{−2.5} (Fig. 2c). The observed powerlaw behavior of P(M) is simply a consequence of criticality at Z_{c} as it is featured prominently both for the PDUG and random graphs. The powerlaw probability density of cluster sizes is a generic percolation phenomenon that has been observed and explained in both percolation (27, 28) and random graph theories (25). Recently, Gerstein and coworkers (29) reported a powerlaw distribution for fold family sizes derived from the SCOP database and attributed the observed powerlaw distribution to a certain evolutionary mechanism. However, we find that random graphs feature the same powerlaw distribution for fold family sizes and are simply explained by percolation theory (25, 27, 28).
To characterize the structural properties of the PDUG, we compute the probability 𝒫(k) of the number of edges per node k taken at Z_{min} = Z_{c} for individual clusters. It is known that 𝒫(k) distinguishes random graphs from various graphs observed in science and technology (26). In drastic contrast with the equivalent random graph, the PDUG is scalefree, with 𝒫(k) ∼ k^{−1.6} with a high degree of statistical significance (P < 10^{−8}; Fig. 3a). The powerlaw fit of 𝒫(k) is most accurate at Z_{min} ≈ Z_{c} and deteriorates noticeably above and below Z_{c}. The fit at Z_{min} > Z_{c} quickly becomes meaningless as the range of values of connectivity k rapidly diminishes as greater Z_{min} leads to mostly disconnected domains. At Z_{min} < Z_{c}, the powerlaw fit also becomes problematic in the whole range of k because at large values of k (50–100), 𝒫(k) shows some nonmonotonic behavior that can be interpreted as a maximum at large k (the data are insufficient to conclude that with certainty). However, the remarkable property of a maximum 𝒫(k) at k = 0, i.e., dominance of orphans, remains manifest at all Z_{min} values. This finding is in striking contrast with a random graph which is not scalefree at any value of Z_{min} (Fig. 3b) and where 𝒫(k) allows almost perfect Gaussian fit with maximum at higher values of k.§
The discovery of the scalefree character of the protein domain universe is striking and represents the main result of this paper. It has immediate evolutionary implications by pointing to a possible origin of all proteins from a single or a few precursor folds—a scenario akin to that of the origin of the universe from the Big Bang. An alternative scenario, whereby protein folds evolved de novo and independently, would have resulted in random PDUG (similar to the one shown in Fig. 3b) rather than that observed in the scalefree one.
The genesis of scalefree networks observed in other areas of science and technology, such as the world wide web, scientific collaboration networks, and domain combinations in proteomes (26, 30–32), has been explained by the peculiar dynamics of their creation. In particular, several dynamic models featuring “preferential” attachment have been proposed (26, 33–35). However, many models predict an exponent α ≥ 2, whereas in our case, α ≈ 1.6.
It is quite suggestive that the origin of the observed scalefree character of the PDUG lies in the evolutionary dynamics of protein fold genesis as a result of divergent evolution from one or a few precursor domains. To this end, we develop a minimalistic model that aims to explain the scalefree PDUG. Specifically, we assume, as do several other models (29, 35), that new proteins evolve as a result of an increase in the gene population primarily by means of duplication with subsequent divergence of sequences by mutations, as well as more dramatic changes such as deletions of certain parts sequences and even possible reshuffling of some structural elements (foldons; refs. 36 and 37).
Our evolutionary dynamics model starts with a single node representing an initial protein. At each time step t, a new protein is generated by means of gene duplication; hence, the total number of proteins created by time step t is exactly t. The creation of a new protein, t + 1, at time step t occurs through gene duplication of some protein k chosen at random (1 ≤ k ≤ t) from the available gene pool generated up to time t. When a protein k is chosen (node A_{k}), its offspring protein, t + 1, is generated and is represented by node A_{t}_{+1}. Importantly, our evolutionary time step is large enough to allow many mutations as well as more dramatic changes in sequences such as insertions/deletions or shuffling of structural elements to occur in the offspring protein such that sequence similarity with the parent protein is lost (4). Such mutations may or may not lead to significant structural divergence of the offspring from its parent protein because the landscape in sequence space is complex (4).
To account for that uncertainty, we assign the distance w between the parent protein and its offspring as a random number uniformly distributed in the interval 0 ≤ w ≤ 1 (Fig. 4a). This quantity may have a conceptual physical meaning of RMSD, an (inverse) Z score, or any other measure of structural similarity between two proteins. If this structural distance is below some critical value w_{max}, we assign a bond (edge of the graph) between the newly created node A_{t}_{+1} and its parent node A_{k}. Otherwise, a new structural family (fold) is created around the protein t + 1, which is an orphan at time t + 1 (Fig. 4b). If the structural divergence between a new protein t + 1 and its parent does not exceed threshold value w_{max}, so that an edge between nodes A_{t}_{+1} and A_{k} is created, we also attempt to connect a newborn A_{t}_{+1} to the set of nodes A_{i} that are themselves connected by edges to the parent node A_{k}. The reason for this step is that structural similarity may be transitive; i.e., if a newly born protein t + 1 is similar to its parent k, it might also be similar to other proteins that are themselves similar to k. To evaluate whether structural neighbors of k are similar to its offspring t + 1, we use a simple geometrical ansatz. Because w is a measure of structural divergence, we choose w to satisfy the rule of triangle, where the maximal and minimal sizes of any side of a triangle are determined by the sizes of the second and third sides. In this case, the triangle is formed by the new node A_{t}_{+1}, its parent node A_{k}, and A_{i}, a structural neighbor of A_{k} whose distance to A_{t}_{+1} we wish to evaluate. According to the ansatz, we select the distance between A_{i} to A_{t}_{+1} as a random number, w(A_{t}_{+1}, A_{i}), from the uniform distribution in the interval
w(A_{t}_{+1},A_{k}) − w(A_{k},A_{i}) ≤ w(A_{t}_{+1},A_{i}) ≤ w(A_{t}_{+1},A_{k}) + w(A_{k},A_{i}).
If w(A_{t}_{+1},A_{i}) < w_{max}, we assign a bond between A_{t}_{+1} and A_{i} (Fig. 4c).
Additionally, we account for the effects of random mutations and other sequence changes (deletions/insertions) that accumulate over coarsegrained time step t and lead to structural divergence in the protein universe (like the expanding universe in astronomy). To this end, at each time step we increase the structural distances w between all proteins by a small number D: w_{ij} → w′_{ij} = w_{ij} + D (Fig. 4d). If the new weight w′_{ij} exceeds w_{max}, we remove a bond between A_{i} and A_{j}.
We perform simulations for 3,500 maximal nodes and average over 20 runs to compute the dependence of the largest cluster size (giant component) in the generated graphs on w_{max} and find that there is a transition at w_{max} ≈ 0.75 (Fig. 4e). Hence, w_{max} ≈ 0.75 corresponds to a critical region in this model, analogous to Z_{c} ≈ 9 for the PDUG. Indeed, we find (Fig. 4f) that the probability 𝒫(k) of node connectivity follows a powerlaw up to finitesize effects at large k, and that the powerlaw exponent is ≈ 1.6, close to that in the PDUG. Importantly, the results presented in Fig. 4f are averaged over 10^{2} realizations of generated networks, although each individual realization yields a similar distribution, albeit a noisier one, so that for an individual realization, the powerlaw estimate belongs to an interval between 1.4 and 1.9. The presented model, being coarsegrained, does not aim at a detailed and specific description of protein evolution. However, it illustrates that divergent evolution is a likely scenario that leads to scalefree PDUG.
Our method of clustering protein structures provides a number of insights. First of all, using graph theory for protein structure classification removes the ambiguities that are inherent in the highly useful, albeit manual, approaches to structural classification of proteins (20, 21). Perhaps not surprisingly, we observed that the structure of the graph representing the protein domain universe depends on the Z_{min} threshold value of Z score, above which protein domains are considered structurally similar and are connected by an edge of the graph. However, at a certain critical value, Z_{min} = Z_{c}, the structure of the PDUG becomes remarkably universal, simple, and amenable to theoretical understanding from an evolutionary standpoint.
An important component of our analysis is random control where PDUG is compared with random graph. Our results show that random weighted graph having the same weight (Z score) distribution as PDUG features the same clustersize distribution. Because clusters in PDUG can be associated with foldlevel classification of protein structure, this observation suggests that nonuniform distribution of nonhomologous proteins over folds may not be due to special features of “most popular” protein folds, as suggested by some researchers (17, 18). However, that does not necessarily imply that observed protein folds are not selected based on their physical properties (8). It may well be that the divergent evolution scenario described here occurs only on these selected folds, whereas unfeasible ones are not observed in nature. However, our analysis points out that an explanation of the nonuniform distribution of nonhomologous proteins over observed folds does not require one to invoke a designability principle (17) or related conjectures about the nonuniform density of sequences in the space of protein folds (18).
We discovered that the structure of the PDUG is, by far, nonrandom, but rather represents a scalefree network featuring the powerlaw distribution of number of edges per node. The most striking qualitative aspect of the observed distribution is the much greater number of orphans compared with random graph control. Importantly, this qualitative feature remains prominent at any value of threshold Z_{min}, despite the fact that powerlaw fits of 𝒫(k) get worse when Z_{min} deviates from Z_{c}. A natural explanation of this finding is from a divergent evolution perspective. The model of divergent evolution presented here is in qualitative agreement with PDUG, as it produces large (compared with random graph) number of orphans at all values of w_{max}. Orphans are created in the model mostly through gene duplication and their subsequent divergence from precursor. This conjecture may be meaningful biologically, because duplicated genes may be under less pressure and, hence, prone to structural and functional divergence. The divergent evolution model presented here is a schematic one, as it does not consider many structural and functional details, and its assumptions about the geometry of protein domain space in which structural diffusion of proteins occurs may be simplistic. However, its success in explaining qualitative and quantitative features of PDUG supports the view that all proteins might have evolved from a few precursors.
Finally, we want to comment on the robustness of our results. Indeed, powerlaw scaling of 𝒫(k) is observed only in a range of threshold values Z_{min}; special algorithms were applied to discern such behavior. Does nature use similar algorithms and select similar thresholds? In our opinion, nature is not concerned at all with powerlaws and with algorithms in their generation. We believe that creation of functionally (and as a consequence of it, structurally) diverse proteins could have been one of the driving forces of evolution. Our motivation in this work is to “spy” on nature by using algorithms as devices to see implications of evolutionary processes in existing proteins. Selection of a particular value of threshold Z_{min} means just a choice of conditions at which these spying devices are most effective in discerning natural events from random ones.
Acknowledgments
We thank A. G. Murzin, L. Holm, and B. Dominy for helpful discussions. We are especially grateful to L. Mirny for his advice on the project and to E. Deeds for his help. We thank N. Grishin, A. Grosberg, and A. Finkelstein for their reading of the manuscript and their useful critical comments. This work is supported by National Institutes of Health Grant GM52126 (to E.I.S.) and National Institutes of Health National Research Service Award Fellowship GM20251 (to N.V.D.).
Footnotes

↵‡ To whom correspondence should be addressed. Email: dokh{at}med.unc.edu.

↵§ Because of the small number of domains with >25% pairwise sequence similarity, we included all domains in PDUG. We also tested whether this small number of homologous domains affected our 𝒫(k); we found that if we discard one of these homologs, the resulting 𝒫(k) does not change significantly.
Abbreviations

PDUG, protein domain universe graph
 Received July 8, 2002.
 Accepted August 16, 2002.
 Copyright © 2002, The National Academy of Sciences
References
 ↵
 ↵
 ↵
 ↵
 ↵
 Thirumalai D.
 ↵
 ↵

 Hasson M. S.
 ↵

 Holm L.
 ↵

 Govindarajan S.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Sedgewick R.
 ↵
 Bollob'as B.
 ↵
 ↵
 ↵
 Stauffer D.
 ↵
 ↵
 Albert R.
 ↵
 ↵
 ↵
 ↵
 Panchenko A.