New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Lowdimensional, freeenergy landscapes of proteinfolding reactions by nonlinear dimensionality reduction

Communicated by James L. Kinsey, Rice University, Houston, TX, May 1, 2006 (received for review February 6, 2006)
Abstract
The definition of reaction coordinates for the characterization of a proteinfolding reaction has long been a controversial issue, even for the “simple” case in which one single freeenergy barrier separates the folded and unfolded ensemble. We propose a general approach to this problem to obtain a few collective coordinates by using nonlinear dimensionality reduction. We validate the usefulness of this method by characterizing the folding landscape associated with a coarsegrained protein model of src homology 3 as sampled by molecular dynamics simulations. The folding freeenergy landscape projected on the few relevant coordinates emerging from the dimensionality reduction can correctly identify the transitionstate ensemble of the reaction. The first embedding dimension efficiently captures the evolution of the folding process along the main folding route. These results clearly show that the proposed method can efficiently find a lowdimensional representation of a complex process such as protein folding.
The folding of a protein to its functional (native) state can be viewed as a chemical reaction, for which the ensemble of unfolded configurations constitutes the reactant and the native state is the product. Generally, the characterization of chemical reactions requires the location of the reactants, products, and transition states on a freeenergy surface. Simple models (socalled “reaction profiles” or “reaction coordinate diagrams”) are often used to describe the change in free energy as a function of the progress of the reaction from reactant to product. Clearly, a reaction profile is meaningful if the process of interest can be described in terms of one or a few collective coordinates. For instance, in a dissociation reaction in which a diatomic molecule splits into the constituent atoms, the distance between the two atoms provides a natural choice for the reaction coordinate, and the progress of the reaction can be quantitatively characterized in terms of this coordinate. For more complex reactions, the definition of a set of reaction coordinates is a nontrivial task. Because of the high dimensionality of a protein configurational space, this problem is particularly challenging and is a source of significant debate in proteinfolding studies.
Here, we present an approach to the definition of reaction coordinates for the theoretical characterization of a proteinfolding freeenergy landscape based on the idea of nonlinear dimensionality reduction. Modern dimensionality reduction techniques allow us to define a fast and efficient procedure that uses a significant sample of configurations along the folding to extract the most relevant global coordinates that can effectively describe the process. We prove the efficiency and robustness of this method by applying it to study the folding of the src homology 3 (SH3) domain, as obtained from simulation with a coarsegrained protein model (1).
The possibility of using only a few global coordinates to characterize the mechanism through which a protein “organizes” its constituent atoms into a compact functional structure has important practical implications. It is worth mentioning, for example, that a quantitative comparison between simulation and experiment in protein folding often relies on the assumption that it is possible to identify the folding transition state and/or intermediate state ensembles from the analysis of the simulated folding (and/or unfolding) trajectories. However, the definition of these ensembles is generally based on the choice of the reaction coordinates (2–4). Alternative definitions of reaction coordinates have been discussed in the literature (2, 5–8), as have different methods for the identification of a set of transitionstate structures (3, 4, 9). Most of the discussion revolves around the validity of empirical reaction coordinates that are commonly used in this endeavor. Commonly used empirical reaction coordinates in folding studies are defined to condense in a parameter the information on the degree of similarity with the native structure. Examples of such coordinates include the fraction of native contacts formed, Q (2, 5, 10–12), the average shortest path length, 〈L〉 (13, 14), the radius of gyration, R _{g} (12), or the partial contact order pCO (13, 15). The theoretical justification for the use of these structural reaction coordinates relies on the fact that, generally, proteins are minimally frustrated systems and their folding mechanism can be described as a diffusion process in a funnellike energy landscape where the potential depth is strongly correlated with the degree of nativeness (16–19). This argument is not sufficient to ensure a perfect a priori correspondence between a given ensemble of structures experimentally detected [as, for instance, the transitionstate ensemble, experimentally characterized by Φvalue analysis (20, 21)] and the corresponding ensembles obtained on a lowdimensional landscape defined through these reaction coordinates.
The parameter P _{fold}, defined as the probability of a protein structure to reach the folded state before the unfolded state, has been proposed as an ideal reaction coordinate for proteinfolding studies (2, 8, 14, 22, 23). However, the calculation of P _{fold} is computationally so expensive that it becomes unfeasible for many systems of interest. The definition of new strategies to estimate the intrinsic dimensionality of a folding reaction and the definition of the reaction coordinates themselves remain paramount issues in folding studies (5). The approach presented here addresses both of these questions.
What Is the Intrinsic Dimensionality of a Folding Landscape?
A protein conformation is usually described by the Cartesian coordinates of its constituent atoms; a protein structure with N atoms is therefore completely specified by 3N parameters. However, these parameters are not independent of each other. Clearly, the constraints of maintaining intact the covalent bonds and angles and other steric factors effectively reduce the degrees of freedom of a protein molecule. In addition, the high cooperativity of the folding process strongly suggests that the motion of different parts of the protein is correlated along the productive folding route(s), further reducing the effective dimensionality of the configurational space. These considerations lead us to assume that most of the relevant conformations visited by a protein throughout the folding process lie on a lowdimensional manifold embedded in the much higherdimensional space described by the Cartesian coordinates.
In folding/unfolding simulations, molecular dynamics trajectories provide a sampling of configurations populating the embedded manifold that we wish to characterize. Given a sample of protein configurations along the folding process, we address the problem of finding a lowdimensional embedding such that the shape of the underlying manifold is preserved. The final goal is to rigorously define a lowdimensional, freeenergy landscape that could be used to quantitatively characterize a folding simulation. The density of states populated on the manifold needs to be preserved as well so that free energy can be estimated directly from the lowdimensional embedding. In practice, the main question underlying the definition of this embedded folding landscape is whether a base set of coordinates exists in which very few show considerable variation and all of the others remain almost constant during the considered reaction; mathematically, this is a problem of dimensionality reduction. Similar problems are common in a number of disparate fields. For instance, dimensionality reduction plays an important role in image analysis and recognition, where the essential information distributed over a large number of pixels needs to be captured by few global parameters that can be quantitatively and meaningfully compared (24–27).
The definition of an embedded, folding, freeenergy landscape by dimensionalityreduction techniques can reduce the systematic error associated with the choice of empirical reaction coordinates in the calculations of ensemble averages on particular regions of the landscape (such as, for instance, transitionstate ensembles). An important feature of dimensionality reduction is that, usually, the quality of the embedding can be expressed as a function of the number of dimensions chosen, which allows one to estimate a priori the error associated with a set of reaction coordinates. Ideally, one could automatically compute an embedding that preserves, say, 99% of the features of the original data. Unlike empirical reaction coordinates, the dimensions of an embedding are completely uncorrelated so that the number of dimensions of an accurate embedding is the same as the number of dimensions of the submanifold. Minima and saddle points of a specified function of the embedding coordinates (such as free energy) can be automatically identified in an embedding, which is important if more than two or three dimensions are needed to capture the features of the original data, in which case it is not possible to visually identify the folded and unfolded minima or transition paths between them.
Dimensionality Reduction of Folding Simulations
The problem addressed by dimensionalityreduction techniques is to find the best ddimensional embedding for N objects in an ndimensional space. Ideally, the embedding is much more compact than the original representation, and dependencies between dimensions are removed. Dimensionalityreduction techniques fall broadly into two categories: linear and nonlinear. Principal component analysis (PCA) (28) is probably the best known (and widely used) linear technique. Essentially, PCA computes a hyperplane that passes through the data points as best as possible in a leastsquares sense. The principal components are the tangent vectors that describe this hyperplane. These vectors are ordered by the amount of variance they exhibit on the data. So the first principal component corresponds to the best possible projection onto a line, the first two correspond to the best possible projection onto a plane, and so on. Clearly, if the manifold of interest is inherently nonlinear, the lowdimensional embedding obtained by means of PCA is severely distorted. PCA is commonly used in the analysis of nearequilibrium fluctuations sampled by molecular dynamics simulations (29–36), because one can usually assume that the manifold of interest can be reasonably approximated by its tangent hyperplane around an equilibrium point. However, the extent of conformational changes involved in a folding process prohibits any a priori linearization of the manifold, and nonlinear techniques need to be used. The fact that empirical reaction coordinates routinely used in protein folding studies cannot be reduced to a linear combination of the Cartesian coordinates underscores the inadequacy of linear dimensionality reduction techniques to characterize a folding landscape.
The isomap Algorithm for Nonlinear Dimensionality Reduction.
Although several nonlinear dimensionality reduction techniques have been proposed [especially in the context of image analysis (37), speech recognition (38), and climate data analysis (40, 41)], the development of new methods is still an active area of research. The technique we used here for the characterization of folding landscapes is based on the recently proposed isomap algorithm (42). The basic idea of isomap is to define a lowdimensional embedding that preserves as best as possible geodesic distances between all pairs of data points in the sample under consideration (42). The geodesic distance between a pair of points that lie on a surface (or, more generally, a manifold) is defined as the length of the shortest path between them, when the path is confined to lie on the surface. In the context of the isomap algorithm, we assume that a lowdimensional embedded manifold exists, but we do not have an explicit mathematical formulation for the manifold; rather, we have a discrete set of points sampled from this manifold. Therefore, we can only approximate the geodesic distance on the manifold between a general pair of points, say x and y. The idea of isomap is to estimate the shortest possible path from x to y by “hopping” through neighboring points and adding up the short distances between them. Fig. 1illustrates this idea on a simple case of an embedding. The data points shown in Fig. 1 a lie on a twodimensional torus embedded in a threedimensional space. The application of the isomap algorithm to this set of data produces the twodimensional manifold shown in Fig. 1 b, where the network of neighboring points is also shown. A pair of points has been chosen to show the approximation of the geodesic distance between them. The red curve connecting this pair of points represents the shortest “hopping” distance within the network of neighbors, both in the original threedimensional space (Fig. 1 a) and on the resulting twodimensional embedded surface (Fig. 1 b).
This approximation for the geodesic distance holds provided that the data represents a good sampling of the embedded manifold (that is, the sampling of the manifold is sufficiently dense). If a poor data sampling is used, then connecting a point to its nearest neighbors may result in taking “shortcuts” across the surface, artificially shortening distances.
In addition, this approximation for the geodesic distance relies on the definition of neighboring points. Each point should be connected to a number of neighboring points that is not too small (to ensure the connectivity of the manifold) or too large (to minimize the occurrence of shortcuts), that is, a robust definition of “neighboring points” is required. Neighboring points can be defined around each point either by choosing a fixed number of neighbors (that is, the k closest points according to a distance measure, as for instance the least rms deviation), or by assigning a cutoff distance and considering all of the points residing within it.
In the work presented here, for each point we have defined its nearest neighbors as the k points with the smallest rms deviation to it, with k = 9. The robustness of this definition is discussed in detail in the supporting information, which is published on the PNAS web site, where several tests on the validity of the approximations used are presented.
In practice, the isomap algorithm consists of the following three steps:

First, the nearest neighboring points are computed for each point. The nearest neighbor network (or graph) is represented by the list of neighbors for each point, with their associated distances.

The second step involves computing the shortest paths between every pair of points. As explained above, the distances along these shortest paths within the neighbor network are added up to approximate the geodesic distance between the points.

Finally, a lowdimensional representation of the points (i.e., protein configurations for the case considered here) is found such that the interpoint Euclidean distances best preserve their geodesic distances.
Scalable isomap (ScIMAP) and Its Application to Large Folding Simulations.
Although the idea is very straightforward, the basic isomap algorithm as described above suffers from performance bottlenecks when used on large data sets and cannot be directly applied to the study of folding reactions. The major bottleneck when working with molecular data resides in the computation of all neighboring points, for every point, especially if the distance measure is computationally expensive. There are also problems in the computation of the shortest paths for all pairs and in the computation of the final coordinates for the embedding, because these require working with huge matrices (for which, for example, eigenvalues are needed). These bottlenecks render computationally impractical the application of the isomap algorithm to study protein simulations, where the number of conformations sampled is generally N ≫ 100,000, and the protein size yields a very expensive distance measure. We use here the basic idea of isomap as a starting point to define a procedure that is computationally scalable to very large data sets. We introduce a number of nontrivial modifications to suit the basic isomap algorithm to analyze large data sets, such as protein conformations generated in folding/unfolding trajectories. We call this modified procedure ScIMAP.
First of all, the computational bottlenecks present in isomap can be strongly reduced by using landmark points, as has been proposed in recent literature (43, 44). We designate n _{L} data points (i.e., protein configurations) to be landmarks, where n _{L} ≪ N. Rather than computing allpairs shortest paths, we just compute the shortest path from each landmark to every other point. The use of landmarks reduces the shortestpath computation time by a factor n _{L}/N. The intuition for landmarkbased isomap is that if the manifold is lowdimensional, each point can be located by considering its distance to only a small number of landmarks. In theory, if n _{L} ≥ d + 1 and the landmarks are in general position, then there are enough landmarks to locate each point. If the landmarks are chosen randomly, then n _{L} needs to be sufficiently larger than d to guarantee stability (see the supporting information for details).
Although it is more space and timeefficient than the basic version of the algorithm, landmarkbased isomap is still not practical to compute lowdimensional embeddings of large molecular trajectories (typically >100,000 conformations). To obtain a good coverage of the conformational manifold (which is, in turn, essential to ensure the validity of the geodesic approximation and to obtain accurate free energy estimates), it is necessary to compute embeddings of very large trajectories.
To make the problem computationally tractable, we can make the following observation. We expect that low freeenergy areas on a folding landscape will have a very high sampling density, resulting in a multitude of conformations. In particular, for a canonical sampling at the temperature close to the folding temperature, a significant fraction of the sampled points is expected to fall near the protein’s folded state, typically with little spatial variability. These redundant conformations, although crucial to estimate the density of states (thus to compute all thermodynamic averages), are not necessary to infer the overall shape of the manifold we are trying to recover and can be initially ignored to obtain the embedding geometry. Once the lowdimensional embedding has been found, the discarded conformations can be reinserted into place by computing their lowdimensional coordinates as a combination of the coordinates of their nearest neighbors, which is a less demanding process than solving for the global manifold shape of the original data set. The reinserted points now restore the original density of states of the trajectory around the folded state and allow for the computation of thermodynamic quantities.
The reinsertion of conformations into a precomputed lowdimensional embedding can also be used to further enrich the resolution of the landscape, for instance by adding configurations sampled at different temperatures [that can be combined in free energy calculations (45, 46)]. Moreover, the reinsertion of configurations provides a way to test the robustness of the procedure to extract the lowdimensional embedding. If some of the configurations to be reinserted are in regions where their closest neighbors are in fact far apart, the approximation used is not valid. In the application presented below, all of the configurations initially filtered out could be reinserted without experiencing such a problem.
Finally, after the problem size has been reduced as outlined above, the algorithm can be run in parallel on a cluster of computers. In particular, we adopted parallel, iterative procedures (47, 48) that efficiently compute the top m ≪ N eigenvalues of the distance matrix as required by the embedding method. The details are presented in the supporting information.
Results: Folding Landscape of SH3 as a LowDimensional Embedded Manifold
We tested ScIMAP outlined above by applying it to characterize the proteinfolding landscape obtained from simulation of a coarsegrained model of an SH3 domain. The basic ideas of the model are detailed in a recent publication (1), where a comparison of the results with experimental data also are presented. The purpose of the application presented here is not to further validate this coarsegrained protein model but rather to show how nonlinear dimensionality reduction can be used to estimate the intrinsic dimensionality of the configurational space explored in folding simulations, and to “naturally” define a set of orthogonal reaction coordinates associated to the relevant dimensions.
The performance of a dimensionalityreduction procedure can be estimated by monitoring the residual variance σ_{r}(d, n) as a function of the number of dimensions considered for the embedded manifold, d, and the dimensionality, n, of the original space. Following the definition used in ref. 42, the residual variance σ_{r}(d, n) can be computed as σ_{r}(d, n) = 1 − R ^{2}(D̂ _{d} , D _{n} ), where R(D̂ _{d} , D _{n} ) is the correlation coefficient taken over all of the entries of matrices D̂ _{d} and D _{n} . The matrix D̂ _{d} contains all of the pairwise distances as obtained on the ddimensional embedding, whereas the matrix D _{n} stores the corresponding geodesic distances as computed in the original (ndimensional) space. In the case of the SH3 folding simulations that we are considering here, the original space has dimensionality n = 3 × 57 = 171. The function σ_{r}(d, n) monotonically decreases as the number of embedding dimensions, d, increases, up to the limit value σ_{r}(n, n) = 0 when d = n. By definition, the maximum possible value of the residual variance is σ_{r}(d, n) = 1, if the distances computed on the ddimensional embedded manifold are completely uncorrelated with the geodesic distances computed in the original space. If σ_{r}(d, n) drops close to zero for small values of d ≪ n, then the space of interest can be well approximated by considering only d embedding dimensions.
Fig. 2 shows that the embedded landscape associated to the folding simulations of the coarsegrained model of SH3 has extremely low residual variance (blue points), even when only one dimension is considered. Namely, σ_{r}(1, n) ≃ 0.08, σ _{r} (2, n) ≃ 0.04, and σ_{r}(3, n) ≃ 0.02. These values give an estimate of the distortion introduced when one, two, or three embedding dimensions are used as reaction coordinates to describe the folding landscape. The small magnitude of these values is evident when they are compared with the corresponding residual variance obtained when PCA is used on the same data (red points in Fig. 2). These results support the idea that the folding landscape of SH3 can be essentially described by one reaction coordinate, in agreement with results from previous work (1, 49).
Freeenergy surfaces can be computed as a function of the embedding coordinates. Fig. 3 shows the freeenergy profile obtained when only the first dimension is used as a reaction coordinate for the folding process. These results are obtained for a temperature very close to the folding temperature T _{f}. One main barrier separates the freeenergy minima corresponding to the unfolded and folded states, as expected in a twostate folding process. On this reaction profile, the transition state can be defined as the ensemble of states with a value of the first embedding coordinate corresponding to the top of the free energy barrier. For a twostate folding process, the parameter P _{fold} provides a stringent test for the identification of the transitionstate ensemble (2, 8, 10, 14, 22, 23). Each individual configuration around the freeenergy barrier (namely, each conformation with a value of the first embedding coordinate x _{1} in the range −7 < x _{1} < 0) is labeled with a value of P _{fold} by means of a set of 100 ancillary simulations starting from it. For each small interval x _{1} ± Δx _{1} an average value of P _{fold} is computed over all conformations with a corresponding x _{1} within that interval, whereas the variance is reported as an error bar. Fig. 3 Inset shows that the range of values on the first embedding coordinate x _{1} ≃ −4 corresponding to the location of the freeenergy barrier has an associated value of P _{fold} ≃ 0.5. The red circle in Fig. 3 Inset identifies the P _{fold} values corresponding to the top of the freeenergy barrier (that is, around x _{1} ≃ −4). Remarkably, the transition state identified by means of the onedimensional free energy profile F(x _{1}) as a function of the first embedding coordinate, x _{1}, is in full agreement with the ensemble obtained by a thorough P _{fold} analysis: The top of the free energy barrier corresponds to P _{fold} ≃ 0.5. The theoretical folding probability (39) associated to the onedimensional free energy F(x _{1}) is also shown in Fig. 3 Inset. The folding probability P _{t} is in agreement with the calculated P _{fold} values on most of the interval considered, particularly at the transition state. Deviations between the average value of the calculated P _{fold} and the theoretical folding probability P _{t}(x _{1}) are observed around the folded state (x _{1} ≃ −6) and can be explained in terms of the variation of free energy along the second embedding dimension in this region (as it appears from Fig. 4, discussed below).
It is worth noting that for the protein model considered here, the P _{fold} analysis required >12,000 central processing unit (CPU) hours on an 2.2GHz Intel Xeon processor and was performed for a small subset of configurations,** and the embedding procedure was completed in <500 CPU hours (<24 CPU hours running on 20 processors) and provides information on the whole configurational space.
Not surprisingly, the transitionstate ensemble from the onedimensional embedded manifold of the SH3 model also is in good agreement with what can be obtained by using the parameter Q as an empirical reaction coordinate (data not shown). Previous studies have shown that Q is a robust reaction coordinate for some twostate folding proteins (5, 10, 49), SH3 being one of them, which may not be the case in general, however, particularly for more complex folding reactions for which more than one reaction coordinate is needed.
Additional information on the folding process is obtained when the first two embedding dimensions are considered reaction coordinates in the freeenergy calculation. Fig. 4 presents the twodimensional embedded freeenergy landscape as a function of the first two embedding dimensions.
Fig. 4 a shows a contour plot of the free energy. Again, as expected for a twostate folding protein, two distinct freeenergy minima appear: one that is more localized corresponding to the folded state and one with a larger basin corresponding to the unfolded state. The freeenergy gradient field is superimposed to the freeenergy contour plot in Fig. 4 b. The transitionstate ensemble on this twodimensional landscape can be defined by considering the “Continental Divide,” i.e., the separatrix between the basin corresponding to the folded and unfolded states. In practice, a point on the landscape is considered in the basin of a given minimum if the gradient flux starting from that point leads to the minimum. The transitionstate ensemble is then defined as all regions on the landscape where gradient fluxes leading to opposite minima meet. The transitionstate region so defined is depicted in Fig. 4 b.
It is clear from Fig. 4 that the most populated folding route (defined by the minimum freeenergy path on this landscape) closely follows the first embedding dimension. However, deviations from the main folding route are probable, because a nonnegligible amount of structures lie outside the minimum freeenergy path (≈15% of structures lie within the light orange freeenergy level on Fig. 4).
It is important to clarify that the existence of a main folding route does not mean that the folding mechanism follows a deterministic pathway where one single protein structure evolves into the next one along the pathway. Each point along this route on the lowdimensional landscape represents a large ensemble of structures that are not necessarily similar to each other. The fact that a single parameter (i.e., the first embedding dimension, in this case) captures the evolution of the folding process simply means that it is possible to define a “macroscopic” quantity condensing into a single number the common features of the ensemble of structures populated at a given stage of the folding process. The first embedding coordinate describes the evolution of this parameter from the unfolded to the folded ensembles.
Fig. 4 c presents the results from the P _{fold} analysis superimposed on the twodimensional embedded landscape. The comparison of Fig. 4 b and c reveals that the region with P _{fold} ≃ 0.5 matches the separatrix region identified by the diverging gradient fluxes. The variance of P _{fold} measured in each twodimensional interval (x _{1} ± Δx _{1}, x _{2} ± Δx _{2}) is δP _{fold} ≃ 0.12, which is significantly lower than the variance δP _{fold} ≃ 0.2 observed in the onedimensional case (see Fig. 3). The larger uncertainty obtained when only one embedding dimension is considered accounts for the fluctuations observed along the second embedding coordinate.
Conclusions
We have proposed a general procedure to obtain a lowdimensional free energy landscape associated with a simulated proteinfolding reaction. By using ScIMAP [based on isomap (42)], an embedded folding manifold is extracted from a large set (≈2,000,000) of protein conformations sampled throughout extensive folding/unfolding simulations of a coarsegrained model of SH3. The first few embedding coordinates provide a set of reaction coordinates independent of each other. The quality of the embedding can be expressed as a function of the number of dimensions considered. This feature provides an estimate of the error introduced when the first few d embedding dimensions are used as reaction coordinates to describe the simulated folding process. As a consequence, it is possible to estimate the intrinsic dimensionality of a simulated folding process.
The application of ScIMAP to the folding of a coarsegrained protein model of an SH3 domain reveals that its folding landscape is essentially onedimensional. The first embedding dimension captures the evolution of the folding process along the main folding route. However, additional features emerge when two or three dimensions are considered. For instance, the twodimensional free energy landscape as a function of the first two embedding dimensions reveals deviations around the main folding route, populated with a lower probability. The simulated folding reaction considered in this paper is known to be a twostate folding process, where no intermediate states are significantly populated. For such kind of processes, the calculation of the transition probability (or P _{fold} parameter) provides a strict a posteriori test for the “goodness” of a reaction coordinate on the identification of the transitionstate ensemble. Remarkably, a thorough P _{fold} analysis confirms that protein configurations in the transitionstate region as identified on the embedded freeenergy landscape have P _{fold} ≃ 0.5. Moreover, fluctuations around this average value of P _{fold} significantly decrease when the transitionstate region is identified on the two dimensional freeenergy landscape (defined by means of the first two embedding coordinates), with respect to a onedimensional, freeenergy landscape (where only the first embedding coordinate is used). These results validate the use of the first few embedding dimensions as optimal reaction coordinates to characterize the protein folding reaction, at least for the protein model used here.
Acknowledgments
We thank Alessandro Mossa for contributions to initial stages of this project; Peter Wolynes, Attila Stzabo, Yannis Kevrekidis, Tolya Kolomeisky, and Michele Vendruscolo for insightful comments; and members of the L.E.K. and C.C. groups for stimulating discussions. This work was supported by National Science Foundation Career Grant CHE0349303 (to C.C.), National Science Foundation Grants ITR0205671 (to L.E.K.) and CCF0523908 and CNS0454333 (to L.E.K. and C.C.), Advanced Technology Program Grant 00360400102003, Robert A. Welch Foundation Norman Hackermann Young Investigator Award and Grant C1570 (to C.C.), and the Sloan Foundation (L.E.K.). The Rice Terascale Cluster used for the calculations is supported by National Science Foundation Grant EIA0216467 and by Intel and Hewlett Packard.
Footnotes
 ^{‖}To whom correspondence may be addressed. Email: kavraki{at}rice.edu or cecilia{at}rice.edu

↵ ^{‡}Present address: Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292.

Author contributions: P.D., M.M., H.S., L.E.K., and C.C. performed research; L.E.K. and C.C. designed research; P.D., M.M., H.S., L.E.K., and C.C. analyzed data; and P.D., M.M., H.S., L.E.K., and C.C. wrote the paper.

Conflict of interest statement: No conflicts declared.

↵**The P _{fold} parameter was computed for ≈8,000 protein configurations. The total number of configurations used in the definition of the embedded freeenergy landscape is 1,818,000.
 Abbreviations:
 SH3,
 src homology 3;
 PCA,
 principal component analysis;
 ScIMAP,
 scalable isomap.
Abbreviations:
 © 2006 by The National Academy of Sciences of the USA
References

↵
 Das P. ,
 Matysiak S. ,
 Clementi C.
 ↵
 ↵

↵
 Best R. ,
 Hummer G.

↵
 Cho S. ,
 Levy Y. ,
 Wolynes P.
 ↵

↵
 Ma A. ,
 Dinner A.

↵
 Rhee Y. ,
 Pande V.
 ↵
 ↵
 ↵
 ↵

↵
 Dokholyan N. ,
 Li L. ,
 Ding F. ,
 Shakhnovich E.

↵
 Das P. ,
 Wilson C. ,
 Fossati G. ,
 WittungStafshede P. ,
 Matthews K. ,
 Clementi C.
 ↵

↵
 Nymeyer H. ,
 García A. ,
 Onuchic J.
 ↵
 ↵

↵
 Ferst A. ,
 Sato S.
 ↵
 ↵
 ↵
 ↵

↵
 Turk M. ,
 Pentland A.

↵
 Benito M. ,
 Pena D.

↵
 Cho E. ,
 Kim D. ,
 Lee S.

↵
 Jolliffe I.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵

↵
 Levy Y. ,
 Caflisch A.
 ↵

↵
 Weinberger K. Q. ,
 Saul L. K.
 ↵
 ↵
 ↵

↵
 Gamez A. Z. ,
 Zhou C. S. ,
 Timmermann A. ,
 Kurths J.

↵
 Tenenbaum J. ,
 de Silva V. ,
 Langford J.

↵
 de Silva V. ,
 Tenenbaum J.
 Becker S. ,
 Thrun S. ,
 Obermayer K.

↵
 de Silva V. ,
 Tenenbaum J.
 ↵
 ↵

↵
 Arnoldi W.

↵
 Saad Y.
 ↵