Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / BIOCHEMISTRY
Computationally designed libraries of fluorescent proteins evaluated by preservation and diversity of function


,
Divisions of *Biology and Chemistry and
Chemical Engineering and
Howard Hughes Medical Institute, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125
Contributed by Stephen L. Mayo, October 31, 2006 (received for review August 11, 2006)
| Abstract |
|---|
|
|
|---|
GFP | library design | protein design | protein engineering | high-throughput screening
Designed libraries can be synthesized for roughly the same cost as a designed sequence by recognizing the opportunities in gene synthesis for the combinatorial shuffling of sequence diversity (1417). Although many algorithms have now been proposed to design such combinatorial libraries (79, 11, 12), few computationally designed libraries have been characterized experimentally (9, 18, 19), and, to our knowledge, there have been no controlled experiments comparing these methods with each other or with libraries of randomly generated sequence diversity. The results of such a comparison would be hard to predict, especially because none of these methods models protein function explicitly. Instead, these algorithms attempt to model protein stability as a surrogate for protein function on the assumption that libraries with a greater fraction of well folded proteins are more likely to contain variants with the desired function.
Here, we evaluate seven designed combinatorial libraries of GFPs, including one with mutations picked at random. Preservation and diversity of function were judged by using distributions of brightness and color, respectively, compiled from measurements made in vivo with a monochromator-based plate reader. GFP from Aequorea victoria modified by S65T (20) (GFP-S65T) was chosen as a reference sequence for each design algorithm because this variant is less extensively engineered than other variants whose structures have been solved to similarly high resolution. Positions 5772 were targeted for this test because they form the longest contiguous stretch of core positions in the GFP-S65T structure (21). The structure of GFP-S65T is illustrated in Fig. 1A, with the targeted positions shown in yellow. Because random core mutations are generally more disruptive than random surface mutations (22, 23), it was assumed that targeting core positions would provide better differentiation of designed libraries according to preservation and diversity of function criteria. Contiguity was imposed to allow an economical and high-fidelity cassette-based library synthesis [see supporting information (SI) Text]. Where possible, libraries were controlled for both theoretical size and the precise distribution of mutation levels within each library, because one would expect these factors to affect library quality when controlled for the same method of design.
|
| Results |
|---|
|
|
|---|
|
Preservation of Function.
For each of the designed libraries, and for the epPCR library, emission spectra were recorded for
1,500 bacterial cultures expressing GFP variants. We define the brightness and color of each spectrum sampled as its integrated emission intensity and average position, respectively. Because it is not clear how best to define a functional sample, we have quantified each library's preservation of function in three ways. For each library, the percentage of samples that have at least one-half, one-tenth, and one-fiftieth the brightness of cultures expressing GFP-S65T are presented as bar graphs in Fig. 2. By all three of these measures, most of the designed libraries performed considerably better than the Random library. Only 1.6% of samples from the Random library had at least one-fiftieth the brightness of cultures expressing GFP-S65T. Although the SCMFORBIT 322 library had a larger fraction of functional samples than the Random library by this most inclusive definition of function, the SCMFORBIT 322 library had a similar fraction by the most exclusive definition. The relatively poor performance of these two libraries is probably due in part to the relatively large frequencies with which these libraries introduce ionizable side chains to the protein core.
|
1% and 10% of samples being at least one-half and one-fiftieth as bright, respectively, as cultures expressing GFP-S65T. The Q69R mutation, because it introduces an ionizable side chain to the protein core, would seem to be responsible for much of the weaker performance of the MSA-based libraries, compared with the DBISORBIT and CORBIT libraries, which instead introduce the Q69L mutation. However, even if it is assumed that the Q69R mutation always disrupts function and that the Q69L mutation never disrupts function, less striking differences among these libraries must account for at least half the observed differences in performance. Multiple epPCR libraries were synthesized by using different mutation rates. Only the library that appeared to have a fraction of functional samples similar to that of the DBISORBIT library was characterized in detail to compare average mutation levels and diversity of function under this constraint. Despite the fact that random mutations are generally tolerated at surface positions better than at core positions (22, 23), the average number of nonsynonymous mutations for genes in this epPCR library was determined by sequencing to be 2.5, roughly half the average of 4.5 mutations per gene for the core-directed DBISORBIT library.
Diversity of Function. Because the dimmest samples have colors biased by emission from molecules other than GFP, here, we consider only those samples with at least one-half the brightness of cultures expressing GFP-S65T. Of the 11,575 spectra sampled, 701 met this criterion. The redmost and bluemost of these spectra are illustrated in Fig. 1B.
The diversity of function for a library of fluorescent proteins may be associated with either its extremes of color or its dispersion of color. The former we define as the difference between the positions of the redmost and bluemost spectra in a library. Fig. 3 illustrates the set of colors sampled for each library with black marks, such that the separation between leftmost and rightmost marks illustrates a library's performance according to this extremes-of-function metric. Dispersion of function we define as the difference between the positions of the spectra that lie one quartile above and below the median for a library. In Fig. 3, this median is illustrated with a white bar on top of a red box that illustrates the positions of the first and third quartiles.
|
A complementary illustration of the preservation and diversity of function sampled from each library is provided as SI Fig. 5. For each library, the width of each spectrum sampled is plotted against its color with a circle of area proportional to its brightness. Although SI Fig. 5 does not characterize the libraries with the statistical rigor of Figs. 2 and 3, it does provide additional support for the clustering and ranking of the designed libraries described above. It also reveals a striking correlation between emission line shape and emission color among the brightest samples in each library. We have investigated the physical mechanisms that may be responsible for this trend with additional measurements (T.P.T., C.L.V., M. A. Mena, D.N., B. D. Olafson, P. S. Daugherty, and S.L.M., unpublished work).
| Discussion |
|---|
|
|
|---|
12 different amino acids per position to preserve function as well as the DBISORBIT and CORBIT libraries. Finding any two core positions in GFP-S65T that could accept such great diversity, let alone two between positions 57 and 72, would seem to be an especially difficult problem. Fig. 3 illustrates that diversity of function tends to increase with preservation of function among the seven designed libraries. This result justifies an approach to library design in which protein stability is modeled as a surrogate for protein function (79, 11, 12), as long as mutations are directed toward positions likely to perturb function. Moreover, this result suggests that improvements in modeling protein stability should yield designed libraries that sample a wider array of protein functions.
A frequently desired trait among GFP variants has been red-shifted emission (29, 32, 33). Although the vast majority of the bright variants sampled from the epPCR library have emission spectra nearly identical to cultures expressing GFP-S65T, the one sample from this library with a substantial red-shift did have the redmost spectrum sampled in our test. The corresponding GFP gene was sequenced and was determined to have the V224I and M233K mutations. Only the V224I mutation is in the core of the protein and close to the chromophore, suggesting that it is primarily responsible for the observed red shift. The fact that neither of these mutations involves the positions targeted in the test underscores the way the performance of a designed library is intrinsically limited by the quality of the information in the design, such as the choice of positions targeted for mutation. Nevertheless, the far greater number of almost identically red-shifted samples from the DBISORBIT and CORBIT libraries indicates that our best information at present is a valuable tool with which to complement epPCR for sampling diverse functions.
Even though red-shifted emission is frequently desired for GFPs, other measures described here may be more relevant to the extrapolation of these results to other protein engineering projects. Such projects typically aim to increase the stability of an enzyme, its rate of catalysis, or the affinity of a protein for a ligand (28, 34). Because denatured GFP does not fluoresce (35), one interpretation of Fig. 2 is that the algorithms that preserved function best did so by disrupting the global structure of GFP the least. According to this interpretation, we would predict that the algorithms used to design the DBISORBIT and CORBIT libraries would also perform best when attempting to stabilize an enzyme with core-directed mutations. However, the relative performance of the MSA-based methods might be expected to increase in this case if the covariances among amino acid frequencies important for protein stability can be extracted from evolutionary noise (13, 36, 37).
The emission spectrum of GFP is a reporter on the local structure of its chromophore. In other words, a more varied sampling of spectral properties is equivalent to a more varied sampling of structures at the "active site" of GFP. Thus, based on Figs. 2 and 3, we can predict that the algorithms used to design the DBISORBIT and CORBIT libraries will provide the most diverse sampling of active-site structures in functional enzymes. Structure-based computational methods should thus prove especially useful for relatively low-throughput screening projects in which libraries made by epPCR, even those with low mutation rates, cannot be screened thoroughly.
Binding between a protein and its ligand might also be improved most efficiently by sampling with the greatest frequency those perturbations to the structure of the binding interface that do not completely disrupt the global structure of the complex. Thus, if a structure of the bound complex is available, in this case, too, we would recommend using structure-based computational methods of library design to suggest a small number of mutations at each of many buried positions in or near the binding interface. However, if binding to a novel ligand is desired, it may be necessary to disrupt the structure of the protein more significantly than when improvements in binding to a known ligand are desired. In this case, the kinds of mutations suggested by these algorithms may be overly conservative, especially if the new ligand has a different charge. Because selections for protein binding frequently have much greater throughput than the plate reader-based screen we have implemented here, it is worth noting that the DBIS algorithm can be used to design libraries of practically any size.
In summary, we have shown that small combinatorial libraries can exhibit considerable diversity of function if designed well. Based on the design and results of this test, we recommend complementing more widely used strategies for generating functional diversity, such as epPCR and combinatorial saturation mutagenesis, with a strategy that defines a combinatorial library by a single conservative mutation at each of many positions close to a protein's active site. We have found structural information as used by the DBIS algorithm or the method of Hayes et al. (9) to be more successful than limited evolutionary information in identifying compatible conservative mutations. Although currently limited by the need for an accurate structure, the utility of the structure-based design algorithms should improve as methods improve for docking ligands onto proteins and for determining protein structures from protein sequences. Indeed the great promise of these methods for library design is that they might be used to implement a knowledge-based approach to engineering totally novel functions for which no natural protein exhibits even the slightest glimmer of the desired function. In the meantime, this approach to protein engineering should prove especially useful for investigations of protein structurefunction relationships (T.P.T., C.L.V., M. A. Mena, D.N., B. D. Olafson, P. S. Daugherty, and S.L.M., unpublished work), where, ideally, large numbers of differently functional variants would be related by the same small set of mutations.
| Methods |
|---|
|
|
|---|
Fig. 4 illustrates the main components of the generalized DBIS algorithm. A symmetric matrix of rotamer singles and pairs energies is first calculated by using a template structure and rotamer library (13). This rotameric representation of the sequence design problem is then projected onto a smaller matrix with one row and one column for each combination of amino acid and targeted position (see below). These amino acid singles and pairs energies are then combined to build the set-based representation of the combinatorial library design problem by filling a matrix with one row and one column for each set of amino acids considered at each position in the library design. The number of these sets can be reduced from the 220 1 unique sets of 1 to 20 amino acids any number of ways: here, we have imposed both a set size constraint to limit sets to specific numbers of amino acids and a genetic code constraint to limit even these sets to those combinations of amino acids that can be introduced with degenerate codons during primer synthesis. To impose a composition constraint, such that the composition of the library is biased toward the inclusion or exclusion of a specific sequence (e.g., the wild-type sequence), we have applied benefits to some amino acid singles energies. Lastly, a diversity benefit that increases with set size is introduced to the set singles energies to favor larger sets over smaller sets during optimization.
|
|
|
where Erot(ir) and Erot(ir, jcurrent) are rotamer singles and pairs energies, respectively, and jcurrent is the rotamer defined by the amino acid at position j in the template structure. Within the set of rotamers r at position i corresponding to amino acid a, ir
ia, the rotamer that minimizes Epm(ir) is represented as imin,a. If there exists some ir
ia that has survived a previous rotamer pruning step (see SI Text), the amino acid singles energy for amino acid a at position i, Eaa(ia), is then set equal to
|
|
where the composition benefit Ecomp(ia) has a user-defined value that biases optimization toward or away from libraries that include amino acid a at position i. Otherwise Eaa(ia) is set equal to the cutoff value used to prune rotamers, 20 kcal/mol, such that these amino acids are effectively eliminated from the calculation; a value similar to some of the better rotamer singles energies could conceivably improve library design for some applications by complementing the conservative nature of our structure-based method with a desired degree of randomness. Assignment of the amino acid energies in this manner effectively prunes the rotamers in the calculation to no more than one rotamer per amino acid per position. In SI Text, we show that high-scoring sequences in core design tend to use a very small subset of rotamers and that minimizing Epm(ir) is an effective way to identify this subset.
If there exists some ir
ia and some js 
jb that have survived the rotamer pruning step, the amino acid pairs energy, Eaa(ia, jb), is then set equal to
|
|
Otherwise Eaa(ia, jb) is set equal to the cutoff value used to prune rotamers, 20 kcal/mol, such that these amino acids are effectively eliminated from the calculation; a value similar to some of the better rotamer pairs energies could conceivably improve library design for some applications by complementing the conservative nature of our structure-based method with a desired degree of randomness.
For the set of amino acids a represented by x, a set singles energy, Eset(ix), is calculated at each position i as
|
|
where Nx is the number of amino acids in set x, and L is a factor used to control the size of the optimal library. We refer to the second term in this equation as a diversity benefit and to L as a diversity benefit scale factor. Faced with two libraries of the same size, the logarithmic form of the diversity benefit will tend to favor the one with sequence diversity distributed over a greater number of positions. A quadratic form would have the opposite effect and may be more desirable, depending on one's application. Of course, the functional form for the diversity benefit is inconsequential when only two set sizes are considered in a design, as was the case in designing the DBISORBIT and DBISORBIT 44 libraries (see below). For sets x and y at positions i and j, the set pairs energy is then calculated as
|
|
The composition of the optimal combinatorial library was thus defined by the optimal combination of these set singles and pairs energies. In designing the DBISORBIT and DBISORBIT 44 libraries, we first imposed Ecomp(ia) = 0 at all positions; if the GMEL for the value of L that gives the desired library size did not include the GFP-S65T sequence, we iteratively altered Ecomp(ia) in 5 kcal/mol increments for the missing GFP-S65T residues until this sequence was recovered in the designed library.
Library Design Methods. Composition, set-size, and genetic-code constraints were enforced for all tested design algorithms to facilitate comparisons among them. The genetic-code constraint allowed each library to be constructed at minimal cost and effectively applied some of the physicochemical information that may exist in the genetic code to the process of design (it is notable that there were large differences in performance among libraries, although each shared this constraint). Relaxing the genetic-code constraint would change the composition of each designed library substantially and could alter the observed performance ranking.
One set of rotamer singles and pairs energies (calculated as described in SI Text) was used in four different ways to design the DBISORBIT, DBISORBIT 44, CORBIT, and SCMFORBIT 322 libraries. In order for the DBIS algorithm to yield a library of 29 sequences that included GFP-S65T, all values of Ecomp(ia) were set equal to 0, except Ecomp(63T) = 10 kcal/mol, and Ecomp(69Q) = 5 kcal/mol; the only sets considered at each position were the 95 unique sets of either one or two amino acids that can be defined by the use of mixed bases during primer synthesis; L was set equal to 6.5. In order for the DBIS algorithm to yield a library of 44 sequences that included GFP-S65T, all values of Ecomp(ia) were set equal to 0 except Ecomp(63T) = 10 kcal/mol, and Ecomp(69Q) = 10 kcal/mol; the only sets considered at each position were the 113 unique sets of either one or four amino acids that can be defined by the use of mixed bases during primer synthesis; L was set equal to 4.6.
The SCMFORBIT 322 library was designed by applying the method of Voigt et al. (7) in the following way. Each rotamer was first assigned a probability equal to the inverse of the number of rotamers at its position. The self-consistent mean-field solution was then calculated for an initial temperature of 50,000 K. As the temperature was lowered in 100 K increments, the solution from each previous temperature was used as the initial configuration for the next temperature. Saturation mutagenesis was directed to the two positions with site entropies >1.0 at a final temperature of 1,000 K.
The CORBIT library was designed by applying the consensus method of Hayes et al. (9) in the following way. The GMEC for this design problem was used as the initial configuration for a Monte Carlo trajectory through conformation space. One million steps were used for each of 100 cycles during which temperature oscillated between 4,000 K and 150 K. Only the 1,010 unique amino acid sequences with the best energies sampled were retained for further analysis. At 9 of 15 positions, there appeared at least one mutation that could be introduced to GFP-S65T by a single nucleotide substitution. The CORBIT library was thus defined by the one such mutation that appeared with the greatest frequency at each of these nine positions. (At 1,000 sequences a unique library could not be defined by this method because both alanine and threonine appeared with equal frequency at position 58.) Three apparent deficiencies of this consensus method were addressed by developing the DBIS algorithm: first, Monte Carlo-based sampling of the energy landscape is by its nature both inexhaustive and random; second, disruptive combinations of amino acids might arise when a library is designed without accounting for correlations in an alignment; and third, even if correlations were accounted for, any alignment with enough sequences to truly reflect global trends in these correlations would likely be too large to be practical.
The CMSA and SE/CMSA libraries were each designed with the same alignment of naturally occurring fluorescent proteins according to similar consensus methods. Of the 48 GFP homologs aligned by Shagin et al. (26), we used only the 36 homologs labeled as either GFPs, YFPs, cyan fluorescent proteins or red fluorescent proteins. To design the CMSA library, a consensus method derived from the one used by Hayes et al. (9) was used. At 12 of the positions between 57 and 72, there appeared at least one mutation that could be introduced to GFP-S65T by a single nucleotide substitution. The nine positions that had at least one such mutation represented at least four times were mutated to whichever of these mutations occurred with the greatest frequency at each position. Because two such mutations occurred with greatest frequency at positions 62 and 72, we elected, in each case, to introduce the mutation that happened to be shared with the DBISORBIT library. The approach used to design the CMSA library thus directs mutations away from the positions that exhibit the least conservation. To explore the possibility that these least-conserved positions might tolerate mutation best, the SE/CMSA library was designed by directing mutations to the 9 positions (of 12) that had the greatest site entropies,
|
|
where p(ia) is the frequency of amino acid a at position i, and the sum is taken over all amino acids for which p(ia)
0. The mutations introduced at these positions were chosen by the same considerations used to design the CMSA library. We did not use any design algorithms that used pair-wise correlations among the mutations in the MSA, because this alignment was rather small and there may be considerable evolutionary noise in such correlations (36, 37).
The Random library was designed by using a Python script to pick one mutation at random at each of the nine positions mutated in the DBISORBIT library.
Procedures used to synthesize and characterize libraries, including data analysis and error estimation, are provided in SI Text.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: DBIS, diversity benefit applied to interacting sets; epPCR, error-prone PCR; MSA, multiple sequence alignment; SCMF, self-consistent mean-field; SE, site entropy.
To whom correspondence should be addressed. E-mail: steve{at}mayo.caltech.edu
Author contributions: T.P.T. and S.L.M. designed research; T.P.T., C.L.V., and D.N. performed research; T.P.T. contributed new reagents/analytic tools; T.P.T. and C.L.V. analyzed data; and T.P.T. wrote the paper.
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0609647103/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
R. J. Pantazes, M. C. Saraf, and C. D. Maranas Optimal protein library design using recombination or point mutations based on sequence-based scoring functions Protein Eng. Des. Sel., August 8, 2007; (2007) gzm030v1. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||