## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Inferring fitness landscapes by regression produces biased estimates of epistasis

Edited by Richard E. Lenski, Michigan State University, East Lansing, MI, and approved April 23, 2014 (received for review January 15, 2014)

## Significance

The dynamics of evolution depend on an organism’s fitness landscape: the mapping from genotypes to reproductive capacity. Knowledge of the fitness landscape can help resolve questions, such as how quickly a pathogen will acquire drug resistance or by what pattern of mutations. However, direct measurement of a fitness landscape is impossible because of the vast number of genotypes. Here, we critically examine regression techniques used to approximate fitness landscapes from data. We find that such regressions are subject to two inherent biases that distort the biological quantities of greatest interest, often making evolution seem less predictable than it actually is. We discuss methods that may mitigate these biases in some cases.

## Abstract

The genotype–fitness map plays a fundamental role in shaping the dynamics of evolution. However, it is difficult to directly measure a fitness landscape in practice, because the number of possible genotypes is astronomical. One approach is to sample as many genotypes as possible, measure their fitnesses, and fit a statistical model of the landscape that includes additive and pairwise interactive effects between loci. Here, we elucidate the pitfalls of using such regressions by studying artificial but mathematically convenient fitness landscapes. We identify two sources of bias inherent in these regression procedures, each of which tends to underestimate high fitnesses and overestimate low fitnesses. We characterize these biases for random sampling of genotypes as well as samples drawn from a population under selection in the Wright–Fisher model of evolutionary dynamics. We show that common measures of epistasis, such as the number of monotonically increasing paths between ancestral and derived genotypes, the prevalence of sign epistasis, and the number of local fitness maxima, are distorted in the inferred landscape. As a result, the inferred landscape will provide systematically biased predictions for the dynamics of adaptation. We identify the same biases in a computational RNA-folding landscape as well as regulatory sequence binding data treated with the same fitting procedure. Finally, we present a method to ameliorate these biases in some cases.

An organism’s fitness or expected reproductive output is determined by its genotype, environment, and, possibly, the frequencies of other genotypes in the population. In the simplified setting of a fixed environment without frequency-dependent effects, as in many experimental populations (1⇓⇓⇓–5), fitnesses are described by a map from genotypes to reproductive rates called the fitness landscape.

The dynamics of an adapting population fundamentally depend on characteristics of the organism’s fitness landscape (6⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓–24). However, mapping out an organism’s fitness landscape is virtually impossible in practice because of the coarse resolution of fitness measurements and because of epistasis: the fitness contribution of one locus may depend on the states of other loci. To account for all possible forms of epistasis, a fitness landscape must assign a potentially different fitness to each genotype, and the number of genotypes increases exponentially with the number of loci.

As a result of these practical difficulties, fitness landscapes have been directly measured in only very limited cases, such as for individual proteins, RNA molecules, or viruses. Even in these limited cases, genetic variation was restricted to a handful of genetic sites (25⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓–41). Alternatively, one might try to infer properties of a fitness landscape from a time series of samples from a reproducing population. Despite considerable effort along these lines (19, 42⇓–44), this approach is difficult, and such inferences from time series can be subject to systematic biases (45). As a result, very little is known about fitness landscapes in nature, despite their overwhelming importance in shaping the course of evolution.

Technological developments now allow researchers to assay growth rates of microbes or enzymatic activities of individual proteins and RNAs for millions of variants (46⇓–48). As a result, researchers are now beginning to sample and measure larger portions of the fitness landscapes than previously possible. Nonetheless, even in these cases, the set of sampled genotypes still represents a tiny proportion of all genotypes and also, likely, a tiny proportion of all viable genotypes.

To draw conclusions from the limited number of genotypes whose fitnesses that can be assayed, researchers fit statistical models, notably by penalized regression, that approximate the fitness landscape based on the data available. This situation is perhaps best illustrated by recent studies of fitness for the HIV-1 virus based on the measured reproductive capacity of HIV-derived amplicons inserted into a resistance test vector (49, 50). These HIV genotypes were sampled from infected patients. [An alternative approach, often used for measuring activities of an individual enzyme, is to introduce mutations randomly into a WT sequence (47, 51⇓⇓–54).] Whereas the entire fitness landscape of HIV-1 consists of reproductive values for roughly 2^{1,800} ≈ 10^{600} genotypes, only ≈ 70,000 genotypes were assayed in the experiment (49). Researchers therefore approximated the fitness landscape by penalized regression, based on the measured data, using an expansion in terms of main effects of loci and epistatic interactions between loci. The principle goal of estimating the underlying fitness landscape was to assess the extent and form of epistasis (49) and more generally, understand how adaptation would proceed on such a landscape (50).

These studies (49, 50) and other high-throughput fitness measurement studies (46⇓–48) produce massive amounts of data, but not nearly enough data to determine an entire fitness landscape. This difficulty presents the field with several pressing questions. Do statistical approximations based on available data faithfully reproduce the relevant aspects of the true fitness landscape and accurately predict the dynamics of adaptation? Or do biases arising from statistical fits or measurement noise influence the conclusions that we draw from such data?

Here, we begin to address these fundamental questions about empirical fitness measurements and how they inform our understanding of the underlying fitness landscape and evolution on the landscape. We study the effects of approximating a fitness landscape from data in terms of main and epistatic effects of loci. We show that such approximations, which are required to draw any general conclusions from a limited sample of genotypes, are subject to two distinct sources of biases. Although these biases are known features of linear regressions, they have important consequences for the biological quantities inferred from such fitness landscapes. These biases systematically alter the form of epistasis in the inferred fitness landscape compared with the true underlying landscape. In particular, the inferred fitness landscape will typically exhibit less local ruggedness than the true landscape, and it will suggest that evolutionary trajectories are less predictable than they actually are in the true landscape.

Most of our analysis is based on samples from mathematically constructed fitness landscapes. However, we argue that the types of biases that we identify apply generally and in more biologically realistic situations. Indeed, we show that the same types of biases occur in RNA-folding landscapes as well as empirically measured regulatory sequence binding landscapes.

Although it may be impossible to completely remove these biases, we conclude by suggesting steps to mitigate the biases in some cases.

## Results

### Statistical Approximations of Fitness Landscapes.

A function that maps genotype to fitness may be written as an expansion in terms of main effects and interactions between loci (55⇓⇓⇓–59):*x*_{i} represents a genetic variant at locus *i*, *y* is the fitness (typically the logarithm of growth rate), and *L* is the number of loci. The nucleotides ATGC or any number of categorical variables may be encoded by dummy variables represented by *x*_{i}, which equal either +1 or −1 in this study (*Materials and Methods*). The term *β*_{i} and *β*_{ij}, etc., the best-fit coefficients can be inferred by linear regression.

Experimental data are now sufficiently extensive that both the additive and pairwise epistatic coefficients, *β*_{i} and *β*_{ij}, can often be estimated, whereas three-way and higher-order interactions are typically omitted from the statistical model of the fitness landscape. We refer to a statistical model with only additive and pairwise interactions as a quadratic model. Even in the quadratic case, the statistical model may involve more free coefficients than empirical observations, and therefore overfitting could become a problem. Techniques to accommodate this problem and the biases that they introduce are discussed below.

### Bias Arising from Penalized Linear Regressions.

The first type of bias that we study arises from the use of penalized regressions, which are required when a large number of parameters must be inferred from a limited amount of data. Under standard linear regression with limited data, overfitting can cause the magnitudes of inferred coefficients to be large, resulting in positive and negative effects that cancel out to fit the observed fitness measurements. The standard remedy for overfitting is a so-called penalized least squares regression, such as ridge regression or least absolute shrinkage and selection operator (LASSO) regression (60), which constrains the complexity of the inferred model by limiting the magnitudes of the inferred coefficients. For example, in fitting a quadratic landscape to sampled HIV-1 fitness measurements, Hinkley et al. (49) used a form of penalized linear regression to avoid overfitting their data.

Although often required when fitting complex fitness landscapes to data, the penalized least square regression has some drawbacks. In general, the mean squared error (MSE) of any regression can be decomposed into a bias and a variance. The Gauss–Markov theorem guarantees that the standard least squares linear regression produces the lowest possible MSE that has no bias, whereas penalized least squares can reduce the MSE further by adding bias in exchange for a reduction in variance (60). Thus, to provide predictive power for the fitnesses of unobserved genotypes, these regressions necessarily produce biased fits. Although the accuracy of predicting unobserved fitnesses may be improved by such a biased fit, other quantities of biological interest derived from these predictions, such as measures of epistasis, may be distorted by the bias.

To quantify the biases introduced by penalized least square regression we compared mathematically constructed fitness landscapes with the landscapes inferred from a quadratic model fit by ridge regression (similar results hold for LASSO regressions) (*Discussion*). Our analyses are based on two types of mathematical fitness landscapes. The widely used *NK* landscapes of Kauffman and Levin (7) and Kauffman and Weinberger (8) comprise a family of landscapes that range from additive to highly epistatic depending on the parameter *K*, which determines the number of (typically sparse) interactions between sites (7⇓⇓–10). We also study polynomial landscapes, which consist of additive effects and all possible pairwise and three-way interactions. In these landscapes, the amount of epistasis can be tuned by controlling the proportion of fitness variation that arises from the additive contributions, pairwise interactions, and three-way interactions (*Materials and Methods*).

We constructed *NK* and polynomial landscapes with only additive and pairwise effects, we sampled genotypes and fitness from these landscapes, and we fit a quadratic model of the landscape based on the sampled data using penalized least square regression. For both *NK* and polynomial landscapes, we found that the inferred landscape tends to overestimate the fitnesses of low-fitness genotypes and underestimate the fitnesses of high-fitness genotypes (polynomial landscape in Fig. 1 and *NK* landscape in Fig. S1). Thus, there is a fitness-dependent bias in the inferred landscape compared with the true underlying landscape. The extent of this bias depends on the amount of penalization used, which in turn, depends on the amount of data sampled relative to the number of free coefficients in the statistical model. When the number of independent samples equals the number of free coefficients, these biases disappear (Fig. 1, red), but whenever data are in short supply, these biases arise and can be substantial in magnitude (Fig. 1).

The only way to avoid this bias entirely is to obtain at least as many independent observations as model parameters, which is typically unfeasible for realistic protein lengths or genome sizes. Furthermore, as we will discuss below, these inherent biases have important consequences for our understanding of epistasis in the fitness landscape and our ability to predict the dynamics of adaptation.

### Bias Arising from a Misspecified Model.

Even when there is sufficient data so that a penalized regression is unnecessary, there is another source of potential bias in the inferred fitness landscape caused by variables that are omitted from the statistical model but present in the true landscape (e.g., higher-order interactions between loci) (59). In this case, the estimated coefficients of the statistical model will be biased in proportion to the amount of correlation between the omitted variables and the included variables (61). Uncorrelated omitted variables, by contrast, may be regarded as noise, which we discuss below.

Interactions of different orders (e.g., three-way and pairwise interactions) are generally correlated with each other, unless the genotypes are sampled randomly and encoded as ±1, forming an orthogonal basis (57, 62). In this case, which rarely applies to samples drawn from an evolving population, the omitted interactions may be regarded as noise, and the estimated coefficients are guaranteed to be unbiased. However, even in this case, the inferred *y* values may still be biased.

Fig. 2 illustrates the biases arising from model misspecification. To produce Fig. 2, we fit quadratic models to fitness landscapes that contain higher-order interactions. In both the cubic polynomial (Fig. 2) and *NK* (Fig. S2) landscapes, fitnesses that are very high or very low are likely to contain contributions from higher-order interactions with positive or negative effects, respectively. However, these higher-order interactions are not estimated by the statistical model, and therefore the inferred model overestimates low fitnesses and underestimates high fitnesses. Bias arising from model misspecification is qualitatively similar to bias arising from penalized regression, which is discussed above. Bias from a misspecified model can be large, but it would be not be visible in a plot of residuals versus inferred fitnesses.

The misspecified model bias shown in Fig. 2 is a form of regression to the mean, and it is present even in a simple univariate regression with a large amount of noise (63). The slope in Fig. 2, which plots true fitness *y* against the residual *R*^{2} − 1, where *R*^{2} denotes the coefficient of determination of the original regression (*Materials and Methods* shows a derivation). Whether regression to the mean is viewed as bias depends on the interpretation of the statistical model. If one assumes that the model cannot be improved by adding any more predictor variables (i.e., that the noise is caused by purely random factors as opposed to unknown systematic factors), then the regression results are unbiased, and the observed negative slope between *y* simply reflects the fact that the regression cannot estimate the noise. However, in situations when there is a systematic signal that is missing from the statistical model, such as when fitting a quadratic model to a fitness landscape that contains higher-order interactions, then the regression is biased to the mean in proportion to the amount of variance that is not explained by the model. This phenomenon is not caused by measurement noise but by the omission of relevant variables. In the experiments summarized in Fig. 2 there is no measurement noise in the fitnesses, and therefore the negative slope shown in Fig. 2 reflects a true bias: the misspecified model overestimates low fitnesses and underestimates high fitnesses.

With carefully tuned parameters, other forms of misspecified model bias are also possible (Fig. S3). In any case, whatever form it takes, misspecified model bias has consequences for how accurately the landscape inferred from an experiment will reflect the amount of epistasis in the true landscape or predict the dynamics of adaptation, which we will show below.

### Bias Arising from Wright–Fisher Sampling.

A third difficulty that arises when fitting a statistical model to measured fitnesses is the presence of correlations between observed states of loci in sampled genotypes. An adapting population does not explore sequence space randomly, but rather is guided by selection to higher-fitness genotypes. Sequences sampled from a population under selection will thus tend to have correlated loci because of either shared ancestry or epistasis.

Correlated variables do not themselves bias inferred coefficients (at least when the model is specified correctly), but they can inflate the variance of those estimates (64). Predictions from the inferred model are not affected in expectation, provided that the new data have the same correlations as the original training data. However, in the context of the expansion in Eq. **1**, if there are correlations between the included variables, then there are also correlations between the omitted higher-order interactions and the included variables. Thus, sampling from a Wright–Fisher (WF) population will exacerbate the misspecified model bias. As a result, the inferred

To illustrate biases that arise from sampling a population under selection we simulated WF populations for 100 generations on cubic polynomial and *Materials and Methods*). The resulting fits exhibit very high

### Extrapolative Power to Predict Fitnesses.

One of the motivations for fitting a statistical model of a fitness landscape is to predict the fitnesses of genotypes that were not sampled or assayed in the original experiment. This procedure immediately raises the questions: how much predictive power do such statistical fits have, and how does their power depend on the form of the underlying landscape from which genotypes have been sampled as well as the form of the fitting procedure?

Although extrapolation is easy to visualize in a linear regression with one component of *x*, it cannot be plotted as easily in high dimensions, where it is sometimes called hidden extrapolation (64). In the discrete, high-dimensional space of genotypes, no genotype is between any two other genotypes, and therefore every prediction is, in some sense, an extrapolation rather than an interpolation.

Given experimental data, it may be hard to determine if a model is extrapolating accurately or not (65). Here, we quantify the accuracy of extrapolation explicitly using mathematical fitness landscapes. Fig. 4 illustrates the ability of statistical fits to predict fitnesses of genotypes that were not sampled in the training data for a range of models and regressions with varying degrees of misspecification. Fig. 4 quantifies the amount of error when predicting the fitnesses of genotypes that are one or two mutations from the training data as well as predicting fitness of random, unsampled genotypes. Away from the training data, the bias and variance increase with each mutation, which is reflected by the lower squared correlation coefficients between true and inferred values. The predictions are progressively worse as the amount of model misspecification increases.

A statistical model that has a good fit to the training data (i.e., a high

It is interesting to compare the extrapolative power of landscapes fitted to genotypes sampled from a WF population with genotypes sampled randomly. The dashed line in Fig. 4 indicates the expected

### Biases Influence the Amount of Epistasis in the Inferred Landscape.

The dynamics of an adapting population depend fundamentally on the form of epistasis—that is, the way in which fitness contributions from one locus depend on the status of other loci. Indeed, one of the primary goals in fitting a fitness landscape to empirical data is to quantify the amount and form of its epistasis to understand how adaptation will proceed.

Given the two sources of biases discussed above, which are inherent to fitting fitness landscapes to empirical data and exacerbated by sampling from populations under selection, this question arises: how do these biases influence the apparent form of epistasis in the fitted landscape? In this section, we address this question by comparing the form of epistasis in the true underlying fitness landscape with the form of epistasis in the inferred landscape obtained from fitting a quadratic model to sampled genotypes. There are several measures of epistasis known to influence the dynamics of adaptation. We will focus on three measures commonly used in the experimental literature on epistasis.

One way to measure epistasis, which reflects the degree of predictability in adaptation, is to reconstruct all of the possible genetic paths between a low-fitness ancestral genotype and a high fitness-derived genotype sampled from an experimental population (18, 24, 31, 33, 36, 66⇓⇓–69). The proportion of such paths that are accessible or monotonically increasing in fitness is then a natural measure of epistasis. When this proportion is high, many possible routes of adaptation are allowable, suggesting that the evolutionary trajectory cannot be easily predicted in advance. When this proportion is small, it suggests that the evolutionary trajectory is more predictable, at least in a large population.

To generate data similar to what would arise in an evolution experiment, we ran WF simulations on mathematical fitness landscapes (*Materials and Methods*). Each population began monomorphically, and the most populated genotype at the end of the simulation was taken as the derived genotype, which typically contained between five and seven mutations compared to the ancestral genotype. Genotypes and their associated fitnesses were sampled from the population after 100 generations and used to fit a quadratic model of the landscape. Fig. 5*A* shows an example of all of the mutational paths between the ancestral and derived genotypes separated by five mutations for both the true and the inferred fitnesses. Because low fitnesses are likely to be overestimated and high fitnesses are likely to be underestimated, the bias in the inferred landscape tends to eliminate fitness valleys. As a result, the number of accessible paths is higher in the inferred landscape than it is in the true underlying landscape (Fig. 5*A*). Epistasis seems to be less severe, and adaptation seems able to take more paths than it actually can take. This effect occurs systematically: we have observed it over many realizations of different underlying fitness landscapes (Fig. 5*B*).

We also investigated two other measures of epistasis: the number of local maxima in the fitness landscape (7, 8, 48, 70, 71) and the prevalence of sign epistasis between pairs of mutations (18, 41, 71, 72). Both of these quantities are global measures of epistasis that depend on the entire landscape, as opposed to the local measure of accessible paths between an ancestral and a derived genotype. Generally speaking, local maxima tend to slow adaptation to very high fitnesses, although valley crossing can occur in large populations (73⇓⇓–76). Sign epistasis occurs when the fitness effect of a mutation at one site changes sign depending on the status of a second site. Reciprocal sign epistasis is a subset of sign epistasis, and it occurs when the second site is also sign epistastic with the first site.

Fig. S5 compares the true and inferred amounts of these two global measures of epistasis. Both quantities can be either underestimated or overestimated by the inferred landscape depending on the circumstances. When the model is misspecified (Fig. S5, *Lower*) and without penalization or local sampling of genotypes, both measures of epistasis are heavily underestimated, because the model is unable to capture the local maxima and sign epistasis caused by three-way interactions. However, when genotypes are sampled locally (Fig. S5, *Upper*) (i.e., sampled within a few mutations around a focal genotype) and penalized regression is applied, then the inferred landscape is influenced by both penalization bias and extrapolation error. The penalization bias tends to smooth the inferred landscape and eliminate local maxima, whereas extrapolation adds noise to the estimated fitnesses and may create spurious local maxima. Which of these two effects dominates depends on the amount of data sampled and how it is distributed. Sign epistasis seems to be less sensitive to extrapolation and bias when the model is well-specified, presumably because it depends only on the signs of effects and not magnitudes. In all cases considered, however, the inferred landscapes exhibit systematically biased global measures of epistasis.

### More Realistic Landscapes.

To complement the studies above, which are based on mathematically constructed fitness landscapes, we also investigated more realistic landscapes: computational RNA-folding landscapes and empirical data relating regulatory sequences and expression levels (i.e., a regulatory sequence binding landscape) (*Materials and Methods*). The RNA-folding landscape and the regulatory sequence binding landscape (Fig. 6) both exhibit the same form of bias that we observed in the mathematical fitness landscapes.

In the case of the RNA-folding landscape (Fig. 6*A*), measurements are made without noise and there are sufficient data to avoid the need for penalized regression. Thus, the bias to the mean fitness seen in Fig. 6*A* with *B*), however, contain some measurement noise that comprises about 10–24% of the variance (65), whereas the

The form of the biases observed when fitting a quadratic model to these realistic fitness landscapes (Fig. 6) is similar to the biases observed when fitting quadratic models to

### Reducing Bias When Fitting Fitness Landscapes.

Bias arising from a misspecified model may be reduced by adding relevant missing variables. In the context of the expansion in Eq. **1**, this approach requires adding higher-order interactions, such as triplets of sites, quartets, etc. In practice, this approach is often infeasible, because the number of such interactions is extremely large and relatively few of them may be present in the data. In Fig. 7*A*, we show that the bias in the statistical model can be reduced by fitting a model that includes three-way interactions. However, there is still some residual bias because of penalized regression. Thus, incorporating additional predictor variables effectively trades bias caused by model misspecification for bias caused by a more severe penalized regression.

It is not always necessary to use penalized regression, especially when higher-order interactions are sparse. In such cases, it may be appropriate to first select a limited number of relevant variables and then use standard regression that avoids the bias associated with penalization. Selecting the variables to use in the statistical model may be done by hand, based on prior knowledge, or by statistical methods, such as LASSO (60, 77) (*Materials and Methods*). This approach is expected to perform well only when a relatively small number of variables/interactions is, in fact, present in the true fitness landscape.

As a proof of principle, we have shown how LASSO followed by standard unpenalized regression reduces bias in fits to the cubic polynomial and

## Discussion

Our ability to measure the genotype–fitness relationship directly in experimental populations is advancing at a dramatic pace. However, we can never hope to measure but a tiny fraction of an entire fitness landscape, even for individual proteins. As a result, there is increasing need to fit statistical models of landscapes to sampled data. Here, we have shown that such statistical fits can warp our view of epistasis in the landscape and, in turn, our expectations for the dynamics of an evolving population.

We have identified two distinct sources of biases: penalized regression and model misspecification. Our analyses of the effects of penalized regression have been performed using ridge regression, but the same qualitative results hold for LASSO regressions (Figs. S6–S10 are analogous to Fig. 1, Fig. S1, Fig. 4, Fig. S4, and Fig. 5*B*, respectively), and we would also expect them to hold for other forms of penalization, such as elastic net or the generalized kernel ridge regression in refs. 49 and 50. Notably, the bias arising from penalized regression has a slightly different form depending on whether genotypes are encoded in the

Aside from these biases, we have also shown that statistical fits have poor predictive power. Even when a fitted landscape exhibits a large cross-validated

Common measures of epistasis may be grossly distorted by statistical fits to sampled data. Interestingly, different measures of epistasis can be affected differently. For example, the number of paths accessible between an ancestral and derived genotype will be systematically overestimated in such fits, suggesting that evolution is less predictable than it actually is. The number of local maxima in the inferred landscape can be severely over- or underestimated, whereas the prevalence of sign epistasis is less prone to bias (Fig. S5). Estimates of pairwise sign epistasis may be more robust, because quadratic fits capture pairwise interactions, whereas the number of local maxima can depend on higher-order interactions.

The problems of bias to smoothness because of penalized regression and ruggedness because of extrapolation error can coexist in the same dataset, which was observed in a recent study of regulatory sequence binding data (65). Which of these two effects will dominate is unclear, in general, because it depends on the underlying landscape and the form of sampling. As a result, it is difficult to interpret the predictions made by quadratic fits to fitness landscapes, such as fits made to HIV data (50). Nevertheless, it is often possible to at least deduce the presence of epistatic interactions from sampled data (49, 65).

In some cases, such as when the true landscape has sparse high-order interactions between loci, a combination of variable selection followed by unpenalized regression may ameliorate the biases that we have identified. However, the degree to which this approach will reduce bias will surely depend on the biological context. Thus, researchers should incorporate as much prior biological knowledge as possible when choosing a statistical model and fitting procedure. At the very least, it is important that researchers be aware of the biases inherent in fitting statistical models of fitness landscapes to data.

## Materials and Methods

### Parameterization of Genotypes.

To build a statistical model with linear and interacting terms (Eq. **1**), genetic sequences must be encoded as dummy variables,

### Polynomial Landscapes.

We constructed polynomial landscapes by terminating the expansion (Eq. **1**) at the third order and specifying the coefficients in such a way as to control the contributions to the total variation in log fitness that arise from interactions of each order. In particular, the variance of *y* is

where

where *i*th order, *L* is the number of loci, and the total variance is

N K Landscapes.

We followed a standard (7, 8, 78) procedure for constructing *N* denotes the length of the binary string defining the genotype (*K* other sites, yielding a lookup table with *N* lookup tables each with *K*. *K* is constant across sites and genotypes for a particular landscape, and the identities of the *K* sites on which a given locus depends are drawn uniformly from the *k*th-order interactions is proportional to

### RNA-Folding Landscape.

The RNA-folding landscape was generated by the Vienna RNA software (79). The target secondary structure was chosen as the most common structure observed in a sample of 10,000 structures generated from random genotypes of length 15. The fitness function was defined as *h* is hamming distance to the target in the tree edit metric. Training data consisted of

*Escherichia Coli* Regulatory Sequence Binding Landscape.

The data consisted of 129,000 sequences of the *Escherichia coli* lac promoter and associated gene expression levels. Each sequence contained 75 nt and contained roughly seven mutations relative to the WT. *E. coli* were FACS-sorted by expression levels into nine bins, and the bin numbers serve as the phenotype for fitting the quadratic model. In this case, LASSO was used for regression. More details are in ref. 65.

### Penalized Regression.

A quadratic fit with ridge regression was used (unless otherwise stated), which identifies the coefficients that minimize

where the last sum is taken over all coefficients denoted as *λ* was determined by first calculating the 10-fold cross-validated MSE and its SE for many different values: *λ* as the largest

### WF Simulations.

Monte Carlo simulations of adaptation were based on standard WF dynamics (80). A population consists of *N* individuals, each with a genotype consisting of a bit string of length 20. The population replaces itself in discrete generations, such that each individual has a random number of offspring in proportion to its fitness, which is determined by its genotype through the fitness landscape. In practice, this simulation is done efficiently with a multinomial random number generator. Mutations are defined as bit flips, and they are introduced in every individual at each generation with probability *U* per genome. The number of individuals receiving a mutation is, thus, binomially distributed, because double mutations are not allowed. The populations were initialized as monomorphic for a low-fitness genotype chosen by generating 100 random genotypes and picking the one with the lowest fitness. Simulations were run for 100 generations; the resulting population was reduced to unique genotypes, and those genotypes, with the corresponding fitnesses, were used as the training data for regressions. The number of unique genotypes in the population is sensitive to *U*, *N*, the number of generations, and

### Plots of True Fitness Versus Residuals.

The figures plotting true fitness vs. residuals were produced using a Gaussian moving window applied to the raw data. For each value of true fitness *y*, a mean and an SD were calculated by weighting all of the data points by a Gaussian with a width of *σ* and normalized by the sum of all of the weights for the given *y* value. This procedure provides a sense of the distribution at a given *y* without regard to the density of points. Areas on the extremes of *y* had few points to estimate a mean and variance, and they were excluded if the sum of weights was smaller than the 10% percentile of the distribution of all normalization factors. We used the smoothing parameter *A* and Figs. S3 and S6; *B* and Figs. S1, S2, and S7;

### Slope of True Fitness Versus Residuals.

A scatter plot of true fitnesses, *y*, vs. estimated fitnesses inferred by regression, *y* by using a second regression:

where *β* denotes the slope, found by minimizing the MSE:

Recall that *r* is the residual from the initial regression and *y* axis and inferred values on the *x* axis and observing no relationship. In the text, by contrast, we show plots of the true values *y* vs. the residuals *y* are underestimated and genotypes with small *y* are overestimated. This type of bias is a form of regression to the mean. We can calculate the slope of *y* as follows:

and with a similar calculation, we find

If we have mean-centered data *y* vs.

## Acknowledgments

We thank A. Feder, J. Draghi, and D. McCandlish for constructive input and P. Chesson for clarifying the usage of the word “between.” J.B.P. acknowledges funding from the Burroughs Wellcome Fund, the David and Lucile Packard Foundation, US Department of the Interior Grant D12AP00025, US Army Research Office Grant W911NF-12-1-0552, and Foundational Questions in Evolutionary Biology Fund Grant RFP-12-16.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: jplotkin{at}sas.upenn.edu.

Author contributions: J.O. and J.B.P. designed research; J.O. performed research; J.O. analyzed data; and J.O. and J.B.P. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1400849111/-/DCSupplemental.

## References

- ↵
- ↵
- Lenski RE,
- Travisano M

- ↵
- ↵
- Blount ZD,
- Borland CZ,
- Lenski RE

- ↵
- Woods RJ,
- et al.

- ↵
- ↵
- ↵
- ↵
- Macken CA,
- Perelson AS

- ↵
- ↵
- Perelson AS,
- Macken CA

- ↵
- Newman MEJ,
- Engelhardt R

- ↵
- ↵
- ↵
- ↵
- ↵
- Tokuriki N,
- Tawfik DS

- ↵
- ↵
- Kryazhimskiy S,
- Tkacik G,
- Plotkin JB

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Burch CL,
- Chao L

- ↵
- ↵
- ↵
- ↵
- ↵
- Hayden EJ,
- Wagner A

- ↵
- Weinreich DM,
- Delaney NF,
- Depristo MA,
- Hartl DL

- ↵
- ↵
- Lozovsky ER,
- et al.

- ↵
- Hietpas RT,
- Jensen JD,
- Bolon DN

- ↵
- Hall DW,
- Agan M,
- Pope SC

- ↵
- Chou HH,
- Chiu HC,
- Delaney NF,
- Segrè D,
- Marx CJ

- ↵
- Khan AI,
- Dinh DM,
- Schneider D,
- Lenski RE,
- Cooper TF

- ↵
- ↵
- ↵
- ↵
- ↵
- Illingworth CJR,
- Parts L,
- Schiffels S,
- Liti G,
- Mustonen V

- ↵
- ↵
- Illingworth CJR,
- Mustonen V

- ↵
- ↵
- Pitt JN,
- Ferré-D’Amaré AR

- ↵
- Jacquier H,
- et al.

- ↵
- Jiménez JI,
- Xulvi-Brunet R,
- Campbell GW,
- Turk-MacLeod R,
- Chen IA

- ↵
- ↵
- ↵
- ↵
- Kinney JB,
- Murugan A,
- Callan CG Jr.,
- Cox EC

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Friedman J,
- Hastie T,
- Tibshirani R

- ↵
- Wooldridge JM

- ↵
- ↵
- Chernick MR,
- Friis RH

- ↵
- Kutner MH,
- Nachtsheim CJ,
- Neter J

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Iwasa Y,
- Michor F,
- Nowak MA

- ↵
- ↵
- ↵
- Covert AW 3rd.,
- Lenski RE,
- Wilke CO,
- Ofria C

- ↵
- Tibshirani R

- ↵
- Kauffman S

- ↵
- ↵
- Ewens WJ