# Detecting selection with a genetic cross

See allHide authors and affiliations

Edited by Matthew W. Hahn, Indiana University Bloomington, Bloomington, IN, and accepted by Editorial Board Member Daniel L. Hartl August 4, 2020 (received for review July 7, 2020)

### This article has a Correction and a Letter. Please see:

- Correction for Fraser, Detecting selection with a genetic cross - October 12, 2020
- Relationship between Research Article and Letter - February 22, 2021

## Significance

Natural selection is the force that underlies the spectacular adaptations of all organisms to their environments. However, not all traits are under selection; a key question is which traits have been shaped by selection, as opposed to the random drift of neutral traits. Here, I develop a test of selection on quantitative traits that can be applied to almost any genetic cross between divergent populations or species. The test is robust to a wide range of potential confounders and has greater power to detect selection than existing tests. Applied to empirical data, the test reveals widespread selection in both domesticated and wild species, allowing selection to be detected more easily and powerfully than previously possible.

## Abstract

Distinguishing which traits have evolved under natural selection, as opposed to neutral evolution, is a major goal of evolutionary biology. Several tests have been proposed to accomplish this, but these either rely on false assumptions or suffer from low power. Here, I introduce an approach to detecting selection that makes minimal assumptions and only requires phenotypic data from ∼10 individuals. The test compares the phenotypic difference between two populations to what would be expected by chance under neutral evolution, which can be estimated from the phenotypic distribution of an F_{2} cross between those populations. Simulations show that the test is robust to variation in the number of loci affecting the trait, the distribution of locus effect sizes, heritability, dominance, and epistasis. Comparing its performance to the QTL sign test—an existing test of selection that requires both genotype and phenotype data—the new test achieves comparable power with 50- to 100-fold fewer individuals (and no genotype data). Applying the test to empirical data spanning over a century shows strong directional selection in many crops, as well as on naturally selected traits such as head shape in Hawaiian *Drosophila* and skin color in humans. Applied to gene expression data, the test reveals that the strength of stabilizing selection acting on mRNA levels in a species is strongly associated with that species’ effective population size. In sum, this test is applicable to phenotypic data from almost any genetic cross, allowing selection to be detected more easily and powerfully than previously possible.

Trait-based tests of selection aim to distinguish the effects of two major forces of evolution: natural selection and neutral drift. Because many factors affect trait divergence—e.g., population size, divergence time, and genetic architecture—distinguishing these two forces is seldom straightforward. Several types of trait-based selection tests have been proposed, all of which view neutrality as a null model, but which differ in how they assess this null and in the type of data they require [reviewed in chapter 12 of Walsh and Lynch (1)].

For example, time series tests use phenotypic measurements of a single species over time, typically from the fossil record (a stratophenetic series). If the trait shows departure from the neutral expectation of a random walk—e.g., many more time steps with trait increases than decreases—then neutrality is rejected. The key assumption is that environmental changes do not affect these phenotypic trends, which is difficult to justify considering how much environments can change over the millions of years typically covered in a stratophenetic series.

A more widely used approach is known as Q_{ST}, where the population structure of phenotypic variance is compared to the analogous genetic metric F_{ST}. By utilizing genetic crosses in common garden experiments, the confounding effects of environment can be controlled, allowing selection to be assessed in a wide range of species (2). Limitations of this approach include low power [requiring data from >10 populations (3)] and several assumptions about epistasis and mutation rates (*SI Appendix*). However, an improved Q_{ST}-based method has sufficient power to detect selection using only a few populations (4).

Another widely used test is known as the quantitative trait locus (QTL) sign test (5, 6). In this test, QTL are first mapped using genotype and phenotype data from a genetic cross between two divergent parental lines. Under neutrality (and the absence of ascertainment bias), QTL directionality—i.e., which parent’s allele increases the trait at each QTL—is expected to be binomially distributed around 50%, much like a series of coin flips (Fig. 1 *A*, *Left*). In contrast, under lineage-specific selection, QTL directions will be biased in one direction (Fig. 1 *A*, *Right*). Although this test is quite robust due to its minimal assumptions, it also suffers from low power: A minimum of eight QTL (which is rarely reached in practice; see *SI Appendix*) is required to achieve a nominal *P* < 0.01.

The sign test’s low power is largely due to the fact that it only uses QTL directionality information, while ignoring the phenotypic divergence between the two parental lines. However, the parental traits contain important information: If a trait evolves under directional selection, it will diverge much faster than under neutrality (Fig. 1*B*). If it were possible to estimate the divergence expected by chance under neutrality, then this could be used as a null hypothesis; parental trait divergence significantly greater than this expectation would suggest lineage-specific directional selection, whereas divergence less than this would suggest stabilizing selection.

Indeed, this intuitive logic underlies another class of trait-based methods, “rate tests,” that ask whether the phenotypic divergence of multiple populations is consistent with neutral drift (1, 7). The neutral expectation is estimated from population genetic theory, using parameters such as the effective population size, the mutational variance, and the number of generations since population divergence. Since these parameters and their sampling variances can typically only be roughly estimated (at best), and several strong assumptions must also be made, rate tests are viewed as qualitative guides rather than quantitative tests of neutrality (1, 7) (*SI Appendix*).

In this work, I sought to develop a trait-based test of selection with the robustness of the sign test, while utilizing the framework of rate tests to increase the power to detect selection.

## Results

The logic underlying rate tests could lead to a more rigorous test of selection if the expected divergence under neutrality could be more accurately estimated. This can be achieved using the neutral model of the sign test, where the genetic variants underlying QTL (quantitative trait nucleotides, or QTNs) have no directionality bias (Fig. 1 *A*, *Left* and Fig. 1 *B*, purple). How could the distribution of phenotypes expected under this model of neutrality be estimated? One way would be to measure the effect size of every QTN and then predict the phenotypes resulting from random allelic combinations. However, there is a simpler solution: The F_{2} trait distribution represents exactly this null model. Regardless of the QTN directionalities in the parents, the F_{2} phenotypes result from random combinations of the segregating alleles; this randomness mimics the random directionality expected under neutral evolution (Fig. 1 *A*, *Left*, Fig. 1 *B*, purple). Therefore, every F_{2} individual can be thought of as a random draw from the distribution of potential parental phenotypes resulting from neutral evolution. If the parental divergence is significantly greater than expected based on the phenotypic variance observed in the F_{2} population, then neutrality can be rejected in favor of directional selection (Fig. 1 *C*, *Top*). If instead the parents are significantly less diverged than expected—known as transgressive segregation—then stabilizing selection is inferred (Fig. 1 *C*, *Bottom*).

I will begin with a simple model of a trait in a haploid species where all QTN are additive with equal effect sizes, and there is no environmental variation or trait measurement error (i.e., broad-sense heritability *H*^{2} = 1). Let *n*_{q} denote the number of QTN for this trait that differ between two strains/populations. Under neutrality, traits diverge from the ancestor like a random walk, proportional to *B*, purple). For two lineages evolving independently, the expected absolute difference will be *P* = 0.5 and *n* = *n*_{q} (it is only one draw since once one parent’s allelic states are defined, the other’s must be the opposite for any QTN segregating in the cross); for *n*_{q} greater than ∼20, this approximates a normal distribution. The square of this difference is proportional to the parental trait variance; dividing this variance by the variance expected by chance under neutral evolution—i.e., the variance of the F_{2} trait distribution, as discussed above—results in the test statistic (denoted *v* and illustrated in Fig. 1*C*):_{2} phenotypic variance and *F*(1, *n*_{F2} − 1) under neutrality, where *n*_{F2} is the number of F_{2} individuals. This approximates a χ^{2} distribution with 1 degree of freedom for *n*_{F2} greater than ∼20.

We can now relax the simplifying assumptions above. In diploids, the expected variance in the F_{2} is half that between the parents; this is accounted for by multiplying the denominator by a constant, denoted *c* (more generally, any factors that affect the phenotypic variance in the progeny—including dominance and other cross designs such as backcrosses or recombinant inbred lines [RILs]—can be accommodated by adjusting the value of this constant; *SI Appendix*). Allowing environmental variation and trait measurement error is equivalent to adding random noise to both the numerator and denominator. Correcting for this yields the following:*n*_{p1} and *n*_{p2} are the number of replicate individuals measured for each parental strain, and *SI Appendix*). All of these terms can be estimated from phenotype data in a single genetic cross, provided that multiple individuals of each parental type are phenotyped. Note that Eq. **1** is a special case of Eq. **2**, where *c* = 1 (for haploids) and *H*^{2} = 1 (hence

To explore the behavior of *v* as a neutral null model, I conducted simulations of neutral traits in parental strains and their F_{2} progeny (*Methods*). These simulations allowed the precise manipulation of individual parameters to assess their effects on *v*. These are presented as quantile–quantile (QQ) plots, where points following the *Y* = *X* line represent adherence to the expected F distribution. Points above and below the line represent values of *v* greater and less than expected under the null, respectively.

The test followed the expected distribution of *v* closely for a range of different QTL effect size distributions (Fig. 2*A*). This includes the exponential distribution, which is thought to be a reasonable approximation for QTL (8, 9). However, extremely skewed distributions—e.g., a monogenic trait where one QTL explains all trait variance—will lead to half of all F_{2} individuals identical to one of the parents and *v* values tightly distributed around 1, resulting in little power to detect any deviation from the null. Therefore, the test is only useful for polygenic traits.

How polygenic must a trait be? More important than the number of genetic variants affecting the trait—which is likely to be quite large for complex traits (10)—is the number of independently segregating QTN, which is limited by the recombination index (RI) (11). The RI is defined as the haploid number of chromosomes plus the number of recombinations per meiosis; it represents the number of independent genomic regions segregating in a cross. Adherence of *v* to the null improves with greater RI (Fig. 2*B*) since more shuffling of the two parental genomes leads to more normally distributed *v* cannot deviate far from 1. The *v* test behaves conservatively in crosses with RI less than ∼20 (Fig. 2*B*). Fortunately, this is rarely an issue in practice since the mean RI for plants is ∼30 and for animals is ∼40 (12). [For RILs, there are more generations for recombination so the mean “effective RI” is ∼45 for plants and ∼58 for animals (*Methods*).]

Sample size is another important consideration. Although more samples are always preferable, *v* follows the null distribution even with only three F_{2} individuals (Fig. 2*C*).

Other potential sources of noise include environmental variability and measurement error. Although these affect both parental and F_{2} phenotypic variances, they have less effect on the parental estimates when parental replicates are included, because taking the mean of each parental strain reduces this noise. Without correcting for this effect (i.e., setting *H*^{2} = 1, and thus **2**), low *H*^{2} leads to severe underestimates of *v* (Fig. 2 *D*, *Left*). However, including a correction for this (Eq. **2**) precisely accounts for this effect (Fig. 2 *D*, *Right*).

The final effect explored via simulation is genetic interaction, which is the context dependence of phenotypic effects either between loci (epistasis) or between alleles at the same locus (dominance). Epistasis can take many forms, but the large-scale pattern most often observed is known as diminishing returns epistasis (13⇓–15), where strains with the greatest trait value have lower values than expected from additivity. To model this, I transformed the simulated trait values as *t* is each trait value. Since this affects both *v* (Fig. 2*E*). I also modeled synergistic (increasing returns) epistasis as *t*^{2}, and again found no effect (Fig. 2*E*). However, epistasis in more extreme forms could obscure any signal of selection (*SI Appendix*). Dominance can be accounted for by adjusting the value of *c* to offset its effect on *E* and *SI Appendix*).

In sum, simulations of neutral evolution show that *v* is robust with respect to the number of loci affecting the trait, the distribution of locus effect sizes, environmental variability, measurement error, dominance, and epistasis.

Having established the behavior of *v* under neutrality, I explored its power by simulating directional selection. These were identical to the neutral simulations except that all QTN had concordant parental directionality (Fig. 1 *A*, *Right*; Fig. 1 *B*, blue). I compared the *v* test to the sign test to assess their relative performance in a range of parameter settings (other selection tests use very different types of data, precluding a direct comparison). Both tests utilized the same simulated F_{2} phenotype data for each cross; the sign test was also provided with optimal genotype data, meaning that each QTN was genotyped without error, had no linkage with other QTN, and no other genetic markers were included. Both tests result in a probability of neutrality (*P*_{nut}) value for each trait (*Methods*). More extreme *P*_{nut} values represent greater power to detect selection, with points above the diagonal representing crosses with greater power for the *v* test and those below the diagonal indicating greater power for the sign test.

Even with optimal genotype data, at small sample sizes (*n*_{F2} = 10 or 100), very few QTLs are mapped, resulting in the sign test’s low power (Fig. 3). In contrast, the *v* test often rejects the null with *P*_{nut} < 10^{−4} even at *n*_{F2} = 10. The *v* test is generally more powerful than the sign test at *n*_{F2} < 10^{3}. However, the sign test generally outperforms the *v* test at very large sample sizes (*n*_{F2} > 10^{3} when *H*^{2} > 0.8, and *n*_{F2} > 10^{4} for all *H*^{2}).

To further explore the *v* test’s power at small sample sizes, I simulated crosses with a total of 10, 20, or 30 phenotyped individuals (including the parents). For example at *n* = 10, 93% of traits rejected the null at *P*_{nut} < 0.05 when *H*^{2} = 0.1, and 99% when *H*^{2} = 0.8 (*SI Appendix*, Fig. S1*A*). The *v* test performed well with 10 individuals even when selection was weak (with up to 20% of QTN acting in opposition to the majority) and heritability was low (*SI Appendix*, Fig. S2). In contrast, the sign test required 500 to 1,000 phenotyped and genotyped individuals to reach the same power (*SI Appendix*, Fig. S1*B*). This difference in power of the two tests makes sense considering that although they are both evaluating the same neutral null model, the sign test is doing so directly (by comparing QTL directionalities to the neutral expectation; Fig. 1*A*), whereas the *v* test is doing so indirectly. When enough QTL are mapped, the direct approach of the sign test is superior, but by not needing to map QTL, the *v* test requires fewer individuals, as well as no genotype data.

Due to its generality, the *v* test can be applied to data from almost any genetic cross where both parental strains and their F_{2} (or other cross) progeny were phenotyped. This includes most QTL studies, as well as genetic studies from before genotyping was possible. To explore the test empirically, I collected published data for 126 traits from 21 species (Dataset S1). The *v*-test *P*_{nut} values for artificially selected traits (in crops, livestock, and laboratory experiments) revealed a strong skew toward low *P*_{nut} values indicating directional selection, as expected (16) (Fig. 4*A*; for a discussion of trait ascertainment bias—a major caveat for all trait-based selection tests—see *Discussion* and *SI Appendix*). Three of the most significant traits were from maize, including data for ear length and seed weight published in 1913, suggesting intense artificial selection on these traits prior to this date (17).

In contrast, the *P*_{nut}-value distribution for traits naturally selected in the wild showed a less extreme skew (Fig. 4*B* and Dataset S1; comparing distributions, Wilcoxon *P* = 2 × 10^{−5}), with peaks at both low and high *P* values suggesting a wide range of selection pressures. The most significant trait for directional selection was male head shape in a cross between two species of Hawaiian *Drosophila* (18) (data from reciprocal F_{2} crosses and an F_{6} cross are all significant; *SI Appendix*). This is a well-known example of rapid morphological evolution, potentially due to sexual selection (19), but whether the divergence could instead be explained by genetic drift was not previously testable. The next most significant trait was human skin color, measured in the “F_{2}” grandchildren of West Africans and Europeans (20). Despite the small sample size (*n*_{F2} = 14), the *v* test is significant at all three reflectance wavelengths tested (Dataset S1). Human skin color has long been thought to be adaptive based on its correlation with local ultraviolet radiation (21); these results provide independent confirmation for the role of selection. The third most significant wild trait was burrowing behavior in *Peromyscus* mice (22), as measured by burrow length in an interspecies backcross.

The *v* test can also be applied to molecular-level traits such as gene expression levels, which avoids the effects of trait ascertainment bias (see *Discussion*). The BXD collection of mouse RILs is an excellent test case, with gene expression data available for 16 tissues. Performing the *v* test revealed that the *P*_{nut}-value distributions of all tissues were shifted toward stabilizing selection to varying degrees (Fig. 4*C*). To estimate the strength of stabilizing selection in each tissue, I calculated the fraction of genes with *v*-test *P*_{nut} > 0.99. All 16 tissues had values between 1.0 and 1.8% (note this should not be interpreted as the fraction of genes under stabilizing selection, which is likely to be far higher). Interestingly, gene expression from six different regions of the brain had significantly stronger stabilizing selection than in the 10 nonbrain tissues (Fig. 4*C*). This is consistent with previous reports of slower evolution of gene expression in the mammalian brain compared to other tissues (23, 24), and suggests that this slower evolution is at least partially due to greater selective constraint [as opposed to a lower mutational variance, which can also lead to a slower evolutionary rate (1)].

Another type of molecular trait measured in the BXD cross is metabolite levels in liver (25). Applying the *v* test to these metabolomic data, cohorts fed two diets (high-fat and normal chow) showed *P*_{nut} values strongly skewed toward one (Fig. 4*C*). Therefore, the measured metabolites appear to be under stronger stabilizing selection than mRNA levels.

Despite the variation in stabilizing selection on gene expression across the 16 tissues, all tissues were relatively close to the neutral expectation (1% of genes at *P*_{nut} > 0.99). To compare this to other species, I analyzed gene expression data from genetic crosses of five additional species (*Saccharomyces cerevisiae*, *Oryza sativa*, *Arabidopsis thaliana*, *Brassica rapa*, and *Caenorhabditis elegans*; Dataset S2). In contrast to mouse, some species had much stronger stabilizing selection (e.g., *B. rapa* with 10.7% of genes under stabilizing selection). One hypothesis to explain this wide range of values is that natural selection is expected to be stronger in species with larger effective population sizes (*N*_{e}). Direct measurements of *N*_{e} for these species are not possible, but an indirect indicator is the fraction of neutral genomic positions that are heterozygous (known as π), which is expected to increase with *N*_{e} (26). Plotting the strength of stabilizing selection against published values of π, I observed a strong correlation (Fig. 4*D*). This suggests that *N*_{e} (or another factor correlated with *N*_{e}; *SI Appendix*) may be a major determinant of stabilizing selection on gene expression levels, as has been previously proposed (23, 24).

The peaks of gene expression *P*_{nut} values near zero (Fig. 4*D*) suggest that directional selection may also be detectable from these data. For example, in *S. cerevisiae* genes with low *P*_{nut} are highly enriched for roles in mitochondrial translation (false discovery rate [FDR] = 6 × 10^{−23} among genes down-regulated in the laboratory strain BY; no enrichment among genes up-regulated in BY). In *B. rapa*, the defense response to other organisms is the most enriched function among genes with low *P*_{nut} (FDR = 0.02). Similarly in *C. elegans*, immune response was the most enriched (FDR = 0.04).

## Discussion

In this work, I introduce a test of selection that combines the logic of rate tests with the neutral null model of the sign test. The result is a test that is simple, robust, and more powerful than existing tests. This will allow researchers to assess selection on traits of interest any time they perform a genetic cross. Moreover, if multiple traits are measured in the same cross, results from the test can be directly compared to assess the strength of selection on diverse phenotypes (as in Fig. 4*C*).

There are many potential extensions to this test. For example, the *v*-test framework could be applied any time the genomes of two divergent populations are mixed, including naturally admixed populations. In this case, the only modification needed would be in the calculation of *c*, representing the expected ratio of parental variance to admixed progeny variance (*SI Appendix*). Other extensions could include testing multiple correlated traits simultaneously, focusing only on additive effects, or estimating confidence intervals (*SI Appendix*).

It is important to note that results from all trait-based tests of selection must be treated with caution when trait ascertainment bias is present. If traits are chosen for study based on the extent of their divergence between populations, then the neutral model no longer holds. For example, imagine 100 neutral traits; in any properly calibrated selection test, we would expect ∼5 of these to reach a nominal *P* < 0.05. If these same five are the only traits included in a study (e.g., because they have the strongest phenotypic divergence), then they will appear to be inconsistent with neutrality. In some cases it is possible to correct for ascertainment bias, either by modifying the test itself (5) or by using a more conservative *P* value threshold (*SI Appendix*). However, the ideal solution is to analyze traits that were selected for study independently of the parental trait values, which by definition would lack any ascertainment bias. The most widespread examples of this are molecular-level traits such as the levels of mRNAs, proteins, metabolites, etc. Similarly, standardized phenotyping (27) can be free of bias.

Notably, Eq. **2** is identical to the widely used Castle–Wright (CW) estimator for the number of loci underlying divergence in a quantitative trait (28⇓–30) (although it was derived independently). The maximum possible value of this estimator is the RI of the species being crossed, resulting in a strong downward bias for most complex traits, which can have thousands of variants contributing (10). It is therefore rather fortuitous that this severely biased estimator is also precisely F-distributed under the null hypothesis of neutrality, even though neutrality had no role in its original derivation and the F distribution has no role in its traditional interpretation (28, 29). Furthermore, the true number of loci underlying a trait (what the CW estimator aims to estimate) is not indicative of selection; a neutral trait could have any number of underlying loci, so this cannot be used to assess a trait’s neutrality.

How can this one equation have two seemingly unrelated interpretations? The CW estimator requires a number of restrictive assumptions, including that all QTLs must act in the same direction with respect to their parent of origin (as in Fig. 1 *A*, *Right*). Rather than being an assumption of the *v* test, this concordant QTL directionality is exactly what the *v* test was designed to detect. Therefore, the connection between the two interpretations rests on the fact that the strength of the signal detected by the CW estimator—the number of reinforcing QTLs acting in the same parental direction—is indicative not only of the number of loci, but also of any selection that has acted on those loci since the divergence of the two parental strains.

One consequence of this mathematical homology is that the hundreds of published CW estimator values dating back to 1921 (28) can now be immediately reinterpreted as tests of neutral evolution (*SI Appendix*), even when the phenotype data from these studies are not available.

## Methods

### Neutral Simulations.

Parameters in neutral simulations are shown in Fig. 2: QTN effect size distribution, RI (the number of independently segregating QTN), *n*_{F2}, *H*^{2}, epistasis, and dominance (*c* = 2 for additive, *c* = 1 for bidirectional dominant, and *c* = 4/3 for unidirectional dominant; *SI Appendix*). Parameter values were chosen to reflect typical published datasets rather than optimal parameters for test performance. First, I generated effect sizes for the specified number of QTN by sampling from the specified distribution. In neutral evolution, QTN directions in each parent are random (Fig. 1*A*), so the first parent’s traits were determined by flipping the sign of each effect size with 50% chance and then summing the values. The second parent’s traits were calculated the same way but with all signs flipped from the first parent (each QTN increases the parental difference by twice its effect size, assuming parents are homozygous). F_{2} traits were determined by multiplying each effect size by one number chosen randomly from a set of four numbers that represent the four possible F_{2} genotypes at each locus and their resulting phenotypic effects ([0 1 1 2] for additive, [0 2 2 2] for unidirectional dominant, or [0 0 2 2] for bidirectional dominant; *SI Appendix*), separately for every individual, and summing across all QTN. When *H*^{2} < 1, random noise was also added to each parental and F_{2} individual as a normally distributed variable with zero mean and SD _{2} trait values before noise is added, since *t* is the trait value. Parental within-strain variances (_{2} population. From these variables, *v* was calculated using Eq. **2** and converted into a *P* value based on the *F*(1, *n*_{F2} − 1) cumulative distribution.

### Selection Simulations.

Selection was simulated using the framework described above, with one difference: omitting the step of flipping the sign of each effect size with 50% chance. This meant that all QTNs were reinforcing in their directionalities. *v* was then calculated as described above and converted to a *P* value based on the *F*(1, *n*_{F2} − 1) cumulative distribution. Parameter values are listed in Fig. 3 legend.

To calculate the sign test *P* value, QTLs must first be mapped. The genotype of each QTN variant (randomly generated as described above) was provided in a genotype matrix, with no genotyping error and no additional genetic markers (this represents an unrealistic best-case scenario for QTL mapping). Pearson’s correlation coefficient between each marker and the F_{2} phenotypes were then converted into logarithm of odds (LOD) scores (31) as *P* value was the sign test *P* value.

The simulations in *SI Appendix*, Figs. S1 and S2 were identical to those in Fig. 3, but with different parameter values and different visualizations. One thousand simulations were performed for each combination of parameter values shown. In *SI Appendix*, Fig. S2, true positives were defined as crosses simulating selection where the *v*-test *P* value was below a given cutoff; false positives were crosses simulating neutral evolution where the *P* value was below the same cutoff (the cutoff varied from 0 to 1 to generate each receiver operating characteristic curve). Selection strength was represented by the fraction of QTNs with reinforcing directionality. This cannot be translated into a selection coefficient since it would depend on how the trait relates to fitness, the population size, time since population divergence, and many other parameters.

### Empirical Analysis.

Traits were collected for Fig. 4*A* from two sources: Wright (29), tables 15.1–15.8, and Lynch and Walsh (31), table 9.6. Traits were collected for Fig. 4*B* from Lynch and Walsh (31), table 9.6; Rieseberg (16), table 5; and literature searches for cichlid and *Peromyscus* data.

For traits in Fig. 4 *A* and *B* and Dataset S1, *H*^{2} values were estimated as follows. If values were provided by the authors of the original study, these were used. If not, the environmental variance was estimated using Wright’s preferred method (29), a weighted average of within-strain variances: _{1} were not available, then I used the sample size of each parental type as weights: *H*^{2} (likely due to overestimation of *H*^{2} < 0.1 were set to 0.1. For the two cases of traits with outbred parents (burrowing and parenting behavior in *Peromyscus*), *H*^{2} may be underestimated due to within-strain genetic variation contributing to *H*^{2} = 0.4 for these traits [which is higher than the heritability of most behavioral traits (32)], resulting in less significant values of *P*_{nut} (*SI Appendix*).

Data in Fig. 4*C* were collected from http://www.genenetwork.org/ (selecting species, mouse; group, BXD family; type, any tissue with parental data). Since most mouse gene expression datasets in Fig. 4*C* had only one sample per parental strain, *H*^{2}, **2** was set to *H*^{2} = 0.31 for all genes, this being the median value estimated for yeast (33) (specifically the median of 1 − *e*, where *e* is the residual gene expression variance not explained by either additive or pairwise epistatic effects). I selected yeast because it had the largest number of gene expression profiles from a genetic cross of any species, a comprehensive heritability analysis performed by the original authors, and the closest π to mouse. Note that these modifications affect the *y*-axis values in Fig. 4*C*, but not the relative relationship between points; any values could have been used without affecting the trend shown. The complete list of tissues and stabilizing selection scores are in Dataset S2.

Gene expression data for the six species in Fig. 4*D* were from the following sources: *S. cerevisiae* (33), *A. thaliana* (34), *B. rapa* (35) (normal phosphorus condition), *C. elegans* (36, 37) (control condition), *O. sativa* (38), and *Mus musculus* (see above). Published expression data from other species’ crosses were not usable (e.g., no parental data). For mouse, the median stabilizing selection level across all 16 tissues was used. To avoid spuriously low or negative estimates of *H*^{2}, for all six species any genes with *H*^{2} < 0.1 were set to 0.1 (as described above). As above, to facilitate comparison across datasets, I assumed half of the parental variance was genetic, and half environmental. All *P*_{nut} values are listed in Dataset S2. π values were taken from Leffler et al. (39) as the median of autosomal π estimates for each species. No values were listed for *O. sativa* or *B. rapa*, so other published estimates were used (40, 41). For *O. sativa*, both parents of the genetic cross (Zhengshan 97 and Minghui 63) were in population group XI, so the π for this group was used. Using a more recently published π estimate for *S. cerevisiae* [0.18%, which is the median π across all 14 nonmosaic populations (42)] yielded a slightly stronger Pearson correlation (*r* = 0.935). Omitting *B. rapa* as an outlier also strengthened the correlation (*r* = 0.960). Gene Ontology process enrichments were calculated with GOrilla (43).

### Estimating RI.

RI was estimated as [mean cM/100] + [mean haploid chromosome number] for 189 plants and 140 animals (12). RILs experience around twice as many detectable recombinations (defined as those occurring in a heterozygous genomic region) as an F_{2} cross, regardless of the exact number of generations of inbreeding, so for RILs the recombination values were doubled. Backcrosses experience about half as many detectable recombinations as an F_{2} cross.

## Data Availability.

There are no newly generated data underlying this work.

## Acknowledgments

I thank M. Feldman, A. Kern, S. Otto, D. Petrov, M. Schumer, and the H.B.F. laboratory for helpful discussions and feedback, C. Hu and J. Bloom for sharing data, and P. Nut for inspiration. H.B.F. is supported by NIH Grant 1R01GM13422801.

## Footnotes

- ↵
^{1}Email: hbfraser{at}stanford.edu.

Author contributions: H.B.F. designed research, performed research, analyzed data, and wrote the paper.

The author declares no competing interest.

This article is a PNAS Direct Submission. M.W.H. is a guest editor invited by the Editorial Board.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2014277117/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- B. Walsh,
- M. Lynch

- ↵
- ↵
- R. B. O’Hara,
- J. Merilä

- ↵
- O. Ovaskainen,
- M. Karhunen,
- C. Zheng,
- J. M. C. Arias,
- J. Merilä

- ↵
- H. A. Orr

- ↵
- ↵
- ↵
- S. P. Otto,
- C. D. Jones

- ↵
- ↵
- ↵
- ↵
- ↵
- S. Kryazhimskiy,
- D. P. Rice,
- E. R. Jerison,
- M. M. Desai

- ↵
- H.-H. Chou,
- H.-C. Chiu,
- N. F. Delaney,
- D. Segrè,
- C. J. Marx

- ↵
- A. I. Khan,
- D. M. Dinh,
- D. Schneider,
- R. E. Lenski,
- T. F. Cooper

- ↵
- L. H. Rieseberg,
- A. Widmer,
- A. M. Arntz,
- J. M. Burke

- ↵
- R. A. Emerson,
- E. M. East

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- E. G. Williams et al.

- ↵
- M. Kimura

- ↵
- ↵
- W. E. Castle

- ↵
- S. Wright

- ↵
- C. C. Cockerham

- ↵
- M. Lynch,
- B. Walsh

- ↵
- ↵
- ↵
- M. A. L. West et al.

*Arabidopsis*. Genetics 175, 1441–1450 (2007). - ↵
- J. P. Hammond et al.

*Brassica rapa*. Plant Physiol. 156, 1230–1241 (2011). - ↵
- M. G. Sterken et al

*Caenorhabditis elegans*. bioRxiv:10.1101/651885 (28 September 2019). - ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Evolution