The evolution of skin pigmentation associated variation in West Eurasia

Skin pigmentation is a classic example of a polygenic trait that has experienced directional selection in humans. Genome-wide association studies have identified well over a hundred pigmentation-associated loci, and genomic scans in present-day and ancient populations have identified selective sweeps for a small number of light pigmentation-associated alleles in Europeans. It is unclear whether selection has operated on all the genetic variation associated with skin pigmentation as opposed to just a small number of large-effect variants. Here, we address this question using ancient DNA from 1158 individuals from West Eurasia covering a period of 40,000 years combined with genome-wide association summary statistics from the UK Biobank. We find a robust signal of directional selection in ancient West Eurasians on skin pigmentation variants ascertained in the UK Biobank, but find this signal is driven mostly by a limited number of large-effect variants. Consistent with this observation, we find that a polygenic selection test in present-day populations fails to detect selection with the full set of variants; rather, only the top five show strong evidence of selection. Our data allow us to disentangle the effects of admixture and selection. Most notably, a large-effect variant at SLC24A5 was introduced to Europe by migrations of Neolithic farming populations but continued to be under selection post-admixture. This study shows that the response to selection for light skin pigmentation in West Eurasia was driven by a relatively small proportion of the variants that are associated with present-day phenotypic variation. Significance Some of the genes responsible for the evolution of light skin pigmentation in Europeans show signals of positive selection in present-day populations. Recently, genome-wide association studies have highlighted the highly polygenic nature of skin pigmentation. It is unclear whether selection has operated on all of these genetic variants or just a subset. By studying variation in over a thousand ancient genomes from West Eurasia covering 40,000 years we are able to study both the aggregate behavior of pigmentation-associated variants and the evolutionary history of individual variants. We find that the evolution of light skin pigmentation in Europeans was driven by frequency changes in a relatively small fraction of the genetic variants that are associated with variation in the trait today.


Introduction 42
Skin pigmentation exhibits a gradient of variation across human populations that tracks with 43 latitude (1). This gradient is thought to reflect selection for lighter skin pigmentation at higher 44 latitudes, as lower UVB exposure reduces vitamin D biosynthesis which affects calcium 45 homeostasis and immunity (2-4). Studies of present-day and ancient populations have revealed 46 signatures of selection at skin pigmentation loci (5-9), and single nucleotide polymorphisms 47 (SNPs) associated with light skin pigmentation at some of these genes exhibit a signal of 48 polygenic selection in Western Eurasians (10). However, this observation, the only documented 49 signal of polygenic selection for skin pigmentation, is based on just four loci (SLC24A5, 50 SLC45A2, TYR, and APBA2/OCA2). 51 52 Therefore, while the existence of selective sweeps at a handful of skin pigmentation loci is well-53 established, the evidence for polygenic selection-a coordinated shift in allele frequencies across 54 many trait-associated variants (11)-is less clear. Recently, genome-wide association studies 55 (GWAS) of larger samples and more diverse populations (12-15) have emphasized the polygenic 56 architecture of skin pigmentation. This raises the question of whether selection on skin 57 pigmentation acted on all of this variation or rather was driven by selective sweeps at a relatively 58 small number of loci. 59 60 The impact of demographic transitions on the evolution of skin pigmentation also remains an 61 open question. The Holocene history (~12,000 years before present (BP)) of Europe was marked 62 by waves of migration and admixture between three highly diverged populations: hunter-63 phenotypes of ancient peoples, we can assess the extent to which they carried the same light 73 pigmentation alleles that are present today. This allows us to identify which pigmentation-74 associated variants have changed in frequency due to positive selection and the timing of these 75 selective events. We present the first systematic survey of the evolution of European skin 76 pigmentation-associated variation, tracking over a hundred loci over 40,000 years of human 77 history-almost the entire range of modern human occupation of Europe. 78 difference between Mesolithic hunter-gatherers and Early Upper Paleolithic populations (0.097 rs7870409 near RALGPS1-light alleles decreased in frequency more than can be explained by 135 changes in ancestry ( Fig. 2 B & C). The dark allele at rs6120849 (EDEM2) is also associated 136 with increased sitting and standing height in the UK Biobank, and the dark allele of rs7870409 137 (RALGPS1) is associated with increased arm and trunk fat percentage (62). These observations 138 raise the possibility of pleiotropic effects constraining the evolution of these loci or of spurious 139 pleiotropy due to uncorrected population structure in the GWAS. 140 141 On the other hand, the changes in frequencies of several variants appear to have been driven 142 largely by changes in ancestry. For example, rs4778123 near OCA2, rs2153271 near BNC2 and 143 rs3758833 near CTSC show evidence that their changes in frequency were consistent with 144 changes in genome-wide ancestry ( Fig. 2 B & C). At SLC24A5, rs2675345 shows evidence of 145 change with both ancestry and time, suggesting that even after the spread of the light allele from 146 migrating populations in the Neolithic (7), selection continued to occur post-admixture. Again, 147 not all of these cases involve an increase in the frequency of the light pigmentation allele over 148 time. The light allele of rs4778123 (OCA2) was at high frequency in hunter-gatherers but lower 149 in later populations (Fig. S4). From the manually curated set of SNPs (Fig. 2 D-F), rs12203592 150 near IRF4 also displays a marked effect of ancestry with higher light allele frequency in hunter-151 gatherers as well. While rs12203592 was not present in the UK Biobank summary statistics, 152 another SNP at the IRF4 locus, rs3778607, was present but with a lower ancestry effect ( Fig S5). 153 154 Lack of PBS selection signal across pigmentation SNPs in source populations of Europeans 155 As we described in the previous section, the frequencies of several skin pigmentation SNPs 156 appear to have been driven by changes in ancestry. This observation suggests that their 157 frequencies have diverged between the European source populations and then subsequently been 158 driven by admixture. Frequency differences across the source populations might reflect the 159 action of either genetic drift or selection. To examine this question, we looked for evidence of 160 selection at skin pigmentation SNPs in each ancestral population using the population branch 161 statistic (PBS) test (63). We assigned ancient individuals to the groups hunter-gatherer, Early Few of the skin pigmentation loci show extreme PBS values (Fig. 3, Fig. S7). On the Early 166 Farmer and Steppe ancestry branches, but not hunter-gatherer branch, the SLC24A5 locus 167 exhibited the strongest signal ( Fig. 3 B & C), indicating selection at the locus in both Early 168 Farmer and Steppe source populations. The RABGAP1 locus also exhibited an elevated signal of 169 selection in Early Farmer and Steppe, which remained when using a 20-SNP window size (Fig.  170 S8 B and C). However, unlike the SLC24A5 variant, this SNP (rs644490) did not show a 171 substantial effect of ancestry from the individual SNP regression analysis. Across the ancestral 172 groups, we observed an elevated PBS at the OCA2 locus around rs9920172, with the most 173 extreme value on the hunter-gatherer branch. As expected, this SNP does not show an effect of 174 ancestry in allele frequency change over time. As a whole, the PBS values of the 170 skin 175 pigmentation SNPs do not significantly deviate from the genome-wide distribution of PBS for 176 any of the ancient source populations (Fig. S9). 177

178
Signals of selection are restricted to the SNPs with the largest effect sizes 179 Since our results suggest that pigmentation-associated variants exhibit different changes in allele 180 frequency, we further examined the signal of selection found in Fig. 1. In general, SNPs with 181 larger GWAS effect sizes changed more in frequency over time than those with small effect sizes 182 (P = 2.8 x 10 -18 ; adjusted R 2 =0.36) (Fig. S10A). To understand the contribution of these large 183 GWAS effect size SNPs, we iteratively removed SNPs from the polygenic score calculations 184 ( Fig 4A). Removing the SNP rs16891982 at SLC45A2, we still found a significant decrease of 185 score over time (P < 1 x 10 -4 ), but the estimated rate of change (time) decreased by 39 percent. 186 Removing the top two SNPs at the SLC45A2 and SLC24A5 locus further attenuated the signal 187 with a decrease in time of 58 percent (P = 3.96 x 10 -4 ). We removed the five SNPs with the 188 largest GWAS effect sizes and found the effect of time attenuated (P = 0.026) with a decrease in 189 ordered by largest to smallest absolute GWAS effect size (Fig. 4B). Using all 170 SNPs in the 197 test, we failed to observe a signal of polygenic selection (P = 0.27). However, when we restricted 198 to the top 5 largest GWAS effect size SNPs, we found a significant signal (P < 2.5 x 10 -7 ). We 199 observed a significant Qx statistics at P < 0.05 up to the top 30 SNPs, but it appears this signal is suggesting that that differences in allele frequency across ancestry groups were mostly due to 224 genetic drift. One exception is that the light allele at SLC24A5 was nearly fixed in both Early 225 Farmer and Steppe ancestry populations due to selection. However, even for this variant we 226 observe a signal of ongoing selection in our data even after admixture with hunter-gatherers, indicating continued selection after admixture. This is analogous to the rapid selection at the 228 same locus for the light allele introduced via admixture into the KhoeSan, who now occupy 229 southern Africa (12, 68). 230

231
We are also able to test previous claims about selection on particular pigmentation genes. In 232 contrast to reports of positive selection in Europeans at the MC1R locus (69, 70), we find no 233 evidence of positive selection in Europeans. Among UK Biobank SNPs, rs1805007 near MC1R 234 explains a relatively large amount of variation within the UK but is predicted to explain 235 relatively little of the variation between Europe and West Africa (  (Table S1), but it is highly plausible that  We obtained summary statistics for UK Biobank GWAS for skin colour (data-field 1717) from 308 the publicly available release by the Neale Lab (version 3, Manifest Release 20180731) (14) . 309 The GWAS measured self-reported skin colour as a categorical variable (very fair, fair, light 310 olive, dark olive, brown, black). To identify genome-wide significant and independent SNPs, we 311 performed clumping using PLINK v1.90b6.6 (89) with 1000 Genomes GBR as an LD reference 312 panel (--clump-p1 5 x 10 -8 --clump-r2 0.05 --clump-kb 250), and followed up with clumping 313 based on physical distance to exclude SNPs within 100 kb of each other. We made two separate 314 lists of UK Biobank SNPs for the shotgun and capture-shotgun datasets because the capture-315 shotgun dataset was restricted to the 1240K array sites. For the capture-shotgun dataset, we 316 intersected all UK Biobank SNPs with 1240K array SNPs before identifying 170 independent 317 and genome-wide significant SNPs (Table S2), which were used in Figures 1B, 1C, 2B, 2C, 3, We also manually curated a list of skin pigmentation-associated SNPs from the literature (Table  321 S4). We identified 12 suitable studies (12, 13, 95, 96, 76, 79, 81, 90-94), 9 of which were 322 GWAS conducted in diverse populations (Table S5). We did not include SNPs from all the 323 considered papers for our analysis because some studies reported associations that were not 324 genome-wide significant (P < 5 x 10 -8 ). However, we included sub-significant SNPs from a 325 GWAS in Europeans (93) with evidence of replication (P < 5 x 10 -8 ) in the UK Biobank GWAS. 326 For a GWAS in African populations (12), we selected a SNP for each LD block, which was 327 defined by the study authors, with the greatest support based on multiple lines of evidence (e.g.  Table S4. We ended with a final list of 18 SNPs for the manually curated 335 set, 10 of which are present on the 1240K capture array (Table S6). We also made pseudo-336 haploid calls for each of the 18 SNPs in the shotgun dataset, picking a random allele from the 337 reads at each site as for the overall dataset. 338 339 ADMIXTURE analysis on capture-shotgun data 340 We performed unsupervised ADMIXTURE (97) on the dataset of ancient individuals with K=3, 341 which we found to produce the lowest cross-validation error for 2<K<9. The three identified 342 clusters could easily be identified as corresponding to hunter-gatherer, Early Farmer, and Steppe 343 (also referred to as Yamnaya) ancestry. 344 maximum possible score given the SNPs present in the sample: m is the number of skin pigmentation SNPs genotyped in the individual, di is the presence of the 353 dark allele at the i th SNP, and i is the GWAS-estimated effect size (Fig. 1 A-C). For the 354 manually curated list of SNPs where we did not have comparable effect size estimates, we 355 computed an unweighted score, effectively assuming that all variants had the same effect size. 356 This score is the proportion of dark alleles an individual carries out of the SNPs used in the 357 construction of the score:

= ∑
where m is the number of SNPs 358 genotyped in the individual and di is the presence of the dark allele at the i th SNP (Fig. 1 D-F). For the stratified analysis, we divided the capture-shotgun dataset into mutually exclusive 378 ancestry groups. We categorized individuals for this analysis as hunter-gatherer if they carried Individuals we categorized as Steppe had over 30 percent of the ADMIXTURE component and 382 were dated to less than 5000 years BP. Based on these cutoffs, there were 102 hunter-gatherer, 383 499 Early Farmer, and 478 Steppe ancestry assigned individuals. 384 385 Because the fitted model parameters are overdispersed (Fig. S2), we computed P-values from a 386 genome-wide empirical null distribution for βdate. We made this null distribution by running the 387 regression model described above on scores from sets of random, frequency-matched (+/-1%) 388 SNPs across the genome, maintaining the same effect sizes for the weighted score. We matched 389 the derived allele frequency based on the EUR superpopulation frequencies from 1000 Genomes. 390 We reported P-values based on 10,000 random samples. 391 392

Time series analysis of individual SNP allele frequencies 393
We performed logistic regression for each individual SNP using date of the sample and ancestry 394 as covariates. That is, the probability that the haplotype sampled from individual carries the 395 derived allele is pi, where 396 log = + + 1 + … + 10 ,

397
We compared the full model above to a nested model with no principal components by 398 performing a likelihood ratio test (in R we use anova(nested model, full model, test='Chisq')) to 399 obtain a P-value for the ancestry term which encompasses βPC1…βPC10. To obtain a P-value for 400 βdate, we compare a nested model without βdate to the full model. 401

402
Qx polygenic selection test 403 We used the Qx test for directional, polygenic selection on skin pigmentation (10). For this test, 404 we used skin pigmentation SNPs obtained from the UK Biobank in all 1000 Genomes 405 populations. We restricted sites to those on the 1240K and constructed a covariance matrix from 406 a total of 1 million SNPs. To calculate empirical P-values, we sampled a total of 500,000 null 407 genetic values matching skin pigmentation SNPs based on a +/-2% frequency on the minor 408 allele. However, we report 1 million runs for the top 7 SNPs and 2 million runs for the top 5 and 409

Variance in skin pigmentation explained by individual SNPs 413
We estimated the variance explained in the phenotype by a SNP within the UK population using 414 where β is the UK Biobank GWAS-estimated effect size and f is the allele frequency in the GBR 416 population from 1000 Genomes. We calculated the proportion of variance explained as 417 where N is the GWAS sample size (356,530) and σ is the standard error of the estimated β (98). 419 To estimate the variance explained between GBR and YRI, we calculated the between 420 population variance as 421 where f1 and f2 are the allele frequencies of GBR and YRI, β is the UK Biobank GWAS-423 estimated effect size, and favg is 0.5(f1 + f2).