Radiation with reticulation marks the origin of a major malaria vector

Significance Introgressive hybridization is prevalent in recent and rapid animal radiations, and emerging evidence suggests that it leads to the sharing of genetic variation that can facilitate adaptation to new environments and generate novel phenotypes. Here we study a recent and rapid radiation of African mosquitoes in which only one species, An. funestus, is a primary human malaria vector with a continent-wide geographic distribution. We trace the evolutionary history of the group and demonstrate introgression events between multiple species, the most recent of which involved substantial gene flow into An. funestus that preceded its range expansion across tropical Africa. Our findings point to introgression as an underappreciated factor contributing to the acquisition of high malaria vectorial capacity.


SI References
(http://broadinstitute.github.io/picard/). Variants were called using GATK v.3.5 and HaplotypeCaller. Resulting VCF files were converted to consensus sequences in FASTA format using a custom python script, vcf2fastamt.py. For each species an alignment of the mitochondrial genomes was uploaded to NCBI as a pop-set (see Data availability). 5 S4. Whole Genome Alignments (Fig. S3, Table S3) The new de novo assemblies were aligned together with the An. funestus reference (Fig. S3, Table S3) using progressiveCactus v 0.01 (14), to provide a lift-over table (coordinate translation) into a common coordinate system based on the genome coordinates of An. funestus. Prior to genome alignment, masked sites were the MAF was parsed using MafFilter v.2.3 (24) , first by chromosome (SelectChr), and then only retaining alignment blocks that contain all species (Subset). Alignment blocks were merged into a single alignment block with continuous coordinates by adding 'N' characters using MafFilter (Merge).
The HAL file from progressiveCactus and HalTools was then used to generate a lift-over table to allow projection 20 of genomic locations between the individual AFC species and the An. funestus reference genome. A custom python script, liftover.py was used to re-orient alleles and create the projected VCF from the lift-over table (hereafter, the lift-over VCF). Phased information was then added from the con-specific VCFs. All resulting VCFs were then merged into a single VCF using BCFtools v.1.6 (25). The final merged VCF consisted of 160 Mb of accessible sites with 122.78 Mb of sites also having a genotype called for the outgroup An. rivulorum.

S5. Population genomics and Variant Calling (Figs. S4-S11, Tables S4-S5)
In addition to assembling new reference genomes, we also individually re-sequenced the genomes of 42 additional mosquitoes in the AFC (Table S1). For An. longipalpis C, An. parensis, and An. vaneedeni, an additional eight 30 specimens were re-sequenced per species. For An. funestus-like, three additional individuals were re-sequenced. For An. funestus, 15 individuals were re-sequenced from five different countries, including six individuals carrying clade 2 mtDNA (18) (Table S1). No additional individuals were sequenced for An. rivulorum and An. species A.
All re-sequencing libraries were prepared at McGill University and Génome Québec Innovation Centre 35 (Montreal, Canada) and sequenced on 12 lanes of the HiSeq X with 150 paired-end cycles. Sequencing reads were processed as described above (SI Appendix, Text S2), except that con-specific nuclear and mtDNA genome assemblies were used as a reference for mapping with BWA.
Filtered con-specific VCF files were statistically phased using the read-informative phasing option in SHAPEIT2 v.2.r837 (28) . Phase informative read (PIR) files were generated from BAM files using the tool extractPIRs v.1.r68 available with SHAPEIT2. Switch errors were identified with the input-graph option and 100 replicates, using a custom python script. If a site switched between haplotypes in >10% of the replicates, the site was coded as unphased in the VCF.

S5.2 Interspecific demographic inferences for models and simulations
For the purpose of subsequent data simulations (SI, Appendix, Text S8), we reconstructed the demographic history of each AFC species using MSMC2 (29) and con-specific VCF files. Input files for MSMC2 were built 10 using scripts from www.github.com/schiffles/msmc_tools using positive masks (callable sites) generated as described in SI Appendix, Text S5.1. Negative masks included homozygous positions with < 10 read coverage, and positions that were repeat-masked or missing.
For each AFC species, three individuals were combined for a total of six haplotypes. We combined haplotypes 15 only for autosomes. We determined that the results were similar if using individual chromosome arms (e.g., tested on 2R and 3R in An. funestus) versus all the autosomes. Thus, we combine individuals across all autosomes. MSMC2 was run using a r/µ of ~2.9 (determined as described in SI Appendix, Text S5.5), 20 iterations, and the default time pattern. To check convergence, MSMC2 runs were repeated 10 times and the results were averaged across runs.

20
The msmc_tools script multihetsep_bootstrap.py was run to create input files for 20 bootstraps with a chunk size of 10 Mb and 5 chunks per four simulated chromosomes. All results were interpolated to the same generation times using a custom python script, msmc2_interpolate.py. Results from the AFC species were converted to effective population size using the Drosophila mutation rate of 2.8 x 10 -9 per site per generation (30). The 0.025 25 and 0.975 quantiles of the effective population sizes estimated from the bootstrap replicates were plotted using ggplot2 (Fig. S4).

30
Nucleotide diversity (31) and Tajima's D (32) were calculated in non-overlapping windows of 10 and 50 kb using the con-specific VCFs and functions available in scikit-allel. The unfolded site-frequency spectrum (SFS) was calculated using the program est-sfs (33) using An. rivulorum and An. species A as outgroups to polarize the allelic state. The scaled SFS was then plotted using tools in scikit-allel.

35
The distributions of nucleotide diversity were similar among An. funestus populations, but lower for other species (Fig. S5A). Similar to the results from MSMC2, An. funestus-like had the lowest diversity. The distribution of Tajima's D statistic overlapped 0 for all AFC species except An. longipalpis C and An. vaneedeni, where it was strongly negative (Fig. S5B). Consistent with this result, the site-frequency spectrum in An. vaneedeni and An. longipalpis C samples had an excess of low-frequency and high-frequency alleles (Fig. S5C). We provide 40 additional plots of these values along the chromosomes in Figs. S6-S8.

S5.4 Genome-wide recombination rate
A recombination map was constructed for each autosome arm (Fig. S9) using LDJump v.0.2.2 (34) for each 45 species with at least ten individuals (the minimum number required for phasing with SHAPEIT2 v.2.r837). For An. funestus, only the Mozambique population was used, as sample sizes for other populations were three individuals or fewer. FASTA formatted files were created from each con-specific VCF using a custom python script. LDJump was then run using a population look-up table constructed with LDpop (35) and based on the demographic history in SI Appendix, Text S5.2 .
We examined the genetic structure among AFC species using principal component analyses (PCA) on 50,000 randomly chosen segregating sites (Fig. S10) using functions available in scikit-allel and the lift-over VCF. SNPs were chosen if they had no missing data, were segregating in each population, and had a minor allele frequency >10%. Principal component 1 (PC1) and PC2 accounted for on average 22% and 11.3% of the variation, respectively. On the X chromosome five distinct clusters are visible splitting individuals into predefined species.
On chromosomes 2R and 3R, there was no separation along PC2 between individuals belonging to An. longipalpis C and An. parensis. This was in contrast to 3L and 2L where PC2 separated An. longipalpis C and An. parensis, but PC1 did not separate individuals of An. funestus and An. funestus-like. S5.6 Population structure and pairwise genetic distance in An. funestus 10 Because we had population samples of An. funestus from multiple geographic locations, we evaluated its population structure to gain insight about its possible effect on phylogenetic and other analyses. We included An. funestus-like as a reference to evaluate its separation from each An. funestus population (Fig. S11). For each chromosome arm, genetic structure was examined among An. funestus populations using a PCA. PCA was constructed using 50,000 random SNPs from the lift-over VCF in scikit-allel. SNPs were chosen if they had no missing data, were segregating in each population, and had a minor allele frequency >10%.

20
Population divergence times among An. funestus populations were estimated to infer the timing of dispersal across sub-Saharan Africa. Population divergence times were calculated using two methods. The first, using the cross-coalescent rates in MSMC2, accounts for changing effective population sizes. The median crosscoalescent time was assumed to be the divergence time between populations even if the final cross-coalescent rate did not reach a value of 0. The second approach assumes no migration between populations following 25 divergence, and is robust to small sample sizes (36). Calculations were performed using the custom python script, divTime.py.   Table S5). The median estimates were not significantly different (Wilcoxon rank-sum test, p-value = 0.47). Estimates among pairs that included Kenya or Zambia populations, for which there was only a single individual sampled, were highly variable from both methods.

S6.1 Mitochondrial genome phylogeny
Individual consensus mtDNA genomes (SI Appendix, Text S3) were aligned using MAFFT v.7.394 (37). Coding sequences were concatenated after removing the AT-control region, tRNA and rRNA. A Bayesian phylogeny 40 was reconstructed from the concatenated coding sequences using BEAST2 (21) under the GTRGAMMA model and allowing for estimation of different rates per gene but assuming the same tree. BEAST2 was run with three different starting chains each with length 100 million and combined within LogCombiner (Fig. 1B, Fig. S12).

S6.3 Window-based genome phylogenies
Phylogenies were reconstructed along the genome using maximum likelihood in PhyML v.3.1 (43). PhyML was run on the lift-over VCF using scripts available at (www.github.com/simonhmartin/genomics_general)using the phased option to split each individual genotype into two haplotypes. Windows were selected to be 5 kb in length with at least 50 informative positions and less than 50% missing sites. This resulted in 24 Resulting trees (referred to as "gene trees" regardless of their protein-coding content) were then pruned at their tips, to include only one individual per species selected at random using the custom python script, pruneTips.py. This resulted in 24,556 phylogenetic trees with one tip to represent each species. The proportion of each tree on each chromosome arm is shown in Fig. 2 and Fig. S13. The frequencies of commonly observed topologies on the autosomes and the X chromosome were comparable (Fig. S14).
S7. Species networks using D statistics and admixture graphs (Figs. S15-S16, table S6) We built admixture graphs (44) for all the major trees illustrated in Fig. 2A and Fig. S13. First, a table of D statistics (45,46) was calculated with ABBABABAwindows.py(www.github.com/simonhmartin/genomics_general)for all species triplets, using An. rivulorum as an outgroup (Table S6). Next, we fitted up to three admixture events using the add_an_admixture function in admixturegraph (44) (Figs. S15 and S16). After each admixture node was added, we sorted the resulting graphs in ascending order by the cost function. The cost function can be interpreted as the log 20 likelihood of the edge lengths and admixture proportions as graph parameters (44). The top 10 graphs were examined to ensure that each was correctly rooted with An. rivulorum. Only the best scoring graph, randomly selected in case of ties, was used for the next step. We continued to add admixture events until the cost function no longer decreased. Given that we have six species, the cost function was not improved after the addition of a third admixture node, hence we stopped at three admixture events. We compared among the resulting graphs using the likelihood ratio test demonstrated at www.cran.rproject.org/web/packages/admixturegraph/vignettes/admixturegraph.html. Among all pairwise comparisons only three graphs (Fig. 3) could not be significantly rejected by the likelihood ratio test at a corrected p-value of 0.001. (Fig. S17, Tables S7-S10)

30
The results from admixture graph were equivalent for the three network hypotheses involving introgression events projected onto tree i, iii, and vii (Fig. 3). To choose among the three trees, we used a model selection approach in abcrf v.1.8.1 (47). The program abcrf uses a supervised classification analysis of the competing evolutionary models based on a feature vector of population genetic summary statistics. We used a random forest approach over a strict ABC approach, to mitigate what is called the "curse of dimensionality" in the 35 computational analysis of highly multivariate data (48).
To train the random forest classifier under our three models, we used a custom python script, abc_sims.py, which utilizes the coalescent simulation program msmove (github.com/geneva/msmove) and models introgression as a single pulse event. We modeled population size changes using the results from MSMC2 (SI

40
Appendix, Text S5.2, Fig. S4), and represented the variance in observed mutation and recombination rates by drawing these parameter values from the empirical results obtained in SI Appendix, Text S5.3 and S5.4. We defined priors on divergence times using the equation dxy/(2μ) − 2Nanc generations, where dxy is the average pairwise divergence between species X and Y (Table S7) and Nanc is the ancestral population size (49).
We performed 100,000 coalescent simulations for each model with random parameter combinations drawn from priors defined in Table S8. Population genetic summary statistics were calculated using abc_scripts/abc_stats.py, which utilizes the two_popStatsML program of the FILET package (50) and the binned joint site frequency spectrum (51). To avoid biases caused by selection and linked selection, we calculated the observed statistics as the median value in windows of 10 kb located approximately 5 kb from defined protein coding regions on autosome arms; we did not consider the X chromosome. A comparison between non-coding windows and an approach using all windows did not yield different results.
We constructed the random forest in abcrf from 1,000 decision trees using 220 summary statistics with the lda option set to FALSE. We then used the predict function, also with 1,000 decision trees, to estimate the posterior probability of each scenario.
The confusion matrix (Table S9) demonstrates that we were able to confidently distinguish between alternative scenarios with a classification error rate of 3-5% and prior error rate of 0.0335 with 1,000 decision trees. The best model was based on tree vii with average of 600 votes out of 1,000 votes and a posterior probability of 0.68 (Table S10). Importance parameter values indicated that the most informative statistics were those associated with the species pairs Lik-Van, Fun-Lon, and Fun-Par (Fig. S17). (Table S11).

10
We used the abc package (52) to estimate the divergence and introgression times under our best model, tree vii, with three introgression events. Priors were defined as in Table S8. The program abc is not able to utilize the entire vector of population genetic summary statistics we calculated for model selection (SI Appendix, Text S8). Therefore, we used a subsample of 20,000 simulations and estimated the contribution of each summary statistic 15 to the parameter estimation using the regression function in abcrf (53). We then chose the top 25 statistics (sorted in descending order on most to least important) to inform the parameter estimates in the package abc.
We ran an additional 2,000,000 simulations using the custom python scripts noted in SI Appendix, Text S8. Posterior estimates of introgression times and divergence times were estimated in abc using the neural net option with tol = 0.005, sizenet = 10, and numnet = 15 (52). Cross-validation was used to examine the effect of tolerance choice on parameter estimates by running 100 cross-validation simulations and using the rejection method and a vector of tolerances, tols = 0.01, 0.005, 0.0005. Priors and posteriors were examined using the plot function in abc.
Results are presented in Table S11. Divergence and introgression times are presented in generations assuming a mutation rate of 2.8 x 10 -9 mutations per site per generation (30). Values in parentheses are the 95% credible 20 intervals for the value.
In addition to divergence and introgression, we estimated the current effective population size for each species using the vector of population genetic summary statistics. We allowed for each species to have a uniform prior on its current population size and defined by the 95% confidence intervals from MSMC2 (SI Appendix, Text 25 S5.2). For all species the estimates were consistent with the median of the last epoch of MSMC2, except for An. funestus which had an effective population size twice as large as the MSMC2 estimate and An. funestus-like that had a population size three times larger (Table S11).
We verified that each An. funestus population was equally distant from An. parensis using pair-wise genetic distance (dXY). We were particularly interested in whether An. funestus populations geographically closer to An. parensis (sampled from Mozambique, Table S1) have a smaller genetic distance from An. parensis potentially indicative of recent or ongoing introgression. We calculated dXY in 10 kb windows along the genome using scikitallel. We found that the pairwise nucleotide divergence was highly similar between each An. funestus population and An. parensis (dXY avg 0.026, stdev 0.0004).

S10. Identifying genomic regions of introgression by machine learning (Figs. S18-S22, Tables S11-S13)
To characterize where in the genome introgression took place we used the program FILET (Finding Introgressed Loci using Extra Trees Classifiers) (50). FILET uses simulated training data in a random forest classifier to assign windows along the genome as introgressed or non-introgressed, and it can also define the direction of 40 introgression. We chose FILET because we can train the classifier using simulations that are custom to our model for evolution of the AFC (Table S11). This allows us to consider the entire evolutionary history of the AFC with multiple introgression events and specific demographic processes.
The FILET classifier was trained on population genetic summary statistics generated using the coalescent 45 simulator msmove (github.com/geneva/msmove) for each category (migration of species 1 into species 2, mig21; migration of species 2 into species 1, mig12; and no migration, noMig). We expanded our analyses to all pairs of species and not just the three introgression events in tree vii because it was possible that minor events, contributing less to the observed D statistics used by admixturegraph, may still be present among the AFC species. Following the notation of (50), we define TM (time since migration) as a proportion of TD (time since divergence). We only include parameter combinations of introgression times that were as old as 75% of the divergence time for pairs of species. Introgression intensity (PM) was defined as a pulse migration event using the flag -ev in msmove.
We inferred that An. funestus and An. parensis had gene flow during two different times (Events A and C, Fig.  1D). To differentiate these events, we trained the classifier on simulated data under two exclusive scenarios, one that tested migration at event A but barred it at event C, and a second under the converse. The mean divergence among An. funestus populations is consistent with a very recent expansion across Africa (SI Appendix, Text S5.7), suggesting that geographic source of An. funestus should not have a major impact.

10
Accordingly, as our sample size for An. funestus was small for most locations (Table S1), we used the Mozambique population of An. funestus, for which we had six specimens.
We generated 180,000 simulations (60,000 for each migration direction and no migration) with parameter combinations drawn from uniform distributions of TM (0, 0.75 x TD], and PM defined by results from the ABC analyses (Table S11). Separate simulations were performed for the X chromosome, under the assumption of an effective population size of 0.75 relative to the autosomes. We masked our simulated data to match masking in our genome data following suggestions at https://github.com/kern-lab/FILET using a custom python script, makeFILETmask.py. We withheld 10,000 simulations for construction of a confusion matrix and to verify posterior cutoffs (Table S12). In all cases the rate of false positives was very low, however some classifications 20 (mostly involving older events) had a high rate of false negatives in at least one direction.
Population genetic summary statistics for classifying windows as introgressed were constructed for each pair of species (Fun-Lik, Fun-Van, Fun-Lon, Fun-Par, Lik-Van, Lik-Lon, Lik-Par, Van-Lon, Van-Par, Lon-Par) from phased FASTA files masked for missing data and repeat motifs. Each summary statistic was calculated in 10 kb 25 windows with a sliding step of 1 kb, omitting any window for which > 50% of sites were masked or missing. The classifiers were trained using a feature vector of all population genetic summary statistics.
Following classification, we clustered adjacent windows showing evidence of introgression by joining consecutive windows with > 90% probability of introgression (i.e. the probability of no-introgression class <10%).
Plots of genome-wide introgression are presented in Figs. S18-S22 and Fig. 4. Estimates of the amount of introgressed DNA between species pairs, and the amounts and proportions of introgression on the autosomes versus the X chromosome are presented in Table S13. (Table S14) 35

S11. Detecting introgression using branch lengths
As a secondary test of our introgression results we used the branch length test provided in the program QuIBL (54). Species triplets were extracted from the 24,556 phylogenetic trees representing ~122.78 Mb of aligned sequence data (SI Appendix, Text S6). For a given species triplet, QuIBL computes the frequency of a given triplet tree and estimates the distribution of branch lengths under the three alternate trees. We ran QuIBL on 40 every species triplet under default parameters with number of steps (numsteps) equal to 50 and specifying An. rivulorum as root. QuIBL classifies triplet topologies as resulting from either ILS or introgression/speciation by fitting the branch lengths to an expected distribution under each scenario (54). We examined the triplet count for the autosomes (total trees 22,048) and X chromosome (total trees 2,508) and kept only those counts which were derived from introgression (except as noted in Table S14) by conditioning on a Bayes factor of <= -10.0

45
(supporting of the introgression model over ILS). To determine which arrangements were introgressed we evaluated the results against tree vii, the hypothesized species branching order (SI Appendix, Text S8). S12. Introgression and inference from mtDNA (Fig. S23)

50
Complete mtDNA genomes from mosquitoes morphologically identified as An. funestus were recently sequenced and assembled from three localities in southern and Central Africa (55). The Bayesian phylogeny reconstructed from these sequences revealed two deeply diverged lineages, whose last common ancestor was estimated at ~500,000 years ago. To better understand the implication of these results in the context of our findings from the present study, we aligned the individual consensus mtDNA genomes from our study (SI Coding sequences were concatenated after removing the AT-control region, tRNA and rRNA. We reconstructed a Bayesian phylogeny from the concatenated coding sequences using BEAST2 (21) under the GTRGAMMA model, allowing for estimation of different rates per gene but assuming the same tree. BEAST2 was run with three different starting chains each with length 100 million and combined within LogCombiner. The resulting mtDNA genome phylogeny is presented in Fig. S23.  Fig. S1. Confirmation of species assignments through read mapping to ribosomal ITS2 (50)] to the nucleotide diversity in population 2, ibsMaxB (identify by state maximum length for second species), Wright's fixation index (Fst), gmin (59), the ratio of dmin to the nucleotide diversity in population 1 (dd1) (50), and average linkage disequilibrium within populations to average LD within the global population (zx) (50). Species abbreviations: An. funestus (Fun), An. funestus-like (Lik), An. longipalpis C (Lon), An. parensis (Par), and An. vaneedeni (Van). An. funestus (3Ra, 3Rb, 3La, and 2Ra) are indicated by double-headed arrows. Centromeres are represented as black ¼ circles at the right of each chromosome arm. The hatched box represents masked region. An. funestus (3Ra, 3Rb, 3La, and 2Ra) A (Fig1D) An. funestus (3Ra, 3Rb, 3La, and 2Ra)