Genome analyses reveal the hybrid origin of the staple crop white Guinea yam (Dioscorea rotundata)

Significance Guinea yam is an important staple tuber crop in West Africa, where it contributes to the sustenance and sociocultural lives of millions of people. Understanding the genetic diversity of Guinea yam and its relationships with wild relatives is important for improving this important crop using genomic information. A recent genomics study proposed that Guinea yam originated from a wild relative, the rainforest species Dioscorea praehensilis. Our results based on sequencing of 336 Guinea yam accessions do not support this notion; rather, our results indicate a hybrid origin of Dioscorea rotundata from crosses between the savannah species Dioscorea abyssinica and D. praehensilis.

6,124,093 yes yes -  Table S2. Comparison of heterozygosity levels in the five clusters of D. rotundata. Heterozygosity level of an individual is defined as the ratio of number of heterozygous SNPs to the total number of mapped sites to the reference genome. The diagonal cell represents the mean ± standard deviation of the heterozygosity levels of the samples in each cluster. The other cells represent P-values of the difference of the heterozygosity levels between the two clusters as calculated by two-tailed Student t test. Cluster 1 has a significantly higher heterozygosity level than the other clusters.

Materials and Methods
Table of contents S1. Reference assembly S1.1 Whole-genome sequencing using Oxford Nanopore Technology S1.2 Quality control S1.3 De novo assembly S1.4 Polishing and removing duplicated contigs S1.5 Gene prediction and annotation S2. Generation of pseudo-chromosomes by anchoring contigs onto a linkage map S2. Haplotype network analysis of the whole plastid genome S7. Inferring the changes in population size S8. Exploring the possibility of extensive introgression from wild Dioscorea species S1. Reference assembly S1.1 Whole-genome sequencing using Oxford Nanopore Technology To generate version 2 of the Dioscorea rotundata reference genome sequence, we sequenced an F1 individual plant named "TDr96_F1" using the PromethION sequencer (Oxford Nanopore Technologies). "TDr96_F1" was the same individual plant used to obtain version 1 of the D. rotundata reference genome sequence (1). "TDr96_F1" DNA was extracted from fresh leaves as described (1). The DNA was subjected to size selection and purification with a gel extraction kit (Large Fragment DNA Recovery Kit; Zymo Research). The purified DNA was sequenced using PromethION at GeneBay, Yokohama, Japan (http://genebay.co.jp).

S1.2 Quality control
As a first step in our pipeline for genome assembly (Fig. SM1), we removed the lambda phage genome from raw reads with NanoLyse v1.1 (2). We then filtered out reads with an average read quality score of less than 7 and those shorter than 1,000 bases with Nanofilt v2.2 (2). This was followed by trimming of the first 75 bases to remove low-quality bases in all read that were retained. This generated 3,124,439 reads, corresponding to 20.89 Gbp of sequence (Table SM1).

Short reads
Short reads were generated by Illumina platform Step 1 Step 2 Step 3 Step 4  Table SM1. Summary of filtered ONT reads.
-Raw reads were registered in DDBJ under accession number DRR196916.
-Genome coverage was estimated based on the expected genome size of D. rotundata (570Mb).

S1.3 De novo assembly
We assembled filtered long DNA sequence reads with Flye v2.4.2 (3), using 570 Mb as the estimated genome size of D. rotundata (1). This generated 8,721 contigs with N50 of 137,007 base pairs (Step 1 in Table SM2) and a total size of 636.8 Mb, which is larger than the expected D. rotundata genome size of 570 Mb. To evaluate the completeness of the gene set in the assembled contigs, we applied BUSCO analysis (Bench-Marking Universal Single Copy) v3.0.2 (4). For BUSCO analysis, we set "genome" as the assessment mode and used Embryophyta odb9 as the database and obtained 40.7% complete BUSCOs (Step 1 in Table SM2). Table SM2. Summary of the reference assembly.

S1.4 Polishing and removing duplicated contigs
To correct the assembled contigs, we repeatedly polished them with Illumina short reads (Table SM3) using Pilon v1.23 (5) until there was no further change in the % of complete BUSCOs. We aligned Illumina jump reads as single reads to the assembled contigs using the bwa mem command in BWA v0.7.17 (6) and sorted the BAM files with SAMtools v1.9 (7). The BAM files were used to run Pilon with the option "--diploid". We polished the contigs six times. The percentage of complete BUSCOs Step 1 Step 2 Step 3 Step 4 was 89.9% after the first polishing step (Step 2 in Fig. SM1). To remove duplicated contigs, we used Purge Haplotigs v1.0.2 (8), which removes duplicated contigs based on depth and the number of matching bases (Step 3 in Fig. SM1). In Purge Haplotigs, the percent cutoff of aliment coverage was set to 95%. Finally, we polished the contigs again. The percentage of complete BUSCOs was 90.1% after the second polishing process (Step 4 in Fig. SM1). Comparing the features in the old reference genome with the new reference genome, the number of missing bases ("N") was drastically reduced (Table SM4). Table SM3. Sequence list used for polishing.
-All values are calculated after quality control.
-Genome coverage was estimated based on the expected genome size of D. rotundata (570 Mb). Table SM4. Comparison of the old (1) and new reference assemblies.
*In Version 2, contigs were used instead of scaffolds.

S1.5 Gene prediction and annotation
For gene prediction, we used 20 RNA-Seq data sets representing 15 different organs and three different flowering stages in male and female plants (Table SM5). Total RNA was used to construct cDNA libraries using a TruSeq RNA Sample Prep Kit V2 (Illumina) according to the manufacturer's instructions. The extracted RNA was sequenced on the Illumina platforms NextSeq500 and HiSeq4000. In the quality control step, we filtered the reads and discarded reads shorter than 50 bases and those with an average read quality below 20 and trimmed poly(A) sequences with FaQCs v2.08 (9). Quality trimmed reads were aligned to the newly assembled contigs with HISAT2 v2.1 (10) with the options "--no-mixed --no-discordant --dta". Transcript alignments were assembled with StringTie v1.3.6 (11) separately for each BAM file. These GFF files were  (12) with the option "--filter-min-length 150", generating 26,609 gene models within the new assembly (Table SM6). Additionally, coding sequences (CDSs) that were predicted using the previous reference genome (1) were aligned to the newly assembled contigs with Spaln2 v2.3.3 (13). Consequently, 8,889 CDSs that did not overlap with the new gene models were added to the new gene models (Table SM6). Finally, gene models shorter than 75 bases were removed, and InterProScan v5.36 (14) was used to predict ORFs (open reading frames) and strand information for each gene model. We predicted 35,498 genes, including 66,561 transcript variants (Table SM6). For gene annotation, the predicted gene models were searched in the Pfam protein family database using InterProScan (14) and with the blastx command in BLAST+ (15) with the option "-evalue 1e-10", using the Viridiplantae database from UniProt as the target database. The resulting gene models and annotations were uploaded to ENSEMBL (http://plants.ensembl.org/Dioscorea_rotundata/Info/Index).

S2.1 Preparing the mapping population
To develop the chromosome-scale TDr96_F1 genome sequence from the assembled contigs, we generated an F1 population containing 156 individuals by crossing two D. rotundata breeding lines: TDr04/219 as the female parent (P1) and TDr97/777 as the male parent (P2).

S2.2 Whole-genome resequencing
We extracted each DNA sample from dried D. rotundata leaves as described (1). Libraries for PE short reads were constructed using an Illumina TruSeq DNA LT Sample Prep Kit (Illumina). The PE library was sequenced on the Illumina Hiseq4000 platform. A summary of sequence and alignment information is provided in Table SM7 (attached at the bottom of this file).

S2.3 Quality control and alignment
We used FaQCs v2.08 (9) to remove unpaired reads and adapters. We then filtered out reads shorter than 75 bases or those whose average read quality score was 20 or lower with prinseq-lite v0.20.4 lite (16). We also trimmed bases whose average read quality score was below 20 from the 5¢ end and the 3¢ end using the sliding window approach (the window size was five bases, and the step size was one base) in prinseq-lite (16). Subsequently, we aligned the filtered reads of P1, P2, and F1 progenies to the newly assembled contigs (supplementary text S1) using the bwa mem command in BWA (6). After sorting the BAM files, we only retained properly paired and uniquely mapped reads using SAMtools (7).

SNP-type heterozygous markers
SNP-based genotypes for P1, P2, and F1 progenies were obtained as a VCF file. The VCF file was generated as follows: (i) SAMtools v1.5 (7) mpileup command with the option "-t DP,AD,SP -B -Q 18 -C 50"; (ii) BCFtools v1.5 (17) call command with the option "-P 0 -v -m -f GQ,GP"; (iii)  (17) norm command with the option "-m+any" (Fig. SM2). We rejected the variants with low read depth (<10) or low genotype quality scores (<10) in the two parents. We regarded variants with low read depth (<8) or low genotype quality scores (<5) in F1 progenies as missing and only retained the variants with low missing rates (<0.3). Subsequently, only bi-allelic SNPs were selected by the BCFtools (17) view command with the option "-m 2 -M 2 -v snps". Referring to the genotypes in the VCF file, heterozygous genotypes called by unbalanced allele frequency (out of 0.4-0.6 in two parents, and out of 0.2-0.8 in F1 progenies) were regarded as missing, and filtering for missing rate (<0.3) was applied again. Finally, a binomial test was performed to reject SNPs affected by segregating distortion in the F1 progenies. This binomial test assumes that the probability of success rate is 0.5 based on the two-side hypothesis, and we regarded variants having p-value less than 0.2 as segregation distortion.

Presence/absence-type heterozygous markers.
A VCF file was generated to search for positions with contrasting read depth between the two parental plants P1 and P2 using the following commands: (i) SAMtools (7) mpileup command with the option "-B -Q 18 -C 50"; (ii) BCFtools (17) call command with the option "-A"; and (iii) BCFtools (17) view command with the options "-i 'MAX(FMT/DP)≥8 and MIN(FMT/DP)≤0' -g miss -V indels". This means that one of the parents (P1 or P2) has enough read depth (≥8) and another parent has no reads aligned on that region (A in Fig. SM3). Subsequently, we converted continuous positions in the VCF file to a feature that provides the start and end coordinate information of a region using the BEDTools v.2.26 (18) merge command with the option "-d 10 -c 1 -o count". We only retained sufficiently wide features (≥50 bp) in the BED file (the 1st BED). To reject false positives whereby low-depth regions are erroneously regarded as absent regions, we focused on both the boundary regions around each feature and the features themselves. For boundary regions, the 2nd BED file including expanded (twice-sized) features of each feature given in the 1st BED was generated with the BEDTools (18) slop command with the option "-b 0.5 -pct".
Using the depth value in each feature given in the 1st BED, presence/absence-based genotypes for parental plants P1 and P2 and F1 progenies were determined. To verify the rejection of false-positive features, we also referred to the depth values in the boundary regions around each feature. Verified features were only accepted as presence/absence markers. The depth values in each feature were calculated with the SAMtools (7) bedcov command with the option "-Q 0". Also, the depth values in the boundary regions were obtained by subtracting the depth values of the 2nd BED from that of the 1st BED (B in Fig. SM3). For P1 and P2, we regarded genotypes having depth ≥ 8 as present genotypes, meaning the heterozygosity of present and absent, while those having depth < 2 were classified as absent genotypes, meaning the homozygosity of absent. For F1 progenies, we classified markers with depth > 0 and = 0 as present and absent markers, respectively. Finally, we applied the same binomial test for SNP-type heterozygous markers as that used for presence/absence-type heterozygous markers. (continued)

Integration of SNP-type and presence/absence-type heterozygous markers.
To develop parental line-specific linkage maps, we integrated SNP-type and P/A-type (presence/absence-type) heterozygous markers. Two types of markers were defined: Type-1 markers and Type-2 markers. If an SNP-type marker was heterozygous in P1 but homozygous in P2 or if a P/A-type marker was present in P1 and absent in P2, it was classified as a Type-1 marker (P1-heterozygous marker set). Conversely, if a SNP-type marker was homozygous and heterozygous in P1 and P2, respectively, or if a P/A-type marker was absent in P1 but present in P2, it was classified as a Type-2 marker (P2-heterozygous marker set).

S2.5 Anchoring and ordering contigs
Pruning and flanking markers based on Spearman's correlation coefficients. Distance matrices of Spearman's correlation coefficients (ρ) were calculated for every marker pair in each contig in each marker set (P1-heterozygous marker set and P2-heterozygous marker set). According to the histogram of absolute ρ calculated from each contig, most markers on the same contigs were correlated with each other (Fig. SM4). Therefore, we pruned correlated flanking markers to remove redundant markers (Fig. SM5). Accordingly, we obtained 11,389 markers for linkage mapping (Table SM8).   Linkage mapping. The markers obtained as described in the previous section were converted to genotype-formatted data. Based on this genotype-formatted data, genetic linkage maps were constructed using MSTmap (19) with the following parameters: "population_type DH; distance_function kosambi; cut_off_p_value 0.000000000001; no_map_dist 15.0; no_map_size 0; missing_threshold 25.0; estimation_before_clustering no; detect_bad_data no; objective_function ML" for each marker set. After trimming the orphan linkage groups, we solved the complemented-phased duplex linkage groups caused by coupling-type and repulsion-type markers in the pseudo-testcross method. Finally, two parental-specific linkage maps were constructed. These two linkage maps were designated as P1-map (constructed using Type-1 markers) and P2-map (constructed using Type-2 markers) (Fig SM6 and Fig SM7). The linkage groups were visualized by R/qtl (20). The numbering of linkage groups is the same as that used in the previous reference genome (1).

Integration of two parental-specific linkage maps into the chromosome-scale physical genome sequence.
Based on a matrix derived from the contigs shared between the P1-and P2-maps, i.e., linkage groups (Table SM9), the contigs were anchored and linearly ordered as pseudo-chromosomes. During the anchoring and ordering process, we identified contigs whose markers were allocated to different linkage groups. Such contigs were further divided into sub-contigs to ensure that they were not allocated to different pseudo-chromosomes. We divided the contigs at the proper positions as described previously (1). During this procedure, 34 genes including 61 transcript variants were cut and removed. Finally, a previously described method (1) was followed to generate the pseudo physical genome sequence composed of 20 pseudo-chromosomes. To compare the newly generated pseudo-chromosomes with the ones we constructed previously (1), we generated a dot plot with D-Genies (21) (Fig. SM8) and counted the anchored base pairs in the new pseudo-chromosomes (Table SM10). The resulting reference genome, including unanchored contigs, was uploaded to ENSEMBL (http://plants.ensembl.org/Dioscorea_rotundata/Info/Index).

S3.1 Whole-genome resequencing of Guinea yam accessions
For genetic diversity analysis, we selected 333 accessions of D. rotundata maintained at IITA, Nigeria, representing the genetic diversity of Guinea yam landraces and improved lines of West Africa. We extracted DNA from dried leaves of each D. rotundata accession as described (1). Libraries for PE short reads were constructed using an Illumina TruSeq DNA LT Sample Prep Kit (Illumina). The PE library was sequenced on the Illumina Nextseq500 or Hiseq4000 platform. Finally, P1 (TDr04/219) and P2 (TDr97/777) parents used to anchor the contigs and the reference individual "TDr96_F1" were added to the 333 accessions. Therefore, we used a total of 336 accessions for this analysis. A summary of the sequences and alignments is provided in Dataset S1.

S3.2 Quality control, alignment, and SNP calling
We used FaQCs v2.08 (9) and prinseq-lite v0.20.4 lite (16) for quality control. We used the same parameters provided in supplementary text S2.3, but both paired and unpaired reads were aligned to the new reference genome using the bwa mem command in BWA (6) with option "-a". After sorting the BAM files, the VCF file was generated using the SAMtools (7) mpileup command with the option "-t DP,AD,SP -B -Q 18 -C 50", and variants were called by the BCFtools (17) call command with the option "-P 0 -v -m -f GQ,GP". Low-quality variants were rejected using the BCFtools (17) view command with the options "-i 'INFO/MQ≥40, INFO/MQ0F≤0.1, and AVG(GQ)≥5'". We regarded variants with low read depth (<8) or low genotype quality score (<5) as missing, filtered out SNPs with high missing rates (≥0.3) across all samples, and only retained bi-allelic SNPs on the pseudo-chromosomes.  (Fig. SM9). We could not find the best K value based on the cross-entropy criterion, so we defined five clusters for convenience.

S3.4 Polymorphism and ploidy of nuclear genomes
Heterozygosity level and unique alleles. First, we calculated the heterozygosity level in each accession (Fig. S2). We defined the heterozygosity level as follows: where S is the number of heterozygous SNPs and L is the total number of mapped sites in an accession. The heterozygosity levels of each cluster were statistically compared by two-tailed Student t test (Table S2). Second, we counted the unique alleles in each cluster (Fig. S3). An allele was considered unique if it only existed in a cluster even when the allele was a singleton in all accessions.

Flow cytometry.
Ploidy level was estimated by flow cytometry using a Partec Ploidy Analyzer (Sysmex Partec, Gorlitz, Germany). Fully developed fresh young leaves were sampled and chopped with a razor blade (ca. 5 x 5 mm) in 0.4 mL nuclear extraction buffer (solution A of a High-resolution kit; Sysmex Partec, Gorlitz, Germany). The suspension was filtered through a nylon filter (50-μm mesh), and the extracted nuclei were stained with 4′,6-diamino-2-phenylindole solution. After 5 min of incubation at room temperature, the sample was examined in a ploidy analyzer at a rate of 5-20 nuclei/s. The DNA index (DI) of each accession was calculated based on the relative amount of DNA in nuclei at the G1 stage compared to the internal standard. Rice (Oryza sativa L.) was used as an internal standard for calibration of the measurements. Flow cytometry was repeated two or three times with different leaf samples to confirm the DI of each accession. The ploidy levels of each accession were determined by comparing their DI with that of the diploid accession "TDr1673", for which the chromosome number was confirmed microscopically to be 2n = 40. (Dataset S1)

Summary statistics of population genetics.
After removing the triploid accessions of cluster 1, we imputed missing genotypes using BEAGLE v4.1 (24) with default options. We then calculated the summary statistics of population genetics (Table S3). First, we counted segregating sites and singletons in 308 Guinea yam accessions. We also estimated Watterson's 3 (3 4 ! ) (25), pairwise nucleotide diversity (3 4 " ) (26), and Tajima's D (27) in the same dataset. We defined 3 4 ! as follows:  where a is equal to: and 2 7 is the number of average mapped sites in a population and n is the number of sequences. We also defined 3 4 " as: where 2 7 is the number of average mapped sites in a population, n is the number of sequences, and kij is the number of nucleotide differences between the ith and jth sequences.
We also calculated LD decay of 308 Guinea yam accessions (Fig. S4). The SNPs whose minor allele frequencies less than 0.05 were removed from the above SNP set used to calculate 3. LD decay was calculated with 200-kb window and 100-kb step. Ten SNPs were randomly sampled within a window, and all possible combinations of r 2 were calculated using the sampled SNPs within a window.

S4.1 Data preparation
For phylogenomic analysis of African yam, we used 308 Guinea yam accessions sequenced in the present study (excluding cluster 1 triploid accessions), as well as 80 D. rotundata, 29 D. abyssinica, 21 Western D. praehensilis, and 18 Cameroonian D. praehensilis accessions that were sequenced in a previous study (28) using two accessions of the Asian species D. alata as an outgroup (Table  SM11). Of the samples sequenced in the previous study (28), we only used sequences whose species labels matched a species predicted by admixture analysis in the previous study (28). Also, we removed the sequences that were labeled as hybrids in the previous study (28). Two sequences of D. alata downloaded from NCBI were used as the outgroup (Table SM11). Subsequently, read quality control, alignment, and SNP calling of these 458 sequences were conducted using the pipeline described in supplementary text S3.2. Except for the Neighbor-joining (NJ) tree (29) (supplementary text S4.2), we only used SNPs with a missing rate < 0.3 in each targeted species. When the markers were polarized by comparison with the D. alata outgroup, the SNPs at positions where the alleles of D. alata were not completely fixed or where either of the D. alata sequences was missing were filtered out.

S4.2 Neighbor-joining tree
Before constructing the NJ tree (29), we only retained SNPs at positions with no missing data across all five species (D. rotundata, D. abyssinica, Western D. praehensilis, Cameroonian D. praehensilis, and D. alata). When we converted the VCF file including the remaining SNPs to a multi-FASTA file, heterozygous SNPs were converted to IUPAC code to characterize them as ambiguous markers. To construct the NJ tree (29), we ran MEGA X v10. 1.8 (30) using the 463,293 remaining SNPs. In MEGA X (30), the bootstrap value was set to 100 and the other parameters were set as default. Finally, the NJ tree was drawn with GGTREE v2.0.4 (31).

S4.3 Inferring the evolutionary history of wild Dioscorea species using ∂a∂i
To elucidate the evolutionary relationships of the three wild Dioscorea species, D. abyssinica (indicated as A), Western D. praehensilis (P), and Cameroonian D. praehensilis (C), which are closely related to D. rotundata, we performed ∂a∂i analysis (32). This technique allows evolutionary parameters to be estimated based on an unfolded site frequency spectrum. The joint unfolded site frequency spectrum was calculated based on the 17,532 polarized SNPs and was projected down to 25 chromosomes in each species.
First, three phylogenetic models, {{A, P}, C}, {{P, C}, A}, and {{C, A}, P}, were tested without considering migration among the species. The parameter bounds of each population size ranged from 10 -3 to 100, and those of each divergence time ranged from 0 to 3, as suggested in the ∂a∂i manual (https://dadi.readthedocs.io/en/latest/). The grid size was set to (40,50,60). The maximum iteration for an inference was set to 20. Randomly perturbing the initial values using the 'perturb_params' function in ∂a∂i (32), the parameters were inferred 100 times. Under these conditions, the {{A, P}, C} model had the highest likelihood out of the three models (Table S4).
Based on the assumption that {{A, P}, C} represents the true evolutionary relationship among the three wild Dioscorea species, the evolutionary parameters were re-estimated by ∂a∂i (32), allowing symmetric migration among the species. The parameter bounds of each symmetric migration rate ranged from 0 to 20, as also suggested in the ∂a∂i manual. The parameters were inferred 100 times by ∂a∂i (32) with different initial parameters, and the best parameter set was selected based on Akaike information criterion.

S4.4 Inferring the evolutionary history of wild Dioscorea species using fastsimcoal2
To complement our results and to exactly replicate the conditions used in the previous report (28), fastsimcoal2 (33), which was used in the previous study (28), was also used to test these three models ({{A, P}, C}, {{P, C}, A}, and {{C, A}, P}). Until the SNP calling step, we basically followed our own pipeline in supplementary text S3.2 based on the reference genome version 1 including the unanchored contigs (1) to be consistent with the previous study (28). The misclassified samples excluding hybrids were genetically re-classified by admixture analysis following the methods used in the previous study (28). The threshold of missing rate across all samples was set to 0.25, as proposed in the previous study (28). We obtained 87,671 SNPs using our pipeline, fewer than the number of SNPs analyzed in the previous coalescent simulation (28). Therefore, we skipped the down-sampling of the SNPs to 100,000, unlike in the previous study (28). For the other steps and the parameter bounds for the coalescent simulation by fastsimcoal2 (33), we followed the method used in the previous study exactly (28) using the same version of fastsimcoal2 (33).

S5.1 Site frequency spectrum polarized by two candidate progenitors of Guinea yam
We focused on the allele frequencies of 388 D. rotundata sequences, including 80 from the previous study (28), at the SNPs positioned over the entire genome that are oppositely fixed in the two candidate progenitors. The SNP set was generated as described in supplementary text S4.1. Based on this SNP set, 144 SNPs were oppositely fixed in the two candidate progenitors across all pseudo-chromosomes; the allele frequencies of these 144 SNPs were calculated and plotted.

S5.2 Inferring the domestication history of Guinea yam using ∂a∂i
To infer the domestication history of Guinea yam, we used ∂a∂i (32). Using the 15,461 polarized SNPs generated by following the method in supplementary text S4.1, three phylogenetic models, {{A, R}, P}, {{P, R}, A}, and {{A, R}, {P, R}} (hypothesis 1, 2, and 3 in Fig. 2A, respectively) were tested, considering symmetric migration among the species. The parameter bound for the admixed proportion from D. abyssinica ranged from 0 to 1. The other parameter bounds were the same as in supplementary text S4.3. The maximum iteration for an inference was set to 20. The parameters were inferred 100 times by ∂a∂i (32).

S5.3 Comparison of FST on each chromosome among three African yams
FST (34) among the three species (D. abyssinica, [Western] D. praehensilis, and D. rotundata) was calculated in each chromosome. We estimated FST using the formula: @ *+ = " + − " * " + where HT and HS are the expected heterozygosity in the total population and sub-divided population, respectively, which are equal to: where fA1 and fA2 are the allele frequencies in each population (34). Finally, the calculated FST were averaged in each chromosome.

S6. Haplotype network analysis of the whole plastid genome
The sample set used to construct the haplotype network of the whole plastid genome was the same as that used to construct the NJ tree (supplementary text S4.2). We aligned the 458 whole-genome sequences, together with the whole plastid genome of D. rotundata (1), to the newly improved reference genome of D. rotundata. We followed the pipeline described in supplementary text S3.2 for quality control and alignment. Because the plastid genome is haploid, the "--ploidy" option was set to 1 in the BCFtools call command (17) when SNPs were called. Singleton SNPs were removed as unreliable markers. SNPs with more than one low-quality genotype (GQ<127) across the samples were also removed as unreliable markers. We did not allow any missing data. Finally, a haplotype network was constructed using the retained 250 SNPs by the median joining network algorithm (35) implemented in PopART (36).

S7. Inferring the changes in population size
To explore the changes in population sizes, the demographic history of African yams was re-inferred by ∂a∂i (32) allowing migration. By fixing the parameters predicted in supplementary text S5.2 except for population sizes, we re-estimated each population size at the start and end points after the emergence of these species, assuming an exponential increase/decrease in population size. The parameter bounds of population sizes ranged from 10 -3 to 100, and the maximum iteration for an inference was set to 20. The parameters were inferred by ∂a∂i 100 times (32).

S8. Exploring the possibility of extensive introgression from Dioscorea species
To explore the possibility of multiple introgressions from both parental wild yams, the f4 statistic (37,38) was applied to the four clusters of D. rotundata excluding the cluster 1 triploid accessions.
Here, calculation of the f4 statistic requires four populations: PR1 is the first cluster of D. rotundata, PR2 is the second cluster of D. rotundata, PP is a population of (Western) D. praehensilis, and PA is a population of D. abyssinica. We where F( is the observed allele frequency in a window in population Pj. In most windows, A C . is close to zero, which means that the window has a concordant genealogy because the two clusters of D. rotundata have a small genetic distance (B in Fig. SM10). However, if these two clusters of D. rotundata have a large genetic distance and if one or both populations have a small genetic distance from a wild Dioscorea species, then A C . skews from 0. Therefore, a locus having a skewed A C . has a discordant genealogy (C or D in Fig. SM10). For PP (the population of D. praehensilis) and PA (the population of D. abyssinica), the samples sequenced in the previous study (28) were used (Table SM11), and the dataset was prepared as described in supplementary text S4.1. As the first screening, all possible combinations of the clusters of D. rotundata, excluding accessions in cluster 1, were used for PR1 and PR2 (Fig. SM11). In this analysis, we identified an extensive introgression around the SWEETIE gene (4.00 to 4.15 Mb on chromosome 17). Because clusters 2 and 5 have the same genealogy pattern around the SWEETIE gene, we merged them into one population (P25) and used this as PR1. Because cluster 4 has the opposite genealogy pattern to P25 around the SWEETIE gene, we used P4 as PR2. As a result, A C . (D -1 , D . , D 0 , D , ) was calculated for the second screening (Fig. 4). If a locus had |Z(f4)|>5, we regarded it as an outlier (red dots in Fig. 4B). To reveal the relationships of the D. rotundata accessions around the SWEETIE gene, a Neighbor-Net was constructed by SplitsTree v5.1.4 (39) using 308 D. rotundata accessions excluding the accessions in cluster 1, 29 D. abyssinica accessions, and 21 D. praehensilis accessions. A total 458 SNPs from the 4.00-4.15 Mb region on chromosome 17 were converted to multi-FASTA format. At that time, heterozygous genotypes were converted to IUPAC codes. Fig. SM10. Schematic explaining how f4 behaved in this study. "A" represents the population of D. abyssinica. "P" represents the population of D. praehensilis. "R1" represents the first populations of D. rotundata. "R2" represents the second populations of D. rotundata. This figure was adapted from (38).