Extreme heterogeneity in sex chromosome differentiation and dosage compensation in livebearers

Significance Morphologically and functionally distinct X and Y chromosomes have repeatedly evolved across the tree of life. However, the extent of differentiation between the sex chromosomes varies substantially across species. As sex chromosomes diverge, the Y chromosome gene activity decays, leaving genes on the sex chromosomes reduced to a single functional copy in males. Mechanisms have evolved to compensate for this reduction in gene dosage. Here, we perform a comparative analysis of sex chromosome systems across poeciliid species and uncover extreme variation in the degree of sex chromosome differentiation and Y chromosome degeneration. Additionally, we find evidence for a case of chromosome-wide dosage compensation in fish. Our findings have important implications for sex chromosome evolution and regulation.


Supplementary Materials and Methods
Sample collection and sequencing. We collected adult male and female individuals from four guppy species (Poecilia wingei from our laboratory population, Poecilia picta from Guyana, Poecilia latipinna and Gambusia holbrooki from Florida, USA). We chose these samples in order to obtain an even phylogenetic distribution. The species we assessed exhibit clear somatic dimorphisms, including coloration and size, in addition to gonadal differences. Most notably, females possess an enlarged abdomen and anal fin. In males, the anal fin is modified to form a gonopodium (i.e. and intromittent organ), which From each species, we immediately stored head and tail samples from three males and three females in ethanol and, RNAlater, respectively. We extracted DNA from heads with the DNeasy Blood and Tissue Kit (Qiagen) and RNA from tails with the RNeasy Kit (Qiagen), following the manufacturer's instructions. Library preparation and sequencing were performed at The Wellcome Trust Centre for Human Genetics, University of Oxford, following standard protocols and using the Illumina HiSeq 4000 platform.
Genomic DNA was used to construct paired-end (PE) sequencing libraries with short insert sizes (average insert size 500bp) and mate-pair (MP) libraries with long insert sizes (average insert size 2kb) for each individual. The Nextera Mate Pair Sample Preparation Kit was used for preparing mate-pair libraries. We assessed data quality with FastQC v0.11.3 (www.bioinformatics.babraham.ac.uk/projects/fastqc/) and used Trimmomatic v0.36 (1) to trim reads. For both DNA-seq and RNA-seq reads we removed adaptor sequences, regions of low Phred score (reads with average Phred score <15 in sliding windows of four bases and reads with leading/trailing bases with a Phred score < 3) and short reads (if either read in a pair was shorter than 50bp). Genome assembly. We first corrected the reads using Quake v0.3.5 (2) and estimated the optimal assembly k-mer length using KmerGenie v1.6741 (3). We then used SOAPdenovo v2.04 (4) to construct female de novo genome assemblies for P. wingei, P. picta and G. holbrooki and a male assembly for P. latipinna, using both the paired-end and mate-pair reads (Table S2). The paired-end reads were used for both the contig and scaffolding steps of the assembly process, while the mate-pair reads were only used for scaffolding. Additionally, we used the SOAPdenovo GapCloser module to close the gaps resulting from the assembly scaffolding step. Finally, we removed sequences shorter than 1kb from the assemblies.
To improve assembly contiguity and to reconstruct chromosomal fragments for each species, we followed the UCSC chains and nets pipeline from the kentUtils software suite (5) before employing the Reference-Assisted Chromosome Assembly (RACA) algorithm (6). The chains and nets pipeline is designed for building pairwise nucleotide alignments and bridging gaps between pairwise syntenic blocks to construct larger structures (5). A chain alignment represents an ordered pairwise sequence alignment between two species.
A net alignment represents a collection of chains within a genome region, ordered in a hierarchical manner based on synteny scoring. The RACA algorithm incorporates the pairwise alignment files, together with read mapping information to identify syntenic fragments (regions which maintain sequence similarity and order) across the species used. RACA then estimates adjacency between syntenic fragments in each target genome to reconstruct predicted chromosome fragments (PCFs) for each target species (6).
First, for each species, we carried out DNA-seq read mappings to the de novo assemblies using Bowtie2 v2.3.3.1 (7), reporting concordant mappings only (--no-discordant option) and using the appropriate mate orientations according to the insert size of the libraries (-fr option for short-insert libraries and --rf option for long-insert libraries). The resulting alignments were converted into the RACA-specific input format (script available on the RACA website http://bioen-compbio.bioen.illinois.edu/RACA/). We also obtained pairwise alignments using LASTZ (www.bx.psu.edu/%7Ersharris/lastz/; parameters C=0, E=30, H=2000, K=3000, L=3000, O=400, M=50) between a reference species (here we used the X. hellerii genome, obtained from NCBI GenBank Xiphophorus_hellerii-4.0, assembly accession GCA_003331165.1), the target species and an outgroup species (here we used the Medaka, Oryzias latipes, genome, obtained from GenBank ASM223467v1, assembly accession GCA_002234675.1). We then converted these alignments into chains and nets formats following the UCSC axtChain (-minScore=1000, -linearGap=medium), chainAntiRepeat, chainSort, chainPreNet and netSyntenic tools (5). The syntenic chains and nets fragments, together with the paired-end alignments, were used as input files for RACA (resolution=10000 for P. picta and P. latipinna and resolution=1000 for P. wingei and G. holbrooki). For each target species, RACA ordered and oriented target scaffolds into PCFs (Table S2), and we used this positional information of scaffolds in the genome for all further analyses.
Analysis of genomic coverage. For each species, using BWA v0.7.12 (8), we mapped male and female paired-end DNA-seq reads to the de novo scaffolds with positional annotation from RACA, following the aln and sampe alignment steps, and extracted uniquely mapping reads. We then used soap.coverage v2.7.9 (http://soap.genomics.org.cn/) to calculate the coverage (number of times each site was sequenced divided by the total number of sequenced sites) of each scaffold in each sample. For each scaffold, we calculated the male to female (M:F) fold change coverage as log2(average male coverage) -log2(average female coverage).

SNP density analysis.
For each species, using Bowtie1 v1.1.2 (7), we mapped male and female paired-end DNA-seq reads to the de novo scaffolds with positional annotation from RACA, generating map format output files. We sorted the map files by scaffold and converted them into profiles, which represent counts for each of the four nucleotide bases, for each individual using bow2pro v0.1 (http://guanine.evolbio.mpg.de/). For each site, we applied a minimum coverage threshold of 10 and called SNPs as sites with a major allele frequency of 0.3x the total site coverage. We obtained gene information through the expression analysis detailed below and for each gene we calculated the average SNP density as the number of SNPs divided by the number of filtered sites. We excluded SNPs outside of genic regions. For each gene we then calculated M:F fold change SNP density as log2(average male SNP density) -log2(average female SNP density).
Detection of sex chromosome non-recombining regions and strata of divergence. We used the fold change coverage and SNP density estimates to distinguish regions that are homologous and recombining between the sex chromosomes from regions that show full or even partial sex chromosome divergence, and which are hence non-recombining. For each species, we generated 95% confidence intervals based on bootstrapping autosomal M:F coverage ratios and autosomal M:F SNP density ratios separately. For XY systems, we defined non-recombining, older strata of divergence as regions with a significant decrease in M:F coverage ratio outside the 95% confidence interval. In addition, we defined younger strata of divergence as regions with no reduction in male coverage but with a significant increase in M:F SNP density ratio outside the 95% confidence interval.
Conversely, for ZW systems, a significant increase in M:F coverage ratio and a significant decrease in M:F SNP density ratio are expected for older and, respectively, younger regions of divergence.
Gene expression analysis. For each species, using HISAT2 v2.0.4 (9), we mapped male and female RNA-seq reads to scaffolds with positional annotation from RACA, reporting paired (--no-mixed) and concordant (--no-discordant) mappings only and tailoring the alignments for downstream transcript assembly (--dta). We used SAMtools to sort by coordinate and bam convert the sam output files. For each sample, we then used StringTie (10) to obtain transcripts in a GTF file format, which we then merged to assemble a non-redundant set of transcripts for each species. Before further analyses, we filtered the merged GTF file for non-coding RNA (ncRNA) by using BEDtools getfasta We identified sex-biased genes in EdgeR using a minimum of two-fold differential expression (log2 M:F RPKM > 1 for male-biased genes and < -1 for female-biased genes) and a significant p value (padj < 0.05 based on FDR correction for multiple testing (15)).
We tested for an enrichment of GO terms in the non-recombining regions of the sex chromosomes relative to the rest of the genome in each species. We first extracted the longest isoform for each gene from the Danio rerio (GRCz10) coding sequences from Ensembl 84. We then BLASted longest isoforms from each of our target gene sets to the D. rerio sequences with BLASTn v2.3.0 (16), using an e-value cutoff of 10e -10 and a minimum percentage identity of 30%. For genes with multiple alignment hits, we chose the top blast hit based on the highest BLAST score. We then compared D. rerio orthologues for genes in the non-recombining regions with those for genes in the rest of the genome using GOrilla (17).
k-mer analysis. In order to identify shared Y sequence across P. reticulata, P. wingei and P. picta, we followed a k-mer analysis method previously described in Morris et al. 2018 (18). We have previously used this approach to successfully identify shared Y sequence between P. reticulata and P. wingei (18). Briefly, here we used the HAWK pipeline (19) to count k-mers from paired-end DNA-seq reads and identify unique k-mers for each sex in each species. Across all the species, we then identified shared female unique k-mers and shared male unique k-mers, referred to as Y-mers (18).

Allele-specific expression (ASE) analysis.
In order to estimate ASE patterns from RNAseq data, we tailored previously published pipelines (20,21). For each species, we called SNPs separately for males and females using SAMtools mpileup and varscan (22), with parameters --min-coverage 2, --min-ave-qual 20, --min-freq-for-hom 0.90, and excluding triallelic SNPs and Ns. Additionally, we removed SNPs that were not located within genic regions from the final filtered gene dataset. To exclude potential sequencing errors from our SNP dataset, we applied coverage filtering thresholds (20,21). Firstly, we set a minimum site coverage of 15 reads (the sum of major and minor alleles), as a power analysis indicated that at a minimum coverage of 15 reads we have a 78% power to detect a signal of allele-specific expression. Secondly, we applied a variable coverage filter that accounts for the change in the likelihood of sequencing errors at different coverage levels (accounting for an error rate of 1 in 100 and a maximum coverage for a given site of 100,000 (20)). Lastly, to avoid the potential bias in our ASE estimations from the preferential assignment of reads to the reference allele (23), we removed clusters of more than 5 SNPs in 100 bp windows.
If genes have biallelic expression, meaning that alleles from both chromosomes are expressed at the same level, we expect a probability of around 0.5 of recovering reads from either chromosome. For each SNP in the final filtered dataset we tested for ASE by identifying significant deviations from the expected probability of 0.5 using a two-tailed binomial test (p < 0.05). We corrected for multiple testing when running binomial tests on autosomal SNPs. Additionally, we called SNPs as ASE if a minimum of 70% of the reads stemmed from one of the chromosomes. We called genes as ASE if they had at least one SNP with a consistent ASE pattern across all heterozygous samples. We tested for significant differences in ASE patterns between the sexes and between the autosomes and the sex chromosomes using chi-square tests.