Haplotype-resolved genome assembly and implementation of VitExpress, an open interactive transcriptomic platform for grapevine

Significance In contrast to most crops, Vitis vinifera benefited little from classical breeding due to their heterozygosis. Strikingly, the main cultivated grape varieties that we see today have remained the same as they were centuries ago; they have been poorly enriched with traits to adapt to changing environment. However, climate change and concerns about environment call for major changes in viticulture that require transitioning to knowledge-based concepts and advanced genomics tools. We report here the generation of haplotype-resolved genome assemblies for two grapevine cultivars and the setup of VitExpress, an open interactive transcriptomic platform, providing genome browser and integrated web tools for expression profiling and gene correlation studies. These community resources and tools are anticipated to foster advances in several areas of grapevine research.


Chasselas PacBio IsoSeq library preparation
Total RNA was obtained on five grapevine cv.Chasselas tissues (root, stem, leaf, flower and fruit) using the Spectrum Plant Total RNA Kit (Sigma-Aldrich, St. Louis, USA).Analyses of RNA quantity and quality were performed using NanoDrop and Qubit (Thermo Fisher Scientific, Waltham, MA, USA).RNA integrity was also assessed using the Agilent DNF-472 HS RNA (15 nt) kit on the Fragment Analyzer system (Agilent, Santa Clara, CA, USA).For each vine tissue, 900ng of poly(A) RNA was reverse transcribed into cDNA using the NEBNext® Single Cell/Low Input cDNA Synthesis & Amplification Module (New England Biolabs, Ipswich, Massachusetts, USA) according to PacBio recommendations (PN 101-892-000 Version 01).Amplified cDNA was purified using three different ProNex beads ratio, in order to modulate the full-length cDNA transcript size distribution (0.95X for short transcripts, 0.86X for typical transcripts centered around 2kb and 0.82X for long transcripts enrichment).The IsoSeq libraries were constructed using SMRTbell® Template Prep kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA) according to PacBio recommendations (PN 101-892-000 Version 01) and each sample was individually barcoded using SMRTbell barcoded adapter plate 3.0 (102-009-200).The sizes and concentrations of libraries were assessed using the 2100 Bioanalyzer system (Agilent, Santa Clara, CA, USA) and the Qubit dsDNA HS reagents Assay kit.After determining the molarity of each adapter-barcoded sample, they were pooled together according to their size distribution to generate 3 IsoSeq libraries.Sequencing primer v4 and Sequel® II DNA Polymerase 2.1 were annealed and bound, respectively to the SMRTbell libraries.Each of the three libraries was loaded on 1 SMRTcell 8M at an on-plate concentration of 80pM using adaptive loading.Sequencing was performed on the Sequel® II system with Sequel® II Sequencing kit 2.0, a run movie time of 24 hours with 120 min pre-extension step and Software version 11.0 PacBio by Gentyane Genomic Platform (INRAE Clermont-Ferrand, France).

Hi-C sequencing
The genetic material to produce Hi-C reads was extracted from young leaves of Chasselas and Ugni Blanc genotypes.To ensure the most informative Hi-C contact map we chose an endonuclease-based kit.Thus, the number of fragments produced is increased compared to the kits based on a combination of restriction enzymes.The Omni-C library was prepared using Dovetail Omni-C™ Kit (Cantata Bio, Scotts Valley, CA, USA) according to the manufacturer's protocol independently for Chasselas and Ugni Blanc.Briefly, starting from 1.5g of young leaves, chromatin was fixed in place in the nucleus.Fixed chromatin was digested with endonuclease DNase I then extracted.Chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends.After proximity ligation, crosslinks were reversed and the DNA purified from proteins.Purified DNA was treated to remove biotin that was not internal to ligated fragments.Sequencing library was generated using Illumina-compatible adapters.Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of the library.The libraries were sequenced on an Illumina NovaSeq6000 platform.Finally, to minimize the sequencing of chimeric reads, an insert size of 390bp and sequencing size of 150bp has been used.

Genomes assemblies
The full assembly process comprises 4 steps.(i) The pre-processing, consisting in HiFi read extraction (subreads correction), (ii) the raw data profiling, (iii) the graph resolution resulting in the draft primary assembly and the haplotype assemblies, and (iv) the Hi-C scaffolding using the contact-map.The produced assemblies were then assessed and polished.In parallel, the organelles' genomes were also assembled using supplementary contigs of hifiasm assembly results coupled with manual reconstruction.For each HiFi run, only reads considered as viable Circular Consensus Sequences (CCS) were extracted.CCS reads must have at least 3 passes of polymerase and a Phred Score quality greater than 30 (equivalent to read accuracy >0.99).CCS reads were blasted against the NCBI UniVec (1) to detect common contaminants.Only CCS reads displaying less than 10 % match with the contaminant were retained.A 250 bp-long contaminant threshold was applied to exclude nonspecific matches.To check for GC count anomalies of the filtered CCS reads, Kat plot (2) from k-mer matrix table was computed.Estimated CCS coverage was calculated with a hypothetical 500 Mb Vitis vinifera ssp.vinifera genome size for both cultivars.K-mer frequencies (21-length) were computed using Jellyfish2.Major metrics such as heterozygosity rate, the percentage of repetitions and estimated genome length were then inferred with GenomeScope 2.0 (3).HiFi reads were assembled using Hifiasm v.0.16.0-r369 (4) based on its global performance on heterozygous samples and its computational efficiency that allows easy adjustment of parameters.Regarding the heterozygosity of the samples and the use of two batches of CCS reads coming from two different SMRT cells, the similarity threshold for duplicate haplotigs was set up at -s 0.30 after adjustments and imbalance checking between haplotig genomes.Samples with high heterozygosity rate may lead to one assembly being much artificially larger than the other during haplotype phasing process.Therefore, the haplotig deduplication and similarity threshold parameters were set in order to minimize discrepancy in genome length between haplotigs and primary assemblies without altering the overall contiguity.The Hi-C scaffolding was produced taking advantage of linking information from the Hi-C contact map.Hi-C reads were firstly mapped to the draft genome with juicer (5) with default parameters.Then a first version of the scaffolded assembly was produced with 3D de novo assembly pipeline (6).This step was conducted with the parameter -r 0 to prevent automatic polishing, that will be implemented manually in a final step using juicebox software (7).The missing truncated telomeres were then retrieved from the Hi-C free draft assemblies.As for missing telomeres, we aligned independently the chr00 telomeric-like repeat enriched fragments and the raw subreads on telomeres and elongated the chromosomes when subreads aligned perfectly to the informative sequence.

Gap filling of scaffolds junctions
In order to produce the most contiguous assemblies, a remapping of the raw reads against first draft assemblies was processed.Each gap and surrounding areas were manually investigated using Integrative Genome Viewer v.2.12.0 (8).Reads covering the gap with minimum of 5kb overlap on each side with a mapping quality (MAPQ) of at least 60.These reads were then selected to build consensus sequence patches.

Retrieving mitochondrial and chloroplastic genomes
The mitochondrial and chloroplastic sequences were recovered by aligning contigs not related to chromosomes on custom BLAST databases of grapevine organelles' sequences (9).Using a referenceguided assembly strategy, contigs carrying organelle DNA were used to assemble the organelles.
10k SNPs Vitis genotyping array.A robust subset of 10k SNPs extracted from the 18k SNPs Vitis genotyping array covering 783 genotypes (12) was aligned on each primary and haplotype genome sequences with bwa-mem.Flanking sequences of 5p and 3p were aligned independently and SNP positions were then retrieved according to mapping orientation and CIGAR codes.In this way, only perfectly matching flanking sequences were used.R package "snpReady" (13) was used to evaluate Minor Allele Frequency (MAF), Nei's genetic diversity and Hardy-Weinberg equilibrium statistic for each of the SNP markers.The Euclidean genetic distance was computed on the SNP subset to build the distance matrix.SNPs that did not match the PN12X.v2-basedoriented chip were not considered for computing the genetic distance since they most likely correspond to misassembled PN12X.v2portions of the genome.In order to produce phylogenetic trees segregation for our genotypes, we conducted a simulation of diploidy for each haplotype.This was achieved by duplicating the alleles, which consequently resulted in them being homozygous for all markers.As for the real heterozygous assemblies we combined allelic information from each haplotype.Phylogenetic analyses were performed using Discriminant Analysis of Principal Components (DAPC) followed by a K-means clustering with k=8 taking into account the Bayesian Information Criterion (BIC) and using arbitrarily the same cluster number than the 10k SNPs study.The optimal number of PCs was determined by the optim.a.score function from adegenet R v.2.1.10package (14).DAPC crossvalidation was performed with default parameters by checking if the predictive success is maximized and the associated root mean squared error (RMSE) was minimized for our number of PCs.When associating K-means groups to the most representative geographic region of Vitis genotypes according to Bacilieri et al. (15), each K-means cluster refers to one distinct most representative region as described by Laucou et al. (12).

IsoSeq Model production. The first
Step of IsoSeq data processing consists in the production of the full length non chimeric transcripts (FLNC).Circular consensus reads (CCS) were extracted and demultiplexed from raw data.Then the CCS were cleaned from primer sequences and chimeric concatemer.PolyA tails were trimmed and reads were reoriented when needed.The obtained reads were hierarchically clustered to produce one consensus per read cluster.All previous steps were led using PacBio IsoSeq3 (16).Tissue specific FLNC transcripts were then mapped against Chasselas primary assembly.Using TAMA Collapse from Transcriptome Annotation by Modular Algorithms suite (17), the transcripts aligned were collapsed by tissue.Then all collapsed transcripts were combined into a single transcriptome model using TAMA Merge.To specifically find the protein coding transcripts, the transcriptome model was passed to TAMA ORF to get all open read frames (ORFs) and potential Nonsense-mediated Decays (NMD).The resulting transcripts were then "Blasted" using Blastp against UNIREF.An in-house script was then used to filter Blast results and get the list of best transcript candidates.
De novo genome annotation.Using EugeneEp (18), an automatic gene prediction was produced.The Pipeline was fed with multiple protein and transcriptomic evidences.The predicted annotation was filtered using 1,600 RNA-Seq signal coming from the mapping against the Chasselas genome.In parallel, current PNv4 annotation model was mapped using transcripts and proteins.Only transcripts with at least 80% identity and coverage were kept.The resulting model was then merged with the IsoSeq model using TAMA Merge.Repetitive elements were identified using several tools.For low complexity regions, we use Dustmasker (NCBI toolkit https://ncbi.github.io/cxx-toolkit/ ) and for complex repeats, a combination of LTR Harvest (v.2.9.5) (19) and Red (v.2.0) tools (20).Tandem repeats were found using Tandem repeats Finder v.4 (21).To detect de novo transposable elements, Repeat modeler (v.2.0) (22) combined with RepeatMasker (v.4) (23) was used.
Genome annotation "lift over".The gene model built from the Chasselas genome assembly was "lifted on" the Ugni Blanc and the four haplo-genomes.During the mapping process, genes from each chromosome were firstly aligned versus the corresponding putative chromosome to minimize biased gene correspondences, given that duplicated genes might catch the best alignment hit.If no solid hit were found, remaining genes were aligned on the other chromosomes.Genes with at least 80% coverage and 80% identity were finally "lifted" to build high quality models.Translocation events between genomes were then investigated by crossing SyRI outputs and liftoff outputs.Screening for transposable elements by RepeatMasker allowed to highlight structural re-arrangements co-localizing with transposable elements.Annotation models assessment.Assessment was made using BUSCO (24) in transcriptome mode against embryophyta.odb10and eudicots.odb10.The datasets contain 1,614 universal single-copy genes derived from 50 orthologous genomes and 2,326 universal single-copy genes derived from 30 orthologous genomes, respectively, and allow the evaluation of the gene model completeness.
VitExpress platform production.Data from NCBI SRA databank were filtered and retrieved using NCBI SRA Toolkit v3.0 (25).To ensure the data reliability, only samples associated with a publication were selected.For each selected project a manual description was produced using sample metadata crossed with the information found in the related publication.Then, each sample was processed with a unique RNA-Seq quantification pipeline, using fastQC v0.11.7 (26) for quality check, TrimGalore v0.6.5 (27) for sequence cleaning and trimming, STAR RNA-seq aligner v2.7.9a (28) for the mapping step.Finally, the raw count matrix was produced using FeatureCounts from Subread package v2.03 (29).RNA-seq signal tracks was produced using deepTools v3.0.2 (30).In-house R-script was used to produced normalized expression matrix.The web Application platform was developed using Symfony v.5 (31), a PHP framework coupled with JavaScript libraries: ChartJs (32) for simple charts, Inchlib (33) for HeatMaps, and custom implementation of D3.js (34) for the correlation StatTools.From the RNA-Seq analysis results, an object-relational Database was built using Doctrine (35), a collection of PHP libraries allowing the database storage and object mapping with Symfony.Raw read Kmer profiling with GenomeScope.First peak stands for unique heterozygous k-mers (25x coverage) and second peak for unique homozygous k-mers (50x coverage).The coverage of 75x and 100x stands for repetitive homozygous and repetitive heterozygous k-mers, respectively."len" stands for inferred genome length; "uniq" for non repetitive sequence ratio; "aa" for homozygosity rate; "ab" for heterozygosity rate; "kcov" for mean k-mer coverage for heterozygous bases; "err" for error rate of the reads; "dup" for average rate of read duplications; "k" for k-mer size; "p" for ploidy level.Observed (blue): observed k-mer profile; full model (black) : estimated GenomeScope model; unique sequence (yellow): unique k-mer threshold; Errors (red) : error k-mers threshold; k-mer peaks : corresponding to multiple of the kcov peak.

Visualisation and Statistical Tools Description
Expression profiler Normalized expression pattern of genes in different chosen conditions

Expression heatmap
Clusterized normalized expression for genes and conditions

Correlation genes finder
Computes and displays the most correlated genes for one selected gene in a chosen conditions

Correlation network
Computes and displays an expression-based, weigthed correlation network for the selected genes

Isoforms Browser
Displays RNAseq signal tracks and gene isoforms on the reference genome for the chosen conditions

Fig. S1 .
Fig. S1.Full Assembly and annotation pipeline.The workflow of de novo sequencing, assembly and annotation of the grape genome comprises 3 main steps.(A) Raw data production, profiling and validation.(B) Assembly of the circular consensus sequences and integration with the contact map produced by Hi-C.(C) Structural and functional annotation based on Isoseq RNA sequencing data.

Fig. S2.
Fig. S2.Raw read Kmer profiling with GenomeScope.First peak stands for unique heterozygous k-mers (25x coverage) and second peak for unique homozygous k-mers (50x coverage).The coverage of 75x and 100x stands for repetitive homozygous and repetitive heterozygous k-mers, respectively."len" stands for inferred genome length; "uniq" for non repetitive sequence ratio; "aa" for homozygosity rate; "ab" for heterozygosity rate; "kcov" for mean k-mer coverage for heterozygous bases; "err" for error rate of the reads; "dup" for average rate of read duplications; "k" for k-mer size; "p" for ploidy level.Observed (blue): observed k-mer profile; full model (black) : estimated GenomeScope model; unique sequence (yellow): unique k-mer threshold; Errors (red) : error k-mers threshold; k-mer peaks : corresponding to multiple of the kcov peak.

Fig. S3 .
Fig. S3.(A) Hi-C chromosome contact map for Chasselas and Ugni blanc displaying the spatial organization of chromatin.Linkage matrices are built with a 1 kb resolution and loaded on Juicebox.The x-axis represents genomic positions along the genome and the y-axis represents the same positions but mirrored.Each pixel color intensity represents the frequency of interchromosomal and intrachromosomal chromatin contacts between genomic loci.Darker colors indicate higher interaction frequencies.The diagonal line represents self-interactions.(B) Density of the telomeric repeats (AAACCCT)n identified using tidk explore on the chromosomes of Chasselas (left) and Ugni Blanc (middle), and PN_T2T (right).Telomeric motif density is computed with a 1 kb bin size.

Fig. S4 .
Fig. S4.(A) Dot plots displayed by D-genies showing the collinearity between the genomes of Chasselas vs Pinot Noir (top), Chasselas vs Ugni Blanc (bottom).Grid lines delimit the chromosomal scaffolds.Gaps are depicted by breaks in the global contiguity, horizontal breaks represent gaps in assemblies compared to Chasselas, while vertical breaks indicate gaps in our assembly.(B) Structural variations events (inversions, translocations and duplications) revealed by synteny analysis between PNv4, PN_T2T and Chasselas assemblies.

Fig. S5 .
Fig. S5.Examples of reconstruction of PNv4 genes using RNA Isoforms produced by IsoSeq.(A) Fusion event : four PNv4 genes (Vitvi13g04662; Vitvi13g04663; Vitvi13g04666; Vitvi13g01609) defined as only two genes based on IsoSeq.(B) Split event : one PNv4 gene (Vitvi13g04410) shown to correspond to two genes by IsoSeq.First track in blue represents the cumulative RNA-Seq signal from 1,600 samples.Second track in purple correspond to the IsoSeq gene model.Third track represents the PNv4 model remapped against Chasselas assembly.

Fig. S6 .
Fig. S6.VitExpress data processing and integration workflow.Phase I : raw data are gathered from NCBI SRA and subsequently evaluated for quality requirement.Phase II : the collected data undergo manual curation and description, followed by RNA-seq quantification.The entire dataset is then normalized.Phase III : the normalized quantification data, along with relevant metadata, are formatted to fit an object relational mapping database and a web application's back-end system.Phase IV: a userfriendly web-based platform is built to facilitate data exploration, analysis and export with statistical and visualization tools.

Fig. S7 .
Fig. S7.Data overview of VitExpress database.Pie charts provide the data proportion corresponding to species, grape types, developmental stages and organs.Only manually curated data were included in VitExpress platform.

Table S3 .
Annotation of protein-coding and non-coding genes of Chasselas and Ugni Blanc haplogenomes.

Table S4 .
Number of genes impacted by structural variants between haplotypes for Chasselas and Ugni Blanc

Table S5 .
Tools available in VitExpress.