Global biogeography of chemosynthetic symbionts reveals both localized and globally distributed symbiont groups

Significance Knowledge of host–symbiont biogeography is critical to understanding fundamental aspects of symbiosis such as host–symbiont specificity. Marine animals typically acquire their symbionts from the environment, a strategy that enables the host to associate with symbionts that are well-suited to local conditions. In contrast, we discovered that in the chemosymbiotic bivalve family Lucinidae several host species distributed across the globe are all associated with a single cosmopolitan bacterial symbiont. The genetic cohesiveness of this global symbiont species is maintained through homologous recombination across its extensive geographical range. The remarkable flexibility in the lucinid association is advantageous to both host and symbiont as it increases the likelihood of locating a compatible partner across diverse habitats spanning the globe.


Quality filtering, assembly, and bacterial genome binning
Read libraries were trimmed, PhiX contamination filtered, and quality checked using BBMAP v37.61's BBDUK feature (1) and the provided adapter and PhiX databases for contamination detection. All reads below an average q-score of 15 and a minimum length of 100 bp were removed. The minimum contaminant kmer of 21 was used and reads matching the contaminants or below the quality score were filtered out. Libraries were then interleaved (see Jupyter notebook for exact command line).
Individual read libraries were assembled using SPAdes v3. 13.1 (2, 3) with the "metagenome" setting and iterative kmers of 21, 31, 41, 51, 61, 71, 81, and 91. The resulting metagenomic assembly scaffolds were then binned using a combination of Anvi'o v6.1 (4, 5) using CONCOCT v1.1.0 (6) and MetaBAT v2.15 (7) (coverage and no coverage options). All contigs shorter than 1000 bp were removed from the assembly prior to mapping and binning. If available, libraries from the same host and location (maximum of three, treated as biological replicates) were used for binning. If three libraries from the same host and location were not available at the time of binning, binning was attempted with libraries from similar hosts or broad scale geographical areas (Dataset S1). Read libraries were mapped to the scaffold assemblies using BBMAP with default settings. The resulting .sam files were converted to .bam files with samtools v1.9 (8) and sorted using the anvi-init-bam script from Anvi'o. Then the default CONCOCT (using v1.1.0) binning workflow within Anvi'o with a minimum length of 1000 bp was used for binning with Anvi'o. The default MetaBAT (v2.15) binning workflow within Anvi'o with a minimum length of 1500 bp (minimum for the program) was also used. A MetaBAT binning attempt with no coverage information was also attempted. All potential bins for each metagenome were then compared using dRep v2.4.2's dereplicate workflow (9) with default settings and the "best" was selected automatically. If no bin was selected, a manual bin selection (using CheckM v1.1.3 (10) with Gammaproteobacteria gene set) and revision of the best MAG (metagenome assembled genomes) using Anvi'o's 'anvi-refine' was attempted as a method of manually improving quality of MAGs. The bins were then checked for completion using CheckM's taxonomy specific workflow using the previously mentioned gene set. MAGs determined to be 90% or more complete and 5% or less contaminated from the gammaproteobacterial gene set in CheckM were used in further downstream analysis without further refinement. MAGs determined to be 90% complete and more than 5% contaminated were manually refined using 'anvi-refine' in an attempt to improve MAG quality. MAGs that were determined to be 90% or more complete and less than 10% contaminated post refinement, referred to as high-quality MAGs, were used in further downstream analysis.
DESMAN was used to calculate the potential number of strains within each MAG. To do so, metagenomic libraries of MAGs with an average coverage of 30 or more were subsampled to 30 coverage by using the "outm" and downsample options in BBMAP using the seed numbers of 1, 11,42,47, and 117 to reproduce five subsample readsets. Those under 30 coverage were not subsampled and used as is. To increase the sensitivity, a custom set of 173 single copy genes (Table S8) was created by using the DESMAN specific setting to determine single copy COGs with species groups of interest. These genes were used in place of the original 35 genes in the DESMAN snakemake workflow and the five down sampled replicates were used for each MAG. Potential strain numbers in individual clams/metagenomes were calculated using DESMAN v2.1.1 (11) and the snakemake workflow (12) on the program. The 'best run' of each replicate was taken and considered the number of strains for the metagenome.

Phylogenetic analyses
The taxonomy of all MAGs was checked using GTDB v0.3.3 (13) as a confirmation. Only those MAGs taxonomically assigned to Sedimenticolaceae or Chromatiaceae were used in this study. All publicly available MAGs of other lucinid symbionts (Ca. Thiodiazotropha sp. and Ca. Sedimenticola sp.), alongside Allochromatium vinosum as an outgroup, were downloaded for use in this analysis (complete list of accession numbers in Tables S2 and S1). All publicly available MAGs used in this study were quality checked using CheckM's taxonomy specific workflow, which used the previously mentioned Gammaproteobacterial taxonomy set of marker genes. MAGs below 90% completion and/or above 10% contamination were only included in the "Phylogenetic analysis" of this study. Metagenomic read libraries associated with publicly available MAGs (mostly from (14)) were downloaded from NCBI and were also quality filtered using the same workflow as mentioned above. Those metagenomic read libraries that were sequenced to be shorter than 150 bp were filtered using the length cut off of 50 bp.
All available MAGs (irrespective of completion) were used to generate a concatenated gene tree. The CheckM v1.1.3 (10) lineage workflow was used to locate, align, and concatenate the default set of 43 universal marker genes (concatenated marker gene alignment can be found on FigShare (15)). CheckM's automatically generated concatenated protein alignment of these 43 marker genes was then submitted to the W-IQ-TREE server (16) using the default settings (auto substitution model detection, 1000 ultrafast bootstraps, and 1000 replicates of SH-aLRT branch tests). This tree was visualized and rooted with Allochromatium vinosum in iTOL (Interactive Tree Of Life) v5 (17). 16S rRNA gene sequences were retrieved from each MAG using BARRNAP v0.9 (18). Near full length 16S rRNA gene sequences (greater than 1200 bp) were aligned using the online SILVA:SINA aligner (19) using default parameters. The alignment was used to generate a phylogeny and visualized using the workflow mentioned previously.
All MAGs were placed into species groups using FastANI v1.3 (20) which was used to determine ANI (average nucleotide identity) values and a program-suggested species cut off of 95% twoway ANI were used. MAGs near the 95% ANI boundary were submitted to jspecies web server (21) to run an ANIb and ANIm test to check the accuracy of species boundaries.

Localization of symbionts in clam gills
The presence of Ca. T. taylori within select hosts was visualized with FISH (fluorescence in situ hybridization) on 5-µm-thick sections of the gills from Loripes orbiculatus (Mauritania, 2018) and Clathrolucina costata (Panama, 2019). 16S rRNA gene sequences from Ca. T. taylori, assigned to Sedimenticolaceae with SILVA (see Phylogenetic analyses), were used as probe references in probe design. A probe was designed with Geneious Prime (2020.0.5) and Decipher (probe design web tool, (22)). The 16S rRNA gene sequence of Ca. T. endolucinida (a co-occurring species of symbiont) was used as a non-target group to prevent false positives during the process. A formamide series from 0-60%, in 10% steps, was carried out to optimize the hybridization conditions for the Ca. T. taylori and T. endolucinida probes. A gammaproteobacteria-specific probe (Gam42a-Fluos) was included as a positive control (23). All probe sequences used are available in Table S6.
Gill sections were fixed, post dissection, in a 4% paraformaldehyde, 10% sucrose, pH 7.4, 0.01 M PBS solution overnight at 4°C. The gills were then washed in 10% sucrose, pH 7.4, 0.01 M PBS solution for 10 minutes, three times, followed by an increasing ethanol series of 30%, 50%, and 70% (v/v) for 10 minutes each. Dehydrated gills were stored in 70% ethanol at 4°C until embedding. Gills were embedded in 1% low melting agarose for embedding. Additional dehydration and paraffin wax infiltration was performed using the standard protocol in Automatic Tissue Processor Donatello (Diapath, Italy). For initial infiltration, low melting temperature paraffin was used (ATS-200850, Sanova, Austria). Samples were then embedded in medium melting temperature paraffin (ATS-200856, Sanova, Austria). The paraffin-embedded gills of the clams were cross-sectioned at 5 µm using a Leica microtome (Leica, Germany) and subsequently mounted on SuperfrostPlus adhesion slides (Thermo Scientific, USA) in a 37°C water bath. Sections were then dried in a horizontal position at room temperature overnight. Sections were dewaxed in Roti-Histol (Carl Roth, Germany), 10 minutes for three times, followed by two 10 min washes with absolute ethanol and finalized with three five minute washes of 1X PBS. Dewaxed sections were dried using compressed air and stored at 4°C until use.
Gill sections were hybridized using a hybridization buffer containing 50% Formamide, 0.01% SDS (v/v), 0.1 g/mL Dextran sulfate, 10% blocking reagent (Roche, Switzerland), at a final concentration of 0.9M NaCl, 0.02M Tris-HCl (Table S7). Approximately 15 µl of buffer was used per gill section. 2 µl (10 µmol) of each probe (Table S6) was added to each section. Slides were hybridized for 3 hours at 46°C in dark conditions. To maintain a humid atmosphere within the hybridization chamber during incubation, Kimtech Science precision wipes (Kimberly-Clark, USA) partially soaked in 40% or 50% formamide were located underneath the samples. Following the hybridization, the samples were rinsed with pre-warmed 48°C washing buffer (Table S7), rinsed briefly in Milli-Q water, and rapidly dried using compressed air. Dried slides were stained using 15 µl (1 µg/mL) of DAPI and incubated at room temperature in dark conditions for 15 min. Post incubation, the DAPI was rinsed briefly in 4°C Milli-Q water, three times, and dried. After drying, the sections were mounted using ProLong Glass antifade mounting media (Thermo Fisher Scientific, USA), cured overnight at room temperature and stored at 4°C.
Sections were visualized with a Leica TCS SP8 X confocal laser scanning microscope. Images were taken with a 63X Plan-Apochromat oil immersion objective (Refractive Index of glass slide, immersion oil and antifade mounting medium: 1.52) The Leica software LASX (3.7.2.22383) was used for image acquisition and post-processing if necessary.

Functional annotation and pangenomic analysis of bacterial genomes
Anvi'o's pangenomic workflow was used for orthologous group clustering and functional comparisons. COG (Cluster of Orthologous Groups of proteins) functions were annotated and included using 'anvi-run-ncbi-cogs'. Each MAG's ORF (Open Reading Frame) proteins from Anvi'o were annotated using eggNOG-mapper v2 (24) with eggNOG database v5.0 (25) and these were added to the pangenomic database. All ORF proteins were also used to generate a custom functional annotation set by using all Pfam (protein family) HMMs (26). These ORF proteins were searched for all possible domains using HMMER v3.3 (27) to search against the PfamA reference database v33.1 with the model designated thresholds used as cutoff. This output was then parsed and added as a functional annotation to each MAG. DRAM (Distilled and Refined Annotation of Metabolism) v1.1.1 was also used to annotate and visualize the metabolisms of all high-quality MAGs (28).
All high-quality MAGs in the dataset were also used in an Anvi'o pangenome similar to above with a mcl inflation value of 8. Anvi'o pangenomic genomes database and profiles can be found on Figshare (29,30). All MAGs were also submitted to the RAST (Rapid Annotation using Subsystem Technology; https://rast.nmpdr.org/) web server for functional annotation using the default RASTtk pipeline (31). Available metagenomic assemblies were screened for genes and proteins of interest using NCBI's blast+ v 2.8.1 (32) with an e-value threshold of 1e -16 and a word size of 25 (only for blastn). Coverages for genes of interest were also calculated using BBMAP's coverage stats options with a 90% minimum identity to screen for potentially unassembled genes in assemblies.
EggNOG functions and Pfam functions were used for a functional enrichment analysis through 'anvi-get-enriched-functions-per-pan-group'. An enrichment calculation was performed between species of interest using the 95% ANI species cut-offs (as determined by FastANI) as pre-defined groups to create a bacterial interspecies comparison of metabolic functions. Species groups with less than three MAGs were excluded from the enrichment analysis. A second enrichment calculation was performed within species by grouping according to geographical location. Only groups with more than three individuals were used for functional enrichment. Within all analyses, only genes with an 'adjusted q value', a p value adjusted for false discovery rate, of less than 0.05 were considered.

C. orbicularis metatranscriptome analysis
Three C. orbicularis specimens were collected in Guadeloupe (2016), preserved in RNAlater, and stored at -80°C. Total RNA was extracted using TRIzol (Thermo Fisher Scientific) according to manufacturer's instructions. Ribosomal RNA depletion and library construction were carried out at the Vienna Biocenter Core Facilities GmbH (VBCF, Vienna, Austria) as described in Yuen et al.
(2019) (33). The RNA-Seq reads were trimmed and processed as previously described by Yuen et al. (2019). We used BBMAP with the parameters slow = t, ambiguous = best, minid = 0.99 to align the trimmed and filtered reads to the publicly available MAG of Ca. T. endolucinida (GCA_001715975.1), an endosymbiont of C. orbicularis from Guadeloupe (1). FeatureCounts was then used to quantify gene level transcript abundances, which were subsequently converted to transcripts per million (34). Metatranscriptomic results can be found in Dataset S7.

Recombination rates and nucleotide diversity of bacterial symbionts
To infer recombination events in bacterial genomes, we used the maximum likelihood implementation of ClonalFrame, ClonalFrameML (35). MAGs (only high quality) were aligned with progressiveMauve, which is part of the Mauve genome alignment package v2.0 (36). Core genes; i.e., nucleotide sequences that were shared among all MAGs within a group, were extracted with a custom script, stripSubsetLCBs, which can be accessed through the original publication (37). These core genes were then re-aligned with MUSCLE (38) and cleaned with trimAl (39) with the following parameters: -resoverlap 0.75 -seqoverlap 80. RAxML v8.2.10 (40) was used to build a phylogenetic tree based on this trimmed alignment analogously to (41). All alignments are available on Figshare (42)(43)(44). The resulting tree and alignment were fed into ClonalFrameML to calculate the ratio of recombination vs. mutation events with default parameters. All MAGs were used for this analysis (i.e., n = 8 for Ca. T. weberae, n = 10 for Ca. T. lotti, n = 18 for Ca. T. endolucinida, and n = 24 for Ca. T. taylori). The visualization of the recombination events across the phylogeny was computed in R v6.3.2 with the script cfml_results.R with a few manual modifications, which can be downloaded from the ClonalFrameML Github page (Fig. S1A-C). One hundred simulations were run to calculate confidence intervals of the estimates. The cleaned alignments were also used to (i) estimate average nucleotide diversities inside and outside the clonal frame with GUBBINS (Genealogies Unbiased By recomBinations In Nucleotide Sequences (45)), and (ii) build a phylogenetic tree of all MAGs within the Ca. T. taylori group (n = 24) using only the clonal frame. For the latter, the alignment was again submitted to the W-IQ-TREE server using the default settings (auto substitution model detection, 1000 ultrafast bootstraps, and 1000 replicates of SH-aLRT branch tests). This tree was visualized and rooted with Ca. T. sp. "RUGA" in iTOL. All scripts used for the analysis can be found in the associated Jupyter Notebook (46).

Storage Compounds
All species are also predicted to be able to store and utilize glycogen and polyhydroxyalkanoates, as indicated by RAST annotations and the presence of genes annotated as Poly(3hydroxyalkanoate) polymerase (phaC).

Phosphorus
All species have genes for phosphate ABC transporter proteins (pstSCAB genes and phoU), which suggest the ability to take up extracellular phosphate for use inside the cell. All species also have the genes for synthesizing and utilizing polyphosphate as a storage compound for excess phosphate/energy. Notably Ca. T. weberae and Ca. T. sp. RUGA encoded for an organophosphonate utilization operon (phnACDEFGHIJKMNOP genes), which suggest the capability of these two species to utilize compounds containing C-P bonds (47). These compounds (phosphonates) are found to be up to 10% of the total bioavailable phosphorus in marine environments.

Supplementary Discussion
Naming conventions for Lucinid symbionts should be altered to avoid host-specific nomenclature There are now two lucinid symbionts that have been found to associate with multiple host species (Ca. T. taylori and Ca. T. endolucinida) and several host species that associate with different symbiont species (four symbiont species within Ctena orbiculata and three symbiont species with Loripes orbiculatus). With only a small fraction of host species sampled so far, it is likely that there will be more symbionts that associate with multiple host species. This causes host-centric names like "Ca. T. endoloripes", "Ca. T. endolucinida", and "Ca. Sedimenticola endophacoides" to potentially be invalidated and less insightful with future discoveries. Previously, based on limited sampling, it had been assumed that these symbionts only occurred within single host species. While the species name "Ca. T. endolucinida" is accurate as it is found in many lucinid hosts, it is not the only species that is found in many hosts, just as "Ca. S. endophacoides" might not be the only host within Phacoides. A host centric naming system also creates issues for any potential Ca. Thiodiazotropha that are first found in the environment, as found recently in seagrass using FISH (fluorescent in situ hybridization). In this study, we propose a shift in the naming of these species away from a host centric system to a more applicable system of naming focused on those that have made contributions to the fields of lucinid taxonomy and chemosymbiosis. Within the umbrella of "Ca. T. endoloripes" species, previously based on 16S rRNA gene sequences, we found metagenome-assembled genomes (MAGs) that grouped into two different species. Note that we identified two specimens of Loripes orbiculatus hosting both symbionts living peacefully in the same gill. We are proposing the names "Ca. T. weberae" and "Ca. T. lotti" after Miriam Weber and Christian Lott, whose work in the bay of Fetovaia on Elba Island, Italy led to the discovery of a Loripes orbiculatus population that kick-started the research on this species and its symbionts. The cosmopolitan symbiont associated with eight host species across three oceans will be named "Ca T. taylori" after Dr. John Taylor. Dr. John Taylor has spent decades studying and describing the biodiversity within the Lucinidae. His invaluable work on lucinid taxonomy, morphology and systematics has built the foundations for many others to pursue careers studying these wonderful organisms. Our work was only possible because of his contributions, discussions, and extensive museum collection.       T. taylori. The binding and specificity of these probes was tested through a formamide curve. It was determined that both probes bound successfully at 50% formamide with no non-specific binding to the other symbiont. Dataset S1 (separate file).
Sample metadata for individual metagenomes and MAGs.

Dataset S2 (separate file).
Average nucleotide identity values, calculated through FastANI and Jspecies, grouped by species.
Presence of genes of interest in all high-quality MAGs, using eggNOG and RAST as primary sources of functional information.

Dataset S4 (separate file).
Enriched metabolic functions by species group, assessed using eggNOG and PFAM as functional annotation sources.

Dataset S5 (separate file).
Absence of genes of interest confirmation using BLAST, annotation, and coverage.

Dataset S6 (separate file).
GUBBINS values for primary species of interest.