Illuminating structural proteins in viral “dark matter” with metaproteomics

Significance Marine viruses are abundant and have substantial ecosystem impacts, yet their study is hampered by the dominance of unannotated viral genes. Here, we use metaproteomics and metagenomics to examine virion-associated proteins in marine viral communities, providing tentative functions for 677,000 viral genomic sequences and the majority of previously unknown virion-associated proteins in these samples. The five most abundant protein groups comprised 67% of the metaproteomes and were tentatively identified as capsid proteins of predominantly unknown viruses, all of which putatively contain a protein fold that may be the most abundant biological structure on Earth. This methodological approach is thus shown to be a powerful way to increase our knowledge of the most numerous biological entities on the planet. Viruses are ecologically important, yet environmental virology is limited by dominance of unannotated genomic sequences representing taxonomic and functional “viral dark matter.” Although recent analytical advances are rapidly improving taxonomic annotations, identifying functional dark matter remains problematic. Here, we apply paired metaproteomics and dsDNA-targeted metagenomics to identify 1,875 virion-associated proteins from the ocean. Over one-half of these proteins were newly functionally annotated and represent abundant and widespread viral metagenome-derived protein clusters (PCs). One primarily unannotated PC dominated the dataset, but structural modeling and genomic context identified this PC as a previously unidentified capsid protein from multiple uncultivated tailed virus families. Furthermore, four of the five most abundant PCs in the metaproteome represent capsid proteins containing the HK97-like protein fold previously found in many viruses that infect all three domains of life. The dominance of these proteins within our dataset, as well as their global distribution throughout the world’s oceans and seas, supports prior hypotheses that this HK97-like protein fold is the most abundant biological structure on Earth. Together, these culture-independent analyses improve virion-associated protein annotations, facilitate the investigation of proteins within natural viral communities, and offer a high-throughput means of illuminating functional viral dark matter.

Viruses are ecologically important, yet environmental virology is limited by dominance of unannotated genomic sequences representing taxonomic and functional "viral dark matter." Although recent analytical advances are rapidly improving taxonomic annotations, identifying functional dark matter remains problematic. Here, we apply paired metaproteomics and dsDNA-targeted metagenomics to identify 1,875 virion-associated proteins from the ocean. Over onehalf of these proteins were newly functionally annotated and represent abundant and widespread viral metagenome-derived protein clusters (PCs). One primarily unannotated PC dominated the dataset, but structural modeling and genomic context identified this PC as a previously unidentified capsid protein from multiple uncultivated tailed virus families. Furthermore, four of the five most abundant PCs in the metaproteome represent capsid proteins containing the HK97-like protein fold previously found in many viruses that infect all three domains of life. The dominance of these proteins within our dataset, as well as their global distribution throughout the world's oceans and seas, supports prior hypotheses that this HK97-like protein fold is the most abundant biological structure on Earth. Together, these culture-independent analyses improve virion-associated protein annotations, facilitate the investigation of proteins within natural viral communities, and offer a high-throughput means of illuminating functional viral dark matter.
viruses | marine | proteins M icroorganisms are central to the Earth's ecosystem function (1), and it is becoming increasingly evident that viruses substantially influence microbially driven processes through mortality and manipulation of metabolism via viral-encoded metabolic genes (reviewed in ref. 2), including those involved in photosynthesis (3) and most of central carbon metabolism (4). However, holistic understanding of marine viruses has been limited in part by the dominance of "unknown" genomic sequences encountered when surveying viral communities in nature.
This "viral dark matter" in metagenomes manifests as an inability to obtain functional or taxonomic annotations for most (63-93%) of surveyed sequence space (5), as well as an inability to taxonomically annotate the vast majority (>99%) of viral populations observed in nature (6). Emerging approaches, such as comparison of metagenomes using shared k-mers (7), protein clusters (PCs) (8), and viral populations (6), enable ecological inferences without annotation (reviewed in ref. 9), but further conclusions are hindered by most viral PCs and populations remaining unknown. Taxonomic viral dark matter occurs due to limited representation of viruses in reference databases-86% of 1,531 sequenced genomes of bacterial and archaeal viruses were isolated from only 3 of 61 known host phyla (10). Some progress is being made using traditional isolation and genome-sequencing techniques to obtain reference genomes for both abundant (11,12) and rare, but ubiquitous (13), marine viruses. However, identifying viral genomic information within microbial genomic datasets and using genome-and network-based analytics to classify these previously unidentified sequences is already rapidly increasing the number of available and classified viral reference genome sequences (10). With the emerging deluge of novel and diverse single-cell genomic datasets that contain viruses (14,15), such methods are likely to uncover viruses for all known phyla in short order, which should presumably greatly illuminate taxonomic viral dark matter.
In contrast, high-throughput advances to resolve our understanding of functional viral dark matter are lagging. Examination of viral genomic sequence space organized into PCs based on similarity has revealed that the global virosphere (the catalog of genes encoded by viruses) is now well sampled in the upper oceans (6) and likely contains less than 3.9 million proteins (16). Although the abundance of viral PCs is becoming well understood, the functions of these PCs remain poorly characterized.

Significance
Marine viruses are abundant and have substantial ecosystem impacts, yet their study is hampered by the dominance of unannotated viral genes. Here, we use metaproteomics and metagenomics to examine virion-associated proteins in marine viral communities, providing tentative functions for 677,000 viral genomic sequences and the majority of previously unknown virionassociated proteins in these samples. The five most abundant protein groups comprised 67% of the metaproteomes and were tentatively identified as capsid proteins of predominantly unknown viruses, all of which putatively contain a protein fold that may be the most abundant biological structure on Earth. This methodological approach is thus shown to be a powerful way to increase our knowledge of the most numerous biological entities on the planet.
A promising approach to annotate portions of functional viral dark matter could be to elucidate which predicted proteins encode viral structural components. Computationally, artificial neural networks have been used to predict viral capsid and tail proteins from metagenomic data, which has been validated through in vivo expression and visualization of four putative viral structural genes (17). Experimentally, divergent structural proteins from cultivated viral isolates have been annotated using mass spectrometry (MS)-based proteomics (13,(18)(19)(20). Metaproteomics has now emerged as a powerful tool to investigate microbial communities (21,22), and here we apply this approach to marine viral communities to identify virionassociated proteins and facilitate annotation of the structural components of viral dark matter, generating new insights regarding the structural proteins in natural viral communities.

Results and Discussion
Metaproteomic Datasets for Investigating Wild Marine Viruses. Highthroughput experimental MS-based proteomics was applied to four purified marine viral communities from the Mediterranean Sea, Indian Ocean, and Atlantic Ocean (Table S1) collected through the Tara Oceans Expedition (23). After using several experimental approaches to generate metaproteomes (Table S2; see experimental overview in Fig. S1), we selected the sample preparation method that minimized keratin contamination and autotryptic peptides [filter-aided sample preparation 2 (FASP2)] and the mass spectrometer that produced the most peptide spectra (LTQ Orbitrap Velos Pro). We then evaluated three analytical search pipelines to compare these MS-derived peptide spectra against assembled contigs from their paired dsDNA viral metagenomes included in the Tara Oceans Viromes (TOV) dataset (6) (Fig. S1). Among these pipelines, TPP with X! Tandem enabled the identification of the most spectra, nonredundant proteins (i.e., the distinct nonidentical proteins those spectra represent), and PCs (defined as groups of proteins with 60% similarity across 80% coverage; Table  S3). Furthermore, 26% of the total spectra were only identified using the TPP with X! Tandem pipeline, and only 8% of total spectra were not identified using this pipeline (Fig. S2A). Finally, the distribution of annotated spectra within the viral functional and taxonomic categories was highly similar among all three pipelines ( Fig. S2B; Morisita's Index of 1.0 for each pairwise comparison). We thus generated the Quantitative Dataset consisting of the peptide spectral abundances and annotations obtained only from the FASP2 sample preparation method, the LTQ Orbitrap Velos Pro mass spectrometer, and the TPP with X! Tandem pipeline to quantitatively investigate viral protein abundances (Fig. S1).
The Quantitative Dataset consisted of 15,270 spectra representing 697 nonredundant proteins in 296 PCs (Table S3; Dataset S1). The majority (74% of spectral counts) of proteins in this dataset facilitated annotation of previously unannotated virion-associated proteins (i.e., "newly annotated"; Fig. 1). Taxonomically, 24% of the proteins were annotated as belonging to tailed phages (myoviruses, podoviruses, and siphoviruses; Fig. 1). However, there were very few tail proteins in the dataset; among the proteins with previous functional annotations, the majority (23%) were identified as capsid proteins and <1% were identified as tail proteins (Fig. 1), resulting in ∼100-fold more capsid than tail proteins. Two prior proteomic studies of marine phage isolates show that, although all ORFs annotated as tail proteins were detected in the proteomes of myoviruses infecting Synechococcus and Prochlorococcus (24), five of the nine putative tail proteins were not detected in Cellulophaga siphoviruses (13). This suggests that, even in isolates, MS-based proteomic methods may miss tail proteins-presumably due to loss during phage isolation or deficiencies in sample preparation method (i.e., inefficient digestion with trypsin due to limited K/R residues in these specific proteins or excessive digestion due to having too many K/R residues). In this complex community case using metaproteomics, lower conservation of tail proteins relative to capsids may also hamper their identification through annotation using reference databases (see discussion regarding conservation of viral-associated proteins below).
Collectively, experimentation with two sample preparation methods, three mass spectrometers, and three analytical search pipelines, generated additional peptide spectra beyond the Quantitative Dataset (Fig. S1). Due to the methodological differences, these data could not be combined quantitatively; however, they did provide expanded identification of virion-associated proteins in the four marine viral communities because not all methods identified the same proteins. The resulting Inclusive Dataset (see overview in Fig. S1) contained 1,875 nonredundant proteins grouped into 574 PCs (Table S4), which is ∼2.7and ∼1.9-fold more proteins and PCs, respectively, than the Quantitative Dataset. Of these proteins, most (991 nonredundant proteins; 53% of the Inclusive Dataset) were again newly identified as virion-associated proteins ( Fig. 1), providing functional annotation to 677,376 previously unannotated viral metagenomic reads from these samples, identified here as "structural" based on similarity to peptide spectra using the three analytical search pipelines. The metaproteomes included 176 proteins (9% of the Inclusive Dataset) previously seen in viral isolate experimental proteomes and identified as "viralassociated" or structural (e.g., ref. 13) (Fig. 1). In addition, the metaproteomes provided annotation for 84 previously unannotated hypothetical proteins in viral isolate genomes (4% of the Inclusive Dataset; Fig. 1).
To further examine the utility of metaproteomic analyses in natural viral samples, we first investigated whether the metaproteomes included proteins within the dominant PCs from the paired viral metagenomes. Of the 200 most abundant PCs in the viral metagenomes of each sample, 9% (72 of 800 PCs total) were experimentally detected in the metaproteomic Inclusive Dataset, including 47 PCs that had no prior functional annotation (Fig. 2). We next examined TOV-generated viral populations (i.e., contigs grouped based on similarity of ≥80% of their genes at ≥95% nucleotide identity) (6) for the presence of PCs detected in the metaproteomes. This showed that the metaproteomic PCs in the Inclusive Dataset were detected in viral populations from the Virion metaproteomics helps annotate previously unknown viral proteins. Functional annotations and their associated taxonomic annotations (linked by dashed lines) are presented for the Quantitative Dataset based on protein spectral abundances generated from one method and one analytical search pipeline, as well as the Inclusive Dataset that includes all proteins identified using all methods and analytical search pipelines combined. Annotations are based on the top BLASTP match (e value < 0.001) against the viral RefSeq database (full annotation details in Dataset S1). The "Capsids" category includes proteins annotated as head-tail connectors, necks, and portals, whereas the "Other" functional category includes scaffolding proteins and enzymes such as proteases. Hypothetical proteins in genomes of viral isolates are functionally annotated as "Newly annotated" but have a taxonomic annotation. paired viral metagenomes that spanned a large range of population abundances-identifying proteins in the most abundant viral populations, as well as rare populations (Fig. 3A). Applying these same analyses to all 5,476 viral populations detected in the larger, globally distributed TOV dataset (6) revealed that metaproteomedetected PCs were found in populations spanning a large range of abundances across as many as 36 of the 43 samples (Fig. 3B). Together, this combined information (Figs. 1-3) suggests that metaproteomics is a powerful approach to inform annotation of previously unknown genomic content as structural genes in both isolates and variably abundant populations in natural viral communities.
Dominant Protein Clusters in Viral Metaproteomes. Within the Quantitative Dataset, one PC (CAM_CRCL_773, previously identified in the Global Ocean Sampling expedition, Pacific Ocean Viromes, and TOV datasets) (5,6,25) was by far the most abundant, representing 57.5% of spectral counts (Fig. 4A). Given this PC's dominance, we applied network analysis to the 400 protein members of this PC in the Inclusive Dataset, which showed two clearly separated groups divergent by ∼30% amino acid identity (Fig. 4B). Within this PC, only 10 of the 400 constituent proteins were previously annotated (as capsid proteins of siphoviruses JD024 and D3112 that infect Pseudomonas), which represented only 1.6% of the PC's spectral counts derived from the Quantitative Dataset ( Fig. 4B and Dataset S1). This PC thus included the majority (79%) of the previously unannotated spectra in the Quantitative Dataset (Fig. 1). In silico structural modeling of representative sequences from this PC suggested both groups represent major capsid proteins from phages similar to one another (the lambdoid phages HK97, ref. 26, and BPP-1, ref. 27; Fig. 4 C and D); however, these best fits were relatively weak (template modeling scores, TM scores, lower than the accepted cutoff of 0.5) (28). Thus, this dominant PC appears to be a major capsid protein of previously unexplored marine viruses.
The next four most abundant PCs in the Quantitative Dataset contained a total of 9.8% of the spectral counts (Fig. 4A) and were predominantly annotated as capsid proteins by sequence similarity (Dataset S1) and structural modeling (Fig. S3) of their total ORFs present within the Inclusive Dataset. The most abundant of these four PCs, CAM_CRCL_625, was a T4-like major capsid protein by consensus annotation of the PC's component ORFs (29) and also by structural modeling (30). Moving in order of decreasing spectral abundance, PCs CAM_CRCL_14716 and TARA_183056 were both functionally and taxonomically unannotated by sequence similarity; however, by structural modeling, both had best fits to a capsid protein of cyanophage Syn5 (31), although the TM score for the latter PC was below the recommended cutoff of 0.5 (28). Finally, PC TARA_207964 was annotated as a capsid protein from phage HMO-2011 (which infects Ca. Puniceispirillum marinum of the SAR116 clade) (11) by similarity, but was annotated as the major capsid protein of cyanophage P-SSP7 (32) by structural modeling, likely because there is currently no reference structure available in the modeling database for phage HMO-2011. Collectively, this combination of ORF annotation and structural modeling thus suggested that, of the top five most abundant PCs (which comprised approximately two-thirds of the spectra in the Quantitative Dataset), at least four were capsid proteins. This is consistent with the dominance of capsids in the annotated portion of the metaproteomes (Fig. 1), and with our understanding of virion structural proteins usually being dominated by capsid proteins in proteomes of viral isolates (13,24).
We next sought to examine the global-scale distribution of these five most abundant metaproteome-detected PCs, by examining their presence in previously-identified TOV viral populations (6).  Populations originating from a sample with a metaproteome are also indicated, with "originating" defined as that population having the maximum coverage in one of those four samples from which metaproteomes were generated.
The dominant metaproteome-detected PC (CAM_CRCL_773) was present in a total of 93 viral populations collectively found in every TOV sample across seven oceans and seas (Fig. 5). In contrast, the four next most abundant PCs were present in substantially fewer populations and showed somewhat more restricted geographic distributions. One PC (TARA_183056) was found in 10 populations that were present in every oceanic region examined except the Southern Ocean. Two PCs (CAM_CRCL_625 and TARA_207964) were found in 5 and 11 viral populations, respectively, predominantly present only in the Indian and Atlantic Oceans, and the Mediterranean and Red Seas. Finally, one PC (CAM_CRCL_14716) was present in only one viral population that showed the most geographic restriction, with the highest abundance from the Indian Ocean, where two of the four metaproteomic samples were collected, but low or nonexistent abundance in the remaining locations. Thus, the five most abundant PCs in the four metaproteomes from three stations are present in viral populations with both widespread and regionally restricted distributions.
Conservation of Virion-Associated Proteins. Conservation of structural similarity in viral capsid proteins, even in the absence of nucleotide sequence similarities, has long been recognized (33,34). It is thus notable that the model-predicted structural similarities of the five most abundant PCs in the Quantitative Dataset (Fig. 4A) are to capsid proteins that all contain the HK97-like fold, including siphophage HK97, HK97-like phage BPP-1, myophage T4, podophage Syn5, and siphophage P-SSP7 (Fig. 4 C and D and Fig. S3) (27,30,31,34). This HK97-like capsid protein fold has been found in viruses infecting organisms from all three domains of life (35) and is suggested to be the most abundant biological structure on Earth, based on the high abundance of total viruses (e.g., refs. 30, 34, and 36). The data presented here support that assertion: not only do the most abundant PCs in the metaproteomes (representing 67% of the Quantitative Dataset; Fig. 4) seem to contain this protein fold, four out of five of these PCs also appear widely distributed in the upper oceans as shown in our analysis of the TOV viral populations (Fig. 5).
To further investigate conservation in virion-associated proteins, selective constraints of the PCs from the Inclusive Dataset were examined using the ratio of nonsynonymous to synonymous polymorphisms (pN/pS), which has proven powerful for analysis of microbial metagenomic datasets (37,38). Average pN/pS ratios for PCs in the metaproteome were significantly lower than those determined for all viral metagenome-derived PCs (0.67 vs. 0.84; P < 0.001, Mann-Whitney U test; Fig. 6). For comparison, viral metagenome PCs previously annotated as capsids also had relatively low pN/pS ratios (average, 0.48), whereas ratios for annotated tail proteins were higher (average, 0.69). Together, this information suggests stronger overall negative selection for virion-associated Line thickness (edge weights) correspond to amino acid identity, calculated as the number of identical residues within the alignment. Proteins used for structural modeling (C and D) are outlined in thick black. Taxonomic affiliation based on PC annotation is indicated by color. (C) Representative structural model for the group of amino acid sequences on the Left of the network diagram using the I-TASSER prediction server. The best-fit template was major capsid protein 2FS3 (C score, −4.81; 24% identity; TM score, 0.10) of Enterobacteria phage HK97, which infects Escherichia coli. (D) Representative structural model for the group of amino acid sequences on the Right of the network diagram using the I-TASSER prediction server. The best-fit template was major capsid protein 3J4U (C score, −2.16; 24% identity; TM score, 0.43) of the HK97-like Bordetella phage BPP-1. proteins (i.e., increased maintenance of their gene sequences), especially capsid proteins, relative to other viral genome-encoded proteins. This is analogous to previous observations of conservation in housekeeping genes in microorganisms (e.g., ref. 38) and underscores the importance of capsid protein structure maintenance to virion fitness.
Genomic Context for Experimentally Detected Viral Proteins. Genomic context frequently improves gene-specific functional and taxonomic interpretations. We thus examined the genomic context of the five most abundant metaproteome-detected PCs via their five longest associated contigs per PC in the TOV dataset ( Fig. 7 and Dataset S2). The most abundant PC (CAM_CRCL_773) was present in contigs where few (24-29%) ORFs were annotated and also showed no taxonomic consensus, the latter of which is consistent with >99% of TOV viral populations (6). However, this genomic context did show that CAM_CRCL_773 was present within a genomic region containing ORFs encoding for a tail fiber, baseplate, and a terminase, as well as three additional unannotated PCs that were also detected in the metaproteome. Within these contigs, the presence of two tail genes and the significant similarities to tailed virus genes for the majority (90-100%) of annotated ORFs indicates that this dominant PC may belong to previously-unidentified Caudovirales.
In contrast, the second most abundant PC (CAM_CRCL_625) was present in contigs that were predominantly taxonomically annotated (58-100% of their ORFs), mainly as genes of Myoviridae infecting highly abundant hosts such as Pelagibacter, Synechococcus, and Prochlorococcus ( Fig. 7 and Dataset S2). This PC was again found within a genomic region containing multiple tail and capsid proteins and two terminase subunits. Collectively, this genomic context combined with the sequence-based and structural modeling-based annotations (above) provides strong evidence that CAM_CRCL_625 is a capsid protein of myoviruses.
The third and fourth most abundant PCs (CAM_CRCL_14716 and TARA_183056) were found in predominantly unannotated contigs (11-29% of ORFs annotated; Fig. 7; Dataset S2). The former (CAM_CRCL_14716) was present in only one TOV contig, consistent with its more restricted geographic distribution (Fig. 5). Although the annotations present in both of these PCs' contigs did not allow taxonomic consensus to be reached, each PC occurred within genomic regions containing other metaproteome-detected PCs. Furthermore, the genomic context for TARA_183056 included a terminase gene as well tail fiber genes, suggesting it may belong to another unidentified Caudovirales.
Finally, the fifth most abundant PC (TARA_207964) was present in predominantly annotated contigs (57-78% annotated ORFs) in which the consensus taxonomy (56-91%) was podophage HMO-2011, a phage infecting a SAR116 bacterium (11) (Fig. 7 and Dataset S2). This matches this PC's annotation reported above via its component metagenomic ORFs. This PC was also present in a well-annotated genomic region that included a metaproteomedetected PC (TARA_40991) annotated as a portal protein, supporting the annotation of this PC (TARA_207964) as capsid protein of podophage HMO-2011.

Conclusions
In summary, this study establishes environmental metaproteomics as a high-throughput strategy for shedding light on viral dark matter in two ways: (i) defining formerly unannotated proteins as structural, and (ii) revealing which of these proteins are most abundant thereby focusing further inquiry (e.g., structural modeling). The 1,875 viral proteins observed in these metaproteomes allowed us to newly annotate 991 proteins as primarily structural. Surprisingly, the majority (67%) of the metaproteomic spectra were derived from just five environmentally dominant PCs. With a combination of sequence-and structural modeling-based annotation, these PCs are now predominantly identified as putative capsid proteins of tailed viruses containing the most abundant biological structure on Earth, the HK97-like protein fold. Furthermore, analysis of metaproteomic PCs facilitated understanding of increased selective pressures on genes encoding virion-associated proteins (e.g., capsids). Although this study focused on dsDNA viruses, the approach is generalizable to ssDNA and RNA viruses,  which currently require generation of separate metagenomes. Thus, this large-scale annotation strategy and the findings presented here will help guide the experimentation needed to refine structural annotations and offer glimpses of the viral metagenomic dark matter that obfuscates our understanding of the most abundant biological entities on Earth: viruses.

Methods
A detailed description of all metaproteomic, metagenomic, and bioinformatic procedures is provided in SI Methods.
ACKNOWLEDGMENTS. We thank Bonnie Poulos for preparing viral concentrates, Genoscope for viral metagenomic sequencing, members of Tucson Marine Phage Lab for comments on the manuscript, and University Information Technology Services Research Computing Group and the Arizona Research Laboratories Biotechnology Computing for High-Performance Computing Cluster access and support. We thank Kristen Corrier and Manesh Shah of University of Tennessee/Oak Ridge National Laboratory for efforts in filter-aided sample preparation (FASP) preparation of viral samples and MS analyses, and aspects of proteome informatics, respectively. The four viral concentrates were collected as part of exceptional commitment by scientists and sponsors who made the Tara