New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Elucidating microbial codes to distinguish individuals

Studies of the human microbiome have revealed that both site specificity and individuality play a role in shaping the microbial communities of healthy individuals (1). Longitudinal studies have determined that bacterial strains remain stable over time (2, 3), suggesting that microbial signatures may distinguish individuals. In PNAS, Franzosa et al. evaluate whether this variation within the human microbiome is sufficient to distinguish individuals in a large population (4).
Adapting Classical Computer Science Algorithms for Biological Reality
To evaluate the feasibility of microbiome-based identifiability, Franzosa et al. (4) intertwine microbial ecological theories with computer science algorithms. These two fields are elegantly blended throughout the manuscript as the authors walk readers through algorithm development, taking time to justify each decision with a biological reality. This care in explaining the details ensures the story is accessible to scientists with mixed backgrounds and yields an excellent “teaching” paper as well as research study. Multidisciplinary appeal is particularly important as more and more research involves collaborations spanning different fields of expertise.
Franzosa et al. begin the task of metagenomic code construction by introducing the concept of a hitting set (5). Fig. 1 demonstrates hitting sets in relation to metagenomic code construction in a population of four individuals. Each individual has several features in varying abundance (Fig. 1A), from which a subset are selected to differentiate individuals. For example, when comparing features present in individual 1 to individual 2, features found exclusively in individual 1 are sufficient to distinguish 1 from 2 and represent the nonhit set. These discriminatory features are combined to create a list of features capable of identifying one individual from the rest of the population. In a greedy approach, features most common among the nonhit sets are prioritized to create the hitting set or metagenomic code (Fig. 1B). This equates to prioritizing rare features in the population. Although this is computationally sufficient for creating a minimally sized code, it ignores the biological necessity of prioritizing features that will remain stable over time.
(A) Four individuals and their color-coded features (circles). More circles indicate a more abundant feature. (B) Example of an efficient greedy approach used to construct a minimal hitting set for individual 1. Venn diagrams demonstrate features present in individual 1 compared with other individuals. For each comparison, the nonhit set is those features found exclusively in individual 1. The candidate hitting set is iteratively created by adding the most common elements among the nonhit sets. Large, not slashed circles represent features included in the code. (C) Simplified example of a biologically informed greedy code construction for individual 1 that prioritizes feature stability. Individual 1’s detectable features are ordered by descending abundance gap. In an iterative manner, the highest ranked feature is used to remove individuals from part A in whom this feature was not found. Only if a feature removes an individual does it become part of the code. For example, the red feature distinguishes individual 1 from individual 2 and therefore is part of individual 1’s code. The code is complete when individual 1 can be excluded from all individuals in part A. (D) Possible results include the following: a true positive, individual 1’s code remains stable; a false negative, individual 2 lost the brown feature and therefore his code no longer matches him; a false positive, individual 4 acquired the yellow feature so now matches his own code in addition to individual 2’s code. Large circles comprise unique codes for each individual in part A. Note: individual 3 lacks a unique code because all of his features are shared with individual 1.
To elucidate factors promoting stability, Franzosa et al. analyzed samples from the Human Microbiome Project (HMP) (1). In the published HMP study, 120 individuals were sampled at multiple body sites and time points. Samples were subjected to 16S ribosomal sequencing to detect archaeal and bacterial communities, and a subset of these samples was also subjected to metagenome shotgun (MGS) sequencing to capture the entire genomic content of the sample including the human host, bacteria, fungi, and virus. Because of the availability of amplicon and MGS sequencing, the authors were able to test strategies using four different microbial features. Temporal stability of these features was evaluated against the ecological measures of prevalence and abundance. Prevalence is the percentage of individuals in a population who possess the feature. Abundance is the relative quantity of the feature in an individual. Unsurprisingly, the abundance of a feature was positively correlated with its stability, or ability to be redetected at a later time point (4). Prevalence was also a strong indicator of stability. Features present at low levels in the population were more frequently lost between time points, whereas prevalent features were more commonly acquired by noncarriers between time points (4).
Given these associations, metagenomic codes prioritizing rare features will be more susceptible to temporal variation and failure. Thus, to prioritize stable features, Franzosa et al. (4) modified the biologically naïve greedy approach to favor abundant features that would be robust to temporal variation. As visualized in Fig. 1C, this was accomplished by first ordering the features present in an individual by descending abundance gap, that is, the difference in abundance between the feature and the next most abundant in the population. This effectively prioritizes features in codes that are abundant in an individual but not overly prevalent in the population. This promotes stability of the feature in the individual and lessens the likelihood of others acquiring the feature over time. This biologically informed approach may generate a metagenomic code that is larger than that generated by the efficient greedy algorithm (Fig. 1 B and C); however, the cost of a larger code is justified given the added insurance of stability.
Evaluating Accuracy of Unique Metagenomic Codes Derived from Sequencing Data
Using the biologically informed approach for hitting sets, Franzosa et al. (4) constructed codes from four different metagenomic features, two taxon-level and two gene-level.
The different features were then evaluated for their ability to generate a code that was specific to an individual in a sample population, stable over time, and unlikely to match an individual in an unseen population. The authors found gene-level features produced population-level unique codes for the majority of individuals, whereas more frequently taxon-level codes could not be generated (4). Often, the source of a failure was an individual’s taxon-level features, species, being contained within other individuals’ communities. This is probable given the limited number of species that have been identified to colonize each site of the body (1). The gene-level codes, species marker genes and windows of bacterial genomes, overcome this limitation by exploiting strain-level differences that provide a larger reservoir from which to pull variants. Similar results were published by Schloissnig et al. (2), who found that species relative abundances were insufficient to distinguish individuals at multiple time points, whereas SNP variation patterns might. Other studies have also highlighted the extensive strain-level variability between individuals in the gut (6) and skin (7), further emphasizing that species-level comparisons fail to capture important diversity within a community.
Next, Franzosa and colleagues accessed the stability of their metagenomic codes using the HMP longitudinal data. For each code, they determined which of the scenarios depicted in Fig. 1D occurred over time. Possible fates for a code include the following: true positive, an individual’s code derived at time 1 still uniquely identifies him among the population at time 2; false negative, an individual lost a component of his code between time 1 and 2 so is no longer recognized; false positive, an individual gains a feature over time so he is now identified by his own code and someone else’s code. By calculating the occurrence of each of these scenarios, the authors found taxon-level codes to be unstable with an average of 14% true positives across body sites. In comparison, gene-level codes were much more stable with 52% true positives (4). This discrepancy exists because marker-based codes are composed of fewer, more stable taxa, whereas taxon-level codes require inclusion of less abundant taxa to achieve uniqueness. Stability was highest in the gut where marker-based codes had an average of 86% true positives and only 2% false positives.
To predict the robustness of their metagenomic codes against previously unseen subjects, Franzosa et al. (4) computationally inferred the population size for which they expected a particular code to be unique. Depending on the feature type and body site, a code was predicted to be unique among hundreds of individuals. For stool, codes were predicted to be unique within ∼700 individuals; a number validated with 85 stool metagenomes from a cohort of healthy Danish subjects (8). These results demonstrate the feasibility of generating microbial codes that remain stable over time and unique within a pool of hundreds of unseen individuals.
Implications of Personal Microbiome Signatures
Using a classical computer science algorithm adapted to biological reality, Franzosa et al. generated microbiome codes predicted to be unique among hundreds of individuals. In their discussion, the authors address the ethical implications of microbiome-based identifiability. Concerns of linking study participants to their sequencing data are not unfounded. In 2013, Gymrek et al. (9) used publically available Internet resources to identify 50 individuals who had participated in a genomic study. In the study, surnames were inferred from genomic sequencing using genealogy databases linking the two. By combing the recovered surname with additional demographic information, i.e., subject’s year of birth and state of residency, the authors were able to link several individuals with their sequencing data.
Although no database exists linking microbial communities and surnames, certain pieces are coming together. For example, recent studies (10, 11) explore the similarity of microbial signatures within families. The possibility that individuals could be identified from microbiome data needs to be considered in the context of projects such as American Gut (americangut.org), which use crowdsourcing to collect thousands of microbiome samples from individuals curious to learn their personal microbial composition. On their website and in their consent form, American Gut cautions users, “it is theoretically possible that you might be identifiable from your data.” Users should carefully consider this possibility because personal information including diet (12), health status (13), age, and geography (14) can potentially be inferred from their microbial communities.
References
- ↵
- ↵
- ↵.
- Faith JJ, et al.
- ↵.
- Franzosa EA, et al.
- ↵
- ↵
- ↵
- ↵
- ↵.
- Gymrek M,
- McGuire AL,
- Golan D,
- Halperin E,
- Erlich Y
- ↵
- ↵.
- Lax S, et al.
- ↵.
- Wu GD, et al.
- ↵.
- Greenblum S,
- Turnbaugh PJ,
- Borenstein E
- ↵
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
- Biological Sciences
- Microbiology
See related content: