Skip to main content
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian
  • Log in
  • My Cart

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses

New Research In

Physical Sciences

Featured Portals

  • Physics
  • Chemistry
  • Sustainability Science

Articles by Topic

  • Applied Mathematics
  • Applied Physical Sciences
  • Astronomy
  • Computer Sciences
  • Earth, Atmospheric, and Planetary Sciences
  • Engineering
  • Environmental Sciences
  • Mathematics
  • Statistics

Social Sciences

Featured Portals

  • Anthropology
  • Sustainability Science

Articles by Topic

  • Economic Sciences
  • Environmental Sciences
  • Political Sciences
  • Psychological and Cognitive Sciences
  • Social Sciences

Biological Sciences

Featured Portals

  • Sustainability Science

Articles by Topic

  • Agricultural Sciences
  • Anthropology
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology
  • Cell Biology
  • Developmental Biology
  • Ecology
  • Environmental Sciences
  • Evolution
  • Genetics
  • Immunology and Inflammation
  • Medical Sciences
  • Microbiology
  • Neuroscience
  • Pharmacology
  • Physiology
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences
  • Sustainability Science
  • Systems Biology
Commentary

Elucidating microbial codes to distinguish individuals

Allyson L. Byrd and Julia A. Segre
PNAS June 2, 2015 112 (22) 6778-6779; first published May 26, 2015; https://doi.org/10.1073/pnas.1507731112
Allyson L. Byrd
aMicrobial Genomics Section, Translational and Functional Genomics Branch, National Human Genome Research Institute, Bethesda, MD 20892;
bDepartment of Bioinformatics, Boston University, Boston, MA 02215
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Julia A. Segre
aMicrobial Genomics Section, Translational and Functional Genomics Branch, National Human Genome Research Institute, Bethesda, MD 20892;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jsegre@nhgri.nih.gov

See related content:

  • Human microbiome identifiability
    - May 11, 2015
  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Studies of the human microbiome have revealed that both site specificity and individuality play a role in shaping the microbial communities of healthy individuals (1). Longitudinal studies have determined that bacterial strains remain stable over time (2, 3), suggesting that microbial signatures may distinguish individuals. In PNAS, Franzosa et al. evaluate whether this variation within the human microbiome is sufficient to distinguish individuals in a large population (4).

Adapting Classical Computer Science Algorithms for Biological Reality

To evaluate the feasibility of microbiome-based identifiability, Franzosa et al. (4) intertwine microbial ecological theories with computer science algorithms. These two fields are elegantly blended throughout the manuscript as the authors walk readers through algorithm development, taking time to justify each decision with a biological reality. This care in explaining the details ensures the story is accessible to scientists with mixed backgrounds and yields an excellent “teaching” paper as well as research study. Multidisciplinary appeal is particularly important as more and more research involves collaborations spanning different fields of expertise.

Franzosa et al. begin the task of metagenomic code construction by introducing the concept of a hitting set (5). Fig. 1 demonstrates hitting sets in relation to metagenomic code construction in a population of four individuals. Each individual has several features in varying abundance (Fig. 1A), from which a subset are selected to differentiate individuals. For example, when comparing features present in individual 1 to individual 2, features found exclusively in individual 1 are sufficient to distinguish 1 from 2 and represent the nonhit set. These discriminatory features are combined to create a list of features capable of identifying one individual from the rest of the population. In a greedy approach, features most common among the nonhit sets are prioritized to create the hitting set or metagenomic code (Fig. 1B). This equates to prioritizing rare features in the population. Although this is computationally sufficient for creating a minimally sized code, it ignores the biological necessity of prioritizing features that will remain stable over time.

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

(A) Four individuals and their color-coded features (circles). More circles indicate a more abundant feature. (B) Example of an efficient greedy approach used to construct a minimal hitting set for individual 1. Venn diagrams demonstrate features present in individual 1 compared with other individuals. For each comparison, the nonhit set is those features found exclusively in individual 1. The candidate hitting set is iteratively created by adding the most common elements among the nonhit sets. Large, not slashed circles represent features included in the code. (C) Simplified example of a biologically informed greedy code construction for individual 1 that prioritizes feature stability. Individual 1’s detectable features are ordered by descending abundance gap. In an iterative manner, the highest ranked feature is used to remove individuals from part A in whom this feature was not found. Only if a feature removes an individual does it become part of the code. For example, the red feature distinguishes individual 1 from individual 2 and therefore is part of individual 1’s code. The code is complete when individual 1 can be excluded from all individuals in part A. (D) Possible results include the following: a true positive, individual 1’s code remains stable; a false negative, individual 2 lost the brown feature and therefore his code no longer matches him; a false positive, individual 4 acquired the yellow feature so now matches his own code in addition to individual 2’s code. Large circles comprise unique codes for each individual in part A. Note: individual 3 lacks a unique code because all of his features are shared with individual 1.

To elucidate factors promoting stability, Franzosa et al. analyzed samples from the Human Microbiome Project (HMP) (1). In the published HMP study, 120 individuals were sampled at multiple body sites and time points. Samples were subjected to 16S ribosomal sequencing to detect archaeal and bacterial communities, and a subset of these samples was also subjected to metagenome shotgun (MGS) sequencing to capture the entire genomic content of the sample including the human host, bacteria, fungi, and virus. Because of the availability of amplicon and MGS sequencing, the authors were able to test strategies using four different microbial features. Temporal stability of these features was evaluated against the ecological measures of prevalence and abundance. Prevalence is the percentage of individuals in a population who possess the feature. Abundance is the relative quantity of the feature in an individual. Unsurprisingly, the abundance of a feature was positively correlated with its stability, or ability to be redetected at a later time point (4). Prevalence was also a strong indicator of stability. Features present at low levels in the population were more frequently lost between time points, whereas prevalent features were more commonly acquired by noncarriers between time points (4).

Given these associations, metagenomic codes prioritizing rare features will be more susceptible to temporal variation and failure. Thus, to prioritize stable features, Franzosa et al. (4) modified the biologically naïve greedy approach to favor abundant features that would be robust to temporal variation. As visualized in Fig. 1C, this was accomplished by first ordering the features present in an individual by descending abundance gap, that is, the difference in abundance between the feature and the next most abundant in the population. This effectively prioritizes features in codes that are abundant in an individual but not overly prevalent in the population. This promotes stability of the feature in the individual and lessens the likelihood of others acquiring the feature over time. This biologically informed approach may generate a metagenomic code that is larger than that generated by the efficient greedy algorithm (Fig. 1 B and C); however, the cost of a larger code is justified given the added insurance of stability.

Evaluating Accuracy of Unique Metagenomic Codes Derived from Sequencing Data

Using the biologically informed approach for hitting sets, Franzosa et al. (4) constructed codes from four different metagenomic features, two taxon-level and two gene-level.

The different features were then evaluated for their ability to generate a code that was specific to an individual in a sample population, stable over time, and unlikely to match an individual in an unseen population. The authors found gene-level features produced population-level unique codes for the majority of individuals, whereas more frequently taxon-level codes could not be generated (4). Often, the source of a failure was an individual’s taxon-level features, species, being contained within other individuals’ communities. This is probable given the limited number of species that have been identified to colonize each site of the body (1). The gene-level codes, species marker genes and windows of bacterial genomes, overcome this limitation by exploiting strain-level differences that provide a larger reservoir from which to pull variants. Similar results were published by Schloissnig et al. (2), who found that species relative abundances were insufficient to distinguish individuals at multiple time points, whereas SNP variation patterns might. Other studies have also highlighted the extensive strain-level variability between individuals in the gut (6) and skin (7), further emphasizing that species-level comparisons fail to capture important diversity within a community.

Next, Franzosa and colleagues accessed the stability of their metagenomic codes using the HMP longitudinal data. For each code, they determined which of the scenarios depicted in Fig. 1D occurred over time. Possible fates for a code include the following: true positive, an individual’s code derived at time 1 still uniquely identifies him among the population at time 2; false negative, an individual lost a component of his code between time 1 and 2 so is no longer recognized; false positive, an individual gains a feature over time so he is now identified by his own code and someone else’s code. By calculating the occurrence of each of these scenarios, the authors found taxon-level codes to be unstable with an average of 14% true positives across body sites. In comparison, gene-level codes were much more stable with 52% true positives (4). This discrepancy exists because marker-based codes are composed of fewer, more stable taxa, whereas taxon-level codes require inclusion of less abundant taxa to achieve uniqueness. Stability was highest in the gut where marker-based codes had an average of 86% true positives and only 2% false positives.

To predict the robustness of their metagenomic codes against previously unseen subjects, Franzosa et al. (4) computationally inferred the population size for which they expected a particular code to be unique. Depending on the feature type and body site, a code was predicted to be unique among hundreds of individuals. For stool, codes were predicted to be unique within ∼700 individuals; a number validated with 85 stool metagenomes from a cohort of healthy Danish subjects (8). These results demonstrate the feasibility of generating microbial codes that remain stable over time and unique within a pool of hundreds of unseen individuals.

Implications of Personal Microbiome Signatures

Using a classical computer science algorithm adapted to biological reality, Franzosa et al. generated microbiome codes predicted to be unique among hundreds of individuals. In their discussion, the authors address the ethical implications of microbiome-based identifiability. Concerns of linking study participants to their sequencing data are not unfounded. In 2013, Gymrek et al. (9) used publically available Internet resources to identify 50 individuals who had participated in a genomic study. In the study, surnames were inferred from genomic sequencing using genealogy databases linking the two. By combing the recovered surname with additional demographic information, i.e., subject’s year of birth and state of residency, the authors were able to link several individuals with their sequencing data.

Although no database exists linking microbial communities and surnames, certain pieces are coming together. For example, recent studies (10, 11) explore the similarity of microbial signatures within families. The possibility that individuals could be identified from microbiome data needs to be considered in the context of projects such as American Gut (americangut.org), which use crowdsourcing to collect thousands of microbiome samples from individuals curious to learn their personal microbial composition. On their website and in their consent form, American Gut cautions users, “it is theoretically possible that you might be identifiable from your data.” Users should carefully consider this possibility because personal information including diet (12), health status (13), age, and geography (14) can potentially be inferred from their microbial communities.

Footnotes

  • ↵1To whom correspondence should be addressed. Email: jsegre{at}nhgri.nih.gov.
  • Author contributions: A.L.B. and J.A.S. wrote the paper.

  • The authors declare no conflict of interest.

  • See companion article on page E2930.

References

  1. ↵
    1. Human Microbiome Project Consortium
    (2012) Structure, function and diversity of the healthy human microbiome. Nature 486(7402):207–214
    .
    OpenUrlCrossRefPubMed
  2. ↵
    1. Schloissnig S, et al.
    (2013) Genomic variation landscape of the human gut microbiome. Nature 493(7430):45–50
    .
    OpenUrlCrossRefPubMed
  3. ↵
    1. Faith JJ, et al.
    (2013) The long-term stability of the human gut microbiota. Science 341(6141):1237439
    .
    OpenUrlAbstract/FREE Full Text
  4. ↵
    1. Franzosa EA, et al.
    (2015) Identifying personal microbiomes using metagenomic codes. Proc Natl Acad Sci USA 112:E2930–E2938
    .
    OpenUrlAbstract/FREE Full Text
  5. ↵
    1. Selman B
    (2008) Computational science: A hard statistical view. Nature 451(7179):639–640
    .
    OpenUrlCrossRefPubMed
  6. ↵
    1. Greenblum S,
    2. Carr R,
    3. Borenstein E
    (2015) Extensive strain-level copy-number variation across human gut microbiome species. Cell 160(4):583–594
    .
    OpenUrlCrossRefPubMed
  7. ↵
    1. Oh J, et al., NISC Comparative Sequencing Program
    (2014) Biogeography and individuality shape function in the human skin metagenome. Nature 514(7520):59–64
    .
    OpenUrlCrossRefPubMed
  8. ↵
    1. Arumugam M, et al., MetaHIT Consortium
    (2011) Enterotypes of the human gut microbiome. Nature 473(7346):174–180
    .
    OpenUrlCrossRefPubMed
  9. ↵
    1. Gymrek M,
    2. McGuire AL,
    3. Golan D,
    4. Halperin E,
    5. Erlich Y
    (2013) Identifying personal genomes by surname inference. Science 339(6117):321–324
    .
    OpenUrlAbstract/FREE Full Text
  10. ↵
    1. Schloss PD,
    2. Iverson KD,
    3. Petrosino JF,
    4. Schloss SJ
    (2014) The dynamics of a family’s gut microbiota reveal variations on a theme. Microbiome 2:25
    .
    OpenUrlCrossRefPubMed
  11. ↵
    1. Lax S, et al.
    (2014) Longitudinal analysis of microbial interaction between humans and the indoor environment. Science 345(6200):1048–1052
    .
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. Wu GD, et al.
    (2011) Linking long-term dietary patterns with gut microbial enterotypes. Science 334(6052):105–108
    .
    OpenUrlAbstract/FREE Full Text
  13. ↵
    1. Greenblum S,
    2. Turnbaugh PJ,
    3. Borenstein E
    (2012) Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. Proc Natl Acad Sci USA 109(2):594–599
    .
    OpenUrlAbstract/FREE Full Text
  14. ↵
    1. Yatsunenko T, et al.
    (2012) Human gut microbiome viewed across age and geography. Nature 486(7402):222–227
    .
    OpenUrlCrossRefPubMed
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Elucidating microbial codes to distinguish individuals
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Microbiome codes
Allyson L. Byrd, Julia A. Segre
Proceedings of the National Academy of Sciences Jun 2015, 112 (22) 6778-6779; DOI: 10.1073/pnas.1507731112

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Microbiome codes
Allyson L. Byrd, Julia A. Segre
Proceedings of the National Academy of Sciences Jun 2015, 112 (22) 6778-6779; DOI: 10.1073/pnas.1507731112
Digg logo Reddit logo Twitter logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley
Proceedings of the National Academy of Sciences: 112 (22)
Table of Contents

Submit

Sign up for Article Alerts

Article Classifications

  • Biological Sciences
  • Microbiology

Jump to section

  • Article
    • Adapting Classical Computer Science Algorithms for Biological Reality
    • Evaluating Accuracy of Unique Metagenomic Codes Derived from Sequencing Data
    • Implications of Personal Microbiome Signatures
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Surgeons hands during surgery
Inner Workings: Advances in infectious disease treatment promise to expand the pool of donor organs
Despite myriad challenges, clinicians see room for progress.
Image credit: Shutterstock/David Tadevosian.
Setting sun over a sun-baked dirt landscape
Core Concept: Popular integrated assessment climate policy models have key caveats
Better explicating the strengths and shortcomings of these models will help refine projections and improve transparency in the years ahead.
Image credit: Witsawat.S.
Double helix
Journal Club: Noncoding DNA shown to underlie function, cause limb malformations
Using CRISPR, researchers showed that a region some used to label “junk DNA” has a major role in a rare genetic disorder.
Image credit: Nathan Devery.
Steamboat Geyser eruption.
Eruption of Steamboat Geyser
Mara Reed and Michael Manga explore why Yellowstone's Steamboat Geyser resumed erupting in 2018.
Listen
Past PodcastsSubscribe
Birds nestling on tree branches
Parent–offspring conflict in songbird fledging
Some songbird parents might improve their own fitness by manipulating their offspring into leaving the nest early, at the cost of fledgling survival, a study finds.
Image credit: Gil Eckrich (photographer).

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490