Identification of individuals by trait prediction using whole-genome sequencing data
See allHide authors and affiliations
Contributed by J. Craig Venter, June 28, 2017 (sent for review February 7, 2017; reviewed by Jean-Pierre Hubaux, Bradley Adam Malin, and Effy Vayena)

Significance
By associating deidentified genomic data with phenotypic measurements of the contributor, this work challenges current conceptions of genomic privacy. It has significant ethical and legal implications on personal privacy, the adequacy of informed consent, the viability and value of deidentification of data, the potential for police profiling, and more. We invite commentary and deliberation on the implications of these findings for research in genomics, investigatory practices, and the broader legal and ethical implications for society. Although some scholars and commentators have addressed the implications of DNA phenotyping, this work suggests that a deeper analysis is warranted.
Abstract
Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.
Footnotes
- ↵1To whom correspondence may be addressed. Email: jcventer{at}jcvi.org or clippert{at}humanlongevity.com.
↵2Present address: Forensic Biology Unit, Alameda County Sheriff's Office, Oakland, CA 94605.
Author contributions: C.L., M.C.M., F.J.O., and J.C.V. designed research; C.L., M.C.M., V.L., and F.J.O. devised the method for reidentification; C.L., M.C.M., and C.X. performed research; C.L., R.S., M.C.M., E.Y.K., O.A., A.H., A.B., P.G., V.L., K.Y., T.W., M.Z., W.-Y.Y., C.C., T.L., C.W.H.L., B.H., C.X., J.P., S.B., and Y.T. contributed new reagents/analytic tools; C.L., R.S., M.C.M., E.Y.K., O.A., A.H., A.B., P.G., K.Y., T.W., M.Z., W.-Y.Y., T.L., C.W.H.L., and J.P. contributed phenotype prediction models; C.L., R.S., M.C.M., E.Y.K., S.L., O.A., A.H., A.B., P.G., V.L., K.Y., T.W., C.C., S.R., H.T., C.X., R.K.R., and F.J.O. analyzed data; C.L., F.J.O., and J.C.V. supervised the data analysis; A.T., R.K.R., and J.C.V. supervised the study cohort; C.L., M.C.M., A.T., and R.K.R. wrote the paper; and C.L., M.C.M., E.Y.K., S.L., O.A., A.H., A.B., P.G., K.Y., T.W., M.Z., W.-Y.Y., and R.K.R. wrote the supporting information.
Reviewers: J.-P.H., Ecole Polytechnique Fédérale de Lausanne; B.A.M., Vanderbilt University; and E.V., University of Zurich.
Conflict of interest statement: The authors are employees of and own equity in Human Longevity Inc.
Data deposition: Access to genome data is possible through a managed access agreement (www.hli-opendata.com/docs/HLIDataAccessAgreement061617.docx).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1711125114/-/DCSupplemental.
Freely available online through the PNAS open access option.