Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets
See allHide authors and affiliations
Edited by Andrew G. Clark, Cornell University, Ithaca, NY, and approved April 10, 2017 (received for review December 6, 2016)

Significance
We describe a method for identifying in distinct genetic datasets observations that represent the same person. By using correlations among genetic markers close to one another in the genome, the method can succeed even if the datasets contain no overlapping markers. We show that the method can link a dataset similar to those used in genomic studies with another dataset containing markers used for forensics. Our approach can assist in maintaining backward compatibility with databases of existing forensic genetic profiles as systems move to new marker types. At the same time, it illustrates that the privacy risks that can arise from the cross-linking of databases are inherent even for small numbers of markers.
Abstract
Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the challenge of record matching—the identification of dataset entries that represent the same individual. We show that records can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing in different datasets. Using two datasets for the same 872 people—one with 642,563 genome-wide SNPs and the other with 13 short tandem repeats (STRs) used in forensic applications—we find that 90–98% of forensic STR records can be connected to corresponding SNP records and vice versa. Accuracy increases to 99–100% when ∼30 STRs are used. Our method expands the potential of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers of markers—including databases of forensic significance.
With the increasing abundance of genetic data, the usefulness of a genetic dataset now depends in part on the possibility of productively linking it with other datasets. Thus, for example, genome-wide association study samples typed with different SNP sets are routinely combined by cross-imputation, in which markers typed only in a subset of samples are probabilistically imputed in each sample, so that all markers can be analyzed in all samples (1⇓–3). Similarly, datasets gathered on short tandem repeat (STR) markers with different protocols can be computationally adjusted to enlarge samples for joint analysis when sets of alleles at individual markers differ between datasets (4, 5). Such efforts magnify the value of genetic datasets without requiring coordinated genotyping.
One issue that arises in combining multiple datasets is the record-matching problem: the identification of dataset entries that, although labeled differently in different datasets, represent the same underlying entity (6, 7). In a genetic context, record matching involves the identification of the same individual genome across multiple datasets when unique identifiers, such as participant names, are unavailable. This task is relatively simple when large numbers of SNPs are shared between marker sets: if records from different datasets match at enough of the shared SNPs, then they can be taken to represent the same individual.
What if no markers are shared between two genetic datasets? Can genotype records that rely on disjoint sets of markers be linked? Genetic record matching with no overlapping markers has many potential uses. Datasets could become cross-searchable even if no effort has been made to include shared markers in different marker sets. Record matching between new and old marker sets could determine whether an individual typed with a new set has appeared in earlier data, thereby facilitating deployment of new marker sets that are backward-compatible with past sets.
The presence of linkage disequilibrium (LD)—nonindependence of genotypes at distinct markers, primarily those that are proximate on the genome—can enable record matching without shared markers. As a result of LD between markers in different datasets, certain genotype pairs are more likely to co-occur, so that some potential record pairings are more likely than others. The principle applies even to different marker types not often genotyped together, such as SNPs and STRs, provided that LD exists across marker types [as is true of SNPs and STRs (8, 9)].
Relying on this principle, we devised an LD-based record-matching algorithm and evaluated its performance with nonoverlapping marker sets: one of SNPs and the other of STRs. Using 872 people from 52 populations (Table S1), we considered SNPs on a genotyping array used for population genetics and genome-wide association (10). For our STR set, we examined the Combined DNA Index System (CODIS) loci commonly used in forensic genetics (11) as well as subsets of a larger set of 432 STRs typed in the same people (12).
Sample sizes by population
Our STR application enables record matching in forensic genetic contexts, where STRs are widely used. Record matching between SNP and STR panels has two additional motivations specific to forensics. First, SNP technological advances enable cost-effective genotyping of large numbers of SNPs, which could allow more precise genetic inferences than are possible with current STR panels. However, forensic testing in the United States continues to rely largely on the 13 STRs selected in the 1990s (13, 14), increasing to 20 STRs for new profiles beginning in 2017 (15), partly because millions of profiles for the 13 STRs have already been gathered in law enforcement databases (16). Reliable record matching between SNP and STR profiles could facilitate development of a backward-compatible SNP set that enables new SNP profiles to be matched against known STR profiles collected in past decades.
Second, the legality of the use of forensic genetic markers in light of US constitutional protections against unreasonable searches is based partly on a premise that these markers provide only the capacity for identification and no other information about a person (17⇓–19). To test this premise, many investigations have examined phenotypic associations with the CODIS markers, mostly concluding that such associations are small enough to be unimportant (17, 20, 21). Record matching of CODIS and SNP data would make it possible to link a CODIS profile to a whole-genome SNP profile that could enable consequential phenotypic predictions, potentially undermining the claim that the CODIS markers are phenotypically trivial. Thus, applying record matching with forensic markers is important for establishing the level of “genetic privacy” present in a forensic marker profile.
Results
We split 872 people into two disjoint subsets: a training set for learning associations between STR alleles and their surrounding SNP haplotypes and a test set for assessing record-matching accuracy. We considered 10 schemes with varying fractions of the full data allocated to training and test subsets; for each scheme, we examined 100 random assignments of people to the two subsets (100 “partitions”). We focus on a scheme with intermediate sizes for the training set (75%; n = 654) and the test set (25%; n = 218).
Imputation Accuracy.
In principle, one way to link records is by genotype imputation, in which alleles of untyped loci are probabilistically predicted using genotypes at nearby typed loci (2, 3). If STR genotypes can be imputed from SNPs with perfect accuracy, then a complete set of STR genotypes can be produced from neighboring SNPs (22).
We assessed imputation accuracy at the CODIS loci using Beagle (23), imputing genotypes at each STR based on SNP genotypes within a 1-Mb window centered on the STR. First, in the training set, we used Beagle to phase the SNP genotypes together with the STR genotypes, producing a set of estimated haplotypes that included the STR alleles. Second, we imputed STR genotypes in the test set using the phased haplotypes from the training set as a reference panel.
Considering all of the CODIS markers, Beagle imputation accuracies exceed the accuracy of a null imputation method that ignores LD with nearby SNPs (Fig. 1), but they are lower than typical SNP imputation accuracies (2, 3, 24). Combining across the 13 loci and across 100 partitions into training and test sets, the null imputation method produces a mean of 11.7 of 26 alleles imputed correctly, whereas imputing with Beagle leads to a corresponding mean of 15.2. These accuracies are similar to those obtained at non-CODIS tetranucleotide STRs (Fig. S1). As has been seen previously (24), imputation accuracy is negatively correlated with measures of genetic diversity (Table S2), and the larger space of possible genotype predictions for multiallelic STRs renders their imputation accuracies lower than those observed for lower-diversity SNP loci.
Allelic imputation accuracies for 13 CODIS loci. The figure shows imputation accuracy for the partition of 872 individuals into training (75%) and test (25%) sets that yielded median (51st greatest) record-matching accuracy by the Hungarian method among 100 partitions. Beagle accuracy is obtained by imputing the STR genotype assigned the highest imputation probability by Beagle. Null accuracy is obtained by imputing the same high-frequency STR genotype in all individuals regardless of nearby SNP genotypes. Vertical lines represent 95% confidence intervals based on 10,000 bootstrap resamples of individuals from the test set. Beagle accuracies are significantly higher (Wilcoxon signed rank test, two-tailed p < 0.05) than null accuracies at all loci except one (D18S51; p = 0.09). Beagle accuracy is also higher when measuring total numbers of alleles imputed correctly in each person (p < 2.2 × 10−16). Beagle and null accuracies are negatively correlated with heterozygosities reported in Table S2.
Allelic imputation accuracies for 431 non-CODIS tetranucleotide STR loci. The plot considers the partition of the data represented in Fig. 1. Beagle imputation accuracy is obtained by imputing the STR genotype assigned the highest imputation probability by Beagle. Null imputation accuracy is obtained by imputing the same STR genotype for all people, irrespective of nearby SNP genotypes. Markers are sorted from left to right by null accuracy. Across all loci, the mean null accuracy is 0.497, and the mean Beagle accuracy is 0.624. Note that ref. 11 compared 432 rather than 431 non-CODIS tetranucleotides with the CODIS loci; we omitted TPO-D2S, an alias for the CODIS locus TPOX.
Allelic imputation accuracies and expected heterozygosities for 13 CODIS loci
Match Scores.
Because imputation accuracies are not near one, records cannot be linked by simply imputing STR genotypes and identifying the record that matches the imputed genotype. It is nevertheless possible to combine imputation information across loci, producing a score that quantifies agreement between a set of STR genotypes and a set of SNP genotypes.
We term the set of L STR genotypes carried by an individual an “STR profile,” and we term the set of SNP genotypes of an individual—aggregating neighboring SNPs for all of the STR markers—a “SNP profile.” Ri represents the STR profile for an individual i, with the diploid genotype at the lth locus in the profile denoted Ril. Similarly, Sj is the SNP profile for individual j, and Sjl is the set of diploid SNP genotypes in SNP profile j in the window around the lth STR.
Fellegi and Sunter (6) proposed match scores interpretable as log-likelihood ratios comparing the hypotheses that two records are drawn from the same or different people. For each possible SNP–STR profile pair, we computed the match score
STR genotypes at distinct loci are assumed to be independent in accord with the distant chromosomal locations of the CODIS loci. Consequently, the probability of observing STR profile Ri given SNP profile Sj and M is a product
For each partition into training and test sets, we computed match scores for each possible SNP–STR profile pairing in the test set (Fig. 2A). The method produces larger match scores when the profiles match than when they do not match (p < 2.2 × 10−16) (Materials and Methods and Fig. 2B). To understand the potential of the method, we used the match-score matrix to declare matches between STR and SNP profiles in four scenarios.
Match scores of records that truly match and match scores of nonmatches. (A) The matrix of match scores (Eq. 1) comparing 218 CODIS STR profiles with 218 SNP profiles for the data partition represented in Fig. 1. Each cell gives a match score for the pairing of a SNP profile with a CODIS profile. Scores pairing a given CODIS profile with each SNP profile appear in a column, and scores pairing a given SNP profile with each CODIS profile appear in a row. Darker colors represent larger values. Population memberships are colored by geographic region: Africa, orange; Europe, blue; Middle East, yellow; Central/South Asia, red; East Asia, pink; Oceania, green; Americas, purple). Of 52 populations in our dataset (Table S1), 47 appear in the test set shown. True matches are on a diagonal from the bottom left to the top right, and they tend to have higher match scores than off-diagonal nonmatches. Population structure is also visible (Table S3). For example, SNP profiles from Africans tend to have low match scores with non-Africans, and match scores of nonmatches tend to be higher when both CODIS and SNP profiles are from Native Americans. (B) Kernel density estimate for match scores. We applied a normal kernel with bandwidth chosen by Silverman’s rule (option nrd0 in the density function in R) to the matrix entries in A. Nonmatches tend to have negative log-likelihood match scores, whereas true matches tend to have positive scores.
One-to-One Matching.
We first considered the alignment of a pair of datasets on the same samples: we have n STR profiles and n SNP profiles to be matched, and it is known that each STR profile is from the same person as exactly one SNP profile. The pairing is not known and may not be trivial to determine even given an informative match-score matrix because a single STR profile might have the highest match score for multiple SNP profiles or vice versa. Given the match scores of each SNP profile with each STR profile, we conduct one-to-one matching by finding the SNP–STR profile pairing that maximizes the sum of the match scores over all paired profiles. Finding this pairing is a special case of the linear sum assignment problem solvable by the “Hungarian method” (25).
Under the null hypothesis of random matching of STR and SNP profiles, the number of correct assignments among n people is distributed as the number of fixed points in a random permutation of length n. This quantity has expectation 1 and is approximately Poisson(1)-distributed (ref. 26, chap. 3, section 5). Applying the Hungarian method to the match-score matrix leads to highly accurate matches in the test set (Table 1). For 100 partitions into training and test sets, it gives a median of 214 of 218 (98.2%) correct assignments. In 18 of 100 cases, all assignments are correct. Even in the lowest-accuracy trial, 204 of 218 records are matched correctly, an extremely improbable result under random matching (p ≈ 2.8 × 10−385).
Record-matching accuracies for the CODIS markers
It is possible to decrease the rate of false-positive matches by requiring that matched pairs exceed a minimum match-score threshold, leaving the remaining records unpaired. Fig. 3A shows the proportion of accurately assigned, inaccurately assigned, and unassigned cases as the threshold is varied for partitions with the maximum, median, and minimum numbers of correct assignments when all pairs are assigned. In the lowest-accuracy trial, 67.9% of profiles (148 of 218) can be matched accurately before a single error is made.
The proportions of profiles unassigned, correctly assigned, and incorrectly assigned as the match-score threshold is varied. When the threshold is large, all profiles are unassigned (lower left vertex). Gradually lowering the threshold leads to assignment of all profiles, tracing a curve to the right edge. Of 100 partitions into training and test sets, the figure plots trials with maximum, median, and minimum accuracies when all possible profiles are paired. (A) One-to-one matching. (B) One-to-many matching selecting the STR profile that best matches a query SNP profile. (C) One-to-many matching selecting the SNP profile that best matches a query STR profile. (D) Needle-in-haystack matching counting the proportion of true matches with match score that exceeds the maximal match score among nonmatches. In D, after the match-score threshold is lower than the largest match score among nonmatches, all pairings are marked incorrect.
Figs. S2A and S3 display one-to-one matching results when the training-set and test-set sizes are varied. As the training-set size increases, matching accuracy increases because a larger reference haplotype set increases imputation accuracy in the test set (Fig. S4). Matching accuracy declines as the size of the test set increases because the difficulty of the matching problem increases when there are more records to be matched.
The median proportion of test-set CODIS and SNP records matched correctly as a function of the sizes of the training and test sets. We divided the data into training and test sets in 1,000 ways, examining training sets of sizes 436, 545, 654, and 763—representing 50, 62.5, 75, and 87.5% of the data. For each training-set size, we used test-set sizes that were multiples of 109 (1/8 of 872), so that the sum of training-set and test-set sizes did not exceed 872. For each of 10 possible schemes for the proportions representing the training and test sets, we considered 100 random divisions of the data, using the same 100 partitions in all analyses for a given scheme. (A) One-to-one matching. (B) One-to-many matching selecting the STR profile that best matches a query SNP profile. (C) One-to-many matching selecting the SNP profile that best matches a query STR profile. (D) Needle-in-haystack matching. In D, the vertical axis has the same scale as in the other panels.
Proportions of the sample unassigned, correctly assigned, and incorrectly assigned as a function of the match-score threshold under one-to-one matching using the Hungarian method. Each panel considers different proportions (training, test) of the total data (n = 872) allocated into training and test sets, with 100 allocations according to those proportions. (A) 1/2, 1/8. (B) 1/2, 1/4. (C) 1/2, 3/8. (D) 1/2, 1/2. (E) 5/8, 1/8. (F) 5/8, 1/4. (G) 5/8, 3/8. (H) 3/4, 1/8. (I) 3/4, 1/4. (J) 7/8, 1/8. The figure design follows Fig. 3.
The median value of the mean allelic imputation accuracy across 13 CODIS markers as a function of the size of the training set. Beagle and null imputation accuracies follow Fig. 1. The median is taken across 100 partitions into training and test sets. Imputation accuracies are plotted for all 10 schemes for the sizes of training and test sets; multiple test-set sizes produce similar values at a fixed training-set size, and they are represented by overlapping plotted points. The lines connect the median values for the test-set sizes at given training-set sizes.
One-to-Many Matching.
In some scenarios, it cannot be assumed that a one-to-one correspondence exists between profiles of one type and profiles of the other type. To examine these cases, we relax the assumption that each entry in each dataset matches exactly one entry in the other dataset. In this more challenging problem, representing the alignment of a pair of databases with partially overlapping but nonidentical samples, one STR (or SNP) profile is a “query,” and we seek the SNP (or STR) profile that matches the query. Here, rather than using a matrix of match scores all at once, we consider a row or column vector, quantifying the suitability of matches of the query in one database to candidate profiles in another. This vector corresponds to either a row or a column of the matrix used for one-to-one matching; it is possible that a candidate might be identified as the best match for many queries representing different individuals.
When a SNP profile in the test set is used as a query, we choose as its match the STR profile in the test set that has the largest match score with that SNP profile. Across 100 data partitions, pairing 218 SNP profiles to their highest scoring STR match produces a median accuracy of 91.3% (199 of 218) (Table 1). The minimum accuracy is 86.2% (188 of 218), and the maximum is 95.9% (209 of 218). As was seen in one-to-one matching, matching accuracy increases with the size of the training set and declines as the size of the test set increases (Figs. S2B and S5).
Proportions of the sample unassigned, correctly assigned, and incorrectly assigned as a function of the match-score threshold under one-to-many matching that attempts to find the CODIS profile that matches a query SNP profile. Each panel considers different proportions (training, test) of the total data (n = 872) allocated into training and test sets, with 100 allocations according to those proportions. (A) 1/2, 1/8. (B) 1/2, 1/4. (C) 1/2, 3/8. (D) 1/2, 1/2. (E) 5/8, 1/8. (F) 5/8, 1/4. (G) 5/8, 3/8. (H) 3/4, 1/8. (I) 3/4, 1/4. (J) 7/8, 1/8. The figure design follows Fig. 3.
Similarly, when an STR profile in the test set is the query, we choose as its match the SNP profile in the test set that has the largest match score with that STR profile. Pairing STR profiles to their highest-scoring SNP match produces a median accuracy of 89.9% (196 of 218) (Table 1), with a minimum of 85.3% (186 of 218) and a maximum of 95.4% (208 of 218). Accuracy increases with training-set size and declines with test-set size (Figs. S2C and S6).
Proportions of the sample unassigned, correctly assigned, and incorrectly assigned as a function of the match-score threshold under one-to-many matching that attempts to find the SNP profile that matches a query CODIS profile. Each panel considers different proportions (training, test) of the total data (n = 872) allocated into training and test sets, with 100 allocations according to those proportions. (A) 1/2, 1/8. (B) 1/2, 1/4. (C) 1/2, 3/8. (D) 1/2, 1/2. (E) 5/8, 1/8. (F) 5/8, 1/4. (G) 5/8, 3/8. (H) 3/4, 1/8. (I) 3/4, 1/4. (J) 7/8, 1/8. The figure design follows Fig. 3.
As in the one-to-one matching case, it is possible to achieve higher confidence that proposed pairings are correct if some true matches can be missed. Fig. 3 B and C shows the proportions of accurately assigned, inaccurately assigned, and unassigned cases as the match-score threshold is varied. In the lowest-accuracy cases, when a SNP profile is the query, 41.3% of profiles (90 of 218) are assigned accurately before a single erroneous match is made, and when an STR profile is the query, 56.0% of profiles (122 of 218) are assigned accurately before an error is made.
Needle-in-Haystack Matching.
An even more difficult problem arises when only one among all possible SNP–STR profile pairings is a true match. This scenario represents the case in which a database query to locate a match is performed only for one profile. Perfect accuracy is achieved if the match-score distributions for matches and nonmatches do not overlap. To evaluate accuracy in this scenario, we recorded the fraction of true matches with match scores exceeding the largest score among nonmatching profiles.
Across partitions into training and test sets, the median percentage of true matches with match scores exceeding the maximum match score among nonmatches is 45.0% (98 of 218) (Table 1). The minimum is 8.3% (18 of 218), and the maximum is 73.4% (160 of 218). As in the other cases, matching accuracy increases with increasing training-set size and declines with increasing test-set size (Figs. S2D and S7).
Proportions of the sample unassigned, correctly assigned, and incorrectly assigned as a function of the match-score threshold under needle-in-haystack matching. Each panel considers different proportions (training, test) of the total data (n = 872) allocated into training and test sets, with 100 allocations according to those proportions. (A) 1/2, 1/8. (B) 1/2, 1/4. (C) 1/2, 3/8. (D) 1/2, 1/2. (E) 5/8, 1/8. (F) 5/8, 1/4. (G) 5/8, 3/8. (H) 3/4, 1/8. (I) 3/4, 1/4. (J) 7/8, 1/8. The figure design follows Fig. 3.
Adding STRs.
Record matching proceeds by accumulating information about the agreement of a pair of records across loci. Thus, adding more loci is expected to increase record-matching accuracy. To evaluate the effect of the number of loci, we repeated our matching procedures in non-CODIS STR sets of varying size (Fig. 4). For each procedure, accuracy increases as more loci are considered. Median accuracy increases to 97.2% (212 of 218) in 20-locus panels for one-to-many matching procedures and 71.6% (156 of 218) for needle-in-haystack matching. Almost all 30-locus panels (99 of 100) produce perfect matching accuracy in one-to-one matching, and most produce accuracy above 99% in one-to-many matching (84 of 100 with query SNP profiles; 51 of 100 with query STR profiles). With 50-STR panels, in the median trial, 96.8% of true matches (211 of 218) have match scores exceeding the highest match score among unmatched pairs.
Record-matching accuracy as a function of number of STRs. For each number of loci, 100 random locus sets are analyzed for the data partition in Fig. 1; results are shown horizontally jittered. (A) One-to-one matching. (B) One-to-many matching selecting the STR profile that best matches a query SNP profile. (C) One-to-many matching selecting the SNP profile that best matches a query STR profile. (D) Needle-in-haystack matching.
Discussion
We have shown that genetic records can potentially be linked even if they contain nonoverlapping sets of markers. Despite the small number of markers in one of our datasets—13 STRs—multimarker profiles can be matched to genome-wide SNP profiles with median accuracies in excess of 90% (Table 1). Furthermore, record-matching accuracy increases with the number of markers, and with only a few dozen markers, accuracy nears 100% (Fig. 4).
The fact that such high match accuracies are achievable despite relatively low imputation accuracies at individual STRs is perhaps surprising. In domesticated cattle, McClure et al. (22) observed that SNP haplotypes are highly predictive of the allele of an STR lying within the haplotype and that STR profiles could, therefore, be imputed with high accuracy. In humans, however, LD between STRs and surrounding SNPs is weaker, with many distinct STR alleles appearing on the same SNP haplotype in a population and with multiple SNP haplotypes possessing the same STR allele (8, 9). Nevertheless, because some LD does exist and because SNP-based imputation accuracies exceed null imputation accuracies, LD information can be accumulated across markers to permit record matching—not unlike the manner in which small differences in allele frequency across populations can be accumulated across markers to enable inference of ancestry (27, 28).
Imputation accuracy is negatively correlated with genetic diversity measures in the samples in which genotypes are imputed, and thus, it is often greater in low-diversity populations than in high-diversity populations (24, 29). Here, this effect accumulates across loci, and mean match scores are, therefore, greater for matches in low-heterozygosity Native Americans than in high-heterozygosity Africans (Table S3). The increase in match scores applies also to nonmatching pairs in low-diversity populations, however, because the greater similarity among individuals in such populations inflates the mean match score for these pairs (Fig. 2). Because match scores are inflated in low-diversity populations for both matches and nonmatches, differences between mean match scores for matches and nonmatches are similar across groups, so that the potential to separate matches from nonmatches need not be greatest in low-diversity groups (Table S3).
Mean match scores for matching and nonmatching pairs of individuals subdivided by geographic region
Our study adds to a growing list of record-matching scenarios in genetics. Vohr et al. (30) used SNPs to calculate likelihood ratio-based match scores, relying on a phased reference panel to assist in assigning low-coverage sequence reads to the same individual. Conceptually similar methods have also been used to identify mismatches between genotypes and expression data (31, 32). Our method augments this work by using the record-matching framework and enabling matches between markers of different types, even when imputation accuracy is far below one.
The fact that CODIS imputation accuracies are relatively low (Fig. 1) suggests that, from a SNP profile, it is unlikely that the full CODIS STR profile of an individual can be reliably imputed. However, if that profile already appears in a CODIS STR database, then a match between the SNP profile and its associated STR profile might be possible to identify. The feasibility of record matching suggests a form of backward compatibility with the CODIS STR database, in which SNP rather than STR profiles would be collected on future samples and queried against both new SNP databases and existing STR data. STRs could then be typed on such samples to validate an STR–STR match only if a strong SNP–STR match is suggested. Although matching accuracy was imperfect, that stringent match-score thresholds permitted many accurate matches before the first error was made suggests that backward compatibility by record matching and exclusion of unlikely pairings may be achievable for a substantial fraction of samples.
The utility of record matching in advancing forensic genetics depends on the degree to which it scales to typical forensic dataset sizes, numbering in the thousands or millions of profiles. We find that record-matching accuracy increases as larger training sets become available but decreases with the test-set size. An increase in the number of CODIS loci from 13 to 20 (15) increases the potential of record matching; accuracies were considerably higher in our 20-STR scenarios than in the 13-STR CODIS examples; 4 loci in the 2017 update to the CODIS markers are available in the data of ref. 12, and combining them with the 13-locus set indeed provides a substantial accuracy increase (Table 1). We expect that accuracy could potentially be increased further if STR genotypes were produced by a procedure that obtains the full DNA sequence of the STRs rather than the length of the repeated unit, thereby subdividing repeat-length alleles into finer allelic classes (33, 34); with this approach, the level of resolution at which alleles are classified as distinct could be tuned to a level that maximizes the record-matching accuracy.
Even with quite large test sets, it is plausible that some profiles could be paired with high confidence. If the match scores
The potential for record matching of SNP and CODIS STR profiles, especially with augmented CODIS panels, uncovers new risks to privacy. Some record pairings have match scores so large that they are improbable in the absence of a true match. Thus, authorized or unauthorized analysts equipped with two datasets, one with SNP genotypes and another with CODIS genotypes, could possibly identify some pairs of records that are likely to represent the same person. For people with records linked in this way, CODIS genotypes would reveal genomic SNP genotypes that could, in turn, reveal much more information than the CODIS genotypes themselves—such as precise ancestry estimates, health and identification information that accompanies SNP records, and predictions for genetically influenced phenotypes. In this sense, contrary to the view that CODIS genotypes expose no phenotypes (17, 20, 21), a CODIS profile on a person together with a SNP database—if the person is in the database—in principle may contain all of the phenotypic information that can be reliably predicted from the SNP record. Conversely, participants in biomedical research or personal genomics who have consented to share their SNP genotypes may be subject to a previously unappreciated risk: identification in a forensic STR database.
As in other situations in which data aggregation can unexpectedly reveal genetic information at the individual level (35⇓–37), it is desirable to reevaluate the privacy of forensic STR profiles in light of the widespread availability of diverse SNP profiles to researchers and the public. Because our record-matching methods can potentially be extended beyond the detection of identical people to the detection of relatives—matching a SNP profile of an individual to an STR profile of a relative—we expect that privacy considerations will extend to this scenario as well.
Materials and Methods
Data.
From the Human Genome Diversity Panel, we examined previously reported genotypes on 872 samples—the intersection of 938 unrelated samples with SNP genotypes reported (10), a subset of 1,048 samples with STR genotypes reported (12), and 978 samples with CODIS genotypes reported (11). Population information appears in Table S1. Non-CODIS STRs included 431 tetranucleotides and trinucleotide D22S1045, which is in the 2017 CODIS update (15). We obtained non-CODIS STR positions by querying University of California-Santa Cruz (UCSC) Genome Browser’s BLAT (38) using the locus RefSeq sequence (table S1 of ref. 12) and build hg18; for CODIS loci, UCSC Genome Browser queries used the locus name.
Phasing and Imputation.
In Beagle 4.1 (23), we set the number of iterations to 10. When phasing, we used defaults for all other parameters: maxlr = 50,000, lowmem = false, window = 50,000, overlap = 3,000, impute = true, cluster = 0.005, ne = 1 million, err = 0.0001, seed = -99,999, and modelscale = 0.8. When imputing STRs, we set gprobs = true and maxlr = 1 million, and we used a linkage map based on GRCh36 coordinates.
For each STR, windows extended 500 kb in both directions from the STR midpoint (GRCh36 coordinates corresponding to UCSC hg18). For non-CODIS loci, the number of SNPs in such windows ranged from 80 to 547, with a median of 262. For CODIS loci, the range was 164–655, with a median of 272.
Imputation Accuracy.
Imputation accuracy was assessed as the number of accurately imputed alleles (24). Null imputations were made disregarding the neighboring SNP genotypes by imputing the genotype that, under Hardy–Weinberg equilibrium with the allele frequencies estimated in the training set, is predicted to lead to the highest accuracy. Denoting the alleles at a locus 1, …, K in decreasing order of their frequencies p1, …, pK, the most frequent homozygote was imputed if
To verify that this condition for “null” imputations produces the highest accuracy, note that, if the most frequent homozygote is always imputed, then for each individual homozygous for allele 1 (frequency
If instead, the most frequent heterozygote is imputed, then the number of alleles imputed correctly is two for individuals heterozygous for the two most frequent alleles (frequency 2p1p2), one for homozygotes for allele 1 or 2 (frequency
Match Scores.
To avoid likelihoods of zero in match-score computations, any diploid genotype assigned probability zero by Beagle was given probability 0.0005, one-half the lowest permissible nonzero probability in the Beagle version that we used. Probabilities were then renormalized to sum to one. Probabilities for genotypes including alleles unobserved in the training set or missing were set equal under all hypotheses about M so as not to affect match scores.
Testing Match Scores of True Matches Against Nonmatches.
To account for dependencies among values in the same column or row of the match-score matrix, we fit a linear mixed model with crossed random effects using entries from the matrix in Fig. 2A: Yij = β0 + β0i + β0j + β1Xij + εij. Here, Yij is the match score for the pairing of the ith STR and jth SNP profiles, β0 is a global intercept, β0i is a random intercept for scores involving the ith STR profile, β0j is a corresponding intercept for the jth SNP profile, the indicator variable Xij is one if Yij represents a true match (i = j), and εij is a normal disturbance with expectation zero and constant variance. We used R package lmer, computing p values by Satterthwaite approximation with package lmerTest. This model was strongly preferred over models that excluded random effects for either STR or SNP profiles (Akaike Information Criterion and Bayesian Information Criterion differences >3,000). The estimate for
Acknowledgments
We thank H. Greely, E. Halperin, and N. Rudin for discussions and B. Browning for assistance with Beagle. This work was supported by NIH Grant R01HG005855 and National Institute of Justice Grant 2014-DN-BX-K015.
Footnotes
- ↵1To whom correspondence should be addressed. Email: noahr{at}stanford.edu.
Author contributions: M.D.E., B.F.B.A.-H., J.Z.L., and N.A.R. designed research; M.D.E., T.J.P., and N.A.R. performed research; M.D.E. and N.A.R. analyzed data; and M.D.E., B.F.B.A.-H., T.J.P., J.Z.L., and N.A.R. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1619944114/-/DCSupplemental.
Freely available online through the PNAS open access option.
References
- ↵.
- de Bakker PI, et al.
- ↵
- ↵
- ↵
- ↵.
- Pemberton TJ,
- DeGiorgio M,
- Rosenberg NA
- ↵
- ↵.
- Winkler WE
- ↵
- ↵.
- Willems T,
- Gymrek M,
- Highnam G,
- Mittelman D,
- Erlich Y, 1000 Genomes Project Consortium
- ↵.
- Li JZ, et al.
- ↵.
- Algee-Hewitt BFB,
- Edge MD,
- Kim J,
- Li JZ,
- Rosenberg NA
- ↵
- ↵
- ↵
- ↵
- ↵.
- Federal Bureau of Investigation
- ↵
- ↵.
- Greely HT,
- Kaye DH
- ↵Maryland v. King, 133 S. Ct. 1958 (2013)..
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Riordan J
- ↵
- ↵.
- Edge MD,
- Rosenberg NA
- ↵
- ↵.
- Vohr SH,
- Buen Abad Najar CF,
- Shapiro B,
- Green RE
- ↵.
- Westra HJ, et al.
- ↵.
- Broman KW, et al.
- ↵
- ↵.
- Warshauer DH,
- King JL,
- Budowle B
- ↵
- ↵.
- Gymrek M,
- McGuire AL,
- Golan D,
- Halperin E,
- Erlich Y
- ↵
- ↵.
- Kent WJ