# Statistical inference of the generation probability of T-cell receptors from sequence repertoires

^{a}Joseph Henry Laboratories, Princeton University, Princeton, NJ 08544;^{b}Laboratoire de Physique Statistique, UMR8550, Centre National de la Recherche Scientifique and École Normale Supérieure, 24 rue Lhomond, 75005 Paris, France;^{c}Laboratoire de Physique Théorique, UMR8549, Centre National de la Recherche Scientifique and École Normale Supérieure, 24 rue Lhomond, 75005 Paris, France; and^{d}Simons Center for Systems Biology, Institue for Advanced Study, Princeton, NJ 08544

See allHide authors and affiliations

Contributed by Curtis G. Callan, Jr., July 27, 2012 (sent for review June 19, 2012)

## Abstract

Stochastic rearrangement of germline V-, D-, and J-genes to create variable coding sequence for certain cell surface receptors is at the origin of immune system diversity. This process, known as “VDJ recombination”, is implemented via a series of stochastic molecular events involving gene choices and random nucleotide insertions between, and deletions from, genes. We use large sequence repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta chains to infer the statistical properties of these basic biochemical events. Because any given CDR3 sequence can be produced in multiple ways, the probability distribution of hidden recombination events cannot be inferred directly from the observed sequences; we therefore develop a maximum likelihood inference method to achieve this end. To separate the properties of the molecular rearrangement mechanism from the effects of selection, we focus on nonproductive CDR3 sequences in T-cell DNA. We infer the joint distribution of the various generative events that occur when a new T-cell receptor gene is created. We find a rich picture of correlation (and absence thereof), providing insight into the molecular mechanisms involved. The generative event statistics are consistent between individuals, suggesting a universal biochemical process. Our probabilistic model predicts the generation probability of any specific CDR3 sequence by the primitive recombination process, allowing us to quantify the potential diversity of the T-cell repertoire and to understand why some sequences are shared between individuals. We argue that the use of formal statistical inference methods, of the kind presented in this paper, will be essential for quantitative understanding of the generation and evolution of diversity in the adaptive immune system.

Receptor proteins on the surfaces of B and T cells in the immune system interact with pathogens, recognize them and initiate an immune response. The diversity of these receptors is the outcome of a remarkable process in which germline DNA is edited to produce a repertoire of (T or B) cells with varied antigen receptor genes (1). The process is called “VDJ recombination” because the germline contains multiple versions of so-called V-, D-, and J-genes, particular instances of which are quasi-randomly selected, stochastically edited, and joined together to produce a new surface receptor gene each time a new immune system cell is generated.

The statistical distribution of these biochemical events (and the resulting receptor coding sequences) in a population of newly created receptors is an important quantity: It contains information about the in vivo functioning of the biochemical editing mechanism and provides the baseline for a quantitative assessment of the downstream workings of selection in the adaptive immune system. Here, we address the problem of inferring this distribution from the large T-cell sequence repertoires that are becoming available via high-throughput sequencing technology (2⇓⇓–5). In particular, we focus purely on a subset of receptor sequences that are nonproductive, due to a reading frame shift or an accidental stop codon to isolate the statistics of the molecular mechanism from the effects of selection on the functional repertoires.

In the beta chain of human T-cell receptors (the focus of this work), the germline has 48 different V-genes, 2 D-genes, and 13 J-genes. VDJ recombination proceeds by first joining a D-gene with a J-gene and then a V-gene with the DJ junction. First, the recombination activating gene (RAG) protein complex brings two randomly chosen D- and J-genes together, cuts out the intervening chromosomal DNA, and forms a hairpin loop at the end of each gene (6, 7). In further steps (8, 9) the hairpin loops are opened, creating overhangs at the end of both genes that may eventually survive as P-nucleotides (short inverted repeats of gene terminal sequence) (10). This is followed by nucleotide deletions and insertions at the junctions and ends with ligation. The process is then repeated between a random V-gene and the DJ junction. The end product is the so-called CDR3 region of the receptor gene: a short, highly variable region that plays an essential role in determining the antigen specificity of the cell.

Each recombined sequence can thus be thought of as the outcome of a generative event described by several random variables (Fig. 1): V-, D-, and J-gene choices, deletions of variable numbers of nucleotides from the selected genes, insertions of random nucleotides between them, and the possible creation of P-nucleotides (short palindromic nucleotides as in Fig. 1*A* at the 3^{′} end of the D-gene). From the set of observed CDR3 sequences, we wish to infer the underlying probability distribution of these generative events.

To date, this inference has been done via a deterministic alignment procedure that assigns a unique event to each sequence (2⇓–4). However, because individual CDR3 sequences can arise in multiple ways (see Fig. 1), this assignment must be done probabilistically. Deterministic alignment introduces spurious biases and correlations in the statistics of generative events (Fig. 2). Thus, a statistical inference procedure is needed to accurately infer the underlying event probability distribution from the data. In this paper we present such a method, based on likelihood maximization via an iterative expectation-maximization algorithm (11) and apply it to recent data on human T-cell receptor sequences.

## Analysis

We work with sequence data on CD4+ T-cell beta chain CDR3 regions obtained from nine human subjects by methods described in refs. 4 and 5 (see Acknowledgments). In these experiments, T cells are collected from a blood sample and sorted into “naïve” (CD45RO−) and “memory” (CD45RO+) compartments, DNA is extracted, and sequence reads long enough to capture a 5^{′} piece of the J-gene, a 3^{′} piece of the V-gene, and the variable sequence lying in between are obtained.

Each sequence is read multiple times, and a clustering algorithm is used to correct for sequencing error (4, 5). This process produces a dataset consisting of an average of 232,000 (140,000) unique CDR3 sequences from the naïve (memory) compartments for each individual subject. Each unique sequence comes with a multiplicity reflecting the prevalence of that particular cell type in the blood sample.

Roughly 14% of the unique CDR3 sequences are “nonproductive,” i.e., either their J-genes have been shifted out of the correct reading frame or the CDR3 sequences have a premature stop codon. They arise from a recombination event on one of a cell’s two chromosomes that failed to make a functional receptor, followed by a successful recombination on the other chromosome. Such sequences should not be subject to functional selection (5), and their statistics should reflect only the VDJ recombination process (see *SI Appendix*, section 10 for evidence that the non-productive constraint introduces no bias). Because this is our primary concern, we focus our analysis on the nonproductive CDR3 sequences, of which there are an average of 35,000 (22,000) in the naïve (memory) compartments for each individual subject. We analyze the naïve and memory data sets separately to be able to verify the absence of selection effects. Our data sets are available online (see *SI Appendix*, sections 1 and 2 for details).

### Structure of Recombination Event Distributions.

Each CDR3 generating recombination event can be fully characterized by a set *E* of discrete variables comprising: the identities of the V-, D- and J-genes selected for recombination* (V,D,J); the numbers of bases deleted from the 3′ end of the V-gene (del*V*), the 5′ end of the J-gene (del*J*), and both ends of the D-gene (del5^{′}*D* and del3^{′}*D* for the 5′ and 3′ ends, respectively); the number of palindromic nucleotides at each of the gene ends (pal*V*,pal*J*,pal5^{′}*D*,pal3^{′}*D*); the specific sequence (*x*_{1},…,*x*_{insVD}) of length ins*VD* inserted at the VD junction; and the specific sequence, (*y*_{1},…,*y*_{insDJ}) of length ins*DJ* inserted at the DJ junction (see Fig. 1). We choose a convention in which both sequences are read in the 5′ to 3′ direction, but the VD (DJ) inserted sequence is read from the sense (antisense) strand.

We seek a joint distribution over all of these variables containing the minimal set of dependences between the variables that is required to self-consistently capture the observed correlations in the data. We find that the following factorized form for the probability of a recombination event *E* (defined by specific values for all the event variables) successfully captures all the significant correlations between sequence features that are present in the data (see Fig. 2): [1]The various factors are normalized joint or conditional distributions on their respective arguments. *P*(*V*) and *P*(*D*,*J*) account for the fact that the various genes have different usage probabilities (and that D- and J-gene usage is correlated). The factors *P*(del*V*|*V*), etc., are distributions on the number of nucleotide deletions, conditioned on the gene being deleted (deletion profiles turn out to be very gene-dependent). *P*(ins*VD*) and *P*(ins*DJ*) give the probabilities of different numbers of nucleotide insertions at each junction. The parameters and account for possible nucleotide bias in the insertions: They give the conditional probabilities of inserting a specific nucleotide given the identity of the immediately 5′ nucleotide, with *x*_{0} referring to the last nucleotide at the 3′ end of the truncated V-gene on the sense strand for a VD insertion, or at the end of the truncated J-gene on the antisense strand for a DJ insertion.

P-nucleotides do not appear explicitly in Eq. **1**: we treat them as “negative” deletions (i.e., a palindrome of half-length 2, as in Fig. 1*A*, is counted as a deletion of value -2). This is possible because we find that when the number of nucleotide deletions is greater than zero, occurrences of palindromic nucleotides at the end of the gene segment are completely explained by chance insertions of the corresponding nucleotides (see *SI Appendix*, section 11 and Fig. S10). Thus, true P-nucleotides, not attributable to chance insertions, only occur in association with zero nucleotide deletions and it is consistent to label them as negative deletions.

The factors in our equation for *P*_{recomb}(*E*) [Eq. **1**] are probability distributions on event variables that take on a finite number of values. Specifying this joint distribution requires a total of 2,865 probabilities (more than 90% of which are needed for the deletion length probabilities of the individual V-, D- and J-genes). Despite the large number of probabilities to be inferred, we are able to determine them accurately and without overfitting. We emphasize that our goal is to obtain an accurate description of recombination event statistics, and not (yet) to explain those statistics mechanistically.

### Generation Probability and Likelihood of Observed Sequences.

The probability *P*_{gen}(σ) of generating a specific CDR3 sequence σ is the sum of the probabilities of all recombination events *E*_{σ} that produce σ: [2]The likelihood *L*(σ) of observing a specific CDR3 sequence read σ, however, must take into account residual sequencing error as well as allelic variation and is given by a sum over a larger set of recombination events that generate sequences close to σ: [3][4]In the latter equation, *n*_{err} is the number of mismatches between the observed read σ and the CDR3 sequence that would be produced by the recombination event *E* with allele choices *a*. *L* is the length of the sequence read. The mismatch rate *R* is determined in the inference with the rest of the distribution parameters and reflects both sequencing error as well as unknown allelic variation. In practice, we only consider recombination events that lead to CDR3 sequences with at most a few mismatches from σ. The sum over alleles^{†} arises because we do not know a priori which alleles are present and reads may not go deep enough into the gene sequence to clearly distinguish alleles from each other (12). The probabilities of the different alleles, given a gene, are also inferred and are expected to differ from individual to individual.

The likelihood of the whole dataset is then the product over the individual sequence likelihoods: . This expression depends implicitly on the parameters defining the generative probability distribution (along with the allele distributions and the sequencing error parameter), and we infer their correct values by maximizing using an expectation maximization algorithm (11, 13) (see *SI Appendix* for algorithmic details). In order to identify universal features of the diversity generation machinery, we perform this inference separately for each individual subject. Our analysis software is available online (see *SI Appendix* for details).

## Results

In what follows, we present results of our analysis of naïve, nonproductive, CDR3 sequence repertoires of nine individuals (see *SI Appendix* for a parallel analysis of memory sequence repertoires). Selected results data files are available online (see *SI Appendix* for details).

### Correlations Between Event Variables.

It is important to verify that correlations not present in the assumed structure of the probability distribution [Eq. **1**] are in fact not present in the data. To perform this self-consistency check, we use the inferred generative distribution to compute the probability-weighted counts distribution of recombination event variables in the data and then use this distribution to calculate the mutual information of all pairs of event variables. The matrix of mutual information values is shown in the upper-triangular part of Fig. 2*A*, where the entries outlined in red are dependences accounted for by individual factors in our assumed form of *P*_{recomb}(*E*) [Eq. **1**], entries outlined in green are indirect dependences that can be induced by these factors, and the rest would vanish if the data were perfectly described by the assumed structure of *P*_{recomb}(*E*). There are a few detectable correlations that are not consistent with the assumed structure: (ins*VD*,del*V*),(ins*DJ*,del*J*), and (V,D). They are, however, all so weak (mutual information < 0.02 bits) that we do not model them explicitly (indeed, they might arise from subtle biases in our inference procedure).

For comparison, in the lower-triangular part of Fig. 2*A* we show the mutual information values of all pairs of variables, but now calculated from a deterministic assignment of events to sequences based on maximal alignments. The resulting distributions exhibit spurious correlations that are absent from the corrected, maximum likelihood estimate (MLE) of the distributions. For instance, the number of insertions at the two junctions are found to be independent in our analysis while the uncorrected estimate shows a dependence (Fig. 2 *B* and *C*).

### Gene Usage Distributions.

The inferred frequencies of V- and J-genes vary significantly from gene to gene, a phenomenon for which no mechanistic explanation has yet been given. In particular, linear location on the chromosome does not explain the pattern of either V- or J-gene usage (see *SI Appendix*, Fig. S4 *A* and *C*). The usage frequencies are consistent between individuals, though of all the inferred parameters in *P*_{recomb}, these usage patterns show the most relative variation between individuals.

The pattern of D-gene use conditioned on J-gene choice (*SI Appendix*, Fig. S4*D*) reveals the known mechanistic constraint prohibiting utilization of D-genes that lie 3^{′} of the chosen J-gene (1, 5). The inferred distribution assigns a total probability of less than 0.1% for joining events using TRBD2 and any TRBJ1 gene. We note that such a determination is impossible without probabilistic analysis due to the uncertainty in identifying genes in specific sequences. The dependence between V-gene choice and D- or J-gene choice is very weak to nonexistent (with mutual information less than 0.01 bits). Thus, we believe that previously reported correlations in the use of these genes (14) reflect the effects of selection rather than VDJ recombination. Finally, we note the presence of pseudo V-genes that occur in almost 10% of the nonproductive CDR3s (see *SI Appendix* for more details).

### Nucleotide Insertions.

In Fig. 3 we show the factors related to insertions in the inferred distribution *P*_{recomb}(*E*). The VD and DJ insertions are uncorrelated (Fig. 2) and their length distributions are nearly identical, with exponential tails (Fig. 3*A*). The nucleotide frequencies in the inserted segments are not uniform and are well explained by a dinucleotide Markov model where the probability of inserting A, C, G, or T depends on the immediately 5′ nucleotide (see Fig. 3*B*). The VD inserted segment, on the sense strand, and the DJ inserted segment, on the antisense strand, show a preference for Cs. The frequencies of trinucleotides are almost perfectly accounted for by the dinucleotide preferences (Fig. 3*C*), suggesting that the sequence statistics are fully captured by dinucleotide statistics. Additionally, the VD insertion dinucleotide bias, taken on the sense strand in the 5′-3′ direction, is virtually identical to the DJ insertion dinucleotide bias, taken on the antisense strand in the 5′-3′ direction. This suggests that the mechanism of junctional nucleotide insertions is strand specific and occurs on opposite strands for the VD and DJ junctions. The molecular mechanistic basis of these features is not evident.

### Nucleotide Deletions.

Because there is a strong correlation between number of deletions and gene identity (see the entries for *I*(del*V*,*V*) and *I*(del*J*,*J*) in Fig. 2), we allow for gene-dependent deletion profiles in *P*_{recomb}(*E*) [Eq. **1**]. The results for a few genes are shown in Fig. 4*A* (see *SI Appendix*, Figs. S12–S16 for all the profiles). P-nucleotides are counted as negative deletions as they occur only in association with zero nucleotide deletions (see *SI Appendix*, Fig. S10). The profiles have substantial variation from gene to gene, suggestive of a nuclease activity that depends on sequence context, but they are highly consistent between individuals. We have modeled this context dependence using a position weight matrix summing independent contributions from the bases in a six nucleotide window (four 3^{′} and two 5^{′}) around the cutting point to the log probability of deletion (see Fig. 4*B* and *SI Appendix*, Fig. S11 for details). We find that only bases 3^{′} of the deletion site have a strong effect on the probability, with T and A nucleotides having the greatest contribution, consistent with previous observations (15). This simple model, which ignores both the P-nucleotides as well as the effects of distance from the end of the gene, does reasonably well in explaining the variation in deletion probabilities (*r*^{2} = 0.7). This modeling is simply to suggest that the complexity of the observed deletion distributions may ultimately be explained by a parsimonious mechanistic model that reflects the underlying biochemistry of the deletion process.

### Consistency of Distributions Across Individuals.

The insertion profiles, and the many different gene-dependent deletion profiles, are very consistent between individuals (Figs. 3 and 4 and *SI Appendix*), suggesting the action of a universal molecular mechanism of rearrangement and providing convincing evidence against overfitting. We note that finite sample size statistics account for less than 50% of the observed interindividual variance (indicated by the error bars) in some of our plots, possibly reflecting biological variation.

### Potential Diversity of Repertoire.

Our inferred distribution of recombination events [Eq. **1**] implies a probability distribution *P*_{gen}(σ) on the space of all CDR3 sequences [Eq. **4**] whose entropy is a measure of the potential sequence diversity of VDJ recombination. Because multiple recombination events can lead to the same sequence, we cannot calculate *S*_{seq} directly. We do, however, have an explicit description of *P*_{recomb}, the entropy of which we can calculate: *S*_{recomb} = 52 bits; in addition, we can show that sequence entropy and recombination event entropy are related by [5]where the correction term, , is the entropy of recombination events that give the same sequence (which we know for sequences in the repertoire as a byproduct of our inference), averaged over sequences. This means that CDR3 sequences can be generated in approximately 32 different ways, on average, by VDJ recombination; this is the fundamental reason why we must resort to probabilistic inference methods. The total sequence diversity of 47 bits corresponds to a potential CDR3 repertoire size of approximately 10^{14} sequences^{‡}. This is to be compared with the estimated 4 × 10^{6} unique CDR3 sequences in an individual (4, 16) , the approximately 10^{11} T cells in the blood of an individual (17) and the approximately 10^{13} potential peptide-MHC complexes (18). Although convergent recombination means that the sequence entropy cannot be neatly partitioned into contributions from gene choice, deletions, and insertions, the entropy of recombination events *S*_{recomb} can be so partitioned (Fig. 5*A*). We note that the bulk (60%) of the recombination entropy comes from the nucleotide insertions, and little from gene choice (5 bits from V and 4 bits from D and J) consistent with previous estimates (19). For comparison, uniform usage of the genes would result in an entropy of 5.9 bits for V and 4.7 bits for D- and J-gene choices.

### Overlap of Repertoires Between Individuals.

Some sequences appear in the repertoires of more than one individual, and we can ask whether their number and specific identities are consistent with chance on the basis of our generative distribution *P*_{gen}(σ). Some shared sequences appear simultaneously in too many repertoires to be valid and are probably due to intersample contamination (see *SI Appendix* for details). Eliminating clearly identifiable questionable cases, we are left with 21 sequences that occur in the nonproductive repertoires of two individuals and none that occur in more than two.

The total number of shared sequences between the repertoire samples of any pair of individuals with sample sizes *N*_{1} and *N*_{2} is expected to be Poisson distributed with mean where . Note that although the specific shared sequences are likely to have high probabilities of generation, the number of shared sequences, without regard to their identities, is determined by , which is the average value of *P*_{gen} over the potential repertoire. We estimate this quantity to be by taking the mean of *P*_{gen} over the observed repertoire.

In Fig. 5*B*, we compare the expected number of pairs of individuals with a certain number of shared sequences (calculated as a sum of Poisson distributions over the pairs) to the observed number of such pairs, showing excellent agreement. The specific shared sequences have particularly high generation probabilities according to our distribution, with a median value of approximately 10^{-8} compared to the repertoire median of approximately 10^{-14} (Fig. 5*C*). Because the generative distribution is trained on individual repertoires, and is highly consistent between individuals, its success in accounting for recurring sequences between individuals is a nontrivial test of its validity. We find similar results for the shared sequences among the memory repertoires (see *SI Appendix*, Fig. S6).

Convergent recombination has been proposed as an explanation for the occurrence of “public” T-cell receptors (20⇓–22). However, the recombination entropy *S*(*E*|σ) is only weakly correlated with the generation probability *P*_{gen}(σ) (correlation coefficient 0.13, see *SI Appendix*, Fig. S7), and we find that the shared nonproductive sequences in our data do not have higher recombination entropies than other sequences.

### Results from Other Repertoires.

Inference of *P*_{recomb}(*E*) from the nonproductive memory repertoires of the same nine individuals leads to results identical with those reported above for the naïve nonproductive repertoires (see *SI Appendix*, Figs. S5 and S6). The consistency of the inferred generative distribution between these repertoires as well as between the nine individuals is strong evidence that the nonproductive CDR3 sequence statistics, memory or naïve, reflect only the basic recombination process and not selection. In *SI Appendix*, Fig. S8 we show the distribution of generation probabilities of CDR3 sequences from the productive repertoires. Although it is tempting to apply our approach to the productive sequence repertoires, it would be inconsistent to do so: These sequences have passed selection filters, thymic and adaptive, and we have no analog of Eq. **1** to parametrize the probability of such success. This is an important subject for future investigation.

## Discussion

We have presented a method for inferring the statistics of VDJ recombination events from the large T-cell receptor sequence repertoires that are made available by high-throughput sequencing. We emphasize the crucial importance of using a probabilistic approach: The typical CDR3 sequence can be produced by about 32 different recombination events, and using a deterministic assignment of events to each sequence results in systematic biases and spurious correlations. Our general approach allows us to cope with not-yet-indexed alleles (12) and, most importantly, with sequencing errors, an essential task given the rapid growth of high-throughput but error-prone sequencing technologies.

Because we focus on nonproductive sequences, our results describe the probability distribution over CDR3 sequences produced by the recombination machinery before any functional selection has occurred. Its remarkable reproducibility across individuals and repertoires (naïve and memory) provides compelling evidence for the consistency and accuracy of our method. The obtained distribution is a central feature of the adaptive immune system and serves as a baseline (or, in evolutionary terms, a neutral model) for analyzing the subsequent processes of the immune system. By calculating the entropy of the generative distribution, we can estimate the potential diversity of the CDR3 sequences (approximately 10^{14} sequences) and the contributions of insertions, deletions and gene choices to this entropy. We find that insertions contribute most (60%) of the diversity.

We are able to evaluate the probability of generating any specific CDR3 sequence (including as yet unobserved ones). This probability could be used to estimate the strength of selection on a sequence or group of sequences, or the likelihood that a sequence is shared between individuals or repertoires. Thus, it could help better characterize the significance of shared or public T-cell receptor sequences (22). We have verified that the sequences that are shared between the nonproductive repertoires of different individuals in our data are consistent with the predictions of the inferred probability distribution (Fig. 5 *B* and *C*), a very stringent test of its accuracy.

The recombination event distributions also provide insight into the molecular mechanism of recombination and should serve as a starting point for detailed mechanistic models of recombination. We find that the recombination processes at the two junctions are essentially independent of each other and that insertion events are independent of gene choice and deletions. The inferred distribution confirms that a D-gene can only recombine with downstream J-genes. We derive a precise model for the composition of inserted nucleotides, based solely on frequencies of dinucleotides. We also show that a relatively crude model of sequence-specific nuclease activity can account for the deletion probabilities reasonably well. Our observed distribution, which is specified by a large number of probabilities, should be reproduced by parsimonious, but more realistic, mechanistic models.

We have focused on characterizing the molecular generation of nucleotide sequences that code for T-cell receptors. The functional receptor repertoire is first shaped by this molecular process and then by thymic selection and adaptation to pathogens. Quantitative models of the latter processes are needed for understanding the adaptive immune system. Whereas the underlying biochemistry conveniently served to parametrize our sequence distributions, finding an analogous functionally relevant parametrization of amino-acid sequences to model the effects of selection is much more challenging (23). Statistical analysis of the productive receptor repertoires, with our precise characterization of the unselected repertoire in hand, will hopefully aid in this effort.

## Acknowledgments

We are grateful to H. Robins and collaborators for making the datasets on which this work is based available to us.The work of C.G.C. was supported in part by National Science Foundation (NSF) Grant PHY-0957573 and by US Department of Energy Grant DE-FG02-91ER40671. The work of A.M. was supported in part by the NSF Physics of Living Systems program (PHY-1022140). C.G.C. thanks the Institute for Advanced Study for hospitality during the performance of part of this work.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: ccallan{at}princeton.edu.

Author contributions: A.M., T.M., A.M.W., and C.G.C. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

↵

^{*}Here we distinguish only the genes, not their various alleles. The gene list includes germline pseudogenes: They cannot produce functioning receptor proteins but, because we work with non-coding VDJ rearrangements, pseudogene sequences can appear in the data.↵

^{†}We use the known alleles for each gene listed in the IMGT database (24) augmented by a few additional variants observed in the data (see*SI Appendix*for details).↵

^{‡}Recall that this estimate is for the β-chain only. The α-chain will yet add more diversity to this estimate.This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1212755109/-/DCSupplemental.

## References

- ↵
- Murphy KP,
- Travers P,
- Walport M,
- Janeway C

- ↵
- Freeman JD,
- Warren RL,
- Webb JR,
- Nelson BH,
- Holt RA

- ↵
- Weinstein JA,
- et al.

- ↵
- Robins HS,
- et al.

- ↵
- Robins HS,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- McLachlan GJ,
- Krishnan T

- ↵
- ↵
- Dempster A,
- Laird N,
- Rubin D

- ↵
- ↵
- Gauss GH,
- Lieber MR

- ↵
- Arstila TP,
- et al.

- ↵
- ↵
- ↵
- Cabaniols JP,
- Fazilleau N,
- Casrouge A,
- Kourilsky P,
- Kanellopoulos JM

- ↵
- Quigley MF,
- et al.

- ↵
- Venturi V,
- et al.

- ↵
- ↵
- Mora T,
- Walczak AM,
- Bialek W,
- Callan CG

- ↵

## References

- ↵
- Murphy KP,
- Travers P,
- Walport M,
- Janeway C

- ↵
- Freeman JD,
- Warren RL,
- Webb JR,
- Nelson BH,
- Holt RA

- ↵
- Weinstein JA,
- et al.

- ↵
- Robins HS,
- et al.

- ↵
- Robins HS,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- McLachlan GJ,
- Krishnan T

- ↵
- ↵
- Dempster A,
- Laird N,
- Rubin D

- ↵
- ↵
- Gauss GH,
- Lieber MR

- ↵
- Arstila TP,
- et al.

- ↵
- ↵
- ↵
- Cabaniols JP,
- Fazilleau N,
- Casrouge A,
- Kourilsky P,
- Kanellopoulos JM

- ↵
- Quigley MF,
- et al.

- ↵
- Venturi V,
- et al.

- ↵
- ↵
- Mora T,
- Walczak AM,
- Bialek W,
- Callan CG

- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology

- Physical Sciences
- Statistics