# Insights into immune system development and function from mouse T-cell repertoires

^{a}Joseph Henry Laboratories, Princeton University, Princeton, NJ 08544;^{b}Laboratoire de Physique Théorique, UMR8549, CNRS, École Normale Supérieure, 75005 Paris, France;^{c}Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ 08903;^{d}Institute for Advanced Study, Princeton NJ 08540;^{e}Laboratoire de Physique Statistique, UMR8550, CNRS, École Normale Supérieure, 75005 Paris, France

See allHide authors and affiliations

Contributed by Curtis G. Callan Jr., January 5, 2017 (sent for review November 9, 2016; reviewed by Benjamin Chain and Arup K. Chakraborty)

### This article has a Correction. Please see:

## Significance

The immune system defends against pathogens in part via a diverse population of T cells that display different surface receptor proteins [T-cell receptors (TCRs)] designed to recognize MHC-presented foreign peptides. Receptor diversity is produced by an initial random gene recombination process, followed by selection for proteins that fold correctly and bind weakly to self-peptides. Using data from mice of different ages, from embryo to young adult, we quantify the changes with time in the way receptors are generated and selected for function. We find a strong increase in repertoire diversity, occurring shortly after birth, due to a sharp increase in the number of random nucleotide insertions in the primitive TCR gene recombination process. Differences between thymic and blood TCR sequence distributions allow us to infer subtle details of this “turning on” of the mouse immune system.

## Abstract

The ability of the adaptive immune system to respond to arbitrary pathogens stems from the broad diversity of immune cell surface receptors. This diversity originates in a stochastic DNA editing process (VDJ recombination) that acts on the surface receptor gene each time a new immune cell is created from a stem cell. By analyzing T-cell receptor (TCR) sequence repertoires taken from the blood and thymus of mice of different ages, we quantify the changes in the VDJ recombination process that occur from embryo to young adult. We find a rapid increase with age in the number of random insertions and a dramatic increase in diversity. Because the blood accumulates thymic output over time, blood repertoires are mixtures of different statistical recombination processes, and we unravel the mixture statistics to obtain a picture of the time evolution of the early immune system. Sequence repertoire analysis also allows us to detect the statistical impact of selection on the output of the VDJ recombination process. The effects we find are nearly identical between thymus and blood, suggesting that our analysis mainly detects selection for proper folding of the TCR receptor protein. We further find that selection is weaker in laboratory mice than in humans and it does not affect the diversity of the repertoire.

The adaptive immune system relies on cell surface receptors to recognize an unpredictable array of foreign pathogens. T cells perform their surveillance function through a highly diverse repertoire of T-cell receptors (TCRs): Any individual TCR recognizes only a small subset of the foreign peptides that it may encounter, and the system defends against a broad range of pathogens by having TCRs of many different specificities. This receptor diversity is created by a stochastic DNA editing process (VDJ recombination) that acts on the TCR gene each time a new immune cell is created from a stem cell and acts independently on each of the two chains (alpha and beta) that compose the receptor.

The resulting repertoire diversity can now be studied in great detail, using high-throughput sequencing of lymphocyte receptor repertoires (1⇓⇓⇓⇓⇓⇓–8). In previous work, we used human sequence repertoires to develop methods for inferring the details of the stochastic process of VDJ recombination (9, 38) and for characterizing the statistical effects of thymic selection (10). In this paper, we apply these methods to T-cell

It is known that B- and T-cell receptors formed in embryonic or neonatal individuals are less diverse than in adults: They have fewer nontemplated insertions (11) due to the absence of terminal deoxynucleotidyl transferase (TdT) expression, the enzyme responsible for these insertions (12⇓–14). Whereas this observation has been confirmed by deep sequencing of human TCR repertoires (15, 16), the precise form and time-resolved dynamics of VDJ recombination have not been assessed.

We analyze mouse T cells taken simultaneously from thymus and blood in the same individual to gain insights into the dynamics of repertoire development. First, because the periphery accumulates T cells produced at different times, whereas new T cells pass quickly through the thymus and reflect conditions at a single time point, the statistical structure of the blood T-cell repertoire can be very different from that of the thymus. This difference poses a challenge for statistical inference, and in this paper we develop a method for describing such repertoires derived from a mixture of different conditions. Second, the ability to compare thymus with blood sequence repertoires allows for a more refined view of the stages of receptor selection for functionality, from initial receptor generation to eventual passage into the periphery, and leads to a qualitatively different picture for mice than previously reported for humans.

## Results

### Inferring the Statistics of VDJ Recombination.

A new TCR gene is created from germline DNA by a series of stochastic events: choosing gene segments, deleting bases from the ends of the chosen gene segments, and inserting nucleotides between the modified gene segments. Because the same sequence can be generated by distinct recombination events, standard tools that assign a unique recombination scenario to each sequence (17⇓⇓–20) give biased estimates that would limit our ability to detect the developmental changes that interest us here. In previous work (9, 10) we showed how to overcome this problem, using an approach that assigns probabilities to different ways of generating a sequence (see *Materials and Methods* for details). This approach allows us to accurately quantify and track diversity as a function of developmental age, in both the thymus and the periphery.

Here, we apply these methods to thymic and peripheral sequence repertoires of TCR beta-chain (TRB) genes of mice of varying ages: 17 d after conception and 4 d, 21 d, and 42 d after birth (see *Materials and Methods* and Table S1 for a complete summary of data). A distinct set of generation parameters was inferred for each of these repertoires. Only out-of-frame sequences, which are nonproductive and thus thought to be free of selection effects, were included in the inference.

### Thymic Repertoires Reveal Mechanism of Diversity Maturation.

The best-documented element of VDJ recombination known to change between fetal and adult life is the number of nontemplated (N) insertions at the junctions. In Fig. 1 we plot the marginal distributions of the number of N insertions at the VD and DJ junctions inferred from out-of-frame thymic sequences of mice at a sequence of ages. During the passage from fetal to mature animal, this distribution changes dramatically and rapidly. In the embryonic mouse, 90% of the sequences have no insertions, whereas this fraction drops to 10% in adult mice. This trend is consistent with previous observations in neonates (11) and is explained by the low level of expression of TdT before birth (12, 14). TdT is turned on after birth, and Fig. 1 indicates that the asymptotic level of TdT in adults must be reached before 21 d, as there are no noticeable differences between 21 d and 42 d. The distribution at 4 d shows an intermediate situation, roughly halfway between embryonic and adult. Whereas Fig. 1 shows that the inferred distribution of insertions is identical at the VD and DJ junctions in the thymus of fetal or adult mice, it is different at 4 d. This difference could be due to the temporal ordering of VDJ recombination: DJ recombination occurs before VD recombination, with a short time delay between the two, and the rise of TdT expression during this delay could explain the increased mean number of VD insertions relative to that of DJ insertions.

We asked whether features of the recombination process other than the number of insertions changed between embryonic and mature mice. We found that D and J gene choice did not change significantly (*t* test corrected for multiple testing), whereas a few V genes had their use significantly (

To quantify the overall change in diversity between TRB sequence repertoires at different ages, we calculated the Shannon entropy of their distributions. Entropy can be decomposed as a sum of contributions from gene choices, deletions, and insertions, from which a correction for convergent recombination must be subtracted (9). We find that the diversity of generated nucleotide sequences increased from 21 bits in fetal mice to 30 bits in adult mice. The change in repertoire diversity during this transition is almost entirely due to the change in the insertion profile, as can be seen in Fig. 2, *Inset*, where the different contributions to the sequence entropies of the fetal and mature sequence repertoires are compared.

The entropy is mathematically equal to the negative of the mean over the sequence repertoire of the logarithm of the generation probability. A plot of the distribution of generation probabilities,

### Peripheral Repertoires Reflect Past History of Thymic Diversity.

Our analysis of thymic repertoires has shown that the generative probability distribution for VDJ recombination changes dramatically in the days and weeks after birth. To understand how the evolution of VDJ recombination impacts repertoires, we need to account for the fact that peripheral compartments accumulate cells generated across earlier times, as sketched in Fig. 3. As a result, sequence repertoires must be described by a mixture of generative models with varying parameters that reflect past states of the generation process.

To use our inference procedure to quantify the state of mixing of repertoires, we must make some simplifying assumptions. First, given our observation that other features vary rather little with age (Figs. S1–S4), we assume that only the statistics of the untemplated insertions change with time. Second, we assume that the instantaneous insertion distribution function,

The distribution **1**, the mixture distribution depends only on the mean and variance of

We estimate **2** in *Materials and Methods*, using our inference technique, and then adjusting **1**. The data points in Fig. 4*A* show the mean effective TdT level *B* report the associated values of var(*A* shows that the blood repertoire transitions from embryonic (*A*, *Inset*). Fig. 4*B* shows that, although embryonic and adult repertoires have no mixing,

The behavior of the data displayed in Fig. 4 can be better understood by comparison with a simple model (*Materials and Methods* and *Mixture Model*). In this model the effective TdT level *A*, dashed curve), recombined cells are created at a rate that increases rapidly with time, and cells reside in the thymus for 3 d on average, after which they are released into the periphery. Although model parameters were chosen to reproduce the observed behavior quantitatively, we did not attempt a formal fit to the data, because of the paucity of data points. Results for *A*) and also accounts for the observed level of mixing as a function of time in blood and thymus (Fig. 4*B*). The model curves in Fig. 4*B* are parametric in time (time stamps added for clarity) and it is significant that the data points lie close to points on the model curves at the right age.

### Selection Shapes the In-Frame Repertoire.

Our discussion so far has focused on the evolution of the generative model for VDJ recombination, a model inferred from nonproductive, out-of-frame sequences. We now discuss what can be learned from in-frame sequences. Because they can code for functional surface receptor proteins, their statistics will be modified, relative to the generative model statistics, by selection effects. To quantify selection, we focus on the complementarity-determining region 3 (CDR3) of the beta chain, the region thought to encode most of the functional diversity of the T-cell repertoire. We define the CDR3 as the amino acid sequence running from a conserved cysteine in the V segment to a conserved phenylalanine in the J segment. We associate to each possible CDR3 amino acid sequence *Materials and Methods* for details).

We inferred selection motifs for a variety of thymic and blood repertoires (Fig. S6). These motifs are very consistent between thymus and blood of mice of the same age, with weaker consistency between mice of different ages (Fig. S7). Similarity between blood and thymus may seem surprising, as we could have expected a significant fraction of TRBs from thymic cells to have been sequenced before any selection effect, making them statistically closer to out-of-frame sequences. These observations suggest that our selection factors primarily capture selection for the ability of the coded protein to fold into a displayable receptor and may not capture more subtle effects such as negative selection against self-recognition. In Fig. S6, we also display patterns of correlation between mature mouse selection factors and quantitative amino acid biochemical properties; significant, but hard-to-interpret, patterns are apparent.

Our method attaches two hidden variables to each in-frame sequence: its probability *A* shows that this correlation does not hold for mice: The *B*). For mice, selection is a weak effect: The distribution over *B*). Consequently, selection purges a large fraction of sequences and substantially modifies the repertoire statistics.

## Mixture Model

In this model, T cells are introduced into the thymus with rate

The mean and variance of **2** of the main text.

## Discussion

VDJ recombination is a stochastic process that produces the initial diversity on which the adaptive immune system relies to develop a functional and diverse repertoire of receptor specificities. Previous studies have shown that this diversity is limited in neonates compared with adults, either by biasing the choice of gene segments (22⇓⇓–25) or by having a small number of N insertions (11, 26). Combining high-throughput sequencing with statistical analysis of murine T-cell receptor beta chains, we analyzed the dynamics of maturation of VDJ recombination. This analysis allowed us to precisely quantify, in bits, how diversity increases with age, from embryo to adult. We found that the most significant change in the recombination statistics was the number of untemplated N insertions, which sharply increases around the age of 4 d, from almost no insertions to the amount found in adults. Low numbers of insertions in neonates and during embryonic development are common to both B- and T-cell receptors (27) and are attributed to low TdT expression (12⇓–14). Diversity can be further reduced in embryos by concentrating gene use on only a few combinations, as was shown for Ig in mice (22, 23), humans (24), and more recently zebrafish, using high-throughput sequencing (25). Similar observations were made on human TCR beta chains (28, 29). By contrast, we found only minor differences in TRB gene use between embryonic and adult mice (Fig. S1), meaning that the reduced number of N insertions is the only factor limiting diversity in the embryo relative to the adult.

One can only speculate about the biological function of the lack of N insertions in embryos and very young individuals. Rearrangements with no insertions may encode particular specificities that are effectively innate. The invariant TCRs of mucosal-associated invariant T cells (MAIT) and natural killer T cells (NKT) are specific examples of such genetically encoded receptors (30). These TCRs, which lack N insertions, are formed with high probability by VDJ recombination (31) and are further selected to be very conserved. Receptors lacking N insertions may provide neonates with a minimal set of innate-like specificities, ensuring basic immunity (32), which is later completed by the full diversity of receptors endowed with N insertions.

Our analysis highlights the importance of focusing on the underlying statistical ensembles from which repertoires are drawn, rather than looking for significance in the sequences themselves. Although sequence repertoires are contingent and noisy, with little to no overlap between individuals, their statistical properties are consistent between individuals, as was already noted for humans (9, 21). Crucially, a statistical treatment is essential for tracking the precise dynamics of N insertions with age, as deterministic assignments give systematically biased estimates of these numbers (Fig. S8). In our study of the development of repertoire diversity, we avoided the confounding factor of possibly time-dependent selection by analyzing out-of-frame rearrangements.

Using an analysis tailored to the productive repertoire, we were able to assign to any CDR3 sequence a statistical measure of selection, quantifying the probability that an amino acid sequence, once generated, goes on to become a functional receptor. The inferred selection motifs were very similar in thymic and blood repertoires. This lack of difference suggests that our measure of selection is mainly sensitive to basic selection against misfolding of the encoded receptor protein (which would have the same effect on thymic and blood repertoires) and is relatively insensitive to thymic selection against self-recognizing receptors (which would be present in blood but not in thymus). Negative selection in the thymus of course exists, but previous work on statistically characterizing its nature has shown that strong effects are localized in the few residues that actually contact presented peptides (33⇓–35). Because we are statistically characterizing selection across the more extensive CDR3 region (11 aa long on average), it is perhaps not surprising that localized negative selection effects are washed out. It would clearly be productive to incorporate such insights into our approach.

The study of mice raises interesting and puzzling questions about sequence diversity. We found that the mature mouse repertoire is 9 bits (or

The mature mouse repertoire is 14 bits (or

## Materials and Methods

### Datasets.

The data used in our analyses are 87-bp (and 60-bp) nucleotide sequences covering the variable region of the rearranged mouse TRB gene. The sequences were obtained by Adaptive Biosciences, using their TRB DNA sequencing protocol (including error correction on the basis of multiple reads of each unique DNA sequence) applied to biological samples provided by two of the authors (A.J.L. and C.S.D.). The samples comprised blood, spleen, and thymus samples from mice sacrificed at four different ages: 17-d embryo and 4 d, 21 d, and 42 d postbirth (the library preparation and sequencing for day 42 thymic samples were replicated). The mice were Black 6 laboratory mice (Jackson Laboratories) raised in standard laboratory conditions. The animal care committee of the New Jersey Medical School provided approvals for the experiments carried out with mice in this publication. The numbers of unique sequences in the various datasets were a few tens of thousands on average (with a few datasets providing more than

### Stochastic Model for VDJ Recombination.

Out-of-frame data sequences were used to infer the statistical ensemble of sequences produced directly by the VDJ recombination process. We assume that the probability distribution for the generative events involved in VDJ recombination of TRB has the form (9, 38)**2** gives the probability of recombination scenarios, not sequences. To obtain the probability of generating a specific sequence, one must sum the expression in Eq. **2** over all of the recombination scenarios that result in that sequence. We determine the component probability distributions in Eq. **2**,

The main assumption underlying Eq. **2** is its simple product structure, reflecting the independence of the enzymes that carry out different steps of the process. Another assumption is that the probability of inserting a given N nucleotide depends only on the identity of the nucleotide that precedes it (Markov assumption). We self-consistently checked the validity of these assumptions by verifying a posteriori that almost no unaccounted correlations between the recombination events were left in the data that were not explicitly assumed (*Validation of the Structure of the Sequence Generation Model* and Fig. S9) and by showing that the statistics of triplets of N insertions were well predicted by the Markov model (Fig. S5). We also compared our probabilistically inferred distributions of recombination scenario variables with distributions assembled from assignments made by a standard VDJ alignment software package (17). We found that these nonprobabilistic alignment methods greatly overestimate the fraction of sequences with no N nucleotide insertions and significantly violate the D-J pairing rule imposed by genome topology, whereas the probabilistic method does not (Fig. S8). This discrepancy is what motivates our use of a probabilistic approach. The inferred model features were very reproducible across individuals of the same age (Figs. S2 and S4).

### Selection Model.

Following previous work on human TRB sequences (21), we associate to each possible CDR3 amino acid sequence *Z* enforces an overall normalization. The set of all these subfactors defines a motif of selection across all possible TRB sequences, and a likelihood maximization procedure allows us to infer the best selection factors from the data.

To test the consistency of the selection model, Eq. **3**, we also learned a more general model where, instead of taking the

### Code Availability.

The Matlab software for implementing the inference procedures is available at princeton.edu/∼ccallan/MousePaper/software/. The results of the inference, along with instructions on how to use these files to recreate the figures in this paper, are available at princeton.edu/∼ccallan/MousePaper/results/.

### Model of Repertoire Maturation.

TCRs are produced in the thymus with a time-dependent effective TdT level

## Validation of the Structure of the Sequence Generation Model

Our inference procedure rests on a presumption of independence of the various factors in the generative model and a verification of that independence is an important aspect of our analysis. To address this issue, we compute the mutual information—a nonparametric measure of dependence between random variables—between the various recombination scenario variables and compare these numbers between the data and the generative model inferred from the same data.

The mutual information between two random variables

On the other hand, the inference procedure assigns multiple scenarios, each with its own probability, to each data sequence. For any pair of scenario variables (e.g., insVD and delJ) we can use these assignments over all of the data sequences to populate a list of pairs of values, weighted by scenario probabilities. From this list we can then compute the mutual information between the two elements of the pair, using the Treves–Panzeri correction (39) to account for small sample sizes. The mutual information computed in this fashion is displayed in the above-diagonal halves of the matrices of Fig. S9. The model form is considered accurate if the obtained mutual information agrees with that predicted by the model, i.e., if the matrices of Fig. S9 are symmetric.

## Selection Model

The selection factor is defined as the fold change between the probability of generation of a TRB sequence

## Acknowledgments

The work of Y.E., T.M., and A.M.W. was supported in part by European Research Council Starting Grant 306312. The work of Y.E. was supported in part by The V Foundation for Cancer Research Grant D2015-032. The work of C.G.C. and Z.S. was supported in part by National Science Foundation Grant PHY-1305525.

## Footnotes

↵

^{1}Z.S. and Y.E. contributed equally to this work.- ↵
^{2}To whom correspondence should be addressed. Email: ccallan{at}princeton.edu.

Author contributions: Z.S., Y.E., C.G.C., A.J.L., T.M., and A.M.W. designed research; Z.S., Y.E., C.S.D., C.G.C., T.M., and A.M.W. performed research; Z.S., Y.E., C.G.C., T.M., and A.M.W. analyzed data; and Z.S., Y.E., C.G.C., A.J.L., T.M., and A.M.W. wrote the paper.

Reviewers: B.C., University College London; and A.K.C., Massachusetts Institute of Technology.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1700241114/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- Weinstein JA,
- Jiang N,
- White RA,
- Fisher DS,
- Quake SR

- ↵.
- Boyd SD, et al.

- ↵.
- Robins HS, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Murugan A,
- Mora T,
- Walczak AM,
- Callan CG

- ↵.
- Elhanati Y,
- Marcou Q,
- Mora T,
- Walczak AM

- ↵.
- Feeney AJ

- ↵.
- Bogue M,
- Gilfillan S,
- Benoist C,
- Mathis D

- ↵.
- Komori T,
- Okada A,
- Stewart V,
- Alt FW

- ↵.
- Gilfillan S,
- Dierich A,
- Lemeur M,
- Benoist C,
- Mathis D

- ↵.
- Britanova OV, et al.

- ↵.
- Pogorelyy MV, et al.

- ↵.
- Brochet X,
- Lefranc MP,
- Giudicelli V

- ↵.
- Thomas N,
- Heather J,
- Ndifon W,
- Shawe-Taylor J,
- Chain B

- ↵
- ↵.
- Yu Y,
- Ceredig R,
- Seoighe C

- ↵.
- Elhanati Y,
- Murugan A,
- Callan CG,
- Mora T,
- Walczak AM

- ↵
- ↵.
- Perlmutter RM,
- Kearney JF,
- Chang SP,
- Hood LE

- ↵.
- Schroeder HW, et al.

- ↵.
- Jiang N, et al.

- ↵.
- Feeney AJ

- ↵.
- Schroeder HW,
- Zhang L,
- Philips JB

- ↵
- ↵.
- Raaphorst FM,
- Kaijzel EL,
- van Tol MJ,
- Vossen JM,
- van den Elsen PJ

- ↵
- ↵
- ↵
- ↵.
- Kosmrlj A,
- Jha AK,
- Huseby ES,
- Kardar M,
- Chakraborty AK

- ↵
- ↵
- ↵
- ↵.
- Madi A, et al.

- ↵.
- Elhanati Y, et al.

- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology

- Physical Sciences
- Biophysics and Computational Biology