Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Research Article

Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire

Jacob Glanville, Wenwu Zhai, Jan Berka, Dilduz Telman, Gabriella Huerta, Gautam R. Mehta, Irene Ni, Li Mei, Purnima D. Sundar, Giles M. R. Day, David Cox, Arvind Rajpal, and Jaume Pons
  1. aResearch Informatics, Rinat-Pfizer Inc., 230 East Grand Avenue, South San Francisco, CA 94080;
  2. bProtein Engineering, Rinat-Pfizer Inc., 230 East Grand Avenue, South San Francisco, CA 94080; and
  3. cTarget Generation Unit, Pfizer Inc., 230 East Grand Avenue, South San Francisco, CA 94080

See allHide authors and affiliations

PNAS December 1, 2009 106 (48) 20216-20221; https://doi.org/10.1073/pnas.0909775106
Jacob Glanville
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wenwu Zhai
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jan Berka
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Dilduz Telman
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gabriella Huerta
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gautam R. Mehta
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Irene Ni
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Li Mei
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Purnima D. Sundar
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Giles M. R. Day
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David Cox
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arvind Rajpal
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jaume.pons@pfizer.com arvind.rajpal@pfizer.com
Jaume Pons
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jaume.pons@pfizer.com arvind.rajpal@pfizer.com
  1. Communicated by Richard A. Lerner, The Scripps Research Institute, La Jolla, CA, September 21, 2009

  2. ↵1J.G., W.Z., and J.B. contributed equally to the work. (received for review August 25, 2009)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Abstract

Antibody repertoire diversity, potentially as high as 1011 unique molecules in a single individual, confounds characterization by conventional sequence analyses. In this study, we present a general method for assessing human antibody sequence diversity displayed on phage using massively parallel pyrosequencing, a novel application of Kabat column-labeled profile Hidden Markov Models, and translated complementarity determining region (CDR) capture-recapture analysis. Pyrosequencing of domain amplicon and RCA PCR products generated 1.5 × 106 reads, including more than 1.9 × 105 high quality, full-length sequences of antibody variable fragment (Fv) variable domains. Novel methods for germline and CDR classification and fine characterization of sequence diversity in the 6 CDRs are presented. Diverse germline contributions to the repertoire with random heavy and light chain pairing are observed. All germline families were found to be represented in 1.7 × 104 sequences obtained from repeated panning of the library. While the most variable CDR (CDR-H3) presents significant length and sequence variability, we find a substantial contribution to total diversity from somatically mutated germline encoded CDRs 1 and 2. Using a capture-recapture method, the total diversity of the antibody library obtained from a human donor Immunoglobulin M (IgM) pool was determined to be at least 3.5 × 1010. The results provide insights into the role of IgM diversification, display library construction, and productive germline usages in antibody libraries and the humoral repertoire.

  • HMM
  • phage display
  • pyrosequencing
  • CDRs

The humoral immune response recognizes novel molecular surfaces by exposure to a vast repertoire of potential binding partners (1). Antibody paratopes, the agents of humoral molecular recognition, mediate specific binding through a protein-antigen interface that varies dramatically between molecules. When confronted with a novel antigen, the chance that any given antibody in the pool will bind is low. Therefore, it is primarily the diversity of the antibody repertoire that determines whether a specific complementary paratope will be recovered (2).

Under such selective pressures, a number of mechanisms to maximize the recognition potential of the antibody repertoire have evolved. Antibody paratopes are found at the hypervariable region of a light and heavy chain heterodimer. Each chain contributes 3 loops to a spatial cluster of complementarity determining regions (CDRs). CDRs 1 and 2 are encoded in germline V-segment loci: 51 VH and 70 Vκ/λ loci, each with unique amino acid encodings, exist in a typical human haplotype (3–5). Diversity in each chain is determined by combinatorial VH-(DH)-JH (for the heavy) or Vκ/λ-Jκ/λ (for the light) rearrangements, P and N-addition, junctional flexibility, and somatic hypermutation of variable domain nucleotides, with a concentration on CDR encoding regions (6, 7). The combinatorial association of such stochastically generated light and heavy chains has the potential to generate many orders of magnitude more diversity than can be uniquely displayed on the 1011 B-cells in a single individual's lymphocyte population (2, 8). With each antibody variable fragment (Fv) encoded by at least 650 base pairs, the presented repertoire is potentially 4 orders of magnitude larger than the entire human diploid genome (6.4 × 109 bp).

Such extreme sequence diversity poses multiple challenges to repertoire characterization efforts. Achieving sufficient sampling depth to determine total diversity is impractical with Sanger-based sequencing (6, 7). High-throughput sequencing methods, while able to address sampling depth, have until recently produced read lengths under 200 bp; too short to span 3 CDRs in a single read (9). While in other settings these technologies can rely on assembly to overcome short read lengths (10), the diverse yet repetitive character of antibody Fv reads cause assembly to either fail or return erroneous chimeric contigs that do not represent individual population members (10, 11). Once sequences are obtained, somatic hypermutation and junctional diversity pose a challenge to reliable CDR boundary identification (12–14). The problem is significant: a recent study reports that over 10% of the variable domain sequences in the Kabat antibody database have been misnumbered by existing methods (15). Given these limitations, past diversity assessment efforts have focused on low-resolution length-based CDR3 spectratyping (16) and local nucleotide-level V-(D)-J assessment of more limited TCR β-chain (10) and zebrafish repertoires (17). While these approaches provide valuable insights into specific features of binding site diversification, it has not yet been feasible to characterize the combined effects of diversification on the complete translated paratope at molecular resolution.

Long-read high-throughput sequencing chemistry, Bayesian fold recognition and a single chain variable fragment (scFv) architecture have created an opportunity for complete paratope repertoire analysis. Recent advances in high-throughput pyrosequencing chemistry have allowed 106 400 bp sequences to be generated in a single run: deep enough for capture-recapture diversity assessments and long enough to span all 3 CDRs of a chain in a single read. Once read, a novel application of Kabat-labeled profile Hidden Markov Models (HMM) borrows from advances in remote homology fold recognition to provide an O(n) fast, highly accurate unified Bayesian framework for domain recognition, CDR boundary identification, and multiple sequence alignment (18–21). With reliable access to the entire CDR contribution of a variable domain in a single read combined with shotgun reads spanning the heavy-light chain pairing, it becomes possible to directly estimate and characterize the number of unique binding surfaces presented by an antibody library repertoire.

Accurate diversity assessment is of particular interest during the construction of combinatorial antibody repertoires (22, 23). Phage display libraries allow an antibody repertoire to be queried with a candidate antigen directly, without the need to proceed through in vivo immunization (24, 25). A number of strategies for introducing repertoire diversity during library construction have been proposed (26–28) but existing methods to assess the final functional library diversity are based on estimates of transformation efficiency and limited sequence sampling. Here we present the design and assessment of a scFv library built directly from the complete germline diversity of 654 human donor Immunoglobulin M (IgM) repertoires; a lymphocyte reservoir that includes naïve, memory, plasma, and preimmune somatically altered paratopes (29–31). The available diversity of the entire library was assessed directly using high-throughput pyrosequencing to generate datasets large enough to perform capture-recapture diversity (17) and chain assortment estimates. We also compare the repertoire and diversity of the input library to functional binders derived from panning the library against 16 diverse antigens. The results provide a powerful method for monitoring diversity during future library construction efforts and fundamental insights into the strategy of functional paratope diversification elected by evolutionary forces.

Results

Antibody Library Generation.

Heavy and light chain V-genes from 654 healthy human donors were separately amplified by PCR using equimolar mixture of degenerate family primers (32, 33) individually validated at a common reaction condition of 25 PCR cycles at 94 °C for 45 sec, 58 °C for 45 sec and 72 °C for 60 sec. The heavy and light domain products were randomly associated in a scFv VH-(G4S1)3linker-VL architecture. A total of 120 μg of scFv antibody repertoire from 72 μg of VH-Vk and 48 μg of VH-Vλ was obtained and ligated into a display vector. Three hundred and ten transformations yielded 302 ml total library volume. Plating of serial dilutions for colony counting resulted in an estimated 3.1 × 1010 (SD 0.7 × 1010) successful transformants containing scFv antibodies in the total remaining 301 ml library pool.

Antibody Library Selections.

Sequences for more than 1.7 × 104 nonredundant antibodies were obtained from output generated by panning against 16 human and nonhuman targets. These sequences were CDR clustered using the profile HMM CDR and germline classification methods used on the library. For each antigen, an average of 30 representative sequences from distinct CDR clusters were selected for further characterization: affinities obtained ranged from less than 100 pM to over 1 μM.

454 Sequencing.

Each of the 2 samples, rolling circle amplified (RCA) shotgun and variable domain PCR amplicon, were sequenced using the GS FLX Titanium large PicoTiter plate in 2 separate sequencing runs. The 2 sequencing runs combined yielded 1,452,529 and 1,602,399 raw well reads for the shotgun and amplicon library, respectively. After the signal processing step of the 454 data analysis pipeline, where reads may be rejected by multiple signal and quality filters, we obtained 923,876 (shotgun) and 554,310 (amplicon) quality filter-passing reads.

Accuracy of Profile HMM CDR Classification and Kabat Numbering.

In 779 benchmark cases, 99.8% of CDR-H3 loops were classified correctly, and all other CDRs received perfect classification (supporting information (SI) Fig. S1A). The single error was due to assignment of stem residue, H102, to a neighboring insert state at the C terminus of the H3 (Fig. S1B). With a CDR boundary insert correction, all cases classified correctly.

Using the scFv HMM (Smith/Waterman local alignment, expectation value <10−10, >70% match state occupancy in all FW regions in single reading frame), 96,303 heavy and 98,946 light chain reads spanning entire variable domains in a single reading frame were identified in the pyrosequencing results, aligned, and CDR labeled.

GS-Linker Assessment.

Of the subset of RCA shotgun library reads that spanned the GS-linker and framework regions of VH and Vκ/λ domains to either side, 95.6% appeared as expected by design. The remaining 4.4% had predominantly single errors that could be genuine linker errors or pyrosequencing read errors.

Germline Classification of Library Clones.

In >250,000 simulations, sequences with less than 30 mutations in the V-segment were never misclassified. Even with up to 50 simulated mutations, 99.97% of test sequences were correctly classified, with only 5.8% receiving reduced family-level classification (Fig. S2 A and B). This limited rate of resolution reduction corrects 84% of sequences misclassified by a naïve approach. By comparison, 95% of all antibodies recovered from the library differed by less than 30 mutations from the closest germline allele (Fig. S2C).

Using the classification method, all Ig-bearing library sequence reads were classified to germlines (Table 1). Forty-eight heavy and 53 light known functional V-segment germline loci were encountered at least 10 times in the full-domain sequence reads (Tables S1 and S2). In addition, 2 germlines listed as pseudogenes by International ImMunoGeneTics Information System (IMGT) (HV3-h and HV3–71) were recovered in rearranged form (127 and 240 times, respectively). Germlines sampled were consistent between variable-domain amplicon and RCA-amplified shotgun library samples and resemble the distribution found in Kabat database, although some differences are observed (Kolmogorov-Smirnov assessments: Amplicon vs. RCA D = 0.0930; P = 0.989, Amplicon vs. panned D = 0.1163, P = 0.917, Amplicon vs. Kabat D = 0.1395, P = 0.765).

View this table:
  • View inline
  • View popup
Table 1.

Partial list of V-segment germline distribution in 100% nonredundant Kabat sequences (Kabat), scFv domain-specific amplicons (ampl.), scFv RCA, and Sanger sequencing from 1.7 × 104 nonredundant round 2 and 3 binders against 16 unique antigen targets (Pann.). Forty-eight of 51 IMGT functional HV germlines were recovered, 53 of 70 K/L germlines were recovered. (see Table S1 and S2 for complete list)

Heavy and Light Chain Pairing.

Heavy and light chain family pairings in the library were found to occur in proportion to the abundance of the respective families: indistinguishable from a null model of random assortment (χ2 observed vs. expected: H/L: P value: 0.9512) (Fig. 1 A and B). While most of the dominant germline families are represented in the chain pairings of leads generated after panning, in panned sequences we observe nonrandom assortment of families (χ2 panned vs. library H/L: P value: 0.00107), illustrated by the deemphasis of KV4 and the increased lambda contribution (Fig. 1C).

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

Heavy and light family frequencies and pairing observed in 18,158 RCA reads. (A) Heavy (Left) and light chain (Right) family frequencies observed in the library. Only families at 1% chain frequency are shown. (B) Heavy and light chain family pairings occur in proportion to the abundance of their partner, indistinguishable from a null model of random assortment by χ2 (P value: 0.9512). (C) Heavy and light chain families pairings in second and third round Panning show pairing preferences that cannot be explained by random assortment (P value: 0.00107).

CDR-H3 Diversity.

The CDR-H3 length distribution was consistent across site-directed and RCA-amplified shotgun library preparation approaches. A Poisson distribution with mean 11.5 (Kabat 95–102), as observed by others, was consistent with results found here (7), although an increase in H3 of length 5 was observed. (Fig. 2A).

Fig. 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 2.

CDR-H3 length and amino acid composition for the most common length bin. (A) Observed CDR-H3 length diversity from 65,240 amplicon and 22,769 RCA reads. (B) Position specific amino acid frequencies for 10,281 length 11 (determined by positions 95–102) CDR-H3s. (C) Number of nongermline encoded amino acids found in CDRs 1 and 2, by domain type.

Somatic Mutation Distribution.

In V-segment encoded CDRs (1 and 2), 17% of sequences were unaltered from germline, while 78% of sequences had between 1 and 6 aa mutations (Fig. 2C; see Tables S1 and S2 for details). The definition of somatic mutation used counts distance from closest germline allele and could therefore be inflated by novel allelic and copy number loci variations not found in IMGT. Position specific scoring matrices of all sequences for each germline show a pattern of somatic hypermutation consistent with that previously reported (34).

Total Diversity Estimate.

The observed diversity in the heavy chain is dominated by contributions from the CDR-H3 (Fig. 3A), while that observed in the light chain is more evenly contributed by all 3 CDRs (Fig. 3B). In the total paratope contribution from the heavy chain, diversity contributions from H1 and H2 more than double the diversity found in H3 alone. In the light chain, combined CDR diversity is more than 6-fold higher than diversity in any single CDR. Figure 3 C and D, showing percent recapture at sampling depths for the heavy and light chain CDRs, respectively, recapitulates patterns observed in the diversity estimates (Fig. 3 A and B) with CDR-H1 and CDR-H2 approaching saturation at lower sampling depths than CDR-H3. In the light chain, CDR-L2 saturates at lower sampling depths than both CDR-L1 and CDR-L3. In general, the approach to saturation for the light chain is more rapid than the heavy chain counterpart. A lower bound estimate of 2.2 × 105 (SD 2.2 × 103) diversity for heavy domain, and 1.6 × 105 (SD 0.8 × 103) for the light domain is obtained by nonredundant capture-recapture at M = 33,000 as rarefaction trends toward asymptote in Figs. 3 A and B.

Fig. 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 3.

Diversity estimates for heavy and light chain CDRs in the antibody library. (A and B) 10 capture-recapture rarefaction results for CDR1, 2, 3 and concatenated CDRs for heavy and light chain, respectively. (C and D) Percent recapture during rarefaction analysis.

Discussion

Direct analysis of phage displayed paratope diversity was made possible by recent advances in long-read pyrosequencing, a novel application of profile HMM-based sequence analysis, and syntenic placement of Fv chains in the scFv construct. Long read lengths allowed the entire CDR contributions from variable chains to be assessed in concert without assembly. A combination of variable-domain amplicon and RCA of plasmid library sample preparations used for sequencing allowed for the depth required for effective capture-recapture based diversity assessments, heavy and light chain paring assignment, and scFv construct integrity (including the GS linker) to be evaluated directly. HMM-based residue labeling provided the flexibility required for accurate CDR identification and residue-specific somatic hypermutation rate assessment in a diverse sequence space. Germline analysis provided a mechanism to assess the dispersion of potential paratope diversity in an antibody repertoire.

Considering only CDRs and using a stringent definition of diversity that requires at least 2 amino acid mutations from any other sequence for each chain to be considered unique, a lower bound diversity estimate of 2.2 × 105 was determined for the heavy chain CDRs, and 1.6 × 105 for the light chain CDRs. While it is possible that this strict definition of diversity underestimates the total number of unique molecules available in the library, doing so minimizes the chance that read errors in assembly-free sequence could contribute to artificially inflating the recapture diversity estimates. Given the observed random pairing of heavy and light chain variable domain families in the library (Fig. 1), we estimate the combined library diversity of unique nonredundant paratopes to be near 3.5 × 1010. This estimate is quite similar to the predicted number of transformants recovered during library construction (3.1 × 1010, SD 0.7 × 1010). While rarefaction studies do not show complete saturation for the concatenated CDRs at sampling depths displayed (Fig. 3 A and B), the projected asymptotes suggest that the diversity of a well-constructed phage displayed antibody library is limited by transformation efficiency.

The single largest source of paratope diversity in antibodies is derived from variation in the heavy chain CDR3 loop. In the library, CDR-H3 lengths ranged from 1 to 31 aa, following an apparent Poisson distribution (Fig. 2A). While every amino acid could be found at most positions in CDR-H3, the distribution of abundances varied depending on sequence length (Fig. S3) in a manner characteristic of these loops previously observed in human antibodies (27, 34).

Diversity in V-segment encoded CDR 1 and 2 is dispersed by germline origin and diversified by somatic hypermutation. The 48 heavy and 53 light germlines found in the library provide 2,544 distinct potential heterodimer pairings. Library antibodies pair randomly at the family level in the scFv construct. If this trend can be assumed to hold true at the level of germlines, then based on frequencies observed (Table 1; see Tables S1 and S2 for complete list) and a library size of 1010, one can expect to find between 105–107 unique combinations of most heterodimer germline pairs in the library. In each chain, germline encoded CDRs 1 and 2 vary by average of 4.4 aa from the next closest germline. This gap in CDR sequence space between germlines is thoroughly explored by somatic hypermutation, with 78% of antibody chains displaying between 1–6 aa mutations in their V-segment encoded CDRs. The total contribution of germline origins and somatic mutation is substantial: estimates that consider all 3 CDRs on each chain result in a 12-fold increase in total diversity compared an estimate derived from H3 and L3 alone (Fig. 3).

When panning 16 diverse protein test antigens with the phage displayed antibody library, multiple binders were always successfully recovered. In sequencing over 1.7 × 104 leads recovered during the second and third round of panning, antibodies were found to be derived from almost every germline available in the library. For each antigen panned, at least 30 unique antibody clusters were identified that explored multiple distinct epitopes across their respective target's surface. The diversity of binders recovered is a strong indication that the deliberate inclusion of diverse frameworks into the library contributed directly to accessible diversity. While some variation in germline frequency (Table 1) and chain pairings (Fig. 1C) were observed, the overall similarity in distributions between the phage display library and panned binder products suggests that germlines were recruited in rough proportion to their availability. This indicates that the display of antibody on phage in a scFv format does not significantly restrict the available diversity accessible to in vitro panning.

Even with 302 transformations, diversity came close to saturating transformants. While heavy and light chain shuffling and pooling of cDNA from multiple donors during library construction makes it difficult to speculate on whether this is a direct reflection of the diversity of individual antibody diversity or of the degree of humoral overlap between individuals, it does suggest that the maximum possible repertoire diversity supported by human germlines may yet be fully explored by phage display. While this library design was successful, biases during construction can always impact the final functional diversity and not be readily noticeable using traditional diversity estimation techniques. In future library development projects, deep sequencing in conjunction with germline and CDR analysis could provide a powerful quality assurance technique to identify and correct biases that may be introduced during each stage of library construction. This general methodology has immediate applications in quality assurance during library construction as well as potential applications in antibody repertoire assessment in disease states.

Materials and Methods

Construction of Human Naïve scFv Library.

Total RNA and/or mRNA were obtained from 637 healthy human peripheral blood leukocyte donors (BioChain Institute and Clontech) and 17 human spleens (BioChain Institute, Clontech, and OriGene). First strand cDNA was synthesized by using human heavy chain constant region primer HuIgM (5′-TGGAAGAGGCACGTTCTTTTCTTT-3′), human κ constant region primer HuCκ (5′-AGACTCTCCCCTGTTGAAGCTCTT-3′) and human λ constant region primer HuCλ (5′-TGAAGATTCTGTAGGGGCCACTGTCTT-3′) according to vendor specifications (Invitrogen). Human heavy and light chain V-genes were separately amplified by PCR using equimolar mixture of degenerate family primers (32, 33): 9 VH and 4JH for VH genes, 7 Vκ and 5 Jκ for Vκ genes, and 9 Vλ and 3 Jλ for Vλ genes at a final concentration of 10pmol/μl of each primer. The amplified products were assembled as VH-(G4S1)3linker-VL scFv antibodies according to Marks and Bradbury (32). The assembled scFv antibody repertoire were purified, cut with SfiI, and cloned into a vector. TG1 competent cells (Stratagene) were transformed in with the scFv vector by electraporation using BTX 1 mm gap cuvettes in 310 parallel reactions. Transformation efficiency was estimated by colony counting of plated serial dilutions drawn from 1 ml of the 302 ml posttransformant pool before any incubation.

Selection of Human Antibodies from scFv Phage Display Library.

Phage antibodies were prepared from scFv library glycerol stocks according to published protocols (33). The specific antibodies to 16 diverse antigens were selected and screened by ELISA and BIAcore assay after 3–4 rounds of biopanning through either immobilized antigen at 5–10 μg/ml for solid phase or biotinylated antigen at 20–200 nM for solution phase depending on the antigens according to standard protocols (33). The obtained antibodies were sequenced and grouped by CDR clustering for further analysis.

454 Titanium Sequencing.

Two sample preparation strategies were used for sequencing the scFv library. High-depth bidirectional variable domain coverage was provided by PCR amplification of the VH and Vλ/Vκ scFv insert regions using amplicon specific primers complementary to the constant regions of the vector and the GS-linker and harboring the 454 Titanium adaptor sequences to generate Ig amplicon libraries. Vector composition and read error-rate were assessed by sequencing single-stranded RCA-shotgun libraries generated by RCA of the whole vector, followed by random shearing and ligation of 454 Titanium adaptors. The sequencing runs using the Roche/454 Genome Sequencer FLX were set up according to the 454 Titanium Sequencing protocol. Please see SI Text for RCA and shotgun Library preparation, PCR amplicon library preparation (including primer sequences used, Table S3), and sample library titration for 454 Titanium sequencing. Information on controls performance, loading density, and signal intensities in addition to signal processing and run yield are listed as well.

Sequence Analysis.

Translation, multiple sequence alignment, Kabat numbering and identification of structurally conserved CDR boundary positions were performed with HMMER profile hidden Markov models (18) designed to represent the scFv architecture. The HMM was trained with normalized concatenations of 95% nonredundant IMGT (35) germline V and J segment amino acid multiple sequence alignments, a direct GS-linker encoding, and a permissive insert D segment. Direct mapping of Kabat numbering system to columns in the HMM allowed specific Kabat positions and ranges to be identified. CDRs were bounded by conserved Kabat positions H1 31–35, H2 51–61, H3 93–102; L1 24–34, L2 50–56, and L3 89–97 that could be identified with high accuracy (details in SI Text). The approach was evaluated with a benchmark of 779 superposed nonredundant antibody structures from multiple species. Reads with frames bearing 10−10 or better expectation values to the model were aligned to and annotated by the profile. Seventy percent match state coverage in neighboring framework regions was a read-through requirement for CDR and GS-linker analysis.

Germline Classification.

Classification of V segment germlines was performed by nucleotide comparison to IMGT database of allelic variants using BLAST (36). Classification was made to germline, and a subset of very similar germlines were pregrouped during classification (Indicated by solidus in Table 1). To address increased risk of misclassification in a mutation-rich sequence space, confidence in classification was assigned using a benchmarked probabilistic framework. Confidence that the top hit was the correct germline was determined by Embedded Image where Si is the confidence that primary hypothesis i is correct, λ is the common alignable query sequence length, μ is the observed number of mutations between query and germline i, δ is the distance difference between the primary hypothesis i and the alternate g, and G is the total number of germlines. In cases where the sum of alternate hypotheses was greater than the cutoff 10−3, family classification was attempted. The method's ability to classify sequences with somatic mutations and read errors was benchmarked by simulation: >250,000 sequences derived from human frameworks and bearing progressive simulated somatic mutation loads.

Diversity Assessment.

Nonredundant functional binder diversity was assessed per domain, using only translated CDR sequences from single reading frames spanning the entire variable domain. By ignoring silent mutations and any mutations that occurred in the framework, sequencing error effects were minimized and only variation most likely to impact binding specificity was evaluated. Diversity was determined with capture-recapture rarefaction as previously described (17) with 2 modifications: 1) the entity compared when assessing recapture was the translated amino acid content of CDRs, and 2) a rigorous nonredundant diversity assessment, counting any CDR concatenation with less than 2 amino acid differences as being a functionally equivalent recapture, was performed for each estimate.

Acknowledgments

We would like to acknowledge Lin T. Guey, Peter Henstock, Tenshang Joh, and Albert Seymour for their assistance in statistical analyses. We are appreciative of the helpful correspondence with Joshua Weinstein regarding his application of capture-recapture for antibody diversity assessment. We are also grateful to Andrea Rossi for reading the manuscript and sharing his insights on structural biology considerations.

Footnotes

  • 2To whom correspondence may be addressed. E-mail: jaume.pons{at}pfizer.com or arvind.rajpal{at}pfizer.com
  • Author contributions: J.G., W.Z., J.B., A.R., and J.P. designed research; J.G., W.Z., J.B., D.T., G.H., G.R.M., I.N., L.M., and P.D.S. performed research; J.G., W.Z., J.B., G.M.R.D., and D.C. contributed new reagents/analytic tools; J.G. analyzed data; and J.G., W.Z., J.B., P.D.S., A.R., and J.P. wrote the paper.

  • Conflict of interest statement: All authors are employees of Pfizer Inc.

  • See Commentary on page 20137.

  • This article contains supporting information online at www.pnas.org/cgi/content/full/0909775106/DCSupplemental.

References

  1. ↵
    1. Kindt TJ,
    2. Capra JD
    (1984) The Antibody Enigma (Plenum Press, New York).
  2. ↵
    1. Perelson AS,
    2. Oster GF
    (1979) Theoretical studies of clonal selection: Minimal antibody repertoire size and reliability of self-non-self discrimination. J Theor Biol 81:645–670.
    OpenUrlCrossRefPubMed
  3. ↵
    1. Huber C,
    2. et al.
    (1993) The V kappa genes of the L regions and the repertoire of V kappa gene sequences in the human germ line. Eur J Immunol 23:2868–2875.
    OpenUrlCrossRefPubMed
  4. ↵
    1. Kawasaki K,
    2. et al.
    (1995) The organization of the human immunoglobulin lambda gene locus. Genome Res 5:125–135.
    OpenUrlAbstract/FREE Full Text
  5. ↵
    1. Matsuda F,
    2. et al.
    (1998) The complete nucleotide sequence of the human immunoglobulin heavy chain variable region locus. J Exp Med 188:2151–2162.
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Tonegawa S
    (1983) Somatic generation of antibody diversity. Nature 302:575–581.
    OpenUrlCrossRefPubMed
  7. ↵
    1. Wu TT,
    2. Kabat EA
    (1970) An analysis of the sequences of the variable regions of Bence Jones proteins and myeloma light chains and their implications for antibody complementarity. J Exp Med 132:211–250.
    OpenUrlAbstract
  8. ↵
    1. Trepel F
    (1974) Number and distribution of lymphocytes in man. A critical analysis. Klin Wochenschrift 52:511–515.
    OpenUrlCrossRefPubMed
  9. ↵
    1. Margulies M,
    2. et al.
    (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380.
    OpenUrlCrossRefPubMed
  10. ↵
    1. Freeman JD,
    2. Warren RL,
    3. Webb JR,
    4. Nelson BH,
    5. Holt RA
    (2009) Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome Res 19:1817–1824.
    OpenUrlAbstract/FREE Full Text
  11. ↵
    1. Mavromatis K,
    2. et al.
    (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4:495–500.
    OpenUrlCrossRefPubMed
  12. ↵
    1. Gelfand I,
    2. Kister A,
    3. Kulikowski C,
    4. Stoyanov O
    (1998) Algorithmic determination of core positions in the VL and VH domains of immunoglobulin molecules. J Comput Biol 5:467–477.
    OpenUrlPubMed
  13. ↵
    1. Honegger A,
    2. Pluckthun A
    (2001) Yet another numbering scheme for immunoglobulin variable domains: An automatic modeling and analysis tool. J Mol Biol 309:657–670.
    OpenUrlCrossRefPubMed
  14. ↵
    1. Lefranc MP,
    2. et al.
    (2003) IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Dev Comp Immunol 27:55–77.
    OpenUrlCrossRefPubMed
  15. ↵
    1. Abhinandan KR,
    2. Martin AC
    (2008) Analysis and improvements to Kabat and structurally correct numbering of antibody variable domains. Mol Immunol 45:3832–3839.
    OpenUrlCrossRefPubMed
  16. ↵
    1. Gorski J,
    2. et al.
    (1994) Circulating T cell repertoire complexity in normal individuals and bone marrow recipients analyzed by CDR3 size spectratyping. Correlation with immune status. J Immunol 152:5109–5119.
    OpenUrlAbstract
  17. ↵
    1. Weinstein JA,
    2. Jiang N,
    3. White RA, III,
    4. Fisher DS,
    5. Quake SR
    (2009) High-throughput sequencing of the zebrafish antibody repertoire. Science 324:807–810.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Eddy SR
    (2000) Profile hidden markov models for biological sequence analysis. HMMER, Available at http://hmmer.janelia.org.
  19. ↵
    1. Eddy SR,
    2. Mitchison G,
    3. Durbin R
    (1995) Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 2:9–23.
    OpenUrlCrossRefPubMed
  20. ↵
    1. Karplus K,
    2. et al.
    (1997) Predicting protein structure using hidden Markov models. Proteins Suppl 1:134–139.
    OpenUrl
  21. ↵
    1. Park J,
    2. et al.
    (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284:1201–1210.
    OpenUrlCrossRefPubMed
  22. ↵
    1. Huse WD,
    2. et al.
    (1989) . Generation of a large combinatorial library of the immunoglobulin repertoire in phage lambda. Science 246:1275–1281.
    OpenUrlAbstract/FREE Full Text
  23. ↵
    1. Barbas CF, III,
    2. Lerner RA
    (1991) Combinatorial immunoglobulin libraries on the surface of phage (phabs): Rapid selection of antigen-specific Fabs. Methods: Companion Methods Enzymol 2:119–124.
    OpenUrlCrossRef
  24. ↵
    1. Griffiths AD,
    2. et al.
    (1993) Human anti-self antibodies with high specificity from phage display libraries. EMBO J 12:725–734.
    OpenUrlPubMed
  25. ↵
    1. Marks JD,
    2. et al.
    (1991) By-passing immunization. Human antibodies from V-gene libraries displayed on phage. J Mol Biol 222:581–597.
    OpenUrlCrossRefPubMed
  26. ↵
    1. Hoet RM,
    2. et al.
    (2005) Generation of high-affinity human antibodies by combining donor-derived and synthetic complementarity-determining-region diversity. Nat Biotechnol 23:344–348.
    OpenUrlCrossRefPubMed
  27. ↵
    1. Knappik A,
    2. et al.
    (2000) Fully synthetic human combinatorial antibody libraries (HuCAL) based on modular consensus frameworks and CDRs randomized with trinucleotides. J Mol Biol 296:57–86.
    OpenUrlCrossRefPubMed
  28. ↵
    1. Vaughan TJ,
    2. et al.
    (1996) Human antibodies with sub-nanomolar affinities isolated from a large non-immunized phage display library. Nat Biotechnol 14:309–314.
    OpenUrlCrossRefPubMed
  29. ↵
    1. Klein U,
    2. Kuppers R,
    3. Rajewsky K
    (1997) Evidence for a large compartment of IgM-expressing memory B cells in humans. Blood 89:1288–1298.
    OpenUrlAbstract/FREE Full Text
  30. ↵
    1. Weller S,
    2. et al.
    (2004) Human blood IgM “memory” B cells are circulating splenic marginal zone B cells harboring a prediversified immunoglobulin repertoire. Blood 104:3647–3654.
    OpenUrlAbstract/FREE Full Text
  31. ↵
    1. Weller S,
    2. et al.
    (2008) Somatic diversification in the absence of antigen-driven responses is the hallmark of the IgM+ IgD+ CD27+ B cell repertoire in infants. J Exp Med 205:1331–1342.
    OpenUrlAbstract/FREE Full Text
  32. ↵
    1. Marks JD,
    2. Bradbury A
    (2004) PCR cloning of human immunoglobulin genes. Methods Mol Biol 248:117–134.
    OpenUrlPubMed
  33. ↵
    1. Marks JD,
    2. Bradbury A
    (2004) Selection of human antibodies from phage display libraries. Methods Mol Biol 248:161–176.
    OpenUrlPubMed
  34. ↵
    1. Zemlin M,
    2. et al.
    (2003) Expressed murine and human CDR-H3 intervals of equal length exhibit distinct repertoires that differ in their amino acid composition and predicted range of structures. J Mol Biol 334:733–749.
    OpenUrlCrossRefPubMed
  35. ↵
    1. Lefranc MP,
    2. et al.
    (2009) IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res 37:D1006–D1012.
    OpenUrlAbstract/FREE Full Text
  36. ↵
    1. Altschul SF,
    2. et al.
    (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25:3389.
    OpenUrlAbstract/FREE Full Text
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire
Jacob Glanville, Wenwu Zhai, Jan Berka, Dilduz Telman, Gabriella Huerta, Gautam R. Mehta, Irene Ni, Li Mei, Purnima D. Sundar, Giles M. R. Day, David Cox, Arvind Rajpal, Jaume Pons
Proceedings of the National Academy of Sciences Dec 2009, 106 (48) 20216-20221; DOI: 10.1073/pnas.0909775106

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire
Jacob Glanville, Wenwu Zhai, Jan Berka, Dilduz Telman, Gabriella Huerta, Gautam R. Mehta, Irene Ni, Li Mei, Purnima D. Sundar, Giles M. R. Day, David Cox, Arvind Rajpal, Jaume Pons
Proceedings of the National Academy of Sciences Dec 2009, 106 (48) 20216-20221; DOI: 10.1073/pnas.0909775106
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

Article Classifications

  • Biological Sciences
  • Biochemistry

Related Articles

  • Twenty years of combinatorial antibody libraries, but how well do they mimic the immunoglobulin repertoire?
    - Nov 23, 2009
Proceedings of the National Academy of Sciences: 106 (48)
Table of Contents

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Abstract
    • Results
    • Discussion
    • Materials and Methods
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Smoke emanates from Japan’s Fukushima nuclear power plant a few days after tsunami damage
Core Concept: Muography offers a new way to see inside a multitude of objects
Muons penetrate much further than X-rays, they do essentially zero damage, and they are provided for free by the cosmos.
Image credit: Science Source/Digital Globe.
Water from a faucet fills a glass.
News Feature: How “forever chemicals” might impair the immune system
Researchers are exploring whether these ubiquitous fluorinated molecules might worsen infections or hamper vaccine effectiveness.
Image credit: Shutterstock/Dmitry Naumov.
Venus flytrap captures a fly.
Journal Club: Venus flytrap mechanism could shed light on how plants sense touch
One protein seems to play a key role in touch sensitivity for flytraps and other meat-eating plants.
Image credit: Shutterstock/Kuttelvaserova Stuchelova.
Illustration of groups of people chatting
Exploring the length of human conversations
Adam Mastroianni and Daniel Gilbert explore why conversations almost never end when people want them to.
Listen
Past PodcastsSubscribe
Panda bear hanging in a tree
How horse manure helps giant pandas tolerate cold
A study finds that giant pandas roll in horse manure to increase their cold tolerance.
Image credit: Fuwen Wei.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Cozzarelli Prize
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490