Previous Article |
Table of Contents
| Next Article
Genetics
Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence



















Department of Genetic Information,
Division of Molecular Life Science, Tokai University School of Medicine,
Bohseidai, Isehara, Kanagawa 259-1193, Japan;
Centre for Bioinformatics and Biological
Computing, School of Information Technology, Murdoch University, Murdoch WA
6150, Australia;
Bioscience Research Laboratory,
Fujiya Company, Limited, 228 Soya, Hadano, Kanagawa 257-0031, Japan;
¶National Institute of Genetics Center for
Information Biology and DDBJ, 1111 Yata, Mishima, Shizuoka 411-8540, Japan;
||Japan Biological Information Research Center,
National Institute of Advanced Industrial Science and Technology, TIME24
Building 10F, 2-45 Aomi, Koto-ku, Tokyo 135-0064, Japan; and

Institut National de la Santé
et de la Recherche MédicaleContrat de Recherche
Stratégique, Laboratoire d'Immunogénétique
Moléculaire Humaine, Centre de Recherche d'Immunologie et
d'Hématologie, 4 Rue Kirschleger, 67085 Strasbourg, France
Edited by Masatoshi Nei, Pennsylvania State University, University Park, PA and approved April 23, 2003 (received for review January 29, 2003)
| Abstract |
|---|
|
|
|---|
6 million years ago), the
molecular basis of traits unique to humans vs. their closest relative, the
chimpanzee, is largely unknown. This report describes a large-scale
single-contig comparison between human and chimpanzee genomes via the sequence
analysis of almost one-half of the immunologically critical MHC. This
1,750,601-bp stretch of DNA, which encompasses the entire class I along with
the telomeric part of the MHC class III regions, corresponds to an orthologous
1,870,955 bp of the human HLA region. Sequence analysis confirms the existence
of a high degree of sequence similarity between the two species. However, and
importantly, this 98.6% sequence identity drops to only 86.7% taking into
account the multiple insertions/deletions (indels) dispersed throughout the
region. This is functionally exemplified by a large deletion of 95 kb between
the virtual locations of human MICA and MICB genes, which
results in a single hybrid chimpanzee MIC gene, in a segment of the
MHC genetically linked to species-specific handling of several viral
infections (HIV/SIV, hepatitis B and C) as well as susceptibility to various
autoimmune diseases. Finally, if generalized, these data suggest that
evolution may have used the mechanistically more drastic indels instead of the
more subtle single-nucleotide substitutions for shaping the recently emerged
primate species.
98.77% nucleotide and >99% amino acid identity with us
(2,
3). However, there are
important biomedical (as well as obvious morphological and cognitive)
differences between the two species, which thus far have eluded any molecular
explanation within this supposedly 1% diversity range. Among these are our
differential handling of a number of infectious agents, e.g., HIV (progression
to AIDS) (4), late
complications of hepatitis B and C
(5,
6), as well as susceptibility
to Plasmodium falciparum
(7), which are of utmost public
health importance. The molecular basis of these distinctive traits is thought
to be in large part encoded within the MHC, where MHC class I molecules sample
pathogen-derived antigenic peptides for recognition by the CD8+

T cell receptor expressing cytotoxic T cells
(8). We have already reported the complete sequence and gene map of the 3.7-Mb human chromosome 6p21.3-located MHC (alternatively called the human leukocyte antigen or HLA) gene complex (9, 10). This is a gene-rich (224 identified loci) highly polymorphic (with some MHC genes having >400 alleles) genomic segment that is associated with a myriad (>100) of mostly autoimmune but also infectious disorders for which our molecular knowledge, for the most part, remains rudimentary. It is precisely this extremely high level of MHC polymorphism and heterozygosity that is believed to confer a selective advantage to the host in encountering the extraordinarily diverse pathogen-derived antigenic repertoire (8). The human MHC is composed of three distinct regions, designated from the centromere to the telomere as the class II, III, and I regions. The telomeric 1.8-Mb class I region harbors two notable (but not only) multicopy gene families, HLA and MIC (10, 11), which are thought to have arisen from repeated gene duplications (10, 12) and which engage a host of critical immune receptors: the T cell receptor as well as Ig and lectin-like inhibitory and activatory receptors (13, 14).
Despite the facts that structural and/or functional orthologues for all human HLA genes have been found in chimpanzee (1519) (HLA-A/B/C/E/F/G vs. Patr-A/B/C/E/F/G) and that there is no doubt that the MHC biology between these two close species is nearly interchangeable, the genomic architecture of chimp MHC is unknown, although it is assumed to be closely linear to that of human. Our aim was to capitalize on our detailed knowledge of the human HLA region to jump-start a large-scale comparative genomic analysis with regard to that of the chimpanzee (P. troglodytes) MHC (called Patr). Not only will the chimpanzee MHC sequence provide an in-depth analysis of this important genomic region between two such closely related species, but it also has the intrinsic power to unravel the molecular basis for some important biological differences between us and the chimpanzee. In this regard, we present 1.75 Mb of continuous genomic sequence linking the Lymphotoxin B (LTB) gene in the telomeric area of the class III region to Patr-F locus (chimpanzee HLA-F orthologue) at the telomeric end of the MHC class I region.
| Materials and Methods |
|---|
|
|
|---|
2 kb in length, were
PCR-generated from the human HLA and MIC genes (exons
24) as well as several MHC-based sequence tagged sites by using cloned
human genomic DNA as a template. The final contig map was constructed by
comparison with the complete sequence of the human MHC
(9,
10).
DNA Sequencing and Analysis. Fourteen chimpanzee BAC clones that
covered
1.75 Mb from the LTB to Patr-F genes were
completely and bidirectionally shotgun sequenced with an average redundancy of
7.0x, which was sufficient for assembly and analysis of the entire
sequence using previously established procedures
(10,
20). The chimpanzee sequence
was compared with our previously published human sequence (GenBank accession
nos. AP000502
[GenBank]
000521)
(10,
20). Sequence alignments were
performed and homologies determined by using the programs contained within the
GENETYX Ver. 11
(www.sdc.co.jp/genetyx)
and DNASIS (Hitachi, Tokyo) software packages. Dot matrix analysis was
performed by using HARRPLOT Ver. 2 as part of the GENETYX package. The
nucleotide diversity (21)
profile was constructed after determining the percent nucleotide difference
between the human and chimpanzee sequences for a sliding window of 1 kb. The
diversity profile was then drawn by using the graphics output of Microsoft
EXCEL. All indels were removed from the alignments to standardize the number
of nucleotides examined within each window. Finally, repetitive elements were
identified within the contiguous sequences by using the REPEATMASKER webserver
(A. F. A. Smit and P. Green,
http://ftp.genome.washington.edu/cgi-bin/RepeatMasker).
| Results and Discussion |
|---|
|
|
|---|
95 kb shorter than the corresponding 1,796,912-bp HLA
class I region. Fig. 1 depicts
a detailed comparative genomic map between these two MHC regions. Analysis of
the repeat content reveals an occupancy rate of 52.03% of the region by such
sequences as compared with 51.11% for the human counterpart. These were
respectively composed of 17.66% (chimpanzee)/16.79% (human) short interspersed
elements (SINEs), 17.87%/18.10% long interspersed elements (LINEs), and
12.98%/12.88% LTR elements. A detailed breakdown of the repeat content of the
entire region is provided in Table 1, which is published as supporting
information on the PNAS web site,
www.pnas.org).
As expected, there is considerable similarity between the genomic organization
of human and chimpanzee MHCs. The chimpanzee sequence contains 41 putative
coding genes and 59 noncoding or pseudogenes, which are matched with
orthologous loci in identical orientations within the human MHC
(9,
10). Interestingly, a detailed
sequence analysis revealed the existence of 64 indels, each >100 bp in
length (Fig. 1). Most of these
indels include repetitive elements, such as Alu, LINE, LTR, etc., as well as
the frequent insertion of the repeated sequence, SVA, within the chimpanzee
sequence. Importantly, the indels were directly responsible for the major
differences observed between the two species. These include the loss of three
human pseudogenes, DHFRP, HCGII-4, and MICF, and the
presence of only a single chimpanzee MIC gene in the region
corresponding to the two human functional MICA and MICB
genes, at the centromeric end of the class I region. This single chimpanzee
MIC gene, Patr-MIC, was therefore produced as result of a
large 95-kb deletion between the corresponding human MICA and
MICB genes following a scenario that we reconstitute below.
|
A 95-kb Deletion Between the Human MICA and MICB Genes Leads to the
Generation of a Single Chimeric Patr-MIC Gene. The human MICA and
MICB genes are believed to result from a genomic duplication that
occurred
3344 million years ago (Mya)
(22,
23), hence well before the
separation of the chimpanzee from the human lineage
6 Mya
(24)
(Fig. 2A). Therefore,
we asked from which ancestral MIC gene (A or B)
this single Patr-MIC gene was originated, i.e., which MIC
gene was deleted from the chimpanzee genome? Because the human MICA
and MICB genes have a relatively high sequence similarity, it was not
possible to settle the issue by dot plot analysis (data not shown).
Consequently, and to thoroughly address this question, we performed both
structural as well as similarity analyses of the human and chimpanzee
MIC genes. Structural analysis showed that the 5' flanking
sequences and the first intron of the Patr-MIC gene have retained all
of the signature retroelements characteristic of the orthologous region of
human MICA, whereas the 3' flanking sequences of the
Patr-MIC display all of the characteristic retroelements of the human
MICB (Fig.
2B). In accordance with this retroelement profiling,
similarity investigations unveiled that exon 1 to intron 2 of
Patr-MIC show greater sequence similarity with corresponding regions
of MICA rather than with MICB, whereas the opposite is seen
for Patr-MIC exons 5 and 6 as well as the 3' noncoding region
(Table 2, which is published as supporting information on the PNAS web site).
Finally, the polymorphic (GCT)n (n = 4, 5, 6, 9, 10)
short-tandem repeat, which exists only within the fifth exon (transmembrane
domain) of MICA but not MICB
(19), is also absent from the
same exon in Patr-MIC. However, because the sequences between exon 3
and intron 4 of the Patr-MIC are equally homologous to orthologous
regions within MICA and MICB, one could not establish the
exact position of the recombinational event. Nevertheless, on the basis of
sequence differences, we have narrowed the recombination breakpoint down to a
segment located between the ends of MICA's second and MICB's
fourth introns.
|
The existence of a single functional MIC gene centromeric of the major classical class I locus, Patr-B here, is not exclusive to chimpanzee, because it has been recognized in other primates, including humans. The gorilla, indeed, appears to have only one MIC gene with a strong sequence similarity to the human MICA (25, 26). In humans, individuals carrying the HLA-B*4801 allele (rare in Caucasians but more common in Northeast Asians as well as Native Americans) have lost the MICA locus also due to a large 100-kb genomic deletion surrounding and including this locus (albeit the genomic breakpoints are distinct from those observed in chimpanzee) (27, 28). All in all, it is quite intriguing that an equal-sized deletion involving this very same region and genes (MICA/B) has happened at distinct points in time in several different primate species. This very phenomenon might also be the reason why rodents are devoid of MIC genes, because the putative location of functional mouse MICA and MICB genes, the segment linking H2-D (equivalent to HLA-B or Patr-B) and BAT1, is substantially shortened compared with the human MHC: 40 instead of 173 kb (11, 29, 30). The molecular basis of the existence of such an apparently "deletion-prone" segment between MICA and MICB remains to be established, but this could be due to the existence of a HERV-L sequence, which contains a 2.5-kb AT-rich insertion in its 5' LTR, which might therefore serve as a recombination hot spot (23, 31).
Nucleotide, Amino Acid, and Structural Similarities Between Human and Chimpanzee Orthologous Sequences. Fig. 3 compiles our similarity analysis with respect to nucleotide and amino acid diversity among 35 orthologous human/chimpanzee genes identified here, of a total of 41 putative coding sequences. The average nucleotide and amino acid identities were 98.9% and 98.3%, respectively (Table 3, which is published as supporting information on the PNAS web site). This relatively lower amino acid identity might be the result of positive selection aimed to maintain genetic polymorphism in the MHC (that is MHC class I) genes (32). Indeed, once genes were divided into MHC (hereafter designating MHC class I or MHC-I) and non-MHC loci, it was found that sequence identities were 99.3%/99.1% (nucleotide/amino acid) for the 28 non-MHC genes and "only" 97.1%/95.0% for the seven MHC-I genes, including the solo MIC gene. Furthermore, when MHC-I genes themselves were subdivided into classical/polymorphic (HLA-ABC-Patr-ABC), nonclassical/nonpolymorphic (HLA-EFG-Patr-EFG), as well as nonclassical but polymorphic (MICA/MICB-Patr-MIC), the nucleotide/amino acid identities were 96.2%/93.4%, 98.8%/98.5%, and 95.1%/89.6%, respectively. This analysis of nucleotide and amino acid similarities implies that, in addition to positive selection acting on MHC genes, a resolute degree of purifying selection acts primarily on the non-MHC genes to maintain their structural conservation (32). This makes sense, because most of these non-MHC genes are involved in basic (homeostatic) cellular functions that require interindividual as well as interspecies homogeneity. In contrast, MHC-I genes have to constantly adapt themselves to the microbiological habitat of every species (exceptions to this observed dichotomy are SPR1, SEEK1, and HCGIX-4; further functional characterization of these loci might answer this apparent discrepancy).
|
Comparative Nucleotide Diversity Profiling. A "nucleotide
diversity (substitution: single-nucleotide polymorphism) profile" was
generated across the entire 1.68-Mb gap-free (indels excluded) aligned genomic
sequence by using a sliding window of 1,000 bp
(Fig. 4). The average degree of
nucleotide identity between the chimpanzee and the human for this region
(again excluding indels) is 98.6%, which is similar to the earlier estimation
of 98.77% (1.23% nucleotide difference)
(7). However, this nucleotide
difference is not constant across the entire MHC. For instance, within the two
non-MHC gene-rich clusters (Fig.
4, left, LTB to BAT1 gene; center, IEX-1 to HSR1 gene), it is of
0.7%, which is five to nine times less than the average nucleotide
difference of 6.73.5% around the classical MHC genes. This variation in
nucleotide difference implies again that purifying selection is acting to
maintain conservation much more strongly throughout these non-MHC gene-rich
clusters (including their intergenic regions), whereas in contrast, the
classical class I gene regions (including their intergenic sequences) show a
lower degree of similarity, probably as a result of overdominant selection
necessary to maintain polymorphism
(33).
|
As expected, the genomic segments surrounding the MHC genes, except for the nonclassical HLA/Patr-G loci, reveal continuous high diversity profile, especially around the classical class I loci. This high degree of nucleotide variation may be the result of positive selection, the existence of multicopied sequences as well as hitchhiking effect due to the accumulative effect of balancing selection acting on the MHC loci in linkage disequilibrium (33). In contrast, the 35 kb surrounding HLA/Patr-G genes is highly conserved, displaying only a 0.9% nucleotide difference in contrast to the situation next to other MHC genes. This low level of nucleotide variation between HLA/Patr G genes might be in connection with the biology of the HLA/Patr-G molecule implicated at maternofetal immunity. Finally and interestingly, the diversity profile between the chimpanzee and human sequences closely resembles that previously obtained between different human MHC haplotypes (3337).
Interestingly, once the indels are taken into account, the above-observed 98.6% sequence identity drops to only 86.7% (substitution, 1.4%; indels, 11.9%). This indel-included 86.7% identity may be a better representation of whole-genome sequence similarity between the human and the chimpanzee, as confirmed by a recently published study comparing a number of fragmented chimpanzee sequences with their human counterparts (38).
Further Analysis on Mismatched Sequence Between the Human and the
Chimpanzee. More precise comparative analysis of indels and substitutions
(transitions and transversions) was carried out by using mismatched sequences
between the two species. When the 1,870,955-bp human and 1,750,601-bp
chimpanzee sequences were aligned, the length of entire mismatched sequences
was of 252,252 bp with substitutions representing 9.6% (24,221 bp) and indels
90.4% (228,031 bp). Thus, the major difference between the human and
chimpanzee sequences is overwhelmingly attributable to indels. With regard to
substitutions, they were further classified into transitions (16,680 bp, 6.6%)
and transversions (7,541 bp, 3.0%). Among indels, single-nucleotide indels
were represented only by 1,230 bp (0.5%), because most indels were longer than
a single base pair. Fig.
5A shows a diversity profile of transitions,
transversions, and single-nucleotide indels using a sliding window of 1,000 bp
across the entire aligned sequence. The 1.91-Mb segment was divided into the
MHC class I (multicopy) and non-MHC (single-copy) gene regions
(Fig. 5A).
Fig. 5B is the
percentage of continuous indels that override 2-bp length. Diversity profiles
give very similar patterns between transitions and transversions, although
transitions, as expected, occurred more frequently than transversions
(Fig. 5A). When
focused on single-nucleotide indels (1,230 bp), one notices only a single peak
between HLA-B/Patr-B and HLA-C/Patr-C
(Fig. 5A), with the
remainder of the region showing an even alteration of
0.06% with no
significant peaks (even within the polymorphic MHC genes). However, when
longer indels are included, indels were apparently accumulated in the MHC
class I-harboring regions (1,013,321 bp in total), 191,512 bp (84.4% of total
226,801 bp indels) as compared with those in the non-MHC segments (897,959
bp), 35,289 bp (15.6%).
|
The GC contents of the human and chimpanzee sequences were similar to each
other, 45.9% and 46.1%, respectively. These contents were much higher than the
average GC content of the entire human genome (41.0%) but lower than that
expected from random nucleotide distribution (50%). By investigation of each
of 24,221 substitutions, transition (T
C, A
G) and transversion (T,
C
A, G) were found to contribute to 68.9% and 31.1% of the total
substitutions, respectively (Fig.
6). This percentage of transition in the total substitutions is
10% higher than that reported in the previous studies using 16 pseudogenes
for sequence comparison (59.3%)
(36,
37). When considering
individual transition and transversion pathways, T
C and A
G were
found to have almost similar percentages of the total substitutions between
them, but G
C (9.1%) and A
T (6.1%) gave higher and lower
percentages as compared with G
T and A
C as well as those obtained
in the previous studies, respectively
(38,
39). Further, although the MHC
gene regions tend to maintain a high degree of genetic polymorphism, the
ratios within nucleotide substitutions from and to each base were almost the
same between the MHC class I (multicopy) gene and non-MHC (single-copy) gene
regions (Fig. 7, which is published as supporting information on the PNAS web
site).
|
In summary, this work reports the sequence of one-half of the chimpanzee
MHC, which to date represents the longest continuous sequence within this
species, our closest evolutionary relative. Comparative genomics with the
orthologous human MHC class I region unveils a wealth of information, the most
salient being the existence of a large number of indels that appear to be the
main driving force behind the observed differences between the two species.
Hence our perceived sequence divergence of only 1% between these two species
appears to be erroneous, because this work, along with another recently
published analysis, puts both species much further apart, >10% here and
5% in another recently published study
(40), albeit the latter study
compared shorter segments of both genomes. This relatively high and previously
unexpected degree of sequence divergence might have functional implications
not only within the coding sequences itself but also within regulatory
elements (41,
42). Within the MHC per se,
the most notable effect of indels appears to be the generation of a single
chimeric Patr-MIC by fusion of MICA and MICB. This,
along with other indels as well as nucleotide substitutions [which could be
dubbed "transspecies single-nucleotide polymorphisms (SNPs)"],
might therefore directly contribute to the patent difference between these two
closely linked species with regard to susceptibility to a number of infectious
as well as autoimmune disorders, most of which are primarily linked to the
MHC. The study of these transspecies SNPs might further help to pinpoint the
most ancient and perhaps functionally relevant human SNPs among the increasing
numbers that are being continuously identified.
| Acknowledgements |
|---|
| Footnotes |
|---|
Abbreviation: BAC, bacterial artificial chromosome.
Data deposition: The sequences reported in this paper have been deposited in the GenBank database (accession nos. AB100082 [GenBank] 100087 and BA000041 [GenBank] ).

To whom correspondence should be addressed. E-mail:
hinoko{at}is.icc.u-tokai.ac.jp.
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
N. Yuhki, T. Beck, R. Stephens, B. Neelam, and S. J. O'Brien Comparative Genomic Structure of Human, Dog, and Cat MHC: HLA, DLA, and FLA J. Hered., August 3, 2007; (2007) esm056v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
F.-C. Chen, C.-J. Chen, and T.-J. Chuang INDELSCAN: a web server for comparative identification of species-specific and non-species-specific insertion/deletion events Nucleic Acids Res., July 13, 2007; 35(suppl_2): W633 - W638. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Shiina, M. Ota, S. Shimizu, Y. Katsuyama, N. Hashimoto, M. Takasu, T. Anzai, J. K. Kulski, E. Kikkawa, T. Naruse, et al. Rapid Evolution of Major Histocompatibility Complex Class I Genes in Primates Generates New Disease Alleles in Humans via Hitchhiking Diversity Genetics, July 1, 2006; 173(3): 1555 - 1570. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. R. See, S. Brooks, J. C. Nelson, G. Brown-Guedira, B. Friebe, and B. S. Gill Gene evolution at the ends of wheat chromosomes. PNAS, March 14, 2006; 103(11): 4162 - 4167. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fukami-Kobayashi, T. Shiina, T. Anzai, K. Sano, M. Yamazaki, H. Inoko, and Y. Tateno Genomic evolution of MHC class I region in primates PNAS, June 28, 2005; 102(26): 9230 - 9234. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. G. de Groot, C. A. Garcia, E. J. Verschoor, G. G. M. Doxiadis, S. G. E. Marsh, N. Otting, and R. E. Bontrop Reduced MIC Gene Repertoire Variation in West African Chimpanzees as Compared to Humans Mol. Biol. Evol., June 1, 2005; 22(6): 1375 - 1385. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Sambrook, A. Bashirova, S. Palmer, S. Sims, J. Trowsdale, L. Abi-Rached, P. Parham, M. Carrington, and S. Beck Single haplotype analysis demonstrates rapid evolution of the killer immunoglobulin-like receptor (KIR) loci in primates Genome Res., January 1, 2005; 15(1): 25 - 35. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Glockner, R. Lehmann, A. Romualdi, S. Pradella, U. Schulte-Spechtel, M. Schilhabel, B. Wilske, J. Suhnel, and M. Platzer Comparative analysis of the Borrelia garinii genome Nucleic Acids Res., November 16, 2004; 32(20): 6038 - 6046. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. K Kulski, T. Anzai, T. Shiina, and H. Inoko Rhesus Macaque Class I Duplicon Structures, Organization, and Evolution Within the Alpha Block of the Major Histocompatibility Complex Mol. Biol. Evol., November 1, 2004; 21(11): 2079 - 2091. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Daza-Vamenta, G. Glusman, L. Rowen, B. Guthrie, and D. E. Geraghty Genetic Divergence of the Rhesus Macaque Major Histocompatibility Complex Genome Res., August 1, 2004; 14(8): 1501 - 1515. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Ovcharenko, D. Boffelli, and G. G. Loots eShadow: A Tool for Comparing Closely Related Sequences Genome Res., June 1, 2004; 14(6): 1191 - 1198. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Hurt, L. Walter, R. Sudbrak, S. Klages, I. Muller, T. Shiina, H. Inoko, H. Lehrach, E. Gunther, R. Reinhardt, et al. The Genomic Sequence and Comparative Analysis of the Rat Major Histocompatibility Complex Genome Res., April 1, 2004; 14(4): 631 - 639. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||