Previous Article |
Table of Contents
| Next Article
AGRICULTURAL SCIENCES
Sequence composition and genome organization of maize








*Plant Genome Initiative at Rutgers, Waksman Institute, Rutgers, The State University of New Jersey, Piscataway, NJ 08854;
Munich Information Center for Protein Sequences, Institute for Bioinformatics, GSF Research Center for Environment and Health, D-85764 Neuherberg, Germany; and
Arizona Genomics Institute and ¶Arizona Genomics Computational Laboratory, University of Arizona, Tucson, AZ 85721
Communicated by Brian A. Larkins, University of Arizona, Tucson, AZ, August 20, 2004 (received for review July 22, 2004)
| Abstract |
|---|
|
|
|---|
80% the size of the human genome. To gain global insight into the organization of its genome, we have sequenced the ends of large insert clones, yielding a cumulative length of one-eighth of the genome with a DNA sequence read every 6.2 kb, thereby describing a large percentage of the genes and transposable elements of maize in an unbiased approach. Based on the accumulative 307 Mb of sequence, repeat sequences occupy 58% and genic regions occupy 7.5%. A conservative estimate predicts
59,000 genes, which is higher than in any other organism sequenced so far. Because the sequences are derived from bacterial artificial chromosome clones, which are ordered in overlapping bins, tagged genes are also ordered along continuous chromosomal segments. Based on this positional information, roughly one-third of the genes appear to consist of tandemly arrayed gene families. Although the ancestor of maize arose by tetraploidization, fewer than half of the genes appear to be present in two orthologous copies, indicating that the maize genome has undergone significant gene loss since the duplication event.
maize genome | whole-genome sequence tags | map-based sequence | whole-genome duplication | gene families
A significant feature emerging from the genome sequences of Arabidopsis and rice is the large number of genes compared to mammalian genomes. Although the gene number in Arabidopsis is comparable to that in human and mouse, rice appears to have a much larger gene set, due largely to gene amplification (3). Does the gene number in plants at all correlate to their genome size (4)? Such a question arises, particularly in light of the fact that many crop plants have undergone whole-genome duplication. Maize is an interesting example of such a duplication event. In contrast to wheat, maize has not maintained homoeologous chromosomes, but rather has undergone reassortment of the homoeologous regions acquired from the two progenitor genomes. These regions were first demonstrated cytologically between nonhomologous chromosomes (5) and then later by genetic linkage mapping of nontandem gene duplicates (6). A comprehensive genetic analysis of homoeologous regions was performed with DNA markers (7) and by comparative mapping to close relatives of maize (8, 9).
Although genetic and cytogenetic analyses provided us with a global view of the organization of the maize genome, a more detailed analysis will come from its DNA sequence. In fact, the maize genome is most likely the next plant genome that will be sequenced after Arabidopsis and rice. However, because of its suspected greater sequence complexity than the human genome, maize is also thought to be the next technical challenge in genome sequencing. When large genomes are dissected into overlapping bacterial artificial chromosome (BAC) clones, these clones can be assembled into megabase (Mb)-sized chromosomal fragments through fingerprinting methods. After the fragments are assigned to their chromosomal locations, sequences from the BAC clones become positioned relative to the genetic map, thereby serving as anchors for other sequences.
Following this concept, we have generated
475,000 maize BAC end sequences (BESs) with a cumulative length of 307 Mb, providing an 8-fold coverage of the genome. BES reads averaged 647 bp with an average distribution of one end every 6.2 kb of the genome. Besides constituting a framework for sequencing the maize genome, the BES provided us with a comprehensive, quantitative data set, which allowed us to assess maize transposable element (TE) and gene content. Moreover, by examining the physical linkage of BESs, we determined that a large proportion of the maize gene set consists of tandemly arrayed gene families, and that a heavy loss of unlinked duplicated genes must have occurred during the transition from a tetraploid to a diploid species.
| Materials and Methods |
|---|
|
|
|---|
BAC clones were retrieved from 384-well storage plates and grown in 96 deep-well plates to saturation. DNAs were extracted with Whatman Unifilters by using a Tomtec liquid handling system. DNA precipitates were resuspended in 40 µl of buffer; samples of 10 µl were used for the forward and reverse sequencing of the BAC ends with universal primers; another 10-µl aliquot was used for DNA fingerprinting. About 75% of the clones were assembled into a physical map of contiguous, overlapping clones (www.genome.arizona.edu/fpc/maize and http://pgir.rutgers.edu) by using the program FINGERPRINTED CONTIGS (FPC) (13). More than 50% (297,961 BACs) of the fingerprinted clones were sequenced from the ends, 176,643 from both ends and 121,318 from one end only, yielding a total of 474,604 BESs with an average read length of 647 bases at Q16. DNA sequences were processed by using LUCY software (14) and deposited in the genome survey sequence (GSS) section of the GenBank database.
Gene Prediction Parameters. Masked BESs (details in Results and Discussion) were analyzed for their coding potential by applying extrinsic (homology-based) and intrinsic (ab initio gene prediction) criteria and methods. Detection of genes by a single read (average, 647 bp) has similar limitations to EST-based approaches, but can be enhanced by using combinatorial evidence scores. Therefore, in our homology-based methods, gene content analysis of the BES data are predicated on the tentative consensus (TC) sequences (of the SPUTNIK database (15) containing structured ESTs from all major plant-derived EST collections (550,000 TCs; >2.3 x 106 ESTs).
| Results and Discussion |
|---|
|
|
|---|
Construction of a Database of Maize Repeat Elements. Besides sufficient coverage of the genetic map, an essential first step to study the content of the maize genome is a meaningful definition of repeat sequences. Accordingly, we first set out to determine a biologically relevant threshold of repeat identity for inclusion in our analyses. One challenge of distinguishing between coding and noncoding portions of the genome consists of filtering protein-encoding TEs from gene families. Because maize repeats are typically longer than a single sequence read, the BES data set was not used for de novo repeat discovery. Instead, the repeat database was built from completely sequenced maize BAC clones and other related genomic sequences already deposited in the GenBank database, along with a survey of GenBank entries screened for repeat sequences by using typical features like polyproteins of retroelements, LTRs, and other sequence repeat motifs by using WU-BLAST (17) Version 2.0 for repeat detection (Fig. 4, which is published as supporting information on the PNAS web site). After collapsing a total of 7,760 sequences (15.2 Mb), 74% (5,700 sequences) remained as nonredundant reference sequences.
To define a repeat identity threshold that would enable selective and sensitive repeat detection and classification, a BLAST-identity limit-for-repeat detection was determined for several data sets by plotting the degree of detection of repeat elements against the identity threshold applied (Fig. 5, which is published as supporting information on the PNAS web site). Four data sets of GSSs were analyzed. Because a large percentage of the maize genome consists of TEs, several attempts were made to use fractionation methods to increase the information content of genome sequence by constructing genomic subclone libraries that are depleted in TEs. Three such fractionation methods have been reported for the maize genome. The first (high C0t-derived, HC) is a library derived by reassociation kinetics of denatured genomic DNA of inbred B73 (16); the second (methyl-filtered, MF) is a library generated by in vivo filtration of methylated from nonmethylated DNA of inbred B73 (16, 18); and the third (RescueMu) is a library derived from junction sequences of genomic insertion sites of the maize transposable element Mutator from other inbred lines (19). Therefore, in addition to the BES collection, the analysis used these specialized whole-genome data sets that were designed to specifically reduce representation of repeated elements. Fig. 5, which is published as supporting information on the PNAS web site, shows a plot of the degree of sequence identity of hits versus the percentage of hits falling into each class. The BESs show a higher level of repeat content than the fractionated sequence representations (MF, HC, and RescueMu), but the overall shapes of the curves are similar. Below an identity limit of 55%, the curves reach a plateau. All four GSS collections show a steep drop in the percentage of repeat nucleotides above a 60% identity limit, with the curves from fractionated data becoming less steep above 65-70%, suggesting that a threshold in this range will be most suitable. Lower thresholds are likely to lead to unspecific or overmasking of GSS data. Based on these observations, subsequently, an identity threshold of 70% for repeat masking was applied for all data sets.
The TE/Repeat Elements of Maize. The BESs provided us with the most comprehensive data set of the repeat content of the maize genome to date. Previous reports projected the amount of repetitive DNA in maize to be in the order of 60-80% (20). Applying conservative parameters, the BES data set arrived at the lower end of previous calculations. As derived from the BES collection, the number of repeat elements present in maize may be close to 58% in terms of number of nucleotides, which might be a slight under-representation in comparison to the 63% detected within a random sheared library (16) (Table 1); this might be attributable to a suppression of centromeric sequences within the BES collection. Not unexpectedly, the largest class of repeat elements is TEs, which were first discovered in maize (21) and have been studied genetically for many years. In recent years, as more maize genomic sequences have become available, they have also been studied extensively at the molecular level (22). Therefore, a sequence similarity-based classification scheme can be used to determine the copy number of different TE families (Table 2). The class I elements (retroelements) dominate over the class II elements (DNA transposons) by a huge margin of 56% to 1% of the genome sequences, respectively. In general, most plant genomes are rich in LTR-retrotransposons and miniature inverted-repeat TEs (22). On the other hand, the number of non-LTR retroelements, like short and long interspersed nuclear elements, is very small (<0.2%), in contrast to the human genome with >25% (23).
|
|
Features of TE. Different TEs have differential insertion specificities that are characteristic of each class. For example, in plants, the Ty1/copia elements were first identified as insertions near maize genes, whereas the highly repetitive Ty3/gypsy elements have a preference to insert into or near other repetitive elements (25). Moreover, in maize and other plant species, the class II TEs such as Ac/Ds, En/Spm, Mu and miniature inverted-repeat TEs are known to insert preferentially into genes and low-copy-number DNA, which are relatively hypomethylated. This finding explains the presence of class II elements in all three fractionated libraries (Table 1).
It is becoming clear from the analysis of many genomes that TEs are a ubiquitous feature in the organization of chromosomes. Surprisingly, even compact genomes, like those of pufferfish (0.4 Gb), were recently found to exhibit a richer diversity of retrotransposons than the human and mouse genomes, which are roughly 7 times larger (26). A very wide range of variation is observed in the structure of retroelements, ranging from nucleotide substitutions and small insertions/deletions to large rearrangements (27). Retroelements are therefore thought to evolve faster than the nonrepetitive portion of any genome (22). Because TEs comprise the single most abundant component (40-80%) of many large genomes, TE mining is an essential part of genome research that will enable us to determine their key role in genome evolution (28).
How Well Are TEs Fractionated from the Rest of the Genome? Comparison of the BESs to data sets of fractionated genome sequences revealed that the three fractionated libraries have reduced repeat sequence representation to a limited extent (3- to 6-fold, Table 1). To assess the effectiveness with which repeat elements were reduced in the fractionated data sets, the comparison was extended to the different classes of repeat elements. We observed biases in which types of repeat sequences were represented in each library type (Fig. 6, which is published as supporting information on the PNAS web site). In the MF and HC data sets, the dominant class of repeat elements is LTR retrotransposons. The RescueMu data set contains similar numbers of LTR retrotransposons and DNA transposons, indicating that Mu insertions occur in both classes of TEs, although at a low frequency. The dominance of class I TEs in the MF and HC data sets might reflect that LTR retrotransposons are frequently nested and/or fragmented and, therefore, recalcitrant to filtration by methylation or reassociation kinetics. Interestingly, the MF fraction has a significant proportion of centromere-specific repeats, indicating the possible importance of nonmethylation of chromatin structure in centromere function. This finding would also be consistent with the transcription of centromere-specific retrotransposons, as has been suggested for rice centromeres (29). Thus, different fractionation techniques tag interesting functional aspects of genomic sequences, but are unlikely to provide a simple separation of genomic DNA into genes and TEs.
Significant Levels of Transcripts from Repeat Elements. To identify the transcribed portion of the genome, we first compared masked and unmasked BESs against large EST/TC data sets from different plant species. Such a comparison allowed us to determine the degree of sequence similarity between species and detect contaminations with sequences derived from nonmaize species. In addition, it gave us an estimate of the representation of repeat elements within EST databases. As a reference set for the BESs, we again used the three genomic libraries (HC, MF, and RescueMu) that have reduced repeat element representation. All data sets (both nonmasked and repeat masked) were screened in all six translated frames (TBLASTX using 10-35) against a collection of EST clusters and TCs from maize and six other members (sugarcane, rice, wheat, barley, Sorghum, and Secale) of the Gramineae family, along with four legumes (Leguminoseae), Medicago, Phaseolus, Glycine, and Lotus and also from Arabidopsis (Brassicaceae). The species specificities and the phylogenetic distances correlated, consistent with the lack of nonmaize sequences within the BES collection (Fig. 1). The difference in the hit percentages between nonmasked and repeat-masked is indicative of EST databases containing sequences from retroelements. In general, the differences in BES hit rates were caused by only a small portion of the respective ESTs/TCs. Even at high stringency (10-50), 1.5% of the TCs have a very high number of nonspecific hits. Because nonmasked BESs contain a large proportion of retrotransposon-related genes, the small proportion of TE-related ESTs becomes heavily emphasized. There is also an increase in the hit rate of nonmasked as compared to repeat-masked sequences in other plant species (sugarcane, rice, wheat, barley, Sorghum, and Glycine), indicating a conservation of transcribed TEs between species (30).
|
3 kb (31). This result suggests the presence of
59,000 genes in maize, significantly higher than the 45,000 estimated for rice (3), but with a lower average gene density of 1 per 40 kb. If the same analysis is applied to the fractionated GSS collection, gene coverage increases only by a factor of 2 (HC), 2.5 (RescueMu), and 3 (MF) (Fig. 7, which is published as supporting information on the PNAS web site).
Because most BESs cover only a small portion of a gene, we enhanced the functional analysis of these sequences by forming TCs from BES-tagged known transcripts (Materials and Methods) by applying a stringent threshold of E
10-20. In this way we could overcome the restriction of comparably short BESs with the longer TC-derived peptide sequences and arrive at a better representation of BESs assigned to functional categories (Table 3). About 9.1% have a match to plant-derived ESTs (Sputnik/TC, E
10-35). Among the TC hits, 7,628 were derived from maize, representing 22% of the total unique ESTs known from maize.
|
Genome Topology. What percentage of genes is tandemly arrayed and how many genes were derived from each of the two progenitors of maize (unlinked duplicate genes)? The latter question helps us to assess the degree of how much of whole-genome duplication is still recognizable at the gene level. However, for both questions, we needed to determine which BESs are physically linked. This became possible because the same clones have also been fingerprinted and assembled into contigs by using FPC (13). TCs based on BESs, as described above, with overlapping match characteristics were grouped into bins of homologous TCs or gene family signatures (GFSs). Because GFSs can be linked to fingerprinted contigs via individual BES names, relative positions of GFS can be determined and regional correlations can be analyzed. Therefore, the combination of the FPC information of BAC clones and their sequence information enabled us to address questions on the extent of local genic duplications (e.g., tandem arrays) and of global duplications.
The TCs were grouped into 9,129 distinct GFSs by using the BES-directed clustering strategy. After filtering for highly abundant GFSs, 9,038 were used for the subsequent analysis. By using this approach, we detected 3,064 individual tandem arrays. As expected from the distribution of BESs along the chromosome, the sizes ranged from 2 to 15 members with a maximum proportion (73%) of 2 members (Fig. 2). Of the total of 21,098 BES associated to GFSs, 7,427 fell into this category, which results in an estimate of one-third (35%) of the genes being organized in tandem arrays in maize. Because this number represents the lower limit in maize, it exceeds the degree of tandem gene duplications found in rice (25%) (31) and certainly in Arabidopsis (17%) (1).
|
10-20), we examined FPC-derived BAC contigs for BESs anchored within the respective contigs and associated them to GFSs. Of 1,802 fingerprinted contigs tested, 1,078 (60%) had at least two GFSs located on two corresponding contigs and 513 (28%) had at least three corresponding GFSs. Moreover, 34% (7,175 of 21,098) of individual GFSs anchored to BES fell into this category, implying exhaustive molecular traces of the ancient tetraploidization of the maize genome. Nevertheless, considering that we tagged
70% of the genes, this finding represents a very conservative estimate, and the degree of retained duplicates might be markedly higher. | Conclusions |
|---|
|
|
|---|
86%. This heavy loss of duplicate genes would be consistent with the change from a tetravalent genome of the progenitors of maize to today's diploid genome, which could be referred to as the diploidization process. Interestingly, a similar process seems to have occurred in yeast, although to a more extreme level with nearly 90% loss of duplicated genes (34). However, in contrast to yeast, the remaining gene number has increased dramatically, because of tandemly amplified gene families. One explanation could come from the phylogenetic analysis of the 41-member zein gene family, which indicated that tandem gene amplification occurred within the last 4.5 million years (35), whereas the two progenitors of maize arose
11.9 million years ago (mya) and hybridized to form maize between 11.9 and 4.8 mya (36). Although we knew that the maize genome is rich in LTR-type retrotransposons, their density must be quite variable and their number may contribute to only slightly more than half of the genome size. The predicted gene sequences make up only 7.5% of the genome, and all repeat elements make up 58% of the genome, but what is located in the remaining 34.5%? Interestingly, two recent examples have shown that unique sequences potentially contain important regulatory features and can be separated from the coding regions by the insertion of retroelements in the range of 100 kb (37, 38). Therefore, the space between the known repeat elements and the identifiable coding region models will require more functional analysis.
| Acknowledgements |
|---|
| Footnotes |
|---|
Freely available online through the PNAS open access option.
Abbreviations: Gb, gigabase(s); Mb, megabase(s); BAC, bacterial artificial chromosome; BES, BAC end sequence; TE, transposable element; GSS, genome survey sequence; TC, tentative consensus; HC, high C0t-derived; MF, methyl-filtered; GFS, gene family signature.
Data deposition: The sequences reported in this paper have been deposited in the GenBank database (accession nos. can be found in Table 4, which is published as supporting information on the PNAS web site).
To whom correspondence should be addressed. E-mail: messing{at}mbcl.rutgers.edu.
© 2004 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
I. J. Leitch, L. Hanson, K. Y. Lim, A. Kovarik, M. W. Chase, J. J. Clarkson, and A. R. Leitch The Ups and Downs of Genome Size Evolution in Polyploid Species of Nicotiana (Solanaceae) Ann. Bot., April 1, 2008; 101(6): 805 - 814. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. J. Leitch and M. F. Fay Plant Genome Horizons: Michael Bennett's Contribution to Genome Research Ann. Bot., April 1, 2008; 101(6): 737 - 746. [Full Text] [PDF] |
||||
![]() |
P. Smarda, P. Bures, L. Horova, B. Foggi, and G. Rossi Genome Size and GC Content Evolution of Festuca: Ancestral Expansion and Subsequent Reduction Ann. Bot., February 1, 2008; 101(3): 421 - 433. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. L. Eveland, D. R. McCarty, and K. E. Koch Transcript Profiling by 3'-Untranslated Region Sequencing Resolves Expression of Gene Families Plant Physiology, January 1, 2008; 146(1): 32 - 44. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. A. Kronmiller and R. P. Wise TEnest: Automated Chronological Annotation and Visualization of Nested Plant Transposable Elements Plant Physiology, January 1, 2008; 146(1): 45 - 59. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. I. E. Amarillo and H. W. Bass A Transgenomic Cytogenetic Sorghum (Sorghum propinquum) Bacterial Artificial Chromosome Fluorescence in Situ Hybridization Map of Maize (Zea mays L.) Pachytene Chromosome 9, Evidence for Regions of Genome Hyperexpansion Genetics, November 1, 2007; 177(3): 1509 - 1526. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fengler, S. M. Allen, B. Li, and A. Rafalski Distribution of Genes, Recombination, and Repetitive Elements in the Maize Genome Crop Sci., July 16, 2007; 47(S2): S-83 - S-95. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Liu, C. Vitte, J. Ma, A. A. Mahama, T. Dhliwayo, M. Lee, and J. L. Bennetzen A GeneTrek analysis of the maize genome PNAS, July 10, 2007; 104(28): 11844 - 11849. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Sun, M. Thompson, G. Lin, H. Butler, Z. Gao, S. Thornburgh, K. Yau, D. A. Smith, and V. K. Shukla Inositol 1,3,4,5,6-Pentakisphosphate 2-Kinase from Maize: Molecular and Biochemical Characterization Plant Physiology, July 1, 2007; 144(3): 1278 - 1291. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Rossi, S. Locatelli, S. Varotto, G. Donn, R. Pirona, D. A. Henderson, H. Hartings, and M. Motto Maize Histone Deacetylase hda101 Is Involved in Plant Development, Gene Transcription, and Sequence-Specific Modulation of Histone Modification of Genes and Repeats PLANT CELL, April 1, 2007; 19(4): 1145 - 1162. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Pritham and C. Feschotte Massive amplification of rolling-circle transposons in the lineage of the bat Myotis lucifugus PNAS, February 6, 2007; 104(6): 1895 - 1900. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Emrich, L. Li, T.-J. Wen, M. D. Yandeau-Nelson, Y. Fu, L. Guo, H.-H. Chou, S. Aluru, D. A. Ashlock, and P. S. Schnable Nearly Identical Paralogs: Implications for Maize (Zea mays L.) Genome Evolution Genetics, January 1, 2007; 175(1): 429 - 439. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Schlueter, I. F. Vasylenko-Sanders, S. Deshpande, J. Yi, M. Siegfried, B. A. Roe, S. D. Schlueter, B. E. Scheffler, and R. C. Shoemaker The FAD2 Gene Family of Soybean:: Insights into the Structural and Functional Divergence of a Paleopolyploid Genome Crop Sci., January 1, 2007; 47(Supplement_1): S-14 - S-26. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Wang and H. K. Dooner Eukaryotic Transposable Elements and Genome Evolution Special Feature: Remarkable variation in maize genome structure inferred from haplotype diversity at the bz locus PNAS, November 21, 2006; 103(47): 17644 - 17649. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Bruggmann, A. K. Bharti, H. Gundlach, J. Lai, S. Young, A. C. Pontaroli, F. Wei, G. Haberer, G. Fuks, C. Du, et al. Uneven chromosome contraction and expansion in the maize genome Genome Res., October 1, 2006; 16(10): 1241 - 1251. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Schlueter, B. E. Scheffler, S. D. Schlueter, and R. C. Shoemaker Sequence Conservation of Homeologous Bacterial Artificial Chromosomes and Transcription of Homeologous Genes in Soybean (Glycine max L. Merr.) Genetics, October 1, 2006; 174(2): 1017 - 1028. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Soderlund, W. Nelson, A. Shoemaker, and A. Paterson SyMAP: A system for discovering and viewing syntenic regions of FPC maps. Genome Res., September 1, 2006; 16(9): 1159 - 1168. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. Yandeau-Nelson, Y. Xia, J. Li, M. G. Neuffer, and P. S. Schnable Unequal Sister Chromatid and Homolog Recombination at a Tandem Duplication of the a1 Locus in Maize Genetics, August 1, 2006; 173(4): 2211 - 2226. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. A. Robertson-Hoyt, M. P. Jines, P. J. Balint-Kurti, C. E. Kleinschmidt, D. G. White, G. A. Payne, C. M. Maragos, T. L. Molnar, and J. B. Holland QTL Mapping for Fusarium Ear Rot and Fumonisin Contamination Resistance in Two Maize Populations Crop Sci., June 20, 2006; 46(4): 1734 - 1743. [Abstract] [Full Text] [PDF] |