Previous Article |
Table of Contents
| Next Article
PLANT BIOLOGY
Use of DNA barcodes to identify flowering plants




*Department of Botany and
Laboratories of Analytical Biology, National Museum of Natural History, Smithsonian Institution, P.O. Box 37012, Washington, DC 20013-7012; and
Department of Biology, University of Pennsylvania, Philadelphia, PA 19104
Contributed by Daniel H. Janzen, April 15, 2005
| Abstract |
|---|
|
|
|---|
Methods for identifying species by using short orthologous DNA sequences, known as "DNA barcodes," have been proposed and initiated to facilitate biodiversity studies, identify juveniles, associate sexes, and enhance forensic analyses. The cytochrome c oxidase 1 sequence, which has been found to be widely applicable in animal barcoding, is not appropriate for most species of plants because of a much slower rate of cytochrome c oxidase 1 gene evolution in higher plants than in animals. We therefore propose the nuclear internal transcribed spacer region and the plastid trnH-psbA intergenic spacer as potentially usable DNA regions for applying barcoding to flowering plants. The internal transcribed spacer is the most commonly sequenced locus used in plant phylogenetic investigations at the species level and shows high levels of interspecific divergence. The trnH-psbA spacer, although short (
450-bp), is the most variable plastid region in angiosperms and is easily amplified across a broad range of land plants. Comparison of the total plastid genomes of tobacco and deadly nightshade enhanced with trials on widely divergent angiosperm taxa, including closely related species in seven plant families and a group of species sampled from a local flora encompassing 50 plant families (for a total of 99 species, 80 genera, and 53 families), suggest that the sequences in this pair of loci have the potential to discriminate among the largest number of plant species for barcoding purposes.
angiosperm | internal transcribed spacer | Plummers Island | species identification | trnH-psbA
DNA barcoding follows the same principle as does the basic taxonomic practice of associating a name with a specific reference collection in conjunction with a functional understanding of species concepts (i.e., interpreting discontinuities in interspecific variation). Presently, some controversy exists over the value of DNA barcoding (7), largely because of the perception that this new identification method would diminish rather than enhance traditional morphology-based taxonomy, that species determinations based solely on the amount of genetic divergence could result in incorrect species recognition, and that DNA barcoding is a means to reconstruct phylogenies when it is actually a tool to be used largely for identification purposes (8-10). In support of barcoding as a species identification process, Besansky et al. (11), Janzen (12, 13), Hebert et al. (1-4), and Kress (14) have offered arguments for the utility of DNA barcoding as a powerful framework for identifying specimens. Our objective in this paper is not to debate the validity of using barcodes for plant identification, but rather to determine appropriate DNA regions for use in flowering plants.
A portion of the mitochondrial CO1 gene was deliberately chosen for use in animal identification when DNA barcoding was proposed (1), and its broad utility in animal systems has been demonstrated in subsequent pilot studies (1-5). The taxonomic limits of CO1 barcoding in animals are not fully known, but it has proven useful to discriminate among species in most groups tested (2). The choice of a DNA region usable for barcoding has been little investigated in other eukaryotes, whereas in prokaryotes, rRNA genes are favored for identifications (e.g., ref. 15). Among plants, especially angiosperms, DNA-based identifications, although not strictly through the use of DNA barcodes, have been creatively used to reconstruct extinct herbivore diets (16, 17), to identify species of wood (18), to correlate roots growing in Texas caves with the surface flora (19), and to determine species used in herbal supplements (20). However, some of these identifications have not been entirely successful at the species level, and DNA barcoding per se has not yet been applied to plants. The primary reason that barcoding has not been applied to plants by the emerging initiative is that plant mitochondrial genes, because of their low rate of sequence change, are poor candidates for species-level discrimination. The divergence of CO1 coding regions among families of flowering plants has been documented to be only a few base pairs across 1.4 kb of sequence (21, 22). Furthermore, plants rapidly change their mitochondrial genome structure (23), thereby precluding the existence of universal intergenic spacers that otherwise would be appropriately variable unique identifiers at the species level (e.g., ref. 24).
For plant molecular systematic investigations at the species level, the internal transcribed spacer (ITS) region of the nuclear ribosomal cistron (18S-5.8S-26S) is the most commonly sequenced locus (25). This region has shown broad utility across photosynthetic eukaryotes (with the exception of ferns) and fungi and has been suggested as a possible plant barcode locus (26). Species-level discrimination and technical ease have been validated in most phylogenetic studies that employ ITS, and a large body of sequence data already exists for this region (>36,000 angiosperm sequences were available in GenBank in December 2004, although these sequences have not been filtered for taxa, so it is not certain how many species are represented). However, the limitations of this nuclear region in some taxa are well established. ITS has reduced species-level variability in certain groups (especially recently diverged taxa on islands), divergent paralogues that require cloning of multiple copies, and secondary structure problems resulting in poor-quality sequence data (25, 27). In some cases, the preferential amplification of endophytic or contaminating fungi may occur, although this can be eliminated with plant-specific primer design (28, 29).
An advantage of the ITS region is that it can be amplified in two smaller fragments (ITS1 and ITS2) adjoining the 5.8S locus, which has proven especially useful for degraded samples. The quite conserved 5.8S region in fact contains enough phylogenetic signal for discrimination at the level of orders and phyla (29), although identification at this taxonomic level is not the concern of barcoding. Alignments are trivial to optimize for 5.8S due to the few indels found in plants and fungi (30). In contrast for phylogenetic reconstruction, ITS or any rapidly evolving noncoding region can require complex sequence alignment for homology assessments. Thus, the 5.8S locus can serve as a critical alignment-free anchor point for search algorithms that make sequence comparisons for both phylogenetic and barcoding purposes. The utility of conserved regions such as 5.8S to generate a pool of nearest neighbors for refined comparisons will be critical for effective database searches, especially when comparing a sequence that has no identical match in a sequence library. GenBank BLAST searches with our ITS data (see below) returned correct matches for the sequences in GenBank. This success suggests that despite alignment concerns, current search algorithms will be fast and effective at using ITS for species-level identifications, given an adequate database for comparison. For all of these reasons, ITS, even with its recognized limitations, is a prime candidate as an effective locus for DNA barcoding in plants.
However, the recognition that ITS has certain functional limitations for DNA barcoding of plants is a compelling argument that a search for additional loci is warranted. For phylogenetic investigations, the plastid genome has been more readily exploited than the nuclear genome and may offer for plant barcoding what the mitochondrial genome does for animals. It is a uniparentally inherited, nonrecombining, and, in general, structurally stable genome. Universal primers are available for a number of loci and intergenic spacers that are evolving at a variety of rates. The plastid locus most commonly sequenced by plant systematists for phylogenetic purposes is rbcL, followed by the trnL-F intergenic spacer, matK, ndhF, and atpB (e.g., refs. 31-33). rbcL has been suggested as a candidate for plant barcoding (34), even though it has generally been used to determine evolutionary relationships at the generic level and above. Besides rbcL and atpB, all of the latter plastid loci have been used at the species level with various degrees of success. Most of them (except the trnL-F spacer) require full-length sequences of >1 kb to yield enough sequence length to discriminate species. Most relevant to plant barcoding, no region of the plastid genome has been found to have the high level of variation seen in most animal CO1 barcodes, although a few intergenic spacers have shown more promise than any plastid locus now in general use (33).
When evaluating other genetic loci appropriate for plant DNA barcoding, three criteria must be satisfied: (i) significant species-level genetic variability and divergence, (ii) an appropriately short sequence length so as to facilitate DNA extraction and amplification, and (iii) the presence of conserved flanking sites for developing universal primers. With regard to sequence length, we note that in CO1 barcoding systems, the 600- to 700-bp length fortuitously matches high-quality sequence data from average capillary sequencer reads, although it is expected that routine read length will improve with new technology. An important rationale for using short sequences also resides in the need to obtain useful data from potentially degraded samples found in museum specimens. Amplicon size and gene copy number have been shown to account for much of the variability of amplification success: smaller sizes and increased copy number promote greater success with PCR, presumably by increasing the likelihood that a desired sequence has been preserved (18).
| Materials and Methods |
|---|
|
|
|---|
2% (Table 1) were categorized as the most variable segments, and therefore the most promising of the plastid genome for DNA barcoding when normalized for length. The nuclear ITS region and plastid rbcL gene were used as baseline comparisons for these chloroplast test regions (Table 1). To further narrow down the number of remaining regions usable for barcoding purposes, we applied a sequence criterion of 300-800 bp and a stable presence across multiple plastid genomes of both monocots and dicots.
|
|
|
| Results |
|---|
|
|
|---|
The results of our intrageneric tests across eight genera in the first taxon set demonstrated conspicuous differences between the nine plastid regions with respect to our three barcoding criteria: amplification success, sequence length, and sequence divergence. Only three regions (trnH-psbA, rp136-rpf8, and trnL-F) were successfully amplified for all eight genera and 19 species; the other regions, including ITS, could not be amplified in one or more taxa (Table 2). Sequence length in the nine plastid regions ranged from 204 to 1,240 bp, with mean length in all but two (ycf6-psbM and psbM-trnD) falling within our 300- to 800-bp optimum length criterion (Table 2). ITS had the highest between-species sequence divergence values in four of the five genera successfully amplified (Table 2), with a mean sequence divergence of 2.81% across the five genera. trnH-psbA ranked first in divergence value in six of the eight genera and in 11 of the 14 species pairs, compared with the other eight plastid regions; trnV-atpE and trnC-ycf6 ranked highest for the remaining two genera and three species pairs (Table 2). trnH-psbA ranked highest (1.24%) in mean percent sequence divergence across all genera, whereas trnV-atpE (0.29%) and ycf6-psbM (0.30%) ranked lowest (Table 2).
In our broader taxonomic sampling of the Plummers Island flora in which only herbarium material was used, none of the loci could be successfully amplified for all of the 83 species tested, which we suggest may be related to primer design or to more fundamental changes in gene structure during herbarium specimen preparation and storage (see ref. 33). Amplification success was highest for trnH-psbA (100%), followed by rbcL (5' half; 95%), and ITS (88%, although high-quality sequence data were not obtained from all ITS amplifications). We could not detect any general correlation between specimen age and amplification success, indicating that herbarium specimens in apparently good condition and as old as 20 years can be successfully used to establish DNA-sequence reference libraries. Moreover, amplification of full-length ITS was possible (results not shown) for the five specimens of Erysimum cheiranthoides collected between 1897 and 1997 (Fig. 2), indicating that significantly older specimens also may be used.
Because of the high sequence divergence value in the majority of genera in our taxon set one and the high amplification success of the trnH-psbA spacer in all of our test samples, this region became the focus of our examination of the plastid genome for further analyses of barcoding potential. The trnH-psbA amplicon ranged from 247 to 1,221 bp, whereas the intergenic spacer alone (excluding primer-binding regions and small regions of flanking exon) ranged from 119 to 1,094 bp across 53 families of flowering plants, including both the Plummers Island species and the taxonomic groups (extremes were Thalictrum and Trillium, respectively; see Table 2 and Table 5, which is published as supporting information on the PNAS web site). Most taxa (92%) had amplicons falling between 340 and 660 bp, which is within our suggested length criterion for successful barcoding. All species in our sampling had unique trnH-psbA spacer sequences, which is very relevant to the question of using this gene for barcoding plants.
| Discussion |
|---|
|
|
|---|
We suggest that the trnH-psbA intergenic spacer is the best plastid option for a DNA barcode sequence that has good priming sites, length, and interspecific variation. In our trials across a diverse set of genera in seven plant families, three plastid regions (trnH-psbA, rp136-rpf8, and trnL-F) ranked highest with respect to amplification success and appropriate sequence length, but trnH-psbA demonstrated nearly 3 times the percentage sequence divergence of these other two regions (1.24% in trnH-psbA vs. 0.44% in both rp136-rpf8 and trnL-F; Table 2). The two spacers with the next highest mean sequence divergence after trnH-psbA (atpB-rbcL at 0.63% and trnC-ycf6 at 0.55%) could not be amplified in one or more of the test genera. In only one genus (Solidago; Asteraceae), exceptionally low sequence divergence in trnH-psbA prevented discrimination among the three species tested, although insertion/deletion differences still allowed us to distinguish among the species. This lack of sequence divergence between taxa was true for one or more species pairs in ITS and all other plastid spacers, except atpB-rbcL, in our test sample. In only 2% of our samples did homopolymer regions adversely affect sequence quality in trnH-psbA.
For a number of reasons, we refrained from a statistical test of differences among mean sequence divergences of the nine spacer regions. First, the sample size in our survey was too restricted to provide a meaningful statistical test (although the standard error of the mean of trnH-psbA does not directly overlap with the means of any of the other spacers). More importantly, as pointed out by Shaw et al. (33), genera within and between families of plants are phylogenetically nonequivalent, i.e., lineages recognized as genera may have quite different divergence rates depending on the various life history traits of the included species. Therefore, statistical comparisons between genera with respect to genetic distance are not valid or warranted at this time. Our intent in calculating these mean percent divergences across loci is to provide a qualitative evaluation of each spacer region for barcoding purposes. In this respect, we consider the high divergence value of trnH-psbA, which permits species discrimination in the largest number of taxa we tested (six of the eight genera and 11 of the 14 species pairs), as strong support for its use as a plant barcode.
The universality of trnH-psbA for differentiating among all flowering plant species clearly needs further investigation (see below), especially in taxa with extremely short spacers that may not contain enough sequence variation for species-level discrimination (e.g., Thalictrum and Solidago in our study and Minuartia in ref. 33). This spacer region also is present in other nonflowering land plants. In a search of GenBank, we found that the trnH-psbA spacer has been successfully amplified in angiosperms, gymnosperms, ferns, mosses, and liverworts, although we do not know at this time the degree of between-species divergence. Further study is needed to determine whether this plastid region is as variable in the nonflowering plants as we have shown for our test angiosperms, and therefore whether it is of broad utility as a barcode across the total spectrum of land plants.
Our findings on the properties of trnH-psbA agree with Shaw et al. (33) in their extensive survey of noncoding plastid DNA for phylogenetic purposes. By applying our barcode criteria (i.e., length considerations and universality) to the framework of their study, we conclude that trnH-psbA has greater potential for species-level discrimination than any other locus they analyzed. Similar to our results, they demonstrated that trnH-psbA amplified and sequenced easily with an average length of 465 bp across the 30 taxa they surveyed. Although this region was the second most variable of the 21 spacers they tested in terms of potentially informative characters, they ranked its utility for phylogenetic purposes as low (tier 3) because of its short length. Our analysis of the number of nucleotide substitutions within genera across all taxa in the 21 plastid regions presented by Shaw et al. (33) indicates that the trnH-psbA spacer has the highest percentage nucleotide difference (0.0135 difference per base pair), even though at least 8 of the 21 other regions showed a greater total number of nucleotide substitutions because of their longer length. The interspecific nucleotide differences in trnH-psbA ranged from 18% to 105% higher than that of the other eight most variable plastid regions. Because short sequence length is an important criterion for barcoding, the high frequency of nucleotide differences of trnH-psbA, in combination with its relatively short length, is a significant advantage. Other studies also have shown a high percentage of interspecific divergence for trnH-psbA, and in most cases, the highest in all plastid regions tested (e.g., refs. 44-48).
Despite this high level of interspecific variation, trnH-psbA has found only limited use in species-level phylogenetic reconstruction because of the short length as well as the difficulty of alignments resulting from a high number of indels (e.g., refs. 49-51). In contrast with the problems of indels for phylogenetic construction, we suspect that indels will ultimately enhance the information needed for species identifications, once the appropriate informatics tools for barcoding are developed. In the set of species we sampled, sequences were alignable within genera, but problematic above that rank. In the one case (Solidago) where sequence divergence was not sufficient to separate species, the presence of unique indels allowed easy discrimination among the taxa. Blaxter (34) advocates ease of alignment as a criterion when evaluating the utility of barcode loci. We do not consider difficulty of alignment to be a major obstacle to the applicability of either ITS or trnH-psbA for the primary purpose of DNA barcoding, i.e., identification. Although ease of alignment is desirable, it is not necessary for barcoding. Searches in GenBank by using our data from both loci with a BLAST search returned correct identities at both the gene and species level. BLAST searches are anchored and canalized by conserved regions in both loci, 5.8S in ITS and the small region of flanking exon for trnH-psbA. Intraspecific variation in both ITS and trnH-psbA is known to be relatively low, compared with interspecific variation (27, 52), although in the present study, our intraspecific sampling was insufficient to address this issue.
The extraction of DNA from specimens in herbarium collections was highly successful. This success may be due to the specimens having been air-dried and in a good state of preservation as evidenced by the generally green appearance of the leaves selected for extraction (Fig. 2). Plant voucher specimens vary in how and when they are dried after being pressed. If specimen-drying facilities are not immediately available, especially in humid tropical climates, botanists often treat pressed specimens with ethanol to temporarily preserve them against fungal attack and degradation. Alcohol has been shown to be detrimental to recovering high-quality DNA (53), although how it will affect the short sequences needed for barcoding is unknown. We are encouraged by the fact that museum specimens of insects dried from ethanol storage readily yield CO1 sequences. A more thorough investigation and optimization of methods to extract high-quality barcode DNA from herbarium collections in a high-throughput format will be critical to efficiently build a sequence-database library for plant DNA barcodes. Our positive results by using well preserved specimens indicate that the a priori selection of apparently undegraded plant samples will be an important determinant of success. Fortunately, herbaria often have more than one specimen per species among which to select for successful DNA barcoding.
We have shown here that there are gene sequences suitable for DNA barcoding of flowering plants. It may be necessary to employ more than one locus to attain species-level discrimination across all flowering plant species. Algorithms for combining barcoding sequences from two or more DNA regions to yield species-level unique identifiers are now needed. We believe that ITS and trnH-psbA serve as good starting points for large-scale testing of DNA barcoding across a large sample of angiosperms. A good test would be to expand taxon sampling through the application of both ITS and trnH-psbA to barcode the estimated 8,000 species of flowering plants of Costa Rica (54).
| Acknowledgements |
|---|
| Footnotes |
|---|
Freely available online through the PNAS open access option.
Abbreviations: ITS, internal transcribed spacer; CO1, cytochrome c oxidase 1.
Data deposition: The sequences reported in this paper have been deposited in the GenBank database (accession nos. DQ005959 [GenBank] -DQ006232).
To whom correspondence should be addressed. E-mail: kressj{at}si.edu.
© 2005 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
R. Lahaye, M. van der Bank, D. Bogarin, J. Warner, F. Pupulin, G. Gigot, O. Maurin, S. Duthoit, T. G. Barraclough, and V. Savolainen From the Cover: DNA barcoding the floras of biodiversity hotspots PNAS, February 26, 2008; 105(8): 2923 - 2928. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. J. Kress and D. L. Erickson DNA barcodes: Genes, genomics, and bioinformatics PNAS, February 26, 2008; 105(8): 2761 - 2762. [Full Text] [PDF] |
||||
![]() |
P. Schwarz, O. Lortholary, F. Dromer, and E. Dannaoui Carbon Assimilation Profiles as a Tool for Identification of Zygomycetes J. Clin. Microbiol., May 1, 2007; 45(5): 1433 - 1439. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. G. Whipple, M. E. Barkworth, and B. S. Bushman Molecular insights into the taxonomy of Glyceria (Poaceae: Meliceae) in North America Am. J. Botany, April 1, 2007; 94(4): 551 - 557. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Seifert, R. A. Samson, J. R. deWaard, J. Houbraken, C. A. Levesque, J.-M. Moncalvo, G. Louis-Seize, and P. D. N. Hebert From the Cover: Prospects for fungus identification using CO1 DNA barcodes, with Penicillium as a test case PNAS, March 6, 2007; 104(10): 3901 - 3906. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Shaw, E. B. Lickey, E. E. Schilling, and R. L. Small Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III Am. J. Botany, March 1, 2007; 94(3): 275 - 288. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Taberlet, E. Coissac, F. Pompanon, L. Gielly, C. Miquel, A. Valentini, T. Vermat, G. Corthier, C. Brochmann, and E. Willerslev Power and limitations of the chloroplast trnL (UAA) intron for plant DNA barcoding Nucleic Acids Res., February 16, 2007; 35(3): e14 - e14. [Abstract] [Full Text] [PDF] |
||||
![]() |
D.M. Geiser, M.A. Klich, J.C. Frisvad, S.W. Peterson, J. Varga, and R.A. Samson The current status of species recognition and identification in Aspergillus Stud Mycol, January 1, 2007; 59(1): 1 - 10. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Koch, C. Dobes, C. Kiefer, R. Schmickl, L. Klimes, and M. A. Lysak Supernetwork Identifies Multiple Events of Plastid trnF(GAA) Pseudogene Evolution in the Brassicaceae Mol. Biol. Evol., January 1, 2007; 24(1): 63 - 73. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Rubinoff, S. Cameron, and K. Will A Genomic Perspective on the Shortcomings of Mitochondrial DNA for "Barcoding" Identification J. Hered., November 1, 2006; 97(6): 581 - 594. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. H. Cannon, C. S. Kua, E. K. Lobenhofer, and P. Hurban Capturing genomic signatures of DNA sequence variation using a standard anonymous microarray platform Nucleic Acids Res., October 6, 2006; 34(18): e121 - e121. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Desnos-Ollivier, S. Bretagne, F. Dromer, O. Lortholary, and E. Dannaoui Molecular identification of black-grain mycetoma agents. J. Clin. Microbiol., October 1, 2006; 44(10): 3517 - 3523. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. G. Johnson, S. R. Larson, A. L. Anderton, J. T. Patterson, D. J. Cattani, and E. K. Nelson Pollen-Mediated Gene Flow from Kentucky Bluegrass under Cultivated Field Conditions Crop Sci., September 8, 2006; 46(5): 1990 - 1997. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Robba, S. J. Russell, G. L. Barker, and J. Brodie Assessing the use of the mitochondrial cox1 marker for use in DNA barcoding of red algae (Rhodophyta) Am. J. Botany, August 1, 2006; 93(8): 1101 - 1108. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Schwarz, S. Bretagne, J.-C. Gantier, D. Garcia-Hermoso, O. Lortholary, F. Dromer, and E. Dannaoui Molecular Identification of Zygomycetes from Culture and Experimentally Infected Tissues J. Clin. Microbiol., February 1, 2006; 44(2): 340 - 349. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||