Previous Article |
Table of Contents
| Next Article
Genetics
The diploid genome sequence of Candida albicans




|| **



*Stanford Genome Technology Center, Palo Alto, CA 94304;
Research Center for Pathogenic Fungi and Microbial Toxicoses, Chiba University, Chiba 260-8673, Japan; Departments of
Stomatology, ||Microbiology and Immunology, and **Pharmaceutical Chemistry, University of California, San Francisco, CA 94143; ¶Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, MN 55455; and 
3938 Paseo Grande, Moraga, CA 94556
Contributed by Ronald W. Davis, March 8, 2004
| Abstract |
|---|
|
|
|---|
We present the diploid genome sequence of the fungal pathogen Candida albicans. Because C. albicans has no known haploid or homozygous form, sequencing was performed as a whole-genome shotgun of the heterozygous diploid genome in strain SC5314, a clinical isolate that is the parent of strains widely used for molecular analysis. We developed computational methods to assemble a diploid genome sequence in good agreement with available physical mapping data. We provide a whole-genome description of heterozygosity in the organism. Comparative genomic analyses provide important clues about the evolution of the species and its mechanisms of pathogenesis.
Whole-genome shotgun (WGS) sequencing has been successfully applied to very large genomes; however, standard assembly software does not allow for the possibility of two homologs with varying degrees of similarity and does not assemble such sequences correctly unless the sequence is nearly homozygous throughout the genome. To assemble the C. albicans diploid genome sequence, we began with PHRAP, the widely used assembly program (www.phrap.org). Application of PHRAP resulted in an assembly (Assembly 6) in which the sum of the contigs significantly exceeded the haploid genome size. Here we describe the conversion of this standard PHRAP assembly into a diploid assembly that is in good agreement with available physical mapping data. The diploid sequence assembly reveals the nature and extent of heterozygosity in strain SC5314. Together with the gene set inferred from the sequence, these results provide significant insights into C. albicans evolution and pathogenesis.
| Methods |
|---|
|
|
|---|
20%. Genes believed to be single copy were often found on two contigs, suggesting that homologous sequences were sometimes assembled into separate contigs. Standard finishing experiments designed to close gaps, normally undertaken after completing an assembly, were inappropriate if most apparent gaps were caused by separate assembly of heterozygous sequence, not lack of data. We call regions of the assembly where homologs assembled together "nearly homozygous" and regions of separated assembly "heterozygous." Although separated homologs usually had similar sequence, similarity alone was insufficient to identify them amid the many duplicated sequences in the genome. Sequence alignments between separated homologs, however, do have distinctive properties that are created as a byproduct of assembly. To reconstruct the diploid genome sequence from the PHRAP assembly, it was necessary to identify heterozygous regions of the assembly, align separated homologs, and appropriately join them. As shown in Fig. 1 b and c, the logic of single-copy assembly, applied to diploid sequence, dictates that separated homologs must give rise to what we call "terminal alignments."
|
|
In practice, identification of terminal alignments was not always straightforward. In general, sequence at the ends of PHRAP contigs came from a single read and therefore was often of low quality and sometimes chimeric. This prevented most homologous terminal alignments from reaching the very end of the contig. We used PHRAP's base quality scores to perform a statistical test for assessing terminality, and suspected chimeric contig ends were identified and trimmed in the diploid assembly (see Supporting Text).
Special methods were devised to assemble across transposon insertions and large substitutions such as the mating-type-like locus. In a process analogous to finishing in single-copy WGS, manual assembler directives based on physical mapping data, paired plasmid clone sequences, and known GenBank C. albicans sequences were used to guide the assembly. A more detailed description of the assembly is presented in Supporting Text, Table 6, and Fig. 6, which are published as supporting information on the PNAS web site.
Identification of Heterozygosity in Strain SC5314. In nearly homozygous regions of the PHRAP assembly, where highly similar homologs assembled together, polymorphisms were identified by scanning PHRAP contigs for positions having a pattern of high-quality disagreements between individual reads. Similar methods have been used to find polymorphisms in the human (9) and Anopheles (10) whole-genome assemblies. By aligning homologous supercontig pairs, we were able to identify many additional polymorphisms between homologs that PHRAP had assembled separately. Both methods of polymorphism discovery use base quality scores to distinguish true polymorphisms from sequencing errors (see Supporting Text).
| Results |
|---|
|
|
|---|
The genome size and physical map of C. albicans has been examined primarily in strain CBS5736 and its derivatives (3). No significant differences were found between the electrophoretic karyotype of the sequencing strain SC5314 and CBS5736. Size estimates of the SC5314 chromosomes are presented in Table 1. Given the assumptions made in determining genome size, the assembled haploid genome sequence of C. albicans is in remarkably good agreement with estimates of genome size derived from physical criteria. Supercontigs with sequenced map markers were assigned to the chromosomes from which the markers derive. The varying levels of coverage of individual chromosomes, lowest on chromosome R, relate to the number and distribution of markers on the physical map.
|
Our assembled rDNA sequence (see additional data at http://genome-www.stanford.edu/candida-pnas2004-supplement) gives a repeating unit of 12,756 base pairs and indicates that the haploid genome encodes
55 copies of the shorter, intronless class of rDNA (see Table 1). The arrangement of the rRNA genes in strain SC5314 is similar to that in Saccharomyces cerevisiae with the addition of a low-complexity region of
2 kb. This region varies among strains and is used in various DNA typing schemes. Analysis of traces that contain partial rDNA sequences suggest that the repeat is located between supercontigs 10247 and 2511. Physical mapping data had previously placed the rDNA near markers on 2511.
As in S. cerevisiae, a relatively small fraction of C. albicans genes contain introns. Unlike some other fungal species, C. albicans does not appear to have extensively spliced genes. The C. albicans intron structure is generally similar to that of Saccharomyces. C. albicans and its close relatives translate the codon CUG as serine rather than the usual leucine in nuclear genes (11). Approximately two-thirds of the ORFs make use of this unusual codon.
Heterozygosity. The diploid assembly highlights the extent of natural heterozygosity in C. albicans. The analysis described in Methods yielded a total of 62,534 high-confidence polymorphisms for the entire genome. Single base substitutions made up >89% of the high-confidence polymorphism set, with a 2:1 ratio of transitions to transversions (Table 7, which is published as supporting information on the PNAS web site). Homologs assembled separately by PHRAP account for 19% of the genome but contain 65% of the polymorphisms. The overall average frequency of polymorphism is one in 237 bases, considerably higher than observed in human or Anopheles sequence, probably in part due to our detection of separately assembled regions as homologs. The significance of the extensive allelic differences in C. albicans is unknown but may function to increase genetic diversity (12, 13) and contribute to the evolution of drug resistance (14).
The polymorphisms in the C. albicans genome, like those in human and Anopheles, are distributed quite unevenly across its genome. Table 1 lists the overall frequencies by chromosome. The excess polymorphisms on chromosomes 5 and 6 are explained by just a few highly diverged regions described below. The low overall polymorphism on chromosomes 3 and 7 results from very large nearly homozygous regions; Fig. 3 shows the distribution of polymorphisms along chromosome 7. The large nearly homozygous regions are near the telomeres, likely the result of mitotic recombination. Although the general location of the centromere is known from translocation data (B.B.M., unpublished data), the more polymorphic regions do not point to a more specific location.
|
50 kb from MTL on the same supercontig. At this latter site, examination of the homologous supercontig indicates that it contains an inversion of sequence with otherwise low levels of polymorphism. The inversion is itself flanked by inverted repeats and could have occurred in vivo or as an outcome of the PHRAP assembly step. Localized inversions are a major feature of fungal genome evolution (15). Among the other highly polymorphic sites are a second inversion and known gene families containing low-complexity sequences.
|
We found 3,579 ORFs containing high-confidence polymorphisms. In 2,792 of these, the polymorphisms alter protein translation. Among the protein differences, for 94 there was no ORF (100 amino acids or greater) on the homologous supercontig obviously encoding an allele, and for 57 others the ORF was fragmented into more than one ORF on the homologous supercontig. The effects of heterozygosity in C. albicans coding regions have not yet been extensively explored; however, significant phenotypic differences between parent strains and heterozygous mutants have been reported (17).
Among the 6,699 indels, there is a general decline in frequency with increasing length except at multiples of three bases. The excess of indels with length a multiple of three is concentrated almost completely in the coding fraction of the genome as defined by the reduced ORF set (Fig. 4). Three-base indels are not surprising in low-complexity regions of proteins such as homopolymer tracts.
|
|
|
|
The most striking differences between S. cerevisiae and C. albicans are found in oxidative metabolism. In addition to common components in their electron transport chains, C. albicans also encodes a typical complex I. C. albicans has both the mitochondrially and nuclear encoded subunits of this complex found in most eukaryotes but absent in S. cerevisiae. An increased role for respiration in C. albicans is suggested by numerous differences, including a pyruvate dehydrogenase kinase to regulate the flow from glycolysis into the trichloroacetic acid cycle, the lipase family mentioned above and other enzymes in fatty acid catabolism, and additional amino acid catabolic pathways.
Sulfur metabolism appears also to differ between these two yeasts. C. albicans has genes likely to encode a direct pathway to cysteine in addition to a transsulfuration pathway from homo-cysteine. Genes encoding cysteine catabolic enzymes may also be present. These additional cysteine pathways might reflect an increased significance for glutathione metabolism in C. albicans.
The increased filamentation responses found in C. albicans would be expected to require alterations in genes for structural proteins and for cell cycle regulation. Among the differences in structural proteins, a kinesin-like gene most closely resembles the type found in Aspergillus. In the cell cycle, a number of differences from S. cerevisiae appear in the subunits of the anaphase-promoting complex.
Finally, the genome sequence reveals a number of adaptations for environmental sensing and response. C. albicans' ability to pass through the digestive tract requires it to cope with widely varying pH environments. C. albicans has a number of genes related to the pH regulatory genes of Aspergillus (23) and encodes a small family of chloride channels with members resembling types expressed in a variety of mammalian tissues. Also of note are differences in genes in the calmodulin signaling pathway, including a protein kinase related to one implicated in sensing surface contact in a plant pathogen.
| Discussion |
|---|
|
|
|---|
It is possible that additional sequence data might have closed some of the remaining true coverage gaps. From the eight chromosomes and the assembly gaps due to the copies of the MRS, one has
20 contigs as a lower bound. The remaining gaps have diverse origins in other repeat sequences, true gaps in the coverage, regions that may not be readily cloned in Escherichia coli, and overlaps too short for the conservative approach to joins we have used. We periodically assembled available sequence, and the results are summarized in Table 5. Increasing coverage yielded contigs whose sum clearly exceeded the haploid genome size; however, with assembly 6, there was a precipitous drop in the number of large contigs and in the total sequence contained within small contigs. The superassembly process continued these trends while delivering a product very close to independently derived estimates of the genome size.
|
C. albicans biology also suggests that the limitations of the superassembly will not have severe practical effects. The only highly suspect areas of the sequence relate to the largest repeated sequences, particularly the ALS genes and the MRS (3, 19). Both of these sequence families have allelic and strain variation. Naturally occurring translocations via the MRS have been observed. Because much of the interest in C. albicans derives from the diversity of clinical isolates, the disproportionate effort required to assemble these sequences in one strain would have limited value.
Our diploid genome sequence catalogs polymorphisms in both protein encoding genes and potential regulatory sequences. This should greatly facilitate the search for additional loci where allelic differences are significant for pathogenesis. In addition to providing likely sites for regulation, variable numbers of TRs are useful markers for both population genetics and epidemiology.
The release of the C. albicans genome sequence to the public domain at various stages of completion has already accelerated research in the biology and disease processes of this important pathogen. The availability of a diploid genome sequence will now take these studies to a new level.
| Acknowledgements |
|---|
| Footnotes |
|---|
Data deposition: The sequence reported in this paper has been deposited in the GenBank database (accession no. AACQ00000000). The version described in this paper is the first version, AACQ01000000.
Present address: Department of Anesthesia, Stanford University, Stanford, CA 94305. ![]()

To whom correspondence should be addressed. E-mail: dbowe{at}stanford.edu.
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
N. Matmati and Y. A. Hannun Thematic Review Series: Sphingolipids. ISC1 (inositol phosphosphingolipid-phospholipase C), the yeast homologue of neutral sphingomyelinases J. Lipid Res., May 1, 2008; 49(5): 922 - 928. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Goyard, P. Knechtle, M. Chauvel, A. Mallet, M.-C. Prevost, C. Proux, J.-Y. Coppee, P. Schwartz, F. Dromer, H. Park, et al. The Yak1 Kinase Is Involved in the Initiation and Maintenance of Hyphal Growth in Candida albicans Mol. Biol. Cell, May 1, 2008; 19(5): 2251 - 2266. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Denisov, B. Walenz, A. L. Halpern, J. Miller, N. Axelrod, S. Levy, and G. Sutton Consensus generation and variant detection by Celera Assembler Bioinformatics, April 15, 2008; 24(8): 1035 - 1040. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-C. Chang, M. M.-L. Hsu, R. C. Barton, and C. J. Jackson High-Frequency Intragenomic Heterogeneity of the Ribosomal DNA Intergenic Spacer Region in Trichophyton violaceum Eukaryot. Cell, April 1, 2008; 7(4): 721 - 726. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Park, B. Park, K. Jung, S. Jang, K. Yu, J. Choi, S. Kong, J. Park, S. Kim, H. Kim, et al. CFGP: a web-based, comparative fungal genomics platform Nucleic Acids Res., January 11, 2008; 36(suppl_1): D562 - D571. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Rossignol, P. Lechat, C. Cuomo, Q. Zeng, I. Moszer, and C. d'Enfert CandidaDB: a multi-genome database for Candida species and related Saccharomycotina Nucleic Acids Res., January 11, 2008; 36(suppl_1): D557 - D561. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. M. Mitrovich and C. Guthrie Evolution of small nuclear RNAs in S. cerevisiae, C. albicans, and other hemiascomycetous yeasts RNA, December 1, 2007; 13(12): 2066 - 2080. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. M. Mora-Montes, S. Bates, M. G. Netea, D. F. Diaz-Jimenez, E. Lopez-Romero, S. Zinker, P. Ponce-Noyola, B. J. Kullberg, A. J. P. Brown, F. C. Odds, et al. Endoplasmic Reticulum {alpha}-Glycosidases of Candida albicans Are Required for N Glycosylation, Cell Wall Integrity, and Normal Host-Fungus Interaction Eukaryot. Cell, December 1, 2007; 6(12): 2184 - 2193. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Cornell, I. Alam, D. M. Soanes, H. M. Wong, C. Hedeler, N. W. Paton, M. Rattray, S. J. Hubbard, N. J. Talbot, and S. G. Oliver Comparative genome analysis across a kingdom of eukaryotic organisms: Specialization and diversification in the Fungi Genome Res., December 1, 2007; 17(12): 1809 - 1822. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Li, C. Su, X. Mao, F. Cao, and J. Chen Roles of Candida albicans Sfl1 in Hyphal Development Eukaryot. Cell, November 1, 2007; 6(11): 2112 - 2121. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. K. Hane, R. G.T. Lowe, P. S. Solomon, K.-C. Tan, C. L. Schoch, J. W. Spatafora, P. W. Crous, C. Kodira, B. W. Birren, J. E. Galagan, et al. Dothideomycete Plant Interactions Illuminated by Genome Sequencing and EST Analysis of the Wheat Pathogen Stagonospora nodorum PLANT CELL, November 1, 2007; 19(11): 3347 - 3368. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. R. Borneman, T. A. Gianoulis, Z. D. Zhang, H. Yu, J. Rozowsky, M. R. Seringhaus, L. Y. Wang, M. Gerstein, and M. Snyder Divergence of Transcription Factor Binding Sites Across Related Yeast Species Science, August 10, 2007; 317(5839): 815 - 819. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. M. Yeater, J. Chandra, G. Cheng, P. K. Mukherjee, X. Zhao, S. L. Rodriguez-Zas, K. E. Kwast, M. A. Ghannoum, and L. L. Hoyer Temporal analysis of Candida albicans gene expression during biofilm development Microbiology, August 1, 2007; 153(8): 2373 - 2385. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Kinclova-Zimmermannova and H. Sychrova Plasma-membrane Cnh1 Na+/H+ antiporter regulates potassium homeostasis in Candida albicans Microbiology, August 1, 2007; 153(8): 2603 - 2612. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Xu, H. Guo, and Z.-L. Yang Single nucleotide polymorphisms in the ectomycorrhizal mushroom Tricholoma matsutake Microbiology, July 1, 2007; 153(7): 2002 - 2012. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Rossignol, M. E. Logue, K. Reynolds, M. Grenon, N. F. Lowndes, and G. Butler Transcriptional Response of Candida parapsilosis following Exposure to Farnesol Antimicrob. Agents Chemother., July 1, 2007; 51(7): 2304 - 2312. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Kim, M. S. Waterman, and L. M. Li Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi Genome Res., July 1, 2007; 17(7): 1101 - 1110. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Biswas, P. Van Dijck, and A. Datta Environmental Sensing and Signal Transduction Pathways Regulating Morphopathogenic Determinants of Candida albicans Microbiol. Mol. Biol. Rev., June 1, 2007; 71(2): 348 - 376. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. C. Odds, M.-E. Bougnoux, D. J. Shaw, J. M. Bain, A. D. Davidson, D. Diogo, M. D. Jacobsen, M. Lecomte, S.-Y. Li, A. Tavanti, et al. Molecular Phylogenetics of Candida albicans Eukaryot. Cell, June 1, 2007; 6(6): 1041 - 1052. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Kunze, D. MacCallum, F. C. Odds, and B. Hube Multiple functions of DOA1 in Candida albicans Microbiology, April 1, 2007; 153(4): 1026 - 1041. [Abstract] [Full Text] [PDF] |