High-quality draft assemblies of mammalian genomes from massively parallel sequence data
See allHide authors and affiliations
Contributed by Eric S. Lander, November 23, 2010 (sent for review October 8, 2010)

Abstract
Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (∼100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd.
Footnotes
- 1To whom correspondence may be addressed. E-mail: lander{at}broadinstitute.org or jaffe{at}broadinstitute.org.
Author contributions: S.G., I.M., D.P., F.J.R., J.N.B., B.J.W., T.S., G.H., D.A., M.C., R.D., L.W., R.N., A.G., and D.B.J. performed research; T.P.S., S.S., A.M.B., C.N., and E.S.L. analyzed data; S.G. and I.M. led the genome assembly team; S.G., I.M., D.P., F.J.R., J.N.B., B.J.W., T.S., G.H., and D.B.J. developed and implemented algorithms; T.P.S. and S.S. generated SOAPdenovo and ABySS assemblies; D.A., M.C., R.D., L.W., R.N., and A.G. performed laboratory research and development; and C.N., E.S.L., and D.B.J. wrote the paper.
The authors declare no conflict of interest.
Data deposition: The sequence data reported in this paper have been deposited in the NCBI Short Read Archive (study names Human_NA12878_Genome_on_Illumina and Mouse_B6_Genome_on_Illumina 2) and in the DDBJ/EMBL/GenBank database (accession nos. AEKP00000000, AEKQ00000000, and AEKR00000000).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1017351108/-/DCSupplemental.
Freely available online through the PNAS open access option.