Table 1. Comparison of selected assemblies
Statistic Notes Assembly
Assembly a WGA CSA HG06 WGSA NCBI-28 NCBI-34
Producer b Celera Celera UCSC Celera NCBI NCBI
Method c WG C H WG H H
Data source d Combined Combined IHGSC Celera IHGSC IHGSC
Associated date e Nov. 2000 Jan. 2001 Dec. 2000 Dec. 2001 Dec. 2001 Oct. 2003
Intrinsic measures
    acgt in assembly, Mbp f 2,587 2,656 2,742 2,696 2,853 2,865
    acgt unmapped, Mbp g 280 60 37 36 58 22
    No. of contigs h 221,036 169,157 133,667 211,493 47,117 512
    No. of scaffolds i 118,968 54,061 76,058 4,940 42,754 447
    N50 contig length, kbp j 53 98 110 23 575 29,105
    N50 scaffold length, kbp j 3,563 2,954 331 29,133 613 36,791
    Scaffold span, Mbp k 2,848 2,909 2,833 2,819 2,855 2,869
    RefSeq (50% cov, 95% id) l 17,348 18,305 18,122 19,149 18,810 19,613
    Segmental duplication, Mbp m 27.3 54.5 108.0 69.5 120.0 152.3
    Seg. dup. in unmapped, Mbp n 13.9 5.1 2.9 8.3 2.7 5.1
    Confirmed conflicted mates o 0.38% 0.91% 5.61% 0.31% 2.44% 0.28%
    Mates linking mapped + unmapped p 1.52% 0.16% 0.03% 0.13% 0.02% 0.01%
Comparison to NCBI-34
    No. of matches q 256,021 208,148 150,624 308,371 60,544
    No. of runs q 12,560 47,540 71,291 7,315 23,024
    No. of clumps q 1,595 1,187 3,189 339 2,951
    acgt in matches, Mbp r 2,498 2,520 2,495 2,657 2,653
    Extra sequence, Mbp s 89 136 247 38 200
    Missing sequence, Mbp t 367 345 370 208 212
    acgt in runs, Mbp u 2,557 2,650 2,553 2,759 2,682
    N50 match length, kbp v 27 33 47 15 306
    N50 run length, kbp v 1,204 441 203 1,959 954
    N50 clump length, kbp v 5,404 5,931 1,809 33,501 2,765
Percent of acgt in matches to NCBI-34 in:
    Global HCS w 79.86% 78.65% 72.73% 95.96% 77.35%
    Unmapped scaffolds x 8.76% 1.50% 0.78% 0.78% 0.41%
    Mismapped scaffolds y 10.69% 18.41% 17.14% 2.45% 21.11%
    Scaffold-incompat. matches z 0.68% 1.44% 9.35% 0.81% 1.12%
    Potentially chimeric scaffolds aa 9 33 666 25 97
    Chimeric acgt, Mbp bb 10 27 112 13 21
    No. of small conflicts cc 3,474 6,165 14,582 3,912 1,586
    acgt in small conflicts, Mbp dd 7 9 121 8 9
  • More extensive results are contained in Tables 3 and 8 and Data Set 8 on the PNAS web site.

  • a Assembly gives the acronym used in the text.

  • b Producers are Celera (www.celeradiscoverysystem.com), University of California, Santa Cruz (UCSC, www.genome.ucsc.edu), and NCBI (www.ncbi.nlm.nih.gov).

  • c Method identifies the computational approach used to produce each assembly: WG, whole-genome; C, compartmental; H, hierarchical.

  • d Data sources are Celera (shotgun reads plus public BAC ends), IHGSC (HGP data), or a combination (Celera data plus a subset of human genomic data from GenBank).

  • e Dates shown are assembly completion date (Celera), data freeze date (UCSC), or release date (NCBI).

  • f Unambiguous bases in the assembly consensus sequence (including “acgt unmapped”).

  • g Unambiguous bases not assigned to specific chromosome locations.

  • h Contiguous sequence built of overlapping sequencing reads.

  • i Chains of linked contigs.

  • j A base has a 50% chance of being in a contig or scaffold at least this long.

  • k Sum of the lengths of the scaffolds, including internal Ns.

  • l RefSeqs alignable at 50% coverage and 95% identity thresholds.

  • m Bases in matches to segmental duplications in NCBI-34.

  • n Subset of segmental duplication unmapped in this assembly.

  • o Percent of mate pairs indicating a possible misassembly. Mate pair data indicate relative orientation and distance between pairs of sequencing reads. Celera fragments were aligned to each assembly. Where two or more pairs of fragments imply the same rearrangement, they are counted as a possible misassembly.

  • p Percent of aligned mate pairs with one fragment aligned to an unmapped scaffold (see “acgt unmapped”) while the other is aligned to a mapped scaffold.

  • q Analysis of A2Amapper's one-to-one mapping between each assembly and NCBI-34. Matches, runs, and clumps are successively less restrictive local alignments, derived from the one-to-one mapping, as described in the text. Informally, matches never include gaps > 10 bp, runs never span conflicting matches, and clumps never span a conflict >50 kbp.

  • r Unambiguous bases within matches.

  • s Unambiguous bases of each assembly outside all matches.

  • t Unambiguous bases of NCBI-34 outside matches to this assembly.

  • u Unambiguous NCBI-34 bases within runs.

  • v A matched base has a 50% chance of being in a match/run/clump at least this long.

  • w Percent of matched bases in the maximal set of consistent matches, defined by heaviest common subsequence (HCS).

  • x Percent of matched bases in unmapped scaffolds.

  • y Percent bases in matches in scaffolds disagreeing with NCBI-34 in chromosome assignment, order, or orientation.

  • z Percent of matched bases in scaffold-incompatible matches, where a match is incompatible with its scaffold if it conflicts with the (length-weighted) majority of matches in the scaffold.

  • aa Scaffolds with a consistent subset of incompatible matches >50 kb.

  • bb Unambigous bases in minority subset(s) for potentially chimeric scaffolds.

  • cc Runs of incompatible matches not counted in “Potentially chimeric scaffolds.”

  • dd Unambigous bases in matches in small conflicts.