Whole-genome shotgun assembly and comparison of human genome assemblies

PNAS Istrail et al. 10.1073/pnas.0307971100.

Supporting Information

Files in this Data Supplement:

Supporting Dataset 1

Sensitivity and specificity of A2Amapper. Accuracy estimation of the one-to-one mappings produced by A2Amapper. Comparison to independently derived and mapped biological features; comparison to correct mappings between simulated sequences; comparison to random noise.

Supporting Dataset 2

Analysis of the RefSeq content of each of the assemblies considered. Provides additional detail to the analysis in the main text.

Supporting Dataset 3

Chromosomal arm coverage. Graphs of match coverage as a function of basepair distance from the centromere, based on A2Amapper matches of each assembly to NCBI-34.

Supporting Dataset 4

Putative deletions in, and filler for, NCBI-34. List of selected matches between NCBI-34 and WGSA, as determined by A2Amapper. Matches indicate putative PAC deletions in NCBI-34.

Supporting Dataset 5

Putative deletions in, and filler for, NCBI-34. List of selected matches between NCBI-34 and WGSA, as determined by A2Amapper. Matches indicate putative WGSA filler for NCBI-34 gaps.

Supporting Dataset 6

Putative deletions in, and filler for, NCBI-34. List of selected matches between NCBI-34 and WGSA, as determined by A2Amapper. Matches indicate putative NCBI-34 filler for NCBI-34 gaps.

Supporting Dataset 7

Dot plot representations of every human chromosome. A selection of genome assemblies are superimposed on each image. Each is plotted against NCBI-34, which is represented on the horizontal axis. Each line represents matches as determined by A2Amapper. Images are zoomable using public domain viewers such as Adobe Acrobat.

Supporting Dataset 8

Statistics on A2Amapper mappings between every pair from the set of human genome assemblies considered. Each statistic occurs in a separate matrix. Includes glossary.

Supporting Figure 3

Analysis of RefSeq mappings. Included in Supporting Dataset 2.

Supporting Figure 4

RefSeq mapping rates. Included in Supporting Dataset 2.

Supporting Figure 5

Run and match coverage of chromosomal arms. Included in Supporting Dataset 3.

Supporting Figure 6

Non-N coverage of chromosomal arms. Included in Supporting Dataset 3.

Supporting Figure 7

Dot plot representations of alignments of WGSA and NCBI-34, highlighting two regions discussed in the main text. A putative orientation error in NCBI-34 near the centromere of chromosome 6.

Supporting Figure 8

Dot plot representations of alignments of WGSA and NCBI-34, highlighting two regions discussed in the main text. A putative WGSA filler for an NCBI-34 gap near a telomere of chromosome 6.

Supporting Table 3

Counts of matches and base pairs in matches between every chromosome of every assembly considered. For each assembly pair, summary counts segregate mapped (to a chromosome) from unmapped sequence.

Supporting Table 4

Exon preservation rates of 11 mapping methods. Included in Supporting Dataset 1.

Supporting Table 5

Error rates of 11 mapping methods. Included in Supporting Dataset 1.

Supporting Table 6

Lists of RefSeq mappings that covered more bases in WGSA.

Supporting Table 7

Lists of RefSeq mappings that covered more bases in NCBI-34.

Supporting Table 8

Assembly comparison table. Source for Table 1 in the main text. Includes additional statistics (rows) and assemblies (columns) not included in Table 1. Contains two workbooks, accessible with the tabs at the bottom of the screen in Excel; one workbook summarizes the other. Contains some comments (accessible by mouse over in Excel) on fact sources. File format is Microsoft Excel 2002 for Windows.

Supporting Table 9

Spreadsheets on intrachromosome mates. The spreadsheet includes many workbooks, accessible with the tabs at the bottom of the screen in Excel. Spreadsheet file format is Microsoft Excel 2002 for Windows.

Supporting Table 10

Spreadsheets on interchromosome mates. The spreadsheets include many workbooks, accessible with the tabs at the bottom of the screen in Excel. Spreadsheet file format is Microsoft Excel 2002 for Windows.

Supporting Table 11

Putative deletions in NCBI-34. Included in Supporting Dataset 4.

Supporting Table 12

Putative WGSA filler for NCBI-34. Included in Supporting Dataset 5.

Supporting Table 13

Putative mappings for NCBI-34 unmapped sequence. Included in Supporting Dataset 6.

Supporting Text 1

Description of clump construction. The main text includes clump analysis of order and orientation in selected assemblies.

Supporting Text 2

Mate pair analysis. Order and orientation analysis based on mate pair data from Celera’s whole genome shotgun sequencing effort. Description of methods and results.

This Article

  1. PNAS February 17, 2004 vol. 101 no. 7 1916-1921
  1. AbstractFree
  2. Figures Only
  3. Full Text
  4. Full Text (PDF)
  5. » Supporting Information