Supporting Materials and Methods

Alignment of Simulated Sequence Reads to the Human Genome. We tested two methods, BLASTZ and Smith-Waterman, to align simulated sequence reads to the nonrepetitive portion of the human genome (UCSC hg16, July 2003; National Center for Biotechnology Information Build 34). In each case, we achieved very high specificity and sensitivity by applying an S1–S2 (s1–s2) filter (see Appendix 1) with an empirically derived cutoff. BLASTZ was used with the parameters K = 2,500 and L = 2,500, followed by the filter (s1–s2) > 3,000. Smith-Waterman alignments were generated by using a DeCypher machine (TimeLogic, Carlsbad, CA). Because the mutation model of the simulated reads is known (based on the algorithms used to generate them), we theoretically derived the parameters for linear gap alignment so that the optimal alignment is the single most likely path given the observed mutated read [the Viterbi path (1)]. Empirically, we achieved slightly better read placement by being 25% more tolerant of gaps than the theoretical parameters; we thus applied these instead.

We selected 500 gap-free, 650-bp intervals at random from the human genome sequence. We then created three simulated datasets, each starting with the same 500 human "reads" and randomly mutating each read at P = 50%, 55%, or 60% nucleotide divergence as well as introducing insertions and deletions each at q = 6% the rate of substitution. For the simulated mutations, each base of the read was modified by any combination of the following three independent events: changed to one of the other three bases with probability p, deleted with probability pq, and addition of a random base after it with probability pq. Allowing for back-substitutions, these models correspond to branch lengths of D = 0.825, 0.99, and 1.205. The 500 reads (in their original form and at the three levels of mutation) as well as the raw Smith-Waterman alignments for the three levels of mutation are supplied as supporting data (see www.nisc.nih.gov/data).

Alignment of Random Mouse Sequences to the Human Genome. We selected gap-free, 650-bp intervals at random from the mouse genome sequence, and aligned them to the human genome by using the Smith-Waterman algorithm approach described in the previous section. Specifically, we used double-affine gap semiglobal dynamic programming alignments with match reward +19, mismatch penalty 0, gap open penalty 30, and gap extend penalty 3 for the first 20 bases and 1 thereafter. Success or failure of each resulting alignment was assessed by comparing the placement of each aligned read to the mouse–human synteny maps (Michael Kamal, Broad Institute, personal communication). The synteny maps provided a corresponding human location for 475/500 (95%) of the mouse sequences, and we measured sensitivity and specificity relative to the reduced dataset of 475 sequences. Sequences, alignments, and synteny maps are available (see www.nisc.nih.gov/data).

Aligning Individual Sequence Reads from Targeted Regions. To test the feasibility of aligning individual sequence reads to the human genome sequence (the reasons for which are described in Appendix 2), we randomly selected 1,590 BAC-based sequence reads used to generate the finished mouse sequence for ENCODE region ENm001. Each of the 1,590 sequence reads was aligned to the human genome sequence by using either BLASTZ (2) or a hardware-optimized Smith-Waterman algorithm (3), in each case recording whether the best placement (assessed by means of an S1–S2 filter; see Appendix 1) was within a 3-Mb region (hg16:chr7:115000000-118000000) that contains ENm001. The sequence reads used in these studies are available (see www.nisc.nih.gov/data).

BLASTZ and Smith-Waterman correctly aligned 51.5% (819/1,590) and 66.4% (1,056/1,590) of the sequence reads, respectively. Of note, 48% (770/1,590) of the sequence reads were correctly aligned by both methods, with 3% (49/1,590) and 18% (286/1,590) of the sequence reads aligning only with BLASTZ or Smith-Waterman, respectively. Interestingly, BLASTZ, which is a more practical alignment method with sufficient speed to align entire mammalian genomes (see Appendix 3), aligns most of the sequence reads aligned by Smith-Waterman. Thus, the reads aligned by BLASTZ are essentially a subset of those aligned by Smith-Waterman.

There is substantial regional variation in the alignability of mouse and human sequences, and ENm001 has a higher level of human–mouse conservation [» 50% (4)] than the genome-wide average of » 40% (5). For this reason, we similarly analyzed a different targeted region (mm3.chr16:91,521,333-92,411,984, near ENm005) with a reportedly lower level of human–mouse conservation; similar alignment statistics were encountered, with 34% (170/500) of the random 650-bp mouse sequence reads correctly aligning to the human genome sequence by Smith-Waterman.

Generating Low-Redundancy Mouse Genome Assemblies. Random subsets of mouse sequence reads [generated during the whole-genome shotgun sequencing of the mouse genome (5)] were compiled that provided 2- and 3-fold redundancy of the mouse genome. Specifically, this involved selecting reads (in increments of 96-well plates of subclones, with sequences generated from both ends of each subclone) from four subclone libraries with three average insert sizes (4, 10, and 40 kb). The subsets were chosen such that the redundancy in Q20 bases exceeds the given multiple (2 or 3) of 2.7 Gb (the estimated size of the mouse genome). By this criterion, the full whole-genome shotgun assembly of the mouse genome provided 7.5-fold redundancy. The distribution of redundancy provided by the different subclone libraries (with the distinct average insert sizes) was as follows: (i) 2-fold redundant dataset (1.8-fold with 4 kb, 0.1-fold with 10 kb, 0.1-fold with 40 kb); and (ii) 3-fold redundant dataset (2.7-fold with 4 kb, 0.2-fold with 10 kb, 0.1-fold with 40 kb).

We used the whole-genome shotgun assembly program ARACHNE (6, 7) to assemble the above-generated 2- and 3-fold redundant sets of sequence reads. ARACHNE was also used for the previously reported mouse genome assembly (5). Although the underlying source code has changed slightly, we used essentially the same settings and parameters. The resulting alignments are available on request.

In addition to the above genome-wide analyses, six finished mouse BAC sequences were chosen for the in-depth analyses (summarized in Fig. 2). An initial 10 finished mouse BAC sequences were selected at random from among » 6,000 that were used to in generating a recent mouse assembly (by parsing an AGP file). Of the 10 selected BACs, 4 derived from mouse chromosome X; we discarded these and analyzed the remaining 6 BACs. Their GenBank accession numbers and originating mouse chromosomes are AC122196 (chr5), AC131773 (chr15), AL662846 (chr11), AL591433 (chr11), AL672244 (chr16), and AC131702 (chr7).

Analysis of Low-Redundancy Hedgehog Sequence. The following analysis was performed to generate the data summarized in Fig. 3. A hedgehog BAC (GenBank AC139340, version 3) orthologous to the ENm001 region was chosen, and the 3,055 sequence reads used to generate the sequence of this BAC were extracted from the National Center for Biotechnology Information trace archive. Random subsets of reads were then chosen to provide 1-, 2-, 3-, and 7-fold redundancy of the BAC. For all four subsets, the unassembled reads were aligned to the orthologous human sequence by using BLASTZ (with the parameters K = 2,500 and L = 2,500). The human sequence had already been "softmasked" by using REPEATMASKER. For comparison, the complete BAC sequence was also aligned to the human sequence by using BLASTZ.

To compare how the different redundancies affect coverage and alignment, a human–hedgehog alignment was generated for each of the different datasets, as follows:

•All aligned reads were sorted by their start position relative to the human sequence.

•A hedgehog alignment was started by taking the first aligned read (from the sorted reads).

•Each alignment was progressively extended by adding in an aligned read only if it extended the existing alignment. If an aligned read overlapped with the existing alignment, the newly aligned read alignment was truncated to add in only new sequence.

Generation of Low-Redundancy Sequence Datasets for MCS Analyses. We generated sequence assemblies (by using PHRAP) with sequence reads that provided 0.5-, 1-, 2-, 3-, 4-, 5-, and 6-fold sequence redundancy of individual BAC clones. In all cases, finished sequence had already been generated for the BAC, and a random read-picking method was used that normalized for reads derived from over-represented regions (i.e., areas of overlap between adjacent BACs). The assemblies plus unassembled reads for each level of sequence redundancy were then used for MCS detection. These data are available upon request and are essentially a subset of the data reported elsewhere (4, 8-10).

Generation of "Isoread" MCS Curves. To estimate the sensitivity of MCS detection for equivalent numbers of sequence reads distributed across different numbers of species (data presented in Fig. 4), we first generated spline-smoothed curves (with 6 degrees of freedom) through actual data points and interpolated intermediate read equivalents. The curves were generated with S-PLUS (Insightful, Seattle). Intermediate isoread equivalents were calculated for 5 and 11 species by using actual data points (at 0.5-, 1-, 2-, 3-, 4-, 5-, and 6-fold sequence redundancy) generated with 8 species. For example, isoread equivalents for 8 species at 2-fold redundancy are 5 species at 3.2-fold redundancy (8•2/5) and 11 species at 1.5-fold redundancy (8•2/11).

Phylogenetic Tree Generation and Selection of Species for Low-Redundancy Sequencing. The phylogenetic tree used for branch length calculations was derived by merging several published phylogenetic trees [large eutherian tree (11); primate-specific tree (12); noneutherian tree kindly provided by William Murphy (Texas A&M University, College Station; personal communication)]. This process is described below. Such merging was necessary because no one tree contained all of the species under consideration [e.g., primates, eutherian mammals, monotremes and marsupials, and all of the species currently selected for sequencing by the ENCODE project (13)]. Using the merged tree and taking into account the set of eutherian mammals whose genomes have been sequenced, we formulated a "theoretically optimal" set of eutherian mammals for future sequencing (14). This algorithm and its results are summarized in Appendix 4, and practical and biological considerations for species selection are described in Appendix 5.

The tree-merging process involved the following: Three trees relating various mammalian species, each of which was constructed by maximum-likelihood inference using multiple alignments of PCR-amplified gene fragments (see Fig. 6), were merged. These trees (with minor changes such as using common names for species or merging/splitting a few very closely related chimeric taxa) were provided in Newick format by William Murphy (personal communication). We assumed that, within each tree, the branch lengths were proportional to the actual number of substitutions per site that have accumulated at neutrally evolving sites, but that the proportionality constant may be different for the different trees. We first merged T1 and T3 to form T13, then merge T13 and T2 to form T123, and finally added individual species and alternate entries for synonymous species names (see Fig. 6 for T1, T2, and T3 nomenclature). The method for merging trees involved verifying topological compatibility, rescaling one of the trees by a factor computed by averaging over several branches shared by both trees, and finally cutting and pasting to replace an entire clade on one tree with a more comprehensive version of that clade from the other tree. The details of this merging process are provided in Appendix 6.

1. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, U.K.).

2. Schwartz, S., Kent, W. J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R. C., Haussler, D. & Miller, W. (2003) Genome Res. 13, 103-107.

3. Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197.

4. Margulies, E. H., NISC Comparative Sequencing Program, Maduro, V. V. B., Thomas, P. J., Tomkins, J. P., Amemiya, C. T., Luo, M. & Green, E. D. (2005) Proc. Natl. Acad. Sci. USA 102, 3354-3359.

5. International Mouse Genome Sequencing Consortium (2002) Nature 420, 520-562.

6. Jaffe, D. B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J. P., Zody, M. C. & Lander, E. S. (2003) Genome Res. 13, 91-96.

7. Batzoglou, S., Jaffe, D. B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J. P. & Lander, E. S. (2002) Genome Res. 12, 177-189.

8. Margulies, E. H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. & Green, E. D. (2003) Genome Res 13, 2507-2518.

9. Margulies, E. H., NISC Comparative Sequencing Program & Green, E. D. (2004) Cold Spring Harbor Symp. Quant. Biol. 68, 255-263.

10. Thomas, J. W., Touchman, J. W., Blakesley, R. W., Bouffard, G. G., Beckstrom-Sternberg, S. M., Margulies, E. H., Blanchette, M., Siepel, A. C., Thomas, P. J., McDowell, J. C., et al. (2003) Nature 424, 788-793.

11. Murphy, W. J., Eizirik, E., O’Brien, S. J., Madsen, O., Scally, M., Douady, C. J., Teeling, E., Ryder, O. A., Stanhope, M. J., de Jong, W. W., et al. (2001) Science 294, 2348-2351.

12. Eizirik, E., Murphy, W. J., Springer, M. S. & O’Brien, S. J. (2004) in Anthropoid Origins: New Visions, eds. Ross, C. F. & Kay, R. F. (Kluwer, New York), pp. 45-64.

13. ENCODE Project Consortium (2004) Science 306, 636-640.

14. O’Brien, S. J., Eizirik, E. & Murphy, W. J. (2001) Science 292, 2264-2266.