Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / APPLIED BIOLOGICAL SCIENCES / APPLIED MATHEMATICS
An algorithm for assembly of ordered restriction maps from single DNA molecules




*Department of Mathematics, University of Southern California, 3620 South Vermont Avenue, KAP 108, Los Angeles, CA 90089-2532;
Laboratory for Molecular and Computational Genomics, Departments of Genetics and Chemistry, University of Wisconsin, Biotechnology Center, 425 Henry Mall, Madison, WI 53706; and
Department of Computational Molecular Biology and Bioinformatics, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089-2910
Edited by Philip P. Green, University of Washington School of Medicine, Seattle, WA, and approved August 24, 2006 (received for review May 17, 2006)
| Abstract |
|---|
|
|
|---|
whole-genome shotgun optical mapping | map assembler
1 µm) on elongated DNA molecules. The combination of a charged glass surface and fluid flow guided by a microfluidic device (3) simultaneously elongates and deposits DNA random chains as well defined stripes within the device. Because a critical density of charge is maintained on these surfaces, absorbed and elongated molecules under tension uniquely "flag" restriction enzyme cleavage sites as visible gaps formed due to relaxation of adjacent DNA. The distance, or mass of each consecutive restriction fragment is determined by integrated fluorescence intensity measurements against an internal standard. Collectively, these actions produce oriented, labeled molecules that work in concert with downstream image processing, yielding a massive set of restriction maps as relatively compact data files. Due to the enormous throughput of this system, a genome is redundantly spanned by individual restriction maps supporting "shotgun" assembly techniques for whole genome analysis. However, genome assembly is inherently complicated by the fact that measurements are made on random individual DNA molecules, which cannot benefit from averaging steps intrinsic to bulk measurement techniques used by common DNA sequencing platforms; no amplification step is used during optical mapping. This finding places another level of complexity within the genome assembly step, where further error reduction must be must be performed after acquisition. Such errors are characterized as: (i) spurious, or false restriction sites, (ii) partial digestion, or missing cuts (where restriction sites are not observed in optical maps), (iii) small fragments (<2 kb) are underrepresented in maps, (iv) sizing error, and (v) chimeric maps that result from images of ambiguously overlapping DNA molecules. Ordered restriction maps reveal structural detail across a genome in ways that are only surpassed by DNA sequence data, and recent findings show prevalent structural variations in human populations (4, 5) with many loci linked to diseases. Also, cancer genomes are notoriously rife with aneuploidy and structural aberrations fostered by unchecked genomic instability, which when fully characterized at high-resolution present new routes for diagnostics (6) and treatment. Our current techniques for discovery of structural alterations are somewhat bound by the limitations imposed by DNA hybridization or cost (sequencing). For example, genomic microarrays do reveal deletions (7), but are confounded by common genomic repeats and cannot discern inversions (8). Furthermore, insertions and other genomic events not represented on a chip array cannot be assayed and go undiscovered. As such, ordered restriction maps broadly reveal genome structural events potentiating their discovery and physical characterization in one step. Accordingly, the scalable de novo assembly approach presented here will greatly facilitate the construction of physical maps for this emerging field of human and tumor biology.
| Prior Optical Mapping Algorithms |
|---|
|
|
|---|
Some formulations of this problem were demonstrated to be NP-hard (12), whereas others allowed polynomial time algorithms (11). These algorithms were designed for reconstruction of short restriction maps using cloned DNA substrates. These methods produced accurate consensus restriction maps but could not be applied to "shotgun" optical mapping data, which is most typical form of data in current optical mapping system. In "shotgun" optical mapping, maps of DNA molecules are produced by random shearing of genomic DNA. This implies that optical maps represent random parts of genome rather than identical DNA molecules.
Ananthraman et al. designed a Bayesian method that could accommodate shotgun optical mapping data by searching over a large space of order assignments, but their algorithm had deficiencies in terms of scalability to large genomes.¶ Consequently, application of this algorithm to genome assemblies more complex than bacteria required additional extensive ad hoc approaches.
As such, there is a need for new algorithms that are specifically designed for handling many computational issues inherent to the assembly of large genomes, such as plant and mammalian. Here, we approach optical map assembly problem by using an approach that is quite different from existing restriction map reconstruction algorithms. Our method utilizes an overlaplayoutconsensus strategy commonly used in existing DNA sequence assemblers because it proved to be practical even for assembling sequence reads from very large genomes. As a key step of our assembly method, we use a highly efficient error correction method to eliminate false positive overlaps and chimeric maps that otherwise render assembly problem highly ambiguous.
| DNA Sequence Assemblers |
|---|
|
|
|---|
| An Optical Map Assembly Method |
|---|
|
|
|---|
5 Mb), one microbial genome (34.5 Mb), one plant genome (430 Mb), and one human genome. Comparison of the two bacterial maps produced by our map assembly method to the known DNA sequences confirms the high accuracy of our method. The overlap structure that we employ for representation of region connectivity is capable of accommodating assemblies from polymorphic genomic regions such as those found in diploid organisms and populations of tumor cells with highly aberrant genomes. | Optical Mapping Measurements |
|---|
|
|
|---|
N(Y,
2Y) for
2
0.3), where Y represents the true genomic size of corresponding region of DNA (17). This finding implies, for example, that for a 20-kb DNA fragment, 80% of the measurements are within 3.3 kb of 20 kb. | Results |
|---|
|
|
|---|
|
500x coverage), and map to sequence alignments showed no false or missing cuts and only seven very small, missing restriction fragments (0.10.7 kb).
In terms of map assembly, the greater genome size and complexity of eukaryotic genomes dictate that significantly larger and more informative sets of optical maps must be considered. Accordingly, assemblies for Thalassiosira pseudonana, Oryza sativa ssp. japonica (rice), and Homo sapiens used only optical maps containing 15 or more restriction fragments as a strict filter for map quality, commesurate with the current throughput of the optical mapping system, to reduce the amount of computation while ensuring sufficient genome coverage. Looking at Table 1, we see very long computational times for calculating the overlaps for T. pseudonana and rice, but subsequent layout-consensus and refinement steps are rapid; taking only
1 h on a desktop computer. In terms of assembly accuracy of O. sativa ssp. japonica genome, of maps which aligned with the threshold to the reference genome, 91% of their total length matched by using local and gapped-local alignment. The nonaligned part concentrated most often in the end of the maps due to lower optical map coverage. Details of the alignments revealed that
1.4% of contig restriction sites did not align to any restriction site in the sequence (this corresponds to
1.4 extra sites per 1 Mb of DNA sequence). Also, 13% of genomic DNA sequence restriction sites did not match to any contig site (in the original optical map data set, 20% of sites are missing). Also, the vast majority of fragments <1 kb was not represented by our consensus maps, and 1,107 of 1,575 fragments between 1 and 3 kb were missing in our consensus maps. Finally, nine of 288 contigs produced alignment patterns consistent with misassemblies (we should note, however, that the published rice genome may also contain misassemblies because it is not entirely finished).
Following the same steps as before, the human genome assembly (from a tumor sample derived from a complete hydatidiform mole; S. Reslewic, personal communication) produced 219 assembled contigs (150 Mb) containing 10 or more optical maps. These were compared through local and gapped alignment to NCBI human build 35 (22) showing 171 (109 Mb) of 219 (124 Mb) contigs aligning with high scores (q score, 11). We attribute a low amount of assembled contig mass to low effective coverage of this human data set. Specifically, even when optical maps are aligned to the reference genome, only 20% of these maps score above threshold. In terms of accuracy of the assembled contigs, we observed 81 extra sites (one extra site per 1.3 Mb of reference sequence), and 210 missing sites of 6,153 reference map sites. Also, vast majority of fragments under 1 kb was not represented by our consensus maps, and 420 of 645 fragments between 1 and 3 kb were not represented. Finally, two assembled contigs showed patterns that appeared to be possible misassemblies.
| Discussion |
|---|
|
|
|---|
| Methods |
|---|
|
|
|---|
1/100 of a second on an average PC, and requires O(mn) storage. Given the potentially large number of pairwise comparisons required in the overlap step, a question arises as to whether something can be done to reduce the number of candidate pairs by some heuristic method without calculating all pairwise alignments. In sequencing, this problem is addressed by means of k-mer hashing (e.g., k = 23) that allows detection of a high percentage of correct overlaps by finding long word matches and thus avoids the expensive dynamic programming step for many pairs of sequence reads. The reason why this idea works well for sequence reads produced by the Sanger sequencing method is because the errors in sequence reads are usually somewhat rare. This rarity is due to the fact that averaging over a large population of clonal sequences results high sequence read accuracy (>95%). We investigated the possibility of geometric hashing of adjacent optical map fragments for quick elimination of spurious overlaps (23). However, because optical mapping is a single-molecule mapping technology, averaging is not embedded in the primary measurement, as it is in Sanger sequencing, which results in higher error frequency when compared with sequence reads. False cuts and especially missing cuts are ubiquitous in optical maps, thus requiring the tuple size k to be small to ensure a high probability of finding true overlaps. On the other hand, if k is chosen to be small, only a small number of nonoverlapping map pairs can be eliminated before the alignment step, and a large fraction of maps still need to be aligned. Therefore, we abandoned hashing in favor of exhaustive pairwise comparisons for the purpose of finding accurate overlaps.
Overlap Graph Construction. A directed overlap graph G = (V, E) is defined by a set of nodes V corresponding to individual optical maps taken in a particular orientation, and a set of directed edges E, corresponding to high-quality overlaps between optical maps. Initially, all pairwise overlaps of optical maps are calculated by using our likelihood-based scoring method (17). The quality of reported overlaps is based on the site match measure that we term "q score" (A.V., unpublished data). Overlaps with q scores exceeding a specified threshold are considered to be accurate and are selected for construction of the graph. Specifically, they are sorted according to their q score values and progressively added to the graph in decreasing order of significance. Each map is placed in a particular orientation (normal or reverse), and the graph is grown as more edges are added. At this point, orientation consistency is checked. If an edge, suggested for addition to the graph, connects two nodes within the same component of the graph, orientation of maps represented by these nodes must be consistent with orientation of maps within the suggested overlap. If orientation is inconsistent, the edge is not added to the graph and therefore not considered in further analysis. This step accomplishes elimination of false overlaps with inconsistent orientation. The idea behind this step is to embed correct overlaps into the graph as early as possible. False overlaps generally have low q scores; hence, by the time they are considered for the addition to the graph, we hope that enough accurate overlaps are embedded already to purge the spurious orientation-inconsistent overlaps from further consideration. Unlike the greedy merge procedure used in CAP (14), this step does not create a problem if a false edge is incorporated early in the graph. As will become clear below, our graph correction procedure will eliminate such an edge, so corresponding regions will remain unconnected instead of causing a spurious consensus map to be reported.
Previously, we explained that each map in the overlap graph appears in a particular orientation o
{N, R}, where N stands for normal orientation (fragments appear in the same order as in the map stored in the file) and R stands for reverse (fragments appear in the opposite order compared with the way the map is stored in the file). Furthermore, edges within the overlap graph can be of two types: containment edges and noncontainment edges. A containment edge connects two maps, one of which is contained by another through their pairwise alignment. More precisely, if map M1 represents genomic region G1 and map M2 represents genomic region G2, than map M1 contains map M2 if G1
G2. In this case, M1 is called master map, and map M2 is called a slave map. Therefore, noncontainment edges represent overlaps of maps that contain both common and unique regions.
Below we describe how we assign edge directions and weights in the overlap graph. Suppose that maps M1 and M2 appear in orientation o1 and o2 in their pairwise overlap. The edge weights in the overlap graph are given by estimates of genomic distances between optical map midpoints deduced from their overlap, so that weight (M
M
) = dist(M
, M
). Fig. 1 gives an illustration how this distance can be calculated for maps M1 and M2. Based on the positions of map centers C1 and C2, we can identify the largest common alignment block (u1, u2; v1, v2) that does not contain map centers C1 and C2. We use corresponding map regions with sizes B1 = |u2 u1|, B2 = |v2 v1|, A1 = |u1 C1|, and E2 = |C2 v2| to estimate the distance: dist(M
, M
) = (1) I{u1
C1, C2
v2} x (A1 + (B1 + B2)/2 + E2). Of course, all of the numbers are defined by the map orientation; for example, v2 = 
fi for M
and v2 = ||M2|| 
fi for M
. Here fi are fragment sizes of map M2 and k is the index of the site corresponding to v2. It is clear that changing the direction of the edge will change the sign of the edge weight, but not its absolute value. So, when constructing the graph, we choose edge directions so that edge weights are positive.
|
|
1 = o1 and
2 = o2. If the edge is consistent, it is included in the graph, otherwise it is skipped.
Elimination of False Edges with Consistent Orientation.
We can use the proposed genomic distance between optical maps to our advantage to eliminate orientation consistent false edges still present in the overlap graph. For every node Ni in the graph, we perform a depth-first search of specified depth taking only outgoing edges and collect all nodes Nj such that there exist multiple independent paths through the graph connecting nodes Ni and Nj. For each of those paths P
we compute its spanning genomic distance D
by adding weights of edges taken along the edges of each path. Distances from correct paths must be distributed according to the distance error model. In our previous analysis (17), we have established that for genomic region of size y, its size X, estimated from the optical map, is normally distributed with mean EX = y and variance Var(X) =
2y. Naturally, we adopt the same error model for the genomic distances between optical maps, because they are calculated by adding sizes of fragments within optical maps. Unfortunately, for a given pair of maps Mi and Mj, represented by nodes Ni and Nj, their true genomic distance is generally unknown. To overcome this limitation, we want to find the path P
with distance D
connecting Ni and Nj, that maximizes the size of the cluster of paths connecting Ni and Nj with distances found within
standard deviations
D
of D
. If there is more than one path within such a cluster (including the path corresponding to D
), then all of the edges within those paths are marked as "confirmed" (because we found multiple evidence of connectivity with agreeable distance). Unconfirmed edges are then eliminated from the overlap graph along with all isolated nodes.
Chimeric Map Elimination. Even after elimination of false edges, some chimeric maps may be present in the overlap graph. Chimeric maps have a very distinctive appearance in the overlap graph (Fig. 2), namely, a set of maps L(M) overlapping with the left part of chimeric map M and a set of maps R(M) overlapping with the right part of M must not be locally connected in the overlap graph other than through that chimeric map. This observation is based on the fact that right and left parts of map H must belong to different regions. Therefore, potential chimeric maps are identified as local articulation nodes, removal of which disconnects the local subgraph (in this case, edge direction is not important, because finding articulation points suffices). Provided enough optical map coverage, we can use this as a strategy for finding chimeric maps. From every node adjacent to map M, we perform a breadth first search of specified depth without taking paths through M to discover all immediate neighbors of M. If we fail, the node corresponding to M is removed from the graph.
Genomic distances between map centers as we described in this section provide simple, yet powerful measures that can be used to filter out overlap errors that make assembly problematic. Conveniently, this method can also be extended into assembly of sequences, where distance would represent nucleotide count between centers of sequence reads based on their overlap. A corresponding distance error model would be derived from parameters of the sequencing instrument, where the amount of sequencing errors will depend on the DNA content of a particular genomic region. Naturally, this distance would allow detection of false overlaps between sequence reads and aid the sequence assembly process.
Layout and Consensus Map Reconstruction.
After false edges and chimeric maps are eliminated from the overlap graph, the graph should break into components representing islands of overlapping maps. For each component of the graph, we want to extract a contig representing the genomic region corresponding to this island. In other words, we must find a cycle-free path through a subgraph maximizing the genomic distance spanned by this path. To do this, we first identify sources within the component and perform depth-first search from each source assigning the longest distance accumulated along the paths to each of the discovered nodes. Consider node Ni with weight wi corresponding to the longest spanning distance of the incoming path ending at that node. If Oi is a set of edges incoming to Ni, then Si gives the set of nodes for which edges from Oi are outgoing. The maximization recursion for Ni outgoing is given by
|
|
where wj is the the weight stored at the node Nj
Si and wi
j = dist(Ni, Nj) is the distance between maps Ni and Nj. The heaviest path is found by locating the node with the largest weight and the heaviest path ending at that node. The prescribed merging of optical maps within a corresponding island is thus given by the set of maps as they appear along this heaviest path starting from a relevant source node.
Each graph component yields a draft map by combining portions of corresponding optical maps based on their pairwise overlap relations (Fig. 3). Although the draft map is a concatenate of multiple optical maps, each fragment in it is given by a fragment from a single optical map. Therefore, a draft map inherently contains errors in the form of missing cuts, false cuts, and fragment size inaccuracies. These are subject to further correction, which we accomplish through a map refinement procedure (24). More specifically, maps from the corresponding overlap graph component are used to improve draft map accuracy by (i) removing false cuts, (ii) adding missing cuts, and (iii) reestimating fragment sizes (Fig. 4). To accomplish this, we align optical maps to the draft map and perform hypothesis testing to identify positions of draft map where sites need to be added and/or deleted. Fragment size reestimation is accomplished by taking average of fragments of optical maps corresponding to fragments of the draft map. This procedure is repeated until no further corrections can be made. During each iteration of the refinement procedure, optical maps are realigned to yield more accurate corrections. The refinement process converges rapidly and equilibrium is usually reached within 1013 iterations. The corrected consensus map is reported as a map representing the corresponding island.
|
|
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
To whom correspondence should be addressed. E-mail: valouev{at}usc.eduAuthor contributions: A.V. and M.S.W. designed research; A.V. performed research; A.V. and D.C.S. contributed new reagents/analytic tools; A.V., D.C.S., and S.Z. analyzed data; and A.V., D.C.S., and M.S.W. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS direct submission.
¶ Ananthraman, T., Schwartz, D, Mishra, B. The Seventh International Conference on Intelligent Systems for Molecular Biology, 1999. ![]()
© 2006 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||