An Eulerian path approach to DNA fragment assembly
See allHide authors and affiliations

Contributed by Michael S. Waterman
Abstract
For the last 20 years, fragment assembly in DNA sequencing followed the “overlap–layout–consensus” paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical “overlap–layout–consensus” approach in favor of a new euler algorithm that, for the first time, resolves the 20yearold “repeat problem” in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of largescale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.
Children like puzzles, and they usually assemble them by trying all possible pairs of pieces and putting together pieces that match. Biologists assemble genomes in a surprisingly similar way, the major difference being that the number of pieces is larger. For the last 20 years, fragment assembly in DNA sequencing mainly followed the “overlap–layout–consensus” paradigm (1–6). Trying all possible pairs of pieces corresponds to the overlap step, whereas putting the pieces together corresponds to the layout step of the fragment assembly. Our new euler algorithm is very different from this natural approach—we never even try to match the pairs of fragments, and we do not have the overlap step at all. Instead, we do a very counterintuitive (some would say childish) thing: we cut the existing pieces of a puzzle into even smaller pieces of regular shape. Although it indeed looks childish and irresponsible, we do it on purpose rather than for the fun of it. This operation transports the puzzle assembly from the world of a difficult Layout Problem into the world of the Eulerian Path Problem, with polynomial algorithms for puzzle assembly (in the context of DNA sequencing).
Although the classical approach culminated in some excellent programs (phrap, cap, tigr, and celera assemblers among them), critical analysis of the “overlap–layout–consensus” paradigm reveals some weak points. The major difficulty is that there is still no polynomial algorithm for the solution of the notorious “repeat problem” that amounts to finding the correct path in the overlap graph (layout step). Unfortunately, this problem is difficult to overcome in the framework of the “overlap–layout–consensus” approach. All the programs we tested made errors (up to 19% misassembled contigs) while assembling shotgun reads from the bacterial sequencing projects (Table 1; Fig. 1). These genomes were assembled despite the fact that there is no errorfree fragment assembler today (by errors we mean incorrect assemblies rather than unavoidable basecalling errors). Biologists “pay” for these errors at the timeconsuming finishing step (7).
How can one resolve these problems? Surprisingly enough, an unrelated area of DNA arrays provides a hint. Sequencing by Hybridization (SBH) is a 10yearold idea that never became practical but (indirectly) created the DNA arrays industry. Conceptually, SBH is similar to fragment assembly; the only difference is that the “reads” in SBH are much shorter ltuples. In fact, the first approaches to SBH (8, 9) followed the “overlap–layout–consensus” paradigm. However, even for errorfree SBH data, the corresponding layout problem leads to the NPcomplete Hamiltonian Path Problem. Pevzner (10) proposed a different approach that reduces SBH to an easytosolve Eulerian Path Problem in the de Bruijn graph.
Because the Eulerian path approach transforms a once difficult layout problem into a simple one, a natural question is: “Could the Eulerian path approach be applied to fragment assembly?” Idury and Waterman, mimicked fragment assembly as an SBH problem (11) by representing every read of length n as a collection of n − l + 1 overlapping ltuples (continuous short strings of fixed length l). At first glance, this transformation of every read into a collection of ltuples (breaking the puzzle into smaller pieces) is a very shortsighted procedure, because information about the sequencing reads is lost. However, the loss of information is minimal for large l and is well paid for by the computational advantages of the Eulerian path approach. In addition, lost information can be restored at later stages.
Unfortunately, the Idury–Waterman approach, although very promising, did not scale up well. The problem is that sequencing errors transform a simple de Bruijn graph (corresponding to an errorfree SBH) into a tangle of erroneous edges. Moreover, repeats pose serious challenges even in the case of errorfree data. This paper abandons the “overlap–layout–consensus” approach in favor of an Eulerian superpath approach. This reduction led to the euler software that generated errorfree solutions for all largescale assembly projects that were studied.
The difficulties with fragment assembly led to the introduction of doublebarreled (DB) DNA sequencing (15, 16) that was first used in 1995 in the assembly of Haemophilus influenzae (17). The published fragment assemblers ignore the DB data, with the exception only of the celera assembler (Myers et al., ref. 18) and gigassembler (Kent and Haussler, ref. 19). Although a number of sequencing centers use heuristic procedures to utilize DB data, such approaches often fail for complex genomes. eulerdb analyzes DB data in a new way by transforming matepairs “read–intermediate gap of length dread2” (a roughly 2fold increase in the effective read length) into matereads” read1intermediate dna sequence of length “dread2” in most cases. Assuming the clone length is 5 kb, it transforms the original fragment assembly problem with N reads of length 500 bp into a new problem with about N/2 reads of length 5,000 bp. From some perspective, eulerdb provides an algorithmic shortcut for the still unsolved experimental problem of increasing the read length. This approach typically resolves all repeats except perfect repeats that are longer than the insert length.
New Ideas
The classical “overlap–layout–consensus” approach to fragment assembly is based on the notion of the overlap graph. The DNA sequence in Fig. 2a consists of four unique segments A, B, C, D, and one triple repeat R. Every read corresponds to a vertex in the overlap graph, and two vertices are connected by an edge if the corresponding reads overlap (Fig. 2b). The fragment assembly problem is thus cast as finding a path in the overlap graph visiting every vertex exactly once, a Hamiltonian Path Problem. The Hamiltonian Path Problem is NPcomplete, and the efficient algorithms for solving this problem are unknown. This is why fragment assembly of highly repetitive genomes is a notoriously difficult problem.
This paper suggests an approach to the fragment assembly problem based on the notion of the de Bruijn graph. In an informal way, one can visualize the construction of the de Bruijn graph by representing a DNA sequence as a “thread” with repeated regions covered by a “glue” that “sticks” them together (Fig. 2c). The resulting de Bruijn graph (Fig. 2d) consists of 4 + 1 = 5 edges (we assume that the repeat edge is obtained by gluing three repeats and has multiplicity three). In this approach, every repeat corresponds to an edge rather than a collection of vertices in the layout graph.
One can see that the de Bruijn graph (Fig. 2c) is a much simpler representation of repeats than the overlap graph (Fig. 2a). What is more important is that the fragment assembly is now cast as finding a path visiting every edge of the graph exactly once, an Eulerian Path Problem. There are two Eulerian paths in the graph: one of them corresponds to the sequence reconstruction ARBRCRD, whereas the other one corresponds to the sequence reconstruction ARCRBRD. In contrast to the Hamiltonian Path Problem, the Eulerian path problem is easy to solve even for graphs with millions of vertices, because there exist lineartime Eulerian path algorithms (20). This is a fundamental difference between the euler algorithm and conventional approaches to fragment assembly.
Although de Bruijn graphs have algorithmic advantages over overlap graphs, it is not clear how to construct de Bruijn graphs from collections of sequencing reads. The described “gluing” idea requires knowledge of the finished DNA sequence that is not available until the last step of the assembly, a Catch22. Below we show how to construct de Bruijn graphs from the read data without knowing the finished DNA sequence.
The existing assembly algorithms postpone the consensus step (error correction in reads) until the end of the fragment assembly. euler reverses this practice by making consensus the first step of fragment assembly.
Let s be a sequencing read (with errors) derived from a genome G. If the sequence of G is known, then the error correction in s can be done by aligning the read s against the genome G. In real life, the sequence of G is not known until the last “consensus” step of the fragment assembly. It is another Catch22: to assemble a genome, it is highly desirable to correct errors in reads first, but to correct errors in reads, one has to assemble the genome first. To bypass this Catch22, let us assume that, although the sequence of G is unknown, the set G_{l} of all ltuples present in G is known. Of course, G_{l} is also unknown, but G_{l} can be reliably approximated without knowing the sequence of G. Our error correction idea uses this approximation to correct errors in reads. In contrast to existing algorithms based on pairwise alignments, it utilizes the multiple alignments of short substrings to modify the original reads and to create a new instance of the fragment assembly problem with a greatly reduced number of errors.
Imagine an ideal situation where error correction has eliminated all errors, and we deal with a collection of errorfree reads. Is there an algorithm to reliably assemble such errorfree reads in a largescale sequencing project? At first glance, the problem looks simple but, surprisingly enough, the answer is no: we are unaware of any algorithm that solves this problem. For example, phrap, cap3, and tigr assemblers (with default settings) make 17, 14, and 9 assembly errors, respectively, while assembling real reads from the NM genome. All these algorithms still make errors while assembling the errorfree reads from the NM genome (although the number of errors reduces to 5, 4, and 2, respectively). euler made no assembly errors and produced fewer contigs with real data than other programs produced with errorfree data.
To achieve such accuracy, euler has to restore information about sequencing reads that was lost in the construction of the de Bruijn graph. Our Eulerian Superpath idea addresses this problem. Every sequencing read corresponds to a path in the de Bruijn graph called a readpath, and the fragment assembly problem corresponds to finding an Eulerian path that is consistent with all readpaths, an Eulerian Superpath Problem.
Error Correction and Data Corruption
Sequencing errors make implementation of the SBHstyle approach to fragment assembly difficult. To bypass this problem, we reduce the error rate by a factor of 35–50 and make the data almost errorfree by solving the Error Correction Problem. We use the NM sequencing project (13) as an example. NM is one of the most “difficulttoassemble” and “repeatrich” bacterial genomes completed so far. It has 126 long nearly perfect repeats up to 3,832 bp in length (not to mention many imperfect repeats). The length of the genome is 2,184,406 bp. The sequencing project resulted in 53,263 reads (coverage is 9.7), with 255,631 errors distributed over these reads. It results in 4.8 errors per read (an error rate of 1.2%).
Our error correction procedure uses an approximation of G_{l} (the set of all ltuples in genome G) rather than the sequence G to correct sequencing errors. An ltuple is called solid if it belongs to more than M reads (where M is a threshold) and weak otherwise. A natural approximation for G_{l} is the set of all solid ltuples from a sequencing project.
Let T be a collection of ltuples called a spectrum. A string s is called a Tstring if all its ltuples belong to T. Our approach to error correction leads to the following.
Spectral Alignment Problem.
Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a Tstring.
A similar problem was considered by Pe'er and Shamir (21) in a different context of resequencing by hybridization. In the context of error corrections, the solution of the Spectral Alignment Problem makes sense only if the number of mutations is small. In this case, the Spectral Alignment Problem can be efficiently solved by dynamic programming even for large l.
Spectral alignment of a read against the set of all solid ltuples from a sequencing project suggests the error corrections that may change the sets of weak and solid ltuples. Iterative spectral alignments with the set of all reads and all solid ltuples gradually reduce the number of weak ltuples, increase the number of solid ltuples, and lead to elimination of many errors in bacterial sequencing projects. Although the Spectral Alignment Problem helps to eliminate errors (and we use it as one of the steps in euler), it does not adequately capture the specifics of DNA sequencing.
The Error Correction Problem described below is a better model for fragment assembly that leads to elimination of ≈97% of errors in a typical bacterial project.
We found that some reads from the NM project have very poor spectral alignment. These reads are likely to represent contamination, vector, isolated reads, or an error in the sequencing pipeline. It is a common practice in sequencing centers to discard such “poorquality” reads, and we adopt this approach. Another important advantage of spectral alignment is an ability to identify chimeric reads. Such reads are characterized by good spectral alignments of the prefix and suffix parts (6), which, however, cannot be extended to a good spectral alignment of the entire read.
Given a collection of reads (strings) S = {s_{1}, …, s_{n}} from a sequencing project and an integer l, the spectrum of S is a set S_{l} of all ltuples from the reads s_{1}, …, s_{n} and s̄_{1}, …, s̄_{n}, where s̄ denotes a reverse complement of read s. Let Δ be an upper bound on the number of errors in each DNA read. A more adequate approach to error correction motivates the following.
Error Correction Problem.
Given S, Δ, and l, introduce up to Δ corrections in each read in S in such a way that S_{l} is minimized.
An error in a read s affects at most lltuples in s and lltuples in s̄ and usually creates 2l erroneous ltuples that point to the same sequencing error (2d for positions within a distance d < l from the endpoint of the reads). A greedy approach for the Error Correction Problem is to look for error corrections in the reads that reduce the size of S_{l} by 2l (or 2d for positions close to the endpoints). This simple procedure already eliminates 86.5% of the errors in sequencing reads. euler uses a more involved approach that eliminates 97.7% of sequencing errors and transforms the original sequencing data with 4.8 errors per read on average into almost errorfree data with 0.11 errors per read on average (22).
A word of caution is in place. Our errorcorrection procedure is not perfect while deciding which nucleotide, among, let us say, A or T is correct in a given ltuple within a read. If the correct nucleotide is A, but T is also present in some reads covering the same region, the errorcorrection procedure may assign T instead of A to all reads, i.e., to introduce an error rather than to correct it. Because our algorithm sometimes introduces errors, data corruption is probably a more appropriate name for this approach! Introducing an error in a read is not such a bad thing, as long as the errors from overlapping reads covering the same position are consistent (i.e., they correspond to a single mutation in a genome). An important insight is that, at this stage of the algorithm, we do not care much whether we correct or introduce errors in the sequencing reads. From an algorithmic perspective, introducing a consistent error that simply corresponds to changing a nucleotide in a final assembly is not a big deal. It is much more important to make sure that we eliminate a competition between A and T at this stage, thus reducing the complexity of the de Bruijn graph. In this way, we eliminate false edges in our graph and deal with this problem later: the correct nucleotides are easily reconstructed either by a majority rule or by a variation of the Churchill–Waterman algorithm (23). For the NM sequencing project, orphan elimination corrects 234,410 errors and introduces 1,452 errors.
Eulerian Superpaths
Given a set of reads S = {s_{1}, …, s_{n}}, define the de Bruijn graph G(S_{l}) with vertex set S_{l−1} (the set of all (l − 1)tuples from S) as follows. An (l − 1)tuple v ∈ S_{l−1} is joined by a directed edge with an (l − 1)tuple w ∈ S_{l−1}, if S_{l} contains an ltuple for which the first l − 1 nucleotides coincide with v and the last l − 1 nucleotides coincide with w. Each ltuple from S_{l} corresponds to an edge in G. If S contains the only sequence s_{1}, then this sequence corresponds to a path visiting each edge of the de Bruijn graph, a Chinese Postman path (20). The Chinese Postman Problem is closely related to the problem of finding a path visiting every edge of a graph exactly once, an Eulerian Path Problem (24). One can transform the Chinese Postman Problem into the Eulerian Path Problem by introducing multiplicities of edges in the de Bruijn graph. For example, one can substitute every edge in the de Bruijn graph by k parallel edges, where k is the number of times the edge is used in the Chinese Postman path. If S contains the only sequence s_{1}, this operation creates k “parallel” edges for every ltuple repeating k times in s_{1} (23). Finding Eulerian paths is a well known problem that can be efficiently solved in linear time. We assume that S contains a complement of every read and that the de Bruijn graph can be partitioned into two subgraphs (the “canonical” one and its reverse complement).
With real data, the errors hide the correct path among many erroneous edges. The graph corresponding to the errorfree data from the NM project has 4,039,248 vertices (roughly twice the length of the genome), whereas the graph corresponding to real sequencing reads has 9,474,411 vertices (for 20tuples). After the errorcorrection procedure, this number is reduced to 4,081,857.
A vertex v is called a source if indegree(v) = 0, a sink if outdegree(v) = 0, and a branching vertex if indegree(v)⋅ outdegree(v) > 1. For the NM genome, the de Bruijn graph has 502,843 branching vertices for original reads (for ltuple size 20). Error corrections simplify this graph and lead to a graph with 382 sources and sinks and 12,175 branching vertices. Because the de Bruijn graph gets very complicated even in the errorfree case, taking into account the information about which ltuples belong to the same reads (that was lost after the construction of the de Bruijn graph) helps us to untangle this graph.
A path v_{1} … v_{n} in the de Bruijn graph is called a repeat if indegree(v_{1}) > 1, outdegree(v_{n}) > 1, and indegree (v_{1}) = outdegree(v_{i}) = 1 for 1 ≤ i ≤ n − 1 (Fig. 3). Edges entering the vertex v_{1} are called entrances into a repeat, whereas edges leaving the vertex v_{n} are called exits from a repeat. An Eulerian path visits a repeat a few times, and every such visit defines a pairing between an entrance and an exit. Repeats may create problems in fragment assembly, because there are a few entrances in a repeat and a few exits from a repeat, but it is not clear which exit is visited after which entrance in the Eulerian path. A readpath covers a repeat if it contains an entrance into and an exit from this repeat. Every covering readpath reveals some information about the correct pairings between entrances and exits. A repeat is called a tangle if there is no readpath containing this repeat (Fig. 3). Tangles create problems in fragment assembly, because pairings of entrances and exits in a tangle cannot be resolved via the analysis of readpaths. To address this issue, we formulate the following generalization of the Eulerian Path Problem:
Eulerian Superpath Problem.
Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths.
To solve the Eulerian Superpath Problem, we transform both the graph G and the system of paths 𝒫 in this graph into a new graph G_{1} with a new system of paths 𝒫_{1}. Such transformation is called equivalent if there exists a onetoone correspondence between Eulerian superpaths in (𝒢, 𝒫) and (𝒢_{1}, 𝒫_{1}). Our goal is to make a series of equivalent transformations that lead to a system of paths 𝒫_{k}, with every path being a single edge. Because all transformations on the way from (𝒢, 𝒫) to (𝒢_{k}, 𝒫_{k}) are equivalent, every solution of the Eulerian Path Problem in (𝒢_{k}, 𝒫_{k}) provides a solution of the Eulerian Superpath Problem in (𝒢, 𝒫).
Below, we describe a simple equivalent transformation that solves the Eulerian Superpath Problem in the case when the graph G has no multiple edges. Let x = (v_{in}, v_{mid}) and y = (v_{mid}, v_{out}) be two consecutive edges in graph G, and let 𝒫_{x,y} be a collection of all paths from 𝒫 that include both these edges as a subpath. Informally, x,ydetachment bypasses the edges x and y via a new edge z and directs all paths in 𝒫_{x,y} through z, thus simplifying the graph. However, this transformation affects other paths and needs to be defined carefully. Define 𝒫_{→x} as a collection of paths from 𝒫 that end with x and 𝒫_{y→} as a collection of paths from 𝒫 that start with y. The x, ydetachment is a transformation that adds a new edge z = (v_{in}, v_{out}) and deletes the edges x and y from G (Fig. 4a). This detachment alters the system of paths 𝒫 as follows: (i) substitute z for x, y in all paths from 𝒫_{x,y}, (ii) substitute z for x in all paths from 𝒫_{→x}, and (iii) substitute z for y in all paths from 𝒫_{y→}. Because every detachment reduces the number of edges in G, the detachments will eventually shorten all paths from 𝒫 to single edges and will reduce the Eulerian Superpath Problem to the Eulerian Path Problem.
However, in the case of graphs with multiple edges, the detachment procedure may lead to errors, because “directing” all paths from the set 𝒫_{→x} through a new edge z may not be an equivalent transformation. In this case, the edge x may be visited many times in the Eulerian path, and it may or may not be followed by the edge y on some of these visits.
For illustration purposes, let us consider a simple case when the vertex v_{mid} has the only incoming edge x = (v_{in}, v_{mid}) with multiplicity 2 and two outgoing edges y1 = (v_{mid}, v_{out1}) and y2 = (v_{mid}, v_{out2}), each with multiplicity 1. In this case, the Eulerian path visits the edge x twice; in one case, it is followed by y1, and in another case, it is followed by y2. Consider an x,y1detachment that adds a new edge z = (v_{in}, v_{out1}) after deleting the edge y1 and one of two copies of the edge x. This detachment (i) shortens all paths in 𝒫_{x,y1} by substitution of x, y1 by a single edge z, and (ii) substitutes z for y1 in every path from 𝒫_{y1→}. This detachment is an equivalent transformation if the set 𝒫_{→x} is empty. However, if 𝒫_{→x} is not empty, it is not clear whether the last edge of a path P ∈ 𝒫_{→x} should be assigned to the edge z or to the (remaining copy of) edge x.
To resolve this dilemma, one has to analyze every path P ∈ 𝒫_{→x} and decide whether it “relates” to 𝒫_{x,y1} (in this case, it should be directed through z) or to 𝒫_{x,y2} (in this case, it should be directed through x). By “relates” to 𝒫_{x,y1} (𝒫_{x,y2}), we mean that every Eulerian superpath visits y1 (y2) immediately after visiting P.
Two paths are called consistent if their union is a path again. A path P is consistent with a set of paths 𝒫 if it is consistent with all paths in 𝒫 and inconsistent otherwise (i.e., if it is inconsistent with at least one path in 𝒫). There are three possibilities (Fig. 5): (i) P is consistent with exactly one of the sets 𝒫_{x,y1} and 𝒫_{x,y2}, (ii) P is inconsistent with both 𝒫_{x,y1} and 𝒫_{x,y2}, and (iii) P is consistent with both 𝒫_{x,y1} and 𝒫_{x,y2}.
In the first case, the path P is called resolvable, because it can be unambiguously related to either 𝒫_{x,y1} or 𝒫_{x,y2}. An edge x is called resolvable if all paths in 𝒫_{→x} are resolvable. If the edge x is resolvable, then the described x, ydetachment is an equivalent transformation after the correct assignments of last edges in every path from 𝒫_{→x}. In our analysis of the NM project, we found that 18,026 among 18,962 edges in the de Bruijn graph are resolvable.
The second condition implies that the Eulerian Superpath Problem has no solution, because P, 𝒫_{x,y1}, and 𝒫_{x,y2} impose three different scenarios for just two visits of the edge x. After discarding the poorquality and chimeric reads, we did not encounter this condition in the NM project.
The last condition (P is consistent with both 𝒫_{x,y1} and 𝒫_{x,y2}) corresponds to the most difficult situation. If this condition holds for at least one path in 𝒫_{→x}, the edge x is called unresolvable, and we postpone analysis of this edge until all resolvable edges are analyzed. We observed that equivalent transformation of other resolvable edges often resolves previously unresolvable edges. However, some edges cannot be resolved even after the detachments of all resolvable edges are completed. Such situations usually correspond to tangles, and they have to be addressed by another equivalent transformation called a cut.
Consider a fragment of graph G with 5 edges and 4 paths y3 − x, y4 − x, x − y1, and x − y2 (Fig. 4b). In this symmetric situation, x is a tangle, and there is no information available to relate any of paths y3 − x and y4 − x to any of paths x − y1 and x − y2. An edge x = (v, w) is removable if (i) it is the only outgoing edge for v and the only incoming edge for w, and (ii) x is either the initial or the terminal edge for every path P ∈ 𝒫 containing x. An xcut transforms 𝒫 into a new system of paths by simply removing x from all paths in 𝒫_{→x} and 𝒫_{x→} without affecting the graph G itself (Fig. 4b). Obviously, an xcut is an equivalent transformation if x is a removable edge.
Detachments and cuts proved to be powerful techniques to untangle the de Bruijn graph and to reduce the fragment assembly to the Eulerian Path Problem for all studied bacterial genomes. However, there is still a gap in the theoretical analysis of the Eulerian Superpath Problem in the case when the system of paths is not amenable to either detachments or cuts.
euler with CloneEnd Data
One hundred twentysix long nearly perfect repeats in the NM genome tangle the de Bruijn graph and make it difficult to analyze. eulerdb untangles this graph by using the cloneend DB data eulerdb maps every read into some edge(s) of the de Bruijn graph. After this mapping, most matepairs of reads correspond to paths that connect the positions of these reads in the de Bruijn graph (provided the distance between these positions in the graph is approximately equal to the estimated distance between reads from the matepairs). eulerdb views such paths as long artificial mate–reads and analyzes them with the same Eulerian Superpath algorithm that is used for the analysis of standard reads.
Mapping reads into the de Bruijn graph allows one to identify errors in the DB data. In most cases, both reads r_{1} and r_{2} from the matepair (r_{1}, r_{2}) are mapped to the same graph component. In this case, one can find a path between these reads and compare the length d(r_{1}, r_{2}) of this path with an estimated distance l(r_{1}, r_{2}) between clone ends. In most cases, such a path is unique and its length approximately matches the clone length [d(r_{1}, r_{2}) ≈ l(r_{1}, r_{2})]. In this case, we reconstruct the intermediate sequence between reads and transform the matepair into mate path. In case the difference between d(r_{1}, r_{2}) and l(r_{1}, r_{2}) is beyond the acceptable variation in the clone length, it is most likely an error in the DB data. In the case of multiple paths between r_{1} and r_{2} in the de Bruijn graph, we transform the matepair into the corresponding materead if the clone length l(r_{1}, r_{2}) matches the length of exactly one path between r_{1} and r_{2}.
We emphasize important differences between our approach and other DB assemblers. The celera assembler masks repeats, generates a large set of contigs, and pieces these contigs together by using the DB data. There are two types of contigs subject to the DB step of the celera's algorithm: Gcontigs flanked by gaps in read coverage (on at least one side) and Rcontigs flanked by repeats (on both sides). In a genome with many repeats, the number of Rcontigs may significantly exceed the number of Gcontigs. For example, in the NM project, phrap generates 160 contigs, and only half of them are Gcontigs, whereas in the L. lactis project, phrap generates 62 contigs, and only 7 of them are Gcontigs.
The celera assembler provides an excellent solution for assembly of Gcontigs, but it does not distinguish between two types of contigs. After masking repeats, the celera assembler generates a large number of contigs; only a small portion of them are Gcontigs. Because the resulting contigs are short, it leads to a rather complicated DB step (Myers et al., ref. 18) that may cause disassembly for short contigs with limited DB data.
In contrast to other DB assemblers, eulerdb distinguishes between Gcontigs and Rcontigs. eulerdb does not mask repeats but instead uses them (combined with DB data) as a powerful tool for resolving Rcontigs. The NM sequencing project illustrates the advantages of using (rather than masking) repeats in fragment assembly. eulerdb reduces the number of contigs from 149 to 117 if only matepairs from nonrepeated regions are used. Use of matepairs that partially overlap the repeats further reduces the number of contigs to 91 (Fig. 1). eulerdb typically resolves all tangles except tangles that are longer than the length of the insert. In the NM project (with insert length up to 1,800), euler left only 5 unresolved tangles of length 3,610, 3,215, 2,741, 2,503, and 1,831.
After completing eulerdb, we build scaffolds by using DB data as “bridges” between different contigs (eulersf). eulersf combines the 91 contigs into 60 scaffolds, thus closing most gaps that are shorter than the insert length and further simplifying the finishing step (Fig. 1). Myers et al. (18) described an excellent solution of the scaffolding problem. Recently, Kent and Haussler (19) described a different greedy approach to the scaffolding problem that resulted in a successful draft assembly of the human genome. eulersf uses a similar strategy, but it scaffolds a smaller set of longer contigs (mainly Gcontigs). See ref. 25 for details.
Conclusions
Finishing is a bottleneck in largescale DNA sequencing. Of course, finishing is an unavoidable step to extend the islands and close the gaps in read coverage. However, existing programs produce many more contigs than the number of islands, thus making finishing more complicated than necessary. What is worse, these contigs are often assembled incorrectly, thus leading to the timeconsuming contig verification step. Even a single misassembly forces biologists to conduct total genome screening for assembly errors. euler bypasses the “repeat problem,” because the Eulerian Superpath approach transforms imperfect repeats into different paths in the de Bruijn graph. As a result, euler does not even notice repeats unless they are long perfect repeats.
Difficulties in resolving repeats led to the introduction of DB DNA sequencing and the breakthrough sequencing efforts reported by the Human Genome Program (26) and celera (27) The celera assembler is a twostage procedure that includes masking repeats at the overlap–layout–consensus stage with further ordering of contigs via DB data. The DB step of the celera assembler is influenced by the previous “overlap–layout–consensus” step. All “overlap–layout–consensus” algorithms are “afraid” of repeats, and the celera assembler has no choice but to mask the repeats. euler does not require masking the repeats but instead provides a clear view of repeats (tangles) in the genome. These tangles may be resolved by DB information (eulerdb). In addition, euler has excellent scaling potential for eukaryotic genomes, because there exist lineartime algorithms for the Eulerian Path Problem.
Acknowledgments
We are grateful to Herbert Fleischner, Michael Fonstein, Dmitry Frishman, Eugene Goltsman, David Harper, Alexander Karzanov, Zufar Mulyukov, Julian Parkhill, Steven Salzberg, and Alexei Sorokine for useful discussions and providing sequencing data.
Abbreviations
 SBH,
 Sequencing by Hybridization;
 DB,
 doublebarreled;
 NM,
 N. meningitidis
 Accepted June 7, 2001.
 Copyright © 2001, The National Academy of Sciences
References
 ↵
 Green P

 Bonfield J K,
 Smith K F,
 Staden R

 Sutton G,
 White O,
 Adams M,
 Kerlavage A
 ↵
 Huang X,
 Madan A
 ↵
 Gordon D,
 Abajian C,
 Green P
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Weber J,
 Myers G
 ↵
 Fleischmann R D,
 Adams M D,
 White O,
 Clayton R A,
 Kirkness E F,
 Kerlavage A R,
 Bult C J,
 Tomb J F,
 Dougherty B A,
 Merrick J M,
 et al.
 ↵
 Myers E W,
 Sutton G G,
 Delcher A L,
 Dew I M,
 Fasulo D P,
 Flanigan M J,
 Kravitz S A,
 Mobarry C M,
 Reinert K H,
 Remington K A,
 et al.
 ↵
Kent, W. & Haussler, D. (2001) Genome Res., in press.
 ↵
 Fleischner H
 ↵
 Pe'er I,
 Shamir R
 ↵
 Pevzner P A,
 Waterman M S
 ↵
 ↵
 Pevzner P A
 ↵
Pevzner, P. A. & Tang, H. (2001) Bioinformatics17, in press.
 ↵
 ↵
 Venter J C,
 Adams M D,
 Myers E W,
 Li P W,
 Mural R J,
 Sutton G G,
 Smith H O,
 Yandell M,
 Evans C A,
 Holt R A,
 et al.