A binary search approach to whole-genome data analysis

Contributed by Eviatar Nevo, August 2, 2010 (sent for review February 6, 2010)
September 10, 2010
107 (39) 16893-16898

Abstract

A sequence analysis-oriented binary search-like algorithm was transformed to a sensitive and accurate analysis tool for processing whole-genome data. The advantage of the algorithm over previous methods is its ability to detect the margins of both short and long genome fragments, enriched by up-regulated signals, at equal accuracy. The score of an enriched genome fragment reflects the difference between the actual concentration of up-regulated signals in the fragment and the chromosome signal baseline. The “divide-and-conquer”-type algorithm detects a series of nonintersecting fragments of various lengths with locally optimal scores. The procedure is applied to detected fragments in a nested manner by recalculating the lower-than-baseline signals in the chromosome. The algorithm was applied to simulated whole-genome data, and its sensitivity/specificity were compared with those of several alternative algorithms. The algorithm was also tested with four biological tiling array datasets comprising Arabidopsis (i) expression and (ii) histone 3 lysine 27 trimethylation CHIP-on-chip datasets; Saccharomyces cerevisiae (iii) spliced intron data and (iv) chromatin remodeling factor binding sites. The analyses’ results demonstrate the power of the algorithm in identifying both the short up-regulated fragments (such as exons and transcription factor binding sites) and the long—even moderately up-regulated zones—at their precise genome margins. The algorithm generates an accurate whole-genome landscape that could be used for cross-comparison of signals across the same genome in evolutionary and general genomic studies.
The analysis of signal distribution across the genome is important for whole-genome studies such as transcriptome expression, chromatin remodeling, and identification of transcription factor (TF) binding sites. Signal distribution can be generated either by tiling array hybridizations (1) or via mapping of next-generation sequencing (NGS) reads (2, 3) on the known genome. The major goal of the analysis of signal distribution is to detect genome fragments that are enriched with up-regulated signals. These fragments could be TF binding sites that came out of ChIP-chip (ChIP-seq) data experiments (1, 3), chromatin remodeling (4, 5), or genome transcription (6) events. The main algorithmic challenge here is to detect accurately short and long up-regulated genome fragments, generating a whole-genome landscape for the signal. In particular, the analysis and comparison of the landscape could lead to important evolutionary implications within and between populations and species, for instance, how whole-genome activations are coordinated in different populations and where the genome hotspots for population divergence are located.
Three popular algorithms for the analysis of whole-genome data and our approach are briefly described.

1. Model-Based Analysis of Tiling Arrays: MAT Algorithm.

The model-based analysis of tiling arrays (MAT algorithm) (7) is based on running a sliding window over the genome to detect fragments that are enriched with high signals. The score for a fragment's enrichment is similar to a t test, and the algorithm detects all significantly enriched windows. An enriched fragment is defined as a union of highly intersecting, significantly enriched windows, with the t test-like score applied to the entire fragment. The default window size is 600 bp.
The apparent disadvantage of the MAT approach is the constant window size in each run. Up-regulated fragments that are much shorter than the window size may be overlooked, because the calculation involves summation over the entire window. The low signals in the vicinity of the short up-regulated fragment will reduce the score that is assigned to the window below the detection threshold. On the other hand, long fragments that are only moderately enriched might be overlooked, too, because windows that cover only part of the long fragment would be assigned low scores that are below the detection threshold. Thus, application of the MAT algorithm using all possible window sizes is needed to identify all enriched fragments and their precise margins. However, this would account for n2 operations, which is a huge number, taking into account that n is about a million Affymetrix probes, for example, for the Arabidopsis chromosome 1 of 30,000,000 bp length.

2. Chromosomal Map of Tiling Array Hybridizations: The TileMap Algorithm.

The chromosomal map of tiling array hybridizations (TileMap algorithm) (8) applies a hidden Markov model (HMM) to detect significantly enriched fragments. Each probe could be in a “target” or “zero” state. The HMM procedure with t test-transformed probe signals is used to infer the hidden probe states, and regions of interest are selected based on the posterior probability for a probe to be in the target state. The HMM transition probability remaining in the target state is high only for the genome neighboring probes. Thus, the rate of detection of long regions that are only moderately enriched is low. Relatively short and moderately enriched fragments could be missed by TileMap as well, because of the sparse distribution of higher signals inside these fragments.
In the second working mode, TileMap utilizes the moving average method, instead of HMM, for combining neighboring probes. The final summary statistics for each probe is calculated as the average over a window with the probe in the window center. The method results are sensitive to the window-size choice. Thus, one should use shorter windows for smaller expected fragments and longer windows for larger fragments. As in MAT, the constant sliding window size is disadvantageous.

3. Rank Statistics-Based Enrichment-Site Prediction: The TAS Algorithm.

Affymetrix Tiling Array Software (TAS algorithm) (9) performs a statistical evaluation of each probe P value using the Wilcoxon signed-rank test inside a sliding window centered on the probe. In the following step, based on a threshold, the probes are divided into significant and insignificant. The significant probes are joined in the finally reported enrichment sites. This is done by a clustering procedure that depends on two parameters: the “maximal gap”—the maximally allowed distance between the positive probes—and the “minimal run”—the minimal site length. The parameter choice is specific to the expected fragment structure and size.

4. Our Binary Search Algorithm.

For any type of whole-genome signal, the binary search (BS) algorithm (based on ref. 10) provides the identification of a series of nonintersecting chromosome fragments enriched by the up-regulated signals of the given type, where up-regulation is considered with respect to the average level of the signal along the chromosome. The algorithm generates nonoverlapping fragments that can be of any length, and the significance of their enrichment is measured in standard deviations (SDs) of a normal distribution. The asymptotic normality of the fragment's statistics is based on an assumption of the mutual independence of signals or differentials for any two probes of the chromosome. In reality, however, we deal with an extreme value distribution because the score of each fragment in the series is maximized by the procedure. Thus, if the up-regulated signals are not independent and are grouped inside a fragment of the chromosome, then (i) it will be registered by the algorithm as a significant deviation from the null hypothesis on the uniform distribution of the signal across the chromosome; (ii) the precise margins of the fragment with the biggest deviation from the null hypothesis (highest score) will be determined; and (iii) the fragment will be assigned a score in SD units, reflecting the fragment's signal deviation from the null hypothesis. Because the score of the fragment enrichment is measured in SD units, the chromosome profiles of enrichment for signals of a different nature and with different absolute values are normalized and can be compared with each other.
After identification of the basic series of significantly enriched fragments, each of the series’ fragments is treated using the same procedure. Namely, the good and poor subfragments are detected by diminishing the low signals inside the fragment (see Materials and Methods for details), and the poor subfragments are removed. The process is reiterated in a nested manner until all poor subfragments are removed. As a result, the whole chromosome landscape of a fragment's enrichment is prepared.
The level of nested iterations is regulated by a parameter. If the parameter is high, the BS algorithm works in a “long” mode without any iteration. Under a lower parameter (“short” mode), the recursion is working on several nested levels, removing the low-score subfragments from all initially detected fragments. The short mode is good for detection of short activation fragments, such as exons, where the low-signal introns must be removed from activated zones. However, the short mode could be irrelevant for detecting long active zones, because the BS-short will frequently be slicing long active zones while considering random appearances of the low-signal islands as poor subfragments. In the long mode, the initially detected high-score fragments are not subdivided.
Theoretically, the BS algorithm roughly gives equivalent results to the ones obtained using the multiple MAT software runs with all n possible window sizes (n is the chromosome length). However, because of its “divide-and-conquer” nature, the BS algorithm has better scaling and performs only n*log(n) operations per chromosome (instead of n2 operations in the case of multiple MAT runs).
The algorithmic details can be found in Materials and Methods, Appendix, and SI Text 1.

Results

Receiver Operating Characteristic Curves for Simulated Data.

The receiver operating characteristic (ROC) sensitivity versus specificity curves, for simulation both of exon and TF binding site activations with low and high superimposed noise, demonstrated that the BS algorithm significantly overperformed the three other algorithms. For example, for exon activation with low noise, the ROC curves are shown in Fig. 1. The BS curve is obviously the best, and TAS is next. However, there is a significant difference between BS and TAS in the number of collected false positives (FP). Indeed, for the same number of true positives (590,000 probes: 80% of all true positives), BS collected 38,500 false positives, and TAS collected almost three times more (110,000 probes). A comparison for exon activation with high-noise conditions also showed that the BS algorithm was superior to the other algorithms (Fig. S1).
Fig. 1.
The ROC curves of the four algorithms on the simulated exon data with low superimposed noise. Each curve depicts sensitivity [true positive (TP) rate] versus specificity [false positive (FP) rate].
For the simulated TF binding sites, with the same noise levels, the performance of all four algorithms was worse (Figs. S2 and S3). However, again, BS outperformed the three other algorithms. Specifically, under low-noise levels, the BS algorithm collected 80% of the true positives (38,500 probes) with the false discovery rate (FDR) = 23.7%. The TAS algorithm, the best alternative, collected 80% of the true positives with FDR = 35.2%, and the MAT algorithm collected 80% of the true positives with FDR = 64.8%.
The performance of the BS and TAS algorithms with very long modestly enriched fragments was checked on the simulated extended activated genome fragments (from 100,000 to 1,600,000 bp in length). For both algorithms, we examined the distribution of distances from the detected fragment margins to the closest margin of the true fragment for detected fragments of high score (>10). The BS-short mode was inaccurate in the detection of a fragment's margins. It generated a distribution of distances with an average of about 100,000 bp and SD = ∼300,000 bp. The TAS algorithm gave even worse results. Indeed, for a TAS window equal to 70 bp, the distribution's average was ∼450,000 bp and SD = ∼400,000 bp. For a TAS window equal to 1,000 bp, the distribution's average was 600,000 bp and SD = 450,000 bp, which is not an improvement over the shorter window. This very high inaccuracy level reflects the extensive fragment subdivision by both algorithms. In contrast, the BS-long mode detected the true fragment margins 1,000 times more accurately. Indeed, its distribution average was equal to ∼400 bp and SD = 300 bp.

Arabidopsis Whole-Genome Expression Data.

Our results showed a corroboration between the expression enrichment landscape obtained using the BS algorithm (short mode) and the underlying exon structure. For instance, the exon structure of gene PMR5 is clearly correlated with the expression enrichment landscape (62% of overlapping; Fig. 2).
Fig. 2.
Detection of the exon-specific signal enrichments in the neighborhood of the PMR5 gene (chr2, positions 12,812,803–12,816,439) by the BS algorithm. The presentation consists of two panels. (A) The sum of raw log-signals across three replicates of Arabidopsis sibling data (2) in the gene neighborhood. The exon locations in this panel are marked by “ex” on the x axis. (B) The fragments (blue and pink bars) that appeared to be enriched by up-regulated signals as detected by the BS algorithm (short mode). The height of the blue bars is the enrichment score for a fragment. The pink bars depict the average fragment's signal in SD units. The detected fragments overlap with the exon locations (depicted by the yellow negative bars) in 62% of all probes in exons.

Arabidopsis Genome H3K27me3 CHIP-on-Chip Data.

The BS analysis of Arabidopsis histone 3 lysine 27 trimethylation (H3K27me3) data (4) revealed many activated fragments. Besides narrow gene-associated histone trimethylated fragments, the analysis found several long fragments. In particular, the long region (∼250,000 bp) in the pericentromeric zone of chromosome 2 is highly enriched by up-regulated H3K27me3 signals (Fig. 3). This example demonstrates that the BS algorithm can detect both short and very long enriched fragments with their accurate margins.
Fig. 3.
The histone 3 lysine 27 trimethylated fragments in Arabidopsis chromosome 2 as detected by the BS algorithm. The blue bars in the main picture depict the detected enriched fragments, and the pink bars show the average fragment's level of log-signals in SD units. An extended fragment (∼250,000 bp) enriched by up-regulated signals was detected by the BS-long algorithm in the centromere region (the fragment's start position is about 3,250,000 bp). The two insets give a closer look at this region. The left inset shows how activations were detected by the BS-long mode, and the right shows how the BS-short algorithm detected activations.

Saccharomyces cerevisiae Genome Data on Spliced Introns.

The BS algorithm was applied to the zero-limited differentials between quantile normalized dbr1 (mutant) and BY (wild type) tiling array raw log-signals from ref. 5. The detected BS landscape (short mode of the algorithm) was contrasted with the fragments detected by TAS in ref. 5 (ZZ137-ZZ248-bw20_pvalue.00028_fdr.05.bed file). Also, the reference list of 272 known introns used in ref. 5 was used for contrasting the BS and ZZ137-ZZ248 findings with real introns. The BS-short detected fragments that were significantly enriched by (dbr1 − BY) differentials matched well with the detected fragments in ref. 5 and with known long introns according to three individual data replicates and according to the average of replicates across chromosomes (Fig. 4 and Figs. S4S6). However, in comparison with ref. 5, the BS algorithm applied to signal data generated fewer fragments as “new introns” because an independent appearance of the same fragment in all three replicates was expected. Often, the identified new intron in ref. 5 was detected by the BS algorithm as the enrichment in one replicate only. For instance, in ref. 5, the new intron was detected in chromosome 1 by a procedure that takes all replicates into consideration. However, this new intron was not supported by separated BS analyses of all three replicates of the (dbr1 − BY) differential data (Fig. S4).
Fig. 4.
S. cerevisiae genome CHIP-on-chip data on the intron lariat (chromosome 7). The high enrichments (blue, light blue, and khaki color—upper three profiles for replicates) detected by the BS algorithm (short mode) perfectly matched the detections in ref. 5 (red profile) and with known longer introns (bottom profile, long dark blue bars). Both algorithms overlooked the shorter known introns (short dark blue bars), probably because there were no hybridization signals from these introns due to faster degradation of the short lariat.

Discussion

The BS algorithm could be considered a representative of the class of recursive segmentation algorithms (11). The positive- and negative-score fragments without division points inside are final. The existence of the negative-score indivisible fragments makes the approach to be the binary search-like algorithm because a significant portion of the negative-score indivisible intervals is deleted from the studied pool in each cycle.
The recursive segmentation (BS-short mode) of initially detected long activated fragments could be a problem, because we may lose truly activated long genome intervals by slicing them into short portions. If the long fragments of high score have to be subdivided, an evaluation can be performed based on several levels of applied nesting recursion (several negative thresholds) as follows. If the recursion slices the fragment into smaller and smaller pieces, then the individual short fragments have no biological sense, but the entire fragment might be biologically meaningful.
A high accuracy of genome segmentation by the BS algorithm into fragments of different sizes, with variable levels of signal enrichment inside them, allows a superimposition of the whole-genome landscapes for different signal types. The landscape superimposition, as well as generation of the “signal-corroboration” landscapes based on the point covariations of different signals, is very helpful for the detection of hidden genome regulations and regulatory elements. Indeed, the genome mechanics of functional and evolutionary biology of an organism could be elucidated by the detection of genome “hotspots”—the fragments where the epigenetic and expression signals are enriched in a coordinated manner.
Without superimposed noise in the simulated data, the BS algorithm provides 100% correct detection of up-regulated fragments. Thus, the quality of detection of fragments enriched by higher signals depends on the level of superimposed noise. Also, the length of enriched fragments has a substantial impact upon their correct identification. Indeed, the short fragments with superimposed noise of a high level could be missed by the algorithm more frequently, because their scores will be insignificant due to superimposed noise. This is why sensitivity/specificity of the BS algorithm is worse for the simulated TF binding site data.
Here the segmentation of a genome by the BS algorithm was discussed in application to tiling microarray signals. Such application is reliable because signals of the genome neighboring probes are technologically independent—probes are randomly distributed over the chip. On the other hand, the NGS-generated signals in genome neighboring positions are highly dependent, because they are frequently collected from the same reads. To overcome such local dependence on raw data, the straightforward approach for transcriptome data is as follows: The NGS signals of neighborhoods are accumulated in their centers, and the distance between neighboring centers is arranged to be substantially more than the length of the read.
The Poisson distribution model is widely used in analysis of ChIP-Seq data (12). The question is whether the BS-algorithm segmentation will detect the enriched fragments that are deviations from the null Poisson hypothesis. The simulation of the read distribution as a superimposition of the uniform distribution of random reads across a “chromosome” and the distribution of “signal” reads concentrated in some TF binding intervals show that the Poisson and normal distribution-based score profiles are highly correlated and have the same maxima (segmentation points) (SI Text 2, Fig. S10). Therefore, one can use the BS algorithm to find the Poisson model-based enriched fragments.

Two Conclusive Notes.

As a precedent of the iterative binary segmentation procedure, the algorithm of Vostrikova (13) has to be mentioned. Also, although our algorithm does not require a window, it does have tuning parameters governing the stopping rule. The C++ code of the BS algorithm, MSWindows and Linux executables, program instructions, and working examples can be found at http://evolution.haifa.ac.il/index.php/people/item/146. The program is used through the command-line mode.
BS-algorithm basic theorems and schema can be found in Appendix and SI Text 1.

Materials and Methods

BS-Algorithm Details.

Similarly to ref. 7, the score of a fragment was based on the normalized differential of the probe log-signal from the general mean of all probe log-signals in a chromosome. The differential was normalized by the truncated SD of the probe log-signals in the chromosome. Thus, the normalized differential for log-signals of the probe m was
Now, consider a set of consequent probe differentials of a fragment (N numbers a1, a2,…, aN). Denote the score of the interval [i,j] as . Denoting the length of the interval between i and j by , we have . This score is distributed N(0,1) under the null hypothesis of normality and mutual independence of log-signals in the interval.
The BS algorithm segments the entire chromosome by the sequential binary division into intervals, in one of which the entire unknown chromosome's best-score fragment should be located as a whole. Because the globally best fragment will also be the locally best, the search procedure is recursive, looking for the best-score fragment inside every generated segment.
The segmentation is based on two segmentation corollaries from lemmas of ref. 10 about numerical inequalities (see Appendix for details). The statement of the first corollary is as follows. Let us calculate the score profile F(j): scores of subsequent enclosed fragments, each starting at the beginning of a chromosome. If for some chromosome position, (i) subsequent portions of the above enclosed fragments that end in positions before the selected position all have positive scores and (ii) subsequent portions of the above enclosed fragments that end in a position after the selected position all have negative scores, then the best-score fragment lies either to the left or to the right from the selected position but does not cover it—the position is a dividing one in the segmentation for searching the best-score interval.
The second segmentation corollary provides another type of division. In the expanding series of enclosed fragments that start at the beginning of the zone and have positive scores, the maximum-score fragment can be found. Thus, the corollary states that the end of the maximum-score fragment of the series is a division position. Specifically, the entire best-score fragment inside the zone should be to the left from this division position.
Another segmentation point could be found based on the reversed second corollary for the negative signal deviations from the chromosome's average. Indeed, because the scores of subsequent enclosed fragments on the right from the first corollary division point are negative, based on reversed Corollary 2 from Appendix, the point of minimal negative-score fragment of the series will be the division point for the detectable poorest subfragment. The poorest fragment will be to the left from this point and, therefore, the best-score subfragment will be to the right from this point. This division point is applied if a score of the negative fragment that provides this minimum of the score profile is less than the negative threshold (NT).
The algorithm starts from the entire chromosome as the only segment. For every generated segment, the algorithm goes into it in a binary search manner, looking for the best-score fragment inside this segment, until some segment cannot be further divided. The score of every indivisible segment is estimated and, if it is more than a threshold, the fragment is considered to be the enriched zone. This is the logic of the BS-long mode. According to the BS-long segmentation stopping rule, all generated segments have positive maxima of score profiles at their margins when calculating the profiles from left to right and from right to left. However, according to Corollary 2, the best-score subfragment of the interval has to be left of the right segment's margin and right of the left margin. This could be the case when the best-score subfragment is not the entire segment. Therefore, one has to look inside every segment that contains positive (higher than the chromosome average) and negative signals, both from theoretical and practical viewpoints. Indeed, sometimes from a biological consideration, we prefer to split a segment despite the fact that there are no better-score fragments inside. For instance, in the Arabidopsis genome, the introns are very small, and the entire gene may have a better score than each of its individual exons. Nevertheless, from a biological point of view, we prefer a segmentation of the gene into exons.
Via a trick, the BS-short algorithm manages to go inside some of the indivisible segments, attempting to divide them further and to remove the internal poor subfragments from these fragments. The trick consists of the proportional diminishing of weak fragment's signals, that is, signals that are lower than the chromosome's average (if they exist), until the fragment's average is equal to the chromosome average. After such diminishing of weak signals (i.e., an increase of their negative differentiation from the chromosome average), the internal fragment's structure will be exposed to the BS algorithm, and the algorithm might further divide the fragment. Indeed, because the original fragment was detected as an enriched one, its score profile is positive across the entire fragment, and reaches a maximum at the fragment's end. Therefore, the putative best and poor subfragments are hidden from the BS segmentation rules. By diminishing the weak fragment's signals, we change the behavior of the score profile along the fragment. Now, the profile has to reach zero at least once, at the fragment's end. Thus, the corollaries are working, and all subfragments enriched by good signals (that are higher than the chromosome average) will be detected by BS segmentation. If after segmentation the score of some fragment is lower than the threshold (i.e., the fragment is enriched by weak signals), this segment will be declassified. Thus, the BS-short algorithm is acting in a recursive manner, trying to subdivide every detected good fragment; it also removes the insignificant-score subfragments from this fragment until the minimum of the fragment's score profile in each detected interval is higher than the predefined NT (see next paragraph). The mathematical basis of the BS algorithm and the algorithm's schema are described in Appendix and SI Text 1.

Recursive Segmentation and Stopping Rule.

A stopping rule for the recursive subsegmentation of fragments is defined by the NT parameter as a threshold for the negative score of a poor fragment that provides the minimum negative value of the score profile inside the detected fragment. To run the algorithm in pure long mode, the NT threshold has to be a very high negative value. However, the natural strategy is to run the algorithm under several NT thresholds both in long and in the recursive (short) modes to detect all biologically interesting active fragments (Discussion).

Sensitivity/Specificity Analysis with Simulated Data.

To study the BS-algorithm reaction to different scenarios of higher signal distribution across the genome and signal intensity variability and compare the BS results with the results of the three other algorithms, we performed an extensive evaluation with simulated data. The simulated probe signals were either on the baseline level or on a few higher signal levels. Higher signals cover either exons of Arabidopsis genes or selected TF binding sites. In a special simulation mode, slightly higher signals cover several long zones of each chromosome.
Various levels of noise were superimposed on the above data. The noise level is higher in up-regulated simulated fragments than in the baseline regions. In this way, we tried to emulate the typical noise distribution in microarray experiments and in inherently different distributions of noise in the NGS data. As expected, the most prominent feature influencing the performance of all of the algorithms was the SD of the superimposed noise. Indeed, the performance of all of the algorithms degraded with the increase in the noise variance.

Biological Datasets and Goals of Their Analyses.

For each type of signal in the four biological datasets, a whole-genome landscape of the enrichment by up-regulated signals was prepared. For some of the datasets, instead of log-signals, we analyzed the per-probe differentials (i.e., the differences of two log-signals of different natures for the same probe). We used the “zero-limited” differentials, that is, a zero is substituted for the differential in cases where the difference of the two log-signals was negative. The reason for producing such modified differentials is as follows. The algorithm estimates the enrichment of the fragment by calculating the (normalized) sum over all of the fragment's differentials. Thus, the presence of highly negative differentials inside the fragment would diminish the sum of positive differentials of the fragment. With zero-limited differentials, the positive differentials would not be diminished by nonrandom negative differentials. With this, the detection rate depends only on the height and abundance of the positive differentials.

Arabidopsis expression data.

We applied the BS algorithm to the whole-genome expression of the seedling Arabidopsis (6) to check whether the exon structure of expressed genes could be revealed by the BS enrichment analysis.

Arabidopsis histone 3 lysine 27 trimethylation CHIP-on-chip data.

The goal of our analysis was to detect histone 3 lysine 27 trimethylated-enriched genome fragments, including long ones based on the tiling array data of ref. 4. The histone 3 lysine 27 trimethylation (H3K27me3) regulations were calculated as zero-limited differentials. Specifically, the H3K27me3 regulation is defined as the difference between the log-signals of CHIP-on-chip hybridization of two DNA samples extracted by α-H3K27me3 antibodies and by α-H3 antibodies. It is assumed that α-H3K27me3 antibodies extract DNA fragments associated with histone 3 lysine 27 trimethylation, and α-H3 antibodies access all histone 3 associated with DNA fragments (4).

The spliced intron data in the S. cerevisiae genome.

In the yeast S. cerevisiae, the absence of the debranching enzyme causes the accumulation of lariat RNAs. The accumulation allows a comparison of tiling array signals of RNA from the debranching mutant (dbr1) to the wild-type parent strain (DBR+ or BY4743 strain) and thus the identification of lariats on a genome-wide scale (5). As it is stated in ref. 5, the enrichment of a genome fragment in the dbr1 signal in comparison with BY4743 (BY) putatively indicates the presence of an intron. Such enrichment was detected in ref. 5 by the sliding-window method, where the Wilcoxon rank-sum test between quantile normalized dbr1 and BY log-signals indicates the window enrichment by the lariat. P value of the test was assigned to the central probe of the window (5). The goal of our study was to compare the BS-algorithm results with the results in ref. 5.

S. cerevisiae Genome CHIP-on-Chip Data on the Chromatin Remodeling Factor ISW2 Binding Sites.

See SI Text S3 and Figs. S4S6 for data description and analysis results.

Appendix

Adaptation and Proof of The Basic Theorems From Ref. 10.

Lemma 1. Let 1 ≤ ij < kN, Aij ≤ 0, Aik > 0. Then Aj+1,k > Aik.
Proof. Denote , , and . Since Sik = Sij + Sj+1,k, Sij ≤ 0 as it follows from Aij ≤ 0 that SikSj+1,k is valid. The inequality Aj+1,k > Aik follows from the fact that nik > nj+1,k.
Corollary 1. Let sub-interval [i, k] of the interval [i, q] have zero score (Aik = 0). All enclosed intervals [i, k−δ,] have positive scores (Ai,k−δ > 0), and the intervals [i, k+ε] have zero or negative scores (Ai,k ≤ 0) for all ε. Then k cannot be inside the maximum score sub-interval of the interval [i, q].
Proof. Assuming the opposite, let us consider that k is inside the sub-interval of the maximum score [p, q], 1 ≤ ip < k < qjN. Since Aik = 0, the sum Sik = 0 also. From Aiq ≤ 0, it follows that Siq ≤ 0. From Siq = Sik + Sk+1,q it follows that Sk+1,q ≤ 0 (and, correspondingly, Ak+1,q ≤ 0). Since [p,q] is the maximum score interval on [i, j], it follows that ApqAi,k−δ > 0. Since Ak+1,q ≤ 0, it follows that Apk > 0. According to Lemma 1 for intervals [p, k] and [k+1, q], Apk > Apq that contradicts the maximality of Apq.
Theorem 1. Let all intervals [i, h], 1 ≤ i < hjN have non-negative scores. Let interval [p, q], 1 ≤ ip < qjN, be the sub-interval of a positive score that has a maximal score inside the interval [i, j], i.e., Ap,qAmn, for any pair mn, mp, nq, 1 ≤ im < njN. Let the interval [i, k] have the maximum positive score among all intervals [i, h], 1 ≤ i < hjN, i.e., AikAih. Then, k cannot be inside interval [p, q].
Corollary 2: the entire maximum score sub-interval of the interval [i, j] has to be either to the left from k, or to the right from k.
Proof. Let's assume the opposite that k is inside the interval of maximal score [p, q]. Thus, 1 ≤ i < p < k < qjN. Let Si,p-1, Sp,k, Sk+1,q are sums of elements for intervals [i, p−1],[p, k],[k+1, q], and Ai,p−1, Ap,k, Ak+1,q are scores for these intervals. Let scores of the three most important intervals [p, q],[i, k], and [i, q] will be denoted as X (= Apq), W (= Aik), and Y (= Aiq). Then:
Let us prove that Y > W that contradicts the theorem's condition on maximality of W, (W = AikAih for all h). Indeed, according to [5]:
According to [3] and [6]:
If the right part of the equation [7] is negative, then the left part is negative also. The negativity of the left part means that Y > W.
Let us prove that the right part of [7] (denoted as Ω) is negative. If we put X2 instead of Y2 in the right part of [7] then:
because X > Y due to maximality of X score among scores of all sub-fragments of the interval [i, j].
According to [3]:
The sum Sk+1,q cannot be negative because, in this case, the score of [p, k] interval will be bigger than the score of [p, q] interval, which is the maximal score (X). Indeed, if Sk+1,q is negative then Sp,q < Sp,k and np,q > np,k. Thus, the inequality Apk > Apq = X will be valid which contradicts to maximality of X. The sum Si,p−1 is positive or zero also because of the scores, and thus sums of all intervals [i, h] are positive according to the theorem's conditions. Therefore, elimination of the negative term (−2∗Si,p−1Sk+1,q) will increase the left part of [8], and:
Let us add and deduct S2p,k to the left part of [9]:
or:
Using [2] and [4] we will get:
Since X is the maximum of all sub-fragment's scores, the left part of [10] is negative, and therefore Ω is also negative. From negativity of Ω (right part of equality [7]) it follows that Y > W. Since Y is Aiq, this inequality contradicts the theorem's condition that WAih for all h 1 ≤ i < hjN. Therefore, our assumption is wrong, and k cannot be inside the interval of maximal score [p, q]. The theorem is proven.

Acknowledgments

We thank Andrei Leontovich and Valery Kirzner for helpful discussions. We are very thankful to Nir BenTal and the reviewers of the paper. E.N. thanks the Ancell Teicher Research Foundation for Genetics and Molecular Evolution for financial support. Research of S.K. and L.B. was supported by the BLOOMNET ERA-PG Grant of the European Commission's Seventh Framework Programme.

Supporting Information

Supporting Information (PDF)
Supporting Information
sapp.doc

References

1
XS Liu, Getting started in tiling microarray analysis. PLoS Comput Biol 3, 1842–1844 (2007).
2
M Margulies, et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005).
3
ER Mardis, ChIP-seq: Welcome to the new frontier. Nat Methods 4, 613–614 (2007).
4
X Zhang, et al., Whole-genome analysis of histone H3 lysine 27 trimethylation in Arabidopsis. PLoS Biol 5, e129 (2007).
5
Z Zhang, JR Hesselberth, S Fields, Genome-wide identification of spliced introns using a tiling microarray. Genome Res 17, 503–509 (2007).
6
K Hanada, X Zhang, JO Borevitz, WH Li, SH Shiu, A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Res 17, 632–640 (2007).
7
WE Johnson, et al., Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci USA 103, 12457–12462 (2006).
8
H Ji, WH Wong, TileMap: Create chromosomal map of tiling array hybridizations. Bioinformatics 21, 3629–3636 (2005).
9
S Ghosh, HA Hirsch, E Sekinger, K Struhl, TR Gingeras, Rank-statistics based enrichment-site prediction algorithm developed for chromatin immunoprecipitation on chip experiments. BMC Bioinformatics 7, 434 (2006).
10
AM Leontovich, LI Brodsky, AE Gorbalenya, Construction of the full local similarity map for two biopolymers. Biosystems 30, 57–63 (1993).
11
W Li, P Bernaola-Galván, F Haghighi, I Grosse, Applications of recursive segmentation to the analysis of DNA sequences. Comput Chem 26, 491–510 (2002).
12
Y Zhang, et al., Model-based analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008).
13
LJ Vostrikova, Detection of “disorder” in multidimensional random processes. Soviet Mathematics Doklady 24, 55–59 (1981).

Information & Authors

Information

Published in

Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 107 | No. 39
September 28, 2010
PubMed: 20833816

Classifications

Submission history

Published online: September 10, 2010
Published in issue: September 28, 2010

Keywords

  1. genome segmentation
  2. tiling array
  3. next-generation sequencing

Acknowledgments

We thank Andrei Leontovich and Valery Kirzner for helpful discussions. We are very thankful to Nir BenTal and the reviewers of the paper. E.N. thanks the Ancell Teicher Research Foundation for Genetics and Molecular Evolution for financial support. Research of S.K. and L.B. was supported by the BLOOMNET ERA-PG Grant of the European Commission's Seventh Framework Programme.

Authors

Affiliations

Leonid Brodsky1 [email protected]
Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel; and
Simon Kogan
Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel; and
Eshel BenJacob
School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel
Eviatar Nevo1 [email protected]
Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel; and

Notes

1
To whom correspondence may be addressed. E-mail: [email protected] or [email protected].
Author contributions: L.B. and E.N. designed research; L.B. and S.K. performed research; L.B., S.K., E.B., and E.N. analyzed data; and L.B., E.B., and E.N. wrote the paper.

Competing Interests

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements




Altmetrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Get Access

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    A binary search approach to whole-genome data analysis
    Proceedings of the National Academy of Sciences
    • Vol. 107
    • No. 39
    • pp. 16751-17059

    Media

    Figures

    Tables

    Other

    Share

    Share

    Share article link

    Share on social media