Quantitative prediction of NF-κB DNA– protein interactions

We describe a general method based on principal coordinates analysis to predict the effects of single-nucleotide polymorphisms within regulatory sequences on DNA–protein interactions. We use binding data for the transcription factor NF-κB as a test system. The method incorporates the effects of interactions between base pair positions in the binding site, and we demonstrate that such interactions are present for NF-κB. Prediction accuracy is higher than with profile models, confirmed by crossvalidation and by the experimental verification of our predictions for additional sequences. The binding affinities of all potential NF-κB sites on human chromosome 22, together with the effects of known single-nucleotide polymorphisms, are calculated to determine likely functional variants. We propose that this approach may be valuable, either on its own or in combination with other methods, when standard profile models are disadvantaged by complex internucleotide interactions.

T he genetic basis of complex disease may stem partly from diversity in gene regulation. Eukaryotic genes are regulated largely through the interactions of transcription factors and their assembly into multicomponent enhancer complexes (1). Each gene is regulated by a unique combination of activators that produce activation and transcription specific to it. The arrangement of transcription factor-binding sites and their bound activators generates protein-DNA and protein-protein interactions unique to an enhancer (2). DNA variation in the enhancer is known to affect transcription factor binding, for example in the NF-B and Oct-1 sites in the tumor necrosis factor promoter (3,4), the SP-1 site in the Matrix metalloproteinase-2 promoter (5), and the AP-1-binding site in the Matrix ␥-carboxyglutamic acid protein promoter (6).
There is a clear need for a quantitative model of how variations at the binding site affect binding affinity. The simplest nonquantitative representation of a binding site is as a consensus sequence and is best suited to highly conserved sites. The profile or position-weight matrix (7)(8)(9), in which variation is modeled at each position in the binding site, independently of the neighboring base pairs, is more widely applicable. Usually profiles predict the likelihood that a sequence is a binding site rather than binding affinity. This subtle distinction arose historically because profiles were constructed from a multiple alignment of sequences that bind in vivo, although in fact binding affinity and binding probability are interchangeable (9).
Profiles are very useful, but they cannot model interactions between positions within the binding site (9,10). They normally describe only fixed-length motifs, whereas some DNA-binding proteins can bind to variable length sites (11). Finally, it is not always possible to construct a multiple alignment of binding sites, because there may be no consistent choice for the sequences' orientations such that every pair of sequences is optimally aligned. This is a serious problem in microarray (12,13) and SELEX (14,15) studies, where all binding sites are tested. SELEX uses a pool of random sequences to assay binding but tends to provide little information about low-affinity sites, which are important in multifactor complexes.
Here we introduce a quantitative model that predicts the effects of nucleotide variations within the binding motif on binding affinity. It is applicable to any DNA-binding protein, and requires only binding data from a subset of sites, chosen to span the space of the binding sites. Interactions between positions within the binding site are incorporated by using principal coordinate (PC) analysis to map the DNA-binding sites into a Euclidean space, such that the distance between any two sites' spatial coordinates approximates the dissimilarity between their DNA sequences. A site's binding affinity is modeled as a function of its coordinates. We illustrate the model with an analysis of binding affinity in the NF-B͞Rel family of transcription factors. NF-B, which binds with DNA through the Rel homology domain (RHD) (16,17), coregulates many genes and has a critical role in immunity, inflammation, and apoptosis (18). Crossvalidation and experimental verification of predicted NF-B binding affinities confirm the model's accuracy, and comparisons are made with profile (P) models to demonstrate the presence of base pair interactions. Finally, we analyze two finished human genome regions for potential NF-B-binding sites containing singlenucleotide polymorphisms (SNPs) likely to affect binding significantly.

Methods
PC Model. Many observations are required to fit a statistical model that systematically tests the significance of each position in the DNA motif and the interactions between positions, etc. Instead, we made the assumption that similar DNA motifs will tend to have similar binding affinities, although certain consensus positions, or combinations of positions, will be more influential than others.
For two oligonucleotides i, j of the same length, with sequences S i , S j , let h(S i , S j ) be the Hamming distance between S i , S j , defined as the number of corresponding positions that differ, and let S j be the reverse complement of S j . Define a dissimilarity measure between S i , S j as i.e., the minimum of the distance between S i and S j or its reverse complement. For example, GGGATGGCCC and GGAATT-GCCC differ at positions 3 and 6 thus have Hamming distance 2, and the second sequence's reverse complement GGGCAAT-TCC has Hamming distance 5, so d is 2.
Metric scaling (19,20) projects sequence i to a point x i in n-1-dimensional space (where n is the number of sequences) such that

[2]
This paper was submitted directly (Track II) to the PNAS office.
Although d ij is not a metric, and hence the embedding into Euclidean space is only approximate, it is sufficient for our purposes.
The logarithm of binding affinity y i is modeled by linear regression on the m largest principal coordinates containing most of the variance between the sequences:

[3]
The mean i includes the experiment batch effect, b is a vector of coefficients estimated by least squares, and e i is the residual error. The columns (principle coordinates) of the matrix x ik are orthogonal, so the least-squares estimates of the b k are uncorrelated.
The predicted relative log-binding affinity for sites r, s is z rs ϭ (x r Ϫ x s )b, a linear function of the vector displacement between their coordinates. Nonsignificant coefficients in b are set to zero for prediction. Prediction variance is where s k is the SE of the coefficient b k . Predicted affinities are either expressed relative to the affinity of sequence GGGGT-TCCCC (used as an experimental control on each gel) or ranked from 0 (weakest) to 1 (strongest). Any sequence r not in the training set is embedded into the sequence space by kernel density mapping with coordinate where K(d) ϭ 10 Ϫd is a kernel density. The relative binding affinity is then calculated as before.
Profile (P) and Symmetric Additive Profile (SAP) Models. We fitted two quantitative P models to test whether the PC model improved the fit and whether interactions were present. The inner 6-bp variable core of the NF-B site comprises three pairs of nucleotides, in which position n on one strand is identified with position 11-n on the reverse strand. For instance, the pairs AG (3,8), AG (4,7), and TT (5,6) represent ggAATACCcc ccTTATGGgg . The log-binding affinity is modeled as the sum of effects of the three pairs, i.e., log y ϭ ϩ p 38 ϩ p 47 ϩ p 56 ϩ e. In the P model, there are 10 df, 2 each for p 38 and p 47 , and 5 for p 56 . In the simpler SAP model, the effects of the nucleotides within each pair are additive and symmetric; e.g., the effect of an A at position 1 is the same as an A at position 6; there are 5 df, one each for p 38 and p 47 , and 3 for p 56 . Analyses used software written in PERL and the NAG C library (http:͞͞www.nag.co.uk) (available on request) and the SPLUS 2000 statistical analysis package.

Results
Wide Variation in NF-B p50p50 and p50p65 Binding Affinities. We assayed by electrophoretic mobility-shift assay (EMSA) 52 of a possible 256 variants of the GGRRNNYYCC NF-B motif for binding affinity to recombinant p50p50 homodimer and p50p65 heterodimer complexes (Fig. 1A). The consensus used generalizes the original NF-B motif GGGRNNYYCC by introducing variation at position 3. The 52 sites were chosen to span the consensus sequence space (see Methods). The data are plotted in Fig. 1B. With both dimers, we observed up to a 1,000-fold reproducible variation in binding among the sites (published as supporting information on the PNAS web site, www.pnas.org). Although the original version of the consensus is too restrictive (e.g., GGAAATTTCC has relatively high p50p65 affinity), neither discriminates well between high-and low-affinity sites. A single-nucleotide change between two sequences, depending on context, can have little effect on binding affinity (e.g., GGGAT-ACCCC, GGGATATCCC, p50p50 ratio 1.5) or can cause a significant change (e.g., GGGGCTTCCC, GGGGCTCCCC, p50p50 ratio 5.7) (Fig. 1B).
PC Model Fitting. We used metric scaling (19) to embed the DNA sequences in a Euclidean space, such that the distance between the points representing any pair of sequences approximated their sequence dissimilarity, taking into account that reverse complement sequences have the same binding as forward ones (Eq. 1). Each sequence was initially mapped to a vector in a highdimensional space, but most of the variance in the coordinates that correlates with binding affinity maps to a 12-dimensional subspace.
The logarithm of a sequence's binding affinity y was modeled by least-squares linear regression on its principal coordinates, plus a term for the experimental inter-gel effect (Eq. 3). Models were fitted to the raw data, not to the grand means, and hence incorporate the variation between duplicated observations. Regression coefficients were estimated for each dimer. The regressions explained over 90% of the variance (Table 1). Only eight (p50p50) and nine (p50p65) coefficients of 12 in the regression were significant (P-value Ͻ0.05). The significant coefficients identify principal coordinates that influence binding affinity ( Table 5, which is published as supporting information on the PNAS web site), whereas nonsignificant coefficients indicate where the consensus can vary without affecting binding.
The 52 oligonucleotides assayed span the sequence space, in that each of the putative 256 NF-B-binding sites differs from at least one of the 52 (or its reverse) by no more than one nucleotide. Consequently, the PC model should predict binding affinity across all 256 potential binding sites and will be more accurate than a single observation as it reduces the effect of experimental variability.
Comparison of PC and P Models. (i) MATINSPECTOR (7) (www. genomatix.de) predicts binding affinity by using the TRANSFAC database (23,24). It predicts affinity by comparing the sequence to a profile of the binding site. All sites scoring below a threshold are treated as nonbinding and set to 0. We used the lowest allowed threshold (0.70). Of the 52 sequences assayed, 71% had nonzero MATINSPECTOR scores with p50p50 and 84% with p50p65. These values were compared with our experimental data, which were ranked on a scale of 0-1 after averaging duplicated observations and correcting for gel effects. The correlations of experimental data with MATINSPECTOR predictions were 0.69 (p50p50) and 0.51 (p50p65), compared with correlations of 0.90 (p50p50) and 0.89 (p50p65) for experimental data versus PC predictions.
(ii) We fitted our NF-B-binding data to two quantitative P models, one a simple P and the other a SAP (see Methods). The SAP model performs significantly worse than both P and PC (Table 1). For p50p50, the PC and P models have similar goodness-of-fit, but interestingly they explain different aspects of the data, because the result of fitting one model after the other is highly significant. In contrast, for p50p65, PC fits much better than P; fitting the PC after P is highly significant (P Ͻ 10 Ϫ8 ), but fitting the profile after the PC model is only marginally significant (P Ͻ 0.05). For the purposes of fair comparison, no attempt was made to optimize the models by removing nonsignificant terms; however, the fits of the optimal PC models are also reported. Thus, the PC model fitted the data marginally (p50p50) or significantly (p50p65) better than the P models.
Crossvalidation. After correcting for variability between experiments and averaging the affinities for each duplicated sequence, each of the 52 sequences of the training set was excluded in turn, and the linear regression repeated on the remaining 51. The binding affinity of the excluded sequence was predicted and compared with the observed value. The crossvalidated predictions, as assessed by the correlation coefficient between the observed and prediction affinities, were only slightly less accurate than the predictions when the observation was included in the regression (Table 1, 10-val correlation coefficients). As an affinity predictor, the PC model outperformed the P models for both p50p50 and p50p65. Fig. 2 compares the crossvalidated PC predictions and observed data for p50p50 and p50p65. The predictions are accurate, and the errors are stable over the wide range of binding affinities.
Prediction of New Sites. We assayed the affinities of nine additional oligonucleotides (three each with low, moderate, and high predicted affinities). The comparison of ranked experimental binding data with the PC prediction is shown in Table 2. In all cases, the differences were within 10 percentile points. Predictions for all 256 sites are published as supporting information on the PNAS web site (Table 6). Overall, the PC model predicts quantitative binding affinity in the NF-B motif accurately and is an improvement over profiles.
Interaction Between Nucleotide Positions. Fitting the PC model after fitting P models significantly improved the fit (Table 1) We looked for sequence characteristics that correlated with each principle coordinate, k: first the training set sequences i were sorted by their coordinates x ik , and then at each consensus position the average sequence composition was computed within the top and bottom 25% of the sorted sequences. Large differences in composition between the high and low groups indicated variation along that direction of sequence space. Fig. 3 shows the result of subtracting the bottom-quartile frequencies from the top quartile for four major significant principal coordinates, at each position within the variable central 6-bp core of the binding site. For example, PC1 involves the outer core positions 3, 4, 7, 8, in a contrast between GGGG**CCCC (high affinity) and GGAA**TTCC (low affinity). In contrast, PC3 involves the inner core positions 5, 6 GGA*AA**CC vs. GGG*GC**CC.
Differences Between p50p50 and p50p65. The corresponding regression coefficients for p50p50 and p50p65 were broadly similar in sign and magnitude, and the correlation between p50p50 and p50p65 raw data was about 80% after correcting for gel effects. Nevertheless, there are systematic differences in binding affinity between the dimers. One dimer was regressed on the other to remove common features, and the residuals regressed on the principle coordinates (data not shown); significant coefficients indicate binding affinity differences. PC3 and PC10, which mainly involve the inner core positions 5 and 6, were the main distinguishing factors.  (11,198) were identified on chromosome 22 and 1,161 motifs in the MHC. On chromosome 22, the frequencies of the individual motifs ranged widely from 5 to 133 occurrences. In contrast, on a randomized version of the chromosome, which preserved local sequence composition in each 10-kb window of the original, the frequencies ranged from 15 to 42 (Fig. 5, which is published as supporting information on the PNAS web site). There was a significant correlation (R ϭ 0.61, P Ͻ 0.01) between the frequencies of the sites on chromosome 22 and the MHC.
NF-B motifs were classified as (i) within 1 kbp upstream of a gene, (ii) within exons or introns, (iii) within 1 kbp downstream of a gene, or (iv) elsewhere. About 10% of the motifs occurred significantly more often (5-30 times more than expected) in upstream or downstream regions (P Ͻ 0.01). There was a slight but significant under-representation of high-affinity p50p65 sites across chromosome 22 overall (P Ͻ 0.01), and an overrepresentation of high-affinity p50p50 sites in upstream regions (P Ͻ 0.01). Similar results hold for the MHC region. Consequently, selection for higher-affinity NF-B-binding sites may have occurred near genes and may have been inhibited elsewhere, although it is also possible this phenomenon is due to variation in sequence composition.
Of 20,000 SNPs reported for chromosome 22, 89 are within NF-B-like-binding motifs. In 53 cases, the variant no longer matches the NF-B consensus and is expected to have low affinity. This was partially confirmed when five of these SNPs, selected at random, were assayed by EMSA. Four showed significant loss of affinity (Fig. 4, compare lanes 1 and 2, etc.). The exception was the high-affinity site GGGGATTCCC, in which the mutation C 10 ͞T 10 had almost no effect on its p50p65binding affinity (1.3-fold reduction), although p50p50 affinity was significantly reduced (5.5-fold). In the 36 cases where the variant still matches the NF-B consensus, we predicted the binding affinity of both wild type and variant. In nine of these cases, the predicted affinity is strongly affected by the polymorphism, defined as a change in rank of greater than 25 percentiles for either dimer (Table 3).

Discussion
We have described a method to predict the quantitative effect of sequence variation on binding affinity, applicable to the analysis of any DNA-binding protein. It requires experimental binding  data from only a selection of potential binding sites chosen so that all other sites are within a single base change of this training set. For NF-B, binding data from only 52 of 256 decamers were used to predict the remainder. The least number of sequences spanning the consensus sequence space is 23. However, it is desirable that the number of data points should be large enough to reduce prediction error because of experimental variation. With microarrays, it is possible to assay the training set in a single experiment (12,13). In theory, data could also be acquired from crystallographic or NMR studies generated for a single site (25), but data from many structures would be needed to match the predictive power of the model presented here.
The PC analysis projects the set of binding sites into a space that reflects accurately the differences between the sequences. Directions in sequence space along which the binding sites do not vary are implicitly ignored, so parameters that could not contribute to the model are never included. We confirmed its accuracy by crossvalidation and by experimentally verifying our predictions for nine additional motifs not included in the training set. For NF-B, binding affinity was modeled as a linear function of the sequences' coordinates. Nonlinear models did not improve the fit significantly (data not shown) but might be appropriate in other contexts. The PC model is accurate at prediction within the space of sequences matching the binding consensus but not necessarily outside of that space.
The PC approach compares favorably with profile-based methods. For instance, MATINSPECTOR (7) predictions show much lower correlation with observed binding than does our PC model. However, because the profile database used by MATIN-SPECTOR lacks our empirical data, as a fairer test we compared the PC model with two P models created from those data. The PC model performs significantly better than both P models for p50p65 and marginally better for p50p50. The crossvalidated correlation between predictions and observations is higher with the PC model in all cases, and moreover the fit improves significantly when the PC model is combined with either P model. Because the only additional information encoded by the PC model relates to the covariation of nucleotide positions in the binding site, these interactions must have a role in NF-B binding. The extent to which such interactions occur in general remains to be seen, and a comprehensive study is desirable. The PC model is therefore valuable, either on its own or in combination with other methods, when empirical data are limited, and where internucleotide interactions occur. Profiles will continue to be useful, but we have shown they have limitations, and that one can do better.
Although the binding affinities for p50p50 and p50p65 are strongly correlated, we identified two directions in sequence space where the dimers had significantly different regression coefficients. The main distinguishing factors between the two complexes predominantly involve the inner core positions 5 and 6, defined as NN in the original consensus. We can therefore predict those sequences that are likely to bind differently to the two complexes. Kunsch et al. (14) came to a similar conclusion based on the SELEX-type assay with homodimeric NF-B com- Fig. 3. Sequence characteristics of four major principal coordinates. PC1, PC2, and PC4 are significant for both p50p50 and p50p65 complexes, PC3 is significant for p50p65 only (as indicated in brackets). For each statistically significant principal coordinate, sequences were sorted along the coordinate. Mean base frequencies were computed for the Upper and Lower 25% of sites at each position within the 6-bp core of the NF-B site, e.g., for principal coordinate 1, 12͞13 upper-quartile sequences had a G at position 3 and one had an A, whereas 10͞13 lower-quartile sequences had an A and 3 had a G. The result of subtracting the sequence composition from the Upper and Lower quartiles is plotted (PC1, position 3: 9 for G and Ϫ9 for A). Thus PC1 represents a contrast between GGGG**CCCC (high affinity) and GGAA**TTCC (low affinity). plexes, in which the preferential binding motif for p50p50 was GGGGATYCCC, whereas p65p65 preferred GGGRNTTTCC. Our results extend this observation. We find a number of motifs that have a higher predicted preference for one of the dimers (e.g., GGAAAGTTCC, 5 times higher for p50p65 that for p50p50). This is of particular biological relevance, as p50p65 is a potent transcriptional activator, and p50p50 is believed to act as a repressor (26,27).
Those sequences that define the original NF-B consensus and that rank high in our model were discovered by their roles in the transcriptional regulation of Ig, IFN-␤, IL-6, E-selectin, tumor necrosis factor (TNF), etc. As it excludes some sequences that bind strongly, e.g., the human TNF promoter (22), we broadened the consensus slightly. Fig. 4 suggests the consensus may need to be widened still further-for instance, a pair of flanking guanidines is not always necessary.
Our analysis of chromosome 22 and the MHC region indicates we can predict which NF-B sites are likely to bind. These predictions should be tested in vivo (28,29) to determine whether other factors such as chromatin are involved. For example, the binding site for yeast Repressor-activated protein 1 is found throughout the yeast genome, but binding occurs preferentially to potential promotors in intergenic regions (30). It will also be interesting to examine the regions of extended cross-species homology, as it is believed that long-range regulatory sequences tend to be conserved among mammals (31).
We can readily predict those SNPs occurring within binding sites that are likely to have functional effects. Previously (3), we demonstrated that a SNP within the NF-B motif of the tumor necrosis factor promoter dramatically decreased the binding of p50p50, yet left the binding of p50p65 practically unchanged and resulted in a higher level of gene transcription. We identified about 50 SNPs located within NF-B-like motifs on chromosome 22 that would alter NF-B binding to DNA. These SNPs may be good candidates for disease association studies.
Finally, NF-B is only one among a few scores of common transcription factors that regulate a majority of genes. It should therefore be feasible to construct quantitative models for all these factors by performing a relatively small number of binding assays. Armed with these data, one would be well placed to predict the binding affinities at the complex of transcription sites that regulate each gene (32,33) and then begin to model how gene expression varies as a function of polymorphisms within the binding sites. The predicted rank for either NF-B dimer changed by more than 0.25 by the SNP.