Physical constraints and functional characteristics of transcription factor–DNA interaction
See allHide authors and affiliations

Edited by David R. Nelson, Harvard University, Cambridge, MA, and approved July 11, 2002 (received for review December 21, 2001)
Abstract
We study theoretical “design principles” for transcription factor (TF)–DNA interaction in bacteria, focusing particularly on the statistical interaction of the TFs with the genomic background (i.e., the genome without the target sites). We introduce and motivate the concept of programmability, i.e., the ability to set the threshold concentration for TF binding over a wide range merely by mutating the binding sequence of a target site. This functional demand, together with physical constraints arising from the thermodynamics and kinetics of TF–DNA interaction, leads us to a narrow range of “optimal” interaction parameters. We find that this parameter set agrees well with experimental data for the interaction parameters of a few exemplary prokaryotic TFs, which indicates that TF–DNA interaction is indeed programmable. We suggest further experiments to test whether this is a general feature for a large class of TFs.
With rapid advances in the sequencing and annotation of entire genomes, the task of understanding the associated regulatory networks becomes increasingly prominent. Currently, many experimental and computational efforts are devoted to deciphering the genetic wiring diagram of a cell (1–3). Most of these efforts are focused on locating the functional DNAbinding sites of transcription factors (TFs). This knowledge, together with the genomic sequences, will provide a qualitative picture of which gene products may directly affect the expression of which genes. While obtaining such wiring diagrams is tremendously important for the eventual understanding of gene regulation at the system level, this knowledge in itself is not sufficient for the quantitative understanding of systemlevel effects. This has been shown dramatically in a detailed experimental study of the regulation of the endo16 gene in sea urchin development (4), which revealed an intricate regulatory function where a dozen or so TFs control the expression of a single gene. It would have been impossible to infer even the gross qualitative features of the transcriptional control from the knowledge of the binding sites alone.
A major obstacle to progress is the lack of a quantitative understanding of the physical interaction between the TFs. However, even the simpler interaction between TFs and DNA sequences is not so well understood quantitatively: It is common to classify a potential TFbinding DNA sequence in a “digital” manner—either the sequence is designated for TF binding, or it is not. In this view of TF–DNA interaction, differences between the TFbinding sequences are only nuisances that impede straightforward bioinformatic methods of targetsequence discovery. On the other hand, there are plenty of examples where differences between target sequences are known to be functionally important (5). In many cases, the binding of a TF to one site occurs only in the presence of some other TF, while the binding of the same TF to a different site does not require other TFs. This flexibility in function often is accomplished by differences in the binding sequences and is believed to be the basis for combinatorial control and signal integration in gene regulation (6). Also, different binding sites of the same TF can be “tuned” to bind at different TF concentrations, as suggested by a recent study of the Escherichia coli flagella assembly system (7). If further experimental studies confirm that tuning of binding thresholds indeed is used genomewide to establish desired generegulatory functions, then TF–DNA binding should be regarded more in an analog instead of a digital manner.
In this work, we report our theoretical study on the “design” of TF–DNA interaction, assuming the analog scheme of operation. Specifically, we impose the functional requirement that the threshold concentration for TF binding to a site can be controlled over a wide range by the choice of the sequence alone; we refer to this as the “programmability” of TF–DNA binding. Taken together with thermodynamic and kinetic constraints, this functional requirement leads to a narrow range of “optimal” TF–DNA interaction parameters. We then compare our result to experimentally known parameters for exemplary TFs to determine whether the design of these TFs indeed would allow the analog scheme of operation.
To focus our discussion, we limit ourselves exclusively to the case of bacterial TFs, which are the best characterized experimentally. We study both the equilibrium occupancy of a target sequence and the dynamics of locating the target. Von Hippel, Berg, and Winter have already discussed many aspects of these issues in a series of seminal articles (8–12). Our study is built firmly on their work but includes a number of additional issues: (i) the effect of sequencespecific binding to the genomic background (nontarget sequences) on the equilibrium occupation of a target sequence, (ii) kinetic traps arising statistically from the genomic background, and (iii) the desired programmability of TF–DNA binding. We adopt the model developed by von Hippel and Berg (11) and allow both the sequencespecific and nonspecific modes of TF–DNA binding. Sequencespecific binding occurs if the binding sequence is sufficiently close to the best binding sequence and is governed quantitatively by a specificity parameter. For typical bacterial TFs with binding sequences that are no more than 15 bases long, we find that our physical and functional requirements are best satisfied within a narrow regime of intermediate specificity, amounting to the loss of ≈2 k_{B}T for each additional base mismatch from the best binding sequence. Furthermore, the kinetic constraint favors a low threshold to nonspecific binding, while the programmability requirement pushes the threshold to larger values. The optimal tradeoff value only depends on the genome size and lies ≈16 k_{B}T above the energy of the best binding sequence for a genome of 10^{7} bases. These values correspond well with the interaction parameters of a number of well characterized TFs, which suggests that programmability of TF–DNA binding is compatible with the reality of protein–DNA interaction and may be used by the organism to accomplish biological functions. We hope to stimulate further experiments determining the interaction parameters for a wider range of TFs (see Discussion). These experiments could either strengthen or falsify the programmability concept depending on whether the interaction parameters are generally in agreement with our prediction.
Model of TF–DNA Interaction
Much of our knowledge on the details of TF–DNA interaction is derived from extensive biochemical experiments on a few exemplary systems dating back to pioneering work in the late 1970s (8–10, 13–15) and continuing through recent years (16–20). Furthermore, detailed structural information is available for many TFs from various structural families (21). Based on this knowledge, quantitative models of TF–DNA interaction have been established (8, 11, 12, 17). Together with the recent availability of genomic sequences, these models can be used to characterize the thermodynamics as well as the dynamics of TFs with genomic DNA in a cell. We briefly review the primary model of TF–DNA interaction in this section, which serves to introduce our notation and formulate the problem.
Biochemical and structural experiments, e.g., using lac repressor (9, 14, 20), have established firmly that (i) TFs bind closely to the DNA with a free energy ΔG_{ns} (with respect to the cytoplasm) regardless of its sequence due to electrostatic interaction alone, and (ii) additional sequencespecific binding energy can be gained (via hydrogen bonds) if the binding sequence is close to the recognition sequence of the TF. Let the total binding (free) energy of a TF to a sequence s→ = {s_{1}, s_{2}, …, s_{L}} of L nucleotides s_{i} ∈ {A,C,G,T} be ΔG[s→] (with respect to the cytoplasm), and let s→* be the best binding sequence. ΔG[s→] becomes sequenceindependent, ΔG[s→] ≡ ΔG_{ns}, if s→ is far from s→*. This is believed to occur via a change in the conformation of the TF from one that allows more hydrogenbond formation to another that brings the positive charges of the TF closer to the negatively charged DNA backbone (10).
For this study, it will be convenient to measure all energies with respect to that of the best binder, ΔG[s→*]. Let us define E[s→] ≡ ΔG[s→] − ΔG[s→*]. Furthermore, we will introduce the threshold energy E_{ns} ≡ ΔG_{ns} − ΔG[s→*], where TF–DNA binding switches from the specific to the nonspecific mode (for lac repressor, E_{ns} ≈ 10 kcal/mol). Then given the above model of TF–DNA interaction and assuming that the TF is bound to the DNA essentially all the time,¶ all thermodynamic quantities regarding this TF can be computed from the partition function∥ where β^{−1} = k_{B}T ≈ 0.6 kcal/mol and s→_{j} denotes the subsequence of the genomic sequence {s_{1}, s_{2}, … , s_{N}} from position j to j + L − 1. The binding length of a typical bacterial TF is L = 10 ∼ 20 bp. The length of the genomic sequence, N, is typically several million bp.
The form of the binding energy E[s→] has been studied experimentally for several TFs (16–19). In particular, recent experiments on the TF Mnt from bacteriophage P22 (16) support the earlier model (11) that the contribution of each nucleotide in the binding sequence to the total binding energy is approximately independent and additive, i.e., For the TFs Mnt, Cro, and λ repressor, the parameters of the “energy matrix” ℰ_{i}(s_{i}) have actually been determined experimentally by in vitro measurements of the equilibrium binding constants K[s→] ∝ e^{−βE[s→]} for every singlenucleotide mutant of the best binding sequence s→* (16, 18, 19). Due to our definition of the energy scale, ℰ_{i}(s_{i}) = 0 for s_{i} = _{i} and ℰ_{i}(s_{i}) > 0 for s_{i} ≠ _{i}; the latter will be referred to as “mismatch energies.” While the simple form of the binding energy (Eq. 2) certainly will not hold for all TFs, and di, trinucleotide correlation effects are likely to be important in many cases [e.g., to some extent for lac repressor (20)], the key results of our study are not sensitive to such correlations as long as there is a wide range of binding energies for different binding sequences. Thus we will adopt the simple form (Eq. 2) for this study. For the three well studied TFs, the mismatch energies are typically in the range of 1 ∼ 3 k_{B}T. While the threshold energies E_{ns} have not been measured carefully for these TFs, it is believed that nonspecific binding does not occur until the binding sequences are at least 4–5 mismatches away from s→* (G. Stormo, private communication).
Genomic Background and Target Recognition
Thermodynamics.
Let us first consider the binding of a single TF to its target sequence, denoted by s→_{t}. We will assume that thermal equilibrium can be reached within the relevant cellular time scale and discuss the important kinetics issue afterward. The effectiveness of the binding of the TF to its target is then described by the equilibrium binding probability P_{t}, which depends not only on the binding energy E_{t} ≡ E[s→_{t}] but also on the interaction with the rest of the genomic sequence. Let the contribution of this genomic background to the partition function be Z_{b}, then the binding probability to the target is given by where F_{b} = −k_{B}T ln Z_{b} is the effective binding energy (or free energy) of the entire genomic background. Eq. 3 is a sigmoidal function of E_{t} with a (soft) threshold at F_{b}, i.e., a TF binds (with probability P_{t} > 0.5) if E_{t} < F_{b}. Since E_{t} ≥ 0 by definition, we must have in order for a target sequence to be recognized by a single TF (we consider multiple TFs below). The background contribution can be computed for any given TF and genome according to Eq. 1 if the bindingenergy matrix, the threshold energy E_{ns}, and the genomic sequence is known. We will instead seek a description that is independent of the specifics of the genomic sequences and energy matrices. To accomplish this, we observe first that for the few well studied TFs, the interaction of the TF with the genomic background can be well approximated by the interaction of the TF with random nucleotide sequences of the same length and singlenucleotide frequencies p(s). This is illustrated in Fig. 1A, where the histogram of binding energies obtained by using the bindingenergy matrix ℰ_{i}(s) for the TF Cro on the E. coli genome (solid line) coincides well with the histogram of the same energy matrix applied to random nucleotide sequences (circles). Moreover, there appears to be hardly any positional correlation in the binding energies along the genome, as shown by the “energy landscape” in Fig. 1B (see legend for details). In the following, we will therefore describe the effect of the genomic background by treating it as a random nucleotide sequence for a generic TF. In particular, we will describe the genomic background partition function by Z_{b} = Z_{sp} + N⋅, where the contribution due to sequencespecific binding is with S(N) denoting a given collection of N random nucleotide sequences of length L drawn according to the frequency p(s) for each nucleotide s.
Even with the random sequence approximation (Eq. 5), computation of the background energy F_{b} = −k_{B}T ln Z_{b} is nontrivial in principle: From its definition, it is clear that F_{b} is a random variable, and its precise value will depend on the actual collection of sequences S(N). We are interested in the typical value of F_{b}, a reasonable approximation of which is its statistical average, ≡ −k_{B}T . [We use an overbar to denote averages over an ensemble of different sequence collections S(N).] Computing the average , however, is difficult to do for an arbitrary energy matrix ℰ_{i}(s) short of performing numerical simulations. An alternative is to compute the ensemble average of Z_{b}, i.e., = + N where with the singlenucleotide frequencies p(s), and assume that This is, for example, the approach taken by Stormo and Fields (17) in their analysis of the TF Mnt.** We note in passing that can be written more compactly in terms of the density of states Ω_{sp}(E) for specific binding (the normalized version of the histogram in Fig. 1A), i.e., Eq. 7 is based on the socalled annealed approximation ≈ ln , which is valid for the genomic sequence length N → ∞ but not always appropriate for finite N, e.g., if the partition function is dominated by a few lowenergy terms. Much is known from statistical physics about systems of the type defined by the partition function Z_{sp} in Eq. 5, generically known as the randomenergy model or REM,†† introduced by Derrida (22). It turns out that the annealed approximation is valid as long as the system's entropy is significantly larger than zero, reflecting the contribution of many terms in the partition sum. We will see further below that proper function of the TFs requires the system to be in a regime where the annealed approximation is safely applicable. We thus will take the validity of Eq. 7 for granted. In this case, the condition in Eq. 4 for the recognition of the target sequence by a single TF becomes
Search Dynamics.
To carry out their function properly, TFs not only need to have a high equilibrium binding probability to their targets but also must be able to locate them in a reasonably short time (e.g., less than a few minutes) after they have been activated by an inducer or freshly produced by a ribosome. This constitutes a constraint on the “search dynamics” of TFs.
In their nonspecific binding mode, TFs are still strongly associated with the DNA but are able to diffuse (i.e., slide) randomly along the genome (8–10). However, pure 1D diffusion would be an inefficient search process, because it is very redundant (e.g., a 1D random walker always returns back to the start.) For instance, assuming generously a 1D diffusion constant of D_{1} ≈ 1 μm^{2}/sec (10), one finds a time T_{1D} ∼ N^{2}/D_{1} ∼ 10^{6} sec for a single TF to diffuse around a bacterial genome of length N ≈ 5 × 10^{6} bp (≈1 mm). Thus, to find a target within a few minutes via 1D diffusion, one would need at least 100 TFs per cell to search in parallel (so that the search length N is reduced by a factor of 100). On the other hand, there are well documented examples where regulation is accomplished effectively by only a few TFs in a cell (e.g., ≈10 for lac repressor in E. coli; ref. 24).
As studied in detail by Winter, Berg, and von Hippel (8–10), the search dynamics of TFs involves instead a combination of sliding along the DNA at short length scales and hopping between different segments of DNA (either over the dissociation barrier through the cytoplasm or by direct intersegment transfer; see Fig. 2A). This search mode is much faster (given the high DNA concentration inside the cell), because the dynamics is essentially 3D diffusion beyond the hopping scale, and 3D diffusion is much less redundant than 1D diffusion. For example, if the TFs were not bound to the DNA at all, a single TF of a few nanometers in linear dimension ℓ would locate its target in a cell volume V_{cell} of several μm^{3} in the average first passage time of T_{3D} = V_{cell}/(4πℓD_{3}) ∼ 10 sec, given a 3D diffusion constant on the order of D_{3} ∼ 10 μm^{2}/sec (25). The search time T_{3D/1D} for the combined 1D/3D diffusion under in vivo conditions can be estimated to be comparable to T_{3D} (10). Hence, the search time is short enough to comfortably allow even a single TF to locate its target within the physiological time scale.
In the study of the search dynamics reviewed above, binding of the TF to the genomic background was assumed to occur at a single energy value, namely, the nonspecific energy ΔG_{ns} (8). On the other hand, the energy landscape of Fig. 1B clearly shows that the random genomic background contains many isolated sites with binding energies far below ΔG_{ns}. These sites constitute kinetic traps that, in principle, can impede the local search process drastically if the energy difference to their surroundings is sufficiently large.‡‡ Thus to understand the search dynamics fully, we need to characterize the effect of kinetic traps in the genomic background: What is the constraint on the design of TF–DNA interaction imposed by requiring that the effect of kinetic traps be negligible?
At each binding sequence s→_{j} with energy E_{j} ≡ E[s→_{j}] < E_{ns}, the TF typically spends a time τ_{j} = τ_{0}⋅, where τ_{0} is the average “waiting time” of the TF at a nonspecific binding site. Along the search path of the TF, the average waiting time τ̄ per binding site then is given simply by
Here we assumed as before that the genomic sequence is random such that the sequencespecific binding energy E can be treated as a random variable drawn from the distribution Ω_{sp}(E). The second term, with the help of the unit step function θ(x), is used to express the fact that there is no kinetic trap for the (majority of) sites with E > E_{ns}.
A comparison of Eqs. 10 and 8 for the average partition function immediately yields the important relation§§ since in Eq. 8, the second term dominates for E > E_{ns}. As expected, the kinetic trap factor τ̄/τ_{0} grows exponentially with E_{ns}, the threshold to nonspecific binding. On the other hand, we note from = + N (see Thermodynamics) that the trap factor can be made to be of order 1 such that the dynamical analysis of refs. 8–10 remains qualitatively valid if ≤ N. The physical meaning of this condition is that the average effect of the kinetic traps can be rendered small if the sum of the waiting times does not exceed the order of the plain diffusion time. As we will see, this can be accomplished by choosing the bindingenergy matrix ℰ_{i}(s) and E_{ns} appropriately. Combining this kinetic constraint with Eq. 9, we obtain the condition for the rapid recognition of a target sequence by a single TF.
Programmability of Binding Threshold
Multiple TFs.
There are of course typically multiple copies of the same TF in the cell, and the regulatory function is accomplished if anyone of these TFs binds to the target sequence. If the cell contains n copies of a given TF, then the occupation probability for the target sequence, Eq. 3, is replaced by the Fermi distribution (or “Arrhenius function”) P_{t} = 1/[1 + , since each binding sequence can be occupied at most by one TF. The chemical potential μ(n) is determined implicitly from the condition¶¶ where the quantity in brackets represents the total density of states. In the simplest scenario, where steric exclusion between TFs bound to the nontarget sequences is negligible, one has (11) This is empirically found to be a good approximation for those TFs with known bindingenergy matrices as shown in Fig. 2B. We will adopt the form of Eq. 14 for the chemical potential of a generic TF in this study; a general argument will be given later to justify this choice even for the case where multiple target sequences are present in the same genome.
Using Eq. 14, the occupation probability can be written more succinctly, P_{t} = 1/[1 + ñ_{t}/n], where denotes the (soft) threshold concentration of the TF for occupation of the target sequence.
Programmability.
The allowed values of the background free energy F_{b} for the binding of the target sequence obviously depend on the TF concentration n. For example, we have the condition in Eq. 4 for n = 1, while smaller values are allowed for n > 1. It thus appears that the allowed F_{b} values are different for the different TFs, because they would typically be present in the cell with different concentrations. On the other hand, even for a given TF species, the desired binding threshold may not be at a single concentration for different target sites but can vary depending on functional demands. For example, it can be desirable to turn on different genes/operons at different TF concentrations to maintain a temporal order in the expression of different operons as the concentration of the controlling TF gradually changes over time. This effect was observed recently for the E. coli flagella assembly (7) and SOS response systems (U. Alon, private communication).
As another example, consider the case where a particular TF A is involved in the regulation of two operons, X and Y. Suppose it is desired that A activates the transcription of operon X on its own at a concentration n_{A}, while operon Y should be activated only if A is present (at the same concentration n_{A}) together with another TF B that can bind cooperatively with A. It is desirable then to have a strong binding site for A in the regulatory region of operon X such that its threshold ñ_{A,X} < n_{A}, and a weak binding site in the regulatory region of operon Y, with a threshold ñ_{A,Y} > n_{A}. The latter insures that the operon Y will not be activated accidentally by fluctuations in n_{A} alone, and only when the TF B is present would the attractive interaction between A and B induce the two to bind to their targets.
The above examples show that it is functionally desirable to have the ability to set the binding threshold ñ_{t} of a given TF to each of its targetsequence s→_{t} individually. As is clear from the defining expression (Eq. 15), this can be done only through the choice of the targetsequence s→_{t} which affects E_{t}, because the other variable, F_{b}, is fixed for a given TF. We refer to the ability to control the binding threshold ñ_{t} through the choice of the targetsequence s→_{t} alone as programmability of the binding threshold. Assuming that programmability is a desirable feature of TF–DNA interaction (since sequence changes can be accomplished easily by point mutation if the functional need arises), we seek to determine the specifics of the TF–DNA interaction, e.g., the binding matrix ℰ_{i}(s), the length of the binding sequence L, and the threshold energy E_{ns}, which allow the targets to be maximally programmable.
TwoState Model and Parameter Selection.
Specifically, let us require programmability of the binding threshold over the entire range ñ = 1 … 10^{3}, since typical cellular TF concentrations range from a few to a few hundred per cell. The lower bound ñ ≈ 1 immediately imposes the condition in Eq. 4 on F_{b}, or, taking also the kinetic constraint into account, the condition in Eq. 12. Furthermore, to tune ñ throughout the desired range with a reasonable resolution, it is necessary to have the ability to change E_{t} from 0 to k_{B}T ln 10^{3} ≈ 7k_{B}T in small increments. This requires the nonzero entries of the bindingenergy matrix ℰ_{i}(s) to take on small values. Which choices for the TF–DNA interaction parameters [ℰ_{i}(s), L, E_{ns}] can simultaneously satisfy the latter requirement and condition (Eq. 12)?
The combined effect of these physical constraints and functional demands is understood best by simplifying the energy matrix ℰ such that we retain the essential and generic aspect of sequencespecific binding while eliminating all TFspecific details. Toward this end, we adopt the twostate model originally introduced by von Hippel and Berg (11), characterizing all of the nonzero entries of the significant positions∥∥ in the energy matrix by a single value, i.e., where ɛ is a dimensionless “discrimination energy” (in units of k_{B}T). It describes the energetic preference of the TF for the optimal binding sequence s→* and is a crucial parameter controlling the specificity of the TF. Within the twostate model, the binding energy to the target s→_{t} is simply ɛ times the total number of mismatches between the target and the best binder s→*, i.e., E[s→_{t}] = ɛ⋅s→_{t} − s→*, where  …  denotes the Hamming distance between two sequences. Clearly, programmability is best satisfied with a small ɛ, which enhances the resolution of the programmable binding threshold.
The twostate model (Eq. 16) also allows an explicit evaluation of the condition in Eq. 12 via the formula Eq. 6 for . Assuming for simplicity equal singlenucleotide frequencies in the background (i.e., p(s) = 1/4), the quantity in the bracket of Eq. 6 is evaluated easily. We have (ɛ, L) = N⋅ζ^{L}(ɛ), where ζ ≡ Σ_{s}e^{−βɛ(s)}p(s) = (1 + 3e^{−ɛ})/4. Note that ζ^{−1} is in the range between 1 and 4 and can be regarded as the effective size of the nucleotide “alphabet” as “seen” by the TF in the specific binding mode. The maximum value ζ^{−1} = 4 is attained if the energy matrix has infinite discrimination, ɛ → ∞, while no discrimination can be achieved at ɛ = 0 where ζ^{−1} = 1. In Fig. 3A, we indicate the allowed region (ɛ, L) ≤ 1 in the parameter space of (ɛ, L) with the boundary L*(ɛ) = ln N/ln ζ^{−1}(ɛ) defined by (L*, ɛ) = 1. From Fig. 3, it is clear that the desire for small ɛ pushes the system to the boundary at = 1. Along the boundary, the smallest ɛ is given by the largest allowable binding length L. For typical bacterial TFs with binding sequences that are no longer than ≈15 bp (usually dimers), we find ɛ ≈ 2.
Although the result on ɛ is somewhat specific to the twostate model, the need for → 1 imposed by the programmability consideration forces the threshold energy to take on the value (for N ∼ 10^{7}) according to the condition in Eq. 12 independent of the specifics of the bindingenergy matrix ɛ. It also follows that such that the binding threshold is simply given by The dependences of the ñ on the number of mismatches for the twostate model are shown in Fig. 3B. We see that at the optimal parameter choice of (ɛ = 2, L = 15), each mismatch increases the binding threshold ñ by nearly 10fold. In principle, further finetuning can be accomplished by using small variations in the mismatch energies.
Discussion
The key results of this study, that maximal programmability of the binding threshold ñ requires the TF–DNA interaction to satisfy the conditions in Eqs. 17 and 18, can be conveniently summarized graphically using the density of states Ω_{sp}(E). In Fig. 4, the density of states is plotted with the normalization that max_{E} Ω_{sp}(E) = N, as indicated by the horizontal dotted line. The background free energy F_{b} can be obtained using the Legendre construction: One draws the line (the dashed line in the semilog plot of Fig. 4) such that it just touches Ω_{sp}(E). F_{b} then can be read off as the intercept of the dashed line on the E axis, which should be in the vicinity of the origin according to Eq. 18. Similarly, E_{ns} (as given by Eq. 17) can be read off as the E coordinate where the dashed line intersects the horizontal dotted line.
The point where the dashed line tangents Ω_{sp}(E) also is physically meaningful: The E coordinate of the tangent point gives the ensembleaveraged binding energy E_{0} ≡ Σ_{E} EΩ_{sp}(E)e^{−βE}/Z_{sp}. The vertical coordinate N_{0} of the tangent point is given by the relation F_{b} = E_{0} − k_{B}T ln N_{0}, which expresses the fact that the dominant contribution to the background free energy stems from the N_{0} sequences of energy ≈E_{0} in the collection of N random sequences: The Boltzmann weight of those sequences with E > E_{0} is too small to contribute to the partition sum, while for E < E_{0}, there are too few sequences.
The value of N_{0} is an important characteristics of the system. S = ln N_{0} is known as the “entropy” of this system, and H = ln(N/N_{0}) is known as the “relative entropy”; the latter has been used to characterize the specificity of the TF–DNA interaction (17). As mentioned before, the annealed approximation is valid only if many terms contribute to the partition sum, i.e., if N_{0} ≫ 1. For the twostate model (Eq. 16), the values of ɛ and L corresponding to the line N_{0} = 1 are far from the line L*(ɛ) selected by the maximal programmability criterion; this justifies the use of the annealed approximation. At the optimal parameter of ɛ = 2 and L = 15, we have N_{0} ≈ 10^{3} ≫ 1. The corresponding relative entropy is H ≈ 7 (≈10 bits).
The large value of N_{0} also provides us with an intuitive understanding of the simple dependence (Eq. 14) of the chemical potential μ on the cellular TF concentration n (see Fig. 2B). As mentioned already, the expression (Eq. 14) is obtained if multiple occupancy of the background sequences is negligible at the TF concentration n. Since there is a large number (i.e., N_{0}) binding sequences that contribute significantly to the net effect of background binding, multiple occupancy of these sequences is indeed not likely if n < N_{0}. Thus for N_{0} ∼ O(10^{3}), the expression (Eq. 14) can be taken as a good approximation of the chemical potential over the typical range of cellular TF concentration n = 1 … 10^{3}, as shown in Fig. 2B for the three known TFs. We expect this result to hold even if there are multiple target sequences, say m_{t}, the binding energy E_{t} of which is much lower than E_{0} as long as E_{t} > k_{B}T ln m_{t} such that F_{b} is not affected by the addition of these target sequences to the density of states. Having μ(n) independent of the number of targets is a desirable functional robustness property from a system perspective, because one wouldn't want to perturb the recognition of the TFs and the existing targets by the addition of a few new targets. It will be interesting to see to what extent this feature is preserved by studying the energetics of TFs with a large number of target sites, e.g., the catabolic repressor protein CRP in E. coli (5).
Finally, we compare the values of the optimal interaction parameters according to our theory to those of the well studied TFs. From the values listed in Table 1, we see that all the available data are in the neighborhood of the expectation based on the maximal programmability criterion. We do not suggest here that programmability was necessarily the selective driving force that constrained the TF–DNA interaction to its observed form (there could be other reasons, e.g., biochemical restrictions, for the interaction to be of this form). However, the rough correspondence between theory and observation does indicate that it is possible (and perhaps even very likely) that TFs generally have the required energetics for their binding threshold to be programmable over a wide range.
One obvious shortcoming of the above comparison is that the three TFs for which the interaction parameters are known are all from bacteriophages and may not represent typical prokaryotic TFs. It therefore will be very important to experimentally determine the interaction parameters for a variety of different TFs. The results of a sufficient number of such studies will inform us whether programmability is a generic feature of TF–DNA interaction. Knowledge of this kind can be very helpful in developing appropriate coarsegrained models of gene regulation at the system level. In particular, quantitative relations of the type suggested by Eq. 19 will be necessary for an eventual quantitative description of generegulatory networks. Also, this knowledge would have important implications for the evolution of gene regulation (26, 27).
Acknowledgments
We acknowledge useful discussions with G. Stormo, P. von Hippel, and K. Sneppen on many aspects of TF–DNA interaction. We are also grateful to the hospitality of the Institute for Theoretical Physics in Santa Barbara, where some of the work was carried out. This research is supported in part by National Science Foundation Grant DMR9971456. U.G. was supported in part by a German fellowship from the Deutscher Akademischer Austauschdienst, and T.H. was supported in part by a Burroughs Wellcome functional genomics award.
Footnotes

↵† U.G. and J.D.M. contributed equally to this work.

↵‡ To whom reprint requests should be addressed. Email: gerland{at}physics.ucsd.edu.

↵§ Present address: NEC Research Institute, 4 Independence Way, Princeton, NJ 08540.

This paper was submitted directly (Track II) to the PNAS office.

↵¶ In vivo measurements for the case of lac repressor found less than 10% of the TFs were unbound (15). This agrees well with an estimate based on a typical prokaryotic cell volume of 3 μm^{3}, a genome length of 5 × 10^{6} bases, and a nonspecific binding constant on the order of 10^{4} M^{−1} under physiological conditions (13), which yields a fraction of unbound TFs at a fewpercent level.

↵∥ One also should include the reverse complement of the genomic sequence in the evaluation of the partition function Z. In order not to make the notation too complicated, we extend the definition of “genomic sequence” to include its complement.

↵** In ref. 17, the nonspecific binding was not included so that = and the energy scale was shifted such that Z_{b} = N.

↵†† In many applications, including protein folding (23), the REM was introduced to approximate the random background interaction. The TF–DNA interaction as defined by Eq. 5 represents one of the few systems for which the REM description is directly applicable.

↵‡‡ Note that the additional sequencespecific binding energy to a “spurious site” in the background equally increases the kinetic barrier for sliding to a neighboring site as well as for dissociation into the cytoplasm.

↵§§ Note that this relation is actually independent of the additive form of the binding energy (Eq. 2).

↵¶¶ Here, the exclusion between overlapping binding sites can be neglected, because n ≪ N. Also, we have not included the (unimportant) exclusion between the specific and unspecific binding mode at a given site.

↵∥∥ Note that the energy matrices for most TFs contain a number of (fixed) positions that have no strong preference for any of the nucleotides. We will not consider these positions in the ensuing discussion of the twostate model and will use L to refer to the total number of significant positions.
Abbreviations

TF, transcription factor
 Received December 21, 2001.
 Copyright © 2002, The National Academy of Sciences
References
 ↵
 Davidson E. H.

 Berman B. P.
 ↵
 ↵
 Neidhardt F. C.
 ↵
 Ptashne M.
 ↵
 Kalir S.
 ↵
 ↵
 ↵
 ↵
 von Hippel P. H.
 ↵
 ↵
 ↵
 ↵
 KaoHuang Y.
 ↵
 ↵
 ↵
 Sarai A.
 ↵
 Takeda Y.
 ↵
 ↵
 ↵
 ↵
 Bryngelson J. D.
 ↵
 ↵
 Elowitz M. B.
 ↵
 Sengupta A. M.
 ↵
 ↵