# A maximum entropy framework for nonexponential distributions

^{a}Department of Mathematics, Oregon State University, Corvallis, OR 97331;^{b}Laufer Center for Physical and Quantitative Biology, Departments of Physics and Chemistry, State University of New York, Stony Brook, NY 11794; and^{c}Department of Systems Biology, Columbia University, New York, NY 10032

See allHide authors and affiliations

Contributed by Ken A. Dill, November 7, 2013 (sent for review June 26, 2013)

## Significance

Many statistical distributions, particularly among social and biological systems, have “heavy tails,” which are situations where rare events are not as improbable as would have been guessed from more traditional statistics. Heavy-tailed distributions are the basis for the phrase “the rich get richer.” Here, we propose a basic principle underlying systems with heavy-tailed distributions. We show that it is the same principle (maximum entropy) used in statistical physics and statistics to estimate probabilistic models from relatively few constraints. The heavy-tail principle can be expressed in terms of shared costs and economies of scale. The probability distribution we derive is a mathematical digamma function, and we show that it accurately fits 13 real-world data sets.

## Abstract

Probability distributions having power-law tails are observed in a broad range of social, economic, and biological systems. We describe here a potentially useful common framework. We derive distribution functions for situations in which a “joiner particle” *k* pays some form of price to enter a community of size , where costs are subject to economies of scale. Maximizing the Boltzmann–Gibbs–Shannon entropy subject to this energy-like constraint predicts a distribution having a power-law tail; it reduces to the Boltzmann distribution in the absence of economies of scale. We show that the predicted function gives excellent fits to 13 different distribution functions, ranging from friendship links in social networks, to protein–protein interactions, to the severity of terrorist attacks. This approach may give useful insights into when to expect power-law distributions in the natural and social sciences.

Probability distributions are often observed to have power-law tails, particularly in social, economic, and biological systems. Examples include distributions of fluctuations in financial markets (1), the populations of cities (2), the distribution of Web site links (3), and others (4, 5). Such distributions have generated much popular interest (6, 7) because of their association with rare but consequential events, such as stock market bubbles and crashes.

If sufficient data are available, finding the mathematical shape of a distribution function can be as simple as curve-fitting, with a follow-up determination of the significance of the mathematical form used to fit it. However, it is often interesting to know if the shape of a given distribution function can be explained by an underlying generative principle. Principles underlying power-law distributions have been sought in various types of models. For example, the power-law distributions of node connectivities in social networks have been derived from dynamical network evolution models (8⇓⇓⇓⇓⇓⇓⇓⇓–17). A large and popular class of such models is based on the preferential attachment rule (18⇓⇓⇓⇓⇓⇓⇓⇓–27), wherein it is assumed that new nodes attach preferentially to the largest of the existing nodes. Explanations for power laws are also given by Ising models in critical phenomena (28⇓⇓⇓⇓⇓–34), network models with thresholded “fitness” values (35), and random-energy models of hydrophobic contacts in protein interaction networks (36).

However, such approaches are often based on particular mechanisms or processes; they often predict particular power-law exponents, for example. Our interest here is in finding a broader vantage point, as well as a common language, for describing a range of distributions, from power law to exponential. For deriving exponential distributions, a well-known general principle is the method of maximum entropy (Max Ent) in statistical physics (37, 38). In such problems, you want to choose the best possible distribution from all candidate distributions that are consistent with certain set of constrained moments, such as the average energy. For this type of problem, which is highly underdetermined, a principle is needed for selecting a “best” mathematical function from among alternative model distribution functions. To find the mathematical form of the distribution function over states , the Max Ent principle asserts that you should maximize the Boltzmann–Gibbs–Shannon (BGS) entropy functional subject to constraints, such as the known value of the average energy . This procedure gives the exponential (Boltzmann) distribution, , where *β* is the Lagrange multiplier that enforces the constraint. This variational principle has been the subject of various historical justifications. It is now commonly understood as the approach that chooses the least-biased model that is consistent with the known constraint(s) (39).

Is there an equally compelling principle that would select fat-tailed distributions, given limited information? There is a large literature that explores this. Inferring nonexponential distributions can be done by maximizing a different mathematical form of entropy, rather than the BGS form. Examples of these nontraditional entropies include those of Tsallis (40), Renyi (41), and others (42, 43). For example, the Tsallis entropy is defined as , where *K* is a constant and *q* is a parameter for the problem at hand. Such methods otherwise follow the same strategy as above: maximizing the chosen form of entropy subject to an extensive energy constraint gives nonexponential distributions. The Tsallis entropy has been applied widely (44⇓⇓⇓⇓⇓⇓⇓⇓–53).

However, we adopt an alternative way to infer nonexponential distributions. To contrast our approach, we first switch from probabilities to their logarithms. Logarithms of probabilities can be parsed into energy-like and entropy-like components, as is standard in statistical physics. Said differently, a nonexponential distribution that is derived from a Max Ent principle requires that there be nonextensivity in either an energy-like or entropy-like term; that is, it is nonadditive over independent subsystems, not scaling linearly with system size. Tsallis and others have chosen to assign the nonextensivity to an entropy term, and retain extensivity in an energy term. Here, instead, we keep the canonical BGS form of entropy, and invoke a nonextensive energy-like term. In our view, only the latter approach is consistent with the principles elucidated by Shore and Johnson (37) (reviewed in ref. 39). Shore and Johnson (37) showed that the BGS form of entropy is uniquely the mathematical function that ensures satisfaction of the addition and multiplication rules of probability. Shore and Johnson (37) assert that any form of entropy other than BGS will impart a bias that is unwarranted by the data it aims to fit. We regard the Shore and Johnson (37) argument as a compelling first-principles basis for defining a proper variational principle for modeling distribution functions. Here, we describe a variational approach based on the BGS entropy function, and we seek an explanation for power-law distributions in the form of an energy-like function instead.

## Theory

### Assembly of Simple Colloidal Particles.

We frame our discussion in terms of a joiner particle that enters a cluster or community of particles, as shown in Fig. 1. However, this is a natural way to describe the classical problem of the colloidal clustering of physical particles; it is readily shown (reviewed below) to give an exponential distribution of cluster sizes. However, this general description also pertains more broadly, such as when people populate cities, links are added to Web sites, or when papers accumulate citations. We want to compute the distribution, , of populations of communities having size .

To begin, we express a cumulative cost of joining. For particles in colloids, this cost is expressed as a chemical potential, i.e., a free energy per particle. If represents the cost of adding particle *j* to a cluster of size , the cumulative cost of assembling a whole cluster of *k* particles is the sum

Max Ent asserts that we should choose the probability distribution that has the maximum entropy among all candidate distributions that are consistent with the mean value of the total cost of assembly (54),where λ is a Lagrange multiplier that enforces the constraint.

In situations where the cost of joining does not depend on the size of the community a particle joins, then , where is a constant. The cumulative cost of assembling the cluster is thenSubstituting into Eq. **2** and absorbing the Lagrange multiplier λ into yields the grand canonical exponential distribution, well known for problems such as this:

In short, when the joining cost of a particle entry is independent of the size of the community it enters, the community size distribution is exponential.

### Communal Assemblies and Economies of Scale.

Now, we develop a general model of communal assembly based on economies of scale. Consider a situation where the joining cost for a particle depends on the size of the community it joins. In particular, consider situations in which the costs are lower for joining a larger community. Said differently, the cost-minus-benefit function is now allowed to be subject to economies of scale, which, as we note below, can also be interpreted instead as a form of discount in which the community pays down some of the joining costs for the joiner particle.

To see the idea of economy-of-scale cost function, imagine building a network of telephones. In this case, a community of size 1 is a single unconnected phone. A community of size 2 is two connected phones, etc. Consider the first phone: The cost of creating the first phone is high because it requires initial investment in the phone assembly plant. And the benefit is low, because there is no value in having a single phone. Now, for the second phone, the cost-minus-benefit is lower. The cost of producing the second phone is lower than the first because the production plant already exists, and the benefit is higher because two connected phones are more useful than one unconnected phone. For the third phone, the cost-minus-benefit is even lower than for the second because the production cost is even lower (economy of scale) and because the benefits increase with the number of phones in the network.

To illustrate, suppose the cost-minus-benefit for the first phone is 150, for the second phone is 80, and for the third phone is 50. To express these cost relationships, we define an intrinsic cost for the first phone (joiner particle), 150 in this example. We define the difference in cost-minus-benefit between the first and second phones as the discount provided by the first phone when the second phone joins the community of two phones. In this example, the first phone provides a discount of 70 when the second phone joins. Similarly, the total discount provided by the two-phone community is 100 when the third phone joins the community.

In this language, the existing community is paying down some fraction of the joining costs for the next particle. Mathematically, this communal cost-minus-benefit function can be expressed asThe quantity on the left side of Eq. **5** is the total cost-minus-benefit when a particle joins a *k*-mer community. The joining cost has two components, expressed on the right side: each joining event has an intrinsic cost that must be paid, and each joining event involves some discount that is provided by the community. Because there are *k* members of the existing community, the quantity is the discount given to a joiner by each existing community particle, where is a problem-specific parameter that characterizes how much of the joining cost burden is shouldered by each member of the community. In the phone example, we assumed . The value of represents fully equal cost-sharing between joiner and community member: each communal particle gives the joining particle a discount equal to what the joiner itself pays. The opposite extreme limit is represented by ; in this case, the community gives no discount at all to the joining particle.

The idea of communal sharing of cost-minus-benefit is applicable to various domains; it can express that one person is more likely to join a well-populated group on a social networking site because the many existing links to it make it is easier to find (i.e., lower cost) and because its bigger hub offers the newcomer more relationships to other people (i.e., greater benefit). Or, it can express that people prefer larger cities to smaller ones because of the greater benefits that accrue to the joiner in terms of jobs, services, and entertainment. (In our terminology, a larger community pays down more of the cost-minus-benefit for the next immigrant to join.) We use the terms “economy of scale” (EOS) or “communal” to refer to any system that can be described by a cost function, such as Eq. **5**, in which the community can be regarded as sharing in the joining costs, although other functional forms might also be of value for expressing EOS.

Rearranging Eq. **5** gives . The total cost-minus-benefit, , of assembling a community of size *k* iswhere is the digamma function ( is Euler’s constant), and the constant term will be absorbed into the normalization.

From this cost-minus-benefit expression (Eq. **6**), for a given , we can now uniquely determine the probability distribution by maximizing the entropy. Substituting Eq. **6** into Eq. **2** yields

Eq. **7** describes a broad class of distributions. These distributions have a power-law tail for large *k*, with exponent , and a cross-over at from exponential to power law. To see this, expand asymptotically and drop terms of order ; this yields , so Eq. **7** obeys a power law for large *k*, and becomes a simple exponential in the limit of (zero cost-sharing). One quantitative measure of a distribution’s position along the continuum from exponential to power law is the value of its scaling exponent, . A small exponent indicates that the system has extensive social sharing, thus power-law behavior. As the exponent becomes large, the distribution approaches an exponential function. Eq. **7** has a power-law scaling only when the cost of joining a community has a linear dependence on the community size. The linear dependence arises because the joiner particle interacts identically with all other particles in the community.

What is the role of detailed balance in our modeling? Fig. 1 shows no reverse arrows from *k* to . The principle of Max Ent can be regarded as a general way to infer distribution functions from limited information, irrespective of whether there is an underlying a kinetic model. So, it poses no problem that some of our distributions, such as scientific citations, are not taken from reversible processes.

## Results

Eq. **7** and Fig. 2 show the central results of this paper. Consider three types of plots. On the one hand, exponential functions can be seen in data by plotting vs. *k*. Or, power-law functions are seen by plotting vs. . Here, we find that plotting vs. a digamma function provides a universal fit to several disparate experimental data sets over their full distributions (Fig. 3). Fig. 2 shows fits of Eqs. **7**–**13** datasets, using and as fitting parameters that are determined by a maximum-likelihood procedure (see *SI Text* for dataset and goodness-of-fit test details). The and characterize the intrinsic cost of joining any cluster, and the communal contribution to sharing that cost, respectively.

Rare events are less rare under fat-tailed distributions than under exponential distributions. For dynamical systems, the risk of such events can be quantified by the coefficient of variation (CV), defined as the ratio of the SD to the mean . For equilibrium/steady-state systems, the CV quantifies the spread of a probability distribution, and is determined by the power-law exponent, . Systems with small scaling exponents () experience an unbounded, power-law growth of their CV as the system size *N* becomes large, . This growth is particularly rapid in systems with , because the average community size diverges at . For these systems, is observed. Several of our datasets fall into this high-risk category, such as the number of deaths due to terrorist attacks (Table 1).

## Discussion

We have expressed a range of probability distributions in terms of a generalized energy-like cost function. In particular, we have considered types of costs that can be subject to economies of scale, which we have also called “community discounts.” We maximize the BGS entropy, subject to such cost-minus-benefit functions. This procedure predicts probability distributions that are exponential functions of a digamma function. Such a distribution function has a power-law tail, but reduces to a Boltzmann distribution in the absence of EOS. This function gives good fits to distributions ranging from scientific citations and patents, to protein-protein interactions, to friendship networks, and to Web links and terrorist networks—over their full distributions, not just in their tails.

Framed in this way, each new joiner particle must pay an intrinsic buy-in cost to join a community, but that cost may be reduced by a communal discount (an economy of scale). Here, we discuss a few points. First, both exponential and power-law distributions are ubiquitous. How can we rationalize this? One perspective is given by switching viewpoint from probabilities to their logarithms, which are commonly expressed in a language of dimensionless cost functions, such as energy . There are many forms of energy (e.g., gravitational, magnetic, electrostatic, springs, and interatomic interactions). The ubiquity of the exponential distribution can be seen in terms of the diversity and interchangeability of energies.

A broad swath of physics problems can be expressed in terms of the different types of energy and their ability to combine, add, or exchange with each other in various ways. Here, we indicate that nonexponential distributions, too, can be expressed in a language of costs, particularly those that are shared and are subject to economies of scale. Second, where do we expect exponentials vs. power laws? What sets Eq. **5** apart from typical energy functions in physical systems is that EOS costs are both independent of distance and long-ranged (the joiner particle interacts with all particles in given community). Consequently, when the system size becomes large, due to the absence of a correlation length-scale, the energy of the system does not increase linearly with system size, giving rise to a nonextensive energy function. This view is consistent with the appearance of power laws in critical phenomena, where interactions are effectively long-ranged.

Third, interestingly, the concept of cost-minus-benefit in Eq. **5** can be further generalized, also leading to either Gaussian or stretched-exponential distributions. A Gaussian distribution results when the cost-minus-benefit function grows linearly with cluster size, ; this would arise if the joiner particle were to pay a tax to each member of a community, and this leads to a total cost of (Eq. **1**). These would be “hostile” communities, leading to mostly very small communities and few large ones, because a Gaussian function drops off even faster with *k* than an exponential does. An example would be a Coulombic particle of charge *q* joining a community of *k* other such charged particles, as in the Born model of ion hydration (55). A stretched-exponential distribution can arise if the joiner particle instead pays a tax to only a subset of the community. For example, in a charged sphere with strong shielding, if only the particles at the sphere’s surface interact with the joiner particle, then and , leading to a stretched-exponential distribution. In these situations, EOS can affect the community-size distribution not only through cost-sharing but also through the topology of interactions.

Finally, we reiterate a matter of principle. On the one hand, nonexponential distributions could be derived by using a nonextensive entropy-like quantity, such as those of Tsallis, combined with an extensive energy-like quantity. Here, instead, our derivation is based on using the BGS entropy combined with a nonextensive energy-like quantity. We favor the latter because it is consistent with the foundational premises of Shore and Johnson (37). In short, in the absence of energies or costs, the BGS entropy alone predicts a uniform distribution; any other alternative would introduce bias and structure into that is not warranted by the data. Models based on nonextensive entropies intrinsically prefer larger clusters, but without any basis to justify them. The present treatment invokes the same nature of randomness as when physical particles populate energy levels. The present work provides a cost-like language for expressing various different types of probability distribution functions.

## Acknowledgments

We thank A. de Graff, H. Ge, D. Farrell, K. Ghosh, S. Maslov, and C. Shalizi for helpful discussions, and K. Sneppen, M. S. Shell, and H. Qian for comments on our manuscript. Support for this work was provided by a US Department of Defense National Defense Science and Engineering Graduate Fellowship (to J.P.), the National Science Foundation and Laufer Center (J.P. and K.A.D.), and Department of Energy Grant PM-031 from the Office of Biological Research (to P.D.D.).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: dill{at}laufercenter.org.

Author contributions: J.P. and K.A.D. designed research; J.P. and P.D.D. performed research; J.P. and K.A.D. contributed new reagents/analytic tools; J.P. analyzed data; and J.P., P.D.D., and K.A.D. wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1320578110/-/DCSupplemental.

## References

- ↵
- ↵
- Zipf GK

- ↵
- ↵
- Newman M

- ↵
- ↵
- Taleb NN

- ↵
- Bremmer I,
- Keats P

- ↵
- ↵
- ↵
- Maslov S,
- Krishna S,
- Pang TY,
- Sneppen K

- ↵
- ↵
- Leskovec J,
- Chakrabarti D,
- Kleinberg J,
- Faloutsos C,
- Ghahramani Z

- ↵
- ↵
- ↵
- Fortuna MA,
- Bonachela JA,
- Levin SA

- ↵
- ↵
- Pang TY,
- Maslov S

- ↵
- Simon H

- ↵
- ↵
- Barabási AL,
- Albert R

- ↵
- ↵
- Yook S-H,
- Jeong H,
- Barabási A-L

- ↵
- ↵
- ↵
- ↵
- ↵
- Peterson GJ,
- Pressé S,
- Dill KA

- ↵
- ↵
- Yeomans J

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Deeds EJ,
- Ashenberg O,
- Shakhnovich EI

- ↵
- ↵
- ↵
- ↵
- ↵
- Rènyi A

- ↵
- Aczél J,
- Daróczy Z

- ↵
- Amari S-i

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Gell-Mann M,
- Tsallis C

- ↵
- Dill K,
- Bromberg S

- ↵Born M (1920) [Volumes and heats of hydration of ions].
*Z Phys*1:45–48. German.

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics