# Understanding scaling through history-dependent processes with collapsing sample space

See allHide authors and affiliations

Edited by H. Eugene Stanley, Boston University, Boston, MA, and approved March 13, 2015 (received for review November 4, 2014)

## Significance

Many complex systems reduce their flexibility over time in the sense that the number of options (possible states) diminishes over time. We show that rank distributions of the visits to these states that emerge from such processes are exact power laws with an exponent −1 (Zipf’s law). When noise is added to such processes, meaning that from time to time they can also increase the number of their options, the rank distribution remains a power law, with an exponent that is related to the noise level in a remarkably simple way. Sample-space-reducing processes provide a new route to understand the phenomenon of scaling and provide an alternative to the known mechanisms of self-organized criticality, multiplicative processes, or preferential attachment.

## Abstract

History-dependent processes are ubiquitous in natural and social systems. Many such stochastic processes, especially those that are associated with complex systems, become more constrained as they unfold, meaning that their sample space, or their set of possible outcomes, reduces as they age. We demonstrate that these sample-space-reducing (SSR) processes necessarily lead to Zipf’s law in the rank distributions of their outcomes. We show that by adding noise to SSR processes the corresponding rank distributions remain exact power laws,

A typical feature of aging is that the number of possible states in a system reduces as it ages. Whereas a newborn can become a composer, politician, physicist, actor, or anything else, the chances for a 65-y-old physics professor to become a concert pianist are practically zero. A characteristic feature of history-dependent systems is that their sample space, defined as the set of all possible outcomes, changes over time. Many aging stochastic systems (such as career paths) become more constrained in their dynamics as they unfold (i.e., their sample space becomes smaller over time). An example for a sample-space-reducing (SSR) process is the formation of sentences. The first word in a sentence can be sampled from the sample space of all existing words. The choice of subsequent words is constrained by grammar and context, so that the second word can only be sampled from a smaller sample space. As the length of a sentence increases, the size of the sample space of word use typically reduces.

Many history-dependent processes are characterized by power-law distribution functions in their frequency and rank distributions of their outcomes. The most famous example is the rank distribution of word frequencies in texts, which follows a power law with an approximate exponent of −1, the so-called Zipf’s law (1). Zipf’s law has been found in countless natural and social phenomena, including gene expression patterns (2), human behavioral sequences (3), fluctuations in financial markets (4), scientific citations (5, 6), distributions of city (7) and firm sizes (8, 9), and many more (see, e.g., ref. 10). (Some of these examples are, of course, not associated with SSR processes.) Over the past decades there has been a tremendous effort to understand the origin of power laws in distribution functions obtained from complex systems. Most of the existing explanations are based on multiplicative processes (11⇓⇓–14), preferential mechanisms (15⇓–17), or self-organized criticality (18⇓–20). Here we offer an alternative route to understand scaling based on processes that reduce their sample space over time. We show that the emergence of power laws in this way is related to the breaking of a symmetry in random sampling processes, a mechanism that was explored in ref. 21. History-dependent random processes have been studied generically (22, 23), however not with the rationale to understand the emergence of scaling in complex systems.

## Results

### The Pure SSR Process and Zipf's Law.

The essence of SSR processes can be illustrated by a set of *N* fair dice with different numbers of faces. The first dice has one face, the second has two faces (coin), the third one three, and so on, up to dice number *N*, which has *N* faces. The faces of a dice are numbered and have respective face values. To start the SSR process, take the dice with the largest number of faces (*N*) and throw it. The result is a face value between 1 and *N*; say it is *K*. We now take dice number *i* between 1 and *L*. We now take dice number *ϕ*. As the process unfolds, *ϕ* generates a single sequence of strictly decreasing numbers *i*. An intuitive realization of this process is depicted in Fig. 1. The probability that the process *ϕ* visits the particular site *i* in a sequence is the visiting probability *N*. Take the process *ϕ* and let *ϕ* directly generates a 1 with probability *ϕ* first generates 2 with probability *N*, the probability to hit *i* in the first step is *j*, *i* in the next step with probability

If the process *ϕ* is repeated many times, meaning that once it reaches dice number 1 we start by throwing dice number *N* again, we are interested in how often a given site *i* is occupied on average. The occupation probability for site *i*, given that there are *N* possible sites, is denoted by *ϕ*. Although in general the visiting probability *ϕ* both probabilities only differ by a normalization factor. This is so because any sequence generated by *ϕ* is strictly decreasing and contains any particular site *i* at most once. Further, any sequence ends on site 1, meaning

An alternative picture that illustrates the history-dependence aspect of the same SSR processes is shown in Fig. 2. In Fig. 2, *Left* we show an independent and identically distributed (iid) stochastic process, where the space of potential outcomes is *N* sites of *i* to site *j* is *i*, we obviously find that this is constant over time,*Right*). If at time *t* the ball is at level (site) *i*, at *i* during a downward sequence is again *ϕ*, the same proof applies.

### The Role of Noise in SSR Processes.

It is conceivable that in many real systems nestedness of SSR processes is not realized perfectly and that from time to time the sample space can also expand during a sequence. In the above example this would mean that from time to time random upward moves are allowed, or equivalently, that the nested process *ϕ* is perturbed by noise. In the context of the scenario depicted in Fig. 2 we look at a superposition of the SSR *ϕ* and the unconstrained random walk *λ* to denote the mixing ratio, the nested SSR process *i*, with probability *λ* it jumps (downward) to any of site *N* sites, *λ* that the sample space for the next throw is *λ* the process is *ϕ* and stops, and with probability *i* more than once. This implies in general that the visiting probability *N* and write

Note that *ϕ* produces one realization of *N*, the average sequence length is *ϕ* with a process *ϕ*, except for the case when site *N*-faced dice and thus restarts the process *ϕ* (in the numerical simulations we stop the process after *M* restarts). For *ϕ*, and the end point of the previous one (see also Fig. 5*A*). Replacing *ϕ* by **2** ensures that we have an infinitely long, noisy sequence, which is denoted by *i* is reached from site *j* in the next time step in an infinite and noisy SSR process. It reads**3** and **4** we get*λ* is nothing but the mixing parameter for the noise component. For *λ*. Note that Eq. **6** is a statement about the rank distribution of the system. Often statistical features of systems are presented as frequency distributions, that is, the probability that a given site (state) is visited *k* times, *p* is a power law with exponent *λ*, **6** implies that we are able to understand a remarkable range of exponents in frequency distributions, **6** and numerical simulations (Fig. 3*A*). The slope of the measured rank distributions in log scale, *λ*. Fitting was carried out by using the method proposed in ref. 24.

### Convergence Speed of SSR Distributions.

From a practical side the question arises of how fast SSR processes converge to the limiting occupation distribution given by Eq. **6**. In other words, what is the distance between the sample distribution *p*^{(λ)}_{T} of the process Φ^{(λ)}_{∞} and *p*^{(λ)}, after *T* individual jumps? In Fig. 3*B* we show the Euclidean distance of the distribution after *T* jumps *λ*. For the pure random case

## Examples

### Sentence Formation and Zipf’s Law.

One example for a SSR process of the presented type is the process by which sentences are formed. During the creation of a sentence, grammatical and contextual constraints have the effect of a reducing sample space [i.e., the space (vocabulary) from which a successive word in a sentence can be sampled]. Clearly, the process of sentence formation is not expected to be strictly SSR, and we expect deviations from an exact Zipf’s law in the rank distribution of words in texts. In Fig. 4 we show the empirical distribution of word frequencies of Darwin’s *The Origin of Species*, which shows an approximate power law with a rank exponent of *M* corresponds to the number of sentences in a text. In the simulation we use

### SSR Processes and Random Walks on Networks.

SSR processes can be related to random walks on directed networks, as depicted in Fig. 5*A*. There we start the process from a start node, from which we can reach any of the *N* nodes with probability *ϕ* above. However, if *B* we show the result for the node occupation distribution for the process *B*, *Inset* (black dashed line), which shows the observed path rank distribution for the

### SSR Processes and Fragmentation Processes.

One important class of aging systems are fragmentation processes, such as objects that repeatedly break at random sites into ever smaller pieces (see, e.g., refs. 28 and 29). A simple example demonstrates how fragmentation processes are related to SSR processes. Consider a stick of a certain initial length *L*, such as a spaghetto, and mark some point on the stick. Now take the stick and break it at a random position. Select the fragment that contains the mark and record its length. Then break this fragment again at a random position, take the fragment containing the mark, and again record its length. One repeats the process until the fragment holding the mark reaches a minimal length, say the diameter of an atom, and the fragmentation process stops. The process is clearly of the SSR type because fragments are always shorter than the fragment they come from. In particular, if the mark has been chosen on one of the endpoints of the initial strand of spaghetti, then the consecutive fragmentation of the marked fragment is obviously a continuous version of the SSR process *ϕ* discussed above. Note that even though the length sequence of a single marked fragment is an SSR process, the size evolution of all fragments is more complicated, because fragment lengths are not independent from each other: Every spaghetti fragment of length *x* splits into two fragments of respective lengths,

## Discussion

The main result of Eq. **6** is remarkable insofar as it explains the emergence of scaling in an extremely simple and hitherto unnoticed way. In SSR processes, Zipf’s law emerges as a simple consequence of breaking a directional symmetry in stochastic processes, or, equivalently, by a nestedness property of the sample space. More general power exponents are simply obtained by the addition of iid random fluctuations to the process. The relation of exponents and the noise level is strikingly simple and gives the exponent a clear interpretation in terms of the extent of violation of the nestedness property in strictly SSR processes. We demonstrate that SSR processes converge equally fast toward their limiting distributions, as uncorrelated iid processes do.

We presented several examples for SSR processes. The emergence of scaling through SSR processes can be used straightforwardly to understand Zipf’s law in word frequencies. An empirical quantification of the degree of nestedness in sentence formation in a number of books allows us to understand the variations of the scaling exponents between the individual books (26). SSR processes can be related to diffusion processes on directed networks. For a specific example we demonstrated that the visiting times of nodes follow a Zipf’s law, and could further reproduce very general recent findings of path-visit distributions in random walks on networks (27). Here we presented results for a complete directed graph; however, we conjecture that SSR processes on networks and the associated Zipf’s law of node-visiting distributions are tightly related and are valid for much more general directed networks. We demonstrated how SSR processes can be related to fragmentation processes, which are examples of aging processes. We note that SSR processes and nesting are deeply connected to phase-space collapse in statistical physics (21, 30⇓–32), where the number of configurations does not grow exponentially with system size (as in Markovian and ergodic systems), but grows subexponentially. Subexponential growth can be shown to hold for the phase-space growth of the SSR sequences introduced here. In conclusion, we believe that SSR processes provide a new alternative view on the emergence of scaling in many natural, social, and man-made systems. It is a self-contained, independent alternative to multiplicative, preferential, self-organized criticality and other mechanisms that have been proposed to understand the origin of power laws in nature (33).

## Acknowledgments

We thank two anonymous referees who made us aware of the relation to network diffusion and fragmentation and Álvaro Corral for his helpful comments. This work was supported by Austrian Science Fund FWF under Grant KPP23378FW.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: stefan.thurner{at}meduniwien.ac.at.

Author contributions: B.C.-M., R.H., and S.T. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

## References

- ↵.
- Zipf GK

- ↵
- ↵
- ↵
- ↵.
- Price DJ

- ↵
- ↵
- ↵.
- Axtell RL

- ↵.
- Saichev A,
- Malevergne Y,
- Sornette S

- ↵.
- Newman MEJ

- ↵.
- Simon HA

- ↵.
- Jackson W

- Mandelbrot B

- ↵
- ↵
- ↵
- ↵
- ↵.
- Barabasi A-L,
- Albert R

- ↵
- ↵
- ↵
- ↵.
- Hanel R,
- Thurner S,
- Gell-Mann M

- ↵
- ↵.
- Clifford P,
- Stirzaker D

- ↵
- ↵.
- Feller W

- ↵.
- Thurner S,
- Hanel R,
- Corominas-Murtra B

- ↵
- ↵.
- Tokeshi M

- ↵
- ↵
- ↵
- ↵.
- Hanel R,
- Thurner S

- ↵.
- Mitzenmacher M

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics