New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
A Poissonian explanation for heavy tails in email communication

Edited by Steven Strogatz, Cornell University, Ithaca, NY, and accepted by the Editorial Board September 3, 2008 (received for review January 11, 2008)
Abstract
Patterns of deliberate human activity and behavior are of utmost importance in areas as diverse as disease spread, resource allocation, and emergency response. Because of its widespread availability and use, email correspondence provides an attractive proxy for studying human activity. Recently, it was reported that the probability density for the interevent time τ between consecutively sent emails decays asymptotically as τ^{−α}, with α ≈ 1. The slowerthanexponential decay of the interevent time distribution suggests that deliberate human activity is inherently nonPoissonian. Here, we demonstrate that the approximate powerlaw scaling of the interevent time distribution is a consequence of circadian and weekly cycles of human activity. We propose a cascading nonhomogeneous Poisson process that explicitly integrates these periodic patterns in activity with an individual's tendency to continue participating in an activity. Using standard statistical techniques, we show that our model is consistent with the empirical data. Our findings may also provide insight into the origins of heavytailed distributions in other complex systems.
The analysis of social and economic data has a long and illustrious history (1–3). Despite their idiosyncratic complexity, a number of striking statistical regularities are known to describe individual and societal human behavior (4–7). These regularities are of enormous practical importance because they provide insight into how individual behaviors influence social and economic outcomes. Indeed, much of the current research on complex systems aims to quantify the impact of individual agents on the organization and dynamics of the system as a whole (8, 9). Before we can predict how individuals affect, for example, the organization of systems, it is paramount to understand the behavior of the individual agents.
The current availability of digital records has made it much easier for researchers to quantitatively investigate various aspects of human behavior (10–21). In particular, email communication records are attracting much attention as a proxy for quantifying deliberate human behavior because of the omnipresence of email communication and availability of email records (13, 14, 16, 18). The data, however, do not provide a detailed record of all of the activities in which each individual participates; we do not know, for instance, when an individual is sleeping, eating, walking, or even browsing the web. The resulting uncertainty in deliberate human activity thus poses a fundamental challenge to quantifying and modeling of human behavior.
Researchers commonly account for uncertainty or lack of information through stochastic models. One of the simplest stochastic models for human activity is a point process in which independent events occur at a constant rate ρ. Such processes are referred to as homogeneous Poisson processes, and they are used to describe a large class of phenomena, including some aspects of human activity (22). Homogeneous Poisson processes have two wellknown statistical properties: the time between consecutive events, the interevent time τ, follows an exponential distribution, p(τ) = ρe^{−ρτ}, and the number of events N_{T} during a time interval of duration T time units follows a Poisson distribution with mean ρT.
Several recent studies of deliberate human activity, including email correspondence, have focused on the former property. These studies have reported that the empirical distribution of interevent times decays asymptotically as a power law, p(τ) ∝ τ^{α}, with exponent α ≈ 1 (13, 14, 18, 23). Other studies have identified a similar powerlaw scaling in the interevent time distribution of many other facets of human behavior, such as file downloads (10–12), letter correspondence (15, 17, 18), library usage (17), broker trades (17), web browsing (17, 19), human locomotor activity (20), and telephone communication (21). These observations are in stark contrast to the predictions of a homogeneous Poisson process, suggesting that a more suitable null model with which to compare mechanistic models of human activity is a truncated powerlaw model with scaling exponent α = 1.*
The heavytailed nature of the distribution of interevent times prompts us to search for the mechanisms responsible for its emergence. Two main classes of mechanisms can be considered: (i) human behavior is primarily driven by rational decision making, which introduces correlations in activity, thereby giving rise to heavy tails;^{†} (ii) human behavior is primarily driven by external factors such as circadian and weekly cycles, which introduces a set of distinct characteristic time scales, thereby giving rise to heavy tails.^{‡} Whereas the former interpretation has been shown to give rise to a truncated powerlaw distribution of interevent times, the latter has been rejected by some authors (17, 18). Indeed, even though Hidalgo (24) investigated a model with seasonal changes in activity rates that is able to generate data with an approximate powerlaw decay in the distribution of interevent times with exponents α ≈ 2 or α ≈ 1, the α ≈ 1 case requires a specific relationship between the rates of activity ρ_{i} and the corresponding duration of the seasons T_{i} over which each rate holds. It has therefore been argued that seasonality alone can only robustly give rise to heavytailed interevent time distributions with exponent α ≈ 2 (17).
Here, we demonstrate that the distribution of interevent times in email correspondence patterns display systematic deviations from the truncated powerlaw null model because of circadian and weekly patterns of activity. We subsequently propose a mechanistic model that incorporates these observed cycles, and a simulated annealing procedure to nonparametrically estimate its parameters. We then use Monte Carlo hypothesis testing to demonstrate that the predictions of our model are consistent with the observed heavytailed interevent time distribution. Finally, we discuss the implications of our findings for modeling human activity patterns and, more generally, complex systems.
Empirical Patterns
We study a database of email records for 3,188 email accounts at a European university over an 83day period (23). Each record comprises a sender identifier, a recipient identifier, the size of the email, and a time stamp with a precision of 1 s. We preprocess the dataset and identify a set of 394 accounts that provide enough data to quantify human activity and that are likely neither spammers nor listservs [see Preprocessing of the Data in supporting information (SI) Text and Fig. S1].
To gain some intuition about email activity patterns, let us consider a fictitious student, Katie.^{§} Katie arrives at the university 20 min before her Thursday morning class. During this time, she decides to check her email and sends 3 emails. Katie checks her email after lunch and sends a brief email to a friend before her next class. Later that evening, Katie sends 4 more emails once she has finished her homework. Katie does not check her email again until the following day when she sends emails intermittently between attending classes, completing homework assignments, and meeting social engagements. Katie spends the weekend without email access and doesn't send another email until Monday. Katie's email activity, which is similar to many email users, is both periodic and cascading. That is, there are periodic changes in her activity rate, which account for her sleep and work patterns, and there are cascades of activity—active intervals—of varying length when Katie primarily focuses on email correspondence (Fig. 1).
If our intuition about deliberate human activity is correct, then the periodic pattern of activity should manifest itself in the interevent time statistics, particularly when compared with the predictions of the truncated powerlaw null model that does not account for temporal periodicities (see Null Model in SI Text). Specifically, we anticipate that email users typically send emails during the same 8hour periods of the day. We therefore expect the data to have significantly more interevent times between 24 ± 8 h—the time required to send emails on consecutive workdays—than the truncated powerlaw model predictions. We therefore expect that the null model underestimates the number of interevent times between 16 and 32 h. Because of the normalization of the probability density, the truncated powerlaw model will overestimate other interevent times. These predictions are all confirmed by the data, suggesting that periodicity is a fundamental aspect of human activity (Fig. 2).
Model
We propose a model of email usage that incorporates the hypothesized periodic and cascading features of human activity. We account for periodic activity with a primary process, which we model as a nonhomogeneous Poisson process. Whereas a homogeneous Poisson process has a constant rate ρ, a nonhomogeneous Poisson process has a rate ρ(t) that depends on time. In our model, the rate ρ(t) depends on time in a periodic manner; that is, ρ(t) = ρ(t + W), where W is the period of the process. Consistent with our observations (Fig. 3), we relate the rate of the nonhomogeneous Poisson process to the daily and weekly distributions of active interval initiation, p_{d}(t) and p_{w}(t): where the period W is 1 week, and the proportionality constant N_{w} is the average number of active intervals per week.^{¶}
We further assume that each event generated from the primary process initiates a secondary process, which we model as a homogeneous Poisson process with rate ρ_{a} (see Additional Evidence for a Homogeneous Poisson Cascade in SI Text). We refer to these “cascades of activity” as active intervals, during which N_{a} additional events occur where N_{a} is drawn from some distribution p(N_{a}). Once the N_{a} events have occurred in the active interval, the activity of the individual is again governed by the primary process defined by Eq. 1. Our model thus mimics how individuals like Katie use email: Katie sends emails sporadically throughout the day, but once she starts checking her email, it is relatively easy to send additional emails in rapid succession. We refer to the resulting model as a cascading nonhomogeneous Poisson process.^{‖}
Results
To compare our model with the empirical data, we first need to estimate the parameters of our model from the data. Ideally, the data would specify which events belong to the same active intervals—the active interval configuration C—so that we could estimate the distributions p_{d}(t), p_{w}(t), and p(N_{a}). The data we analyze, however, do not specify the actual active interval configuration C_{o}, so it is not evident whether, for example, p(N_{a}) should be described by a normal or exponential distribution.
Because we do not know a priori the functional form of the activity pattern in the cascading process, we cannot use the formalism implemented by, for example, Scott and coworkers (28, 29). Instead, we introduce a new method that enables us to nonparametrically infer the empirical distributions p_{d}(t), p_{w}(t), and p(N_{a}) from the data.
Given a particular active interval configuration C, we can easily calculate all of our model's parameters and compare its predictions with the empirical data: N_{w} is the average number of active intervals per week; p_{d}(t) and p_{w}(t) are the probabilities of starting an active interval at a particular time of day and week, respectively; the active interval rate ρ_{a} is the inverse of the average interevent time in active intervals; and the probability of N_{a} additional events occurring during an active interval p(N_{a}) is estimated directly from the active interval configuration (Fig. 3). We then manipulate the active interval configuration C to find the active interval configuration Ĉ that gives a best estimate of the observed interevent time distribution (see Methods). This method allows us to infer the bestestimate distributions p̂_{d}(t), p̂_{w}(t), and p̂(N_{a}), given the data and our proposed model, without making any assumptions on their functional forms.
We next compare the predictions of the cascading nonhomogeneous Poisson process with the empirical cumulative distribution of interevent times P(τ) for all 394 users under consideration in the present study (see SI Appendix). Because we are using the empirical data to estimate the parameters for our model—that is, the estimated parameters depend on the data—we must use Monte Carlo hypothesis testing (30, 31) to assess the significance of the agreement between the predictions of our model and the empirical data (see Monte Carlo Hypothesis Testing in SI Text). The visual agreement of our model's predictions are confirmed by P values clearly above our 5% rejection threshold (Fig. 4).
In fact, the cascading nonhomogeneous Poisson process can only be rejected at the 5% significance level for 1 user, indicating that our model cannot be rejected as a model of human dynamics. By comparison, the truncated powerlaw null model is rejected at the 5% significance level for 344 users. Indeed, the null model is always rejected for many more users than the cascading nonhomogeneous Poisson process regardless of the rejection threshold selected, and our model does not display the large systematic deviations from the data that are observed for the truncated powerlaw null model (Fig. 5)
Discussion
Our results clearly demonstrate that circadian and weekly cycles, when coupled to cascading activity, can accurately describe the heavy tails observed in email communication patterns. The question then is, would rational decision making, together with circadian and weekly cycles, be equally able to describe the statistical patterns observed for email communication? Even if the answer to this question is affirmative, parsimony suggests that rational decision making is not a necessary component of human activity patterns, given our simpler explanation.
In addition to providing a good description of email communication patterns, we surmise that our model is readily applicable to many other conscious human activities. For instance, most people make telephone calls sporadically throughout the day. After a telephone call has been made, it is effortless to make another telephone call. Similarly, individuals run errands throughout the month. Once an individual runs one errand, it is easier to run another errand during the same trip than it is to run errands again the following day. Both of these anecdotes are illustrative of the way humans tend to optimize their time and effort to accomplish the tasks in their daily routines, a process that is captured by the periodic and cascading mechanisms in our model.
The particular periodic and cascading features that are incorporated into our model depend on the activity under consideration. For instance, sexual activity is influenced by menstrual cycles (32), and airline travel is influenced by seasonality (33). Furthermore, our model can also be generalized to cases in which the parameters are not stationary. This may be important, for instance, in the case of Darwin and Einstein's letter correspondence in which the number of letters sent per year increased 100fold over 40 years (15, 18).
Although our model is only designed to account for a single activity (email correspondence), it can easily be extended to incorporate the multitude of activities in which any individual participates. To facilitate the inclusion of additional activities, it is useful to interpret our model as a nonstationary hidden Markov point process (28, 34). Within this framework, an individual switches between any two activities i and j with some probability defined by a nonstationary Markov transition matrix T_{ij}(t) that depends on time t. For instance, our model can be redefined as a nonstationary hidden Markov point process that switches between two states: a state in which an individual is not composing emails and a state in which an individual is composing emails. Predictions of models that incorporate more than one activity can then be verified against data that records several activities for a single individual.
Our model further suggests an experiment (35) that not only records when an individual has sent an email, but also when that individual is using a computer or actively using an email client. This additional data would provide direct empirical evidence for describing active intervals. In the absence of such data, we have developed a simulated annealing procedure that allows us to nonparametrically infer the hidden Markov structure of our model, providing insight into how to compare our model with other cascading point processes (25, 26).
Although our model provides an accurate description of when an email is sent, a question left unaddressed is to determine who the probable recipient of that email is going to be. For instance, one might speculate that emails are sent randomly with some Poissonian rate to acquaintances or individuals who share common interests. Alternatively, it is plausible that emails are sent based on a perceived priority of important tasks, perhaps in response to previous correspondence (14). When combined with our model that statistically describes when individuals send emails, quantifying the likely recipient of an email will provide an important step toward describing how the structure of email and social networks evolve.
Our study also provides a clear demonstration of how hypothesis testing (30, 36) can objectively assess the validity of a proposed model—a procedure we vehemently advocate. Using this methodology, we demonstrate that although both models reproduce the asymptotic scaling of the observed interevent time distribution, our model is consistent with the entire interevent time distribution, whereas the truncated powerlaw null model is not.
The consequences of our findings are clear; demonstrating that a model reproduces the asymptotic powerlaw scaling of a distribution does not necessarily provide evidence that the model is an accurate mechanistic description of the underlying process. Indeed, there is mounting evidence that some purported powerlaw distributions in complex systems may not be power laws at all (37–39). There may be a common explanation for these apparent power laws: Complex systems are inherently hierarchical, but the distinct levels in the hierarchy are difficult to distinguish (40). In the case of email correspondence, for example, the active intervals are not recorded in the data, thereby concealing the various scales of email activity. This demonstrates how the mixture of scales of activity can give rise to scalefree activity patterns. We suspect that similar mixtureofscales explanations (41–45) may provide a basis for the reported universality of heavytailed distributions in complex systems.
Methods
Area Test Statistic.
We quantify the agreement between a model ℳ(θ) with parameters θ and dataset 𝒟 by measuring the area A between the empirical cumulative distribution function P_{𝒟}(u) and the model cumulative distribution function P_{ℳ}(u∣θ): We specify u = ln τ, which is roughly uniformly distributed, to improve the numerical efficiency of our simulated annealing procedure. The area test statistic is advantageous because it is easy to interpret, and it retains more information about the distribution than many other test statistics (see Area Test Statistic in SI Text).
Identifying Active Intervals.
If we knew the actual active interval configuration C_{o}, it would be straightforward to compute the parameters θ_{o} = {N_{w},p_{d}(t),p_{w}(t),ρ_{a},p(N_{a})} of the cascading nonhomogeneous Poisson process. The data, however, do not identify the actual active interval configuration C_{o}, so we must use heuristic methods (see Simulated Annealing Procedure in SI Text) to determine the bestestimate active interval configuration Ĉ, from which we can compute the bestestimate parameters θ̂. We use simulated annealing to minimize the area test statistic A (Eq. 2) for the interevent time distribution. Thus, identifying active intervals that are consistent with our expectations for our model reduces to finding the bestestimate active interval configuration Ĉ, which minimizes the area A between the empirical data and the predictions of the cascading nonhomogeneous Poisson process.
Our simulated annealing procedure is as follows. Starting from a random active interval configuration C in which adjacent events are randomly assigned to the same active interval, we compute the parameters θ of the cascading nonhomogeneous Poisson process, then we numerically estimate the cumulative distribution P_{ℳ}(u∣θ), and, finally, we measure the area test statistic A(C) of the active interval configuration C. The active interval configuration is modified to a new configuration C′ by either merging two adjacent active intervals or by splitting an active interval. If the new configuration C′ reduces the area test statistic, then the new configuration is unconditionally accepted. Otherwise, the configuration is conditionally accepted with probability exp(−(A(C′) − A(C))/T), where T is the effective “temperature” measured in units of the area test statistic A. After attempting 2N configurations at each temperature so that each pair of N consecutive events might be merged and split, we reduce the temperature T by 5% until the active interval configuration settles at the bestestimate Ĉ without moving for 5 consecutive cooling stages.
Throughout the simulated annealing procedure, we track the lowest area test statistic configuration. If the system has settled in a configuration that is not the lowest area test statistic configuration, the system is placed in the lowest area test statistic configuration, and the system is cooled further. We have verified that our simulated annealing procedure accurately identifies active intervals and estimates parameters θ in synthetically generated cascading nonhomogeneous Poisson process datasets (see Simulated Annealing Procedure in SI Text and Fig. S5).
Acknowledgments
We thank R. Guimerà, M. SalesPardo, M. J. Stringer, E. N. Sawardecker, S. M. Seaver, and P. McMullen for insightful comments and suggestions. R.D.M. and D.B.S. thank the National Science Foundation (NSF)–Integrative Graduate Education and Research Traineeship Program (DGE9987577) for partial funding during this project. A.E.M. is supported by NSF Grant DMS0709212. L.A.N.A. gratefully acknowledges the support of NSF Award SBE 0624318 and of the W. M. Keck Foundation.
Footnotes
 ^{1}To whom correspondence should be addressed. Email: amaral{at}northwestern.edu

Author contributions: R.D.M., D.B.S., A.E.M., and L.A.N.A. designed research; R.D.M. and D.B.S. performed research; R.D.M. and D.B.S. analyzed data; and R.D.M., D.B.S., A.E.M., and L.A.N.A. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. S.S. is a guest editor invited by the Editorial Board.

↵* For simplicity, we use a truncated power law with an exponent of α = 1 as our null model. Similar conclusions are reached when the powerlaw scaling exponent is fit to the data or when other heavytailed null models [e.g., lognormal or loguniform distributions (16)] are considered.

↵† If humans make decisions based on their own previous memories, then we might expect that humans are heavily influenced by recent events. That is, the probability ρdt that an event will happen in a time interval dt is not constant but is, instead, a decreasing function of the time elapsed since the last event (18).

↵‡ This interpretation does not rely on highlycompetent human behavior and allows for the possibility that human activity, and hence the time dependence of ρ, is modulated by instinct, the environment, or social stimuli.

This article contains supporting information online at www.pnas.org/cgi/content/full/0800332105/DCSupplemental.

↵§ We suspect that most users had access to their email only at the university because the data are obtained from a European university prior to 2004 (J. P. Eckmann, personal communication).

↵¶ In specifying N_{w} as the average number of active intervals per week, we are implicitly assuming that the fraction of time spent in active intervals is very small. We have verified that this is the case for all users under consideration. Also, it is important to choose the time step Δt in the binning of the empirical p_{d}(t) to be sufficiently small such that the probability of an event occurring at time t is ρ(t)Δt ≪ 1. We choose Δt = 1/N_{w} hours, which meets this criterion while still maintaining computational feasibility.

↵‖ Our model is similar in spirit to the Neyman–Scott cascading point process (25, 26) and the Hawkes selfexciting process (27), except that in our model (i) the primary process is modulated periodically by a nonhomogeneous rate, and (ii) the active intervals are nonoverlapping.
 © 2008 by The National Academy of Sciences of the USA
References
 ↵
 Smith A
 ↵
 Pareto V
 ↵
 Zipf GK
 ↵
 Stanley MHR,
 et al.
 ↵
 Huberman BA,
 Pirolli PLT,
 Pitkow JE,
 Lukose RM
 ↵
 ↵
 ↵
 Newman MEJ
 ↵
 Castellano C,
 Fortunato S,
 Loreto V
 ↵
 ↵
 ↵
 ↵
 Johansen A
 ↵
 ↵
 ↵
 Stouffer DB,
 Malmgren RD,
 Amaral LAN
 ↵
 Vázquez A,
 Oliveira JG,
 Dezsõ Z,
 Goh KI,
 Kondor I,
 Barabási AL
 ↵
 Vázquez A
 ↵
 Dezsõ Z,
 Almaas E,
 Lukács A,
 Rácz B,
 Szakadát I,
 Barabási AL
 ↵
 ↵
 Candia J,
 et al.
 ↵
 Daley DJ,
 VereJones D
 ↵
 Eckmann JP,
 Moses E,
 Sergi D
 ↵
 Hidalgo C
 ↵
 Neyman J,
 Scott EL
 ↵
 Lowen SB,
 Teich MC
 ↵
 Hawkes AG
 ↵
 Scott SL
 ↵
 Scott SL,
 Smyth P
 ↵
 D'Agostino RB,
 Stephens MA
 ↵
 Press WH,
 Teukolsky SA,
 Vetterling WT,
 Flannery BP
 ↵
 Udry J,
 Morris NM
 ↵
 Kulendran N,
 King ML
 ↵
 Elliott RJ,
 Aggoun L,
 Moore JB
 ↵
 Watts DJ
 ↵
 Sivia DS,
 Skilling J
 ↵
 ↵
 ↵
 Clauset A,
 Shalizi CR,
 Newman MEJ
 ↵
 SalesPardo M,
 Guimerà R,
 Moreira AA,
 Amaral LAN
 ↵
 Silcock H
 ↵
 ↵
 ↵
 Willinger W,
 Govindan R,
 Jamin S,
 Paxson V,
 Shenker S
 ↵
 Motter AE,
 de Moura APS,
 Grebogi C,
 Kantz H