Previous Article |
Table of Contents
| Next Article
PHYSICAL SCIENCES / SOCIAL SCIENCES / COMPUTER SCIENCES / SOCIAL SCIENCES
Tracing information flow on a global scale using Internet chain-letter data

,
*Department of Computer Science, Carleton College, Northfield, MN 55057; and
Department of Computer Science, Cornell University, Ithaca, NY 14853
Edited by Ronald L. Graham, University of California at San Diego, La Jolla, CA, and approved January 25, 2008 (received for review September 6, 2007)
| Abstract |
|---|
|
|
|---|
social networks | algorithms | epidemics | diffusion in networks
Here, we trace these types of large-scale information-spreading processes at a person-by-person level using methods to reconstruct the propagation of massively circulated Internet chain letters, and from these observations we propose a new set of principles for how such processes work. We focus in particular on two such chain letters, which exhibit tree-like patterns of dissemination that are quite similar to each other but are initially in conflict with the intuitive picture of how information spreads in these settings. Rather than expanding to many individuals in a few steps, the trees are very narrow and continue reaching people several hundred levels deep. We describe a mathematical model that produces trees with this characteristic structure, grounded fundamentally in the observations that social networks are highly clustered and that information can take widely varying amounts of time to traverse different edges in the network. The simple structure of the model, and the fact that it is based on earlier empirical studies of human response times (19–21), thus suggests a possible basis for this narrow and deeply reaching style of information transmission in the local dynamics of communication within highly clustered social networks.
| Reconstructing the Spread of Internet Chain Letters |
|---|
|
|
|---|
We performed a similar analysis for a second chain letter, a petition that began circulating in 1995, purporting to organize political support for continued United States governmental funding of National Public Radio (NPR) and the Public Broadcasting System (PBS). Through similar means to those used for the Iraq petition, we acquired 316 distinct copies of the NPR petition, comprising a total of 13,052 people. The dissemination of the two chain letters exhibited qualitatively very similar structures, and for purposes of the discussion here, we focus on the analysis of the chain letter associated with the Iraq petition. Although both petitions in fact had their origins in hoaxes and naive misunderstandings, as a large fraction of the most widespread Internet chain letters do (22, 23), this fact is immaterial to our purposes, especially because almost all signatories to each appeared to believe them to be authentic; hence, we are studying genuine instances of the dissemination of individual pieces of information along links in the global social network.
People may in general receive a copy of the chain letter multiple times, but if each appends his or her name to just one copy, then the full propagation of the letter can be represented as a tree structure: recipients are nodes, the originator is the root, and node w is a child of node v if w appends its name directly below v's. Moreover, if this is the case, then each copy of the letter represents a path through the propagation tree, and the observable portion of the tree can be reconstructed simply by superimposing these paths (Fig. 1 A and B). Inspection of the chain letters indicates that recipients in the observable portion did appear to almost uniformly forward the letter just once, and hence reconstruction of a tree provides a reasonable approximation to the actual propagation process. However, the superposition of the lists on the 637 letters deviates from a tree structure because of extensive noise in the data: Some recipients reordered the list of names on their copy of the letter in ways closely analogous to the kinds of chromosomal rearrangements one finds due to sequence mutation events in biological settings (24, 25) (Fig. 1C). We observed examples of point mutations (in some petition copies, names were replaced by the names of political figures), insertion/deletion events (there were a number of small blocks of 1–5 names that were present in the middle of the list in some petition copies and absent in other copies), duplication events (blocks of 2–20 names that were duplicated in some petition copies, sometimes immediately adjacent within the list and sometimes hundreds of names later), block rearrangements (in one petition, two pairs of blocks of 2–3 names were swapped relative to their position in all other copies that contained the same names), and one hybridization event (the names at the ends of two copies of the petition were intermingled after their common prefix in a third copy).
|
Because any tree on this node set would have 19,301 edges (one fewer than the number of nodes), we need to delete a proportionally small number of edges (483) from the graph G to produce a tree. We do this deletion in a way that removes links inconsistent with a tree in the least consequential way possible. Specifically, for each edge e = (v, w), we define the evidence for e to be the number of distinct copies of the petition that exhibit edge e. Using the evidence for each edge as its weight, we compute a directed spanning tree of G (also known as a branching or arborescence) of the maximum possible weight; this computation can be done efficiently at the scale of our data using an algorithm due to Edmonds (26). (We use an implementation from the LEMON project, http://lemon.cs.elte.hu.) Thus, we produce a spanning tree in which the total evidence for all edges, under our definition, is as large as possible. Finally, after the construction of the spanning tree, certain nodes no longer lie on a path from the root node to an individual who posted their copy of the letter. We delete such nodes, producing the final tree we use for our analysis; this tree contains 18,119 total nodes with 557 leaves, all of whom posted their copy of the letter, and 63 internal nodes that also posted.
| The Structure of the Dissemination Tree |
|---|
|
|
|---|
The tree reconstructed from the data, however, reveals a structure that is very different from the picture suggested by simple epidemic models: the median distance to the root over all nodes is nearly 300, and >90% of the nodes have exactly one child. Fig. 2 depicts the full tree, with a zoomed-in view of the tree in Fig. 3 to illustrate the characteristic structure. SI Appendix contains a high-resolution image of the full tree. The full superposition of the lists without correction for noise, although it deviates from a precise tree topology, exhibits a qualitatively very similar structure, indicating that these properties are intrinsic to the spreading of the chain letter, and not an artifact of the reconstruction process. Moreover, qualitatively similar structures are exhibited in the propagation of the other large-scale chain letter for which we have data. (See SI Appendix for more detail and an image of the tree associated with this other chain letter.)
|
|
As a result, we frame the problem of modeling the tree as follows: Is there a class of simple and plausible generative processes that, when run on real social networks, produce synthetic trees of the characteristic structure we observe—deep, narrow, and with most nodes having one child? A negative answer would suggest that what we are seeing is the result of unobserved idiosyncrasies in the collective behavior that produced the lists. If the answer is positive, however, it argues that this type of structure is in fact achievable by natural mechanisms, suggesting that deep patterns of transmission are in fact a robust form of information-spreading and potentially focusing the search for more detailed theories about why we observe it in real life.
| Modeling the Structure of the Dissemination Tree |
|---|
|
|
|---|
Our models will start at a randomly chosen initiating node and construct trees spreading outward from this node, with portions of the tree made visible by some nodes posting their copy of the message. We will then assess how closely the structure of the observable portion of the constructed trees resembles the propagation tree of the real chain letter, using three metrics: the median node depth, the width, and the fraction of nodes with exactly one child. Here, the depth of a node is defined as its distance from the root, and the width of a tree is defined as the maximum size of a set of nodes that all possess the same depth. In all cases, the metrics will be averaged over multiple independent simulation runs on the LJ network: Each simulation run is continued until the observable tree first reaches the size of the real chain-letter propagation tree, and those runs in which the tree fails to reach this size are omitted. Omitting simulated trees that fail to grow large enough is consistent with our goal of studying properties of information diffusion conditional on reaching a large population; in real life, most circulated e-mail messages never spread widely, but we are interested in the structure of those that do.
Our models all incorporate the following two principles: Many recipients may choose not to forward the letter at all, and only a few recipients will choose to post the letter publicly. Thus, we introduce a discard-rate parameter
, specifying the probability that a given recipient discards the message and takes no further action on it, and a post-rate parameter
, specifying the probability that each recipient publicly posts his or her copy of the letter. In keeping with findings from earlier experiments based on e-mail forwarding (27), we set the discard-rate to the default value 0.65, although we find that reasonable variations do not qualitatively change the findings. The post-rate is a parameter that we will more explicitly vary. Public posting is the only means by which portions of the tree become observable: When a recipient posts the letter, his or her full path from the root becomes visible, and hence in general a node on the tree is observable at the end of the process if and only if one of its descendants posted a copy of the letter. We will be studying the structure of the observable portions of the trees produced by our models, as we do with real chain letters.
We first consider a model based on a direct application of these probabilistic ingredients. We choose a random root node and construct a tree in unit time steps. In each step, each new recipient of the letter discards it independently with probability
and otherwise forwards it to all neighbors (posting with probability
). Any neighbor w that has not received the letter previously becomes a new recipient in the next time step; if w receives the letter from multiple senders, it chooses one of these senders arbitrarily as its parent in the tree. Finally, once the process terminates, we look at the observable portion of the tree, consisting of the union of all paths from the root to the nodes that posted their copy of the letter.
Although such a model is very natural, it produces trees that compare poorly to the real chain-letter data. Simulating this model on the LJ network, the observable portion of the tree has a median depth 5.0, width 9,625, and single-child fraction 19.04% (averaged over 10 independent runs) with
= 0.10, and very similar properties for other small values of
. This wide divergence from the real data cannot be remedied simply by having recipients send to a smaller set of neighbors; if each recipient who forwards does so to a random subset of 4 or 5 of his or her neighbors, then the width remains in the thousands, the median depth remains <50, and the single-child fraction remains <70%. The central problem is that this style of random epidemic process seems unable to produce trees whose observable portions are very large, yet with a number of children per node so highly concentrated around 1.
| Models Based on Asynchronous Response Times |
|---|
|
|
|---|
before acting on the message, where
is distributed according to the density function f(x) = x–
for an exponent
. This accords with the findings of recent studies of human response times to a spectrum of communication types including e-mail (19–21), which find such distributions with exponents
ranging between 1 (with cut-off) and 3/2. We find that our results remain qualitatively consistent across this range; for the results described here, we use
= 3/2 as a default.
The specifics of the model with asynchronous response times are as follows. Time proceeds continuously, rather than in discrete steps, and when a given node w in the network first receives a copy of the letter, at time t, it first decides whether to participate in the process at all, choosing to do so with probability 1 –
. If w chooses to participate, it then chooses a random waiting time
distributed as above. Between times t and t +
, node w may receive multiple copies of the letter (including the initial one it received at time t). At time t +
, node w selects the copy of the letter it has received with the longest list of names (breaking ties arbitrarily), forwards it to all its neighbors, and publicly posts this copy with probability
. As before, when the process terminates, we consider the observable portion of the tree.
This asynchronous pattern of response has a "serializing" effect in networks with large clustering coefficient (18), as the LJ network has: If the neighbors of a forwarding node are mutually connected, then they will forward the letter to each other as they act on it in order, producing a single long list with all of their names rather than many distinct shorter lists, each containing one of their names. In the observable tree, this change will tend to produce deeper "runs" of nodes in which each node has exactly one child, precisely the structure that we observe. This way in which real-valued response times produce paths with a greater number of hops is analogous to phenomena in the analysis of shortest paths in graphs with random edge lengths (30), although the two types of models have different structures, arising from different generative mechanisms.
Asynchronous response is a step toward trees with the correct structure, but it is not enough by itself; consequently, we introduce a second extension to the model as well. This second extension is based on the fact that recipients actually have two natural ways of reacting to the message other than discarding it: they can forward it to their neighbors in the network, as before, or they can group-reply to the set of corecipients on the e-mail message they receive; in the latter case, these corecipients each receive a copy of the letter with the recipient's name appended. Thus, we keep the details of the previous model the same as before, with one addition: for a back-rate parameter β, a nondiscarding recipient node w at time t +
forwards the letter to its own neighbors as before with probability 1 – β, and otherwise it group-replies with probability β.
Combined with asynchronous response times, group-replying further amplifies the serializing effect of having copies of the letter handled in sequence by the set of nondiscarding neighbors of a node, with each appending its name and thus producing a single long path in the tree. However, increasing the back-rate also reduces the progress of the letter to new nodes in the graph, because group-replying rather than forwarding to neighbors only provides copies of the letter to nodes that have already received it at least once. With a high back-rate, the letter is thus less likely to ever reach a large set of nodes. Thus, it becomes natural to study trade-offs in the tree structure as a function of β.
Fig. 4 A–C shows the median depth, width, and single-child fraction of trees produced as the back-rate β and post-rate
are each varied independently between 0 and 1 (with the discard-rate
fixed to 0.65, although analogous results hold for other discard-rates in the range between 0.5 and 0.75). For high back-rates around 0.95, combined with low post-rates around 0.22, we obtain trees that approximately match the propagation tree of the real chain letter in all three metrics (Fig. 4 D–F).
|
| Discussion |
|---|
|
|
|---|
As noted earlier, the way in which the spreading pattern is made visible to us by the data—through lists of signatories—means that we lack detailed information about recipient lists and time-stamps on all but a handful of individual messages. As a result, our modeling efforts have, by necessity, focused on arguing that the unusual structures we observe are capable of arising from simple generative processes, thus suggesting that this style of information transmission can in fact have a natural basis, and attempting to expose some of its plausible qualitative ingredients. With more detailed information—for example, in an analysis that had access to many or most of the message headers—we could study the response times and overlaps in recipient lists among adjacent nodes in the tree and thus assess the alignment of these models to the detailed mechanics of message-sending, not just to global parameters (depth, width, single-child fraction) of the tree itself.
Overall, then, Internet-based snapshots of information diffusion can potentially provide us with insight into some of the global dynamics underlying social phenomena such as opinion formation and political mobilization. The fact that the observed diffusion occurs along trees that are so deep and narrow suggests that the paths traversed by information through social networks can be more complex than might have been supposed, with the large number of steps giving the diffusion a certain fragility and presenting greater opportunities for the information to be altered or lost as it spreads. The pattern of the diffusion also seems initially in conflict with the small-world nature of the social network in which it is embedded; but the models discussed here show that such patterns are capable of arising from natural processes operating in real social networks. In the end, the structure of a small world, in which most people are connected by short paths, need not be at odds with a world in which an antiwar appeal, embedded in an e-mail chain letter, can pass through several hundred intermediaries before arriving in one's inbox.
| Acknowledgments. |
|---|
|
|
|---|
| Footnotes |
|---|
To whom correspondence may be addressed. E-mail: dlibenno{at}carleton.edu or kleinber{at}cs.cornell.eduFreely available online through the PNAS open access option.
Author contributions: D.L.-N. and J.K. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/cgi/content/full/0708471105/DC1.
© 2008 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
Related articles in PNAS:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||