## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# General methodology for inferring failure-spreading dynamics in networks

Edited by Yueyue Fan, University of California, Davis, CA and accepted by Editorial Board Member Susan Hanson July 18, 2018 (received for review December 26, 2017)

## Significance

Failure spreading widely exists in many systems, but methodologies devised to understand its dynamics so far are domain-constrained and demonstrate limited applicability across different systems. This paper tackles this issue from a reverse perspective of failure-spreading processes: It takes the spreading outcomes as inputs and seeks to infer the spreading process that gives rise to the outcomes, instead of the other way around as the prevalent approaches do. Because failure-spreading outcomes are commonly observed for different systems, we envision that this approach is generally applicable and provides a promising avenue to potentially unify research on spreading dynamics across disciplines. This research will facilitate understanding system dynamics and developing control techniques for them at different systems, scales, and dimensions.

## Abstract

A generic modeling framework to infer the failure-spreading process based on failure times of individual nodes is proposed and tested in four simulation studies: one for cascading failures in interdependent power and transportation networks, one for influenza epidemics, one benchmark test case for congestion cascade in a transportation network, and one benchmark test case for cascading power outages. Four general failure-spreading mechanisms—external, temporal, spatial, and functional—are quantified to capture what drives the spreading of failures. With the failure time of each node given, the proposed methodology demonstrates remarkable capability of inferring the underlying general failure-spreading mechanisms and accurately reconstructing the failure-spreading process in all four simulation studies. The analysis of the two benchmark test cases also reveals the robustness of the proposed methodology: It is shown that a failure-spreading process embedded by specific failure-spreading mechanisms such as flow redistribution can be captured with low uncertainty by our model. The proposed methodology thereby presents a promising channel for providing a generally applicable framework for modeling, understanding, and controlling failure spreading in a variety of systems.

Many natural, social, or technical phenomena, such as epidemics (1, 2), viral marketing (3, 4), metabolic reaction knockouts (5, 6), traffic congestions (7, 8), and power blackouts (9⇓⇓–12), are known to be underpinned by a spreading process, in which failures (e.g., disease infections) initially hitting a small set of nodes propagate to a much larger set in a network. Given the ubiquity of failure-spreading processes, understanding and controlling their dynamics from network perspectives has been an active research field since the early 2000s (13⇓–15).

In this paper, we propose and test a general methodology for inferring the dynamics of failure-spreading processes. The time of failure for each node and a minimum amount of knowledge in the network topology are taken as inputs. The methodology seeks to infer the mechanisms underlying a spreading process and how failures propagate from node to node. Four general spreading mechanisms—external, temporal, spatial, and functional (16⇓⇓⇓–20)—are conceived as abstractions of a variety of domain-specific spreading mechanisms. A stochastic cascade model based on maximum likelihood estimation (MLE) is devised to estimate parameters associated with these four general spreading mechanisms and to reconstruct the node-to-node failure-propagation process. In this stochastic cascade model, failure propagation from one node to another is viewed as a probabilistic event driven by the four general spreading mechanisms, resulting in node failures also being probabilistic. MLE is used to maximize the likelihood of nodes failing at the given failure times, with the following three assumptions: (*i*) failed nodes may recover but will not fail again during the study period, (*ii*) failures of two nodes are conditionally independent given the four general spreading mechanisms, and (*iii*) among the four general spreading mechanisms there exists no interaction between external factors and failure-propagation mechanisms (i.e., temporal, spatial, and functional).

A number of existing methods for modeling spreading processes are known in the literature of various disciplines. Examples of these methods include percolation theory (21⇓–23), game theory (24, 25), the sandpile model (26, 27), the branching process (4, 28), agent-based models (29), and other domain-specific models, such as OPA for power outages (the name OPA comes from the three institutions that collaborated to develop this model: Oak Ridge National Laboratory, Power Systems Engineering Research Center at the University of Wisconsin, and the University of Alaska) (30). These lines of research, which we call collectively the “forward approach” in the rest of this paper, typically presume a network topology and a specific spreading mechanism and seek to understand how they (nonlinearly) affect outcome quantities such as the final failure size (31). The proposed methodology, which we call the “backward approach,” differs from the forward approach by reversing the traditional modeling procedure. Instead of taking spreading mechanisms and network topologies for inputs and outputting the outcome of the spreading process, the proposed backward approach takes the outcome of the spreading process (i.e., time of failure for each node) for input and seeks to infer detailed network topologies and spreading mechanisms. It thus advances the state of the art by addressing three limitations of the forward approach. First, the forward approach relies on knowledge of real-world network topologies (32, 33), which is not always available. For example, the detailed physical layouts of power networks are often kept secret for commercial or security reasons (16, 34), and the structural connectivities in human brains are not yet fully understood at the subcortical level (20, 35). Second, learning the process that leads to the spreading outcomes (e.g., failure sizes) is still intricate, if not impossible, through the forward approach (36). This is because, with the exception of few state-of-the-art models (37⇓–39), solutions to the network dynamic equations can only be obtained at equilibrium conditions when the spreading process has been concluded and the networks are stable. The transient states of the networks, while the spreading process is proceeding, are difficult to analyze (40, 41). This limitation is also a result of the fact that most models in the forward approach focus on processes and quantities at the network level, such as network fragmentation and the size of the giant component (31), while the local node-to-node propagation process is simplified or overlooked. Third, the underlying spreading mechanisms may vary across different systems. Therefore, a model tailored for one specific system has limited generalizability to other systems. Modeling approaches with specific spreading mechanisms inhibit identifying generic patterns of spreading processes across different systems (42). The proposed backward approach addresses these three limitations by removing the requirement for network topologies as an input, formulating the node-to-node failure-propagation process probabilistically, and adopting the four general spreading mechanisms, respectively.

We tested and validated our proposed model by comparing inference results to simulation data in four simulation studies: (a) cascading failures in interdependent power and transportation networks in New York City (NYC) during Hurricane Sandy, (b) a hypothetical influenza epidemic in NYC, (c) a congestion cascade scenario in the Sioux Falls benchmark network (43), and (d) cascading power outages in the Wood–Wollenberg 6-bus benchmark system (44). The Sioux Falls benchmark network and the Wood–Wollenberg 6-bus benchmark system are widely used cases for testing newly developed models in transportation network analysis and power system analysis, respectively. Results for (a) and (b) verify that the proposed methodology is mathematically sound: The backward approach is capable of precisely inferring the parameters associated with the spreading mechanisms as well as recovering the simulated failure-spreading processes. The results for (c) and (d) cross-validate the robustness of the backward approach: The backward approach demonstrates solid performance when the failure data are independently generated using well-established domain-specific spreading mechanisms, such as user rerouting in a roadway network and load rebalancing in a power grid. The simulation study (d) also validates the backward approach by comparing the inferred failure-spreading process with data generated by a fully validated simulation procedure that has been tested in its ability to replicate a real-world process of cascading power outages (32, 45⇓–47). All comparisons in the simulation studies between the inferred failure-spreading processes and simulation data return consistent patterns, suggesting the accuracy, robustness, and generalizability of the proposed methodology.

## Model

The stochastic cascade model formulates the likelihood of a set of node failures given the four general spreading mechanisms, as shown in the upper line of Eq. **1**:*n* nodes and **E** represents external factors which could vary with contexts. In general, external factors determine the onset of a spreading process by giving rise to initial failures (also known as seeds) (18, 48). *j* to node *i*, which accounts for the three failure-propagation mechanisms: temporal dependence, spatial dependence, and functional dependence. These three failure-propagation mechanisms capture how the lapse of time, the separation in geographical space, and the connections between node-level processes (e.g., individual activities in an epidemic network) change the dynamics of failure propagations between nodes. The parameters *N* is the set of nodes in the network(s). Under the conditional independence assumption, the likelihood of the set of node failures can be decomposed into a sequence of (conditionally) independent failures of individual nodes (49), as presented in the second line of Eq. **1**.

The node failure probability **1** can be reformulated using the hazard function **2**. A hazard function, when defined in a discrete-time setting, is the instantaneous failure probability at a given time step **2**:**2** concern cases of observed node failures, right-censored nodes, and left-censored nodes, respectively. Proof of Eq. **2** is provided in *SI Appendix*. Eq. **2** shows that node failure probabilities in all three cases reduce to functional forms in terms of hazard functions.

For node *i*, its failure hazard at a given time step **3**:*j* to node *i* in a stochastic setting, there exist uncertainties regarding when node *j* fails. Conceptually, as long as *j* fails before *j* to *i* happening at time step *j* to node *i*, **4**:*j* to node *i*. Proof of Eq. **4** is provided in *SI Appendix*. Eqs. **3** and **4** decompose the failure hazard of node *i* at time *Materials and Methods*.

Inference in the above stochastic cascade model is thus formulated as an MLE problem with the likelihood function defined by Eq. **1**. MLE has been applied to modeling the spreading dynamics of disease (50⇓–52) and information (53⇓–55). These two lines of research, however, only account for temporal dependence (i.e., the failure time interval) between nodes and require multiple instances of cascades for the model estimation. The present model differs from the existing ones by synthesizing the four general spreading mechanisms, thus enabling the estimation with a single cascade instance.

## Results

The above inference model is tested by comparing the inferred failure-spreading processes with the simulated failure-spreading processes in the four simulation studies. For the study of interdependent infrastructure cascading failures (a) and influenza epidemic (b), we randomly generate the parameter values and simulate failure-spreading processes based on the generated parameter values. The failure time of each node is produced from the simulations and is fed into the above inference model to obtain the inferred parameter values and inferred failure-spreading processes. For the study of congestion cascade (c) and cascading power outages (d), we employ two simulators that replicate real-world failure-spreading processes to generate node failure times in the transportation network and power network, respectively. These two simulators assume spreading mechanisms specific to transportation systems and power systems and are independent of the inference model. The failure times are then used as inputs to the inference model. The detailed formulations of the inference model, which involve conceiving the functional forms for *Materials and Methods*.

### (a) Interdependent Infrastructure Cascading Failures.

The four general spreading mechanisms are adopted as abstractions of the complex and difficult-to-model (specific) spreading mechanisms behind cascading failures of interdependent infrastructures (34). The external factors include the impacts of high winds and storm surges (57), evaluated by parameters *SI Appendix*) and is characterized by *t* values from the MLE, are shown in Table 1. A *t* value is the standardized deviation of a parameter estimate from its preset value, evaluated as

All parameter estimates in Table 1 have insignificant deviations from their preset values, indicated by *t* values within the range [−1.96, 1.96]. This demonstrates the capability of our model to accurately quantify how the general spreading mechanisms contribute to the emergence of cascading failures.

The numbers of failed nodes due to external factors and failure propagations at each time step are shown in Fig. 1. Their temporal patterns show remarkable correspondence between the simulated curves (Fig. 1*A*) and the inferred ones (Fig. 1*B*). In terms of trends, the total number of failed nodes and number of failures due to propagations keep increasing until time step 5, before they drop to close to zero around time step 14. Overall this trend is consistent with the empirical results on disaster impacts during Hurricane Sandy (57). The number of failed nodes due to external factors decreases straight to zero at time step 6. The highest numbers of failed nodes per time step for the simulation and inference result are also similar with a value of 1,050 and 1,037, respectively, both at time step 5. The almost identical temporal patterns of simulation and inference result provide strong evidence that the model can capture the underlying cascading failure process given cascading failure outcomes (node failure times). This is further demonstrated by the comparison of failure hazards in Fig. 2.

The inferred hazard functions (Fig. 2*B*) present patterns similar to the simulated ones (Fig. 2*A*). The hazard of failure due to external factors trends downward from the beginning of the simulation, while the hazard of failure due to propagation reaches a peak around time step 6 and decreases thereafter. The result suggests that the model is able to recover the dynamics of general spreading mechanisms in a scenario where they are constantly evolving. In addition, although the failure hazards due to external factors and due to propagation assume similar functional forms (*Materials and Methods*), the differences in their temporal trends suggest that a single functional form in the detailed model formulation could capture different dynamics for the general spreading mechanisms. This property suggests our model’s versatility in handling diverse types of failure-spreading processes.

The spatial patterns of infrastructure cascading failures in the simulation and inference result are presented in Fig. 3. The inference outcomes (Fig. 3 *B* and *D*) accurately capture the simulated failure time of power service and transportation service (Fig. 3 *A* and *C*) in the majority of the census tracts in NYC. The inferred failures and simulated failures show comparable spatial patterns in terms of how they spread geographically: Early-stage failures (at time steps 1–3) are scattered across the region, surrounded by failures that occurred later in time. This implies that certain census tracts act as centers that propagate failures of infrastructure service to their surrounding census tracts.

In addition to the temporal and spatial patterns of failure spreading, we also look into the node-to-node failure-propagation patterns, which are compared between the simulated patterns (Fig. 4*A*) and the inferred patterns (Fig. 4*B*) by first visualizing the propagation networks (as shown in the graphs) and then quantifying the degree distributions of the propagation networks. In a propagation network, a (directed) link from node A to node B signifies that A is the most likely source to have propagated failure to B. In other words, a propagation network depicts the most likely paths through which failures propagate from the initial seeds to the rest of the network.

In Fig. 4, the visualized propagation networks show comparable patterns between simulation (Fig. 4*A*) and inference (Fig. 4*B*): Nodes form tiny clusters among which few links are present. This suggests that in both the simulated data and the inference result the node-to-node propagation patterns are “local,” where any single node only influences a few other nodes rather than having a widespread influence over the entire infrastructure networks. This is also implied by the degree distributions shown in Fig. 4, on which power-law distributions can be fitted. The simulated propagation network and the inferred propagation network have notably matched degree distributions, with both similar densities for each degree category and close values of the fitted power-law coefficients.

### (b) Influenza Epidemic.

We apply the inference model to a hypothetical influenza-spreading scenario in NYC. Its setting is different from the infrastructure cascading failures in case (a) in five aspects. First, each node in the influenza-spreading network represents an individual person, which means the size of the epidemic network is much larger than the interdependent infrastructure networks where a node represents a given infrastructure service in a zone. The total number of nodes (individuals) in the influenza-spreading network is 4,414,610 out of a population of 8,175,133 in NYC, considering the effect of vaccination (59, 60). Second, a recovery process is modeled in the formation of influenza epidemics. The susceptible–infected–recovered model is used to simulate the influenza epidemic, in which each infected (or equivalently, “failed”) individual recovers and is no longer infective after 6 d in infection (61). Third, external factors are not considered in the epidemic study, as inflow of influenza infection cases from outside the study region is assumed to be controllable through travel restrictions (62). Fourth, a temporal dependence parameter *t* values, are presented in Table 2.

All four parameters in Table 2 are accurately estimated, with estimates close to their preset values and deviations being insignificant (

The simulated epidemic process and the inferred epidemic process in Fig. 5 show almost identical patterns. In terms of the number of influenza cases per time step, the curves for the simulated cases and inferred cases almost overlap: The inferred number of cases is within 1% deviation from the simulated number of cases for any time step. Two waves of infection outbreaks, one around time step 13 and the other around time step 25, are noticeable in both the simulated and inferred epidemic process. The reproduction numbers in both the simulation and inference result show decreasing trends and cross the threshold of 1 (0 at log-scale) at the same time step of 6. These findings demonstrate our model’s ability to recover the temporal patterns of an influenza spreading process. In terms of the spatial patterns, Fig. 6 illustrates the proportion of infected population in each census tract at time steps 10, 20, and 30 for the inferred epidemic process (Fig. 6 *A*–*C*) and the simulation (Fig. 6 *D*–*F*), respectively. It shows highly consistent spatial patterns between the two: The disease first emerges and spreads locally in southern Brooklyn, gradually covers the entire borough, and propagates to Manhattan, Queens, and Staten Island by the end of the study period.

The node-to-node failure-propagation patterns are analyzed at both zone level and individual level. The zone-level propagation networks, which indicate whether influenza spreads from one zone (census tract) to another, are illustrated in Fig. 7. The simulated propagation network (Fig. 7*A*) and the inferred propagation network (Fig. 7*B*) share a noticeable topological pattern: The network organization evolves from a lattice-like structure observed at early stages to a tree-like structure later in the epidemic process. The former pattern is recognizable in the upper part of the networks in Fig. 7, which is characterized by small, overlapping cycles, and the latter is apparent in the lower part of the networks, showing “spikes” that are long branches in the tree structures. This finding adds to the evidence that the proposed model is capable of capturing the failure-propagation patterns at the network level as well as how the patterns evolve over time.

The degree distributions of the propagation networks at both zone level and individual level are evaluated in Fig. 8. At the zone level (Fig. 8 *A* and *B*), the node degrees follow Poisson distributions, which implies that zone-to-zone propagations take place independently of each other and can be approximated by Bernoulli processes. The degrees of individuals in the propagation networks are fitted into power-law distributions, suggesting that most people infect only a few others while a small number may have transmitted influenza to a large group. The proposed inference model successfully reconstructs the simulated degree distributions at the two scales, with similar shapes of the distributions as well as close Poisson/power-law coefficients between the simulated propagation networks and inferred propagation networks.

### (c) Congestion Cascade.

In a roadway network, when a link (i.e., a road segment) gets congested, travelers using this link will choose a new route to bypass the congested link(s) (64). This behavior causes a redistribution of traffic flow from the congested link(s) to other links and could overload other links, resulting in a congestion cascade.

We apply the inference model to a congestion cascade by modeling it as a failure-spreading process where failure of a link is interpreted as being congested. The Sioux Falls network, shown in Fig. 9*A*, serves as the benchmark network (43) to which the inference results are compared. The network is initially operating at user-equilibrium state (65) in which every user chooses the fastest route in travel time among all possible routes. At time step 1, links 6 → 8 (meaning the road segment from node 6 to node 8 in Fig. 9*A*) and 8 → 6 would fail as they are already overloaded under user equilibrium and become the seeds for subsequent failure propagations. At every time step, overloaded links fail and their flow is redistributed to the surviving links at the next time step. The time step when each link fails is recorded and fed as input into the inference model. The simulated link failure times are illustrated in Fig. 9*B*. The number of failed links per time step is shown in Fig. 9 *C* and *D* for the simulation data and inference result, respectively.

The congestion cascade process is concluded in four time steps when the network reaches a new equilibrium where no new flow redistributions or congestions/failures happen. The numbers of congested/failed links at every time step in the simulation data and inference result demonstrate remarkable similarities, with deviations within ±2. Both numbers increase for three time steps before dropping at the fourth time step. As no more failure happens after time step 4, the surviving links at the end of time step 4 become right-censored, whose numbers are also the same for the simulation data and inference result.

The simulated and inferred failure-propagation networks are shown in Fig. 10 *A* and *B*, respectively. Every node in Fig. 10 corresponds to a link (i.e., a road segment) in the Sioux Falls network (Fig. 9*A*). Both propagation networks show a separation of two clusters, one initialized by the failure of link 6 → 8 (representing the road segment from 6 to 8 in Fig. 9*A* but denoted as node 6 → 8 in Fig. 10) and the other by the failure of link 8 → 6. More interestingly, the topologies of the clusters initialized by link 6 → 8 in the simulation data and inference result are exactly the same. For the cluster extending from link 8 → 6, the model is successful in recovering the key hubs for propagating failures. In Fig. 10*A*, the links 8 → 6, 8 → 9, and 9 → 5 are identified as hubs, as each of them propagates failures to more than one link. They thus play a more important role in facilitating the failure propagations than other links and could serve as potential candidates for strengthening to inhibit the progress of failure spreading. These three links emerge as hubs in the inferred failure-propagation network (Fig. 10*B*) as well. A noticeable difference between the simulated propagation network and the inferred one is the overestimation of the number of hubs. The inference model misidentifies links 11 → 4 and 21 → 24 as hubs, which does not agree with the simulated propagation network. This difference likely results from the power-law degree distribution assumed for the functional dependence (*SI Appendix*) in the inference model. Due to the fat tail in a power-law distribution, it tends to generate a larger number of high-degree links (road segments) that have functional dependence with multiple other links (road segments), compared with a binomial distribution which suggests more random occurrence of functional dependences. This finding casts doubts on the common belief that power-law distributions are prevalent in transportation systems (66).

### (d) Cascading Power Outages.

We use a validated simulation model called DCSIMSEP (45, 46) and replicate a detailed simulation process on the 6-bus benchmark system (44) as implemented by Hines et al. (47). DCSIMSEP simulates multiple complex mechanisms in power outage cascades, including grid separation, load rebalancing, load shedding, flow redispatch, and overloading failures (*SI Appendix*). This simulator was validated by comparing its simulation results with prior research results in 41 power networks (32). Hines et al. (47) used this simulator to generate 1,000 instances of outage cascade in the 6-bus benchmark system and calculated the probability of outage propagating from one power line to another, which they called an “influence graph” (*SI Appendix*).

Instead of using 1,000 cascade instances, we randomly generate one power outage cascade with simulation settings identical to those in ref. 47 and apply the proposed inference model to this one cascade to obtain the propagation network. The 6-bus benchmark system, the simulated outage cascade (power line failure times) upon it, and the inferred power line failure times are illustrated in Fig. 11.

The simulated failure times and the inferred ones are exactly the same except for two power lines, namely line 1–4 (i.e., the power line connecting node 1 and 4 in Fig. 11) and line 2–3. The influence graph obtained by Hines et al. (47) is shown in Fig. 12*A*, in comparison with the inferred propagation network in Fig. 12*B*.

The influence graph and the inferred propagation network are consistent in two key aspects. First, for every power line that fails in the simulation its most likely source of failure in the propagation network coincides with the one in the influence graph. This consistency is evident for the propagations from line 2–4 to lines 1–4, 2–5, and 4–5, which present high probabilities (marked by dark bold arrows) in both the influence graph and propagation network. For line 2–3, the propagation network also correctly identifies its most likely source of failure as line 2–5. Second, failure-propagation probabilities from line 3–5 to lines 1–4 and 2–3 and from line 2–5 to 4–5 are all insignificant (<0.05) in the propagation network, which corresponds to the small probabilities of these propagation paths in the influence graph. Two propagation paths, one from line 3–5 to 4–5, and the other from line 2–4 to 2–3, show differences between the influence graph and propagation network. The failure-propagation probability from 3–5 to 4–5 is significantly larger in the influence graph than in the propagation network. This could be attributed to the propagation network’s being based on only one cascade instance, which has insufficiently characterized the propagation likelihood from 3–5 to 4–5. The use of only one cascade in the propagation network is also responsible for another difference between the influence graph and propagation network: The lines 1–2, 3–6, and 5–6 are isolated in the propagation network, as no failure happens to them in this single-cascade simulation.

### Computational Complexity.

We evaluated the computational complexity of the proposed MLE and find a theoretical range from *O*(*n*) to *SI Appendix*). When node failures are distributed evenly in time across the study period (i.e., uniformly distributed), the computation time is close to the upper bound; when node failure times have a more concentrated distribution around a certain value, the computation time moves toward the lower bound. Considering that spreading events are typically fast processes characterized by large numbers of failures in a short span of time (67, 68), it is expected that when applied to real-world data the proposed model should demonstrate a performance close to *O*(*n*). This result is verified by a numerical analysis of the relationship between computation time and network size (i.e., the number of individuals) in the influenza epidemic simulations, as depicted in Fig. 13. In this analysis, the population in every census tract is decreased by a common scaling factor, and the computation time is evaluated as the scaling factor changes. Fig. 13 supports that the inference model has a computational complexity close to *O*(*n*) by demonstrating a close-to-linear relationship between the computation time and the scaling factor.

## Conclusion and Discussion

Modeling failure-spreading processes is an active and promising research field across multiple disciplines, considering their ubiquity and many unaddressed challenges in the current state of the art. The ultimate goals along this line of research are to understand and control the spreading of failures. A backward-approach modeling framework aimed at these goals is proposed and tested here. Four general spreading mechanisms, which serve as abstractions of a variety of domain-specific failure-spreading mechanisms, are assumed to underlie all failure-spreading processes. They are formulated into a survival-analysis-based model formulation that describes how the four general spreading mechanisms give rise to a cascade of node failures. MLE is used to estimate the parameters associated with the general spreading mechanisms. With the estimated parameters, the spreading process and node-to-node failure-propagation patterns are reconstructed.

The proposed modeling framework demonstrates good performance both in quantification of the four general spreading mechanisms and in reconstructing the failure-spreading process temporally and spatially as well as the local node-to-node failure-propagation patterns. In the simulation study of interdependent infrastructure cascading failures (a) and influenza epidemic (b) where the parameter values are preset in the simulations, the proposed model accurately infers the parameters in both cases. The temporal-spatial patterns of the failure-spreading processes, as well as the node-to-node propagation patterns are compared between the simulation data and inference results in all four simulation studies. The simulated and inferred temporal-spatial patterns show remarkable correspondence consistently throughout the four studies. The proposed model is also capable of recovering properties of the simulated failure-propagation patterns, such as the degree distributions of the propagation networks, the clustering of failures, the emergence of hubs in the propagation process, and the most likely failure-propagation paths to individual nodes, as demonstrated by all four simulation studies. Particularly in the case of cascading power outages (d), the inferred node-to-node failure-propagation probabilities using one cascade instance are consistent with the result from a validated simulation approach using the same power network but 1,000 cascade instances. This further demonstrates the capacity of the proposed approach in inferring failure-spreading processes under non-data-intensive circumstances, considering that certain types of failure-spreading data such as data on large-scale infrastructure failures are often sparse.

One challenge in modeling failure spreading is that (specific) spreading mechanisms vary from case to case and in different systems. Consequently, the specific spreading mechanisms involved in a real-world failure-spreading instance are usually not immediately known following the failures, and only knowable after a period of investigations and study (69, 70). The proposed approach demonstrates potential capacity to address this challenge by using a high-level abstraction of specific spreading mechanisms, which are represented by the four general spreading mechanisms. Two properties of the proposed approach could be summarized from the four simulation studies. First, the model demonstrates the flexibility of integrating knowledge in specific spreading mechanisms. In the influenza epidemic simulation study (b), where the specific spreading mechanism (i.e., close human contact) is known, this specific mechanism can be formulated into the model. In other words, the proposed approach can be operational both at an abstract level with the four general spreading mechanisms and at a more detailed level with domain-specific spreading mechanisms. This provides the proposed approach with the capacity to model both a well-understood spreading process (e.g., the spreading of some infectious diseases) as well as an emergent spreading process where there is little knowledge. Second, the ability of the proposed approach in inferring the dynamics of failure-spreading processes can be independent of the specific spreading mechanisms underlying the spreading processes in specific systems. This is shown through the simulation studies of congestion cascade (c) and cascading power outages (d), where the simulation data are generated by distinct (specific) spreading mechanisms. We show that our inference model recovers notably similar spreading processes and node-to-node failure-propagation patterns with their simulated counterparts that are independently generated using domain-specific simulators. This finding has two suggestions: (*i*) the proposed model has general applicability to a variety of systems and spreading processes; and (*ii*) the model demonstrates robustness against misspecifications of its detailed formulations that quantify the spreading mechanisms.

One limitation of the proposed approach is the approximations made in formulating the likelihood of node failure due to external factors and the likelihood of node-to-node failure propagation, by assuming a probability distribution of node failure time (due to external factors) and a distribution of failure time interval, respectively. The approximations can cause two potential issues. First, it creates an interaction effect among the three failure-propagation mechanisms. In particular, temporal dependence is made interactive with spatial dependence and functional dependence. This is because the parameterization of the probability distributions in terms of spatial dependence and functional dependence affects the shape of the distributions which capture temporal dependence. The interaction among the propagation mechanisms is a potential problem when the time dimension has an effect on the node-to-node failure propagations (e.g., periodicity in propagation occurrences) independent from spatial and functional factors. Interpreting the modeling results to capture the dynamics of each failure-propagation mechanism would then be difficult in this case, as their dynamics would confound each other. Second, a predefined and fixed probability distribution may be insufficient to characterize the dynamics of the failure-spreading mechanisms under certain circumstances. This is especially the case for a spreading process with a long life span, such as a global influenza epidemic which could last for months or even years. In this case, the spreading mechanisms themselves could be evolving over time, resulting in shifts in the parameter values or even the type of probability distributions. Addressing this issue would require an inference model formulation that is capable of updating itself, which is a potentially interesting topic for future research.

## Materials and Methods

The detailed model formulations are described in this section. The general functional forms for **5** and **6**, respectively:*i* and node *j*. Two specific functional forms for **7** and **8**. In the influenza epidemic (b), congestion cascade (c), and cascading power outages (d), **9**. With no external factors considered, *i* and failure propagation from node *j* to node *i* at a single time step, respectively, and *j* to node *i*. More specifically in the infrastructure cascading failures (a), the instantaneous failure likelihood due to external factors **10**:*i*. The instantaneous failure-propagation likelihood, **11**:*i* and node *j* and *i* is functionally dependent on node *j*.

The influenza epidemic study (b) has a unique formulation of failure-propagation rate **12**:*i* and *j* are located in, *i*,*j*) is an indicator of whether

## Acknowledgments

We thank M. Elizabeth Halloran (University of Washington) for her invaluable help in revising and preparing this paper. This work was supported by National Science Foundation Civil, Mechanical and Manufacturing Innovation Grant 1536340 and National Institutes of Health Grant 1R01GM108731-01A1.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: qzchen{at}uw.edu.

Author contributions: X.G. and C.C. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. Y.F. is a guest editor invited by the Editorial Board.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1722313115/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- Ferguson NM,
- Donnelly CA,
- Anderson RM

- ↵
- Teunis P, et al.

- ↵
- Kempe D,
- Kleinberg J,
- Tardos É

- ↵
- ↵
- Takemoto K,
- Tamura T,
- Akutsu T

- ↵
- Smart AG,
- Amaral LAN,
- Ottino JM

- ↵
- Ma R,
- Ban X,
- Pang J-S

- ↵
- Wu JJ,
- Sun HJ,
- Gao ZY

- ↵
- Hines P,
- Balasubramaniam K,
- Sanchez EC

- ↵
- Albert R,
- Albert I,
- Nakarado GL

- ↵
- Brummitt CD,
- D’Souza RM,
- Leicht EA

- ↵
- Motter AE,
- Lai Y-C

- ↵
- Newman MEJ

- ↵
- Moore C,
- Newman MEJ

- ↵
- ↵
- Rinaldi SM,
- Peerenboom JP,
- Kelly TK

- ↵
- Bassett DS, et al.

- ↵
- Yang Y,
- Nishikawa T,
- Motter AE

- ↵
- ↵
- ↵
- ↵
- Watts DJ

- ↵
- Li D, et al.

- ↵
- Wang W-X,
- Lai Y-C,
- Armbruster D

- ↵
- Matjaž P

- ↵
- Huang L,
- Yang L,
- Yang K

- ↵
- ↵
- ↵
- ↵
- Dobson I,
- Carreras BA,
- Lynch VE,
- Newman DE

- ↵
- Barabási AL

- ↵
- Hines P,
- Cotilla-Sanchez E,
- Blumsack S

- ↵
- ↵
- Ouyang M

- ↵
- Forstmann BU, et al.

- ↵
- Baldick R, et al.

- ↵
- ↵
- Barrat A,
- Barthelemy M,
- Vespignani A

- ↵
- Yang Y,
- Motter AE

- ↵
- ↵
- ↵
- Lorenz J,
- Battiston S,
- Schweitzer F

- ↵
- LeBlanc LJ,
- Morlok EK,
- Pierskalla WP

- ↵
- Wood AJ,
- Wollenberg BF

- ↵
- ↵
- Rezaei P,
- Hines PDH,
- Eppstein MJ

- ↵
- ↵
- ↵
- Koski T,
- Noble J

- ↵
- ↵
- Danon L, et al.

- ↵
- Netrapalli P,
- Sanghavi S

- ↵
- Gomez Rodriguez M,
- Leskovec J,
- Schölkopf B

- ↵
- Gomez-Rodriguez M,
- Leskovec J,
- Balduzzi D,
- Schölkopf B

- ↵
- Du N,
- Song L,
- Woo H,
- Zha H

- ↵
- ↵
- Guan X,
- Chen C

- ↵
- ↵
- Gany F,
- Rau-Murthy R,
- Mujawar I

- ↵
- Centers for Disease Control and Prevention

- ↵
- ↵
- Germann TC,
- Kadau K,
- Longini IM,
- Macken CA

- ↵
- Belik V,
- Geisel T,
- Brockmann D

- ↵
- Scott DM,
- Novak DC,
- Aultman-Hall L,
- Guo F

- ↵
- Wardrop JG

- ↵
- Akbarzadeh M,
- Memarmontazerin S,
- Soleimani S

- ↵
- Newman DE, et al.

- ↵
- ↵
- ↵
- U.S.-Canada Power System OutageTask Force

*Final report on the August 14th blackout in the United States and Canada: Causes and recommendations*(US Dept of Energy, Washington, DC and National Resources Canada, Ottawa, ON, Canada), Report.

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Engineering

- Social Sciences
- Sustainability Science