# Modeling stochastic processes in disease spread across a heterogeneous social system

^{a}Data61, Commonwealth Scientific and Industrial Research Organisation, Pullenvale, QLD 4069, Australia;^{b}Health & Biosecurity, Commonwealth Scientific and Industrial Research Organisation, Canberra, ACT 2601, Australia;^{c}School of Information Technology and Electrical Engineering, University of Queensland, St Lucia, QLD 4072, Australia;^{d}School of Computer Science and Engineering, University of New South Wales, Kensington, NSW 2052, Australia

See allHide authors and affiliations

Edited by Burton H. Singer, University of Florida, Gainesville, FL, and approved November 21, 2018 (received for review January 25, 2018)

## Significance

This study infers probabilistic infection routes of a vector-borne disease, by modeling internal dynamics of metapopulations driven by human mobility as multivariate stochastic processes. In this way, our proposed model uncovers the self-excitation and mutual excitation nature of disease spread across a heterogeneous social system with rich context. Our model is a general extension of networked Hawkes processes, providing flexibilities to add constraints (presence of diffusion medium) and to use domain knowledge (cross-metapopulation connectivity), enabling covering of direct and indirect diffusion processes such as contact-based and vector-borne disease spread. Our model is readily applicable to a wide range of intragroup and intergroup diffusion processes in social and natural systems and can infer probabilistic causality between discrete events.

## Abstract

Diffusion processes are governed by external triggers and internal dynamics in complex systems. Timely and cost-effective control of infectious disease spread critically relies on uncovering underlying diffusion mechanisms, which is challenging due to invisible infection pathways and time-evolving intensity of infection cases. Here, we propose a new diffusion framework for stochastic processes, which models disease spread across metapopulations by incorporating human mobility as topological pathways in a heterogeneous social system. We apply Bayesian inference with the stochastic Expectation–Maximization algorithm to quantify underlying diffusion dynamics in terms of exogeneity and endogeneity and estimate cross-regional infection flow based on Granger causality. The effectiveness of our proposed model is shown by using comprehensive simulation procedures (robustness tests with noisy data considering missing or delayed human case reporting in real situations) and by applying the model to real data from 15-y dengue outbreaks in Australia.

Diffusion processes in the real world often produce non-Poisson distributed event sequences, where interevent times are highly clustered in the short term but separated by long-term inactivity (1). Examples are observed in both human and natural activities such as resharing microblogs in online social media (2, 3), citing scholarly publications (4, 5), a high incidence of crime along hotspots (6, 7), and aftershock sequences near the seismic center (8). These all imply that an event occurrence is likely triggered by preceding events in cascades of different scales, and the timing of discontinuous events conveys information of underlying diffusion mechanisms.

Based on point process approaches, uncovering such feedback mechanisms between preceding and triggered events has drawn significant attention from a wide range of scientific communities (2⇓⇓⇓⇓⇓⇓–9), since it helps predict diffusion trends and establish cost-effective strategies for the promotion or restriction of the diffusion process (10). When it comes to epidemics, an accurate understanding of underlying dynamics is crucial for the timely control of infectious disease spread. However, uncovering disease dynamics is very challenging due to unobservable transmission routes and limited information of private social networks, unlike explicit cited–citing relationships of documents in online social media (2, 3, 9, 11) or in academic publications (4, 5). Moreover, large international and domestic travel volumes have increased the uncertainty of infection pathways. Thus, the quantification of exogenous and endogenous effects is essential to overcome the challenges and understand emergent bursts of outbreaks, and has been largely neglected in epidemic studies (12⇓–14).

In this study, we propose the Latent Influence Point Process model (LIPP) for disease spread across a heterogeneous social system by incorporating three major counterbalancing factors: (*i*) exogenous influence covering environmental heterogeneity, (*ii*) endogenous influence attributed to macrolevel interactions between metapopulations, and (*iii*) a time decay effect. We apply Bayesian inference using the stochastic Expectation–Maximization algorithm, which enables us to quantify the reflexivity of metapopulations, i.e., the level of feedback on event occurrences (15, 16), driven by external and internal dynamics in a complex system, and to estimate infection flow between metapopulations based on Granger causality.

We first conduct simulations to generate synthetic data as ground truth by varying parameter settings so as to mimic real data. We also add variations to the generated datasets for reproducing (*i*) random missing, (*ii*) nonrandom missing, and (*iii*) time-delayed mechanisms of human case reporting. With 1,200 datasets in total, we evaluate the model performance (recovery of infection flow between regions infection flow and model parameters) and compare with competing baselines. Our model well recovers cross-regional infection flow, with greater than 85% accuracy (over 70% for noisy data) with a 95% confidence interval, and outperforms baseline models. For real data, we investigate dengue spread in Queensland, Australia, during a 15-y period (2002–2016). We find that dengue outbreaks become more globally interconnected across multiple regions through human mobility, leading to more complex behavior of disease spread over time. In terms of reflexivity, precursory growth and symmetric decline of outbreaks in metropolitan or populated regions are attributed to slow but persistent feedback on preceding outbreaks via intergroup dynamics, while abrupt growth but sharp decline in remote or peripheral regions is driven by rapid but inconstant feedback (abrupt outbreaks during an intensive period) via intragroup dynamics. Additionally, similar diffusion trends between two populous cities reflect synchronous feedback mechanisms of regional social systems, which is likely due to large volumes of external visitors and heavy reciprocal fluxes between the cities. That is, human mobility is a vital factor of mutual excitations across regions.

## Methods

We first explain our data collection and propose a diffusion framework. It quantifies exogenous and endogenous dynamics in disease spread and infers probabilistic transmission routes and cross-regional infection flow across a heterogeneous social system.

### Data Collection.

We investigate dengue outbreaks in Queensland, Australia, from 2002 to 2016, provided by Queensland Health. Dengue is a mosquito-borne viral disease transmitted among humans by mosquito vectors, whose outbreak risk is rapidly increasing worldwide (17). The data contains records of anonymized infected individuals such as onset dates, residence postcodes, and acquisition places if available. For understanding cross-regional infections at a macro level, we categorize residence postcodes into 15 regions, which correspond to the statistical areal level 4 defined by the Australian Statistical Geography Standard. Based on selected target regions, we create an event sequence of infections as a tuple consisting of occurrence time and region identity.

We also incorporate human mobility as topological heterogeneity across multiple regions, which reflects macrolevel internal dynamics in a social system. To obtain structural connectivity between regions, we use three different types of travel data such as International Visitor Survey, National Visitor Survey, and geo-tagged Twitter posts (see *SI Appendix*, section S1 for detailed statistics and measurements).

### Background.

We consider a Hawkes process (18) as our fundamental diffusion framework, since it is a non-Markovian extension of a Poisson process and thus realizes the clustering of events in the real world. A general univariate Hawkes process is defined with an intensity function,

Fig. 1 shows the embodiment of a Hawkes process to a disease-spread scenario. As shown in Fig. 1*A*, disease infections are represented as a single arrival process. It is reframed as Fig. 1*B* by considering self-excitations and mutual excitations (intraregion and interregion disease transmissions). As discussed in the Introduction, such cross-regional outbreaks are accelerated by human mobility (solid and dashed arrows in Fig. 1*C*). The infection pathways from regions where the vector is absent to other regions (dashed lines) result from an infected individual (international or domestic visitor) transiting through the vector-free regions. That is, human mobility allows bidirectional infection pathways (mutual excitations) among vector-free and vector-present regions.

In this context, the objective of our framework is to model bursty behavior (clustered in time and space) of disease outbreaks across metapopulations by incorporating human mobility as topological pathways in a heterogeneous social system.

### Proposed Model.

We now propose the LIPP model, which incorporates the exogeneity and endogeneity of a social system as major components for realizing the bursty diffusion processes in the real world. Based on inputs of a spatial and temporal event sequence and cross-regional human mobility, our model aims to quantify the reflexivity (level of feedback on prior events) (15, 16) of a social system using estimated model parameters and to infer transmission routes and infection flow between regions.

Suppose that we observe an event sequence D consisting of N spatiotemporal events in a set of regions R during an observation time period *B*, we consider multiple timelines separated by event occurrence regions. For each region r, the history of events consists of two different types of event sequences,

We incorporate three major counterbalancing components into our framework: (*i*) exogenous influence covering environmental heterogeneity of target regions, (*ii*) endogenous influence attributed to macrolevel interactions between metapopulations, and (*iii*) a time decay effect with an exponential memory kernel. Details are discussed in *Exogeneity* and *Endogeneity*.

#### Exogeneity.

In region r, events can occur independently of a previous event history, due to external influence. This is modeled with a Poisson process with a background intensity,

#### Endogeneity.

Contrary to exogenous infections, internal dynamics in a social system drives bursts of events through interactions between individuals over social networks, so it is called internal influence (10). Our model incorporates cross-regional human mobility as macrolevel endogenous effects on diffusion, and the intensity brought by mutual excitations across multiple regions is defined as*C*. The second term,

### Bayesian Inference.

We apply Bayesian inference to estimate the latent influence of each region in our proposed model by using a gamma distribution as a conjugate prior and thus obtaining a gamma posterior. We also introduce the latent index variables Z = *B* (dashed lines). By using the stochastic expectation–maximization (EM) algorithm, we learn our model parameters and estimate probabilistic transmission routes (see *SI Appendix*, section S2 for details).

#### Granger causality.

A multivariate Hawkes process is a linear dependence structure of mutually exciting point processes, whose notion has been shown to reflect the Granger causality (20, 21). Granger causality is a statistical belief that the knowledge of a possible cause should improve (“Granger-causes”) the prediction of the subsequent effect (22). Finding causality between two stochastic random variables in a time series is related to learning multivariate Hawkes kernels in parametric (4, 11, 23) or nonparametric ways (24, 25).

#### Granger causality of multivariate Hawkes process.

In the context of Granger causality, Eqs. **2**–**4** can be reformulated with a more general linear dependence structure of an R-dimensional Hawkes process as

## Simulation

### Synthetic Data Generation.

We generate synthetic data by using an exact sampler (26), which samples time moments without approximation by decomposing a random variable into multidimensional nonhomogeneous Poisson processes based on the superposition property. This enables us to obtain marks (event types: regions) of triggering events to be used as ground truth of infection flow. Accordingly, Fig. 2 illustrates examples of true and estimated region-to-region matrices of infection flow from our synthetic datasets by varying the number of regions such that *SI Appendix*, section S3 for our simulation algorithm).

### Robustness Test.

In real situations, infection reports are often missing or delayed, which makes it more challenging to learn model parameters. In this regard, we add variations to the synthetic data in three different ways: (*i*) random missing, (*ii*) clustered missing, and (*iii*) time delay by 5%, 10%, and 15% for each case (1,200 test cases in total). With the synthetic data, we conduct robustness tests with respect to the recovery of infection flow and model parameters. For the parameter recovery, we also evaluate relative strengths between estimated parameters, which is important to validate our subsequent interpretations of underlying diffusion processes with real data. As a result, our proposed model is robust to noisy data, as shown in *SI Appendix*, Table S3. Clustered missing data affect the model performance the most, followed by random missing and time-delayed data, but the accuracy rates remain over 70% (see *SI Appendix*, section S3 for data variations, and see *SI Appendix*, sections S4 and S4 for test results).

### Comparisons with Baselines.

We also compare our proposed model, “LIPP with prior” based on Bayesian inference with two baselines: (*i*) “LIPP without prior” based on the maximum likelihood estimation (MLE) and (*ii*) a recent competing approach, called “MLE-SGLP,” which learns causality structures of nonparametric multivariate Hawkes processes based on MLE with sparse-group lasso (25). As evaluation metrics, inference errors are measured for all 1,200 synthetic datasets, which demonstrates that our model outperforms the two baselines (see *SI Appendix*, section S4 for details).

## Case Study: Dengue Spread

In this section, we conduct experiments on real data, whose results are interpreted with estimated model parameters based on the verification of parameter and infection-flow recovery with synthetic data in *Simulation*. For the experiments on real data, we set the observation time window as 1 y to examine time-evolving diffusion dynamics with a fine-grained time resolution.

### Cross-Regional Infection Flow.

As discussed earlier, infection pathways are unobservable, so we estimate the probability that each preceding event has triggered a current event by using the stochastic EM algorithm. Fig. 3 shows the examples of constructed transmission routes based on estimated pairwise probabilities of triggering and triggered dengue cases, for the 3 y with the largest outbreaks during a 15-y period. Here, each node presents a dengue case, color-coded by region. As the figure shows, earlier dengue outbreaks tend to be more locally clustered, but, over the years, they become globally interconnected across regions, leading to more complex behavior of infectious disease spread. Based on the estimated transmission routes in Fig. 3, the corresponding infection flow between regions are illustrated at a macro level in Fig. 4. As the figure shows, spread of dengue becomes more far-reaching across Queensland over time.

These all are consistent with event raster plots in Fig. 5, exhibiting increasing dengue outbreaks all over the regions throughout the year in 2013, compared with local outbreaks during the intensive period in 2003 and 2009. Such spatial expansion of infectious diseases can be attributed to the increase in travel volumes (12, 27).

### Reflexivity of a Regional Social System in Disease Spread.

A Hawkes process generalizes a nonhomogeneous Poisson process by allowing the self-exciting nature via preceding events, as discussed in Eq. **1**. The linearity of the conditional intensity **2**. Accordingly, we quantify the level of exogeneity

#### Behavioral split.

Fig. 6 summarizes these quantifications for regions with the largest number of dengue cases during the 15-y period. In general, the background intensity μ hardly changes, while the branching ratio b increases over time in metropolitan or populated areas such as Brisbane Inner City (BIC), Gold Coast (GC), and Sunshine Coast (SC) relative to remote or peripheral areas such as Cairns, Outback, and Townsville. In Fig. 6, *Left*, these two groups of regions also exhibit different growth patterns of dengue cases: precursory growth and symmetric decline in populous regions (BIC, GC, and SC) versus abrupt rise and sharp drop in peripheral regions (Cairns, Outback, and Townsville) in 2003 and 2009, showing a split in behavior. Additionally, mosquito vectors are presented in the three peripheral regions, while they are absent in the other populous areas.

#### Intragroup and intergroup dynamics.

Precursory growth in the major population centers is likely due to high reachability from statewide regions, i.e., high probability of importation of infected individuals. However, the absence of mosquito vectors in these regions allows no more excitations by previous outbreaks, leading to symmetric decline. In other words, dengue outbreaks in these populous regions are driven by strong but unsustainable intergroup dynamics. On the other hand, abrupt growth but sharp decline in the peripheral regions is attributed to rapid but inconstant excitations via mosquito vector transmissions. This strong but unstable intragroup dynamics is possibly affected by time-varying vector density and visitor volumes. That is, nonuniformly distributed mosquito vectors statewide, unbalanced human mobility between regions, and time-varying visitor volumes compositely lead to such behavioral split. Interestingly, BIC and GC exhibit similar growth patterns and reflexivity of a regional social system, which implies that endogenous feedback mechanisms are synchronous. This is likely due to human mobility patterns: The large volumes of external visitors and heavy reciprocal fluxes between the two cities more likely drive mutual excitations.

## Discussion

The spread of infectious diseases leads to formation of event clusters in both space and time. Such spatiotemporal events are well realized by a point process, due to its flexible consideration of lasting impact of bursty behaviors rather than a current snapshot (4), and thus it is widely used as a mathematical tool in diverse research areas (28, 29). In this context, we proposed a model, LIPP, which generalizes a multidimensional Hawkes process by incorporating macrolevel internal dynamics of metapopulations, driven by human mobility.

### Extension of Networked Hawkes Processes.

Our proposed memory kernel in Eq. **4** can be reformulated with element-wise matrix multiplication as

### Cross-Domain Implications.

In real situations, tracking infection routes often depends on time-consuming site investigations or a survey on travel routes of infected patients. Based on such efforts and expert knowledge, a single outbreak identification (ID) is assigned to a collection of cascading (or ongoing) local transmission possibly initiated by the same index case (see *SI Appendix*, Fig. S2 for the reference of outbreak IDs provided by Queensland Health). Outbreak IDs are currently the best-known data source for coupling cases, but a considerable proportion of cases are left unknown or possibly misidentified, without linkages between coupled cases. Here, our estimation of probabilistic transmission routes can provide investigators or experts with initial reference of infection pathways for their efficient tracking and timely control of disease spread, reducing response time and cost under resource constraints.

For understanding the origin of a burst, the interplay between external shock and internal dynamics in complex systems has also been of great interest across disciplines (10, 30). We quantified the level of exogeneity and endogenity of clustered bursts by incorporating environmental heterogeneity and internal dynamics between metapopulations. That is, our approach can reveal rich context which underlies time-evolving subgroup interactions in the real world.

All these aspects increase the applicability of our proposed model to a wide range of intragroup and intergroup diffusion processes in social and natural systems at a macro level. Additionally, microlevel investigations, such as targeting subregions and analyzing detailed socioeconomic factors, would help obtain a holistic view of underlying diffusion mechanisms, which is an interesting direction for future work.

## Acknowledgments

We would like to thank Queensland Health and Tourism Research Australia for providing data. Also, we thank Cassie Jansen at Queensland Health for the key role in guiding us in defining the problem that needs to be solved.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: minkyoung.kimm{at}gmail.com.

Author contributions: D.P. and R.J. jointly conceived the study, guided the analysis, and interpreted the results; M.K., D.P., and R.J. designed research; M.K. performed research; M.K. contributed new reagents/analytic tools; M.K. analyzed data; and M.K. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1801429116/-/DCSupplemental.

- Copyright © 2019 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- ↵
- Kim M,
- Newth D,
- Christen P

- ↵
- Zhao Q,
- Erdogdu MA,
- He HY,
- Rajaraman A,
- Leskovec J

- ↵
- Kim M,
- McFarland D,
- Leskovec J

- ↵
- Shen H,
- Wang D,
- Song C,
- Barabási AL

- ↵
- ↵
- Short MB,
- Bertozzi AL,
- Brantingham PJ

- ↵
- ↵
- Crane R,
- Sornette D

- ↵
- Kim M,
- Paini D,
- Jurdak R

- ↵
- Iwata T,
- Shah A,
- Ghahramani Z

- ↵
- ↵
- ↵
- ↵
- Filimonov V,
- Sornette D

- ↵
- Hardiman SJ,
- Bercot N,
- Bouchaud JP

- ↵
- WorldHealthOrganization

- ↵
- ↵
- Cinlar E

- ↵
- Didelez V

- ↵
- Eichler M,
- Dahlhaus R,
- Dueck J

- ↵
- Granger CW

- ↵
- Linderman S,
- Adams R

- ↵
- Lewis E,
- Mohler G

- ↵
- Xu H,
- Farajtabar M,
- Zha H

- ↵
- Dassios A,
- Zhao H

- ↵
- Wesolowski A, et al.

- ↵
- Daley DJ,
- Vere-Jones D

- ↵
- Snyder DL,
- Miller MI

- ↵
- Roehner B,
- Sornette D,
- Andersen JV

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Computer Sciences

- Biological Sciences
- Applied Biological Sciences