An open challenge to advance probabilistic forecasting for dengue epidemics

Significance Forecasts routinely provide critical information for dangerous weather events but not yet for epidemics. Researchers develop computational models that can be used for infectious disease forecasting, but forecasts have not been broadly compared or tested. We collaboratively compared forecasts from 16 teams for 8 y of dengue epidemics in Peru and Puerto Rico. The comparison highlighted components that forecasts did well (e.g., situational awareness late in the season) and those that need more work (e.g., early season forecasts). It also identified key facets to improve forecasts, including using multiple model ensemble approaches to improve overall forecast skill. Future infectious disease forecasting work can build on these findings and this framework to improve the skill and utility of forecasts.


This PDF file includes:
Supplementary Text Fig. S1 to S5 Tables S1 to S4

Supplementary Text Forecast Methods
Teams reserved the right to publish their own research and therefore only provided standardized forecasts and brief descriptions of their approaches. The following is a summary of those approaches including citations for those that have been published. Some high-level model characteristics are summarized in Table S1.
Team A. The model is a dynamical two-strain susceptible-exposed-infectious-recoveredsusceptible (SEIRS) compartmental model with a multi-life stage model for vector populations. Parameters were derived from the literature and data included dengue case data, precipitation, and minimum and maximum temperatures.
Team B. Forecasts for each target-location were generated from an ensemble of three types of statistical models: Holt-Winters exponential smoothing (time series smoothing of historical dengue at local, seasonal, and long-term scales), multidimensional analogues (on historical dengue data and historical precipitation data), and historical average models (the seasonal distributions of historical cases). Ensemble components were assigned individual weights for each target-location pair based on mean absolute error in predictions over the previous 4 years (1).
Team C. The model used a single strain susceptible-infected-recovered (SIR) model with an ensemble adjustment Kalman filter to sample model parameters consistent with all historical dengue case data available at the time of each prediction (2).
Team D. The model used the K-spectral centroid clustering algorithm to generate normalized clusters of incidence patterns from similar seasons to a sliding window of normalized cases in the current season. The most similar curve was selected and scaled to project mean incidence for the rest of the season. Probabilities for each target were specified based on previous season error for the same targets.
Team E. Two models were used to generate forecasts: a negative binomial generalized linear regression model (including time, recently observed cases, temperature, El Niño, the normalized difference vegetation index, and temperature-scaled R0) and a Gaussian process model (including season week, incidence at the end of the previous season, a seasonal sine wave, and an indicator for severe seasons that is estimated for the current season). The Gaussian process model was used for all forecasts with less than 7 seasons of historical data and all early season forecasts, and the negative binomial regression for later season forecasts once more historical data had accumulated (3).
Team F. A generalized linear model was built with numerous variables consisting of different case and environmental variables at different lags. Variables were selected by minimizing collinearity and optimizing fit (on mean absolute error and the Akaike and Bayesian information criteria). For total incidence in San Juan predictions were generated from a second model; a linear regression of cumulative cases reported up to that forecast week in a bootstrap sample of previous seasons on the total number of cases reported.
Team G. A susceptible-infected-recovered (SIR) human and vector compartmental model was fit by iteratively sampling parameters, generating a predictive distribution, evaluating the likelihood on the data, and updating the parameter density. An extremely randomized trees regressor was used to generate independent estimates based on a suite of incidence and weather variables. Estimates from both models were weighted and combined to generate the forecasts.
Team H. Five models were developed to predict subcomponents of dengue risk using historical case data: (1) a seasonal pattern (fitted to a transformed and normalized cosine functions), (2) a local logistic curve (fitted to the current season data), (3) a seasonal autoregressive integrated moving average, (4) a model to predict incidence based on estimated population susceptibility for each season (based on incidence in the previous two seasons), and (5) a model predicting season incidence based on early season incidence. These five model outputs were fitted to historical data with a random forest model to make a single forecast at each time point.
Team I. A 4-strain human compartmental model, with susceptible-exposed-infectious-recovered (SEIR) for unique strain sequence combinations (128 compartments) was coupled with a multilifecycle compartmental mosquito model. Parameters were initialized based on literature review and sampled using a genetic algorithm to calibrate the model to the data. The calibrated model was used to generate stochastic simulations for the forecasts.
Team J. Forecasts were made using an ensemble of three models: an empirical Bayes model (using current and historical dengue data) (4), a pinned spline model (using dengue, lagged precipitation, and temperature data), and an empirical prior (using only dengue data from previous seasons). Individual model components were weighted to optimize leave-one-out crossvalidation.
Team K. Principal component analysis was used to select four variables from the numerous dengue and climate variables available: lagged minimum temperature, dew point temperature, specific humidity, and precipitation. These variables were used in a multinomial logistic regression to generate forecasts.
Team L. Forecasts were generated by averaging across three models: a neural network informed by a susceptible-infectious-recovered-susceptible (SIRS) compartmental model coupled with a 4component (egg, immature, susceptible adult, and infectious adult) vector population model, a neural network time series model, and a Bayesian time series model. All models used temperature, precipitation, and historical case data.
Team M. For San Juan, a non-parametric additive autocorrelation model (dimension: 21, time delay: 3, forecasting steps: 52) was fitted using log-transformed data from previous seasons. For Iquitos, a seasonal autoregressive integrated moving average model (SARIMA(3,0,2)(0,1,1)52) was used. For both locations, forecasts were made for the entire season and not updated as new data became available within the season.
Team N. The model used ordinary least squares fit to the relationship of cumulative cases up to the current week to each target on historical data and predict incidence for the remainder of the season. Prediction distributions were modified to inflate kurtosis and avoid predictions that were highly unlikely given historical data.
Team O. The model was a non-parametric kernel-density state space reconstruction using logtransformed and smoothed historical dengue data (5). Parameters and lags were estimated independently for each forecast horizon by minimization of the cross-validation mean absolute error of point predictions.
Team P. The model was a Bayesian statistical, time-series regression model including smoothed, lagged dengue case data, lagged climate variables (precipitation, minimum temperature, and relative humidity), and an indicator for serotype switches within the past two years. Cases were modeled as a negative binomial process with an offset for estimated population size.

Climate and environmental data
Daily historical temperature and precipitation observations were made available from the Global Historical Climatology Network. Remotely sensed estimates of precipitation and Normalized Difference Vegetation Index were provided for the areas around both locations from the National Oceanic and Atmospheric Administration (NOAA) Climate Data Records. Temperature, precipitation, dew point, relative humidity, and specific humidity estimates were provided from the National Centers for Environmental Prediction Climate Forecast System Reanalysis. These data sources are maintained and quality controlled by NOAA.

Regression models
We compared logarithmic scores for all team forecasts and the baseline forecasts using Bayesian generalized linear models to estimate the conditional distribution of transformed scores given specific sets of variables potentially related to forecast skill. We first truncated each binned prediction to the range of 0.001 to 0.999 to avoid logarithmic scores of negative infinity or zero. We then calculated the corresponding logarithmic scores and converted those scores to surprisal values (−log( ( )) by changing the sign. This outcome variable, surprisal, has the advantage of being on the same scale as the logarithmic score but is continuous and positive and therefore suitable to be approximated by a Gamma distribution. The truncation at 0.001 poses one potential complication: for models with many forecasts assigning zero probability to the outcome, the truncation leads to an artificial density at that specific surprisal (−log(0.001) ≅ 6.9). However, the Bayesian models treat these observations as uncertain and via Markov Chain Monte Carlo sampling they are distributed across a wider range of scores.
After transformation, we used generalized linear Bayesian regression models to assess the effects of multiple variables (e.g. 1 ) on the Gamma-distributed surprisal values (using the identity link to measure effects on the logarithmic score scale): where is the dispersion, > and 1 are regression coefficients. To be consistent with the rest of the manuscript, we reported the estimated effects from the regressions on the logarithmic score scale rather than the surprisal scale.
We fitted regression models with Stan (http://mc-stan.org/) using the stan_glm function in the rstanarm package (http://mc-stan.org/rstanarm/). For each regression, we ran four chains with a burn-in of 500 samples, then collected an additional 1,000 samples and thinned by two to attain 2,000 samples across the four chains. All models were checked for convergence and autocorrelation. To compare regression models we used leave-one-out cross validation with Pareto-smoothed importance sampling to estimate the expected log pointwise predictive density (ELPD) (6).
To assess the relationship between scores and target, location, or season characteristics, we fitted a series of regression models to identify extrinsic factors associated with score variability including: location, target, location-target specific entropy (described below), season, forecast week (week of season), a testing/training season indicator, peak incidence (normalized by location), season incidence (normalized by location), and peak week (centered by location). We calculated target-location entropy (∑ log , where is the bin-specific frequency) as the entropy of each target relative to the target-specific bins over all seasons in the training and testing datasets for each location separately (excluding two seasons in Iquitos with no clear peaks: First, we fitted a base regression model with covariates for target, forecast week, location and a location-season interaction term to capture variation by season (Table S3). We found that scores varied significantly across at least some components of all of these factors. We then removed the target variable and added two target-related covariates -target-location specific entropy and the number of bins for the probabilistic forecast -finding that the number of bins was associated with the differences in scores by target. We compared the ELPD for this model (-18,182) to the base regression model (-18,184) and found a negligible difference (2.4, standard error (SE): 3.5). We then removed the entropy variable and the location-season interaction terms and added four covariates potentially associated with season-specific scores: an indicator for training versus testing, centered peak week, normalized peak incidence, and normalized season incidence. Scores varied by centered peak week and normalized peak incidence with a slightly lower ELPD (-18,237, difference: -55, SE: 13). We then removed the training season and normalized season incidence variables, keeping only forecast week, location, number of forecast bins, normalized peak incidence, and centered peak week. Compared to the base model, this model had a lower ELPD (-18,234, difference: -50, SE: 15) with a reduced set of variables that could account for key differences in scores. We did not assess additional variables that could explain the betweenlocation difference because with only two locations there would be no resolution between alternative binary effectors.
Next, we used the final model described above to assess differences between forecasting approaches, using variables for mechanistic (if the forecast included at least one mechanistic component), climate (if the forecast used at least one climate variable), and ensemble (if the forecast was based on more than one model) (see Table S1 and above for more model-specific details). Additional important distinctions (e.g. modeling a vector population or using serotype data) were not assessed because not enough forecasts incorporated those approaches to enable comparison. For example, only two forecasts used serotype data and each used it in a different way so any comparison would be confounded by many other possible distinctions (e.g. which climate data were used).
Finally, we compared calibration (see Fig. S4 for the specific metric) by forecasting approach with the same three characteristics (mechanistic, climate, and ensemble) while controlling for target and location (Table S4). Fig. S1. Peak week and peak incidence forecasts at weeks 0, 12, 24, and 36 for all testing seasons. The solid black lines indicate the most recent dengue data that were available to teams to inform these forecasts and the dashed line indicates the data that became available later in the season. The colored points represent point estimates for each team while the bars represent 50% and 95% prediction intervals (dark and light, respectively). The colors match the legend in Fig. 2 and the labels in Fig. S2. Corresponding forecasts for seasonal incidence are shown in Fig. S2.  Fig. S2. Total seasonal incidence forecasts at weeks 0, 12, 24, and 36 for all testing seasons. The solid black lines indicate the number of cases reported up to week 12 or week 24 and the dashed lines indicate the total at the end of the season. Grey horizontal lines indicate the bins for each forecast. Points are the point estimates and shading represents the 50% and 95% prediction intervals (dark and light, respectively). Open bars at the top indicate that the 95% interval includes the highest bin: greater than 1,000 or 10,000 cases, for Iquitos or San Juan, respectively.   S4. Calibration compared to logarithmic score by team and target. Calibration was calculated as described in Fig. S3. Better calibration (lower) was general associated with better logarithmic scores (higher) for all target-location pairs (Iquitos-Peak week R 2 : 0.55, Iquitos-Peak incidence R 2 : 0.91, Iquitos-Season incidence R 2 : 0.95, San Juan-Peak week R 2 : 0.59, San Juan-Peak incidence R 2 : 0.79, San Juan-Season incidence R 2 : 0.78). Despite lower logarithmic scores, peak week calibration is generally better because the greater number of bins (52 vs. 11) leads to more low or no probability predictions for outcomes that did not happen, which are therefore well calibrated.