Skip to main content
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian
  • Log in
  • My Cart

Main menu

  • Home
  • Articles
    • Current
    • Latest Articles
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • Archive
  • Front Matter
  • News
    • For the Press
    • Highlights from Latest Articles
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home

Advanced Search

  • Home
  • Articles
    • Current
    • Latest Articles
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • Archive
  • Front Matter
  • News
    • For the Press
    • Highlights from Latest Articles
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses

New Research In

Physical Sciences

Featured Portals

  • Physics
  • Chemistry
  • Sustainability Science

Articles by Topic

  • Applied Mathematics
  • Applied Physical Sciences
  • Astronomy
  • Computer Sciences
  • Earth, Atmospheric, and Planetary Sciences
  • Engineering
  • Environmental Sciences
  • Mathematics
  • Statistics

Social Sciences

Featured Portals

  • Anthropology
  • Sustainability Science

Articles by Topic

  • Economic Sciences
  • Environmental Sciences
  • Political Sciences
  • Psychological and Cognitive Sciences
  • Social Sciences

Biological Sciences

Featured Portals

  • Sustainability Science

Articles by Topic

  • Agricultural Sciences
  • Anthropology
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology
  • Cell Biology
  • Developmental Biology
  • Ecology
  • Environmental Sciences
  • Evolution
  • Genetics
  • Immunology and Inflammation
  • Medical Sciences
  • Microbiology
  • Neuroscience
  • Pharmacology
  • Physiology
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences
  • Sustainability Science
  • Systems Biology
Research Article

Accurate estimation of influenza epidemics using Google search data via ARGO

Shihao Yang, Mauricio Santillana, and S. C. Kou
PNAS November 24, 2015 112 (47) 14473-14478; first published November 9, 2015; https://doi.org/10.1073/pnas.1515373112
Shihao Yang
aDepartment of Statistics, Harvard University, Cambridge, MA 02138;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mauricio Santillana
bSchool of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138;
cComputational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02115
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: kou@stat.harvard.edu msantill@fas.harvard.edu
S. C. Kou
aDepartment of Statistics, Harvard University, Cambridge, MA 02138;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: kou@stat.harvard.edu msantill@fas.harvard.edu
  1. Edited by Wing Hung Wong, Stanford University, Stanford, CA, and approved September 30, 2015 (received for review August 6, 2015)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Significance

Big data generated from the Internet have great potential in tracking and predicting massive social activities. In this article, we focus on tracking influenza epidemics. We propose a model that utilizes publicly available Google search data to estimate current influenza-like illness activity level. Our model outperforms all available Google-search–based real-time tracking models for influenza epidemics at the national level of the United States, including Google Flu Trends. Our model is flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for estimation and prediction at multiple temporal and spatial resolutions for other social events.

Abstract

Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-search–based tracking models, including the latest version of Google Flu Trends, even though it uses only low-quality search data as input from publicly available Google Trends and Google Correlate websites. ARGO not only incorporates the seasonality in influenza epidemics but also captures changes in people’s online search behavior over time. ARGO is also flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for real-time tracking of other social events at multiple temporal and spatial resolutions.

  • digital disease detection
  • seasonal influenza
  • big data
  • influenza-like illnesses activity real-time estimation
  • autoregressive exogenous model

Big data sets are constantly generated nowadays as the activities of millions of users are collected from Internet-based services. Numerous studies have suggested great potential of these big data sets to detect/manage epidemic outbreaks [influenza (1⇓⇓⇓⇓–6), Ebola (7), dengue (8)], predict changes in stock prices (9, 10) and housing prices (11), etc. In 2009, Google Flu Trends (GFT), a digital disease detection system that uses the volume of selected Google search terms to estimate current influenza-like illnesses (ILI) activity, was identified by many as a good example of how big data would transform traditional statistical predictive analysis (12). However, significant discrepancies between GFT’s flu estimates and those measured by the Centers for Disease Control (CDC) in subsequent years led to considerable doubt about the value of digital disease detection systems (13). Although multiple articles have identified methodological flaws in GFT’s original algorithm (14⇓–16) and have led to incremental improvements (14, 16) (see also googleresearch.blogspot.com/2014/10/google-flu-trends-gets-brand-new-engine.html), a statistical framework that is theoretically sound and capable of accurate estimation is still lacking. Here we present such a framework that culminates in a method that outperforms all existing methodologies for tracking influenza activity using internet search data.

Influenza outbreaks cause up to 500,000 deaths a year worldwide, and an estimated 3,000–50,000 deaths a year in the United States (17). Our ability to effectively prepare for and respond to these outbreaks heavily relies on the availability of accurate real-time estimates of their activity. Existing methods to predict the timing, duration, and magnitude of flu outbreaks remain limited (18). Well-established clinical methods to track flu activity, such as the CDC’s ILINet, report the percentage of patients seeking medical attention with ILI symptoms (www.cdc.gov/flu/). Although CDC’s %ILI is only a proxy of the flu activity in the population, it can help officials allocate resources in preparation for potential surges of patient visits to hospital facilities. See refs. 19⇓–21 for further discussion.

CDC’s ILI reports have a delay of 1–3wk due to the time for processing and aggregating clinical information. This time lag is far from optimal for decision-making purposes. To alleviate this information gap, multiple methods combining climate, demographic, and epidemiological data with mathematical models have been proposed for real-time estimation of flu activity (18, 21⇓⇓⇓–25). In recent years, methods that harness Internet-based information have also been proposed, such as Google (1), Yahoo (2), and Baidu (3) Internet searches, Twitter posts (4), Wikipedia article views (5), clinicians’ queries (6), and crowdsourced self-reporting mobile apps such as Influenzanet (Europe) (26), Flutracking (Australia) (27), and Flu Near You (United States) (28). Among them, GFT has received the most attention and has inspired subsequent digital disease detection systems (3, 8, 29⇓⇓–32). Interestingly, Google has never made their raw data public, thus making it impossible to reproduce the exact results of GFT.

We highlight three limitations of the original GFT algorithm, previously identified in refs. 15 and 16. First, it was shown that a static approach, which does not take advantage of newly available CDC’s ILI activity reports as the flu season evolves, produced model drift, leading to inaccurate estimates. Second, the idea of aggregating the multiple query terms (the independent variables in the GFT model) into a single variable did not allow for changes in people’s Internet search behavior over time (and thus changes in query terms’ abilities to track flu) to be appropriately captured. Third, GFT ignored the intrinsic time series properties, such as seasonality of the historical ILI activity, thus overlooking potentially crucial information that could help produce accurate real-time ILI activity estimates.

Our Contribution

The methodology presented here produces robust and highly accurate ILI activity level estimates by addressing the three aforementioned shortcomings of the multiple GFT engines. In addition, we provide a theoretical framework that, for the first time to our knowledge, justifies the prevailing use of linear models in the digital disease detection literature by incorporating causality arguments through a hidden Markov model. This theoretical framework contains, as a special case, the model developed in ref. 16. Our model not only achieves the goal of (i) dynamically incorporating new information from CDC reports as it becomes available and (ii) automatically selecting the most useful Google search queries for estimation as in ref. 16, but also largely improves estimation by (iii) including the long-term cyclic information (seasonality) from past flu seasons on record as input variables and (iv) using a 2-y moving window (which immediately precedes the desired date of estimation) for the training period to capture the most recent changes in people’s search patterns and time series behavior (33). Our methodology efficiently builds a prediction model from individual search frequency as well as the past records of ILI activity. It uses both sources of information more efficiently than simply combining GFT with autoregressive terms as suggested in ref. 15, because GFT is not optimally aggregated to provide additional information on top of time series information. Furthermore, we provide a quantitative efficiency metric that measures the statistical significance of the improvement of our methodology over other alternatives. For example, our method is twice as accurate as the method that combines GFT with autoregressive terms. Finally, even though we use as input only the publicly available, low-quality data from the Google Correlate and Google Trends websites, our method has significant improvement over the latest version of GFT.

We name our model ARGO, which stands for AutoRegression with GOogle search data. Statistically speaking, ARGO is an autoregressive model with Google search queries as exogenous variables; ARGO also employs L1 (and potentially L2) regularization to achieve automatic selection of the most relevant information.

Results

Retrospective estimates of influenza activity (ILI activity level, as reported by the CDC) were produced using our model, ARGO, for the time period of March 29, 2009 through July 11, 2015, assuming we had access only to the historical CDC’s ILI reports up to the previous week of estimation. We compared ARGO’s estimates with the ground truth: the CDC-reported weighted ILI activity level, published typically with 1- or 2-wk delay, by calculating a collection of accuracy metrics described in Materials and Methods. These metrics include the root-mean-squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), correlation with estimation target, and correlation of increment with estimation target. For comparison, we calculated these accuracy metrics for (i) GFT estimates (accessed on July 11, 2015), (ii) estimates produced using the method of Santillana et al. (6, 16), (iii) estimates produced by combining GFT with a lag-3 autoregressive model, AR(3), as suggested in ref. 15, (iv) estimates produced with an AR(3) autoregressive model (4, 15), and (v) a naive method that simply uses the value of the prior week’s CDC ILI activity level as the estimate for the current one. For fair comparison, all benchmark models (ii–iv) are dynamically trained with a 2-y moving window.

Table 1 summarizes these accuracy metrics for all estimation methods for multiple time periods. The “Whole period” column shows that ARGO’s estimates outperform all other alternatives, in every accuracy metric for the whole time period. The other columns of Table 1 show the performance of all of the methods for the 2009 off-season H1N1 flu outbreak, and each regular flu season since 2010. Fig. 1 displays the estimates against the observed CDC-reported ILI activity level.

View this table:
  • View inline
  • View popup
Table 1.

Comparison of different models for the estimation of influenza epidemics

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

Estimation results. (Top) The estimated ILI activity level from ARGO (thick red), contrasting with the true CDC’s ILI activity level (thick black) as well as the estimates from GFT (green), method of ref. 16 (blue), GFT plus AR(3) model (dark yellow), and AR(3) model (dashed gray). The two background shades, white and yellow, reflect two data sources, Google Correlate and Google Trends, respectively. The dash-dotted purple vertical line separates Google Correlate data with search terms identified on March 28, 2009 and May 22,2010. (Middle) The estimation error, defined as estimated value minus the CDC’s ILI activity level. (Bottom) Zoomed-in plots for estimation results in different study periods. (A) The H1N1 flu outbreak period. (B) The 2012–2013 regular flu season. (C) The 2014–2015 regular flu season. A regular flu season is defined as week 40 of one year to week 20 of the following year.

Close inspection shows that, in the post-2009 regular flu seasons, ARGO uniformly outperformed all other alternative estimation methods in terms of RMSE, MAE, MAPE, and correlation. ARGO avoids the notorious overshooting problem of GFT, as seen in Fig. 1. During the 2009 off-season H1N1 flu outbreak, ARGO had the smallest MAPE. In terms of RMSE and MAE, ARGO (relative RMSE = 0.640, relative MAE = 0.584) had the second best performance, underperforming slightly only the GFT+AR(3) model (relative RMSE = 0.580, relative MAE = 0.570). In terms of correlation, ARGO (r = 98.5%) had similar performance to the (potentially in-sample data of) GFT (r = 98.9%) (14) and GFT+AR(3) models (r = 98.6%) and outperformed all of the other alternatives.

To assess the statistical significance of the improved prediction power of ARGO, we constructed a 95% confidence interval for the relative efficiency of ARGO compared with other benchmark methods. The relative efficiency of method 1 to method 2 is the ratio of the true mean-squared error of method 2 to that of method 1 (34), which can be estimated by its observed value (see Eq. 4); its confidence interval can be constructed by stationary bootstrap of the error residual time series (35). Table 2 shows that ARGO is estimated to be at least twice as efficient as any other alternative, and the improvement in accuracy is highly statistically significant.

View this table:
  • View inline
  • View popup
Table 2.

Estimate of relative efficiency of ARGO compared with other models with 95% confidence interval (CI)

It is well known that CDC reports undergo revisions, weeks after their initial publication, that respond to internal consistency checks and lead to more accurate estimates of patients with ILI symptoms seeking medical attention. Thus, the available historical CDC information, in a given week, is not necessarily as accurate as it will be. We tested the effect of using (potentially inaccurate) unrevised information by obtaining the historical unrevised and revised reports, and the dates when the reports were revised, from the CDC website for the time period of our study. We used only the information that would have been available to us, at the time of estimation, and produced a time series of estimates for the whole time period described before. We compared our estimates to all other methods and found that ARGO still outperformed them all. Moreover, the values of all five accuracy metrics for ARGO essentially did not change, suggesting a desirable robustness to revisions in CDC’s ILI activity reports. The results are shown in Table S1.

View this table:
  • View inline
  • View popup
Table S1.

Comparison of different models for the estimation of influenza epidemics, with weekly CDC’s ILI activity level that excludes forward-looking information from ILI activity report revision

We faced an additional challenge in producing real-time estimates for the latest portion of the 2014–2015 flu season. At the time of writing this article, the only data available to us for the week of March 28, 2015 and later came from the Google Trends website. The information from Google Trends has even lower quality than from Google Correlate and changes every week. These undesired changes affected the quality of our estimates. To assess the stability of ARGO in the presence of these variations in the data, we obtained the search frequencies of the same query terms from Google Trends website on 25 different days during the month of April 2015 and produced a set of 25 historical estimates using ARGO. The results of the accuracy metrics associated to these estimates are shown in Table S2. This table shows that, despite the observed variation in the Google Trends data, ARGO is threefold more stable than the method of ref. 16, and still outperforms on average any other method.

View this table:
  • View inline
  • View popup
Table S2.

Mean and SD of accuracy metrics when using Google Trends data accessed at different dates

Discussion

Strength of ARGO.

The results presented here demonstrate the superiority of our approach in terms of both accuracy and robustness, compared with all existing flu tracking models based on Google searches. The value of these results is even higher given the fact that they were produced with low-quality input variables. It is highly likely that our methodology would lead to even more accurate results if we were given access to the input variables that Google uses to calculate their estimates.

The combination of seasonal flu information with dynamic reweighting of search information appears to be a key factor in the enhanced accuracy of ARGO. The level of ILI activity last week typically has a significant effect on the current level of ILI activity, and ILI activity half a year ago and/or 1 y ago could provide further information, as shown in Fig. S1, which reflects a strong temporal autocorrelation. The integration of time series information leads to a smooth and continuous estimation curve and prevents undesired spikes. However, simply adding GFT to an autoregressive model is suboptimal compared with ARGO, because simply treating GFT as an individual variable does not allow adjustment for time series information at the resolution of individual query terms, and many terms included in GFT may no longer provide extra information once time series information is incorporated. In fact, once the time series information is included, fewer Google search query terms remain significant. For example, among 100 Google Correlate query terms, ARGO selected 14 terms, on average, each week, whereas the method of ref. 16 and GFT (1) selected 38 and 45 terms, respectively, each week on average. The combination of ARGO’s smoothness and sparsity lead to a substantial reduction on the estimation error, as observed in Tables 1 and 2, where ARGO shows improved performance in all evaluation metrics over the whole time period and is twice as efficient as GFT+AR(3).

Fig. S1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. S1.

Dynamic coefficients for ARGO. Red color represents positive coefficients, blue color represents negative coefficients, white color represents zero, and gray color represents missing values. Missing values can be the result of (i) query terms not identified by Google Correlate and (ii) Google Trends data not available for particular query terms. Black horizontal dashed line separates Google query queries from autoregressive lags. Yellow vertical dashed line separates coefficients trained on Google Correlate data from those trained on Google Trends data, and green vertical dashed line separates query terms identified on March 28, 2009 from those identified on May 22, 2010.

Our methodology allows us to transparently understand how Google search information and historical flu information complement one another. Time series models tend to be slow in response to sudden observed changes in CDC’s ILI activity level. The AR(3) model shows this “delaying” effect, despite its seemingly good correlation. Google searches, on the other hand, are better at detecting sudden ILI activity changes, but are also very sensitive to public’s overreaction.

To investigate further the responsiveness (comovement) of ARGO toward the change in ILI activity, we calculated the correlation of increment between each estimation model and CDC’s ILI activity level. The correlation of increment between two time series at and bt is defined as Corr(at−at−1,bt−bt−1), which measures how well at captures the changes in bt. Table 1 shows that ARGO has similar capability to that of GFT and the method of ref. 16 in capturing the changes in ILI level, and outperforms the time series model AR(3) uniformly.

Time series information (seasonality) tends to pull ARGO’s estimate toward the historical level. This was evident at the onset of the off-season H1N1 flu outbreak (week ending at May 2, 2009), which resulted in ARGO’s underestimation. ARGO self-corrected its performance the following week by shifting a portion of model weights from the time series domain to the Google searches domain. Inversely, at the height of 2012–2013 season, ARGO, GFT, and the method of ref. 16 all missed the peak due to an unprecedented surge of search activity. ARGO achieved the fastest self-correction by redistributing the weights not only across Google terms but also across time series terms, missing the peak by only 1 wk, as opposed to 2 wk for ref. 16 and about 4 wk for GFT. It is important to note that although we have used CDC’s ILI as our gold standard for influenza activity in the US population, and data from Google Correlate/Trends as our independent variables, our methodology can be immediately adapted to any other suitable ILI gold standard and/or set of independent variables.

Limitations and Next Steps.

Although ARGO displays a clear superiority over previous methods, it is not fail-proof. Because it relies on the public’s search behavior, any abrupt changes to the inner works of the search engine or any changes in the way health-related search information is displayed to users will affect the accuracy of our methodology (36, 37). We expect that ARGO will be fast at correcting itself if any such change takes place in the future. As in any predictive method, the quality of past performance does not guarantee the quality of future performance. In this article, we fixed the search query terms after 2010 so as to directly compare our results with GFT, which has kept the same query terms since 2010; future application of ARGO may update search terms more frequently. ARGO can be easily generalized to any temporal and spatial scales for a variety of diseases or social events amenable to be tracked by Internet searches or services (3, 4, 8, 9, 29, 30, 38, 39). Further improvements in influenza prediction may come from combining multiple predictors constructed from disparate data sources (40). After the initial submission of this article in May 2015, Google announced that GFT would be discontinued and that their raw data would be made accessible to selected scientific teams. This announcement happened soon after the GFT team published a manuscript that proposed a new time series-based method for the (now discontinued) GFT engine (41). This new development makes our contribution timely and useful in providing a transparent method for disease tracking in the future.

Materials and Methods

All data used in this article are publicly available. Therefore, IRB approval is not needed.

Google Data.

To avoid forward-looking information in our out-of-sample predictions, and to make the search term selection in our approach consistent with the main revision to GFT (14) immediately after the H1N1 pandemic, we obtained the highest-correlated terms to the CDC’s ILI using Google Correlate (www.google.com/trends/correlate) for two different time periods. For the first time period (pre-H1N1 period), we inserted only CDC’s ILI data from January 2004 to March 28, 2009 into Google Correlate, and used the resulting most highly correlated search terms as independent variables for our out-of-sample predictions for the time period April 4, 2009 through May 22, 2010. For the second time period (post-H1N1), we inserted only CDC’s ILI data from January 2004 to May 22, 2010 into Google Correlate to select new search terms, as done in ref. 14. These last search terms were used as independent variables for all subsequent predictions presented in this work. Tables S3 and S4 show all query terms identified. For the pre-H1N1 period (the first time period), the terms from Google Correlate include spurious (or overfitted) terms like “march vacation” or “basketball standings,” as discussed in ref. 15. However, Fig. S1 shows that these spurious terms were often not selected by ARGO, i.e., ARGO would give them zero weights, demonstrating its robustness. For the post-H1N1 time period, the updated query terms from Google Correlate include mostly flu-related terms (see Table S4). This suggests that spurious terms were “filtered out” by including off-season flu data. For the time period of March 28, 2015 up to the date of submission of this article, we acquired search frequencies for this set of query terms from Google Trends (www.google.com/trends; date of access: July 11, 2015) as Google Correlate only provides data up to March 28, 2015 at the time of writing this article.

View this table:
  • View inline
  • View popup
Table S3.

All search phrases identified by Google Correlate using data as of March 28, 2009

View this table:
  • View inline
  • View popup
Table S4.

All search phrases identified by Google Correlate using data as of May 22, 2010

Google Correlate standardizes the search volume of each query to have mean zero and SD 1 across time and contains data only from 2004 to March 2015. To make Google Correlate data compatible with Google Trends data, we linearly transformed the Google Correlate data to the same scale of 0–100 in our analysis. We used Google Correlate data up to its last available date, and then switched to Google Trends data afterward. This is indicated in Fig. 1 by different shades of the background. We used the latest version of GFT (fourth version, revised in October 2014) weekly estimates of ILI activity level as one of our comparison methods. GFT is available at www.google.org/flutrends/about (date of access: July 11, 2015).

CDC’s Data.

We use the weighted version of CDC’s ILI activity level as the estimation target (available at gis.cdc.gov/grasp/fluview/fluportaldashboard.html; date of access: July 11, 2015). The weekly revisions of CDC’s ILI are available at the CDC website for all recorded seasons (from week 40 of a given year to week 20 of the subsequent year). For example, ILI report revision at week 50 of season 2012–2013 is available at www.cdc.gov/flu/weekly/weeklyarchives2012-2013/data/senAllregt50.htm; ILI report revision at week 9 of season 2014–2015 is available at www.cdc.gov/flu/weekly/weeklyarchives2014-2015/data/senAllregt09.html.

Formulation of Our Model.

Our model ARGO is motivated by a hidden Markov model. The logit-transformed CDC-reported ILI activity level {yt} is the intrinsic time series of interest. We impose an autoregressive model with lag N on it, which implies that the collection of vectors {y(t−N+1):t}t≥N is a Markov chain (this captures the clinical fact that flu lasts for a period, but not indefinitely). The vector of log-transformed normalized volume of Google search queries at time t, Xt, depends only on the ILI activity at the same time, yt (this follows the intuition that flu occurrence causes people to search flu-related information online). The Markovian property on block y(t−N+1):t leads to the (vector) hidden Markov model structure.y1:N→y2:(N+1)→⋯→y(T−N+1):T↓↓↓XNXN+1XT[1]

Our formal mathematical assumptions are

  • (assumption 1) yt=μy+∑j=1Nαjyt−j+ϵt,ϵt∼iidN(0,σ2)

  • (assumption 2) Xt|yt∼NK(μx+ytβ,Q)

  • (assumption 3) conditional on yt, Xt is independent of {yl,Xl:l≠t}

where β=(β1,β2,…,βK)⊺,μx=(μx1,μx2,…,μxK)⊺, and Q is the covariance matrix. To make the variables more normal, we transform the original ILI activity level pt from [0,1] to ℝ using the logit function, obtaining the yt, and transform the Google search volumes from [0,100] to ℝ using the log function, obtaining Xt. The log function is appropriate because Google search frequencies usually have an exponential growth rate near peaks and are artificially scaled to [0,100] by dividing the running maximum. Because Google Trends is in integer scale from 0 to 100, we add a small number δ=0.5 before the transformation to avoid taking the log of 0. The predictive distribution f(yt|y1:(t−1),X1:t) is normal with mean linear in y(t−N):(t−1) and Xt and constant variance (see Supporting Information). This observation leads to Eq. 2, which defines the ARGO model.

The ARGO Model.

Let yt=logit(pt) be the logit-transformed CDC’s (weighted) ILI activity level pt at time t, and Xi,t the log-transformed Google search frequency of term i at time t. Our ARGO model is given byyt=μy+∑j=1Nαjyt−j+∑i=1KβiXi,t+ϵt, ϵt∼iidN(0,σ2),[2]

where Xt can be thought of as the exogenous variables to time series {yt}.

Parameter Estimation of ARGO Model.

We chose N=52 (weeks) to capture the within-year seasonality in ILI activity, and K=100 (Google search terms) following the data availability from Google Correlate. Because we have more independent variables than the number of observations, the usual maximum likelihood estimate (ordinary least squares) method will fail. Therefore, we impose regularities for parameter estimation. In general we have three kinds of penalties, L1 penalty (42), L2 penalty (43), and a linear combination of L1 and L2 penalties (44). All parameters are dynamically trained every week with a 2-y (104-wk) rolling window.

In a given week, the goal is to find parameters μy, α=(α1,…,α52), and β=(β1,…,β100) that minimize∑t(yt−μy−∑j=152αjyt−j−∑i=1100βiXi,t)2+λα‖α‖1+ηα‖α‖22+λβ‖β‖1+ηβ‖β‖22[3]

where λα,λβ,ηα, and ηβ are hyperparameters. Ideally, we would like to use cross-validation to select all four hyperparameters. However, because we have only 104 training data points at a given week due to the 2-y moving window, the cross-validation result is highly noisy. Thus, we need to prespecify some of the hyperparameters. For model simplicity and sparsity, combining with the evidence seen from cross-validation, we set ηα=ηβ=0, leading to L1 penalization on both autoregressive and Google search terms. With the remaining λα and λβ, the cross-validation results still have considerable variance. By the same sparsity and simplicity consideration, we further constrained λα=λβ. Therefore, the ARGO model we finally propose is Eq. 3 with constraint ηα=ηβ=0 and λα=λβ. A detailed discussion of our specification of the hyperparameters is provided in Supporting Information (see Table S5).

View this table:
  • View inline
  • View popup
Table S5.

Comparison of different specifications of hyperparameters for in-sample study period

Accuracy Metrics.

The RMSE, MAE, and MAPE of estimator p^ to the target ILI activity level p are defined, respectively, as RMSE(p^t,pt)=[(1/n)∑t=1n(p^t−pt)2]1/2, MAE(p^t,pt)=(1/n)∑t=1n|p^t−pt|, and MAPE(p^t,pt)=(1/n)∑t=1n|p^t−pt|/pt. The correlation of estimator p^ to the target ILI activity level p is their sample correlation coefficient. The correlation of increment between p^t and pt is defined asCorr. of increment(p^t,pt)=Corr(p^t−p^t−1,pt−pt−1).

The relative efficiency of estimator p^(1) to estimator p^(2) is e(p^(1),p^(2))=MSEtrue(2)/MSEtrue(1), where MSEtrue(i)=E[(p^(i)−p)2], which can be estimated bye^(p^(1),p^(2))=MSEobs(2)MSEobs(1) where MSEobs(i)=1n∑t=1n(p^t(i)−pt)2.[4]

The 95% confidence interval can be constructed by the time series stationary bootstrap method (35), where the replicated time series of the error residual is generated using geometrically distributed random blocks with mean length 52 (which corresponds to 1 y). We obtain the basic bootstrap confidence interval for log{e(p^(1),p^(2))} and then recover the original scale by exponentiation. The nonparametric bootstrap confidence interval takes the autocorrelation and cross-correlation of the errors into account, and is insensitive to the mean block length.

SI Materials and Methods

Details of our methodology are presented as follows. First, the predictive distribution in the formulation of the ARGO model and the corresponding assumptions are described; second, the statistical strategy to determine the hyperparameters of the ARGO model is explained; third, the results of two sensitivity analysis aimed at testing the robustness of the ARGO methodology—(i) with respect to subsequent revisions of CDC’s ILI activity reports and (ii) with respect to observed variation of the input variables coming from Google Trends data—are presented; fourth, the exact search query terms identified by Google Correlate with different data access dates are presented; and fifth, a heat map showing the coefficients for the time series and Google search terms dynamically trained by ARGO is included.

The R package that implements the ARGO method is available at the authors' websites (www.people.fas.harvard.edu/∼skou/publication.htm).

SI Predictive Distribution in the Formulation of ARGO Model

To improve normality for both the input variables and the dependent variables, the CDC-reported ILI activity level was logit-transformed, and the linearly normalized volume of Google search queries were log-transformed. To avoid taking the log of 0, we add a small number δ=0.5 before the log transformation. These transformations led to two sets of variables, the intrinsic (influenza epidemics activity) time series of interest {yt} and the (Google search) variable vector Xt at time t (that depends only on yt). Our formal mathematical assumptions are

  • (assumption 1) yt=μy+∑j=1Nαjyt−j+ϵt,ϵt∼iidN(0,σ2)

  • (assumption 2) Xt|yt∼NK(μx+ytβ,Q)

  • (assumption 3) conditional on yt, Xt is independent of {yl,Xl:l≠t}

where β=(β1,β2,…,βK)⊺,μx=(μx1,μx2,…,μxK)⊺, and Q is the covariance matrix. The predictive distribution f(yt+1|y1:t,X1:(t+1)) is given byf(yt+1|y1:t,X1:(t+1))∼N((1σ2+β⊺Q−1β)−1(μy+α⊺y(t−N+1):tσ2+β⊺Q−1(Xt+1−μx)),(1σ2+β⊺Q−1β)−1),[S1]which is a normal distribution, whose mean is a linear combination of y(t−N):(t−1) and Xt, and whose variance is a constant.

SI Determination of the Hyperparameters for ARGO

The optimized parameters of the ARGO model, μy, α=(α1,…,αN), and β=(β1,…,βK), are obtained byarg minμy,α,β∑t(yt−μy−∑j=152αjyt−j−∑i=1100βiXi,t)2+λα‖α‖1+ηα‖α‖22+λβ‖β‖1+ηβ‖β‖22.[S2]The training period consists of a 2-y (104-wk) rolling window that immediately precedes the desired date of estimation. The hyperparameters are λα,λβ,ηα, and ηβ. We tested the performance of ARGO with the following specifications of hyperparameters: (specification 1) restrict ηα=ηβ=0 and λα=λβ, cross-validate on λα. This is our proposed ARGO with the same L1 penalty for Google search terms and autoregressive lags; (specification 2) restrict ηα=ηβ=0, cross-validate on (λα,λβ). This is ARGO with separate L1 penalties for Google search terms and autoregressive lags; (specification 3) restrict ηα=ηβ and λα=λβ=0, cross-validate on ηα. This is ARGO with the same L2 penalty for Google search terms and autoregressive lags; (specification 4) restrict λα=λβ=0, cross-validate on (ηα,ηβ)—this is ARGO with separate L2 penalties for Google search terms and autoregressive lags; and (specification 5) restrict λα=λβ,ηα=ηβ, cross-validate on (λα,ηα). This is ARGO with the same elastic net (both L1 and L2) penalty for Google search terms and autoregressive lags.

Table S5 summarizes the in-sample estimation performance for our proposed ARGO, together with the other specifications of hyperparameters. It is apparent from the table that the L1 penalty generally outperforms the L2 penalty. The L1 penalty tends to shrink the coefficients of unnecessary independent variables to be exactly zero, and thus eliminates redundant information; on the other hand, the L2 penalty can only shrink the coefficients to be close to zero. As a result, L2 penalized coefficients are not as sparse as their L1 counterparts. Furthermore, from Table S5, we see that ARGO with separate L1 penalties (specification 2) outperforms ARGO with separate L2 penalties (specification 4), in terms of both RMSE and MAE. Similarly, ARGO with the same L1 penalty (specification 1) outperforms ARGO with the same L2 penalty (specification 3), in terms of both RMSE and MAE.

The elastic net model, which combines L1 penalty and L2 penalty, does not provide any error reduction. In the cross-validation process of setting (λα,ηα) for the elastic net model, 70 wk out of 116 in-sample weeks showed that the smallest cross-validation mean error when restricting ηα=0 (i.e., zero L2 penalty) is within 1 SE of the global smallest cross-validation mean error, suggesting that restricting L2 penalty term to be zero (i.e., ηα=0) will introduce little bias. Therefore, for the simplicity and sparsity of the model, we drop the L2 penalty terms and use only the L1 penalty.

Next, we want to decide between the remaining two specifications, ARGO with separate L1 penalties (specification 2) and ARGO with the same L1 penalty (specification 1). One might argue that Google search terms and autoregressive lags are different sources of information and thus should have different L1 penalties. However, empirical evidence in Table S5 shows that, again, giving extra flexibility to (λα,λβ) does not generate improvement compared with fixing λα=λβ. In the cross-validation process of setting (λα,λβ) for separate L1 penalties, 99 wk out of 116 in-sample weeks showed that the smallest cross-validation mean error when restricting λα=λβ (i.e., same L1 penalty) is within 1 SE of the global smallest cross-validation mean error. This may well be due to the gain from variance reduction when imposing the restriction λα=λβ. Based on the same simplicity and sparsity consideration, we finally decided to restrict ηα=ηβ=0 and λα=λβ in the setting of hyperparameters for ARGO.

SI Revision of CDC’s ILI Activity Reports

Within a flu season, CDC reports are constantly revised to improve their accuracy as new information is incorporated. Thus, CDC’s weighted ILI figures displayed in previously published reports may change in subsequent weeks. As a consequence, in a given week, the available CDC ILI information from the most recent weeks may be inaccurate. To test the robustness of ARGO in the presence of these revisions and mimic the real-time tracking in our retrospective predictions, we trained ARGO and all other alternative models based on the following schedule.

Suppose zi,j is the CDC-reported ILI activity level of week i accessed at week j. Since CDC’s ILI activity report is often delayed for 1 wk, on week j, the historical ILI activity-level data we have are {zi,j:i≤j−1}. Due to revisions, ILI activity level of week i accessed at different weeks zi,i+1,zi,i+2,… may be different but will converge to a finalized value zi,∞ eventually. Hence, to avoid using forward-looking information, in week j, we train all models with the ILI activity level accessed at that week, {zi,j:i≤j−1}. In this sense, any future revision beyond week j will not be incorporated in the training at week j. However, for the accuracy metrics, the estimation target remains the finalized the ILI activity level (zi,∞,i=1,2,…).

Table S1 shows the estimation results when using the aforementioned schedule. Note that ARGO still outperforms all other alternative models. Moreover, the absolute values of all four accuracy metrics for ARGO trained this way essentially do not change compared with ARGO trained with finalized ILI activity level as studied in Table 1 of the main text, indicating the robustness of ARGO.

The weekly revisions of CDC’s ILI activity reports are available at the CDC website from week 40 of the year to week 20 of the subsequent year for all seasons studied in this article. For example, ILI activity level revisions at week 50 of season 2012–2013 are available at www.cdc.gov/flu/weekly/weeklyarchives2012-2013/data/senAllregt50.htm; ILI activity report revision at week 9 of season 2014–2015 is available at www.cdc.gov/flu/weekly/weeklyarchives2014-2015/data/senAllregt09.html (the webpage has suffix “htm” for seasons before 2014–2015 and suffix “html” for 2014–2015 season). In this retrospective case study, when the revisions of ILI activity level were not available for a particular week during the off-season period, the finalized ILI activity level was used instead.

SI Variations of Google Trends Data

Google Trends historical data constantly change as a consequence of renormalizations and algorithm updates. To study the robustness of ARGO to Google Trends data revisions, we obtained the search frequencies of the search query terms identified by Google Correlate on May 22, 2010 (see Fig. S1 and Table S4) from the Google Trends website (www.google.com/trends) on 25 different days in April 2015. We studied the variability of ARGO’s performance when using these 25 different versions of Google Trends data as input variables for the common time period of September 28, 2014 to March 29, 2015. We studied the 2014–2015 flu season only partially (up to March 2015) because this is the longest study period covered by all of the obtained versions of Google Trends data, at the time (May 1, 2015) of the first submission of this article. We want to emphasize that Google Correlate data were only available up to February 2014 when accessed in April 2015.

Despite the inevitable variation to the revision of the low-quality data from Google Trends, ARGO still achieves considerable stability compared with the method of Santillana et al. (16) during this time period. Table S2 suggests that ARGO is threefold more robust than the method of ref. 16. The incorporation of time series information helps ARGO achieve stability. As an extreme example, the AR(3) model focuses entirely on the time series information and is thus independent of Google Trends data revisions. GFT, formulated with the original search variables as inputs, is, by construction, insensitive to the changes in Google Trends data. For this portion of the study, we included the signal from GFT for context only, and we treat it as exogenous in our analysis. Based on the results from previous time periods, it is highly likely that if we had access to Google’s internal raw data (i.e., historical search volume for disease-related phrases), we would have achieved the same stability as well. However, even with these low-quality data, ARGO outperforms GFT uniformly on all versions of data in terms of both RMSE and MAE.

Detailed Description of Google Correlate Data.

Tables S3 and S4 list the search query phrases identified by Google Correlate as of March 28, 2009 and May 22, 2010, respectively. The March 2009 version included spurious terms such as “college.basketball.standings,” “march.vacation,” “aloha.ski,” “virginia.wrestling,” etc. These spurious terms did not appear in the May 2010 version.

Dynamic Coefficients for ARGO.

Fig. S1 shows the coefficients for the time series and Google search terms dynamically trained by ARGO via a heat map. The level of ILI activity last week is seen to have a significant effect on the current level of ILI activity, and ILI activity half a year ago and/or 1 y ago could provide further information, as the figure shows. Among Google Correlate query terms, ARGO selected 14 terms out of 100, on average, each week.

Acknowledgments

S.C.K.’s research is supported in part by National Science Foundation Grant DMS-1510446.

Footnotes

  • ↵1To whom correspondence may be addressed. Email: kou{at}stat.harvard.edu or msantill{at}fas.harvard.edu.
  • Author contributions: M.S. and S.C.K. designed research; S.Y., M.S., and S.C.K. performed research; S.Y. analyzed data; and S.Y., M.S., and S.C.K. wrote the paper.

  • The authors declare no conflict of interest.

  • This article is a PNAS Direct Submission.

  • This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1515373112/-/DCSupplemental.

View Abstract

References

  1. ↵
    1. Ginsberg J, et al.
    (2009) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014
    .
    OpenUrlCrossRefPubMed
  2. ↵
    1. Polgreen PM,
    2. Chen Y,
    3. Pennock DM,
    4. Nelson FD,
    5. Weinstein RA
    (2008) Using Internet searches for influenza surveillance. Clin Infect Dis 47(11):1443–1448
    .
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Yuan Q, et al.
    (2013) Monitoring influenza epidemics in china with search query from baidu. PLoS One 8(5):e64323
    .
    OpenUrlCrossRefPubMed
  4. ↵
    1. Paul MJ,
    2. Dredze M,
    3. Broniatowski D
    (2014) Twitter improves influenza forecasting. PLOS Curr Outbreaks 10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117
    .
  5. ↵
    1. McIver DJ,
    2. Brownstein JS
    (2014) Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time. PLOS Comput Biol 10(4):e1003581
    .
    OpenUrlCrossRefPubMed
  6. ↵
    1. Santillana M,
    2. Nsoesie EO,
    3. Mekaru SR,
    4. Scales D,
    5. Brownstein JS
    (2014) Using clinicians’ search query data to monitor influenza epidemics. Clin Infect Dis 59(10):1446–1450
    .
    OpenUrlAbstract/FREE Full Text
  7. ↵
    1. Wesolowski A, et al.
    (2014) Commentary: Containing the Ebola outbreak–the potential and challenge of mobile network data. PLOS Curr Outbreaks 10.1371/currents.outbreaks.0177e7fcf52217b8b634376e2f3efc5e
    .
  8. ↵
    1. Chan EH,
    2. Sahai V,
    3. Conrad C,
    4. Brownstein JS
    (2011) Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance. PLoS Negl Trop Dis 5(5):e1206
    .
    OpenUrlCrossRefPubMed
  9. ↵
    1. Preis T,
    2. Moat HS,
    3. Stanley HE
    (2013) Quantifying trading behavior in financial markets using Google trends. Sci Rep 3:1684
    .
  10. ↵
    1. Bollen J,
    2. Mao H,
    3. Zeng X
    (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8
    .
    OpenUrlCrossRef
  11. ↵
    1. Goldfarb A,
    2. Greenstein SM,
    3. Tucker CE
    1. Wu L,
    2. Brynjolfsson E
    (2015) The future of prediction: How Google searches foreshadow housing prices and sales. Economic Analysis of the Digital Economy, eds Goldfarb A, Greenstein SM, Tucker CE (Univ Chicago Press, Chicago), pp 89–118
    .
  12. ↵
    Helft M (November 11, 2008) Google uses searches to track flu’s spread. NY Times. Available at www.nytimes.com/2008/11/12/technology/internet/12flu.html?_r=0#. Accessed July 11, 2015
    .
  13. ↵
    1. Butler D
    (2013) When Google got flu wrong. Nature 494(7436):155–156
    .
    OpenUrlCrossRefPubMed
  14. ↵
    1. Cook S,
    2. Conrad C,
    3. Fowlkes AL,
    4. Mohebbi MH
    (2011) Assessing Google Flu Trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS One 6(8):e23610
    .
    OpenUrlCrossRefPubMed
  15. ↵
    1. Lazer D,
    2. Kennedy R,
    3. King G,
    4. Vespignani A
    (2014) Big data. The parable of Google Flu: Traps in big data analysis. Science 343(6176):1203–1205
    .
    OpenUrlAbstract/FREE Full Text
  16. ↵
    1. Santillana M,
    2. Zhang DW,
    3. Althouse BM,
    4. Ayers JW
    (2014) What can digital disease detection learn from (an external revision to) Google Flu Trends? Am J Prev Med 47(3):341–347
    .
    OpenUrlCrossRefPubMed
  17. ↵
    1. World Health Organization
    (2014) Influenza (seasonal) (World Health Org, Geneva), Fact Sheet 211
    .
  18. ↵
    1. Shaman J,
    2. Karspeck A
    (2012) Forecasting seasonal outbreaks of influenza. Proc Natl Acad Sci USA 109(50):20425–20430
    .
    OpenUrlAbstract/FREE Full Text
  19. ↵
    1. Lipsitch M,
    2. Finelli L,
    3. Heffernan RT,
    4. Leung GM,
    5. Redd SC, 2009 H1n1 Surveillance Group
    (2011) Improving the evidence base for decision making during a pandemic: The example of 2009 influenza A/H1N1. Biosecur Bioterror 9(2):89–115
    .
    OpenUrlCrossRefPubMed
  20. ↵
    1. Nsoesie EO,
    2. Brownstein JS,
    3. Ramakrishnan N,
    4. Marathe MV
    (2014) A systematic review of studies on forecasting the dynamics of influenza outbreaks. Influenza Other Respi Viruses 8(3):309–316
    .
    OpenUrlCrossRef
  21. ↵
    1. Chretien JP,
    2. George D,
    3. Shaman J,
    4. Chitale RA,
    5. McKenzie FE
    (2014) Influenza forecasting in human populations: A scoping review. PLoS One 9(4):e94130
    .
    OpenUrlCrossRefPubMed
  22. ↵
    1. Nsoesie E,
    2. Mararthe M,
    3. Brownstein J
    (2013) Forecasting peaks of seasonal influenza epidemics. PLoS Curr 5:5
    .
    OpenUrl
  23. ↵
    1. Soebiyanto RP,
    2. Adimi F,
    3. Kiang RK
    (2010) Modeling and predicting seasonal influenza transmission in warm regions using climatological parameters. PLoS One 5(3):e9450
    .
    OpenUrlCrossRefPubMed
  24. ↵
    1. Shaman J,
    2. Karspeck A,
    3. Yang W,
    4. Tamerius J,
    5. Lipsitch M
    (2013) Real-time influenza forecasts during the 2012–2013 season. Nat Commun 4(2837):2837
    .
    OpenUrlPubMed
  25. ↵
    1. Yang W,
    2. Lipsitch M,
    3. Shaman J
    (2015) Inference of seasonal and pandemic influenza transmission dynamics. Proc Natl Acad Sci USA 112(9):2723–2728
    .
    OpenUrlAbstract/FREE Full Text
  26. ↵
    1. Paolotti D, et al.
    (2014) Web-based participatory surveillance of infectious diseases: The Influenzanet participatory surveillance experience. Clin Microbiol Infect 20(1):17–21
    .
    OpenUrlCrossRefPubMed
  27. ↵
    1. Dalton C, et al.
    (2009) Flutracking: A weekly australian community online survey of influenza-like illness in 2006, 2007 and 2008. Commun Dis Intell Q Rep 33(3):316–322
    .
    OpenUrlPubMed
  28. ↵
    1. Smolinski MS, et al.
    (2015) Flu near you: Crowdsourced symptom reporting spanning two influenza seasons. Am J Public Health 105(10):2124–2130
    .
    OpenUrlCrossRefPubMed
  29. ↵
    1. Althouse BM,
    2. Ng YY,
    3. Cummings DA
    (2011) Prediction of dengue incidence using search query surveillance. PLoS Negl Trop Dis 5(8):e1258
    .
    OpenUrlCrossRefPubMed
  30. ↵
    1. Ocampo AJ,
    2. Chunara R,
    3. Brownstein JS
    (2013) Using search queries for malaria surveillance, Thailand. Malar J 12(1):390
    .
    OpenUrlCrossRefPubMed
  31. ↵
    1. Scarpino SV,
    2. Dimitrov NB,
    3. Meyers LA
    (2012) Optimizing provider recruitment for influenza surveillance networks. PLOS Comput Biol 8(4):e1002472
    .
    OpenUrlCrossRefPubMed
  32. ↵
    1. Davidson MW,
    2. Haim DA,
    3. Radin JM
    (2015) Using networks to combine “big data” and traditional surveillance to improve influenza predictions. Sci Rep 5:8154
    .
    OpenUrlCrossRefPubMed
  33. ↵
    1. Burkom HS,
    2. Murphy SP,
    3. Shmueli G
    (2007) Automated time series forecasting for biosurveillance. Stat Med 26(22):4202–4218
    .
    OpenUrlCrossRefPubMed
  34. ↵
    1. Everitt BS,
    2. Skrondal A
    (2002) The Cambridge Dictionary of Statistics (Cambridge Univ Press, Cambridge, UK)
    .
  35. ↵
    1. Politis DN,
    2. Romano JP
    (1994) The stationary bootstrap. J Am Stat Assoc 89(428):1303–1313
    .
    OpenUrlCrossRef
  36. ↵
    Tsukayama H (October 13, 2014) Google is testing live-video medical advice. Washington Post. Available at https://www.washingtonpost.com/news/the-switch/wp/2014/10/13/google-is-testing-live-video-medical-advice/. Accessed April 20, 2015
    .
  37. ↵
    Gianatasio D (November 10, 2014) How this agency cleverly stopped people from googling their medical symptoms: The right ads at the right time. Adweek. Available at www.adweek.com/adfreak/how-agency-cleverly-stopped-people-googling-their-medical-symptoms-161331. Accessed April 20, 2015
    .
  38. ↵
    1. Yang AC,
    2. Tsai SJ,
    3. Huang NE,
    4. Peng CK
    (2011) Association of Internet search trends with suicide death in Taipei City, Taiwan, 2004–2009. J Affect Disord 132(1-2):179–184
    .
    OpenUrlCrossRefPubMed
  39. ↵
    1. Cavazos-Rehg PA, et al.
    (2015) Monitoring of non-cigarette tobacco use using Google Trends. Tob Control 24(3):249–255
    .
    OpenUrlAbstract/FREE Full Text
  40. ↵
    1. Santillana M, et al.
    (2015) Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Comput Biol 11(10):e1004513
    .
    OpenUrlCrossRefPubMed
  41. ↵
    1. Lampos V,
    2. Miller AC,
    3. Crossan S,
    4. Stefansen C
    (2015) Advances in nowcasting influenza-like illness rates using search query logs. Sci Rep 5:12760
    .
    OpenUrlCrossRefPubMed
  42. ↵
    1. Tibshirani R
    (1996) Regression shrinkage and selection via the lasso. J R Stat Soc, B 58(1):267–288
    .
    OpenUrl
  43. ↵
    1. Hoerl AE,
    2. Kennard RW
    (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
    .
    OpenUrlCrossRef
  44. ↵
    1. Zou H,
    2. Hastie T
    (2005) Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 67(2):301–320
    .
    OpenUrlCrossRef
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Accurate estimation of influenza epidemics using Google search data via ARGO
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Accurate influenza epidemics estimation via ARGO
Shihao Yang, Mauricio Santillana, S. C. Kou
Proceedings of the National Academy of Sciences Nov 2015, 112 (47) 14473-14478; DOI: 10.1073/pnas.1515373112

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Accurate influenza epidemics estimation via ARGO
Shihao Yang, Mauricio Santillana, S. C. Kou
Proceedings of the National Academy of Sciences Nov 2015, 112 (47) 14473-14478; DOI: 10.1073/pnas.1515373112
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley
Proceedings of the National Academy of Sciences: 112 (47)
Table of Contents

Submit

Sign up for Article Alerts

Article Classifications

  • Physical Sciences
  • Applied Mathematics

Jump to section

  • Article
    • Abstract
    • Our Contribution
    • Results
    • Discussion
    • Materials and Methods
    • SI Materials and Methods
    • SI Predictive Distribution in the Formulation of ARGO Model
    • SI Determination of the Hyperparameters for ARGO
    • SI Revision of CDC’s ILI Activity Reports
    • SI Variations of Google Trends Data
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Penguin swimming
Origin and diversification of penguins
Juliana Vianna and Rauri Bowie explain the origin and diversification of penguins.
Listen
Past PodcastsSubscribe
Opinion: Cultural and linguistic diversities are crucial pillars of biodiversity
To best manage natural systems, modern societies must consider alternative views and interpretations of the natural world.
Inner Workings: Sub buoys prospects for 3D map of marine microbial communities
Implications range from elucidating metabolic pathways that help facilitate greenhouse gas release, to revealing compounds for medicine or pollution remediation.
Image credit: Mak Saito (Woods Hole Oceanographic Institution, Woods Hole, MA).
Ancient genomes reveal demographic history of France
A large genomic dataset reveals ancient demographic events that accompanied the transition to agriculture and changes in metallurgic practices in France.
Image credit: Pixabay/DavidRockDesign.
Satellite in orbit
Orbital-use fees in satellite industry
A study finds that imposing a tax on orbiting satellites could increase the value of the satellite industry from $600 billion to $3 trillion by 2040 by decreasing collision risks and space debris.
Image credit: NASA.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Latest Articles
  • Archive

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates

Feedback    Privacy/Legal

Copyright © 2020 National Academy of Sciences. Online ISSN 1091-6490