The Internet is an important source of health information. Thus, the frequency of internet searches may provide information regarding infectious disease activity. As an example, we examine the relationship between searches for influenza and actual influenza occurrence. Using search queries from http://search.yahoo.com, between March 2004 and May 2008, we counted daily unique queries, originating in the U.S. and containing influenza-related search terms. Counts were divided by the total number of searches, and the resulting daily fraction of searches was averaged over the week. We estimated linear models, using searches with one- to ten-week lead times as explanatory variables, to predict the percentage of positive influenza cultures and also deaths due to pneumonia and influenza in the U.S. Using the frequency of searches, our models predicted an increase in positive influenza cultures 1–3 weeks in advance (p < 0.0001) and similar models predicted an increase in mortality from pneumonia and influenza up to five weeks in advance (p < 0.0001). Search-term surveillance may provide an additional tool for disease surveillance.
The Internet has dramatically changed how people search for medical information. Over the past decade, an increasing amount of information has become available on websites, especially for infectious diseases. For example, public health organizations at the local, state, national and international level now routinely provide health-related information via their websites. These sites provide important updates about infectious disease activity and outbreaks. Also, most medical journals are available on-line, and to facilitate searching for journal articles the National Library of Medicine web-site now contains over 16 million citation records [1].
In addition to medical journals and public health websites, news websites supply a constant stream of updated health information. Also, several commercial firms organize medical information exclusively for clinicians, some catering specifically to infectious disease physicians and microbiologists [2]. Professional societies like the Infectious Diseases Society of America, the American Society of Microbiology, and the Society for Healthcare Epidemiology of America also support websites with relevant scientific information, position statements, and practice guidelines. Some of these societies support electronic communities focused on infectious diseases, expanding the flow of medical information between clinicians and public health officials [3,4,5].
To capitalize on the dynamic nature of web-based information, investigators have launched efforts to exploit this information for disease surveillance. For example, the Global Public Health Intelligence Network (GPHIN), developed by the Public Health Agency of Canada, continuously monitors media sources and web-based information related to disease outbreaks around the world [6]. GPHIN data is not available to the general public. However, a relatively new site at HealthMAP.org, monitors information from a variety of sources and displays results in real-time on a world map [7]. Access to this website is free and available to the public.
An estimated 113 million people in the U.S. use the Internet to find health-related information [8]. Searchers include patients and their families as well as healthcare providers [8, 9, 10, 11]. However, the large number of health-related sites has made it difficult to find specific information that is credible and reliable. Thus, Internet search engines (e.g., Ask, Google, and Yahoo) are now essential for Internet users to find information. In fact, most people searching for medical information use a search engine [8].
On a typical day, 8 million people are searching for health-related information [8]. Thus, the pattern of how and when people search may provide clues or early indications about future concerns and expectations. For example, an analysis of internet search terms related to jobs and job opportunities has produced accurate and useful statistics about the unemployment rate [12]. Similarly, searches for health-related information might also yield useful health statistics. Eysenbach [13], unable to get access to search engine query logs, demonstrated that clicks on a “sponsored link” on Google Adsense, triggered by Canadian searchers entering “flu” or “flu symptoms”, accurately anticipated the Flu Watch reports collected by the Public Health Agency of Canada.
Thus, analyzing actual search query logs for terms related to infectious diseases may provide a unique supplement to traditional infectious disease surveillance systems. The Centers for Disease Control and Prevention (CDC) influenza surveillance program identifies disease as, or after, it occurs, and therefore does not provide advance warnings. Furthermore, the CDC's data regarding influenza activity are no longer current when released to healthcare providers. To supplement influenza surveillance, several forms of syndromic surveillance have been suggested ranging from analysis of over-the-counter-medication sales to school absentee records [14]. As another supplemental form of surveillance, we describe how internet search query logs may help detect changes in disease activity. Using influenza as an example, we examine the temporal relationship between the search terms related to a disease and the actual cases of disease occurrence to determine if, and to what extent, an increase in search frequency matches or precedes actual disease activity.
To measure influenza disease occurrence, we used two types of U.S. influenza surveillance data. The first type was based on weekly influenza cultures [15]. Each week during the influenza season, clinical laboratories throughout the U.S., which are either members of the World Health Organization (WHO) Collaborating Laboratories or National Respiratory and Enteric Virus Surveillance System (NREVSS), report the total number of respiratory specimens tested and the number that test positive for influenza.
The second type of data summarizes weekly mortality from pneumonia and influenza [15]. These data are collected from the 122 Cities Mortality Reporting System. Each week, the participating cities report the total number of death certificates received and also the number that list pneumonia or influenza as the underlying and/or contributing cause of death. Based on these data, we obtain national influenza mortality figures. To match the date range of our Internet-search data, both types of influenza-surveillance data that we used were collected from March 2004 to May 2008.
Search query logs were obtained from Yahoo! and they cover the period from March 2004 through May 2008. From the Internet Protocol (IP) address associated with a search, we attempted to identify the geographic location (i.e., U.S. Census region) from which the search was initiated. The number of unique queries that came from the U.S. and contained influenza-related terms was counted daily. We excluded searches from outside the U.S. because the influenza season varies geographically. These daily influenza-search counts were divided by the total number of all U.S.-originated searches for each day to obtain the daily fraction of influenza-related searches. This normalization removed the possible effect of the overall growth of searches. As the influenza surveillance data were reported weekly, we used a weekly influenza-related search fraction by taking the average of the daily fraction for each week.
We obtained 2 series of influenza-related search fraction data at the national level: (1) the fraction of US search queries that contain the terms “influenza” or “flu” but do not contain the terms “bird,” “avian,” or “pandemic” and (2) the fraction of US search queries that contain the terms “influenza” or “flu” but do not contain the terms “bird,” “avian,” “pandemic,” “vaccine,” “vaccination,” or “shot.”
By restricting these series to queries that did not contain the terms “bird,” “avian,” and “pandemic,” we attempted to remove searches for avian influenza rather than seasonal influenza. Also, because most influenza vaccination occurs before the influenza season, we excluded all obvious vaccination-related searches.
We also classified weekly influenza-related search data into 9 US Census regions. Census-region data were normalized by total searches within that region. Because we identified the geographic location of origin from the Internet protocol address, there were cases for which we were not able to identify the exact region in which a search originated, but we were able to identify that the search came from within the United States. Therefore, the sum of the search data for the 9 US Census regions does not equal the amount of data at the national level. For each US Census region, we obtained only 1 series of data: weekly search data from the region for queries that contain the terms “influenza” or “flu” but do not contain the terms “bird,” “avian,” “pandemic,” “vaccine,” “vaccination,” and “shot.”
To define the relationship between culture-positive cases of influenza and influenza-related searches, we examined the relationship between influenza culture data and influenza-related searches at the national level. These data are presented as a time series in Figure 1.
The fraction of influenza-related search queries and the rates of positive influenza cultures follow similar patterns, but a sharp increase in searches precedes the sharp increase in the rate of positive cultures. Using the culture data, we fitted the following linear model to test the predictability of search frequency on positive influenza cultures, including a time-trend variable: where t is a time trend (measured in weeks), ct is the rate of positive influenza cultures received during week t, and st-x is the search frequency in week t-x. To determine the appropriate lag (in weeks), we examined eleven values for x and compared the R2 value for each model. The model with a search term with a one-week lag fit best. However, models with lags up to 3 weeks in advance of culture data fit similarly in terms of R2. A summary of the regression results for the 0–10-week-lag-search-term models are presented in table 1.
The coefficient on the time trend is not significantly different from zero in any of the models. However, there is a positive relationship between the fraction of influenza related queries and positive influenza culture rates two weeks later (p < 0.001). The large coefficient on st−2 reflects the fact that influenza-related search frequency is measured as a fraction of all searches. The predicted values from the 2 week model and the actual culture data are presented in Figure 2.
Predicted Values for Positive Influenza Cultures Based on Searches and Actual Values by Week
We also fit separate models with lags from 1–10 weeks for each of the 9 U.S. census regions. Results were similar to the national model with the best fitting models predicting positive influenza cultures 1–3 weeks in advance. The average R2 at 2 weeks was 0.3788. However, values varied from a high of 0.5729 in the East South Central region and a low of 0.1656 in the Mid Atlantic region.
figure 3 plots the time series of influenza-related searches and influenza mortality for the U.S. To account for the relationship between searches and mortality, as described for the culture data, we fitted the following linear model to test the predictability of search frequency on influenza mortality: where mt is the total number of deaths from pneumonia and influenza in week t, and all other variables are as defined earlier. A model incorporating searches at time t−5 fits slightly better than other models with a search variable ranging from time t to time t−10. All of the regression results using searches from 0–10 week lags are listed in table 2. A positive relationship exists between the fraction of influenza-related search queries and pneumonia and influenza mortality 5 weeks later (p < 0.001). The large coefficient on st−5 reflects the fact that influenza-related search frequency is measured as a fraction of all searches and thus takes on small values, on the order of 10−6. Figure 4 shows the predicted values from the 5 week model and the actual mortality data.
Predicted Values for Mortality from Influenza and Pneumonia Based on Searches and Actual Values by Week
Finally, we fit models with lags from 0–10 weeks for each of the 9 U.S. census regions. Results were similar to the national model: for the best fitting models, searches peaked 4–6 weeks before deaths from influenza and pneumonia. The average R2 at 5 weeks was 0.3041. However, values varied from a high of 0.4250 in the East North Central region to a low of 0.1227 in the Pacific region.
Influenza reoccurs each season in regular cycles, but the geographic location, timing, and size of each outbreak vary, complicating efforts to produce reliable and timely estimates of influenza activity. However, we found that a distinct temporal association exists between influenza-related search-term frequency and influenza disease activity. On a national level, influenza- related search-term activity seems to precede an increase in the number of cultures positive for influenza and deaths attributable to pneumonia and influenza. Furthermore, the temporal relationship between searches and cultures positive for influenza and searches and mortality corresponds to the epidemiology of influenza, because the number of deaths from pneumonia typically peaks a few weeks after a peak in the number of influenza cases.
Investigators have suggested several supplemental approaches for influenza surveillance, at prediagnosis and diagnosis stages. Prediagnosis approaches mainly include the analysis of information collected before specific influenza-related diagnoses are made, including analyses of telephone triage calls [16], purchases of over-the-counter medications for respiratory diseases [17–20], and school absenteeism [21]. In contrast, diagnosis- level approaches attempt to gather clinical data from emergency department visits [22–24] or microbiologic sources in as close to real-time as possible. The timeliness of influenza surveillance approaches has recently been thoroughly reviewed elsewhere [14]. Prediction markets have also been used to provide future estimates of influenza activity by aggregating both prediagnostic and postdiagnostic information [25]. In general, the efforts described herein provide information days to weeks in advance of traditional sources, but it is difficult to compare these approaches, because different geographic regions were studied, different statistical approaches were used, and some reports only include 1 influenza season [14]. To generalize these approaches to the national level would require merging several data sources from different geographic areas and multiple firms (in the case of pharmacy data or billing data). In contrast, search query data are efficiently collected in a standard usable form and aggregate both prediagnostic and postdiagnostic information. Although it is difficult to compare with other methods, analysis of Internet search terms seems to perform reasonably well. In addition, data are easy to collect, and unlike other nontraditional forms of surveillance, search data can easily be used to study other diseases.
If future results are consistent with these findings, search-term surveillance may provide an important and cost effective supplement to traditional disease-surveillance systems. In the case of influenza, a few weeks of lead time could help inform epidemiological investigations and assist with both prevention and treatment efforts. Search terms classified by different geographic regions may provide even more useful information. For example, we fit linear models using data from the nine census regions and found that influenza related searches are statistically significantly related to influenza mortality. Models in some regions perform better than others, suggesting that information in some regions may generate searches in other regions. Further work is needed to examine the spatial relationship between searches and the geographic spread of influenza. However, because culture and mortality data are not uniformly reported at the state level, our geographic analysis stopped at the census region level.
Despite the promise of using search data for surveillance purposes, there are several limitations. First, with only four years of data, the inferential conclusions from time-series analysis are limited. A second limitation: we need to account for the possibility that some searches may be generated by news reports or a “celebrity effect” instead of actual disease activity. For example, the publication of a medical journal article about influenza may generate searches with no relationship to disease occurrence and the same may be true if a celebrity contracts a specific disease. Cancer researchers, using Yahoo search queries, found that daily variations of search frequency were heavily influenced by news reports [26]. However, Internet searches for specific cancers were still correlated with their estimated incidence and mortality. Also, a news item causing a large increase in search volumes should be easy to identify and rather short lived given the half-life of most news cycles.
The limited geographic data gleaned from search terms is a third limitation of search data. Geographic search data are extracted from IP addresses and may not always represent actual geographic location. Privacy issues represent another significant limitation. The search data described in this paper was aggregated across users for 9 census regions. However, search data with much finer geographic information linked to individuals across multiple different searches for different topics could represent a privacy concern. Thus, we envision health investigators only using aggregated search volumes over larger geographic regions for surveillance purposes. Finally, access to search query logs from search engines will need to be made available to investigators. Other attempts to study actual search query data for public health reasons have been unsuccessful [13].
In addition to data from search engines, data gleaned from website hits or web searches on specific websites may also provide useful information about disease activity. For example, the number of articles retrieved on the site Healthlink, a consumer health information website maintained at the Medical College of Wisconsin, was correlated with influenza activity [27]. Thus, searches for specific diseases on high traffic websites (e.g., a state health department) may provide important time-series data as it captures the number and to some extent the geographic location (via IP address) of people investigating the activity of a specific disease. Searches for specific medical conditions on the National Library of Medicine's PubMed website may indicate changing patterns in infectious disease activity or potential adverse drug events. Furthermore, changes in volume of searches on commercial websites (e.g., Up-To-Date, MD Consult) may indicate gaps in clinical knowledge, or the need for clinical trials. Data from such sites may be more representative of what healthcare providers are searching for as opposed to the general public.
We propose that search-term surveillance may represent a novel and inexpensive way of performing supplemental disease surveillance. Using search series is not limited to influenza; it could also be used to monitor emerging and re-emerging infectious diseases and to detect changes in phenomena related to chronic illnesses. Surveillance of symptom-based searches (e.g., diarrhea) may help detect outbreaks if search levels rise above an established baseline. De-identified search volumes for sexually transmitted infections (e.g., syphilis) may provide public health officials indications of disease trends in advance of official reports of disease activity. Although search probably provides some aggregation of news reports, it also adds a behavioral component by signaling how important topics are to searchers. Thus analysis of search data may also reveal how people respond to medical news and may provide indications about their concerns and future expectations. Despite several limitations, the ability to detect trends and confirm observations from traditional surveillance approaches make this new form of surveillance a promising area of research at the interface between computer science, epidemiology and medicine.
This work was in part supported by the Robert Wood Johnson Foundation, Pioneer Portfolio.
Potential Conflicts of Interest: P.M.P. has been a member of the Emerging Trends in Seasonal Influenza Advisory Panel of Roche Laboratories, Inc. D.M.P. and Y.C. are employees of Yahoo! Research. F.D.N. has no potential conflicts to declare.
IDSA Members: For your free access to this journal, log in via the IDSA members area.
Open access options for authors visit Oxford Open
This journal enables compliance with the NIH Public Access Policy