The ongoing COVID-19 epidemic is a global pandemic and could at some point affect up to two third of the population. Containment is possible if a local outbreak is detected early. However, not all surveillance and health care systems have the capacity or infrastructure to find early cases. In recent years, Google search trends (GT) has been studied as a potential early warning system for various infectious diseases. The results are mixed. For seasonal influenza, GT can predict the timing of the peaks accurately, but case number prediction is more difficult (see for example here). The question is, whether GT can be used to predict outbreaks of COVID-19 in various countries. One major problem with the use of GT data is that it is unclear whether search trends reflect disease activity in a population or whether they reflect the reaction of a population to media coverage of a disease. The following short study will investigate this issue.
In this case study I will investigate three questions:
Which countries that are currently affected with COVID-19 show a substantial increase in GT activity in Januar/February 2020 compared to the previous three years?
Is the GT a reflection of media coverage for COVID-19 and how does GT activity relate to epidemiological data for COVID-19?
Do we see any increases in GT activity in unaffected countries that are not related to media coverage?
I have retrieved two sets of GT data: the weekly “web search” data for the period of 2017-01-01 to 2020-03-05 (3-year data) and the daily “web search” and “news” data for the period of 2020-01-01 to 2020-03-05 (60 days data). I am here assuming that the GT activity for “news” is a good reflection of the actual media coverage.
The search terms I collected GT data on are: “pneumonia”, “cough” and “fever” for web data and “coronavirus” for news coverage. The GT were queried with the gtrendsR R package.
I have translated these terms into local languages for each country using the translateR R package. The COVID-19 data was digitized from the WHO situation reports, the datasets can be found in my Github repository.
I first used the 3-year GT data in countries with at least one case and calculated the mean relative search activity for the pre- and post-COVID periods (Jan-2017 to Dec-2019 and Jan to Feb 2020) and the mean increase as post-mean/pre-mean. I then looked the 60-days GT data for all countries with at least a twofold increase in pre- vs. post-COVID-19 activity. I have omitted data from countries that generally have a low-search volume defined as 10 or more days (i.e. >=6.5%) with no search activity. For the remaining countries, I smoothed the 60-day GT by a moving-average procedure with a 3 days window to reduce the noise and compared the “web search” activity to the “news” activity and the incident confirmed cases.
I first looked into the search activity of the past 3 years for “pneumonia”, “cough” and “fever” in the most widely spoken local language in all countries with at least one case of coronavirus as of March 5, 2020.
Many European countries show seasonal peaking of all search terms, which are indicative of seasonal influenza activity. Pneumonia seems to be the search term which stands out for many countries.
The following plots shows the x-fold increase in mean search activity for the three search terms in the language spoken by the majority (first half of the bars) and in english (second half of the bars).
## Warning: Removed 1 rows containing missing values (geom_bar).
Almost all affected countries show a small increase in activity, but countries with large transmission show a massive increase in search activity for pneumonia after January 1, 2020 compared to the last 3 years (e.g. Taiwan, China, Japan, Hong Kong). These are the countries with a least twofold increase in GT activity for “pneumonia”:
For countries with at least twofold increase in activity for “pneumonia”, I compared the web search trends, the media coverage according to GT news activity and the incident confirmed cases of COVID-19. I have omitted countries with a low search volume, because these data provide more noise than information. The time series data of GT activity were smoothed to reduce the noise.
The same for english:
Generally, the GT curves are smoother for countries with widespread transmission: Singapore, South Korea, Indonesia, Hongkong, Vietnam, Japan, Italy and Germany. For these countries it appears that web and news GT activity as well as COIVD-19 activity coincide over time. We can use cross-correlation with different time lags to examine to examine whether the web activity preceeds the news activity or not. For this I calculated the pearson correlation coefficient between web search hits and news hits for each country and language with lags of -7 to 6. The following plots show the distribution of the lag times for which the correlation was maximal for each country, for local language (left) and english (right):
These estimates show that on average, the time lag between web and news activity at which we have the highest correlation is 0. This is the correlation of the smoothed GT web and news activity by country for the most often spoken local language for a 0 time lag:
For countries with smoother GT data and widespread transmission (Germany, Italy, Japan), there is a very good agreement between web and news activity at time lag 0. The median correlation coefficent is 0.4782256. Countries with noisier GT data obviously have a lower correlation.
Hongkong, Indonesia, Singapore, Macao, Taiwan, South Korea and Malaysia seem to show an increase in web search activity for pneumonia in the first half of January without corresponding news activity, but it is unclear whether this is due to alternative media coverage (i.e. not reflected in GT news activity), reactions to rumours, noise or actual cases googling their symptoms.
Finally, I have looked at the GT activity for two unaffected countries (as of March 5, 2020): Turkey and Kazakhstan. The following plots show the web search activity for pneumonia in the last three years (upper) and the web search and news activity in the last 60 days (lower):
For Kazakhstan it appears that the first increase in web activity at the beginning of January does not coincide with increased news coverage and could be indicative of people googling their coronavirus symptoms, but again it could also be related to media coverage not reflected in GT news. Kazakhstan has so far not reported any cases of COVID-19, but the country is geographically close to China and the two countries have commercial relations.
In summary, the results show that web activity, news coverage and cases increase more or less simultaneously in many countries. Since these cases were infected several days prior to the confirmation date, there is little evidence that increased search behaviour is the result of people googling their symptoms. It rather seems to be a reaction to media coverage. GT is an interesting source of information, but the usefulness for detecting cases in this pandemic is questionable.
In general, if GT data are used for outbreak and case detection, it is important to distinguish reactive and pro-active googling behaviour. Increases in search behaviour as a reaction to media coverage seem to be common given the results above. Web trends should be cross-checked with media coverage for the same time frame. I have tried to approximate media coverage through the “news” activity on GT, but it is not entirely clear whether this is a good representation of media coverage. As always, more studies are needed.
Google is officially blocked in China and Iran, which means that the GT activity is probably not representative of the whole population. A quick search on Baidu trends showed that the search trend for pneumonia is very similar to the results downloaded from Google. However, trend data from Baidu can only be retrieved as a plot as far as I know, which is why I have not included it here.
The main problem of GT data is that these are relative data points scaled between 0 and 100 for the time period of observation. The time series for the same search trend in the same country can look different if we widen or narrow the time period of observation. It is thus difficult to determine if an increase is just stochastic noise or an actual signal. Generally we can assume that changes in smoother curves are substantial (i.e. meaningsful) than changes in noisy curves. It is also difficult to compare trends for web and news activity because of the relative scaling. Ideally, web search activity is measured with absolute numbers to understand thresholds and distinguish real increases from stochastic noise.