1. Introduction
Global access to the worldwide web has increased remarkably, with just over 63% of the global population accessing the Internet in 2022 [
1]. Although access to the search giant Google is limited in some regions (such as China and North Korea), it dominates search activity in most other territories (
Table 1). Knowing what the world is searching for online gives researchers the opportunity to identify and respond to these trends in a timely manner. As such, gaining access to search activity has long been regarded a holy grail for researchers, with different tools used in the assessment of such patterns.
Specific to Google’s search engine, the unrestricted Google Trends platform (
https://trends.google.com/, accessed on 22 September 2022) is open for all to explore how specific demographics searched for certain keywords. Google Trends retrieves a relative search volume (RSV), a metric ranging from 0 to 100 and based on the proportional popularity of the keyword in a specific geographic region for the selected period. Although this platform gives the user an indication of the dates or times that a specific phrase was searched the most frequently, it lacks the ability for users to compare results from different periods [
2]. For example, searching the same keyword for different time periods on the same geographical boundary yields different results (
Figure 1). Since this is a scaled metric, based on the number of searches in the geographical limitation selected for the searched keyword, comparisons between regions are not possible with Google Trends data [
3]. The extraction of Google Trends data can be automated, to some extent, using unofficial application programming interfaces (APIs), such as pytrends (v4.8.0) for Python [
4].
For those interested in comparing search activity in different regions or time periods, Google offers limited access to the Google Health Trends (GHT) API. Access can be requested from
https://bit.ly/3xpYFJo. GHT was used to explore the search behavior of African Internet users related to the COVID-19 pandemic, as a prediction tool for dengue fever in Brazil, and to gauge interest in pre-exposure prophylaxis in the United States of America [
5,
6,
7], with mixed reports on the effectiveness of this tool in infodemiology and epidemiology.
Recently, Google announced via email that the GHT API “will be improved by providing higher precision responses by using a more comprehensive sample of search requests” (
Supplementary Email S1). The said changes were implemented on 18 July 2022, with all data from 1 January 2022 being altered to include this new comprehensive search request sample. Google also indicated that any changes in search interest dating 1 January 2022 might be attributable to this change.
Such changes impact ongoing research, especially when future research efforts seek to compare periods before and after the implementation of such a change. The GHT documentation has also not been updated yet to indicate that such a change was made, risking the potential that erroneous conclusions can be made in the future. Here, I present an investigation into whether this change implemented by Google indeed had an impact on the GHT data retrieved and provide the first evidence that future research using the GHT platform should refrain from comparing data obtained from 1 January 2022 onwards to dates before 2022.
2. Materials and Methods
2.1. Data Extraction from the Google Trends API
The use of Freebase IDs, or in the absence of a Freebase ID, the corresponding Google Knowledge Graph Identifiers (GKGIs), allows searching for specific terms regardless of the searcher’s input language, since Google aggregates search values based on these identifiers. For example, searches conducted for ‘watre’ (sic), ‘水’, ‘l’eau’, ‘जलम्’, ‘metsi’ or ‘amanzi’ would be categorized as a search for ‘/m/0838f’ corresponding to the English word ‘water’. Freebase IDs or GKGIs were identified using the Google Knowledge Graph Search API and used as search terms on the GHT API, according to the recommendation by Google. Using Freebase IDs, therefore, allows for comparable search data across linguistically different searches.
The presented study was based on two different datasets extracted from the GHT API. First, the probabilities of short search sessions of 421 Freebase IDs (
Supplementary Table S1) were searched in 30 countries (
Table 1) before the recent update to the Google Trends sampling strategy. These extractions were carried out between 9 and 12 June 2022 for a different research project, the author having no prior knowledge of the pending change in the GHT random sampling strategy. A second extraction was performed after the changes were made to the GHT API on 22 July 2022. Weekly probabilities for short search sessions were extracted for the period from 6 January 2019 to 22 May 2022, resulting in 177 weeks’ worth of data extracted for each of the searched terms in all countries. The extractions were carried out using a Python script as per Google’s guidelines [
8]. The only modification was that the process was automated by including for loops to conduct the extractions for different countries.
2.2. Statistical Analyses
Statistical analyses were performed in R (R Core Team, v4.2.0, 2022), using RStudio Integrated Development for R. The raw data extracted were plotted as two separate time series, applying locally estimated scatterplot smoothing (LOESS) to visually identify potential trends. Spearman correlation was used to determine the correlations between data obtained from the two data extractions and summarized. Thereafter, a new time series was constructed by calculating the difference between the values retrieved via the Google Trends API before and after the updates occurred on 18 July 2022. These time series of differences were also plotted. Anomalies (datapoints that are outside the normal fluctuation range of a time series) in the different time series were detected using the AnomalyDetection package for R [
8] and the anomaly time series were plotted using the internal plotting functions of R, as well as ggplot2 [
9].
3. Results
In total, 12,630 time series were extracted both before and after the implemented change to the Google Trends API, plotted with the application of LOESS and visually inspected for potential trends. These figures are publicly available here:
https://doi.org/10.25415/ujhb.20424642.v2. Visual inspection was indicative of a high degree of similarity between the extracted data points from 2019–2021, with divergences in general trends occurring more frequently in the data from 2022 onward (
Figure 2 as an example).
For the extracted timeframes, a high correlation was observed between the data extracted before and after the update for the years 2019–2021, with respective median correlation values [interquartile range] of 0.955 [0.93; 0.98], 0.961 [0.93; 0.98] and 0.956 [0.93; 0.98] for these years (
Figure 3). However, for the first months of 2022, the median correlation for the 421 included search terms was much lower, 0.262 [0.04; 0.53].
Since 177 data points (corresponding to weekly search activity) were extracted from the Google Trends API for each time series, a total of 2,235,510 data points were included in this study, of which ~7.42% (165,953) were identified as anomalies using the AnomalyDetection package for R. Plots of data points identified as anomalies in the difference plots are made publicly available here:
https://doi.org/10.25415/ujhb.20430924.v1. Anomalies in a constructed difference time series occur due to Google’s daily updates of the uniformly distributed random sample of searches from which the data are extracted. As such, some variance is expected, as was the case in anomalies detected for 2019–2021 (
Table 2). However, most (79.40%) of the anomalies detected in the collected data originated in 2022. The median values of these anomalies between the two extractions were similar for 2019–2021, while the median for 2022 was double that of previous years.
Within the 30 countries included in this investigation, all returned an increased number of anomalies in the 2022 data, ranging between 46.09% (China) and 96.18% (India) of anomalies in these time series (
Table 3).
4. Discussion
The Google Trends API gives researchers the ability to access search trends from most countries around the world. Little is known regarding the sampling strategy that Google implements to construct the GHT database, apart from the statement in the GHT API Getting Started Guide:
“Numbers are calculated on a uniformly distributed random sample of Google web searches done since 2004, updated once a day, thus there may be some variance between similar requests” [
10].
As such, fluctuations in data retrieved on different extraction days are expected. Although such variance can affect data for a specific search term on a specific day, general trends in time series have a high correlation between data extracted on different days. From the two data sets extracted before and after the changes were made to the Google sampling strategy, a high degree of correlation was observed for the data extracted for 2019–2021 (
Figure 3). This is in line with the notification received on the changes made to the sampling strategy. In its email, Google indicated that the changes to the sampling strategy will only affect data from 1 January 2022 onward (
Supplementary Email S1).
These changes in the sampling strategy resulted in a greater range of correlation values between older and newer data sets for the year 2022 to date (
Figure 3A,E), as well as a lower median correlation value. The low similarity between the data extracted before and after the change in sampling strategy is indicative of the implemented change to the data used to retrieve the Google Trends data. By detecting anomalies in the difference between these two time-series, we were able to show that changes implemented to the GHT sampling strategy mostly increased the returned values (
Table 1), with the median value of these unexpected differences in 2022 being double the value of previous years. In the 30 countries investigated, the majority of unexpected data points from the differenced time series occurred in 2022 (
Table 2). Through a visual inspection of the plotted time series, most search terms showed an increasing trend during the first months of 2022.
Since this newly implemented change to the sampling strategy results in predominantly higher search volume being returned, data extracted prior to 18 July 2022 can no longer be compared to data extracted after this date for the year 2022. However, the high level of correlation for previous years is indicative that, in most cases, comparative studies focused on dates prior to 1 January 2022 could still be accurate considering the minor variance introduced by Google’s daily updates to the sample data set. As mentioned elsewhere [
11], caution should be exercised in the interpretation of single extractions of GHT API data, which may be falsely interpreted as changes in search trends. Therefore, it is advised that the extractions of the GHT API data be repeated on different dates and analyzed accordingly.
The presented study was not without limitations. Owing to the short timeframe between the announcement that the GHT sampling strategy will be changing and the date of implementation of these changes, only data from a singular extraction prior to the implemented change could be analyzed. It is also uncertain as to which increases were due to the changes made to GHT, or which were attributable to the chance of the GHT sample dataset on the day of extraction. Although this limits the quantification of the changes made to the sampling algorithm, the results are indicative that the changes impacted the data obtained from the service, that there is mostly an increase in search probability for most search terms after 1 January 2022, and that the interpretation of comparative studies with data extracted after the implemented changes should be handled with caution.
5. Conclusions
Evidenced here is the first report that the recent changes to the sampling strategy implemented by Google impacted the comparability of the GHT API data, particularly on comparisons of search trends from before and after January 1, 2022. Although the improved sampling strategy may result in a more accurate representation of search trends, caution should be exercised on any increased search trends observed following the 1 January 2022 date and extracted after the 18 July 2022. Furthermore, it would be impossible to determine whether such changes indeed gave a more representative view of the use of the Google Search Engine by individuals. Although such changes may impact current research activities involving the GHT API, the improved sensitivity that may arise from this change and the benefits of having an improved GHT API may, in the future, result in better predictions—which could be especially useful when using the Google Trends API for public health monitoring.
Funding
This research received no external funding.
Institutional Review Board Statement
No ethical review was required, as the data used in this research were extracted from publicly available resources.
Informed Consent Statement
Not applicable.
Data Availability Statement
Acknowledgments
The author wishes to express his gratitude to the Institute for Intelligent Systems, University of Johannesburg for supporting this open access publication. Furthermore, to Esmé Grobler, my sincere gratitude for their language editing of the manuscript.
Conflicts of Interest
The author declares that there is no conflict of interest.
References
- Union, I.T. Global Connectivity Report 2022; International Telecommunication Union: Geneva, Switzerland, 2022; p. 186. [Google Scholar]
- Google. FAQ about Google Trends Data. Available online: https://support.google.com/trends/answer/4365533?hl=en (accessed on 22 September 2022).
- Cebrián, E.; Domenech, J. Is Google Trends a quality data source? Appl. Econ. Lett. 2022, 29, 1–5. [Google Scholar] [CrossRef]
- Hogue, J.; DeWilde, B. pytrends. 2022. Available online: https://pypi.org/project/pytrends/ (accessed on 5 January 2022).
- Fulk, A.; Romero-Alvarez, D.; Abu-Saymeh, Q.; Saint Onge, J.M.; Peterson, A.T.; Agusto, F.B. Using Google Health Trends to investigate COVID-19 incidence in Africa. PLoS ONE 2022, 17, e0269573. [Google Scholar] [CrossRef]
- Romero-Alvarez, D.; Parikh, N.; Osthus, D.; Martinez, K.; Generous, N.; Del Valle, S.; Manore, C.A. Google Health Trends performance reflecting dengue incidence for the Brazilian states. BMC Infect. Dis. 2020, 20, 252. [Google Scholar] [CrossRef] [Green Version]
- Farkhad, B.F.; Nazari, M.; Chan, M.S.; Albarracin, D. State health policies and interest in PrEP: Evidence from Google Trends. AIDS Care 2022, 34, 331–339. [Google Scholar] [CrossRef]
- Vallis, O.; Hochenbaum, J.; Kejariwal, A.; Rudis, B.; Tang, Y. AnomalyDetection: Anomaly Detection Using Seasonal Hybrid Extreme Studentized Deviate Test; R package version 2.0.1. 2018. [Google Scholar]
- Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Google. Health Trends API Getting Started Guide. Available online: https://sites.google.com/a/google.com/health-trends-api-getting-started-guide/?pli=1 (accessed on 22 September 2022).
- Raubenheimer, J.E. Google Trends Extraction Tool for Google Trends Extended for Health data. Softw. Impacts 2021, 8, 100060. [Google Scholar] [CrossRef]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).