3.1. A Simulation Study
The proposed scan statistic is first applied in the year 2004 to the simulated datasets used in [
14], which have been specifically designed for evaluating and comparing the statistical power of spatiotemporal disease outbreak detection methods. These simulated datasets can be found in
http://www.satscan.org/datasets (accessed on 25 December 2021).
These datasets were simulated by Martin Kulldorf based on the geography and population of New York City, including the effects of disease outbreaks of a hypothetical disease of varying size and location. Accordingly, geographic coordinates (representing the approximate center of each zip code) and population numbers for 176 New York City zip codes were used for these datasets. Assuming outbreak occurred in seventeen different locations in New York City, with a high or moderate risk, 34 datafiles were created considering a period of 31 days, with a random number of cases of the hypothetical disease. Each file has 1000 simulated datasets. For each dataset, the total number of randomly allocated cases is 100 multiplied by the number of days (i.e., 31 × 100 = 3100 cases). The number 100 was chosen to reflect the occurrence rate of certain syndromes common to the syndromic surveillance system of New York City emergency department visits.
In the null model scenario of this context, each person living in New York City is equally likely to contact the disease, and hence each case is assigned to a particular zip code on any given day with a probability proportional to the population of that zip code. When generating data for geographically localized outbreaks, an increased risk was assigned to the zip codes in which the outbreaks are assumed to have occurred. Consequently, for each such zip code and day combination, the corresponding population was multiplied by an assigned relative risk.
This study attempts to detect the areas in which a disease outbreak has occurred out of a larger geographic region using the improved Spatiotemporal Mann-Whitney Scan Statistic. The wider geographical region includes areas belonging to the four main boroughs of New York City: Brooklyn (A), Manhattan (B), Staten Island (C) and the Bronx (E). The simulated datasets, which were created assuming that outbreak occurred only in Williamsburg, Brooklyn (A), were selected to apply the novel spatiotemporal MW scan statistic. The chosen datasets are simulated with a high risk of outbreak for a period of 31 days. Of the datasets simulated, assuming outbreak occurred in Williamsburg, Brooklyn, 10 datasets were chosen randomly for this study.
The main objective in this scenario is to determine whether the exact outbreak zip code, 11211, of area A is identified by this scan statistic. If this zip code is not included in the most likely cluster, secondary clusters are found using the method introduced in [
1] until this area is detected.
According to the results (see
Appendix A), all the zip codes identified in the most likely cluster of each sample belong to area A. In other words, the scan statistic has detected A as the outbreak area in all cases, out of the four boroughs considered. Hence, the
Spatiotemporal MW Scan Statistic performs well in detecting the area of disease outbreak in a larger geographical region. Furthermore, it detects four surrounding areas of the exact outbreak zip code on average, in the most likely cluster. Moreover, 80% of the time, the Spatiotemporal MW Scan Statistic identifies the exact outbreak zip code in the most likely cluster. Therefore, it is reasonable to suggest that the
Spatiotemporal MW Scan Statistic can effectively detect the exact disease outbreak zip code in the majority of cases. In the samples where the exact zip code was not detected in the most likely cluster, it was detected in the first secondary cluster, implying that the spatiotemporal scan statistic has an ability to detect the exact outbreak zip code in one of its significant clusters.
3.2. An Application to COVID-19 Data
We secondly applied the proposed scan statistic to reported COVID-19 cases in China corresponding to the time period from January 2020 to May 2020. This dataset contains cases reported in 33 major spatial areas in China. The data were extracted through the data repository of the Center for Systems Science and Engineering at Johns Hopkins University.
Our main objective in this context is to determine disease outbreak areas with a significantly higher number of reported cases using the improved Spatiotemporal Mann-Whitney Scan Statistic. According to the results, 30 out of 33 locations are included in the most significant cluster. In order to further confirm these results, we came up with the following spatiotemporal visualizations.
The chart in
Figure 1 was obtained excluding the Hubei region since it was an outlier with a significantly higher number of cases over the period. The regions which are not included in the highly likely cluster are boxed in
Figure 1. According to the visualization, these regions have little or no fluctuation over the considered period compared to the other regions. Furthermore, no significant peaks of cases can be seen in those regions over the five months.
Table 1 shows the spatial averages of each region in ascending order. Accordingly, three of the areas which are not included in the cluster have the lowest spatial averages.
Moreover, according to
Figure 2, the three areas which are not included in the cluster are located away (towards left) from the rest of the locations. Even though the regions Macau and Ningxia have relatively low spatial averages, they are located closer to the larger outbreak areas and hence they could have been included in the most significant cluster.
3.3. A Comparison Study
We thirdly conducted a comparison study on the simulated datasets mentioned in
Section 3.1 to assess and compare the performance of the novel Spatiotemporal Mann-Whitney Scan Statistics with the existing widely used Space Time Scan Statistic and Space Time Permutation Scan Statistic. A novel performance measure is introduced for the purpose of comparing the performance due to the unavailability of a common measure which can be used for both parametric and non-parametric scan statistics.
This novel performance measure is introduced as a Total Score which integrates four sub-scores concerned with the following four major aspects in identifying disease outbreak areas.
As stated in
Section 3.1, the data are simulated assuming that outbreak occurred in the five zip codes of Williamsburg, Roosevelt Island, Bulls Head, La Guardia and West Farms of the five major boroughs Brooklyn (A), Manhattan (B), Staten Island (C), Queens (D) and the Bronx (E), respectively. Among these five major boroughs, Manhattan (B) and the Bronx (E) are relatively small areas. In the occurrence of an outbreak in a zip code, it is highly likely that the disease spreads quickly across smaller areas and that cases are reported from several parts of that area. Thus, such a small area can be quickly detected when assessing the number of incidents.
Therefore, if a scan statistic can identify at least one zip code belonging to either B or E areas, Score 1 is equal to 1, whereas if it was able to identify at least one zip code belonging to both areas B and E, Score 1 is equal to 2. Hence, Score 1 can take values 0, 1 or 2.
As opposed to the circumstances in Score 1, Score 2 is used to assess the ability of scan statistics to detect larger areas given that outbreak occurred in a particular zip code. Unlike the above scenario, when an outbreak occurs in a zip code of a larger area, it is unlikely to spread to the entire region and that cases are reported from the entire region. Thus, large areas such as A, C and D will not be quickly identifiable. Therefore, if a scan statistic can identify at least one zip code belonging to either A, C or D, Score 2 is equal to 2. Thus, Score 2 can take the values 0, 2, 4 or 6.
Irrespective of the area of the outbreak zip code being small or large, if a scan statistic can identify surrounding zip codes of the outbreak zip code, it should be rewarded. Identification of the surrounding zip codes of the exact outbreak zip codes helps to narrow down the region in which health officials should take action. Thus, Score 3 is defined to assess the ability of scan statistics in detecting adjacent zip codes of the exact outbreak zip code. Accordingly, the standard adjacent zip codes of each outbreak zip code were identified, and if a scan statistic can identify at least one adjacent zip code of an outbreak zip code, Score 3 is equal to 0.1. Since there are five outbreak zip codes, score 3 can take the values 0, 0.1, 0.2, 0.3, 0.4 or 0.5.
Score 4 serves to assess the ability of the scan statistic to detect the exact zip codes in which the outbreak occurred. If a scan statistic can identify the exact zip code in one of its significant clusters, it should be given extra points. Accordingly, for the identification of each exact zip code 0.3 marks are given. Hence, Score 4 can take the values 0, 0.3, 0.6, 0.9, 1.2 or 1.5.
The total score can be used to assess the overall performance of the scan statistics in detecting disease outbreak areas, which is an aggregation of the four above-mentioned sub-scores. Accordingly, the total score can take any value between 0 and 10. In the presence of a single data sample, the total score obtained for each technique can be compared. Accordingly, the higher the total score, the better the performance of the scan statistic. In the presence of many samples, the determined total score is compared based on the coefficient of variation of the samples, and thus, the lower the coefficient of variation, the better the performance of the scan statistic.
According to the results (
Appendix B), both the Space Time Scan Statistic and the Space Time Permutation Scan Statistic perform better in identifying outbreaks in narrow areas than in wider areas. The Space Time Permutation Scan Statistic performs well in detecting the surrounding areas of exact outbreak zip codes, while the Space Time Scan Statistic performs well in identifying exact outbreak zip codes. The experimental results of
Section 3.1 reveal that the novel Spatiotemporal MW Scan Statistic performs significantly well in detecting the area of disease outbreak in a larger geographical region. Moreover, it can effectively detect the exact disease outbreak zip code in the majority of cases. Even if the exact outbreak zip code was not detected in the most likely cluster, it was detected in one of the secondary clusters by this scan statistic.