1. Introduction
Over the past decades, the serious pollution problems caused by the rapid global economic development have attracted the attention of both researchers and the general public worldwide. According to the World Health Organization [
1], approximately 90% of people worldwide breathe air that does not accord with air quality standards, and around three million deaths each year are related to outdoor air pollution. Particles suspended in the air can harm the respiratory and cardiovascular systems of the human body, possibly leading to cardiopulmonary diseases [
2,
3]. The concentration of PM
2.5 reflects the degree of contamination. The full recognition and accurate forecasting of PM
2.5 concentration can guide the government in taking timely actions to reduce the pollution emission and prevent hazard exposure for the public; thus, related methods have undergone extensive studies [
4,
5,
6,
7].
Methods for predicting PM
2.5 concentrations can be divided into two types, empirical and physical models. Physical models adopt meteorological principles and statistical methods to simulate the emission, dispersion, transformation, diffusion, and removal processes of pollutants [
7,
8]; thus, the spatiotemporal distribution of air pollutants are predicted. Physical models can provide explicit insights into the physical-chemical processes of the diffusion and transformation of multiple pollutants and present the direct linkage between pollutant emission and air pollution. However, the physical models are dependent on priori knowledge and accuracy of emission data, which may cause errors. Comparative analyses have demonstrated that well-developed site-specific empirical models performed higher prediction accuracy than physical methods [
9]; thus, we explored an empirical model to predict air quality in this study. Empirical studies avoid complicated theoretical models; they simply adopt statistics-based methods to predict the air quality, and many empirical models have been proposed. The air quality and weather of surrounding sites affect the PM
2.5 concentration of the central site due to the diffusion of pollutants and other meteorological elements across areas. This condition refers to spatial correlation. Traditional empirical methods address spatial correlation by two means. One is to consider all sites in the study area [
10,
11]. However, sites farther away from the target site can reduce the prediction accuracy of the models. In addition, these methods are hard to handle prediction tasks across a large area. The other only takes the K-nearest neighbor (KNN) sites into consideration, and it has been extensively applied in PM
2.5 forecasting for selecting the most related sites [
12,
13,
14]. In these studies, the distance between the target site and other sites were calculated, and the K-nearest sites were selected and input to the model for training. Thus, it can decrease the number of input features, and avoid the interference of less relevant sites. The most popular distance criterion employed is the geographical distance, but pollutant diffusion processes are based on wind. However, these models neglected wind when selecting related sites.
Researchers have extensively used neural network methods to predict air quality [
15,
16,
17,
18,
19]. Most of these methods are based on multivariate linear model and multilayer perception, which are not designed for time series data (e.g., PM
2.5 concentration). They performed lower accuracy than recurrent neural networks, such as long short-term memory (LSTM) neural network, in handling temporal correlation [
20]. Compared with ordinary recurrent neural networks, LSTM [
21] inherently captures long- and short-term trends and effectively avoids exploding and vanishing gradient problems; thus, it is widely used in time series forecasting [
13]. Li and Peng [
11] incorporated the historical PM
2.5 concentration of 12 monitoring sites in Beijing with their meteorological data into the LSTM model and simultaneously provided predictions of hourly PM
2.5 concentration of all sites. Soh [
14] and Zhao [
22] employed LSTM and fully connected layer (FC) neural network to extract spatiotemporal correlation features from the historical air quality and meteorological data of target and neighbor sites. They found that LSTM has higher prediction accuracy than other models, such as ordinary recurrent neural network and support vector regression model. However, the spatial correlation was not fully considered in the above studies. The employed FC is designed for processing 1D characteristic. Therefore, high-dimensional spatial features can only be input into FC after flattening, possibly causing loss of spatial information. Comparatively, convolutional neural network (CNN) [
23] extracts spatial correlation features by convolving the adjacent elements of high-dimensional matrices with moving kernels. Consequently, it can fully use spatial information. CNN has been applied widely in the computer vision field to capture the spatial features of images [
24,
25]. When the data of multiple related sites are integrated into a matrix, then, its elements are correlated due to spatial correlation. Thus, CNN is suitable for extracting spatial features of multiple sites. Qin and Yu [
26] proposed a two-step forecasting model that combined CNN to capture the spatial features of air quality and meteorological data of multiple sites and LSTM to learn the temporal correlations. However, instead of predicting the PM
2.5 concentration of a specific site, this model aimed to make predictions of the target city, which cannot provide the spatial distribution of pollution. Huang and Kuo [
27] proposed APNet model that used CNN to capture the correlation among the air quality and meteorological data of the target station and LSTM to extract temporal features. Wen and Liu [
28] used CNN to capture the spatial correlation of the PM
2.5 data of multiple stations and then used LSTM to capture temporal tendency. However, both of them have not considered the air quality and meteorological spatiotemporal data of surrounding sites sufficiently.
The current study proposed an LSTM-CNN multi-step-ahead forecasting model using dynamic wind field distance (LSTM-CNN-DWFD) to fill the gap of existing methods. First, we selected K-nearest sites as neighbor sites in accordance with defined dynamic wind field distance instead of the most popular geographical distance. Second, we built an LSTM-CNN hybrid neural network to provide predictions. We input the historical air quality and meteorological condition of the target site and neighbor sites into a local LSTM and FC to capture temporal features of each site. Then, we used CNN to capture spatiotemporal features. We also concatenated the spatiotemporal features with the weather forecasts. In turn, we input these forecasts into FC to predict the hourly PM2.5 concentration of the target site over the next 24 h. We selected the PM2.5 and weather data of Beijing collected from 1 May 2014 to 30 April 2015 as experimental data, and conducted six-fold rolling origin experiments on 36 target monitoring stations. The comparison results with other six methods confirmed the effectiveness and superiority of the proposed model in predicting PM2.5 concentration. This study makes the following contributions: (1) When using KNN to select related sites, wind impact was taken into consideration by replacing geographical distance with dynamic wind field distance; (2) spatiotemporal data, which includes the air quality and weather data of the target and neighbor sites, were fully considered in the proposed LSTM-CNN-DWFD model. From the results, the proposed method was found more effective in selecting related sites and extracting spatiotemporal features and performed with higher accuracy in multi-scale predictions of PM2.5 concentration than other methods.
3. Results
The experimental data included the hourly air pollutant observations, meteorological factor observations, and weather forecasts of the 36 stations in Beijing from 1 May 2014 to 30 April 2015. A total of 8760 samples were collected for each station, and
Section 2.1 introduces the detailed description of the data. Considering that evaluation results based on a single forecast origin can be unreliable when the forecasting results are sensitive to randomness and systematic business cycle effects [
36], rolling origin has become a widely used evaluation technique in time series studies [
22,
37,
38]. In the rolling origin method, the time series data are divided into several periods. The first several periods are selected as train set, and the next period is selected as test set. Then, the forecasting origin moves to the next period in turn and the forecasts are produced from each origin [
39]. Rolling origin method partially controls for specific effects arising from a particular origin. In this study, considering the required sample size, the moving window and forecast window was set as one month. Considering the requirement of sample size, six-fold rolling origin experiments were conducted on each station in Beijing, and the results of totally 216 experiments were used to evaluate the performance of the proposed model.
Table 1 shows the concrete time span of each train set and test set. It is worth noting that the train and test sets of each fold were determined in accordance with time sequence, guaranteeing they have no overlap.
Before building the LSTM-CNN-DWFD model, several parameters should be preset. According to
Section 2.2.1,
in the definition of dynamic wind field distance being 3 or 1 causes the wind impact to be considered inadequate or excessive. Therefore, we set
equals to 2 in the following experiments. The best values of other parameters were determined via the rolling origin experiments on Site 1, where the number of neighbor sites selected
was set to 9, and the time lags
was set to 3, 6, and 12 for 1–6, 7–12, and 13–24 h, respectively. To avoid overfitting of the neural network, dropout layer and early stopping method were employed in our experiments.
To demonstrate the effectiveness of the proposed LSTM-CNN-DWFD model,
Section 3.1 presents the spatiotemporal distribution of its prediction error, from which we can obtain details of its prediction performance. In
Section 3.2 and
Section 3.3, we design a series of comparisons with six other methods to show its advantages in extracting spatiotemporal features and to confirm the superiority of the proposed KNN-DWFD method in selecting related sites.
3.1. Performance of LSTM-CNN-DWFD Model
This subsection shows the forecasting performance of LSTM-CNN-DWFD model.
Figure 7 intuitively shows the predicted PM
2.5 concentration against the observations of Site 1 from 1 March 2015 to 31 March 2015. The red and blue lines represent the observations and forecasts of PM
2.5 concentration, respectively. The
values of the next 1–6 h prediction were 0.85, 0.81, 0.76, 0.70, 0.64, and 0.59, respectively. Accordingly, the explained variance of our model decreases as prediction time increases. As shown in
Figure 7a–f, as prediction time increases, the prediction value of the peak value around 9 March (the circles in
Figure 7) slowly decreased than the true value. Hence, a low prediction accuracy occurs. Additionally, (g) and (h) in
Figure 7 show that, during 13–24 h, the range of the predicted PM
2.5 concentration was larger than that in 7–12 h. In other words, the level of uncertainty increases in 13–24 h. However, the mean of the observations of each interval (red lines) constantly fell in the predicted range.
Table 2 shows each fold’s prediction error of LSTM-CNN-DWFD model. The lowest
and
of all folds occurred at the next 1–6 h, followed by the values at 7–12 h. By contrast, the values at 13–24 h were typically the highest. Thus, the prediction accuracy tends to decrease as the prediction time increases. The average RMSE (MAE) of 216 experiments were 43.90 (29.17), 57.89 (42.16), and 63.14 (47.64) for 1–6, 7–12, and 13–24 h, respectively. The standard errors of RMSE (MAE) of 36 stations were 9.48 (6.84), 12.13 (9.16), and 12.73 (9.88), respectively, which illustrates the stability of LSTM-CNN-DWFD model. Additionally, the
s of the 1–6, 7–12, and 13–24 h predictions in Fold1–4 ranged in 45–52, 58–73, and 62–82, respectively. However, in Fold5–6, the
s ranged in 34–35, 42–47, and 44–52, respectively. Hence, the
s of Fold5–6 were relatively lower in general than those in other folds. The results of
also show the same trend. Therefore, the prediction accuracy in spring (Fold5–6) tends to be higher than in autumn and winter (Fold1–4).
Figure 8 presents the spatial distribution of prediction error in each fold for the next 1–6 h prediction task. The bluer the color, the lower the prediction error (
or
).
As shown in
Figure 8, the color of sites becomes bluer as fold increases. Thus, the prediction error decreases. The spatial distribution of the prediction error of all folds shows that the
and
of southern sites were higher than those of northern sites. Hence, the prediction performance in northern sites are better than that in southern sites.
3.2. Effectiveness of LSTM-CNN-DWFD in Extracting Spatiotemporal Correlation
We proposed a multi-step-ahead forecasting model LSTM-CNN-DWFD in this paper. The LSTM-CNN part was designed to extract spatiotemporal correlation from the input data. To show the effectiveness of its architecture, we conducted three groups of comparison experiments between LSTM-CNN and five baseline models. We chose the neighbor sites in accordance with geographical distance in all models.
(a) Evaluate the effect of neighbor sites’ data: LSTM-NN versus LSTM-CNN, CNN, and LSTM-FC [
22]. LSTM-NN adopted an LSTM layer to capture the temporal trend of the historical data of target station and used FC layer to integrate weather forecasts. It did not consider the spatiotemporal data of neighbor sites. However, CNN, LSTM-FC, and LSTM-CNN included the historical data of target and neighbor stations as well as weather forecasts as input.
(b) Evaluate the effectiveness in feature extraction: LSTM-CNN versus CNN and LSTM-FC. The input data of these models were the same, but their neural network differed. CNN adopted a convolution layer and a pooling layer to capture spatiotemporal features and integrated weather forecasts by an FC layer. The LSTM-FC separately trained local LSTM models in the target and neighbor sites similar to LSTM-CNN model. However, it adopted FC to extract the spatial feature from the outputs of local LSTM models. Finally, the weather forecasts were also integrated by FC layer in LSTM-FC.
(c) Evaluate the effect of weather forecasts: LSTM-NN versus LSTM and APNet [
27]. LSTM and APNet considered the historical data of the target station as input, thereby neglecting the effect of weather forecasts. LSTM used an LSTM layer to capture temporal correlation of the input. APNet first used three 1D-convolution and batch normalization layers to compress the input data; it then used LSTM to capture temporal features.
For fairness, all of the LSTM layers used above were stateful LSTM. These models were trained and tested on 36 stations using six-fold rolling origin, and the mean RMSE, MAE, and of the total 216 experiments can provide us their general prediction performance.
Table 3 clearly shows the prediction error of different methods in three different prediction intervals. The best prediction performance (the smallest
and
and the largest
) of each column is marked in boldface. As shown in
Table 3, LSTM-NN showed higher RMSE and MAE and lower
at all prediction scales than CNN, LSTM-FC, and LSTM-CNN. This result is explained by the LSTM-NN neglecting the spatiotemporal correlation of neighbor sites, thereby causing its low prediction accuracy. The RMSEs of the proposed LSTM-CNN model for 1–6, 7–12, and 13–24 h were 44.68, 58.77, and 63.40, respectively, and all RMSEs were lower than those of CNN and LSTM-FC. From the aspect of MAE and
, LSTM-CNN performed the best in the 1–6 and 13–24 h prediction tasks with lower MAE and higher
. Therefore, in 1–6 and 13–24 h predictions, LSTM-CNN has the highest prediction accuracy. In 7–12 h prediction task, CNN, LSTM-FC, and LSTM-CNN, respectively showed the best performance in accordance with
,
, and
. Therefore, the overall performance of the three models can be regarded as close in 7–12 h predictions. In addition, the
and
of APNet and LSTM were obviously higher than those of LSTM-NN, where their
value were much lower. Hence, neglecting weather forecasts causes substantial loss to the prediction accuracy.
3.3. Effectiveness of LSTM-CNN-DWFD in Selecting Related Sites
The KNN-DWFD part in the LSTM-CNN-DWFD model was designed to select K neighbor sites considering wind impact. This subsection compares the prediction performance of the forecasting model under KNN in accordance with geographical distance (in LSTM-CNN model) and dynamic wind field distance (in LSTM-CNN-DWFD model).
3.3.1. The Difference of KNN and KNN-DWFD in Selecting Neighbor Sites
To demonstrate the process of using KNN-DWFD for selecting neighbor sites, we compared the five nearest sites to Site 1 under different wind directions in accordance with dynamic wind field distance in
Figure 9. Here, * represents Site 1, and the triangles represent the five nearest sites. The bluer the color, the nearer the site, and the higher its affect degree.
As shown in
Figure 9, Sites 6 and 24 are approximately the same geographical distance from Site 1; however, they are at different directions, thereby causing their ranks of affect degree to change with wind direction. At time
when the wind direction of Site 1 was northwest, Site 24 was in the upwind area, thereby making it more related to Site 1 than the other sites (excluding the nearest surrounding site Site 2). By contrast, at time
, when the wind direction of Site 1 changed to southeast, Site 6 was nearly in the exact upwind area. Hence, the affect degree of Site 6 was higher than that of other sites (excluding the nearest surrounding site Site 2). This illustrates that KNN-DWFD can dynamically select the most related surrounding sites according to dynamic wind field.
3.3.2. Comparing Prediction Performance of LSTM-CNN and LSTM-CNN-DWFD
Six-fold rolling origin experiments were performed to build LSTM-CNN and LSTM-CNN-DWFD models in 36 stations.
Table 4 shows the mean
,
, and
at multiple prediction scales, where Models 1 and 2 represent LSTM-CNN and LSTM-CNN-DWFD, respectively. In 1–6 and 7–12 h prediction tasks, LSTM-CNN-DWFD showed low
and
and high
, thereby indicating a high prediction accuracy. For 13–24 h prediction task, the
of LSTM-CNN-DWFD is 47.54, which was slightly higher than that of LSTM-CNN. However, the
of LSTM-CNN-DWFD was 63.14, and
was 34.21%, both of which showed better performance than those of LSTM-CNN. Thus, LSTM-CNN-DWFD performed higher accuracy than LSTM-CNN in all prediction scales. In 7–12 h prediction, LSTM-CNN, LSTM-FC, and CNN respectively had the lowest
, the lowest
and the highest
as shown in
Table 3. However, LSTM-CNN-DWFD further improved the prediction accuracy of LSTM-CNN from all aspects. The
,
, and
of LSTM-CNN-DWFD for 7–12 h prediction were 57.89, 42.16 and 49.43%, respectively, all of which outperformed those of LSTM-CNN, LSTM-FC, and CNN. Therefore, from all kinds of criteria, such as
,
, and
, LSTM-CNN-DWFD has the best prediction performance.
The density of stations highly affects the significance of spatial correlations. The higher the density, the nearer neighbor sites, and the more significant the spatial correlation.
Table 5,
Table 6, and
Table 7 show the prediction error of LSTM-CNN and LSTM-CNN-DWFD in regions with different densities of stations at three different prediction intervals. Here,
is the number of surrounding sites within 1.5 km to the target station, which represents the density of stations. The number in the bracket stands for the number of target stations that locate the corresponding density area.
The distribution of stations in Beijing is uneven. A total of 15 stations have no more than two sites within 1.5 km. Meanwhile, nine sites have more than 12 sites within the same distance range. The comparison among
Table 5,
Table 6, and
Table 7 shows that the highest prediction accuracy for 1–6, 7–12, and 13–24 h prediction tasks of both models all occurred in
region, as a lower
and
and a higher
indicated. However, the accuracy in
and
regions was relatively worse.
Nonetheless, for , the s of LSTM-CNN-DWFD at 7–12 and 13–24 h are 62.35 and 67.24, respectively, both of which were lower than that of LSTM-CNN. For and region, the RMSE and MAE of LSTM-CNN-DWFD were generally all lower than LSTM-CNN at multiple prediction scales (except for the at 13–24 h prediction in region), and of LSTM-CNN-DWFD were all higher. Hence, LSTM-CNN-DWFD showed a better prediction accuracy at all prediction scales and all regions with different densities than the other models. In addition, as the density of stations increases, the difference among the RMSE, MAE, and of the two models increases, which means the superiority of LSTM-CNN-DWFD increases in areas where spatial correlation is important.
4. Discussion
This paper proposed a novel PM
2.5 forecasting model―LSTM-CNN-DWFD, which constructed a hybrid neural network to extract spatiotemporal data, and took wind impact into consideration when selecting related surrounding sites. To demonstrate the advantage of the proposed model, six-fold rolling origin experiments were conducted, and
Section 3 shows the results. The experimental data was restricted to a single year, and the test sets covered part autumn (Nov 2014 in Fold1), a whole winter (Dec 2014 to Feb 2015 in Fold 2–4), and part spring (Mar 2015 to Apr 2015 in Fold 5–6). As shown in
Table 2 and
Figure 8, from the view of season, LSTM-CNN-DWFD performed better in spring than in autumn and winter; from the view of space, LSTM-CNN-DWFD performed better in the north than in the south. Similar results were obtained by Zhao [
22] and Bai [
40], both of which found that the performance in winter was the worst, followed by autumn, spring, and summer. Thus, the proposed model is expected to perform higher accuracy if the test set is expanded to a longer time period. The seasonal difference in the prediction accuracy resulted from the variations of atmospheric environment and human activities. The atmosphere environment in autumn and winter (Nov 2014 to Feb 2015) was more stable than spring (Mar 2015 to Apr 2015), including a lower temperature (0.96 versus 10.72
) and lower wind speed (7.16 versus 7.72
). The stable atmosphere structure in winter contributed weak diffusion of PM
2.5 in both horizontal and vertical directions. In addition, human activities in winter (e.g., heating and use of festival firecrackers) contributed anthropogenic emission. According to Liang and Zou [
41], heating activities in winter has contributed more than 50% increase (on average) in PM
2.5 concentration in Beijing since 2010. Ye and Chen [
42] found that the traditions of exploding firecrackers had a direct effect on the air pollution aggravation during the Chinese New Year. As a result, the variations of PM
2.5 concentration in winter were more dramatic and the peaks were higher, as
Figure 10 shows. Consequently, higher contribution of anthropogenic emission in winter and higher peaks of PM
2.5 concentration caused predicting air quality based on meteorological factors more difficult, and the prediction accuracy was lower. The spatial difference was caused due to that the pollution condition is worse in the south (as shown in
Figure 2), and the monitoring stations in the south are fewer and farther between than those in north. Hence, the CNN-based spatial relation extractor cannot capture the spatial dependence well.
From the comparison results between LSTM-CNN and five baseline models in
Table 3, three useful findings can be extracted.
(1) From comparison (a), CNN, LSTM-FC, and LSTM-CNN exhibited lower
and
and higher
than those in LSTM-NN. This result is explained by neighbor stations having a high effect on the pollution of the target station due to the transport of pollutants. Similar conclusions were also drawn by Zhao [
22] and Wen [
28] by comparing the performance of the models with and without the surrounding sites considered. Therefore, considering related neighbor stations can further improve prediction accuracy.
(2) From comparison (b), LSTM-CNN had higher prediction accuracy, especially in 1–6 and 13–24 h, than those of CNN and LSTM-FC. This result is due to the special architecture of combining LSTM with CNN in the SRE part of LSTM-CNN. Compared with CNN, the LSTM layer in SRE is more suitable for processing time series data. The recurrent cell of LSTM contains input gate, forget gate, and output gate. The three gates make the recurrent neuron able to store long-term tendency of the input time series data and extract useful short-term tendency at the same time. This was also confirmed in the work of Li and Peng [
11]. However, CNN do not have a recurrent neuron in its architecture, therefore it cannot learn the temporal dependency of time series data. Hence, LSTM can more efficiently extract temporal features than CNN. By comparing the performance of air quality prediction on city scale of CNN-alone and LSTM-alone models, Qin and Yu [
26] also found that CNN performance was poor in dealing with long-term sequence prediction. In addition, compared with LSTM-FC, the CNN layer in SRE can directly handle the 2D spatiotemporal matrix and extract spatiotemporal features therefrom. However, FC can only employ 1D data as input, so 2D spatiotemporal matrix must be flattened to 1D data to be processed by FC. The flattening process causes some loss of the spatiotemporal dependency among the element of 2D matrix. Therefore, the spatiotemporal information can be more fully utilized by CNN than FC, and additional deep spatiotemporal features can be extracted. By combining LSTM with CNN in SRE, the proposed LSTM-CNN model showed higher prediction accuracy than CNN and LSTM-FC.
(3) From comparison (c), LSTM-NN performed better than LSTM and APNet, especially at 13–24 h. This illustrates that weather forecast data are highly related to the future PM2.5 concentration, especially for long-term prediction. Introducing the weather forecasts can improve prediction performance.
Compared with LSTM-CNN, which selected neighbor sites in accordance with geographical distance, LSTM-CNN-DWFD selected neighbor sites in accordance with dynamic wind field distance, and obtained higher prediction accuracy.
Table 4–7 provide the comparison of the prediction accuracy between LSTM-CNN and LSTM-CNN-DWFD. On the basis of the results, both models performed worse in
and
regions. The bad performance, namely, the low prediction accuracy, results from the number of neighbor stations
set as 9 in our experiments. For
, some less relevant sites were introduced into the model. Meanwhile, for
, some high relevant sites were ignored in the model. Therefore, a more adaptive selection method can be explored to make the number of selected surrounding sites be able to be adaptive to different density of sites. Nonetheless, the results show that LSTM-CNN-DWFD performed well for 1–6, 7–12, and 13–24 h prediction tasks and all regions with different densities. Moreover, the higher the density of stations, the more important the spatial correlation, and the more significant the superiority of LSTM-CNN-DWFD. This result is explained by spatial correlation being anisotropy which is affected by wind. However, geographical distance describes that the spatial dependency is affected by distance, and it takes spatial correlation to be isotropy. In this study, the proposed dynamic wind field distance introduced wind direction into the evaluation of the distance between sites, making it more suitable to represent the spatial relations between sites than geographical distance. Consequently, the neighbor sites selected in LSTM-CNN-DWFD contributed more to the PM
2.5 concentration prediction of target station than the neighbor sites selected in LSTM-CNN. Similar trends also occurred in some studies on spatial interpolations of air pollutants. By introducing wind direction and wind speed into the evaluation of the distance between sites, Li [
43] and Li [
44] both improved the accuracy of spatial interpolation of pollutants. However, both of them did not discuss the applications in air quality forecasting. Therefore, in the future study, wind speed can be introduced to the definition of wind field distance, and we believe that the prediction performance of the forecasting model will be better.
Compared with six other methods, the proposed LSTM-CNN-DWFD model showed the highest prediction accuracy in forecasting hourly PM
2.5 concentration. The LSTM-CNN architecture is shown to be more effective in extracting spatiotemporal features, and dynamic wind field distance fits the spatial correlation better than geographical distance. Due to the limited sample size of the employed dataset, the performance of summer and autumn was not evaluated enough. However, as mentioned above, many related studies demonstrated that the performance of summer and autumn were better than winter [
22,
40]; thus, the proposed model is believed to have a higher accuracy if a longer time period is covered in the test set.
5. Conclusions
This study presented a site-specific forecasting model, namely, LSTM-CNN-DWFD, to predict air pollutant concentrations over the next 24 h using historical air quality, meteorological data, and weather forecasts. By combining LSTM and 2D-CNN, the proposed model simultaneously handled long- and short-term temporal trends and spatial dependency of the spatiotemporal data. Additionally, using a new KNN method, namely, KNN-DWFD, highly related neighbor stations were chosen in the model with wind effect considered. Finally, accurate and stable predictions were realized via the combination of KNN-DWFD and LSTM-CNN in LSTM-CNN-DWFD. Furthermore, through the six-fold rolling origin comparison experiments for 1–6, 7–12, and 13–24 h prediction tasks conducted on the 36 stations in Beijing, LSTM-CNN-DWFD has the highest prediction accuracy, taking RMSE, MAE, and as indicators. The following are the main findings of this study:
The historical air quality and meteorological data of neighbor stations are valuable spatiotemporal data, and fully utilizing these data can considerably improve prediction accuracy. Additionally, taking weather forecasts into consideration can also help predict the future PM2.5 concentration, especially for long-term prediction.
The proposed model, namely, LSTM-CNN, can more efficiently capture the spatiotemporal features by combining local LSTM models and CNN than CNN and LSTM-FC. Hence, it exhibited better prediction performance than the other models as indicated by its low RMSE and MAE and high .
We proposed a dynamic wind field distance to replace geographical distance in new KNN method—KNN-DWFD. The comparison results show that it can fit the spatial correlation better than geographical distance. LSTM-CNN-DWFD is more capable of adapting to different prediction time and density levels than LSTM-CNN, thereby providing more accurate and stable predictions as indicated by its low and and high .
Future studies should focus on the following aspects: (1) Develop a method to choose the number of neighbor stations adaptively for areas with different densities of stations so that the forecasting model can fit the spatial correlations well accordingly; (2) explore a wind field distance definition that simultaneously considers the impact of wind speed and direction and not only the wind direction; (3) explore other patterns to introduce wind impact into the spatial dependency.