1. Introduction
Air pollution is a serious environmental problem that is attracting increasing attention worldwide [
1]. With the rapid development of the Chinese economy and the acceleration of industrialization, urban air pollution is getting worse. As one of the main pollutants in the air, fine particulate matter (PM
2.5) contains a large amount of toxic and harmful substances due to its small particle size. It not only stays in the atmosphere for a long time, but also has a long transport distance, resulting in a decrease in air visibility, seriously affecting our living environment and physical health. In response to it, the Chinese government established air quality monitoring stations in most cities, to detect PM
2.5 and other air pollutant concentrations in real time [
2]. However, it is inevitable for the government to bear a significant financial burden because of expensive equipment [
3,
4]. In addition to monitoring, there is a rising demand for the prediction of future air quality. Obviously, the prediction of real-time and future PM
2.5 concentration is essential for air pollution control and the prevention of health issues caused by air pollution.
With the development of machine learning in recent years, artificial neural network (ANN), support vector regression (SVR), and other methods were successfully applied to the prediction of air pollutant concentration. Zheng et al. [
5] used the spatial features of roads, factories, and parks in the prediction area to predict the concentration of PM
10 and NO
2. Li et al. [
6] used SVR to predict the PM
2.5 concentration of a target station using observation data from the surrounding monitoring stations. Although all these aforementioned methods made use of the spatial features that affect the concentrations of pollutants, the temporal correlation of air pollutants and the time-delay characteristics of PM
2.5 were not considered.
Due to the dynamic nature of relevant atmospheric environments, the recurrent neural network (RNN) is especially suitable to simulate the temporal evolution of air pollutant distributions because RNNs can handle arbitrary sequences of inputs, thereby guaranteeing the capacity to learn temporal sequences [
7]. Ong et al. [
8] used meteorological data to predict PM
2.5 concentration using an RNN. Feng et al. [
9] combined random forest (RF) and an RNN to analyze and forecast the next 24-h PM
2.5 concentration of air pollutants in Hangzhou, China. When there is a long time lag in the traditional RNN, however, it may suffer from problems such as gradient disappearance and gradient explosion [
10]. These RNN-based methods do not take full advantage of spatial features either. Additionally, the states of the feature formation at different times will also have different effects on future PM concentrations. The existing studies did not consider the effects of feature states of the past different times on air pollutants, but only extracted the temporal correlation features of historical data.
To tackle the aforementioned problems, we propose an attention-based convolutional neural network (CNN)–long short-term memory (LSTM) model, AC-LSTM, for predicting the PM
2.5 concentrations over the next 24 h. The proposed AC-LSTM model comprises a one-dimensional convolutional neural network (CNN), long short-term memory (LSTM) network [
10], and attention-based network. As a representative network of RNN, the LSTM network overcomes the defect of gradient disappearance and gradient explosion of the traditional RNN due to its special cell structure [
10]. It can capture the spatiotemporal correlation and interdependence of air quality-related time-series data at the same time. The joint one-dimensional CNN aims to extract spatiotemporal features from air quality data and local spatial correlation features of PM
2.5 concentrations among air monitoring stations. The attention mechanism is an effective mechanism to obtain superior results, as demonstrated in image recognition [
11], machine translation [
12] and sentence summarization [
13]. Therefore, the attention mechanism [
12] was applied in the AC-LSTM model, used to capture the importance degrees of effects of past feature states at different times on PM
2.5 concentration in this paper.
The major contributions of this paper are as follows: (1) by analyzing the spatiotemporal correlation of air quality data, we propose a novel deep learning method that can capture the spatiotemporal dependency of air pollutant concentration, to predict PM2.5 concentrations in the next 24 h; (2) according to the importance degrees of effects of past feature states on PM2.5 concentration, the attention-based layer weighs the past featured states in our predictive model to improve prediction accuracy; (3) comparing the performances of six popular machine learning methods in the air pollution prediction problem, we validate the practicality and feasibility of the proposed model in PM2.5 concentration prediction.
2. Overview of the AC-LSTM Framework
As shown in
Figure 1, the framework of our approach consists of three major parts: model input, feature extraction, and aggregation and prediction. Since PM
2.5 concentration is extremely affected by spatiotemporal features, recent air pollutant concentration, meteorological data, and the PM
2.5 concentrations of all adjacent stations are stacked to construct an input tensor for the one-dimensional CNN layer. In this way, the spatiotemporal features are extracted by the CNN layer. Then, the spatiotemporal correlation is learned by the LSTM layer. Because of the different effects of past states of different times on the PM
2.5 concentration, the attention-based layer can weigh the feature states at past different hours. Finally, the aggregation and prediction of the proposed model is achieved.
How the model predicts the PM
2.5 concentrations of the next 24 h is described in
Figure 2. As shown in
Figure 2,
Xt represents the input data of the model at time
t (e.g., air quality data, meteorological data in
Figure 1),
Yt+1 represents the predicted value of the PM
2.5 concentration at time
t + 1, and
k represents the time lag. We group the air quality data within a particular time lag to formulate different inputs (shown in the broken rectangle) for multiscale predictors, which are used to train separate models corresponding to different time intervals. The time lag of the model input indicates how many hours the input data are in the past. Each blue arrow shown in
Figure 2 represents a different predictor. Afterward, a separate model is trained for each hour over the next 3 h. With respect to the next 7–24 h, it is divided into three time intervals, i.e., 4–6, 7–12, and 13–24 h, where separate models are trained to predict the mean PM
2.5 concentration during each time interval.
4. Results and Discussion
The collected dataset is divided into two parts: the data of the first 28 months are used to train the model, and the data of the last 8 months are used to test the performance of the developed models when benchmarking with others. The mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R2) are used as evaluation metrics to evaluate the performance of the different models in this paper.
4.1. Experimental Set-Up
This section describes the hardware and software environment of the experiment and the configuration of hyperparameters [
2]. The code for all the prediction methods in this paper was written in Python. Our model and other deep learning comparison models were implemented through Keras, an open source deep learning library based on Tensorflow. All experiments were conducted on a Server with two NVIDIA GTX 1080Ti graphics processing units (GPUs) and an Intel Xeon central processing unit (CPU) E5.
There are several hyperparameters in the AC-LSTM prediction model, including the time lag, the number of LSTM layers, the number of nodes in each LSTM layer, and the learning rate. They need to be preset before the model structure is built. Under the condition that all other parameters remain unchanged, we determined the optimal hyperparameter for that selected through our experiments. In the end, we built our model structure using four LSTM layers, and the number of nodes in each LSTM layer was set to 800. The learning rate was 0.0001 in all experiments. The above setting seemed to outperform all others in our experiments.
The time lag is one of the most important hyperparameters. It determines the number of past hours used in the model input and is necessary for multiscale prediction tasks. To this end, we evaluated the performance of the model with different time lags in order to find the optimal time lag in the model. At different time lags, we predicted the PM
2.5 concentrations of all stations in the training set in the next hour. The calculated MAE and RMSE are compared in
Table 3. When the time lag was 10, the RMSE of the model was the lowest. While the time lag was 14, the MAE was the lowest. According to the analysis in
Section 3.4 and the previous studies on RNN [
10], if the time lag is too small, the temporal correlation between time-series data cannot be fully learned and the prediction accuracy will decrease. However, a large time lag may lead to a longer time for training and unnecessary noise. As a result, the time lag in our model was set to 12 for the one-hour prediction task. Of course, for prediction tasks of different time scales, we can also find the optimal lag through experiments in a similar way.
4.2. Effects of Different Features
The input of our model was composed of three types of features: pollutant concentration (
Fp), meteorological data (
Fm), and PM
2.5 concentrations of adjacent monitoring stations (
Fa). To evaluate the effectiveness of different features in the proposed AC-LSTM model, we conducted experiments with different combinations of features and computed the errors on the multiscale prediction tasks. Because the number of monitoring stations in Taiyuan is too small, we used PM
2.5 concentrations from all stations rather than from adjacent stations. The effects of various features in AC-LSTM are shown in
Table 4 and
Table 5. As can be seen, by gradually adding features, the prediction accuracy of the model could be generally improved. Except for the lowest MAE in the next 1 h and 13–24 h prediction tasks, the model with three types of features as input had the best overall performance. This shows that the past feature states and the PM
2.5 concentrations of adjacent monitoring stations can help predict the PM
2.5 concentration.
4.3. Model Convergence
After setting appropriate model parameters, it was necessary to verify whether AC-LSTM converges during training. Therefore, the training loss of AC-LSTM model in the one-hour PM
2.5 prediction task was calculated, as shown in
Figure 8, and the results were compared with three other deep learning methods (simple RNN, LSTM, and CNN–LSTM). The parameters of all models in
Figure 8 were the same, and the mean square error (MSE) after data normalization was used as the loss function for training. It can be seen from
Figure 8 that all models converged at epoch = 20. In
Figure 8a, after 20 epochs in the one-hour prediction task, the MSE losses of the three models were close, but the MSE loss of the AC-LSTM model was slightly smaller than those of the other two models, LSTM and CNN–LSTM. At epoch = 1, the MSE loss of the simple RNN model in
Figure 8b was nearly 100 times greater than that of the three models in
Figure 8a. In addition, at epoch = 80, the loss value of the simple RNN model was 0.00154, while none of the other three models in
Figure 8a had values greater than 0.0015. Obviously, the three models of LSTM, CNN–LSTM, and AC-LSTM in
Figure 8a had better convergence results because of the special memory cell architecture.
4.4. Model Comparison
To verify the feasibility and efficacy of the proposed model in this paper, we compared our proposed AC-LSTM model with six state-of-the-art models, including support vector regression (SVR) [
6], random forest regression (RFR) [
9], multilayer perceptron (MLP) [
25], simple RNN [
9,
26], LSTM [
27], and CNN–LSTM [
28]. After training all the models with the same training and testing datasets, the PM
2.5 concentrations of all stations at different time scales were predicted for performance evaluation. We selected appropriate time lag and hyperparameters for different scale prediction tasks in our AC-LSTM model in the same way. Furthermore, each experiment was repeated five times, and the averaged results were used for comparison, as shown in
Table 6 and
Table 7.
The prediction results from our approach and six others in terms of MAE and RMSE are compared in
Table 6 and
Table 7, where several interesting observations can be highlighted. Firstly, the performance of all models gradually deteriorated as the time to predict became longer. For this purpose, the detailed comparison results of each model for different scale prediction tasks are shown in
Figure A1,
Figure A2,
Figure A3 and
Figure A4 in
Appendix A. From
Figure A1,
Figure A2,
Figure A3 and
Figure A4, it is more obvious that the prediction accuracy of the three models (SVR, RFR, and MLP) worsened as the time to predict became longer. The lack of sufficient and directly relevant input data makes it difficult to predict PM
2.5 concentrations for longer future periods.
Secondly, the performance of the four deep learning methods, i.e., simple RNN, LSTM, CNN–LSTM, and AC-LSTM, was much better than that of the three traditional shallow learning methods, SVR, RFR, and MLP, particularly in predicting over an hour. As can be seen from
Table 4 and
Table 5, the MAE and RMSE of the four deep learning methods were relatively low. The predicted values of the four models on the multiscale prediction task were closer to the observed values in
Figure A1,
Figure A2,
Figure A3 and
Figure A4.
Thirdly, as can be seen from
Figure A1 and the tables, the prediction accuracy of the three non-deep learning models on the one-hour prediction task was comparable to that of the four deep learning models. However, according to the goodness-of-fit plots for all models in
Figure A5 the predicted value distributions of the three models were relatively dispersed, and their
R2 values were lower than those of the four deep learning models. The predicted value distributions of the four deep learning models were close to a 45-degree line (
y =
x). This, on one hand, shows the limitation of conventional approaches; on the other hand, it fully demonstrates the superior performance of the deep learning models in modeling long-term dependency for effective prediction of the PM
2.5 concentration in the future. The reason for this is that these three traditional shallow models cannot process time-series data and fail to learn the temporal correlation of air pollutants. By contrast, simple RNN is able to predict PM
2.5 concentrations over the next 24 h. Compared to simple RNN, the three models of LSTM, CNN–LSTM, and AC-LSTM bring further improved result from overcoming the defects of the conventional RNN.
Furthermore, according to the tables, the MAE and RMSE of AC-LSTM models were the lowest compared to other benchmarking models, except for the MAE for the 13–24 h prediction task. The predicted values of AC-LSTM models on the multiscale prediction task were closer to the observed values in
Figure A1,
Figure A2,
Figure A3 and
Figure A4. Moreover, the
R2 of the AC-LSTM model in the one-hour PM
2.5 prediction task in
Figure A5 was highest. After adding the attention mechanism, AC-LSTM could outperform the LSTM and CNN–LSTM in multiscale prediction tasks. The results show that the proposed AC-LSTM model can effectively learn the spatiotemporal correlation of air pollutants, and it is suitable for predicting urban PM
2.5 concentration in the future.
However, our study has several limitations. Emissions have a significant impact on air quality. Since emission data are difficult to obtain, the data collected in this paper do not include emissions from factories and vehicles in the area. This does affect the prediction accuracy of our model. Moreover, when a sudden pollution accident occurs, the PM2.5 concentration changes suddenly. Whether the proposed model can predict it well still needs to be demonstrated.
5. Conclusions and Future Work
In this paper, we propose an attention-based CNN–LSTM model to predict urban PM2.5 concentrations over the next 24 h. By taking the pollutant concentration in air quality data, meteorological data, and PM2.5 concentrations in adjacent monitoring stations as the input, the model can learn the spatiotemporal correlation and long-term dependence of PM2.5 concentrations. At the same time, the attention mechanism can capture the importance degrees of different feature states based on past time and further improve the prediction accuracy of the model. The experimental results show that the AC-LSTM model improved performance in the multiscale prediction tasks. Several main conclusions of this paper can be highlighted as follows:
Through the analysis of air quality data, PM2.5 concentration has a strong spatiotemporal correlation. Due to the air flow, PM2.5 concentration in the predicted area can be easily affected by the PM2.5 concentrations of the adjacent monitoring stations. As PM2.5 stays in the air for a long time, the past feature states also affect future PM2.5 concentration. This motivated the design of a spatiotemporal model for effective prediction of PM2.5 concentrations;
The experimental results indicate that, in addition to using only the pollutant concentrations of the air monitoring stations, adding the meteorological data and the PM2.5 concentrations of the adjacent monitoring stations can improve the prediction accuracy of the model, especially for prediction tasks on time scales over one hour;
The proposed AC-LSTM model can be applied to multiscale predictors at different time gaps. When compared with the traditional machine learning methods, such as SVR, MLP, and RFR, its prediction accuracy was improved significantly, especially in predicting the PM2.5 concentrations over the gap of one hour. In comparison with deep learning methods, such as simple RNN, LSTM, and CNN–LSTM, AC-LSTM produced improved prediction with lower MAE and RMSE measures due to the introduced attention mechanism in the LSTM model.
Although the proposed model can support the multiscale prediction of PM2.5 concentrations in the temporal domain, in the future, we will also explore its expansion for multiscale prediction in the spatial domain. In addition, the model will also be extended for predicting other pollutants. Last but not least, sensing data, especially satellite data, will also be utilized for large-scale prediction of the PM2.5 concentrations and other pollutants for early warning of air pollution and the protection of people’s health.