1. Introduction
With the rapid development of industrialization and urbanization in recent decades, the PM2.5 emissions of developing countries have increased substantially. Serious PM2.5 pollution has caused many adverse effects on economic activities. For example, as of 2015, for every 5
g/m
increase in PM2.5 concentration, all other things being equal, GDP per capita will decrease by about 2500 China Yuan [
1]. Therefore, how to accurately predict PM2.5 concentration becomes more and more important. The commonly used prediction methods can be divided into two categories: statistics and machine learning algorithms. Statistical methods predict air quality by applying statistics-based models, such as the autoregressive integrated moving average (ARIMA) model [
2,
3,
4], multiple linear regression (MLR) model [
5,
6,
7], and generalized additive model (GAM) [
8,
9,
10]. However, the earlier linear models described above assume that the relationship between variables and target labels is linear, which is not suitable for nonlinear and unstable air quality prediction problems.
In order to overcome this limitation, researchers began to adopt nonlinear machine learning methods. For example, Yang et al. [
11] used support vector regression (SVR) to predict the PM2.5 concentration in Beijing and verified that the accuracy of the proposed model was better than that of other methods. Li et al. [
12] proposed a stacked automatic encoder (SAE) model for air quality prediction and demonstrated that the model exhibited a better performance than linear models such as ARIMA. Feng et al. [
13] used the set of back-propagation neural networks (BP) to predict daily biomass combustion pollutant emissions. Zhang et al. [
14] used the genetic algorithm (GA) combined with the artificial neural network (ANN) to predict local indoor air quality with two ventilation models. Although the nonlinear machine learning methods have achieved satisfactory performance in predicting air pollution, they are unable to learn from the long-term effects of air pollution, because their models are shallow networks with few model parameters. The generalization ability of these models to complex prediction problems is limited.
To solve the problem of models with fewer parameters, people have started to use deep neural networks recently, which have been used widely in image processing, natural language understanding, and so forth [
15,
16,
17,
18]. For example, Seng et al. [
19] proposed a multi-output multi-index supervised learning comprehensive prediction model (MMSL) based on long-term and short-term memory (LSTM) to predict the overall air quality in Beijing. Yan et al. [
20] used the CNN-LSTM model based on spatial–temporal clustering to predict the air quality of Beijing in multi-sites. Experiments show that CNN-LSTM and LSTM generally have a better performance than the BP neural network. Feng et al. [
21] proposed a method based on WRF/RNN to predict the air pollutants in Hangzhou over the next 24 h. Qin et al. [
22] proposed a dual-stage attention-based recurrent neural network (DA-RNN), where the attention mechanism is used in the input stage of the encoder and decoder, so that the most relevant input features can be selected adaptively. Liu et al. [
23] proposed a dual-stage two-phase attention-based recurrent neural network (DSTP-RNN), where a DSTP-based structure was used to enhance the spatial correlation of an exogenous series, and a two-stage attention mechanism was used to generate stable response weights. However, this method only uses the data of one site without considering the influence of the data of other sites on the model. To solve the above problems, in our previous work [
24], an improved attention-based dual-stage two-phase fully connected (DSTP-FC) model was proposed to improve the accuracy of PM2.5 concentration prediction, where an exogenous series correlation method is used to calculate the relationship between the target series and the exogenous series, and the PM2.5 concentrations are predicted by a modified DSTP model.
Although advanced deep learning methods can get good results in air quality prediction, these deep learning-based methods all need enough historical datasets to train the models. For datasets with very little data, these methods do not provide very good prediction results. To solve the above data shortage problem, Ma et al. [
25] proposed a transfer learning-based bidirectional long short term memory (TL-BiLSTM) network to predict the air quality of new stations lacking data. This method transfers the knowledge learned from the existing air quality monitoring stations to the new monitoring stations to improve the prediction accuracy of the new stations. Fong et al. [
26] proposed a transfer learning model combining LSTM and RNN to predict the concentration of air pollutants. Their method inputs the data of all source domain sites into the model for pretraining, then adds the number of network layers to input the data of the target domain to train and predict the air quality of the target domain. Fang et al. [
27] proposed a hybrid deep migration learning strategy based on long and short-term memory (LSTM) and domain adversarial neural networks (DANN), where the temporal features of the source and target buildings are extracted by LSTM, and DANN is used to find the domain invariant features between the source and target buildings through domain adaptation.
The above-mentioned methods have achieved satisfactory performance in the case of new site data shortages, but there are still some problems that should be further studied. For example, the temporal feature extractors of these models are all based on LSTM, which treat all input features equally and fail to pay attention to the important features. The TL-BiLSTM model is a single-site migration, and for the case where the source domain has multiple sites, it is not known which site in the source domain is selected for migration. The LSTM-RNN model inputs the data of all source domain sites into the model for pre-training, which is unsuitable when the number of source sites is large, because this method will input a lot of redundant data, resulting in the over-fitting and calculation problems [
28].
To deal with these problems above, an improved hybrid transfer learning-based deep learning model is proposed in this paper for PM2.5 concentration prediction. When the amount of data in the target domain is small, the model cannot be well trained only by using the data in the target domain. If the transfer learning-based method is used, the model trained on the source domain data is not applicable to the target domain data, when the source and target domain data have different distributions. Thus, the motivation of this study is to use the domain adaptive migration learning method to find the domain invariant characteristics between the source domain and the target domain, and to use the data of the source domain and the target domain to predict the PM2.5 concentration in the target domain with fewer data.
The main contributions of this paper are summarized as follows: (1) An improved hybrid transfer learning model with a dual-stage two-phase model (DSTP) and a domain adversarial neural network (DANN) is proposed; (2) The maximum mean discrepancy (MMD) is introduced into the air quality prediction based on transfer learning, which is used to select which station in the source domain is most suitable for migration to the target domain; (3) An improved dual-stage two-phase (DSTP) model is used to extract the spatial–temporal features of the source domain and the target domain. Various experiments on several cities in China are conducted, and the results verify the efficiency and the generalization ability of the proposed method.
This paper is organized as follows:
Section 2 describes the proposed method and presents the structure of the proposed deep learning-based model;
Section 3 presents the experiments and results;
Section 4 discusses the performance of different feature extractors, the generalization ability and the robustness of the proposed method, and the setting of hyperparameters;
Section 5 provides the conclusion and possible future research directions.
2. Proposed Model
In this paper, a hybrid transfer learning model is proposed. The input of the model includes historical air quality and meteorological data of source and target domains. Firstly, the source domain site selection method based on MMD is used to find the source domain site closest to the target domain. Then the data of the two sites are input into the improved DSTP model together. A feature extractor based on the DSTP model is used to extract the spatial–temporal features of training data from source and target site data. The obtained spatial–temporal features are input into the domain classification model and the regression prediction model, respectively. In this paper, a domain adversarial neural network (DANN) is used to find the domain invariant features between the source domain and the target domain through adversarial domain adaptation of DSTP feature extractor and domain classifier. Finally, the regression prediction model based on the fully connected layer is used to predict the values of the source and target sites. The test data of the target site are input into the pre-trained DSTP-DANN model for PM2.5 concentration prediction. The framework of the proposed model is shown in
Figure 1, which will be described in detail below.
Remark 1. The method presented in this paper is different from those that fine-tune by freezing the first few layers of the model. This method uses the domain adaptation of an adversarial neural network to conduct transfer learning. DANN combines domain adaptation and feature learning in a training process, so that the features of domain invariance can be predicted. Then, the proposed transfer learning-based model trained by source domain site data can be used to assist in predicting target site data without degradation of the prediction performance due to domain drift.
2.1. Site Selection for Source Domain Based on MMD
Because there are many source domain sites, it is necessary to measure the distribution distance between source domain sites and target domain sites, and select the source domain sites closest to the target domain sites. Recent studies have proved that the maximum mean discrepancy (MMD) in the regenerative kernel Hilbert space is an effective method for estimating the distance between two distributions [
29]. Based on two distributed samples, the average difference between two samples corresponding to
f can be obtained by subtracting the function mean of different samples, and MMD is the maximum value of the average difference. For the convenience of calculation, the square form of MMD is generally adopted. The process of using MMD to estimate the difference between two domains is as follows.
The source domain site data in a given source domain is denoted as:
where
x represents the source domain site data and
n represents the source domain site data number. The target site data in the target domain is denoted as:
where
z represents the target domain site data and
m represents the target domain site data number. The nonlinear mapping function in the Hilbert space of the regenerative kernel is denoted as
. Then the squared form of MMD is defined as follows:
The difference in distribution between two domains is the distance between the two data distributions. The smaller the MMD value, the closer the two domains are. Currently, MMD has been widely used in transfer learning algorithms [
30,
31,
32]. The proposed method is used to select the source domain site that is most suitable for migration to the target domain site by calculating the similarity between the source domain and the target domain based on MMD.
2.2. Spatial-Temporal Features Extraction Based on DSTP
The center site is the site to be predicted, and the best matching site of the center site is determined by the exogenous series correlation method. The main reason to use this DSTP model is that a stable attention weight can be obtained by the DSTP model, which uses a dual-stage attention mechanism in the encoder stage. Thus, temporal and spatial features can be extracted simultaneously [
24].
Given all sites’ data, each site contains
n exogenous series and a target series (series to be predicted). Within the window size
of the central site collection, the
k-th exogenous series is represented by:
All exogenous series within window size
are represented by:
The target series is represented by:
In this study, the encoder adopts a two-stage attention mechanism, which aims to study the spatial correlation between the exogenous series of the central site collection, its matching sites’ exogenous series and target series. Specifically, the spatial correlation between the exogenous series of the central site collection and the exogenous series of the matching sites is studied in the first stage of attention. In the second stage of attention, the weighted features are studied again, that is, the spatial correlation among the exogenous series of the central site collection, its target series and matching sites’ target series. Thus, the two-stage spatial mechanism ensures that the learned spatial correlations are stable. The decoder is a temporal attention mechanism designed to learn the temporal correlation among the encoder hidden state, the target series of the central site collection, and the target series of the matching site.
2.2.1. First Stage of Attention
The data from the central site and its matching sites are input into the model together, which can be used to study the exogenous series relationships between them and can improve the accuracy of predicting PM2.5 concentrations. The exogenous series correlation method is used to find matching sites. Given the
k-th feature
of the central site collection at time
t, the
k-th feature
of the exogenous series of the best matching site can be obtained by the exogenous series correlation method [
24]. The spatial correlation between the exogenous attributes of the learning central site collection and the matching site in the input attention mechanism is:
where
is a concatenation operation, and
,
,
are the parameters to learn;
and
are the hidden state and unit state of the encoder LSTM unit at the previous time. After
is calculated, the Softmax function is used to normalize to get the attention weight
.
is determined by
,
, the
k-th feature
of the current input, and the
k-th feature
of the
-th matching site, which measures the importance of the
k-th feature at time
t.
is the combination of all features at moment
t, which is defined as follows:
Then, the hidden states and are input into the LSTM layer to update the hidden state of the current moment, and is input into the attention of the second stage.
2.2.2. Second Stage of Attention
This module aims to learn the spatial correlation between the exogenous series and the target series of the central site collection and the target series of the matching sites. The specific method is to combine the target series of the central site collection with the exogenous series of the corresponding time and add the target series of the best matching site. The attention weights for the input attention mechanism are as follows:
where
,
,
,
are the parameters to be learned.
and
are the hidden state and unit state of the encoder LSTM unit at the previous time; and
q is the hidden size in the second attention module.
After
is calculated, it is normalized by Softmax function to get
. The corresponding target variable
is connected to the
k-th attribute
to form a new vector
, namely:
Note that the weight
measures the importance of
at the moment
t, and any attribute value at any time has its corresponding weight:
Then, and are input into the LSTM layer to update the hidden state at the current moment, and is input into the temporal attention stage.
2.2.3. Decoder with Temporal Attention
The decoder with temporal attention can adaptively select the encoder hidden state most relevant to the target series by weighting the encoder hidden state. The encoder with spatial attention outputs the hidden state, and the decoder learns the temporal relations of the hidden state through the attention mechanism within a window size
. Based on the hidden state
and unit state
of the decoder LSTM unit at the previous time, the attention weight of each encoder hidden state in the attention module at the moment
t can be calculated. The attention weights for the temporal attention mechanism are as follows:
where
are parameters to learn;
p is the hidden size of the third attention module, and
is the
i-th encoder hidden state of the second attention module. After
is calculated, it is normalized by Softmax function to get
. The context vector
is defined as follows:
The temporal relationship between all the hidden state of the central site collection and the target series of matching sites is again learned by concatenating the target series of matching sites:
where
and
,
are the parameters that map the connection to the size of the hidden state of the decoder. Then,
and
are input into the LSTM layer to update the hidden state
at the current moment. The final multi-step prediction formula is as follows:
where
and
are parameters that map concatenation to the size of the decoder hidden state;
represents the concatenation of the decoder hidden state and the context vector;
is the weight and
is the deviation, where
is the time steps to predict in the future. The linear function produces the final prediction result.
2.3. DSTP-DANN Based on Transfer Learning
Figure 2 shows the proposed DSTP-DANN structure based on transfer learning (defined as TL-DSTP-DANN). The TL-DSTP-DANN structure consists of three main components: feature extractor, regression predictor, and domain classifier. The feature extractor is based on the improved DSTP model (see
Section 2.2), and the regression predictor and domain classifier are both fully connected layers.
The training optimization loss of the TL-DSTP-DANN model includes regression loss and domain classification loss. The regression loss for PM2.5 prediction is defined as the mean squared error:
where
n is the batch size of training data;
and
represent the actual and predicted values of PM2.5, respectively. The loss for domain label classification is defined as the dichotomous cross-entropy:
where
and
represent the actual domain label and the prediction domain label, respectively. In this study, we set the source domain label to 0 and the target domain label to 1.
In the training process, to obtain domain invariant features, the distribution of two features is as similar as possible. The parameter of feature mapping is found to maximize the loss of the domain classifier, and at the same time, the parameter of the domain classifier is found to minimize the loss of the domain classifier. The minimum and maximum change between losses cannot be directly realized by gradient update in the back-propagation process of neural networks. The difference between these two losses is achieved by inserting a gradient reversal layer (GRL) between the feature extractor and the domain classifier.
In this paper, DANN is used to search for domain-invariant features between source domain and target domain through the domain adaptation of DSTP feature extractor and domain classifier. The main reason for using the DANN is that it can combine domain adaptation and feature learning in a training process, so that the parameters learned can be directly applied to the target domain without reducing its prediction accuracy due to the domain deviation [
27].
The idea of this paper is very similar to the generative adversarial networks (GANs). The generating model G: Equivalent to a feature extractor, the goal is to make the domain classifier not correctly identify the domain labels (the two feature distributions should be as similar as possible). The discriminant model D: Determine whether a label is the label of the target domain, and the target is to distinguish whether the extracted features come from the source domain or the target domain. GANs is implemented by the competition between G and D.
During the training process, the two models
G and
D can be enhanced simultaneously by competing with each other. Because of the existence of discriminant model
D,
G can learn the similar features of the two distributions well without a lot of prior knowledge and prior distribution, and finally make the data generated by the model achieve the effect of faking the truth (that is,
D cannot distinguish whether the features extracted by
G come from the source domain or the target domain, so
G and
D reach a certain Nash equilibrium [
33]).
In the proposed TL-DSTP-DANN model, GRL acts as a constant transform during forward-propagation, gaining gradients at the latter level and changing its sign during backward propagation. In particular, GRL can be regarded as a pseudo function
, and the following equations are its forward and backward propagation processes:
where
Q is a unit matrix;
is a positive hyperparameter, which realizes the trade-off between regression loss and domain classification loss, and the setting of
refers to [
34]. Because the difference between the regression loss and the domain classification loss is relatively large, the model loss is the sum of the regression loss and the domain classification loss. The GRL layer is followed by the domain classifier, and a hyperparameter is set in the GRL layer to achieve a balance between two loss functions.
In this paper, the source domain site data is denoted as
, where
and
represent the source domain site’s exogenous data and target data, respectively,
n represents the source domain data number. The target domain site data are denoted as
, where
and
represent the target domain site’s exogenous data and target data, respectively,
m represents the target domain data number. The expression of the final objective “pseudo-function” is:
where
,
,
denote the network connection weights of the feature extractor, regression predictor and domain classifier, respectively;
,
,
represent the feature extractor, regression predictor and domain classifier, respectively. The gradient descent method is used to update the learning weights in the TL-DSTP-DANN model, which is expressed as follows:
where
represents the learning rate. The pseudo-code of the proposed TL-DSTP-DANN training process is shown in Algorithm 1.
Algorithm 1 TL-DSTP-DANN model training process. |
Input: source domain site data , target domain site data |
Output: Parameters of the model ,, |
1: for do |
2: Forward: |
3: Calculate the regression loss by Equation (16) |
4: Calculate the loss of domain label classification by Equation (17) |
5: Calculate the loss of “pseudo-function” by Equation (20) |
6: Backward: |
7: Calculate gradient |
8: Update: |
9: Update the network weight parameter by Equations (25)–(27) |
10: end for |
11: return |
5. Conclusions and Future Work
In this paper, a dual-stage two-phase model and an adversarial domain adaptation hybrid transfer learning strategy are proposed to predict PM2.5 concentration, especially for new sites with relatively little historical data. Firstly, the maximum mean discrepancy (MMD) is introduced into the proposed model to select the most suitable source domain site. Then, inputting data from the source domain and the target domain together into an improved DSTP model, the DSTP model extracts the spatial–temporal characteristics of both. DANN finds domain invariant features between source domain and target domain by fusing extracted spatial–temporal features. Finally, the PM2.5 concentration in the target domain is predicted by a regression predictor. To evaluate the performance of the proposed model, we use air quality data from the Beijing sites to assist in predicting PM2.5 concentrations at the Tianjin and Guangzhou sites. The main experimental results are as follows: (1) Compared with other transfer learning prediction models (including TL-LSTM, TL-BiLSTM, TL-DSTP), the proposed TL-DSTP-DANN model decreases by more than 8.5% in MAE, RMSE and MAPE; (2) Transfer learning can obviously improve the performance of PM2.5 prediction in newly built monitoring stations with insufficient data; (3) The comprehensive experimental results of the improved DSTP model combined with DANN are better than those of CNN, LSTM, and BiLSTM. Compared with the BiLSTM, the MAE of the improved DSTP model decreases by 12.05%, and the MAPE decreases by 20%.
In our future work: (1) The current dataset contains only historical air pollutant concentrations and meteorological data, lacking relatively important geographical data, but geographical factors have an impact on PM2.5 concentrations. In the future, the proposed method could provide higher prediction accuracy if datasets containing geographical information are available; (2) An algorithm needs to be investigated to determine which of the multiple source domain data are most suitable for transfer learning to the target domain; (3) Whether the method proposed in this paper can be used to predict other air pollutants, such as O and SO.