1. Introduction
With the global trend and people’s attention, how to monitor the air quality scientifically and effectively and how to further prevent and control air pollution has become a hot topic. The problem of air pollution is very complex, which is characterized by multi pollution coexistence, multi-scale correlation, and multi process evolution. In order to solve this complex problem, it is particularly important to strengthen the construction of air quality monitoring and air quality management information. Only effective prediction, analysis, and research on air quality can effectively improve air quality. How to effectively use the real-time monitoring data of each city’s automatic air monitoring station, mine its internal information, use the monitoring data to build a bridge for analyzing the pollution problem [
1], effective improvement of air quality, improvement of people’s living environment to maintain people’s health is an urgent problem to be solved.
The air quality index (AQI) is calculated by monitoring the concentration of fine particulate matter (PM
2.5), inhalable particulate matter (PM
10), sulfur dioxide (SO
2), nitrogen dioxide (NO
2), ozone (O
3), and carbon monoxide (CO). In recent years, due to the increasing consumption of various energy resources and the increase of cumulative emissions, the problem of air pollution has seriously increased. More and more attention has been paid to the study of air pollution. In order to better adapt to the global trend and create a good air environment, using data mining technology to establish air quality analysis, the forecast model has become an important topic [
2,
3,
4,
5,
6,
7,
8,
9].
Collecting AQI data is key for monitoring pollution problems. To solve the AQI problem, various approaches have been used for data mining, including artificial neural network (ANN), genetic algorithm (GA), decision tree (DT), random forest (RF), and support vector machine (SVM) [
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20]. Each method has a single basic point of view and provides a general performance analysis of air quality indicators, but it is difficult to distinguish the best method. Recent studies have proposed various intelligent systems, and the results seem applicable [
21,
22,
23,
24,
25,
26]. However, these investigated methods have an important shortcoming: they cannot simultaneously provide parameter optimization for an algorithm as well as decision rules. For AQI evaluation, decision rules can be updated according to datasets in the evaluation process and can be used to predict new evaluation results. For that we would aim to investigate an algorithm based on the characteristics of AQI decision rule establishment and parameter optimization. Then the urban air quality forecast and analysis can be based on an intelligent algorithm with parameter optimization and decision rules. We therefore propose an intelligent algorithm that combines DT and simulated annealing (SA), in which DT generates decision rules and SA converges to a global optimum, and the parameters of DT are determined by SA. The rules extracted in this paper could be used to analyze collected information, then forecast a new AQI. In what follows, we review the decision tree in
Section 2, introduce the proposed algorithm in
Section 3, and analyze the simulation results and discussions in
Section 4. Finally, we draw the conclusion.
2. A Brief Description of the Decision Tree Algorithm
In our previous work [
27,
28], we applied the DT algorithm in anomaly intrusion detection and found it to have excellent classification performance. The DT has the advantages of intuitive expression and convenient operation and is widely used in research [
29,
30,
31,
32,
33,
34,
35,
36]. It consists of a root node, a child node, and a leaf node. After the structure is established, the required data are tested, starting from the root node. Depending on the different data attributes, the sub-node selects a property and moves to another sub-node recursively until the leaf is reached. Nodes and leaf nodes are the classifications for data prediction. When a DT is constructed, the attribute with the highest information gain rate is the split attribute of the current node. With recursive calculation, the information gain rate of the calculated attributes becomes smaller and smaller, and in the latest stage, the attribute with relatively large information gain rate will be selected as the splitting attribute, and the DT uses the
Gini coefficient minimization criterion to perform feature selection to generate a binary tree [
29,
30]. The
Gini coefficient minimization criterion is calculated as follows:
indicates the probability that the selected sample belongs to the
k class; the probability that the sample is split is
. For a given sample set
, the
Gini index is:
Here,
is the sample belonging to the
class in
, and
is the number of classes. If the sample set
is divided into two parts
and
according to whether a feature A takes a certain value a, namely:
then under the condition of feature
, the Gini index of set
is defined as:
The
Gini index
Gini represents the uncertainty value of the set
, and the
Gini index
Gini (
D,
A) represents the uncertainty value of the set
after
partitioning. The larger the
Gini index value, the greater the uncertainty result of the sample set. When using the DT algorithm, the two parameters of minimum case (
) and the pruning confidence factor (
) will have different combinations when facing different problems or cases [
29]. In this paper, the SA algorithm is used to adjust and determine the best combination of these two parameters and the best solution of the problem.
3. The Proposed Algorithm
This paper proposes an algorithm for urban air quality forecast and analysis that is based on an intelligent algorithm with parameter optimization and decision rules. In the study, in order to verify the performance of the proposed algorithm, we use Beijing air quality data in CSV (Comma-Separated Values) format [
37]. A partial original data is shown in
Table 1. The seven different features are listed in
Table 2 [
38]. As shown in
Table 2, these pollutants cause poor air quality, affect human living environment and harm human health. The real-time historical data of AQI from 1 January 2017 to 6 October 2018 in the District of Dongcheng, Beijing, 11270 AQI instances with seven different features were collected.
Table 3 presents partial data for the resulting AQI.
According to Environmental Air Quality Standards GB 3095-2012, discrete AQI data are classified according to pollution levels one (excellent) through six (serious). The corresponding relationship between AQI and air quality level is shown in
Table 4.
Table 5 presents partial data for air quality level.
Metropolis introduced SA and proposed an importance sampling method—i.e., accepting new states with probability—called the Metropolis criterion [
39]. This is the basic idea of SA algorithms. Kirkpatrick et al. first proposed the simulated annealing algorithms in 1983 [
40,
41]. SA makes the optimal solution asymptotically convergent and is widely used to solve optimization problems. In recent years, with the rapid increase of information, there has been a huge amount of data (big data) which is larger than the traditional data. Under such a large amount of AQI data, how to find useful data from it has become an important issue. DT is based on the tree structure, presenting the data rules, enabling analysts to understand the implicit knowledge of the data and interpret it, which is widely used in various fields [
30,
31,
32]. However, before establishing the decision tree model, it is necessary to set its relevant parameters, which will affect the result. Under different parameter combinations, if the parameter values are not adjusted properly, the classification result will be poor. Because parameters minimum case (
M) and the pruning confidence factor (
) of the DT will be different due to different problems, it is very time-consuming to manually adjust them.
Therefore, this paper proposes an intelligent algorithm combining DT and SA, and studies an algorithm based on AQI decision rule establishment and parameter optimization. Then, based on the intelligent algorithm of parameter optimization and decision rules, the urban air quality is predicted and analyzed. This study combines the advantages of DT and SA. DT generates decision rules, SA converges to the global optimum, and the parameters minimum case (M) and the pruning confidence factor () of the DT determined by SA. The rules extracted in this paper can be used to analyze the collected information and then forecast a new AQI.
Figure 1 shows a flow chart of the proposed algorithm; the AQI dataset is pre-processed as training and testing data, then initial values for the parameters are proposed; after that, the initial solution can be generated randomly. The proposed algorithm begins with four parameters, namely
, where
denotes the number of generations,
represents the initial temperature,
represents the final temperature that stops the proposed algorithm if the current temperature is lower than
, and
is the coefficient controlling the cooling rate,, respectively. The current temperature
is set to be the same as
. The solution is represented as seven features followed with two variables
, and
as shown in
Table 6. An initial solution
is randomly generated according to the representation of solution in
Table 6. For each generation, the next solution
is generated from
by randomly swapping these seven features and randomly generating these values of four variables in the current solution.
T is decreased after running
generations, according to a formula
, where 0 <
< 1. Let
denotes the testing accuracy of
, and
denote the difference between
and
; that is
. The probability of replacing
with
, where
is the current solution and
is the next solution, given that
> 0, is
. This is accomplished by generating a random number
∈ [0,1] and replacing the solution with
if <
. Meanwhile, if
≤ 0, the probability of replacing
with
is one. In the proposed algorithm, SA and DT are performed to optimize parameters (
and
) to increase the testing accuracy for selected features and build the decision rules. The proposed algorithm is repeated until
is lower than
. Thereafter, the best testing accuracy, and decision rules are reported.
The proposed approach uses the accuracy based on the confusion matrix, which can test the performance of the classification method. The confusion matrix is shown as
Table 7.
TP, FP, FN, and TN represent true positive class, false positive class, false negative class, and true negative class, respectively. The predicted value is a positive example, which is recorded as P (positive). The predicted value is a negative example, which is recorded as N (negative). When the predicted value is the same as or opposite to the actual value, they are recorded as T (true) or F (false), respectively. Four results of defining of examples in the dataset after model classification are: TP: predicted positive class or actually positive class; FP: predicted positive class or actually negative class; TN: predicted negative class or actually negative class; FN: predicted negative class or actually positive class. The classification accuracy calculation formula is as follows:
The receiver operating characteristic curve (ROC curve) and area under the curve (AUC) can test the performance of classification results. Because ROC curve has a good characteristic, when the distribution of positive and negative samples in the test set is changed, ROC curve can be still unchanged. Class imbalance often occurs in the actual data set, that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may change with time. The area under ROC curve is calculated as the evaluation method of imbalanced data. It can comprehensively describe the performance of classifier under different decision thresholds. AUC calculation formula is as follows: