1. Introduction
In line with the rapid development of various industrial sectors, air pollution frequently occurs worldwide, including in Malaysia. According to the World Health Organization (WHO) [
1], air pollution is defined as the contamination of indoor and outdoor environments by impurities that modify the natural features of the environment. Data collected by WHO reveal that most of the global population breathes highly contaminated air that exceeds WHO guidelines. Air pollution can cause detrimental effects on human health, especially the respiratory system, and becomes one of the fundamental sources of morbidity and mortality [
1].
In Malaysia, the air pollutant index (API) adopts six main air pollutants and serves as an indicator to deliver accurate and insightful information on air quality status in any area to the public [
2]. Rani et al. [
3] analyzed the trend of the API in Malaysia from years 2010 to 2015 based on various categories by using XLSTAT. In October 2010, the concentration of particulate matter 10 μm or less in diameter, better known as PM
10, was extremely high in some areas in Johor following the occurrence of forest fires in Indonesia, which led to high API values as the highest relative subindex of monitored pollutants that account for the API readings [
3,
4]. This suggests that such fine dust often found in polluted air contributes greatly to the variability of the API [
3].
Particulate matter is not just the main air pollutant in the Southeast Asia region, but is also identified as the most severe city pollutant around the globe [
5,
6]. For instance, most of the daily average PM
10 concentration at three monitoring stations in Buenos Aires from the years 2010 to 2018 exceeded the standard limit of WHO guidelines, that is, 50
g/m
3 [
7]. Some research findings highlight that the particulate matter concentrations have certain correlations with the weather conditions, four seasons and monsoons [
8,
9,
10].
Due to the increasing public awareness of the dangers of air pollution, numerous air quality-related studies have been performed using various statistical and deep learning models, including forecasting and clustering. Clustering is an exploratory data analysis technique that investigates the fundamental structure of data [
11]. By adopting the clustering technique, the data are assigned into several distinct groups based on their degree of similarity before any further analysis or modeling can be performed. As the data within the cluster can be treated using the same analysis technique, it can save costs and computation time. There are several types of clustering methods, such as partitional clustering, hierarchical clustering and fuzzy clustering. Hierarchical clustering groups similar objects into clusters that eventually merge into a single cluster, whereas fuzzy clustering is a soft-clustering technique in which the objects can be clustered into more than one cluster. As a partitional clustering method, the k-means algorithm is one of the most common and popular techniques since it can be implemented easily [
12]. It classifies data with closer centroid values into the same cluster such that the differences between the clusters are maximized. For instance, k-means clustering was used to analyze the significant changes in air quality in Southampton [
13]. While Kim et al. [
14] applied this algorithm to cluster monitoring stations in the United States based on different temporal patterns of PM
2.5, Beaver and Palazoglu [
15] adopted it to classify classes of ozone episodes in San Francisco.
Air quality time series clustering in Malaysia is often utilized to identify the pattern between the clusters and categorize the area into zone based on the pollution level so that government policies can be executed accurately [
16]. In this context, Suris et al. [
17] clustered the PM
10 data in Malaysia using dynamic time warping (DTW) as the dissimilarities measure. Adopting four clustering techniques, that is, k-means, partitioning around medoid (PAM), agglomerative hierarchical clustering (AHC) and fuzzy k-means (FKM), the results show that the clusters were formed mainly on the basis of the region and geographical location of the stations instead of the station category and local economic activities. A similar result was obtained by Rahman et al. [
11], whereby the stations were classified into high, medium and low pollution regions, respectively, using the AHC technique based on the daily average PM
2.5 concentration.
As climatic and environmental issues concern society, air quality forecasting has become the focus among researchers as an accurate prediction that can reduce the effect of pollution on humans and the biosphere [
18]. Therefore, various types of prediction models have been applied in previous studies. For instance, Aditya et al. [
19] used the logistic regression and autoregression (AR) models to detect air quality and predict the concentration of PM
2.5. A similar approach is shown in the research by Bhalgat et al. [
18], which adopted AR and autoregression integrated moving average (ARIMA) models to predict the concentration of sulfur dioxide (SO
2). Meanwhile, Guo et al. [
20] used a geographically and temporally weighted regression model to calibrate the spatiotemporal dynamic PM
2.5 concentrations to manage haze pollution in China. The random forest method is also deemed capable of modelling various concentrations of air pollutants, such as PM
2.5 and ozone [
21,
22]. In fact, random forest regression is believed to predict air pollutant concentrations more accurately than linear regression and decision trees [
23].
In recent years, neural networks have been preferred by researchers rather than the abovementioned traditional models due to their ability to fit non-linear data with higher accuracy [
10]. The long short-term memory (LSTM) model is a deep learning method modified based on the concept of the recurrent neural network (RNN). Given its strength in solving the shortcomings of the RNN model, such as poor performance with tasks that involve long-term dependency and a vanishing and exploding gradient, the LSTM is found to be suitable to predict sequential data, including time series data. The outstanding performance of the LSTM model is observed through a lower root mean squared error (RMSE) in predicting the prices of gold [
24] and Bitcoin [
25], as well as influenza-like illnesses and respiratory diseases [
26].
In terms of air quality prediction, the LSTM model also possesses great potential to give an accurate result [
27]. The findings obtained by Bakar et al. [
28] show that the multivariate LSTM model predicted the PM
10 concentration at five selected monitoring stations most accurately with the lowest RMSE values, followed by the univariate LSTM model and the univariate ARIMA model. Aiming to increase prediction accuracy, hybrid models that involve a combination of techniques are gaining popularity in the research field. Zhang et al. [
29] discovered that the combination of principal component analysis (PCA) and least squares support vector machine (LSSVM) can reduce the noise in meteorological data, hence giving more accurate predictions in API than the ARIMA model. The PCA–ANN model that uses only the significant parameters also seems competitive in giving a better prediction than the standalone artificial neural network (ANN) model [
30].
For the case of clustering-based LSTM model, it considers the changes in features that are more specific in each cluster, making it an ideal choice to improve prediction accuracy. Yulita et al. [
31] utilized fuzzy clustering and bidirectional LSTM (Bi-LSTM) to obtain higher accuracy and precision in classifying sleep stages. In accordance with the findings obtained in the study on the load prediction for dynamic spectrum allocation performed by Liu et al. [
32] using AHC–LSTM, Li et al. [
33] also found that type-2 fuzzy clustering-based LSTM can increase the accuracy with a much shorter model training time in long-term traffic volume prediction than the LSTM, random forest, back propagation network (BPN) and deep neural network (DNN).
Besides the abovementioned combinations, k-means clustering is also one of the widely used techniques in hybrid models. Ao et al. [
10] first clustered meteorological data according to seasons using the k-means algorithm, then combined the clustering results with the air pollutant concentrations to be input into the Bi-LSTM model. It was found that the proposed model outperforms the other models as it can overcome the continuous fluctuation in meteorological conditions. Using the k-means–LSTM model, Baca et al. [
34] also obtained a better air quality prediction in Andahuaylas, Peru.
Air quality prediction is indeed important for society to take preliminary preparations and preventive measures against poor air conditions. In order to figure out the potential of the hybrid model in predicting the daily average PM10 concentration in Malaysia, this study proposes a clustering-based LSTM model and compares its performance with the univariate LSTM model without clustering. Being a state-of-the-art deep learning method, the LSTM model usually outperforms conventional forecasting models in prediction accuracy. However, it is too time-consuming and unrealistic to construct the model individually for each station, especially in real-life applications. If the model is trained based on a few samples and generalizes its finding to all stations, it might cause an undesirably low accuracy at some stations outside the sampling. Therefore, such a combination of techniques is deemed capable of increasing the prediction accuracy with much less computation time, thus proving to be more efficient than the classical forecasting technique.
3. Results and Discussion
3.1. Descriptive Analysis
The dataset was split into a training set and a test set by a ratio of 8:2, where the training set consists of data ranging from 5 July 2017 to 30 September 2018 and the test set comprises the last four months, that is, from 1 October 2018 to 31 January 2019.
Table 1 shows the minimum value, maximum value and quartiles for the whole dataset.
3.2. Time Series K-Means Clustering
Before the clustering and modeling phases were carried out, the training set was scaled into a range of [0, 1] by adopting min–max normalization. Then, the 60 monitoring stations were clustered based on the k-means algorithm. To identify the optimum
clusters, the values of WGSS were calculated and visualized in
Figure 3 for
.
By using the elbow method, the optimum number of clusters was estimated to be between
and 4. To further validate the goodness of separation, the silhouette index was applied to the identified candidates.
Table 2 shows the silhouette scores for each number of clusters.
Based on the table above, has the highest silhouette score, while has the lowest index. A higher index score indicates a better partitioning of the data, hence is said to be the optimum number of clusters.
The clustering results show that Cluster 1 consists of 19 stations, whereas Cluster 2 comprises 41 stations.
Table 3 lists the cluster membership for the daily average PM
10 concentration according to the stations.
Figure 4 shows the distribution of stations according to clusters.
It was found that most stations in Cluster 1 are in the more developed states along the west coast of Peninsular Malaysia, such as Selangor, Perak, Pulau Pinang and Kuala Lumpur. On the other hand, Cluster 2 is mainly made up of stations that are widely distributed in the less developed states around the east coast of Peninsular Malaysia and east Malaysia, including Terengganu, Kelantan, Sabah and Sarawak.
Moreover, the number of stations based on categories according to the clusters is shown in
Figure 5.
The figure above demonstrates that most stations in Cluster 1 are located in suburban and urban areas in Klang Valley with only one station falling in the rural and industrial areas, respectively. In addition, the majority of the stations in Cluster 2 are categorized as suburban, followed by rural, industrial and urban. On top of that, it was observed that there are more stations located in suburban, rural and industrial areas in Cluster 2 as compared to Cluster 1, which has more urban stations.
After classifying the test set into the clusters, the minimum values, maximum values and quartiles according to the clusters are tabulated in
Table 4.
Table 4 highlights that the range of the daily average PM
10 concentration for the whole dataset in Cluster 2, that is, 231.45
g/m
3, is much higher than the range of 173.66
g/m
3 in Cluster 1. The station locations that mainly spread in the neighboring states might give rise to this situation in accordance with a similar level of haze pollution carried by the monsoon winds [
8,
9]. On the other hand, the median of the daily average concentration of PM
10 of the whole dataset in Cluster 1 is higher than Cluster 2 by 9.08
g/m
3. Such a circumstance is believed to be closely related to the fact that most stations in Cluster 1 are in highly developed areas, including Klang Valley and Pulau Pinang [
11].
The time plots of the daily average concentration of PM
10 for the training set and test set of the selected stations in each cluster are extracted and visualized in
Figure 6 and
Figure 7, respectively.
From
Figure 6, it can be seen that the stations within each cluster have a similar and stable time series pattern across the time range, except for a few spikes observed during a certain period. The drastic increase in the concentration of PM
10 for both clusters around August until mid-September 2018 seems to be closely associated with the transboundary haze that affected most areas of Malaysia at that point.
According to Yusof [
48], the unhealthy API readings were recorded in some states due to haze originating from North Sumatra and West Kalimantan at the time. The situation became worse and lasted until September as the southwest monsoon wind blew toward Peninsular Malaysia. Some states also experienced hot and dry climates with less rainfall, giving rise to the increase in the daytime temperature. Such weather caused wildfires in certain locations, for instance, the occurrence of peatland fires in Klang, Selangor [
49]. As a result, the air quality decreased at station CA21B in Klang, followed by an increase in the daily average concentration of PM
10 to the maximum value of 180.23
g/m
3 in Cluster 1.
Referring to the time plots in Cluster 2, the highest daily average concentration of PM
10 during the hazy period was recorded by station CA55Q, which is located in Permyjaya, Miri, Sarawak. This situation was deemed to be primarily driven by the forest fires at the nearby Industrial Training Institute, Permyjaya, which reduced the air quality in Miri and worsened the hazy conditions. According to Kawi [
50], the API reading in Miri reached an unhealthy level of 130 in the morning on 19 August 2018. In conjunction with the nearly unhealthy API readings caused by the wildfire smoke from West Kalimantan, Indonesia, the PM
10 concentration at other stations in Sarawak, such as Bintulu, Mukah, Sibu and Sarikei, also reported an increase during the hazy period.
Generally, the values of the test set data are at a lower level compared to the training set, that is, not exceeding 75
g/m
3 in both clusters, as shown in
Figure 7. It then leads to a small difference of 3.41
g/m
3 in the data range between both clusters based on
Table 4.
In a nutshell, the time series k-means clustering has assigned the stations into two clusters with a size of 19 and 41 stations, respectively. This result forms the basis of the proposed model.
3.3. Construction of Hybrid Models
A multivariate LSTM model was trained based on the training set for each cluster. An optimum setting of the values of the hyperparameters was tuned manually to achieve the best model performance in the training phase. After a few trials, it was found that the models for both clusters perform well under the same hyperparameter settings as tabulated in
Table 5.
By applying the settings above, the MSE and RMSE, as well as the computation time were computed to evaluate the fitness of the hybrid models to the training set, as shown in
Table 6.
As depicted in the table above, the RMSE values for both of the hybrid models are significantly low in the training phase, indicating that the constructed models can learn the trend of the training set well. In terms of the training time, both models required a similar duration, between 83 s and 85 s.
3.4. Construction of Univariate LSTM Models
By using the same hyperparameter settings with the corresponding hybrid models as shown in
Table 5, a univariate LSTM model was constructed independently for each station. The model performance and computation time were recorded in
Table 7 to assess the degree of fitness of each model to the training set.
Overall, the RMSE values for the univariate LSTM models during the training phase are comparatively higher than the hybrid models, indicating a more unsatisfied fitness to the training set. Nevertheless, there are 38 stations with RMSE values lower than 0.1 in the training phase. In addition, about 74 s to 99 s were needed to train the univariate models.
3.5. Comparison of Prediction Performance between Hybrid Models and Univariate LSTM Models
The prediction performance was computed by comparing the predicted values and the actual test data based on three accuracy metrics, namely RMSE, MAE and MAPE. Then, the difference in prediction performance between the two models was measured based on RPD for each metric. If a model has a smaller value than another for at least two metrics, then it is said to have a better prediction performance. Moreover, a hybrid model is said to have comparable prediction accuracy to the univariate model if the RPD values are less than or equal to 50%.
Table 8 displays the abovementioned values for all the stations; the smaller values of accuracy metrics and RPD values below or equal to 50% are listed in bold.
Based on
Table 8, the hybrid model has recorded a lower value for at least two accuracy metrics at two stations in Cluster 1, which are CA16W and CA17W. Despite having a better prediction performance for most stations, the univariate model does not significantly outperform the hybrid model based on RPD values. This is because the RPD values are more than 50% for at least two accuracy metrics at only four stations, which are CA21B, CA22B, CA33J and CA34J. Hence, a conclusion stating that the proposed model has a competitive prediction performance in Cluster 1 can be drawn.
On the other hand, it is highlighted that the proposed model is capable of giving a more accurate prediction for station CA02K based on much lower RMSE, MAE and MAPE values compared to the univariate model. Focusing on the RPD values, the prediction performance of the proposed model only varies significantly from the univariate model at 13 stations in Cluster 2.
There are 39 stations with an RPD less than or equal to 50% for RMSE. Among these stations, 12 of them have RPD values within 0–10%, 6 stations have RPD around 10–20%, 10 stations and 3 stations have a range of 20–30% and 30–40%, respectively, while the rest have RPD values within 40–50%. Meanwhile, most of the satisfactory RPD values based on MAE fall in the range of 0–10% (12 stations), followed by the range of 30–40% (9 stations), 10–20% and 20–30% (8 stations, respectively) and 40–50% (6 stations). Lastly, 47 stations have an RPD less than or equal to 50% for MAPE. It is observed that most of the RPD values based on MAPE fall in the range of 0–10% (18 stations), followed by 10–20% (12 stations), 20–30% (7 stations), 40–50% (6 stations) and 30–40% (4 stations). In short, the hybrid model can output a competitive prediction performance compared to the univariate model, as it records an acceptable range of RPD values based on all three metrics.
If the prediction performance of the hybrid model does not significantly vary from the univariate model based on RPD for at least two accuracy metrics at each station, then it can be concluded that the proposed model is suitable to forecast the PM
10 concentration at that station. From
Table 8, the hybrid model seems to be potentially adopted as the PM
10 prediction model for 43 stations (71.67%), whereas the univariate LSTM model is more suitable to be employed for the stations in Johor, Terengganu and Sabah.
Figure 8 shows the actual and predicted values for selected stations from both clusters. Both models can fit the actual data trend well for stations CA10A (Cluster 1) and CA01R (Cluster 2). Plots from other stations were also investigated and similar results were observed.
To summarize, the prediction accuracy of the hybrid model does not significantly deviate from the univariate model, as the RPD values are within the 50% acceptable range at 43 stations for 71.67% of the stations. This has proven the capability of the hybrid model to predict the PM10 concentration at a similar accuracy level to the univariate model. Furthermore, the hybrid model can capture and fit the actual data trend quite well for most stations with a rather shorter computation time than the univariate LSTM model. This is closely related to the fact that only one hybrid model is constructed for each cluster, whereas the univariate model is individually constructed for each station, leading to a total model training time of 4951.842 s for 60 univariate models and just 168.237 s for two hybrid models. Such a rather shorter computation time without any drawback on prediction performance or trend fitness has made the hybrid model a more ideal forecasting model.
Nevertheless, the occurrence of hazy conditions at certain periods in the training set that negatively affected the air quality of each location at different levels is one of the factors that leads to a better prediction accuracy of the univariate LSTM model for some stations. The PM
10 concentration increases drastically during hazy days in conjunction with the high emissions of particulate matter and greenhouse gases. On the other hand, PM
10 is at a low concentration during normal days as the aerosol particles are released by mobile sources, including motor vehicles, and stationary sources, such as factories [
6]. Due to the nature of the hybrid model that uses the data from all the stations within the same clusters to predict the PM
10 values without considering much about the localized pollution level as in the univariate model, this might cause the tendency to overestimate PM
10 for some stations that are less affected by the transboundary haze.
In addition, the concentration of PM
10 is mainly influenced by other meteorological factors, such as wind speed, temperature and relative humidity [
6]. The concentration of particulates is found to have a correlation with the temperature, wind speed, dew point and air pressure [
6,
19]. In accordance with this, Zhang et al. [
51] found that there is a significant correlation between particulates and relative humidity during the winter season in Nanyang. Meanwhile, Pineda Rojas et al. [
7] also revealed that the high daily average PM
10 concentration is often recorded when the sky cover and relative humidity are low. Similar to the finding that the PM
10 concentration is high during the southwest monsoon season [
9], Yassen and Jahi [
8] discovered that the TSP concentration in Klang Valley is higher during that season as compared to the rainy season. Thus, it can be concluded that different real-time meteorological conditions at each station will influence the concentration of particulate matter and lead to a slightly lower prediction accuracy of the hybrid model for some stations.
4. Conclusions
In brief, this study proposed a novel hybrid model that combines both the k-means clustering technique and the state-of-the-art LSTM model in predicting the daily average PM10 concentration in Malaysia. Throughout the study, comparisons were made between the hybrid model and the univariate LSTM model in terms of prediction performance, trend fitting and computation time.
In this study, 60 air quality monitoring stations were divided into two distinct clusters by adopting the time series k-means clustering method. Cluster 1 consists of 19 stations that are mainly distributed in highly developed areas, such as Klang Valley and Pulau Pinang, such that most of them fall under the urban and suburban categories. On the other hand, Cluster 2 comprises 41 suburban and rural stations that are located mainly on the east coast of Peninsular Malaysia, Sabah and Sarawak. The within-cluster time series patterns are quite similar and relatively stable with a few unexpected spikes, especially during the transboundary hazy period.
The results show that the hybrid model can give a comparable prediction performance to the univariate LSTM model based on the RPD values for three accuracy metrics. In terms of fitting the actual trend, the hybrid model can capture the patterns of daily average PM10 concentration, although it gives a poorer result compared to the univariate model for some stations due to several factors, such as the hazy period in the training set that contaminated the air quality at a different level and the varying meteorological conditions at each location. In addition, the hybrid model significantly outperforms the univariate LSTM model based on its much shorter training time, suggesting the capability of the proposed model to effectively increase the prediction efficiency in real-life applications.
As for the future research direction, it is suggested to consider the other meteorological factors, especially wind speed, during the clustering phase to reduce their impacts on the PM10 concentration. Moreover, the hourly PM10 concentration also warrants further study so that the public can better plan their daily activities beforehand. In such a context, two-step k-means clustering could be implemented to better capture the variation in the PM10 concentration before constructing the forecasting model for each subclass of the main clusters. Last but not least, a comparison between hybrid models that employ different forecasting methods, such as ARIMA, gated recurrent unit (GRU) and LSSVM models, can be carried out to identify which combination of techniques can predict the PM10 concentration better.