1. Introduction
Air pollution remains one of the most pressing global health challenges, identified as the second leading risk factor for premature death worldwide. In 2021 alone, air pollution was responsible for approximately 8.1 million deaths globally, underscoring its profound impact on human health [
1]. Fine particulate matter refers to particles with an aerodynamic diameter of 2.5 μm or less (PM
2.5). PM
2.5 particles are particularly concerning because they are small enough to penetrate deep into the lungs and even enter the bloodstream, posing significant risks to human health. The Global Burden of Disease (GBD) study estimated that ambient PM
2.5 exposure was responsible for approximately 4.14 million deaths globally in 2019 [
2]. These particles are associated with a wide range of health outcomes, including stroke, ischemic heart disease, chronic obstructive pulmonary disease (COPD), and lung cancer [
3,
4,
5,
6,
7,
8]. The respiratory system, especially the lungs, is vulnerable to PM
2.5-induced toxicity, leading to inflammation and impaired immune responses, increasing susceptibility to respiratory infections [
9]. Growing evidence suggests that PM
2.5 exposure is also linked to neurodegenerative diseases. The small size of the particles enables them to penetrate the brain via the olfactory nerve [
10]. Recent trends have shown an alarming increase in PM
2.5 emissions due to wildfires, exacerbated by climate change and land management practices. Wildfire-related PM
2.5 pollution has been observed to travel long distances, affecting regions far beyond the initial fire location [
11]. Wildfires in the western United States have increased in frequency and intensity since the mid-1980s, primarily driven by rising temperatures and earlier spring snowmelt [
12]. Climate projections suggest that the area affected by wildfires in the western U.S. could expand by 54% between 2046 and 2055 compared to 1996–2005 [
13]. During severe wildfire events, PM
2.5 levels can spike to hazardous levels, exceeding the Environmental Protection Agency’s (EPA) threshold of 225.5 μg/m
3 for hazardous air quality [
14]. Given the severe health impacts and increasing frequency of extreme PM
2.5 pollution events, it is critical to implement proactive measures such as improved air quality monitoring, stricter emission control policies, and enhanced public health advisories.
PM
2.5 forecasting is critical for protecting public health by enabling timely interventions, reducing exposure to hazardous air pollution, and supporting broader pollution management efforts. However, PM
2.5 forecasting is challenging due to the complex interactions between atmospheric chemistry, meteorological variability, and human activities, which cause rapid fluctuations in pollutant levels [
15,
16]. Additionally, capturing temporal variations and spatial distributions is crucial for effective exposure assessment and health impact evaluation. Various modeling approaches have been used for PM
2.5 forecasting, ranging from traditional statistical methods, such as Autoregressive Integrated Moving Average (ARIMA), to Artificial Intelligence and Machine Learning (AI/ML) models. These AI/ML techniques include nonlinear models like Support Vector Regression (SVR) and Artificial Neural Networks (ANNs), which have shown promise in capturing complex relationships in air quality data [
17,
18]. While ANNs have been widely applied for PM
2.5 forecasting, their shallow structures often limit feature learning in complex datasets [
19]. Recent Deep Learning (DL)-based approaches, including Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models, have significantly improved both spatial and temporal pattern modeling [
20]. Hybrid models combining CNNs and LSTMs have further enhanced forecasting accuracy, particularly for datasets with spatiotemporal complexity [
21,
22]. However, DL models still face challenges such as vanishing gradients and limited long-term dependency modeling, as noted by [
21]. Transformer models, initially designed for Natural Language Processing (NLP) [
23], have shown promise for long-term PM
2.5 forecasting due to their ability to capture long-range dependencies [
24]. Unlike recurrent models, Transformers rely on self-attention mechanisms that allow more efficient information flow across sequences [
23]. In [
25] introduced the Informer model, which improves temporal embeddings to learn non-stationary and long-range temporal dependencies. However, it focuses solely on “temporal attention” and overlooks spatial relationships between variables. In [
26] tackled this by developing a graph Transformer that captures dynamic spatial dependencies, using sparse attention to trim less relevant nodes. In [
27] further advanced this with the Spacetimeformer, which flattens multivariate time series to handle spatial and temporal influences. Recent models like the Sparse Attention-based Transformer (STN) by [
28] effectively reduce time complexity while capturing long-term dependencies in PM
2.5 data. Similarly, [
29] proposed the SpatioTemporal (ST)-Transformer, designed to improve spatiotemporal predictions of PM
2.5 concentrations in wildfire-prone areas.
An often-overlooked challenge in PM
2.5 forecasting is data imbalance, particularly in predicting high pollution levels. AI/ML models have demonstrated strong performance in forecasting PM
2.5 under lower concentrations but often struggle to accurately capture extreme pollution events where PM
2.5 levels exceed 60 μg/m
3 [
30]. Studies have consistently shown that PM
2.5 concentrations tend to be underestimated during severe pollution episodes, as high-value events are underrepresented in the training data [
31,
32]. This imbalance results from the rarity of extreme pollution spikes, making it difficult for models to generalize and predict these critical conditions effectively [
33,
34,
35]. Although this challenge is well-known, relatively few studies have focused on solutions for improving predictions of extreme PM
2.5 levels [
36,
37,
38]. An effective strategy to address this imbalance is data augmentation, which expands the training dataset by introducing varied and informative samples, improving data diversity and quality. This approach enhances the representation of underrepresented patterns, leading to better model robustness and generalization [
39,
40,
41]. Undersampling and oversampling are data augmentation techniques developed to address the challenges of imbalanced datasets, each employing distinct strategies to adjust data distribution and improve model performance. Oversampling techniques aim to increase the representation of the minority class by generating or duplicating data points to improve data diversity and representation. Random oversampling, a simpler approach, duplicates minority class instances but can lead to overfitting issues in conventional models [
42]. Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling (ADASYN) are widely used oversampling methods to mitigate the effects of imbalanced datasets [
43], with variants like SMOTE with k-means also being prominent ([
44,
45]. Conversely, undersampling techniques reduce the dominance of the majority class by removing data points, aiming to create a more balanced representation. Random undersampling deletes the majority of class instances randomly but risks information loss [
42]. Undersampling methods are combined with clustering approaches to balance datasets while preserving data structure. This involves clustering the data into several clusters using methods such as k-means clustering and then selecting representative points from each cluster to minimize information loss [
46,
47]. Several studies have applied oversampling and undersampling techniques to address data imbalance and improve model performance in the context of PM
2.5 modeling. In [
48] aimed to improve the estimation accuracy of high PM
2.5 concentrations by using an AugResNet model with random oversampling and SMOTE. While their approach improved performance on high-value PM
2.5 datasets, a limitation of the study was its focus on a single cutoff threshold and PM
2.5 retrieval rather than forecasting, which limits its broader applicability. In [
49] employed LSTM, GRU, and hybrid GRU + LSTM models with linear interpolation for data augmentation, expanding the dataset without addressing the imbalance between high and low PM
2.5 concentrations. Their approach did not specifically target data imbalance, focusing on general dataset expansion, which can lead to overfitting, as synthetically increasing the dataset size does not introduce new variability. In [
50] tackled the dataset shift problem between urban and rural PM
2.5 data, addressing differences in predictor variable density using multiple imputations by chained equations; however, this study focused on correcting biases caused by variable density disparities rather than general PM
2.5 forecasting, which limits its relevance to broader PM
2.5 prediction challenges.
The current research on PM2.5 forecasting reveals critical gaps that need further investigation.
One major challenge is data imbalance, where high PM
2.5 concentration events, particularly during extreme pollution episodes such as wildfires, are significantly underrepresented in datasets. This imbalance often leads to poor model performance in forecasting these critical pollution levels, as models struggle to generalize effectively under such conditions [
31,
32,
33].
Another gap is the limited application of augmentation techniques tailored to address this imbalance. While methods like SMOTE and ADASYN have shown promise in balancing datasets [
48,
49], their use in PM
2.5 forecasting, particularly for extreme pollution events, remains limited. Most studies have focused on general dataset expansion without targeting rare, high-concentration events, which can result in overfitting rather than improved generalization [
42].
Lastly, the underexplored potential of Transformer models presents another critical gap. Despite their success in long-term sequence modeling across various domains [
22,
23,
24], Transformers have been insufficiently investigated for PM
2.5 forecasting, especially in urban environments where pollution poses significant health risks. Their ability to capture long-term dependencies and complex spatiotemporal patterns remains underutilized in extreme PM
2.5 event forecasting [
25,
29].
This study addresses the identified research gaps by applying data augmentation techniques, specifically cluster-based undersampling with varying majority-to-minority class ratios, to improve the representation of high PM2.5 concentrations in the training data. The majority-minority cutoff thresholds are selected based on two EPA-defined criteria, emphasizing the importance of robust models capable of accurately forecasting elevated PM2.5 levels in real-world scenarios. The study leverages a Transformer-based architecture with multi-head sparse attention to tackle the challenge of long-term dependency modeling inspired by models like Informer and Spacetimeformer. The specific research objectives are listed below:
Augment imbalanced PM2.5 dataset with cluster-based undersampling with different combinations of majority-to-minority class ratios.
Investigate the impact of two minority–majority cutoff thresholds based on limits set by the EPA on model performance.
Build and train a Transformer model to leverage the capabilities of multi-head attention in the context of PM2.5 forecasting.
Develop a robust forecasting model that accurately predicts PM2.5 concentrations, particularly during extreme pollution spikes caused by events like wildfires in New York City, Philadelphia, and Washington, D.C.
The remainder of this paper is organized as follows:
Section 2 details the data, including the study area and data description.
Section 3 outlines the methodology, covering data preprocessing, collocation, cutoff thresholds, cluster-based undersampling, the Transformer model architecture, and the model training and evaluation process.
Section 4 presents experimental results, including accuracy assessment, partial sampling ratios, cutoff thresholds, and time series analysis.
Section 5 discusses the findings, while
Section 6 concludes the study with a summary of key insights and potential directions for future research.
4. Experiments and Results
4.1. Accuracy Assessment
The optimal partial sampling ratios were selected by comparing model performance metrics across augmented datasets and the original, unaugmented dataset at each cutoff threshold.
For experiments performed with a cutoff threshold of 12.1 µg/m
3, accuracy metrics are displayed in
Table 7. As the resampling ratio becomes more balanced, ranging from 10/90 to 50/50, both RMSE and MAE metrics generally decrease, indicating improved model performance. The best overall performance is observed at the 50/50 ratio, where the RMSE reaches 2.757, the MAE is 1.044, and R
2 achieves a value of 0.850. This R
2 value suggests that the 50/50 ratio offers the strongest correlation between forecasted and true PM
2.5 values, making it the most effective configuration for balanced data.
Accuracy metrics for experiments performed with a cutoff threshold of 35.5 µg/m
3 are shown in
Table 8. Interestingly, the 20/80 resampling ratio emerges as the optimal configuration overall, achieving the lowest RMSE (2.080) and MAE (1.386), alongside the highest R
2 value of 0.914. This strong performance suggests that a 20/80 ratio balances the trade-off between capturing minority and majority points while minimizing error. The same ratio also delivers the best results for high-value PM
2.5 points, with an RMSE of 15.353, MAE of 10.077, and an R
2 value of 0.778, demonstrating that it is particularly effective for extreme pollution levels.
When comparing the performance of models trained on the original dataset to those trained on resampled datasets, the original data consistently underperforms, particularly in terms of error metrics like RMSE and R2. This pattern emphasizes the value of resampling techniques for improving model accuracy.
4.2. Partial Sampling Ratio Comparison
At a cutoff threshold of 12.1 µg/m
3, in evaluating model performance across varying partial sampling ratios, tests on both the full dataset and high-value points demonstrate a clear trend: the 50/50 partial sampling ratio consistently yields optimal results, as displayed in
Figure 6. For the full dataset, RMSE decreases as the sampling ratio becomes more balanced, reaching its lowest point at the 50/50 ratio. This indicates that more balanced data distribution significantly enhances forecast accuracy. Similarly, the R
2 value steadily increases, peaking at the 50/50 ratio, signaling the model’s improved ability to capture long-range dependencies at this balanced ratio.
For high-value points, the results further underscore the importance of balanced resampling. RMSE shows a marked decline, and MAE gradually reduces as the ratio approaches 50/50. The model’s highest R2 value at this ratio confirms its strongest performance in predicting high-value points with greater accuracy. Overall, the 50/50 sampling ratio emerges as the optimal configuration, demonstrating that more evenly distributed data enhances the model’s performance, particularly in forecasting high-value events.
Patterns of model performance across varying partial sampling ratios change for the cutoff threshold of 35.5 µg/m
3, as presented in
Figure 7. For the whole dataset, RMSE decreases as the partial sampling ratio becomes more balanced, reaching its minimum at 20/80. However, as the ratio becomes more balanced at 30/70, 40/60, and 50/50, RMSE slightly increases, indicating that the most balanced ratios do not necessarily lead to the best performance. Also, the R
2 value peaks at 20/80 but declines for more balanced ratios, suggesting that more even data distribution does not always improve model performance.
For high-value points, RMSE shows a sharp decline from its value based on the original data, continuing to decrease at the 20/80 ratio, with further stabilization beyond this point. MAE follows a similar trend, with a steep drop at 20/80 and stabilization thereafter. This indicates that the 20/80 partial sampling ratio effectively minimizes errors for high-value points. Similarly, R2 improves significantly with slightly more balanced resampling, reaching its peak at 20/80, and begins to drop afterward, highlighting the model’s best performance at this ratio.
Overall, the 20/80 ratio provides the best performance for both the full dataset and high-value points, delivering the lowest RMSE and highest R2. Models trained on the original data perform the worst in terms of RMSE and R2, underscoring the value of resampling for improving forecast accuracy, particularly for high-value points.
The discrepancy between RMSE and MAE in the original dataset arises from the nature of these metrics. RMSE amplifies the impact of large errors due to its squaring mechanism, making it highly sensitive to outliers, whereas MAE treats all errors equally, offering a more robust reflection of average performance (Chai and Draxler, 2014; Willmott and Matsuura, 2005). This suggests that the original dataset likely contains a few large outliers that inflate RMSE without significantly affecting MAE. As the partial sampling ratio becomes more balanced, the model improves accuracy when predicting high-value outliers (leading to lower RMSE) but loses some accuracy in predicting low-value events (causing a slight increase in MAE).
4.3. Cutoff Threshold Comparison
Models trained on the 35.5 µg/m
3 threshold consistently outperformed those trained on the 12.1 µg/m
3 threshold in terms of RMSE and R
2, as demonstrated in
Figure 8. The resampling ratio plays a crucial role in model performance, with the 20/80 ratio emerging as optimal for the 35.5 µg/m
3 threshold, while the 50/50 ratio works best for the 12.1 µg/m
3 threshold. This disparity is largely driven by the nature of the data captured at each threshold. The higher 35.5 µg/m
3 threshold likely includes a more concentrated set of high-value points, making a less balanced ratio like 20/80 more effective since the distinct minority points do not require as much balancing. In contrast, the 12.1 µg/m
3 threshold includes more low-value points, necessitating a 50/50 ratio to adequately represent both minority and majority groups.
RMSE, which squares the differences before averaging, amplifies larger errors, making it more sensitive to a few large deviations from actual values. This explains why models trained on the 12.1 µg/m3 threshold performed worse in terms of RMSE, as the larger prediction errors had a greater impact. However, MAE, which treats all errors equally, performed better for the 12.1 µg/m3 threshold, indicating that while the errors were frequent, they were smaller in magnitude.
For the 12.1 µg/m3 threshold, RMSE consistently decreases as the sampling ratio becomes more balanced, reaching its minimum at the 50/50 ratio. This trend highlights the importance of equal representation of minority and majority classes in improving overall accuracy for a lower threshold value. Conversely, MAE initially increases from 10/90 to 30/70 but significantly improves at 50/50. R2 also shows steady improvement with increasing balance, reaching its peak at 50/50, where the model captures the strongest correlation between predicted and actual values.
For the 35.5 µg/m3 threshold, RMSE shows a sharp decrease as the resampling ratio shifts from 10/90 to 20/80, indicating improved model performance by reducing large prediction errors. However, as the ratio becomes more balanced beyond 20/80, RMSE starts to increase slightly, suggesting that the model begins to overfit to the minority class while losing accuracy for low-value points. R2 follows a similar trend, increasing to a peak at 20/80, where it achieves the highest variance explanation. Beyond this optimal ratio, R2 declines, reflecting the reduced ability to accurately capture the distribution of both high- and low-value points.
The chosen thresholds, 12.1 µg/m3 and 35.5 µg/m3, while aligned with EPA air quality standards, have inherent limitations that may impact the generalizability of the findings. First, these thresholds are specific to U.S. regulatory definitions and may not capture the nuances of air quality classifications used in other regions, such as Europe or Asia, limiting the global applicability of the model. Additionally, these fixed thresholds may oversimplify the dynamic and continuous nature of PM2.5 pollution levels, potentially misclassifying borderline cases and reducing sensitivity in capturing real-world fluctuations. The reliance on static thresholds also fails to account for seasonal or geographic variations in PM2.5 concentrations, which could alter the distribution of high- and low-value points and affect model training. Furthermore, by focusing only on two thresholds, the study may overlook potential insights that could arise from exploring a wider range of cutoff values, especially for datasets with different pollutant distributions. These limitations highlight the need for future work to explore adaptive or region-specific thresholds and assess their impact on model performance.
4.4. Time Series Analysis
Figure 9 presents the time series comparison of observed and forecasted PM
2.5 between the models trained on the original versus augmented dataset with cutoff threshold of 35.5 µg/m
3 and partial sampling ratio of 20/80. For all three cities, the model trained on the original dataset shows strong accuracy for lower PM
2.5 concentrations, particularly for values below 30 µg/m
3. This is reflected in the high similarity between forecasted and observed values at these low levels. However, the model struggles to predict higher PM
2.5 concentrations, reaching a ceiling in magnitude when faced with extreme pollution events, as evidenced in the red-boxed regions. This limitation arises from the imbalanced dataset, where the majority of points consist of lower values, leading the model to prioritize these over the rarer high-value points. As a result, the model is unable to fully capture extreme PM
2.5 events, which shows that forecast accuracy tends to decline as PM
2.5 levels increase.
In contrast, models trained on the augmented dataset, using a 35.5 µg/m3 cutoff threshold and a 20/80 partial sampling ratio, demonstrate improved performance in capturing high-value PM2.5 events. Although there is a trade-off, where the model’s accuracy for lower PM2.5 levels is slightly reduced, this adjustment leads to significantly better RMSE and R2 measures. The forecasted values in the red-boxed regions are much closer to the observed peaks, demonstrating that the model trained on augmented data is better equipped to handle rare and extreme pollution levels. The trade-off is seen in the slightly worse MAE, as the augmented dataset introduces more diversity and some smaller errors that MAE treats equally, while RMSE emphasizes the larger improvements in extreme cases. The model built on the augmented data is better suited to handle high-value points, which is particularly beneficial in scenarios where predicting extreme pollution is more critical than maintaining perfect accuracy at lower concentrations.
The key contrast between the two trained models lies in the distributional focus: the original dataset performs better on low-level PM2.5 concentrations but struggles with extreme values. In comparison, the augmented dataset sacrifices some accuracy at lower concentrations to better capture the high-value events, which are crucial for understanding and managing pollution spikes. This trade-off is especially visible in the improvements in terms of RMSE, which penalizes large errors more severely. These results show that the model trained on augmented data is significantly better at predicting higher PM2.5 values.
5. Discussion
The underestimation of high pollutant levels has been an issue frequently discussed in many studies [
31]. This research addresses the challenge by applying data augmentation techniques before training the deep learning model. One of the key contributions of this study is the exploration of cluster-based undersampling, implemented at different cutoff thresholds and partial sampling ratios, which helped mitigate class imbalance and improve model performance. Our findings indicate that the higher cutoff threshold of 35.5 µg/m
3 resulted in superior model performance when compared to the lower threshold of 12.1 µg/m
3, as the 35.5 µg/m
3 threshold more effectively differentiated between low- and high-value points. The most optimal partial sampling ratio for the 35.5 µg/m
3 cutoff threshold was found to be 20/80. Previous studies, such as that of [
49], explored data augmentation through linear interpolation to generate synthetic data and increase dataset size. Their approach significantly improved the performance of models like GRU and LSTM—yielding up to a 31% improvement in MAPE. However, the study primarily focused on increasing the overall volume of data without addressing class imbalance, which is a critical challenge in the prediction of extreme air pollution events. In contrast, [
48] directly tackled dataset imbalance using random oversampling techniques to increase the representation of high-value samples. While their approach helped increase the representation of high-value samples, it led to overfitting on these samples and subsequently degraded the model’s performance on the whole dataset. In contrast, our use of cluster-based undersampling allowed the model to avoid overfitting to high-value samples, resulting in improved prediction performance not only for the high-value samples but also for the dataset overall. These findings, however, align with [
48] in highlighting the importance of partial sampling ratios, with their study identifying 30/70 as optimal for certain datasets and reinforcing the idea that fully balanced datasets are not always the best approach. Other studies, including that of [
72], suggest that each dataset’s unique characteristics necessitate tailored sampling strategies. In our case, the 20/80 ratio paired with the 35.5 µg/m
3 cutoff provided the best performance in capturing high-value points without over-suppressing the majority class, underscoring the importance of strategic undersampling for achieving balanced model generalization.
The results of this study not only highlight the effectiveness of cluster-based undersampling and tailored cutoff thresholds in improving PM2.5 forecasting but also carry broader implications for air quality management. By addressing the frequent underestimation of high pollutant levels, our methodology contributes to more accurate identification of critical pollution episodes, which is essential for timely public health interventions. The superior performance achieved using a 35.5 µg/m3 cutoff threshold underscores the importance of selecting thresholds that align with the dataset’s characteristics and the targeted application. This finding suggests that air quality models must adopt region-specific or context-driven thresholds to ensure reliable predictions, especially when forecasting extreme pollution levels. Moreover, the partial sampling ratio of 20/80 demonstrates that balancing the dataset does not necessarily mean achieving equal representation of classes; rather, an optimal balance must consider the distribution and nature of the data to maximize model performance.
Future work could enhance the model by incorporating additional data sources that influence PM
2.5 levels. Urban traffic data, which are crucial in accounting for emissions from vehicles, and industrial activity data from factories and power plants would provide more detailed insights into spikes in pollution. Including weather data such as wind patterns and forecasts could improve the model’s accuracy in predicting pollutant dispersion across regions. In addition to data augmentation through cluster-based undersampling, more advanced techniques like Generative Adversarial Networks (GANs) could be explored to generate realistic synthetic data for extreme pollution events, which are rare but critical to forecast [
77]. Another promising avenue would be extending the model to perform multistep predictions, forecasting PM
2.5 concentrations over multiple time steps rather than just the next step, which would be particularly valuable for air quality forecasting over longer periods like days or weeks. Moreover, extending the methodology to other pollutants, such as nitrogen dioxide (NO
2), sulfur dioxide (SO
2), and ozone (O
3), would allow for a comprehensive air quality forecasting framework, enabling cities to predict and address multiple pollutants simultaneously. Given that many pollutants interact synergistically to exacerbate health effects, multi-pollutant models would enhance the precision of interventions. Additionally, applying this approach to different regions or urban areas would help validate the model’s generalizability. Regional variations in pollution sources, meteorological factors, and population density may require adaptive strategies, such as incorporating localized data or adjusting the undersampling strategy to align with regional conditions.
From a methodological perspective, future extensions include the integration of multistep forecasting, enabling predictions over longer time horizons. This would be particularly valuable for planning city-level interventions, such as scheduling traffic restrictions or industrial shutdowns during predicted high-pollution periods. The incorporation of additional data sources, such as traffic, industrial activity, and weather forecasts, could further enhance the model’s robustness by capturing critical predictors of PM2.5 variations. Lastly, advanced techniques like Generative Adversarial Networks (GANs) could be explored to synthesize data for rare but impactful extreme pollution events, addressing a key limitation of current datasets. These extensions would not only refine the methodology but also position it as a cornerstone for developing next-generation air quality forecasting systems.
6. Conclusions
This study demonstrates that the 35.5 µg/m3 threshold consistently outperforms the 12.1 µg/m3 threshold across key metrics like RMSE and R2, likely due to its better representation of higher pollution values. The choice of partial sampling ratio proved crucial, with 50/50 optimal for the 12.1 µg/m3 threshold and 20/80 optimal for the 35.5 µg/m3 threshold, effectively balancing the need to capture both frequent and extreme pollution events. The model with the best performance (RMSE: 2.080, MAE: 1.386, R2: 0.914) utilized the 35.5 µg/m3 threshold and a 20/80 partial sampling ratio. Overall, models trained on resampled data significantly outperformed those trained on the original dataset, demonstrating the importance of data augmentation in handling imbalanced datasets and improving forecast accuracy, especially for high-value pollution scenarios.
The findings of this study have important practical implications. Accurate PM2.5 forecasting is essential for timely public health interventions, particularly in urban areas prone to extreme pollution levels. By tailoring threshold selection and resampling strategies to the characteristics of the data, forecasting models can provide more reliable predictions, enabling policymakers and city planners to take targeted actions to mitigate health and environmental risks.
Future research could build on these contributions by exploring additional thresholds tailored to specific regional air quality standards, ensuring broader applicability of the methodology. Incorporating supplementary data sources such as urban traffic patterns, industrial activity, and meteorological variables could further enhance the model’s ability to capture the complex factors driving PM
2.5 fluctuations. Advanced techniques like Generative Adversarial Networks (GANs) could be employed to generate synthetic data for rare, extreme pollution events, addressing data scarcity challenges. Expanding the geographic scope of the model to include diverse regions and testing its performance with different pollutants such as NO
2, SO
2, and O
3 could create a comprehensive air quality forecasting system. Lastly, extending the model to perform multistep predictions would provide long-term forecasting capabilities, supporting more effective planning and intervention strategies over extended periods. These directions offer promising opportunities to refine and expand the impact of PM
2.5 forecasting models on air quality management for public health and disaster events such as wars and wildfires [
78].