1. Introduction
Paso del Norte (PdN) is the largest metropolitan area on the border between the United States of America and Mexico, with a population estimation of 2.4 million. This region is made up of three large cities: El Paso, Texas; Las Cruces, New Mexico; and Ciudad Juarez, Mexico, all of which share the PdN airshed. Similar to any other developing metropolis, PdN is confronted with the ever-increasing issues of poor air quality. Additionally, as an international cross-border location, there is growing concern about its air quality for both the United States and Mexico [
1,
2].
Fine particulate matter (PM
) is an air pollutant with an aerodynamic diameter of less than or equal to 2.5
m, which becomes hazardous to people’s health when the PM concentration levels in the air are above a certain standard. These small particles can absorb a variety of chemical components, including metals, salts, poisons, organic compounds, and biological groups, such as pollens [
3]. PM
levels are rising as a result of automobiles, power generation, and other anthropogenic factors. Ammonium sulfate, Ammonium nitrate, and organic and elemental carbon are all major components of PM
[
4]. These PM
chemicals have a significant impact on human health and can lead to cardiovascular problems [
5]. Even the tiniest airways and lungs can be invaded by these microorganisms, causing increased respiratory oxidative stress and inflammation [
6].
The ambient PM
concentrations in the PdN region surpassed the U.S. Environmental Protection Agency’s (EPA) National Ambient Air Quality Standards (NAAQS) on many occasions. The majority of PM
in this area originated from geological and industrial sources as well as vehicle exhaust and household cooking and heating. The desert environment is characterized by sporadic calm winds, frequent stagnation due to high atmospheric stability, and sporadic shallow convective and nocturnal boundary layer heights, all of which contribute to the rising PM
concentrations [
7].
In recent years, machine learning algorithms showed their feasibility in predicting the concentration of air pollutants. Scientists and researchers from across the globe have used different algorithms and techniques to predict air pollutants. Several studies [
8,
9,
10,
11,
12] found that machine learning, including deep learning [
13], random forest [
14], and ensemble models [
15], are highly capable of estimating PM
concentration on different temporal and spatial scales.
According to the literature review above, the majority of the existing forecasting models are capable of predicting daily PM
concentrations and high PM days; however, because of the complex geography and topography of the Paso del Norte region [
7,
16,
17], as well as its exceptional meteorological conditions, these analyses cannot be applied to our study area. Therefore, this study is dedicated to conducting an in-depth analysis of historical PM
concentrations and proposes an efficient ML method for forecasting future high/low PM concentration days in the PdN region. The novelties of this study are as follows: (1) analyzing the temporal characteristics of PM
concentration patterns by month based on the historical data collected from designated locations in the PdN region; (2) analyzing PM
concentration data using several ML models with good prediction effectiveness and comprehensible results to address various inadequacies from prior studies; (3) identifying the primary variables causing the high particulate matter concentration in this area; and (4) investigating the complex link between PM
and other meteorological and air pollutant variables based on ground station data using various ML techniques.
In this study area, researchers have conducted a number of studies to better understand the chemical and physical processes responsible for causing high PM
concentrations [
16,
18,
19,
20]. The majority of these investigations were diagnostic, or they modeled the situation using an idealized profile [
21,
22] or a specific method that was limited by the technology at that time [
23]. Furthermore, the topography of the study area makes it difficult for forecast and prediction accuracy of air quality models to accurately predict pollutants [
24]. Hence, the approach in our study certainly overcomes those limitations and fills the research gap.
In this study, various machine learning (ML) algorithms are utilized to predict particulate matter concentration. The study uses data on air pollutants and meteorological variables collected from several locations in the Paso del Norte region during the years from 2014 to 2019. The ML models used include ridge regression, logistic regression, MARS (multivariate adaptive regression splines), SVM (support vector machine), and RF (random forest). This study has three objectives: first, to predict high/low PM levels; second, to determine the features that contribute to high PM concentrations; and finally, to forecast the PM concentration values using penalized linear regression. Detailed research has been conducted to determine how PM concentrations affect other air pollutants and meteorological variables.
We organized this article according to the following pattern:
Section 2 presents a basic overview of machine learning methods with regularization techniques to suit the best model. A significant part of
Section 3 is dedicated to the details of experimental data, including an explanation of the data properties and air quality standards.
Section 4 discusses the association between the variables and other exploratory analyses. Then,
Section 5 presents results from the application of ML models to the data sets, as well as the accuracy and parameter estimates of the models. Finally,
Section 6 summarizes the fitted models used to classify the PM
levels, along with the most important variables responsible for a high or low PM
concentration in the study area.
3. Data Background
The air quality index(AQI) is used for reporting the daily air quality for any specific location. It indicates the quality of air, such as whether it is clean or polluted. It also demonstrates the health risks associated with air quality. Different countries have different air quality indices. The
Table 1 below shows the average PM
concentration in the United States for a 24-h period.
El Paso is a non-attainment city for carbon monoxide (CO) and PM
, and it has several days of high PM
concentration during the months from May to September. Data from urban, suburban, industrial, and rural areas were used to calculate PM
precursor substances in
Table 2. The data sets included both air pollutants and meteorological variables.
The Paso del Norte (PdN) region has become a major environmental concern for both countries in recent decades. The PdN region is made up of the cities of El Paso, Texas; Ciudad Juarez, Mexico; and a few more cities from New Mexico [
35]. Our study area includes deserts such as The Chihuahuan, mountain ranges, shared rivers, wetlands, state parks, and protected areas. Around 12 million people live along the border, almost equally divided between the two countries. With 0.7 million people, El Paso is the U.S.’s eighth largest city; adjacent to it, another 1.3 million people live in Cuidad Juarez, Mexico [
23]. El Paso, a southwestern U.S. city, has the typical warm and arid climate, but its air quality is typically large because of the industrial activity along the U.S./Mexico border, as well as the unique meteorological conditions created by geography [
18,
20]. Due to a mix of high population density, industrial effects, and weather circumstances, El Paso has historically been in non-attainment for the U.S. NAAQS for O
, CO and PM
, and PM
[
19,
36].
In this work, data were collected from the Continuous Ambient Monitoring Station (CAMS) of TCEQ. Data from different CAMS in the Paso del Norte region were used to collect hourly average PM
concentrations at ground level. In
Figure 2, the AQI days were displayed from the year 2014 to 2019 in the El Paso region. As shown, most of the moderate and unhealthy AQI days were in the summer and winter seasons. In addition, recent years have shown increases in unhealthy days throughout the year.
4. Exploratory Data Analysis
El Paso, Texas, is considered to have the highest levels of PM in the United States. It has a history of high PM exceedances every year. To demonstrate the trend for high PM days, we conducted an extensive study of the years from 2014 to 2019.
Figure 4 shows the box and whisker plot of all the meteorological and air pollutant variables used for our study. On the vertical axis, the numerical values for all of the variables are presented, and the names of the variables are presented on the horizontal axis. As most of the data were collected in the summer season, due to the high PM concentration days, the mean value of the outdoor temperature is around 80 degrees Fahrenheit. The relative humidity is around 30–40%, and the Ozone is around 50–60 parts per billion.
Figure 5 illustrates the correlation between all of the predictors of PM
from the data set. Several meteorological variables, such as wind speed, resultant wind speed, and maximum wind gust, have a better positive correlation with PM
. On the contrary, dew point temperature and relative humidity have a negative correlation with the target variable, i.e., PM
.
Figure 6 shows the scatter plot matrix with the slope values between the variables and a histogram of the diagonal element. This histogram provides a sense of the shape of the univariate distribution for each variable. Additionally, above each scatter plot, the slope of the linear fit is demonstrated with its statistical significance indicated by one asterisk (*) sign, which denotes
p < 0.05, or two asterisk signs (**), which shows
p < 0.01. As illustrated, with our target variable, PM
, all pollutants are in positive associations, and there are statistically significant relationships between them.
5. Results
This section will discuss the data processing approaches, the results, and the applications of machine learning models for predicting the ground-level PM
concentration in the atmosphere. In this study,
of the training data was used in the prediction model, and 30% of the test data was used to evaluate the model. The regularization techniques are considered to obtain the best model by using bias-variance trade-off rules. To predict PM
data, we first use penalized regression based on several meteorological variables. The lasso regression was used to predict the PM
with reduced predictors. The coefficients obtained from the lasso and its evaluation metrics are presented in
Table 3.
We also used the ridge and elastic net regression methods to overcome the multicollinearity issue and to predict the PM
, including all the variables.
Figure 7 shows the sample path of the tuning parameter
for the above three models when cross-validation is applied. From these figures, we see how the tuning parameter
was picked using cross-validation. At this point, the two dotted lines show the two lambda values. The left one gives the minimum cross-validation error, and the right one gives the most highly regularized model within 1 S.D. of the minimum error for a fixed
.
In the logistic regression, we used the lasso regularization with the
penalty and obtained the tuning parameter
with cross-validation. The
penalty is significant for variable selection and shrinkage because it forces some of the coefficients’ estimates to be zero [
38].
Table 4 demonstrates the coefficients of the predictors where Nitrogen Dioxide, Oxides of Nitrogen, Wind Speed, Resultant Wind Direction, Std. Dev. Wind Direction, Outdoor temperature, and Relative Humidity are the important factors for PM
classification.
In addition to
, we used the logistic regression model with an
penalty to reduce the multicollinearity issue of the data set. The tuning parameter
is optimized via ten-fold cross-validation until we achieve the best predictive model. For the MARS model, the cross-validation of the training data was used to choose a reliable classifier. This model regulates the training process with the residual sum of squares (RSS). In the random forest model, five-hundred trees were used. and three variables were sampled at each split to classify the levels. Using the mean decrease accuracy and mean decrease Gini indices, we ordered the predictors based on their importance.
Figure 8 shows that the predictors
Nitrogen Dioxide,
Wind Speed,
Oxides of Nitrogen, and
Maximum Wind Gust are important variables for high PM
levels in the atmosphere.
Table 5 shows the prediction mean squared error and misclassification rate of the models, and we also analyze the confusion metrics to choose the best classifier for the PM
concentration.Lastly, the kernel SVM is studied to classify the PM
concentration, where ten-fold cross-validation and different cost levels were used. The optimized cost and accuracy were found for the parameter
of 0.001.
In
Table 6, RMSE and R-Square values of tuning parameter
for the three penalized models are presented.
Model Accuracy
This section compares and contrasts our proposed models using a variety of evaluation metrics (see
Table 3). The true positive rate, or the fraction of detected positives in the target variable, represents the sensitivity in this case. The true negative rate (TNR), or the fraction of recognized negatives, is measured by specificity. At this point, the ROC curve is also presented where the X axis shows the true positive rate, or sensitivity, and the Y axis shows the false positive rate, or 1-specificity. The confidence of interval is a range of values that is likely to include a population value with a certain degree of confidence. Accuracy is the proportion of the total number of predictions that are correct. The diagonal line of the ROC curve represents the threshold (0.5), which separates the ROC space (see
Figure 9). A good classifier tends towards a value of one. From
Table 7 and
Figure 9, it is concluded that the random forest model performs well compared to others.
6. Conclusions
In recent years, scientists have proposed and implemented numerous models for forecasting and predicting air pollutants across different geographical locations. The results of this study suggest that machine learning techniques are effective for predicting PM concentrations based on meteorological and air pollution variables. The purpose of this study was to analyze methodologies for predicting ground-level fine particle concentrations. Several meteorological parameters, including temperature, wind speed, relative humidity, and different air pollutants, including CO, NOx, and Ozone are used for classification.
Our proposed penalized regression models with L1 and L2 regularization provide important features for detecting high or low PM
days. To determine the significant predictors for high PM
concentration days, we used various ML algorithms such as random forest, MARS, logistic regression, and SVM. Cross-validations of the training data were used to examine the various cost functions, yielding the models’ tuning parameters. The tuning parameter determines which model is best for predicting the test data. After fitting the test data with the optimized predictive models, several metrics were computed to assess the prediction. In addition, the accuracy, sensitivity, specificity, precision, and recall metrics have been compared (see
Table 6) to obtain the best classifier. According to empirical research, the random forest model correctly classifies 92.73% of PM
data as high or low with a confidence interval of (89% to 96%). It also demonstrates that several meteorological elements, such as Nitric Oxide, Wind Speed, and Maximum Wind Gust have a significant impact on PM
’s high concentration. The areas under the ROC curves for all ML approaches are shown in
Figure 9, where the RF and SVM models depict high accuracy in classifying high and low PM
days. The results of this study can contribute to an evaluation of the long-term effects of PM
air pollution and the diseases caused by exposure to PM
. Furthermore, the analysis provides valuable information which can be useful in the prevention and control of air pollution in the binational airshed. The future work of this research work will focus on the prediction of an unanticipated increase in PM
in the study area during the peak season of air pollution using deep learning, i.e, LSTM (long short-term memory) analysis, a feed-forward neural network using multiple neurons, stochastic approaches [
39], causality discovery approaches [
40], etc. Further, it can be used as the basis for many future advanced research projects involving machine learning/deep learning-based air pollution prediction, since extended historical data can be collected for training tailored to this region.