1. Introduction
Industrial emission gases, a crucial output component of manufacturing processes [
1,
2], are often rich in odourous substances [
3,
4], leading to concerns about air quality due to their atmospheric mobility and potential impact on the environment and receptor population [
1,
5]. Recent studies have highlighted the connection between atmospheric emissions and the chemical structure of odourous molecules, emphasising the need for effective odour management strategies [
6,
7]. With a growing public concern about air pollution, odours act as an early indicator of environmental pollution, prompting increased attention to monitoring and control strategies [
8,
9,
10].
The population’s capacity to perceive a specific odour is highly subjective, as the same stimulus can evoke significantly different sensations among different individuals or even within the same individual across various circumstances [
11,
12], or in the presence of different environmental factors, such as the presence of high humidity or high CO
2 concentration [
13,
14,
15]. It is crucial to bear in mind that olfactory discomfort emerges from multiple factors: odour does not directly correspond to the odour-causing molecule, as it is not an inherent feature of the molecule itself. Instead, it corresponds to a sensation triggered by the substance once interpreted by the olfactory system [
6,
16]. Even when not toxic, offensive odour compounds can induce symptoms such as irritation of the respiratory tract, chest tightness, palpitations, drowsiness, and mood swings [
17,
18]. Furthermore, certain malodourous chemicals, like ethylbenzene, toluene, and benzene, can have severe health effects [
19].
Due to the complexity of industrial odourous emissions-related challenges, adopting a comprehensive approach to olfactory nuisance management is crucial [
1,
7,
20]. Regulation for controlling olfactory nuisances spans various hierarchical levels, from European legislation to local regulations, reflecting the need to address odour emissions at both supranational and local levels [
6,
21]. Nowadays the European Union mandates control over the emission of odour-causing substances during plant operations to safeguard people’s life quality [
22]: European directives require technical standards published by CEN, with UNI EN 13725:2022 being the reference standard for odour determination by dynamic olfactometry [
11].
Dynamic olfactometry, as outlined in EN 13725:2022, employs human noses as sensors to assess odours by their impact on trained individuals, who undergo a screening process to determine the concentration of odourants in a gaseous sample [
7,
11,
23]. Olfactometry is widely acknowledged as the most sensitive method for assessing odour quality [
24]. However, it has limitations such as time and effort requirements, the need for specialised laboratories separate from sampling sites, the inability for on-site or real-time measurements, high uncertainty levels, potential exposure of panelists to hazardous substances, and a lack of precise implementation guidance, resulting in inconsistent approaches [
23]. Furthermore, sensorial analysis is inherently unreproducible [
25].
Over the past 15 years, Instrumental Odour Monitoring Systems (IOMS) have gained increasing popularity as tools for assessing odour impact [
26,
27]. Sensor-based instrument methodologies employ artificial olfactory systems that replicate the capabilities of human smell, allowing for continuous operation directly in ambient air [
28,
29,
30]. Among these systems, e-Noses—electronic olfactory system—have emerged as the most used ones [
26,
27,
31]: they are devices with chemical sensors and a pattern recognition system used to detect and identify various odours [
30,
32]. It is crucial to clarify that e-Noses do not conduct a chemical analysis of the analysed mixture but instead provide an olfactory fingerprint [
33,
34]. Despite promising results, practical applications of e-Noses in real-life scenarios remain limited due to:
- (1)
Technical challenges, such as sensor insensitivity or lack of selectivity [
35,
36] and susceptibility to interference from temperature and humidity [
27,
37], make it difficult to establish a direct correlation between the chemical fingerprint and human perception of odour.
- (2)
Standardisation, which represents a significant barrier to the widespread adoption of e-Noses on an industrial scale [
31,
38].
- (3)
The total uncertainty of the measured data are not defined in terms of reproducibility and reliability, which are necessary parameters for making instrument comparisons [
39].
Anyway, the use of IOMS for continuous olfactory nuisance monitoring proved to be particularly beneficial in industrial process control, enabling prompt intervention in plant management by activating operational measures upon surpassing critical odour thresholds [
40].
The theoretical conversion from chemical concentration to odour concentration could be feasible, but the mechanism of human odour recognition is not yet fully understood [
41]. Hence, experimental investigations are required. The simplest method involves directly using the concentration of a single substance, which can work well for individual substances or constant composition groups [
42,
43]. In this study, the experimental method for converting chemical concentration to odour concentration will be addressed.
After the e-Nose training, a mathematical relationship with an odour level can be derived; it is based on the recorded sensor concentration values and facilitates the estimation of odour concentration. However, it is important to clarify that this mathematical relationship is not absolute but applies only to the specific type of odour being detected [
6,
27]. Internationally, several standard methods were developed to implement this knowledge, including Draft P2520.1 [
44] and TC264 WG41 on electronic sensors for odorant monitoring [
45] The first method focuses on creating chemically standardised mixtures, while the second method relies on comparing electronic noses with reference values obtained through dynamic olfactometry.
This study aimed to evaluate the reliability of data collected by e-Nose near a wastewater treatment plant and explore the correlation between chemical concentration sensor readings in the electronic nose and odour concentration values. A wastewater treatment plant is classified as a passive diffuse odourous source, where certain surfaces or openings release unpleasant odours that disperse at an unpredictable airflow rate [
46]. Area sources release emissions from large surfaces and are categorised as active or passive. Active sources have outward airflow, like biofilters or aerated heaps, while passive sources, such as landfill surfaces and wastewater tanks, rely on processes like equilibrium or convection for mass transfer to the air [
47]. The sedimentation phase of the treatment process is the most odourous, with an odour emission factor (OEF) of 190,000 ou
E/m
3 of treated effluent [
47,
48]. This phase generates odourous substances due to anaerobic and anoxic conditions. The primary odourous compounds produced during wastewater treatment include hydrogen sulphide, ammonia, organic sulfur compounds, reduced organic sulfur compounds, and amines [
49,
50]. Sulfur compounds, mainly hydrogen sulphide, are released during the breakdown of aerobic organic matter, while ammonia is produced from protein decomposition [
6]. Indeed, it is possible to identify alterations in air quality attributable to a plant by analysing measurements taken in ambient air, even at a distance from the facility.
Random Forest models can assess how different chemical concentrations can predict odour data, which is crucial for odour management in industries. These models quantify the dependence of odour data on chemical concentrations and identify the importance of each variable. Selecting the most relevant chemical species for analysis is essential to accurately reflect real-world conditions, as some species significantly influence odour concentration. Additionally, the linear correlation model utilises theoretical formulas and linear regression to determine if the key chemical variables align with the electronic nose’s data. Collectively, this dual approach strengthens the electronic nose’s reliability for continuous odour monitoring, helping address odour-related issues and improve community relations. This approach serves to validate odour concentration measurements conducted during environmental monitoring using an e-Nose, aiding in the development of an accurate characterisation strategy. Such activities are preparatory steps for future field validation with the assistance of dynamic olfactometry.
2. Materials and Methods
2.1. Site and Equipment
The study analysed data from the ETL3000 e-Nose located at a sewage treatment plant presented in
Figure 1 dedicated to wastewater purification. The e-Nose is placed on the red spot near sedimentation basins, highlighted with green points. The plant area is located near a residential one, hence the need for precise monitoring to identify and potentially prevent spikes in odour concentration that could affect the residential area in the future.
The electronic nose is the ETL3000 (ORION, Veggiano (PD), Italy), which represents an air quality monitoring instrument for indicative measurement (as outlined in European Directive 2008/50/EC [
51]). The electronic nose has been trained following the guidelines proposed by the Lombardy Region [
52]: reference odours, which are representative of the emissions from the specific activities being studied, were chosen. The sensors of the ETL3000 were calibrated using controlled concentrations of the selected reference odours. Calibration entailed generating response curves for each sensor by exposing them to known concentrations of odours.
During the training phase, the ETL3000 was exposed multiple times to the reference odours under controlled laboratory conditions. Sensor responses were recorded, establishing a comprehensive database of odour patterns. The accuracy of the ETL3000 was validated by testing it with the reference odours and comparing its identifications with those obtained from standard olfactometric methods. Cross-validation techniques were employed to ensure reliability and precision in odour detection and quantification. Subsequently, the ETL3000 was deployed in real emission scenarios. Field testing involved comparing its readings with traditional olfactometric measurements to fine-tune its sensitivity and accuracy. Adjustments were made as necessary to enhance its performance.
The data collection period spanned from 18 October 2023 to 13 November 2023, therefore for a period of 27 days. In addition to measuring odour concentration (ouE/m3), the electronic nose recorded various descriptors of the analysed site. These included data from a weather station capturing physical parameters such as wind speed (m/s), wind direction (deg), temperature (°C), atmospheric pressure (Pa), precipitation (mm), relative humidity (%), and UV index, along with readings from four chemical sensors. These chemical sensors monitored levels of volatile organic compounds (VOCs, in ppm) with a SENS-IT PID sensor and a SENS-IT Electrochemical sensor, hydrogen sulphide (H2S, in ppb) with a SENS-IT Electrochemical sensor, and benzene (C6H6, in ppb) with a SENS-IT Thick Film Metal Oxide Semiconductor sensor, contributing to the overall assessment of odour.
The system features a variety of sensors integrated into modules. These include metal-oxide semiconductors (MOS) that can detect both organic and inorganic molecules, specialised electrochemical sensors (EC), and selective photoionisation detectors (PID). Sensors designed for C6H6 detection (from 0 ppb to 30 ppb) employ Thick Film Metal Oxide Semiconductor (TF-MOS) technology. The sensor’s active surface consists of a specific nanostructured semiconductor metal oxide. Initially, atmospheric oxygen is adsorbed onto the sensor’s surface, leading to charge transfer from the semiconductor to oxygen molecules. Subsequently, a specific gas reacts with the adsorbed oxygen (via Red-Ox reactions), releasing electrons into the semiconductor’s conduction band. By utilising current signals from the sensors during these reactions, the concentration of the specific gas can be directly measured. Additionally, monitoring of H2S (from 0 ppb to 3000 ppb) and VOCs from (0 ppm to 25 ppm) is carried out by employing traditional Electrochemical Technology (EC). In the EC, the chemical targe interacts with the electrode surface, and it causes a change in electrical properties, such as voltage or current, which can be measured and correlated with the concentration of the target chemical. A photoionisation detector (PID) uses ultraviolet (UV) light to ionise VOC molecules in an air sample. When VOCs enter the ionisation chamber, the UV light knocks electrons off the VOC molecules, creating positive ions and free electrons. These ionised particles generate an electrical current when attracted to electrodes within the chamber. The strength of this current is proportional to the concentration of VOCs, providing real-time measurement. PIDs are commonly used for monitoring air quality in industrial and environmental settings.
2.2. Data Analysis
Odour data are processed every ten minutes using an algorithm that correlates the descriptors of the electronic nose. A preliminary investigation was conducted through the development of a correlation matrix. The data for this matrix were processed following a smoothing operation, which helped to reduce variability and improve the quality of the analysis. This matrix was calculated by constructing a linear model for each variable, treating each one as the dependent (or outcome) variable in relation to all other variables in the dataset, which served as independent (or predictor) variables. For these models, the coefficient of determination (R
2) was calculated to understand how effectively the models could explain the variance of each individual variable. A deeper investigation was conducted by the software R. The analysis is performed using the Random Forest Algorithm [
53,
54] and then with least squares regression [
55].
As G. Biau and E. Scornet report in 2016 in ‘A Random Forest guided tour’ [
54], Random Forest (RF) is a supervised learning algorithm that represents an evolution of decision trees. RF was used for a more accurate development of the correlation between the data, as it is a machine learning algorithm used to solve classification problems and predict continuous variables based on a set of predictors. Decision trees are a tree structure where attributes are evaluated at nodes, leading to the final classification in the leaves. The evaluation criterion for attributes is the Information Gain (IG), which measures the variation in information entropy from a previous state to a subsequent one. It is commonly used when deciding which feature (or attribute) to use to split the data into decision tree nodes during the training process. The higher the entropy, the lower the predictability of the event.
RF addresses the overfitting issue by being an ensemble method that generates k trees through bootstrap sampling of the dataset—statistical technique used to estimate a sample distribution without making parametric assumptions about the underlying population—selecting attributes randomly at each split (node) and evaluating Information Gain, described in Equation (1), to obtain indications of model robustness as follows:
where ‘T’ represents the training dataset, ‘a’ is the value of an attribute, and ‘H’ represents information entropy defined in Equation (2):
where ‘P
x’ is the probability that event ‘x’ occurs, and ‘k’ represents the number of attributes.
RF combines predictions from all the trees, thereby reducing the risk of overfitting. Its hyperparameters—parameters external to machine learning models that affect the training process but are not directly learned from the data—include the number of trees (ntree), the number of randomly sampled variables (mtry), the number of samples for training (sampsize), the minimum size of terminal nodes (nodesize), and the maximum number of final nodes (maxnodes).
For the data analysis, three distinct random forest models were utilised: RF0, RF1, and RF2. Each model exhibits different characteristics regarding the division between the training (sampsize) and test sets, which are discussed in detail in the Results and Discussion section. For the development of all three models, the following parameters were considered: The number of trees (ntree) was set to 2000; the number of variables sampled (mtry) was set to 2; and the minimum node size (nodsize) was set to 1. Additionally, the variable Ncores was set to 8, indicating the number of processors used to expedite the calculations, and the operation set.seed(12345) has been set to ensure that random processes yield consistent results, allowing for reproducibility in statistical analyses and machine learning.
In the RF method, variables play a significant role in determining the accuracy and effectiveness of the model. RF assigns an importance score to each chemical variable, reflecting its contribution to the model’s ability to make accurate predictions. The impurity-based importance metric provided by the model’s
importance attribute offers a comprehensive measure of each variable’s contribution to reducing uncertainty in the predictions. By summing the reduction in node impurity each variable achieves across all trees, the model identifies which variables are the most influential in shaping predictions. This measure is crucial for understanding the underlying structure of the data and prioritising key features. The use of the Random Forest model in odour analysis is particularly effective due to its accuracy and robustness in handling complex data. This approach allows for predicting odour concentrations by simultaneously analysing various chemical substances and identifying those most influential on odour perception. Additionally, Random Forest captures nonlinear interactions between variables and provides insights into the importance of each factor, enhancing the understanding of the elements contributing to odours. Its ability to manage missing data makes it a reliable method for environmental monitoring. The predictive variables used for this study are related to the chemical measurements’ concentration of H
2S, C
6H
6, and the two measurements of VOCs. Further information on RF can be found in the literature [
56,
57,
58].
Considering the objective of investigating the relationship that best reflects the available data, once the most important chemical variables are identified through the RF model, they can be incorporated into the logarithmic model described in literature and subsequently outlined.
Typically, a relationship that links chemical concentrations (C
i) is used to evaluate the odour concentration (C
cOD), as described in Equation (3). It involves multiplying by a specific coefficient (kc) for the quotient obtained from dividing the measured chemical concentration (C
i) by the odour threshold value (OTV
i), which is the minimum concentration of an odourous substance in the air required to be detected for the specific compound [
31,
59,
60]. An additional parameter complementing odour concentration is odour intensity (OI). Intensity measures the perceived strength of an odour, which is influenced by both the odourant and the individual perceiving it [
61]. In contrast, concentration quantifies the actual amount of odour present in the air. Although they are often considered interchangeable, concentration represents the objective quantity of the odour, while intensity reflects the subjective perception of its strength [
62]. The OI mixture formula is indicated in Equation (4) and calculated by the Weber–Fechner law [
63].
Together, these parameters offer distinct but complementary insights into the olfactory experience. In literature, concentrations corresponding to the olfactory thresholds of many compounds have been experimentally determined. These values are applicable only when referring to pure substances. However, in the presence of mixtures, effects such as independence, additivity, synergy, and antagonism can occur [
64]. Determining the proportionality constant (kc) for this calculation typically involves olfactometric measurements and a linear regression analysis, as demonstrated in studies like [
43,
63].
An additional investigation into the mathematical model was carried out using the least squares method. The confidence intervals (Equation (5)) for the parameter estimates in the linear models were calculated using the following formula:
β represents the estimated coefficient; t
α/2, df is the critical value from the Students t-distribution for the chosen confidence level (95%) and the degrees of freedom (df = n-k), where ‘n’ is the number of observations and ‘k’ is the number of estimated parameters, including the intercept; SE(β) is the standard error associated with the estimated coefficient. This formula allows for the computation of the range within which we expect the true parameter values to fall, with a given level of confidence. A detailed discussion of the linear regression model is included in the
Supplementary Materials. This approach leverages the linear relationships between the selected variables and the target to gain a more detailed understanding of the factors influencing the model’s output. This allows for the integration of the nonlinear information captured by RF with a more interpretable structure provided by the linear model.
3. Results and Discussion
Within the analysis algorithm and in the generated graphs, the various descriptive variables representing the site are denoted as follows: ‘wint’ for wind intensity, ‘wdir’ for wind direction, ‘temp’ for temperature, ‘pres’ for atmospheric pressure, ‘prec’ for precipitation, ‘rhum’ for relative humidity, ‘uvidx’ for UV index, ‘VEC’ for VOCs concentration determined with the EC sensor, ‘VPID’ VOCs concentration determined with the PID sensor, ‘VH2S’ for H2S concentration, and ‘VC6H6’ for C6H6 concentration. Additionally, the concentration of odourous substances is indicated by ‘CcOD’.
Due to the experimental nature of the data, as they consist of signals of environmental origin recorded by the electronic nose, it is necessary to use a smoothing model to filter out unwanted noise. The smoothing process makes it easy to analyse or visualise underlying trends or patterns. This technique is particularly useful when raw data are affected by random fluctuations or irregularities that can hinder interpretation. One of the most common techniques is the so-called “moving average”, which involves calculating the average of values in the data series over a specified time interval. In this case, the preceding five odour concentrations were used as a moving average around the current odour concentration value.
The correlation matrix (
Table 1) provides a comprehensive summary of the relationships between various environmental variables and odourous concentration ‘C
cOD’ within the dataset. The matrix presents the correlation coefficients, which quantify the linear relationships between pairs of variables. Coefficients range from −1 to 1: the value 1 represents a perfect positive relationship, the value 0 indicates the absence of any relationship, and the value −1 signifies a perfect negative relationship.
The odourous concentration ‘CcOD’ ranges from 2.2 ouE/m3 to 18,972.2 ouE/m3, with a mean value of 519.76 ouE/m3. The data at the 98th percentile reaches 4868 ouE/m3.
An analysis of the intensity and direction of the wind in relation to the concentration of odour was subsequently carried out.
Observing
Table 1, it is evident that the relationship for VEC and VPID does not appear to be particularly significant, as indicated by its lack of correlation with C
cOD. The data for VC
6H
6 exhibit a significant correlation with C
cOD, and VH
2S shows a stronger correlation with C
cOD. By observing the detailed numerical correlations from
Table 1, it becomes easier to interpret and understand the linear relationships among the variables, particularly in relation to the odourous concentration C
cOD. A brief analysis of the weather variables ‘temp’, ‘wint’, ‘wdir’, ‘pres’, ‘prec’, ‘rhum’, and ‘uvidx’ data show no significant correlation with the odour concentration, especially regarding temperature, which could be due to a temporal limit in the dataset (of a duration of 27 days), as during October–November, there is generally no significant temperature variation: the mean temperature was 15.7 °C with a deviation standard of 4.5 °C.
3.1. Analysis of Wind Direction and Intensity and Implications for Odour
The wind rose chart in
Figure 2 illustrates the distribution of wind frequencies in terms of direction and intensity. The directions are represented by cardinal and intercardinal points: N (North), NE (North-East), E (East), SE (South-East), S (South), SW (South-West), W (West), NW (North-West). The bars extending from the centre of the chart indicate the direction from which the wind is blowing. The length of the bars represents the frequency with which the wind blows from a certain direction. The concentric circles represent frequency percentages, with increments of 5% (5%, 10%, 15%, 20%). The colours of the bars indicate the wind speed (in m/s), according to the legend at the bottom: ‘blue’ is for 0 m/s −2 m/s, ‘green’ for 2m/s–4 m/s, ‘yellow’ for 4 m/s–6 m/s, red for 6 m/s–20.78 m/s. The average wind speed is reported in the chart as “mean = 0.89512 m/s”. The percentage of time the wind is calm, with very low wind speed, is indicated as “calm = 41.1%”. Most of the wind comes from directions varying between N and NE and between S and SE, with a predominance of moderate winds (blue and green) and some instances of stronger winds (yellow and red). Compared to the annual wind rose, this analysis shows a higher frequency of winds coming from the north. This is due to the seasonal presence of the
Tramontana wind, which is common during the late autumn period. The N direction seems to have the highest frequency of calm wind. The percentage of time the wind is calm is quite high,41.1%, suggesting that in this dataset, there are many periods of very weak or absent wind.
An in-depth analysis is conducted by observing the distribution of odour concentration in relation to wind direction and frequency (
Figure 3). Each segment represents the concentration levels of odour as influenced by the wind coming from different directions. The directions are represented by cardinal points: N, NE, E, SE, S, SW, W, and NW. The segments extending from the centre indicate the direction from which the wind is blowing.
The segment colours represent the odour concentration, according to the legend on the right. The length of the segments represents the frequency of wind from each direction. The concentric circles represent frequency percentages, with increments of 5% and 10%.
In
Figure 3, the mean odour concentration is reported as “mean = 521.08 ou
E/m
3”. The percentage of calm conditions is indicated as “calm = 31.9%”. Higher odour concentrations (red and orange segments) are notably present in the N, SE, and S directions, indicating that these directions have higher levels of odour when the wind blows from them. The chart also shows moderate odour concentrations (blue and green segments) spread across various directions. The calm condition percentage (31.9%) suggests that nearly a third of the time, wind conditions are calm, leading to lower odour dispersion. The different calm period percentages in
Figure 2 and
Figure 3 arise because
Figure 2 reflects low or absent wind speeds, while
Figure 3, focusing on odour transport, shows fewer calm periods as low wind is less effective for dispersing odour. This wind–odour association highlights wind’s role in odour movement. High calm readings may be influenced by elevated areas nearby, and calm periods were excluded from the analysis dataset.
Considering the relative positions of the primary sedimentation tanks and the electronic nose shown in
Figure 1, it was essential to narrow the analysis to wind directions specifically originating from the 225° (SW) to 325° (NW) range. Since these angles correspond to the directions from which the wind is approaching, analysing this specific range enhances the likelihood of detecting odours originating directly from the tanks.
Figure 4 illustrates the box plot of odour concentrations concerning the wind directions NW, W, and SW.
Table 2 reports values of odour concentration for various percentiles. The box plot shows the distribution of odour concentrations for the wind directions Northeast (NE), West (W), and Southwest (SW), with the y-axis on a logarithmic scale to better visualise central data and reduce the visual impact of outliers, making it easier to compare both typical and extreme values in a balanced way. This choice is helpful as the concentrations vary over a wide range, from 10 to over 10,000 ou
E/m
3. The plot reveals that the median odour concentrations across the three directions are similar, around 100 ou
E/m
3. The central part of the distribution (the 25th to 75th percentiles) shows values concentrated within a limited range, while the numerous outliers indicate occasional high peaks. Overall, the plot highlights that, although moderate values are most frequent, there are also episodes of high concentration in all three directions.
Odour concentrations tend to increase significantly from lower percentiles (25th and 50th) to higher percentiles (75th and 98th). This indicates that while most odour concentrations remain relatively low, there are some very high peaks. At the maximum concentrations (98th percentile), the highest concentrations occur with winds from the SW, peaking at 4547.4 ouE/m3; odour concentrations for NW reach 3766.11 ouE/m3 at the 98th percentile; for winds from W, the highest recorded concentration is 2904.0 ouE/m3.
The SW direction shows the highest odour concentrations overall, suggesting a significant odour source located northwest of the measurement point. SW and W show slightly lower but still significant concentrations compared to SW.
For odourous emissions, there is no national law in Italy that sets clear and uniform limits for odour concentration, leaving it to regional authorities. For instance, the unique limit for Lombardy Region is 300 ou
E/m
3 for compost production plants, used as a reference to prevent negative impacts on air quality [
65,
66].
The concentration limit of 1 ou
E/m
3 pertains to the odour concentration perceived in areas with sensitive receptors, such as residential areas or public placesfor olfactory impact studies by simulation of dispersion [
67]. This more restrictive measure ensures that the odour does not reach levels that are annoying or harmful to people near the facility. In general, olfactory nuisance is defined when the odour exceeds the limit for more than 2% of the monitoring time (i.e., 172 h in a period monitoring of a year) [
52].
As shown in
Figure 1, the e-Nose is positioned near the plant boundary line; however, a study was conducted regarding the number of times the electronic nose records concentrations greater than 300 ou
E/m
3, occurring in 26.36% of cases as reported in
Figure 5. The scale on the x-axis is adjusted to allow observation of the values contained within the 98th percentile. The graph related to the entire dataset is included in the
Supplementary Materials (Figure S1).
A more specific analysis has been carried out regarding the weather variables of interest. It was possible to analyse data from both chemical detectors and odour concentrations under different meteorological conditions.
The data of C
cOD, VEC, VPID, VH
2S, and VC
6H
6, which underwent smoothing, was analysed considering: wind direction angles between 225° and 325° indicate the range from which the wind originates, covering the area where odour from the sedimentation tank can potentially be intercepted by the electronic nose. This directional range helps to effectively filter odour data specifically related to the water treatment plant, leveraging the positioning of the electronic nose relative to the sedimentation tank; a temperature greater than 15 °C and relative humidity greater than 70% since the presence of water vapour in the ambient air can significantly increase the perception of odour. A comparison has been made between the selected data as described previously. There is a lack of correlation between the value of C
cOD and the variables VPID and VEC, both by considering the entire dataset and the data related to the previously mentioned meteorological conditions; a more significant correlation is observed with C
cOD value and the variables VH
2S and VC
6H
6. For variable VH
2S, the R
2 value increases from 0.60 to 0.68, while for variable VC
6H
6, the R
2 value rises from 0.44 to 0.60 when selecting data based on the above-mentioned meteorological characteristics. The relevant figures are provided in the
Supplementary Materials (
Figures S2–S5).
3.2. Data Pre-Processing and Machine Learning with Random Forest Models
Data are pre-processed by the logarithmic function before using a Random Forest algorithm. This is advantageous for several reasons. It helps stabilise variance and reduce the impact of outliers, making the model more robust. Additionally, it transforms highly skewed data into a more normal distribution, which improves the algorithm’s performance. The transformation also compresses the range of values, bringing them onto a similar scale, and simplifies relationships in the data, allowing the Random Forest to detect and model patterns more effectively [
68].
Given the presence of a limited dataset, as it consists of less than a monthly series of data with readings taken every ten minutes, three different approaches have been followed:
In the first series of graphs, the first 30% of the data were extracted for training the algorithm and calculating the predictive model on the subsequent remaining 70% of the dataset. This model will be referred to in the text as RF0 (
Figure 6).
A random sample was taken from the dataset (30% of the data) as training for the algorithm, and a predictive model was built on the remaining 70% of the data within the dataset. This model will be referred to in the text as RF1 (
Figure 7).
A random sample was extracted from the dataset (30% of the data) as training for the algorithm, and then a predictive model was built: the predictive model was applied to the dataset, i.e., the complete matrix, and the same 30% used previously for training was recalculated. This model will be referred to in the text as RF2 (
Figure 8).
A preliminary check is conducted using all the data available in the dataset to observe the model’s performance. RF1 is preferable because it is not influenced in the prediction processing by data used for training the model; therefore, the entire dataset is not used, unlike RF2, where part of the data are also used for training the model. Additionally, RF1 is not directly tied to a specific period, unlike RF0, which uses the first 30% of the data for prediction.
Each set of graphs corresponding to a random forest model comprises three distinct plots. The plots in Panel A serve to illustrate the correlation between the observed variable and its predicted counterpart generated by the model. The ideal scenario is represented by the purple line, which follows the y = x trajectory, indicating perfect alignment between observed and predicted CcOD values.
The plots in panel A present a heatmap scatter plot, where the colour intensity indicates regions of higher concentration of values (the red region), offering insights into the data distribution. In the plots in panel B, an error plot is depicted, providing a visual representation of the variance between observed and predicted values across the dataset. Finally, the plots in Panel C feature a bar chart that showcases the distribution of errors, offering a comprehensive view of the model’s predictive performance across different segments of the dataset.
In the histograms of errors, a Gaussian distribution centred around error 0 is observed in
Figure 7 and
Figure 8. In
Figure 6, the error histogram (panel C) does not exhibit a Gaussian shape, indicating that the error is more prevalent. In
Figure 7, noticeable and occasionally significant error spikes are evident. In
Figure 8, error spikes persist but appear less pronounced compared to those observed in model RF0, while the Gaussian error distribution remains narrow. Similarly, in the series of graphs corresponding to model RF2, despite the persistence of error spikes, their magnitude is diminished compared to RF0.
The following table (
Table 3) summarises the coefficients related to the trend line and the R-squared values of the three RF models extended to the entire dataset.
For RF0, the positive intercept and slope of less than 1 indicate an underestimation of predictions compared to the measured values. The R-squared value of 0.31 suggests a relatively weak correlation between predicted and measured data, and a relatively high RMSE (Root Mean Squared Error) of 1.125 indicates overall low accuracy for this model. In RF1, the intercept is close to zero, and the slope is nearly 1, indicating better alignment between predicted and measured data. The R-squared value of 0.7 shows a good correlation, and the reduced RMSE is 0.87. For RF2, the intercept is close to zero, and the slope slightly above 1 indicates a slight overestimation in the predictions. It has the highest R-squared value (0.75) and the lowest RMSE (0.75). The discrepancy in R-squared values among the models can be attributed to the utilisation of training values within the predictive model. Model RF2, which incorporates training values, achieves a higher R-squared value, whereas RF0, lacking randomness in training set utilisation, yields a lower R-squared value. However, it is important to note that model RF2, while achieving a higher R-squared value, is not the most preferable choice. This is because the data used for training cannot also be used in the prediction set, making model RF1 a more favourable option. For the three models, the importance of variables VEC, VPID, VH
2S, and VC
6H
6 has been calculated. Below there is the summary of
Table 4.
In the evaluation of variable importance within the Random Forest models, VH
2S was identified as the most significant variable, particularly in RF1 and RF2, where it accounted for 49.5% of importance. In RF0, it held an importance of 40.8%. VC
6H
6 also demonstrated substantial influence, contributing 36% in RF0 and 30% in RF1 and RF2. The VPID variable had moderate importance, ranging from 19.5% to 22.4%, while VEC was the least impactful, with contributions of only 0.8% in RF0 and 1% in the other two models. All models agree in defining variable VH
2S and variable VC
6H
6 as the most important for model structuring. Variable VPID has a minor influence, while variable VEC does not influence model construction. The reason why the variables VPID and VEC do not show significant importance in RF models is likely that variables are measured using sensors such as PID or electrochemical sensors, both of which detect thousands of compounds, potentially dominated by those with lesser olfactory impact. Therefore, in the future, to effectively utilise these types of sensors in odour-related applications, it may be necessary to perform compound speciation. PID sensors are highly sensitive and can detect VOCs at very low concentrations, but they lack selectivity and can be influenced by numerous interfering compounds present in the environment. This complexity makes it challenging to use these sensors in applications where distinguishing between different VOCs is necessary [
69].
On the other hand, electrochemical sensors can still be susceptible to interference from other gases in the environment. Additionally, their sensitivity can be affected by environmental conditions such as temperature and humidity, which can compromise measurement accuracy [
70].
3.3. Analysis on Data < 1000 ouE/m3
An additional analysis was conducted with the random forest model RF0, RF1, RF2, which considers odour concentration data below 1000 ouE/m3. The investigation focuses on concentrations of CcOD lower than 1000 ouE/m3 due to the higher density of data below this threshold.
The wider spread of
Figure 9,
Figure 10 and
Figure 11 of the Gaussian curve in the error histogram indicates a general scenario of a higher volume of data affected by errors. In all
Figure 9 and
Figure 10 panels C, concerning the error between observed and predicted data, there is a noticeable shift in the peak frequency of errors that is not centred in 0. Specifically, the highest frequency is no longer attributed to an error of 0, but rather to errors lower than 0. In RF1, the frequency of errors surpassing approximately 100 ou
E/m
3 is observed; in RF1, around 60 ou
E/m
3, while in RF2 (
Figure 11), it is approximately 30 ou
E/m
3. These observations align with the earlier discussions regarding the significance of the data. An increased quantity of error spikes for all three models is observed.
The following
Table 5 summarises the coefficients related to the trend line and the R-squared values of the three RF models related to odourous concentrations less than 1000 ou
E/m
3.
In RF0, the high intercept and very low slope coefficient suggest that RF0 has almost no relationship between predicted and measured values. The very low R-squared (0.01) confirms a weak correlation, and the RMSE of 1.17 indicates poor overall accuracy, meaning this model is not effective in capturing the underlying pattern of the data. In the RF1 model, there is an almost perfect alignment between the slope line and the reference line. With an intercept closer to zero and a slope above 1, RF1 slightly overestimates predictions. The R-squared value (0.27) indicates a weak correlation. The RMSE of 0.83 shows moderate accuracy. This model has the lowest intercept and the highest slope among the three, suggesting it slightly overestimates the predictions. In RF2, an R-squared value of 0.47 indicates a moderate correlation, and the RMSE of 0.71 reflects improved accuracy. Despite showing improved metrics, the RF2 model cannot be considered due to the use of the training set within the test set. It cannot be determined if there is a relationship model between the observed and predicted odour concentration values below 1000 ou
E/m
3. The Random Forest models developed on the entire dataset and for concentrations below 1000 ou
E/m
3 were validated using 5-fold cross-validation. The related considerations are presented in the
Supplementary Materials in
Tables S1 and S2.
The e-Nose system, despite certain limitations, proves valuable for managing industrial odour nuisance emergencies. This system is particularly suited for situations requiring continuous air quality monitoring during odour-related events. Notably, the e-Nose has been shown to deliver more reliable measurements when odour concentrations exceed 1000 ou
E/m
3, making this data instrumental for the timely management of olfactory nuisance emergencies. By leveraging this information, environmental authorities can communicate effectively with the public and swiftly intervene when odour levels surpass critical thresholds. However, one major challenge lies in the lack of uncertainty associated with e-Nose measurements. Currently, no specific regulation ensures that these measurements have the repeatability, reproducibility, and overall uncertainty characteristics needed for full reliability. Following the use of Random Forest to generate accurate predictions and assess variable importance, further exploration of the relationships between predictor variables and the response variable can provide clearer insights. This analysis helps illustrate how each chemical predictor might directly influence odour responses, enabling industries to act on the most odouriferous chemical compounds. The aim is to deploy the e-Nose in a manner like the EN 14181 standard [
71], enabling it to differentiate between various odour types in public areas and to use this feedback to modify the chemical configuration of facilities with odour-related challenges.
3.4. Relationship Between Chemical Concentration and Odour Concentration
The first investigative method to establish a relationship between chemical concentration data and odour-related data pertains to the formula previously described in Equation (3). Due to the lack of importance of the variable VEC and its lesser impact on the variable VPID in the RF models (both related to VOCs measurements), these variables have not been utilised. The odour threshold value used for VH
2S (OTVH
2S), which corresponds to H
2S, is 0.41 ppb, and for VC
6H
6 (OTVC
6H
6), the benzene values are 2700 ppb. Below is Equation (6) with the variables of interest, VH
2S and VC
6H
6; ‘k
0’ and ‘k
1‘represent the experimental coefficients related to the study data.
In
Table 6 the values of logarithmic model are reported.
The coefficient of determination (R-squared) is 0.6, meaning that 60% of the variation in the dependent variable can be explained by the independent variables in the model. The standard error for k
0 is 2.34 × 10
2, while for k
1, it is 2.5 × 10
−2. Both standard errors are relatively low compared to the coefficient estimates, indicating a good level of precision in the estimation. The t-value for k
0 is quite high (−5.05 × 10
−5), indicating that k
0 is statistically significant and makes a substantial contribution to the model, while k
1 t-value is high (8.77 × 10), suggesting strong statistical significance. The
p-value associated with the F-test is less than 2 × 10
−16, indicating that the model is statistically significant. Additionally, both coefficients are highly significant (
p-value < 0.001), indicating that both variables have a significant effect on the dependent variable. The confidence intervals presented indicate a range of possible values for the model parameters with 95% confidence. The confidence interval for the intercept ranges from −1.168 × 10
3 to −1.078 × 10
3, while the interval for the coefficient of the composite variable lies between 3.26 and 3.41. This indicates that the effect of the composite variable on C
cOD is positive. The narrow range around these values suggests that the coefficient estimate is quite precise. The fact that both confidence intervals are relatively narrow implies that the parameter estimates are reliable and have low uncertainty. Furthermore, the interval for the composite variable does not include zero, confirming that this variable has a significant effect on C
cOD. It represents a good fit of the model, although there is still a significant portion of variability (40%) that is not explained by the model. It is possible to visually observe a good fit between the average odour concentrations, while the model is much less accurate at very high and very low odour concentrations. The results are visible in
Figure 12.
Having observed a moderate correlation within the literature model, a linear model is investigated. Through the least squares method, the potential correlation between the value of C
cOD and the sum of the variables VH
2S and VC
6H
6, corrected for the respective experimental coefficients k
0, k
3, and k
4, is identified. The following linear model is investigated with Equation (7):
In
Table 7, coefficient values of the least squares model are reported. The model indicates that both VH
2S and VC
6H
6 variables are statistically significant predictors of the dependent variable C
cOD.
Value k0 has a standard error of 2.19 × 10, k3 has a standard error of 4.31 × 10−2, and k4 has a higher standard error of 5.83 × 103. Despite k4’s relatively higher standard error, it remains small relative to the estimate, indicating good precision across coefficients. All coefficients have very high absolute t-values (k0 = −63.3, k3 = 61.2, k4 = 34.1), suggesting that each is highly significant. The t-values indicate strong contributions to the model, with k3 and k0 being particularly influential due to their large t-values. The p-values associated with the coefficients are very small (<2 × 10−16), indicating strong evidence against the null hypothesis that the coefficients are equal to zero. The confidence intervals for the model parameters provide insight into the precision of the coefficient estimates at a 95% confidence level. For the intercept, the interval ranges from −1.37 × 103 to −1.29 × 103, indicating a high level of precision in its estimation due to the relatively narrow range. The coefficient for the variable k3 is similarly precise, with a confidence interval between 2.46 and 2.63. In contrast, the coefficient for k4 shows a wider confidence interval, spanning from 1.8 × 105 to 2.1 × 105, suggesting greater variability in this estimate. However, despite this broader interval, the estimate remains significantly positive. These confidence intervals underline the robustness of the estimates, especially for k3, while highlighting some degree of uncertainty for k4.
The R-squared value is 0.7, indicating that approximately 70% of the variability in the dependent variable is explained by the independent variables in the model, and this is evident in
Figure 13.
In summary, the model suggests that both the VH
2S and VC
6H
6 variables have a significant impact on the dependent variable, and the model is statistically more significant in predicting the outcome variable (
Table 8). The Literature Model has an R
2 of 0.6, indicating it explains 60% of the variance in the data, which suggests a moderate fit. Its Mean Squared Error (MSE) is 750.8, and the Root Mean Squared Error (RMSE) is 866.5, showing significant deviations between predicted and actual values. The Mean Absolute Error (MAE) is 590.9, reflecting limited precision. In contrast, the Least Squares Model performs better, with an R
2 of 0.7, explaining 70% of the variance. It has a lower MSE of 612.8 and RMSE of 782.8, indicating improved accuracy. The MAE of 515.0 also suggests greater reliability. However, there is still potential for further improvement, such as incorporating additional variables and refining the calibration of the electronic nose.
4. Conclusions
The analysis aimed to assess the reliability of data collected by an electronic nose near a wastewater treatment plant and to investigate the correlation between chemical concentration sensor readings and odour concentration values. This approach validates odour concentration measurements conducted during environmental monitoring using an electronic nose, aiding in the development of an accurate characterisation strategy. A good strategy for selecting relevant odour data are to focus exclusively on measurements taken under suitable wind conditions for the experimental setup of the sedimentation tank and the electronic nose.
The analysis established that the variables VH2S and VC6H6 significantly influence odour concentration, underscoring their crucial role in odour perception in this environment. Despite the challenges associated with overfitting in machine learning, the Random Forest approach with three different models (RF0, RF1, RF2) provided predictive performance, particularly with the RF1 model, which minimised data overlap and maximised accuracy: the value of R2 for R1 is 0.7, indicating a correlation to be considered acceptable in the case of environmental data. It has been observed that the concentration of odours is significantly influenced by specific chemical substances detected by dedicated sensors: All RF models consistently identified VH2S and VC6H6 as the most important variables for predicting odour concentration, highlighting the significance of H2S and benzene levels in determining odour intensity. VPID had a lesser impact, while VEC showed an unrelated importance in model construction.
The analysis revealed a significant correlation between CcOD and the readings of certain chemical sensors, particularly VH2S and VC6H6. These variables exhibited a more pronounced influence on odour concentration compared to VEC and VPID, indicating the significance of specific compounds contributing to the overall odour perception.
Further supporting the significant influence of VH2S and VC6H6 on odour concentration, linear regression analysis identified both variables as statistically significant predictors of odour intensity, with strong evidence against the null hypothesis. It was observed that the best correlation model was not achieved using the literature model but rather through the sum of concentrations multiplied by a coefficient. The multiple linear regression model revealed a moderate coefficient of determination of 0.70, indicating that approximately 70% of the variability in odour concentration can be explained by the independent variables. The statistically significant F-statistic and a relatively small standard error for the model coefficients underscore the overall significance of the regression model. The hybrid approach, combining the analysis of Random Forest with the minimum least squares model, can offer a valuable balance between model complexity and interpretability. This approach leverages the predictive capabilities of Random Forest while also providing a clearer understanding of the factors influencing the outcome through the linear model.
The study has several limitations, including relatively high prediction errors and difficulties in correlating with very high and very low odour concentration values. This highlights the inherent difficulty in training the electronic nose and the need for a closer correlation between odour concentration and odour intensity, which is closely related to the sensory experience generated by the odour stimulus in the olfactory system and follows a logarithmic pattern because human odour perception is extremely complex and not yet fully understood. Furthermore, a challenging correlation was noted for odour concentrations below 1000 ouE/m3, highlighting difficulties in prediction and reduced reliability of the data provided by the electronic nose and the complexities of correlating chemical concentrations with perceived odour intensity. Despite these challenges, the electronic nose system plays an important role in the effective management of emergencies associated with industrial odour nuisance situations, being able to recognise which of the compounds identified by the electronic nose are primarily responsible for the odour nuisance event. Future research should aim to extend the dataset duration and include additional variables to improve the model’s accuracy and reliability. It will be necessary to investigate the possibility of employing specific chemical sensors tailored to the type of odour nuisance and to conduct a comparison with dynamic olfactometry for thorough field validation. Overall, this study demonstrates the effectiveness of electronic noses in environmental monitoring and the necessity for specific standardised procedures for odour-causing compounds for the system training.