Prediction of Complex Odor from Pig Barn Using Machine Learning and Identifying the Influence of Variables Using Explainable Artificial Intelligence

Lee, Do-Hyun; Lee, Sang-Hun; Woo, Saem-Ee; Jung, Min-Woong; Kim, Do-yun; Heo, Tae-Young

doi:10.3390/app122412943

Open AccessArticle

Prediction of Complex Odor from Pig Barn Using Machine Learning and Identifying the Influence of Variables Using Explainable Artificial Intelligence

by

Do-Hyun Lee

^1,†,

Sang-Hun Lee

^1,†,

Saem-Ee Woo

^2,†,

Min-Woong Jung

²

,

Do-yun Kim

¹ and

Tae-Young Heo

^1,*

¹

Department of Information & Statistics, Chungbuk National University, Cheongju-si 28644, Republic of Korea

²

Animal Environment Division, National Institute of Animal Science, RDA, 1500, Kongjwipatjwi-ro, Iseo-myeon, Wanju-gun 55365, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

Theses authors equally contributed to this work.

Appl. Sci. 2022, 12(24), 12943; https://doi.org/10.3390/app122412943

Submission received: 14 November 2022 / Revised: 9 December 2022 / Accepted: 9 December 2022 / Published: 16 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Odor is a very serious problem worldwide. Thus, odor prediction research has been conducted consistently to help prevent odor. Odor substances that are complex odors are known, but complex odors and odor substances do not have a linear dependence. In addition, depending on the combination of odor substances, the causal relationships, such as synergy and antagonism, are different for complex odors. Research is needed to know this, but the situation is incomplete. Therefore, in this study, research was conducted through data-based research. The complex odor was predicted using various machine learning methods, and the effect of odor substances on the complex odor was verified using an explainable artificial intelligence method. In this study, according to the Malodor Prevention Act in Korea, complex odors are divided into two categories: acceptable and unacceptable. Analysis of variance and correlation analysis were used to determine the relationships between variables. Six machine learning methods (k-nearest neighbor, support vector classification, random forest, extremely randomized tree, eXtreme gradient boosting, and light gradient boosting machine) were used as predictive classification models, and the best predictive method was chosen using various evaluation metrics. As a result, the support vector machine that performed best in five out of six evaluation metrics was selected as the best model (f1-score = 0.7722, accuracy = 0.8101, sensitivity = 0.7372, specificity = 0.8656, positive predictive value = 0.8196, and negative predictive value = 0.8049). In addition, the partial dependence plot method from explainable artificial intelligence was used to understand the influence and interaction effects of odor substances.

Keywords:

odor substance; odor prediction; machine learning; explainable artificial intelligence

1. Introduction

Over the past few decades, both meat consumption and livestock numbers have grown worldwide. Consequently, the resulting tremendous increases in livestock sheds have caused significant environmental problems. In particular, odor is the most important problem related to livestock sheds, causing significant damage to not only farmers but also residents of the surrounding area. Therefore, many countries have implemented odor prevention policies to prevent odor problems, and research is currently underway in various fields [1,2]. Furthermore, several studies have been conducted to solve this odor problem. For example, researchers created a real-time monitoring electronic nose (e-Nose) system using the backpropagation method to classify unpleasant odors around cattle ranches [3], while Yan et al. (2020) [4] generated predictive models for binary ester, aldehyde, and aromatic hydrocarbon mixtures using a support vector regression algorithm. In addition, another study attempted to statistically solve the problem using a spatial analysis system [5].

Rincón et al. (2019) [6] proposed odor activity value analysis and partial least squares regression to estimate odor concentration. This tool-based approach was found to reduce the cost of odor monitoring through dynamic olfactory measurements. Barczak et al. (2022) [7] collected 56 biosolid samples from two wastewater treatment plants located in Sydney, Australia, and examined the material discharged from these samples using analytical and sensory methods, including olfactory detection ports and dynamic olfactory measurements. Cangialosi et al. (2021) [8] suggested an instrumental odor monitoring system that uses machine learning models to classify odor sources and quantify odor concentrations. Kang et al. (2020) [9] used an artificial neural network to predict odor concentrations in sewage treatment plants. Mulrow et al. (2020) [10] trained meteorological, operational, and H₂S sensor data for three days and developed an advanced warning predictor for local odor complaints.

Likewise, research has been widely conducted to predict or identify the causative substances of odors. In particular, data-based research using machine learning is being actively conducted. However, several studies have examined the influence of or relationships between causative substances. Zhu et al. (2019) [11] used machine learning methods to develop predictive models for the yield and carbon content of biochar (C-char) based on pyrolysis data of lignocellulosic biomass. Moreover, a previous study applied a partial dependence plot (PDP) to identify the interactions between factors. Qi et al. (2022) [12] established a random forest regression model of optimization using an artificial bee colony to rapidly screen coal fly ash. Furthermore, after determining the correlation between the chemical composition and amorphous phase, a study was conducted to provide interpretation between variables based on the PDP and Shapley methods.

One of the main purposes of this study was to supplement and develop the results of our previous studies. In order to produce more variety and meaningful results, we attempted to utilize explainable artificial intelligence methods. Many previous studies attempted to interpret the chemical composition analysis results using XAI. Wojtuch et al. (2021) [13] developed a methodology to display the structural contributors that have the greatest influence on a specific model output for the metabolic stability of chemical properties and examined the contribution of chemical components to the analytical model results using Shapley additive explanations (SHAP). A study was conducted to determine the contribution of W/FCO, H₂/CO, and temperature to the methane yield by observing the coupled impact of chemical components on the methane yield of Chakkingal et al. (2022) [14] as a PDP and using SHAP analysis. Grimmig et al. (2021) [15] applied machine learning analysis of engine oil components used in industrial engines to test the effects of various physicochemical parameters on the performance improvement of combustion engines. As described above, chemical composition analysis studies through machine learning methods were conducted in various fields, and XAI was used to interpret the results.

As such, it can be seen that studies using the explainable AI have been conducted for various chemical substances, even besides odor analysis. However, research on the analysis of the relationship between substances constituting odors is relatively insufficient. Therefore, in this study, data-based predictive analysis using machine learning was performed to predict odors. Furthermore, the purpose of this study was to investigate how and to what extent odor substances affect odors. We tried to determine whether odor was bad with the Malodor Prevention Act of the Republic of Korea. Currently, the Republic of Korea’s Malodor Prevention Act regulates the concentration of complex odors measured by olfactory measurements and 17 designated odor substances by instrument analysis methods. We divided the odor classes in each case based on the above law.

A complex odor increases or decreases depending on the interaction between chemical compounds. In general, since the relationship between chemical compounds is not linear, the value of the complex odor may vary depending on the combination of compounds. In particular, since chemical compounds can affect complex odors depending on their combination, research is needed to find causal relationships, such as synergy or antagonism [16]. In this study, analysis using XAI’s PDP technique was performed to understand the relationship between various chemical compounds.

In this study, we used 212 sample observations. The objective variable was complex odor, which is a measurement standard for emission allowance. Complex odor refers to an odor that is a mixture of two or more odor substances, and the measured compound odor is classified into one of two categories: one that is allowed to be discharged or one that cannot be discharged according to the Malodor Prevention Act [17]. The explanatory variables were 15 odor substances, seasonal variables using information on measurement time, odor measurement locations (inside of the pig barn, outside of the pig barn, site boundaries). We replaced the missing values in the data using the multivariate imputation technique [18]. Before generating a predictive model, we performed variance inflation factor (VIF) analysis, correlation analysis, and analysis of variance (ANOVA). For the classification model, we used the following machine learning techniques as predictive models: k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), extremely randomized tree (Extra-Trees), eXtreme gradient boosting (XGboost), and light gradient boosting machine (LightGBM). Accuracy, f1-score, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), which are mainly used in classification models, were used as model evaluation metrics. As a result, the SVM model was found to be the best model.

Finally, we applied explainable artificial intelligence (XAI) techniques, which are used to interpret machine learning results, to the predictive model selected through the above process. In addition to several XAI techniques, we used the PDP technique, which can be used when a dependency relationship exists between variables, to analyze the effects of odor substances on complex odors and the interactions between odor substances.

2. Materials and Methods

2.1. Materials

2.1.1. Study Area

Odor compounds were collected from the interiors of finishing barns on 10 pig farms in Korea, which ranged in size. A total of 57 pig barns with breeding densities ranging from 0.75 to 1.1 heads/m² gathered odor compounds between April and November 2018. Forced ventilation was applied to eight pig barns, and winch curtains and forced ventilation were applied in parallel to three pig barns. Regarding floor type, five barns were created from partial slats (slat ratios varied from 25 to 70%), the other two were manure scraper systems, and the other was created from full slate. Figure 1 shows diagram of pig barns that mentioned above.

2.1.2. Data Sampling

Pig odor is the result of anaerobic fermentation in livestock manure, which is caused by long-term storage and incomplete digestion of nutrients in feed [19,20]. Since this odor is caused by a mixed gas in which several substances exist in combination, it is not appropriate to evaluate it as a single substance. Thus, we analyzed complex odors, and 15 odor species as follows: ammonia (NH₃), hydrogen sulfide (H₂S), methyl mercaptan (MM), dimethyl sulfide (DMS), dimethyl disulfide (DMDS), acetic acid (ACA), propionic acid (PPA), butyric acid (BTA), isobutyric acid (IBA), valeric acid (VLA), isovaleric acid (IVA), phenol (Ph), para-cresol (pC), indole (ID), and skatole (SK).

Complex odor was calculated using the air dilution olfactory method of the standard odor test, and ammonia was measured using an ultraviolet-visible spectrophotometer (UV-2700, Shimadzu, Japan) at a wavelength of 640 nm. A GC with a pulsed-flame photometric detector (PFPD, 456-GC, Scion Instruments, The Netherlands) outfitted with a CP-Sil 5CB column and a thermal desorption (TD, Unity + air server, Markes, UK) system with a cold trap was used to examine sulfur compounds simultaneously. Oven temperature was maintained at 60 °C for 3 min, after which the temperature was increased to 160 °C at 8 °C/min, then maintained at that state for 9 min. All samples were acquired at least three times, and each experiment was repeated three times for each condition [21].

Using a three-bed tube, volatile organic compounds (VOCs), volatile fatty acids (VFAs), phenols, and indoles were sampled at 0.1 L/min for 5 min (Carbopack C: Carbopack B: Carbopack X, 1:1:1). The analysis system used a connected GC/MSD and thermal desorption (TD). For the TD (unity + air server, Markes, UK) cold trap condition, split flow was 10:1, flow path temperature was maintained at 150 °C, and the low temperature increased from 5 °C to the high temperature of 300 °C. A GC (6890N/5973N, Agilent, Santa Clara, CA, USA) fitted with a CP-Wax52CB column (60 m 0.25 mm 0.25 um) and an MSD was used to analyze VOCs, VFAs, phenols, and indoles. Oven temperature was 45 °C for the first 5 min, and then increased to 250 °C at a rate of 5 °C/min, and then maintained at 250 °C for 4 min. The ion source temperature was maintained at 230 °C [19]. Table 1 and Table 2 summarize the conditions of the odor substance analysis device and the collected complex odors and odor substances, respectively.

This study used a total of 212 data points. According to the Malodor Prevention Act, a complex odor is binary classified as a dischargeable odor (discharge, 0) if the complex odor value is 500 or less (15 or less for the site boundary), and as no dischargeable odor (no discharge, 1) if the value is greater. As a result, discharge was classified as 101 and no discharge as 111. The explanatory variables included 15 different odor substances, which were continuous variables, and the odor measurement location and season variables were used as categorical variables. As shown in Table 2, the location variables were categorized as In, Out, and Boundary, and the season variables were categorized as spring (March–May), summer (June–August), fall (September–November), and winter (December–February).

2.2. Methods

In this study, we created a predictive classification model that predicts binary-classified complex odors and used XAI to investigate the influence of odor substances. The study process consists of data preprocessing, optimal model selection, and analysis of odor substances using XAI. In the data preprocessing process, missing values were replaced through multiple imputation, and relationships between variables were identified through ANOVA and correlation analysis. Next, the process of selecting the optimal model compared six machine learning models, and the process of dividing and analyzing the data at a ratio of seven to three was repeated 30 times, and the results were evaluated with six metrics. Lastly, the influence of odor substances was identified by using the PDP of the XAI. Figure 2 shows the workflow of this study, and this section introduces the methods used in the study.

2.2.1. Preprocessing

Before generating the predictive model, several preprocessing steps were performed. First, the missing values were imputed using the multiple imputation method. Additionally, for continuous variables among the explanatory variables, correlation analysis was used to determine the relationships between the variables. For categorical variables, ANOVAs were performed to determine whether the mean differences by category were statistically significant.

Multiple Imputation

Multiple imputation is a method that applies several univariate imputations to find the optimal imputed value [22]. In the present study, after generating

m

imputed datasets by applying multiple univariate imputations to incomplete data, the estimator and standard error were calculated using

m

imputed datasets.

Analysis of Variance and Correlation Analysis

ANOVA is a widely used statistical method for group comparisons. It determines whether a difference exists between groups by testing the hypothesis of variance between the groups to be compared using a t-test. In this study, we used an ANOVA to verify whether a difference existed in the average amount of complex odor production by category. Correlation analysis [23] identifies the relationship between two or more quantitative variables. The range of the correlation coefficient, the result of the correlation analysis, is from −1 to +1. A correlation coefficient of +1 indicates that the two variables have a perfectly positive linear relationship, −1 indicates that they have a completely negative linear relationship, and 0 indicates that they have no linear relationship.

2.2.2. Classification Model

To select the best predictive classification model, six machine learning models were compared. Furthermore, to improve the performance of the model, the k-fold cross-validation method, which improves the generalization performance of the model, and the grid search (GS) method, which finds the optimal hyperparameter, were used.

K-Nearest Neighbor

KNN [24] is a method for allocating the most frequent class among the

k

instances closest to Instance

x

. It uses a distance metric to determine the nearest instance, and the Euclidean distance is primarily used [25].

Support Vector Machine

The support vector machine (SVM) method is a technique used to create a boundary between points displayed in a multidimensional space [26]. It aims to optimally and linearly separate a space called a hyperplane through high-dimensional transformation using nonlinear mapping. SVMs can be used to learn models for almost any purpose, including classification and numerical prediction. In this study, a hyperplane was used as a classifier to determine whether an accident occurred. The goal is to find the optimal hyperplane with the maximum margin hyperplane that separates the distance between the two groups as far as possible. This maximum possible margin could provide high predictive accuracy.

Random Forest

RF [27] is a method that expands the decision tree method, which expresses decision-making rules in the form of a tree, to the idea of bagging [28]. To improve generalization performance, RF applies the ensemble method to the results of multiple trees generated by the collection of random variables to prevent problems with overfitting and large variance in individual trees. In the ensemble process, the average is primarily used for prediction, and voting is used for classification.

Extremely Randomized Trees

Extra-trees [29] is a tree-based ensemble method that applies the ensemble method to unpruned trees according to the classical top-down procedure. In contrast to other tree-based ensemble methods, the cutoff, which is a criterion for splitting nodes, is selected completely at random, and the entire sample is used instead of a portion of the sample selected using the bootstrap method.

eXtreme Gradient Boosting

XGBoost [30] is a boosting technique in which the previous model affects the next model, which is unlike bagging, which uses several models that are independent of each other. Gradient boosting [31], a representative boosting technique, is applied to enable parallel processing to reduce the amount of computation and increase computation speed. It has the advantage of preventing overfitting and sparse data.

Light Gradient Boosting Machine

LightGBM [32] uses two techniques, gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB), to improve performance in terms of efficiency and scalability of the gradient-boosting decision tree (GBDT). LightGBM showed almost the same performance as the existing GBDT, allowing it to train more than 20 times faster and using GOSS and EFB to manage large data objects and large variables, respectively. Moreover, by improving these two techniques, it is possible to achieve much better performance than XGBoost in terms of calculation speed and memory usage.

K-Fold Cross-Validation

K-fold cross-validation [33] divides the dataset into

k

mutually exclusive folds of approximately the same size, trains the model with

k - 1

folds of data, and evaluates the model’s performance with the other fold. K-fold cross-validation has a total of

k

evaluation results, the average of which is used as the final evaluation result.

2.2.3. Performance Evaluation Metrics

Using six evaluation metrics (accuracy, precision, specificity, recall, NPV, and f1-score), we evaluated the performance of the classification model. The application of these various evaluation metrics could provide a more detailed explanation of model performance [34].

In the present study, accuracy refers to the ratio of the number of correctly classified cases among all the classification problems. In the confusion matrix table (Table 3), accuracy can be calculated by dividing the sum of all cases in the confusion matrix by the sum of TP and TN. Precision, which is also known as the PPV, is an evaluation metric for measuring positive predictive performance. It can be measured by dividing the TP value by the sum of TP and FP. Recall represents cases classified as true and is the ratio of TP cases to the sum of TP and FP. Specificity is the ratio of TN cases to the sum of FP and TN cases. NPV is the ratio of TN cases to the sum of FN and TN cases and shows the model’s performance in classifying TN cases in predicted negative cases. The f1-score returns one metric value by combining precision and recall. By calculating the harmonic average of precision and recall, the f1-score is an evaluation metric that considers both precision and recall.

2.2.4. Explainable Artificial Intelligence

Complex odors were classified based on whether the emission was or was not allowed. However, the machine learning results in general did not represent specific information about the classification output. Therefore, we used the XAI method to obtain detailed information regarding the classification results. In this study, we used a PDP for the XAI method. Although there are several other XAI techniques, the reason for using PDP is that other XAI techniques can affect the derivation of the results of the analysis when correlations between variables exist. Since the PDP was free from such assumptions, the PDP was used as the XAI method for the final analysis results.

PDP is an explainable AI method that can interpret the relationship between a variable and a model’s predictive values. Moreover, it was designed to quantify the contribution to the classification result of the model according to the change in the value of the variable. In this study, the PDP results for the main variable were visualized to facilitate an understanding of the effects of the change in the variables on the classification results of the model. The PDP formula is as follows:

f_{x_{v}}^{^} (x_{v}) = E_{x_{m}} [f^{^} (x_{v}, x_{m})] = \int f^{^} (x_{v}, x_{m}) d P (x_{v})

(1)

x_{v}

is the variable that the partial dependence function needs to visualize, and

x_{m}

is the variable used in the machine learning analysis.

x_{m}

consists of one or two variables, depending on the number of variables used in the analysis. Variable vectors

x_{v}

and

x_{m}

are subsets of the entire variable space

X

, and compute the difference between the results of applying combinations of these variable vectors to the machine learning model. To obtain the influence of

x_{v}

, the effect of variable

x_{v}

on the entire model can be calculated by computing the mean of the combination of analysis result values of all models for variable

x_{m}

which varies according to

x_{v}

.

3. Results

3.1. Preprocessing

In this study, preprocessing was performed before generating the predictive model. First, the missing values in the data were replaced using the multi-imputation method. To match the different units for each explanatory variable, standardization was performed by setting the mean of each variable to 0 and the standard deviation to 1. In addition, to understand the relationship between the response and the explanatory variables, correlation analysis, and multicollinearity were checked for continuous variables among the explanatory variables, and an ANOVA was performed for categorical variables.

3.1.1. Missing Imputation

The missing values in the data were only found for the odor substance variable, and the missing values were divided into two types: not available (NA), caused by a malfunction of the sensor that measures the odor, and not detected (ND), which failed because of the sensor’s detection limit. Missing values were distributed, and there were a total of 113 missing values with 90 NAs and 23 NDs (Figure 3). In this study, two missing values were replaced using the same method. The multi-imputation technique was used instead of removing missing values, as presented by Lee et al. (2022) [18].

Table 4 shows the statistical values of the continuous variables after replacing the missing values. Standardization was performed to match the units for each variable with respect to the data substituted for the missing values.

3.1.2. Correlation Analysis

A correlation analysis was performed to analyze the relationship between the response and the continuous explanatory variables. The odor substances having a high correlation with the complex odor were IBA (0.583) and ACA (0.504), and the correlation coefficient of other odor substances was found to be less than 0.5 (Table 5). As shown on the right side of Figure 4, it can be seen that the substances belonging to the same class (sulfur compounds or VOCs) among odor substances have high correlation values. Accordingly, VIF analysis was applied to determine whether an independent relationship between explanatory variables could be assumed.

VIF [35] is a criterion for judging whether a multicollinearity problem is present. If it is usually greater than 10, the explanatory variable is judged to have a multicollinearity problem in Equation (2).

{VIF}_{i}

is the VIF value of the i-th variable and

r_{i}

is the r-square value excluding the i-th variable.

{VIF}_{i} = \frac{1}{1 - r_{i}}

(2)

We found that the VIF values of the four variables (PPA, BTA, IBA, and VLA) exceeded 10, indicating multicollinearity. This means that a linear relationship exists between the explanatory variables; thus, a linear model would not be suitable as a predictive model. To use a linear model, techniques, such as variable selection or variable extraction, which remove correlations between explanatory variables, should be used. However, in this case, the effects of all the odor substances examined in this study on the complex odor could not be determined because the shape of the original variable was changed or removed. Therefore, the study was conducted using machine learning, a nonlinear modeling method, rather than a linear model.

3.1.3. Analysis of Variance

The location and season variables are categorical variables. First, the measurement location variables were created using the one-hot encoding method (a method for expressing categorical variables as 0 and 1) to generate in-pig barn (In), out-of-pig barn (Out), and boundary (Boundary) variables. The season variables were divided into four seasons: spring (March–May), summer (June–August), fall (September–November), and winter (December–January), and then one-hot encoding was applied, as shown in Figure 5. However, in the case of winter, no corresponding data were present; therefore, it were removed.

An ANOVA was performed to determine whether the average amount of occurrence differed for each categorical variable generated in this way, and whether the generated categorical variable was meaningful as a variable. As shown in Table 6, the subscripts (A, B, and C) above the numbers grouped the mean of each category in the case of variables with significant differences in mean. For example, in the case of hydrogen sulfide, it can be said that there is no difference between the categories In and Out because the averages of the two categories are in the same group. On the other hand, since the Boundary is a different group from In and Out, the difference can be said to be significant. As shown in Table 6, the location variables showed no difference in the average amount of occurrence per location for the four variables (DMS, DMDS, ID, and SK). For the variables excluding these four variables, a difference was observed in the average amount of occurrence per location (p-value < 0.05).

Regarding the average amount of occurrence by location, In was the highest for complex odor, NH₃, ACA, PPA, BTA, IVA, VLA, and p-C, and was found to decrease as the distance increased. H₂S and IBA were similar for In and Out, but decreased for Boundaries. Finally, MM and Ph were high for In, Out, and Boundary and appeared at a similar level.

As Table 7 shows, no differences were observed in the average amount of occurrence per season across seven variables (complex odor, NH₃, H₂S, DMDS, p-C, ID, and SK) for the measurement time variable (season). For the nine variables excluding these, the average generation differed by season (p-value < 0.05).

Regarding the average occurrence amount by season, a difference was observed only in the spring for DMS, ACA, PPA, BTA, and IVA. In all cases, the average occurrence was highest in the spring. VLA showed a difference in the average occurrence only in the fall, during which the amount was the lowest. For MM, the difference in the average occurrence amount was significant in summer and fall, and the average occurrence amount was the largest in summer. Finally, for IBA and Ph, the difference in average occurrence was significant only in spring and fall, and the average occurrence amount was the largest in spring. As such, for both location and season, the difference in average occurrence by category was found to be significant for most variables. These can be considered suitable for use as variables.

3.2. Predictive Classification Model

To predict binary-classified complex odors based on the Malodor Prevention Act, the following six machine learning models were compared: KNN, SVM, RF, Extra-Trees, XGBoost, and LightGBM. In order to avoid overfitting problems, the analysis process was repeated 30 times by dividing the training data and test data in a ratio of seven to three. The optimal parameters were calculated using five-fold CV and GS, and Table 8 shows the parameter ranges and parameter values for the best f1-score for each model.

Accuracy, f1-score, sensitivity, specificity, PPV, and NPV were used to select the optimal predictive model among the six possibilities. As shown in Table 9, the average score for each model is approximately 30 analysis results, and the value in parentheses indicates the standard deviation. SVM showed the best performance in five out of six metrics (f1-score, accuracy, specificity, ppv, and npv).

In addition to the above evaluation metrics, it was also confirmed through other methods.

First, a receiver operating characteristic curve (ROC curve) shows performance per threshold using true positive rate (TPR) and false positive rate (FPR) values. Second, we used the area under the ROC curve (AUC), which means the value of the area under the ROC curve, and the value of AUC has a value between 0 and 1, and the closer it is to 1, the better the model. Finally, we used the Matthews correlation coefficient (MCC), which is a correlation coefficient used in binary classification, which has a value between −1 and 1, and the closer it is to 1, the better the model.

Figure 6 is a ROC curve generated based on the parameter with the best f1-score for each model, and AUC and MCC values are also expressed in the legend. As a result, it was found that SVM showed the best performance in the three metrics above.

In the selected predictive model (SVM), important variables among the 15 odor substances were identified through variable importance (VI) [36,37], as shown in Figure 7. Here, a variable with a large VI value is an important variable in generating a predictive model, and not a variable that greatly affects the fluctuation of a complex odor. Respectively, NH₃, PPA, and H₂S showed the highest VI values, while DMDS and IVA showed the lowest.

Summarizing the results of three analyses, the results are only shown for the four variables of NH₃, VLA, PPA, and BTA, which correspond to variables with high VI values, significant in the ANOVAs, and high correlation coefficients with complex odor among the 15 odor substances.

3.3. Explainable Artificial Intelligence

Figure 8 presents the partial dependence of four main variables: NH₃, PPA, BTA, and VLA. As shown in Figure 8a, when the value of NH₃ was less than 0.3 ppm, the classification model had little effect on classifying the observed case as “no discharge”. However, it can be confirmed that the classification model has a positive effect on classifying it as “no discharge” based on 0.3 ppm. For the PDP of the PPA variable shown in Figure 8b, no reasonable effect was observed on the classification of the model until the PPA value was approximately 0.7 ppb. However, after that point, as the PPA value increased, the model demonstrated a negative effect on classifying it as “no discharge”. As Figure 8c shows, the variable did not significantly affect the classification of the model until the BTA value was approximately 1.5 ppb, after which the model showed a positive effect on classifying it as “no discharge” as the BTA value increased. For the PDP shown in Figure 8d, it was possible to determine the influence of the change in VLA value, which had a positive influence from approximately 0.5 ppb or more until approximately 1.1 ppb and onward; as the VLA value increased, the classification model demonstrated a negative influence on classification as “no discharge”.

NH₃ values up to 0.3 ppm had little effect on the classification model’s classification of observed cases as “no discharge”. However, it could be confirmed that the classification model had a positive effect on classifying it as “no discharge” based on 0.3 ppm. For the PDP of the PPA variable shown in Figure 8b, the variable had little effect on the classification of the model until the PPA value was approximately 0.7 ppb. However, after that, as the PPA value increased, the model had a negative effect on classification as “no discharge”. In the PDP of the BTA variable shown in Figure 8c, the variable did not significantly affect the classification of the model until the BTA value reached approximately 1.5 ppb, after which, as the BTA value increased, the model had a positive effect on classification as “no discharge”. In the PDP shown in Figure 8d, the influence of a change in VLA values could be confirmed. For VLA, the value had a positive influence from approximately 0.5 ppb or more; however, from approximately 1.1 ppb onward, as the VLA value increased, the classification model had a negative influence on classification as “no discharge”.

Overall, NH₃ and BTA had positive correlations when classifying them as “non-dischargeable” as the values of these variables increased, whereas PPA had a negative correlation. For VLA, the effect on the classification result was confirmed to continuously change as the value increased and was not simply a positive or negative correlation.

By expressing the partial dependence calculation results of two variables on the x-axis and y-axis in a grid format, it is possible to determine the extent to which change in the two variables affects the model’s predictive results. The six PDP interactions visualize the partial dependence between each of the four main variables. Figure 9 shows six interaction plots by grouping the four main odor species into pairs of two by two.

Figure 9a is the interaction plot of NH₃ and BTA. As confirmed in Figure 9a, when both NH₃ and BTA increased, the probability that the classification model classified as “no discharge” also increased. As shown in Figure 9b, as the values of both NH₃ and VLA increased, the probability that the model would classify them as “no discharge” increased. However, VLA showed a decrease in the probability of classification as “no discharge” after the VLA value was greater than approximately 8.96.

Figure 9c shows that, as the value of NH₃ increased, the model’s probability of classification as “no discharge” increased; however, as the PPA value increased, the probability of classification as “no discharge” decreased. As shown in Figure 9d, as the PPA value decreased and the BTA value increased, the probability of classification as “no discharge” increased. From the results presented in Figure 9e, we could determine that the model’s probability of classification as “no discharge” increased when the values for both PPA and VLA decreased. In particular, it was observed that a VLA value of approximately 3.0 or less and a PPA value of approximately 2.0 or less were highly likely to be classified as “no discharge”. Finally, as shown in Figure 9f, as the VLA value increased, the probability of classification as “no discharge” decreased, while as the BTA value increased, the probability of classification as “no discharge” increased.

4. Discussion

This study was conducted as a follow-up to the previous study [18]. The number of data points, which was a problem in the previous study, increased from 57 to 212. As a result, it was confirmed that the performance of the model improved, as it was a data-based study. However, despite the increase in the number of data points, the number of 212 samples tends to be rather small considering that this is a data-based study. When there is little data, some problems can arise. For example, in the case of this study, some overfitting may occur even though several methods were used to solve the overfitting problem.

In general, there is no linear dependence between odor substances and complex odors. Additionally, depending on the combination of the odor substances, it may affect the complex odor, but the effect is different, such as synergy and antagonism, and research is insufficient [16]. For this reason, research is needed to find out the causal relationship, which we tried to find out based on the data.

As a result of the PDP, in the case of NH₃, VLA, and BTA, as the value increased, the probability of being classified as having no discharge increased, which is the same result as the expert domain. However, in the case of PPA, the result was somewhat contrary to the expert domain that the probability of being classified as no discharge decreases as the value increases. Since the sense of smell is affected not only by chemical effects but also by the physiological state of a person, it is thought that there will be errors in creating a model that explains only the concentration of a substance. In the case of such a negative relationship, it is judged that additional research is needed [6], and when checking the partial dependence calculated value (y-axis of the PDP isolation plot) of the PPA, the value was found to be small. It is judged that more accurate results can be obtained by securing more data.

Currently, attempts are being made to reduce complex odors by managing the concentration of each odor substance in actual pig barns, and the concentration of odor substances is measured in real time through sensors in some pig barns. Based on the results of this study, it is expected that complex odors can be managed by designating thresholds for each odor substance. In addition, it is expected that the performance of the model will continuously improve as data is accumulated through sensors in pig barns.

5. Conclusions

Odor is a severe problem, to the extent that government policies have been implemented to reduce odors from livestock facilities worldwide. To solve this problem, research on predicting the occurrence of odors is continuously being conducted. In particular, research has recently been conducted using machine learning based on data. However, although various chemical studies have provided predictive research on odor substances, analysis studies on the influence of odor substances or relationships between odor substances using machine learning are insufficient. Therefore, in this study, we investigated the influence and interaction of odor substances beyond odor prediction.

In this study, we used a sample comprising 212 data. As a response variable, the value of the complex odor was binarily classified as either an acceptable or an unacceptable odor, according to the Malodor Prevention Act. To predict this, 15 species of odor substances, location, and season were used as explanatory variables, and measurement locations (in, out, and boundary) and measurement times (spring, summer, fall, and winter) were used as categorical variables.

First, the missing values found in the data before generating the predictive model were replaced using the multiple imputation method. Next, to examine the relationship between variables, correlation analyses were performed for continuous variables, and ANOVAs were used for categorical variables. As a result, for odor substances, which are continuous variables, the correlations between them were high; however, the correlation with the response variable, complex odor, was low. This indicates that for the data used in the present study, prediction using a nonlinear model was better than prediction using a linear model. In the ANOVAs, a mean difference by category was observed for 11 odor substances (NH₃, H₂S, MM, ACA, PPA, IBA, BTA, IVA, VLA, Ph, and p-C) with the measurement location variable, and the average difference by category occurred for eight odor substances (DMS, ACA, PPA, IBA, BTA, IVA, VLA, and Ph) with the same variable. Thus, both categorical variables could be considered suitable.

Next, the results of six machine learning techniques (KNN, SVM, RF, Extra-Trees, LightGBM, and XGBoost) were compared to select the optimal composite odor predictive model. The GS and five-fold cross-validation methods were used to calculate the optimal parameters in the learning process, and accuracy, f1-score, sensitivity, specificity, PPV, and NPV were used as evaluation metrics. As a result, the SVM method was found to be the best for four of the six evaluation indicators.

Finally, the PDP method was used for the SVM model, which showed optimal performance for interpreting the predictive results, and the effect of the odor substances and the interaction effect could be confirmed. The results for the four variables of NH₃, VLA, PPA, and BTA were significant in the ANOVA among a total of 15 odor substances, and had high values in the correlation analysis and variable importance of the SVM model.

First, a PDP isolation plot was used to compare the influence of each of the four odor species. As a result, it was observed that the influence on the classification of the model rapidly increased after reaching a certain level for all four components. As the values of NH₃, VLA, and BTA increased, they demonstrated a positive effect on exceeding the emission limit for complex odors. However, as the PPA value increased, it appeared to have a negative effect on exceeding the emission limit.

Additionally, through the PDP interaction plot, the results of the interaction of the four odor species were obtained as six interaction plots. In the case of NH₃, when its correlation with other elements was compared, it was found that the effect on the permissible emissions was insignificant. Conversely, for PPA, it was found that the effect of exceeding the emission limit for complex odor was greater than that of the odor species when compared together. Specifically, as an interaction plot of PPA and VLA, we found that, when both PPA and VLA values were very small, the probability of exceeding the emission allowance was significantly high.

This study was conducted as a follow-up to the previous study. Since there is generally no linear dependence between odor components and complex odors, the main purpose of this study was to elucidate the causal relationship between them. As a result of the analysis, it was determined which model had good performance in classifying the dataset, and it was possible to identify the main variables affecting classification. Moreover, through the application of the XAI method to the analysis results, it was possible to interpret the classification results of the model. We hope that these studies and findings will help manage complex odors in actual pig barns.

Author Contributions

Software, D.-H.L. and S.-H.L.; Formal analysis, D.-H.L. and S.-H.L.; Data curation, D.-H.L. and S.-H.L.; Writing—original draft, S.-H.L.; Writing—review & editing, D.-H.L., D.-y.K., S.-E.W., M.-W.J. and T.-Y.H.; Supervision, T.-Y.H.; Funding acquisition, S.-E.W., M.-W.J. and T.-Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korean Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) and Korea Smart Farm R&D Foundation (KosFarm) through the Smart Farm Innovation Technology Development Program, funded by the Ministry of Agriculture, Food and Rural Affairs (MAFRA) and the Ministry of Science and ICT (MSIT), Rural Development Administration (RDA) (421020-03). This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) through the “2025 Livestock Industrialization Technology Development Program”, funded by the Ministry of Agriculture, Food and Rural Affairs (MAFRA)(grant number: 321088-5).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wojnarowska, M.; Sagan, A.; Plichta, J.; Plichta, G.; Szakiel, J.; Turek, P.; Sołtysik, M. The influence of the methods of measuring odours nuisance on the quality of life. Environ. Impact Assess. Rev. 2021, 86, 106491. [Google Scholar] [CrossRef]
Torkey, H.; Atlam, M.; El-Fishawy, N.; Salem, H. A novel deep autoencoder based survival analysis approach for microarray dataset. PeerJ Comput. Sci. 2021, 7, e492. [Google Scholar] [CrossRef] [PubMed]
Hidayat, R.; Wang, Z.-H. Odor classification in cattle ranch based on electronic nose. Int. J. Data Sci. 2021, 2, 104–111. [Google Scholar]
Yan, L.; Wu, C.; Liu, J. Visual analysis of odor interaction based on support vector regression method. Sensors 2020, 20, 1707. [Google Scholar] [CrossRef] [Green Version]
Wojnarowska, M.; Ilba, M.; Szakiel, J.; Turek, P.; Sołtysik, M. Identifying the location of odour nuisance emitters using spatial GIS analyses. Chemosphere 2021, 263, 128252. [Google Scholar] [CrossRef]
Rincón, C.A.; De Guardia, A.; Couvert, A.; Wolbert, D.; Le Roux, S.; Soutrel, I.; Nunes, G. Odor concentration (OC) prediction based on odor activity values (OAVs) during composting of solid wastes and digestates. Atmos. Environ. 2019, 201, 1–12. [Google Scholar] [CrossRef]
Barczak, R.J.; Możaryn, J.; Fisher, R.M.; Stuetz, R.M. Odour concentrations prediction based on odorants concentrations from biosolid emissions. Environ. Res. 2022, 214, 113871. [Google Scholar] [CrossRef]
Cangialosi, F.; Bruno, E.; De Santis, G. Application of Machine Learning for Fenceline Monitoring of Odor Classes and Concentrations at a Wastewater Treatment Plant. Sensors 2021, 21, 4716. [Google Scholar] [CrossRef]
Kang, J.-H.; Song, J.; Yoo, S.S.; Lee, B.-J.; Ji, H.W. Prediction of odor concentration emitted from wastewater treatment plant using an artificial neural network (ANN). Atmosphere 2020, 11, 784. [Google Scholar] [CrossRef]
Mulrow, J.; Kshetry, N.; Brose, D.A.; Kumar, K.; Jain, D.; Shah, M.; Kunetz, T.E.; Varshney, L.R. Prediction of odor complaints at a large composite reservoir in a highly urbanized area: A machine learning approach. Water Environ. Res. 2020, 92, 418–429. [Google Scholar] [CrossRef]
Zhu, X.; Li, Y.; Wang, X. Machine learning prediction of biochar yield and carbon contents in biochar based on biomass characteristics and pyrolysis conditions. Bioresour. Technol. 2019, 288, 121527. [Google Scholar] [CrossRef] [PubMed]
Qi, C.; Wu, M.; Zheng, J.; Chen, Q.; Chai, L. Rapid identification of reactivity for the efficient recycling of coal fly ash: Hybrid machine learning modeling and interpretation. J. Clean. Prod. 2022, 343, 130958. [Google Scholar] [CrossRef]
Wojtuch, A.; Jankowski, R.; Podlewska, S. How can SHAP values help to shape metabolic stability of chemical compounds? J. Cheminform. 2021, 13, 74. [Google Scholar] [CrossRef] [PubMed]
Chakkingal, A.; Janssens, P.; Poissonnier, J.; Barrios, A.J.; Virginie, M.; Khodakov, A.Y.; Thybaut, J.W. Machine learning based interpretation of microkinetic data: A Fischer–Tropsch synthesis case study. React. Chem. Eng. 2022, 7, 101–110. [Google Scholar] [CrossRef]
Grimmig, R.; Lindner, S.; Gillemot, P.; Winkler, M.; Witzleben, S. Analyses of used engine oils via atomic spectroscopy–Influence of sample pre-treatment and machine learning for engine type classification and lifetime assessment. Talanta 2021, 232, 122431. [Google Scholar] [CrossRef]
Blazy, V.; de Guardia, A.; Benoist, J.C.; Daumoin, M.; Guiziou, F.; Lemasle, M.; Wolbert, D.; Barrington, S. Correlation of chemical composition and odor concentration for emissions from pig slaughterhouse sludge composting and storage. Chem. Eng. J. 2015, 276, 398–409. [Google Scholar] [CrossRef]
The Malodor Prevention Act Institution. The Malodor Prevention Act in Korea. Available online: https://easylaw.go.kr/CSP/CnpClsMainBtr.laf?popMenu=ov&csmSeq=1405&ccfNo=2&cciNo=2&cnpClsNo=1#copyAddress (accessed on 10 November 2022).
Lee, D.-H.; Woo, S.-E.; Jung, M.-W.; Heo, T.-Y. Evaluation of Odor Prediction Model Performance and Variable Importance according to Various Missing Imputation Methods. Appl. Sci. 2022, 12, 2826. [Google Scholar] [CrossRef]
Jang, Y.N.; Jung, M.W. Biochemical changes and biological origin of key odor compound generations in pig slurry during indoor storage periods: A pyrosequencing approach. BioMed Res. Int. 2018, 2018, 3503658. [Google Scholar] [CrossRef] [Green Version]
Jensen, B.B.; Jørgensen, H. Effect of dietary fiber on microbial activity and microbial gas production in various regions of the gastrointestinal tract of pigs. Appl. Environ. Microbiol. 1994, 60, 1897–1904. [Google Scholar] [CrossRef] [Green Version]
Jang, Y.-N.; Hwang, O.; Jung, M.-W.; Ahn, B.-K.; Kim, H.; Jo, G.; Yun, Y.-M. Comprehensive analysis of microbial dynamics linked with the reduction of odorous compounds in a full-scale swine manure pit recharge system with recirculation of aerobically treated liquid fertilizer. Sci. Total Environ. 2021, 777, 146122. [Google Scholar] [CrossRef]
Allison, P.D. Multiple imputation for missing data: A cautionary tale. Sociol. Methods Res. 2000, 28, 301–309. [Google Scholar] [CrossRef]
Gogtay, N.J.; Thatte, U.M. Principles of correlation analysis. J. Assoc. Physicians India 2017, 65, 78–81. [Google Scholar] [PubMed]
Aldayel, M.S. K-Nearest Neighbor classification for glass identification problem. In Proceedings of the 2012 International Conference on Computer Systems and Industrial Informatics, Sharjah, United Arab Emirates, 18–20 December 2012; pp. 1–5. [Google Scholar]
Salem, H.; Shams, M.Y.; Elzeki, O.M.; Abd Elfattah, M.; F. Al-Amri, J.; Elnazer, S. Fine-tuning fuzzy KNN classifier based on uncertainty membership for the medical diagnosis of diabetes. Appl. Sci. 2022, 12, 950. [Google Scholar] [CrossRef]
Pradhan, A. Support vector machine-a survey. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 82–85. [Google Scholar]
Cutler, A.; Cutler, D.R.; Stevens, J.R. Random forests. In Ensemble Machine Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 157–175. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the IJCAI, Montreal, QC, Canada, 20–25 August 1995; pp. 1137–1145. [Google Scholar]
Grandini, M.; Bagli, E.; Visani, G. Metrics for multi-class classification: An overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
Robinson, C.; Schumacker, R.E. Interaction effects: Centering, variance inflation factor, and interpretation issues. Mult. Linear Regres. Viewp. 2009, 35, 6–11. [Google Scholar]
Strobl, C.; Boulesteix, A.-L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007, 8, 25. [Google Scholar] [CrossRef] [Green Version]
Wei, P.; Lu, Z.; Song, J. Variable importance analysis: A comprehensive review. Reliab. Eng. Syst. Saf. 2015, 142, 399–432. [Google Scholar] [CrossRef]

Figure 1. Odor-collecting device installed inside/outside/boundary of pig barns.

Figure 2. The overall workflow of the conducted study.

Figure 3. Map of missing data.

Figure 4. Correlation plot; (left) correlation bar plot of explanatory variables about response variables; (right) heatmap of data.

Figure 5. Number of counts for location and season.

Figure 6. ROC curve, AUC, and MCC result for each model.

Figure 7. Variable importance of odor substances.

Figure 8. Partial dependence isolation plots for four main odor species (NH₃, PPA, BTA, and VLA). The blue line means all analyzed cases in the data, and the green line in the center means the average of all blue lines.

Figure 9. Partial dependence interaction plots for four odor main species (NH3, PPA, BTA, and VLA). (a) Interaction plot of NH3 and BTA. (b) interaction plot of NH3 and VLA. (c) interaction plot of PPA and NH3. (d) interaction plot of PPA and VLA. (e) interaction plot of BTA and VLA. (f) interaction plot of BTA and PPA.

Table 1. Summary of the study’s various analytical methods.

Variable	Sampling Method	Analytical Instrument	Analytical Conditions
Response variable (Complex odor)	Lung Sampler and Polyester Aluminum bag (10 L)		Air dilution method, Korea
Explanatory variable (Ammonia)	Solution Absorption	UV/vis (Shimadzu)	Wavelength range 640 nm
Explanatory variable (Four Sulfur Compounds)	Lung Sampler and Polyester Aluminum bag (10 L)	GC/PFPD (456-GC, Scion instruments)	Column: CP-Sil 5CB (60 m × 0.32 mm × 5 μm) Oven Condition: 60 °C (3 min) → (8 °C/min)→ 160 °C (9 min)
Explanatory variable (Ten VOCs)	Tenax TA tube adsorption	GC/FID (CP-3800, Varian)	Column: DB-WAX (30 m × 0.25 mm × 0.25 μm) Oven Condition: 40 °C →(8 °C/min)→ 150 °C→(10 °C/min) → 230 °C

Table 2. Summary of study variables.

Class	Variable Name (Abbreviation)		Unit	MDL *	Type
Response variable	Complex Odor		${OU}_{E} m^{- 3}$		0: discharge 1: no discharge
Explanatory variables	Ammonia (NH₃)		ppm	0.08	Float64
	Sulfur compounds	Hydrogen sulfide (H₂S)	ppb	0.06	Float64
		Methyl mercaptan (MM)		0.07	Float64
		Dimethyl sulfide (DMS)		0.08	Float64
		Dimethyl disulfide (DMDS)		0.05	Float64
	Volatile Organic Compounds (VOCs)	Acetic acid (ACA)	ppb	0.07	Float64
		Propionic acid (PPA)		0.34	Float64
		Isobutyric acid (IBA)		0.52	Float64
		Normality butyric acid (BTA)		0.96	Float64
		Isovaleric acid (IVA)		0.49	Float64
		Normality valeric acid (VLA)		0.53	Float64
		Phenol (Ph)		0.09	Float64
		P-Cresol (p-C)		0.06	Float64
		Indole (ID)		0.40	Float64
		Skatole (SK)		0.38	Float64
	Location				In: inside of the pig barn Out: outside of the pig barn Boundary: site boundaries
	Season				Spring: March–May Summer: June–August Fall: September–November Winter: December–February

* MDL = Method Detection Limit.

Table 3. Binary-class confusion matrix.

Confusion Matrix		True
Confusion Matrix		Class 0 (Discharge)	Class 1 (No Discharge)
Predict	Class 0 (discharge)	True Positive (TP)	False Positive (FP)
Predict	Class 1 (no discharge)	False Negative (FN)	True Negative (TN)

Table 4. Summary of the data after the imputation method.

	Complex Odor	Ammonia	Hydrogen Sulfide	Methyl Mercaptan	Dimethyl Sulfide	Dimethyl Disulfide	Acetic Acid	Propionic Acid
Mean	800	2.72	208.51	6.04	5.65	0.12	240.52	184.06
STD	1467	3.99	386.96	15.53	40.51	0.44	407.90	291.47
Min	3	0.00	0.04	0.06	0.00	0.02	0.15	0.13
Median	300	1.06	56.25	0.07	0.08	0.05	27.81	17.70
Max	10000	22.24	2484.00	120.00	462.00	4.28	2446.00	2109.69
	Iso-Butryic Acid	Butyric Acid	Iso-Valeric acid	Valeric Acid	Phenol	p-Cresol	Indole	Skatole
Mean	19.28	159.94	46.071	85.78	6.78	34.27	2.18	3.15
STD	37.08	262.03	83.60	192.73	12.70	60.15	7.08	9.15
Min	0.04	0.52	0.08	0.28	0.06	0.00	0.02	0.04
Median	2.01	7.93	4.42	5.52	1.85	2.89	0.86	1.52
Max	380.00	1455.52	743.69	1869.40	125.72	481.20	95.28	127.13

Table 5. Correlation analysis and variance inflation factor (VIF) results.

Variables	Correlation	VIF	Variables	Correlation	VIF
Ammoina	0.50	2.47	Butricy acid	0.35	20.40
Hydorgen sulfide	0.27	1.42	Iso-valeric acid	0.28	34.93
Methyl mercaptan	0.39	1.53	Valeric acid	0.24	26.34
Dimethyl sulfide	0.02	1.22	Phenol	0.29	6.52
Dimethyl disulfide	0.11	1.48	p-Cresol	0.42	4.86
Acetic acid	0.50	8.82	Indole	0.09	1.07
Propionic acid	0.46	58.70	Skatole	0.07	1.10
Iso-butryic acid	0.58	6.35

Table 6. Analysis of variance results for location. The subscripts (^A, ^B, ^C) above the numbers grouped the mean of each category in the case of variables with significant differences in mean.

Variables	In	Out	Boundary	F-Value	p-Value	Variables	In	Out	Boundary	F-Value	p-Value
Complex Odor	$1533^{A}$	$708^{B}$	$73^{C}$	33.84	<0.001	Iso-butryic acid	${36.23}^{A}$	${17.96}^{A}$	${2.01}^{B}$	33.16	<0.001
Ammoina	${5.74}^{A}$	${2.54}^{B}$	${0.49}^{C}$	38.52	<0.001	Butricy acid	${323.15}^{A}$	${173.78}^{B}$	${11.95}^{C}$	30.48	<0.001
Hydorgen sulfide	${357.28}^{A}$	${275.63}^{A}$	${16.27}^{B}$	16.95	<0.001	Iso-valeric acid	95.74 ^A	${48.04}^{B}$	${4.81}^{C}$	23.63	<0.001
Methyl mercaptan	${14.26}^{A}$	${4.24}^{B}$	${1.18}^{B}$	13.90	<0.001	Valeric acid	${186.40}^{A}$	${85.17}^{B}$	${6.81}^{C}$	16.31	<0.001
Dimethyl sulfide	11.87	6.08	0.11	1.40	0.2501	Phenol	${12.66}^{A}$	${6.28}^{B}$	${2.60}^{B}$	11.40	<0.001
Dimethyl disulfide	0.20	0.13	0.06	1.82	0.1639	p-Cresol	${69.12}^{A}$	${36.60}^{B}$	${3.38}^{C}$	24.49	<0.001
Acetic acid	${495.79}^{A}$	218.76 ^A	${45.75}^{C}$	25.48	<0.001	Indole	2.45	1.82	2.36	0.16	0.8489
Propionic Acid	${380.37}^{A}$	${177.05}^{B}$	${21.23}^{C}$	33.91	<0.001	Skatole	4.35	2.28	3.13	0.85	0.4309

Table 7. Analysis of variance results for season. The subscripts (^A, ^B, ^C) above the numbers grouped the mean of each category in the case of variables with significant differences in mean.

Variables	Spring	Summer	Fall	F-Value	p-Value	Variables	Spring	Summer	Fall	F-Value	p-Value
Complex odor	641	681	768	0.19	0.8291	Iso-butryic acid	${35.10}^{A}$	${16.44}^{AC}$	${12.94}^{BC}$	7.43	0.0017
Ammonia	1.44	2.60	3.20	2.15	0.1194	Butricy Acid	${428.16}^{A}$	${130.03}^{B}$	${103.68}^{B}$	20.92	<0.001
Hydrogen sulfide	213.68	240.98	169.28	0.79	0.4552	Iso-valeric acid	${116.53}^{A}$	${48.41}^{B}$	${21.88}^{B}$	15.61	<0.001
Methyl mercaptan	${3.81}^{A}$	${9.39}^{AB}$	${3.08}^{AC}$	4.21	0.0162	Valeric acid	${192.99}^{A}$	${103.72}^{A}$	${34.09}^{B}$	8.52	0.0012
Dimethyl sulfide	${37.24}^{A}$	${0.68}^{B}$	${0.83}^{B}$	10.67	<0.001	Phenol	${12.10}^{A}$	${7.62}^{AC}$	${4.23}^{BC}$	4.63	0.0108
Dimethyl disulfide	0.08	0.17	0.08	1.06	0.3485	p-Cresol	51.85	28.25	34.15	1.68	0.1882
Acetic acid	${377.45}^{A}$	${157.86}^{B}$	${269.19}^{B}$	3.88	0.0221	Indole	1.28	1.25	3.42	2.43	0.0904
Propionic acid	${365.33}^{A}$	${167.31}^{B}$	${132.45}^{B}$	7.62	0.0016	Skatole	5.01	2.69	3.07	0.69	0.5039

Table 8. Results of hyperparameter optimization using grid search and five-fold cross-validation.

Model	Parameter	Range	Optimal Parameter
KNN	n_neighbors	*Range (1, 15, 1)	4
	Wights	‘uniform’, ‘distance’	‘distance’
	Algorithm	‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’	‘auto’
SVM	C	[0.05, 0.5, 1.0]	0.5
	Kernel	‘linear’, ‘rbf’	‘linear’
	n_estimators	Range (10, 100, 5)	10
RF	max_depth	Range (1, 15, 1)	10
	min_samples_leaf	Range (1, 10, 1)	2
	min_samples_split	[2,5,8,10,12]	2
	Criterion	‘gini’, ‘entropy’	‘entropy’
Extra-Trees	n_estimators	Range (1, 3, 1)	3
	max_depth	Range (1, 7, 1)	5
	num_leaves	[31,127]	31
LightGBM	max_depth	Range (3, 5, 1)	3
	min_child_weight	[ $1 \times 10^{- 7}$ , $1 \times 10^{- 3}$ , $1 \times 10^{- 1}$ , $1 \times 10^{1}$ , $1 \times 10^{3}$ , $1 \times 10^{5}$ ]	$1 \times 10^{- 7}$
	min_data_in_leaf	[30,50,100]	30
XGBoost	max_depth	Range (3, 10, 1)	3
XGBoost	min_child_weight	Range (3, 5, 1)	3

*Range (start, end + 1, step).

Table 9. Results of the predictive model. The underline means best score of each evaluation metrics.

	F1-Score	Accuracy	Sensitivity	Specificity	PPV	NPV
KNN	0.74 (0.05)	0.74 (0.03)	0.77 (0.09)	0.72 (0.08)	0.72 (0.08)	0.77 (0.09)
SVM	0.77 (0.05)	0.81 (0.02)	0.74 (0.08)	0.87 (0.04)	0.82 (0.06)	0.80 (0.04)
RF	0.76 (0.05)	0.78 (0.02)	0.74 (0.08)	0.82 (0.06)	0.79 (0.05)	0.78 (0.04)
Extra-Trees	0.72 (0.05)	0.76 (0.04)	0.70 (0.11)	0.80 (0.11)	0.77 (0.10)	0.76 (0.08)
LightGBM	0.76 (0.05)	0.78 (0.03)	0.75 (0.09)	0.81 (0.07)	0.77 (0.07)	0.79 (0.07)
XGBoost	0.74 (0.05)	0.77 (0.03)	0.72 (0.10)	0.81 (0.07)	0.78 (0.06)	0.76 (0.08)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, D.-H.; Lee, S.-H.; Woo, S.-E.; Jung, M.-W.; Kim, D.-y.; Heo, T.-Y. Prediction of Complex Odor from Pig Barn Using Machine Learning and Identifying the Influence of Variables Using Explainable Artificial Intelligence. Appl. Sci. 2022, 12, 12943. https://doi.org/10.3390/app122412943

AMA Style

Lee D-H, Lee S-H, Woo S-E, Jung M-W, Kim D-y, Heo T-Y. Prediction of Complex Odor from Pig Barn Using Machine Learning and Identifying the Influence of Variables Using Explainable Artificial Intelligence. Applied Sciences. 2022; 12(24):12943. https://doi.org/10.3390/app122412943

Chicago/Turabian Style

Lee, Do-Hyun, Sang-Hun Lee, Saem-Ee Woo, Min-Woong Jung, Do-yun Kim, and Tae-Young Heo. 2022. "Prediction of Complex Odor from Pig Barn Using Machine Learning and Identifying the Influence of Variables Using Explainable Artificial Intelligence" Applied Sciences 12, no. 24: 12943. https://doi.org/10.3390/app122412943

APA Style

Lee, D. -H., Lee, S. -H., Woo, S. -E., Jung, M. -W., Kim, D. -y., & Heo, T. -Y. (2022). Prediction of Complex Odor from Pig Barn Using Machine Learning and Identifying the Influence of Variables Using Explainable Artificial Intelligence. Applied Sciences, 12(24), 12943. https://doi.org/10.3390/app122412943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Complex Odor from Pig Barn Using Machine Learning and Identifying the Influence of Variables Using Explainable Artificial Intelligence

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Study Area

2.1.2. Data Sampling

2.2. Methods

2.2.1. Preprocessing

Multiple Imputation

Analysis of Variance and Correlation Analysis

2.2.2. Classification Model

K-Nearest Neighbor

Support Vector Machine

Random Forest

Extremely Randomized Trees

eXtreme Gradient Boosting

Light Gradient Boosting Machine

K-Fold Cross-Validation

2.2.3. Performance Evaluation Metrics

2.2.4. Explainable Artificial Intelligence

3. Results

3.1. Preprocessing

3.1.1. Missing Imputation

3.1.2. Correlation Analysis

3.1.3. Analysis of Variance

3.2. Predictive Classification Model

3.3. Explainable Artificial Intelligence

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI