1. Introduction
Sludge is produced as a byproduct of treating wastewater; it is made up of solid material that has been removed from the water. As sludge is composed of waste originating from various sources, it contains not only common microorganisms and floating and dissolved solids, but also carbon compounds, which are usually derived from the waste of living organisms, and inorganic substances, such as heavy metals and toxic compounds [
1]. If these substances are released into the environment without proper treatment, they can contaminate water bodies and cause eutrophication, or if they penetrate into the soil, they can lead to plant death. Additionally, the release of sulfide from the sludge into the air can cause air pollution [
2]. Therefore, the sludge is then further processed to (i) reduce the toxicity, (ii) recover any valuable nutrients, (iii) produce biogas and (iv) reduce greenhouse gas production. Sludge from wastewater treatment plants can be put through anaerobic digestion process, in which the organic matter is broken down in the absence of oxygen. Methane and carbon dioxide are the main components of the biogas produced in this way [
3,
4]. However, digestion efficiency and gas production are affected by many factors, such as sludge characteristics and external factors, such as operation, temperature, and pH inside the digester [
5]. The demand for energy in modern society is constantly increasing, while traditional fossil energy is still the main source [
6]. With the growth of the global population and economy, the enhancement of people’s awareness of environmental protection and the adjustment of energy policies, the diversification trend of human energy demands is also increasing. Thus, the demand for renewable energy is also showing an increasing trend. Gas is a renewable gaseous fuel derived from biomass that can be used as an alternative to traditional fossil fuels. For that reason, energy harvesting from waste sludge through anaerobic digestion could help in energy recovery and hence improving sludge anaerobic digestion efficiency, and gas production is an important issue in current WWTP research and practice [
7,
8]. All in all, through the anaerobic digestion of sludge, the amount of waste disposed and the negative impact of landfills on the environment are reduced, and the gas generated during the digestion process is used as energy, reducing greenhouse gas emissions and improving the environment and soil quality [
9].
Anaerobic digestion takes place in the absence of oxygen. Anaerobic microorganisms, such as bacteria, produce biogas by decomposing organic matter. This process is divided into four stages: hydrolysis, acid production, acetic acid production, hydrogen production, and methane production [
10,
11]. Generally, the traditional WWTP uses a two-stage digestion process [
12,
13]. Before anaerobic digestion, pretreatment is required, such as thermal pretreatment, ultrasonic pretreatment, and chemical pretreatment. The raw sludge is sent to the thickening tank for treatment and then continuously or batchwise put into the first-level biogas digester for stirring [
14,
15]. Stirring can make the sludge fully contact the microorganisms at the bottom and keep the mixing state in the biogas digester stable [
16,
17]. The generated methane is then collected into the methane production unit, and the remaining sludge is sent to the secondary digester for concentration and supernatant separation. In general, mesophilic anaerobic digestion is used during the operation of WWTP biogas digesters, and the temperature is maintained at 30~37 °C [
18]. As a parameter to judge whether the digestion process is running normally, pH is generally maintained at 6.8~7.2 [
19]. However, due to the complexity of anaerobic digestion, the overall digestion efficiency and gas production are in general low. Therefore, in order to improve its efficiency and methane production, data mining techniques are used to analyze relevant data in the anaerobic digestion process, identify and analyze relevant variables and parameters in the anaerobic digestion process, and then propose optimized parameters to improve efficiency and gas production.
In addition, the methane produced from anaerobic digestion can be used to provide energy for equipment operation within the sewage treatment plant, thus reducing operational costs. Moreover, methane is a clean and renewable energy source that can be used as fuel in various fields, such as transportation and residential gas supply. With this in mind, the development of anaerobic digestion technology can not only solve the problem of organic waste disposal, but also contribute to the development and utilization of new energy sources.
The performance of anaerobic digestion systems depends on various factors, including growth factors, operational parameters, system type, and digester type [
20]. The digester can control growth factors, such as temperature, pH, and organic acids, while operational parameters, such as the HRT (hydraulic retention time) and OLR (organic loading rate) can directly affect the system’s stability and treatment efficiency and affect the metabolism of growth factors. Due to the complexity of anaerobic digestion, the overall digestion efficiency and gas production are, in general, low. Thus, effectively controlling the OLR and HRT is one of the methods to improve the efficiency and gas production of the anaerobic digestion [
21].
In order to improve anaerobic digestion efficiency and methane production, data mining techniques are used to analyze relevant data in the anaerobic digestion process. Due to the large amount of data, data mining techniques can be used to automate the analysis and discover trends and patterns in the data, establish corresponding data models, analyze parameter data under different operations, and find the optimal operating parameters, effectively achieving the goal of optimizing production. In addition, it is possible to predict the efficiency and gas production of anaerobic digestion, to adjust parameters and treatment processes in a timely manner. Identifying and analyzing relevant variables and parameters in the anaerobic digestion process can propose optimized parameters to improve efficiency and gas production.
Elena Rossi et al. [
22] aimed to develop an experiment using a multiple regression model to predict biogas production from the dry anaerobic digestion of organic waste. In this paper, the author first evaluated the correlation results between variables by analyzing the Pearson correlation analysis and then used MLR and selected the temperature, pH, TS content, mixture, carbon–nitrogen ratio, OLR and HRT, nutritional availability, and toxic compounds as input variables, with biogas as the output model. Experiments demonstrate that MLR can be used for preliminary evaluation and potential energy prediction. Fuqing Xu et al. [
23] aimed to develop a method to predict the gas production of lignocellulosic biomass in low-temperature solid-state anaerobic digestion (SS-AD). The authors used MLR to predict gas production, by examining different types of biomasses and chemical compositions and operating parameters to predict gas production and compared the accuracy of prediction models with traditional prediction models. Through the experimental data, the establishment of a multiple regression model can accurately predict the gas production of different biomasses in SS-AD. The results show that this forecasting method can provide a valuable reference for forecasting.
NA Perendeci et al. [
24] analyzed the correlation between the biochemical composition of microalgal and cyanobacterial biomass and gas production using Pearson correlation analysis. The gas production and biochemical composition of microalgal and cyanobacterial samples from different sources were assessed through Pearson correlation analysis. The results showed that the gas production and gas production were positively correlated with the total protein, total sugar, total organic carbon, and total ash content of the biomass and negatively correlated with the crude fat and cellulose content of the biomass. This result can be used to analyze biochemical components to predict gas production, which provides a reference value for predicting and optimizing gas production. Yongwoon Park et al. [
25] used Pearson correlation analysis to study the effect of different biomass mixtures on steam production during anaerobic digestion and calculate the Pearson correlation coefficient between different biomass mixing ratios and gas production. The results showed that the mixing ratio of different biomasses had a significant effect on the gas production. The Pearson correlation analysis can provide value for predicting and optimizing the gas production of anaerobic digestion.
Artificial Neural Networks (ANNs) are mathematical models that simulate the artificial neural network of the human brain and are used to deal with nonlinear problems and perform pattern recognition [
26,
27]. The use of ANNs in the field of sludge anaerobic digestion has been rarely explored and applied in real field applications [
28,
29,
30]. Previously, Liliana Mafalda studied the use of ANN methods to predict gas production and chemical oxygen demand (Chemical Oxygen Demand COD) removal rates during anaerobic digestion [
31]. The model uses six parameters. The author standardized the initial data and used ANN and BP (Backpropagation Neural Network—BP) to train and verify the model. This experiment proved that the ANN method can predict the gas production and COD removal rate well in the anaerobic digestion process. ANN can be used as an effective tool to predict key parameters in the anaerobic digestion process. However, the authors did not analyze the error of the prediction results, nor did they verify the stability and reliability of the model, all of which require further research. Yuchen Wu et al. established a backpropagation neural network and used a genetic algorithm to establish a model to optimize anaerobic digestion. The model can be used to predict gas production at different temperatures. The results show that the model has a good fitting degree with the actual data and has important practical application value. However, the author did not fully discuss the scope and limitations of the model, which will affect the accuracy of the model in practical applications [
32]. P. Sakiewicz et al. proposed an innovative artificial neural network method for simulating the operation of a biogas–wastewater treatment system. The method predicts the relationship between the output of the system and the parameters of the system’s operation. The results show that the model can predict the impact of operating parameters on system performance and provide a reference for system optimization. However, the author did not conduct a comparative analysis between other modeling methods and the ANN and did not fully discuss the scope of the application and stability of the model [
33]. Through numerous experiments, it can be proven that an ANN is useful for the optimization of anaerobic digestion parameters. It can model complex nonlinear relationships and conduct the comprehensive analysis and optimization of multiple factors, thereby effectively improving anaerobic digestion efficiency and gas production.
The ANN is applied in many fields, such as digital twin, mathematics field, medical field, etc. T.I. Zohdi [
34] used self-adaptive digital twin technology to simulate and predict the spread of flames in complex environments and used digital twins to model and predict fire spread trends. Another article of the authors [
35] optimized the digital twin framework applied to aerial firefighting and pilot safety; by using machine learning algorithms (MLA) to determine the optimal dynamics of the aircraft, the high flame-retardant release efficiency is maximized. Jinyong Ying et al. [
36] proposed a new deep learning structure of a multi-scale fusion network, which can accurately calculate the solution while maintaining the physical properties and continuity of the solution and can solve the ellipse interface problem well. Yinghao Chen et al. [
37], aiming at the many similarities and overlaps between Crohn’s disease (CD) and intestinal tuberculosis (ITB) clinically, used the fusion neural network (FCNN) method to distinguish the two CDs and ITB and trained the patient’s brain through FCNN The relevant inspection indicators were used to establish a model, and the accuracy of the model is higher than that of MLR and other models, showing higher performance and accuracy.
As a new energy source, the output of methane in WWTPs is not high. This study aimed to investigate and identify the underlying factors contributing to the low digestion efficiency and gas production in a wastewater treatment plant (WWTP) by collecting and processing mining data, finding the best operating parameters, and optimizing and predicting anaerobic digestion efficiency and gas production. By using data mining, Pearson correlation, multiple linear regression, and artificial neural networks are analyzed for sludge anaerobic digestion in order to improve its efficiency and application.
2. Materials and Methods
2.1. Data Mining
The data mining was collected from the Korean Wastewater Treatment Plant in Daegu. The wastewater treatment facility is combined with an anaerobic digestion process where the effluent is sequentially filtered and purified through sand chambers and settling tanks, and the remaining material is sent to thickening tanks and then biologically treated, scilicet, fermented in digesters where the resulting biogas is collected and used in boilers or other applications and the remaining dewatered sludge is burned as fuel. The WWTP provided operational data for almost seven years, which recorded various parameters during the operation of the WWTP, including sludge thickening, digesters, storage tanks, dewatering, and gas collection. The analysis in this study focused on digester digestion variables that are closely related to the anaerobic digestion process of the sludge. The data used in the study were collected automatically by the wastewater treatment plant on a daily basis and recorded for seven years (January 2014 to December 2021).
2.2. Statistical Pre-Analysis
The initial stage of statistical analysis involves identifying the relevant variables of interest, including both independent and dependent variables. After identifying the variables, exploratory data analysis (EDA) is conducted to better understand the data and uncover any patterns or trends. This is accomplished by utilizing various visualization techniques, such as scatterplots or histograms, to examine the distribution of the data and identify relationships between variables.
Seasonal data analysis of this data provides insight into the performance of the digester digestion process and explores how to optimize anaerobic digestion parameters to improve treatment efficiency and gas production. Next, data mining was performed by normalizing and removing outliers to determine the optimal operating factors that would improve the efficiency of the digestion process. The relationship between operational factors and digestion efficiency was tested, and the conditions for the optimal operation of digestion were then determined. Finally, based on the preliminary results, the optimal operating factors that can increase digestibility were successfully derived (
Figure 1). The main factors in the operation of the biogas digester include sludge inflow, temperature, pH, TS, VS, VS%, organic acid, alkalinity, OLR, and HRT. In addition, by using four scores, the season data are divided into three categories, the upper quartile (Q3, Q4) is the Good Case, the lower quartile (Q2, Q1) is the Bad Case, and the middle data are the Normal Case.
2.3. Pearson Analysis
In the research, firstly, Pearson correlation analysis was used to explore and verify the correlation between digestion efficiency and methane production and other variables. In order to ensure the quality and accuracy of the data, processed and standardized data were used. Pearson correlation analysis is a descriptive statistical analysis that describes the linear relationship between variables in a dataset and is often used to measure the correlation between two variables. The value of this correlation coefficient ‘r’ (Formula (1)) is between 1 and −1, and the closer it is to 1 or −1, the stronger the correlation between the two variables, that is, a positive or negative correlation. When r is between 1 and −1, it can indicate that there is some degree of correlation between the two variables, for example, r = 0.3, which means that there is a weak correlation between the two variables. An r value closer to 1 and −1 means that the stronger the correlation between the two variables, the more there is a complete positive or negative correlation between the two variables, meaning that when one variable increases, the other variable decreases. However, Pearson correlation analysis can only describe whether there is a linear relationship between two variables, not simply a causal relationship between two variables [
38]. When the Pearson correlation coefficient is close to 0, it indicates that the degree of correlation between the two variables is very low, that is, there is an irrelevant or nonlinear relationship. The results of this statistic can play an important role in data analysis and forecasting because it can help determine the strength and direction between variables [
39]. In addition, when performing Pearson correlation analysis, in addition to calculating the correlation coefficient r, it is also necessary to verify the significance level,
p-value, of the correlation coefficient; the
p-value is an indicator used to judge whether the Pearson correlation coefficient of the sample data is statistically significant, and the results with a
p-value less than 0.05 are generally considered to be statistically significant [
40].
: Variables.
: The square of the variables.
r: Pearson correlation coefficient.
n: Variable coefficient.
2.4. Multiple Linear Regression Analysis (MLR)
Multiple linear regression analysis (MLR) is a traditional statistical method that establishes the relationship between variables based on a mathematical model. It can establish a linear relationship between the response variable and multiple predictor variables and use this relationship to make predictions. One of the advantages of MLR is the rapid establishment of models and predicting the outcome [
41]. MLR was performed on parameters that have a correlation between digestion efficiency and gas production. MLR was used to model a linear relationship between two or more independent variables
and a dependent variable
. The general form of the multiple regression model is shown as Formula (2) [
38]
: Y-intercept.
: dependent variable.
,: independent variable.
, : parameter.
: the error term.
When building a regression model, it is assumed that the relationship between the model and the actual value is approximately reasonable; that is, a linear equation is satisfied between the predictor variable ‘
’ and the dependent variable ‘
’. The least squares method, Formula (3), is used to estimate the regression coefficients, the purpose of which is to find the minimum sum of squares of the error between the model predicted value and the actual value. The error between the predicted value of the regression model and the actual value is minimized by minimizing the error sum of squares. The process of minimizing the sum of squared errors is used to find the optimal regression coefficient estimates, so that the prediction values of the regression model are more accurate, and the model ability is improved.
: Y-intercept.
: predicted Value.
,: independent Variables.
, : parameters.
is the coefficient of determination, which is a measure of the fitting degree of the regression model and is used to indicate the proportion of the predicted value that can be explained by the model. The value ranges from 0 to 1, the difference between the predicted value and the actual value is small, and the closer
is to 1, on the contrary, there is no relationship between the predicted value and the actual value,
= 0. Formula (4) and related Formulas (5)–(7) are as follows:
: coefficient of Determination.
: regression sum of squares.
: total sum of squares.
: total sum of squares.
: regression sum of squares.
: error sum of squares.
: regression sum of squares.
: predictive value.
: the mean of the sample.
: error sum of squares.
: predictive value.
: the actual value of the i observation.
However, the performance of the regression model is not only evaluated based on the coefficient of determination
, but also needs to use other indicators, such as the percentage error (
), root mean square error (
), mean absolute error (
), etc., to evaluate the accuracy of the model. PE is the percentage used to evaluate the difference between the predicted result and the real result;
is an indicator to measure the difference between the predicted value and the real value;
is an indicator to measure the difference between the predicted value and the real value.
, and
are all indicators used to evaluate the accuracy of the model. The smaller the error value, the higher the accuracy of the model, and vice versa, the lower the accuracy of the model. However, PE is used to compare the prediction accuracy of different models, and
and
are used to evaluate the prediction accuracy of a single model. Relevant formulas such as (8)–(10) are as follows.
: percentage error.
: predictive value.
: the actual value of the
observation.
: root mean square error.
: predictive value.
: the actual value of the observation.
: mean absolute error.
: predictive value.
: the actual value of the observation.
: total samples.
2.5. Artificial Neural Network (ANN)
An artificial neural network (ANN) is a technique that mimics the workings of neurons in the human brain by learning patterns between inputs and outputs to predict outcomes [
42]. Its power lies in its ability to capture complex nonlinear relationships between input and output variables, and it can be used to process large, complex datasets. It can learn from data, adjust weights and biases, and predict them to generate new data.
In this study, the artificial neural network multilayer perceptron (Multilayer Perceptron-MLP) was used to avoid the performance problems caused by a large number of neurons and too few neurons. MLP is a feed-forward neural network consisting of an input layer (independent variable), a hidden layer, and an output layer (dependent variable) (
Figure 2). The input layer receives data, the hidden layer processes the data, and the output layer produces the result [
33]. The connection between the input layer and the hidden layer is the weight of the input data, and the link between the hidden layer and the output layer is the weight of the hidden layer nodes. Each node in an MLP has an activation function that transforms the node’s weighted input into an output signal. The connections between these nodes are achieved through weights, which determine how much each node contributes to the output. In MLP, the output of each node is connected to all nodes in the previous layer, and each connection has a weight, and these weights are used to calculate the output of each node. Therefore, the hidden layer nodes and weights affect the output layer results, and the adjustment of weights is a key training process to ensure that the network produces accurate output results. Both the hidden layer and the output layer have a bias (bias) node, of which output is a constant, which is used to adjust the offset of the entire network.
The purpose of this study was to investigate the improvement in anaerobic digestion efficiency and gas production during the digestion process. Therefore, the digestion efficiency and gas production were selected as output variables, and many factors related to anaerobic digestion, such as sludge inflow, temperature, pH value, TS, VS, VS%, organic acid, alkalinity, the HRT, and OLR were used as input variables, and the ANN model was established for prediction and analysis, so as to reflect the digestion efficiency and gas production of the anaerobic digestion process.
3. Results and Discussion
3.1. Effect of Seasonality on Gas Production
Generally, gas production is affected by seasonal factors, and the higher the temperature, the higher the gas production. This is because the higher temperature promotes the metabolism and growth of microorganisms in the digester, increasing the degradation rate of organic waste. Relevant studies have shown that the digesters were monitored in autumn and winter, and the results showed that the gas production in autumn was significantly higher than that in winter, and the gas production when the temperature was low was lower than that when the temperature was high [
43].
Dong Jin Lee et al. [
44]. investigated the seasonal effects of the organic loading rate and acid phase on methane production during anaerobic digestion at a food wastewater treatment plant in southern Korea. The results showed that seasonal changes have a significant impact on methane production during anaerobic digestion. Due to the high temperature and rain in the summer, the acid production in the digester of the food processing plant is high, and the concentration of VFAs increases sharply, which inhibits methane production and leads to a decrease in gas production in the summer. In November, methane production rose, and gas production increased due to cooler temperatures and less rainfall.
3.2. The Relationship between Gas Production and Season
Based on the existing gas production data, this study uses the quartile approach to divide the data set into three categories [
45]: Good Case, Bad Case, and Normal Case. Among them, Good Case and Bad Case represent the upper quartile and lower quartile respectively, while Normal Case refers to the data between these two quartiles. This classification helps to analyze gas production data and identify trends. In particular, quartile values are crucial for distinguishing Good Cases from Bad Cases and represent boundaries that divide gas production data into different categories. This classification method can effectively help researchers better understand gas production data, extract useful information, and provide a reference for follow-up research.
The upper and lower quartiles divide the dataset into four equal parts. In this method, the upper quartile of digestion efficiency and gas production is defined as the Good Case, and the lower quartile is defined as the Bad Case. Specifically, Good Case values for digestive efficiency range from 45.1% to 83%, Bad Case values range from 0% to 29%, and values between 29% and 45.1% are considered the Normal Case. Good Case values for gas production range from 14,000 to 21,000 m3/day, Bad Case values range from 1400 to less than 10,000 m3/day, and values between 10,000 and 15,000 m3/day are the Normal Case. These interquartile ranges can be used to assess the performance of the digestion efficiency and gas yield production process. When the digestion efficiency is higher than 45.1% or the gas production is higher than 14,000 m3/day, this means the digestion performance is good; on the contrary, when the digestion efficiency is lower than 29% or the gas production is lower than 10,000 m3/day, this means the digestion performance is poor. Therefore, this statistical method can be used to monitor and evaluate the performance of the production process and provide a way to judge the quality of the production process based on statistical principles. Therefore, the low production in the summer and autumn is classified as a “Bad Case”, and the high production in the spring and winter is classified as a “Good Case”.
Gas production and digestion efficiency are affected by seasonal factors. Factors, such as external temperature and rainfall, will affect the digester, thereby affecting gas production and digestion efficiency [
44,
46]. According to the results in
Figure 3 and
Figure 4, the Good Case ratio is relatively large in the winter and spring; the percentages of digestion efficiency are 10% and 49%, and the percentages of gas production are 47% and 80%, respectively. In contrast, the digestion efficiency and gas production in the summer and autumn are not good, accounting for 57% and 22% of the total percentage respectively, which is different from the general situation that the digestion efficiency and gas production are higher in the summer and autumn. According to the research of E Sánchez et al. [
47], tropical climate, especially in the rainy season, due to the high temperature and high humidity environment, will have an adverse effect on anaerobic digestion, because the temperature of the digester is too high, with too much water, so the substrate in the digester is diluted, and the humid and high-temperature environment reduces the activity of microorganisms, which in turn affects the stability and digestion efficiency of the internal environment of the anaerobic digester and further affects the gas production. Further explained, this seasonal trend in gas production and digestion efficiency could be attributed to the influence of the rainy season in South Korea. During the rainy season, the concentration of organic matter and substrates in the wastewater flowing into the wastewater treatment plant (WWTP) decreases. This reduction ultimately affects the efficiency of anaerobic digestion, the main process that produces gas in wastewater treatment plants. Therefore, gas production is lower in the summer and autumn. Conversely, in the spring and winter, fewer raindrops and increased concentrations of organic matter and substrates boosted the efficiency of anaerobic digestion, leading to increased gas production, and thus, this fell into the “Good Case” category.
Table 1 and
Table 2 provide information on the various parameters associated with the digestion system. This table shows the various parameters within the digester, including temperature, pH ,and others, along with the maximum, minimum, and average values for each parameter in both Good and Bad Cases under different conditions, and these data assist in the assessment of the performance and reliability of the system. A Good Case is a situation where the system is performing as expected, and a Bad Case is a situation where the system is not performing as expected. No significant difference between the main parameters was observed in the values between the Good Case and the Bad Case, indicating that the system operates in a relatively stable state at this stage. This situation may be due to low gas production due to seasonal factors. South Korea’s summer and autumn are the rainy seasons. When rainwater flows into the sewage treatment plant, the temperature and pH in the digester change. Typically, in order to maintain a good digestion environment, the temperature fluctuation should not exceed 1 °C, and if it exceeds 2~3 °C, it will have a great impact on anaerobic digestion [
48]. In addition, the organic matter load and HRT, marked in red in the table, are parameters that can only be adjusted in the actual operation of this study. This means that for other parameters, accurate data may not be able to be obtained through a laboratory operation. Therefore, during the actual operation, special attention needs to be paid to these operable parameters in order to be able to optimize the performance and efficiency of the system. At the same time, for those parameters that cannot be changed, we need to treat them as constants in the operation of the system in order to better understand the performance and behavior of the digester. Based on this, statistical analysis is necessary, which can help us understand in more detail the impact of existing parameters on the production gas yield of WWTPs in the west. By performing statistical analysis of the data, we can identify and determine which parameters have the most significant impact on gas production and how to optimize these parameters to improve the efficiency and quality of gas production. This statistical analysis can help us better understand and control the variables in the production process so that we can more reliably produce high-quality gas products. On this basis, we can formulate more effective production strategies and programs to improve the production capacity and benefits of a western WWTP.
Digestion efficiency generally refers to the rate at which organic matter is degraded by anaerobic digestion microorganisms into organic products, and gas is produced in the process. In the example in
Figure 5, the relationship between the digestion efficiency and gas production was examined. An
value of 0.9002 indicated a very strong positive correlation between sludge anaerobic digestibility and gas production, implying a direct relationship between these two variables. It has been shown that a higher sludge digestibility leads to an increase in gas production during anaerobic digestion, and conversely, a lower digestibility leads to a decrease in gas production [
49]. The strength of this relationship was significant, and 90.02% of the variation in gas production could be explained by the variation in sludge anaerobic digestibility. In practice, this information is useful to the anaerobic digestion operator and can be used to optimize the process.
3.3. Pearson Correlation Analysis Results
Statistical conventions are convention rules in statistics that solve the problems caused by data uncertainty and objectivity. By following these rules, statistical results can be made more objective and reliable. The field is used to analyze data in many areas, including environmental science [
50]. In Pearson correlation analysis, statistical conventions need to be followed to ensure the accuracy and validity of the results. For example, data preprocessing checks the normal distribution of data, calculates correlation coefficients, sets significance levels and test statistics for hypothesis testing, etc., to judge whether the statistical results are meaningful. According to statistical conventions, the absolute value of r is usually divided into the following grades: 0.8–1.0, very strong correlation; 0.6–0.79, strong correlation; 0.4–0.59, moderate correlation; 0.2–0.39, weak correlation 0.0–0.19, very weak or no correlation [
51]. According to the results of Pearson’s correlation analysis, there is a medium-strength relationship between digestion efficiency and sludge inflow, pH, alkalinity, and HRT, and this has no correlation with other parameters. There is a correlation between gas production and pH, VS%, OLR, HRT, and digestion efficiency and a very strong correlation with alkalinity.
Shuang Zhang et al. [
52] studied the response of the semi-continuous anaerobic digestion of food waste to a gradually increasing temperature, including aspects of methanogenetic bacterial communities, a correlation analysis, and energy balance. They used Pearson analysis to assess the linear relationship between the different parameters and further investigated the response of the parameters, as well as methane production. Correlation analysis results showed that methane production was positively correlated with temperature, while there was a weak correlation between pH and methane production. These two relationships are not strong, but it does not mean that there is no correlation between the two.
In this study, the relationship between the two variables can be analyzed according to the correlation coefficient in
Figure 6. The correlation coefficients between digestion efficiency and influx, pH, and HRT were −0.498, 0.498, and 0.490, respectively, indicating that there was a moderate correlation between them. This means that these three variables are related and interrelated to the efficiency of digestion during digestion. The correlation coefficient with alkalinity was 0.631, showing a strong correlation between them, so the effect of alkalinity on digestion efficiency was more significant. The correlation coefficients among VS%, OLR, and digestion efficiency were 0.464, 0.508, and 0.590, respectively, indicating that there was a moderate correlation among them. This means that these three variables are correlated and interrelated with respect to gas production during digestion. The correlation coefficient with alkalinity was 0.793, indicating that there is a strong correlation between them, so the impact of alkalinity on gas production is more significant. The correlation coefficients between gas production and pH, organic acids, and HRT were 0.334, 0.381, and 0.282, respectively, indicating that there is a weak correlation between them, which indicates that there is a weak linear relationship between gas production and pH, organic acids, and HRT. Pearson correlation analysis can only detect and describe linear relationships and cannot detect and describe nonlinear relationships. The weak correlation and no correlation in the Pearson correlation analysis indicated that there was a non-linear relationship between digestion efficiency and gas production and other factors. It is worth noting that Pearson correlation analysis can only prove a linear relationship between two variables rather than a causal relationship. In other words, when there is a strong relationship between variable a and variable b, it can only show that there is a linear relationship between the two variables. The relationship is strong, rather than an increase or decrease in either variable a or variable b resulting in a change in the other.
3.4. Comparison between MLR and ANN
In this study, IBM SPSS Statistics 27 was used. Two models, Multiple Linear Regression (MLR) and an Artificial Neural Network (ANN), were compared to determine which model was more suitable for the studied dataset. It should be noted that MLR and the ANN use the same data to build predictive models. During experimental modeling, a multiple linear regression model included digestion efficiency, gas production, etc. In the case of ANN, the accuracy and reliability of the prediction results of ANN were higher (
Table 3). The error of the ANN model was lower than that of the MLR model in terms of the percentage error (
), root mean square error (
), and mean absolute error (
). The above four indicators were used to measure and evaluate the accuracy and reliability of the model. The final research results show that the model can accurately predict digestion efficiency and methane production with high prediction accuracy and reliability, which means that the neural network model makes more accurate and reliable predictions for our dataset. While MLR models may perform well on some datasets, this study shows that ANN models are a more reliable and accurate choice when dealing with large amounts of data. The main advantage of the ANN model is that it can handle a large amount of nonlinear data, so it has wide applicability in many practical applications. In addition, the ANN model also has self-learning ability, which can improve the performance of the model through training based on a large number of datasets. To sum up, this study uses the artificial neural network model when processing large amounts of data to obtain more accurate and reliable prediction results. Of course, this is not to say that the MLR model is useless. For some situations with less data and obvious linear relationships, the MLR model is still a viable choice.
Through Pearson analysis, it can be seen that not all parameters have a linear relationship with digestion efficiency and gas production, and most parameters show a weak correlation or no correlation. That is, through Pearson correlation analysis, it can be concluded that there is a nonlinear relationship between digestion efficiency and gas production, and the advantage of ANN lies in the processing of nonlinear relationships, which can be modeled well through neurons and weights.
3.5. ANN Training Test Range
This section details the importance of a multilayer perceptron in training and validation and the impact of min and max unit hidden layers. In experiments, we analyzed the prediction results by setting different training and validation set ratios and min and max unit hidden layers and evaluated the performance of the model. During the research process, it is usually necessary to calculate and analyze multiple input variables to determine their impact on the output variables. This study performed calculations on 17 available input variables using IBM SPSS Statistics 27. These input variables include OLR, HRT, and other parameters that affect anaerobic digestion. The dataset consists of 96 sets of data, by default: 70% for training, 30% for validation, and also includes the minimum number of units in the hidden layer 1 and the maximum number of units, 60.
In learning, the ratio of the training set and verification set is very important; a training set that is too large or too small will affect the performance of the model. Larger training sets make the model too simple to capture complex patterns in the data, while smaller training sets tend to cause the model to overfit and fail to generalize predictions to new data. Therefore, in general, during the training process of the ANN model, the proportion of training set and verification set is kept at 70% and 30%, as well as 80% and 20% [
53]. Also, the settings of the maximum and minimum hidden layers affect the performance of the multilayer perceptron. If the number of units is too small, the model will be underfitted, and too many will cause the model to be overfitted, dependent on the training data, and unable to generalize; for the new dataset, the prediction performance is not good, or it cannot be adapted well to the new dataset [
54].
In order to evaluate the performance of the model, different proportions of the training set and validation set were set in the experiment, the minimum and maximum unit hidden layers were set respectively, and the performance of the model was evaluated by predicting the results. The test results show that when the training set is 70% and the verification set is 30%, when the minimum unit hidden layer is 5 and the maximum unit hidden layer is 6, the activation function of the hidden layer is the hyperbolic tangent function, and the output layer function is the identity function. Currently, the prediction effect of digestion efficiency and gas production is the best. Therefore, in this experiment, according to the prediction results, in the multi-layer perceptron, the proportion of the training set and validation set should be kept in the range of 70% and 30%, and the minimum and maximum unit hidden layers should be kept at 5 and 6, to obtain the best prediction results.
Wei-Yao Chen et al. [
30] manipulated and modeled the operating parameters of biogas in a pre-commercial integrated anaerobic-aerobic bioreactor (IAAB) using ANN techniques. As a result, the COD removal rate increased by 23.3%, and the methane yield increased by 13.4%. R. Yukesh Kannah et al. [
55] predicted biogas production by using an ANN for substrate concentration in a mixed upflow anaerobic digester reactor treating landfill leachate. Different OLRs were used in the study, and the accuracy of the ANN was predicted. As a result, the methane production increased, and the COD removal rate also increased significantly.
3.6. Schemes for Increasing Digestion Efficiency and Gas Production
Observations from the actual operation of a WWTP show that the control and adjustment of multiple parameters are crucial for its efficient operation, including temperature, pH, flow rate, etc. There are complex interactions among these parameters, and the correct choice of parameter combinations is of great importance for the WWTP digester control parameters. The combination of these parameters can also be referred to as a vector. In this study, we chose the organic matter load and hydraulic retention time as control vectors. The OLR and HRT are two important parameters affecting digestion efficiency and methane production [
56]. Among them, the OLR refers to the amount of organic matter entering the reactor per unit time, and the HRT refers to the average time that wastewater stays in the reactor. When both the OLR and hydraulic retention time are appropriate, the digestion process can be made more stable, improving digestion efficiency and methane production [
57]. In the WWTP, the OLR and HRT are two important control parameters that significantly affect the degradation of organic matter by microorganisms in the digester. If the OLR is too high or the HRT is too short, the microorganisms in the digester may not be able to fully degrade the organic matter, resulting in reduced digestion efficiency and even the accumulation of volatile fatty acids, affecting the stability of the entire digestive system. On the contrary, if the OLR is too low or the HRT is too long, the microorganisms in the reactor may lose their vitality, and the digestion efficiency and methane production will also decrease. Therefore, it is necessary to balance the OLR and HRT to maintain the activity and stability of microorganisms in the digester and then increase the degradation rate of organic matter to improve digestion efficiency and gas production.
Normally, the range of the HRT in the sludge anaerobic digestion process is 15 to 30 days, and this range is obtained based on practical experience and scientific research [
57,
58]. However, due to the different treatment effects and raw water quality of different wastewater treatment plants, the range of the HRT needs to be determined according to the actual situation in specific practice. In this study, the HRT was assigned a range of 10 to 35 days, which is slightly wider than the usual range. This is due to the need to determine the appropriate HRT range in conjunction with the actual situation to ensure sludge stability in order to maximize digestion efficiency and gas production, taking into account the differences in factors, such as the equipment and raw water quality of WWTPs in the actual situation. In addition, the OLR value is also an important parameter in the sludge anaerobic digestion process. In this study, the range of the OLR value was also carefully considered and controlled to achieve a better treatment effect. The control of the OLR value needs to be based on the water quality of the raw water, the treatment capacity of the treatment plant, and other factors that are determined. The range of OLR values was calculated from the quartile values (minimum, first quartile, median, third quartile, and maximum) of the OLR of the wastewater treatment plant.
By sequentially combining the organic matter load in
Table 4 with the HRT, the gas pro-duction and digestion efficiency were predicted, respectively. The results showed that when the organic load ranged from 1.26 to 1.46 (kg/m
3 day) and the HRT ranged from 10 to 15 days, the gas production was the highest, increasing by 2.8%. This shows that under the combination of organic matter load and the HRT in this range, the gas production of the biogas digester is optimal. When the organic matter load ranged from 0.86 to 1.06 (kg/m
3 day) and the HRT ranged from 21 to 25 days, the predicted value of the digestion efficiency was tested, and the results showed that it was 0.7% higher than the actual value (
Table 5). To improve digestion efficiency and gas production at the same time, the organic matter load was finally maintained at 1.26 to 1.46 (kg/m
3 day), and the HRT range was 26 to 30 days. The results showed that the gas production increased by 1.3%, and the digestion efficiency increased by 0.5%. (
Figure 7 and
Figure 8).
In the study, the influence of the OLR and HRT on the digestion efficiency and gas production of the digester was determined. The results showed that within a certain range of the OLR and HRT, the digestion efficiency and gas production can be optimized, but in some cases, the predicted value will be lower than the actual digestion efficiency, which indicates that the predictive model needs further optimization. It is important to note that these results are based on specific conditions and specific circumstances of the wastewater treatment plant. Therefore, in practical applications, it needs to be adjusted and optimized according to the wastewater treatment plant to obtain better treatment and digestion.
The results in
Table 6 and
Table 7 show that within the studied OLR range of 1.26 to 1.46 (kg/m
3·day) and HRT range of 26 to 30 days, the prediction model can improve digestion efficiency and gas production to a certain extent. Especially in the summer and autumn, the improvement of digestion efficiency and gas production was more significant. These results are of great significance for improving the treatment efficiency and reducing treatment costs of sewage treatment plants. However, although the error between the predicted and actual values was within an acceptable range, the digestion efficiency and gas production decreased in the winter. In addition, in the actual operation process, there may be other factors, such as the actual ambient temperature, which may affect the digestion efficiency and gas production, so further research is needed to determine the specific mechanism of these effects.
After the artificial neural network model is established, in order to judge the accuracy and reliability of the model, it is necessary to evaluate and verify the model. Among them, an important indicator is the significance between the predicted value and the actual value, which is usually measured based on
. It can be seen from
Figure 9 that the
between the predicted value and the actual value of the digestion efficiency was 0.9417, and the
between the predicted value and the actual value of the gas production was 0.9562. This means that the model is highly predictive and fits the data well, and a high
value also means that the model has a high degree of confidence in the prediction, which can be used under different parameter combinations to improve gas production and digestion efficiency.
Using the OLR of 1.26–1.46 (kg/m
3·day) and the HRT of 26–30 days in Bad Case prediction, it can be seen from
Table 7 that both digestion efficiency and gas production have been significantly improved, and for the predicted digestion, the efficiency is 5.2% higher than the actual value, and the predicted gas production is 8.8% higher than the actual value. It can be seen from
Table 8. In this study, the range of the OLR and HRT can be applied to abnormal operating conditions at the same time, but the OLR and HRT need to be adjusted according to the actual situation in operation to ensure the maximum improvement of digestion efficiency and gas production.