3.1. Exploratory Data Analysis (EDA)
EDA of dataset was conducted for preprocessing and assessment of the dataset. Analysis of the distribution and relationship between input features and target variables was conducted to identify the outliers, handling of missing values, and cleaning of the dataset. Missing values for some biomass properties features were filled using the other literature, which had value of missing datapoint for the same feedstock from the same research group. The final preprocessed and cleaned dataset consisted of 166 datapoints for hydrogen yield prediction and 118 datapoints each for CO, CO2, and CH4 gas yield predictions.
The distribution of dataset was visualized using the box plot presented in
Figure 1. From
Figure 1, it can be observed that the ranges of H
2, CO, CO
2, and CH
4 gas yields were (0.02–8.13 mmol/g), (0.00–1.64 mmol/g), (0.03–13.01 mmol/g), and (0.02–7.35 mmol/g), respectively, while their means were 1.84, 0.38, 3.53, and 1.86 mmol/g, respectively. Similarly, ranges of ‘Temperature’, ‘Time’, ‘Concentration’, ‘Pressure’, ‘C’, ‘H’, ‘N’, ‘S’, ‘VM’, ‘Moisture’, and ‘Ash’ input features were (300–651 °C), (10–80 min), (1.64–35.00 wt%), (22–29 MPa), (36.10–85.00%), (3.39–6.80%), (0.00–6.40%), (0.00–11.20%), (21.90–95.00%), (1.71–13.69%), and (0.00–16.70%), respectively. The dataset consisted of a wide range of SCWG process conditions utilized for SCWG of lignocellulosic biomass for improving the scope of the machine learning models. Furthermore, the dataset consisted of both model compounds and real feed for lignocellulosic biomass, incorporating 28 different types of lignocellulosic feedstocks comprising of cellulose, xylose, lignin, kraft lignin, soybean straw, flax straw, canola straw, rice straw, cotton stalk, wheat straw, canola hull, canola meal, pinewood, orange peel, aloe vera rind, banana peel, coconut shell, lemon peel, pineapple peel, sugarcane bagasse, timothy grass, horse manure, pinecone, canola hull fuel pellet, canola meal fuel pellet, oat hull fuel pellet, barley straw fuel pellet, and partially burnt wood.
Correlations between pairs of two input features or pairs of two output variable, and also between pairs of input features and output variables, are visualized using the PCC correlation matrix (
Figure 2). From
Figure 2, both temperature and volatile matter (VM) have the highest PCC coefficient of 0.29 for hydrogen yield. This shows that the increment in temperature significantly increases hydrogen yield. It can be also observed that hydrogen content (H) of biomass is highly correlated with VM, with PCC of 0.64. Furthermore, hydrogen content has a high correlation with hydrogen yield, with PCC of 0.24. This shows that the high volatile matter containing biomass usually has high hydrogen content, which enhances the hydrogen yield. Among output parameters, hydrogen is strongly positively correlated with CO
2 yield. This is due to the fact that, in SCWG, hydrogen is produced mainly via reforming and the water–gas shift reaction since the reforming reaction mainly produces hydrogen with CO, which further undergoes a water–gas shift reaction to produce more hydrogen or is consumed via a methanation reaction to produce methane. Since CO
2 is also produced via a water–gas shift reaction along with hydrogen; thus, yields of CO
2 and H
2 are correlated.
3.2. Evaluation of Machine Learning Models
LR, GPR, ANN, SVM, DT, RF, XGB, and CatBoost machine learning models were trained on the clean and preprocessed dataset for prediction of gas yields of SCWG of lignocellulosic biomass. Hyperparameters of these machine learning models for each gas yields were optimized using GA and PSO optimizer algorithms. The parameters of GA and PSO optimizer algorithms are presented in
Table 1 and
Table 2. Hyperparameters and their ranges for optimization of each machine learning model for hydrogen yield are presented in
Table 3. The results of the optimized hyperparameters of each machine learning model for hydrogen yield by GA and PSO optimizer algorithms are also presented in
Table 3. It can be observed that despite being heuristic optimization algorithms, both GA and PSO algorithms are solved for different optimized hyperparameters. This is due to differences in the search mechanisms of both algorithms to find optimal hyperparameters. Hyperparameters of each machine learning model for the prediction of other gas yields were also optimized using GA and PSO optimizers.
Unoptimized, GA-, and PSO-optimized machine learning models were compared and evaluated using values of R
2 and MSE of the respective machine learning models. The results of R
2 and MSE values during training and testing of the machine learning model are presented in
Figure 3,
Figure 4,
Figure 5 and
Figure 6. From
Figure 3, it can be observed that for prediction of hydrogen yield, the LR model demonstrated poor performance, with the lowest test R
2 of just 0.14 and a very high test MSE of 1.79. This is due to the fact that the LR model utilizes a linear regression mechanism during the learning of the machine learning model. It explains the linear relationships between input features and output very well. However, it is not capable of processing complex and non-linear datasets. The poor performance of the LR model signifies the non-linear relationship between input features and hydrogen yield. Similarly, GPR and SVM models also demonstrated moderate performance with a test R
2 of 0.50 and 0.56, respectively. This is due to the fact that the GPR model is also based on regression analysis, which also suffered due to non-linear relationship of SCWG features. The low performance of the SVM model was also due to non-linearity of the SCWG process, which affects the hyperplane separation and thus affects the performance of the SVM model.
The ANN model usually shows relatively good performance for thermochemical processes. However, the ANN model demonstrated moderate performance, with its test R
2 of 0.59. This is due to the fact that simple ANN models are susceptible to overfitting of small datasets, especially those having non-linear complex relationships [
34]. This is also confirmed by its relatively high training R
2 of 0.96 during the training of the model. This highlights the overfitting and biasedness of the ANN model for prediction of the hydrogen yield. A literature study also confirmed the susceptibility of the ANN model for overfitting for prediction of the hydrogen yield. Zhao et al. [
35] reported that for prediction of the hydrogen yield from SCWG, ANN and GPR models suffered from overfitting due to the use of a single training process, and these models do not utilize statical averaging or bootstrap sampling compared to ensemble tree models. The SVM model is also susceptible to overfitting due to similar reasons.
Among all unoptimized machine learning models, tree-based models demonstrated high predictive power for hydrogen yield. The unoptimized XGB model showed a high test R
2 of 0.78 and low test MSE of 0.45. The XGB model also demonstrated high prediction capabilities during training of the model, with a high training R
2 of 0.999. This shows the balanced performance of the unoptimized XGB model in both training and testing. The unoptimized CatBoost model also performed well, with its test R
2 of 0.72 followed by a test R
2 of 0.69 of the unoptimized RF model. Among tree-based models, the simple DT model demonstrated the lowest test R
2 of 0.61 compared to ensemble-based tree models. A simple decision tree model is susceptible to overfitting of the dataset, which is minimized in ensemble tree models. Ensemble tree models usually utilize a group of simple decision tree models and either average the prediction of each tree model in the case of the RF model or correct the error of the preceding tree sequentially in the case of the XGB and CatBoost models [
36]. This eliminates the biasedness of a single decision tree model and minimizes the overfitting of the dataset by the machine learning model.
The use of GA and PSO optimizers improved the prediction power of nearly all machine learning models. In general, hyperparameter-tuned machine learning models optimized by the PSO algorithm outperformed the GA algorithm-optimized machine learning models. This is due to the difference in the search mechanism of both algorithms for optimal solutions, which resulted in the different optimized hyperparameters selected by both algorithms. Since the performance of a machine learning model is dictated by its hyperparameters, GA- and PSO-optimized models differ in their prediction capabilities. Among all unoptimized, GA-, and PSO-optimized machine learning models, the PSO-optimized XGB model demonstrated the highest test R2 of 0.84 and the lowest test MSE of 0.34. Interestingly, while the use of the PSO optimizer improved the test R2 of the XGB model, it also reduced the training R2 to 0.98. In contrast, the training R2 scores were 0.999 for the unoptimized model and 0.98 for the GA-optimized model. This highlights the effectiveness of the PSO optimizer algorithm in improving the robustness of the model by minimizing the biasedness and overfitting of the model, which resulted in improved performance on the test dataset. PSO-optimized CatBoost also performed well, with its high test R2 of 0.80 and low test MSE of 0.41. The order of test R2 values among PSO-optimized machine learning models was XGB-PSO > CAT-PSO > RF-PSO > DT-PSO > ANN-PSO > GPR-PSO > SVM-PSO. On the other hand, the order of R2 among GA-optimized machine learning models was CAT-GA > XGB-GA > RF-GA > SVM-GA > DT-GA > GPR-GA > ANN-GA.
For prediction of CH
4 yield, among unoptimized machine learning models, the CatBoost model showed superior performance, with its high test R
2 of 0.78 for prediction of methane yield from SCWG of lignocellulosic biomass (
Figure 4). The unoptimized RF model also performed well, with its test R
2 of 0.66. Similar to the prediction of hydrogen yield, the LR machine learning model performed worst among all machine learning models for prediction of methane yield, with its lowest test R
2 of 0.34 and highest test MSE of 1.60. The performance of machine learning models was improved by the use of PSO and GA optimizer algorithms. However, in general, the improvement in the prediction power of optimized machine learning models from unoptimized machine learning models was higher with the PSO optimizer compared to the GA optimizer. Among all machine learning models, CatBoost models were the top three best performing models, with the PSO-optimized CatBoost machine learning model resulting in the highest test R
2 of 0.83, followed by 0.81 of the GA-optimized CatBoost model and 0.78 of the unoptimized CatBoost model. GA- and PSO-optimized XGB models also showed comparable performance, with test R
2 of 0.77 and 0.76. However, LR, GPR, and DT machine learning models demonstrated the worst performance among all machine learning models.
Similar to the prediction of hydrogen yield and methane yield, CatBoost and XGB were the best performing machine learning models for prediction of CO yield of SCWG of lignocellulosic biomass (
Figure 5). Among unoptimized machine learning models, CatBoost models had the highest test R
2 of 0.86, followed by 0.83 of XGB and 0.83 of the RF model. The extent of improvement in the performance of machine learning models by the PSO optimizer algorithm was also higher compared to the GA optimizer algorithm from unoptimized machine learning models for the prediction of CO yield. The PSO-optimized CatBoost model was the best performing machine learning model for prediction of CO yield, with its highest test R
2 of 0.94, followed by the GA-optimized CatBoost and PSO-optimized XGB machine learning model. Among all machine learning models, LR and GPR were the worst performing machine learning models.
The CatBoost model was also able to predict the CO
2 gas yield of the SCWG process (
Figure 6). Among unoptimized machine learning models, the CatBoost model demonstrated high prediction performance, with its test R
2 of 0.83, followed by test R
2 of 0.81 of XGB and 0.78 of RF. The PSO optimizer further improved the performance of machine learning models, and the CatBoost-PSO model had the highest test R
2 of 0.92 followed by 0.90 of XGB-PSO among all machine learning models. Similar to the prediction of other gas yields, the PSO optimizer performed better compared to the GA optimizer in improving the performance of unoptimized machine learning models. SVM, GPR, and LR had the worst performance in the prediction of CO
2 gas yield.
Thus, boosting ensemble-based machine learning models such as CatBoost and XGB were clearly the best performing machine learning models for the prediction of gas yields of the SCWG process. This is due to the use of the tabular and structured dataset, for which ensemble tree-based model tends to perform the best [
37]. Boosting ensemble tree models have also demonstrated their superior prediction power in other thermochemical processes such as pyrolysis [
38], hydrothermal liquefaction [
39], and hydrothermal carbonization [
40]. Moreover, these boosting models utilize the decision of a group of multiple simple tree models, which learn from the preceding tree specifically for the misclassified instances. This limits overfitting, especially for smaller dataset. Studies also showed that the ensemble boosting algorithms outperform even the deep learning models for a variety of tabular datasets [
41].
In conclusion, XGB-PSO and CatBoost-PSO models demonstrated the highest prediction power for the yields of H2, and CH4, CO, and CO2, respectively, for SCWG of lignocellulosic biomass. Overall, the effectiveness of the PSO optimizer for hyperparameter tuning of machine learning models was highest compared to the GA optimizer. Due to the superior performance of these machine learning models, XGB-PSO and CatBoost-PSO were selected for further analysis of the prediction of H2, and CH4, CO, and CO2 gas yields, respectively.
3.3. Feature Analysis and Summary Plots
The impact of input features and their relative importance for prediction of gas yields of SCWG process in machine learning models were studied using SHAP analysis. SHAP analysis helps to overcome the black-box nature of machine learning models. SHAP values quantify the contribution of each feature towards the prediction of a machine learning model. The ‘base case’ refers to the model’s prediction using no feature information. Thus, a positive SHAP value indicates that a feature has increased the prediction from the base case, while a negative value indicates a decrease in the prediction power of the machine learning model. A SHAP value of zero suggests the feature has no impact from the base case prediction. This metric offers an intuitive means to interpret complex model predictions.
Feature importance plots, summary plots, and heat maps of SHAP values of input features for prediction of H
2, CH
4, CO, and CO
2 yields are presented in
Figure 7,
Figure 8,
Figure 9 and
Figure 10. From
Figure 7, it can be observed that the temperature was the most dominant feature, with feature importance of 21.93%, followed by (15.38%) hydrogen content (H), (11.68%) ash content (ash), (10.84%) time, (9.41%) concentration, and (7.30%) carbon content (C) for prediction of hydrogen yield by the XGB-PSO model. The high feature importance of temperature for hydrogen yield is due to the endothermic nature of reforming, hydrolysis, and water–gas shift reactions, which are favored at high reaction temperatures, enhancing the hydrogen yield [
42]. The SHAP summary plot also shows that an increase in the feature value of temperature increased the SHAP values of hydrogen yield and shifted the SHAP value points to right side (more positive). A more detailed analysis of each instance is provided in the heat map of the SHAP value plot for hydrogen yield.
Hydrogen content was the second most dominant feature for prediction of hydrogen yield, which increased hydrogen yield with an increase in the value of hydrogen content. This shows that hydrogen content of biomass contributes to the hydrogen yield, and biomass with higher hydrogen content is recommended to achieve high hydrogen yield. An increase in ash content also improved hydrogen yield. This is due to the presence of alkali and alkaline earth metals (AAEMs) in the ash content of lignocellulosic biomass [
43]. These AAEMs have catalytic effects in promoting reforming, hydrolysis, and water–gas shift reactions, which improves the hydrogen yield [
44]. However, higher values of ash content also decreased the hydrogen yield, as indicated in the summary plot. Thus, only the optimum amount of ash content in biomass is beneficial for hydrogen yield. Different nature of components of ash content should also be considered as some of these components in ash content may be less or more active in their catalytic activity for SCWG. For example, silica (SiO
2) present in ash content has very little catalytic activity compared to highly active potassium (K) for the SCWG reaction.
Increase in reaction time (time) also increased the hydrogen yield since longer reaction time allows sufficient time for hydrolysis and reforming reactions to take place that produce hydrogen and CO, enabling the water–gas shift reaction of CO in excess water for production of more hydrogen [
45]. Therefore, higher reaction time is beneficial for high hydrogen yield. However, an increase in feedstock concentration decreased the hydrogen yield since hydrogen is also produced from supercritical water (SCW), which acts as a reactant in reforming, hydrolysis, and water–gas shift reactions. Additionally, feedstock concentration measures the amount of biomass in a mixture of water and biomass, where a high concentration indicates a greater amount of biomass in a relatively low amount of water. Therefore, at high feed concentrations or at less water content in feedstock mixture, reforming, hydrolysis, and water–gas shift reactions diminish as per Le Chatelier’s principle, which decreases the hydrogen yield at high feedstock concentrations [
46]. An increase in carbon content (C) improved the hydrogen yield up to a certain extent; however, a further increase in carbon content did not improve hydrogen yield. Thus, biomass with high carbon content is only beneficial up to a certain extent.
These results are in agreement with the reported literature of SCWG of lignocellulosic biomass, where the temperature is the most important parameter for the SCWG process, followed by reaction time and feedstock concentration, while reaction pressure is the least influential reaction parameter for SCWG of lignocellulosic biomass [
47]. Among biomass properties, hydrogen content (H), ash content (ash), and carbon content (C) were the most dominant features, which indicates biomass with high hydrogen content and a moderate amount of ash and carbon content is recommended for high hydrogen yield. Overall, biomass properties had a high feature importance of 52.91% compared to feature importance of 47.09% of SCWG reaction process parameters for the prediction of hydrogen yield. This highlights the importance of screening most suitable lignocellulosic biomass feedstocks to achieve high hydrogen yield. Thus, although optimization of SCWG conditions is necessary for improving hydrogen yield, more attention should be paid to the selection of suitable lignocellulose biomass for holistic optimization of the SCWG process for maximization of hydrogen yield.
Feature analysis based on CatBoost-PSO for methane yield showed that the carbon content (C) of the biomass was the most dominant feature, with its highest feature importance of 24.85%, followed by (17.65%) temperature and (8.78%) volatile matter (VM) of biomass (
Figure 8). An increase in the carbon content of the biomass increased the methane yield. This is due to the fact that carbon molecules in SCWG product gas are facilitated by the carbon content of the biomass. An increase in reaction temperature also increased the methane yield, which is due to the hydrogenation and methanation of produced CO and CO
2 at high reaction temperatures, which increased the methane yield at high reaction temperatures [
48]. Similarly, an increase in volatile matter (VM) of biomass enhanced methane yield due to the ease of gasification of volatile matters in SCWG. These volatile matters represent the alcohols, ketones, aldehydes, and organic acids. These are the intermediates of the gasification in SCW, which are easily converted into gaseous products such as methane, H
2, and CO
2 [
13]. Therefore, an increase in the volatile matter of biomass increased methane yield. Similar to hydrogen yield, biomass properties had cumulative feature importance of 63.19% compared to 36.81% feature importance of SCWG reaction conditions.
For CO gas yield, feature analysis revealed that the ash content (ash), volatile matter (VM), temperature, time, and concentration were the most influential features, with feature importance of 16.93, 15.35, 15.05, 14.27, and 11.08%, respectively (
Figure 9). Low to moderate ash content had a negative impact, which is due to the fact that even though AAEMs in ash content enhances reforming reactions, due to enhancement of water–gas shift reactions, most of the produced CO is consumed for hydrogen production. Only at a really high ash content, where the water–gas shift reaction attains equilibrium and an increased content of AAEMs, the water–gas shift reaction is no longer enhanced, leading to the increase in CO yield. This was also observed in hydrogen yield where really high ash content actually decreased the hydrogen yield. Similarly, an increase in volatile matter decreased the CO yield as high volatile matter promotes the further conversion of CO gas into methane and hydrogen via water–gas shift, methanation, and hydrogenation reactions. Similarly, an increase in temperature and time also decreases CO yield, as at high reaction temperature, water–gas shift reactions dominate and consume the CO. CO yield is high at short reaction times as a short reaction time does not allow sufficient time for further conversion of produced CO by consecutive water–gas shift and methanation reactions, which are enabled at longer reaction times. This led to the decrement in CO yield at longer reaction times. However, an increase in feedstock concentration increased the CO yield. This is due to diminished activity of water–gas shift, methanation, and hydrogenation reactions at high feedstock concentrations, which result in unutilized CO gas and thus increases the yield of CO gas at higher feedstock concentrations [
49].
Similar to methane yield, carbon content (C) of the biomass was the most dominant feature for prediction of CO
2 gas yield, having feature importance of 29.73%, which is followed by (12.64%) temperature, (9.14%) volatile matter (VM), and (8.84%) time (
Figure 10). This is because most of the carbon content comes from the biomass itself, which resulted in its highest feature importance for prediction of CO
2 yield. Thus, an increase in carbon content of biomass increased the CO
2 gas yield. An increase in temperature increases the conversion of CO to CO
2 and hydrogen by enhancing the water–gas shift reaction. Similarly, an increase in volatile matter of biomass favors the production of gaseous products due to the ease of gasification of volatile matter resulting in an increase in yield of CO
2 [
50].
An increase in time also allows sufficient time for further conversion of produced CO gas by enhanced water–gas shift reactions at longer reaction time, which increases the CO2 gas yield.
It can be observed that even though temperature has a high influence among SCWG process features on gas yields of the SCWG process, biomass properties as a whole have feature importance of 52.91, 63.19, 57.54, and 68.37% compared to 47.09, 36.81, 42.46, and 31.63% feature importance of SCWG process parameters for prediction of H2, CH4, CO, and CO2 gas yields, respectively. Thus, biomass characteristic plays a key role in the SCWG degradation mechanism of lignocellulosic biomass, which influences the gas distribution of the SCWG process. The characteristics of biomass should be considered while optimizing SCWG process parameters. These biomass properties and the SCWG process also have interactive effects during the gasification of lignocellulosic biomass in SCW. Hence, study of the interactive effects of these input features on gas yields of SCWG is important to understand the degradation mechanism of lignocellulosic biomass in SCWG.
3.4. Two-Way SHAP Analysis
SHAP dependency plots for investigating the influence of interactive effects of the most dominant input features for the prediction of gas yields are presented in
Figure 11,
Figure 12,
Figure 13 and
Figure 14. In a SHAP two-way dependency plot, the
x-axis shows the value of feature 1 and the primary
y-axis represents the effect as the function of the SHAP values of the target variable (gas yields). The effect of feature 2 is represented using the secondary
y-axis and values are represented using a gradient. This helps to visualize the interactive effects of the two input variables on the SHAP values of the gas yields.
From
Figure 11, it can be observed that the input features had interactive effects on the prediction of hydrogen yield. An increase in temperature for high hydrogen content containing biomass resulted in the highest SHAP values for hydrogen yield. High SHAP values for hydrogen yield can also be achieved for moderate hydrogen containing biomass at high reaction temperatures. However, high hydrogen content at low reaction temperatures does not necessarily translate into high hydrogen yield. Similarly, modest ash content helped to achieve high hydrogen yields at high reaction temperatures. However, low reaction temperature even at optimum Ash content does not result in high hydrogen yield. For hydrogen content (H) and ash content of biomass, these features did not show much interaction at a low hydrogen content and low ash content of biomass. Only at an optimum ash content of biomass did an increase in hydrogen content of biomass result in the highest hydrogen yield.
Similarly, for time and hydrogen content, the highest hydrogen yield was obtained at highest hydrogen content and longer reaction time. However, at shorter reaction times and low high hydrogen content, these features did now show much interaction for hydrogen yield. This indicates that for efficient conversion of the hydrogen content of biomass into hydrogen gas during gasification, higher reaction temperature, longer reaction time, and optimum amount of ash content are required. Reaction temperature and reaction time showed interactive behavior, where highest hydrogen yields were obtained at high reaction temperature and longer reaction time. However, a comparable hydrogen yield can also be obtained even at short reaction times at high reaction temperatures. Similarly, reaction time and concentration also showed interactive behavior, and the highest hydrogen yield was obtained at low feedstock concentration at longer reaction times. A high hydrogen yield was obtained even at moderate to high concentrations at longer reaction times.
For prediction of methane yield, carbon content (C) and temperature showed high interactive behavior (
Figure 12). The highest SHAP values for methane yield were obtained at high carbon content and high reaction temperatures. Strong interactive behavior was observed at moderate to high values of carbon content and at high reaction temperatures. Volatile matter (VM) and carbon content also showed high interactive behavior, and biomass having high volatile matter usually had moderate to high carbon content, which resulted in the highest methane yield. Temperature and VM of biomass also had strong interactive behavior, and high methane yield was obtained at a high reaction temperature and high amount of volatile matter. However, high SHAP values of methane yield were also observed at moderate VM at high temperatures or, also at moderate temperatures for biomass having high VM content. Temperature and time had interactive behavior for SHAP values of methane yield at high temperature and longer reaction time. Similarly, hydrogen and carbon content of biomass had an interactive effect on methane yield at high carbon and high hydrogen content, which resulted in the highest SHAP values of methane yield.
Interestingly, ash content and volatile matter (VM) of biomass had interactive behavior on SHAP values of CO yield; only at high ash content and moderate volatile matter were the highest SHAP values of CO gas yield observed (
Figure 13). However, temperature had strong interactive behavior with ash content, volatile matter, and time. High SHAP values of CO yield were obtained at low temperatures and low ash content of biomass. Similarly, low temperature and moderate values of volatile matter resulted in the highest SHAP values of CO yield. However, high values of CO yields were obtained at low reaction times at low temperatures. Time also had strong interactive behavior with concentration and volatile matter of biomass for CO yield, where high SHAP values of CO yield were observed at short reaction times and high concentrations. However, low values of volatile matter at short reaction times resulted in high SHAP values for CO yield. Volatile matter and concentration themselves also had a strong interactive influence on CO yield, where high SHAP values of CO yield were obtained at low volatile matter and high feedstock concentration.
Two-way SHAP analysis for prediction of CO
2 yield showed that the carbon content (C) of biomass and temperature had a strong interactive influence on SHAP values of CO
2 yield. The highest values of CO
2 yield were observed at high carbon content and high temperatures (
Figure 14). Similarly, carbon content of biomass also had interactive behavior with volatile matter on CO
2 yield, where an increase in volatile matter and carbon content increased the SHAP values of CO
2 yield. Temperature and volatile matter also had interactive effects at high values of temperature and volatile matter, where an increase in volatile matter at high reaction temperature increased the SHAP values of CO
2 yield. However, high comparable values of SHAP values were also observed at moderate volatile matter at high temperatures.
Temperature and time also had a strong interactive influence on CO2 yield. Longer reaction time and high temperature had the highest SHAP values of CO2 yield. Hydrogen content also demonstrated strong interactive behavior with volatile matter and carbon content of biomass on SHAP values of CO2 yield. High volatile matter and high hydrogen content showed the highest values of CO2 yield. This is due to the relationship between volatile matter and hydrogen content of biomass, as volatile matter of biomass represents high quantities of organic acids, hydrocarbons, alcohols, aldehydes, and ketones. These compounds usually have higher amounts of hydrogen atoms; thus, an increase in volatile matter also represents an increase in hydrogen content, which had a positive interactive influence on the SHAP values of CO2 yield. High values of hydrogen content and carbon content also resulted in high values of CO2 yield.
Thus, the degradation of lignocellulosic biomass follows a complex reaction mechanism, and input variables such as SCWG reaction conditions and biomass properties have an interactive influence during SCWG of lignocellulosic biomass. These interactive influences of input features have an effect on the product distribution and individual gas yields of the SCWG process. Two-way SHAP analysis highlighted the strong interactive influence of the most dominant features on yields of H2, CH4, CO, and CO2. This shows that the optimization of SCWG of lignocellulosic biomass is a complex process and requires careful simultaneous tuning of various parameters to maximize the hydrogen yield of the SCWG process.
Thus, this study presented a novel and comprehensive application of machine learning models for SCWG of lignocellulosic biomass to elucidate the interactive effects of input features and their complex relationship with gas yields. Utilization of only lab-scale batch reactor data helped to better capture the relationships between input variables and gas yields with minimum influence of other unaccounted-for variables. However, it also resulted in limited scope of the prediction models only to a lab-scale batch reactor. Nevertheless, the main objective of this study was to understand the complex degradation behavior of lignocellulosic biomass during SCWG and interactive effects of input variables on gas yields. This study presented a groundwork for comprehensive optimization of SCWG reaction conditions and selection of suitable biomass, especially at industrial scale, to maximize hydrogen gas yields with a high degree of certainty. This will foster efforts being made for commercialization of SCWG at an industrial scale.