1. Introduction
The rapid growth of the global population alongside the ongoing trend in urbanisation has resulted in a significant surge in energy requirements. This escalating demand for energy necessitates the exploration of innovative, dependable, and environmentally friendly energy sources. One such solution is the utilisation of hydrogen (H
2) gas as a versatile and sustainable energy carrier [
1]. Therefore, it is crucial to possess a comprehensive understanding of the thermodynamic properties of H
2 under various conditions. This knowledge is indispensable for effectively navigating the behaviour of H
2 across diverse pressure (
P), temperature (
T), and environmental contexts. By delving into the intricate interplay of H
2’s thermodynamic characteristics, researchers and engineers can make informed decisions, optimise processes, and ensure the safe and efficient utilisation of H
2 in a wide range of applications. Whether in energy systems, industrial processes, or scientific investigations, a profound grasp of H
2’s thermodynamics empowers us to harness its potential with precision and confidence.
H
2 holds substantial significance within the realms of both the petroleum and chemical industries, exemplifying its multifaceted utility. In the pursuit of enhancing the quality of heavy petroleum fractions, a pivotal strategy involves elevating the H
2-to-carbon ratio. This objective is achieved by incorporating H
2 into hydrocarbons through the hydrocracking process [
2]. Consequently, H
2 solubility in hydrocarbon systems emerges as a pivotal thermodynamic parameter, exerting considerable influence over the design, optimisation, and efficiency of diverse chemical and petroleum industrial processes, as well as the associated equipment.
The solubility of H
2 in hydrocarbon systems is influenced by
P,
T, and the nature of the hydrocarbon compound. Taking a thermodynamic perspective, the solubility of H
2 in hydrocarbons increases with the increase in
T,
P, and the hydrocarbon’s Carbon Number (
CN). This trend has been substantiated by experimental findings documented in the literature [
3,
4,
5]. An elevated
P and
T and a higher
CN of hydrocarbons foster greater interaction between H
2 molecules and the hydrocarbon matrix, leading to enhanced solubility.
Although field and laboratory measurements of H2 solubility in hydrocarbons provide precise results, both methods are demanding in terms of time and resources. However, engaging in comprehensive experiments involving heavy hydrocarbon systems under conditions of elevated P and T introduces a considerable level of risk, rendering this option unappealing within the industry. Consequently, the rapid and accurate determination of H2 solubility is of utmost importance. In response to these challenges, the industry seeks an approach that efficiently balances accuracy and speed in determining H2 solubility. Rapid and precise H2 solubility determination has transformative implications, fostering safe and efficient decision-making within various sectors, including chemical and petroleum industries.
Empirical paradigms, the Equation of States (EoS), and intelligent strategies present promising avenues for predicting H
2 solubility in hydrocarbon systems, offering expedited and cost-effective alternatives to experimental measurements. Nonetheless, the inherent complexity and non-linear nature of H
2 solubility’s dependence on
P,
T, and the characteristics of n-alkanes complicate the effectiveness of traditional empirical correlations and EoS methods. One of the challenges with EoS methods is the time-consuming process of calibrating various parameters for each specific system. This involves extensive adjustments that can be computationally intensive, particularly when striving for high accuracy across different n-alkanes and operational conditions. Furthermore, near the critical point, where phase behaviour is particularly sensitive, EoS models often face significant challenges in maintaining accuracy. The non-linear interactions between H
2 and hydrocarbons become even more pronounced in these regions, further complicating the prediction process [
6,
7,
8]. Consequently, the development and application of advanced predictive models, potentially incorporating Machine Learning (ML) techniques, emerge as valuable pursuits in enhancing the accuracy and reliability of H
2 solubility predictions. Such models can better navigate the intricate relationships that underlie H
2 solubility behaviour across diverse hydrocarbon systems and operational conditions. A comprehensive literature review on the mentioned paradigms is provided in our previous study [
9].
Recent advancements in ML and deep learning have seen their application in various aspects of renewable energy research, such as optimising the operation of electricity–gas–heat-integrated multi-energy microgrids under uncertainties [
10], enhancing security in real-time vehicle-to-grid dispatch [
11], improving power forecasting in renewable power plants through novel graph structures [
12], calculating dew point pressure in gas condensate reservoirs [
13], and the application of Decision Trees (DTs) for the calculation of H
2 solubility in different chemicals [
14]. In line with these developments, our study leverages Deep Neural Networks (DNNs) to accurately predict H
2 solubility in n-alkanes, contributing to the efficient design of H
2-based energy systems. This work underscores the growing importance of advanced modelling techniques in promoting sustainable energy solutions.
The primary objective of this study is to evaluate the feasibility of employing DNNs for predicting H2 solubility in n-alkanes. The investigation focuses on two pivotal aspects. First, we analyse the impact of distinct model structures on predictive performance. Second, we investigate the influence of incorporating dropout layers to mitigate overfitting. To achieve these goals, three distinct DNN models are constructed, compiled, and trained. The development of these models follows robust methodologies, ensuring the reliability of the results. Extensive assessments are carried out to evaluate the accuracy of each model, ensuring their effectiveness in delivering reliable predictions. In the final stages of this study, a comprehensive stability analysis is executed to assess both the accuracy and precision of the developed model. This analysis is designed to ascertain the generalisability of the developed models. Through this process, we gain valuable insights into the model’s performance consistency and its ability to extrapolate knowledge to previously unseen data.
This paper comprises four distinct sections, each serving a specific purpose in addressing the research objectives. It begins with a concise introduction that outlines the context and aims of this study. Following this,
Section 2 presents a detailed description of the modelling approaches and the database utilised, providing insights into their composition and characteristics.
Section 3 presents the results and discussions. It covers diverse aspects, including the development of predictive models, analysis of errors, evaluation of stability, and a comparison with existing literature models. This section provides a comprehensive understanding of the models’ performance and their implications. This paper concludes in
Section 4 with a summary of key insights derived from this study’s findings and outlines future prospects.
2. Modelling
The initial phase of model development entails data acquisition, a critical foundation for building a robust ML model. The next step involves dividing the database. In this study, the dataset is separated into three sets of training, validation, and testing. Although extensive data cleaning and quality checks were carried out in our previous study [
9], here, the database was reviewed for any dubious sample. During model fitting, it is imperative to use only the training and validation sets. The developed model was then applied on the testing sets. The following sections provide detailed discussions on data splitting, model development, and testing data modelling.
The framework illustrated in
Figure 1 serves as a roadmap, outlining the sequence of steps integral to the development of the model. As is shown, there are three main steps: data preparation, training (enclosed by blue dashed line), and testing (enclosed by red dashed line). Data preparation includes database development and splitting. The scaler and model are developed in the training phase and are then used in the testing step. During the training phase, it is crucial to utilise only the training and validation subsets. This deliberate isolation is an approach designed to enhance the model’s ability to generalise beyond the specific instances on which it has been trained. By restricting the model’s exposure to the testing sets, the integrity of the evaluation process is maintained, ensuring that performance assessments remain unaffected by any unintended familiarity with the testing data. Following the model development, the resulting model is subsequently applied to the testing sets. This stage serves as a test of the model’s predictive capability and its ability to generalise to unseen data. A thorough evaluation against the testing sets validates the model’s real-world applicability and its capacity to provide informed predictions beyond the training context.
In the following sections, a thorough examination was conducted to clarify the complexities of data partitioning, the methodologies used in model development, and the rigorous evaluation of the developed model. Through these discussions, a comprehensive understanding of the challenges and details of the methodology is presented, along with the valuable insights it has the potential to generate.
2.1. Database Development
A full presentation of the database development is provided in our previous study [
9]. All the experimental samples were gathered from the open literature [
3,
5,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41]. The database underwent an additional review to ensure data quality. To avoid sparse data samples, we focused on a specific range of pressure (0.101–559.5 MPa) and temperature (92.3–664.05 K). Compared to the previous study [
9], the operational variable cut-offs were adjusted, and no Hat-outliers [
42,
43] were excluded from the database.
The solubility of H
2 in n-alkanes depends on two primary categories of independent variables: the type of n-alkane and operational factors. While a range of characteristics can be used to describe n-alkanes, this study focused specifically on the critical features essential for accurate estimation. The selected critical features for this study include
CN, critical temperature (
TC) in Kelvin (K), and critical pressure (
PC) in MPa. Operational factors are represented by
P and
T, which reflect the conditions under which solubility is measured. The characteristics of the n-alkanes utilised in this study are detailed in
Table 1.
Furthermore, to enhance the analysis, two engineered features—dimensionless temperature (TD) and dimensionless pressure (PD)—are introduced. These features are derived by dividing the actual values of T and P by their respective critical values. Consequently, the modelling process encompasses three types of features: three molecular characteristics (CN, TC, and PC), two operational variables (T and P), and two engineered features (TD and PD). All of these function as independent variables in the predictive model.
2.2. Data Split to the Training and Testing
In this study, the database was divided into three distinct sets: training, validation, and testing. The model fitting process employs the training data, while the validation dataset was utilised to assess the performance of the trained models during the training phase. Upon successful completion of both training and validation, the model was subsequently tested using data that were not seen during the training phase. In the previous study [
9], the performance of these models was evaluated using n-eicosane, comprising 36 samples. To ensure a fair and equitable comparison, the same methodology was adopted in this study.
To enhance the robustness of the results, data splitting in this study was conducted based on n-alkanes rather than individual data samples. Specifically, the division was performed on the count of n-alkanes, ensuring that all samples associated with a particular chemical were consistently assigned to the same dataset. This methodology encourages the model to independently learn the complexities of developing isotherms, rather than simply focusing on the task of imputing missing data points.
Figure 2 provides a clear illustration of the partitioning of data into training and testing sets for three distinct chemicals within a fixed
P.
Figure 2a illustrates the sample-wise data division, while
Figure 2b depicts the group-wise division. It is worth noting that in sample-wise splitting, the testing data points are combined with the training data, facilitating the potential for predicting testing data through interpolation. In contrast, group-wise splitting assigns the testing data to a chemical that is not represented in the training set. Essentially, this method necessitates that the model understands and predicts underlying trends based on the distinct characteristics of each chemical.
Table 2 presents the various sets, detailing the names of the n-alkanes alongside the corresponding sample counts for each set. Of the 15 n-alkanes, 9 are assigned to the training set, 3 to the validation set, and 3 to the testing set, resulting in a nominal data split ratio of 60:20:20. However, owing to the differing sample counts for the various n-alkanes, the actual split ratio based on sample count is approximately 76:13:11. This discrepancy arises primarily from the relatively high number of methane samples (297) included in the training set. It is noteworthy that both the validation and testing sets comprise n-alkanes that were not part of the training phase, thereby ensuring a robust evaluation of the model.
2.3. Input Preparation
As previously indicated, the steps of data cleaning and quality assessment, including the exclusion of duplicates, feature extraction, and extreme
P and
T values, were executed in our preceding study [
9]. In this study, additional measures were taken: the operational parameter cut-offs were adjusted to remove sparse samples, outliers identified by the Hat-method [
42,
43,
51] were included, and the database underwent another thorough review to exclude any dubious samples.
Constructing ML models using scaled data is regarded as a sound practise. In this study, standardisation was employed as the selected scaling technique. This process involves subtracting the mean value of a feature from each individual feature value and then dividing the resultant value by the standard deviation of that feature. As a result of this transformation, the feature achieves a mean of 0 and a standard deviation of 1, facilitating consistent and standardised comparisons between different features.
While scaling may not be obligatory for non-parametric models like DT-based models due to their inherent insensitivity to feature scaling, its significance becomes pronounced for distance-centric models such as the DNN and Support Vector Machine (SVM). These models significantly depend on the distance metrics between data points, and the presence of unscaled features may distort these distance computations, adversely affecting the model’s performance and convergence. Furthermore, employing scaled data generally results in reduced computational time during the modelling phase. When input features are normalised to a similar scale, the convergence of algorithms can be expedited, facilitating quicker optimisation. Additionally, using scaled data often contributes to a more stable training process, as it mitigates the risk of features with large values to dominate the learning process.
The characteristics of the training, validation, and testing data are provided in
Table 3. As shown, the skewness and kurtosis of the operational parameters are close to zero. This indicates that their distribution is close to normality, suggesting a balanced dataset without significant outliers or extreme values. A near-zero skewness implies a symmetric distribution of the data around the mean, while a near-zero kurtosis indicates that the data’s tails are not heavy, thus reducing the likelihood of anomalies. This balance in the dataset enhances the reliability and accuracy of the model’s predictions. Another important point is that the
CN is not considered a categorical feature. Considering
CN as a categorical variable would limit the model to the n-alkanes encountered during the training phase, which is not desirable. Our goal is to develop a model applicable to all possible n-alkanes, including those not used for training, ensuring broader applicability and robustness.
2.4. Model Development
The primary objective of this study is to evaluate the effectiveness of DNNs in predicting H
2 solubility across a range of n-alkanes. To achieve this aim, three distinct DNNs were developed. The varied architectural compositions of these models offer a comprehensive framework for examining the effects of incorporating batch normalisation [
52] and dropout layers [
53], as well as variations in layer arrangement. Batch normalisation enhances the training speed and stability of DNNs, while dropout mitigates overfitting by randomly omitting units and their connections during the training process. It is important to highlight that this study utilised Python, along with Keras running on the TensorFlow backend, for the modelling process. A list of all the packages employed, along with their respective versions and the specifications of the computer system used for modelling, can be found in
Appendix A.
2.4.1. Model Construction
Keras was utilised to construct the models. Considering the necessity of evaluating layer concatenation, the functional Application Programming Interface (API) was selected over the simpler sequential API. Three models, designated as DNN 1, DNN 2, and DNN 3, were examined, with the details of these models outlined in the subsequent section.
DNN 1 represents the most straightforward model under consideration. As depicted in
Figure 3a, this model consists of three hidden layers, each containing 30 neurons.
DNN 2 represents an enhanced version of DNN 1, achieved by integrating batch normalisation and dropout into every hidden layer, as illustrated in
Figure 3b. There are discrepancies among researchers regarding the nomenclature of these layers; some classify DNN 2 as a 10-layer network, comprising nine hidden layers and one output layer. In this study, as shown in
Figure 3, we adopted the term “block” to refer to a unit that encompasses the primary layer (Dense) along with its associated components (batch normalisation and dropout). To aid clarity, distinct colours were assigned to each type of layer.
The configuration of DNN 3 is illustrated in
Figure 4. Similar to DNN 2, it comprises three blocks consisting of dense layers, batch normalisation, and dropout layers. However, the first block exhibits a notable distinction. As previously mentioned, the target variable depends on three primary inputs,
P and
T, which represent the operational parameters, along with the type of n-alkane. Furthermore, two additional features,
PD and
TD, were derived by integrating operational and molecular characteristic attributes.
As illustrated in
Figure 4, the network inputs are categorised into three segments: Input 1, Input 2, and Input 3. These segments represent molecular characteristics (comprising three features), engineered features (encompassing two features), and operational features (incorporating two features), respectively. Each segment is connected to a hidden layer consisting of ten units. Following the processes of batch normalisation and dropout, these segments are concatenated to create a layer comprising thirty units. The subsequent architecture is consistent with that of DNN 2.
To provide a more comprehensive insight into the model’s structure, a dropout ratio of 0.05 was employed, meaning that approximately 5% of the neurons were temporarily excluded during training. This approach enhances generalisation and mitigates the risk of overfitting. The Adam optimiser was selected to compile the models, which is a standard practise in model optimisation. By iteratively adjusting the model’s parameters, the optimiser minimises the influence of the chosen loss function. In this case, the loss function was defined as the Mean Squared Error (MSE), which is a suitable choice for regression tasks, quantifying the average squared difference between predicted and actual values.
The parameter count, which includes both trainable and non-trainable parameters, is thoroughly detailed in
Table 4. This count has a direct impact on the complexity of the model and its potential performance. Notably, the inclusion of batch normalisation adds both trainable and non-trainable parameters to the model. This technique not only stabilises and accelerates the training process but also enhances the overall performance of the model.
Notably, DNN 3 stands out by featuring fewer trainable parameters compared to its counterpart, DNN 2. This reduction results from the lack of interconnections between the various input types in its input layer. This streamlined architecture not only diminishes the overall complexity of the model but also aligns effectively with the specific modelling objectives.
2.4.2. Model Training
After constructing and compiling the models, the subsequent phase entails fitting them to the data. During this stage, the models are trained using the provided dataset, with the number of “epochs” and the “batch size” playing crucial roles. Specifically, “epochs” refer to the number of times the entire dataset is iterated over during the training phase, while ‘batch size’ determines the number of data points processed before the model’s parameters are updated. In this study, the models were trained for 1000 epochs with a batch size of 64.
A critical aspect of this process is monitoring the “validation loss”. This metric provides valuable insights into the model’s performance on unseen validation data, helping to ensure that the model does not become excessively tailored to the training data and retains its ability to generalise to new information. The purpose of tracking the validation loss is to identify the point at which the model’s performance on the validation dataset is optimised. Once this optimal performance stage is reached, the model’s configuration is saved as the best iteration using a callback. This “best model” configuration then serves as a reference for future applications and comparisons, ensuring that the most effective model iteration is preserved.
Figure 5 visually illustrates the convergence of the loss function, represented by the
MSE, for both the training and validation datasets. This representation indicates the extent to which the model’s predictions align with the actual data points, providing insight into its predictive effectiveness.
Examining DNN 1 reveals that, although the training error decreases, there is no corresponding improvement in the testing error. Even after 200 epochs, the validation error exhibits a slight upward trend. This phenomenon, referred to as overfitting, suggests that the model has become excessively tailored to the training data, which compromises its ability to generalise to new, unseen data points.
To address the issue of overfitting, dropout—a technique that temporarily deactivates a subset of neurons during training—was judiciously employed. The implementation of dropout helps mitigate overfitting by improving the model’s capacity to generalise beyond the training data. When comparing training losses, DNN 2 and DNN 3 demonstrate higher values than DNN 1. However, both DNN 2 and DNN 3 show a significant reduction in validation loss without raising concerns about overfitting.
A notable distinction emerges when comparing DNN 2 and DNN 3. DNN 3 demonstrates superior performance with respect to validation data, highlighting its enhanced capability to capture underlying patterns within the data. This improved performance contributes to better generalisation on unseen samples.
2.4.3. Predicting the Testing Data
Upon successfully training the models and identifying the best-performing one based on validation loss, the next step involves applying this model to the test data. This process allows for the evaluation of the model’s predictive performance on previously unseen data points.
Before inputting the testing data into the network, it is crucial to apply scaling to the data. Additionally, the architecture of the models relies on a logarithmic transformation of the target variable. Once the model generates predictions, an inverse transformation is performed to revert the solubility values to their original scale. This process consists of two main steps: first, the inverse scaling procedure is carried out to reverse the initial data scaling; second, the 10th power is applied to reverse the logarithmic transformation. This results in the predicted solubilities being expressed on their original scale.
3. Result and Discussion
The dependent variable (x) is closely linked to three independent variables: P, T, and the specific chemical type. Together, these independent factors influence the target variable x. To characterise the chemicals under consideration comprehensively, a variety of descriptors can be applied. Each descriptor adheres to its own distinct statistical distribution, highlighting the limitation of relying on a single descriptor. Therefore, exploring multiple descriptors is essential for a more accurate understanding. In addition to the three primary characteristics, two engineered dimensions, PD and TD, are introduced. These engineered variables provide a standardised framework for incorporating P and T, facilitating a more cohesive analysis.
To model the target variable
x, representing the mole fraction of H
2, a logarithmic transformation of the original data is selected. This approach is informed by a significant observation: the distribution of
x exhibits a lognormal pattern, with values predominantly clustering around zero (see
Figure 6). The logarithmic transformation serves two key purposes. Firstly, it aids in the development of a normal distribution, which is a common assumption in statistical modelling. Secondly, and perhaps more critically, it prevents the generation of negative predictions for values close to zero. This consideration is vital to ensure that the model’s predictions remain consistent with the physical constraints of the data.
It is noteworthy that DNN models, in contrast to their DT counterparts, possess a unique capability for extrapolation. This allows DNN models to generate predictions that extend beyond the predefined range of target values. Consequently, this feature enhances the model’s versatility and its capacity to offer insights into scenarios that fall outside the range of the training data.
3.1. Statistical Error Analyses
This section presents a comprehensive assessment of each model’s performance, employing both graphical illustrations and statistical methods. The previously mentioned
MSE, calculated using logarithmically transformed and scaled solubility values, is not utilised. Instead, the evaluation focuses on calculating the Root-Mean-Squared Error (
RMSE) related to the actual solubility values expressed in mole fraction units. This adjustment facilitates a more direct and accessible understanding of the error scale. Additionally, the Symmetric Mean Absolute Percentage Error (
SMAPE) is calculated, which ranges from 0 to 100%. The formulations for the model metrics utilised are provided in
Table 5.
Table 6 provides a comprehensive overview of the model metric values derived from the models developed in this study, each evaluated across distinct datasets. The use of these model metrics offers a quantitative perspective for assessing the performance of the models under various conditions. The top performer in each dataset—training, validation, and testing—is highlighted using bold formatting, which improves the clarity of their identification.
Upon thorough evaluation, DNN 1 is the best model based on its performance on the training set, achieving an RMSE of 0.006991 and an SMAPE of 1.82%. However, its effectiveness appears to diminish when applied to the validation and testing datasets, as indicated by RMSE values of 0.014867 and 0.014058, and SMAPE values of 4.43% and 6.70%, respectively. In contrast, DNN 2 and DNN 3 display a notable consistency in their ability to generalise beyond the training data. Both models demonstrate similar error rates in the training and testing sets. Specifically, DNN 2 has a testing set RMSE of 0.007050 and an SMAPE of 3.24%, while DNN 3 shows a testing set RMSE of 0.009641 and an SMAPE of 3.28%. Remarkably, DNN 3 stands out for its superior predictive accuracy on the validation sets, outperforming its peers in this regard.
Table 7 presents the model metrics associated with DNN 3 across each n-alkane within the training, validation, and testing sets, providing a detailed view of the model’s predictive accuracy. The Symmetric Mean Absolute Percentage Error (
SMAPE) for all n-alkanes ranges from 1.29% to 4.94% in the validation and testing sets. This relatively narrow error margin across diverse n-alkanes indicates that DNN 3 is highly effective at generalising from the training data to unseen data, maintaining a high level of accuracy even when predicting the solubility of n-alkanes not included in the model’s training phase.
The consistency of low SMAPE values across different n-alkanes suggests that the model has not only captured the underlying physical relationships governing H2 solubility but also generalised these relationships well to new data. This ability to generalise is crucial for the practical application of the model in real-world scenarios, where it may need to predict solubility for n-alkanes beyond those included in the initial dataset. Essentially, the DNN 3 model’s performance metrics underscore its robustness and reliability, demonstrating that it has effectively learned the governing physical patterns of H2 solubility in n-alkanes. This strong performance supports the model’s potential use in various industrial applications, where accurate and reliable solubility predictions are essential for optimising H2-based processes and systems.
Figure 7 presents a scatter plot that juxtaposes predicted values against actual experimental values in the upper section, while the lower section depicts the alignment of the Standard Prediction Error (
SPE) with the experimental values. These values are derived from the validation and testing sets, predicted using the DNN 3 model. To facilitate the comparison of data samples, both plots share a common
x-axis, ensuring a coherent alignment between the upper and lower sections. A closer examination of the scatter plot reveals that the majority of data points are situated near the 45-degree line, indicating a strong correlation between the model’s predictions and the actual experimental values. Additionally, the
SPE plot demonstrates that most data samples exhibit
SPE values constrained within −10% and 10%. This figure demonstrates the model’s exceptional performance when tested with unseen n-alkanes, indicating that it has effectively identified the fundamental physical patterns and key relationships governing their behaviour. Its ability to predict the behaviour of new n-alkanes not included in the training dataset confirms its capacity to generalise beyond the training data. This robustness highlights the model’s potential for practical applications across various scenarios involving n-alkanes, showcasing its capability to provide valuable insights in relevant fields.
3.2. Comparison with the Literature Models
In our previous study [
9], we developed and tested three DT-based models. These included a basic DT model and three ensemble models: Gradient Boosting (GB), Random Forest (RF), and Extra Trees (ET). Notably, ensemble models aggregate multiple simple DT models, with each employing distinct aggregation techniques. The ensemble models from our prior research utilised a considerable number of simple estimators, specifically incorporating 84 estimators for the GB model, 70 for the RF model, and 90 for the ET model.
The current study introduced a more robust method for data separation, enhancing data quality. However, to ensure equitable comparison, n-eicosane samples were used for extra testing. These data were not used during training of the model. Illustrated in
Figure 8 is a cumulative distribution function plot, depicting the absolute
SPE for n-eicosane. This plot illustrates the DNN models’ superior performance compared to both the basic DT model and the ensemble ET model. Remarkably, the DNN models demonstrate superior efficacy to the GB and RF models.
This study presents a more robust method for data separation, which significantly enhances data quality. To ensure a fair comparison, similar to our prior study [
9], extra testing was conducted using n-eicosane samples, which were not included during the training of the model.
Figure 8 illustrates the cumulative distribution function plot, depicting the absolute
SPE for n-eicosane. This plot clearly demonstrates the superior performance of the DNN models in comparison to both the basic DT model and the ensemble ET model. Notably, the DNN models also exhibit greater efficacy than the GB and RF models.
Additionally,
Table 8 provides model metrics for both our previous study [
9] and the DNN models developed in the current research. Among the models published in our earlier work [
9], only the RF predictions exhibit a close alignment with the DNN models, with all maintaining an
SMAPE of less than 5%. Furthermore, the generalisability observed in the DNN models may be attributed to the robust data separation methodology employed in this study.
Nevertheless, it is important to note that the cut-off values for operational parameters were adjusted in this study, and several incorrectly recorded data points were either excluded or corrected. Consequently, the comparison may not definitively demonstrate that the DNN is superior to ensemble DT-based models. Rather, it highlights that the models developed in this study represent a significant advancement towards achieving greater accuracy and reliability.
3.3. Model Stability
The development of a DNN model involves various elements that introduce a degree of uncertainty. This investigation focuses on two primary factors contributing to this uncertainty. The first factor arises from the initial randomisation of the model’s weights, while the second pertains to the random partitioning of data into training, validation, and testing sets. To ensure the model’s effectiveness, it must be capable of effectively managing and adapting to these inherent sources of randomness.
To conduct a comprehensive investigation into the effects of stochastic model training and the initialization of models with random weights, a rigorous procedure was established that involved the creation, compilation, and fitting of 50 networks. Particular emphasis was placed on minimising other sources of randomness throughout the experimental process. A key aspect of the methodology was the use of identical datasets, which ensured consistency across the various phases of training and evaluation. The hyperparameters detailed in
Section 2.4 were consistently applied during both the compilation and fitting of the models. However, to optimise computation time, all models were trained for 600 epochs instead of the originally planned 1000. While this adjustment may result in a slight reduction in prediction accuracy compared to previous sections, it effectively illustrates the impact of randomness.
Upon completing each iteration of model training, a rigorous testing phase was conducted using the designated testing data. The evaluation metric employed was the
SMAPE.
Figure 9 provides a visual representation of this process, depicting the
SMAPE values obtained from a diverse set of 50 DNNs, each subjected to distinct training processes. This figure includes a histogram (subplot (a)) and a QQ plot (subplot (b)). Notably, both graphical representations collectively support the conclusion of a normal distribution of the errors.
The normal distribution of errors across the 50 distinct DNNs offers valuable insights into the model’s stability and robustness. This distribution indicates a consistent performance across various training processes, suggesting that the model’s behaviour is not unduly affected by random factors, which results in predictable outcomes. Furthermore, models exhibiting normally distributed errors tend to demonstrate greater robustness, as they are resilient to variations in training conditions. This resilience significantly enhances their ability to generalise effectively to new, unseen data.
In contrast, the subsequent experiment revealed a more pronounced influence of randomness arising from the data partitioning process. A systematic approach was employed for data partitioning, beginning with the segregation of the data samples belonging to n-eicosane, which was designated as the additional testing set in the previous study [
9]. These samples were set aside for testing across all models. The remaining dataset was then divided into three distinct subsets—training, validation, and testing—in a ratio of 60:20:20. This division was executed using a group-based methodology aimed at preserving the integrity and coherence of data groups throughout the modelling process.
Figure 10 illustrates both the associated histogram (shown in subplot (a)) and the QQ plot (displayed in subplot (b)). Unlike the first experiment, where the errors exhibited a well-defined normal distribution, the errors in the second experiment displayed a distribution that deviated from the normal pattern. This observation suggests that the randomness introduced by data partitioning had a more substantial impact on the model’s performance than the randomness introduced by weight initialization. Consequently, inconsistent data partitioning can lead to increased variability in the model’s performance, hindering its ability to generalise effectively to new data. The notable degree of randomness observed in the second experiment can be attributed to the differing distribution of data across the various sets.
This study’s findings underscore the importance of appropriate data partitioning, especially in scenarios where available experimental data samples are limited. In such instances, achieving a consistent distribution of data across different sets is crucial for minimising the adverse effects of randomness on the model’s performance. Notably, when a more extensive dataset is available for training, the potential impact of randomness introduced by data partitioning may be reduced, owing to the larger sample size. This observation further highlights the significance of strategic data management, which can ultimately lead to more reliable and robust model outcomes.
Figure 11 provides a graphical representation of the outcomes derived from the two experiments conducted: the training randomness experiment and the splitting randomness experiment. In this visual depiction, the
x-axis and
y-axis represent the logarithmically transformed solubility values and
P, respectively, for n-eicosane. To facilitate a deeper understanding, the instances were arranged in isotherms. The key observation from this figure is the marked contrast in model performance across the various training trials, which employed a fixed data partitioning approach and different data partitioning methods. Notably, the trials using the fixed approach demonstrate superior accuracy and precision compared to those constructed with different data partitioning strategies.
In
Figure 11, the experimental data points are represented by circles on the graph. The full-coloured intervals indicate the 95% confidence interval for model predictions in the first experiment, while the pale-coloured intervals correspond to the second experiment. A significant trend is observed, as the experimental data points align more closely with the full-coloured intervals, which suggests an improvement in model accuracy. Furthermore, the lengths of these full-coloured intervals are notably shorter than those of the pale-coloured intervals, reflecting an increased precision in prediction.
However, it is important to acknowledge that, despite the overall accuracy and precision, there are instances where the target values fall outside the prediction intervals. These occurrences underscore the limitations of the model and reveal areas where predictive errors persist. It is also important to note that the experiment was designed to demonstrate the effects of randomness, and therefore only a limited number of epochs were considered.