1. Introduction
Geotechnical evaluation of wellbores, which is crucial for hydraulic fracturing design, sand production management and prediction, fault stability, reactivation analysis, and wellbore stability, requires accurate and comprehensive knowledge of the mechanical properties of rocks [
1].
The uniaxial compressive strength (UCS) of rocks is a fundamental measure of drillability in geotechnical engineering, particularly in the oil and gas sectors. UCS quantifies the maximum axial load a rock can endure before failure, reflecting the rock’s resistance to compressive forces. In drilling applications, higher UCS values signal harder formations, which demand increased energy and precise adjustments to drilling parameters—such as bit type, weight on bit, and rotary speed—to achieve efficient penetration. Accurately estimating UCS enables engineers to anticipate the energy requirements and bit wear, aiding in optimal drilling parameter selection and cost reduction.
UCS significantly influences the rate of penetration (ROP) in drilling. Rocks with lower UCS allow for higher ROP, facilitating faster drilling and minimizing tool wear, while those with higher UCS reduce ROP, requiring adjustments to drilling methods and equipment. Therefore, reliable UCS prediction models are vital for enhancing drillability assessments, proactively managing wellbore stability and improving drilling efficiency in complex geological formations.
One commonly used method to evaluate these properties is through the measurement of UCS in the laboratory [
2]. However, this method can be highly sensitive to the loading process of the core sample and is destructive in nature [
3]. As an alternative, log-based methods that indirectly measure rock strength have been proposed, but their precision and accuracy have not been fully validated by reliable data [
4]. Given the challenges and expenses associated with standard laboratory tests, indirect methods are more promising for practical use [
2].
Sonic travel time is a physical property of rocks that is utilized for studying rock mechanics and reservoir evaluation. This property varies based on the lithology, rock textures, fluid content, and porosity of the rock [
5]. In cases where limited core samples are available, sonic and neutron logs can be used to estimate the rock properties. Researchers have developed six experimental equations related to carbonate rock strength for measuring geophysical properties, which are listed in
Table 1 [
6,
7,
8,
9,
10]. Another approach for predicting rock strength is to use drilling data based on ROP models [
11,
12]. These models can be used with all types of bits, although tri-cone bits (TCBs) are preferred due to their wide range of use [
13].
Koolivand-Salooki et al. [
14] developed a method for determining the UCS of rock formations using Genetic Programming (GP). This approach utilized parameters such as total formation porosity, Bulk Density, and water saturation obtained from various logging techniques, including sonic, neutron, gamma ray, and electric logs. The elastic moduli were derived from compressional and shear sonic logs using mathematical correlations, and the rock UCS was estimated using empirical correlations by Wang and Plumb. The study involved analyzing approximately 5000 data points from three wells in an Iranian oil field to develop the GP model for UCS prediction. The model was fine-tuned using UCS data from core samples and validated with two separate datasets. The estimated UCS values from the GP model closely matched those obtained from analytical methods based on well-log data [
14].
McElroy and colleagues [
15] introduced an ANN modelling approach for predicting the UCS of oil well cement, specifically class “H”. This research analyzed 195 cement samples, incorporating varying concentrations of pre-dispersed nanoparticles, including nanosilica (nano-SiO
2), nanoalumina (nano-Al
2O
3), and nanotitanium dioxide (nano-TiO
2), across different temperature conditions. The effectiveness of these nanoparticles was assessed through transmission electron microscopy (TEM) images. The ANN model included one input layer, one hidden layer, and one output layer. Its performance was superior to that of Multi-Linear Regression (MLR) and Random Forest (RF) regression algorithms in terms of statistical accuracy. Based on their findings, the developed ANN model was a highly accurate and non-destructive alternative approach to traditional UCS tests, offering cost and time-saving advantages to the petroleum industry [
15].
Hiba et al. [
16] carried out a study to investigate the geomechanical parameters used for field planning and development, specifically focusing on the tensile and UCS values of rock. Given the time-consuming nature of laboratory measurements, the researchers employed non-destructive techniques to expedite and enhance the reliability of predictions. They used an ANN to predict Ts and UCS based on drilling data obtained from two Middle Eastern fields. The ANN was highly accurate in predicting both parameters during the training phase and was effective in predicting them during the testing and validation phases with an average AAPE of 0.59% [
16].
Ibrahim et al. [
17] investigated the use of machine learning to predict the UCS and tensile strength (T0) of carbonate rocks from well-log data. They utilized RF and decision tree (DT) algorithms on data from a Middle Eastern reservoir, identifying gamma ray, compressional time, and Bulk Density as key predictive factors. The study found both models to be highly accurate, with RF slightly outperforming DT. Specifically, RF achieved a correlation coefficient (R) of 0.97 and an absolute average percentage error (AAPE) of 0.65% for UCS prediction, and an R of 0.99 and AAPE of 0.28% for T0. These results suggest that machine learning offers a reliable and efficient method for estimating rock strength parameters, though further research is needed for other geological formations [
17].
This study aims to evaluate the effectiveness of various machine learning models in predicting the UCS of rocks within the context of oil and gas wells, which is a key factor in maintaining wellbore stability and optimizing drilling operations. By comparing Linear Regression, ensemble methods (such as Random Forest, Gradient Boosting, XGBoost, and LightGBM), support vector machine-based regression (SVM-SVR), and multilayer perceptron artificial neural network (MLP-ANN) models, this research seeks to identify the most reliable and accurate approaches. This study not only highlights the importance of selecting suitable machine learning models for geotechnical applications but also advances the field by providing valuable insights into rock behaviour under drilling conditions. These insights pave the way for further research, ultimately improving the understanding of geomechanical properties and their impact on drilling operations.
4. Methodology
The methodology employed in this study aimed to comprehensively evaluate the predictive performance of various machine learning models in forecasting the uniaxial compressive strength (UCS) of rocks encountered in oil and gas wells. The analysis centred on five key input parameters: weight on bit (WOB), Sonic Transit Time (DT), Density Tool Reading (NPHI), rate of penetration (ROP), and Bulk Density (RHOB). Each of these features plays a significant role in determining the UCS of the rock. For instance, Bulk Density provides insight into the mineral composition and density of the rock, which directly correlates with its mechanical strength. Sonic Transit Time reflects the elasticity and acoustic properties of the rock, while Neutron Porosity indicates the porosity level, which affects fluid saturation and overall rock strength.
A correlation heatmap and scatter plots were generated (
Figure 1 and
Figure 2) to visually depict the relationships between UCS and the input variables, facilitating a preliminary understanding of the dataset’s characteristics.
The dataset, comprising 111 data points, was meticulously curated to ensure its representativeness and suitability for model training. To assess the model’s performance accurately and prevent overfitting, we employed a standard 80:20 train–test split. During model development, we also conducted cross-validation with 6 folds to further validate the models’ performance and ensure robustness. This approach allowed us to mitigate the risks of overfitting, ensuring that our findings are not only reliable but also generalizable to unseen datasets. While our current dataset comprises 111 data points, which may limit the generalizability of our conclusions, we acknowledge this limitation and plan to expand our dataset in future studies to further validate our results and improve model robustness.
Each machine learning model, including Linear Regression, Random Forest, Gradient Boosting, XGBoost, SVM-SVR, and MLP-ANN, underwent a systematic training process. This process involved feeding the models with the training dataset and iteratively adjusting their parameters to minimize prediction errors and optimize predictive accuracy. For the Random Forest model, the best hyperparameters identified are a maximum depth of 10, a minimum number of 2 leaf samples, a minimum sample split of 5, and 100 estimators. In the case of Gradient Boosting, the optimal settings include a learning rate of 0.05, a maximum depth of 5, and 100 estimators, which facilitate effective learning without overfitting. XGBoost demonstrates improved performance with a learning rate of 0.15, a maximum depth of 5, and 100 estimators. The MLP-ANN model exhibits the best performance with a logistic activation function, a hidden layer configuration of 100 neurons, a constant learning rate, and stochastic gradient descent (SGD) as the solver. For the SVM, optimal hyperparameters comprise a regularization parameter C of 10, a gamma value set to “scale”, and a linear kernel. Finally, LightGBM achieves its best predictive results with a learning rate of 0.15, a maximum depth of 5, and 100 estimators.
Following model training, a rigorous evaluation was conducted using diverse statistical metrics. These metrics included Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), Mean Percentage Error (MPE), Median Absolute Error, R2 score, Adjusted R2 score, Mean Squared Logarithmic Error (MSLE), Mean Bias Error (MBE), and geometric and symmetric Mean Absolute Percentage Error (MAPE). By analyzing these metrics, we were able to quantify the predictive accuracy, bias, and overall performance of each model, enabling a robust comparison and selection of the most suitable model for UCS prediction in oil and gas wells.
5. Results and Discussion
The visual representations in
Figure 3 and
Figure 4 offer a comprehensive overview of the comparative performance of various machine learning models in predicting UCS values for both the training and testing datasets. These figures serve as valuable tools for assessing the predictive accuracy and generalization capabilities of different models in the context of geotechnical engineering applications.
Figure 3 presents a comparative analysis of predicted and actual UCS values against the data index for a training dataset using different machine learning models. In the Linear Regression model (
Figure 3a), the predicted UCS values (red line) demonstrate a general alignment with the actual values (blue dots), but significant deviations are evident, particularly in regions where the actual values exhibit sharp fluctuations. These deviations suggest that the Linear Regression model struggles to capture the non-linear patterns in the data accurately. Random Forest (see
Figure 3b) and Gradient Boosting (see
Figure 3c) exhibit closer alignment with the actual UCS values compared to Linear Regression. Both models reduce the magnitude of the deviations, reflecting their capability to handle non-linearities better than a simple linear model. Among these, Gradient Boosting appears to have a slight edge, maintaining a tighter fit throughout the dataset. XGBoost (see
Figure 3d) further improves the accuracy, displaying a strong correlation between the predicted and actual values across the entire data index. The model’s robustness in managing diverse data patterns is evident from the minimal deviations observed. This is reflected in the stability of its predictions, even when faced with variations in the underlying data characteristics, such as different geological formations or variations in drilling parameters. The minimal deviations observed in the model’s predictions indicate that it can effectively generalize from the training data to unseen samples, thereby minimizing the risk of overfitting. This robustness ensures that the model remains reliable under various operational conditions, making it a valuable tool for practical applications in the field. LightGBM (see
Figure 3e) demonstrates a performance comparable to XGBoost, with a similarly tight fit and minimal errors. Both boosting algorithms, XGBoost and LightGBM, seem to offer superior predictive accuracy due to their advanced ensemble learning techniques. The SVM-SVR model (see
Figure 3f) shows a reasonable fit but with more pronounced deviations in certain sections of the data index. This suggests that while SVM-SVR is effective, it might not be as versatile as the boosting methods in capturing the full complexity of the UCS values. Finally, the MLP-ANN model (see
Figure 3g) provides a fit comparable to the boosting models, with predictions closely following the actual values. The neural network’s ability to model intricate patterns in the data contributes to its high predictive performance.
Figure 4 provides a comparative analysis of several predictive models in estimating UCS against actual values using a testing dataset. Linear Regression (
Figure 4a) reveals substantial discrepancies between predicted and actual UCS values, which are particularly notable in several spikes and troughs where the model fails to capture the variability in the data. This reinforces the earlier observation that Linear Regression is less effective in modelling complex, non-linear relationships inherent in the dataset. As depicted in
Figure 4b, Random Forest shows improved performance compared to Linear Regression, with predictions more closely aligned with actual values. However, some deviations persist, indicating that while Random Forest models handle non-linearities better, they may still miss certain data intricacies. Gradient Boosting (see
Figure 4c) demonstrates a closer fit to the actual UCS values, reducing the magnitude of prediction errors compared to both Linear Regression and Random Forest. The model’s ensemble learning capability enhances its predictive accuracy, although minor deviations are still present. As demonstrated in
Figure 4d, XGBoost maintains a robust alignment with actual values across the testing dataset, further validating its efficacy in handling complex data patterns. The minimal deviations observed suggest that XGBoost effectively generalizes the underlying data structure. LightGBM (see
Figure 4e) displays performance on par with XGBoost, with predictions closely following the actual UCS values. The model’s ability to capture detailed data patterns is evident, though occasional deviations indicate slight overfitting or data-specific challenges. SVM-SVR (see
Figure 4f) exhibits reasonable predictive accuracy but with noticeable deviations in several regions of the data index. This suggests that while SVM-SVR is effective in certain scenarios, it may not consistently capture the full complexity of the UCS values as effectively as ensemble methods. As depicted in
Figure 4g, MLP-ANN shows a strong predictive performance, with predictions aligning closely with actual values throughout the dataset. The neural network’s capability to model complex and non-linear relationships contributes to its high accuracy, although minor deviations suggest room for further optimization. The comparative analysis for both training and testing phases underscores that ensemble methods such as Gradient Boosting, XGBoost, and LightGBM, along with neural network approaches like MLP-ANN, generally outperform simpler models like Linear Regression and SVM-SVR in predicting UCS values. These advanced models demonstrate superior generalization capabilities, making them more reliable for practical applications in predicting complex, real-world phenomena.
Figure 5 provides a comparative analysis of the scatter plots illustrating the relationship between predicted and actual UCS values using different modelling techniques. Each subplot displays a series of points representing the actual UCS values on the x-axis and the corresponding predicted UCS values on the y-axis, along with a red dashed line indicating the ideal 1:1 prediction line and a shaded region representing prediction uncertainty. As shown in
Figure 5a, the Linear Regression model shows a broad distribution of points around the ideal line, indicating a moderate fit with noticeable variance, particularly for higher UCS values. As depicted in
Figure 5b, the Random Forest model demonstrates an improved alignment with the 1:1 line, suggesting better prediction accuracy and less dispersion compared to Linear Regression. As illustrated in
Figure 5c,d, the Gradient Boosting and XGBoost (d) models exhibit a closer clustering of points around the ideal prediction line, signifying higher predictive precision and reduced variability. This observation highlights the effectiveness of ensemble techniques in capturing the underlying patterns in the data. Similarly, as shown in
Figure 5e, LightGBM also shows a strong correlation between predicted and actual values, though with slightly more dispersion compared to Gradient Boosting and XGBoost. Furthermore, The SVM-SVR model (see
Figure 5f) presents a robust performance, with most points lying near the ideal line and within the uncertainty bounds. However, there are a few outliers that deviate significantly, indicating some limitations in the model’s generalization capability. Lastly, the MLP-ANN model (see
Figure 5g) demonstrates a satisfactory predictive performance with a majority of points closely following the ideal line. Nonetheless, there is a slight tendency for higher variability at the extremes of the UCS range, suggesting that while MLP-ANN captures the overall trend effectively, it may struggle with extreme values.
Figure 6 illustrates the relative importance of input features for various machine learning models. Understanding feature importance is essential for interpreting model behaviour and identifying key drivers of predictions.
Figure 6a highlights the feature RHOB as the most influential predictor, followed by NPHI and DT in the Linear Regression model. The radar plot indicates a significant reliance on RHOB, with other features playing relatively minor roles. This suggests that in the linear model, RHOB holds a dominant explanatory power, likely due to its strong linear relationship with the target variable.
Figure 6b shows a similar trend, with RHOB again being the most important feature for the Random Forest model. However, the spread of importance is slightly more balanced, with DT and NPHI also contributing significantly. This indicates that the Random Forest model captures more complex interactions among features compared to Linear Regression.
Figure 6c,d both display a notable emphasis on RHOB, but with a more pronounced role for NPHI and DT for the Gradient Boosting and XGBoost models. The radar plots for these models reveal a more distributed importance among the features, suggesting that the boosting methods are effective in leveraging multiple features to enhance predictive accuracy.
Figure 6e shows a more balanced distribution of feature importance, with WOB, RHOB, and NPHI all contributing significantly to the LightGBM model. This model’s radar plot is more uniform compared to others, indicating that LightGBM utilizes a diverse set of features to make predictions, potentially leading to better generalization.
Figure 6f provides a bar chart of feature importances based on absolute coefficients. Here, NPHI emerges as the most influential feature, followed by DT and RHOB for the SVM-SVR model. This distribution reflects the model’s ability to capture complex, non-linear relationships where multiple features significantly impact the outcome.
Figure 6g uses permutation importance to measure feature relevance. NPHI and DT show the highest importance, indicating their critical role in the neural network’s predictions. The reliance on these features suggests that the MLP-ANN model effectively captures intricate patterns in the data.
Figure 7 illustrates the Taylor plots for seven different models. As shown in
Figure 7a, in the Linear Regression model, the correlation coefficient is moderate, suggesting a reasonable but not exceptional agreement between predicted and observed UCS values. The model’s standard deviation is lower than that of the observations, indicating that Linear Regression underestimates the variability in UCS values. The Random Forest model (see
Figure 7b) shows a higher correlation coefficient compared to Linear Regression, indicating a stronger relationship between predictions and actual values. The standard deviation is closer to that of the observations, suggesting that Random Forest better captures the variability in UCS values.
Figure 7c demonstrates an even higher correlation coefficient, nearing 1.0, for the Gradient Boosting model, which implies a very strong agreement between the model predictions and the actual UCS values. The standard deviation of the predictions aligns closely with that of the observations, indicating that Gradient Boosting effectively captures the variability in the data. As shown in
Figure 7d, the XGBoost model also exhibits a high correlation coefficient, similar to Gradient Boosting, and a standard deviation that closely matches the observed values. This indicates that XGBoost is highly effective in predicting UCS values with a high degree of accuracy and reliability. As depicted in
Figure 7e, LightGBM shows a strong correlation coefficient, slightly less than that of Gradient Boosting and XGBoost, but still indicative of a good predictive performance. The standard deviation is close to the observed values, although there is a slight deviation, suggesting some minor discrepancies in capturing the full range of data variability.
Figure 7f presents a good correlation coefficient for the SVM-SVR model, although not as high as the ensemble methods like Gradient Boosting and XGBoost. The standard deviation is comparable to the observed values, indicating that SVM-SVR performs well in terms of capturing data variability, albeit with occasional prediction inaccuracies.
Figure 7g shows a strong correlation coefficient and a standard deviation that aligns well with the observed values, indicating that MLP-ANN captures both the trend and variability in the UCS data effectively. However, similar to SVM-SVR, it might have occasional outliers or prediction errors.
The analysis of residual distributions for various machine learning models is critical for understanding their predictive performance. Residuals, which are the differences between observed and predicted values, should ideally exhibit a random pattern centred around zero; any discernible trends or patterns may suggest that the model is not effectively capturing underlying relationships within the data. This analysis is instrumental in identifying biases in the model, revealing whether it tends to consistently overestimate or underestimate predictions. Moreover, it helps to detect non-linearities that may require additional features or interaction terms for better representation. Assessing the residuals also aids in evaluating the homogeneity of variance; the presence of heteroscedasticity can compromise the reliability of the model’s predictions.
Figure 8 presents the residual distributions of seven different models. As shown in
Figure 8a, Linear Regression demonstrates a wide spread of residuals, with several outliers on both ends. The distribution appears slightly skewed to the left, indicating that the model tends to underpredict in some instances. The presence of multiple residual peaks suggests that the model might not fully capture the underlying data patterns, leading to heterogeneous residuals. As depicted in
Figure 8b, the Random Forest technique shows a more centred residual distribution, although it still exhibits some degree of skewness to the left. The residuals are more tightly clustered around the mean compared to Linear Regression, suggesting better overall predictive accuracy. However, the model still struggles with extreme values, as evidenced by the residuals extending far from the mean.
Figure 8c,d present similar residual distributions for the Gradient Boosting and XGBoost models, respectively, with both showing a noticeable concentration around the mean and a reduction in extreme residuals. This indicates that these models are effective in minimizing prediction errors and handling variance within the data. Both distributions, however, show slight left skewness, suggesting occasional underpredictions. As demonstrated in
Figure 8e, the LightGBM model displays a distinctive pattern with a significant peak around a small positive residual value. This indicates a slight bias in the model’s predictions, consistently overestimating to a small extent. Despite this, the distribution is relatively narrow, suggesting high accuracy in most predictions.
Figure 8f,g show residuals for the SVM-SVR and MLP-ANN models, respectively, with wider spreads compared to the boosting methods but narrower than Linear Regression and Random Forest. The SVM-SVR residuals are fairly symmetrical around the mean, implying balanced prediction errors, while MLP-ANN shows a right-skewed distribution, indicating a tendency towards overprediction.
Figure 9 illustrates the residuals of predicted versus actual UCS values for seven different models. Residuals are the differences between observed values and the values predicted by the models, and analyzing these residuals helps evaluate model performance by identifying any patterns or biases in the predictions. As demonstrated in
Figure 9a, in the Linear Regression model, the residuals exhibit a noticeable spread around the zero line, with a tendency to increase as the actual UCS values increase. This pattern suggests that the model may be underpredicting for higher UCS values and overpredicting for lower UCS values, indicating a potential linear bias in the predictions.
Figure 9b shows residuals that are more tightly clustered around the zero line compared to Linear Regression, although there are still some noticeable outliers. The residuals do not display a clear pattern, indicating that Random Forest provides a more balanced prediction across the range of UCS values but still has room for improvement in reducing prediction errors.
Figure 9c presents residuals that are fairly well distributed around the zero line, with fewer outliers than both Linear Regression and Random Forest. This indicates that Gradient Boosting has a strong predictive capability and effectively minimizes bias, providing accurate predictions across the range of UCS values. As illustrated in
Figure 9d, XGBoost also exhibits a well-distributed pattern of residuals around the zero line, similar to Gradient Boosting. The absence of a clear trend or bias in the residuals further confirms XGBoost’s robustness and accuracy in predicting UCS values. As depicted in
Figure 9e, the LightGBM model displays residuals that are somewhat more scattered, with a few noticeable outliers, particularly at the higher UCS values. While LightGBM generally performs well, these outliers suggest occasional overprediction or underprediction, indicating variability in the model’s accuracy.
Figure 9f shows a relatively balanced distribution of residuals around the zero line, although there are several instances of significant positive and negative residuals. This suggests that while SVM-SVR can predict UCS values with reasonable accuracy, it may struggle with certain data points, leading to occasional large errors.
Figure 9g reveals residuals that are spread more widely around the zero line, with several outliers, especially at the lower end of the UCS range. This dispersion indicates that MLP-ANN has difficulty maintaining consistent prediction accuracy across the range of UCS values, resulting in higher variability in its predictions. The residual analysis in
Figure 9 highlights that ensemble methods such as Gradient Boosting and XGBoost provide the most accurate and unbiased predictions, with residuals closely clustered around the zero line and minimal outliers. Random Forest and LightGBM also perform well but exhibit slightly more variability. Linear Regression and MLP-ANN show higher dispersion and noticeable patterns in residuals, indicating potential biases and less reliable predictions. SVM-SVR offers reasonable accuracy but with occasional large residuals.
The comparative analysis reveals that ensemble methods, particularly Gradient Boosting and XGBoost, deliver superior predictive accuracy and reliability, minimizing residuals and reducing extreme prediction errors for UCS prediction. Random Forest and LightGBM also perform well, albeit with slightly more variance. Linear Regression and MLP-ANN show moderate predictive capabilities with higher variability and wider residual distributions, indicating less precise predictions, and SVM-SVR, while generally accurate, shows better error handling than Linear Regression but not as refined as the boosting methods and is prone to occasional significant errors.
Figure 10 depicts the relationship between actual UCS data and the predictions made by the Golubev and Rabinovitch [
9] model. The red triangles represent the data points, which generally align with the blue linear fit line, indicating a strong positive correlation. The
R2 value of 0.7761 suggests a moderate fit, implying that the model explains approximately 77.61% of the variance in the actual UCS data. This degree of correlation indicates that the Golubev and Rabinovitch [
9] model is capable of predicting UCS values with average accuracy. However, the spread of data points around the fit line also suggests the presence of some deviations and potential outliers, which may be due to various factors such as heterogeneity or model limitations.
Figure 11 illustrates the comparison between real UCS data and the estimates produced by the Rzhevsky and Novick [
41] model. The coefficient of determination,
R2, is 0.7656, signifying that the model accounts for approximately 76.56% of the variability in the UCS data. This
R2 value suggests that the Rzhevsky and Novick [
41] model is a moderate predictor of UCS. Nonetheless, the dispersion of data points around the regression line points to some discrepancies and outliers, potentially arising from inherent model limitations.
Figure 12 presents a comparative analysis of the measured UCS data against predictions made by Nabaei et al.’s [
10] model. The
R2 value of 0.7674 suggests a moderately positive correlation between the model’s predictions and the actual UCS values. However, some scatter around the line indicates that while the model captures the general trend of the data, there are discrepancies and potential outliers that could be attributed to variances in measurement conditions or inherent limitations of the model. The alignment of a majority of the data points along the line of best fit implies that Nabaei et al.’s [
10] model can moderately estimate UCS values within a specific range, although the spread of the data suggests that further refinement of the model could enhance its predictive accuracy.
Table 3 compares the outputs of different machine learning models in terms of statistical indicators such R
2, RMSE, MAE, MAE, and MPE. Linear Regression exhibited the lowest RMSE of 4.72 and MSE of 22.27, indicating its effectiveness in minimizing prediction errors. Additionally, it showed the lowest MAPE (3.38) and MPE (−0.0002), implying minimal percentage error. XGBoost and MLP-ANN also demonstrated competitive performance in terms of MAPE and MPE. In terms of MAE, XGBoost achieved the lowest value (2.578), followed closely by Linear Regression (2.993) and SVM-SVR (3.337). Linear Regression exhibited the highest
R2 value of 0.8849, indicating its ability to explain approximately 88.49% of the variance in the UCS values. SVM-SVR also demonstrated a high
R2 (0.8660), followed by XGBoost (0.8542). Adjusted
R2 values were consistent with
R2 values, with Linear Regression exhibiting the highest adjusted
R2 of 0.8511. Linear Regression and SVM-SVR exhibited the lowest MSLE values, indicating their effectiveness in minimizing logarithmic prediction errors. However, LightGBM showed the highest MSLE, suggesting higher variability in the accuracy of predictions. Regarding MBE, Linear Regression had a positive bias (0.0742), indicating slight overestimation, while other models exhibited varying degrees of bias. Overall, Linear Regression emerged as the top-performing model across multiple evaluation metrics, showcasing its robustness and effectiveness in predicting UCS values in oil and gas wells. However, XGBoost and SVM-SVR also demonstrated competitive performance, highlighting the importance of considering multiple models in predictive modelling tasks.
6. Conclusions
In this study, we explored the efficacy of various machine learning models in predicting the UCS of rocks in oil and gas wells. Through rigorous experimentation and analysis, we evaluated the performance of five distinct models: Linear Regression, Random Forest, Gradient Boosting, XGBoost, SVM-SVR, and MLP-ANN. Our investigation aimed to identify the most accurate and reliable model for UCS prediction, which is crucial for optimizing drilling operations and ensuring wellbore stability in the petroleum industry.
While RHOB consistently appears as a significant feature across most models, the importance of other features such as NPHI and DT varies depending on the model used. Ensemble methods like Gradient Boosting, XGBoost, and LightGBM demonstrate a more balanced utilization of features, enhancing their predictive performance. In contrast, Linear Regression relies heavily on RHOB, reflecting its simplicity and limitations in capturing complex relationships. SVM-SVR and MLP-ANN highlight the importance of NPHI and DT, indicating their effectiveness in modelling non-linear interactions.
Our findings underscore the superiority of ensemble methods, particularly Gradient Boosting and XGBoost, in accurately predicting UCS values. These models demonstrate robustness, reliability, and superior generalization capabilities, making them ideal choices for practical applications in geotechnical engineering.
Additionally, our results highlight the effectiveness of MLP-ANN in capturing the complex, non-linear relationships inherent in the UCS dataset. While MLP-ANN exhibits strong predictive performance, it occasionally struggles with extreme values, indicating opportunities for further optimization.
Furthermore, Random Forest and LightGBM also exhibit commendable performance, albeit with slightly more variability compared to ensemble methods. These models provide viable alternatives, especially in scenarios where computational efficiency is a concern.
On the other hand, Linear Regression and SVM-SVR models, while providing moderate predictive capabilities, fall short of capturing the full complexity of the UCS dataset. These simpler models are outperformed by ensemble methods and MLP-ANN in terms of predictive accuracy and reliability.
Our study underscores the importance of employing advanced machine learning techniques for UCS prediction in oil and gas wells. By leveraging these methodologies, the petroleum industry can benefit from enhanced decision-making processes that lead to improved drilling efficiency and safety. The adoption of superior predictive models not only optimizes operational parameters but also contributes to sustainable practices by minimizing the risks associated with drilling operations. Ultimately, our research paves the way for further exploration and application of machine learning in geotechnical contexts, highlighting the significant potential for ongoing improvements in the field.