1. Introduction
In the oil industry, drilling is the process of creating long and narrow boreholes using tools such as drill bits, drill strings, and drilling fluids, and it serves as a critical step in the exploration and development of petroleum resources [
1]. Optimizing the rate of penetration (ROP) contributes to reducing drilling cycles and costs; thus, establishing a high-accuracy ROP prediction model is of significant importance [
2]. ROP is influenced by various factors, including controllable factors (such as drill bit type, weight on bit, and rotation speed) and uncontrollable factors (such as formation lithology and formation pressure) [
3]. High-precision ROP prediction models help minimize downtime and drilling costs. Therefore, establishing an effective rate of penetration (ROP) prediction model can not only help describe changes in ROP but also significantly improve drilling efficiency.
Early prediction methods primarily relied on physical laws and empirical rules, such as the models proposed by Galle and Woods (1963), Bingham (1965), and Bourgoyne and Young (1974) [
4], which were established through mathematical formulas or physical experiments. However, due to the complexity of drilling environments, these models struggled to accurately capture the nonlinear relationships among multiple variables, resulting in suboptimal predictive performance. With advancements in artificial intelligence technology, deep learning-based models for rate of penetration (ROP) prediction have gradually emerged. These models, particularly deep learning architectures, have demonstrated strong generalization performance, prompting many researchers to seek improvements for their application in ROP prediction [
5]. Ashrafi et al. [
6] combined particle swarm optimization (PSO) and genetic algorithms (GA) with artificial neural networks (ANNs), finding that this hybrid ANN significantly outperformed traditional backpropagation-trained ANNs in accuracy. Similarly, AI-AbdulJabbar et al. [
7] integrated an adaptive differential evolution optimization algorithm with ANNs for ROP prediction in carbonate reservoirs. AI-based models can effectively fit the complex nonlinear relationships present in drilling data, markedly surpassing traditional physical models. However, ANN-based ROP prediction models typically handle static data, neglecting the temporal dynamics of drilling operations, which impacts their effectiveness in practical applications.
As understanding of drilling time series data has deepened, researchers have begun to incorporate time series models for ROP prediction. Recurrent neural networks (RNNs) have been widely used in ROP prediction due to their ability to capture historical information [
8]. Encinas et al. [
9] combined RNNs with multilayer perceptions, significantly improving ROP prediction accuracy by leveraging the sequential nature of drilling operations, achieving better results compared with traditional machine learning models like random forests. Etesami et al. [
10] validated the potential of RNNs within a drilling training framework, although their prediction accuracy reached only 0.90, indicating room for further optimization in ROP prediction tasks. These studies highlight that, despite their effectiveness in certain contexts, RNNs face significant limitations in capturing long-term dependencies due to issues like vanishing and exploding gradients when processing long sequences. This has prompted researchers to explore more advanced models, such as long short-term memory networks (LSTMs), to enhance predictive performance.
Compared with ANNs and RNNs, LSTMs are particularly adept at handling long-term dependency information, owing to their memory cell structure, which enables outstanding performance in ROP prediction [
11]. Safarov et al. [
12] conducted a thorough comparison of traditional machine learning and deep learning methods (such as RNNs and LSTMs) in ROP prediction; however, their study did not delve deeply into data partitioning, a critical factor for achieving accurate ROP predictions. Liu et al. [
13] designed a model that connects LSTM and RNN for ROP prediction, but their research focused solely on deep wells, raising questions about the model’s generalization capability. Zhang et al. [
14] proposed a model combining generative adversarial networks (GANs) with LSTM to predict ROP in continuous coiled tubing operations. However, LSTMs can only utilize input information from prior time points during prediction and cannot integrate outputs from the entire time series. This limitation of unidirectional analysis prevents traditional LSTM models from adequately addressing the complex nonlinear relationships and temporal dynamics among drilling parameters. While LSTMs offer improvements over other models, further exploration and optimization are still necessary in certain aspects.
Bidirectional long short-term memory networks (BiLSTMs), as an extension of unidirectional LSTM networks, can capture bidirectional information within time series, thereby enhancing the accuracy of temporal tasks [
15]. For instance, Kocoglu et al. [
16] found that BiLSTM outperformed traditional and other deep learning methods in predicting production from multiple wells in the Marcellus formation. Liang et al. [
17] proposed a hybrid model combining BiLSTM and random forests (RF) for shale gas production forecasting, successfully addressing complex nonlinear and non-stationary characteristics. Given the extended time span of drilling data, the use of neural networks for prediction may result in the neglect of bidirectional information due to long-term dependencies, ultimately affecting prediction accuracy. BiLSTM not only effectively captures contextual information from logging curves but also enhances the model’s sensitivity to changes in nonlinear features, thereby improving data utilization and predictive performance.
The introduction of bidirectional computation in BiLSTM effectively enhances the accuracy of temporal tasks; however, it also increases model complexity and training costs. To address the computational burden associated with BiLSTM, attention mechanisms have been incorporated into ROP prediction. The attention mechanism dynamically optimizes the weight allocation of input features, highlighting key components and thereby improving the model’s efficiency and accuracy. Cheng et al. [
18] combined the attention mechanism with LSTM, resulting in a significant improvement in the model’s predictive performance. Similarly, Song et al. [
19] proposed an attention-based BiLSTM model for forecasting wind and wave energy, demonstrating that the attention mechanism markedly enhanced the model’s performance. Although traditional attention mechanisms excel in optimizing model performance, they still face challenges, such as high computational complexity and attention diffusion when dealing with long-term dependencies [
20]. To address these issues, the temporal pattern attention (TPA) mechanism was introduced [
21]. By dynamically allocating weights across the time series, TPA captures key temporal information, further enhancing model prediction accuracy. In ROP prediction tasks, TPA utilizes a time-series-based attention weight matrix to help BiLSTM more effectively capture important features within the contextual information, significantly improving both prediction accuracy and efficiency.
The combination of convolutional neural networks (CNN) and long short-term memory (LSTM) networks offers significant advantages in feature extraction and has been widely applied to time series prediction tasks [
22]. CNN effectively extracts features from time series data through progressive convolution and pooling operations, while LSTM selectively updates and outputs information from the memory cells using its gating mechanism. This hybrid structure enhances the model’s ability to capture data features, thereby improving the effectiveness of time series learning [
23]. In multivariate time series data, there is often local correlation between different variables. Compared with 1D-CNN, which can only extract features along a single dimension, two-dimensional convolutional neural networks (2D-CNN) can perform convolution operations simultaneously across both the time steps and variable dimensions. This advantage allows 2D-CNN to more effectively capture relationships between multiple variables, making it particularly well-suited for complex multivariate time series analysis. For instance, Jonkers et al. [
24] combined 2D-CNN with conformal quantile regression (CQR) for regional wind power forecasting and found that the model, when handling high-dimensional input data, exhibited fewer parameters and lower computational complexity compared with transformer models. Additionally, the 2D-CNN model embedded with Laplacian attention proposed by Tuyen et al. [
25] demonstrated the potential to analyze input sequence features from multiple perspectives. Thus, 2D-CNN not only performs convolutions across features and time steps but also effectively handles the complex multidimensional characteristics present in drilling tasks. This capability significantly enhances the model’s ability to capture nonlinear relationships, thereby improving overall predictive performance. Compared with 1D-CNN, the feature extraction advantages of 2D-CNN affirm its suitability for complex time series forecasting tasks.
Despite significant progress in ROP prediction using deep learning, existing studies still face the following challenges:
- (1)
Excessive redundancy exists among input data related to the rate of penetration (ROP), with varying contributions to ROP prediction deriving from different features, which can easily lead to model overfitting.
- (2)
The complex nonlinear relationships between input variables have not been fully explored, and the effectiveness of feature extraction requires improvement.
- (3)
As a time-series data problem, traditional methods have limitations in extracting long-term dependencies, which negatively affects prediction accuracy. These issues collectively reduce the predictive capability of the models.
To address the aforementioned challenges, leveraging the complementary strengths of 2D-CNN and BiLSTM networks is crucial. BiLSTM excels at capturing temporal features from sequential data but has limited capacity in extracting nonlinear relationships between input features, especially when dealing with complex input parameters, which can lead to reduced prediction accuracy. In contrast, 2D-CNN offers strong nonlinear feature extraction capabilities, yet its local sensitivity and inductive bias limit its ability to fully utilize global information in time-series data. Therefore, combining these two models can facilitate the efficient extraction and analysis of both global and local features within the data, leading to improved predictive performance.
In summary, fully exploring feature correlations and bidirectional temporal information within drilling data is critical to improving the accuracy of ROP prediction. To address these challenges and provide more precise predictions, this study proposes an improved hybrid prediction model—CBT-LSTM. The model is based on experimental research conducted using data from four wells in a Chinese oilfield. Results indicate that the CBT-LSTM model outperforms other models in most cases, demonstrating superior predictive accuracy.
The main contributions of this paper include:
- (1)
Proposing a hybrid model (CBT-LSTM) that integrates 2D-CNN, BiLSTM, and the temporal pattern attention (TPA) mechanism. In this model, 2D-CNN is used to extract nonlinear feature relationships from the input data, BiLSTM enhances the global understanding of time-series data, and TPA further improves BiLSTM’s focus on key temporal features by reducing redundant information, thereby enhancing prediction accuracy.
- (2)
Conducting comparative experiments with traditional neural network methods and testing on wells of different depths. The CBT-LSTM model demonstrated superior performance in metrics such as mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error (RMSE), and R2, indicating its significant feasibility and strong generalization capability.
- (3)
Validating the model’s robustness by introducing 10%, 20%, and 30% noise and missing values into the training set. The results show that the model maintains strong predictive performance even in high-noise environments, supporting its practical applicability in real-world scenarios.
The remainder of this paper is structured as follows.
Section 2 provides a detailed description of the research methodology and the architecture of the CBT-LSTM model.
Section 3 outlines the general steps involved in ROP prediction, including data preprocessing, the construction of the sliding window, and the model evaluation metrics.
Section 4 presents the selection of hyperparameters for the CBT-LSTM model, a discussion and analysis of the experimental results, and highlights the limitations and directions for future research. Finally,
Section 5 summarizes the paper and presents the conclusions.
4. Experiments, Results, and Discussion
4.1. The Impact of Different Convolution Kernel Combinations on 2D-CNN Modeling
When constructing deep learning models, the configuration of neuron parameters is crucial, as different settings can significantly affect the overall performance of the model. In 2D-CNN, multiple convolution kernels are used to learn various features, effectively addressing the multivariate correlation issues in drilling data. Therefore, the size and number of convolution kernels play a key role in the performance of the CNN model. To determine the optimal convolution kernel configuration, experiments were conducted to compare the model’s performance under different settings. Specifically, the number of layers in the 2D-CNN was fixed at three, with traditional 3 × 3 convolution kernels, as well as 2 × 1 and 3 × 1 kernels, being selected to reduce computational cost, as per [
45]. To simplify the network design, the stride was set to the default value of 1, and zero padding was applied. Additionally, the number of convolution kernels was kept consistent across the three convolutional layers. As shown in
Table 4, when the convolution kernel size was 2 × 1 and the number of kernels was 64, the model achieved the highest
R2 value of 0.9624, indicating the best predictive performance.
4.2. Hyperparameter Optimization
All experiments were conducted on a Windows system with an Intel i5-12400 CPU (Manufacture: Intel, Santa Clara, CA, USA), 32 GB of RAM (Manufacture: KINGBANK, Shenzhen, China), and an NVIDIA GeForce RTX 3060 12 GB GPU (Manufacture: NVIDA, Santa Clara, CA, USA). The development was performed using Python 3.9 and TensorFlow 2.10.
To find the optimal combination of model hyperparameters, grid search, random search, and particle swarm optimization (PSO) were used for hyperparameter tuning. Grid search systematically evaluates each combination of hyperparameters within a predefined range by exhaustively searching the entire space. This approach ensures that the best configuration is found, but can be computationally expensive, especially for large search spaces. Random search selects hyperparameter combinations randomly within the defined parameter space. Compared with grid search, random search is more efficient at exploring large-scale search spaces, as it can potentially locate near-optimal configurations with less computational effort. Particle swarm optimization (PSO) is an optimization algorithm that simulates the foraging behavior of bird flocks. It leverages swarm intelligence to dynamically adjust particle positions in the search space, allowing for efficient exploration and identification of the optimal solution. PSO is particularly advantageous in complex and multi-dimensional search spaces, offering a balance between exploration and exploitation.
In this study, the selected hyperparameters for optimization are batch size, learning rate, dropout rate of the first dropout layer, and dropout rate of the second dropout layer. The approximate range of these hyperparameters was determined based on relevant literature, with the specific search ranges provided in
Table 5. The number of iterations was set to 100 to ensure sufficient optimization space. When applying PSO, the particle count was set to 40, with an inertia weight of 0.7, an individual learning factor of 0.5, a social learning factor of 2.5, and a maximum of 100 iterations, as per [
2]. These settings aim to strike a balance between convergence speed and exploration of the search space, ensuring the model reaches optimal hyperparameter configurations effectively.
The hyperparameter tuning results obtained from the three search methods are shown in
Table 6. From the perspective of model accuracy, comparing
R2 and MAE values, the PSO method performed the best, while random search and grid search yielded similar results. This indicates that PSO more effectively enhances the neural network model’s accuracy and is recommended over the other two methods. In the hyperparameter combination derived from PSO, the batch size was 64, the learning rate was 0.005, dropout1 was 0.2, and dropout2 was 0.2. Under this hyperparameter configuration, the model achieved the highest
R2 value of 0.9684, indicating optimal predictive performance. Therefore, in all subsequent experiments, the model’s hyperparameters were fixed to these optimal values.
4.3. Cross-Validation Method and Performance Analysis
During model development, cross-validation is commonly used as it allows for more efficient utilization of data samples, increasing the frequency of both training and validation. This generates more predictive results, helping to identify the optimal model parameters and reducing the risk of overfitting. Among these methods, K-fold cross-validation divides the dataset
D into
K approximately equal and mutually exclusive subsets, which is particularly suitable for large datasets as it reduces the possibility of overfitting. This method is considered an incomplete cross-validation approach. Leave-P-out cross-validation is a more exhaustive cross-validation method. It involves removing
P samples from the entire dataset to create all possible training and testing sets. For a dataset with
n samples, this method generates
sets of training–testing pairs. In this study, 5-fold cross-validation, 10-fold cross-validation, and leave-P-out cross-validation methods were used for model comparison. The results are shown in
Table 7:
The results indicate that the 10-fold cross-validation method performed the best across all evaluation metrics, with the lowest MAE, RMSE, and MAPE values, and the highest R2 value, reaching 0.9769. This suggests that 10-fold cross-validation not only improves the model’s accuracy but also enhances its generalization ability on unseen data. In comparison, while the 5-fold cross-validation method is relatively simple, it falls slightly short in terms of predictive accuracy. The leave-P-out cross-validation method demonstrated performance similar to the 10-fold method but, due to its computational complexity, may be less efficient for practical applications. Therefore, it is recommended to prioritize 10-fold cross-validation during model evaluation to achieve more reliable predictive results and better model performance assessment.
4.4. Ablation Experiments
To validate the effectiveness of the improvements proposed in this paper, three sets of ablation experiments were designed to evaluate the roles of BiLSTM and TPA. Four models participated in the ablation study: CBT-LSTM, 2D-CNN-LSTM-TPA, 2D-CNN-BiLSTM-SelfAttention, and 2D-CNN-BiLSTM. All other parameters of these models were kept consistent.
Figure 11 shows the changes in loss values during the training process for the four models. As seen in the figure, the training loss of the CBT-LSTM model continuously decreases throughout the training process and exhibits lower loss values compared with the other models. The comparison results are shown in
Figure 12.
Table 8 shows the comparison of evaluation metrics for ablation experiments on the test set. The figure and table show that the proposed model achieves the highest prediction accuracy. Replacing BiLSTM with a standard LSTM results in decreased prediction accuracy. This is likely because, while LSTM effectively handles time series data, it lacks the bidirectional capability necessary to fully capture complex dependencies. Additionally, the model’s prediction accuracy declines in the absence of an attention mechanism. Replacing TPA with self-attention also leads to a reduction in prediction accuracy. Although the self-attention mechanism is effective, TPA is better at capturing complex temporal variations and key patterns, significantly enhancing the model’s predictive performance. These ablation experiments demonstrate that the inclusion of BiLSTM and TPA plays a crucial role in improving the model’s predictive performance, fully validating the proposed enhancements.
4.5. Contrast Experiments
To verify the significant advantages of the proposed ROP prediction model over other commonly used ROP prediction models, comparisons were made with 1D-CNN-LSTM, LSTM-Attention, BiLSTM, and ANN algorithms.
Figure 13 shows the comparison between the actual ROP values and the predicted ROP values of different models. From the graph, it is evident that the predicted ROP curve obtained by the proposed model aligns more closely with the actual data curve compared with the ROP predictions from other models.
Figure 14 illustrates the linear relationship between the predicted and measured ROP values for five different models. The CBT-LSTM model shows the closest correlation between the predicted and measured ROP, indicating its superior performance. The prediction results of different models are shown in
Table 9. After comparing the performance metrics of these five models, the CBT-LSTM model demonstrated the best performance in key evaluation metrics such as
, MAE, MAPE, and RMSE.
The superior performance of the CBT-LSTM model can be attributed in part to its use of a sliding window, which transforms the time-series data into a 2D structure, enabling the 2D-CNN to capture temporal–spatial relationships more effectively than other models. This model achieved the lowest MAE, RMSE, and MAPE, and the highest R2, indicating its ability to accurately predict ROP with significantly improved accuracy compared with other neural networks. Considering the evaluation of all results, the CBT-LSTM model demonstrates superior performance across all evaluation metrics, further proving its practicality in ROP prediction.
4.6. Model Robustness Validation
During the drilling data collection and transmission process, factors such as sensor inaccuracies, unstable communication lines, and electromagnetic interference can lead to measurement errors and data loss, which in turn cause data quality issues. To assess the robustness of the model, noise is typically introduced into the training samples. By comparing the changes in the model’s performance metrics before and after the introduction of noise, the model’s sensitivity to varying levels of noise can be measured. This approach helps evaluate how well the model can handle noisy and incomplete data, which is crucial for ensuring reliable performance in real-world drilling applications.
In the study of the Well A dataset, outliers and missing values in the dataset were already removed and denoised. To further validate the robustness of the model, this research introduced 10%, 20%, and 30% noise and missing values into the original training set. Specifically, the total number of elements to be processed was calculated based on the specified percentages, and random indices were selected for those elements. For each selected index, there was a 50% probability of adding noise and a 50% probability of setting it to a missing value. The noise added was drawn from a normal distribution with a mean of 0 and a standard deviation of 0.1. All hyperparameters were set to their default values. No denoising was applied to the introduced noise. However, as PCA dimensionality reduction cannot handle NaN values, the missing information in the features was filled using polynomial interpolation.
Figure 15 shows the comparison between the predicted values and the actual values for the Well A dataset after the addition of noise.
Table 10 lists the performance metrics of the model using the original data versus the data with added noise.
From the results presented in the figures and table, it can be observed that adding 10% noise has a minimal impact on the model’s performance. However, as the noise level increases to 30%, the model’s R2 value decreases significantly, though it remains at 0.8061. This demonstrates that the model retains strong robustness, even with a higher noise level. These findings provide strong support for the model’s application in complex drilling environments, where data quality may be compromised due to various external factors.
4.7. Model Generalization Verification
To validate the generalization capability of the CBT-LSTM model, it was applied to three other wells (Wells B, C, and D) in the same region for training and testing. The prediction results for these wells are shown in
Figure 16,
Figure 17 and
Figure 18.
Figure 16a,
Figure 17a and
Figure 18a display the fit between the actual ROP and the predicted ROP for the three wells, demonstrating that the two curves generally remain consistent.
Figure 16b,
Figure 17b and
Figure 18b show scatter plots of the actual ROP versus the predicted ROP, with most blue points distributed near the orange line, indicating a strong correlation between the actual and predicted ROP values.
Figure 16c,
Figure 17c and
Figure 18c illustrate the relative error between the actual and predicted values. The errors are relatively small for Well B, and although there are fluctuations for Wells C and D, the errors remain within an acceptable range.
The evaluation metric results are presented in
Table 11. The model generally performs well in most cases, especially on the datasets for Wells B and D. The performance is slightly inferior on the Well C dataset, possibly due to the model’s insufficient learning capability for shallower well depths. Overall, the model accurately captures the complex nonlinear relationship between the actual ROP and its features, demonstrating good generalization performance across different wells.
4.8. Limitations and Future Directions
Despite achieving high accuracy and generalization capability in ROP prediction through neural networks, the main limitation of this study lies in the scope of the data. First, the various parameters in the drilling process may differ, as different regions and formations involve distinct bit designs, operations, and geological parameters. A mismatch between the model’s training conditions and the actual application environment could lead to a decrease in predictive accuracy. Moreover, the complexity and uncertainty of subsurface geological conditions across different global regions mean that this model has not yet been fully validated in diverse real-world settings, particularly in wells outside the original test area. This constitutes a major limitation of the current research.
In future work, it is necessary to evaluate the model using more generalized datasets, such as data from different oil and gas wells and varied operational scenarios, especially when wells have a high inclination. Additionally, there will be an exploration of integrating cutting-edge deep learning techniques, such as reinforcement learning and transfer learning, to deeply mine the intrinsic patterns in the data and further enhance the model’s predictive accuracy and robustness. Finally, the improved model will be integrated into real-time monitoring systems, enabling real-time safety prediction and monitoring during the drilling process.
5. Conclusions
To address the issues of low ROP prediction accuracy and insufficient utilization of data features, this paper proposes a novel ROP prediction model called CBT-LSTM, which integrates 2D-CNN, BiLSTM, and temporal pattern attention (TPA). In this model, 2D-CNN is responsible for extracting complex feature relationships from the processed data, BiLSTM captures bidirectional information within the data, and TPA dynamically assigns feature weights to enhance the network’s ability to extract critical information.
The experiments were conducted using data from four vertical wells in a Chinese oilfield. First, noise was reduced using an SG filter, and features were selected using five different correlation coefficient methods. Principal component analysis (PCA) was then applied for dimensionality reduction, and a sliding window approach was used to convert one-dimensional sequential data into two-dimensional spatial sequence data. By comparing various hyperparameter optimization algorithms and cross-validation methods, the optimal combination was identified. Additionally, ablation experiments were conducted to validate the importance of BiLSTM and TPA in improving the model’s performance.
To validate the effectiveness and potential advantages of the proposed model, a comprehensive comparison was conducted with benchmark models including 1D-CNN-LSTM, LSTM-Attention, BiLSTM, and ANN. The CBT-LSTM model achieved MAE, MAPE, RMSE, and R2 values of 0.0295, 0.0357, 9.3101%, and 0.9769, respectively, demonstrating higher prediction accuracy than the other models. Additionally, to test the robustness of the model, noise and missing values were introduced into the training data from Well A. When the proportion of outliers was 10%, 20%, and 30%, the model’s R2 values were 0.9583, 0.8943, and 0.8061, respectively. Although the model’s accuracy declined with increasing noise, it remained above 0.80, indicating strong resilience in handling anomalies, further validating its robustness. Finally, in generalization experiments on the other three wells, the model achieved R2 values exceeding 0.95, confirming its strong generalization ability across different wells and operational conditions.
By combining theoretical knowledge with practical implementation, this study conducted extensive experiments to validate the effectiveness of the proposed model. The experimental results clearly demonstrate the superiority of the CBT-LSTM model. This research not only provides an effective approach by which to improving ROP prediction accuracy but also offers crucial technical support for the optimization of real-time drilling operations.