1. Introduction
The tight reservoir in the WZ block is located in the Beibu Gulf sea area. The reservoir has the characteristics of strong heterogeneity, complex pore structure, and poor pore connectivity. The porosity is low, usually between 5% and 20%; the permeability is low, distributed between 0.04 and 140 md, most of which is within 10 md. At the same time, the water saturation of tight reservoirs is high, usually between 30% and 90%.
Machine learning originated in the 1950s [
1,
2] and was not applied to oil and gas production until the 1980s [
3,
4,
5]. In the 21st century, with the gradual expansion of machine learning algorithms and the improvement in computational power, machine learning has begun to make significant strides in various fields. Jani D.B. (2017) used an ANN (Artificial Neural Network) model to accurately predict the performance of solid desiccant cooling systems [
6]. Nasirzadeh F. (2020) employed an ANN model combined with the PI method to provide a novel approach for predicting labor productivity [
7]. Soroush Ahmadi (2024) utilized the response surface method to optimize the corrosion inhibition performance of 2-mercaptobenzothiazole (2-MBT) for carbon steel in 1 M HCl [
8]. Machine learning is becoming increasingly prevalent in oil and gas production [
9,
10,
11]. Due to the complexity of low-permeability tight reservoirs, productivity is affected by many factors. In order to obtain more accurate productivity prediction values, determining the main control factors can reduce the complexity of the model and accelerate the training speed of the model. Li X. (2013) used the three parameters of permeability, porosity, and first closure pressure as input parameters of the model to predict productivity [
12]. Cao Q. (2016) used thickness, average porosity, average clay content, and density as the inputs for the model, but the prediction accuracy on the test set was low [
13]. The data features mapped by the normal parameters cannot fully represent the capacity data and finally affect the results of capacity prediction. With the improvement in the feature extraction ability of the model, researchers began to add fracturing information to the model parameters. Researchers, such as Alimkhanov Rustam (2014), Yue Ming (2024), Wu Lei (2023), and Qin Ji (2022), added the information generated by fracturing to the model [
14,
15,
16,
17]. Alimkhanov Rustam (2014) [
17] used the geological information (total thickness and net thickness, porosity, permeability coefficient, oil saturation coefficient, net-to-gross ratio, reservoir interval, macroscopic heterogeneity, transmission) before and after fracturing in the Povkh oilfield as parameters to predict the fracturing effect and then optimize the fracturing parameters. Wu Lei (2023) [
14] and Qin Ji (2022) [
15] used the fracture half-length and the number of fractures as part of the parameters to generate a model that can be used to optimize the fracturing parameters.
In recent years, with the widespread promotion of machine learning, some researchers have used data mining methods to predict production capacity [
18,
19,
20,
21]. Dong (2022) [
19] used the regression tree model and combined the Spearman correlation coefficient and recursive feature elimination algorithm to rank the importance of influencing factors and predict the initial production capacity. Hui G. (2021) [
21] uses four methods (linear regression, neural network, regression tree, and decision tree) to predict the natural gas production capacity in the Fox Creek area. Regression tree and decision tree methods have better prediction accuracy. With improvements in development needs, a single model can no longer meet the actual production needs, so researchers began to study the composite model [
22,
23]. Liu Jie (2023) [
22] used the KNN-BP (K-Nearest Neighbors Backpropagation) neural network model to predict the productivity of tight sandstone gas reservoirs in the SM block. Compared with the single network model (BP neural network) and other algorithms (support vector machine, random forest, linear regression), the prediction accuracy of the composite neural network model is higher. Fargalla and Mandella Ali M. (2024) [
23] proposed a new model called TimeNet. The model combines a convolutional neural network, a bidirectional gate cycle control unit, an attention mechanism, and Time2 Vec, which can not only capture complex nonlinear time information but also extract formation spatial characteristics. It has a very good effect on the productivity prediction of the Fenchuganj conventional sandstone gas field and the Marcellus shale gas field. For different problems, the selection of the model will also have a focus. Machine learning models are divided into unsupervised and supervised models. Unsupervised models, such as the K-means (K-means Clustering) model, are suitable for clustering analysis. Li Yuanzheng (2022) used the K-means model to classify reservoirs based on pore structure parameters [
24]. Supervised models include support vector machines, long short-term memory neural models, deep neural models, etc. Based on the long-term and short-term memory neural models, Fu (2023) predicted the production capacity with time-series parameters, which were used to predict daily production capacity data [
25].
Considering the special reservoir physical conditions in the WZ area and the large gap between the specific oil production index of some wells, there may be a linear relationship between the main control factors of productivity and the specific oil production index, while the neural network model is just good at dealing with nonlinear problems. Furthermore, conventional neural network models often use static parameter samples for training and prediction, meaning that parameters with dynamic changes are simply averaged or replaced with the median within their change range to treat them as static parameters. However, reservoirs in the WZ region are classified as tight reservoirs, and their reservoir properties change dramatically with depth, making a static approach inappropriate. Pearson correlation analysis and bias analysis are used to identify the main controlling factors for productivity. Subsequently, dynamic parameter samples are constructed from the main controlling factors with depth variations in the original data. These dynamic parameters are then processed using an LSTM (Long Short-Term Memory) network to convert them into static parameters. The remaining controlling factors are combined with the converted static parameters to serve as the input for a BP network, which is then used to predict productivity. The LSTM network excels at extracting features from dynamic data and performs better than simply averaging or using the median of the change range for parameters with dynamic variations. Meanwhile, the BP network is well suited for handling nonlinear problems and is convenient to set up, with a short training cycle.
2. Production Capacity Prediction Model Establishment Process
(1) Problem Description. In view of the special productivity conditions in this area, the specific productivity index under multi-mechanism control is transformed into the specific productivity index affected by fixed-characteristic parameters (logging parameters, crude oil parameters, formation parameters, construction parameters), and a set of mathematical models from multi-characteristic parameters to target parameters is formed. This helps to simplify the description of practical problems and facilitate the establishment of capacity prediction models quickly and efficiently.
(2) Data Preparation. The neural network model belongs to supervised learning, and the sample data set must be involved to update the model parameters. Before training, the sample set needs to be constructed from the original data according to the initially selected feature parameters. The sample set will be divided into a training set, a validation set, and a test set. The training set is used for the forward propagation of the model to form a computational network; the validation set is used to update the parameters of the computing network; and the test set is used to evaluate the model prediction effect after each round of parameter updates.
(3) Establish and optimize the model. Because there are more logging data and physical property data in the original data, other geological parameter data are limited. Considering the inconsistency in the amount of data, the data set is established in two ways: a single sample is taken as a static parameter by the average value of each influencing parameter; the logging and physical property influence parameters of continuous depth are taken as part of dynamic parameters; and other parameters are taken as static parameters. Two network models, the BP neural network and the LSTM-BP neural network, are established correspondingly. The model optimization is divided into hyperparameters and parameter optimization. The hyperparameters are manually set before the model is established, and the parameters are optimized by the loss function and the optimization function in the model function calculation based on the training set and the verification set. The selection of hyperparameters and parameters determines the prediction effect of the model.
(4) Model evaluation. The test set is input into the optimized model, and the predicted value is output. Four prediction evaluation indices were selected: mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R2). These were used to evaluate the prediction effect of the model. Through four evaluation indices, the model with the best prediction effect is selected.
(5) Model application. The actual production application is carried out, the verification samples are collected from the original data according to the main influencing factors of capacity, and the samples are input into the model for capacity prediction.
3. Data Feature Preprocessing
3.1. Data Reduction
After collecting the original data of the study area, the influencing factors related to tight oil wells are preliminarily screened, mainly including logging parameters (natural gamma, acoustic time difference, density, deep resistivity, shale content), formation parameters (effective permeability, porosity, volume coefficient, skin factor, effective thickness, water saturation), crude oil parameters (oil relative density, crude oil viscosity), and construction parameters (oil well radius). Some parameters are not dimensionalized for calculation (
Table 1).
3.2. Analysis of Main Control Parameters of Production Capacity
Before establishing the two network models, it is necessary to consider whether the characteristic parameters of the data set are correlated with the predictors and whether the characteristic parameters are intersected. Feature selection can not only reduce the number of feature parameters, accelerate the training time of the model, and reduce the possibility of overfitting but also improve the generalization ability of the model so that it can maintain a certain robustness in practical applications and reduce the prediction effect on the test set. The actual application of the prediction effect is not good.
The data set structure of the two models is different. The data set combined with dynamic parameters and static parameters is essentially used to extract the internal mathematical information of dynamic parameters. It is the same as the static parameter data set in purpose, and the dynamic and static parameter data sets make it difficult to analyze the main control factors. Therefore, the data set composed of all static parameters is used to analyze the main control parameters. Peel correlation analysis and partial correlation analysis are jointly involved in feature selection. The purpose is to remove the parameters with weak correlation with the target parameters and retain the feature parameters with weak correlation between the feature parameters and strong correlation with the target parameters. In this way, the difficulty of model training is reduced, and the speed of model fitting is accelerated.
First, Pearson correlation analysis was used. Pearson correlation analysis is a statistical method used to measure the degree of linear correlation between two variables. It is based on the Pearson correlation coefficient, which is usually represented by the symbol ‘
r’. The Pearson correlation coefficient is between −1 and 1. It reveals the strength of relationships, which helps to understand which factors have the greatest impact on productivity. Pearson correlation analysis is a statistical method used to measure the degree of linear correlation between two variables. On one hand, using the Pearson correlation coefficient can identify key factors that have a significant linear correlation with productivity. On the other hand, it can also identify other nonlinear factors that do not exhibit linear correlation, providing a basis for the use of neural networks.
Equation (1): r is the correlation coefficient; n is the number of samples; xi and yi are two characteristic parameter values; , is the mean value of two characteristic parameters.
Through the Pearson correlation analysis (
Figure 1), it can be found that the volume coefficient, water saturation, acoustic time difference, shale content, porosity, and effective permeability have a high correlation with the specific oil production index. In the figure, the squares closer to blue represent a stronger negative correlation, while those closer to red indicate a stronger positive correlation. The correlation coefficient between the specific productivity index and oil well radius, crude oil density, and oil relative density is low. It is worth noting that the specific productivity index has a high correlation with the shale content, but the correlation with the relative density of oil is very low. At the same time, there is a certain correlation between the shale content and the relative density of oil, indicating that there is a nonlinear correlation between the relative density of oil and the specific productivity index (the absolute value of the correlation coefficient is close to 1, indicating that the higher the linear correlation between the two, the closer the absolute value is to 0, indicating that there is a nonlinear correlation between the two). Therefore, the use of the neural network structure is more helpful to fit nonlinear problems.
Partial correlation measures the relationship between two variables while controlling for the influence of other variables. It helps us control for confounding factors, allowing us to more accurately assess the independent effect of a specific factor on productivity. Through partial correlation analysis, it can be found (
Figure 2) that the contribution of the oil well radius, crude oil viscosity, oil relative density, and natural gamma contrast oil production index is small. At the same time, water saturation, deep resistivity, acoustic time difference, and shale content are in the same contribution interval. In the selection of parameters, the parameters with a small contribution and high correlation with other parameters can be eliminated; the parameters with the same contribution can be used to distinguish the correlation of the parameters with other parameters and the parameters that can contain other unselected parameters and will not conflict with the existing parameters can be selected.
Pearson correlation coefficients help to initially identify factors related to productivity and the strength of their relationships. Partial correlation, on the other hand, provides a more refined analysis by controlling for the influence of other variables, helping to determine the independent effect of a specific factor. Combined with Pearson correlation analysis and partial correlation analysis, these four parameters are eliminated because the contribution of the oil well radius, crude oil viscosity, oil relative density, and natural gamma is too small. Density and deep resistivity, as well as acoustic time difference and clay content, have a relatively equal contribution. Density has a significant correlation with water saturation and volume coefficient and also has a certain correlation with unselected natural gamma. The deep resistivity has a great correlation with the effective thickness and volume coefficient and has a certain correlation with the density, and the deep resistivity is eliminated. Acoustic time difference and mud content are the same, and the acoustic time difference is eliminated. The final training sample parameters are volume coefficient, water saturation, density, effective thickness, skin factor, shale content, porosity, and effective permeability. The specific oil production index is used as the target parameter.
3.3. Data Processing
Considering the specific application scenarios of the problem, the data set is divided into three parts: training set, verification set, and test set. The training set and the validation set jointly optimize the model parameters; that is, they establish the functional relationship between the characteristic parameters and the specific oil production index. The test set itself does not participate in the training process. Its role is to evaluate the optimized model prediction effect and generalization ability after the model establishes a complete prediction function, so that the model has the ability to process unknown data. The prediction effect of the model evaluated using the unused validation set can improve the authenticity and applicability of the model, and the prediction accuracy obtained by using the test set is more reliable.
In the raw data, it is inevitable that some parameters may be missing or anomalous. To address this issue, the nearest-neighbor algorithm is used. In machine learning, the data characteristics of similar parameters are also similar. The basic process is as follows: identify the missing or anomalous parts of the sample parameters; find other samples with the same missing or anomalous values to form a data set; calculate the Euclidean distance between these samples; select the top-three samples with the smallest Euclidean distances as similar samples for the sample to be repaired; and calculate the average value of the parameters of these similar samples to use as the replacement value.
In the original data, because the characteristic parameters have their own physical dimensions and the numerical distribution range of each other is inconsistent, it is bound to cause the model to be difficult to fit in the model fitting process. In order to make the characteristic parameters comparable, the unified dimension method is used to express the characteristic parameters with percentages in decimal numbers, and then all the characteristic parameters are normalized. Data normalization can convert data of different variables into the same range, so that the value of each sample datum falls between 0 and 1. For a certain feature x in the same sample set, its normalization formula is:
Equation (2): is the normalized data of characteristic parameters; is the original data of characteristic parameters; xmin is the minimum value of characteristic parameters; xmax is the maximum value of characteristic parameters. In the whole process of model training, it is normalized according to different sample sets. At the same time, in the evaluation model, the model prediction results belong to the normalized results, and the anti-normalization operation is needed to obtain the real prediction data.
4. Model Establishment and Evaluation
4.1. Model Evaluation Indicators
The mean absolute error (
MAE) is the mean of the absolute error between the predicted value and the actual value. It is a linear index, which means that all individual differences have equal weight on the average. It can clearly indicate the gap between the predicted value and the actual value, without considering the direction of the error, is not sensitive to outliers, and the calculation is simpler.
Equation (3): ηMAE is the mean absolute error, m3·d−1·Mpa−1; is the real value of specific oil production index, m3·d−1·Mpa−1; is the predicted value of the specific oil production index, m3·d−1·Mpa−1.
The root mean square error (
RMSE) is the square root of the mean square of the difference between the predicted value and the actual observed value. It takes into account the direction of the difference between the predicted value and the actual value, so it can better reflect the accuracy of the model. It has good properties in mathematics, such as differentiability, so that it can be easily used in optimization algorithms. The smaller its value, the better the prediction effect of the model. The expression is:
Equation (4): is the root mean square error.
The mean absolute percentage error (
MAPE) represents the percentage of the mean prediction error relative to the actual value, so it is easier to understand and can be used to compare the prediction performance of different time-series data sets because it is the percentage relative to the actual value. Compared with other evaluation criteria, it is more intuitive and relatively measured. The smaller its value is, the better its expression is:
Equation (5): is the mean absolute percentage error.
The coefficient of determination (
R2) represents the proportion of the variance in the dependent variable that can be explained by the independent variable, ranging from 0 to 1. Without the influence of dimension, the coefficients of determination between different data sets can be directly compared. The closer to 1, the stronger the model’s ability to explain the data, and vice versa.
Equation (6): R2 is the coefficient of determination; is the true average value of the specific oil production index, m3·d−1·Mpa−1.
4.2. Model Design and Evaluation
A BP neural network is usually composed of an input layer, a hidden layer, and an output layer. The input layer accepts data input from outside, and each input node corresponds to a feature or attribute of the input data. The hidden layer is located between the input layer and the output layer and can have multiple layers. Each layer contains multiple neurons (nodes), and each neuron receives input from the previous layer of neurons and generates output for the next layer (
Figure 3). The BP neural networks are characterized by a simple network structure and a short training time. They can achieve good training results with fully static parameter samples. However, in productivity prediction problems, there is a process of converting dynamic parameters to static ones, and using the average or median of dynamic values can inevitably lead to a loss of feature data. The BP model is a conventional network model in neural networks, and using the BP neural network as a control group is of significant importance.
The BP neural structure uses sample data with full static parameters (
Figure 4). Using the root mean square error as the loss function to calculate the loss value makes it easier for gradient descent methods to update the weights. The Backpropagation algorithm is used to change the model weight and bias until it converges to a more reasonable range. Finally, the test set is used to evaluate the model.
Considering the large span of each layer, the selected parameter value is the average value of the continuous value of the layer. Due to the large span of the layer and the uneven distribution of crude oil in the layer, the selection of the average value will cause a certain error. At the same time, each layer will be divided into smaller layers, and the number is not uniform. In order to extract this layer information more accurately, the LSTM long-term and short-term memory networks are selected.
According to the actual raw data sources, shale content, effective permeability, porosity, water saturation, and density change with depth, which are dynamic parameters. The volume coefficient, skin factor, and effective thickness are derived from logging data and belong to static data. The LSTM-BP neural structure uses a sample structure that combines dynamic and static parameters (
Figure 5).
In the LSTM-BP neural network, the LSTM network structure can process sequence data well and capture long-term dependencies in sequence data (
Figure 6). In the overall structure, LSTM is used to capture the feature relationship in the dynamic parameters and output the feature data. The static parameters and the characteristic data of the LSTM output are merged as the input of the BP neural network, and, finally, the target parameters are output.
Create the LSTM-BP neural network structure and use random initialization for the initial model weights. Input the training set samples into the model to output predicted values, use the root mean square error as the loss function to calculate the loss value, and optimize the weights using gradient descent, reloading them into the model. After each round of weight updates, calculate and record the test set loss (
Figure 7). Once all training iterations are complete, select the best weights. When using the model for prediction, training set involvement and multiple rounds of training are not required. The LSTM-BP model not only extracts data features from dynamic changes but also combines the efficient nonlinear data fitting capabilities of BP neural networks. This model provides better prediction results for productivity forecasting problems.
Verification results for a small sample (
Table 2): For oil wells with actual production indices in the range of 0.01–0.1 m
3·d
−1·MPa
−1, the relative prediction error of the LSTM-BP model is similar to that of the BP model. However, when the production index exceeds 0.1 m
3·d
−1·MPa
−1, the LSTM-BP model outperforms the BP model in terms of prediction accuracy. This is because the sample data set contains more samples in the 0.01–0.1 m
3·d
−1·MPa
−1 range. During training, both models achieve good results. In the range of >0.1 m
3·d
−1·MPa
−1, however, the limited number of samples cannot meet the training needs of the BP model, whereas the LSTM-BP model can provide accurate predictions even with fewer samples by extracting more precise features. For production indices in the 0.01–0.1 m
3·d
−1·MPa
−1 range, the magnitude of the production index is too small; any change in the main control parameters greatly affects the final prediction value, making both models somewhat inadequate for wells with such small production index magnitudes. Nevertheless, the LSTM-BP model demonstrates superior generalizability.
The total number of samples in the data set is 43, with 35 samples used for training the model and 8 samples used for testing. After constructing the BP neural network and the LSTM-BP neural network models, the LSTM-BP neural network achieved an average absolute error of 0.0765, a root mean square error of 0.10, an average absolute percentage error of 21.18%, and a coefficient of determination (
R2) of 0.97 on the test set (
Table 3). The average absolute error of the BP model differs by 0.05 from that of the LSTM-BP model, but the average absolute percentage error differs by as much as 10%. This indicates that while the BP model can accurately predict some test samples, it fails to predict others accurately, and the poor prediction performance affects the overall prediction accuracy. Additionally, the average absolute percentage error of the BP model exceeds 30%, which is not suitable for practical applications. This demonstrates that the LSTM-BP model, by extracting dynamic parameter data features, provides a more effective capability for production prediction.
5. Applicable Analysis
Conventional production prediction methods commonly use analytical approaches, employing Darcy’s law combined with the thickness of the feature segment to obtain the production rate for the entire test segment.
In the formula, pe is the formation pressure (MPa); pwf is the formation flowing pressure (MPa); uo is the viscosity of the oil (mPa·s); Bo is the oil formation volume factor; Ki is the oil-phase permeability of the layer segment represented by test point i (mD); hi is the thickness of the layer segment represented by test point i (m); re and rw are the detection radius of the test segment and the wellbore radius, respectively (m); S is the skin factor of the test segment.
Select 10 samples from the training set and 8 samples from the test set to form the prediction samples. Apply the LSTM-BP neural network to these samples for production prediction: Firstly, the original data are normalized through feature engineering. The characteristic parameters of the sample are divided into dynamic parameters and static parameters, and the dynamic parameters will be partially processed by the LSTM neural network. After that, the static parameters are merged with the data processed by the LSTM part; the combined data are sent to the BP neural network for capacity prediction; and, finally, the prediction results are shown. Use three methods—an analytical approach, BP model, and LSTM-BP model—to predict the production rates of these 18 samples and perform a comparative analysis (
Figure 8).
After the model is established, the generalization ability of the model is an important indicator. Generalization means that the trained model should not only fit the known samples but also achieve more effective predictions in the unknown samples. The comparative analysis of the scatter plots of the true value and the predicted value can effectively reflect the generalization ability of the model. Through the scatter plot of the true value and the predicted value, the analytical method is constrained by the limitations of the idealized model and cannot fully capture the data characteristics of the entire sample. The analytical method is affected in the production range below 0.25 m3·d−1·MPa−1, where the model learns the data characteristics of this range but fails to learn from the entire sample data set. Consequently, predictions in subsequent ranges are also based on the idealized model, leading to the model’s predicted values not being evenly distributed around the 45° reference line. The predicted value of the BP neural network cannot be evenly distributed near the reference line. This shows that although the BP neural network has a better determination coefficient and lower relative error in the test set, it cannot predict effectively in the face of unknown samples, and its generalization ability is poor. The LSTM-BP neural network has higher prediction accuracy in the test set, and its scatter points are evenly distributed near the reference line and can be divided into upper and lower parts by the reference line. This shows that the LSTM-BP neural network can achieve more accurate prediction in the face of known and unknown samples, and its generalization ability is strong. Due to the small number of oil well samples with a production index greater than 1 m3/d/MPa in the original data, all three models are affected in their training, leading to poor prediction performance for oil wells in this range. However, the LSTM-BP model, due to its ability to extract dynamic parameter features, mitigates the impact of uneven sample sizes to some extent.
The LSTM-BP model achieves better prediction results in the WZ region. The LSTM-BP model is designed to extract dynamic changing parameters, making it particularly effective for production prediction where dynamic parameters are prevalent. It performs well not only for specific reservoirs but also for conventional reservoirs and deep offshore reservoirs. However, specific regional training samples are required for re-training in particular regions.
Due to the limited number of data samples in the WZ region, a more complex weight optimization function could not be used during model training. If the environment changes and more data samples are available, the model’s performance can be further enhanced, and more effective optimization functions can be employed for weight updates. Future research will focus on selecting optimization functions and expanding the data set to optimize the model’s network structure.
6. Conclusions
(1) Based on the main controlling factors of oil well productivity in the WZ area, samples composed of pure static parameters and dynamic and static parameters were used, and the corresponding neural network model was established to realize the oil well productivity prediction in the WZ area.
(2) The Pearson correlation coefficient and partial correlation coefficient were used to determine that the main controlling factors for production capacity in the WZ region are formation volume factor, water saturation, density, effective thickness, skin factor, mud content, porosity, and effective permeability.
(3) By comparing the model prediction effects under the two samples, the LSTM-BP neural network under dynamic and static parameters has a better prediction effect on the test set, and its coefficient of determination is 0.97. At the same time, the LSTM-BP neural network model also shows a good prediction effect on the reserved 18 oil wells, and its generalization ability is strong. Finally, the LSTM-BP neural network was selected to predict the oil well productivity of the tight reservoir in the WZ area.
(4) The LSTM-BP model outperforms both the analytical method and the BP model in terms of how its predicted values align with the 45° reference line. The LSTM-BP model’s predictions are not only evenly distributed on both sides of the reference line but also exhibit smaller errors relative to the reference line. This results in a significant improvement in the production prediction accuracy.