1. Introduction
Oil-bearing evaluation is of great significance for the identification of potential oil-bearing areas and the exploration of remaining sweet spots in the secondary description of reservoirs, which is vital to further develop and enhance oil recovery for wells with long-term production histories [
1,
2,
3,
4].
There are numerous reports on the prediction of oil bearing in previous studies. NMR (nuclear magnetic resonance) is frequently employed to predict the oil bearing of shale, such as the division of different hydrocarbon-containing components in shale [
5]. The authors of Ref. [
6] applied nuclear magnetic resonance to study the Chang 7 shale in the Yan’an area. The distribution of portable oil can be forecasted based on differences in the relaxation mechanisms of different hydrogen-containing components in the rock measured by NMR. However, the NMR method is precious and only routinely implemented for the examination of oil-bearing cores. The capture range of NMR is too short for the whole oil reservoir.
In addition, lithofacies can also be used for oil-bearing prediction. In general, the constraints of lithofacies can only identify a fixed oil-bearing range, and it is extremely difficult to accurately determine hydrocarbon bearing. The authors of Ref. [
7] studied the oil-bearing properties of different lithofacies based on the extraction of cores, fluorescent thin sections, rock pyrolysis, total organic carbon (TOC), and nuclear magnetic resonance (NMR) testing. The results confirmed that the oil-bearing and physical properties of shale are better than those of shell limestone. The authors of Ref. [
8] studied the physical properties of lacustrine deep-water turbidite reservoirs, revealing that the petrophysical properties seem to have a significant impact on hydrocarbon accumulation in turbidite sandstones. The authors of Ref. [
9] used core, geological analysis, production, and logging data to characterize the lithology, oil-water layer, and oil layer thickness of the Chang 7 tight sandstone reservoir. The reservoir identification standard, the normalized difference curve, is superimposed on the resistivity curve to accurately determine the thickness of tight and fine sandstone reservoirs. However, this method is mainly used to calculate the thickness of the oil layer, and requires a considerable amount of data. In addition, the authors of Ref. [
10] utilized the thermal terahertz analysis (TTA) method to detect the oil-bearing characteristics of desert reservoirs. However, this method is only applicable to macro level oil-bearing predictions.
The previous studies indicate that seismic data is frequently used to predict oil bearing in the plane [
11,
12,
13]. However, the resolution of seismic data is too coarse to make an accurate oil-bearing prediction.
Well logging data is also often utilized for identification of hydrocarbon-bearing intervals, prediction of oil bearing, and the discovery of remaining oil [
14,
15,
16]. These processes mostly rely on expert experience since the nonlinear relationship between the complex variables and multi-type data in the actual oil field is difficult to capture. Fortunately, a large amount of oil field data (i.e., geological information, oil testing, well logging) makes it possible to identify oil bearing [
17,
18,
19,
20,
21,
22].
In recent years, data-driven models have been widely adopted in the oil industry, including production prediction, injection-production parameter optimization, deposits with high paraffin content analysis, fracturing parameter optimization, recovery factor forecast, etc. [
23,
24,
25,
26,
27,
28].
A small number of scholars have conducted research on the application of machine learning in oil bearing. The authors of Ref. [
29] evaluated a machine learning algorithm suitable for borehole geomagnetism to predict the remaining oil saturation in the vertical direction. Six data-driven models were established and tested. The results showed that the AdaBoost-random forest model performed best, with a prediction accuracy of 87%. However, this method requires corresponding bore-ground electromagnetic method data. The authors of Ref. [
30] investigated the impact of physical properties (e.g., thickness, permeability, porosity, net to gross ratio, etc.) on oil saturation. Ten data-driven methods were applied, including random forest, lasso, gradient boosting, ridge regression, Adaboost, elastic net regression, support vector machine, linear regression, multilayer perceptron, and polynomial regression. This study revealed that NWIV (adjacent water injection variation) was the variable with the greatest influence on OSV (oil saturation variation). However, this was a prediction of dynamic oil bearing and could not be used for the identification and prediction of static oil bearing. The authors of Ref. [
31] estimated gas hydrate saturation by machine learning (ML) algorithms, including ridge regression, decision tree, k-nearest neighbor, reduced order models, neural network, etc. The results showed that the k-nearest neighbor model performs best with an accuracy of 82.96%.
In this paper, logging data with high resolution was adopted to make oil-bearing predictions for wells with low production rates during long-term production periods. The purpose of this paper was to combine various types of oil field data and evaluate the oil bearing of a reservoir with data-driven models to provide theoretical guidance for the EOR of oil wells with low production rates under long-term production periods.
2. Data Preparation
2.1. Geology of the Target Block
The target area of this study was the Chang 8 ultra-low permeability reservoir in Changqing oil field, Ordos Basin, China. The main rock type of the reservoir is siltstone (80%). The porosity and permeability of siltstone are relatively small, which also leads to the lower porosity and permeability of this area. The average thickness of the oil layer is 14.6 m, the average porosity is 9.97%, the average permeability is 0.56 × 10−3 μm2, and the original formation pressure is 15.8 MPa. A total of 287 oil-producing wells and 94 water injection wells were put into production in December 2006. The flow rate of the wellhead for this block is 740 t/day, the oil rate is 576 t/day, and the average oil rate of a single well is 2.0 t/day with 22.1% water cut.
The development wells in this area were put into production around July 2005, and generally have a production history of approximately 10 years. The average daily oil rate per well dropped from 2.8 to 0.6 tons/day. As shown in
Figure 1, the average monthly oil production of a single well in the target block has been decreasing since it was put into production.
Figure 2 shows the current oil saturation field map of the target block. It can be seen from the figure that there is still a large amount of remaining oil in the target block that needs to be re-evaluated and excavated.
In the late stage of oil field development, the remaining oil is of vital significance to improve the production and recovery of a single well. Therefore, quickly re-evaluating the oil bearing of wells with long production histories is a necessary and urgent task that can provide guidance for the development of the oil field in the later period.
The well logging data for oil-bearing prediction is mainly dependent on the rich experience of geological engineers, and it is often miscellaneous and tedious work. The interpretation of results given by different engineers may vary widely. Therefore, in order to improve the oil-bearing prediction of oil wells, it is necessary to fully exploit the characteristics of logging data. In addition, there is a complex nonlinear relationship between oil-bearing capacity and logging data, which is extremely challenging for general models or equations to describe. Machine learning models are strong adaptable to nonlinear relationships and can extract the hidden features and relationships behind the disorganized oil field data.
2.2. Oil Field Dataset
The oil field data collected from a block in Changqing oil field, Ordos Basin, China included logging, perforation, and production data. There are more than 400 wells in the target block, including 361 old wells. There were 8 variables in the raw data, such as spontaneous potential (SP), natural gamma (GR), acoustic transit time (AC), resistivity (RT), porosity (POR), permeability (PERM), water saturation (SW), and perforation. The data labels for oil bearing were 1 for an oil-bearing layer and 0 for an interlayer. The distribution and statistical information of the oil field data are shown in
Figure 3 and
Table 1, respectively.
2.3. Data Processing Procedures
In this paper, the oil field data including well logging, perforation, oil testing, and production data collected from 361 vertical wells in Changqing oil field, Ordos Basin, China. Before model training, it is necessary to process the raw data and convert it to available, easily identified data points. In general, the detailed procedures of oil-bearing prediction can be divided into: oil field data collection, data processing, feature selection, data splitting, model evaluation, and prediction, as shown in
Figure 4. Among them, data processing, which includes filling the missing records, noise reduction, perforation coding, and oil-bearing labeling, is the crucial step for predictive capability.
- (1)
Sort out oil field data. The well logging, perforation, production, and test data collected from the oil field were classified and sorted out. Each column in a row was carefully checked. If more than 50% of data were missing, this record was directly deleted. If only individual values were missing in one row, the neighboring average was used to fill the row.
- (2)
Perforation data coding. For the oil-content checking problem in this paper, the well perforation data belonged to the classification data and needed to be coded before training. In this paper, the one-coding technique was used to process the perforation data and convert the classified data into numerical data. One-coding is also known as one-bit effective encoding. The state registers are adopted to encode the states, which have their own independent register bit with only one bit. Each variable might become
m binary values with
m possible values. One-coding can deal with classified data and expand features to a certain extent with only 0 and 1 in the vertical space, as shown in
Figure 5.
- (3)
Oil-bearing labeling. The oil content was measured using oil testing data. If there was oil production, this item was marked as 1. Otherwise, it was marked as 0.
- (4)
Variable ranking. Multiple variables provide a large amount of information for feature extraction; however, the correlation between multiple variables also increases the complexity of the problem. In this paper, information gain was introduced to analyze the correlation between variables and rank them. Then, the performance of the well was predicted using the preferred variables.
- (5)
Training and testing data splitting. The processed well logging data and marked oil production data together constitute the oil-content evaluation dataset. Before training, it was necessary to divide the training and testing datasets. If the proportion of the training dataset is too small, the training model will be over-fitted and the generalization performance will be poor. If the proportion is too large, the test dataset will be too small and the reliability of the test effect will be reduced. Thus, the data were randomly split into 90% training and 10% testing datasets.
- (6)
Model training. After splitting the data, data-driven models, such as neural network, support vector machine, and random forest models, were trained on the training dataset. The model parameters were compared and adjusted to optimize the prediction accuracy of the model.
- (7)
Oil-bearing prediction. The testing dataset was adopted to predict oil bearing, an integrated model was explored to improve model accuracy, and finally, applications in the actual oil field were achieved.
2.4. Information Gain
“Information entropy” is the most commonly used indicator to measure the purity of a sample set. Assuming that in the training dataset
D, |
D| is the sample capacity, that is, the number of samples or elements in
D. There are
K classes
Ck where |
Ck| is the number of samples of
Ck, and the sum of |
Ck| is |
D|,
k = 1, 2, …,
K. Thus, the information entropy of dataset
D can be calculated using the following formula [
32]:
According to each feature
A (i.e., per logging curve in this paper),
D (oil bearing in this paper) is divided into
n subsets
D1,
D2, …,
Dn, where
Di represents the dataset in
D corresponding to the
ith value of feature
A.
|Di| is the number of samples of
Di, the sum of
|Di| is
|D|,
i =
1,
2, …,
n, the set of samples belonging to
Ck in
Di is recorded as
Dik (i.e., the intersection), and
|Dik| is the number of samples of
Dik. The conditional information entropy of selected
A can be calculated as follows:
Then, the information gain of feature
A to dataset
D is:
Generally speaking, the greater the information gain, the greater the “purity improvement” obtained using feature A for partitioning. It is noted that most of the data in this study were metric variables. However, the above method is used to handle categorical variables. A new method was provided for metric variables based on the information gain. The processing steps of the new method are as follows:
Given a training dataset
D and a metric variable
a, it is assumed that
a appears with
n different values on
D. First, these values are sorted from smallest to largest and denoted as {
a1,
a2,
a3, ……,
an}. Based on the division point
t,
D is divided into subsets
Dt- and
Dt+, where
Dt- is the sample with values no greater than
t on attribute
a. The results are the same whatever value
t took between
ai and
ai + 1 to divide
D. Thus, for the metric variable
a, the set of candidate division points containing
n − 1 elements are examined.
That is, the median of the interval [
ai, ai + 1) is taken as the candidate division point. Thus, metric variables are handled in the same way as categorical variables, and we selected the optimal division point for the division of the sample set using the following formula.
where Gain(
D,
a, t) is the information gain of the sample set
D after dichotomizing based on the division point
t. When dividing, the division point is chosen that gives the largest Gain(
D,
a,
t).
3. Principles of Data-Driven Models
3.1. Support Vector Machine
SVM (support vector machine) is a common classification and regression machine learning method. In
Figure 6, dots and circles refer to different results divided by support vector machine. Red dots and red circles emphasize that these dots and circles are crucial support vectors for classification. In the field of machine learning, it is a supervised learning model, which is one of the most robust and generalizing methods of all the well-known data mining algorithms. It is usually used for pattern recognition.
Suppose the training sample set is
S = {(
x1,
y1), (
x2,
y2), (
x3,
y3) …, (
xn,
yn)} where
n represents the number of training samples. The support vector machine learns the training samples and establishes functions describing the relationship between input variables and output, as follows:
where
represents the mapping function,
ω represents the weight vector, and
b represents the threshold vector.
In practical cases, it is difficult for the data to be linearly separated and it is difficult to judge whether the seemingly linearly separable results are caused by overfitting. To solve this problem, the concept of “soft margins “ was introduced. Soft margins allow some samples not to satisfy the constraints, achieving linear separability of most samples. In order to measure the tolerance of the soft margins to the number of samples and parameters that do not meet the constraints, a penalty term is introduced, and the penalty term coefficient is set as the penalty factor C (C > 0). The larger the value of C, the greater the tolerance of the model to sample errors. When C is infinite, all samples will be forced to satisfy the constraints.
By introducing the kernel function to replace product operation
, the prediction decision function of the support vector machine can be obtained as follows:
where
is the kernel function, given by:
Table 2 shows the kernel functions commonly used in support vector machine model.
When the selected kernel function is a Gaussian kernel (also known as a radial basis, RBF function), the parameter γ mainly defines the influence of a single sample on the entire classification hyperplane. When γ is relatively small, the effect of a single sample on the entire classification hyperplane is relatively small, and it is not easily selected as a support vector. Conversely, when γ is relatively large, a single sample has a greater impact on the entire classification hyperplane, and it is more easily selected as a support vector, or the entire model will have more support vectors.
3.2. Artificial Neural Network
An artificial neural network consists of forward propagation of the signal and back propagation of the error. During forward propagation, input samples are fed from the input layer, processed by the hidden layer, and then transmitted to the output layer. If the predicted and target values of the output layer are not met, the error will be passed to the hidden layer through back propagation to each layer of neurons, which can further modify the weight of each neuron. Moreover, the weight adjustment process of forward propagation and error back propagation is carried out repeatedly, and this process continues until the error of the network output is reduced to an acceptable level.
As shown in
Figure 7, for the specific training sample
Xi = (
x1,
x2, …,
xn), the output vector of the hidden layer is
Yi = (
z1,
z2, …,
zp), the output value of the output layer is
Oi, and the expected output is
yi. The weight matrix from the input layer to the hidden layer is represented by
V = (
V1,
V2, …,
Vp)
, where the column vector
Vj is the weight vector corresponding to the
jth neuron in the hidden layer, from the hidden layer to the output. The weight vector between layers is represented by
W = (
w1,
w2, …,
wp), where
wk is the weight of the
kth neuron in the hidden layer corresponding to the output layer neuron.
Table 3 shows the equations of common activation functions.
In this paper, the gradient descent method was used to adjust the weight value. The adjustment of the weights was proportional to the error of gradient descent by introducing the learning rate. For a three-layer neural network, the number of neurons in the input and output layers are dependent on the specific problem. If the number of neurons in the hidden layer is too small, the prediction and generalization capabilities of the network will be reduced. If the number of neurons in the hidden layer is too large, long-term non-convergence of the network training occurs and the fault tolerance performance decreases. If there are an excessive number of hidden layer neurons, convergence of network training is difficult. To the best of the author’s knowledge, there is no consensus on the number of neurons in the hidden layer. In this paper, as few hidden layer neurons were selected as possible under the condition of satisfying the error requirements.
3.3. Random Forest
Random forest is a machine learning algorithm based on decision trees, as shown in
Figure 8. There are three common algorithms used in decision trees, namely ID3, C4.5, and CART. CART is classified according to the Gini index. The CART algorithm adopts a bisection recursive segmentation method based on the Gini index, and thus the dataset can be divided into two branches with the smallest Gini index. This paper adopted the random forest algorithm based on CART. The expression of the Gini index is as follows:
where
k is the number of categories and the probability of each category appearing is
Pi.
Dataset
S can be split into two sub-sample sets,
S1 and
S2, according to certain conditions. Thus, the Gini split index of dataset
S is shown in Equation (10):
where
m is the number of datasets
S;
m1 is the number of datasets
S1; and
m2 is the number of datasets
S2.
Random forest is an ensemble learning algorithm proposed by Breiman, which is an ensemble classifier involving multiple decision trees. Each decision tree in the random forest is obtained based on Bootstrap sampling, and then the category with the majority votes is selected as the final classification result with the combination of multiple decision trees. The construction process of random forest is as follows:
Assuming that in the original training data set D, n is the number of samples, M is the total number of features, and the number of decision trees required to be constructed is K.
- (1)
Extract the training subset. n samples were randomly selected and repeated n times from the original training dataset D, and the others formed the out-of-bagging dataset.
- (2)
Build a decision tree. Firstly, m feature subsets (m < M) were randomly selected from M features, and then the optimal splitting points were selected by relevant criteria and divided into sub-nodes. Meanwhile, the training dataset was divided into corresponding nodes, and finally, all nodes were obtained.
- (3)
Random forest generation. Step (2) was repeated until k decision trees were established to form a random forest {ti, i = 1, 2, …, k}.
- (4)
Using k decision trees in the random forest, the predicted results {t1(x), t2(x), …, tk(x)} were obtained.
- (5)
The mode of the predicted result of each decision tree was taken as the final prediction result of each sample.