1. Introduction
There are enormous shale resources distributed worldwide and it requires advanced exploration and development strategies to get economic production. Due to the application of horizontal wells and hydraulic fracturing technologies, the shale reservoirs have achieved an economic production, which plays an important role in world’s gas supply. For shale reservoir, the production performance is influenced by many factors such as geology, drilling, completion. A production model that contains a comprehensive set of variables is required for production prediction and optimization.
Decline curve analysis (DCA) has been wildly adopted for production forecast and it is a fast method by matching production rate-time history data. Traditional DCA models [
1] is designed for conventional reservoir and assume a boundary-dominated flow, which is not applicable for shale wells with multiple flow regimes. Many DCA models have been specifically designed for shale gas reservoirs [
2,
3,
4,
5,
6,
7]. However, DCA models still have some drawbacks as it requires a long production history to get a reliable result, which means they are mainly applied to on-production wells [
8]. In addition, it is impossible to accommodate additional geological and completion information to improve prediction accuracy, which means it is difficult to optimize production.
Reservoir numerical simulation is another approach to forecast production performance of unconventional reservoirs. It is a physically driven model and can provide accurate results when the data is complete and precise. However, it is difficult and expensive to gather the necessary data such as the complex hydraulic fracture distribution and properties. Moreover, the gas flow system in shale formation is poorly understood [
9] and simulation techniques requires large amounts of computational time [
8].
The emerging machine learning technique has provided a potential method for production modeling as a result of advanced computing powers and access to large data set. Liao et al. [
10] used random forest, Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LGBM) to build a stacking model and identified that stimulated length, total stage count, pumped proppant per stage, pumped fluid per length and injection rate are the most important factors for Wapiti-Montney tight gas formation. However, the model did not include reservoir parameters such as total organic carbon content (TOC). Hirschmiller et al. [
11] used recursive feature elimination to select features and used random forest to predict and optimize well performance. Sheikhi et al. [
12] used linear regression, Random forest, Gradient Boost, XGBoost, Bagging, ExtraTrees and neural network to build models and applied Individual Conditional Expectation and Partial Dependency to assess completion performance, but the input variables selected for the model were limited to completion information. Shelley et al. [
13] built a feed forward neural network model to estimate Wolfcamp production and evaluate the economics for completion designs. Han et al. [
14] used deep neural network and exploratory data analysis to build robust model for production prediction and provide some insights about shale reservoirs. However, the input variables were limited to completion factors. Wang et al. [
15] used random forest, adaptive boosting, support vector machine and neural network to estimate the well performance. It was concluded that random forest has the best performance and these models were useful for designing hydraulic fracture treatments. Liang et al. [
16] used multi-objective random forest to predict dynamic production data. However, there remains a challenge to choose the right method for production prediction. In general, multiple linear regression can only describe linear relationship. Regression tree requires a lot of data to perform well. The construction of neural network is time-consuming and tedious [
15]. Compared with neural work, regression tree and multiple linear regression, support vector regression and gaussian process regression have a better performance for small data sets [
17,
18].
The originality of this paper is that geological and completion data are coupled to more accurately describe the reservoir property. In addition, feature selection is used to provide a variety of geologic and completion parameters specifically related to production.
In this study, a workflow based on data driven approach was proposed for production modeling. A large data set (including geological and completion factors) in Duvernay formation was used to illustrate the application of workflow. Grey relation analysis and Pearson Correlations were applied to screen the geological and engineering parameters related to shale gas productivity. Gaussian process regression, support vector regression, and multiple linear regression were applied for production predicting. A tornado plot was used to identify the most important factors for production performance.
2. Methodology
Figure 1 is a flowchart illustrating the procedures of the study, Pearson correlation degree and grey relation degree is calculated to select features. Gaussian process regression, support vector regression, and multiple linear regression were applied to model cumulative production. After training and testing the model, the model with the best performance is selected to applied sensitivity analysis.
2.1. Pearson Correlations
The Pearson correlation degree (PCD) is one of the commonly used relation measurement standards, it can measure the monotonic relationship between different variables [
19]. The Pearson-correlation formula is as follows:
where
Cov(
x,
y) denotes covariance between
x and
y,
Var(
x) is the variance of
x, and
Var(
y) represents the variance of
y.
PCD varies from −1 to 1. The value of −1 or 1 shows that there is a strong linear relationship between two variables, whereas 0 means no linear relationship. Normally, the relationship strength of variables is estimated on the basis of following value ranges [
20], i.e., 0.8–1.0: very strong correlation; 0.6–0.8, strong correlation; 0.4–0.6, moderate correlation; 0.2–0.4, weak correlation; 0.0–0.2 very weak correlation.
2.2. Grey Relation Analysis
The grey relation analysis, which was originally proposed by Deng et al. [
21], provides a multi-factor analysis method which describes the posture relationship among factors. In a sample data, the grey relation degree (GRD) represents the change trend (speed, size, direction), A high GRD between two factors means a strong correlation. On the contrary, a small value means a weak correlation [
22,
23]. The advantage of grey correlation analysis is that it requires less data and calculation compared to other multi-factor analysis methods [
24] (random forest, regression, etc.). The basic steps of grey relation analysis are as follows:
2.2.1. Determine the Reference Series and Comparison Series
Reference series is a data series which reflect the system behavior. Comparison series is a data series composed of variables affecting system behavior.
represents reference series,
denote comparison series. The whole data series are represented as matrix (2):
where
, I = 0,1,2, …, m; n is the length of variable series.
2.2.2. Dimensionless Processing
In general, each variable series has a specific physical meaning and dimensions or order of magnitude vary from variable to variable. Therefore, it’s important to process raw data before using, the following formula is used for dimensionless:
After dimensionless processing, the data matrix can be represented as matrix (4):
2.2.3. Calculate Grey Relation Degree
Compute the absolute value between reference series and comparison series and get absolute difference matrix (5), then transform matrix (5) into correlation coefficient matrix (7) using formula (6), finally, use Equation (8) to calculate grey relation degree.
2.3. Multiple Linear Regression (MLR)
Multiple linear regression describes the relationship between several independent variables,
X and a dependent variable,
Y.
Y is often called response variable and independent variables X are named predictor variables. Equation (9) is the universal form of a multiple linear regression model.
where
is the number of predictor variables,
is the noise terms, it is assumed that noise terms are uncorrelated and have independent and identical normal distributions with mean zero and constant variance [
19,
25,
26]. The noise terms represent some variables that not include but may have contribution to
Y.
2.4. Support Vector Regression (SVR)
Support vector machine (SVM) is a supervised learning method, which was originally identified by Cortes et al. [
27]. SVM has been comprehensively used as a robust technique for classification and SVR is a regression technique used from SVM. SVR is regarded as a nonparametric method as it relies on kernel functions. Kernel function can project the data to a higher dimensional feature space, hence avoiding the non-linearity in lower space. The aim of SVR is to find a function
f(
x) that diverge from response y by a value no more than
, and in the meantime the function
f(
x) as even as possible (
Figure 2). More details about SVR are described in Al-Azani et al. [
28], Da Silva et al. [
29], El-Sebakhy et al. [
30], Li et al. [
31] and Schuetter et al. [
26].
In SVR, the choice of kernel function directly determines the performance of the model, the main kernel functions used in this study are shown as follows:
2.5. Gaussian Process Regression (GPR)
Gaussian progress regression (GPR) is a non-parametric model. A Gaussian process (GP) refers to a collection of random variables, any finite number of random variables in this collection has a joint gaussian distribution [
32,
33]. GP has quite good adaptability for dealing with some complex problems such as small samples, high dimensionality, and nonlinearity. GP is determined by its mean function and covariance function, and it inherits many properties of the Gaussian distribution [
34,
35]. A finite dimensional subset of a Gaussian process obeys a Gaussian distribution.
Mean function is defined as:
Covariance function is defined as:
where:
, the
GP can be defined as:
As for regression question, the model can be defined as:
Assuming the noise
ε ~ N (0,
), then the prior probability distribution of the observation value
y is obtained:
y and
have a joint Gaussian distribution:
where:
, it is a
n ∗
n positive definite matrix, the matrix elements
describe the correlation between
and
,
is used to measure the correlation between data 𝑋 and
,
is used to describe the correlation between data
.
Then the posterior probability distribution of
can be expressed as:
where:
and represent the covariance and mean of .
Kernel function controls the accuracy of GP as GP is parameterized by a mean function and a kernel function. The main kernel functions used in this study are shown as follows:
Squared Exponential Kernel:
where
is the characteristic length scale and
is the signal deviation.
Exponential Kernel
where
is the characteristic length scale and
r is the Euclidean distance between
and
.
Mater 5/2:
where
r is the Euclidean distance between
and
.
Rational Quadratic Kernel:
where
is the characteristic length scale,
r is the Euclidean distance between
and
and
is a positive-valued scale-mixture parameter.
2.6. Goodness-of-Fit Metrics
There are many measures for evaluating the quality of a model. In this study, two techniques are used: root mean squared errors (RMSE) and R-squared value (Note that the calculation of R-squared is not always as easy as squaring the Pearson correlation, as R-squared can be smaller than 0. However, in linear regression, the square of the Pearson correlation is equal to R-squared). R-squared value indicates the part of variability in the response which is interpreted by the model, it is defined as the ratio of explained sum of squares (ESS) to total sum of squares (TSS):
RMSE can be interpreted as the average distance between the residuals and zero.
3. Problem Description
The Duvernay Formation, an Upper Devonian source rock, lies within Western Canada Basin in Alberta, Canada. It is an organic-rich mudstones surrounded by Leduc reef complexes [
36]. Because of high formation pressure coefficient (1.7–2.0) and significant degree of undersaturation, Duvernay formation is a tight liquid-rich reservoir. The phases of hydrocarbon liquids vary from wet gas in southwest part of the reservoir to black oil in the northeast portion due to various thermal maturity [
37,
38,
39,
40]. The depth of Duvernay is 3000–4150 m deep, and the buried depth rises from southwest to northeast. The study area is located in west shale basin, with an area of around 737.92 km
2. The basic reservoir parameters in study area are described in
Table 1.
The difference of well performance can be attributed to two factors, geological factors and completion factors. Usually, geological factors include contents of brittle minerals, formation pressure, porosity, permeability, reservoir thickness, TOC, thermal maturity, natural fractures, gas content, condensate gas ratio, etc. Completion factors include total horizontal lateral length, total fluid amount, total proppant amount, number of stages, number of clusters, sand contents, cluster distance, stage distance, etc.
Currently, there are 159 horizontal wells in the Duvernay Formation in the study area, the whole wells have a production time of more than one year, 70 wells have a production time of more than 5 years. In the study area, PNP (Plug & perf) segmented completion technology is adopted, high-viscosity fracturing fluid and small particle size quartz sand proppant are used in the hydraulic fracturing process, the clusters for each stage vary from 4 to 10 and the pump rate per stage varies from 10 m
3/min to 14 m
3/min. In the past 8 years, the stage spacing has decreased from 90 m to 50 m and the proppant intensity has increased from 1 t/m to 4 t/m. In this study, 11 variables contained geological information and completion parameters are collected from 159 shale gas wells in Duvernay Formation according the data availability and statistical efficiency. As the hydrocarbon phase in study area is gas condensate, the responses are divided into condensate production and gas production. A list of the whole variables is illustrated in
Table 2.
4. Feature Selection
In grey relation analysis, two responses M6COND and M6GAS are defined as the reference series, and the comparison series are defined as 11 geological and completion predictors. Normally, the smaller the grey relation degree (GRD), the higher the difference is between the comparison series and the reference series.
Table 3 shows GRD and rank between responses and predictors, and the influencing parameters with GRD bigger than 0.69 are selected. The geological factors include CGR, TOC, Sg, the completion factors contain Fluid, PROP, Clusters, Stages, Lateral length.
Figure 3 illustrates Pearson correlation degree (PCD) between responses and inputs, PCD bigger than 0.2 are selected. The result is the same as grey relation analysis, PCD between Thickness, POR, PERM and two responses are all less than 0.2, which means there is a weak correlation between them. Therefore, Thickness, POR and PERM are eliminated from the dataset.
Finally, combined with Pearson correlations and grey relation analysis method, predictors selected as key factors to build statistical models are as follows: Fluid, PROP, Clusters, Stages, Lateral Length, Sg, TOC, CGR. The histogram plots of all selected variables are demonstrated in
Figure 4.
Table 4 shows the results of Kolmogorov-Smirnov (K-S) test. The results illustrate that the significance of all variables (except for Lateral Length) is less than 0.05, meaning that Fluid, PROP, Clusters, Stages, Sg, TOC, CGR are not normally distributed. Therefore, a normal score method (a method which can return a normal distribution dataset) is applied to transform those non-normal distribution variables.
5. Model Development
In this study, Matlab 2019 was used to build the model. To avoid overfitting, the total data was separated into a training set, and a testing set. A total of 129 samples were used as the training set and the remaining 30 samples were used to test the effectiveness of three machine learning models. Training set are performed a 4-fold cross-validation. In this method, the dataset is partitioned into 4 disjoint sets or folds. For each fold, the model is trained using the out-of-fold data and model performance is assessed using in-fold data. Finally, calculate the average test error over all folds. This method performs an excellent estimate of the predictive accuracy and makes full use of all the training data, it works quite well for small data sets. The most difficult part for GPR model and SVR model is parameter optimization. To solve this problem, a Bayesian optimization was applied. Bayesian optimization internally maintains a Gaussian process model of the objective function, and uses objective function evaluations to train the model. After optimization, the hyperparameters of SVR and GPR were demonstrated in
Table 5 and
Table 6:
Table 7 and
Table 8 illustrate the model performance for the prediction of M6GAS and M6COND respectively. The model performance for training set and testing set is compared in
Figure 5. As for M6GAS prediction, the root mean squared errors (RMSE) of the testing set are found to be 280.54 × 10
4 m
3 for GPR, 366.25 × 10
4 m
3 for SVR, 377.72 × 10
4 m
3 for MLR. GPR model shows the highest R
2 and the lowest RMSE in both training set and testing set. Especially in testing set, GPR model shows R
2 at 0.8 and RMSE at 280.54 × 10
4 m
3, it means that the model is able to explain 80% of the gas production variance in the study area and on average there is around 280.54 × 10
4 m
3 uncertainty in the prediction of gas production within 1st 6 producing months for each well. It is found that GPR model shows the best performance for forecasting the gas production. It should be noted that SVR method performs a comparable prediction accuracy in training set compared with GPR. However, the differences of RMSE and R
2 between training set and testing set are very large, indicating that SVR undergoes overfitting problems.
As for M6COND prediction, the result is the same as M6GAS prediction, the root mean squared errors (RMSE) of the testing set are found to be 1884.3 t for GPR, 1903.7 t for SVR, 2544.5 t for MLR. GPR model demonstrates the highest R2 at 0.83 and the lowest RMSE at 1884.3 t in the testing set, it also means that on average there is around 1884.3 t uncertainty in the prediction of M6COND for each well and the model can illustrate 83% of the oil production variance in Duvernay formation. In addition, as for GPR model, the differences of RMSE and R2 between training set and testing set are relatively small, indicating that GPR doesn’t undergo overfitting or underfitting problems.
Considering that the R-squared value of most machine learning based model varied from 0.5–0.8 (Luo et al. [
41], Kong et al. [
9], Wang et al. [
15]), the GPR models built in this study are able to provide relative high accuracy for production prediction. Compared with MLR and SVR, GPR has better performance for the prediction of M6COND and M6GAS. Therefore, GPR method is selected to apply sensitivity analysis.
6. Sensitivity Analysis
Figure 6 is a Tornado Plot that describes the sensitivity of 8 geological and completion factors on M6GAS. In the process of sensitivity analysis, the factors are set based on the statistical distribution of input features used in machine learning model, which is shown in
Figure 6. The base case used the average value of each input feature. The sensitivity values are calculated by setting each interested factor to its maximum and minimum values one-factor-at-a-time as shown in
Table 9, while the other factors are set to their base values. The red bars in
Figure 6 equal to a sensitivity measure (compared with a base example) when a factor is designed to its maximum value and the green bars in
Figure 6 equal to a sensitivity measure (compared with a base example) when a factor is designed to its minimum values. It can be observed from
Figure 6 that Fluid, Stages and TOC are regarded as the most significant factors. It is followed by CGR, Clusters and Lateral Length. Sg and PROP shows the least effect on M6GAS. It can be inferred that these least important factors could have an unimportant important on M6GAS.
Figure 7 is a Tornado Plot that indicates the sensitivity of 8 geological and completion factors on M6COND. It is clear that CGR, Fluid and TOC are the most important factors. The following factors are Stages, Clusters, and Sg. The effect of Sg and PROP is marginal. In the study area, the production of condensate leads to higher economic efficiency, so it is significant to drill wells in high CGR areas as CGR plays the most important role in condensate production. Furthermore, a more progressive completion treatment (such more fluid volume) is required to get more production.
It is significant to note that this model may provide some suggestions for the study area in Duvernay Formation, it is not applicable to apply this model in a completely different reservoir. Moreover, the model doesn’t take parent and child well interference into consideration, which may limit the model prediction accuracy for new offset wells.
7. Conclusions
In this study, grey relation analysis and Pearson Correlation analysis were applied to demonstrate how to select significant geologic and completion factors. Then, MLR, SVR, and GPR were used to predict cumulative production with selected factors. At last, the best performance model was used to conduct sensitivity analysis.
Based on data-driven methodology presented in this paper, it is concluded that:
Based on grey relation analysis and Pearson Correlation analysis, 8 parameters are selected as input in predictive model, they are Fluid, PROP, Clusters, Stages, Lateral Length, Sg, TOC, CGR.
GPR model has the best performance among three predictive models, it results in highest R-squared and lowest RMSE. In the testing set, the model shows a R-squared of 0.8 with a RMSE of 280.54 × 104 m3 in predicting M6GAS and gives an R-squared of 0.83 with a RMSE of 1884.3 t in predicting M6COND.
Using GPR model, sensitivity analysis indicates that Fluid, Stages and TOC are the most important features for M6GAS. As for M6COND, CGR, Fluid, and TOC are the most important features.
The approach includes feature selection, model development, and sensitivity analysis. The results and technique may provide some advice in making development decisions in Duvernay formation. The method may apply to different shale reservoirs