Next Article in Journal
Life Cycle Assessment for Supporting Eco-Design: The Case Study of Sodium–Nickel Chloride Cells
Previous Article in Journal
Optimum Sizing of Photovoltaic and Energy Storage Systems for Powering Green Base Stations in Cellular Networks
Previous Article in Special Issue
Data-Driven Signal–Noise Classification for Microseismic Data Using Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparison of Machine Learning Algorithms in Predicting Lithofacies: Case Studies from Norway and Kazakhstan

by
Timur Merembayev
1,
Darkhan Kurmangaliyev
2,
Bakhbergen Bekbauov
2 and
Yerlan Amanbek
1,*
1
Department of Mathematics, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
2
KazMunayGas Engineering LLP, Nur-Sultan 010000, Kazakhstan
*
Author to whom correspondence should be addressed.
Energies 2021, 14(7), 1896; https://doi.org/10.3390/en14071896
Submission received: 13 February 2021 / Revised: 11 March 2021 / Accepted: 22 March 2021 / Published: 29 March 2021

Abstract

:
Defining distinctive areas of the physical properties of rocks plays an important role in reservoir evaluation and hydrocarbon production as core data are challenging to obtain from all wells. In this work, we study the evaluation of lithofacies values using the machine learning algorithms in the determination of classification from various well log data of Kazakhstan and Norway. We also use the wavelet-transformed data in machine learning algorithms to identify geological properties from the well log data. Numerical results are presented for the multiple oil and gas reservoir data which contain more than 90 released wells from Norway and 10 wells from the Kazakhstan field. We have compared the the machine learning algorithms including KNN, Decision Tree, Random Forest, XGBoost, and LightGBM. The evaluation of the model score is conducted by using metrics such as accuracy, Hamming loss, and penalty matrix. In addition, the influence of the dataset features on the prediction is investigated using the machine learning algorithms. The result of research shows that the Random Forest model has the best score among considered algorithms. In addition, the results are consistent with outcome of the SHapley Additive exPlanations (SHAP) framework.

Graphical Abstract

1. Introduction

It is important to understand the geological structure of formations based on the provided data in many applications. The key features of the complex subsurface can be defined by geophysicists. The geophysicists’ experience on the finding of lithotypes can help to improve the accuracy of the labels in the well logs. Such experience requires many hours of work and additional data from different sources such as seismic survey, cores, etc. One of the possible solutions is to use the machine learning algorithms to accelerate the accurate prediction process in a systematic way. To achieve appropriate accuracy of results, the data-driven algorithms require a large amount of data which should be used in a balanced way in the training procedure. Traditionally, the most common features of a region are identified by geophysicists and then uncommon features are estimated by additional well log data using the knowledge of relationships among lithotypes such as PS, RHOB, and NPHI.
To determine lithotypes, geophysicists perform work in stages: first, Shale and sandstone are determined, often gamma logs are used, sometimes for control of PS, RHOB, and NPHI. After these rocks, the isolation of uncommon lithotypes is made only by their characteristic features. The more features (curves), the better the accuracy of determining the lithotypes. An inexperienced geologist without knowledge of the geology of the field may not accurately determine similar lithotypes; therefore, the use of trained models will solve the problem of the lack of knowledge among geophysicists about the field.
Machine learning can be an effective tool to enrich geoscience workflows. Geostatistical approaches were proposed in many studies [1,2,3,4] to reduce the uncertainty of the subsurface property of using the large datasets.
There are several works regarding application of data analysis methods for mining areas [5,6]. The importance of lithofacies detection for uranium mining is discussed and investigated in [7,8] using machine learning algorithms to solve multilabel lithofacies classification. The in situ leaching of uranium requires a better understanding of the permeable and impermeable rock types.
The authors of [9] have made comparisons of machine learning algorithms using scikit-learning framework (MLPClassifier, the DecisionTreeClassifier, the RandomForestClassifier, and the SVC) for data from offshore wells. Algorithms have been applied to three standard data templates and a practical data template in a lithology classification problem for wells from International Ocean Discovery Program (IODP) Expeditions. We used a dataset from the lithology subdivision in GP (group GP), G1 (group 1), G2 (group 2), and G3 (group 3). The comparison analysis showed that the multilayer perceptron MLP method had better results in the lithology classification for the practical template: lithology of the G2 group.
In [10], the authors proposed using embedded feature selection (EFS) and LightGBM to predict the permeability of a reservoir. Result of EFS was generated based on five features: DEPTH, AC, DEN, FMIT, and GR out of 22 features and was equal 0.9457 (R2). Furthermore, the authors made a comparison of several methods of selection: the mutual information regression (MIR) in FFS and the recursive feature elimination (RFE) in WFS. Commonly used feature selection methods include filter feature selection (FFS) and wrapper feature selection (WFS). The same comparison was done for LightGBM: Random Forest and XGBoost. The best result was from EFS+LightGBM: R2 of 0.9712, RMSE of 0.5959.
The authors of [11,12] presented the application of oil production exploration and development data to generate high-performance predictive models and optimal classifications of geology, reservoirs, and fluid characteristics. The deep learning algorithms have the perspective to solve problems in geoscience in piratically lithology classification as well [13,14,15].
In [16], the authors investigated data preprocessing methods for well logs such as a dimensional reduction and wavelet analysis in order to improve the accuracy of the group method of data handling (GMDH) for lithological classification. Wavelet analysis was used for the decomposition of the log signals for the algorithm (GMDH). The authors of [17] proposed using the continuous wavelet transform of the well log data to detect geological boundaries. One of the applications of the wavelet coefficient is to measure the edge of the boundary strength. The boundary strength is a measure of the geological thickness of units. In the method, instead of solving multivariate classification, additional features were generated to detect the boundaries of the formations. The multi-element geochemical data taken from 259 drill holes were studied and its efficiency was shown for the data with a maximum depth of 600 m.
In this paper, we investigate the prediction of lithofacies using machine learning algorithms for the geological data of Kazakhstan and Norway. We consider machine learning methods such as KNN, Decision Tree, Random Forest, XGBoost, and LightGBM with and without wavelet transformed data. Gamma-ray (GR), medium deep reading resistivity measurement (RMED), compressional waves sonic log (DTC), neutron porosity log (NPHI), bulk density log (RHOB), etc. are considered as the input data of the machine learning models. In addition, the results of the supervised learning are provided in the SHapley Additive exPlanations (SHAP) visualization framework by indicating significant well logs. Our research question is the following: how can some supervised machine learning algorithms accurately predict lithofacies based on the geophysical well log data from Norway and Kazakhstan fields?
The rest of the paper is organized as follows. In the next section, we describe the wavelet transformation, data analysis, and machine learning algorithms. Numerical results of algorithms are presented in Section 3. Section 4 concludes the paper.

2. Methodology

We first describe the wavelet transformation and then the flowchart of workflow for machine learning algorithms. Next, the data analysis and data preparation are presented. We briefly describe the considered machine learning algorithms for supervised multi-labels classification.

2.1. Wavelet Transformation

We use the Gaussian wavelet transformation for the edge detection in the geology formation. The second-order derivative of the Gaussian function is also known as the Mexican hat wavelet. Inflection points of the Mexican hat wavelet represent edges of objects in the signal. Application of wavelet transformation to the given signal generates new artificial data which can be useful for further analysis.
The physical meaning of the wavelet transform is to calculate the joint energy spectrum of signals in the frequency-time domain and identify both the frequency and time information of the distinct modes [18].
Wavelet transformation decomposes a geophysical log into a combination of signals at different frequencies. It allows determining what frequency bands of log is noise and what frequency band is actual data. It provides a one-to-one mapping of the original log, so we can go back and forward between the original and transformed data.
The integral wavelet transform of a function f ( x ) with respect to a mother wavelet is given by
W ψ ( s , τ ) = + f ( x ) ψ s , τ ( x ) d x
where
ψ s , τ ( x ) = 1 s ψ x τ s
where s > 0 , τ are the scale factor and shift, respectively.
For creation wavelet transformation, we used the Ricker wavelet, also known as the “Mexican hat wavelet”:
ψ ( x ) = 2 3 σ π 1 / 4 1 x σ 2 exp x 2 2 σ 2
To illustrate the above explanation, we conduct wavelet transformation of the geophysical logs from Kazakhstan, see Figure 1.
To better display the result of the wavelet transformation of logs we use a log scale in Figure 1b. Figure 1a shows its application to the wavelet transform for two logs.
We have followed the general workflow of a machine learning classifier which is illustrated in Figure 2. Our process of the classifier model consists of the following steps:
  • Data preprocessing.
  • Application of the wavelet transformation to generate new features.
  • Finding hyperparameters and construction of machine learning algorithm as a classifier of lithofacies.
  • Training of the model on the well log data with the labeled lithology by geophysicist or geologist.
  • Evaluation of the trained model of classifier according to specified score based on the test dataset.
The initial stage is started with a generation of new features from the current well logs. Next, training of the model for the new dataset, which includes wavelet-transformed well logs, is performed. The trained model is evaluated by estimation of the accuracy on the test dataset.

2.2. Data Analysis

We consider the well log data form an offshore field in the North Sea, near Norway. The study area contains 98 wells with a maximum depth of 5000 m. Dataset consists of interpreted lithofacies and well logs, 22 wireline log curves including gamma ray (GR), medium deep reading resistivity measurement (RMED), compressional waves sonic log (DTC), neutron porosity log (NPHI), and bulk density log (RHOB) and others. Digital measurements were recorded at 0.1 m intervals, see Table 1 and Table 2 for abbreviations and descriptions of the dataset.
The interpreted lithofacies contains 12 classes. Lithofacies type corresponds to codes (number) which are used in machine learning training and prediction: 0: Sandstone, 1: Sandstone/Shale, 2: Shale, 3: Marl, 4: Dolomite, 5: Limestone, 6: Chalk, 7: Halite, 8: Anhydrite, 9: Tuff, 10: Coal, 11: Basement.
For data exploration we use a library Cegal https://github.com/cegaldev/cegaltools, accessed on 22 March 2021, which is the geoscience tool for loading, plotting, and evaluating well log data using python script. It is also an interactive tool to visualize data details and dependence. Figure 3 shows one well with its logs.
Distribution of lithology types in log scale are presented in Figure 4. We have a similar distribution of lithology classes for training and test datasets.

2.3. Data Preparation

The dataset contains some missing data. Key reasons for missing data are technical problems during acquisition data, cost optimization during geophysic logging, human factor, and others. We utilize the Missingo library [19] to detect the data gap from provided dataset. It helps to define logs with their location. In Figure 5, one well is presented and well logs contain missing data such as missing for full depth of well or with some gaps.
After careful study and statistical analysis of logs for missing data, we decided to concentrate the following logs, which have a smaller percentage of missing data: DEPTH_MD, CALI, RSHA, RMED, RDEP, RHOB, GR, NPHI, PEF, DTC, SP, and BS.

2.4. Algorithms

There are various machine learning algorithms, and each algorithm has its own advantages and disadvantages for solving geoscience problems. In this paper, we made a comparison analysis of five algorithms: K-nearest neighbors (kNN) [20], the Decision Tree [21], the Random Forest Classifier (RFC) [22], and the extreme gradient boosting (XGBoost) [23], LightGBM [24]. They are also explored with and without the generation additional features obtained from the wavelet transformation. In this research, we used “scikit-learn” [25], the developed python framework for utilization of kNN, Desicion Tree, and Random Forest classifier. XGBoost and LightGBM have their own python framework.
K-Nearest Neighbors (kNN) is a machine learning method that has been used for data mining [20]. Each point (data point) has location in a multidimensional space, where the space consists of axis or features of current datasets. The trained model defines an optimal count of neighbors for the trained dataset and when we have a new (test) data point the model finds the K nearest neighbors for the test dataset. KNN has the advantage of being nonparametric. The method is sensitive to scale, so standardizing data is mandatory to eliminate differences in scale. It can be an issue when the dataset is very large, the application of special methods can solve the issue to decrease the space.
Decision tree methods are data mining methods, and they have been successfully used for classification problems. Decision trees were developed by Morgan and Sonquist in 1963, and they applied the algorithm for determinants of social conditions [21]. One advantage of the decision trees is that they are computationally fast and can handle high-dimensional data. On the other hand, a single decision tree can overfit on the data and the algorithm is greedy; therefore, it keeps growing deeper in the tree.
The random forest was introduced by Breiman as a learning tree classifier of an ensemble [22]. The key idea of the algorithm is to take the values of a random vector from an aggregated bootstrap sample (train dataset) and then to train many decision trees. However, the trained tree can have a lot of trees, thus it requires more computational resources.
The main advantage of the XGBoost is parallelization. XGBoost is a scalable version of the gradient boosting machine algorithm and showed efficiency in several machine learning applications. In [23], the XGBoost is an ensemble of classification and regression trees and works for data with nonlinear features. The key idea is to use weak trees and enhancement of trees accuracy for each iteration, taking account the error in prediction from the previous result of a weak tree, the next tree classifier is trained to take into account the error of the already trained ensemble.
LightGBM is a relatively new framework and has a wide application in machine learning/data science applications. The main issue of gradient boosting algorithms is that the algorithm processes all data to gain the result of possible separation points, which impacts performance. This method has been modified to improve the optimal search technique [24].
Based on the train dataset we calculated the main hyperparameters for Random Forest, see Table 3. The main hyperparameters for XGBoost and LightGBM are presented in Table 4 and Table 5, respectively.
The prediction performance of the algorithms is evaluated by tree statistical quality indicators: Jaccard metrics (accuracy), Hamming loss, and Penalty metrics. The reader is referred to Table 5.
The Jaccard metric is computed as
L H a m m i n g ( y , y ^ ) = 1 n l a b e l s j = 0 n l a b e l s 1 1 ( y ^ j y j ) .
The Hamming loss is defined as
J ( y i , y i ^ ) = | y i y i ^ | | y i y i ^ | .
To estimate the accuracy of models, a penalty matrix is used and is derived from the averaged input of a representative sample. This allows for petrophysical unreasonable predictions to be scored by a degree of “wrongness”.
The scoring matrix is defined as follows:
S = 1 N i = 0 N A y ^ i y i
where N is the number of samples, y i ^ is the true lithology label, and y i is the predicted lithology label.

3. Results

Computations are performed on a desktop machine (3.2 GHz Intel Core i7 8700 processor) with 32 GB RAM. Tuning hyperparameters and cross-validations operations are a time-consuming, therefore they are computed in parallel mode using eight cores.

3.1. Lithofacies Prediction for the Norway Data

The comparison of the selected algorithms has been performed on 12 features and additional seven features generated by wavelet transformation, for a total of 19 features. Table 6 shows scores of models on the test dataset by the Jaccard metrics (accuracy), Hamming loss, and Penalty metrics. We observe that the RFC has the highest score on the test set with metrics Accuracy, Penalty matrix, and Hamming loss of 0.948, −0.1289, and 0.0473, respectively. Thus, RFC was selected to provide a detailed analysis of lithofacies classification. The classification report for the RFC model (12 features) can be found in Table 7. By evaluating the precision information from Table 7, we noticed that the lowest value were computed for Dolomite (4) and Coal (10). A reason for such values could be the lack of representation of these lithofacies classes in the dataset.
To understand the good accuracy of the RFC model for lithology classification, we use the SHAP package to verify the results, which are consistent with another study [26]. SHAP is a good tool for explanation of the different models and it provides an important value for each features. SHAP builds an explanatory model for a single row–prediction pair to explain a result of prediction. The SHAP values are calculated by averaging the values over all possible features.
SHAP does not enable us to determine the probabilities of predicted classes in the multi-label classification. The explanation models (tree and kernel) cannot output probabilities due to the constraint associated with nonlinear transformations, but it provides the raw margin values of the objective function which fit the model.
Figure 6 shows the global importance for 12 classes which was calculated as the average of absolute SHAP values. SHAP ranks the input features by the mean SHAP value, the amount of the value provides the importance of the feature in prediction of certain class (higher means more influential). The GR feature influences on the model prediction in all lithology classes, other features have less influence if compared with GR feature.
Figure 7 gives additional explanation of the model, the influence of input features on the model prediction of lithology classes, and their distributions. SHAP calculates a Shapley value for input features and instances, and plots it on the figure. The y-axis is the input features with an order of importance for the model prediction from top to bottom. Each dot on plots is colored by the value of the selected variable, from low (blue) to high (red). SHAP chooses the selected variable for each feature based on its correlation values. Figure 7a–l illustrates the influence of features on each lithology classes. We note that the GR feature has high SHAP values and it impacts the model prediction of the following lithology classes: Sandstone, Limestone, Chalk, Halite, Anhydrite, Coal. However, for Sandstone/Shale, Shale, Marl, and Basement classes; the GR feature tends to have negative SHAP values for their lithology classes. We can see the influence of the GR, DTC, and RHOB variables on almost all lithology classes. On the other hand, some lithology classes such as Tuff and Coal have different important features in the model prediction.
Due to different nature of the Coal properties from the other classes it was found that RHOB and NPHI were significant features in the prediction of Coal. Moreover, RHOB, DTC, RMED, and GR were dominant features in the forecast of Dolomite and Limestone lithology.

3.2. Lithofacies Prediction for the Kazakhstan Data

We carried out numerical experiments in the above-mentioned way for wells in the Kazakhstan oil and gas field; the study area contains 10 wells with a maximum depth of 1700 m. The lithology for the field primarily consists of clay, coal, limestone, dolomite, and sand. The data contain well logs such as thermal neutron porosity, caliper, gamma-ray, temperature, resistivity, sonic, and others. The information from well logs was recorded at every foot of the formation where it is logged across.
Data was split into train and test datasets split to 75% and 25%, respectively. In Figure 8, the distribution of lithologic types for train and test dataset in log scale is presented, and distributions have a similar shape. The total dataset is 59,423 rows and 23 features, the train dataset contains rows 47,538 and 23 features, and the test dataset contains 11,885 rows and 23 features.
Based on the result for the Norway dataset, we used the Random Forest Classifier for data from the Kazakhstan field which showed a better result on three metrics. In Table 8, there are three scores that summarize the performance of the Random Forest Classifier on the test datasets for the different lithofacies types. The Random Forest Classifier shows a precisely result as well.
Class 2 (dolomite) was not precisely predicted, see Table 9. The reason for such values can be an imbalanced dataset.
Figure 9 shows the global importance for five classes. The PHIE (prediction of effective porosity) and PHIT (prediction of total porosity) features influence the model prediction in all Clay (3) and Sand (0) classes.
Figure 10 shows the influence of input features on the model prediction of lithology classes. The SHAP values of the Sand class are higher for PHIE and PHIT features. The colors of PHIE and PHIT values indicate some threshold that can split the positive and negative influence of PHIE and PHIT features on the model prediction, see Figure 10a. The SHAP values of the Limestone class are higher for PHIE and PHIT features, see Figure 10b. The model found dependence by depth for the Limestone, also likely the class is located on defined depth for this field. The SHAP values of Dolomite, Clay, and Coal classes are higher for PHIE, PHIT, PEFZ, and RHOZ features, see Figure 10c,d. The High values of PHIE and RHOZ features are a positive influence on the model prediction of Dolomite. Lower values of PHIE and PHIT features are a positive influence on model prediction of Clay. Lower values of PHIE, PEFZ, and RHOZ features are a positive influence on model prediction of Coal.

4. Conclusions

This paper analyzes the supervised learning algorithms for the well log data from Norway and Kazakhstan with or without the additional wavelet-transformed features. Our focus was on the data of offshore and onshore reservoirs. The findings suggest that our fitted Random Forest model shows the best results among the considered algorithms. The cross-validation methodology was applied in the machine learning models. Machine learning algorithms, in particular Random Forest method, can be integrated to specific geophysical software to proceed with a lithology classification automatically based on well logs without using information about sludge or core samples, and others. This process can improve efficiency of finding solution for some geophysical interpretation problems.
The nature of the decision tree methods (kNN, Random Forest, Decision Tree, etc.) is verified as set of good methods for the well log data, as it enables solving the nonlinear problem of the lithological classifications. The random forest model has an accuracy of 0.948, penalty matrix score of −0.1289, hamming loss score of 0.0473 for 12 features and an accuracy of 0.938, penalty matrix of −0.1697, and hamming loss of 0.0624 for 19 features including features which were generated from wavelet transformation of the data. Scores of algorithms that used the data and wavelet-transformed data are similar to scores of algorithms that trained only on the data without wavelet transformation. However, we believe that such additional features could help for different problems(regression) in geoscience such as identification of permeability or porosity.
We used the SHAP framework to explore the impact of features on the targeted classification and to detect the complex relationships between features. The result of the SHAP in our dataset showed that the significant features on a prediction of some lithology classes were GR, DTC, and RHOB. However, some classes such as Tuff and Coal can be detected by other features (NPHI and RDEP).
In our future research, we intend to concentrate on deep learning algorithm such as 1D-CNN, LSTM, and RNN for prediction of multi-label lithofacies classification, porosity, and permeability using the well log data.

Author Contributions

Conceptualization, T.M. and Y.A.; methodology, T.M. and Y.A.; software, T.M.; validation, D.K. and T.M.; formal analysis, T.M. and D.K.; resources, D.K. and B.B.; writing—original draft preparation, T.M., D.K. and Y.A.; writing—review and editing, T.M., B.B. and Y.A.; visualization, T.M.; supervision, Y.A.; funding acquisition, Y.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Nazarbayev University, Grant No. 110119FD4502, the SPG fund and the Ministry of Education and Science of the Republic of Kazakhstan, Grant No. AP08052762.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable.

Acknowledgments

T.M. and Y.A. wish to acknowledge the research grant, No AP08052762, from the Ministry of Education and the Nazarbayev University Faculty Development Competitive Research Grant (NUFDCRG), Grant No 110119FD4502.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ohl, D.; Raef, A. Rock formation characterization for carbon dioxide geosequestration: 3D seismic amplitude and coherency anomalies, and seismic petrophysical facies classification, Wellington and Anson-Bates Fields, Kansas, USA. J. Appl. Geophys. 2014, 103, 221–231. [Google Scholar] [CrossRef]
  2. Wang, X.; Yang, S.; Zhao, Y.; Wang, Y. Improved pore structure prediction based on MICP with a data mining and machine learning system approach in Mesozoic strata of Gaoqing field, Jiyang depression. J. Pet. Sci. Eng. 2018, 171, 362–393. [Google Scholar] [CrossRef]
  3. Amanbek, Y.; Merembayev, T.; Srinivasan, S. Framework of Fracture Network Modeling using Conditioned Data with Sequential Gaussian Simulation. arXiv 2020, arXiv:2003.01327. [Google Scholar]
  4. Sun, Z.; Jiang, B.; Li, X.; Li, J.; Xiao, K. A Data-Driven Approach for Lithology Identification Based on Parameter-Optimized Ensemble Learning. Energies 2020, 13, 3903. [Google Scholar] [CrossRef]
  5. Ai, X.; Wang, H.; Sun, B. Automatic Identification of Sedimentary Facies Based on a Support Vector Machine in the Aryskum Graben, Kazakhstan. Appl. Sci. 2019, 9, 4489. [Google Scholar] [CrossRef] [Green Version]
  6. Osintseva, N.; Danko, D.; Priezzhev, I.; Iskaziyev, K.; Ryzhkov, V. Combination of classic geological/geophysical data analysis and machine learning: Brownfield sweet spots case study of the middle Jurassic Formation in Western Kazakhstan. In SEG Technical Program Expanded Abstracts 2020; Society of Exploration Geophysicists: Tulsa, OK, USA, 2020; pp. 2176–2180. [Google Scholar]
  7. Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for classification geology data from well logging. In Proceedings of the 2018 14th International Conference on Electronics Computer and Computation (ICECCO), Kaskelen, Kazakhstan, 29 November–1 December 2018; pp. 206–212. [Google Scholar]
  8. Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for stratigraphy classification on uranium deposits. Proc. Comput. Sci. 2019, 150, 46–52. [Google Scholar] [CrossRef]
  9. Bressan, T.S.; de Souza, M.K.; Girelli, T.J.; Junior, F.C. Evaluation of machine learning methods for lithology classification using geophysical data. Comput. Geosci. 2020, 139, 104475. [Google Scholar] [CrossRef]
  10. Zhou, K.; Hu, Y.; Pan, H.; Kong, L.; Liu, J.; Huang, Z.; Chen, T. Fast prediction of reservoir permeability based on embedded feature selection and LightGBM using direct logging data. Measur. Sci. Technol. 2020, 31, 045101. [Google Scholar] [CrossRef]
  11. Tan, F.; Luo, G.; Wang, D.; Chen, Y. Evaluation of complex petroleum reservoirs based on data mining methods. Comput. Geosci. 2017, 21, 151–165. [Google Scholar] [CrossRef]
  12. Kanaev, I.S. Automated Missed Pay Zones Detection Method Based on BV10 Member Data of Samotlorskoe Field. In SPE Russian Petroleum Technology Conference; Society of Petroleum Engineers: Houston, TX, USA, 2020. [Google Scholar]
  13. Al-Mudhafar, W.J. Integrating well log interpretations for lithofacies classification and permeability modeling through advanced machine learning algorithms. J. Pet. Explor. Prod. Technol. 2017, 7, 1023–1033. [Google Scholar] [CrossRef] [Green Version]
  14. Kim, S.; Kim, K.H.; Min, B.; Lim, J.; Lee, K. Generation of synthetic density log data using deep learning algorithm at the Golden field in Alberta, Canada. Geofluids 2020, 26. [Google Scholar] [CrossRef] [Green Version]
  15. Zhang, D.; Yuntian, C.; Jin, M. Synthetic well logs generation via Recurrent Neural Networks. Pet. Explor. Dev. 2018, 45, 629–639. [Google Scholar] [CrossRef]
  16. Shen, C.; Asante-Okyere, S.; Yevenyo Ziggah, Y.; Wang, L.; Zhu, X. Group method of data handling (GMDH) lithology identification based on wavelet analysis and dimensionality reduction as well log data pre-processing techniques. Energies 2019, 12, 1509. [Google Scholar] [CrossRef] [Green Version]
  17. Hill, E.J.; Pearce, M.A.; Stromberg, J.M. Improving automated geological logging of drill holes by incorporating multiscale spatial methods. Math. Geosci. 2020, 53, 1–33. [Google Scholar] [CrossRef] [Green Version]
  18. Pathak, R.S. The Wavelet Transform; Springer Science & Business Media: Berlin, Germany, 2009; Volume 4, p. 178. [Google Scholar]
  19. Bilogur, A. Missingno: A missing data visualization suite. J. Open Source Softw. 2018, 3, 547. [Google Scholar] [CrossRef] [Green Version]
  20. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  21. Rokach, L.; Maimon, O. Decision trees. In Data Mining and Knowledge Discovery Handbook; Springer: Berlin, Germany, 2005; pp. 165–192. [Google Scholar]
  22. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  23. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y. Xgboost: Extreme gradient boosting. In Microsoft. R Package Version 0.4-2; R Package Vignette: Madison, WI, USA, 2015; pp. 1–4. [Google Scholar]
  24. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Proces. Syst. 2017, 30, 3146–3154. [Google Scholar]
  25. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  26. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Proces. Syst. 2017, 30, 4765–4774. [Google Scholar]
Figure 1. Result of applying continuous wavelet transformation.
Figure 1. Result of applying continuous wavelet transformation.
Energies 14 01896 g001
Figure 2. Flowchart of workflow for machine learning algorithms.
Figure 2. Flowchart of workflow for machine learning algorithms.
Energies 14 01896 g002
Figure 3. Visualization of well logs.
Figure 3. Visualization of well logs.
Energies 14 01896 g003
Figure 4. Histogram of lithology facies in log scale.
Figure 4. Histogram of lithology facies in log scale.
Energies 14 01896 g004
Figure 5. Real value of logs with missing data.
Figure 5. Real value of logs with missing data.
Energies 14 01896 g005
Figure 6. Importance factor of each input variable for the features.
Figure 6. Importance factor of each input variable for the features.
Energies 14 01896 g006
Figure 7. Summary plots for influence models prediction by classes for the Norway data.
Figure 7. Summary plots for influence models prediction by classes for the Norway data.
Energies 14 01896 g007
Figure 8. Summary plots for various failure modes of columns.
Figure 8. Summary plots for various failure modes of columns.
Energies 14 01896 g008
Figure 9. Importance factor of Random forest model for each input variable for the features.
Figure 9. Importance factor of Random forest model for each input variable for the features.
Energies 14 01896 g009
Figure 10. Summary plots for influence models prediction by classes for Kazakhstan dataset.
Figure 10. Summary plots for influence models prediction by classes for Kazakhstan dataset.
Energies 14 01896 g010
Table 1. The well log abbreviations.
Table 1. The well log abbreviations.
Log NameLog Description
LITHOFACIES_LITHOLOGYInterpreted Lithofacies
RDEPDeep Reading Restitivity measurement
RSHAShallow Reading Restitivity measurement
RMEDMedium Deep Reading Restitivity measurement
RXOFlushed Zone Resistivity measurement
RMICMicro Resisitivity measurement
SPSelf Potential Log
DTSShear wave sonic log (us/ft)
DTCCompressional waves sonic log (us(ft))
NPHINeutron Porosity log
PEFPhoto Electric Factor log
GRGamma Ray Log
RHOBBulk Density Log
DRHODensity Correction log
CALICaliper log
BSBorehole size
DCALDifferential Caliper log
ROPAAverage Rate of Penetration
SGRSpectra Gamma Ray log
MUDWEIGHTWheight of Drilling Mud
ROPRate of Penetration
DEPTH_MDMeasured Depth
x_locX location of sample
y_locY location of sample
z_locZ(TVDSS) location of sample
Table 2. The descriptions and abbreviations for full dataset.
Table 2. The descriptions and abbreviations for full dataset.
Statistic
Parameter
DEPTH_MDCALIRSHARMEDRDEPRHOBGRNPHIPEFDTCSPBS
mean2184.112.25.84.810.62.070.90.23.6105.544.37.0
standard
deviation
997.25.074.153.8113.40.834.20.28.940.870.96.4
min136.10.00.00.00.00.00.10.00.00.0−999.00.0
25%1418.68.90.00.90.92.047.60.00.083.70.00.0
50%2076.612.40.61.41.42.268.40.22.9105.340.48.5
75%2864.415.71.52.62.52.589.00.44.6139.370.412.3
max5436.628.32193.91988.61999.93.51077.01.0383.1320.5526.526.0
Table 3. Main hyperparameters for Random Forest Classification.
Table 3. Main hyperparameters for Random Forest Classification.
HyperparameterSymbolParameter Value
The number of trees in the forestn_estimators200
The maximum depth of the treesmax_depth70
The minimum number of samples required
to be at a leaf nodemin_samples_leaf1
The minimum number of samples required
to split an internal nodemin_samples_split2
The number of features for the best splitmax_features10
Table 4. Main hyperparameters for XGBoost Classification.
Table 4. Main hyperparameters for XGBoost Classification.
HyperparameterSymbolParameter Value
Number of boosted trees to fitn_estimator526
Minimum sum of instance weightmin_child_weight11
Maximum depth of a treemax_depth12
Minimum loss reduction required to make
a further partition on a leaf node of the treegamma8
L1 regularization term on weightslambda1.36
L2 regularization term on weightsalpha0.23
Boosting learning ratelearning_rate0.73
Table 5. Main hyperparameters for LightGBM Classification.
Table 5. Main hyperparameters for LightGBM Classification.
HyperparameterSymbolParameter Value
Number of boosted trees to fitn_estimator216
Minimum sum of instance weightmin_child_weight4.12
Maximum depth of a treemax_depth11
Minimum loss reductionmin_split_gain0.08
L1 regularization term on weightslambda_l12.69
L2 regularization term on weightslambda_l24.27
Boosting learning ratelearning_rate0.05
Table 6. Comparison of three scores models on the test dataset.
Table 6. Comparison of three scores models on the test dataset.
ModelsOriginal Dataset (12)Original and Generated Features (19)
AccuracyPenalty
matrix
Hamming
loss
AccuracyPenalty
matrix
Hamming
loss
kNN0.926−0.17960.06720.801−0.52370.1969
Random Forest0.948−0.12890.04730.938−0.16970.0624
Decision Tree0.820−0.48100.18320.8167−0.48100.1826
XGBoost0.855−0.38120.14180.8621−0.36810.1631
LightGBM0.897−0.26000.09840.9013−0.25990.1378
Table 7. Classification report of RFC.
Table 7. Classification report of RFC.
Lithofacies ClassPrecisionRecallf1-ScoreSupport
00.940.950.9433,697
10.890.920.929,227
20.980.960.97147,278
30.90.940.926447
40.460.870.61185
50.810.940.879746
60.970.970.972085
710.990.991684
80.930.940.93198
90.940.970.962954
100.730.90.8586
1111116
accuracy 0.95234,103
macro avg0.880.950.91234,103
weighted avg0.960.950.95234,103
Table 8. Comparison three scores of Random Forest Classifier for the test dataset.
Table 8. Comparison three scores of Random Forest Classifier for the test dataset.
ModelsOriginal Dataset (12)Original and Generated Features (19)
AccuracyPenalty
matrix
Hamming
loss
AccuracyPenalty
matrix
Hamming
loss
Random Forest0.977−0.0610.02270.975−0.0680.0253
Table 9. Classification report of RFC for Kazakhstan field test data.
Table 9. Classification report of RFC for Kazakhstan field test data.
Lithofacies ClassPrecisionRecallf1-ScoreSupport
00.970.990.984045
10.780.890.8347
20.380.880.5364
30.990.970.987620
40.960.990.97109
accuracy 0.9811,885
macro avg0.820.940.8611,885
weighted avg0.980.980.9811,885
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Merembayev, T.; Kurmangaliyev, D.; Bekbauov, B.; Amanbek, Y. A Comparison of Machine Learning Algorithms in Predicting Lithofacies: Case Studies from Norway and Kazakhstan. Energies 2021, 14, 1896. https://doi.org/10.3390/en14071896

AMA Style

Merembayev T, Kurmangaliyev D, Bekbauov B, Amanbek Y. A Comparison of Machine Learning Algorithms in Predicting Lithofacies: Case Studies from Norway and Kazakhstan. Energies. 2021; 14(7):1896. https://doi.org/10.3390/en14071896

Chicago/Turabian Style

Merembayev, Timur, Darkhan Kurmangaliyev, Bakhbergen Bekbauov, and Yerlan Amanbek. 2021. "A Comparison of Machine Learning Algorithms in Predicting Lithofacies: Case Studies from Norway and Kazakhstan" Energies 14, no. 7: 1896. https://doi.org/10.3390/en14071896

APA Style

Merembayev, T., Kurmangaliyev, D., Bekbauov, B., & Amanbek, Y. (2021). A Comparison of Machine Learning Algorithms in Predicting Lithofacies: Case Studies from Norway and Kazakhstan. Energies, 14(7), 1896. https://doi.org/10.3390/en14071896

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop