1. Introduction
It is important to understand the geological structure of formations based on the provided data in many applications. The key features of the complex subsurface can be defined by geophysicists. The geophysicists’ experience on the finding of lithotypes can help to improve the accuracy of the labels in the well logs. Such experience requires many hours of work and additional data from different sources such as seismic survey, cores, etc. One of the possible solutions is to use the machine learning algorithms to accelerate the accurate prediction process in a systematic way. To achieve appropriate accuracy of results, the data-driven algorithms require a large amount of data which should be used in a balanced way in the training procedure. Traditionally, the most common features of a region are identified by geophysicists and then uncommon features are estimated by additional well log data using the knowledge of relationships among lithotypes such as PS, RHOB, and NPHI.
To determine lithotypes, geophysicists perform work in stages: first, Shale and sandstone are determined, often gamma logs are used, sometimes for control of PS, RHOB, and NPHI. After these rocks, the isolation of uncommon lithotypes is made only by their characteristic features. The more features (curves), the better the accuracy of determining the lithotypes. An inexperienced geologist without knowledge of the geology of the field may not accurately determine similar lithotypes; therefore, the use of trained models will solve the problem of the lack of knowledge among geophysicists about the field.
Machine learning can be an effective tool to enrich geoscience workflows. Geostatistical approaches were proposed in many studies [
1,
2,
3,
4] to reduce the uncertainty of the subsurface property of using the large datasets.
There are several works regarding application of data analysis methods for mining areas [
5,
6]. The importance of lithofacies detection for uranium mining is discussed and investigated in [
7,
8] using machine learning algorithms to solve multilabel lithofacies classification. The in situ leaching of uranium requires a better understanding of the permeable and impermeable rock types.
The authors of [
9] have made comparisons of machine learning algorithms using scikit-learning framework (MLPClassifier, the DecisionTreeClassifier, the RandomForestClassifier, and the SVC) for data from offshore wells. Algorithms have been applied to three standard data templates and a practical data template in a lithology classification problem for wells from International Ocean Discovery Program (IODP) Expeditions. We used a dataset from the lithology subdivision in GP (group GP), G1 (group 1), G2 (group 2), and G3 (group 3). The comparison analysis showed that the multilayer perceptron MLP method had better results in the lithology classification for the practical template: lithology of the G2 group.
In [
10], the authors proposed using embedded feature selection (EFS) and LightGBM to predict the permeability of a reservoir. Result of EFS was generated based on five features: DEPTH, AC, DEN, FMIT, and GR out of 22 features and was equal 0.9457 (R2). Furthermore, the authors made a comparison of several methods of selection: the mutual information regression (MIR) in FFS and the recursive feature elimination (RFE) in WFS. Commonly used feature selection methods include filter feature selection (FFS) and wrapper feature selection (WFS). The same comparison was done for LightGBM: Random Forest and XGBoost. The best result was from EFS+LightGBM: R2 of 0.9712, RMSE of 0.5959.
The authors of [
11,
12] presented the application of oil production exploration and development data to generate high-performance predictive models and optimal classifications of geology, reservoirs, and fluid characteristics. The deep learning algorithms have the perspective to solve problems in geoscience in piratically lithology classification as well [
13,
14,
15].
In [
16], the authors investigated data preprocessing methods for well logs such as a dimensional reduction and wavelet analysis in order to improve the accuracy of the group method of data handling (GMDH) for lithological classification. Wavelet analysis was used for the decomposition of the log signals for the algorithm (GMDH). The authors of [
17] proposed using the continuous wavelet transform of the well log data to detect geological boundaries. One of the applications of the wavelet coefficient is to measure the edge of the boundary strength. The boundary strength is a measure of the geological thickness of units. In the method, instead of solving multivariate classification, additional features were generated to detect the boundaries of the formations. The multi-element geochemical data taken from 259 drill holes were studied and its efficiency was shown for the data with a maximum depth of 600 m.
In this paper, we investigate the prediction of lithofacies using machine learning algorithms for the geological data of Kazakhstan and Norway. We consider machine learning methods such as KNN, Decision Tree, Random Forest, XGBoost, and LightGBM with and without wavelet transformed data. Gamma-ray (GR), medium deep reading resistivity measurement (RMED), compressional waves sonic log (DTC), neutron porosity log (NPHI), bulk density log (RHOB), etc. are considered as the input data of the machine learning models. In addition, the results of the supervised learning are provided in the SHapley Additive exPlanations (SHAP) visualization framework by indicating significant well logs. Our research question is the following: how can some supervised machine learning algorithms accurately predict lithofacies based on the geophysical well log data from Norway and Kazakhstan fields?
The rest of the paper is organized as follows. In the next section, we describe the wavelet transformation, data analysis, and machine learning algorithms. Numerical results of algorithms are presented in
Section 3.
Section 4 concludes the paper.
2. Methodology
We first describe the wavelet transformation and then the flowchart of workflow for machine learning algorithms. Next, the data analysis and data preparation are presented. We briefly describe the considered machine learning algorithms for supervised multi-labels classification.
2.1. Wavelet Transformation
We use the Gaussian wavelet transformation for the edge detection in the geology formation. The second-order derivative of the Gaussian function is also known as the Mexican hat wavelet. Inflection points of the Mexican hat wavelet represent edges of objects in the signal. Application of wavelet transformation to the given signal generates new artificial data which can be useful for further analysis.
The physical meaning of the wavelet transform is to calculate the joint energy spectrum of signals in the frequency-time domain and identify both the frequency and time information of the distinct modes [
18].
Wavelet transformation decomposes a geophysical log into a combination of signals at different frequencies. It allows determining what frequency bands of log is noise and what frequency band is actual data. It provides a one-to-one mapping of the original log, so we can go back and forward between the original and transformed data.
The integral wavelet transform of a function
with respect to a mother wavelet is given by
where
where
,
are the scale factor and shift, respectively.
For creation wavelet transformation, we used the Ricker wavelet, also known as the “Mexican hat wavelet”:
To illustrate the above explanation, we conduct wavelet transformation of the geophysical logs from Kazakhstan, see
Figure 1.
To better display the result of the wavelet transformation of logs we use a log scale in
Figure 1b.
Figure 1a shows its application to the wavelet transform for two logs.
We have followed the general workflow of a machine learning classifier which is illustrated in
Figure 2. Our process of the classifier model consists of the following steps:
Data preprocessing.
Application of the wavelet transformation to generate new features.
Finding hyperparameters and construction of machine learning algorithm as a classifier of lithofacies.
Training of the model on the well log data with the labeled lithology by geophysicist or geologist.
Evaluation of the trained model of classifier according to specified score based on the test dataset.
The initial stage is started with a generation of new features from the current well logs. Next, training of the model for the new dataset, which includes wavelet-transformed well logs, is performed. The trained model is evaluated by estimation of the accuracy on the test dataset.
2.2. Data Analysis
We consider the well log data form an offshore field in the North Sea, near Norway. The study area contains 98 wells with a maximum depth of 5000 m. Dataset consists of interpreted lithofacies and well logs, 22 wireline log curves including gamma ray (GR), medium deep reading resistivity measurement (RMED), compressional waves sonic log (DTC), neutron porosity log (NPHI), and bulk density log (RHOB) and others. Digital measurements were recorded at 0.1 m intervals, see
Table 1 and
Table 2 for abbreviations and descriptions of the dataset.
The interpreted lithofacies contains 12 classes. Lithofacies type corresponds to codes (number) which are used in machine learning training and prediction: 0: Sandstone, 1: Sandstone/Shale, 2: Shale, 3: Marl, 4: Dolomite, 5: Limestone, 6: Chalk, 7: Halite, 8: Anhydrite, 9: Tuff, 10: Coal, 11: Basement.
For data exploration we use a library Cegal
https://github.com/cegaldev/cegaltools, accessed on 22 March 2021, which is the geoscience tool for loading, plotting, and evaluating well log data using python script. It is also an interactive tool to visualize data details and dependence.
Figure 3 shows one well with its logs.
Distribution of lithology types in log scale are presented in
Figure 4. We have a similar distribution of lithology classes for training and test datasets.
2.3. Data Preparation
The dataset contains some missing data. Key reasons for missing data are technical problems during acquisition data, cost optimization during geophysic logging, human factor, and others. We utilize the Missingo library [
19] to detect the data gap from provided dataset. It helps to define logs with their location. In
Figure 5, one well is presented and well logs contain missing data such as missing for full depth of well or with some gaps.
After careful study and statistical analysis of logs for missing data, we decided to concentrate the following logs, which have a smaller percentage of missing data: DEPTH_MD, CALI, RSHA, RMED, RDEP, RHOB, GR, NPHI, PEF, DTC, SP, and BS.
2.4. Algorithms
There are various machine learning algorithms, and each algorithm has its own advantages and disadvantages for solving geoscience problems. In this paper, we made a comparison analysis of five algorithms: K-nearest neighbors (kNN) [
20], the Decision Tree [
21], the Random Forest Classifier (RFC) [
22], and the extreme gradient boosting (XGBoost) [
23], LightGBM [
24]. They are also explored with and without the generation additional features obtained from the wavelet transformation. In this research, we used “scikit-learn” [
25], the developed python framework for utilization of kNN, Desicion Tree, and Random Forest classifier. XGBoost and LightGBM have their own python framework.
K-Nearest Neighbors (kNN) is a machine learning method that has been used for data mining [
20]. Each point (data point) has location in a multidimensional space, where the space consists of axis or features of current datasets. The trained model defines an optimal count of neighbors for the trained dataset and when we have a new (test) data point the model finds the K nearest neighbors for the test dataset. KNN has the advantage of being nonparametric. The method is sensitive to scale, so standardizing data is mandatory to eliminate differences in scale. It can be an issue when the dataset is very large, the application of special methods can solve the issue to decrease the space.
Decision tree methods are data mining methods, and they have been successfully used for classification problems. Decision trees were developed by Morgan and Sonquist in 1963, and they applied the algorithm for determinants of social conditions [
21]. One advantage of the decision trees is that they are computationally fast and can handle high-dimensional data. On the other hand, a single decision tree can overfit on the data and the algorithm is greedy; therefore, it keeps growing deeper in the tree.
The random forest was introduced by Breiman as a learning tree classifier of an ensemble [
22]. The key idea of the algorithm is to take the values of a random vector from an aggregated bootstrap sample (train dataset) and then to train many decision trees. However, the trained tree can have a lot of trees, thus it requires more computational resources.
The main advantage of the XGBoost is parallelization. XGBoost is a scalable version of the gradient boosting machine algorithm and showed efficiency in several machine learning applications. In [
23], the XGBoost is an ensemble of classification and regression trees and works for data with nonlinear features. The key idea is to use weak trees and enhancement of trees accuracy for each iteration, taking account the error in prediction from the previous result of a weak tree, the next tree classifier is trained to take into account the error of the already trained ensemble.
LightGBM is a relatively new framework and has a wide application in machine learning/data science applications. The main issue of gradient boosting algorithms is that the algorithm processes all data to gain the result of possible separation points, which impacts performance. This method has been modified to improve the optimal search technique [
24].
Based on the train dataset we calculated the main hyperparameters for Random Forest, see
Table 3. The main hyperparameters for XGBoost and LightGBM are presented in
Table 4 and
Table 5, respectively.
The prediction performance of the algorithms is evaluated by tree statistical quality indicators: Jaccard metrics (accuracy), Hamming loss, and Penalty metrics. The reader is referred to
Table 5.
The Jaccard metric is computed as
The Hamming loss is defined as
To estimate the accuracy of models, a penalty matrix is used and is derived from the averaged input of a representative sample. This allows for petrophysical unreasonable predictions to be scored by a degree of “wrongness”.
The scoring matrix is defined as follows:
where
N is the number of samples,
is the true lithology label, and
is the predicted lithology label.
3. Results
Computations are performed on a desktop machine (3.2 GHz Intel Core i7 8700 processor) with 32 GB RAM. Tuning hyperparameters and cross-validations operations are a time-consuming, therefore they are computed in parallel mode using eight cores.
3.1. Lithofacies Prediction for the Norway Data
The comparison of the selected algorithms has been performed on 12 features and additional seven features generated by wavelet transformation, for a total of 19 features.
Table 6 shows scores of models on the test dataset by the Jaccard metrics (accuracy), Hamming loss, and Penalty metrics. We observe that the RFC has the highest score on the test set with metrics Accuracy, Penalty matrix, and Hamming loss of 0.948, −0.1289, and 0.0473, respectively. Thus, RFC was selected to provide a detailed analysis of lithofacies classification. The classification report for the RFC model (12 features) can be found in
Table 7. By evaluating the precision information from
Table 7, we noticed that the lowest value were computed for Dolomite (4) and Coal (10). A reason for such values could be the lack of representation of these lithofacies classes in the dataset.
To understand the good accuracy of the RFC model for lithology classification, we use the SHAP package to verify the results, which are consistent with another study [
26]. SHAP is a good tool for explanation of the different models and it provides an important value for each features. SHAP builds an explanatory model for a single row–prediction pair to explain a result of prediction. The SHAP values are calculated by averaging the values over all possible features.
SHAP does not enable us to determine the probabilities of predicted classes in the multi-label classification. The explanation models (tree and kernel) cannot output probabilities due to the constraint associated with nonlinear transformations, but it provides the raw margin values of the objective function which fit the model.
Figure 6 shows the global importance for 12 classes which was calculated as the average of absolute SHAP values. SHAP ranks the input features by the mean SHAP value, the amount of the value provides the importance of the feature in prediction of certain class (higher means more influential). The GR feature influences on the model prediction in all lithology classes, other features have less influence if compared with GR feature.
Figure 7 gives additional explanation of the model, the influence of input features on the model prediction of lithology classes, and their distributions. SHAP calculates a Shapley value for input features and instances, and plots it on the figure. The
y-axis is the input features with an order of importance for the model prediction from top to bottom. Each dot on plots is colored by the value of the selected variable, from low (blue) to high (red). SHAP chooses the selected variable for each feature based on its correlation values.
Figure 7a–l illustrates the influence of features on each lithology classes. We note that the GR feature has high SHAP values and it impacts the model prediction of the following lithology classes: Sandstone, Limestone, Chalk, Halite, Anhydrite, Coal. However, for Sandstone/Shale, Shale, Marl, and Basement classes; the GR feature tends to have negative SHAP values for their lithology classes. We can see the influence of the GR, DTC, and RHOB variables on almost all lithology classes. On the other hand, some lithology classes such as Tuff and Coal have different important features in the model prediction.
Due to different nature of the Coal properties from the other classes it was found that RHOB and NPHI were significant features in the prediction of Coal. Moreover, RHOB, DTC, RMED, and GR were dominant features in the forecast of Dolomite and Limestone lithology.
3.2. Lithofacies Prediction for the Kazakhstan Data
We carried out numerical experiments in the above-mentioned way for wells in the Kazakhstan oil and gas field; the study area contains 10 wells with a maximum depth of 1700 m. The lithology for the field primarily consists of clay, coal, limestone, dolomite, and sand. The data contain well logs such as thermal neutron porosity, caliper, gamma-ray, temperature, resistivity, sonic, and others. The information from well logs was recorded at every foot of the formation where it is logged across.
Data was split into train and test datasets split to 75% and 25%, respectively. In
Figure 8, the distribution of lithologic types for train and test dataset in log scale is presented, and distributions have a similar shape. The total dataset is 59,423 rows and 23 features, the train dataset contains rows 47,538 and 23 features, and the test dataset contains 11,885 rows and 23 features.
Based on the result for the Norway dataset, we used the Random Forest Classifier for data from the Kazakhstan field which showed a better result on three metrics. In
Table 8, there are three scores that summarize the performance of the Random Forest Classifier on the test datasets for the different lithofacies types. The Random Forest Classifier shows a precisely result as well.
Class 2 (dolomite) was not precisely predicted, see
Table 9. The reason for such values can be an imbalanced dataset.
Figure 9 shows the global importance for five classes. The PHIE (prediction of effective porosity) and PHIT (prediction of total porosity) features influence the model prediction in all Clay (3) and Sand (0) classes.
Figure 10 shows the influence of input features on the model prediction of lithology classes. The SHAP values of the Sand class are higher for PHIE and PHIT features. The colors of PHIE and PHIT values indicate some threshold that can split the positive and negative influence of PHIE and PHIT features on the model prediction, see
Figure 10a. The SHAP values of the Limestone class are higher for PHIE and PHIT features, see
Figure 10b. The model found dependence by depth for the Limestone, also likely the class is located on defined depth for this field. The SHAP values of Dolomite, Clay, and Coal classes are higher for PHIE, PHIT, PEFZ, and RHOZ features, see
Figure 10c,d. The High values of PHIE and RHOZ features are a positive influence on the model prediction of Dolomite. Lower values of PHIE and PHIT features are a positive influence on model prediction of Clay. Lower values of PHIE, PEFZ, and RHOZ features are a positive influence on model prediction of Coal.
4. Conclusions
This paper analyzes the supervised learning algorithms for the well log data from Norway and Kazakhstan with or without the additional wavelet-transformed features. Our focus was on the data of offshore and onshore reservoirs. The findings suggest that our fitted Random Forest model shows the best results among the considered algorithms. The cross-validation methodology was applied in the machine learning models. Machine learning algorithms, in particular Random Forest method, can be integrated to specific geophysical software to proceed with a lithology classification automatically based on well logs without using information about sludge or core samples, and others. This process can improve efficiency of finding solution for some geophysical interpretation problems.
The nature of the decision tree methods (kNN, Random Forest, Decision Tree, etc.) is verified as set of good methods for the well log data, as it enables solving the nonlinear problem of the lithological classifications. The random forest model has an accuracy of 0.948, penalty matrix score of −0.1289, hamming loss score of 0.0473 for 12 features and an accuracy of 0.938, penalty matrix of −0.1697, and hamming loss of 0.0624 for 19 features including features which were generated from wavelet transformation of the data. Scores of algorithms that used the data and wavelet-transformed data are similar to scores of algorithms that trained only on the data without wavelet transformation. However, we believe that such additional features could help for different problems(regression) in geoscience such as identification of permeability or porosity.
We used the SHAP framework to explore the impact of features on the targeted classification and to detect the complex relationships between features. The result of the SHAP in our dataset showed that the significant features on a prediction of some lithology classes were GR, DTC, and RHOB. However, some classes such as Tuff and Coal can be detected by other features (NPHI and RDEP).
In our future research, we intend to concentrate on deep learning algorithm such as 1D-CNN, LSTM, and RNN for prediction of multi-label lithofacies classification, porosity, and permeability using the well log data.
Author Contributions
Conceptualization, T.M. and Y.A.; methodology, T.M. and Y.A.; software, T.M.; validation, D.K. and T.M.; formal analysis, T.M. and D.K.; resources, D.K. and B.B.; writing—original draft preparation, T.M., D.K. and Y.A.; writing—review and editing, T.M., B.B. and Y.A.; visualization, T.M.; supervision, Y.A.; funding acquisition, Y.A. All authors have read and agreed to the published version of the manuscript.
Funding
This work was partially supported by the Nazarbayev University, Grant No. 110119FD4502, the SPG fund and the Ministry of Education and Science of the Republic of Kazakhstan, Grant No. AP08052762.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data sharing is not applicable.
Acknowledgments
T.M. and Y.A. wish to acknowledge the research grant, No AP08052762, from the Ministry of Education and the Nazarbayev University Faculty Development Competitive Research Grant (NUFDCRG), Grant No 110119FD4502.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Ohl, D.; Raef, A. Rock formation characterization for carbon dioxide geosequestration: 3D seismic amplitude and coherency anomalies, and seismic petrophysical facies classification, Wellington and Anson-Bates Fields, Kansas, USA. J. Appl. Geophys. 2014, 103, 221–231. [Google Scholar] [CrossRef]
- Wang, X.; Yang, S.; Zhao, Y.; Wang, Y. Improved pore structure prediction based on MICP with a data mining and machine learning system approach in Mesozoic strata of Gaoqing field, Jiyang depression. J. Pet. Sci. Eng. 2018, 171, 362–393. [Google Scholar] [CrossRef]
- Amanbek, Y.; Merembayev, T.; Srinivasan, S. Framework of Fracture Network Modeling using Conditioned Data with Sequential Gaussian Simulation. arXiv 2020, arXiv:2003.01327. [Google Scholar]
- Sun, Z.; Jiang, B.; Li, X.; Li, J.; Xiao, K. A Data-Driven Approach for Lithology Identification Based on Parameter-Optimized Ensemble Learning. Energies 2020, 13, 3903. [Google Scholar] [CrossRef]
- Ai, X.; Wang, H.; Sun, B. Automatic Identification of Sedimentary Facies Based on a Support Vector Machine in the Aryskum Graben, Kazakhstan. Appl. Sci. 2019, 9, 4489. [Google Scholar] [CrossRef] [Green Version]
- Osintseva, N.; Danko, D.; Priezzhev, I.; Iskaziyev, K.; Ryzhkov, V. Combination of classic geological/geophysical data analysis and machine learning: Brownfield sweet spots case study of the middle Jurassic Formation in Western Kazakhstan. In SEG Technical Program Expanded Abstracts 2020; Society of Exploration Geophysicists: Tulsa, OK, USA, 2020; pp. 2176–2180. [Google Scholar]
- Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for classification geology data from well logging. In Proceedings of the 2018 14th International Conference on Electronics Computer and Computation (ICECCO), Kaskelen, Kazakhstan, 29 November–1 December 2018; pp. 206–212. [Google Scholar]
- Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for stratigraphy classification on uranium deposits. Proc. Comput. Sci. 2019, 150, 46–52. [Google Scholar] [CrossRef]
- Bressan, T.S.; de Souza, M.K.; Girelli, T.J.; Junior, F.C. Evaluation of machine learning methods for lithology classification using geophysical data. Comput. Geosci. 2020, 139, 104475. [Google Scholar] [CrossRef]
- Zhou, K.; Hu, Y.; Pan, H.; Kong, L.; Liu, J.; Huang, Z.; Chen, T. Fast prediction of reservoir permeability based on embedded feature selection and LightGBM using direct logging data. Measur. Sci. Technol. 2020, 31, 045101. [Google Scholar] [CrossRef]
- Tan, F.; Luo, G.; Wang, D.; Chen, Y. Evaluation of complex petroleum reservoirs based on data mining methods. Comput. Geosci. 2017, 21, 151–165. [Google Scholar] [CrossRef]
- Kanaev, I.S. Automated Missed Pay Zones Detection Method Based on BV10 Member Data of Samotlorskoe Field. In SPE Russian Petroleum Technology Conference; Society of Petroleum Engineers: Houston, TX, USA, 2020. [Google Scholar]
- Al-Mudhafar, W.J. Integrating well log interpretations for lithofacies classification and permeability modeling through advanced machine learning algorithms. J. Pet. Explor. Prod. Technol. 2017, 7, 1023–1033. [Google Scholar] [CrossRef] [Green Version]
- Kim, S.; Kim, K.H.; Min, B.; Lim, J.; Lee, K. Generation of synthetic density log data using deep learning algorithm at the Golden field in Alberta, Canada. Geofluids 2020, 26. [Google Scholar] [CrossRef] [Green Version]
- Zhang, D.; Yuntian, C.; Jin, M. Synthetic well logs generation via Recurrent Neural Networks. Pet. Explor. Dev. 2018, 45, 629–639. [Google Scholar] [CrossRef]
- Shen, C.; Asante-Okyere, S.; Yevenyo Ziggah, Y.; Wang, L.; Zhu, X. Group method of data handling (GMDH) lithology identification based on wavelet analysis and dimensionality reduction as well log data pre-processing techniques. Energies 2019, 12, 1509. [Google Scholar] [CrossRef] [Green Version]
- Hill, E.J.; Pearce, M.A.; Stromberg, J.M. Improving automated geological logging of drill holes by incorporating multiscale spatial methods. Math. Geosci. 2020, 53, 1–33. [Google Scholar] [CrossRef] [Green Version]
- Pathak, R.S. The Wavelet Transform; Springer Science & Business Media: Berlin, Germany, 2009; Volume 4, p. 178. [Google Scholar]
- Bilogur, A. Missingno: A missing data visualization suite. J. Open Source Softw. 2018, 3, 547. [Google Scholar] [CrossRef] [Green Version]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Rokach, L.; Maimon, O. Decision trees. In Data Mining and Knowledge Discovery Handbook; Springer: Berlin, Germany, 2005; pp. 165–192. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y. Xgboost: Extreme gradient boosting. In Microsoft. R Package Version 0.4-2; R Package Vignette: Madison, WI, USA, 2015; pp. 1–4. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Proces. Syst. 2017, 30, 3146–3154. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Proces. Syst. 2017, 30, 4765–4774. [Google Scholar]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).