Improving Subsurface Characterisation with ‘Big Data’ Mining and Machine Learning
Abstract
:1. Introduction
2. Data, Workflow and Methods
2.1. Data
2.2. Workflow and Methods
- 1.
- Exploratory data analysis. This first step (Figure 3) constrains the available attributes, their coverage, and variability. Data QC can be done using basic exploratory statistical visualisation tools, for example: histograms; cross-plots; and low-dimensional linear projection. Histograms are used to visualise the data distribution. Cross-plotting each attribute against depth highlights outlier data points and general trends with depth. Finally, linear projection plots are used to assess multivariate data clustering and outliers [25,26]. A circular placement was used which plots attributes on independent vectors, and allows the visualisation of multidimensional data in two-dimensional space [26]. This allows attributes to be assessed independently thereby assuming no attribute has a higher importance for clustering (as in principle component analysis). The data visualisation was completed using Orange Data Mining software [27].
- 2.
- Detection of outliers. Clustering of multivariate data allows outliers and erroneous data to be separated from bulk data prior to further analysis. It is important that outliers are treated separately in the prediction workflow so they do not bias prediction on bulk populations and to avoid imbalance problems [28,29]. Data separated from the bulk population may represent legitimate outliers or small populations behaving differently for a geologically plausible reason, e.g., a giant field or reservoirs with exceptional characteristics due to overpressure or historic subaerial weathering. Hierarchical clustering is used in this study to separate outliers and distinctive groups of geological populations. This agglomerative cluster analysis method groups the data by their proximity in multidimensional space according to the chosen distance metric [30,31,32]. This can efficiently separate outliers in addition to identifying distinct populations of data. Hierarchical clustering was completed using Orange Data Mining software [27].
- 3.
- Grouping similar reservoir and fluid characterisation features. This step identifies groups of reservoir and fluid variables that exhibit similar trends within the data using self-organising maps (SOM) [33]. This unsupervised learning tool groups attributes showing similar dependencies in a reduced space—a 2D lattice. SOM uses competitive learning to assign the input data to neurons on a lattice as arranged by their similarity. The distances between the data on the 2D lattice are visualised by heat maps, where cool colours indicate small distances between data points (clustering), and warm colours represent large distances between data. Here, we use SOM lattice data projection to generate a heat map for each attribute that depicts mutual relationships between the trends in the data, therefore allowing geological attributes to be grouped based on the trends they exhibit. In this data-driven workflow, this step ensures only those physically meaningful trends are brought forward for predictions. SOM was completed using the software ‘ML Office’ [30].
- 4.
- Feature selection. This step in the workflow explores input/output predictor structure in order to select the most relevant combination of inputs for predictions. K nearest neighbour (KNN) is a straightforward deterministic estimation method based on the similarity of data which is in close proximity in multidimensional space [34]. The regression estimates are computed based on the optimum number of neighbouring data points. The key parameter is the number of K neighbouring data points to be optimised for the most accurate prediction. We demonstrate that this basic method, which uses a single parameter to tune, suits the purpose of finding the combination of those input features that are most relevant to the target variable. Cross-validation is commonly used to find the optimal K value at the minimum of the cross-validation error curve [30] (Figure 4). The shape of the error curve can reveal the relevance of the input dimensions to the output (target) variable. Where the error curve has a declining shape with increasing number of neighbours without a distinctive minimum (Figure 4a,b), the target variable does not depend on the distance between the data in the input space and the combination of inputs is, therefore, not relevant to the target variable. An error curve that features a distinct minimum corresponds to a KNN estimate for a target variable combination that can be spatially correlated within the input space (Figure 4c,d). This not just depicts the optimal number of neighbours for a KNN estimate but more importantly confirms that the input attributes are relevant input features that define the variation of the target variable values in the input space [30]. Feature selection using KNN tells us which combination of the input attributes are most relevant to make predictions. KNN cross-validation was completed using the software ‘ML Office’ [30].
- 5.
- Prediction of reservoir properties to increase data coverage. In this study, we test the use of Random Forest (RF) to make predictions and fill gaps in attribute values based on the selected relevant features. Random Forest regression is a tree-based learning algorithm proposed by Breiman [35], and has been successfully used for classification of geological data in previous studies [36]. The regression returns the average prediction of individual trees within an ensemble, or forest [37] (Figure 5). A decision tree makes a decision after following a classification from the tree’s root to its leaf. The decision path includes nodes, branches, and leaf nodes. Each node represents a feature, each branch describes a decision, and each leaf depicts a qualitative outcome. RF constitutes the combinations of features that make predictions based on the probability assigned to branches. The probabilities are updated whilst data are propagated through the RF in a supervised learning fashion. The RF regression prediction model was trained and tuned using training and validation subsets. A blind test set prediction was computed to demonstrate the overall performance of the predictions. Model generation and testing was completed using Orange Data Mining software [27].
- 6.
- Prediction confidence. Prediction confidence was assessed based on multiple prediction models with different inputs and hyperparameter tuning. Test error was used as a measure of confidence. Prediction confidence analyses were completed using Orange Data Mining software [27].
3. Results
3.1. Exploratory Data Analysis and Detection of Outliers
3.2. Grouping of Similar Features: SOM
3.3. Feature Selection: KNN
- •
- API: The CV error minimum corresponds to K = 5 neighbours with FVF and temperature as the inputs (Figure 15a). An increasing CV error with increasing neighbours indicates a level of predictability based on the clustering of these data. Other combinations—FVF and depth, depth and temperature, show little relevancy to predict API with no apparent minima on the error curve (Figure 15a). This suggests that depth has little relevance to API values in this reservoir dataset.
- •
- Depth: API and FVF appear to be the most relevant combination of inputs to predict depth. API and temperature show a very weak (but still detectable) minima (Figure 15b).
- •
- FVF: Any pairs of attributes appear to be relevant to predict FVF. K = 2 neighbours give the minimum error with API and temperature attributes as inputs. Depth and temperature have the optimal K = 4, and API and depth input attributes give an optimal K = 8 (Figure 15c).
- •
- Temperature: Clustering of temperature data is seen with respect to API and FVF, with little to no relationship identified with any other combination (Figure 15d).
3.4. Prediction and Confidence
4. Discussion
4.1. Reservoir and Fluid Property Dependencies
- •
- Temperature-Depth Relationship: It is well documented that in the subsurface, temperature tends to increase with depth. The geothermal gradient is a measure of the change in temperature with depth and is closely related to the thermal conductivity of the rocks in the subsurface. Temperatures increase with depth due primarily to the decay of radioactive elements, such as potassium, thorium and uranium within minerals. Gradients can increase locally where magma emplacement occurs, or where crustal lithology is rich in radioactive elements (as in the case for many granites). Locally, the geothermal gradient may also be lower than expected where highly thermal conductive facies such as salt are present in the subsurface. Additionally, recorded temperatures may be distinct from the geothermal gradient due to local drilling effects (measuring artifacts) or human error. In the case of Basins X and Y, both show a linear relationship between depth and temperature, indicating that there are no unusual regions of heat flow (Figure 7, Figure 8 and Figure 18). This may not be the case in all basins, and care should be taken in volcanically active regions and salt basins. We note that Basin Y displays six values that are anomalously high but appear to increase with depth on a separate gradient (Figure 8). Further inspection of these values suggests that they represent measurements in degrees Fahrenheit (while the remaining dataset is in degrees Celsius). Such anomalies demonstrate the importance of step 1 of our workflow: exploratory data analysis, and its role in identifying incorrect or inconsistent units or data entry points.
- •
- Pressure-Depth Relationship: No pore-fluid pressure data are available in the Basin X dataset; however, Basin Y data show a clear linear increase in pressure with depth. A linear increase in both hydrostatic and lithostatic pressure is expected with depth where fluids are in communication. Deviations from normal hydrostatic pressure can occur when fluids cannot escape and become over-pressured. Such outliers may limit the ability to accurately make predictions (regardless of their validity).
- •
- API-Depth Relationship: Fluid gravity (API) is a measure of how heavy or light a hydrocarbon liquid is, and tends to increase with temperature, and therefore, depth. API values greater than 10 are those fluids lighter than H2O. Therefore, any values in the dataset lower than 10 can be deemed erroneous, or measured in the water leg and provide no information on the hydrocarbon fluids within the reservoir. The higher the API gravity value, the lighter the hydrocarbon fluid. Therefore, fluids with higher API values (gas, gas condensate) will accumulate in hydrocarbon traps above ‘heavier’ fluids (oils). Where traps are filled to spill, the later addition of gas may cause the displacement and remigration of oil to shallower reservoirs. Where underfilled, oil and gas columns can both be present in the trap. Basin X is a predominantly oil-prone basin and shows a linear trend of increasing API with depth. This is consistent with the assumption that source rock maturation increases with depth. The large degree of noise in the data could be accounted to different migration distances from the source rock. Abnormally low API values at shallower depths have been reported in other basins (e.g. in the North Sea [39]) as a result of biodegradation in lower temperature reservoirs. However, no significant modification to the API values due to biodegradation are seen in the Basin X dataset (Figure 18). Basin X is an oil-dominated basin, however, where a basin is mixed oil-gas, it is expected that API values will vary with depth. This is what is observed in the Basin Y API-depth cross-plot (Figure 8) which shows a much greater degree of scatter. This suggests that API would be more challenging to predict in a mixed oil-gas basin.
- •
- Viscosity-Depth Relationship: Viscosity is a measure of the amount of resistance to flow the oil displays. Higher values indicate more resistance to flow. Viscosity is known to be closely associated with API, Temperature and Pressure [16]. When used with compositional data in the form of Watson’s characterization factor [40], a clear inverse relationship between chemical composition, API and viscosity of oil can be seen. Viscosity is known to decrease with increasing temperature and pressure (up to the bubble point). In Basin X, the relevance of viscosity is confirmed by the KNN cross-validation curve against the other reservoir fluid attributes (Figure 16), and a good level of predictability is expected. The viscosity-depth plot shows a decreasing trend with depth (Figure 18). Note that viscosity plots on a log scale, which is an additional challenge when completing outlier detection and hierarchical clustering. A number of high cP values occur below ca. 250 m that can likely be attributed to heavy oils as a result or biodegradation.
- •
- FVF-Depth Relationship: The Formation Volume Factor (FVF) is a key input to the HCIIP equation (Figure 2). For an oil accumulation, the oil formation factor corrects for the change in volume of oil at stock tank conditions compared to those under elevated pressure and temperature conditions in the reservoir. FVF is also closely linked to the level of gas saturation (GOR) [7]. Basin X shows increasing FVF values with depth on a nonlinear trend and increasing scatter with depth (Figure 18). This increasing variability of FVF with depth could be due to increasing gas solution, as is expected with the increasing API seen with depth.
4.2. Machine Learning as a Predictive Tool for Reservoir Characterisation
4.3. Future Applied Usage of Big Data in Subsurface Science
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Vassiliou, M.S. Historical Dictionary of the Petroleum Industry; Rowman & Littlefield: Lanham, MD, USA, 2018. [Google Scholar]
- Anand, P. Big Data is a big deal. J. Pet. Technol. 2013, 65, 18–21. [Google Scholar] [CrossRef]
- Holdaway, K.R. Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models; Wiley: Hoboken, NJ, USA, 2014. [Google Scholar]
- Perrons, R.K.; Jensen, J.W. Data as an asset: What the oil and gas sector can learn from other industries about “Big Data”. Energy Policy 2015, 81, 117–121. [Google Scholar] [CrossRef]
- Mayer-Schönberger, V.; Cukier, K. Big Data: A Revolution that Will Transform How We Live, Work and Think; Houghton Mifflin Harcourt: Boston, MA, USA, 2013. [Google Scholar]
- Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in Knowledge Discovery and Data Mining; American Association for Artificial Intelligence: Palo Alto, CA, USA, 1996. [Google Scholar]
- Standing, M.B. A pressure-volume-temperature correlation for mixtures of California oils and gases. Drill. Prod. Pract. OnePetro 1947, API 1947, 275–287. [Google Scholar]
- Vazquez, M.; Beggs, H.D. Correlations for fluid physical property prediction. In Proceedings of the SPE Annual Fall Technical Conference and Exhibition, Denver, CO, USA, 9–12 October 1977; OnePetro: Moscow, Russia, 1977; p. SPE-6719-M. [Google Scholar] [CrossRef]
- Elsharkawy, A.M.; Alikhan, A.A. Correlations for predicting solution gas/oil ratio, oil formation volume factor, and undersaturated oil compressibility. J. Pet. Sci. Eng. 1997, 17, 291–302. [Google Scholar] [CrossRef]
- Glasø, O. Generalized pressure-volume-temperature correlations. J. Pet. Technol. 1980, 32, 785–795. [Google Scholar] [CrossRef]
- Al-Shammasi, A.A. A review of bubblepoint pressure and oil formation volume factor correlations. SPE Reserv. Eval. Eng. 2001, 4, 146–160. [Google Scholar] [CrossRef]
- Tohidi-Hosseini, S.M.; Hajirezaie, S.; Hashemi-Doulatabadi, M.; Hemmati-Sarapardeh, A.; Mohammadi, A.H. Toward prediction of petroleum reservoir fluids properties: A rigorous model for estimation of solution gas-oil ratio. J. Nat. Gas Sci. Eng. 2016, 29, 506–516. [Google Scholar] [CrossRef]
- Gharbi, R.B.; Elsharkawy, A.M. Neural network model for estimating the PVT properties of Middle East crude oils. SPE Reserv. Eval. Eng. 1999, 2, 255–265. [Google Scholar] [CrossRef]
- Gharbi, R.B.; Elsharkawy, A.M.; Karkoub, M. Universal neural-network-based model for estimating the PVT properties of crude oil systems. Energy Fuels 1999, 13, 454–458. [Google Scholar] [CrossRef]
- Ramirez, A.M.; Valle, G.A.; Romero, F.; Jaimes, M. Prediction of PVT properties in crude oil using machine learning techniques MLT. In Proceedings of the SPE Latin America and Caribbean Petroleum Engineering Conference, Buenos Aires, Argentina, 17–19 May 2017; OnePetro: Moscow, Russia, 2017. [Google Scholar] [CrossRef]
- Oloso, M.A.; Khoukhi, A.; Abdulraheem, A.; Elshafei, M. Prediction of crude oil viscosity and gas/oil ratio curves using recent advances to neural networks. In Proceedings of the SPE/EAGE Reservoir Characterization & Simulation Conference, Abu Dhabi, UAE, 19–21 October 2009; European Association of Geoscientists & Engineers: Utrecht, The Netherlands, 2009; p. cp-170. [Google Scholar] [CrossRef]
- Zadeh, L.A. Fuzzy sets. Inf. Control. 1965, 8, 338–353. [Google Scholar] [CrossRef] [Green Version]
- Ali, J.K. Neural Networks: A New Tool for the Petroleum Industry? In Proceedings of the European Petroleum Computer Conference, Aberdeen, UK, 15–17 March 1994. [Google Scholar] [CrossRef]
- Saemi, M.; Ahmadi, M. Integration of genetic algorithm and a coactive neuro-fuzzy inference system for permeability prediction from well logs data. Transp. Porous Media 2008, 71, 273–288. [Google Scholar] [CrossRef]
- Karimpouli, S.; Fathianpour, N.; Roohi, J. A new approach to improve neural networks’ algorithm in permeability prediction of petroleum reservoirs using supervised committee machine neural network (SCMNN). J. Pet. Sci. Eng. 2010, 73, 227–232. [Google Scholar] [CrossRef]
- Tahmasebi, P.; Hezarkhani, A. A fast and independent architecture of artificial neural network for permeability prediction. J. Pet. Sci. Eng. 2012, 86, 118–126. [Google Scholar] [CrossRef]
- Bhatt, A. Reservoir Properties from Well Logs using neural Networks. Ph.D. Thesis, Norwegian University of Science and Technology, Trondheim, Norway, 2002. [Google Scholar]
- Tewari, S.; Dwivedi, U.D.; Shiblee, M. Assessment of Big Data analytics based ensemble estimator module for the real-time prediction of reservoir recovery factor. In Proceedings of the SPE Middle East Oil and Gas Show and Conference, Manama, Bahrain, 18–21 March 2019; OnePetro: Moscow, Russia, 2019. [Google Scholar]
- Tahmasebi, P.; Hezarkhani, A. Application of adaptive neuro-fuzzy inference system for grade estimation; case study, Sarcheshmeh porphyry copper deposit, Kerman, Iran. Aust. J. Basic Appl. Sci. 2010, 4, 408–420. [Google Scholar]
- Koren, Y.; Carmel, L. Visualization of labeled data using linear transformations. In Proceedings of the IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714), Seattle, WA, USA, 19–21 October 2003; pp. 121–128. [Google Scholar] [CrossRef]
- Orange Data Mining. Linear Projection. 2015. Available online: https://orange3.readthedocs.io/projects/orange-visual-programming/en/latest/widgets/visualize/linearprojection.html (accessed on 6 August 2021).
- Demsar, J.; Curk, T.; Erjavec, A.; Gorup, C.; Hocevar, T.; Milutinovic, M.; Mozina, M.; Polajnar, M.; Toplak, M.; Staric, A.; et al. Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. 2013, 14, 2349–2353. [Google Scholar]
- He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
- Longadge, R.; Dongre, S. Class imbalance problem in data mining review. arXiv 2013, arXiv:1305.1707. [Google Scholar]
- Kanevski, M.; Pozdnoukhov, A.; Timonin, V. Machine Learning for Spatial Environmental Data: Theory, Applications and Software; EPFL Press: New York, NY, USA, 2009. [Google Scholar]
- Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J. Understanding of internal clustering validation measures. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, Australia, 13–17 December 2010; pp. 911–916. [Google Scholar]
- Rokach, L.; Maimon, O. Clustering Methods. In Data Mining and Knowledge Discovery Handbook; Springer: Berlin/Heidelberg, Germany, 2005; pp. 321–352. ISBN 978-0-387-25465-4. [Google Scholar]
- Kohonen, T. The self-organizing map. Neurocomputing 1998, 21, 1–6. [Google Scholar] [CrossRef]
- Fix, E.; Hodges, J.L. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties; Technical Report 4; USAF School of Aviation Medicine: Randolph Field, TX, USA, 1951. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Halotel, J.; Demyanov, V.; Gardiner, A. Value of geologically derived features in machine learning facies classification. Math. Geosci. 2020, 52, 5–29. [Google Scholar] [CrossRef] [Green Version]
- Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; IEEE: Picataway, NJ, USA, 1995; Volume 1, pp. 278–282. [Google Scholar]
- Rodriguez-Galiano, V.F.; Sanchez-Castillo, M.; Dash, J.; Atkinson, P.M.; Ojeda-Zujar, J. Modelling interannual variation in the spring and autumn land surface phenology of the European forest. Biogeosciences 2016, 13, 3305–3317. [Google Scholar] [CrossRef] [Green Version]
- Larter, S.; Wilhelms, A.; Head, I.; Koopmans, M.; Aplin, A.; Di Primio, R.; Zwach, C.; Erdmann, M.; Telnaes, N. The controls on the composition of biodegraded oils in the deep subsurface—Part 1: Biodegradation rates in petroleum reservoirs. Org. Geochem. 2003, 34, 601–613. [Google Scholar] [CrossRef]
- Watson, K.M.; Nelson, E.F.; Murphy, G.B. Characterization of petroleum fractions. Ind. Eng. Chem. 1935, 27, 1460–1464. [Google Scholar] [CrossRef]
- Bjørlykke, K. Petroleum Geoscience; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Feblowitz, J. Analytics in oil and gas: The big deal about big data. In Proceedings of the SPE Digital Energy Conference, The Woodlands, TX, USA, 5–7 March 2013; OnePetro: Moscow, Russia, 2013; p. SPE-163717-MS. [Google Scholar]
Input Clusters | MSE | R2 |
---|---|---|
Clusters C1–C10 | 0.004 | 0.858 |
C10, C7, C5 | 0.004 | 0.861 |
C10 only | 0.002 | 0.828 |
Variables | Input | MSE | R2 |
---|---|---|---|
4 | API, depth, temp, viscosity | 0.004 | 0.861 |
3 | API, depth, temp, | 0.006 | 0.776 |
API, depth, viscosity | 0.005 | 0.811 | |
API, temp, viscosity | 0.004 | 0.848 | |
Depth, temp, viscosity | 0.004 | 0.853 | |
2 | API, depth | 0.010 | 0.661 |
API, temp | 0.007 | 0.756 | |
API, viscosity | 0.006 | 0.793 | |
Depth, temp | 0.010 | 0.647 | |
Depth, viscosity | 0.006 | 0.805 | |
Temp, viscosity | 0.004 | 0.845 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Brackenridge, R.E.; Demyanov, V.; Vashutin, O.; Nigmatullin, R. Improving Subsurface Characterisation with ‘Big Data’ Mining and Machine Learning. Energies 2022, 15, 1070. https://doi.org/10.3390/en15031070
Brackenridge RE, Demyanov V, Vashutin O, Nigmatullin R. Improving Subsurface Characterisation with ‘Big Data’ Mining and Machine Learning. Energies. 2022; 15(3):1070. https://doi.org/10.3390/en15031070
Chicago/Turabian StyleBrackenridge, Rachel E., Vasily Demyanov, Oleg Vashutin, and Ruslan Nigmatullin. 2022. "Improving Subsurface Characterisation with ‘Big Data’ Mining and Machine Learning" Energies 15, no. 3: 1070. https://doi.org/10.3390/en15031070
APA StyleBrackenridge, R. E., Demyanov, V., Vashutin, O., & Nigmatullin, R. (2022). Improving Subsurface Characterisation with ‘Big Data’ Mining and Machine Learning. Energies, 15(3), 1070. https://doi.org/10.3390/en15031070