1. Introduction
In the direction of fluid flow, the study of multiphase flow patterns has consistently been a focal point for researchers [
1,
2,
3,
4,
5], with oil–water two-phase flow research forming the foundation of this field. As oilfield development advances into its later stages, the influence of water on flow dynamics becomes increasingly significant, necessitating accurate prediction of oil–water two-phase flow behaviours during extraction. However, compared to vertical wells, the flow states of oil–water two-phase flows in horizontal and inclined wells are more challenging to predict. This difficulty is exacerbated by the constantly changing development environment, where the drilling angle varies according to the actual situation, and the rapidly changing downhole conditions lead to significant variations in fluid flow velocity, which in turn significantly affect the flow patterns.
Despite ongoing research, consensus on flow pattern definitions remains elusive due to varying influencing factors and research emphases. Current research primarily relies on subjective observation and flow pattern maps, which are influenced by the observer’s subjective factors, resulting in qualitative rather than quantitative identification methods. Therefore, accurately predicting oil–water two-phase flow patterns is crucial for process design, operational safety, and economic efficiency. It can also promote technological innovation and development, enhance production efficiency, reduce risks, and optimise resource utilisation.
In recent years, scholars have adopted computer numerical simulation methods to study fluid flow patterns. Through numerical simulations, researchers can model the effects of different factors on flow patterns. However, the impact of logging instruments on fluids is often overlooked in actual wells, leading to discrepancies between simulation results and actual downhole flow patterns. Physical experiments, while needing to be consistent with real wells, are inconvenient, as they must replicate challenging factors such as temperature and pressure, provide limited data points, and are labour intensive and error prone.
In recent years, many scholars have adopted computer numerical simulation methods to study fluid flow patterns. Through numerical simulations, researchers can model the effects of different factors on flow patterns. However, in actual wells, the impact of logging instruments on fluids is often overlooked, leading to discrepancies between simulation results and actual downhole flow patterns. Physical experiments, on the other hand, need to be consistent with real wells, which is inconvenient. Factors such as temperature and pressure are also challenging to replicate fully, and physical experiments provide limited data points, are labour-intensive, and are prone to errors.
The advent and continuous development of deep learning and machine learning have made data processing and analysis more efficient and accurate. These technologies have improved productivity and enhanced traditional methods. In many fields, machine learning algorithms are widely used for data prediction. For example, in the financial sector, machine learning has been applied to predict goodwill impairment [
6], helping investors identify goodwill impairment risks and mitigate its market impact. Researchers like Zhang Yanan [
7] and Zhang Xiangrong [
8] have used optimisation algorithms and multi-core learning methods to improve the accuracy of financial risk predictions. Recent advances in the energy sector include the application of machine learning to optimise biodiesel production, as demonstrated by Sukpancharoen et al. (2023) [
9], who explored the potential of transesterification catalysts through machine-learning approaches. In addition, Şahin (2023) conducted a comparative study of machine learning algorithms to predict the performance and emissions of diesel/biodiesel/isoamyl alcohol blends [
10]. These studies highlight the growing importance of machine learning in improving the efficiency and sustainability of biofuel production and use.
Extreme gradient boosting (XGBoost) is an emerging machine learning algorithm known for its exceptional modelling capabilities and fast computation speed, which surpasses many other algorithms. Currently, XGBoost has been widely applied in the field of petroleum geology. For instance, Tang Qinxin et al. [
11] employed the XGBoost algorithm to build a model for predicting the productivity of fractured horizontal wells. At the same time, Zhao Ranlei et al. [
12] used XGBoost for lithology identification in volcanic rocks. However, the application of this algorithm for predicting downhole fluid flow patterns still needs to be improved.
This study aims to leverage the XGBoost algorithm to predict downhole fluid flow patterns and evaluate its performance. Given that the effectiveness of the algorithm is influenced by hyperparameters [
13], we utilised the Bayesian optimisation algorithm (BO) to optimise the hyperparameters of XGBoost. The Bayesian optimisation algorithm, known for its global parameter search capability and high efficiency, has been successfully applied across various domains.
For example, in the study by [
14], the Bayesian optimisation algorithm was used for precise detection and localisation of targets in remote sensing images, significantly enhancing the accuracy of detection boundaries. In the study of [
15], the Bayesian optimisation algorithm was applied to optimise the hyperparameters of XGBoost, resulting in the optimal parameter combination for constructing a grain temperature prediction model. The findings indicated that this model had low prediction error and high accuracy, providing a valuable decision-making tool for temperature control management in granaries. Additionally, the research of [
16] proposed a coal spontaneous combustion grading warning model based on Bayesian optimised XGBoost (BO-XGBoost), demonstrating superior stability and classification accuracy of the BO-XGBoost model.
In this study, a multiphase flow simulation experimental apparatus was used to conduct oil–water two-phase flow simulation experiments, collecting 64 sets of flow pattern data. Subsequently, the Bayesian optimisation algorithm was employed to optimise the hyperparameters of XGBoost, thereby aligning the prediction results more closely with actual conditions. This approach provides an effective method for predicting downhole fluid flow patterns, offering a scientific basis for practical engineering applications and fostering the integration of traditional industrial technology with cutting-edge innovations.
The novelty of this work lies in the integration of Bayesian optimisation with the XGBoost algorithm to enhance the prediction accuracy of oil–water two-phase flow patterns. Unlike traditional methods, this approach optimises hyperparameters more efficiently, improving model performance. By systematically combining experimental data with advanced machine learning techniques, this study introduces a robust methodology for accurately predicting complex subsurface fluid dynamics.
4. Experiment
4.1. Design Experiment
A multiphase flow (oil–water) simulator was utilised in the multiphase flow laboratory to conduct the experimental work. The experiments were conducted under ambient temperature (20 °C) and atmospheric pressure. Industrial white oil and tap water were utilised instead of actual downhole oil and water.
Table 1 details the density, viscosity, and surface tension of the oil and water used. In experiments, the well inclination was 90° when horizontal. During the experiments, raw data and photographs were recorded.
Figure 3 shows the oil–water two-phase flow patterns and a schematic of high-speed camera recordings, including smooth stratified flow, interface mixed stratified flow, water-in-oil emulsion, dispersed water-in-oil and oil-in-water, and dispersed oil-in-water and water, each representing different flow states.
The schematic diagram of the experimental setup is shown in
Figure 4.
In this experiment, a total of 64 sets of valid and accurate data were obtained using the simulation experimental apparatus. These data were categorised into five distinct flow patterns: bubble flow, emulsion flow, frothy flow, wavy flow, and stratified flow. For ease of subsequent experimental processes, these patterns were assigned specific codes ranging from 0 to 4, as detailed in
Table 2, which also shows the actual images corresponding to each flow type.
The experimental data, along with actual field data, were used. The experimental data were used as the dataset for the XGBoost and BO-XGBoost algorithms to learn, and the actual data were then input into the trained algorithms for prediction. The predicted results were compared with the actual fluid flow patterns to test the feasibility of the algorithms. The experimental data included variables such as well inclination angles, fluid flow rates, water cut, temperature, and pipe diameter. The actual data were fed into the model trained with the experimental data to obtain the predicted fluid flow patterns for the actual data, which were then compared with the actual results. By analysing accuracy under different flow rates, inclinations, and water cut rates, the effectiveness of the algorithms was assessed.
The hyperparameters of the XGBoost model, optimised by the Bayesian optimisation algorithm, are shown in
Table 3.
4.2. Prediction Results Analysis
After training, 16 sets of known data with varying well inclinations, water cuts, and flow rates were randomly selected for testing. Following data preprocessing, the trained models predicted flow patterns, and these predictions were compared with actual data to evaluate accuracy. The performance of the models was illustrated using confusion matrices and scatter plots.
Figure 5a shows the unstandardised confusion matrix of the XGBoost algorithm’s predictions on the training set, while
Figure 5b presents the standardised confusion matrix on the test set. In the confusion matrix, rows represent observed flow pattern categories, and columns represent predicted categories. The numbers on the axes correspond to the five flow patterns listed in
Table 2. Correct predictions are indicated by blue squares on the diagonal. Off-diagonal squares represent incorrect predictions.
In contrast, squares off the diagonal represent the number of incorrectly predicted samples. In
Figure 5b, the numbers on the diagonal represent the probability of correctly predicting the corresponding flow pattern. In contrast, the numbers off the diagonal represent the probability of predicting an incorrect flow pattern.
Figure 5,
Figure 6 and
Figure 7 depict the confusion matrices and scatter plots for the XGBoost model’s predictions on the training and test sets, respectively.
From
Figure 5, it can be observed that the XGBoost algorithm model had five incorrect predictions in the training set.
From
Figure 5 and
Figure 7, it can be observed that the XGBoost model made five erroneous predictions in the training set results.
Figure 6 and
Figure 7 illustrate the test set results, where two bubbly flows were predicted as dispersed flows, one frothy flow as a bubbly flow, and one dispersed flow as a frothy flow. The overall accuracy reached 75%. The XGBoost algorithm demonstrated some level of accuracy in flow pattern prediction, but there is significant room for improvement.
Figure 8,
Figure 9 and
Figure 10 show the confusion matrices and scatter plots for the BO-XGBoost model’s predictions on the training and test sets.
Figure 8 and
Figure 9 show the BO-XGBoost model’s confusion matrices on the test and training sets, respectively.
Figure 10 shows the scatter plots. From
Figure 8 and
Figure 10, it is evident that the BO-XGBoost model achieved 100% accuracy on the training set, demonstrating significantly better performance than the XGBoost model.
Figure 9 and
Figure 10 show only one misprediction in the test set, with the BO-XGBoost model achieving 93.75% accuracy, where one frothy flow was mispredicted as a bubbly flow. The results highlight the BO-XGBoost algorithm’s marked improvement in learning and predicting flow patterns.
Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset.
Table 4 compares the accuracy and generalisation performance of the BO-XGBoost model with the traditional XGBoost model on the test dataset. The BO-XGBoost model demonstrated a significant improvement, with 93.75% accuracy compared to the XGBoost model’s 75%. Precision increased from 0.788 to 0.967, recall from 0.791 to 0.971, and the F1 score from 0.784 to 0.966, further validating the BO-XGBoost model’s superiority. These results indicate that Bayesian optimisation significantly enhanced the XGBoost model’s predictive accuracy and classification performance.
To comprehensively assess the classification performance of the models, we also utilised receiver operating characteristic (ROC) curves. In multi-class classification problems, ROC curves and the area under the curve (AUC) values provide an overall view of the model’s classification capability. The ROC curves are shown in
Figure 11 and
Figure 12.
Figure 11 and
Figure 12 compare the ROC curves of the traditional XGBoost model and the BO-XGBoost model across different classes of oil–water two-phase flow patterns.
Figure 11 displays the ROC curve for the XGBoost model, with AUC values as follows: Class 0 (0.964), Class 1 (0.857), Class 2 (0.873), Class 3 (1.000), and Class 4 (1.000).
Figure 12 shows the ROC curve for the BO-XGBoost model, with AUC values as follows: Class 0 (0.982), Class 1 (0.929), Class 2 (0.921), Class 3 (1.000), and Class 4 (1.000). A higher AUC value indicates better classification accuracy.
The BO-XGBoost model exhibited higher AUC values for Class 0, Class 1, and Class 2 compared to the traditional XGBoost model. Both models achieved perfect AUC values of 1.000 for Class 3 and Class 4, likely due to small sample sizes.
In summary, the comparative analysis of the ROC curves and their respective AUC values highlights the superior predictive capability of the BO-XGBoost model following Bayesian optimisation. The BO-XGBoost model consistently achieved higher AUC values across most classes, indicating a more robust and accurate classification of oil–water two-phase flow patterns.
Table 5 shows that XGBoost accuracy decreased notably for inclinations of 0°, 60°, and 85°. At 85°, with a flow rate of 300 m
3/d and a water cut of 80%, both XGBoost and BO-XGBoost failed to predict accurately. However, at 90°, both models demonstrated accurate predictions. XGBoost achieved 75% overall accuracy, while BO-XGBoost achieved 93.75%, demonstrating the feasibility and precision of both algorithms.
Figure 13 shows the prediction accuracy of each flow type for the two models.
BO-XGBoost’s superior accuracy is due to its advanced hyperparameter optimisation strategy. Unlike traditional XGBoost models, BO-XGBoost uses Bayesian optimisation to find optimal hyperparameters, adapting better to data characteristics and improving accuracy. Despite slightly lower accuracy, XGBoost is advantageous in training speed and ease of implementation, recognised for its stable and efficient performance.
Considering all factors, BO-XGBoost has demonstrated higher prediction accuracy in this study, providing a robust choice for applications requiring high-precision predictions. However, we acknowledge that the choice of model and tuning strategy should be based on specific application needs and resource constraints.
Future research will focus on exploring and studying the performance of BO-XGBoost and XGBoost across various datasets and problem environments, with the goal of providing deeper insights into the selection of machine learning models.
4.3. Model Interpretability and Feature Analysis
While the BO-XGBoost model achieves high prediction accuracy through training, it remains largely a black-box model in terms of interpretability. To address this issue, we employed Shapley additive explanations [
21] (SHAP) to interpret the experimental results of the model, analysing the contribution of each feature to the prediction outcomes.
Figure 14 illustrates the key features influencing the oil–water two-phase flow patterns.
From
Figure 14, it is evident that the most significant feature affecting the flow pattern was the well inclination angle, followed by the daily production flow rates, with the water cut having the least impact.
In addition to the feature importance plot, we generated detailed feature explanation plots to obtain richer information.
Figure 15 presents the global interpretation of features such as well inclination angles (Angle), flow rates (Flow), and water cut (Con). This comprehensive visualisation explains the contribution of these features to the prediction target, integrating feature values and multi-feature presentations.
In these plots, the vertical axis represents the feature names, and the horizontal axis represents the SHAP values. Each point corresponds to the SHAP value of a feature for a specific sample. Positive SHAP values indicate a positive impact on the prediction, while negative SHAP values indicate a negative impact. The colour of the points represents the feature values, with red points indicating higher values and blue points indicating lower values.
From
Figure 15, it can be observed that the well inclination angle had the most substantial impact on the model output, with SHAP values ranging from −1 to 1.5. This indicates that the well inclination angle played a decisive role in predicting the oil–water two-phase flow patterns. The SHAP values for flow rates and water cut varied within narrower ranges but still significantly influenced the model output. The SHAP values for flow rates ranged from 0.0 to 1.0, suggesting a positive impact on the prediction outcome. In contrast, the SHAP values for water cut rate ranged from −0.5 to 0.5, indicating that in some cases, water cut rate may have a negative impact on the prediction results.
4.4. Limitations of the Study
While our research demonstrates significant improvements in the predictive accuracy of oil–water two-phase flow regimes using the Bayesian-optimised XGBoost algorithm, it is important to acknowledge certain limitations to provide a comprehensive understanding of our study.
Firstly, the experimental data used in this study were obtained under controlled laboratory conditions, which may only partially replicate the complexities and variabilities of real-world reservoir environments. Factors such as temperature variations, reservoir heterogeneities, and the presence of impurities in the fluids were not accounted for in our simulations, potentially affecting the generalisability of our findings.
Secondly, the dataset was limited by the range of water cut rate, well inclination angles, and flow rates explored, as well as the sample size. Although we endeavoured to cover a broad spectrum of conditions, certain flow patterns occurring under extreme or less common operational scenarios might not have been adequately represented. These limitations indicate a need for further studies encompassing a wider range of parameters to enhance the robustness of the predictive model.
Lastly, our study primarily focused on the application of the BO-XGBoost algorithm. Comparative studies involving other advanced machine learning algorithms and optimisation techniques could provide further insights into the relative advantages and potential limitations of our approach. Additionally, incorporating real-time data from field operations and conducting validation studies in actual reservoir conditions would be critical steps toward translating our laboratory-based findings into practical, field-applicable solutions.