1. Introduction
Against the backdrop of growing global energy demand, the development and utilization of offshore oil and gas fields have become crucial means to meet this demand [
1,
2]. However, offshore oil field development faces numerous challenges, the most significant of which is the high cost of offshore drilling combined with a relatively low number of wells. Under these conditions, seismic attribute-based methods for predicting oil and gas-bearing zones within strata have become the most widely used and effective geophysical approaches. Traditional prediction methods, such as single seismic attribute techniques like amplitude attributes and instantaneous frequency attributes [
3,
4], have proven effective in many cases. However, because the seismic attribute response is a cumulative result of multiple factors within the strata and is influenced by geological complexity, hydrocarbon occurrence, and seismic data acquisition and processing, single-attribute-based predictions often suffer from significant ambiguity. The second type of method involves multi-attribute linear integration, such as principal component analysis (PCA) and the weighted sum method (WSM) [
5,
6]. These methods are not only simple in algorithm, but integrate seismic attribute information that reflects the hydrocarbon potential of the strata from multiple perspectives, thereby enhancing the prediction accuracy for oil and gas-bearing zones. Given that the relationship between hydrocarbon potential and seismic attributes is often complex and nonlinear, these methods are limited in their ability to address the nonlinear response between strata hydrocarbon content and seismic attributes, thus constraining their applicability.
With the advancement of computer hardware and software capabilities and the rapid development of machine learning technology, machine learning techniques have been successfully applied to the prediction of hydrocarbon-bearing zones in stratigraphy. Currently, machine learning methods for predicting hydrocarbon-bearing zones can be broadly categorized into two types: unsupervised classification prediction and supervised classification prediction. The main supervised classification methods include random forest(RF) [
7,
8], logistic regression(LR) [
9,
10], and support vector machine (SVM) [
11,
12]. Major unsupervised classification methods include the expectation–maximization algorithm (EM) [
13,
14], self-organizing maps (SOM) [
15,
16], and K-means clustering analysis [
17,
18]. Supervised and unsupervised classifications are suited to different application scenarios. Supervised classification is suitable for multi-well areas and relies on well–logging interpretation parameters as well as oil testing and trial production results. Unsupervised classification methods are generally applicable for predictions in areas with fewer wells [
19,
20]. Since seismic information is a nonlinear composite response to multiple factors, such as stratigraphic undulations, rock framework, fluid types, seismic acquisition noise, and processing methods, the application of unsupervised classification methods generally requires a comprehensive consideration of the geological characteristics of the target area, the quality of seismic data, and the well–seismic response relationships to select the appropriate unsupervised classification method.
Due to the low exploration level and limited number of wells in the WZ6-1 structural area of the Beibu Gulf Basin in the South China Sea, combined with challenges such as well-developed faults, fragmented structures, complex and variable oil, gas, and water distribution, and poor seismic data quality, the understanding of the hydrocarbon-bearing zone distribution in this area is limited, resulting in five unsuccessful wells. The objectives for further exploration in this area remain unclear, creating an urgent need for an effective oil-bearing zone prediction method that suits the complex geological characteristics and seismic data conditions of this region, thereby providing technical support for future exploration and development decisions. This paper proposes a method based on the SVD–K-means algorithm for predicting hydrocarbon-bearing zones, which has yielded promising results. First, six types of horizon seismic attributes were selected based on the well–seismic response characteristics of the study area. Using singular value decomposition (SVD) to preprocess the seismic attributes, the K-means unsupervised nonlinear clustering method was then applied for predicting hydrocarbon-bearing zones. This approach not only eliminated redundant correlations among different seismic attributes and achieved dimensionality reduction and noise suppression, but enhanced the nonlinear relationship between multiple seismic attributes and the hydrocarbon content of the strata. Additionally, it significantly improved the convergence stability of the K-means algorithm, reduced computation, and increased the effectiveness and reliability of the oil-bearing zone prediction results. Based primarily on the results of this prediction, an exploratory well was drilled, yielding a high-production industrial oil flow. This not only confirmed the reliability of the prediction results, but identified a promising new target area for ongoing hydrocarbon exploration and development in this region.
2. Geologic Background of the Study Area
As shown in
Figure 1, the WZ6-1 oil-bearing structure in the Beibu Gulf Basin of the South China Sea is an anticline complicated by faults, divided by two nearly east–west trending faults into three regions: the northern block, central block, and southern block. The southern block is segmented by several near north–south trending radial faults into four fault blocks of varying sizes, named S1, S2, S3, and S4 from west to east. The primary target reservoir layer is the W3IV oil formation of the Oligocene Weizhou Group, consisting of interbedded sandstone and mudstone deposited in a fan delta front environment. The W3IV reservoir has favorable properties, with an average porosity of 26% and permeability of 1120 × 10
−3 μm
2. This structure is highly faulted, with uncertain fault-sealing properties, leading to a complex and variable distribution of oil, gas, and water, and limited understanding of the hydrocarbon accumulation patterns. Five wells were drilled in the northern and central blocks, located at higher structural positions, but they were unsuccessful, with no oil-bearing layers encountered. After further evaluation, an exploratory well, WZ6-1S-1, was drilled in the lower structural position in the southern block, specifically in the S2 fault block, which resulted in a successful oil discovery.
After oil was confirmed through drilling in the S2 fault block in the lower structural position of the southern block, discussions were raised about whether the adjacent fault blocks, S1, S3, and S4, might also contain oil, given their similar geological conditions to S2. To assess the oil-bearing potential of the three fault blocks adjacent to S2, six seismic attributes along the strata, closely associated with hydrocarbon presence, were selected based on previous studies and well–seismic response characteristics in this area. These attributes include arc length, root mean square (RMS) amplitude, dominant frequency, energy half-decay, bandwidth, and instantaneous phase, and are used to predict hydrocarbon-bearing zones based on seismic attributes. As shown in
Figure 2, if traditional single seismic attribute methods are used for oil-bearing zone prediction, the results reveal that the spatial distribution of the six seismic attributes is chaotic, with no discernible pattern. Additionally, there is almost no difference in the seismic attribute characteristics between the five unsuccessful wells drilled at higher structural positions and the oil-bearing S2 fault block confirmed by drilling in the lower southern block (WZ6-1S-1). It is evident that the traditional single seismic attribute prediction method for oil-bearing zones is entirely ineffective in this area and cannot address whether the S1, S3, and S4 fault blocks adjacent to S2 contain oil.
In general, a single seismic attribute often struggles to reveal the hydrocarbon potential of strata that may be implicitly represented within seismic attributes. This is because seismic attributes are a nonlinear composite response to various factors such as strata properties and acquisition/processing effects, rather than an independent response to the hydrocarbon potential of strata. This leads to issues with ambiguity, making it challenging to accurately identify hydrocarbon-bearing zones. Furthermore, a single seismic attribute generally reflects only one physical characteristic of the strata, limiting the information it may contain regarding hydrocarbon potential. Using a linear fusion method for multiple seismic attribute parameters [
21,
22] allows for a more comprehensive integration of seismic attribute variables related to hydrocarbon potential, reducing ambiguity and thereby improving the reliability of prediction results. However, due to the complex nonlinear response relationship between hydrocarbon potential and seismic attributes, along with the subjectivity in determining the weights of multiple seismic attribute variables, the applicability of linear fusion methods is limited, making it difficult to meet the prediction requirements in complex geological conditions for hydrocarbon-bearing strata. This paper applies the K-means nonlinear clustering method based on multiple seismic attribute parameter data to achieve nonlinear fusion of seismic attributes. The goal is to uncover the potential nonlinear relationship between seismic attributes and hydrocarbon potential in order to assess the hydrocarbon-bearing potential of the three fault blocks adjacent to S2.
3. K-Means Model Prediction Method
K-means is an unsupervised clustering machine learning algorithm [
23,
24] that, by predefining K clusters, randomly selects K initial cluster center points. It typically assigns sample data points to the nearest cluster center based on Euclidean distance. Suppose there are
N data samples
x∈R
N×M, representing
N rows of
M types of seismic attribute data. Randomly select
K initial cluster centers
ci = {
c1,
c2,…,
cK}, and the Euclidean distance
dci between data sample
x and the
i-th cluster center is:
In Equation (1),
xkj represents the
j-th seismic attribute in the
k-th row of a data sample
x;
cij represents the value of the
j-th dimension of the
i-th cluster center.
dci denotes the Euclidean distance between data sample
x and the
i-th cluster center. Based on the Euclidean distance
dci, the nearest data sample
x is assigned to the cluster of
ci. By calculating the mean of data samples within each cluster, a new set of cluster centers is formed, calculated as follows:
In Equation (2),
n represents the number of data samples in the
i-th cluster, and
ci denotes the newly formed cluster center for the
i-th cluster. The iteration terminates, yielding the final classification result, when the within-cluster sum of squared errors (SSE) no longer changes or converges. SSE is calculated as follows:
Before performing K-means clustering, an appropriate value of
K, or the number of clusters, is selected based on the needs of the specific research question. This value determines the number of clusters into which the dataset will be divided and is also known as the number of clusters. Choosing an overly small
K value may result in confusion between clusters, leading to the loss of significant data characteristics; while an overly large
K value may cause excessive subdivision and overfitting. Typically, metrics such as the within-cluster SSE and the Davies–Bouldin (DB) index are used to evaluate the clustering effectiveness of the K-means algorithm to determine an appropriate
K value. The DB index calculation relies on the distance between cluster centers and the dispersion of samples within clusters; a smaller DB index value indicates a better clustering result:
In Equation (4), σi represents the average distance between samples in the i-th cluster and its cluster center, and d(ci,cj) denotes the distance between the cluster centers of the i-th and j-th clusters.
Based on previous research, and using the well–seismic response characteristics of the WZ6-1 oil-bearing structural area in the Beibu Gulf Basin of the South China Sea, six selected horizon seismic attributes were applied to a K-means clustering analysis. This approach aimed to uncover the implicit nonlinear response patterns between multiple seismic attribute parameters and reservoir hydrocarbon content, thereby achieving the goal of predicting oil-bearing zones. The appropriate value of
K is generally determined based on the curve characteristics of the sum of squared errors (SSE) and Davies–Bouldin (DB) index with respect to the
K value (
Figure 3). Typically, as the
K value increases, the SSE value gradually decreases, but the rate of decrease slows down. The inflection point or “elbow” in the curve is often considered an appropriate
K value [
25,
26]. A smaller DB index value indicates greater similarity within each cluster and greater differentiation between clusters, signifying better clustering performance. As shown in
Figure 3, when the
K value is set to 4, it is located at the inflection point of the SSE curve and the minimum point of the DB index curve. Therefore, setting the number of clusters to four is optimal. Additionally, the K-means algorithm requires setting the number of random initializations, as each of the K clusters has a corresponding initial center point, which is randomly selected to compute its Euclidean distance to data samples. Fewer random initializations may lead to suboptimal solutions due to unsuitable initial center point selection, while more random initializations increase computational demand, particularly with large datasets like seismic data, significantly extending algorithm runtime. Thus, an appropriate number of random initializations must be chosen. Through repeated testing and parameter tuning, the number of initializations was set to 10, and the number of clusters to 4, yielding the best overall clustering performance.
Using the selected six horizon seismic attribute data types and the optimized K-means algorithm parameters determined from previous experiments, a multi-parameter K-means clustering analysis was performed on the study area, with results shown in
Figure 4. When combined with the actual drilling results in the study area, the predictions, as compared to single seismic attribute methods (
Figure 2), reveal certain patterns of hydrocarbon accumulation: ① Overall, the high structural position in the northern block is predominantly classified as blue (Class IV) to green (Class III), the central block in the high structural position is mainly green (Class III) to yellow (Class II), and the low structural position in the southern block is primarily red (Class I). This distribution suggests a general trend: the northern block at the high structural position is non-oil-bearing → the central block at the high structural position shows some hydrocarbon indication → the southern block at the low structural position is oil-bearing. ② The wells WZ6-1-2 and WZ6-1-3 in the high structural position of the northern block, along with WZ6-1-1 in the central block, are all dry wells. The prediction results place them in the blue (Class IV) or green (Class III) regions. ③ The wells WZ6-1-A1h and WZ6-1-A2h in the central block at the high structural position are also dry wells. The prediction places them in the yellow (Class II) region. Although both wells showed some hydrocarbon indication during drilling, electric logging interpreted them as water-bearing layers. ④ The high-yield oil well WZ6-1S-1 in the S2 fault block in the low structural position of the southern block is predicted to fall within the red (Class I) region. It is inferred that the red (Class I) region of the K-means clustering multi-parameter seismic attribute fusion is likely most closely associated with hydrocarbon-bearing strata. Additionally, the high structural positions of the adjacent S3 and S4 fault blocks also fall within the red (Class I) region, suggesting potential hydrocarbon-bearing zones. ⑤ Outside the trap of the S2 fault block, which has been confirmed as oil-bearing by the WZ6-1S-1 well, there exists a large red (Class I) region in the low structural position. However, this appears to contradict the general hydrocarbon accumulation pattern.
This issue may be related to the correlation among the input seismic attribute variables (x), as the determination of the Euclidean distance dci and the iterative cluster center ci, according to Equations (1) and (2), depends on the input seismic attribute variables (x). If there is redundancy among the input seismic attribute variables, it may cause shifts in the new cluster centers ci generated through iteration and introduce significant deviations in the computed Euclidean distance dci. This dual bias ultimately leads to greater error in the classification results using the K-means algorithm. Furthermore, based on Equations (3) and (4), the clustering effectiveness indicators SSE and DB index are also affected by redundancy among the input seismic attribute variables.
This paper proposes a preprocessing method using SVD technology on multiple seismic attribute data, which effectively resolves this issue. Since SVD technology uses an orthogonal decomposition algorithm, it is commonly employed as an effective method to address data redundancy issues. In the orthogonal decomposition of singular values using SVD, the larger singular values correspond to the main informational features embedded in the data, while the smaller singular values correspond to noise interference. The SVD technique not only reduces redundancy among seismic attribute data variables, but achieves data dimensionality reduction and suppresses noise interference. Singular value decomposition was performed on the six selected seismic attributes along the layer in the study area to obtain singular values, with appropriate values retained to reconstruct the seismic attribute data. Based on the SVD-reconstructed seismic attribute data, the K-means algorithm is then applied for predictive classification, referred to as the SVD–K-means clustering method. This approach overcomes the issues encountered when directly using multi-parameter seismic attribute data for K-means predictive classification. This method has achieved satisfactory results in the application for predicting oil-bearing zones in the WZ6-1 structure.
4. SVD–K-Means Model Prediction Method
SVD can decompose a matrix into the product of three matrices: left singular vectors, singular values, and right singular vectors.
Assuming there are M types of seismic attribute categories and N samples, forming an N × M matrix x, then matrix x can be decomposed, as outlined below.
Assuming that there are m kinds of seismic attribute categories, a total of n samples, that is, the formation of m × n order matrix M, then according to the singular value decomposition form M can be divided into:
In Equation (5),
U = {
u1,
u2,…,
uN} is an
N ×
N orthogonal matrix,
V = {
v1,
v2,…,
vM} is an
M ×
M orthogonal matrix, and
Σ is an
N ×
M diagonal matrix with rank r ≤ min (
N,
M). The singular values
σi, arranged in descending order along the diagonal of
Σ, satisfy
σ1 ≥
σ2 ≥ ⋯ ≥
σr. The matrices
U and
V can be derived from the eigenvectors of
xxT and
xTx, respectively. The singular value
σi is the square root of the non-negative eigenvalues of
xxT and
xTx. A larger singular value represents a higher amount of information energy contained. The number of singular values to retain is determined based on the cumulative contribution rate of the singular values, controlling the amount of information retained [
27,
28]. This is calculated as follows:
In Equation (6), r represents the number of retained singular values, and Pr denotes the cumulative contribution rate. Typically, the optimal cumulative contribution rate is determined through experimental analysis based on the specific problem, to effectively reduce data redundancy, achieve dimensionality reduction, and suppress noise.
Performing SVD on the six types of seismic attributes yields a singular value distribution curve, as shown in
Figure 5, which generally forms an “L” shape. The magnitude of the singular values reflects the amount of information energy contained and is positively correlated with the information content. Larger singular values reflect the main information characteristics, while smaller singular values indicate noise interference. Discarding these smaller singular values and reconstructing the seismic attribute data results in limited loss of the primary information characteristics reflected in the reconstructed data. Based on the seismic response characteristics of six drilled wells in the study area, repeated testing found that retaining the top three singular values, with a cumulative contribution rate of 88.1%, yielded the best prediction results for hydrocarbon zones using the SVD–K-means clustering method.
The results of the oil-bearing zone prediction using the SVD–K-means clustering method are shown in
Figure 6, demonstrating good predictive performance: ① The southern S2 fault block, which has been confirmed as oil-bearing by the WZ6-1S-1 well, remains in the red oil-bearing zone (Class I). Compared to the K-means prediction result based directly on seismic attribute data (
Figure 5), the red (Class I) oil-bearing area has significantly contracted towards the higher structural position of the S2 fault block trap, almost aligning with the S2 fault block’s trap area, which is consistent with hydrocarbon accumulation patterns. ② The locations of the five dry wells in the high structural positions (WZ6-1-1, WZ6-1-2, WZ6-1-3, WZ6-1-A1h, and WZ6-1-A2h) are all within the yellow (Class II) or green (Class III) regions. The prediction results are consistent with the drilling outcomes. ③ A rolling development well, WZ6-1-A3, was deployed in the high structural position of the S3 fault block within the red (Class I) region, yielding successful results: a total oil layer thickness of 15.8 m was encountered, and the production rate reached 150 m
3/d after commissioning. ④ The high structural position of the S4 fault block in the southern block is also within the red (Class I) region, suggesting it as a potential target area for future exploration and development.
As shown in
Figure 6, the results of predicting hydrocarbon zones using the SVD–K-means clustering method not only align with drilling results, but match the geological patterns of hydrocarbon enrichment in the study area, as confirmed by the high-production industrial oil flow achieved from the newly drilled well WZ6-1-A3. However, why are the results of predicting hydrocarbon zones using the K-means algorithm directly on seismic attribute data (
Figure 5) less ideal? The following analysis explores the causes of this outcome based on clustering effectiveness indicators in the K-means algorithm.
Commonly, the SC coefficient (silhouette coefficient), CH (Calinski–Harabasz) index, and DB index are used to evaluate and compare the effectiveness of various clustering algorithms [
29,
30]. The SC coefficient measures the compactness of samples within their own clusters and their separation from other clusters. The SC value ranges from [−1, 1], with values closer to 1 indicating better clustering effectiveness. The specific calculation formula is as follows:
In Equation (7),
ai represents the average distance between sample
i and other samples within the same cluster, while
bi represents the average distance between sample
i and all samples in other clusters. The average SC value of all samples is used as the overall clustering effectiveness SC. The CH index is used to evaluate the distance between cluster centers and the dispersion of samples within clusters, with the calculation formula as follows:
In Equation (8), B represents the between-cluster sum of squared errors, W represents the within-cluster sum of squared errors, N is the total number of samples, and K is the number of clusters. A higher CH value indicates better clustering results. According to the DB index Formula (4), the DB index can provide a basis for selecting K values and also be used for clustering effectiveness evaluation.
Due to the randomness of the initial cluster centers in the K-means clustering algorithm, each clustering result exhibits some variation. Therefore, this study uses the average values of the three evaluation metrics from multiple clustering results for comparative evaluation. As shown in
Table 1 and
Table 2, the clustering effectiveness of the K-means and SVD–K-means methods is compared using the average values of the three metrics across eight clustering results. The results indicate that SVD–K-means outperforms K-means across all metrics, with an 18.4% improvement in the SC coefficient, a 57.8% increase in the CH index, and a 24.7% improvement in the DB index. The superior clustering effectiveness of SVD–K-means explains its better performance in hydrocarbon zone prediction.
5. Conclusions
The reservoir of the WZ6-1 oil-bearing structure in the study area consists of interbedded sandstone and mudstone deposits from a fan delta front, with a highly heterogeneous spatial distribution of sand bodies. Faulting is extensive, resulting in fragmented structures with significant variations in fault sealing between fault blocks, and a low degree of certainty. The distribution characteristics of oil, gas, and water are complex and variable, with limited understanding of hydrocarbon accumulation mechanisms, making it challenging to predict oil and gas-bearing zones, which has led to five unsuccessful wells. Since seismic attributes are not simply a linear response to the hydrocarbon potential of strata—a single factor of interest—but rather a nonlinear composite response influenced by various factors including the strata properties, acquisition, and processing, the use of conventional single seismic attributes or linear fusion of multiple seismic attribute variables for predicting hydrocarbon-bearing zones in this area has proven ineffective. The results not only poorly correlate with drilled well data, but fail to align with the fundamental geological patterns of hydrocarbon accumulation, rendering the approach “ineffective”. The application of the SVD–K-means clustering method proposed in this paper for predicting oil-bearing zones has yielded positive results. Not only does it align well with the drilling results, but it also corresponds to the geological patterns of hydrocarbon accumulation in the study area. This has been further confirmed by the high-yield industrial oil flow obtained from the newly drilled WZ6-1-A3 well, providing crucial technical support for subsequent exploration and development decisions and offering valuable insights for predicting hydrocarbon-bearing zones under similar complex geological conditions.