1. Introduction
Clustering, fuzzy clustering (soft computing), and classification are fundamental techniques in data analysis and pattern recognition [
1,
2]. Clustering involves grouping similar data points, providing insights into inherent structures within datasets [
3]. Fuzzy clustering, a subset of soft computing, extends traditional clustering methods by introducing the concept of membership function used for quantifying the extent to which data points belong to multiple clusters, thus capturing the inherent ambiguity present in many real-world scenarios [
4]. In recent years, clustering has been recognized as an information granulation technology, and the clustering process is referred to as a granulation mechanism. This technology has found application in various fields. Classification, on the other hand, assigns predefined labels to data points based on their characteristics [
5].
As we delve into the realm of uncertainty in data analysis, it becomes crucial to explore the significance of uncertain data. Unlike certain data, uncertain data encapsulate a greater volume of information due to their inherent variability and imprecision [
6]. Understanding and effectively analyzing uncertainty in data have become essential in addressing the complexities of real-world applications [
7]. In recent decades, a plethora of fuzzy set-based approaches [
8] has emerged to effectively model the inherent uncertainty (granularity) present in various real-world phenomena. These methodologies aim to quantify information granularity by employing membership functions [
9]. Among the many excellent algorithms in fuzzy clustering, fuzzy c-means (FCM) stands out as a popularly embraced soft partitioning algorithm, and is extensively applied across diverse domains. FCM plays a pivotal role in partitioning a given input space into distinct regions (groups, categories) based on a predefined similarity/dissimilarity measure [
10]. Within the FCM algorithm, the dataset’s underlying structure is articulated through partition matrices and prototypes (clusters) [
11].
FCM utilizes membership function to measure the extent to which each data point (pattern) belongs to various clusters. From its origin, this technique has garnered considerable attention due to its application studies and conceptual developments [
12]. It has proven to significantly enhance the quality of clustering and classification compared with the traditional hard partitioning methods. A plethora of enhanced clustering approaches have been developed over time. Among these alternatives, kernel-based FCM (KFCM) [
13,
14,
15] has risen as an intriguing and widely adopted approach. KFCM employs various kernel functions that magically create nonlinear transformations, mapping the data from its native space into a higher-dimensional feature space. In this new space, the data are expected to exhibit greater separability. By positioning the data in this augmented space, KFCM aims to achieve superior classification performance [
16]. The implicit nonlinear transformations provided by the kernel function contribute to capturing complex relationships within the data, potentially leading to more accurate and nuanced cluster assignments.
In summary, the KFCM algorithm extends the traditional FCM method by incorporating a kernel function. The key aspects to consider in the analysis of KFCM include the integration of the kernel function, its impact on nonlinear transformations and data separability in higher-dimensional spaces, and the resulting improvement in classification performance compared with conventional algorithms. A comprehensive analysis involves examining the specific kernel function employed, evaluating how nonlinear mappings contribute to enhanced data separability, and assessing the algorithm’s robustness to varying data characteristics [
17]. Additionally, investigating the computational complexity introduced by the kernel-based approach and exploring practical applications and use cases provide valuable insights into the algorithm’s theoretical foundations and real-world effectiveness.
This research focused on the intersection of fuzzy clustering and uncertainty analysis. We aimed to investigate the meaningful application of fuzzy clustering in the context of uncertainty, leveraging its ability to model and accommodate imprecision and ambiguity in data. By incorporating fuzzy clustering into uncertainty analysis, we anticipated gaining valuable insights into the nuanced patterns and relationships within uncertain datasets, thus enhancing our ability to make informed decisions in the face of complexity and variability.
In the design process, we divided the data into certain and uncertain data based on the membership functions of each datum to all prototypes, and then inserted data with the cloud computing technology [
18] based on the membership functions to reduce the proportion of uncertain data. Subsequently, reallocating the contribution of boundary data to the prototype was achieved to reduce the proportion of uncertain data. In this optimization process, we employed the classification error (data label) to supervise the insertion of data. Ultimately, the classification performance was improved by leveraging the enhanced partition matrix.
This paper is structured as follows:
Section 2 provides a brief review of FCM and KFCM methods. In
Section 3, we elucidate the principle behind the proposed scheme.
Section 4 details the experimental studies over synthetic and publicly available data. Finally,
Section 5 summarizes the study.
3. Enhancing Fuzzy Clustering through Innovative Interpolation Techniques
As previously highlighted, when being used to deal with classification tasks, an algorithm’s performance is particularly influenced by the presence of uncertain data situated at the clusters’ boundaries. This part of the data affects the position of the prototype, which in turn affects the partition matrix and ultimately affects the algorithm’s classification performance. Therefore, our focus is on optimizing these cluster boundaries.
In fuzzy clustering, the focus is traditionally placed on the maximum values within each column of the membership matrix, as these values play a crucial role in determining the clustering (classification) results. However, we should recognize that non-maximal values also carry valuable information. For instance, these non-maximal values can provide insights into data points situated at the boundaries of clusters, contributing to a more comprehensive understanding of the clustering structure. While the maximum values heavily influence the overall outcome, acknowledging the significance of non-maximal values enhances the nuanced interpretation of the clustering results.
Let represent the jth largest value of . We use the standard deviation of membership values to partition the data into two parts, namely certain (non-boundary) data Xc and uncertain (boundary) data Xu.
To enhance the optimization of cluster boundaries, we aim to decrease the proportion of data within these boundaries (considered uncertain data). This can be effectively accomplished by increasing the proportion of certain data. Consequently, we incorporate additional data into the set of certain data. The method we adopted in this study is to insert data XI with the cloud computing technology into the certain data according to the membership functions, and modify the prototype matrix based on the new dataset. During the interpolation process, we used the one-dimensional normal membership cloud model to determine the distribution of the interpolated data of each feature.
The cloud model serves as an uncertainty conversion framework that translates a specific qualitative concept expressed through natural language values. It primarily comprises both the forward and backward cloud generators. In this study, our focus lies on data generation for interpolation, primarily driven by the principles underlying the forward cloud generators.
The forward cloud generator functions as a mapping tool that translates qualitative information into quantitative data by utilizing three numerical characteristic parameters of the cloud, namely, expectation (Ex), entropy (En), and super-entropy (He), along with the count of cloud droplets (N). The output of this process provides the quantitative positioning of N cloud droplets within the numerical field space, accompanied by the confidence level associated with each droplet, representing the underlying concept. Given the widespread applicability of normal clouds, the screening process primarily revolves around their utilization.
A specific method for the one-dimensional forward cloud generator is outlined as follows:
Input: Digital parameters (Ex, En, He) embodying the quality concept of weight and the enumeration of cloud droplets (N).
Output: N cloud droplets (Xi) and the association of each cloud droplet with the conceptual framework.
In the initial phase, a normal random number (Eni) is produced, where the expectation is set to En and the standard deviation to He.
Subsequently, a normally distributed random number (X) is generated with Ex as the mean and Eni as the standard deviation.
In the third phase, the degree of X’s association with the specified concept is determined through:
The process is iteratively repeated through steps a to c until N cloud droplets have been successfully generated.
During algorithm execution, the determination of the actual parameters of cloud models is subject to the prototype and the standard deviation of the membership degrees. Then, the prototypes can be adjusted based on the new dataset
according to (3):
Afterward, the partition matrix can undergo additional refinement utilizing the adjusted prototype matrix. During this optimization stage, we utilize the classification error (original labels of the dataset) to supervise the interpolation process. This refinement of the partition matrix ultimately leads to an enhancement in classification performance.
Figure 1 illustrates the methodology of implementing the proposed scheme in detail.
The total computational complexity of KFCM is O(CN2n), while for FCM and the proposed method the computational complexities are respectively O(CNn) and O(C(N + NI)n) (C is the number of prototypes, N is the total number of the original data instances in the n-dimensional space, and NI stands for the number of inserted data in the n-dimensional space). Typically, the number of uncertain data (NI) is much smaller than the total number of data (N). Therefore, theoretically, our algorithm’s execution speed will be slower than the FCM but much faster than the KFCM.
4. Experimental Studies
In what follows, we aimed to assess the effectiveness of the developed scheme by comparing its performance with that of the FCM and Gaussian kernel function-based FCM (KFCM-G) methods. The primary goal of this extensive series of experiments was to discuss the classification performance of these clustering approaches. A variety of experiments were conducted using both synthetic datasets and publicly available datasets (
http://archive.ics.uci.edu/ml, accessed on 3 March 2024) [
27].
All data were normalized with the min-max scaling method, which is described as follows:
where
x and
represent the original and the preprocessed data values, respectively. The goal was to thoroughly evaluate the proposed scheme’s effectiveness. To ensure consistency, all data were normalized to [0, 1]. The classification rate [
4] was utilized as the primary metric in these experiments, given its widespread usage as an index for performance evaluation.
In the experiments, we explored various values being positioned in the interval [1.1, 3.1] for the fuzzification factor, and changed its value with a step size of 0.2. The number of iterations was fixed at 500 to ensure the completion of clustering. We permitted the methods to terminate if the following condition was met:
where
represents the membership matrix coming from the previous iteration. In numerous instances, Equation (10) was fulfilled before the maximum iteration had been reached. We let the Gaussian kernel parameter
σ2 vary from 10 to 100 in increments of 10 to mitigate the computational intensity associated with KFCM(G).
To gauge the efficacy of the proposed approach, we employed 10-fold cross-validation [
28], a widely utilized technique for estimating and validating the classification performance and stability of the fuzzy classification models.
4.1. Synthetic Data Experiments
The first experiment utilized a two-dimensional synthetic dataset comprising 450 individuals categorized into nine distinct classes. The dataset’s geometry is illustrated in
Figure 2.
Figure 3,
Figure 4 and
Figure 5 present the clustering outcomes along with the corresponding partition matrix of the three approaches. The experimental results associated with the classification rates and the model parameter values of the synthetic dataset are plotted in
Figure 6. It was evident that through the allocation of a judiciously chosen fuzzy factor to each datum and the incorporation of the interpolation technology, the optimization of prototypes occurred. This resulted in the refinement of class boundaries.
4.2. Publicly Available Data Experiments
We employed six publicly accessible datasets, detailed descriptions of which are available in the UCI machine learning repository.
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12 show the experimental results associated to the classification rates and the model parameter values of each dataset. It is noteworthy that the classification quality of all these datasets was enhanced through the application of the proposed method. The developed scheme exhibited substantial merits over both FCM and KFCM methods.
The KFCM exhibited improvements on some specific datasets. Notably, the developed approach consistently achieved higher classification rates compared with both the FCM and other kernel-based clustering algorithms. This superiority can be attributed to the optimization of the proposed method on the cluster boundaries through incorporating specific data into the clustering process, thereby reducing the proportion of uncertain data and refining the prototypes. Consequently, this optimization facilitated more accurate cluster identification.
The observed enhancement in classification performance averaged approximately 6%, with improvements ranging from a minimum of 3% to a maximum of 10%. This range represents the most notable improvement achieved by our method.
In summary, our approach achieved the partitioning of uncertain and deterministic data by utilizing membership degrees to delineate boundary and non-boundary data. Based on this, we leveraged the cloud modeling technology for data interpolation and adjusted the prototype to refine the partition matrix, further enhancing the model’s classification performance. This not only enriched and advanced classifier models based on fuzzy clustering technology but also offered valuable insights for uncertainty analysis research.