1. Introduction
As one of the key components of machine learning, fuzzy rule-based classifiers [
1,
2,
3] explore the features of data by constructing fuzzy sets with strong generalization ability and extracting fuzzy rules with good interpretability. Compared with some traditional classification algorithms, such as decision trees [
4,
5], logistic regression [
6], naive Bayes [
7], neural networks [
8], etc., fuzzy classifiers tend to consider both classification performance and interpretability of fuzzy rules when designing models. Fuzzy techniques are also combined with some classic classifiers to deal with human cognitive uncertainties in classification problems. Fuzzy decision trees [
9] is a famous one of those, which have been proved as an efficient classification method in many areas [
10,
11,
12]. Similarly, there are also some multiple classifier systems, such as fuzzy random forest [
13] which are constructed by combining a series of fuzzy decision trees. However, when dealing with the data in practice, such as disease diagnosis [
14], protection systems [
15], nature disaster predictions [
16] and financial problems [
17], traditional fuzzy rule-based classifiers could not always extract fuzzy classification rules with good interpretability, which directly leads to the decrease of classification accuracy. This is mainly because the data in the real world often have the characteristic of imbalance [
18,
19], that is, the samples of a certain class (called the minority class) is far less than those of other classes (collectively called the majority class). Traditional fuzzy classifiers usually assume that the number of samples contained in each class in the dataset is similar, while the classification for imbalanced data focus on just two classes, viz., the minority class and the majority class.
To resolve the issue of classification for imbalanced data, some auxiliary methods have emerged. Data sampling is a common one of them, whose aim is to balance the dataset through increasing the sample size of the minority class (oversampling) or decreasing the sample sizes of majority classes (undersampling) before modeling. A famous oversampling method called “synthetic minority over-sampling technique (SMOTE)” is proposed by [
20], where the minority class is oversampled through selecting the proper nearest neighbor samples from the minority class. In [
21,
22], the SMOTE part is modified before its first stage in which samples weight and k-means are added right before selecting nearest neighbor samples from the minority class of a dataset, respectively. Similarly, there are also a series of studies based on undersampling to help better deal with imbalanced data classification [
23]. A number of fuzzy algorithms are combined with these data sampling techniques to handle imbalanced data classification issues. [
24] presented a study in which the synergy between three fuzzy rule-based systems and preprocessing techniques (data sampling included) is analyzed in the issue of classification for imbalanced datasets. In [
25], some new fuzzy decision tree approaches based on hesitant fuzzy sets are proposed to classify imbalanced datasets. An obvious advantage of the above-mentioned data balancing techniques is that they do not depend on the specific classifier and have good adaptability. However, the data are balanced with compromise, i.e., oversampling would increase the size of training set which may cause overfitting problems and undersampling would remove some useful samples from the training data. This is obviously contrary to the nature of data classification. In addition to the method of data sampling, another commonly used approach focuses on modifying the cost function of classification models so that the penalty weight of the misclassified minority samples in the cost function is greater than the one of the misclassified majority samples. The disadvantage is that there are no universally adequate standards to quantify the penalty weight of the misclassified minority samples.
The above-mentioned methods aim at improving the classification performance by adding some auxiliary methods, such as data sampling methods. However, interpretability is also a significant requirement when constructing classification models. As we all know, fuzzy rules or fuzzy sets constructed by fuzzy algorithms are usually highly interpretable which can reveal the structure of data. For example, the Takagi–Sugeno–Kang fuzzy classifier [
26], a famous two-stage fuzzy classification model, obtains the antecedent parameters or initial structure characteristics through clustering algorithms [
27,
28]. However, when dealing with imbalanced datasets, the samples in the minority class may be treated as outliers or noise points or be completely ignored when applying clustering methods. This obviously affects the classification performance of imbalanced datasets. Thus, a fuzzy method considering both classification performance and interpretability should be constructed. Information granules are a concept worth considering.
Information granules [
29] are entities composed of data with similar properties or functional adjacency which are the core concept in the field of granular computing (GrC) [
30,
31]. To a certain degree, an information granule is the indistinguishable minimum entity with explicit semantics which can depict data. This implies that data with similar numerical values can be arranged together and abstracted into an information granule with specific fuzzy semantics. This is consistent with the abstract process of building fuzzy sets around data. Since information granules can be built in different formalisms such as one-dimensional intervals, fuzzy sets, rough sets, and so on. Thus, in the view of high dimensionality and the special geometric structure of imbalanced datasets, we can construct two different collections of information granules to depict the characteristics of the majority class and the minority class [
32], respectively. For instance, we can see clearly in
Figure 1 that the hollow square-shape samples can be abstracted into a fuzzy set composing six cube-shaped information granules and the solid dot-shape ones can be abstracted into a fuzzy set containing only one cube-shaped information granule. Intuitively, the data is divided into two parts, i.e., the majority class is represented by a fuzzy set composed of six cubes and the minority one is represented by a fuzzy set composed of only one cube. This intuitive classification process demonstrates the principle of using information granules to classify imbalanced data. Therefore, the objective of this paper is to generate some information granules and then use them to form the fuzzy rules for the classification of imbalanced data. To achieve the objective, we proposed a Minkowski distance-based granular classification method. In our proposed method, the information granules in different Minkowski spaces are constructed based on a spectrum of Minkowski distance, which can well reveal the geometric structure of both the majority class and minority class of data. In other words, our approach aims at improving classification performance by exploring and understanding the geometric characteristics of imbalanced datasets, rather than just using some auxiliary methods. At the first stage of our Minkowski distance-based granular classification method, the imbalanced dataset is divided into two partitions in light of their class labels, viz., the majority class and the minority class. Each sample in each partition is considered a “spot” information granule. At the second stage, a series of bigger union information granules is constructed in each partition through using a Minkowski distance-based merging mechanism. The last stage aims at adjusting the radius of the overlapping information granules and building the granular description for each class of data by uniting the refined information granules contained by each corresponding partition. Subsequently, the granular Minkowski distance-based classification model for imbalanced datasets is constructed and two “If-Then” rules emerge to articulate the granular description for each partition and its minority or majority class label. Compared to the existing fuzzy classification methods for imbalanced datasets, this paper exhibits the following original aspects:
The use of Minkowski distance provides an additional parameter for the proposed fuzzy granular classification algorithm, which helps to understand the geometric characteristics of imbalanced data from more perspectives.
The constructed union information granules present various geometric shapes and contain different quantities of information granules, which can disclose the structural features of both majority classes and the minority class.
The proposed Minkowski distance-based modeling method has an intuitive structure and a simple process, and there is no optimization or data preprocessing involved.
The rest of the paper is organized as follows. The representation of information granules, the Minkowski distance calculation and the merging mechanism between two information granules are introduced in
Section 2. In
Section 3, the Minkowski distance-based granular classification method is described in detail. The experiments on some imbalanced datasets are presented in
Section 4.
Section 5 concludes the whole paper.
3. The Proposed Fuzzy Granular Classification Methods for Imbalanced Datasets Based on Minkowski Distance
In this section, the proposed Minkowski distance-based fuzzy granular classification method is detailed. To better demonstrate the modeling process, a normalized
n-dimensional imbalanced dataset
is used. Since
is an imbalanced dataset, we partition it into two subsets, where all the sample in the majority class are grouped together into one subset, i.e.,
and the samples in rest minority class are grouped into another one, i.e.,
.
and
are the numbers of samples in
and
, respectively. The blueprint of the proposed three-stage classification method is shown in
Figure 5.
3.1. The Construction of Information Granules for Each Class
Considering the large difference between the quantities of samples in the majority class and the minority one for imbalanced datasets, it is unrealistic to use the same amount of information granules to capture the geometric characteristics of both majority and minority classes. It is a feasible option to construct two collections containing different quantities of granules to depict the structure of both two classes. Thus, the task of this section is to elaborate the process of constructing the corresponding collection of information granules on and .
Take the majority class
as an example. Through the merging mechanism aforementioned in
Section 2, a collection of information granules can be constructed by the following steps. At the beginning, each point in
is regarded as a “spot” information granule whose radius equals zero and center is the point itself. Thus, we obtain the initial collection of
, says,
, where
p is the Minkowski parameter. Next, the merging mechanism is conducted among these “spot” information granules in
. Specially, the distances between any two information granules in collection
are first calculated with (
6) and the nearest pair of information granules
and
,
are obtained. Then, we suppose that
and
are merged into a bigger information granule
. Its radius, i.e.,
can be calculated by (
8). Here, a key parameter is imported, that is, the radius threshold
to adjust the size of the generated information granules. If this new radius is greater than the predefined
, i.e.,
, the merging mechanism will not be executed and the two information granules
and
are maintained in collection
. Otherwise,
and
are merged into
with (
7) and (
8). Usually, when designing a granular model, a right value of
is usually selected which makes
. In this situation, the contents of the collection
is updated by removing
and
and adding
. So far, two “spot” information granules are merged into one new bigger one. After this, the above merging process can be repeated until the radius of the information granule obtained by merging any two information granules in
is less than
, which is represented in Algorithm 1.
Algorithm 1: The Minkowski distance-based merging process for majority class subset . |
|
Once the merging process is accomplished, a collection containing merged information granules is obtained, viz. . Through uniting all the elements in the collection , we obtain a “union information granule”, says, . For the minority subset , the corresponding union can also be obtained in the same way, says, . Since there is a big difference between the quantities of samples in and , the value of the radius threshold should be the same value when producing both and . In this way, by selecting appropriate values of radius threshold and Minkowski parameter p, the differences on the geometric structure and sample quantities can be capture by the two union information granules, and .
3.2. The Emergence and Evaluation of the Minkowski Distance-Based Fuzzy Granular Classification Model
Through the processing at the above two stages, the two union information granules, i.e., and , are produced to describe the key features of the majority class and minority class . Note that both and can depict the distribution and location of samples belonging to their corresponding classes, such as may occupy much more Minkowski space than , which indicates that the two unions are non-overlapping with each other. Therefore, before establishing the Minkowski distance-based classification model, the overlap between the two union information granules and should be eliminated.
If there is an overlap between
and
, there must be their element information granules are overlapping with each other. For instance, for
from
and
from
, if the Minkowski distance between their centers is lee than the sum of their radii, viz.,
, they are overlapped. In order to eliminate the overlap, we let
and
tangent to each other through scaling their radii into the half of the Minkowski distance between their centers,
In this way, we can eliminate all overlaps between two union information granules
and
and obtain two optimized ones, i.e.,
and
. Thus they can be tagged with the corresponding majority class and minority class with “If-Then” rules. A Minkowski distance-based granular classification model containing two fuzzy rules are emerged, i.e.,
For a given test sample
, we can calculate its activation levels versus the rule of majority class and the one of minority class in (
13). Since the activation levels are usually determined by distances, those can be obtained by judging the position relation between the sample and union information granules, which can be considered by discussing the following three situations:
- (1)
: In this situation, the sample is positioned within the boundary of the union information granule of majority class. We intuitively decide that the distance between the and is zero, viz.,
- (2)
: In this situation, the sample is positioned within the boundary of union information granule of minority class. We intuitively decide that the distance between the and is zero, viz.,
- (3)
and
: In this situation, the sample
locates in neither two union information granules. The Minkowski distance between the sample
and a union information granule is determined with the minimum distance between
and all information granules in
and
, i.e.,
can be regarded as a information granule whose radius is zero, and the Minkowski distances
and
can be calculated referring to (
6) in
Section 2.
Now the activation level of
versus the majority class
and
can be obtained by calculating
respectively. After obtaining the activation levels of
versus the two rules in (
13), its class label
can be determined by choosing the higher activation level, i.e.,
. A particular case, that is
, means that
is a boundary point which can be classified into both two classes.
Table 1 shows a confusion matrix of a two-class classification issue. In traditional classification algorithms, classification accuracy is often used as an evaluation index to quantify the performance of a classification model. According to
Table 1, the classification accuracy can be calculated by
However, when facing imbalanced data sets, the samples in the minority class have little effect on the classification accuracy. Even if the classification model regards all samples as majority classes, the accuracy is still high. This means that it is difficult to well reflect the classification performance of the classifier on imbalanced data sets when using classification accuracy as the evaluation index alone. Thus, in this work, we consider the accuracy of each class and use the following geometric mean as the evaluation index.
4. Experiment Studies and Discussion
In this section, a series of experiments based on some synthetic and publicly available datasets are conducted. There are three purposes in this section: (1) verifying the feasibility of the proposed Minkowski distance-based method for imbalanced data classification, (2) exploring the impact of two key parameters, viz., the Minkowski parameter p and the radius threshold , on the results, (3) completing the comparison with some other methods for imbalanced data classification.
In order to obtain more rigorous experimental results, we first normalize the datasets used in the experiment before the experiment, viz., each attribute is normalized into a unit interval. As for the two parameters, the Minkowski parameter p is set as some certain values due to different datasets, i.e., , and the value of the radius threshold ranges from 0.02 to 0.12 with step 0.02. A fivefold cross-validation approach is considered for higher confidence results, where four of five partitions (80%) for training and the left one (20%) for testing. Reasonably, the five partitions for testing form the whole set. Thus the average result of the five partitions for each dataset is used.
4.1. Synthetic Datasets
Two synthetic datasets showing imbalanced characteristics and unique geometrical structures are used, see
Figure 6. They are generated in the following way.
- (1)
Moon-Blob dataset. This dataset contains 420 samples and two classes. One class, as the majority class, shows a shape of a moon and contains 400 samples. It is governed as follows:
where
ranges in
and
is a noise variable in a normal distribution
. The samples of this Moon class are marked in red in
Figure 6a. Another class, the minority class, shows a shape of a blob and contains 20 samples. The samples are randomly generated with regard to the normal distribution with the mean vector
and the covariance matrix
, which are marked in blue in
Figure 6b.
- (2)
Circles dataset. It is a two-dimensional dataset containing 420 samples with two classes. The majority class contains 400 samples, and governed as follows:
The minority class contains 20 samples, and governed as follows:
rangs in [0,
],
and
are scale factors and
is s noise following a normal distribution
. In
Figure 6b, the majority class is represented in red points and the minority class is in blue.
The above two imbalanced datasets are generated to validate the feasibility of the Minkowski distance-based method. The experimental results on the two synthetic imbalanced datasets are presented in
Table 2 and
Table 3, where
stands for the radius threshold,
p stands for the parameter for calculating Minkowski distances, and the symbol
contains the average value and the standard deviation of the geometric mean, referring to (
17), delivered by the models on the testing sets. In individual tables, the results with the highest value of average value are highlighted in bold-face.
For Moon-Blob dataset, when the value of Minkowski parameter
p equals 1 and the radius threshold
is set as 0.10, the value of
reaches its maximum, say, 94.54%. Refer to
Figure 7a, we can see clearly that the majority class are exactly covered by the red diamond-shape information granules. The union information granule represents a moon shape just matching the original distribution of the samples in the majority class. In contrast, much less blue information granules are constructed for capturing the shape of the minority class. The corresponding classification decision boundary with
and
is presented in
Figure 7b. The darker the color of an area in
Figure 7b, the greater probability the samples in this area belong to the corresponding class. However, when
and
, the generated round-shape information granules do not cover the samples in majority class and minority class better than those with
and
, see
Figure 7c. For Circles dataset, when the value of Minkowski parameter
p is set as ∞ and the one of the radius threshold
is set as 0.04, the value of
reaches the maximum, say, 95.92%. It can be seen clearly that union information granules composed of red and blue cubes cover the samples of both majority and minority classes perfectly.
Apparently, the configuration of Minkowski parameter
p and the radius threshold
dramatically affects the classification performance of the Minkowski distance-based granular models constructed by our proposed method. The main reason behind this is that
p directly determines the geometric shape of the constructed information granules and
determines the sizes. In detail, different imbalanced datasets have different sample distributions, which leads to the diversification of the geometric structure of datasets. The Minkowski parameter
p used in our method enriches the geometric shapes of the constructed information granules. This enables the constructed granular classification models to explore the geometric structure of data from multiple perspectives and as accurately as possible, which helps to improve the classification performance. The radius threshold
is the key parameter of this method to deal with imbalanced data. Since the minority class occupy tiny spaces, such as three blue diamonds in
Figure 7a,
transform the spaces occupied by the majority class into a union of some tiny space (similar to the ones occupied by the minority class). The proposed Minkowski distance-based granular classification method broadens the perspective of classification modeling, and also deftly solves the problem of too few samples in minority class in imbalanced data classification.
4.2. Publicly Available Datasets and Comparison with Other Methods
Twelve publicly available imbalanced datasets are considered from the KEEL repository (
https://sci2s.ugr.es/keel/category.php?cat=clas). It is worth mentioning that the imbalanced datasets are obtained by splitting and reorganizing the “standard” datasets. Six of them are with low imbalanced ratio (the ratio of the number of samples in majority class to the one in minority class is less than 9) and the rest six are with a high imbalanced ratio (greater than 9). They are summarized in
Table 4, where the names of datasets, the number of attributes, the number of samples and their imbalance ratio are presented.
The configuration of the relevant parameters on the publicly available datasets is as follows: for datasets with low imbalanced ratio, the Minkowski parameter
p is set as 1 and the radius threshold
is set as 0.08; for datasets with high imbalanced ratio,
p is set as 2 and
is set as 0.04. Other configurations of the two parameters are also used but no evident increases of the geometric mean (
17) appear. In order to validate the performance of the model built by this Minkowski distance-based method, we conducted a comparative study based on the twelve publicly available datasets. In addition to the model established by this method, there are more classification models established by other fuzzy learning methods, which are Ichibushi et al.’s rule learning algorithms [
2], Xu et al.’s
E-algorithm on imbalanced dataset classification [
34], the well-known C4.5 decision tree algorithm [
35], and Fernández et al.’s hierarchical fuzzy rule-based classification model which adds a SMOTE preprocessing [
36]. The experiment parameter set-up of these classifiers is shown in
Table 5.
Table 6 shows the corresponding comparison results (the average of
with its associated standard deviation) for the test partitions of each classification method. In detail, by columns, we include the Ishibuchi et al.’s method (says, Ishibuchi), Xu et al.’s method (says, the E-Algorithm), the C4.5 algorithm, Fernández et al.’s method (says, Smote-HFRBCS) and our Minkowski distance-based method.
In light of the values of , it is clear that the proposed Minkowski distance-based method can obtain higher values of geometric means than other methods on ten out of twelve datasets. Especially for the six datasets with a high imbalanced ratio, the models established by the proposed method perform much better referring to the mean result. This is because that the granular classification models in this paper is specially designed for the sample quantity, distribution and shapes of imbalanced datasets. The reason includes two aspects. One is that the union information granules constructed separately for the majority class and the minority one are capable of showing the difference between the two classes. In other words, the union information granule constructed for the majority class contains much more information granules and occupies more Minkowski space than those for the minority class, which ensures that the geometric characteristics of the two highly different classes can be captured separately. Another reason is that the information granules that make up each union information granule are produced based on Minkowski distance with various values of p, which results in the generated information granules having various geometric shapes. By adjusting the value of parameter p, we can explore the geometric structure of imbalanced data from different perspectives, and then achieve a more accurate capture of data features. In summary, the proposed Minkowski distance-based method shows some unique advantages over the other four classification methods in dealing with imbalanced datasets, which are shown in the following aspects. (i) The condition parts of the models constructed with our method are information granules, which can disclose the geometric characteristics of imbalanced datasets. (ii) The information granules constructed with different values of the Minkowski distance parameter p realize a multi-perspective description for both majority classes and minority classes of imbalanced datasets, which helps to improve the classification performance of the constructed models. (iii) The reasoning process of the granular classification models is based on the datasets themselves, which can achieve satisfactory classification performance without data preprocessing sampling or additional optimization methods.