Next Article in Journal
Synchronization of Heterogeneous Multi-Robotic Cell with Emphasis on Low Computing Power
Next Article in Special Issue
News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning
Previous Article in Journal
The First Study of Irreversible Electroporation with Calcium Ions and Chemotherapy in Patients with Locally Advanced Pancreatic Adenocarcinoma
Previous Article in Special Issue
Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Under-Sampling Method to Face Class Overlap and Imbalance

by
Angélica Guzmán-Ponce
1,†,
Rosa María Valdovinos
1,†,
José Salvador Sánchez
2,† and
José Raymundo Marcial-Romero
1,*,†
1
Facultad de Ingeniería, Universidad Autónoma del Estado de Mexico, Cerro de Coatepec s/n, Ciudad Universitaria, Toluca 50100, Mexico
2
Department of Computer Languages and Systems, Institute of New Imaging Technologies, Universitat Jaume I, 12071 Castelló de la Plana, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2020, 10(15), 5164; https://doi.org/10.3390/app10155164
Submission received: 7 May 2020 / Revised: 29 May 2020 / Accepted: 1 June 2020 / Published: 27 July 2020

Abstract

:
Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases.

1. Introduction

The class imbalance problem is a challenging situation common to many real-world applications such as fraud detection [1], fault/failure diagnosis [2], face recognition [3], text classification [4], sentiment analysis [5], and credit risk prediction [6], among others. A binary data set is said to be imbalanced when one of the classes (the minority or positive class, C + ) has a significantly lower number of instances in comparison to the other class (the majority or negative class, C ) [7]. The disproportion between the number of positive and negative instances leads to a bias towards the majority class that may imply an important deterioration of the classification performance on the minority class.
Many authors have asserted that the class imbalance distribution by itself does not represent a critical problem for classification, but when it is associated with other data complexity factors, then it can significantly decrease the classification performance because traditional classifiers tend to err on many positive instances [8]. García et al. [9] viewed that the combination of class imbalance and highly overlapping class distributions results in a significant performance deterioration of instance-based classifiers. Class overlap refers to ambiguous regions of the feature space where the prior probability of the classes are approximately equal (i.e., all classes have a similar amount of data), which makes most classifiers more likely to produce a large number of misclassifications [10].
Class noise, which refers to mislabeled instances, can also lead to incorrect classification decisions, especially when combined with class imbalance [11]. The presence of both noise and class imbalance can significantly affect the classification process: mislabeling from the minority class to the majority class increases the imbalance ratio in the data set, whereas mislabeling from the majority class to the minority class produces an even higher difficulty to correctly classify the positive instances [12]. Both class overlap and noise have been commonly tackled through the application of data filtering methods, which are intended to identify and remove mislabeled and atypical data and also to clean possible overlapping between classes [13,14].
The different strategies to address the class imbalance problem are usually categorized into four groups [7,15]:
  • Algorithmic-level methods. These consist in internally biasing the discrimination-based process so as to compensate for the class imbalance.
  • Data-level methods. These perform some sort of data preprocessing with the aim of reducing the imbalance ratio.
  • Cost-sensitive methods. These incorporate distinct misclassification costs into the classification process and assign higher misclassification costs to the errors on the minority class.
  • Ensemble-based techniques. These consist of a combination between an ensemble algorithm and either the data-level or the cost-sensitive approaches. In the first one, data preprocessing is performed before training the classifier, whereas in the second one, the low cost guides the use of the ensemble algorithm.
Conclusions about what is the best solution for the class imbalance problem are divergent, but the data-level methods are likely the most investigated because they are not classifier-dependent, do not need any algorithmic adaptation to the data sets, and can be easily implemented for any problem [16]. Thus, focusing on the data-level approach, resampling methods consist of adjusting the data set size with the aim of balancing the class distribution, either by decreasing the number of majority class instances (under-sampling) or increasing the number of minority class instances (over-sampling). A hybrid resampling approach consists of combining both over-sampling and under-sampling strategies. As remarked by several authors, under-sampling methods may throw away many potentially useful data as a result of ignoring most of majority class, whereas the over-sampling methods can lead to overfitting on the minority class and an increase in complexity and execution time [17]. On the other hand, it is worth pointing out that the resampling methods have been also used to face the imbalance problem together with class overlap and/or the presence of noisy and borderline instances in the data set [8,18,19,20,21], a common situation that has motivated the research presented in this paper.
Although most traditional resampling methods are based on the k nearest neighbor (kNN) rule [22,23,24], clustering algorithms have been demonstrated to be even a more powerful strategy for addressing the class imbalance problem [16,25,26] because they may reduce the risk of removing potentially useful data and, therefore, they can achieve better results than the kNN-based resampling methods. On the other hand, despite the fact that over-sampling has been shown to be apparently superior to under-sampling [27,28], the under-sampling approach can be especially useful when dealing with big data applications [29] since these algorithms will lead to a reduction of the data set size and also a decrease in the computational requirements of the subsequent classification process. Bearing both these issues in mind, the present paper introduces a new under-sampling technique (here called DBMIST-US) to face not only the class imbalance but also the overlapping between classes. It deals with the noisy instances in the majority class via a noise filter based on the DBSCAN clustering algorithm [30] combined with the use of a minimum spanning tree (MST) algorithm to reduce the size of the negative class. The reason to combine the DBSCAN clustering with the MST approach is because DBSCAN has demonstrated to be a powerful tool for identifying and removing noisy instances and cleaning the overlapping between classes [31], but it does not produce a well-balanced class distribution. By viewing the data set as a weighted complete graph, the MST algorithm allows for discovering the core of the majority class, which is further used to remove the amount of redundant negative instances needed to balance both classes.
Hereafter, the remainder of this paper is organized as follows. Section 2 introduces several well-known methods to tackle the class imbalance problem. Details of the DBMIST-US algorithm here proposed are described in Section 3. The experimental setup is reported in Section 4. In Section 5, the results are analyzed and discussed. Finally, Section 6 presents the main conclusions and outlines some open lines to be addressed in the future.

2. Resampling Algorithms to Face Class Imbalance

In this section, a collection of both kNN-based and clustering-based algorithms for dealing with the class imbalance problem is briefly described.

2.1. Neighborhood-Based Algorithms

Most kNN-based under-sampling algorithms are inherited from the literature on prototype selection. These techniques exploit the analysis of the feature space, either by removing instances far from the decision boundary of two classes to reduce redundancy (condensing or thinning) or removing those close to the boundary for generalization (editing, cleaning, or filtering) [32]. Wilson’s editing [33] and Hart’s condensing [22] are the most representative examples of these families of prototype selection methods.
Wilson’s editing (ENN) removes all instances that are misclassified by the kNN rule. The Hart’s condensing (CNN) algorithm uses the concept of consistent subset to eliminate the majority class instances that are sufficiently far away from the decision boundary because these are considered to be irrelevant for learning. Analogously, the Tomek links (TL) [23,34] have also been employed to remove the majority class instances since, if two instances form a Tomek link, then either one of these is noise or both instances are borderline.
The one-sided selection (OSS) technique developed by Kubat and Matwin [35] eliminates only the majority class instances that are redundant or noisy (i.e., those that are bordering the minority class). The concept of Tomek links is used to discover the border instances, whereas the Hart’s condensing algorithm is applied to discover the redundant cases (i.e., those that are far enough from the decision boundary).
Laurikkala [24] proposed the neighborhood cleaning rule (NCL) algorithm, which is quite similar to the OSS technique. With NCL, the majority class instances whose class label differs from the class of at least two of their three nearest neighbors are removed by means of Wilson’s editing [33]. In addition, this algorithm also discards the neighbors that belong to the majority class if a minority class instance is misclassified by its three nearest neighbors.

2.2. Clustering-Based Algorithms

Clustering-based under-sampling methods have become a well-grounded alternative to neighborhood-based algorithms for handling class imbalance because they can mitigate the problem of information loss caused by the removal of negative instances [16].
Yen and Lee introduced a cluster-based under-sampling algorithm named SBC (under-sampling based on clustering) [17], which rests upon the idea that there may exist various clusters in a data set and each cluster i may have a different ratio of majority class instances to minority class instances in the cluster i. Thus, all instances are initially grouped into K clusters; then, a number of negative instances will be randomly selected from each cluster according to the ratio of majority class instances to minority class instances. Finally, it combines the selected negative instances from the K clusters with all the instances in the minority class. A similar under-sampling method corresponds to the algorithm proposed by Longadge et al. [36], which firstly clusters the majority class instances into K groups using the K-means algorithm and then selects | C + | × I R i majority class instances from each cluster i, where  I R i denotes the imbalance ratio in the cluster i. Note that the aim of this method is not to obtain a perfectly balanced class distribution, but to reduce the disproportion between the size of the majority and minority classes.
The ClusterOSS algorithm introduced by Barella et al. [37] is an improvement of the OSS method. It starts by using the K-means algorithm to cluster the majority class instances. Then, the closest instances to the center of each cluster are used to start the application of OSS, removing borderline and noisy negative instances using the Tomek links. Unlike OSS, this technique does not start the under-sampling process at random, but defines how many and which instances should be chosen by applying the K-means clustering.
Sowah et al. proposed an under-sampling technique named CUST [38] whose main objective is to remove redundant and noisy instances along with outliers from the majority class. In a first stage, the algorithm removes noisy negative instances using a techniques based on Tomek links. In a second stage, outliers and redundant negative instances are eliminated by using the K-means algorithm to cluster the remaining majority class instances.
Das et al. introduced the ClusBUS algorithm [39], which aims at discarding negative instances that lie in class overlapping regions. First, it clusters the entire data set into K clusters using the DBSCAN algorithm. Then, for those clusters that contain both positive and negative instances, the algorithm removes a number of majority class instances necessary to create a vacuum around the minority class instances. Tsai et al. presented the CBIS technique [40] based on clustering analysis and instance selection: the affinity propagation algorithm groups similar negative instances into K clusters, and then an instance selection process filters out instances in each cluster. Finally, all reduced K clusters together with the instances of minority class form a balanced data set.
Ng et al. proposed the diversified sensitivity-based under-sampling (DSUS) method [41]. It selects useful instances yielding high stochastic sensitivity with respect to the currently trained classifier. To guarantee that only instances close to the decision boundary are selected and preserve the data distribution, negative instances are clustered and only a representative instance of each cluster participates in the selection process.
Lin et al. developed two under-sampling strategies in which the K-means clustering technique is used [26]. Unlike the algorithms described previously, the number of clusters with negative instances is here defined to be equal to the size of the minority class. The first strategy uses the cluster centers to represent the majority class, whereas the second strategy employs the nearest neighbors of the centers. In the Fast-CBUS method introduced by Ofek et al. [16], the minority class is clustered into K clusters by the K-means algorithm and, for each cluster, a similar number of majority class instances close to the minority class instances are sampled.
It is also worth pointing out that some hybrid algorithms combine clustering with other techniques. For instance, the clustering-based balancing (ClusterBal) method proposed by Sun et al. [42] combines clustering with classifier ensembles: it divides the majority class instances into K subsets by clustering, and then all the minority instances are added to each subset to train a focused classifier. On the other hand, the cluster-based evolutionary under-sampling (CBEUS) [43] combines the K-means clustering applied to the majority class with a genetic algorithm.

3. The DBMIST-US Algorithm

In this section, we describe the new algorithm DBMIST-US for handling class overlap and imbalance by under-sampling the data set. The purpose is to obtain a subset of representative instances from the majority class and avoid the risk of removing instances that may contribute to generate knowledge. The overall process of DBMIST-US is described in Algorithm 1, where I R m a x is the unique free parameter representing the maximum imbalance ratio allowed. It starts by dividing a two-class imbalanced data set of n instances into a subset C with the majority class instances and a subset C + containing the minority class instances. After this initial step, the algorithm consists of two stages:
  • Cleaning stage (lines 3–7). The DBSCAN clustering method [30] is applied to the set with the negative instances to produce a noise-free subset.
  • Core stage (line 8). An MST is built to get a core representation from the majority class. The result is a subset of the majority class with less dispersion than the set obtained in the previous stage.
Algorithm 1 DBMIST-US
Input:  D S = { p 1 , p 2 , , p n } , I R m a x
Output: DS’
 1:
Split D S into two subsets C and C + .
 2:
C C
 3:
repeat
 4:
  Compute ϵ and m i n P t s .
 5:
   C DBSCAN ( C , ϵ , m i n P t s ) .
 6:
until  ϵ and m i n P t s do not change
 7:
D S C + C
 8:
C MSTGraph( D S , I R m a x )
 9:
D S C + C
DBSCAN is a clustering algorithm that assumes that the clusters correspond to dense regions in the feature space. It can discover groups without any prior knowledge about the number of clusters in the data, and it works pretty well with noisy data sets. DBSCAN requires two input parameters: ϵ defines the radius of the neighborhood region (i.e., how close instances should be to each other to be considered a part of the same cluster) and m i n P t s establishes the minimum number of instances that might be part of the ϵ -neighborhood to form a cluster. Unlike the original proposal [44] where the input values of ϵ and m i n P t s are computed by clustering the data set with the K-means algorithm, our DBSCAN formulation only takes care of | C | and | C + | to compute ϵ and m i n P t s , respectively.
DBSCAN begins by randomly selecting a negative instance p i from the current set C . If the analyzed instance p i does not have at least m i n P t s neighboring instances at a distance ϵ , it is considered as noisy and removed from the set C . Note that the rationale behind this is to remove those instances that are close enough to the decision boundary. This process is repeated until m i n P t s and ϵ do not change. After applying the DBSCAN algorithm on C , an MST is computed by the process MSTGraph described in Algorithm 2.
Algorithm 2 MSTGraph
Input:  D S , I R m a x
Output:  C C
 1:
Split D S into two subsets C and C + .
 2:
Build G w = ( V , E ) a weighted complete graph from C .
 3:
S | C |
 4:
I R S | C + |
 5:
C []
 6:
M G incidenceMatrix( G w )
 7:
M S T GetMST( G w , M G )
 8:
while  I R > I R m a x do
 9:
   S S σ 2 Z 2 e 2 ( | C | 1 ) + σ 2 Z 2
 10:
   I R S | C + |
 11:
end while
 12:
for all  { v , u } in E ( M S T ) do
 13:
   C u
 14:
   C v
 15:
   if  | C | > S then
 16:
    return  C
 17:
   end if
 18:
end for

3.1. Core Stage

After cleaning the majority class with DBSCAN, a graph-based approach is used to compute a candidate core of the set C . A weighted complete graph G w = ( V ( G ) , E ( G ) ) is built from the set of majority class instances C given by the DBSCAN-based stage as follows:
  • V ( G ) = { i p i C } a set of vertices.
  • E ( G ) = { { v , u } v , u V ( G ) } a set of edges.
  • e = { v , u } E ( G ) , w ( e ) = d ( p v , p u ) where d ( p v , p u ) is the Euclidean distance between v and u.
The purpose of computing an MST is to build a subgraph that connects all vertices in G W without forming cycles, that is, there is not a path from a vertex x to a vertex y such that x = y and whose sum of edge weights is the smallest. Figure 1 shows an illustrative example of an MST (b) computed from a weighted graph (a).
To build an MST, an incidence matrix M G (line 6) computed by means of Prim’s algorithm [45] is used. This process is as follows:
  • Choose an initial vertex v at random.
  • Select the edge e = { v , u } with the lowest weight in M G and mark v as visited. Now, the next vertex to be analyzed is u.
  • Repeat Step 2 now taking u as the initial vertex of the edge e and while there exists a vertex z such that z has not been visited yet, e = {u,z}.
  • The MST will be built doing backtracking on the already marked vertices until each vertex has been marked.
By definition, an MST contains each vertex of the graph. Our proposal takes only a subset of the MST: the first S-vertices to build the new majority class C , where S denotes the maximum imbalance ratio required (lines 8–10). To validate the 95 % of confidence in the imbalance ratio, the values chosen are Z = 1.96 , σ = 0.5 , and e = 0.05 according to the procedure given by Torres et al. [46].

3.2. Time Complexity

The time complexity of an algorithm is estimated by counting the number of operations performed [47]. Thus, the time complexity of the DBMIST-US algorithm depends on its two stages: the run time complexity of DBSCAN is O ( n 2 ) in the worst-case [48] and the time complexity of MSTGraph is also O ( n 2 ) , which depends on the time complexity of the following processes:
  • The computation of the incidence matrix from G w takes n 2 steps.
  • The computation of an MST based on the Prim’s algorithm takes n 2 steps.
Therefore, as the three procedures are independent, the complexity of DBMIST-US is given by O ( n 2 ) in the worst-case.

3.3. Differences between DBMIST-US and Related Works

The algorithm proposed in this work differs from other related methods in several aspects. First, most developments focus on handling either class imbalance or class overlap separately, while DBMIST-US faces both intrinsic data complexities together. Although one can find a few works addressing both problems, in general, they consist of the application of different techniques in a sequential manner. For instance, Chen et al. designed a methodology based on the use of the NCL algorithm to remove the overlapping majority class instances followed by an ensemble to randomly under-sample the majority class [49]. Similarly, Koziarski and Wożniak proposed an energy-based approach to clean the neighborhoods of minority class instances combined with an over-sampling procedure [50]. Other methods have been also proposed to handle imbalanced data sets with noisy and borderline instances together, but they are mostly based on modifying some previous resampling algorithm; in this line, for instance, Sáez et al. proposed an extension of the well-known SMOTE over-sampling algorithm through the inclusion of an iterative ensemble-based noise filter [8].
Regarding the differences between DBMIST-US and the existing clustering-based methods, it is worth pointing out that most under-sampling techniques rely upon the K-means and the fuzzy K-means algorithms [16,26,36,37,38,40,41,43,51]. However, it is well-known that K-means may not be sufficiently effective when applied to imbalanced data because it always generates clusters with similar sizes [52]. Another weakness of these clustering algorithms is that they assume some prior knowledge of the number of clusters to face the imbalance problem. To overcome these shortcomings, our proposal employs the DBSCAN algorithm in the filtering stage. Some of the advantages of using DBSCAN are: (i) it can identify arbitrarily shaped clusters and detect noise, and (ii) the number of clusters is not a user-specified parameter. In addition, an MST-based approach is used to under-sample the majority class, which, to the best of our knowledge, this is the first development using the MST to face the imbalance problem. A related approach employs the Gabriel graph and the relative neighborhood graph to search for neighbors in SMOTE [53], which are supergraphs of the Euclidean MST.
Finally, differences between DBMIST-US and other resampling methods that use DBSCAN can also be remarked. Sanguanmak and Hanskunatai proposed the hybrid DBSM algorithm, which employs DBSCAN for under-sampling and the SMOTE technique for over-sampling[54]; here, DBSCAN is used to create clusters from the whole training set, and then 50% of the majority class instances are eliminated from each cluster. Another approach is the DBSMOTE algorithm [55], which integrates DBSCAN and SMOTE; DBSCAN is applied to discover arbitrarily shaped clusters and SMOTE generates more synthetic positive instances around the core of the minority class rather than at the border. The main difference between our proposal and both these methods is that, unlike DBMIST-US, these are for over-sampling. On the other hand, the DBMUTE technique combines DBSCAN with the under-sampling MUTE algorithm [56] to discover and remove majority class instances from the overlapping region [57], but this is not intended to identify and remove noisy instances.

4. Experimental Set-Up

Extensive experiments were conducted to evaluate the usefulness of the DBMIST-US method when compared to a pool of state-of-the-art under-sampling techniques. In summary, the research questions to investigate in this experimental study were:
Q1.
What is the classification performance of DBMIST-US in comparison to several state-of-the-art under-sampling algorithms?
Q2.
How robust is DBMIST-US across the use of different classification models?
Q3.
What is the impact of each under-sampling algorithm on the imbalance ratio?

4.1. Data Sets

We conducted the experiments on 20 real-life data sets (Table 1) and 12 synthetic data sets taken from the KEEL Data Set Repository (https://sci2s.ugr.es/keel/imbalanced.php#subA). All data sets have two classes and various levels of imbalance (the imbalance ratio ranges from 9.35 to 82).
The experiments on the synthetic data were carried out on three databases with different shapes of the minority class (subclus, clover, and paw) whose instances are randomly and uniformly distributed in a two-dimensional feature space. In all cases, the positive instances are uniformly surrounded by the majority class. Each database comprises 800 instances with an imbalance ratio of 7 and different levels of random noise (0%, 30%, 50%, and 70%) on both classes: data from both the majority and minority classes were randomly mislabeled and some single feature-values were randomly changed.
In subclus, the positive instances are located inside rectangles that form small disjuncts. Clover represents a more complex, nonlinear problem, where the minority class resembles a flower with elliptic petals. In a paw database, the minority class is decomposed into three elliptic sub-regions of varying cardinalities, where two sub-regions are located close to each other, and the remaining smaller sub-region is separated. Figure 2 illustrates two examples of these data sets: 0% and 70% of noise.

4.2. Reference Under-Sampling Algorithms and Classifiers

Besides the DBMIST-US method with a maximum imbalanced ratio set to 2, we also implemented some state-of-the-art under-sampling algorithms already described in Section 2: CNN, ENN, TL, NCL, OSS, SBC, and ClusterOSS. In addition, the experiments also included a collection of methods that are popular in the literature related to class imbalance (see Table 2).
To investigate the robustness of the experimental results and obtain fair comparisons between the under-sampling algorithms, we evaluated the performance across three popular classification models of different nature: an instance-based learner (nearest neighbor, 1NN), a decision tree (non-pruned J48) and a linear classifier (support vector machine with a linear kernel, SVM). The parameters of these classifiers throughout the experiments were the default values of the WEKA open source machine learning software [61].
The evaluation of the under-sampling algorithms was estimated by 10-fold cross-validation in order to avoid biased results [62]. Each original data set was randomly divided into 10 stratified parts: for each fold, nine blocks were used as a training set, and the remaining portion was used for testing each of the three classifiers that were trained by the different under-sampling strategies. Finally, the performance measures of each method were averaged over the 10 trials.

4.3. Evaluation Metrics

For a two-class problem, Table 3 shows a 2 × 2 confusion matrix where each entry contains the number of correct/incorrect classifications [63]. Thus, the true-positive (TP) and the true-negative (TN) values represent the amount of instances from each class that were correctly classified, whereas the false-positive (FP) and false-negative (FN) indicate the amount of positive and negative instances misclassified.
From this confusion matrix, two plain performance evaluation indices can be easily calculated:
  • True-positive rate (or sensitivity). It measures the proportion of positive instances correctly classified, T P R = T P / ( T P + F N ) .
  • True-negative rate (or specificity). It measures the proportion of negative examples correctly classified, T N R = T N / ( T N + F P ) .
The assessment of classification performance is often carried out using more powerful measures that can be derived from straightforward indices such as the true-positive and true-negative rates. The most popular measure is the overall accuracy, but it is not appropriate when the prior class probabilities are very different because it does not consider misclassification costs and is strongly biased towards the majority class [64]. Thus, the performance of classifiers in the context of class imbalance has commonly been evaluated using other metrics, such as the geometric mean, the F-measure, and the area under the ROC curve. For the present experimental study, the geometric mean was chosen since it represents a good trade-off between the accuracy rates measured on each class:
G m e a n = T P R · T N R

5. Results

In this section, we report the experimental results obtained by DBMIST-US and several state-of-the-art under-sampling algorithms, pursuing to answer the three research questions stated in Section 4.

5.1. Classification Performance Comparison with State-of-the-Art Methods

The aim of the first block of this experimental study was to compare the performance of DBMIST-US with that of reference methods to assess its competitiveness (Q1), and also to analyze its robustness across classification models (Q2). Table A1, Table A2 and Table A3 in Appendix A report the geometric mean (averaged across the 10 runs) of each under-sampling algorithm using the three classifiers on all the data sets. The classification results on the original (non-preprocessed) data sets were also included as a baseline for performance assessment and usefulness testing.
Figure 3 displays the average Friedman rank of the under-sampling algorithm for each classifier. From these graphs, one can observe that DBMIST-US achieved the lowest average ranks regardless of the classifier used, which means that this method performed the best on both the real-life data sets and the synthetic data sets. In addition, as expected, under-sampling the data leads to an improvement in the classification performance, except when using the CNN algorithm; in general, this showed the highest average Friedman ranks (i.e., the worst performance as measured by the geometric mean), especially with the 1NN and J48 classifiers.
It is also interesting to draw attention to the behavior of CNN, NCL, TL, ENN, OSS, RUSBOOST, and SBC with the SVM classification model since the geometric mean obtained by these algorithms was 0 for many of the experimental data sets, as can be seen in Table A3. Taking into account that the geometric mean on the respective non-preprocessed data sets was also 0, this behavior indicates that those under-sampling algorithms could not handle either the class overlap or the class imbalance or even both correctly. To gain some insight into this situation, Figure 4 shows the TPR (red triangles) and TNR (black squares) obtained by using SVM on each under-sampled synthetic data set. From these spider plots, one can observe that the bad performance of most of those methods comes from a true-positive rate equal to 0, which means that the SVM model misclassified all instances in the minority class.
To shed some light onto why the true-positive rate given by those under-sampling algorithms was 0, Figure 5 plots the original paw-50 data set and the under-sampled sets when using either a neighborhood-based method (TL) or a clustering-based method (SBC). One can observe that these algorithms could not solve the class imbalance problem and, consequently, classification by SVM remained biased towards the majority class. On the other hand, the plots show that there still exists some overlapping between classes, indicating that not all noisy negative instances could be removed from the data sets.
A particularly interesting problem emerges with the ClusterOSS method where the performance on the minority class improved significantly, but at the cost of causing a severe degradation of the true-negative rate. In Figure 4, the spider plot for the paw-50 data set brings up an especially dramatic case since the TNR was equal to 0. For a deeper understanding of why such a loss of performance in the true-negative rate arose, the scatter plots in Figure 6 display the under-sampled sets when using ClusterOSS and DBMIST-US. As can be viewed, both methods eliminated the problem of class overlap by cleaning the decision boundary completely. However, the removal of many negative instances with the ClusterOSS algorithm interchanged the role of classes (i.e., the majority class became the minority one) and placed the negative instances further from each other than from the positive instances, whereas DBMIST-US behaved appropriately by producing a perfectly balanced data set and keeping a sufficient number of useful negative instances in well-defined clusters.
For further confirmation of the behavior of DBMIST-US just discussed, the scatter plots in Figure 7 depict the three synthetic databases with 70% of noise in their original class distribution and also after being under-sampled by the DBMIST-US algorithm. Despite all these corresponding to problems with high overlapping between classes, the under-sampling method was able to properly eliminate the noisy and borderline instances in the majority class and clean the decision boundary. In addition, DBMIST-US reduced the disproportion between positive and negative instances very importantly.

5.2. Statistical Significance Analysis

The Wilcoxon’s paired signed-rank test to find out statistically significant differences between each pair of under-sampling algorithms for a significance level of α = 0.05 . In Figure 8, we present the results of a wins-ties-losses analysis, showing the number of data sets on which DBMIST-US obtained statistically significantly better, equal, or worse performance than the reference algorithms on a pairwise basis, for the three classifiers considered in this experimental study.
As can be seen, DBMIST-US significantly outperformed the state-of-the-art under-sampling methods on a majority of databases, independently of the classifier used. This result suggests that the DBMIST-US algorithm is robust against the classification models of different nature.
Figure 9 summarizes the Wilcoxon’s test to show whether or not there exists any significance difference between DBMIST-US and the reference methods for the three classifiers. In this case, the bars represent how many times an algorithm was significantly better, equal to, or worse than the remaining techniques. The most important finding is that the DBMIST-US method was significantly the best under-sampling algorithm for all classifiers, which highlights the robustness of this method once again.

5.3. Evaluation of the Impact on the Imbalance Ratio

This block of the experimental study aimed to investigate the impact of each under-sampling algorithm on the imbalance ratio (Q3), that is, to check the ability of each method to provide well-balanced data sets.
Table 4 reports the imbalance ratio of each under-sampled data set. Initially, the imbalance ratio of the 20 real-life data sets was between 9.35 and 82, whereas the imbalance ratio of the synthetic databases was 7. After under-sampling the original sets, RUS, EUS, EE, and BC gave perfectly-balanced data sets with an imbalance ratio equal to 1; conversely, most data sets that were resampled by NCL, TL, ENN, OSS, and SBC are still heavily class-imbalanced, thus leading to an imbalance ratio really close to the original one. Regarding the DBMIST-US algorithm, the imbalance ratios equal or close to 1 indicate that the resulting data sets were well-balanced, demonstrating the effectiveness of the method proposed here. Finally, the ClusterOSS method produced many data sets with an imbalance ratio less than 1, which means that the huge amount of negative instances removed made the majority class smaller than the minority class, as already pointed out in Section 5.1.
Another way of evaluating the impact of each under-sampling algorithm on the imbalance ratio is looking at the percentage of average reduction of the imbalance ratio given in the last row of Table 4. Results reveal that the top-4 methods were RUS, EUS, EE, and BC with an average reduction of 91.55%, which correspond to the algorithms that produced perfectly-balanced data sets (imbalance ratio equal to 1). However, DBMIST-US with an average reduction of 91.29% was clearly very close to the top-4 methods. On the other hand, the average reductions achieved with NCL, TL, ENN, OSS, and SBC were extremely low, which suggests that these algorithms are not the most appropriate for under-sampling.

6. Conclusions

In this work, we have proposed a new under-sampling method called DBMIST-US to face class overlap and class imbalance through the combination of the DBSCAN clustering algorithm with the generation of an MST. The goal of the technique proposed here is to handle both data complexities simultaneously using a two-stage algorithm. In the first stage, the decision boundary is cleaned by removing negative instances that have been considered as noise, whereas the second stage is in charge of balancing the classes by selecting representative instances from the majority class according to a maximum imbalance ratio predefined by the user.
As already introduced in Section 3.3, several differences between DBMIST-US and some recent related under-sampling algorithms can be highlighted: (i) DBMIST-US faces class overlap and class imbalance simultaneously, whereas other methods are designed to tackle these data complexities separately or by applying a series of techniques in a sequential manner, (ii) the baseline clustering-based under-sampling approaches are based on the classical K-means algorithm, whereas our proposal combines DBSCAN with MST, and (iii) the algorithms that rest upon the use of DBSCAN are primarily developed for over-sampling the minority class instead of under-sampling the majority class.
We carried out a set of experiments to validate the performance of the DBMIST-US method against 12 state-of-the-art under-sampling algorithms using three supervised classifiers (1NN, J48, and SVM), tested on both real-life data sets and synthetic data sets with an imbalance ratio ranging from 7 to 82. The results have shown that the method here proposed performs well with noisy and borderline problems and better than the remaining algorithms when applied on heavily imbalanced data sets, especially using the SVM classification model. Another research question that has properly been answered in the experimental study refers to robustness, showing that DBMIST-US is high robust to the use of classifiers of different nature. It is also worth noting the percentage of average reduction of the imbalance ratio achieved by DBMIST-US, which was comparable to the reduction produced by the best reference algorithms.
The promising results of the DBMIST-US algorithm encourage us to examine some directions for future research. The most immediate issue relates to the generalization of DBMIST-US for handling the multi-class imbalance problem, which emerges as a more challenging situation in many practical applications. A second subject that deserves further research is the multi-label classification of imbalanced data where the classes are not mutually exclusive, which represents a much more complex classification task because an instance may belong to more than one class. However, another avenue for investigation is to analyze how the DBMIST-US algorithm can be used on large-scale data and data streams.

Author Contributions

A.G.-P., R.M.V., and J.S.S. designed the study; A.G.-P. implemented the algorithms and carried out the experiments; J.S.S. and J.R.M.-R. performed the analysis; J.S.S. wrote the manuscript; J.S.S. and R.M.V. revised and validated the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Universitat Jaume I under grant [UJI-B2018-49], the 5046/2020CIC UAEM project and the Mexican Science and Technology Council (CONACYT) under scholarship [702275].

Conflicts of Interest

The authors declare no conflict of interests.

Abbreviations

The following abbreviations are used in this manuscript:
DBSCANDensity Based Spatial Clustering of Applications with Noise
MSTMinimum Spanning Tree
DBMIST-USDBSCAN and MST - Under-Sampling
IRImbalance ratio
kNNk Nearest Neighbor
SVMSupport Vector Machine
SMOTESynthetic Minority Over-sampling Technique

Appendix A. Classification Results

This appendix summarizes the classification performance of each under-sampling algorithm using 1NN, J48, and SVM. Values in boldface highlight the best result for each database.
Table A1. Geometric mean results obtained with 1NN.
Table A1. Geometric mean results obtained with 1NN.
OriginalRUSCNNNCLTLENNOSSEUSEEBCRUSBOOSTSBCClusterOSSDBMIST-US
yeast-0-5-6-7-9_vs_461.273.550.783.069.078.972.278.473.273.576.777.158.975.4
glass-0-1-6_vs_247.074.451.066.952.667.952.968.376.273.564.153.133.076.3
glass240.776.251.653.141.067.147.464.760.067.668.753.166.976.9
shuttle-c0-vs-c4 799.699.6100.099.699.699.6100.099.699.699.699.699.698.599.6
yeast-1_vs_762.374.846.276.867.670.469.966.659.658.165.967.262.469.2
glass486.679.965.391.386.687.783.680.784.384.387.486.656.992.1
ecoli486.395.081.489.486.392.286.292.5100.0100.092.291.998.9100.0
page-blocks-1-3_vs_498.298.294.598.298.298.298.294.594.696.496.699.992.298.2
glass-0-1-6_vs_580.977.065.1100.080.9100.079.394.394.394.391.094.078.195.4
yeast-1-4-5-8_vs_744.164.539.762.747.948.054.269.769.360.065.157.064.961.3
glass581.194.375.293.881.294.079.888.272.072.091.293.344.392.7
yeast-2_vs_877.059.858.380.677.380.577.454.877.572.575.180.271.574.5
flare-F30.381.430.665.933.772.626.278.976.668.577.326.10.087.3
yeast458.883.343.780.162.271.358.984.377.482.477.178.675.479.8
yeast-1-2-8-9_vs_747.656.642.565.454.260.460.064.563.269.965.047.669.266.1
yeast582.1100.065.596.387.694.188.793.292.096.594.682.192.597.1
ecoli-0-1-3-7_vs_2-683.778.248.884.484.084.484.292.671.471.481.984.283.787.3
abalone-17_vs_7-8-9-1050.677.644.866.745.361.450.773.978.273.278.658.344.878.9
yeast671.169.954.886.179.082.780.681.075.181.483.771.288.081.1
poker-8-9_vs_519.953.114.820.020.020.020.061.245.363.565.134.564.080.0
subcl-086.289.966.191.190.495.187.894.590.492.589.591.788.698.1
subcl-3072.072.558.078.975.383.475.574.583.577.077.383.387.290.6
subcl-5060.569.046.073.668.281.968.374.071.581.074.580.686.689.3
subcl-7052.570.535.972.362.782.462.079.872.573.571.674.085.593.0
clover-089.190.459.598.794.896.895.189.989.493.491.698.187.6100.0
clover-3080.478.050.386.881.688.081.978.580.976.982.687.680.496.3
clover-5070.074.548.683.175.987.575.675.875.476.976.385.981.698.5
clover-7062.573.941.180.471.987.475.377.574.977.073.478.785.099.1
paw-093.594.960.398.496.798.595.697.093.092.495.198.771.997.3
paw-3078.377.551.986.783.591.883.783.083.577.983.087.579.494.0
paw-5074.181.546.386.081.889.483.182.883.087.079.986.145.998.1
paw-7064.077.045.078.374.789.875.874.679.077.576.882.186.298.5
Table A2. Geometric mean results obtained with J48.
Table A2. Geometric mean results obtained with J48.
OriginalRUSCNNNCLTLENNOSSEUSEEBCRUSBOOSTSBCClusterOSSDBMIST-US
yeast-0-5-6-7-9_vs_465.170.660.270.665.874.564.373.270.568.677.074.371.376.7
glass-0-1-6_vs_252.360.043.866.753.153.358.346.757.645.663.552.04.664.8
glass253.190.722.457.633.258.052.455.273.061.165.258.265.072.5
shuttle-c0-vs-c4 7100.0100.099.699.9100.0100.0100.0100.0100.0100.099.999.997.0100.0
yeast-1_vs_754.364.557.257.265.262.654.271.658.381.567.462.868.469.7
glass482.273.073.082.182.286.679.388.492.392.389.172.49.092.3
ecoli482.987.592.380.182.980.186.076.587.592.586.377.193.582.8
page-blocks-1-3_vs_496.1100.094.599.899.896.197.498.296.494.597.597.193.898.2
glass-0-1-6_vs_599.488.295.399.793.799.797.1100.094.394.392.599.779.092.3
yeast-1-4-5-8_vs_70.059.60.00.00.00.00.066.660.644.753.125.864.355.0
glass598.8100.085.399.799.599.798.594.394.394.393.9100.049.595.8
yeast-2_vs_822.461.269.222.30.031.622.474.871.476.572.131.667.173.0
flare-F15.285.932.762.40.056.80.084.886.079.584.20.00.090.5
yeast453.978.30.060.757.465.453.983.380.384.178.268.480.979.3
yeast-1-2-8-9_vs_748.253.747.944.644.644.644.659.954.269.767.340.868.462.8
yeast586.396.677.896.489.093.988.987.494.189.894.288.990.296.6
ecoli-0-1-3-7_vs_2-684.471.470.484.584.484.583.570.078.278.274.584.472.494.3
abalone-17_vs_7-8-9-1034.775.00.043.334.639.332.178.875.074.075.557.045.976.0
yeast673.372.870.277.273.373.567.382.671.488.681.971.480.484.6
poker-8-9_vs_50.047.80.00.00.00.00.044.040.042.344.60.052.559.6
subcl-097.892.377.997.897.898.397.295.593.988.692.396.687.394.4
subcl-3066.775.974.071.470.287.568.575.383.279.783.587.389.485.2
subcl-5026.478.031.374.853.484.868.274.780.882.080.983.283.782.6
subcl-700.076.50.032.624.472.614.177.880.773.079.980.183.586.7
clover-063.385.611.665.166.766.966.683.477.578.281.592.782.183.9
clover-3026.474.659.242.945.171.528.268.979.179.681.784.052.797.4
clover-500.071.80.059.147.964.310.077.782.175.575.884.278.494.7
clover-700.068.227.222.30.067.70.071.480.162.675.382.280.497.4
paw-080.493.939.065.965.171.680.191.289.090.392.297.235.194.6
paw-3019.983.460.179.772.587.560.779.283.282.882.387.676.091.1
paw-5028.280.652.082.551.887.551.586.786.985.682.089.030.094.4
paw-700.078.40.063.932.988.434.277.382.178.282.090.876.892.8
Table A3. Geometric mean results obtained with SVM.
Table A3. Geometric mean results obtained with SVM.
OriginalRUSCNNNCLTLENNOSSEUSEEBCRUSBOOSTSBCClusterOSSDBMIST-US
yeast-0-5-6-7-9_vs_40.078.355.042.00.037.00.078.475.374.577.744.230.179.5
glass-0-1-6_vs_20.055.20.00.00.00.00.045.647.844.032.10.00.057.4
glass20.068.60.00.00.00.00.047.159.454.221.80.029.571.7
shuttle-c0-vs-c4 799.6100.099.6100.099.699.699.6100.0100.0100.0100.099.699.9100.0
yeast-1_vs_70.081.20.00.00.00.00.083.374.174.864.50.012.776.6
glass439.268.840.739.239.239.239.284.384.384.387.639.20.092.0
ecoli463.292.592.374.263.274.270.797.597.5100.094.974.299.0100.0
page-blocks-1-3_vs_465.471.974.565.465.465.465.579.479.975.173.165.425.684.5
glass-0-1-6_vs_50.081.673.90.00.00.00.094.381.681.686.40.042.390.2
yeast-1-4-5-8_vs_70.060.60.00.00.00.00.058.967.849.447.30.060.562.0
glass50.088.275.20.00.00.00.088.281.681.681.80.043.191.2
yeast-2_vs_874.172.373.474.274.274.274.274.274.277.573.374.270.673.0
flare-F0.078.236.242.915.250.40.076.878.576.281.70.00.088.7
yeast40.082.40.00.00.00.00.085.282.384.380.90.024.682.3
yeast-1-2-8-9_vs_70.052.40.00.00.00.00.069.774.574.856.40.043.170.9
yeast50.098.970.933.70.039.930.292.994.194.195.80.092.696.3
ecoli-0-1-3-7_vs_2-684.178.278.784.284.084.283.592.684.584.584.184.082.892.6
abalone-17_vs_7-8-9-100.074.10.00.00.00.00.069.875.069.869.60.00.072.1
yeast60.078.50.00.00.00.00.088.481.387.187.70.086.186.3
poker-8-9_vs_50.051.80.00.00.00.00.044.937.935.830.10.02.562.1
subcl-00.051.30.00.00.00.00.054.547.056.80.00.00.067.5
subcl-300.041.80.00.00.00.00.054.041.950.40.00.076.359.4
subcl-500.047.60.00.00.00.00.050.348.552.60.00.077.362.3
subcl-700.052.50.00.00.00.00.053.243.054.80.00.067.650.3
clover-00.054.50.00.00.00.00.047.152.048.40.00.029.593.5
clover-300.055.827.20.00.00.00.050.558.958.60.00.02.254.0
clover-500.051.20.00.00.00.00.058.651.456.80.00.074.553.3
clover-700.057.50.00.00.00.00.057.457.441.40.00.080.061.2
paw-00.064.20.00.00.00.00.047.439.457.60.00.00.085.8
paw-300.057.549.60.00.00.00.059.843.045.00.00.032.450.6
paw-500.061.451.30.00.00.00.062.054.156.70.00.00.061.9
paw-700.049.417.00.00.00.00.055.955.349.60.00.077.027.1

References

  1. Sun, Y.; Kamel, M.S.; Wong, A.K.; Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 2007, 40, 3358–3378. [Google Scholar] [CrossRef]
  2. Codetta-Raiteri, D.; Portinale, L. Dynamic Bayesian networks for fault detection, identification, and recovery in autonomous spacecraft. IEEE Trans. Syst. Man Cybern. Syst. 2015, 45, 13–24. [Google Scholar] [CrossRef]
  3. Zhang, Y.; Zhou, Z.H. Cost-sensitive face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1758–1769. [Google Scholar] [CrossRef] [PubMed]
  4. Liu, C.L.; Hsaio, W.H.; Lee, C.H.; Tao, C.; Kuo, T.H. Semi-supervised text classification with universum learning. IEEE Trans. Cybern. 2015, 46, 462–473. [Google Scholar] [CrossRef]
  5. Gopalakrishnan, V.; Ramaswamy, C. Sentiment learning from imbalanced dataset: An ensemble based method. Int. J. Artif. Intell. 2014, 12, 75–87. [Google Scholar]
  6. García, V.; Marqués, A.I.; Sánchez, J.S. Improving risk predictions by preprocessing imbalanced credit data. In Proceedings of the 19th International Conference on Neural Information Processing, Doha, Qatar, 12–15 November 2012; pp. 68–75. [Google Scholar]
  7. Fernández, A.; García, S.; Galar, M.; Prati, R.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Cham, Switzerland, 2018; Volume 1. [Google Scholar]
  8. Sáez, J.A.; Luengo, J.; Stefanowski, J.; Herrera, F. SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 2015, 291, 184–203. [Google Scholar] [CrossRef]
  9. García, V.; Alejo, R.; Sánchez, J.S.; Sotoca, J.M.; Mollineda, R.A. Combined effects of class imbalance and class overlap on instance-based classification. In Proceedings of the 6th International Conference on Intelligent Data Engineering and Automated Learning, Burgos, Spain, 20–23 September 2006; pp. 371–378. [Google Scholar]
  10. Gupta, S.; Gupta, A. Handling class overlapping to detect noisy instances in classification. Knowl. Eng. Rev. 2018, 33, e8. [Google Scholar] [CrossRef]
  11. Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. Supervised neural network modeling: An empirical investigation into learning from imbalanced data with labeling errors. IEEE Trans. Neural Netw. 2010, 21, 813–830. [Google Scholar] [CrossRef]
  12. Van Hulse, J.; Khoshgoftaar, T.M.; Napolitano, A. A novel noise filtering algorithm for imbalanced data. In Proceedings of the 9th International Conference on Machine Learning and Applications, Washington, DC, USA, 12–14 December 2010; pp. 9–14. [Google Scholar]
  13. Muhlenbach, F.; Lallich, S.; Zighed, D.A. Identifying and handling mislabelled instances. J. Intell. Inf. Syst. 2004, 22, 89–109. [Google Scholar] [CrossRef]
  14. Dong, X.; He, H.; Li, C.; Liu, Y.; Xiong, H. Scene-based big data quality management framework. In Data Science; Springer: Singapore, 2018; pp. 122–139. [Google Scholar]
  15. Barandela, R.; Sánchez, J.; García, V.; Rangel, E. Strategies for learning in class imbalance problems. Pattern Recognit. 2003, 36, 849–851. [Google Scholar] [CrossRef]
  16. Ofek, N.; Rokach, L.; Stern, R.; Shabtai, A. Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 2017, 243, 88–102. [Google Scholar] [CrossRef]
  17. Yen, S.J.; Lee, Y.S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
  18. Napierała, K.; Stefanowski, J.; Wilk, S. Learning from imbalanced data in the presence of noisy and borderline examples. In Proceedings of the 7th International Conference on Rough Sets and Current Trends in Computing, Warsaw, Poland, 28–30 June 2010; pp. 158–167. [Google Scholar]
  19. Alejo, R.; Valdovinos, R.M.; García, V.; Pacheco-Sanchez, J.H. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognit. Lett. 2013, 34, 380–388. [Google Scholar] [CrossRef] [Green Version]
  20. García, V.; Sánchez, J.S.; Mollineda, R.A. An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In Proceedings of the 5th Iberoamerican Congress on Pattern Recognition, Valparaiso, Chile, 13–16 November 2007; pp. 397–406. [Google Scholar]
  21. Van-Hulse, J.; Khoshgoftaar, T. Knowledge discovery from imbalanced and noisy data. Data Knowl. Eng. 2009, 68, 1513–1542. [Google Scholar] [CrossRef]
  22. Hart, P. The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
  23. Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, SMC-6, 769–772. [Google Scholar] [CrossRef] [Green Version]
  24. Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Proceedings of the 8th Conference on Artificial Intelligence in Medicine, Cascais, Portugal, 1–4 July 2001; pp. 63–66. [Google Scholar]
  25. Branco, P.; Torgo, L.; Ribeiro, R.P. A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 2016, 49, 1–50. [Google Scholar] [CrossRef]
  26. Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf.Sci. 2017, 409–410, 17–26. [Google Scholar] [CrossRef]
  27. Drummond, C.; Holte, R.C. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the Workshop on Learning from Imbalanced Datasets II, Washington, DC, USA, 21 August 2003; Volume 11. [Google Scholar]
  28. García, V.; Sánchez, J.S.; Marqués, A.I.; Florencia, R.; Rivera, G. Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst. Appl. 2019, 1–19. [Google Scholar] [CrossRef]
  29. Basgall, M.J.; Hasperué, W.; Naiouf, M.; Fernández, A.; Herrera, F. An analysis of local and global solutions to address big data imbalanced classification: A case study with SMOTE Preprocessing. In Cloud Computing and Big Data; Springer International Publishing: Cham, Switzerland, 2019; pp. 75–85. [Google Scholar]
  30. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; AAAI Press: Portland, OR, USA, 1996; pp. 226–231. [Google Scholar]
  31. Ijaz, M.F.; Attique, M.; Son, Y. Data-driven cervical cancer prediction model with outlier detection and over-sampling methods. Sensors 2020, 20, 2809. [Google Scholar] [CrossRef]
  32. García, S.; Derrac, J.; Cano, J.; Herrera, F. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 417–435. [Google Scholar] [CrossRef] [PubMed]
  33. Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar] [CrossRef] [Green Version]
  34. Tomek, I. An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 1976, SMC-6, 448–452. [Google Scholar] [CrossRef]
  35. Kubat, M.; Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 8–12 July 1997; pp. 179–186. [Google Scholar]
  36. Longadge, R.; Dongre, S.S.; Malik, L. Multi-cluster based approach for skewed data in data mining. IOSR J. Comput. Eng. 2013, 12, 66–73. [Google Scholar] [CrossRef]
  37. Barella, V.H.; Costa, E.P.; Carvalho, A.C.P.L.F. ClusterOSS: A new undersampling method for imbalanced learning. In Proceedings of the 3rd Brazilian Conference on Intelligent Systems, São Carlos, Brazil, 18–23 October 2014; pp. 453–458. [Google Scholar]
  38. Sowah, R.A.; Agebure, M.A.; Mills, G.A.; Koumadi, K.M.; Fiawoo, S.Y. New cluster undersampling technique for class imbalance learning. Int. J. Mach. Learn. Comput. 2016, 6, 205. [Google Scholar] [CrossRef] [Green Version]
  39. Das, B.; Krishnan, N.C.; Cook, D.J. Handling imbalanced and overlapping classes in smart environments prompting dataset. In Data Mining for Service; Springer: Berlin/Heidelberg, Germany, 2014; pp. 199–219. [Google Scholar]
  40. Tsai, C.F.; Lin, W.C.; Hu, Y.H.; Yao, G.T. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci. 2019, 477, 47–54. [Google Scholar] [CrossRef]
  41. Ng, W.W.Y.; Hu, J.; Yeung, D.S.; Yin, S.; Roli, F. Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 2015, 45, 2402–2412. [Google Scholar] [CrossRef] [PubMed]
  42. Sun, Z.; Song, Q.; Zhu, X.; Sun, H.; Xu, B.; Zhou, Y. A novel ensemble method for classifying imbalanced data. Pattern Recognit. 2015, 48, 1623–1637. [Google Scholar] [CrossRef]
  43. Kim, H.J.; Jo, N.O.; Shin, K.S. Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert Syst. Appl. 2016, 59, 226–234. [Google Scholar] [CrossRef]
  44. Smiti, A.; Elouedi, Z. DBSCAN-GM: An improved clustering method based on Gaussian means and DBSCAN techniques. In Proceedings of the IEEE 16th International Conference on Intelligent Engineering Systems, Lisbon, Portugal, 13–15 June 2012; pp. 573–578. [Google Scholar]
  45. Prim, R.C. Shortest connection networks and some generalizations. Bell Syst. Tech. J. 1957, 36, 1389–1401. [Google Scholar] [CrossRef]
  46. Torres, M.; Paz, K.; Salazar, F. Tamaño de una muestra para una investigación de mercado. Boletín Electrónico 2006, 2, 1–13. [Google Scholar]
  47. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  48. Suthar, N.; Indr, P.; Vinit, P. A technical survey on DBSCAN clustering algorithm. Int. J. Sci. Eng. Res. 2013, 4, 1775–1781. [Google Scholar]
  49. Chen, L.; Fang, B.; Shang, Z.; Tang, Y. Tackling class overlap and imbalance problems in software defect prediction. Software Qual. J. 2018, 26, 97–125. [Google Scholar] [CrossRef]
  50. Koziarski, M.; Wożniak, M. CCR: A combined cleaning and resampling algorithm for imbalanced data classification. Int. J. Appl. Math. Comput. Sci. 2017, 27, 727–736. [Google Scholar] [CrossRef] [Green Version]
  51. Xiao, L.; Gao, M.; Su, X. An under-sampling ensemble classification algorithm based on fuzzy C-means clustering for imbalanced data. Data Anal. Knowl. Discov. 2019, 3, 90–96. [Google Scholar] [CrossRef]
  52. Liang, J.; Bai, L.; Dang, C.; Cao, F. The K-means-type algorithms versus imbalanced data distributions. IEEE Trans. Fuzzy Syst. 2012, 20, 728–745. [Google Scholar] [CrossRef]
  53. García, V.; Sánchez, J.S.; Martín-Félez, R.; Mollineda, R.A. Surrounding neighborhood-based SMOTE for learning from imbalanced data sets. Progr. Artif. Intell. 2012, 1, 347–362. [Google Scholar] [CrossRef] [Green Version]
  54. Sanguanmak, Y.; Hanskunatai, A. DBSM: The combination of DBSCAN and SMOTE for imbalanced data classification. In Proceedings of the 13th International Joint Conference on Computer Science and Software Engineering, Khon Kaen, Thailand, 13–15 July 2016; pp. 1–5. [Google Scholar]
  55. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. DBSMOTE: Density-based synthetic minority over-sampling technique. Appl. Intell. 2012, 36, 664–684. [Google Scholar] [CrossRef]
  56. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. MUTE: Majority under-sampling technique. In Proceedings of the 8th International Conference on Information, Communications & Signal Processing, Singapore, 13–16 December 2011; pp. 1–4. [Google Scholar]
  57. Bunkhumpornpat, C.; Sinapiromsaran, K. DBMUTE: Density-based majority under-sampling technique. Knowl. Inf. Syst. 2017, 50, 827–850. [Google Scholar] [CrossRef]
  58. García, S.; Herrera, F. Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evol. Comput. 2009, 17, 275–306. [Google Scholar] [CrossRef]
  59. Liu, X.; Wu, J.; Zhou, Z. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2009, 39, 539–550. [Google Scholar] [CrossRef]
  60. Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
  61. Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: Cambridge, MA, USA, 2017. [Google Scholar]
  62. García, V.; Marqués, A.I.; Sánchez, J.S. Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Inf. Fusion 2019, 47, 88–101. [Google Scholar] [CrossRef]
  63. Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2012, 42, 463–484. [Google Scholar] [CrossRef]
  64. García, V.; Mollineda, R.A.; Sánchez, J.S. A bias correction function for classification performance assessment in two-class imbalanced problems. Knowl. Based Syst. 2014, 59, 66–74. [Google Scholar] [CrossRef]
Figure 1. Illustration of weighted graphs. The graph on the right depicts an MST of the graph on the left. (a) weighted graph; (b) MST.
Figure 1. Illustration of weighted graphs. The graph on the right depicts an MST of the graph on the left. (a) weighted graph; (b) MST.
Applsci 10 05164 g001
Figure 2. Scatter plots of the synthetic data sets with 0% and 70% of noise (points in gray color are for noisy data). (a) subclus-0; (b) clover-0; (c) paw-0; (d) subclus-70; (e) clover-70; (f) paw-70.
Figure 2. Scatter plots of the synthetic data sets with 0% and 70% of noise (points in gray color are for noisy data). (a) subclus-0; (b) clover-0; (c) paw-0; (d) subclus-70; (e) clover-70; (f) paw-70.
Applsci 10 05164 g002
Figure 3. Comparison of average Friedman ranks. (a) real data sets; (b) synthetic data sets.
Figure 3. Comparison of average Friedman ranks. (a) real data sets; (b) synthetic data sets.
Applsci 10 05164 g003
Figure 4. True-positive and true-negative rates using SVM on the synthetic data sets. (a) RUS; (b) CNN; (c) NCL; (d) TL; (e) ENN; (f) OSS; (g) EUS; (h) EE; (i) BC; (j) RUSBOOST; (k) SBC; (l) ClusterOSS; (m) DBMIST-US.
Figure 4. True-positive and true-negative rates using SVM on the synthetic data sets. (a) RUS; (b) CNN; (c) NCL; (d) TL; (e) ENN; (f) OSS; (g) EUS; (h) EE; (i) BC; (j) RUSBOOST; (k) SBC; (l) ClusterOSS; (m) DBMIST-US.
Applsci 10 05164 g004
Figure 5. Scatter plots of the paw-50 database for the non-preprocessed set and the sets yielded by TL and SBC. (a) original; (b) TL; (c) SBC.
Figure 5. Scatter plots of the paw-50 database for the non-preprocessed set and the sets yielded by TL and SBC. (a) original; (b) TL; (c) SBC.
Applsci 10 05164 g005
Figure 6. Scatter plots of the paw-50 database for the non-preprocessed set and the sets yielded by ClusterOSS and DBMIST-US. (a) Original; (b) ClusterOSS; (c) DBMIST-US.
Figure 6. Scatter plots of the paw-50 database for the non-preprocessed set and the sets yielded by ClusterOSS and DBMIST-US. (a) Original; (b) ClusterOSS; (c) DBMIST-US.
Applsci 10 05164 g006
Figure 7. Scatter plots of the synthetic data sets (70% of noise) for the non-preprocessed sets and the sets yielded by DBMIST-US (points in gray color correspond to the instances that were removed).
Figure 7. Scatter plots of the synthetic data sets (70% of noise) for the non-preprocessed sets and the sets yielded by DBMIST-US (points in gray color correspond to the instances that were removed).
Applsci 10 05164 g007
Figure 8. Comparison of DBMIST-US with reference methods with respect to the number of data sets on which DBMIST-US was statistically significantly better (green), equal (yellow), or worse (red). (a) 1NN; (b) J48; (c) SVM.
Figure 8. Comparison of DBMIST-US with reference methods with respect to the number of data sets on which DBMIST-US was statistically significantly better (green), equal (yellow), or worse (red). (a) 1NN; (b) J48; (c) SVM.
Applsci 10 05164 g008
Figure 9. Summary of the Wilcoxon’s test. (a) 1NN; (b) J48; (c) SVM.
Figure 9. Summary of the Wilcoxon’s test. (a) 1NN; (b) J48; (c) SVM.
Applsci 10 05164 g009
Table 1. Summary of the real-life data sets.
Table 1. Summary of the real-life data sets.
Data SetClass Distribution#InstancesIR
1yeast-0-5-6-7-9_vs_451–4775289.35
2glass-0-1-6_vs_217–17519210.29
3glass217–19721411.59
4shuttle-c0-vs-c4123–1706182913.87
5yeast-1_vs_730–42945914.30
6glass413–20121415.47
7ecoli420–31633615.80
8page-blocks-1-3_vs_428–44447215.86
9glass-0-1-6_vs_59–17518419.44
10yeast-1-4-5-8_vs_730–66369322.10
11glass59–20521422.78
12yeast-2_vs_820–46248223.10
13flare-F43–1023106623.79
14yeast451–1433148428.10
15yeast-1-2-8-9_vs_730–91794730.57
16yeast544–1440148432.73
17ecoli-0-1-3-7_vs_2-67–27428139.14
18abalone-17_vs_7-8-9-1058–2280233839.31
19yeast635–1449148441.40
20poker-8-9_vs_525–2050207582.00
Table 2. Other under-sampling methods included in the experiments.
Table 2. Other under-sampling methods included in the experiments.
MethodDescription
Random under-sampling (RUS)It balances the data set through the random elimination of instances that belong to the over-sized class.
Evolutionary under-sampling (EUS)It consists of removing instances in the majority class by using the guide of a genetic-based algorithm [58].
Easy ensemble (EE)The data set is divided into several subsets by random resampling with replacement, and each subset is then used to train a base classifier of the ensemble with AdaBoost [59].
Balance cascade (BC)It performs bagging and removes negative instances that can be classified correctly with high confidence from future selections [59].
Random under-sampling boosting (RUSBOOST)It combines RUS with the AdaBoost algorithm [60].
Table 3. Confusion matrix.
Table 3. Confusion matrix.
Predicted as PositivePredicted as Negative
Actually positiveTPFN
Actually negativeFPTN
Table 4. Imbalance ratio of the resampled data sets.
Table 4. Imbalance ratio of the resampled data sets.
OriginalRUSCNNNCLTLENNOSSEUSEEBCRUSBOOSTSBCClusterOSSDBMIST-US
yeast-0-5-6-7-9_vs_49.41.01.78.09.18.27.31.01.01.01.67.50.41.0
glass-0-1-6_vs_219.31.02.28.710.18.49.61.01.01.01.08.94.21.1
glass211.61.02.010.111.49.810.51.01.01.01.59.91.81.1
shuttle-c0-vs-c4 713.91.00.113.813.913.80.31.01.01.01.513.80.41.2
yeast-1_vs_714.31.02.812.114.012.512.11.01.01.01.511.42.01.0
glass415.51.01.014.515.514.94.21.01.01.01.515.65.01.1
ecoli415.81.01.015.215.815.410.31.01.01.01.514.90.91.1
page-blocks-1-3_vs_415.91.01.015.315.815.24.51.01.01.01.514.73.01.0
glass-0-1-6_vs_519.41.01.218.319.318.75.91.01.01.01.418.72.01.1
yeast-1-4-5-8_vs_722.11.04.119.621.819.821.81.01.01.01.516.92.31.0
glass522.81.01.221.622.722.07.61.01.01.01.422.03.81.1
yeast-2_vs_823.11.02.421.922.922.021.21.01.01.01.521.00.81.0
flare-F23.81.03.122.023.721.823.71.01.01.01.523.20.11.1
yeast428.11.02.526.427.826.624.71.01.01.01.522.10.31.1
yeast-1-2-8-9_vs_730.61.04.328.330.228.430.11.01.01.01.529.60.41.0
yeast532.71.01.132.032.632.025.01.01.01.01.532.00.31.1
ecoli-0-1-3-7_vs_2-639.11.02.138.138.938.017.31.01.01.01.437.62.21.3
abalone-17_vs_7-8-9-1039.31.02.537.539.138.237.31.11.01.01.531.00.31.1
yeast641.41.02.940.041.139.935.51.01.01.01.540.20.41.1
poker-8-9_vs_582.01.05.179.481.780.580.61.01.01.01.569.20.71.1
subcl-07.01.01.36.66.96.56.31.01.01.01.55.70.51.0
subcl-307.01.01.36.46.85.96.71.01.01.01.54.70.60.9
subcl-507.01.01.46.36.65.76.51.01.01.01.54.61.51.0
subcl-707.01.01.46.26.65.66.61.01.01.01.54.60.51.1
clover-07.01.01.46.66.96.66.51.01.01.01.55.51.40.9
clover-307.01.01.16.56.86.26.51.01.01.01.55.10.71.2
clover-507.01.01.36.36.75.86.21.01.01.01.55.10.41.0
clover-707.01.01.36.26.65.66.01.01.01.01.54.71.21.1
paw-07.01.02.26.87.06.77.01.01.01.01.56.30.10.9
paw-307.01.01.16.66.86.26.31.01.01.01.55.20.70.8
paw-507.01.01.06.56.76.06.11.01.01.01.55.40.21.0
paw-707.01.01.16.36.65.85.81.01.01.01.55.21.21.1
% Avg. reduction 91.5586.988.553.2810.1823.2491.5191.5591.5587.3417.3390.2691.29

Share and Cite

MDPI and ACS Style

Guzmán-Ponce, A.; Valdovinos, R.M.; Sánchez, J.S.; Marcial-Romero, J. A New Under-Sampling Method to Face Class Overlap and Imbalance. Appl. Sci. 2020, 10, 5164. https://doi.org/10.3390/app10155164

AMA Style

Guzmán-Ponce A, Valdovinos RM, Sánchez JS, Marcial-Romero J. A New Under-Sampling Method to Face Class Overlap and Imbalance. Applied Sciences. 2020; 10(15):5164. https://doi.org/10.3390/app10155164

Chicago/Turabian Style

Guzmán-Ponce, Angélica, Rosa María Valdovinos, José Salvador Sánchez, and José Raymundo Marcial-Romero. 2020. "A New Under-Sampling Method to Face Class Overlap and Imbalance" Applied Sciences 10, no. 15: 5164. https://doi.org/10.3390/app10155164

APA Style

Guzmán-Ponce, A., Valdovinos, R. M., Sánchez, J. S., & Marcial-Romero, J. (2020). A New Under-Sampling Method to Face Class Overlap and Imbalance. Applied Sciences, 10(15), 5164. https://doi.org/10.3390/app10155164

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop