Next Article in Journal
Algebraic Zero Error Training Method for Neural Networks Achieving Least Upper Bounds on Neurons and Layers
Previous Article in Journal
Emotion Recognition in Human–Robot Interaction Using the NAO Robot
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance

Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary
*
Author to whom correspondence should be addressed.
Computers 2022, 11(5), 73; https://doi.org/10.3390/computers11050073
Submission received: 22 February 2022 / Revised: 22 April 2022 / Accepted: 27 April 2022 / Published: 4 May 2022

Abstract

:
Data imbalance is a serious problem in machine learning that can be alleviated at the data level by balancing the class distribution with sampling. In the last decade, several sampling methods have been published to address the shortcomings of the initial ones, such as noise sensitivity and incorrect neighbor selection. Based on the review of the literature, it has become clear to us that the algorithms achieve varying performance on different data sets. In this paper, we present a new oversampler that has been developed based on the key steps and sampling strategies identified by analyzing dozens of existing methods and that can be fitted to various data sets through an optimization process. Experiments were performed on a number of data sets, which show that the proposed method had a similar or better effect on the performance of SVM, DTree, kNN and MLP classifiers compared with other well-known samplers found in the literature. The results were also confirmed by statistical tests.

1. Introduction

Classification is a typical machine-learning task that aims to classify set of samples into predefined classes. Each sample represents a given entity with d observed features. More formally, if the samples belong to k different classes (labeled l 1 , l 2 , , l k ), the goal is to build a model based on a { x 1 , x 2 , , x n } R d sample set and their corresponding class labels { y 1 , y 2 , , y n } , where y i { l 1 , l 2 , , l k } , ( i = 1 , 2 , , n ) , which can be used to predict a class label of an x R d sample. The set of samples and their corresponding labels used to build the model are called a training data set. If there are only two classes ( k = 2 ), then the problem is a binary classification problem. If the size of the classes differs significantly, the set containing all the samples of the smaller class is called the minority set, and the set containing all the samples of the larger class is called the majority set.
The classical classification algorithms, such as the Naive Bayes, linear SVM and Random Trees, were designed for balanced data sets. If the class distribution is uneven, the models built on the data set usually favor the larger class. Data imbalance must also be taken into account in model evaluation as an inappropriate metric may provide a completely misleading result about the performance of a classifier [1].
We can easily understand the problem by imagining a simple classification task where the samples, represented by triangles and circles, have to be classified into the Triangle class and the Circle class, respectively. The training data set that can be used to build the model has 1000 samples, 990 circles and only 10 triangles. If we consider a trivial classifier that predicts the Circle label for each sample, it would make a good decision for 99% of the samples in the data set; however, in practice, the classifier is completely useless.
A less extreme example is shown in Figure 1a. We built a model with a linear SVM on the data set with 200 circles and 15 triangles. The red line indicates the decision boundary for separating samples belonging to different classes. If a sample to be classified falls to the left of the red line, the classifier will classify it as a triangle—otherwise, as a circle. One can readily see that the location of the decision boundary favors the classification of circles to the detriment of triangles. However, on a less imbalanced version of the data set, the classifier was able to find a good decision boundary (Figure 1b).
We can also observe in Figure 1 that the samples far from the decision boundary can be classified correctly even by the worse classifier. These easily classifiable samples are commonly referred to as safe samples. However, some of the samples near the boundary would be classified differently based on the two models. These samples may play an important role in the construction of the model; however, they are in danger of misclassification as they are very similar to some samples of the other class.
Imbalanced learning has been a focus of interest for two decades, and there are several learning problems in practice where the distribution of classes is skewed (e.g., spam filtering [2], fault detection [3,4], disease risk prediction [5], cancer diagnosis [6], dynamic gesture recognition [7,8], activity monitoring [9] and online activity learning [10]).
We can handle the problem of imbalanced data at the algorithm level, whose best-known representatives are perhaps the cost-sensitive methods. The concept behind them is simple, the misclassification of minority samples costs more (entails a higher penalty) than the misclassification of majority ones [11]. If large amounts of imbalanced data need to be classified, the use of some cost-sensitive deep-learning methods should also be considered [12].
In contrast to the previous approach, data-level algorithms address the root of the problem by changing the number and/or distribution of data before applying general classifiers. For binary classification problems, this can be accomplished by oversampling the smaller class or undersampling the larger one, and hybrid solutions also exist that combine under- and oversampling [1,13]. All three approaches can also be found in the field of deep learning to handle large amounts of imbalanced data [14].
One of the advantages of oversampling is that it preserves the information in the data. In addition, oversampling usually reduces the proportion of samples that are difficult to classify and increases the proportion of the safe samples in the data set, while undersampling tends to remove safe samples from the data set preserving the samples that are difficult to classify. This means that the classification with oversampled data sets is likely to be more efficient than using undersampled ones [15].
In the rest of this paper, we focus primarily on the oversampling algorithms (excluding deep-learning-based solutions). Of the many oversampling algorithms in the literature, it is perhaps worth highlighting that, historically, the algorithm of SMOTE released in 2002 [16]. The SMOTE attempts to add synthetic samples to the minority set that are similar to but not duplicates of the minority samples.
To achieve this goal, each synthetic sample is generated by interpolating a randomly selected minority sample (seed) and one of its k nearest minority neighbors (co-seed and pair). We can already see the advantage of using SMOTE because the original Circle–Triangle data set (Figure 1a) was oversampled by SMOTE to make a more balanced version (Figure 1b).
Nevertheless, SMOTE also has certain disadvantages: it selects samples as seeds that are important (e.g., a sample close to the decision boundary), irrelevant (e.g., a sample far to the decision boundary) and misleading (e.g., noise) for classification with equal probability, and the situation is similar for the selection of co-seeds. To provide a clear picture of the problem, we have supplemented the Circle–Triangle data set with three noise samples denoted by P, Q and R (Figure 2a), and then the data set was oversampled by SMOTE to a similar extent as before.
If we compare Figure 1a to Figure 2a, we can see that the decision boundary has shifted due to noise. Furthermore, Figure 2b shows, that, as a result of oversampling, plenty of new samples appeared around Q and R, thus, increasing the degree of overlap between classes. Moreover, these samples misled the classifier, because they belong to the Triangle class but they are similar to circles.
Over the years, more than a hundred SMOTE variants have been published to overcome the drawbacks of the original version. The interested reader can find more about the SMOTE algorithm, including the several variants in an excellent summary in the literature [17]. It covers current challenges, such as handling streamed data or using semi-supervised learning.
The following section provides a brief overview of oversampling methods. Then, in Section 3, we introduce our ModularOverSampler, which was designed taking into account the common features and differences of the methods found in the literature. Section 4 provides a detailed description of the experimental process and the results. In Section 5, we discuss the limitations of our method and suggest some directions for further research. Finally, conclusions are drawn in Section 6.

2. Related Work

Although there are plenty of oversamplers that have been published in the literature, the theoretical considerations behind them, often based on a priori assumptions, serve the same purpose—to ensure that the distribution and locations of the generated samples are adequate to improve the samples classification. A rough structural pattern is also recognizable in a fairly large proportion of the published algorithms; however, by further refining the structure, the diversity of the procedures becomes visible. During the review of the literature, we present the possible structural steps and their role, mentioning some typical implementation methods (strategies) as this supports our concept, which is described in a later section.
Some of the oversamplers start with noise filtering to permanently remove samples considered to be noise from the data set [18,19]. Using the filtered data set, we may obtain a clearer picture of which samples are at the decision boundary, where minority samples are in danger of misclassification. However, a noise filter is rarely applied to the minority set, because there is a danger that too many minority samples will be removed, thereby, increasing the degree of imbalance.
In addition, the minority set sometimes really consists of only a few samples because there are problems where the event to be observed is rare (e.g., fault detection [3], bankruptcy prediction [20]). The deletion of samples is only acceptable in very justified cases. For example, an isolated sample can be deleted from the minority class [18] if the overlap of classes is moderate. Samples representing noise and outliers can also be deleted [19].
Another typical step is clustering [21], which can be used to delimit the part of the feature space where it is not worth generating samples. Placing the synthetic samples inside the clusters of the minority set helps to avoid the errors illustrated in Figure 3 [22]. The clustering also paves the way for distinguishing samples that are considered important or insignificant for the learning process.
For example, it may reveal the outliers and the isolated samples (possible noise samples), and thus we can exclude them from the further sampling process [23,24,25], and some procedures determine which samples are in danger of misclassification and which ones are safe based on the composition of the clusters, assuming that the decision boundary passes through clusters containing both minority and majority samples [26]. Instead of general-purpose clustering, many procedures determine the set of samples that are in danger, the set of the safe samples and the set of the noise samples based on the examination of the small environment of the samples (e.g., their k-nearest neighbors) [18,27,28,29].
Furthermore, with this, we have reached the next typical step, which is to weight the minority samples according to their importance. At this point, different approaches can be found in the literature because the assessment of the importance of samples varies from method to method. Some oversamplers consider all the minority samples equally important [20,30], while others exclude samples labeled for noise [24]. Contradictory strategies can also be found in algorithms.
For example, there are methods that assign a higher weight to samples near the decision boundary [31] than to samples that are far away (including the case where only boundary samples can become seeds [18]); however, for example, the Safe-Level-SMOTE generates samples near to the safe samples [32]. Furthermore, while many oversamplers ignore data samples in clusters that are too small, there is a method that assigns the largest weights to these minority samples assuming that they are not noises but rare samples [33].
The next step is to choose the seed samples. The selection is mostly made randomly from the samples of the minority set. The distribution of the selection can be uniform or weighted, as described in the previous paragraph.
The synthetic samples can be generated around the selected seeds [34,35]; however, most oversamplers select another sample (called a pair or co-seed) to each seed and create the new samples based on these samples. The most popular method to generate new samples is to interpolate the seeds and their pairs. Typically, the co-seeds are selected randomly from the k minority nearest neighbors of the seeds [20,23,36] or from the minority samples of the seeds cluster [37]. Although the seeds and their pairs are usually selected from the same cluster, borderline minority seeds can be strengthened if the new samples are created between them and the safe samples [38]. There are some methods that select the co-seeds from the majority set; however, this solution is not common [39].
Sample generation is not always the last step, while some methods reject samples generated in the wrong place during sample generation [26], others apply post-filtering on the oversampled data set [23,36,40]. The second strategy allows us to make a decision about removing or preserving the samples by seeing the modified version of the dataset that already includes the new synthetic samples. In this case, the sample removal is not limited to the newly generated samples.

3. Methodology

Researchers who require imbalanced data sets now have more than a hundred oversampling algorithms to choose from, and the increase in the number of samplers has not stopped in recent years. With this in mind, we have developed a new algorithm because it motivated us to create a generic sampler that can be well adapted to various databases.
The proposed method utilizes many ideas from other samplers yet differs fundamentally from previously published solutions due to its modular structure. The main steps are shown in Figure 4. The method consists of a few more steps than we mentioned in the previous section because we have broken down certain steps into smaller ones for flexibility.
For most of the steps, we have defined different strategies which determine the specific methods of performing the steps. (These are described in detail later in this section). Our oversampler can be parameterized with these strategies. The non-round rectangles in Figure 4 indicate interchangeable modules that are swapped according to the strategies passed as parameters. This structure allows us to fundamentally modify the sampling process via parametrization.
Among the strategies, there are some that cannot be used together. The combination of parameters that assigns exactly one strategy to each interchangeable module and these strategies can be used together is called meaningful parameter combination or path. The algorithm is completed with an optimization step that selects the appropriate one from the paths for the data set to be sampled. We recommend cross-validation for this purpose.

3.1. Noise Filtering

As mentioned in Section 2, only a few methods attempt to remove the noise samples of the minority set permanently due to the risk of information loss. We also limited the noise removal to the majority set defining the noise with the Edited Nearest Neighbors (ENN) rule [30], which considers a sample as noise if the majority of its k-nearest neighbors (kNN) belong to another class. Similar solutions can be found in [41].
In this step, the noise removal algorithm can be used with k = 3 and k = 5 , and as a third option, we made it possible to skip the noise filtering step. We will refer to the three strategies as N F _ E N N 3 , N F _ E N N 5 and NoFilter.

3.2. Clustering

We mentioned in Section 2 that there may be different intentions behind the clustering step. Accordingly, the following strategies are defined:
  • D B S C A N m i n , D B S C A N w h o l e : DBSCAN [42] applied on the minority set and the entire data set, respectively. DBSCAN does not require specifying the number of clusters in advance; it defines clusters with extensions starting from high-density points, where at least m i n P t s points must be within an ϵ distance. During the tests, ϵ was 0.8 and m i n P t s was 3.
  • BorderSafeNoise: With this strategy, a sample is placed to a NOISE cluster if there are at least k 1 samples with different labels among its 7 nearest neighbors, and a sample is added to the SAFE if there is no sample with different labels between its 7 nearest neighbors. All other samples form the BORDER cluster ( k 1 { 5 , 7 } ). Similar solution can be found in [18].
  • MinMaj: This strategy creates two clusters, the MINOR cluster includes all samples of the minority set, and the MAJOR cluster includes all samples of the majority set.
In the rest of the algorithm description, we denote the resulting clusters by S, the set of purely minority clusters by S m i n , the set of purely majority clusters by S m a j and the set of mixed clusters by S m i x e d . Furthermore, the size of cluster or set c are denoted by | c | .

3.3. Cluster Weighting

In this step, the clusters can be given weight according to their importance. A normalized version of the weight of the cluster determines the probability of randomly selecting a seed from that cluster. For normalization, the weight of a cluster is divided by the sum of the weights of the clusters. By assigning zero weight, we can completely exclude clusters from the seed selection process. Four weighting strategies are defined: CW_Uniform, CW_BySize, CW_Border, CW_ByInsideEnemy. With the CW_Uniform strategy, we can assign the same positive weight to each cluster with at least one minority sample, creating the possibility that nearly the same number of seeds are selected from these clusters.
As a result, samples of the smaller clusters are expected to be involved in generating more synthetic samples than samples of the larger clusters, therefore, it reduces imbalance within the minority class but does not fully balance the clusters as the solution of Jo and Japkowicz [33]. The CW_BySize calculates the weight of clusters with at least one minority sample based on their size. More formally, the weight of a cluster c ( S m i n S m i x e d ) is
w ( c ) = | c | c i ( S m i n S m i x e d ) | c i | .
All other clusters are given zero weight. If only pure clusters were generated in the previous step, this method ensures that all minority samples have the same chance of being selected as seed. If there are mixed clusters, their minority samples will have a higher chance of becoming a seed than the minority samples of other clusters. A similar solution can be found in [22]. (To avoid redundant paths, this weighting strategy is not used with the MinMaj clustering).
The CW_Border strategy assigns weight of 1 to the BORDER cluster and 0 to all others. As a result, only minority samples with a majority neighbor are selected as seeds. Clearly, this weighting only makes sense if BorderSafeNoise clustering was used in the previous step. The CW_ByInsideEnemy as its name implies, assigns high weight to the mixed clusters in which the number of majority samples is high relative to minority samples. Pure clusters receive zero weights.
For a c S m i x e d , the weight is
w ( c ) = | { x : x c M A J O R } | | c | .
This strategy is used only in combination with D B S C A N w h o l e and BorderSafeNoise.

3.4. Cluster Selection

Using the normalized weights (probability values), we randomly select n s y n clusters by replacement. In a later step, we will select the seeds from these clusters. We attempted to achieve completely balanced classes, thus n s y n is the difference between the number of majority samples and the number of minority samples after the noise filtering step.

3.5. Sample Weighting

This step determines the probability of selecting a sample as a seed. Weighting is performed per cluster and majority samples are automatically given a weight of 0, therefore they cannot become seeds. After determining the weights of the samples, normalization is performed by dividing the weights by the total weight of the samples of their own cluster.
For this step, we defined two weighing strategies: SW_Uniform, SW_kNN.
With the SW_Uniform all minority samples within a cluster are given the same weight. If x is a minority sample of a c cluster, the weight assigned to it is
w ( x ) = 1 | c M I N O R | .
The SW_kNN follows in the footsteps of those samplers which attempt to create synthetic samples similar to those minority samples that are in danger for correct classification [18,31]. The danger level of a sample is often calculated based on the number of majority samples (enemies) among the sample’s k-nearest neighbors.
The variant used in this study determines the weight of a minority sample x according to the following formula:
w ( x ) = | N 5 ( x ) M A J O R | ,
where N 5 ( x ) is the five nearest neighbors of x.

3.6. Seed Selection

To create a multiset of seed samples, we select as many samples from each cluster as the number of times the cluster was previously selected. The sample selection is made randomly with replacement based on the normalized weights (probabilities) that were calculated in the previous step.

3.7. Cluster Selection for Sample Pairs

As the flowchart shows, our oversampler supports sample generation based on a single sample (seed) and based on two samples (a seed and its pair). In the latter case, we need to select a cluster for each seed from which its pair can be selected. These clusters are called the cluster pair of seeds. We have defined the following strategies for this step.
CPM_MINOR: The least restrictive strategy, the MINOR cluster are the cluster pair of each seed.
CPM_SAFE: With this method the minority samples of the SAFE cluster are the cluster pairs of each seed. It can only be used in combination with BorderSafeNoise clustering. As a result, minority samples in danger of misclassification are associated with safe samples, and thus the generated synthetic samples can strengthen the position of the seeds without deteriorating the position of the majority samples around them [38].
CPM_SelfCluster: Using this method, the cluster pair of a seed is the set of minority samples of its own cluster, including the case when both the seed and its pair are taken from the BORDER cluster. This option was motivated by the SOI_C [22].

3.8. Creating Candidate Sets

The purpose of this step is to determine a set of pair candidates for each seed using the output of the previous step. In the simplest case (CS_ALL), we keep all the sample and the candidate set(s) is(are) the output of the previous step.
The second option, CS_kNN, covers a popular approach, when the pairs of the seeds are from the seeds’ nearest neighbors (See, e.g., SMOTE [16], Lee [43]). During the tests, the five nearest neighbors of a seed formed its candidate set, and the nearest neighbors were selected from the pair cluster of the seed.
The third option, CS_ClusterCenter, allows us to examine a less common solution where cluster centers are used in some way to create new synthetic samples (e.g., AHC [44], SMMO [45]). In our case, a candidate set assigned to a seed consists of a single artificial sample that represents the center of the cluster pair of the seed.

3.9. Candidate Weighting

At this point in the process, we assign weights to each sample of the candidate set(s) to determine how likely a sample to be selected as a pair of a certain seed. In this step, we examined the following options. Normalization of weights is performed per candidate set.
SPW_Uniform: operates in a manner analogous to the SW_Uniform. Of course, instead of clusters, this time we are working with the sets of candidates. After CS_ClusterCenter only this option is allowed.
SPW_kNN: works analogously to the SW_kNN presented in Section 3.5. Candidates in danger are given higher weight than others.
SPW_SeedPairDistance: These method gives higher weights to the candidates near the seeds than those away from them. The idea appears for, e.g., in [36] accomplished by using weighted kNN in the pair selection. It is suitable for reducing the occurrence of the type of error shown in Figure 4. Considering an x 1 seed and an x 2 pair belonging to the candidate set (D) of x 1 the weight for the ( x 1 , x 2 ) can be calculated as
w ( x 1 , x 2 ) = max p D ( d ( x 1 , p ) ) d ( x 1 , x 2 ) ,
where d ( p , q ) is a distance function. We used the Minkowski distance with p = 1 and p = 2 parameterization during the test. The former one corresponds to Manhattan distance, the latter one to the Euclidean distance. The two version of the strategies are denoted by SPW_SeedPairDistance m and SPW_SeedPairDistance e .
Considerable evidence can be found in the literature that the meaningfulness of the L k norm worsens with increasing dimensionality for higher values of k [46].

3.10. Sample Pairs Selection

For each seed, as many pairs are randomly selected by replacement from their candidate sets as the number of times the seed is in the seeds’ multiset (except in cases where a candidate set is empty). Candidates are picked according to the normalized weights specified in the previous step.

3.11. Sample Generation

If generation is based only on seeds, the samples are generated applying the oversampling part of the ROSE [35]. Essentially, a new sample is generated in the neighborhood of a seed. The diameter of this neighborhood is determined by a so-called smoothing matrix. The location of the new sample in the neighborhood is based on a given probability density function. If the probability density function is Gaussian, the generated samples can be considered like Gaussian-jitters of the seed. Interested reader can find more details in [35].
If both seeds and pairs are available, the new samples are determined by linear interpolation which is one of the most commonly used solution: x s y n = x + λ · ( y x ) , where λ is a random number selected from a standard uniform distribution, x is a seed, and y is a pair sample. At this point, we expand the minority set with the new samples.

3.12. Post-Filtering

The role of the post-filter is to remove samples generated in the wrong place to smooth the decision boundary and sometimes to remove irrelevant samples (undersampling). For this step, the choices were two different noise filters ( P F _ E N N 5 , PF_IPF) and skipping the step (NoFilter).
P F _ E N N 5 : It works analogously to N F _ E N N 5 ; however, this time we remove the noise sample from the minority set.
PF_IPF: The Iterative Partitioning Filter (IPF) [47] removes samples from the data set that cannot be consistently classified into the appropriate class by the classifiers built on different folds of the data set. The procedure ends if the quotient of the number of detected noises and the size of the original database remains below p for k consecutive steps. We set the value of p to 0.01 and the number of folds to 5 and decision tree was used as a classifier. The SMOTE-IPF [23]. It can delete both minority and majority samples.

4. Results and Discussion

To measure the performance of our method and to compare it with other ones, we performed two experiments. In the first part of this section, we provide details of the evaluation process, which was carried out in the same way in both cases.

4.1. Oversamplers

In the field of oversamplers, Kovács published the largest comparative analysis in 2019, ranking 85 methods based on a well-defined test [48]. According this analysis, the top 10 out of 85 oversamplers are the Polynomial fitting SMOTE [49], ProWSyn [50], SMOTE-IPF [23], Lee [43], SMOBD [51], G-SMOTE [52], CCR [53], LVQ-SMOTE [54], Assembled-SMOTE [38], SMOTE-TomekLinks [36]). We involved these methods in the tests and the SMOTE based on its prevalence. Interestingly, these samplers work with quite a different approach.
However, there are three procedures among them that just complement the SMOTE with a post-filtering step, the SMOTE-IPF, SMOTE-TomekLink and Lee’s method. Furthermore, in addition to these samplers, the CCR and the SMOBD also deal with the issue of noise.
Due to the design of the proposed method, it should also be mentioned that the Polynomial Fitting SMOTE, whose sampling strategy can be completely changed by modifying its parameters achieved the best overall score.

4.2. Data

For both tests, we used databases from the Knowledge Extraction based on Evolutionary Learning Repository [55] and the UCI Machine Learning Repository [56]. The multi-class classification problems were transformed into binary classification problems, therefore these data sets are related.

4.3. Evaluation Metrics

It is the responsibility of the samplers to properly prepare the databases for classification, thus their efficiency is determined by the performance of the classifiers on the sampled data set.
To measure the performance of the classification, the accuracy (Acc),
A c c = T N + T P T N + T P + F N + F P ,
the F1-score
F 1 = 2 T P 2 T P + F N + F P ,
the G-mean score
G = T P T P + F N · T N T N + F P ,
and the area under the ROC curve (AUC) were used. The latter can be estimated using the trapezoidal-rule. In the formulas, T P and T N denote the number of the correctly classified positive and negative samples, respectively, and F P and F N denote the number the misclassified positive and negative samples, respectively. Traditionally, the minority samples are the positive ones. It is worth noting that the accuracy is only slightly affected by the correct or incorrect classification of the minority samples if the data set is highly imbalanced. The other three metrics are less sensitive to the difference of the class sizes.

4.4. Classifiers

As the effectiveness of the oversamplers is not measured per se, the result obtained may depend on the classifiers involved in the tests. To give a fair chance to all samplers, we used four classifiers operating on different principles, namely a linear support vector machine (SVM), a k-nearest neighbors classifier (kNN), a decision tree (CART) and a feed-forward neural network (MLP), and each classifier was run with multiple parametrizations, mainly following the methodology proposed by Kovács [48].
For the C regularization parameter of the SVM values 1, 5 and 10 were used, the kNN were run with k = 3, 5 and 7 without weighting and with inverse distance weighting. The CART was used with Gini-impurity and entropy. The height of the tree could be 2, 3 and unlimited. The MLP were applied on the data sets with RELU and logistic activation function and only one hidden layer. The number of the nodes were 10%, 50% or 100% of the input features.

4.5. Evaluation

To evaluate the performance of the oversampler–classifier pairs we used five-fold cross-validation with three repeats with the smote-variants Python package [56], which guarantees that all the oversamplers are evaluated on the same set of samples during cross-validation, and folds contain a similar proportion of minority and majority samples as the original data set. An oversampled version of the currently selected fold is used to train the classifiers, while testing is done on the original version of the remaining folds.
The different instances of the classifiers and samplers usually provide different results on the same data set depending on their parameters. The highest achieved AUC, F1, G and Acc values are selected as the result of a sampler–classifier pair assuming that users of the methods would also aim to select the best parameterization. Since this step can be considered as parameter optimization, we did not apply a separate optimization step to our method.
The total performance of an oversampler–classifier pair was calculated by averaging the results obtained on the databases.

4.6. Experiment 1

The goal of this test was to verify that the paths of the proposed method are sufficiently diverse to find a suitable one for different data sets. We involved 80 data sets in this experiment, which most important properties are summarized in Table A1.
Our oversampler was tested with the meaningful parameter combinations (paths). For the other methods, we used the parameter combinations provided by the smote-variants package. (These combinations were specified based on the descriptions of the implemented procedures, sometimes supplemented with additional reasonable ones). If it was supported by the algorithm, complete balancing of the data sets was requested.
By performing the test as described in Section 4.5, if the ModularOverSampler achieves the highest scores, it means that there is a path that provides the same or better performance than the best of the other samplers in the comparison. The first table shows the number of the data sets on which the different sampler–classifier pairs achieved first place (Table 1). The results without sampling are also provided.
The results support the assumption that the oversamplers should be adapted to the database being sampled. A different effect of sampling strategies can also be observed in Figure A1 through a database with few minority samples. Comparing Figure A1b,c and Figure A1d–f, we can see that, the kNN-based pair selection strictly narrows the space where new samples can be placed compared to cluster-based solutions, because the same seeds are involved in generating many synthetic samples.
It is easy to see that if the degree of imbalance is smaller, the distribution of the samples will not be so distorted. We can also observe that if the minority samples are not grouped on one side of the majority set, oversampling the border set can increase overlap between classes (Figure A1f). In such a situation, the pair selection should be limited to a smaller environment of the seeds.
The ModularOverSampler provides us more options than ever before to find a sampler that fit the databases. Although there are some paths that simulate the effect of other oversamplers (for example, the path used to create Figure A1b,c is equivalent to the SMOTE and SMOTE-ENN [36]), several paths actually represent new strategies for sampling, as the number of different paths exceeds the number of existing oversamplers.
However, it must be acknowledged that given the performance of today’s PCs, it is not yet viable to examine thousands of parameter combinations to sample a data set. Especially if one wants to further reduce the risk of over-fitting by increasing the number of folds or the number of repetitions. The other remark we need to make here in the name of fairness is that the role of chance in sampling procedures is not negligible. If there is no significant difference in the number of parameters of the samplers, we may assume that none of the samplers gains too much with this effect; however, our method has advantage in this point of view.
To address these issues, we performed another experiment with the limited number of parameters.

4.7. Experiment 2

The aim of the second experiment was to compare the performance of a practical version of our sampler with the other samplers. For the experiment, we formed two completely independent sets, a training and a test sets from the data sets, taking into account the relationship between the different databases. The training set, which contained 50 databases (Table A2), was used to reduce the number of paths. There are many related databases in it, undoubtedly not the most suitable for path selection; however, this was the cost of including more diverse (e.g., topic, number of samples, number of attributes, imbalance ratio) data sets in the test set (Table A3).
For each data set, we determined which parameterization of our oversampler has the best impact on the classifiers. Again, we used AUC, F1, G-score and Acc. Thus, a total of 800 paths were obtained as a result: 50 databases × 4 classifiers × 4 scores. Then, we selected the 35 paths that were most frequently appeared in the results (Table A4). Finally, we examined the performance of the proposed method using the selected paths as parameters. Testing was performed in the same manner as the previous test. The results are shown in (Table 2).
The table shows that the kNN classifier combined with the ProWSyn achieved a higher G value than combined with our method, but in all other cases our method achieved a better score than the others, and these results are no longer based on a large number of parameter combinations but on the use of different strategies.
To determine which classifiers and metrics the differences can be considered statistically significant, we performed statistical tests. First, a nonparametric Friedman test was used. As a null hypothesis, we assumed that the effect (on the performance of the classification) of all samplers was the same, which could be rejected with a 95% confidence level. Based on these, it makes sense to do more research to see if there is a significant difference between the effect of the proposed method and the other ones. To do this, we performed Holm–Bonferroni tests. Based on the results in Table A5, we can conclude that, in many cases, there are significant differences between the effects of our method and the others.
The results suggest that our method may be a safe choice for data preparation for any classifier involved in the test. We can expect similar result using before MLP or DTree and better results using before SVM or kNN than we can expect from other methods. If, for some reason, the runtime of the method is important, the use of another well-performing method, e.g., Polynom-fit-SMOTE or ProWSyn is recommended. For kNN and SVC classifiers, one may also want to consider the CCR.

5. Limitations and Recommendation for Further Research

Although the tests presented in the previous section showed that the proposed method performed well with the 35 paths selected in Experiment 2, further investigations are required to select the optimal subset of paths. Furthermore, it would be interesting to determine the set of the best paths for each classifier. As with most traditional oversamplers, the proposed method is designed to handle small to medium-sized data sets. Tests were also performed on such data sets.
As mentioned in the introduction, if a large amount of data is available, we have the option to use deep-learning algorithms. A good example of this a multi-sensor based hand gesture recognition system for a surgical robot teleoperation. The authors used an approach consisting of a multi-layer Recurrent Neural Network [8]. However, to the best of our knowledge, there has been no extensive comparative study of how oversamplers perform for larger data sets or higher dimensional problems (e.g., credit card transactions, medical image processing). Further research in this area would be desirable.

6. Conclusions

In this study, we presented a modular oversampling method (ModularOverSampler) that can be adapted to different databases by optimizing its parameters. The adaptability of the method was confirmed by a broad test involving 11 oversamplers, 80 imbalanced data sets and four classifiers, namely SVM, DTree, MLP and SVC. The performance of the sampler–classifier pairs was measured using AUC, F1, G and Acc scores. The proposed method reached the highest score on most of the data sets, which proved that the defined parameter combinations (paths) make the procedure very flexible.
Another experiment, in which we used our method with only 35 parameter combinations, provided evidence that the proposed method competes with the best oversampling methods known in the literature even after drastically reducing the number of the paths. In certain cases, a statistically significant improvement was achieved. Finally, identifying the typical steps of oversampling algorithms and organizing them into modules allows the impact of different implementations of each step on classification to be examined more systematically than before. We believe that the methodology presented in this paper will provide a good basis for many future studies.

Author Contributions

Methodology, S.S. and A.F.; Visualization, S.S.; Writing—original draft, S.S. and A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the construction EFOP-3.6.3-VEKOP-16-2017-00002. The project was supported by the European Union and cofinanced by the European Social Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets used in this study are available at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets.php (accessed on 10 February 2022) or KEEL-dataset repository at https://keel.es (accessed on 10 February 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Data sets from the the UCL [57] and KEEL [55] repositories involved in the first experiment. (Multiclass problems were converted to binary ones).
Table A1. Data sets from the the UCL [57] and KEEL [55] repositories involved in the first experiment. (Multiclass problems were converted to binary ones).
NameNo. of SamplesNo. of Minority SamplesNo. of AttributesNameNo. of SamplesNo. of Minority SamplesNo. of Attributes
abalone-17_vs_7-8-9-102338588glass4214139
abalone-19_vs_10-11-12-131622328glass521499
abalone-20_vs_8-9-101916268glass6214299
abalone-21_vs_8581148habarman306813
abalone-3_vs_11502158hepatitis1553219
abalone9-18731428hypothyroid316315125
car_good1728696iris0150504
car-vgood1728656KC1210932621
cleveland-0_vs_41771323kr-vs-k-zero_vs_fifteen2193276
cm14984923led7digit-0-2-4-6-7-8-9_vs_1443377
ecoli-0_vs_1220777lymphography-normal-fibrosis148623
ecoli-0-1_vs_2-3-5244247new_thyroid1215355
ecoli-0-1_vs_5240206page-blocks-1-3_vs_44722810
ecoli-0-1-3-7_vs_2-628177pc111097721
ecoli-0-1-4-6_vs_5280206pima7682688
ecoli-0-1-4-7_vs_2-3-5-6336297poker-8_vs_614771725
ecoli-0-1-4-7_vs_5-6332256poker-8-9_vs_614852525
ecoli-0-2-3-4_vs_5202207poker-9_vs_7244825
ecoli-0-2-6-7_vs_3-5224227segment0230832923
ecoli-0-3-4_vs_5200207spect-f2675544
ecoli-0-3-4-7_vs_5-6257257vehicle384621218
ecoli-0-4-6_vs_5203206vowel09889013
ecoli-0-6-7_vs_3-5222227winequality-red-8_vs_66561811
ecoli-0-6-7_vs_5220206winequality-red-8_vs_6-78551811
ecoli1336777winequality-white-3_vs_79002011
ecoli2336527winequality-white-3-9_vs_514822511
ecoli3336357winequality-white-9_vs_4168511
ecoli4336207wisconsin6832399
flare-f10664311yeast-0-2-5-6_vs_3-7-8-910049910
german100030029yeast-0-2-5-7-9_vs_3-6-810049910
glass0214709yeast-0-3-5-9_vs_7-85065010
glass-0-1-2-3_vs_4-5-6214519yeast-0-5-6-7-9_vs_45285110
glass-0-1-4-6_vs_2205179yeast-1_vs_7459307
glass-0-1-5_vs_2172179yeast-1-2-8-9_vs_79473010
glass-0-1-6_vs_2192179yeast-1-4-5-8_vs_76933010
glass-0-1-6_vs_518499yeast-2_vs_4514518
glass-0-4_vs_59299yeast-2_vs_84822010
glass-0-6_vs_510899yeast414845110
glass1214769yeast514844410
glass2214179yeast614843510
Table A2. Data sets from the the UCL [57] and KEEL [55] repositories used to select the best paths. (Multiclass problems were converted to binary ones).
Table A2. Data sets from the the UCL [57] and KEEL [55] repositories used to select the best paths. (Multiclass problems were converted to binary ones).
ecoli-0-1-3-7_vs_2-6glass-0-1-2-3_vs_4-5-6yeast-0-2-5-6_vs_3-7-8-9
ecoli-0-1-4-6_vs_5glass-0-1-4-6_vs_2yeast-0-2-5-7-9_vs_3-6-8
ecoli-0-1-4-7_vs_2-3-5-6glass-0-1-5_vs_2yeast-0-3-5-9_vs_7-8
ecoli-0-1-4-7_vs_5-6glass-0-1-6_vs_2yeast-0-5-6-7-9_vs_4
ecoli-0-1_vs_2-3-5glass-0-1-6_vs_5yeast-1-2-8-9_vs_7
ecoli-0-1_vs_5glass-0-4_vs_5yeast-1-4-5-8_vs_7
ecoli-0-2-3-4_vs_5glass-0-6_vs_5yeast-1_vs_7
ecoli-0-2-6-7_vs_3-5glass0yeast-2_vs_4
ecoli-0-3-4-7_vs_5-6glass1yeast-2_vs_8
ecoli-0-3-4_vs_5glass2yeast1
ecoli-0-4-6_vs_5glass4yeast4
ecoli-0-6-7_vs_3-5glass5yeast5
ecoli-0-6-7_vs_5glass6yeast6
ecoli-0_vs_1winequality-red-8_vs_6wisconsin
ecoli1winequality-red-8_vs_6-7
ecoli2winequality-white-3-9_vs_5
ecoli3winequality-white-3_vs_7
ecoli4winequality-white-9_vs_4
Table A3. Data sets involved in the second experiment. See in [55,57] (Multiclass problems were converted to binary ones).
Table A3. Data sets involved in the second experiment. See in [55,57] (Multiclass problems were converted to binary ones).
abalone-17_vs_7-8-9-10hepatitislymphography-normal-fibrosis
abalone-19_vs_10-11-12-13hypothyroidnew_thyroid1
abalone-20_vs_8-9-10iris0page-blocks_1-3_vs_4
abalone-21_vs_8kc1pc1
abalone-3_vs_11kddcup-buffer-overflow_vs_backpima
abalone19kddcup-guess-passwd_vs_satanpoker-8-9_vs_6
abalone9_18kddcup-land_vs_portsweeppoker_8_vs_6
car-goodkddcup-land_vs_satanpoker_9_vs_7
car-vgoodkddcup-rootkit-imap_vs_backsegment0
cleveland-0_vs_4kr_vs_k_one_vs_fifteenspect-f
cm1kr-vs-k-three_vs_elevenvehicle3
flare-fkr-vs-k-zero-one_vs_drawvowel0
germankr-vs-k-zero_vs_eight
habarmankr-vs-k-zero_vs_fifteen
Table A4. Paths of the ModularOverSampler selected in Experiment 2. Each row represents a path, where the steps of the algorithm can be seen left to right. The table is divided into two parts based on the main structure of the paths.
Table A4. Paths of the ModularOverSampler selected in Experiment 2. Each row represents a path, where the steps of the algorithm can be seen left to right. The table is divided into two parts based on the main structure of the paths.
Noise FilteringClusteringCluster WeightingSample WeightingSamplingPost-Filtering
NF_ENN 3 BorderSafeNoise 5 CW_BySizeSW_kNNgaussian_jitteringPF_IPF
NF_ENN 3 BorderSafeNoise 7 CW_BorderSW_kNNgaussian_jitteringPF_ENN 5
NF_ENN 3 DBSCAN m i n CW_UniformSW_kNNgaussian_jitteringNoFilter
NF_ENN 5 BorderSafeNoise 5 CW_ByInsideEnemySW_kNNgaussian_jitteringPF_ENN 5
NF_ENN 5 BorderSafeNoise 7 CW_BorderSW_kNNgaussian_jitteringPF_ENN 5
NF_ENN 5 BorderSafeNoise 7 CW_ByInsideEnemySW_Uniformgaussian_jitteringNoFilter
NF_ENN 5 DBSCAN m i n CW_UniformSW_kNNgaussian_jitteringPF_ENN 5
NF_ENN 5 MinMajCW_UniformSW_Uniformgaussian_jitteringPF_IPF
NF_ENN 5 MinMajCW_UniformSW_kNNgaussian_jitteringPF_ENN 5
NoFilterBorderSafeNoise 7 CW_BorderSW_Uniformgaussian_jitteringNoFilter
NoFilterBorderSafeNoise 7 CW_BySizeSW_Uniformgaussian_jitteringPF_ENN 5
NoFilterDBSCAN m i n CW_BySizeSW_Uniformgaussian_jitteringPF_IPF
NoFilterDBSCAN w h o l e CW_ByInsideEnemySW_Uniformgaussian_jitteringPF_IPF
NoFilterMinMajCW_UniformSW_Uniformgaussian_jitteringNoFilter
Noise FilteringClusteringCluster WeightingSample WeightingCluster SelectionCandidate SelectionCandidates WeightingSamplingPost-Filtering
NF_ENN 3 BorderSafeNoise 5 CW_BySizeSW_kNNCPM_SelfClusterPCS_ClusterCentreSPW_UniforminterpolationPF_IPF
NF_ENN 3 BorderSafeNoise 5 CW_BySizeSW_kNNCPM_MINORCS_ALLSPW_UniforminterpolationNoFilter
NF_ENN 3 BorderSafeNoise 7 CW_ByInsideEnemySW_kNNCPM_MINORPCS_ClusterCentreSPW_UniforminterpolationPF_IPF
NF_ENN 3 BorderSafeNoise 7 CW_BySizeSW_kNNCPM_SelfClusterPCS_ClusterCentreSPW_UniforminterpolationNoFilter
NF_ENN 5 BorderSafeNoise 5 CW_BySizeSW_UniformCPM_SelfClusterPCS_ClusterCentreSPW_UniforminterpolationPF_ENN 5
NoFilterDBSCAN w h o l e CW_BySizeSW_UniformCPM_SelfClusterPCS_KNNSPW_UniforminterpolationNoFilter
NoFilterBorderSafeNoise 7 CW_BorderSW_kNNCPM_SelfClusterPCS_ClusterCentreSPW_UniforminterpolationPF_IPF
NoFilterBorderSafeNoise 7 CW_BySizeSW_UniformCPM_SelfClusterPCS_ClusterCentreSPW_UniforminterpolationPF_IPF
NF_ENN 5 BorderSafeNoise 7 CW_ByInsideEnemySW_kNNCPM_MINORCS_ALLSPW_kNNinterpolationPF_IPF
NoFilterBorderSafeNoise 7 CW_BySizeSW_kNNCPM_SelfClusterCS_ALLSPW_kNNinterpolationPF_ENN 5
NF_ENN 3 BorderSafeNoise 5 CW_BySizeSW_kNNCPM_MINORCS_ALLSPW_BySeedPairDistance m interpolationPF_IPF
NF_ENN 3 BorderSafeNoise 7 CW_BySizeSW_UniformCPM_MINORPCS_KNNSPW_BySeedPairDistance m interpolationPF_ENN 5
NF_ENN 5 DBSCAN m i n CW_UniformSW_UniformCPM_MINORCS_ALLSPW_BySeedPairDistance m interpolationPF_ENN 5
NF_ENN 5 DBSCAN w h o l e CW_ByInsideEnemySW_kNNCPM_SelfClusterCS_ALLSPW_BySeedPairDistance m interpolationPF_ENN 5
NoFilterDBSCAN w h o l e CW_UniformSW_kNNCPM_MINORCS_ALLSPW_BySeedPairDistance m interpolationNoFilter
NF_ENN 3 BorderSafeNoise 7 CW_BySizeSW_UniformCPM_SelfClusterCS_ALLSPW_BySeedPairDistance e interpolationPF_IPF
NF_ENN 3 MinMajCW_UniformSW_kNNCPM_MINORPCS_KNNSPW_BySeedPairDistance e interpolationNoFilter
NoFilterBorderSafeNoise 5 CW_ByInsideEnemySW_kNNCPM_SelfClusterCS_ALLSPW_BySeedPairDistance e interpolationNoFilter
NoFilterBorderSafeNoise 5 CW_BySizeSW_kNNCPM_MINORCS_ALLSPW_BySeedPairDistance e interpolationPF_IPF
NoFilterBorderSafeNoise 7 CW_BySizeSW_UniformCPM_SelfClusterCS_ALLSPW_BySeedPairDistance e interpolationPF_ENN 5
NF_ENN 5 DBSCAN w h o l e CW_ByInsideEnemySW_UniformCPM_MINORPCS_KNNSPW_ByKNNEnemyinterpolationPF_ENN 5
Figure A1. Illustration of the impact of some different sampling strategies. The results were obtained with the ModularOverSampler on the data set (a) and with the following parameters: (b) NoFilter, MinMaj, CW_Uniform, SW_Uniform, CPM_MINOR, CS_KNN, SPW_Uniform, interpolation, NoFilter, (c) NoFilter, MinMaj, CW_Uniform, SW_Uniform, CPM_MINOR, CS_KNN, SPW_Uniform, interpolation, PF_ENN, (d) NoFilter, DBSCAN m i n , CW_BySize, SW_Uniform, CPM_SelfCluster, CS_All, SPW_Uniform, interpolation, NoFilter, (e) NoFilter, DBSCAN m i n , CW_Uniform, SW_Uniform, CPM_SelfCluster, CS_ClusterCenter, SPW_Uniform, interpolation, NoFilter and (f) NoFilter, BorderSafeNoise, CW_Border, SW_Uniform, CPM_SelfCluster, CS_All, SPW_Uniform, interpolation, NoFilter. The samples of the original minority and majority sets, the newly generated samples and the removed ones are marked by triangles, circles, stars and hyphens, respectively.
Figure A1. Illustration of the impact of some different sampling strategies. The results were obtained with the ModularOverSampler on the data set (a) and with the following parameters: (b) NoFilter, MinMaj, CW_Uniform, SW_Uniform, CPM_MINOR, CS_KNN, SPW_Uniform, interpolation, NoFilter, (c) NoFilter, MinMaj, CW_Uniform, SW_Uniform, CPM_MINOR, CS_KNN, SPW_Uniform, interpolation, PF_ENN, (d) NoFilter, DBSCAN m i n , CW_BySize, SW_Uniform, CPM_SelfCluster, CS_All, SPW_Uniform, interpolation, NoFilter, (e) NoFilter, DBSCAN m i n , CW_Uniform, SW_Uniform, CPM_SelfCluster, CS_ClusterCenter, SPW_Uniform, interpolation, NoFilter and (f) NoFilter, BorderSafeNoise, CW_Border, SW_Uniform, CPM_SelfCluster, CS_All, SPW_Uniform, interpolation, NoFilter. The samples of the original minority and majority sets, the newly generated samples and the removed ones are marked by triangles, circles, stars and hyphens, respectively.
Computers 11 00073 g0a1
Table A5. The results of the Holm–Bonferroni tests.
Table A5. The results of the Holm–Bonferroni tests.
ModularOverSampler compared byF1—pAUC—pG—pAcc—p
Assembled-SMOTE0.3276270.1281391.0000000.001259
CCR0.0018650.0012790.0039970.075417
G-SMOTE0.0691540.0359560.0039970.075417
LVQ-SMOTE0.0001330.0013290.0004040.000396
Lee0.1123510.0224220.2820320.000033
ProWSyn0.1123510.4467921.0000000.000007
SMOBD0.3276270.0674111.0000000.000419
SMOTE0.0000000.0000040.0000190.000000
SMOTE-IPF0.0001230.0003010.0015190.000000
SMOTE-TomekLinks0.0000220.0000110.0013580.000000
Polynom-fit-SMOTE0.1123510.0224220.0234360.151167
no sampling0.0000000.0000000.0000000.000362
DTree
ModularOverSampler compared byF1—pAUC—pG—pAcc—p
Assembled-SMOTE0.0079450.0255420.8102020.000001
CCR0.1018350.0255420.0078110.527657
G-SMOTE0.0290780.0000230.0078110.031673
LVQ-SMOTE0.0017630.0004450.0000360.031851
Lee0.0000170.0000600.0833250.000000
ProWSyn0.0440350.3970510.8102020.000000
SMOBD0.0017630.0000600.1045560.000000
SMOTE0.0000010.0000000.0078110.000000
SMOTE-IPF0.0000060.0000060.0245620.000000
SMOTE-TomekLinks0.0000000.0000000.0057100.000000
Polynom-fit-SMOTE0.1018350.2035190.0826140.120106
no sampling0.0000000.0000000.0000000.003174
KNN
ModularOverSampler compared byF1—pAUC—pG—pAcc—p
Assembled-SMOTE0.9567071.0000000.8827240.000059
CCR0.9331921.0000000.8827240.020890
G-SMOTE0.1624010.3374850.8827240.031773
LVQ-SMOTE0.0003560.0005170.0016600.000004
Lee0.0554280.6406110.8827240.000000
ProWSyn1.0000001.0000000.8827240.004809
SMOBD1.0000001.0000000.8827240.011318
SMOTE0.0000070.0002080.0490160.000000
SMOTE-IPF0.0080970.0713110.1937640.000000
SMOTE-TomekLinks0.0019150.0006700.3809500.000000
Polynom-fit-SMOTE0.9567070.0196460.0574690.036107
no sampling0.0000000.0000000.0000000.000002
MLP
ModularOverSampler compared byF1—pAUC—pG—pAcc—p
Assembled_SMOTE0.0002580.0024330.5884820.000000
CCR0.0000000.0024330.5884820.000000
G-SMOTE0.0002640.0024330.0101380.008098
LVQ-SMOTE0.0000000.0000010.0004370.000000
Lee0.0000010.0000000.0037900.000000
ProWSyn0.0195460.0562480.7016140.000172
SMOBD0.0063830.0024330.7016140.000010
SMOTE0.0000000.0000000.0000000.000000
SMOTE-IPF0.0000000.0000000.0009020.000000
SMOTE-TomekLinks0.0000000.0000000.0000010.000000
Polynom-fit-SMOTE0.0195460.0019950.0037900.008567
no sampling0.0000000.0000000.0000000.008567
SVC

References

  1. Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  2. Zhao, C.; Xin, Y.; Li, X.; Yang, Y.; Chen, Y. A heterogeneous ensemble learning framework for spam detection in social networks with imbalanced data. Appl. Sci. 2020, 10, 936. [Google Scholar] [CrossRef] [Green Version]
  3. Liu, J. A minority oversampling approach for fault detection with heterogeneous imbalanced data. Expert Syst. Appl. 2021, 184, 115492. [Google Scholar] [CrossRef]
  4. Gui, X.; Zhang, J.; Tang, J.; Xu, H.; Zou, J.; Fan, S. A Quadruplet Deep Metric Learning model for imbalanced time-series fault diagnosis. Knowl. Based Syst. 2022, 238, 107932. [Google Scholar] [CrossRef]
  5. Khalilia, M.; Chakraborty, S.; Popescu, M. Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inform. Decis. Mak. 2011, 11, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Fotouhi, S.; Asadi, S.; Kattan, M.W. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inform. 2019, 90, 103089. [Google Scholar] [CrossRef] [PubMed]
  7. Su, H.; Hu, Y.; Karimi, H.R.; Knoll, A.; Ferrigno, G.; De Momi, E. Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results. Neural Netw. 2020, 131, 291–299. [Google Scholar] [CrossRef] [PubMed]
  8. Qi, W.; Ovur, S.E.; Li, Z.; Marzullo, A.; Song, R. Multi-Sensor Guided Hand Gesture Recognition for a Teleoperated Robot Using a Recurrent Neural Network. IEEE Robot. Autom. Lett. 2021, 6, 6039–6045. [Google Scholar] [CrossRef]
  9. Qi, W.; Aliverti, A. A multimodal wearable system for continuous and real-time breathing pattern monitoring during daily activity. IEEE J. Biomed. Health Inform. 2019, 24, 2199–2207. [Google Scholar] [CrossRef]
  10. Zhao, P.; Hoi, S.C. Cost-sensitive online active learning with application to malicious URL detection. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 919–927. [Google Scholar]
  11. Weiss, G.M. Foundations of imbalanced learning. InImbalanced Learning: Foundations, Algorithms, and Applications; Wiley-IEEE Press: New York, NY, USA, 2013; pp. 13–41. [Google Scholar] [CrossRef]
  12. Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3573–3587. [Google Scholar]
  13. Guo, H.; Li, Y.; Shang, J.; Gu, M.; Huang, Y.; Gong, B. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar]
  14. Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 1–54. [Google Scholar] [CrossRef]
  15. García, V.; Sánchez, J.S.; Marqués, A.I.; Florencia, R.; Rivera, G. Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst. Appl. 2020, 158, 113026. [Google Scholar] [CrossRef]
  16. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  17. Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  18. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
  19. Ma, L.; Fan, S. CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinform. 2017, 18, 169. [Google Scholar] [CrossRef] [Green Version]
  20. Le, T.; Le Son, H.; Vo, M.T.; Lee, M.Y.; Baik, S.W. A cluster-based boosting algorithm for bankruptcy prediction in a highly imbalanced dataset. Symmetry 2018, 10, 250. [Google Scholar] [CrossRef] [Green Version]
  21. Xu, Z.; Shen, D.; Nie, T.; Kou, Y.; Yin, N.; Han, X. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf. Sci. 2021, 572, 574–589. [Google Scholar] [CrossRef]
  22. Sanchez, A.I.; Morales, E.F.; Gonzalez, J.A. Synthetic oversampling of instances using clustering. Int. J. Artif. Intell. Tools 2013, 22, 1350008. [Google Scholar] [CrossRef] [Green Version]
  23. Sáez, J.A.; Luengo, J.; Stefanowski, J.; Herrera, F. SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 2015, 291, 184–203. [Google Scholar] [CrossRef]
  24. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. DBSMOTE: Density-based synthetic minority over-sampling technique. Appl. Intell. 2012, 36, 664–684. [Google Scholar] [CrossRef]
  25. Xu, X.; Chen, W.; Sun, Y. Over-sampling algorithm for imbalanced data classification. J. Syst. Eng. Electron. 2019, 30, 1182–1191. [Google Scholar] [CrossRef]
  26. Hu, F.; Li, H. A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math. Probl. Eng. 2013, 2013, 694809. [Google Scholar] [CrossRef]
  27. Hu, S.; Liang, Y.; Ma, L.; He, Y. MSMOTE: Improving classification performance when training data is imbalanced. In Proceedings of the 2009 Second International Workshop on Computer Science and Engineering, Qingdao, China, 28–30 October 2009; Volume 2, pp. 13–17. [Google Scholar]
  28. Jiang, Z.; Pan, T.; Zhang, C.; Yang, J. A new oversampling method based on the classification contribution degree. Symmetry 2021, 13, 194. [Google Scholar] [CrossRef]
  29. Zhu, T.; Lin, Y.; Liu, Y. Improving interpolation-based oversampling for imbalanced data learning. Knowl.-Based Syst. 2020, 187, 2018104826. [Google Scholar] [CrossRef]
  30. Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, 3, 408–421. [Google Scholar] [CrossRef] [Green Version]
  31. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
  32. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
  33. Jo, T.; Japkowicz, N. Class imbalances versus small disjuncts. ACM Sigkdd Explor. Newsl. 2004, 6, 40–49. [Google Scholar] [CrossRef]
  34. Cateni, S.; Colla, V.; Vannucci, M. Novel resampling method for the classification of imbalanced datasets for industrial and other real-world problems. In Proceedings of the 11th International Conference on Intelligent Systems Design and Applications, Cordoba, Spain, 22–24 November 2011; pp. 402–407. [Google Scholar]
  35. Menardi, G.; Torelli, N. Training and assessing classification rules with imbalanced data. Data Min. Knowl. Discov. 2014, 28, 92–122. [Google Scholar] [CrossRef]
  36. Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  37. Cieslak, D.A.; Chawla, N.V.; Striegel, A. Combating imbalance in network intrusion datasets. In Proceedings of the GrC, Atlanta, GA, USA, 10–12 May 2006; pp. 732–737. [Google Scholar]
  38. Zhou, B.; Yang, C.; Guo, H.; Hu, J. A quasi-linear SVM combined with assembled SMOTE for imbalanced data classification. In Proceedings of the 2013 International Joint Conference on Neural Networks, Dallas, TX, USA, 4–9 August 2013; pp. 1–7. [Google Scholar]
  39. Koto, F. SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level. In Proceedings of the International Conference on Advanced Computer Science and Information System, Tanjung Priok, Indonesia, 18–19 October 2014; pp. 280–284. [Google Scholar]
  40. Chen, L.; Cai, Z.; Chen, L.; Gu, Q. A novel differential evolution-clustering hybrid resampling algorithm on imbalanced datasets. In Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining, Phuket, Thailand, 9–10 January 2010; pp. 81–85. [Google Scholar]
  41. Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Proceedings of the Conference on Artificial Intelligence in Medicine in Europe, Cascais, Portugal, 1–4 July 2001. [Google Scholar]
  42. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
  43. Lee, J.; Kim, N.R.; Lee, J.H. An over-sampling technique with rejection for imbalanced class learning. In Proceedings of the Ninth International Conference on Ubiquitous Information Management and Communication, ACM, Bali, Indonesia, 8–10 January 2015; pp. 1–6. [Google Scholar]
  44. Cohen, G.; Hilario, M.; Sax, H.; Hugonnet, S.; Geissbuhler, A. Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 2006, 37, 7–18. [Google Scholar] [CrossRef]
  45. De la Calleja, J.; Fuentes, O.; González, J. Selecting Minority Examples from Misclassified Data for Over-Sampling. In Proceedings of the FLAIRS Conference, Coconut Grove, FL, USA, 15–17 May 2008; pp. 276–281. [Google Scholar]
  46. Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2001; pp. 420–434. [Google Scholar]
  47. Khoshgoftaar, T.M.; Rebours, P. Improving software quality prediction by noise filtering techniques. J. Comput. Sci. Technol. 2007, 22, 387–396. [Google Scholar] [CrossRef]
  48. Kovács, G. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl. Soft Comput. 2019, 83, 105662. [Google Scholar] [CrossRef]
  49. Gazzah, S.; Amara, N.E.B. New oversampling approaches based on polynomial fitting for imbalanced data sets. In Proceedings of the 2008 the Eighth Iapr International Workshop on Document Analysis Systems, Nara, Japan, 16–19 September 2008; pp. 677–684. [Google Scholar]
  50. Barua, S.; Islam, M.M.; Murase, K. ProWSyn: Proximity weighted synthetic oversampling technique for imbalanced data set learning. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Gold Coast, QLD, Australia, 14–17 April 2013; Springer: Berlin/Heidelberg, Germany; pp. 317–328.
  51. Cao, Q.; Wang, S. Applying over-sampling technique based on data density and cost-sensitive svm to imbalanced learning. In Proceedings of the 2011 International Conference on Information Management, Innovation Management and Industrial Engineering, Shenzhen, China, 26–27 November 2011; Volume 2, pp. 543–548. [Google Scholar]
  52. Sandhan, T.; Choi, J.Y. Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 1449–1453. [Google Scholar]
  53. Koziarski, M.; Woźniak, M. CCR: A combined cleaning and resampling algorithm for imbalanced data classification. Int. J. Appl. Math. Comput. Sci. 2017, 27, 727–736. [Google Scholar] [CrossRef] [Green Version]
  54. Nakamura, M.; Kajiwara, Y.; Otsuka, A.; Kimura, H. Lvq-smote–learning vector quantization based synthetic minority over–sampling technique for biomedical data. J. BioData Min. 2013, 6, 1–10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  55. Alcala-Fdez, J.; Fernandez, A.; Luengo, J.; Derrac, J.; Garcia, S.; Sanchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data set repository, integration of algorithms and Experimental analysis framework. J. Mult. Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  56. Kovács, G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing 2019, 366, 352–354. [Google Scholar] [CrossRef]
  57. UCI Machine Learning Repository: Data Sets. Available online: https://archive.ics.uci.edu/ml/datasets.php (accessed on 10 February 2022).
Figure 1. Illustration of the effect of data set imbalance on classification. The red lines indicate the decision boundaries determined by a linear SVM. Based on the model built on the highly imbalanced data set (a), the circles would be recognized with great accuracy but many triangles would be misclassified. Using a more balanced version of the data set (b), the linear SVM provides a better decision boundary. (Shapes with black contours indicate the samples that were used to determine the decision boundary).
Figure 1. Illustration of the effect of data set imbalance on classification. The red lines indicate the decision boundaries determined by a linear SVM. Based on the model built on the highly imbalanced data set (a), the circles would be recognized with great accuracy but many triangles would be misclassified. Using a more balanced version of the data set (b), the linear SVM provides a better decision boundary. (Shapes with black contours indicate the samples that were used to determine the decision boundary).
Computers 11 00073 g001
Figure 2. The effect of SMOTE on a noisy data set: (a) the noisy data set, where the noise samples are denoted by P, Q and R. (b) the oversampled version of the data set. The red lines indicate the decision boundaries determined by a linear SVM built on the data sets, and shapes with black contours indicate the samples involved in determining the location of the decision boundary.
Figure 2. The effect of SMOTE on a noisy data set: (a) the noisy data set, where the noise samples are denoted by P, Q and R. (b) the oversampled version of the data set. The red lines indicate the decision boundaries determined by a linear SVM built on the data sets, and shapes with black contours indicate the samples involved in determining the location of the decision boundary.
Computers 11 00073 g002
Figure 3. A well-known flow of SMOTE. S is a seed separated from one of its neighbors, P 3 by majority samples; therefore, when choosing P 3 as a co-seed of S, a new sample (denoted by the star) will be likely to fall into the majority set.
Figure 3. A well-known flow of SMOTE. S is a seed separated from one of its neighbors, P 3 by majority samples; therefore, when choosing P 3 as a co-seed of S, a new sample (denoted by the star) will be likely to fall into the majority set.
Computers 11 00073 g003
Figure 4. The main steps of the proposed method, where the non-round rectangles represent the modules with varied strategies.
Figure 4. The main steps of the proposed method, where the non-round rectangles represent the modules with varied strategies.
Computers 11 00073 g004
Table 1. The results of the Experiment 1. The number of databases out of 80 where sampler–classifier pairs are ranked first. (The sum of the columns can exceed 80, because on some databases, two or more methods shared the first place).
Table 1. The results of the Experiment 1. The number of databases out of 80 where sampler–classifier pairs are ranked first. (The sum of the columns can exceed 80, because on some databases, two or more methods shared the first place).
SVCDTree
SamplerF1AUCGAccF1AUCGAcc
Assembled-SMOTE45454346
CCR386699819
G-SMOTE45494345
Lee35344344
LVQ-SMOTE47655567
ModularOverSampler7574737276717467
Polynom-fit-SMOTE96474357
ProWSyn45564444
SMOBD35446344
SMOTE35344344
SMOTE-IPF35344344
SMOTE-TomekLinks35344344
no sampling454144344
KNNMLP
SamplerF1AUCGAccF1AUCGAcc
Assembled-SMOTE54653343
CCR121412244885
G-SMOTE857113345
Lee65663333
LVQ-SMOTE84684433
ModularOverSampler6566685976717371
Polynom-fit-SMOTE13511124335
ProWSyn78764533
SMOBD64663533
SMOTE54553333
SMOTE-IPF64663333
SMOTE-TomekLinks53553333
no sampling444100004
Table 2. The results of the second experiment. The best values highlighted with boldface and the worst values are highlighted with italic font.
Table 2. The results of the second experiment. The best values highlighted with boldface and the worst values are highlighted with italic font.
SVCDTree
SamplerF1AUCGAccF1AUCGAcc
Assembled-SMOTE0.54490.85830.82560.87830.69310.88650.85910.9529
CCR0.54400.86280.83070.87300.67080.87340.80130.9663
G-SMOTE0.55450.86540.82390.90340.68640.88830.85330.9554
Lee0.54370.85700.82190.87420.68480.88440.85680.9526
LVQ-SMOTE0.54440.85840.82300.86930.67310.87570.83780.9567
ModularOverSampler0.62210.87790.83400.96660.73130.90410.86420.9681
Polynom-fit-SMOTE0.61140.86300.82510.94330.70100.88730.84970.9674
ProWSyn0.57110.86230.83210.89490.68380.88830.86180.9475
SMOBD0.55360.86030.82850.88070.68950.88400.85890.9527
SMOTE0.53990.85410.81700.86050.67530.87890.84860.9497
SMOTE-IPF0.54220.85380.82030.86730.67870.87890.85350.9507
SMOTE-TomekLinks0.54200.85190.81760.86640.67430.87870.85080.9504
no sampling0.53760.82540.63840.96400.62670.84550.74410.9652
KNNMLP
SamplerF1AUCGAccF1AUCGAcc
Assembled-SMOTE0.68040.90380.87100.93720.67370.91260.88490.9306
CCR0.74110.91160.86170.97000.69500.91790.88430.9622
G-SMOTE0.68940.90000.86680.94380.66920.91360.87940.9347
Lee0.67650.90180.86790.93620.66550.91080.88100.9290
LVQ-SMOTE0.71280.90650.86450.96080.68180.91250.88330.9395
ModularOverSampler0.74830.91400.87680.97120.71290.92510.88690.9678
Polynom-fit-SMOTE0.72640.90660.87410.96920.70120.91310.87850.9603
ProWSyn0.68000.91010.87860.93170.68420.91390.88350.9372
SMOBD0.68070.90040.86940.93600.67280.91250.88510.9328
SMOTE0.67470.90030.86550.93530.65950.90690.87280.9281
SMOTE-IPF0.67590.90170.86750.93620.66590.91050.87730.9288
SMOTE-TomekLinks0.67420.90070.86530.93520.66500.90780.87840.9297
no sampling0.71240.89240.76830.96990.53870.88420.66610.9614
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Szeghalmy, S.; Fazekas, A. A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance. Computers 2022, 11, 73. https://doi.org/10.3390/computers11050073

AMA Style

Szeghalmy S, Fazekas A. A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance. Computers. 2022; 11(5):73. https://doi.org/10.3390/computers11050073

Chicago/Turabian Style

Szeghalmy, Szilvia, and Attila Fazekas. 2022. "A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance" Computers 11, no. 5: 73. https://doi.org/10.3390/computers11050073

APA Style

Szeghalmy, S., & Fazekas, A. (2022). A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance. Computers, 11(5), 73. https://doi.org/10.3390/computers11050073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop