Next Article in Journal
Holistic Approach for Artificial Intelligence Implementation in Pharmaceutical Products Lifecycle: A Meta-Analysis
Next Article in Special Issue
Unsupervised and Supervised Feature Selection for Incomplete Data via L2,1-Norm and Reconstruction Error Minimization
Previous Article in Journal
Innovative Design of Novel Main and Secondary Arch Collaborative Y-Shaped Arch Bridge and Research on Shear Lag Effect of Its Unconventional Thin-Walled Steel Box Arch Ribs
Previous Article in Special Issue
MFHE: Multi-View Fusion-Based Heterogeneous Information Network Embedding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region

by
Zafar Mahmood
1,†,
Naveed Anwer Butt
1,†,
Ghani Ur Rehman
2,*,†,
Muhammad Zubair
2,†,
Muhammad Aslam
3,†,
Afzal Badshah
4,† and
Syeda Fizzah Jilani
5,*,†
1
Department of Computer Science, University of Gujrat, Punjab 50700, Pakistan
2
Department of Computer Science & Bioinformatics, Khushal Khan Khattak University, Karak 27000, Pakistan
3
School of Computing Engineering & Physical Sciences, University of West Scotland, Glasgow G72 0LH, UK
4
Department of Computer Science & Software Engineering, International Islamic University Islamabad, Islamabad 44000, Pakistan
5
Department of Physics, Aberystwyth University, Aberystwyth SY23 3BZ, UK
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2022, 12(16), 8371; https://doi.org/10.3390/app12168371
Submission received: 3 July 2022 / Revised: 10 August 2022 / Accepted: 18 August 2022 / Published: 22 August 2022

Abstract

:
The classification of imbalanced and overlapping data has provided customary insight over the last decade, as most real-world applications comprise multiple classes with an imbalanced distribution of samples. Samples from different classes overlap near class boundaries, creating a complex structure for the underlying classifier. Due to the imbalanced distribution of samples, the underlying classifier favors samples from the majority class and ignores samples representing the least minority class. The imbalanced nature of the data—resulting in overlapping regions—greatly affects the learning of various machine learning classifiers, as most machine learning classifiers are designed to handle balanced datasets and perform poorly when applied to imbalanced data. To improve learning on multi-class problems, more expertise is required in both traditional classifiers and problem domain datasets. Some experimentation and knowledge of hyper-tuning the parameters and parameters of the classifier under consideration are required. Several techniques for learning from multi-class problems have been reported in the literature, such as sampling techniques, algorithm adaptation methods, transformation methods, hybrid methods, and ensemble techniques. In the current research work, we first analyzed the learning behavior of state-of-the-art ensemble and non-ensemble classifiers on imbalanced and overlapping multi-class data. After analysis, we used grid search techniques to optimize key parameters (by hyper-tuning) of ensemble and non-ensemble classifiers to determine the optimal set of parameters to enhance the learning from a multi-class imbalanced classification problem, performed on 15 public datasets. After hyper-tuning, 20% of the dataset samples are synthetically generated to add to the majority class of each respective dataset to make it more overlapped (complex structure). After the synthetic sample’s addition, the hyper-tuned ensemble and non-ensemble classifiers are tested over that complex structure. This paper also includes a brief description of tuned parameters and their effects on imbalanced data, followed by a detailed comparison of ensemble and non-ensemble classifiers with the default and tuned parameters for both original and synthetically overlapped datasets. We believe that the underlying paper is the first kind of effort in this domain, which will furnish various research aspects to with a greater focus on the parameters of the classifier in the field of learning from imbalanced data problems using machine-learning algorithms.

1. Introduction

Learning from imbalanced data [1] has proven its significance through the efforts devoted by the research community over the last couple of years. Most real-world applications such as medical diagnosis, protein classification, activity recognition, target detection, target detection, microarray research, video streaming, and mining are imbalanced [1]. In an imbalanced distribution, the samples of one class outperform the other class or classes of samples with numbers, obviously a hurdle for the traditional classifiers to learn from multi-class problems. Most traditional classifiers [2], such as k-nearest neighbor (kNN), naive Bayes (NB), artificial neural network (ANN), decision tree (TD), support vector machine (SVM), and logistic regression (LR) designed for the balanced and linear distribution of the instances in the training dataset between the classes. In the scenarios, multi-class imbalance learning requires more expertise and skills, since as with the number of classes in the problem domain increases, so do the challenges of representing the whole problem space accurately. Several problems reported in learning from multi-class imbalance data are the majority and minority class and classes problems [1], issues of overlapping class boundaries [3], a small sample size in the minority class [4], and small disjuncts issues [5], which must be taken into account before applying any multi-class imbalance classifier. Different papers discussed these four-fold reasons as a challenge when conventional classifiers are subject to learning from the imbalanced nature of data.
  • Traditional classifiers are well designed and indeed have better performance and accuracy over the balanced training set, resulting in sub-optimal classification performance when applied to imbalanced problems [6].
  • With the skewed distribution, the predicted accuracy persuades a bias to the class having a greater number of samples, by ignoring the rare class instances, even with the best overall precision produced by the prediction model [7].
  • Both the noise and minority class samples are least represented, and the learning models are sometimes confused with each other, whilst noise may be incorrectly recognized as minority instances [8].
  • Learning from imbalanced distribution is somehow comfortable if the classes are linearly separable from each other. However, in multi-class imbalanced problems, minority instances are overlapped with each other’s boundaries where the earlier likelihood of both the majority and minority classes are nearly equal [9].
Several approaches were reported in the literature to cope with this imbalanced data problem based on the data level methods [10], algorithm adaptation methods [11], and ensemble-based methods [12]. However, the majority of these approaches focused on handling the binary class imbalance problem apparently cannot be directly applied to multi-class imbalance problems, since the decision boundary involves distinguishing between more classes. At the data-level method, the original data are amended to relive the overlapping impact, either by introducing complementary features or data cleansing methods to separate the overlapping classes or merging overlapping classes to form meta classes [13]. However, using the data-level approaches exclusively to address the imbalance and overlapping issues may result in the model overfitting [14] or may lose some useful information [15]. The fundamental changed algorithm cannot be applied as a general method to effectively tackle the overlapping issues in algorithm-based techniques, which modify the current algorithm to deal with the uneven nature of the information and improve the learning of the underlying classifier [16,17]. The choice of base classifiers, the decision-making process, the number of classifiers used for the construction of an ensemble, the accuracy of individual models, the diversity among the individual models in an ensemble, and the number of classifiers used are some main factors to be carefully studied in an ensemble-based method [18].
Parameter hyper-tuning [19] in the selected model is the process of ascertaining some particular parameters to optimize the performance of learning algorithms on a specific set. Parameter tuning has proven its vital role in improving the accuracy and overall model performance both for ensemble and non-ensemble classifiers [20,21]. Every classifier has its own set of parameters and needs to tune following the different tuning steps by performing an exhaustive grid search. There are two types of model parameters in every machine-learning algorithm, conventional parameters and hyper-parameters. Conventional parameters are optimized during the training phase of the underlying model, whereas the users, depending upon the problem dataset before the training phase of the model, set hyper-parameter values. In the more complex structure, i.e., with the increasing overlapping samples, the needs for the optimal set of parameters become more in order to maximize the visibility of the minority class samples [22].
The key contribution of the underlying paper is to focus on analyzing the learning performance of ensemble and non-ensemble classifiers over multi-class imbalanced and overlapped data. A detailed investigation of ensemble and non-ensemble approaches was presented to provide deep insight into the nature of multi-class learning strategies. An exhaustive experiment on the hyper-tuning of six state-of-the-art ensemble and non-ensemble classifiers, to efficiently address the multi-class imbalance datasets issue and comprehensively compare and improve their performance, was performed using four different evaluation metrics for accuracy, namely the overall accuracy (ACC), geometric mean (G-mean) [23], F-measure [12], and the area under curve (AUC) metrics [24]. As a third contribution, an algorithm was designed to synthetically generate overlapping samples in the existing dataset by 20% of the existing samples to make it more complex and overlap to highlight the impact of parameter tuning. The comparison of ensemble and non-ensemble classifiers was carried out on 15 publicly available multi-class imbalanced and overlapped datasets.
The underlying research article is comprised of the following sections. The literature survey is covered in Section 2. The background of ensemble and non-ensemble-based methods are covered in Section 3 and Section 4. In Section 5, we discussed the hyper-tuning of the parameters of the used classifiers in Section 6, and Section 7 discusses the algorithm to synthetically generate the overlapping samples. Section 8 highlights the experimental setup, dataset, and evaluation methods, Section 9 covers the results and discussion. We compare the ensemble and non-ensemble approaches in Section 10 and the article is concluded in Section 11.

2. Related Works

The authors in [25] argued for the significance of ensemble approaches as compared to the sampling methods and individual classifiers when applied to a multi-class imbalanced dataset. The authors compared the ensemble-based approach with the non-ensemble to prove the robustness of ensemble methods. Yao and Wang in [26] combined AdaBoost.NC and AdaBoost with sampling techniques either augmented with or without the decomposition techniques to address the multi-class imbalance problem and highlight that the ensemble approach is more effective than the oversampling and decomposition techniques. Without class decomposition, AdaBoost.NC shows better performance as compared to AdaBoost, but with the increasing number of classes, their performance also decreased gradually. The authors in [26] showed that the shortcomings of sampling methods cannot be avoided by using standard ensemble approaches if the dataset has multiple classes. Despite the increasing number of samples in the positive class by adding some samples through oversampling, the distribution of classes in the data space is still imbalanced, which is dominated by the majority class. Chawla et al. [27] proposed SMOTEBoost, an ensemble-based sampling technique to enhance the conventional SMOTE [28] by mingling it with AdaBoost.M2. The authors applied SMOTE before the base classifier evaluation, and thus, the new instance’s weight is relative to the numbers of samples in the new dataset. After producing the new instances, the original instance’s weight is standardized to generate the new distribution. In every iteration, the instance weight of the minority class increases. The authors in [29] highlighted the outclass performance of bagging as compared to boosting in a multi-class and noisy environment. Furthermore, bagging techniques have to be quickly developed and become more powerful if properly ensemble. Similarly, OverBagging [30] combines the data preprocessing and bagging techniques to manage the class imbalance issue by increasing the positive class cardinality by the duplication of original examples; at the same time, the instance in the majority of negative class is considered in every bag to increase multiplicity.
In [31], SMOTEBagging was proposed to counter multi-class imbalance learning issues by creating every individual bag to be expressively diverse. In every iteration during bag creation, the SMOTE resampling rate is defined and this ratio specifies the positive class instances randomly resampled from the original dataset, and the remaining positive class instances are generated by SMOTE. Barandela et al. proposed UnderBagging for the first time in [31], wherein the negative class instances are arbitrarily condensed at every bootstrap sample to make it equal to the cardinality of the positive class. The basic, simple version of undersampling when merged with bagging-based techniques proves that it is more significant than the more composite solution, such as BalanceCascade [32] and EasyEnsemble [33]. In [34], the authors highlighted an important problem with the ensemble size and ensemble cardinality (number of component classifiers in the final ensemble), as it affects the predictive performance, time, and memory for the classification algorithm when applied to imbalance data and diversity among the component classifier. The boosting ensemble method presented in [35] is one of the prominent methods explicitly based on the complementarity among the component classifier. Through the boosting process, a strong learner built from the collection of different weak learners (weak in the sense of accuracy while being applied on the classification task). In [35], the authors proposed a new ensemble method, twin bounded weighted relaxed support vector machines (TBWRSVM), which is an extension of the weighted relaxed support vector machine (WRSVM) classifier used for class imbalance problems and outliers. The resulting classifier utilizes twin bounded support vector machines (TBSVM), which provides a quick classification method. The authors in [36] have suggested a novel ensemble approach, “Dynamic Ensemble Selection for Multi-class Imbalanced the dataset (DES-MI)” to handle the multi-class problem’s challenge. To improve learning from multi-class problems, the DES-MI model first creates a balanced training set before choosing an appropriate classifier. To compare the accuracy and computational cost of the well-known boosting-based ensemble classifier Xgboost (ensemble-based method), the authors in [37] looked at its general performance, efficiency, competence, and effectiveness while taking into account its sensitivity to the sample size and feature space. The performance of statistical analysis, according to the authors, is greatly improved when parameterizing Xgboost using a Bayesian approach as opposed to utilizing “random forests” and “support vector machines” that are operated on larger sample size.
The authors in [38] described the suitability of data preprocessing techniques to address the data imbalance issues. The authors of this study are of the opinion that a balanced training dataset is more robust for improving the overall performance of the classifier for several base classifiers. Zhang and Mani in [39] proposed a new technique by achieving undersampling through the kNN classifier. Based on the data features of the data distribution, four undersampling methods based on kNN are proposed, namely N e a r M i s s 1 ,   N e a r M i s s 2 ,   N e a r M i s s 3 and the “most distant” method, in which a small subset of training data is selected to minimize the skewness in the reaming data. The authors in [40] highlighted that sampling techniques are very clever in dealing with the binary classification problems with two target variables, facing some difficulties when directly applied to solve multi-class classification problems. The authors used the “Mahalanobis Distance-based Over-sampling (MDO)” [41] technique to handle the imbalanced class data with a mixed attribute, introduce generalized singular value decomposition (GSVD) for complex and mixed-type data, augmenting with a resampling scheme applied on the mixed type of attributes to optimize the synthesis of samples. In [42], the authors presented a combination of k-means clustering and “(SMOTE)”, which results in an effective “oversampling method” and effectively overcomes the imbalance ratio between and within classes and avoids noise generation. K-means clustering is a three-step method: clustering, filtering, and sampling to enhance learning from multi-class classification problems. To address the imbalance classification problems, the authors combine “random oversampling” and “random under sampling techniques” in [43] to propose a hybrid sampling SVM approach. By using the undersampling technique, the samples with the least significance were deleted, followed by the oversampling technique to generate some samples in the minority class. In [44], the authors first proposed an optimization classification model (OCM) to deal with the classification problems using evolutionary computation (EC) techniques. In the second step, the authors proposed a novel algorithm, the self-adaptive fireworks algorithm (SaFWA) based on swarm intelligence, to address the optimization problems. To increase the diversity/range of solutions, four candidate solution generation strategies (CSGSs) were merged with SaFWA. In [45], the authors highlighted the multi-class problem solution strategies (decomposition strategies), i.e., transforming imbalance multi-class problems into several classes and designing a separate classifier for each class (binary decomposition is considered to be the most prominent approach for multi-class decomposition). The proposed model has the flexibility to discard the non-competent classifier to improve the robustness of the combination phase. The competency of the classifier is measured by considering the neighborhood of each sample augmenting with selection criteria (with a threshold option) for a classifier corresponding to the minority class in this neighborhood. The authors in [46], proposed the Bayesian learning probabilistic model to improve the performance of Bayesian classification using the combination of a Kalman filter and K-means. The method is applied to a small dataset just for establishing the fact that the proposed algorithm can reduce the time for computing the clusters from the data. The authors in [47] proposed a deep image analysis–based model for glaucoma diagnosis that uses several features to detect the formation of glaucoma in the retinal fundus. The proposed model is combined with SVM, KNN, and NB to investigate the various aspects related to the prediction of glaucoma in retinal fundus images that help the ophthalmologist make better decisions for the human eye. Some of the prominent existing techniques reported in the literature are listed in Table 1.

3. Ensemble-Based Methods

Ensemble learning, also known as multiple classifier systems, has become an influential solution overshadowing not only multi-class imbalance learning but also two-class imbalance problems and standard classification, as discussed in [48] with regard to the boosting algorithms primarily designed for binary classification. In the literature, different researchers agreed on the versatility and effectiveness of ensemble-based learning techniques, where several component classifier predictions were combined to make a final prediction report, improving the performance of individual weak learners with a small training dataset to build an improved classification-learning model. Ensemble approaches were initially introduced in [49,50] in the early 1990s, who presented their view in [51] by arguing that combining multiple classifiers (via an ensemble process) could yield better performance as compared to individual classifiers. Mathematically [52], the performance of each individual classifier over dataset D with M classes is given in Equation (1) as:
w i , j = 2 p j ( C i ) | D j | + p j ( C i ) + q j ( C i )
where C i represents the performance of an individual classifier with i = 1 , 2 , 3 , N being evaluated on D and a N M matrix W which is defined as:
W = [ W 1 , 1 W 1 , 2 W 1 , M W 2 , 1 W 2 , 2 W 2 , M W N , 1 W N , 2 W N , M ]
Each element of w ( i , j ) is defined in Equation (3), where D j is the set of instances of the dataset belonging to the class j, and p j ( C i ) is the number of accurate predications of the classifiers C i on D j , and q j ( C i ) are the false or incorrect predications of C i that an instance belongs to class j. Subsequently, the target class y ^ of each unknown instance x in the test set is computed by Equation (3):
y ^ = a r g j m a x i = 1 N w i , j χ A ( C i ( x ) = j
where function a r g m a x returns the value of the corresponding index to the largest value from the array, A = [ 1 , 2 , 3 , M ] is the set of unique class labels and χ A is the characteristic function which takes into account the prediction j A of a classifier C i on an instance x and creates a vector in which the j coordinate takes a value of 1 and the rest takes the value of 0. After several alternatives and improved versions for ensemble classifiers, ensemble methods are nevertheless categorized into three main types, namely boosting, bagging, and stacking. Through the boosting process, a strong learner is built from the collection of different weak learners (weak in the sense of accuracy when applied to the classification task). In boosting-based methods [44], an individual classifier iteratively learns to become specialized on a specific set of the training dataset. The weighted samples from a subset of the training dataset were used to train a component classifier in such a way which emphasizes previously misclassified samples. In boosting methods, experiments piloted on the training set using different learning models to prompt classifiers to produce output. The weight assigning concept is used in the boosting process to assign higher weights or higher costs for each classifier that misclassified the underlying example. Using the approach of the weighted average, the output of each classifier is updated to generate the final output [53]. The basic idea of a boosting-based method is to combine the weak learner to build a strong learner to improve the overall accuracy and performance. Among the boosting family, AdaBoost [54] and Gradient tree boosting [55] are well-known methods of boosting ensemble methods. In a boosting-based method, the resulting ensemble model is defined as the weighted sum of weak learners, as shown in the Equation (4).
S L ( . ) = l = 1 L C l W l ( . )
where c l s are coefficients and w l s are weak learners. Using Equation (4) and finding the best ensemble model is somehow a difficult optimization problem. Instead of using this model approach, we can use an iterative optimization process to find all the coefficients and weak learners that give the best overall additive model by adding the weak learners one by one, looking at each iteration for the best possible pair (coefficient, weak learner) to add to the current ensemble model. We recurrently define the ( s l ) s as shown in Equation (5):
S l ( . ) = S l 1 ( . ) + C l W l ( . )
where C l and W l are chosen such that S l is the model that best fits the training data and therefore is the best possible improvement over S ( l 1 ) . We can then denote this in Equation (6):
( C l W l ( . ) ) = a r g c , w ( . ) m i n E ( S l 1 ( . ) ) + ( C l W l ( . ) ) = a r g c , w ( . ) m i n n = 1 N e ( y n , S l 1 ( x n ) ) + c w ( x )
where E ( . ) is the fitting error of the given model and e ( . ,   . ) is the loss/error function. Thus, instead of “globally” optimizing over all the L models in the sum, we approximate the optimum by optimizing “locally” building and adding the weak learners to the strong model one by one.
Bagging methods based on bootstrap aggregation minimize the prediction variance by producing additional examples from the original data for the training set. The training of several base classifiers was carried out on the bootstrap instances of a refined subset of the training dataset, and by using the simple aggregation (majority voting), combines the output of these base classifiers into the final output, thus resulting in a more diverse ensemble, a key factor for an ensemble to work efficiently and effectively. A separate classifier is introduced for each example in the training set, thus having k numbers of a classifier for each iteration of the training set. From the bagging family, the most prominent methods reported in the literature are random forest, which is a flexible and easy-to-use ensemble-based machine-learning algorithm. Most of the time random forest gives very good results, even without hyper-tuned parameters. Because of its simplicity, it can be widely used for both regression and classification tasks. A variation of the bagging scheme, UnderBagging [2], under samples the underlying subset of the instance before the bagging iteration for multi-class imbalance problems by keeping all the minority class samples in each iteration. Another variation of bagging schemes is random forest [5], where the base classifier trained via the bootstrap samples of the underlying training dataset has been randomly reduced to a small subset of dataset samples. In voting-based ensemble methods, predictions from various individual models are combined. In the voting method using an ensemble approach, two or more component models were created separately with a dataset, following an ensemble model to wrap the previously created models and the prediction of those models then aggregated. The resulting model was used to predict new data. Assuming that we have L, bootstrap samples (approximations of L independent datasets) of size B are denoted in Equation (7):
{ z 1 1 , z 2 1 , z B 1 } , { z 1 2 , z 2 2 , z B 2 } , { z 1 L , z 2 L , z B L }
where z b l b th observation of the l th bootstrap sample. We can fit L almost independent weak learners (one on each dataset) as given in Equation (8):
w 1 ( . ) , w 2 ( . ) , , w L ( . )
All the weak learners of Equation (8) are then combined into some kind of averaging process to obtain an ensemble model with a lower variance. For example, we can define our strong model given in Equation (9):
S L ( . ) = a r g k l = 1 m a x [ c a r d ( l w l ( . ) = k ) ]
Apart from accuracy improvements and being accuracy oriented, most of the standard techniques for creating ensembles face difficulties in identifying the subset of the dataset with the minority class. To cope with such difficulties, special attention has to pay to designing ensemble algorithms to handle the class imbalance problem. The effective combination of imbalanced learning, ensemble learning techniques with a base learner, and sampling strategies to confront the imbalance class issue put forward many possible proposals for prominent results in the literature by putting aside the conventional categories such as cost-based and kernel-based methods in the imbalanced domain [56]. Regardless of the popularity, versatility, and effectiveness of ensemble methods (by using an independent baseline classifier) as compared to cost-sensitive and improved algorithm-based methods, in an ensemble process, however, transforming dissimilar classifiers by ensuring their stability and regularity using the underlying training dataset is still a crucial factor for ensured accuracy while dealing with the multi-class classification problems. All the symbols used in this article are described in Table 2.

4. Non-Ensemble-Based Methods

Different research papers have empirically and experimentally reported that the classification of a balanced dataset is somewhat elementary and comfortable to perform, but it becomes difficult when the data are not balanced [57]. The standard machine learning algorithm assumes a balanced distribution of classes in a subset of the dataset used for classification; however, the distribution of instances in classes is not uniform in many real-life situations [58]. Traditional classifiers, initially designed for the classification of balanced datasets, show a significant performance for the underlying problem of having a balanced dataset. On the other hand, real-world data are messy with an unequal distribution of samples, where the traditional classifiers fail to properly identify the relevant target class. Figure 1 shows the imbalanced distribution of samples. This imbalance distribution of instances causes biases toward the majority group, thus creating difficulty for the standard learning algorithm to correctly predict an unseen sample. The ignorance of the minority class will lead to building a poor model, as sometimes, this leased-focused class carries important information, thus providing an impractical classifier for our proposed use case. Many standard classification algorithms, namely the SVM, ANN, DT, NB classifier, and KNN, which are designed based on balanced training datasets, become less effective due to skewed distribution in imbalance classes [59]. Although some of them can use for a small dataset as imbalanced data, if we apply the traditional algorithms to multi-class and multi-label problems with an imbalanced dataset, a good performance in terms of accuracy [60] is not necessarily achieved, as standard algorithms are built with the assumption that the distribution is balanced. Therefore, when presented with large imbalanced datasets, these algorithms fail to properly represent the distributive characteristics of data [1]. Classification problems in many real-world applications and scenarios involve multiple classes both in a balanced and in an imbalanced dataset. The learning environment becomes more complex and challenging when the number of classes increases in the domain and multiple classes overlap with each other’s, making it difficult to establish a clear decision boundary between any two classes or among the classes.

4.1. Support Vector Machine Classifier

Using SVM, in the experiment section, we used both the linear and kernel SVM on the stated datasets to classify multi-class imbalanced data. The RBF kernel is used for a decision function in nonlinear SVM [61], as the data are not linearly separable. A set of parameters [62] that need to be hyper-tuned to improve the performance of kernel SVM on multi-class imbalance data are C (a regularisation parameter used to manage or balance the low testing and training error to make a more general algorithm on unseen data), to adjust the decision boundary curvature and hyperplane shape for the class dividing, the gamma γ (also known as kernel width parameter), and the decision function (decision function shape) and assigning any weight to a class or classes. Thes small values of C may cause the model to estimate constantly and make it difficult to understand the data; on the other hand, big values of C may cause the model to overfit the training data. Similar to this, the class splitting hyper-shape planes will change for very large values of γ . If the data in the dataset are not balanced, class-weight balance is employed. The decision function shape decides the decomposition strategy, whether to apply one-versus-one or one-versus-rest. The parameter search range and optimal parameters are shown in Table 3 and Table 4.

4.2. Random Forest Classifier

To improve the overall model performance via the RF classification model, a set of four parameters needs to be hyper-tuned [63]: the number of n e s t i m a t o r (number of trees), maximum depth, maximum features, and minimum sample split. With regard to the number of tree different opinions there are, some researchers have suggested using the default number of trees to obtain more stable results, whereas some authors argue that a large number of trees should be used. The maximum depth specifies how long the node is expended; if it is not specified, then all nodes will be expended until the leaf node. To achieve the best split, the optimal number of features should be hyper-tuned in the value for ‘max-features’. ‘Minimum-sample-split’ specifies the splitting criteria of an internal node. The parameter search range and optimal parameters are shown in Table 3 and Table 4.

4.3. K-Nearest Neighbor

The basic working procedure of kNN is to find a cluster (subset) of k instances which is the nearest to the predicted sample in the underlying training space, thus showing the independence from the structure of the data. The set of parameters [64] that needs to be tuned to improve the performance of kNN is k, ‘ w e i g h t s ’, and ‘ l e a f s i z e ’. The distance between two points is calculated using the Euclidean distance function. The class having majority k-nearest neighbors is the predicted class for the new predicted sample. K is the key parameter to be tuned (after exhaustive grid search) to obtain a satisfactory result. Other important parameters are ‘ w e i g h t s ’ and ‘ l e a f s i z e ’, which are weights of two types, namely uniform and distance; for a better prediction accuracy, the uniform weight is considered for the multi-class classification problem, and the ‘ l e a f s i z e ’ parameter is a key indicator for the speedy construction, query and memory requirement to store the tree.

4.4. Gradient Boosting Algorithm

The parameters of the boosting-based model [65] are sub-categorized into three types, namely parameters specific to a tree structure, boosting specific parameters, and miscellaneous parameters. Since a decision tree is used as a default base learner, parameters specific to the tree structure are allied with each base learner, for example, the sample (minimum number) is requisite to split an interior node or the ‘ m a x d e p t h ’ for each tree. ‘ L e a r n i n g r a t e ’ and ‘ number-of-tree ’ are the boosting parameters, directly related to the underlying boosting algorithm.

4.5. Decision Tree Algorithm

Decision trees can apply to both balanced and imbalanced (multi-class) classification and regression problems; however, it best employs a nonlinear decision, with a pre-defined class variable or target label. The decision tree enables a predictive classification model with refined accuracy and precision, and provides better stability to the model with the ease of classification. During our experiment, we hyper-tuned these six parameters [66] to make a significant change to the overall performance of the model, and ‘ c r i t e r i o n ’, ‘ m a x i m u m d e p t h ’, ‘ s p l i t t e r ’, ‘ m a x i m u m d e p t h ’, ‘ m i n i m u m s a m p l e - s p l i t ’, ‘ m i n i m u m s a m p l e - l e a f ’ and ‘ m a x i m u m f e a t u r e s ’. The criterion function will decide the quality of the split; to decide the split at each node, the splitter function is used, and how long the tree should grow will be decided by the maximum depth function, the required sample for an internal node to split will be based on m i n i m u m s a m p l e s p l i t , the m i n i m u m n u m b e r o f s a m p l e required at the leaf node will be decided by the m i n i m u m s a m p l e l e a f , and m a x i m u m f e a t u r e f u n c t i o n determines the number of features to be considered when looking for the best split.

4.6. Logistic Regression

Logistic regression can be used for both binary and multi-class imbalance problems [41], although it was initially designed for binary classification, using the one-versus-rest decomposition strategy or modifying the loss function to cross-entropy loss, and logistic regression can be used for multi-class classification. To set the logistic regression for multi-class classification, a parameter called “multi-class” value will be enabled to be multi-class. For multi-class classification, the training model requires the “one-versus-rest” decomposition strategy in case the “multi-class” option is set to “OVR’, and if the “multi-class” option has a “multinomial” value, then “cross-entropy-loss” will be used [67]. The default value of “multi-class” is ‘ovr’, and currently, the ‘multinomial’ can have one of the four possible values as a solver, namely newton-cg, sag, and lbfgs. During the multi-class classification using exhaustive grid search, the two following parameters have proven to be effective for multi-class classification, penalty, and solver with the values, ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, and ‘saga’. ‘liblinear’ is a better option for a dataset that has a lesser number of classes, and if the dataset have a large number of classes ‘saga’ and ‘sag’ are the best choice to use. For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’, and ‘lbfgs’ can handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

5. Parameter Tuning

Parameter tuning is the process of ascertaining some particular parameters to optimize the performance of learning algorithms on a specific set [68] to improve the accuracy and the overall model’s performance both for ensemble and non-ensemble classifiers. Every classifier has its own set of parameters, and needs to tune following the different tuning steps by performing an exhaustive grid search. Most of the time, we assess and compare the underlying models’ performance for the best hyper-parameter settings using the grid search technique and response surface methodology (RSM). However, some researchers [69] have preferred the mean absolute error (MAE) to compare the performance, which is given in Equation (10):
M A E = i = 1 N y i y ^ N
where y i and y ^ i and denote the actual value and predicted value of observation i, respectively, e i denotes the prediction error of observation i, and N denotes the total number of observations in the data. The lower the MAE, the better the model performance.
In this article for the individual classifier, we tested a series of values using ten-fold cross-validation [70] and a grid search mechanism for parameter tuning until an optimal parameter set showing the overall highest classification accuracy and precision for each classifier was obtained. In the result section, each classifier was compared with and without tuned parameters on 20 publicly available datasets. The comparison of eight state-of-the-art classifiers (ensemble and non-ensemble) is highlighted in Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10. Table 5 and Table 6 show the comparison with conventional parameters by showing their overall accuracy, precision, recall, and f 1 s c o r e . After carefully observing the different values of the evaluation matrix, six classifiers, GB, RF, DT, KNN, R-SVM, and LR were selected for hyper-tuning, based on their overall significant performance for the same datasets. Table 7 and Table 8 shows the comparison of the six selected classifiers after the hyper-tuning of their parameters, while Table 9 and Table 10 show the performance comparison after the synthetically controlled overlapping and hyper-tuning.
Before the parameter tuning process, a grid of parameters is specified to evaluate each algorithm and every subset of the parameter; 10-fold cross-validation is performed to evaluate the model. Within the 10-fold cross-validation, 9-fold cross-validation is used to train the model and 1-fold cross-validation is used to validate the model. The process of validation using a grid search is repeated 10 times so that every fold has a fair chance to use as a validation set and the scores from each run are averaged. To save time and space for the hyper-tuning process, we divided the twenty datasets into two groups, namely group 1 wherein datasets contain five classes or less, as shown in Table 3; and group 2, wherein datasets consist of more than 5 classes, as shown in Table 4.

6. Quantification of Class Overlapping

Most real-world imbalanced problems exhibit overlapping issues, while the joint effect of imbalanced and overlapping samples severely affects the classification performance [71]. Overlapping issues and the classification of the imbalance nature of data have significantly gained in popularity for their focus on real-world problems, however, a well-defined mathematical explanation of overlapping is still lacking [72], despite different studies in the literature [72,73] having suggested estimating the class overlapping level. However, a major drawback of these methods is the prior assumption of the normal distribution of data, which is not possible in the majority of real-world datasets. We modified the formula used in [74] in Equation (11) to approximate the overlapping region based on the imbalanced distribution of data, originally designed for binary classification problems with 2D features space.
O v e r l a p p i n g D e g r e e ( % ) = O v e r l a p p i n g R e g i o n M i n o r i t y C l a s s A r e a 100
The overlapping level for the majority class negative instances in the overlapping region affects the classification performance and is calculated via Equation (12).
O v e r l a p p i n g D e g r e e ( % ) = O v e r l a p p i n g R e g i o n M a j o r i t y C l a s s A r e a 100
The overlapping region is the shared feature space of the majority and minority class samples with similar attributes, and the majority class area is calculated using the Euclidean distance and nearest neighbor rule. A class overlap region for two-class C i and C j can be described using Equations (13) and (14).
O v e r l a p p i n g = i f ( P ( x C i ) ) 0 t h e n
( P ( x C j ) ) 0 , m u s t b e , w h e r e x ε o v e r l a p p i n g r e g i o n s a m p l e
If the probability density of class C i is greater than or equal to zero, the same must be true for class C j , where i j , i.e., the sample of the class C i has similar characteristics to the sample of class C j .
To measure the overlap among the features of a different class in a multiclass dataset, it is necessary to evaluate the discriminative power of the features. If there are any features with discriminative characteristics, the problem is thus considered a simple problem. To measure the overlap among the different classes, we used Fisher’s maximum discriminative ratio [75], denoted by F 1 and given by Equation (15):
F 1 = 1 1 + m a x i = 1 m r f i
where r f i is the discriminative ratio for the feature f i listed in the dataset. Originally, the largest discriminative ratio value was stored in F 1 . r f i can also be calculated as presented by Orrial in their research article [76], and as given by Equation (16):
r f i = j = 1 n c k = 1 , k j n c . P c j P c k ( μ c j f i μ c k f i ) 2 j = 1 n c p c j ( σ c j f i )
p c j p c k represent the respective samples in classes c j and c k , respectively, where μ c j f i and μ c k f i denote the means of the features of class samples c j and c k , and σ c j f i shows the standard deviation of those samples. Both Equations (5) and (6) are employed for the binary classification, where the underlying dataset should be decomposed into a binary classification problem using the one-versus-one approach. An alternative computation of the discriminative ratio for both multiclass and binary classification problems is presented by Molliendia [77] and is given by Equation (17):
r f i = j = 1 n c n c j ( μ c j f i μ c k f i ) 2 j = 1 n c l = 1 n c j ( X l i j μ c j f i ) 2
where n c j represents the respective samples in class c j and μ c j f i denote the mean of the sample f i across the samples of class c j . μ f i is the mean of the f i values across all classes, and X l i j represents the individual value of the feature f i from a sample of class c j .

7. Algorithm for Synthetic Control Overlapping in the Majority Class

To check the impact of parameter hyper-tuning on the existing and synthetic overlapped dataset in this section, we proposed an algorithm to generate and add 20% synthetic samples in the majority class of each dataset:
  • Take a multi-class dataset with an imbalanced distribution of data and overlapping samples.
  • Apply preprocessing techniques to the dataset to make it convenient for the underlying classifier. For example, all the categorical values of class labels are converted into numeric values by applying the Label Encoding scheme. Similarly, to scale the underlying data, normalization techniques are applied to scale the features to a range that is centered around zero.
    N e w v a l u e [ . . ] = L a b e l e n c o d e r . f i t t r a n s f o r m a t i o n ( e x i s t i n g v a l u e )
    The transformation method converts the existing feature value into the desired numeric value.
  • Compute the average distance d i between each sample e i belonging to the target class C to its k 1 nearest neighbors N e i which are not of the target class ( e s ) for each minority class sample from the majority class samples and vice versa.
    d ( a , b ) = ( a 1 b 1 ) 2 where a , b R n
    where a , b are two vectors (sample attributes), and d is the distance between the two points. The distance for n th rows point is given below:
    d ( a , b ) = ( a 1 b 1 ) 2 + ( a 2 b 2 ) 2 + + ( a n b n ) 2   w h e r e a , b R n
  • Select the majority class (MC) from the list of the class’s loc set, for which synthetic samples S will be created.
    M C C o u n t = n p . u n i q u e ( M D S [ c o l ] , r e t u r n c o u n t = T r u e )
    where M C C o u n t is used to hold the samples of the majority class, M D S is the multi-class dataset, and c o l is the feature to be counted.
  • Choose the first nearest neighbor by selecting the value of k1 = 1. Distance d i between the sample and its neighbor was calculated using Euclidian distance,
    N e i = set   of   k 1 -nearest   neighbors   of   e i   ε D   of   class   M C C o u n t
    where e i is the individual sample of target class C, d i is the average distance to k n sample of the other class, m i is the target class, and N e i is the closest sample of “the other” than target class ( e s ) .
  • Compute 20% of the synthetic samples to be overlapped in the majority class of the dataset,
    S = ( # T x ) 100
    where S is the set of synthetic samples and T is the subset of D (multiclass dataset) containing all the samples of the target class C ( M C ( C o u n t ) ).
  • Compute the synthetic samples using interpolating schemes.
    y n e w = y + r a n d ( 0 , 1 ) ( y ^ y )
    for each minority class sample y, one obtains its k-nearest neighbors from other minority class samples. Secondly, one chooses one minority class sample y ^ among the k neighbors. Finally, one generates the synthetic sample y n e w by interpolating between y ^ and y.

8. Experimental Setup

A proper comparison of different classifiers for the classification model is a multipart and still uncluttered challenge. To avoid a biased evaluation of the model, the comparison task not only serves to evaluate the committed error but also depends on the structure and nature of the data. In this section, we compare the ensemble and non-ensemble classifiers over 15 multi-class real datasets to measure the accuracy, precision, recall, and f 1 s c o r e using the confusion matrix.

Datasets

We use 20 multi-class real datasets for the purpose of the experiment which we downloaded from UCI [78] and KEEL [79]. To obtain the best result and provide a fair chance for every sample for evaluation, we used a 10-fold cross-validation scheme. Every dataset consists of a different number of samples, features, and classes, i.e., instances vary in the range of 150–12,960, while the number of features ranges from 4 to 65 and the number of classes varies between 3 and 20. To effectively demonstrate the impact of increasing overlapping samples in the underlying datasets, we selected different datasets for the number of samples and number of attributes. Moreover, we synthetically overlapped the different datasets, as we already had overlapping regions to highlight the decreasing performance of the underlying classifier with an increasing overlapping ratio. Table 11 highlights some important attributes of the dataset, such as the name of the dataset, downloaded source, numbers of features, total number of instances, the total number of classes, and the ratio of each class in the dataset.

9. Results and Discussion

This article highlights three different aspects of ensemble and statistical classifiers depicted in Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10 supplemented by the relevant graphical presentation in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13, respectively. This section carried out a detailed comparison to highlight the imbalance and overlapping nature of multi-class classification by applying traditional and ensemble approaches followed by some statistical analysis. Here, we applied different ensemble classifiers (GB, RF, and DT) and non-ensemble approaches (KNN, linear, kernel SVM, and NB) on 15 multi-class datasets. Each dataset consists of a different number of samples, features, and classes, i.e., the instances vary in the range of 150–12,960, while the number of features ranges from 4 to 65 and the number of classes varies between 3 and 20. Our results consist of three parts; first, we explore the six different classifiers on 15 multi-class datasets with default parameters and without synthetic overlapping, using confusion matrix values to gain insight into the different performance measures such as a c c u r a c y , p r e c i s i o n , r e c a l l , and f 1 s c o r e as depicted in Table 5 and Table 6. In the second step, the six algorithms, namely GB, RF, DT, KNN, R-SVM, and LR are hyper-tuned (hyper-tuning the set of parameters) to improve the overall performance of the classification model, using 10-fold cross-validation in an exhaustive grid search technique experiment shown in Table 7 and Table 8.
In the third part, the existing datasets (the majority class of each dataset) are synthetically overlapped by 20% of the original samples to glorify the impact of hyper-tuning the parameters, as shown in Table 9 and Table 10. After examining the results, we come up with these observations: gradient boosting an ensemble approach based on boosting shows remarkable performance in all categories (with and without hyper-tuning parameters for existing and synthetically overlapped datasets) for most of the multi-class imbalanced datasets. GB is an ensemble method, which combines many weak learners to produce a strong learner to deliver improved accuracy. A lower weight and a higher weight are assigned to the predicted outcome if it is correctly classified and misclassified, respectively. Each new tree is a fit on the reproduced subset of the original dataset during the boosting process. We tuned the tree-specific parameters during the hyper-tuning, boosting specific parameters and miscellaneous parameters. For some datasets in the experiment, all the classifiers e n s e m b l e and n o n e n s e m b l e show power performance in terms of both accuracy and precision.
The reason for the power performance is the high variance and poor feature engineering makes it impossible for some algorithms to show a remarkable performance. Although boosting-based ensembles are prominent to control both bias and variance, sometimes with high variance and poor features, engineering them also fails to show a better performance. Along with the gradient boosting algorithm, random forest and decision tree also show some significance as compared to the other used classifier on most of the classifiers, particularly in the dataset with a greater number of features. The main reason for its significance is automatically reducing the number of features through its probabilistic entropy calculation approach. All three ensemble methods, GB, RF, and DT outperform the non-ensemble classifiers with both conventional hyper-tuned parameters for almost all the multi-class datasets. The error or misclassification rate can be dramatically reduced by averaging the component classifiers’ prediction report to produce an optimal final classification report both in a random forest and in the decision tree classifier. In the RFC, multiple trees are used in the ensemble to construct a sample drawn with a replacement from the training set. Moreover, a tree-based RFC ensemble selects a subset of features rather than using all the features of the data in the training set, resulting in the randomization of the tree. A decision tree break downs the classification process into multiple choices about each entry in our feature vector, starting with the root node and going downwards to the leaf where actual classification (prediction) is made. Unlike the traditional algorithm (black box learning algorithm), a decision tree is quite natural, as it visualizes and interprets the choices regarding how the tree is formed, and then follows a suitable path to the leaf node where actual predication or classification is performed. On the other hand, random forest, a collection of decision trees that vaccinate randomness at a certain level, makes it different from the decision tree classifier. The better performance of RF as compared to DT is because injecting randomness at two levels, in bootstrapping and during node splitting, causes a reduction in overfitting and resulting in an accurate model compared to the DT model. RF trains each distant DT on a bootstrapped sample drawn from the initial training dataset. Logistic regression is a linear classifier which can be applied to both linear and nonlinear problems, but as compared to ensemble approaches and SVM with an rbf kernel, the performance of LR is not good, as in other approaches. SVM in both versions (linear and kernel) is one of the most powerful models of machine learning with appropriate skills in separating the hyperplane on both the linear and nonlinear datasets. Sitting a hyperplane for a linear dataset is quite simple, by doing through the linear or rbf kernel, but for nonlinear data, the kernel tricks are used for sitting the hyperplane by projecting the data point in a higher dimension (in this N-dimension, the data points can easily be linearly separable). During the hyper-tuning process, when we set the decision function value to “one-versus-rest”, it decomposes the multi-class problem into several binary class problems, making it convenient for the SVM to find an optimal hyperplane, thus improving its overall model performance. To obtain the most successful results, proper feature engineering and feature scaling must be applied to the training dataset, as the LR model is greatly affected by different value ranges across dependent variables. On the majority of datasets, the poor performance of NB is because of the assumption that instances’ features are conditionally independent, but as we know in the case of the multi-class problem, features depend on each other, thus making this hypothesis wrong, which causes the degradation of the overall performance of the NB model. The basic working mechanism of the KNN model is based on the optimal value selected for k; whenever a new sample is subject to prediction, it simply examines the k-nearest neighbors from the training set, and the majority class among the k nearest neighbors are taken to be class for the test sample. Since the boundary regions of different classes in multi-class classification problems overlap with each other, it becomes difficult to examine the k-nearest neighbors from the training set.
As a final comment about the comparison, if we divide the classifiers between the ensemble and non-ensemble classifiers, the ensemble classifiers outperform the non-ensemble classifier on almost all the datasets with conventional and hyper-tuned parameters. Among the ensemble classifier, there is no constant winner for all the datasets, but overall gradient boosting beats all the classifiers for both categories. In non-ensemble methods, R-SVM shows a remarkable performance with the rbf kernel for both the conventional and tuned parameters, as SVM selects a small subset of training data from the original training data to construct the model. During the hyper-tuning process, when we set the decision function value to “one-versus-rest”, it decomposes the multi-class problem into several binary class problems, making it convenient for the SVM to find an optimal hyperplane, thus improving its overall model performance.
As we stated earlier in Section 9 about the synthetic generation of controlled overlapping samples in the existing dataset, if we look at the results of Table 5, Table 7 and Table 9, which show the accuracy and precision of the stated classifiers, respectively, the results of Table 5 are based on the scenarios in which we tested the ensemble and non-ensemble classifiers over the existing multi-class and overlapped dataset to highlight their performance. The result of Table 7 is based on the scenarios wherein we hyper-tuned the selected parameters of the classifiers using a 10-fold cross-validation and grid search technique, and then applied the selected classifiers to highlight the impact of parameters hyper-tuning over the multi-class and overlapped datasets. After comparing both tables, it is clear that, after hyper-tuning the parameter set, the underlying classifier improved its performance, as discussed in Section 9. For the third scenario, we synthetically generated overlapping samples and 20% of these samples are inserted into the majority class of each dataset to increase the overlapping region of each dataset generation of synthetic samples is discussed in Section 10. After inserting the synthetic samples, the samples of majority and minority classes are more overlapped near the boundary region, resulting in a decrease in the visibility of the minority class. After compromising on the visibility of the minority class samples, the underlying classifier cannot predict the relevant target class effectively, hence decreasing the overall classifier performance. If we look at the results of Table 9, we se the accuracy and precision after the synthetic generation of samples and tuning of the parameters. If we compare the results in Table 9 with those in Table 5 and Table 7, there is a slight change in the classifier performance, even after the insertion of the synthetic samples in the majority of classes. The same justification is for recall and f 1 s c o r e as depicted in Table 6, Table 8 and Table 10.

10. Resultant Summary for Ensemble and Non-Ensemble Classifiers

As a final comment about the comparison, if we divide the classifiers into ensemble and non-ensemble (traditional) classifiers, the ensemble classifiers outperform the non-ensemble classifiers on almost all the datasets with conventional and hyper-tuned parameters.
Among the ensemble classifier, there is no constant winner for all the datasets, but overall, gradient boosting beats all the classifiers for both categories. In non-ensemble methods, R-SVM shows a remarkable performance with the rbf kernel for both the conventional and tuned parameters, as SVM selects a small subset of training data from the original training data to construct the model. The increasing size and dimensions of the problem space make it difficult for traditional or non-ensemble classifiers to correctly predict the unseen samples for multi-class classification data. The main reason behind the poor performance of traditional classifiers is the inability to tackle the high bias and variance. Despite so many machine-learning algorithms, the data to be processed need to be carefully examined, as every time, the data are biased, have high variance, or are sometimes noisy. When these unprocessed data (data with bias, variance, and noise) are subject to classification using the traditional algorithms, most of the time, we obtain a specialized model based on the training set, which yields low accuracy and loses of results. Due to the high bias, the underlying machine learning algorithm is unable to make a meaningful relation between the target variables (class labels) and the features, causing underfitting, which will reduce the overall performance of the classifiers on the testing set. On the other hand, high variance results in the random noise as a part of the training dataset, rather than the intended outputs, because the high variance in the underlying model tends to overfit on training data, resulting in a non-generalized model for the prediction. Compared to traditional classifiers, ensemble approaches try to reduce the variance and bias in the training data, resulting in a more robust and generalized model for the multi-class classification problem. The variance–bias tradeoff is a significant problem in almost all machine learning classifiers, particularly in the case of multi-class classification, where the boundaries of different classes are overlaps with each other’s, making it difficult to draw a clear hyperplane to separate the samples of multi classes. To correctly classify the unseen data during the validation process, ideally one can desire to choose such a machine-learning model, which apprehends the consistencies in its training dataset (effectively addressing the high variance and bias), also avoiding the under-fitting and over-fitting issues. Unfortunately, for most traditional classifiers, it is not an easy task to simultaneously reduce the high variance and bias. On the other hand, the ensemble approaches follow the ‘component-classifiers’ approach, wherein at each iteration, the misclassified sample (caused by high variance and bias) is again subject to the training subset. Ensemble classifiers are working in a parallel mode by assigning individual base learners to a ‘different-different machine’. In short, ensemble approaches are just like a ‘meta-algorithm’ by combining different learning models into a more robust single model to improve the overall performance of the underlying model. The high bias in the training data was reduced via the boosting approach and the high variance was reduced via the bagging approach. All the acronyms used in this article are defined in Table 12.

11. Conclusions

In this paper, we applied six different algorithms (ensemble and non-ensemble) on 15 multi-class datasets with default parameters, and then compared the results with hyper-tuned parameters. The results highlight the significant improvement in the categories of the algorithm after both hyper-tuning a set of parameters and performing an exhaustive grid search technique. However, ensemble approaches outperform the non-ensemble approaches for the majority of multi-class classification datasets. Before applying classification algorithms to the stated imbalanced dataset, we did not apply any data-level approach to balance the dataset. Similarly, after synthetically overlapping the existing datasets, ensemble classifiers show the same results compared to statistical approaches. The results can be further improved if we augment the data level approaches with the ensemble and non-ensemble approaches.
As a future direction, if we perform proper feature engineering on the multi-class imbalanced dataset, the results will surely improve. A combination of ensemble classifiers with a cost-sensitive approach (assigning misclassification cost) can significantly improve the overall performance of the different classifiers. One-class learners, within-class imbalance, small disjuncts, feature selection, stacked ensembles, sophisticated over-sampling techniques, and within-class imbalance are among the research gaps that need to be specially addressed in a future study.

Author Contributions

All of the authors collaborated together to complete this project. Z.M. worked on Methodology. G.U.R. and N.A.B. did the Formal Analysis. The manuscript is validated by A.B. In conversation with Z.M. and G.U.R., all the relevant data is collected by M.A. All the writing, review and editing work is done by M.Z. The original draft of the manuscript was written by S.F.J. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by TRL Technology Ltd.

Data Availability Statement

This article has no associated data.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  2. Hoens, T.R.; Chawla, N.V. Imbalanced datasets: From sampling to classifiers. In Imbalanced Learning: Foundations, Algorithms, and Applications; Wiley Online Library: Hoboken, NJ, USA, 2013; pp. 43–59. [Google Scholar]
  3. Sáez, J.A.; Quintián, H.; Krawczyk, B.; Woźniak, M.; Corchado, E. Multi-class Imbalanced Data Oversampling for Vertebral Column Pathologies Classification. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Oviedo, Spain, 20–22 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 131–142. [Google Scholar]
  4. Rout, N.; Mishra, D.; Mallick, M.K. Handling imbalanced data: A survey. In International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications; Springer: Berlin/Heidelberg, Germany, 2018; pp. 431–443. [Google Scholar]
  5. Kaur, P.; Gosain, A. Issues and challenges of class imbalance problem in classification. Int. J. Inf. Technol. 2018, 14, 539–545. [Google Scholar] [CrossRef]
  6. López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
  7. Loyola-González, O.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; García-Borroto, M. Correlation of resampling methods for contrast pattern based classifiers. In Mexican Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2015; pp. 93–102. [Google Scholar]
  8. Beyan, C.; Fisher, R. Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognit. 2015, 48, 1653–1672. [Google Scholar] [CrossRef] [Green Version]
  9. Denil, M.; Trappenberg, T. Overlap versus imbalance. In Canadian Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2010; pp. 220–231. [Google Scholar]
  10. Abdi, L.; Hashemi, S. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans. Knowl. Data Eng. 2015, 28, 238–251. [Google Scholar] [CrossRef]
  11. Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Algorithm-level approaches. In Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018; pp. 123–146. [Google Scholar]
  12. Bi, J.; Zhang, C. An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl.-Based Syst. 2018, 158, 81–93. [Google Scholar] [CrossRef]
  13. Rahm, E.; Do, H.H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 2000, 23, 3–13. [Google Scholar]
  14. Rao, K.N.; Reddy, C. A novel under sampling strategy for efficient software defect analysis of skewed distributed data. Evol. Syst. 2020, 11, 119–131. [Google Scholar] [CrossRef]
  15. Perveen, S.; Shahbaz, M.; Keshavjee, K.; Guergachi, A. Metabolic syndrome and development of diabetes mellitus: Predictive modeling based on machine learning techniques. IEEE Access 2018, 7, 1365–1375. [Google Scholar] [CrossRef]
  16. Fu, M.; Tian, Y.; Wu, F. Step-wise support vector machines for classification of overlapping samples. Neurocomputing 2015, 155, 159–166. [Google Scholar] [CrossRef]
  17. Qu, Y.; Su, H.; Guo, L.; Chu, J. A novel SVM modeling approach for highly imbalanced and overlapping classification. Intell. Data Anal. 2011, 15, 319–341. [Google Scholar] [CrossRef]
  18. Sun, Z.; Song, Q.; Zhu, X.; Sun, H.; Xu, B.; Zhou, Y. A novel ensemble method for classifying imbalanced data. Pattern Recognit. 2015, 48, 1623–1637. [Google Scholar] [CrossRef]
  19. Shaukat, S.U. Optimum Parameter Machine Learning Classification and Prediction of Internet of Things (IoT) Malwares Using Static Malware Analysis Techniques. Ph.D. Thesis, University of Salford, Manchester, UK, 2019. [Google Scholar]
  20. Anuragi, A.; Sisodia, D.S.; Pachori, R.B. Epileptic-seizure classification using phase-space representation of FBSE-EWT based EEG sub-band signals and ensemble learners. Biomed. Signal Process. Control 2022, 71, 103138. [Google Scholar] [CrossRef]
  21. Han, Y.; Liu, Y.; Wang, B.; Chen, Q.; Song, L.; Tong, L.; Lai, C.; Konagaya, A. A novel transfer learning for recognition of overlapping nano object. Neural Comput. Appl. 2022, 34, 5729–5741. [Google Scholar] [CrossRef]
  22. Gurunathan, A.; Krishnan, B. A Hybrid CNN-GLCM Classifier for Detection and Grade Classification Of Brain Tumor. Brain Imaging Behav. 2022, 16, 1410–1427. [Google Scholar] [CrossRef]
  23. Vong, C.M.; Du, J.; Wong, C.M.; Cao, J.W. Postboosting using extended G-mean for online sequential multiclass imbalance learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 6163–6177. [Google Scholar] [CrossRef]
  24. Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 2020, 17, 168–192. [Google Scholar] [CrossRef]
  25. Fernandes, E.R.; de Carvalho, A.C.; Yao, X. Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data. IEEE Trans. Knowl. Data Eng. 2019, 32, 1104–1115. [Google Scholar] [CrossRef]
  26. Wang, S.; Chen, H.; Yao, X. Negative correlation learning for classification ensembles. In Proceedings of the 2010 international joint conference on neural networks (IJCNN), Barcelona, Spain, 18–23 July 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1–8. [Google Scholar]
  27. Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar]
  28. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  29. Kotsiantis, S.B. Bagging and boosting variants for handling classifications problems: A survey. Knowl. Eng. Rev. 2014, 29, 78–100. [Google Scholar] [CrossRef]
  30. Alam, T.; Ahmed, C.F.; Zahin, S.A.; Khan, M.A.H.; Islam, M.T. An effective ensemble method for multi-class classification and regression for imbalanced data. In Industrial Conference on Data Mining; Springer: Berlin/Heidelberg, Germany, 2018; pp. 59–74. [Google Scholar]
  31. Feng, W.; Huang, W.; Ren, J. Class imbalance ensemble learning based on the margin theory. Appl. Sci. 2018, 8, 815. [Google Scholar] [CrossRef] [Green Version]
  32. Sun, B.; Chen, H.; Wang, J.; Xie, H. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front. Comput. Sci. 2018, 12, 331–350. [Google Scholar] [CrossRef]
  33. Van Hulse, J.; Khoshgoftaar, T.M.; Napolitano, A. An empirical comparison of repetitive undersampling techniques. In Proceedings of the 2009 IEEE International Conference on Information Reuse & Integration, Las Vegas, NV, USA, 10–12 August 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 29–34. [Google Scholar]
  34. Bonab, H.; Can, F. Less is more: A comprehensive framework for the number of components of ensemble classifiers. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2735–2745. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  35. Datta, A.; Chatterjee, R. Comparative study of different ensemble compositions in eeg signal classification problem. In Emerging Technologies in Data Mining and Information Security; Springer: Berlin/Heidelberg, Germany, 2019; pp. 145–154. [Google Scholar]
  36. García, S.; Zhang, Z.L.; Altalhi, A.; Alshomrani, S.; Herrera, F. Dynamic ensemble selection for multi-class imbalanced datasets. Inf. Sci. 2018, 445, 22–37. [Google Scholar] [CrossRef]
  37. Georganos, S.; Grippa, T.; Vanhuysse, S.; Lennert, M.; Shimoni, M.; Wolff, E. Very high resolution object-based land use–land cover urban classification using extreme gradient boosting. IEEE Geosci. Remote Sens. Lett. 2018, 15, 607–611. [Google Scholar] [CrossRef] [Green Version]
  38. Kumar, M.; Sheshadri, H. On the classification of imbalanced datasets. Int. J. Comput. Appl. 2012, 44, 145–148. [Google Scholar]
  39. Mani, I.; Zhang, I. kNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of Workshop on Learning from Imbalanced Datasets; ICML: Baltimore, MD, USA, 2003; Volume 126, pp. 1–7. [Google Scholar]
  40. Yang, X.; Kuang, Q.; Zhang, W.; Zhang, G. AMDO: An over-sampling technique for multi-class imbalanced problems. IEEE Trans. Knowl. Data Eng. 2017, 30, 1672–1685. [Google Scholar] [CrossRef]
  41. De Caigny, A.; Coussement, K.; De Bock, K.W. A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. Eur. J. Oper. Res. 2018, 269, 760–772. [Google Scholar] [CrossRef]
  42. Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef] [Green Version]
  43. Wang, Q. A hybrid sampling SVM approach to imbalanced data classification. In Abstract and Applied Analysis; Hindawi: London, UK, 2014; Volume 2014. [Google Scholar]
  44. Xue, Y.; Zhao, B.; Ma, T.; Pang, W. A self-adaptive fireworks algorithm for classification problems. IEEE Access 2018, 6, 44406–44416. [Google Scholar] [CrossRef]
  45. Krawczyk, B.; Galar, M.; Woźniak, M.; Bustince, H.; Herrera, F. Dynamic ensemble selection for multi-class classification with one-class classifiers. Pattern Recognit. 2018, 83, 34–51. [Google Scholar] [CrossRef]
  46. Karthik, S.; Bhadoria, R.S.; Lee, J.G.; Sivaraman, A.K.; Samanta, S.; Balasundaram, A.; Chaurasia, B.K.; Ashokkumar, S. Prognostic Kalman Filter Based Bayesian Learning Model for Data Accuracy Prediction. Comput. Mater. Contin. 2022, 72, 243–259. [Google Scholar] [CrossRef]
  47. Singh, L.K.; Garg, H.; Khanna, M.; Bhadoria, R.S. An enhanced deep image model for glaucoma diagnosis using feature-based detection in retinal fundus. Med. Biol. Eng. Comput. 2021, 59, 333–353. [Google Scholar] [CrossRef] [PubMed]
  48. Nourzad, S.H.H.; Pradhan, A. Ensemble methods for binary classifications of airborne LiDAR data. J. Comput. Civ. Eng. 2014, 28, 04014021. [Google Scholar] [CrossRef]
  49. Hartman, E.J.; Keeler, J.D.; Kowalski, J.M. Layered neural networks with Gaussian hidden units as universal approximations. Neural Comput. 1990, 2, 210–215. [Google Scholar] [CrossRef]
  50. Kramer, M.A.; Leonard, J. Diagnosis using backpropagation neural networks—Analysis and criticism. Comput. Chem. Eng. 1990, 14, 1323–1338. [Google Scholar] [CrossRef]
  51. Chawla, N.; Eschrich, S.; Hall, L.O. Creating ensembles of classifiers. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; IEEE: Piscataway, NJ, USA, 2001; pp. 580–581. [Google Scholar]
  52. Livieris, I.E.; Kanavos, A.; Tampakas, V.; Pintelas, P. A weighted voting ensemble self-labeled algorithm for the detection of lung abnormalities from X-rays. Algorithms 2019, 12, 64. [Google Scholar] [CrossRef] [Green Version]
  53. Machová, K.; Puszta, M.; Barčák, F.; Bednár, P. A comparison of the bagging and the boosting methods using the decision trees classifiers. Comput. Sci. Inf. Syst. 2006, 3, 57–72. [Google Scholar] [CrossRef]
  54. Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
  55. Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
  56. Zhu, T.; Pimentel, M.A.; Clifford, G.D.; Clifton, D.A. Unsupervised Bayesian inference to fuse biosignal sensory estimates for personalizing care. IEEE J. Biomed. Health Inform. 2018, 23, 47–58. [Google Scholar] [CrossRef] [PubMed]
  57. Farquad, M.A.H.; Bose, I. Preprocessing unbalanced data using support vector machine. Decis. Support Syst. 2012, 53, 226–233. [Google Scholar] [CrossRef]
  58. Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef] [Green Version]
  59. Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from imbalanced data streams. In Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018; pp. 279–303. [Google Scholar]
  60. Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Imbalanced classification with multiple classes. In Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018; pp. 197–226. [Google Scholar]
  61. Rajevenceltha, J.; Kumar, C.S.; Kumar, A.A. Improving the performance of multi-parameter patient monitors using feature mapping and decision fusion. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON), Singapore, 22–25 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1515–1518. [Google Scholar]
  62. Friedrichs, F.; Igel, C. Evolutionary tuning of multiple SVM parameters. Neurocomputing 2005, 64, 107–117. [Google Scholar] [CrossRef]
  63. Reif, M.; Shafait, F.; Dengel, A. Meta-learning for evolutionary parameter optimization of classifiers. Mach. Learn. 2012, 87, 357–380. [Google Scholar] [CrossRef] [Green Version]
  64. Batista, G.; Silva, D.F. How k-nearest neighbor parameters affect its performance. In Argentine Symposium on Artificial Intelligence; Citeseer: Princeton, NJ, USA, 2009; pp. 1–12. [Google Scholar]
  65. Anghel, A.; Papandreou, N.; Parnell, T.; De Palma, A.; Pozidis, H. Benchmarking and optimization of gradient boosting decision tree algorithms. arXiv 2018, arXiv:1809.04559. [Google Scholar]
  66. Mantovani, R.G.; Horváth, T.; Cerri, R.; Vanschoren, J.; de Carvalho, A.C. Hyper-parameter tuning of a decision tree induction algorithm. In Proceedings of the 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), Pernambuco, Brazil, 9–12 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 37–42. [Google Scholar]
  67. Couronné, R.; Probst, P.; Boulesteix, A.L. Random forest versus logistic regression: A large-scale benchmark experiment. BMC Bioinform. 2018, 19, 1–14. [Google Scholar] [CrossRef]
  68. Dioşan, L.; Rogozan, A.; Pecuchet, J.P. Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Appl. Intell. 2012, 36, 280–294. [Google Scholar] [CrossRef]
  69. Pannakkong, W.; Thiwa-Anont, K.; Singthong, K.; Parthanadee, P.; Buddhakulsomsiri, J. Hyperparameter Tuning of Machine Learning Algorithms Using Response Surface Methodology: A Case Study of ANN, SVM, and DBN. Math. Probl. Eng. 2022, 2022, 8513719. [Google Scholar] [CrossRef]
  70. Wong, T.T.; Yang, N.Y. Dependency analysis of accuracy estimates in k-fold cross validation. IEEE Trans. Knowl. Data Eng. 2017, 29, 2417–2427. [Google Scholar] [CrossRef]
  71. García, V.; Mollineda, R.A.; Sánchez, J.S. On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 2008, 11, 269–280. [Google Scholar] [CrossRef]
  72. Sun, H.; Wang, S. Measuring the component overlapping in the Gaussian mixture model. Data Min. Knowl. Discov. 2011, 23, 479–502. [Google Scholar] [CrossRef]
  73. Lee, H.K.; Kim, S.B. An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst. Appl. 2018, 98, 72–83. [Google Scholar] [CrossRef]
  74. Vuttipittayamongkol, P.; Elyan, E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf. Sci. 2020, 509, 47–70. [Google Scholar] [CrossRef]
  75. Jain, S.; Shukla, S.; Wadhvani, R. Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 2018, 106, 252–262. [Google Scholar] [CrossRef]
  76. Nettleton, D.F.; Orriols-Puig, A.; Fornells, A. A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 2010, 33, 275–306. [Google Scholar] [CrossRef]
  77. Mollineda, R.A.; Sánchez, J.S.; Sotoca, J.M. Data characterization for effective prototype selection. In Iberian Conference on Pattern Recognition and Image Analysis; Springer: Berlin/Heidelberg, Germany, 2005; pp. 27–34. [Google Scholar]
  78. Lichman, M.; Bache, K. UCI Machine Learning Repository; University of California: Los Angeles, CA, USA, 2013. [Google Scholar]
  79. Ali, Z.; Ahmad, R.; Akhtar, M.N.; Chuhan, Z.H.; Kiran, H.M.; Shahzad, W. Empirical Study of Associative Classifiers on Imbalanced Datasets in KEEL. In Proceedings of the 2018 9th International Conference on Information, Intelligence, Systems and Applications (IISA), Zakynthos, Greece, 23–25 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar]
Figure 1. An example of imbalanced class distribution.
Figure 1. An example of imbalanced class distribution.
Applsci 12 08371 g001
Figure 2. Accuracy: Ensemble versus Non-Ensemble before Hyper-tuning.
Figure 2. Accuracy: Ensemble versus Non-Ensemble before Hyper-tuning.
Applsci 12 08371 g002
Figure 3. Precision: Ensemble versus Non-Ensemble before Hyper-Tuning.
Figure 3. Precision: Ensemble versus Non-Ensemble before Hyper-Tuning.
Applsci 12 08371 g003
Figure 4. Recall: ensemble versus non-ensemble before hyper-tuning.
Figure 4. Recall: ensemble versus non-ensemble before hyper-tuning.
Applsci 12 08371 g004
Figure 5. F1-Score: ensemble versus non-ensemble before hyper-tuning.
Figure 5. F1-Score: ensemble versus non-ensemble before hyper-tuning.
Applsci 12 08371 g005
Figure 6. Accuracy: ensemble versus non-ensemble after hyper-tuning.
Figure 6. Accuracy: ensemble versus non-ensemble after hyper-tuning.
Applsci 12 08371 g006
Figure 7. Precision: ensemble versus non-ensemble after hyper-tuning.
Figure 7. Precision: ensemble versus non-ensemble after hyper-tuning.
Applsci 12 08371 g007
Figure 8. Recall: ensemble versus non-ensemble after hyper-tuning.
Figure 8. Recall: ensemble versus non-ensemble after hyper-tuning.
Applsci 12 08371 g008
Figure 9. F1-Score: ensemble versus non-ensemble after hyper-tuning.
Figure 9. F1-Score: ensemble versus non-ensemble after hyper-tuning.
Applsci 12 08371 g009
Figure 10. Accuracy: ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Figure 10. Accuracy: ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Applsci 12 08371 g010
Figure 11. Precision: Ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Figure 11. Precision: Ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Applsci 12 08371 g011
Figure 12. Recall: ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Figure 12. Recall: ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Applsci 12 08371 g012
Figure 13. F1-Score: ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Figure 13. F1-Score: ensemble versus non-ensemble after hyper-tuning and synthetic overlapping.
Applsci 12 08371 g013
Table 1. Weaknesses and Strengths of Existing techniques in the literature.
Table 1. Weaknesses and Strengths of Existing techniques in the literature.
SolutionsStrengthWeakness
Decomposition strategies [45]Transform imbalance multi-class problems into several classes and design a separate classifier for each class.In the case of multiple classes, handling individual classifiers for each class is time consuming.
Optimization classification model (OCM) using evolutionary computation (EC) techniques, Self-adaptive Fireworks Algori3m (SaFWA) based on swarm intelligence [44]OCM deals with classification problems, and SaFWA deals with optimization problems. To increase the diversity/range of the solutions, four candidate solution generation strategies (CSGSs) merged with SaFWA.The solution is based on the optimization and diversity solution, but lacks in addressing the increasing impact of overlapping samples.
k-means clustering and (SMOTE) [42]Results in an effective oversampling method for overcoming the imbalance ratio between and within classes and avoid the noise generation by combining the undersampling and oversampling methods.Both in the oversampling and undersampling, the samples in the classes were discarded or increased, while our focus was on increasing the samples in the underlying dataset to minimize visibility.
Mahalanobis distance-based oversampling (MDO) with generalized singular value decomposition (GSVD) [41]To handle the imbalanced class data with a mixed attribute introduces generalized singular value decomposition (GSVD) for complex and mixed-type data, augmenting with a resampling scheme applied on a mixed type of attributes to optimize the synthesis of samples.Without the decomposition technique, the proposed solution is unable to address overlapping issues.
Data preprocessing techniques with sampling strategies and a KNN classifier augmenting with different N e a r M i s s and the “most distant” method [38,39]A balanced training dataset is more robust to improve the overall performance of the classifier for several base classifiers, where a small subset of training data is selected to minimize the skewness in the reaming data.Focused on preprocessing techniques rather than on hypertonic and synthetic overlapping.
Twin bounded weighted relaxed support vector machines (TBWRSVM) [35]Handles the imbalance and outlier in the problem domain.Well designed for classifying imbalanced datasets and identifying outlier samples, but lacking in synthetic overlapping samples.
SMOTEBagging [31]The proposed method significantly counters the multi-class imbalance learning issues by creating every individual bag to be expressively diverse.Only targets the minority class samples to bring back equality to the majority class.
BalanceCascade [32]The negative class instances are arbitrarily condensed at every bootstrap sample to make it equal to the cardinality of the positive class.Only focuses on sampling techniques to balance the datasets.
Boosting ensemble method [35]A strong learner is built from the collection of different weak learners (weak in the sense of accuracy while applying the classification task).Selection of a base learner is a critical job.
Table 2. Symbol and their description.
Table 2. Symbol and their description.
SymbolDescription
D j Sample of class j
p j ( C i ) Number of accurate predications of the classifiers i
χ A Characteristic function which takes into account the prediction j A
c l s Coefficients
w l s Weak learners
z b l b-th observation of the l-th bootstrap sample
LBootstrap samples
CRegularization parameter
λ Kernel width parameter
n e s t i m a t o r Number of trees
l e a f s i z e l e a f s i z e parameters
KNeighbor at k distance
m a x d e p t h Last leaf node of the tree
y i and y ^ i Actual value and predicted value of observation i
C i Class ith sample
C j Class jth sample
r f i Discriminative ratio
d ( a , b ) Distance of two vector (two samples of the classes)
d i Average distance
k 1 Nearest neighbors
N e i Sample in the target class
M C C o u n t Samples in the majority class
C o l Column in the dataset
L o c Location of the sample
SS Set of synthetic samples
TT Subset of D (multiclass dataset)
y n e w New minority
f i Feature listed in dataset
L e a r n i n g r a t e and n u m b e r o f t r e e The boosting parameters directly related to the underlying boosting algorithm
NTotal number of observations
E(.)Fitting error
e ( . , . ) Loss/error function
S l Model that best fits the training data
p c j p c k Represent the respective samples in classes c j and c k
μ f i Mean of the f i values across all classes
Table 3. Parameters Tuning of the Selected Classifiers for the 9 Datasets.
Table 3. Parameters Tuning of the Selected Classifiers for the 9 Datasets.
AlgorithmTuned ParameterSearch RangeBest PDataset
Gradient boostingNumber of estimators[100; 150; 200; 250; 300; 400; 500; 600; 800; 1000; 1200]100IRIS, Car, Counterceptive Used Method, Page_Block, User_Knowledge, Vehicle, Wine, Volcanoes, Wall_Following, Nursey
Learning rate[0.01; 0.02; 0.5; 0.1; 0.2; 0.25; 0.3; 0.4; 0.5]0.5
Min_samples_split[2; 3; 4; 5; 6; 8; 10; 15]2
Max_tree_depth[3; 4; 5; 6; 7; 8; 9; 10; 12; 15]4
Random forestNumber_of_estimators[50, 120; 150; 200; 220; 250; 300; 350; 600; 700]200
Max_features[‘log2’; ‘sqrt’; ‘all’]all
Min_samples_leaf[1; 2; 3; 4; 5; 6; 7; 8; 9; 10]2
Decision Tree‘max_features’:[‘auto’, ‘sqrt’, ‘log2’, Non]None
Criterion‘gini’, ‘entropy’gini
max_depth1, 20, 2
Splitterbest’, ’randombest
‘min_samples_split’:[2, 5, 10],5
‘min_samples_leaf’:[1, 2, 4, 10],1
KNNn_neighbors’:np.arange (1, 15),5
weights’:[‘uniform’, ‘distance’],Uniform
leaf_size’[1, 3, 5]3
RBF support vector machineGamma2 × 10−15, 2 × 10−13, 2 × 10−11, 2 × 10−9, 2 × 10−7, 2 × 10−5, 2 × 10−3, 2 × 10−1, 2 × 101, 2 × 1031, 0.1, 0.01, 0.001, 1, 10, 50, 100, 200, 500100
C2 × 10−1, 2 × 101, 2 × 103, 2 × 105, 0.1, 0.01, 1, 10, 1000.1
Decision_function_shape[O-vs-O, O-vs-R]O-vs-R
Class_weightUniform, balanceduniform
Logistic regressionPenalty[2, 6, 8, 12, 15, 16]12
Solver[‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]‘newton-cg’For large datasets, ‘sag’ and ‘saga’ will be used
Table 4. Parameters Tuning of the Selected Classifiers for the 11 Datasets.
Table 4. Parameters Tuning of the Selected Classifiers for the 11 Datasets.
AlgorithmTuned ParameterSearch RangeBest PDataset
Gradient boostingNumber of estimators[100; 150; 200; 250; 300; 400; 500; 600; 800; 1000; 1200]200Soybean, Satimage, Opt_Digits, Glass, LED_Domain, Ecoli, Dermatology, Bridges, Breast_Tissue, Human_Activity_Recognition
Learning rate[0.01; 0.02; 0.5; 0.1; 0.2; 0.25; 0.3; 0.4; 0.5]0.1
Min_samples_split[2; 3; 4; 5; 6; 8; 10; 15]3
Max_tree_depth[3; 4; 5; 6; 7; 8; 9; 10; 12; 15]8
Random forestNumber of estimators[50, 120; 150; 200; 220; 250; 300; 350; 600; 700]200
Max_number of features[‘log2’; ‘sqrt’; ‘all’]all
Min_samples_leaf[1; 2; 3; 4; 5; 6; 7; 8; 9; 10]2
Decision tree‘max_features’:[‘auto’, ‘sqrt’, ‘log2’, None ]None
criterion’gini’, ’entropy’gini
max_depth1, 20, 2
splitter’best’, ’random’best
‘min_samples_split’:[2, 5, 10],5
‘min_samples_leaf’:[1, 2, 4, 10],1
K-NNn_neighbors’:np.arrange (1, 15),5
weights’:[‘uniform’, ‘distance’],Uniform
leaf_size’[1, 3, 5]5
RBF support vector machinegamma2 × 10−15, 2 × 10−13, 2 × 10−11, 2 × 10−9, 2 × 10−7, 2 × 10−5, 2 × 103, 2 × 10−1, 2 × 101, 2 × 1031, 0.1, 0.01, 0.001, 1, 10, 50, 100, 200, 500200
C2 × 10−1, 2 × 101, 2 × 103, 2 × 105, 0.1, 0.01, 1, 10, 1000.01
Decision_function_shape[O-vs-O, O-vs-R]O-vs-R
Class_weightUniform, balanceduniform
Logistic regressionpenalty[2, 6, 8, 12, 15, 16]12
solver[‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]‘newton-cg’‘saga’ and ‘sag’ will be used for a larger dataset
Table 5. Accuracy and precision before hyper-tuning.
Table 5. Accuracy and precision before hyper-tuning.
Accuracy before Hyper-TuningPrecision before Hyper-Tuning
DatasetsGBRFDTKNNR-SVMLRGBRFDTKNNR-SVMLR
IRIS94.7497.3794.7492.1192.1192.1193.3996.0293.3990.7690.7690.76
Glass94.5995.3291.3594.1194.1194.1193.2493.979092.7692.7692.76
HAR88.4587.386.286.3286.3286.3287.185.9584.8584.9784.9784.97
B_Tissue95.7196.5792.8688.9288.9288.9294.3695.2291.5187.5787.5787.57
Car86.582.8980.9781.681.681.685.1581.5479.6280.2580.2580.25
CMC87.2184.9281.6783838385.8683.5780.3281.6581.6581.65
Ecoli91.9888.8988.3886.8686.8686.8690.6387.5487.0385.5185.5185.51
Nursery82.9483.6781.9988.5788.5788.5781.5982.3280.6487.2287.2287.22
Opt_Digits85.4786.0183.1976.0476.0476.0484.1284.6681.8474.6974.6974.69
Page_Block85.8583.3784.4179.2779.2779.2784.582.0283.0677.9277.9277.92
Satimage85.5384.485.5379.6679.6679.6684.1883.0584.1878.3178.3178.31
Soyaben92.3591.0489.9280.6580.6580.659189.6988.5779.379.379.3
Vehicle82.583.575.3681.6881.6881.6881.1582.1574.0180.3380.3380.33
Volcanoesa80.817981.986.4986.4986.4979.4677.6580.5585.1485.1485.14
Wine96.6196.6194.9278.8278.8278.8295.2695.2693.5777.4777.4777.47
Table 6. Recall and F1-score before hyper-tuning.
Table 6. Recall and F1-score before hyper-tuning.
Accuracy before Hyper-TuningPrecision before Hyper-Tuning
DatasetsGBRFDTKNNR-SVMLRGBRFDTKNNR-SVMLR
IRIS90.593.1390.587.8787.8787.8787.0389.6687.0384.484.484.4
Glass90.3591.0887.1189.8789.8789.8786.8887.6183.6486.486.486.4
HAR84.2183.0681.9682.0882.0882.0880.7479.5978.4978.6178.6178.61
B_Tissue91.4792.3388.6284.6884.6884.688888.8685.1581.2181.2181.21
Car82.2678.6576.7377.3677.3677.3678.7975.1873.2673.8973.8973.89
CMC82.9780.6877.4378.7678.7678.7679.577.2173.9675.2975.2975.29
Ecoli87.7484.6584.1482.6282.6282.6284.2781.1880.6779.1579.1579.15
Nursery78.779.4377.7584.3384.3384.3375.2375.9674.2880.8680.8680.86
Opt_Digits81.2381.7778.9571.871.871.877.7678.375.4868.3368.3368.33
Page_Block81.6179.1380.1775.0375.0375.0378.1475.6676.771.5671.5671.56
Satimage81.2980.1681.2975.4275.4275.4277.8276.6977.8271.9571.9571.95
Soyaben88.1186.885.6876.4176.4176.4184.6483.3382.2172.9472.9472.94
Vehicle78.2679.2671.1277.4477.4477.4474.7975.7967.6573.9773.9773.97
Volcanoesa76.5774.7677.6682.2582.2582.2573.171.2974.1978.7878.7878.78
Wine92.3792.3790.6874.5874.5874.5888.988.987.2171.1171.1171.11
Table 7. Accuracy and Precision after Parameters Hyper-Tuning.
Table 7. Accuracy and Precision after Parameters Hyper-Tuning.
Accuracy before Hyper-TuningPrecision before Hyper-Tuning
DatasetsGBRFDTKNNR-SVMLRGBRFDTKNNR-SVMLR
IRIS95.7198.3495.7193.0895.0887.8194.3696.9994.3691.7393.7386.46
Glass95.5696.2992.3287.2989.8985.3494.2194.9490.9785.9488.5483.99
HAR89.4288.2787.1782.5783.9781.3188.0786.9285.8281.2282.6279.96
B_Tissue96.6897.5493.8387.8389.5481.5495.3396.1992.4886.4888.1980.19
Car87.4783.8681.9477.0180.2471.0386.1282.5180.5975.6678.8969.68
CMC88.1885.8982.6480.6381.6275.8786.8384.5481.2979.2880.2774.52
Ecoli92.9589.8689.3582.6587.4681.0491.688.518881.386.1179.69
Nursery83.9184.6482.9679.7981.9171.6882.5683.2981.6178.4480.5670.33
Opt_Digits86.4486.9884.1681.5782.5776.4185.0985.6382.8180.2281.2275.06
Page_Block86.8284.3485.3873.8283.1274.2185.4782.9984.0372.4781.7772.86
Satimage86.585.3786.576.3683.7973.5385.1584.0285.1575.0182.4472.18
Soyaben93.3292.0190.8986.8189.4882.5691.9790.6689.5485.4688.1381.21
Vehicle83.4784.4776.3367.7678.1371.4782.1283.1274.9866.4176.7870.12
Volcanoesa81.7879.9782.8772.8575.169.9480.4378.6281.5271.573.7568.59
Wine97.5897.5895.8983.8589.281.8996.2396.2394.5482.587.8580.54
Table 8. Recall and F1-Score after Parameters Hyper-Tuning.
Table 8. Recall and F1-Score after Parameters Hyper-Tuning.
Accuracy before Hyper-TuningPrecision before Hyper-Tuning
DatasetsGBRFDTKNNR-SVMLRGBRFDTKNNR-SVMLR
IRIS90.8793.590.8788.2490.2482.9787.890.4387.885.1787.1779.9
Glass90.7291.4587.4882.4585.0580.587.6588.3884.4179.3881.9877.43
HAR84.5883.4382.3377.7379.1376.4781.5180.3679.2674.6676.0673.4
B_Tissue91.8492.788.9982.9984.776.788.7789.6385.9279.9281.6373.63
Car82.6379.0277.172.1775.466.1979.5675.9574.0369.172.3363.12
CMC83.3481.0577.875.7976.7871.0380.2777.9874.7372.7273.7167.96
Ecoli88.1185.0284.5177.8182.6276.285.0481.9581.4474.7479.5573.13
Nursery79.0779.878.1274.9577.0766.847676.7375.0571.887463.77
Opt_Digits81.682.1479.3276.7377.7371.5778.5379.0776.2573.6674.6668.5
Page_Block81.9879.580.5468.9878.2869.3778.9176.4377.4765.9175.2166.3
Satimage81.6680.5381.6671.5278.9568.6978.5977.4678.5968.4575.8865.62
Soyaben88.4887.1786.0581.9784.6477.7285.4184.182.9878.981.5774.65
Vehicle78.6379.6371.4962.9273.2966.6375.5676.5668.4259.8570.2263.56
Volcanoesa76.9475.1378.0368.0170.2665.173.8772.0674.9664.9467.1962.03
Wine92.7492.7491.0579.0184.3677.0589.6789.6787.9875.9481.2973.98
Table 9. Accuracy and Precision after Parameters Hyper-Tuning and Synthetic Overlapping.
Table 9. Accuracy and Precision after Parameters Hyper-Tuning and Synthetic Overlapping.
Accuracy before Hyper-TuningPrecision before Hyper-Tuning
DatasetsGBRFDTKNNR-SVMLRGBRFDTKNNR-SVMLR
IRIS95.1297.7595.1292.3794.3787.193.7596.3893.7591.0193.0185.74
Glass94.9795.791.7386.5889.1884.6393.694.3390.3685.2287.8283.27
HAR88.8387.6886.5881.8683.2680.687.4686.3185.2180.581.979.24
B_Tissue96.0996.9593.2487.1288.8380.8394.7295.5891.8785.7687.4779.47
Car86.8883.2781.3576.379.5370.3285.5181.979.9874.9478.1768.96
CMC87.5985.382.0579.9280.9175.1686.2283.9380.6878.5679.5573.8
Ecoli92.3689.2788.7681.9486.7580.3390.9987.987.3980.5885.3978.97
Nursery83.3284.0582.3779.0881.270.9781.9582.688177.7279.8469.61
Opt_Digits85.8586.3983.5780.8681.8675.784.4885.0282.279.580.574.34
Page_Block86.2383.7584.7973.1182.4173.584.8682.3883.4271.7581.0572.14
Satimage85.9184.7885.9175.6583.0872.8284.5483.4184.5474.2981.7271.46
Soyaben92.7391.4290.386.188.7781.8591.3690.0588.9384.7487.4180.49
Vehicle82.8883.8875.7467.0577.4270.7681.5182.5174.3765.6976.0669.4
Volcanoesa81.1979.3882.2872.1474.3969.2379.8278.0180.9170.7873.0367.87
Wine96.9996.9995.383.1488.4981.1895.6295.6293.9381.7887.1379.82
Table 10. Recall and F1-Score after Parameters Hyper-Tuning and Synthetic Overlapping.
Table 10. Recall and F1-Score after Parameters Hyper-Tuning and Synthetic Overlapping.
Accuracy before Hyper-TuningPrecision before Hyper-Tuning
DatasetsGBRFDTKNNR-SVMLRGBRFDTKNNR-SVMLR
IRIS90.2292.8590.2287.4789.4782.287.1189.7487.1184.3686.3679.09
Glass90.0790.886.8381.6884.2879.7386.9687.6983.7278.5781.1776.62
HAR83.9382.7881.6876.9678.3675.780.8279.6778.5773.8575.2572.59
B_Tissue91.1992.0588.3482.2283.9375.9388.0888.9485.2379.1180.8272.82
Car81.9878.3776.4571.474.6365.4278.8775.2673.3468.2971.5262.31
CMC82.6980.477.1575.0276.0170.2679.5877.2974.0471.9172.967.15
Ecoli87.4684.3783.8677.0481.8575.4384.3581.2680.7573.9378.7472.32
Nursery78.4279.1577.4774.1876.366.0775.3176.0474.3671.0773.1962.96
Opt_Digits80.9581.4978.6775.9676.9670.877.8478.3875.5672.8573.8567.69
Page_Block81.3378.8579.8968.2177.5168.678.2275.7476.7865.174.465.49
Satimage81.0179.8881.0170.7578.1867.9277.976.7777.967.6475.0764.81
Soyaben87.8386.5285.481.283.8776.9584.7283.4182.2978.0980.7673.84
Vehicle77.9878.9870.8462.1572.5265.8674.8775.8767.7359.0469.4162.75
Volcanoesa76.2974.4877.3867.2469.4964.3373.1871.3774.2764.1366.3861.22
Wine92.0992.0990.478.2483.5976.2888.9888.9887.2975.1380.4873.17
Table 11. Dataset with Description.
Table 11. Dataset with Description.
DatasetAttributesSamplesClassesThe Ratio of Each Class
IRIS415030.333, 0.333, 0.333
Glass921460.3271, 0.0794, 0.0421, 0.3551, 0.1355, 0.0607
HAR56110,2996(22.94, 77.06)
Breast_Tissue910660.2075, 0.1981, 0.1320, 0.1415, 0.1509, 0.1698
Bridges1310760.1415, 0.1037, 0.0849, 0.4245, 0.1037, 0.1509
Car7172840.2228, 0.405, 0.7002, 0.0381
CMC10147330.4270, 0.2260, 0.3469
Dermatology3536660.306 0.1967, 0.1666, 0.142, 0.1338, 0.0546
Ecoli933680.4255, 0.2291, 0.0059, 0.009, 0.1041, 0.0595, 0.0148, 0.1547
LED_Domain8500100.1233, 0.1355, 0.452, 0.4311, 0.1677, 0.013, 0.2033, 0.1576, 0.432, 0.01233
Nersery912,96050.3333, 0.3291, 0.0001, 0.3120, 0.0253
Page_Block11547350.8978, 0.0601, 0.0051, 0.0159, 0.0210
Satimage37643060.2382, 0.1092, 0.2110, 0.0973, 0.1099, 0.2343
Soyaben36683190.0234, 0.1332, 0.0644, 0.0292, 0.0292, 0.1346, 0.0644, 0.0292, 0.0204, 0.0219, 0.0292, 0.0292, 0.1332, 0.0117, 0.0292, 0.1288, 0.0292, 0.0292, 0.0292
U_Knowledge640350.2506, 0.3200, 0.3027, 0.2382, 0.0645
Vehicle1984640.2576, 0.2505, 0.2565, 0.2352
Volcanoesa4325350.9077, 0.0209, 0.0178, 0.0264, 0.0271
Wine1417830.3988, 0.3314, 0.2696
Wl_Following25545640.4041; 0.3843, 0.1513, 0.0601
Opt_Digits655620190.1017, 0.1016, 0.101, 0.1007, 0.1, 0.0992, 0.0992, 0.0991, 0.0985
Table 12. Acronyms and their definitions.
Table 12. Acronyms and their definitions.
AcronymDescription
KNNk-Nearest Neighbor
NBNaive Bayes
ANNArtificial Neural Network
DTDecision Tree
SVMSupport Vector Machine
LRLogistic Regression (LR)
G-meanGeometric Mean
F1-ScoreF-Measure
AUCArea under Curve
TBWRSVMTwin Bounded Weighted Relaxed Support Vector Machines
WRSVMWeighted Relaxed Support Vector Machine
DES-MIDynamic Ensemble Selection for Multi-Class Imbalanced the Dataset
GSVDGeneralized Singular Value Decomposition
MDOMahalanobis Distance-based Oversampling
ECEvolutionary Computation
SaFWASelf-Adaptive Fireworks Algorithm
CSGSsCandidate Solution Generation Strategies
k-SMOTEk-Means Clustering
splitMinimum-Sample-Split
OVOOne-versus-One Decomposition Strategies
OVROne-versus-Rest Decomposition Strategies
solver, newton-cg, sag, and lbfgsMultinomial values
RSMResponse Surface Methodology
MAEMean Absolute Error
labelencoder.fitLabel Encoding Scheme
MDSMulticlass Dataset
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Mahmood, Z.; Butt, N.A.; Rehman, G.U.; Zubair, M.; Aslam, M.; Badshah, A.; Jilani, S.F. Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region. Appl. Sci. 2022, 12, 8371. https://doi.org/10.3390/app12168371

AMA Style

Mahmood Z, Butt NA, Rehman GU, Zubair M, Aslam M, Badshah A, Jilani SF. Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region. Applied Sciences. 2022; 12(16):8371. https://doi.org/10.3390/app12168371

Chicago/Turabian Style

Mahmood, Zafar, Naveed Anwer Butt, Ghani Ur Rehman, Muhammad Zubair, Muhammad Aslam, Afzal Badshah, and Syeda Fizzah Jilani. 2022. "Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region" Applied Sciences 12, no. 16: 8371. https://doi.org/10.3390/app12168371

APA Style

Mahmood, Z., Butt, N. A., Rehman, G. U., Zubair, M., Aslam, M., Badshah, A., & Jilani, S. F. (2022). Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region. Applied Sciences, 12(16), 8371. https://doi.org/10.3390/app12168371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop