Next Article in Journal
CultReal—A Rapid Development Platform for AR Cultural Spaces, with Fused Localization
Previous Article in Journal
Association between the Degree of Pre-Synaptic Dopaminergic Pathway Degeneration and Motor Unit Firing Behavior in Parkinson’s Disease Patients
Previous Article in Special Issue
Multi-Level Context Pyramid Network for Visual Sentiment Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification

Department of Electrical Engineering, National Taipei University of Technology, Taipei 10618, Taiwan
*
Author to whom correspondence should be addressed.
Sensors 2021, 21(19), 6616; https://doi.org/10.3390/s21196616
Submission received: 23 July 2021 / Revised: 14 September 2021 / Accepted: 29 September 2021 / Published: 4 October 2021

Abstract

:
The number of sensing data are often imbalanced across data classes, for which oversampling on the minority class is an effective remedy. In this paper, an effective oversampling method called evolutionary Mahalanobis distance oversampling (EMDO) is proposed for multi-class imbalanced data classification. EMDO utilizes a set of ellipsoids to approximate the decision regions of the minority class. Furthermore, multi-objective particle swarm optimization (MOPSO) is integrated with the Gustafson–Kessel algorithm in EMDO to learn the size, center, and orientation of every ellipsoid. Synthetic minority samples are generated based on Mahalanobis distance within every ellipsoid. The number of synthetic minority samples generated by EMDO in every ellipsoid is determined based on the density of minority samples in every ellipsoid. The results of computer simulations conducted herein indicate that EMDO outperforms most of the widely used oversampling schemes.

1. Introduction

With advancements in sensor technology and the Internet of things (IoT), vast quantities of sensing data have been collected and analyzed for different applications. Cost-effective sensors are widely used in our everyday lives to collect various types of data for further online or offline analyses and applications. The classification of real-world sensing data is a highly important research topic in the field of data mining and machine learning. However, the data sets collected using sensors or other sensing techniques usually have a skewed class distribution because the number of data points vary greatly between classes. Such data are called imbalanced data. The data utilized in applications, such as anomaly detection in high-speed trains [1,2,3], fault diagnosis of motors [4,5,6], fault detection and diagnosis in manufacturing processes [7,8,9], and medical diagnosis [10,11,12], are usually imbalanced. In imbalanced data sets, at least one class of data has significantly more data points compared with other classes. Learning on imbalanced data results in poor performance, and this problem has thus attracted considerable research attention in recent years. It is mainly because the performance of many conventional learning algorithms is degraded on the skewed class distribution of imbalanced data sets [13].
Balanced class distribution [14] or equal weighting of classification errors for every class [13] is generally assumed in most conventional machine learning algorithms. For instance, 95% and 5% of imbalanced data sets can comprise majority and minority class samples, respectively. With the equal weighting of the classification errors, traditional classification approaches tend to overlook several or most of the minority class samples in the attempt to minimize the overall classification error. Consequently, although the overall classification error rate is low, the classification error rate for the minority class is high. Minority class samples are important in classification in certain applications, such as medical diagnosis, anomaly detection, and fault detection and diagnosis. Majority class samples usually represent normal conditions, and minority class samples represent abnormal conditions, which can be key in such applications. The learning approaches for imbalanced data are designed to increase learning accuracy with respect to minority classes without trading off learning accuracy with respect to majority classes.
The learning approaches for imbalanced data can generally be categorized into three types: cost-sensitive learning, data-level learning, and ensemble learning, which are comprehensively reviewed in [15,16]. Cost-sensitive learning assigns higher misclassification costs to minority class samples than to majority class samples. Studies have proposed various learning approaches that adjust the misclassification cost using kernel functions; these approaches involve the radial basis function [17], matrix-based kernel regression [18], support vector machine [19,20], and deep learning [21,22].
Data-level learning approaches essentially rebalance the skewed data distribution of different classes by removing several majority class samples or by adding new minority class samples, and they can generally be divided into undersampling and oversampling learning approaches. The main advantage of data-level learning approaches is that they are independent of classifiers. They can be considered a type of data preprocessing approach. Therefore, data-level learning approaches can be easily integrated with other imbalanced learning approaches. Undersampling involves removing the majority class samples to ensure that the learning results are not overly biased toward the majority class [23,24,25,26]. Undersampling reduces both the number of samples in and the computational cost of machine learning. However, it tends to reduce the model’s capability to recognize majority classes. By contrast, oversampling involves increasing the number of minority class samples by resampling them or by generating synthetic samples. However, resampling by simply replicating the minority class samples does not improve learning of the decision region of minority classes. The synthetic minority oversampling technique (SMOTE) proposed in [27] selects several samples from a minority class. Searching in the vicinity of the selected samples, it identifies other samples of the minority class to generate new synthetic samples linearly between the two points. SMOTE is the most widely used oversampling technique because of its computational inexpensiveness. However, SMOTE is prone to overgeneralization because it synthesizes new samples through the random selection of minority class samples. Various adaptive sampling methods based on SMOTE have been proposed to overcome its limitation. The adaptive synthetic sampling approach for imbalanced learning algorithm (ADASYN) [28], SMOTEBoost [29], and Borderline SMOTE [30] are effective modified versions of SMOTE. In contrast to SMOTE, other algorithms, such as those in [31,32,33], have been proposed; these algorithms generate synthetic samples by learning the structure underlying the minority samples.
Ensemble learning for incomplete data integrates traditional machine learning approaches, such as boosting [34], bagging [35], and stacking [36], with other cost-sensitive or data resampling imbalanced learning approaches. In [37], SMOTE was integrated with Adaboost [38] to increase the number of minority samples and to assign higher weights to misclassified minority samples. A similar integration of Adaboost with a novel synthetic sampling method was proposed in [39]. The performance of boosting, bagging, and other hybrid techniques applied to imbalanced data has been compared in [40] and [41].
In the methods proposed in [31,32,33], synthetic samples are generated based not on individual minority samples as proposed in SMOTE [30] but on the underlying structure of the minority samples. Recently, a similar oversampling approach called Mahalanobis distance-based oversampling (MDO) was proposed in [42]. MDO generates synthetic samples based on the structure of the principal component space of minority samples. The synthetic samples generated by MDO have the same Mahalanobis distance as that of the considered minority sample. Because the class mean of the synthetic samples generated by MDO is the same as that of the minority class samples, the covariance structure of the minority class samples is preserved. In [43], a scheme called adaptive Mahalanobis distance oversampling (AMDO) was proposed. AMDO integrates generalized singular value decomposition with MDO to solve the oversampling problem encountered in mixed-type imbalanced data sets. Either MDO or AMDO can be utilized as a direct learning approach for solving problems with multi-class imbalanced problems.
The oversampling results obtained from MDO or AMDO are equivalent to those obtained by placing minority class samples and generated synthetic samples into the principal component space. The minority class samples and the synthetic samples can be considered to be included in an ellipsoid centered at the class mean. The orientation of the ellipsoid depends on the covariance structure of the minority class samples. The synthetic samples do not change the covariance structure of the minority class because all the synthetic samples are generated within the ellipsoid. However, both MDO and AMDO use only one ellipsoid to include the minority class samples and synthetic samples. If the decision regions of the minority class are separated, the decision region approximated using only one ellipsoid may overlap with the decision regions of other classes. This is especially true for imbalanced multi-class data. Samples from different classes may be included in a single ellipsoid structure depending on the target minority class samples. The synthetic samples generated by MDO or AMDO are randomly assigned in the single ellipsoid only if they have the same Mahalanobis distance as that of the associated minority sample. When synthetic samples are generated within a single ellipsoid, the generated synthetic samples tend to be placed in the cluster of samples belonging to other classes. This reduces the effectiveness of oversampling. Moreover, certain decision regions (e.g., those that are ring- or belt-shaped) are difficult to approximate with only one ellipsoid.
A novel approach called evolutionary Mahalanobis distance oversampling (EMDO) is proposed in this paper to overcome the limitations of MDO and AMDO. EMDO utilizes multiple ellipsoids to learn the distribution and orientation of minority class samples in parallel. Gustafson and Kessel proposed a clustering algorithm called the Gustafson–Kessel algorithm (GKA) [44], which is similar to the widely used fuzzy c-means [45] clustering approach with Mahalanobis norms. The advantage of the GKA over fuzzy c-means is that it utilizes the Mahalanobis norm instead of the Euclidean norm to learn the underlying sample distribution. However, the GKA assumes a fixed volume before learning the center and orientation of every ellipsoid. The GKA is an effective clustering approach for learning the centers and orientations of data clusters, but it is unsuitable for learning the decision regions of data due to its assumption of a fixed ellipsoid size. The GKA was modified in [46,47] to adaptively learn ellipsoid sizes for pattern recognition problems by using the genetic algorithm with a single objective function. In the proposed EMDO, the GKA is integrated with multi-objective particle swarm optimization (MOPSO) [48,49] to ensure that the centers, orientations, and sizes of multiple ellipsoids, along with the overall misclassification error, are learned in parallel. The misclassification error is defined as the total number of misclassified samples included in a union of multiple ellipsoids. Therefore, EMDO can learn a set of ellipsoids to approximate connected or disconnected complex decision regions with reasonable accuracy. Because multiple ellipsoids are learned in parallel in EMDO, an effective approach is designed to adaptively determine the number of synthetic samples to be generated in every ellipsoid. Similar ideas that design suitable algorithms to search for model parameters for specific applications are shown in [50,51,52].
The technical novelty and main contribution of this paper are summarized as follows.
1)
An effective novel oversampling approach called EMDO is proposed for multi-class imbalanced data problems. Different from the MDO and AMDO approaches that use only one ellipsoid, EMDO learns multiple ellipsoids in parallel to approximate the decision region of the target minority class samples.
2)
MOPSO is utilized along with GKA in EMDO to optimize the parameters, including the centers, orientations, and sizes of multiple ellipsoids approximating the target class of decision regions with reasonable accuracy.
3)
Synthetic minority samples are generated based on the Mahalanobis distance within every ellipsoid learned by EMDO. A novel adaptive approach is proposed to determine the number of synthetic minority samples to be generated based on the density of minority samples in every ellipsoid.
4)
EMDO was evaluated and found to perform better than other widely used oversampling schemes.
The remainder of this paper is organized as follows. Section 2 presents the problem formulation of oversampling for imbalanced data. The GKA is introduced in this section, and it shows that the GKA is suitable to solve the problem formulated herein. Section 3 introduces the proposed multi-objective optimization scheme designed in the EMDO, which uses MOPSO. Section 4 details the method for calculating the number of ellipsoids required to approximate the decision regions of every class. Section 5 describes performance evaluation of EMDO against other widely used oversampling schemes. Finally, Section 6 concludes the study.

2. Problem Statement and GKA

Given a data set S = { ( x i , y i ) x i R d ,   y i { 1 p } , i = 1 N } , every ith sample x i S belongs to some class y i among p classes. Let S j S be the set containing the samples belonging to class j, j = 1…p. Denote N j S j as the number of samples in S j , where N min = min j = 1 p ( N j ) , N max = max j = 1 p ( N j ) . The data set S is imbalanced if ( N j / N max ) is less than a preset imbalance ratio, IR. The value of IR is determined based on the size of S and on the characteristics of the classification problem. Typically, IR ≥ 1.5. S j is called a minority set if ( S j / N max ) < I R , j = 1…p. An oversampling technique is applied in this study to overcome the skewed distribution of samples in S. If S j is a minority set, the synthetic samples belonging to the same jth class are generated in S j to form an enlarged set S ˜ j such that S ˜ j = N max / I R . Denote N ˜ j as the total number of extra synthetic samples generated to balance the minority set S j ,
N ˜ j = ( N max / I R S j ) .
Note that there can be more than one minority set in a multi-class problem. An oversampling technique is proposed herein to improve classification accuracy on an imbalanced data set. To generate an adequate number of synthetic samples and place them in the minority sets, the distribution of decision regions of every minority set in the d-dimensional feature space must be located. Multiple ellipsoids are utilized in this study to approximate the decision regions of minority sets. EMDO is proposed to learn these ellipsoids and generate synthetic samples in these ellipsoids for oversampling.
Assume that ellipsoids approximate the decision region of the jth class samples. Denote the center of every nth ellipsoid as v n j R d . The distance between every kth sample x k and the ellipsoid center v n j is defined in the Mahalanobis form as follows:
λ n k j = ( ( x k v n j ) T M n j ( x k v n j ) ) 1 / 2 ,
where M n j R d × d is a norm-inducing matrix. The ellipsoid Φ n j is defined as follows by using the Mahalanobis distance defined in (2):
Φ n j ( x k ) = ( x k v n j ) T M n j ( x k v n j ) = 1 .
The sample x k is inside or on the ellipsoid if Φ n j ( x k ) 1 , but it is outside the ellipsoid if Φ n j ( x k ) > 1 . Let the decision region of the jth class samples in the feature space be denoted as j ; j is approximated by the union of α j ellipsoids, that is,
j n = 1 α j Φ n j .
The GKA is used for learning the α j ellipsoids in parallel, given that the size of each ellipsoid is assigned. Denote the size of the ellipsoid Φ n j in (3) as ξ n j , n = 1… α j . The determinant of the norm-inducing matrix M n j is inversely proportional to ξ n j . Therefore,
det ( M n j ) = 1 / ξ n j ,   n = 1 α j .
The GKA learns the norm-inducing matrices M n j and the ellipsoid centers v n j through iteratively calculating an auxiliary fuzzy partition matrix U j R α j × N j by using all N j samples belonging to the jth class. The element μ n k j U j represents the membership value of the kth sample x k associated with the nth ellipsoid Φ n j . The membership values sum to 1 for every x k , that is,
n = 1 α j μ n k j = 1 ,   k = 1 N j .
The GKA is a fast iterative learning algorithm that efficiently updates the membership values in the fuzzy partition matrix U j while learning both the norm-inducing matrix M n j and the center v n j of every nth ellipsoid. Note that the GKA learns all α j ellipsoids in parallel. Denote the matrices containing the ellipsoid centers and the norm-inducing matrices as V j and M j , respectively, that is, V j = [ v 1 j , v 2 j , v α j j ] and M j = [ M 1 j , M 2 j , , M α j j ] . All elements in the triple ( U j , V j , M j ) are learned by iteratively minimizing the distance in (2) weighted with the membership values in U j subject to the constraints in (5) and (6). Let ω n j , n = 1… α j and ϖ k j , k = 1… N j be the Lagrange multipliers of the constraints in (5) and (6), respectively. The triple ( U j , V j , M j ) is iteratively learned as follows:
( U j , V j , M j ) = arg min ( n = 1 α j k = 1 N j ( μ n k j ) b ( λ n k j ) 2 + n = 1 α j ω n j ( det ( M n j ) 1 / ξ n j ) + k = 1 N j ϖ k j ( n = 1 α j ( μ n k j 1 ) ) ,  
where b is an adjusted weighting index. The optimization described in (7) is realized by differentiating (7) with respect to μ n k j ,   v n j ,   ω n j , and ϖ k j ,and by equating the result to 0.
The parameters are obtained as follows:
μ n k j = 1 i = 1 α j ( λ n k j / λ i k j ) 2 / ( b 1 ) ,   n = 1 α j ,   k = 1 N j ;
v n j = k = 1 N j ( μ n k j ) b x k k = 1 N j ( μ n k j ) b ,   n = 1 α j ;
F n j = k = 1 N j ( μ n k j ) b ( x k v n j ) ( x k v n j ) T k = 1 N j ( μ n k j ) b ,   n = 1 α j ;
and M n j = ( ζ n j det ( M n j ) ) 1 / d ( M n j ) 1   n = 1 α j .
The iteration in GKA is stopped when no significant improvement is made in the fuzzy partition matrix U j . Let ( U j ) ( m ) be the fuzzy partition matrix learned in the mth iteration, the norm of the difference between ( U j ) ( m ) and ( U j ) ( m + 1 ) can be defined as
δ j = ( U j ) ( m + 1 ) ( U j ) ( m ) max n , k ( μ n k j ) ( m + 1 ) ( μ n k j ) ( m ) .
The GKA iteratively learns U j , V j , and M j until δ j < ε j , where ε j is a small constant. The flowchart of the GKA is illustrated in Figure 1.

3. Multi-Objective Optimization in EMDO

As depicted in Figure 1 and described in Section 2, the GKA optimizes the centers and norm-inducing matrices of multiple ellipsoids in parallel with a preset size of every ellipsoid. If the ellipsoid size is set inappropriately, the ellipsoids learned by the GKA cannot accurately include all minority class samples. According to (4), each jth-class decision region j is approximated by the union of α j ellipsoids Φ n j of size ξ n j , n = 1… α j . Consider the sets Φ j = { Φ 1 j , Φ 2 j , , Φ α j j } and Ξ j = { ξ 1 j , ξ 2 j , , ξ α j j } . The distance between the kth sample x k and j , denoted L ( x k , j ) , can be defined as the minimum distance between x k and the center of each ellipsoid, as follows:
L ( x k , j ) L ( x k , Φ j ) = min n = 1 α j λ n k j ,
where λ n k j is defined in (2). The sample x k is included in j if L ( x k , j ) 1 and the corresponding class y k = j . Denote a binary function H ( ) as follows:
H ( ο ) = 1 ,   if   the   logic   statement   ο   is   true ; 0 ,   if   the   logic   statement   ο   is   false .  
The total number of jth-class samples included in the set Φ j can be calculated as
O i n c l u d e d j ( Φ j ) = k = 1 N H ( L ( x k ,   Φ j ) 1   and   y k = j ) .
It is possible that several samples that do not belong to the jth class are included in the set of ellipsoids Φ j . The total number of samples not belonging to the jth class but included in Φ j can be calculated as
O i n c l u d e d \ j ( Φ j ) = k = 1 N H ( L ( x k ,   Φ j ) 1   and   y k j ) ,
where N i n c l u d e d \ j ( ) ( N N j ) .
Referring to (5), the total size of the ellipsoids contained in Φ j can be calculated as
F 1 ( Ξ j ) = n = 1 α j ξ n j .
The proposed EMDO not only minimizes the total ellipsoid sizes but also aims to simultaneously maximize the number of jth class samples included in the set of ellipsoids Φ j and minimize the number of samples included in Φ j but not belonging to jth class. The misclassification error can be defined as the summation of the number of jth class samples not included in Φ j , calculated as ( N j O i n c l u d e d j ( Φ j Ξ j ) , and the number of samples that are included in Φ j but that do not belong to the jth class is calculated as O i n c l u d e d \ j ( Φ j | Ξ j ) . Therefore, the misclassification error can be defined as
F 2 ( Φ j | Ξ j ) = N j O i n c l u d e d j ( Φ j | Ξ j ) + O i n c l u d e d \ j ( Φ j | Ξ j ) .
The misclassification error can be utilized as an objective function to optimize the ellipsoid sizes. The set of ellipsoid sizes Ξ j = [ ξ 1 j , ξ 2 j , , ξ α j j ] can be optimized using a multi-objective optimization scheme that minimizes both (17) and (18).
MOPSO is utilized to perform this multi-objective optimization by searching for the best set of ellipsoid sizes Ξ j by minimizing the objective functions F 1 ( ) and F 2 ( ) . Assume that G particles are utilized in the MOPSO. Denote Ξ j ( k , g ) as the gth particle in the kth iteration and Ξ ¯ j ( g ) as the non-dominated solution subject to the following multi-objective optimization. Note that the multi-objective optimization searches for the non-dominated solution of every particle.
Ξ ¯ j ( g ) = A r g min Ξ j ( k , g ) ,   x k S ,   g = 1 G ( F 1 ( Ξ j ( k , g ) ) , F 2 ( Φ j | Ξ j ( k , g ) ) ) .
Ξ j ( k , g ) is defined to be a non-dominated solution, as given by (19), if it is not dominated by any other particle, that is, if both F 1 ( Ξ j ( k , g ) ) F 1 ( Ξ j ( k , i ) ) and F 2 ( Φ j | Ξ j ( k , g ) ) F 2 ( Φ j | Ξ j ( k , i ) ) , i=1…G, i g . Let
a i ( Ξ j ( k , g ) ) = H ( F 1 ( Ξ j ( k , g ) ) > F 1 ( Ξ j ( k , i ) )   or   F 2 ( Φ j | Ξ j ( k , g ) ) > F 2 ( Φ j | Ξ j ( k , i ) ) )
A ( Ξ j ( k , g ) ) = i = 1 G , i g a i ( Ξ j ( k , g ) ) .
The non-dominated solution obtained using the gth particle, denoted as Ξ ¯ j ( g ) , is updated as Ξ j ( k , g ) if A ( Ξ j ( k , g ) ) =0. However, Ξ ¯ j ( g ) remains unchanged if A ( Ξ j ( k , g ) ) >0.
Ξ ¯ j ( g ) = Ξ j ( k , g ) ,   if   A ( Ξ j ( k , g ) ) = 0 ; Ξ ¯ j ( g ) ,   otherwise .  
Only if Ξ j ( k , g ) is the non-dominated solution of (19) will it be included in the repository S rep j , which is the set of all of non-dominated solutions of (19). The best solution achieved by the gth particle, denoted Ξ p _ best j ( g ) , is updated using this non-dominated solution generated by the gth particle, that is, Ξ p _ best j ( g ) = Ξ ¯ j ( g ) . With reference to (22), if Ξ ¯ j ( g ) is not generated in the current kth iteration, Ξ p _ best j ( g ) remains unchanged.
It is possible that several original non-dominated solutions stored in S rep j become dominated after a new non-dominated solution is included in S rep j . A process for filtering out non-dominated solutions, similar to the one proposed in (20) and (21), is executed for all of the non-dominated solutions in S rep j . Assume a total of m r e p non-dominated solutions are in the repository, including the newly generated one. Each of the m r e p non-dominated solutions is compared with all of the other solutions to evaluate whether they are being dominated. Denote Ξ ¯ m j as the mth non-dominated solution. Let
a i ( Ξ ¯ m j ) = H ( F 1 ( Ξ ¯ m j ) > F 1 ( Ξ ¯ i j )   or   F 2 ( Φ j | Ξ ¯ m j ) ) > F 2 ( Φ j | Ξ ¯ i j ) )
A ( Ξ ¯ m j ) = i = 1 m r e p , i m a i ( Ξ ¯ m j ) .
Ξ ¯ m j is no longer a non-dominated solution and is excluded from S rep j if A ( Ξ ¯ m j ) > 0.
An adaptive grid algorithm [53] is applied to S rep j after the filtering process is completed, as given in (23) and (24), to place all the non-dominated solutions into several grids. The global best particle Ξ g _ best j is randomly selected from among the grids in S rep j by using the roulette wheel selection scheme. After both Ξ p _ best j ( g ) and Ξ g _ best j are determined, each particle is updated as follows:
τ ( k , g ) = τ ( k 1 , g ) + c 1 γ 1 ( Ξ p _ best j ( g ) Ξ j ( k , g ) ) + c 2 γ 2 ( Ξ g _ best j Ξ j ( k , g ) ) ,
Ξ j ( k , g ) = Ξ j ( k 1 , g ) + τ ( k , g ) ,   g = 1 G ,
where c 1 and c 2 are preset constants and γ 1 ,   γ 2 [ 0 , 1 ] are randomly generated real numbers. If no new non-dominated solution can be successfully included in S rep j after the filtering process for certain preset K t h r iterations, MOPSO is saturated, and the iterative learning of particles given by (19)–(26) is stopped.
The optimal solution of the multi-objective optimization problem in (19) is searched from S rep j after the iterative learning process is stopped. Let Ξ ¯ m j = { ξ ¯ m 1 j , ξ ¯ m 2 j , , ξ ¯ m α j j } be the mth non-dominated solution in S rep j . The average density of the ellipsoids associated with Ξ ¯ m j is defined as
d ( Φ j | Ξ ¯ m j ) = O i n c l u d e d j ( Φ j | Ξ ¯ m j ) n = 1 α j ξ ¯ m n j .
The optimal solution ( Ξ j ) can be selected based on the average density of the non-dominated solutions because the ellipsoids associated with the optimal solution tend to have a small size but a large number of samples. However, the average density in (27) cannot be directly utilized as the optimization index for selecting the optimal solution. It is modified as the ratio of the total number of jth class samples included in the set Φ j multiplied with the ratio of the total number of samples not belonging to the jth class but included in Φ j . Let the evaluation index for the mth non-dominated solution Ξ ¯ m j be σ j ( Ξ ¯ m j ) , which is defined based on (27) as follows:
σ j ( Ξ ¯ m j ) = d ( Φ j | Ξ ¯ m j ) × O i n c l u d e d j ( Φ j | Ξ ¯ m j ) N j × ( N N j O i n c l u d e d \ j ( Φ j | Ξ ¯ m j ) ) N N j ,
where N \ j ( Φ j | Ξ ¯ m j ) denotes the number of samples not belonging to the jth class but included in Φ j . The optimal solution ( Ξ j ) can be defined as the set of ellipsoid sizes that maximize the index σ j ( · ) , that is,
( Ξ j ) = A r g max Ξ ¯ m j ,   m = 1 m r e p σ j ( Ξ ¯ m j ) .
After the optimal ellipsoid sizes are determined by MOPSO according to (19) and (29), the GKA is applied based on the optimal ellipsoid sizes ( Ξ j ) to calculate the other optimal ellipsoid parameters such as the norm-inducing matrix ( M j ) and the ellipsoid centers ( V j ) . Note that the orientations of all the ellipsoids approximating the jth class decision are determined by the norm-inducing matrix ( M j ) . The proposed MOPSO integrated with the GKA is illustrated in Figure 2.

4. Determining Number of Ellipsoids

The ellipsoid parameters, such as size, centers, and orientation, are optimized using MOPSO integrated with the GKA, as described in Section 2 and Section 3. These ellipsoid parameters are calculated under the condition that the total number of ellipsoids α j used to approximate the jth class decision region is assigned in advance j = 1…p. If α j is too small, the jth class samples might be included in an insufficient number of ellipsoids, resulting in several ellipsoids having large sizes. It is possible that certain samples that do not belong to the jth class are included in these large ellipsoids. Moreover, samples other than those belonging to the jth class might be included because an insufficient number of ellipsoids is assigned to model the jth class samples. A binary-class dataset with 1000 samples in each class is illustrated in Figure 3. The samples in either class 1 or class 2 are randomly generated within the range [0, 3] on X and Y axis, respectively. The problem caused by a small α j is illustrated in Figure 3a. Conversely, the jth class samples might be included in too many ellipsoids if α j is too large. This results in a scenario where several ellipsoids overlap with each another, resulting in the ellipsoids being learned inefficiently, as illustrated in Figure 3b. If a suitable α j is assigned, as illustrated in Figure 3c, the learning result leads to a set of ellipsoids with of the appropriate size, center, and orientation.
To determine the suitable number of ellipsoids, the ellipsoid parameters are optimized using MOPSO integrated with the GKA by setting α j from 1 to an appropriate number q. Denote ( Ξ j ) α j = i as the optimal ellipsoid sizes calculated using MOPSO, according to (29), with α j set to i ellipsoids, i = 1…q. The index σ j ( ( Ξ j ) α j = i ) in (28) is utilized to evaluate the effectiveness and efficiency for different numbers of ellipsoids α j . The suitable number of ellipsoids α j can be determined at the value corresponding to the corner of the curve σ j ( ( Ξ j ) α j = i ) with respect to α j . Figure 4 shows a typical curve σ j ( ( Ξ j ) α j = i ) with respect to α j . According to this figure, α j = 6 is a suitable choice because the curve corner appears at α j = 6 .

5. Generating Synthetic Samples

Any set of jth class samples with ( S j / N max ) < I R , j = 1…p, is considered a minority set. With reference to (1), N ˜ j synthetic samples are to be generated and added into the minority set. Recall that the minority set of the jth class samples is approximated by α j ellipsoids. The N ˜ j synthetic samples must be proportionally added into each of the α j ellipsoids based on the density d n j of every nth ellipsoid, n = 1… α j . The number of jth class samples in the nth ellipsoid Φ n j is defined in a manner similar to (15) as
O i n c l u d e d j ( Φ n j ) = k = 1 N H ( λ n k j 1   and   y k = j )
The number of samples included in the nth ellipsoid but not belonging to the jth class is
O i n c l u d e d \ j ( Φ n j ) = k = 1 N H ( λ n k j 1   and   y k j ) .
The density of ellipsoid Φ n j is defined as
d ( Φ n j ) = O i n c l u d e d j ( Φ n j ) ξ n j .
The weight of the ellipsoid Φ n j for sharing the generated synthetic samples is defined as the reciprocal of the density d ( Φ n j ) modified by the ratio of O i n c l u d e d j ( Φ n j ) to the total number samples in Φ n j :
β n j = 1 d ( Φ n j ) × O i n c l u d e d j ( Φ n j ) O i n c l u d e d j ( Φ n j ) + O i n c l u d e d \ j ( Φ n j ) .
Denote the number of synthetic samples added to Φ n j as N ˜ n j , which is determined based on the weight β n j given in (33)
N ˜ n j = N ˜ j × β n j i = 1 α j β i j ,   n = 1 α j .
The scheme for generating synthetic samples for every ellipsoid Φ n j is designed to resolve the oversampling problem for the following two scenarios:
(A) O i n c l u d e d j ( Φ n j ) / ( O i n c l u d e d j ( Φ n j ) + O i n c l u d e d \ j ( Φ n j ) ) 0.9
In this case, more than 90% of the samples in Φ n j belong to the minority class. The samples other than those belonging to the jth class can be considered noise. The generated synthetic jth class samples do not affect the classification accuracy if they are randomly included in the ellipsoid Φ n j . According to (2), Δ n j = ( λ n k j ) 2 = ( x k v n j ) T M n j ( x k v n j ) . Denote z ni j as the ith eigenvector of M n j corresponding to the ith eigenvalue ψ n i j , i = 1…d. Let Z n j = [ z n 1 j , , z nd j ] , Ψ n j = d i a g ( ψ n 1 j , , ψ n d j ) . Because M n j is a positive-definite matrix,
Δ n j = ( λ n k j ) 2 = ( x k v n j ) T M n j ( x k v n j ) ,   n = 1 α j .
Let p n i j = ( z ni j ) T ( x k v n j ) , i =1…d; then, (35) can be rewritten as
Δ n j = i = 1 d ( p n i j ) 2 ψ n i j ,   n = 1 α j .
Note that the sample x k is considered to be included in the ellipsoid if Δ n j 1 . The ellipsoid has the center v n j , and all the eigenvectors z ni j , i = 1…d, are orthogonal to one another. According to (36), every ith orthogonal eigenvector intersects the ellipsoid’s boundary sphere, where Δ n j = 1 at 1 / ψ n i j and 1 / ψ n i j .
The synthetic samples x ^ k j are randomly generated in ellipsoid Φ n j . The generated synthetic sample x ^ k j is expressed as a linear combination of eigenvectors because all eigenvectors z ni j are orthogonal axes of the ellipsoid, that is,
x ^ k j = i = 1 d z ni j b n i ,  
where b n i is the projection of the vector ( x ^ k v n j ) onto the eigenvector z ni j . To ensure random generation of the synthetic samples inside Φ n j , b n i is set to be a random number within the range.
1 / ψ n i j b n i j 1 / ψ n i j
because each of the eigenvectors intersects the boundary sphere at 1 / ψ n i j and 1 / ψ n i j . However, the approach to randomly generate synthetic samples, as expressed in in (37) and (38), does not guarantee that the generated synthetic sample x ^ k j is always in the ellipsoid Φ n j . For every randomly generated x ^ k j , calculate the Mahalanobis distance according to (2) as
λ ^ n k j = ( ( x ^ k j v n j ) T M n j ( x ^ k j v n j ) ) 1 / 2 .
The generated x ^ k j is in Φ n j if λ ^ n k j 1 . No additional processing is required if x ^ k j is in Φ n j . The generated x ^ k j is outside Φ n j if λ ^ n k j > 1 . Further processing is required if x ^ k j is outside Φ n j . The multiplication of x ^ k j with a random number κ , where 0 < κ ( 1 / λ ^ n k j ) , leads to the multiplicative product κ x ^ k j in Φ n j . Denote the finally determined synthetic samples as x ˜ k j
x ˜ k j = x ^ k j ,   if   λ ^ n k j 1 ; κ 1 x ^ k j ,   if   λ ^ n k j > 1 ; ,
where   κ 1   is   a   random   number   and   κ 1 ( 0 , 1 / λ ^ n k j ] . (B) O i n c l u d e d j ( Φ n j ) / ( O i n c l u d e d j ( Φ n j ) + O i n c l u d e d \ j ( Φ n j ) ) < 0.9
In this case, more samples not belonging to the jth class are in Φ n j . The random placement of synthetic samples in Φ n j , as in the previous case, cannot effectively improve the classification accuracy. Borderline SMOTE [30] is modified to generate synthetic samples in this case. The samples located at the borderline between the clusters belonging and not belonging to the jth class must be first identified using Borderline SMOTE. For every jth class sample x k j Φ n j , define the set S k j containing all m-nearest neighbors. The m-nearest neighbors of x k j are defined as the samples with the m-shortest Mahalanobis distances from x k j . Note that the Mahalanobis distance is calculated using the same norm-inducing matrix M n j as in the case of Φ n j , that is,
λ ˜ k i j = ( ( x k j x i ) T M n j ( x k j x i ) ) 1 / 2 ,   x i Φ n j ,   but   x i x k j .
The sample with all the m-nearest neighbors belonging to jth class is the sample not on the borderline. Conversely, the borderline sample contains at least one sample among the m-nearest neighbors not belonging to the jth class. Therefore, a sample is a borderline sample if at least one sample in the set of m-nearest neighbors S k j does not belong to the jth class for every x k j Φ n j .
After the borderline samples are identified, the synthetic samples are generated through random interpolation between the borderline sample x k j and any other x l j S k j ; that is, the synthetic sample is generated as follows:
x ˜ k j = x k j + κ 2 ( x k j x l j ) ,   x l j S k j ,
where κ 2 is a random number and κ 2 [ 0 , 1 ] .

6. Simulation

The proposed EMDO was evaluated against other multi-class imbalanced data learning algorithms on different numerical data sets. The classifier C4.5 is usually utilized as the classifier to verify the oversampling results for various oversampling approaches. For instance, the oversampling approaches in [30,32,39], and [40,41,42,43] all used C4.5 as the classifier to verify the proposed oversampling schemes. This is mainly due to the fact that the classification results with C4.5 do not change as long as the parameter setting and datasets are fixed. No randomness exists in the classification results with C4.5 for the same parameter setting and dataset. Note that the number of ellipsoids utilized for the minority class is determined using the scheme proposed in Section 4 and is listed in the rightmost column of Table 1 and Table 6. The five nearest neighbors are considered for synthetic sample generation in case (B) of Section 5.
The simulations in this study were conducted using five-fold cross-validation with 10 independent runs. Every data set was tested using different oversampling schemes and compared with the proposed EMDO. The minority class with the minimum size was selected to validate the effectiveness and efficiency of the proposed EMDO.
Several evaluation metrics were designed to evaluate the effectiveness and efficiency of the proposed EMDO. The classification accuracy for the jth class is defined as follows:
P j = T P j T P j + F P j ,
where T P j is the number of true-positive classified samples, that is, the samples that are correctly classified as belonging to the jth class. F P j is the number of false-positive classified samples, that is, the samples that are incorrectly classified as belonging to the jth class. The metric P a v g is defined as the average classification accuracy over all p classes, that is,
P a v g = 1 p j = 1 p P j .
The metric P min refers to the classification accuracy defined in (43) for the minority class with the minimum size. To measure the capability of EMDO to separate any pair of classes, the area under curve (AUC) [54,55] is widely used in [56,57]. Denote A m , n as the AUC between class m and class n. The metric AUCm is defined as follows for measuring the capability of EMDO to separate the smallest minority class with the minimum size from the other classes.
AUCm = 1 p 1 n n A n , n + A n , n 2
where n’ denotes the minority class with the minimum size. In addition to AUCm, the average of AUC over all pairs of classes for a multi-class problem, denoted as MAUC, is defined as
MAUC = 2 p ( p 1 ) m < n A m , n + A n , m 2 .
In order to evaluate the imbalance condition of every data set, the maximum imbalance ratio IRmax is defined as follows:
I R max = N max N min
where N min = min j = 1 p ( N j ) , N max = max j = 1 p ( N j ) .
Example 1:
The data sets used in the simulation are the same as those used in [43] for comparing the performance of the EMDO against AMDO and other learning algorithms. The data sets used in [43] were mainly from data repositories such as the ones Knowledge Extraction based on Evolutionary Learning (KEEL) [58] and UCI (University of California, Irvine) Machine Learning Repository [59]. Table 1 describes these data sets. The performance comparison based on different indices are made in Table 2, Table 3, Table 4 and Table 5. Within Table 2, Table 3, Table 4 and Table 5, the algorithms such as SSMOTE refers to Static-SMOTE [60], GCS refers to RESCALE [61], ABNC refers to AdaBoost.NC [62], and OSMOTE refers to OVOSMOTE [63]. The MDO in [42], MDO+ and AMDO in [43] are also compared in Table 2, Table 3, Table 4 and Table 5 with the proposed EMDO. The Baseline algorithm is the classifier C4.5 without any oversampling technique.
To compare the performance of EMDO with those of the other schemes, the rank average of every scheme was calculated. All oversampling schemes were tested on each of the data sets listed in Table 1. The ranking of algorithm performance was based on each of the metrics. For instance, the algorithm with the best performance is ranked first, the algorithm with the second-to-best performance is ranked second, etc. The average rank of every algorithm is then calculated. The schemes with the same metric values share ranks. For instance, if two schemes are ranked second because they have the same metric values, the two schemes share the second and third ranks. These two schemes are thus ranked 2.5. The means and standard deviations of P min and P a v g are listed in Table 2 and Table 3, respectively, for every oversampling scheme, including the proposed EMDO, applied to different data sets. According to Table 2 and Table 3, EMDO outperforms all of the other schemes on every data set. EMDO has the lowest average rank. The results shown in both Table 2 and Table 3 imply that the oversampling performed using EMDO significantly improves the classification accuracy for the smallest minority class. Moreover, the synthetic samples generated for the minority class samples improve the overall average classification accuracy.
For all schemes, the mean and standard deviation are listed in Table 4 and the AUCm defined in (45) and MAUCm defined in (46) are also compared in Table 4 and Table 5, respectively. As indicated in Table 4, EMDO outperform the other schemes in separating the smallest minority class from the other classes for every listed data set. Moreover, according to Table 5, the capability of EMDO to separate all pairs of classes in the multi-class problem is superior to that of the other schemes.
Example 2:
The performance of EMDO is evaluated on the sensory data in this example. Two data sets, Statlog (Shuttle) from UCI and Mafalda [64] from Github are utilized in this example. The data set Statlog (Shuttle) is the set of the recorded sensory data from NASA’s space shuttle while the data set Mafalda is the set of the recorded sensory data from different brands of cars. The characteristics of these two data sets is shown in Table 6. Table 6 shows that both data sets are extremely imbalanced because the maximum imbalance ratios IRmax are as high as 5684.7 and 5.94, respectively. Four indices P min , P a v g , AUCm, and MAUC are calculated and compared in Table 7 for both data sets with classifier C4.5. The classification results are greatly improved with oversampling scheme EMDO compared with the results without EMDO. EMDO helps improve classification results for both highly imbalanced data sets according to the four evaluation indices listed in Table 7.

7. Conclusions

EMDO was demonstrated to outperform competing oversampling approaches in simulations. EMDO performed well because it approximates the decision region of the target minority class with reasonable accuracy by using a set of ellipsoids. In problems involving multi-class imbalanced data, EMDO performs exceptionally well if the decision region of the minority class is separated in the feature space. EMDO can learn the sizes, centers, and orientations of the ellipsoids that approximate the minority class decision region by using the underlying distribution of minority class samples. IoT is a key emerging technology, and imbalanced data will become an increasingly common problem as the number of IoT sensors increases. The proposed EMDO is suitable for solving such multi-class imbalanced data classification problems. One of the future works related to this study involves applying EMDO to address the problem of imbalanced data encountered in real-world IoT sensing data. Although EMDO is a data-level learning approach, it can easily be integrated with other cost-sensitive methods to increase the effectiveness and efficiency of learning. Further studies on variants of integration can be another direction for future research.

Author Contributions

Conceptualization, L.Y.; methodology, L.Y.; software, T.-B.L.; validation, T.-B.L.; writing—original draft preparation, T.-B.L.; writing—review and editing, L.Y.; supervision, L.Y.; project administration, L.Y.; funding acquisition, L.Y.; Both authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministry of Science and Technology, Taiwan, grant number MOST 110-2221-E-027-054-MY3.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, Y.; Zhong, X.; Ma, Z.; Liu, H. The outlier and integrity detection of rail profile based on profile registration. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1074–1085. [Google Scholar] [CrossRef]
  2. Kang, S.; Sristi, S.; Karachiwala, J.; Hu, Y.-C. Detection of anomaly in train speed for intelligent railway systems. In Proceedings of the 2018 International Conference on Control, Automation and Diagnosis (ICCAD), Marrakech, Morocco, 19–21 March 2018; pp. 1–6. [Google Scholar]
  3. Wang, H. Unsupervised anomaly detection in railway catenary condition monitor4ing using auto-encoders. In Proceedings of the IECON 2020 the 46th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 18–21 October 2020; pp. 2636–2641. [Google Scholar]
  4. Qian, G.; Lu, S.; Pan, D.; Tang, H.; Liu, Y.; Wang, Q. Edge computing: A promising framework for real-time fault diagnosis and dynamic control of rotating machines using multi-sensor data. IEEE Sensors J. 2019, 19, 4211–4220. [Google Scholar] [CrossRef]
  5. Maruthi, G.S.; Hegde, V. Application of MEMS accelerometer for detection and diagnosis of multiple faults in roller element bearings of three phase induction motor. IEEE Sensors J. 2016, 16, 145–152. [Google Scholar] [CrossRef]
  6. Tong, Z.Y.; Dong, Z.Y.; Li, M. A new entropy bi-cepstrum based method for DC motor brush abnormality recognition. IEEE Sensors J. 2017, 17, 745–754. [Google Scholar] [CrossRef]
  7. Kim, E.; Cho, S.; Lee, B.; Cho, M. Fault detection and diagnosis using self-attentive convolutional neural networks for variable-length sensor data in semiconductor manufacturing. IEEE Trans. Semicond. Manuf. 2019, 32, 302–309. [Google Scholar] [CrossRef]
  8. Azamfar, M.; Li, X.; Lee, J. Deep learning-based domain adaptation method for fault diagnosis in semiconductor manufacturing. IEEE Trans. Semicond. Manuf. 2020, 33, 445–453. [Google Scholar] [CrossRef]
  9. Ghosh, A.; Qin, S.; Lee, J.; Wang, G. FBMTP: An automated fault and behavioral anomaly detection and isolation tool for PLC-controlled manufacturing systems. IEEE Trans. Syst. Man Cyber. Syst. 2017, 47, 3397–3417. [Google Scholar] [CrossRef]
  10. Quang, X.; Huo, H.; Xia, L.; Shan, F.; Liu, J.; Mo, Z.; Yan, F.; Ding, Z.; Yang, Q.; Song, B.; et al. Dual-sampling attention network for diagnosis of COVID-19 from community acquired pneumonia. IEEE Trans. Med. Imaging 2020, 39, 2595–2605. [Google Scholar]
  11. Liu, N.; Li, E.; Qi, M.; Xu, L.; Gao, B. A novel ensemble learning paradigm for medical diagnosis with imbalanced data. IEEE Access 2020, 8, 171263–171280. [Google Scholar] [CrossRef]
  12. Huda, S.; Yearwood, J.; Jelinek, H.F.; Hassan, M.M.; Fortino, G.; Buckland, M. A hybrid feature selection with ensemble classification for imbalanced healthcare data: A case study for brain tumor diagnosis. IEEE Access 2016, 4, 9145–9154. [Google Scholar] [CrossRef]
  13. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  14. Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  15. Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef] [Green Version]
  16. Guo, H.-X.; Li, Y.-J.; Shang, J.; Gu, M.-Y.; Huang, Y.-Y. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar]
  17. Wu, G.; Chang, E.Y. KBA: Kernel boundary alignment considering imbalanced data classification. IEEE Trans. Knowl. Data Eng. 2005, 17, 786–795. [Google Scholar] [CrossRef] [Green Version]
  18. Ohsaki, M.; Wang, P.; Matsuda, K.; Katagiri, S.; Watanabe, H.; Ralescu, A. Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans. Knowl. Data Eng. 2017, 29, 1806–1819. [Google Scholar] [CrossRef]
  19. Manevitz, L.M.; Yousef, M. One-class SVMs for document classification. J. Mach. Learn. Res. 2002, 2, 139–154. [Google Scholar]
  20. Raskutti, B.; Kowalczyk, A. Extreme rebalancing for SVMs:a case study. ACM SIGKDD Explor. Newsl. 2004, 6, 60–69. [Google Scholar] [CrossRef]
  21. Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-sensitive learning of deep feature representations form imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3573–3587. [Google Scholar] [PubMed] [Green Version]
  22. Huang, C.; Li, Y.; Loy, C.C.; Tang, X. Learning deep representation for imbalanced classification. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5375–5384. [Google Scholar]
  23. Ng, W.W.Y.; Hu, J.; Yeung, D.S.; Yin, S.; Roli, F. Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 2015, 45, 2402–2412. [Google Scholar] [CrossRef] [PubMed]
  24. Tang, Y.; Zhang, Y.Q.; Chawla, N.V.; Krasser, S. SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B 2009, 39, 281–288. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 2009, 39, 539–550. [Google Scholar]
  26. Kang, Q.; Shi, L.; Zhou, M.; Wang, X.; Wu, Q.; Wei, Z. A distance-based weighted undersampling scheme for support vector machines and its application to imbalanced classification. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4152–4165. [Google Scholar] [CrossRef]
  27. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 26, 321–357. [Google Scholar] [CrossRef]
  28. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
  29. Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the Knowledge Discovery in Databases: PKDD (Lecture Notes in Computer Science), Cavtat-Dubrovnik, Croatia, 22–26 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar]
  30. Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
  31. Xie, Z.; Jiang, L.; Ye, T.; Li, X. A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. International Conference on Database Systems for Advanced Applications, Taipei, Taiwan, 20–23 April 2015; pp. 3–18. [Google Scholar]
  32. Das, B.; Krishnan, N.C.; Cook, D.J. RACOG and wRACOG: Two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 2015, 27, 222–234. [Google Scholar] [CrossRef] [Green Version]
  33. Pérez-Ortiz, M.; Gutiérrez, P.A.; Hervás-Martínez, C.; Yao, X. Graph-based approaches for over-sampling in the context of ordinal regression. IEEE Trans. Knowl. Data Eng. 2015, 27, 1233–1245. [Google Scholar] [CrossRef]
  34. Schapire, R.E. The boosting approach to machine learning: An overview. In Nonlinear Estimation Classification; Springer: New York, NY, USA, 2003; pp. 149–171. [Google Scholar]
  35. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
  36. Polikar, R. Ensemble learning. In Ensemble Machine Learning; Springer: Berlin, Germany, 2012; pp. 1–34. [Google Scholar]
  37. Moniz, N.; Ribeiro, R.P.; Cerqueira, V.; Chawla, N. SMOTEBoost for regression: Improving the prediction of extreme values. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 150–159. [Google Scholar]
  38. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
  39. Guo, H.; Viktor, H.L. Learning form imbalanced data sets with boosting and data generation: The Databoost-IM approach. ACM SIGKDD Explor. Newsl. 2004, 6, 30–39. [Google Scholar] [CrossRef]
  40. Khoshgoftaar, T.M.; Hulse, J.V.; Napolitano, A. Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern. Part A 2011, 41, 552–568. [Google Scholar] [CrossRef]
  41. Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for the class imbalanced problem: Bagging, boosting, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C 2012, 42, 473–484. [Google Scholar] [CrossRef]
  42. Abdi, L.; Hashemi, S. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans. Knowl. Data Eng. 2016, 28, 238–251. [Google Scholar] [CrossRef]
  43. Yang, X.; Kuang, Q.; Zhang, W.; Zhang, G. AMDO: An over-sampling technique for multi-class imbalanced problems. IEEE Trans. Knowl. Data Eng. 2018, 30, 1672–1685. [Google Scholar] [CrossRef]
  44. Gustafson, D.E.; Kessel, W.C. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of the 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes, San Diego, CA, USA, 10–12 January 1979; pp. 761–766. [Google Scholar]
  45. Bezdek, J. Pattern Recognition with Fuzzy Objective Function; Plenum Press: New York, NY, USA, 1981. [Google Scholar]
  46. Yao, L.; Weng, K.-S. Imputation of incomplete data using adaptive ellipsoids with liner regression. J. Intell. Fuzzy Syst. 2015, 29, 253–265. [Google Scholar] [CrossRef]
  47. Yao, L.; Weng, K.-S.; Wu, M.S. Evolutionary learning of classifiers for disc discrimination. IEEE/ASME Trans. Mechatron. 2015, 20, 3194–3203. [Google Scholar] [CrossRef]
  48. Reyes-Sierra, M.; Coello, C.A.C. Multi-objective particle swarm optimizers: A survey of the state-of-the art. Int. J. Comput. Intell. Res. 2006, 2, 287–308. [Google Scholar]
  49. Hu, W.; Yen, G.G. Adaptive multi-objective particle swarm optimization based on parallel cell coordinate system. IEEE Trans. Evo. Comp. 2015, 19, 1–18. [Google Scholar]
  50. Chen, V.C.P.; Ruppert, D.; Shoemaker, C.A. Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming. Oper. Res. 1999, 47, 38–53. [Google Scholar] [CrossRef]
  51. Liu, F.; Ju, X.; Wang, N.; Wang, L.; Lee, W.-J. Wind farm macro-siting optimization with insightful bi-criteria identification and relocation mechanism in genetic algorithm. Energy Convers. Manag. 2020, 217, 112964. [Google Scholar] [CrossRef]
  52. Ahmed, W.; Hanif, A.; Kallu, K.D.; Kouzani, A.Z.; Ali, M.U.; Zafar, A. Photovoltaic panels classification using isolated and transfer learned deep neural models using infrared thermographic images. Sensors 2021, 21, 5668. [Google Scholar] [CrossRef]
  53. Knowles, J.D.; Corne, D.W. Approximating the Nondominated Front Using the Pareto Archived Evolution Strateg. Evol. Comput. 2000, 8, 149–172. [Google Scholar] [CrossRef] [PubMed]
  54. Bradley, A.P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recog. 1997, 30, 1145–1159. [Google Scholar] [CrossRef] [Green Version]
  55. Tang, K.; Wang, R.; Chen, T. Towards maximizing the area under the ROC curve for multi-class classification problems. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 7–11 August 2011; pp. 483–488. [Google Scholar]
  56. Ferri, C.; Hernandez-Navarro, J.; Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognit. Lett. 2009, 30, 27–38. [Google Scholar] [CrossRef]
  57. Loyola-Gonzalez, O.; Martınez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; Garcıa-Borroto, M. Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 2016, 175, 935–947. [Google Scholar] [CrossRef]
  58. Alcalá, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Logic Soft. Comput. 2010, 17, 255–287. [Google Scholar]
  59. Frank, A.; Asuncion, A. UCI machine learning repository. 2010. Available online: http://archive.ics.uci.edu/ml (accessed on 5 March 2020).
  60. Fernández-Navarro, F.; Hervás-Martínez, C.; Gutiérrez, P.A. A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recog. 2011, 44, 1821–1833. [Google Scholar] [CrossRef]
  61. Zhou, Z.-H.; Liu, X.-Y. On multi-class cost-sensitive learning. Comput. Intell. 2010, 26, 232–257. [Google Scholar] [CrossRef] [Green Version]
  62. Wang, S.; Yao, X. Multi-class imbalance problems: Analysis and potential solutions. IEEE Trans. Syst. Man Cybern. B 2012, 42, 1119–1130. [Google Scholar] [CrossRef] [PubMed]
  63. Fernández, A.; López, V.; Galar, M.; del Jesus, M.J.; Herrera, F. Analysing the classification of imbalacned data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl.-Based Syst. 2013, 42, 97–110. [Google Scholar] [CrossRef]
  64. Malfada. 13 December 2017. Available online: https://github.com/sisinflab-swot/mafalda (accessed on 3 September 2021).
Figure 1. Flow chart of GKA.
Figure 1. Flow chart of GKA.
Sensors 21 06616 g001
Figure 2. Flow chart of MOPSO.
Figure 2. Flow chart of MOPSO.
Sensors 21 06616 g002
Figure 3. Learning results with different number of ellipsoids α j . (a) α j = 3; (b) α j = 9; (c) α j = 6.
Figure 3. Learning results with different number of ellipsoids α j . (a) α j = 3; (b) α j = 9; (c) α j = 6.
Sensors 21 06616 g003
Figure 4. The typical curve of σ j ( ( Ξ j ) α j = i ) vs. α j .
Figure 4. The typical curve of σ j ( ( Ξ j ) α j = i ) vs. α j .
Sensors 21 06616 g004
Table 1. Characteristics of the data sets for simulations in Example 1.
Table 1. Characteristics of the data sets for simulations in Example 1.
Data SetSizeAttributesClassesClass DistributionIRmaxNo. of Ellipsoids
Balance62543288/49/2885.88NA/4/NA
Hayes-Roth1324351/51/301.7NA/NA/3
New-Thyroid21553150/35/305NA/3/3
Page-Blocks54721054913/329/28/87/115175.46NA/9/4/6/9
Dermatology358346111/60/71/48/48/205.55NA/4/4/3/3/3
Breast-Tissue1069621/15/18/16/14/221.57NA/NA/NA/ NA/4/NA
User-Knowledge-Modelling (UKM)4035550/102/129/1222.584/NA/NA/NA
Vertebral-Column3106360/150/1002.56/NA/NA
Ecoli32775143/77/52/35/207.15NA/6/4/3/3
NA: not available.
Table 2. Comparison of P min (%) for every over sampling scheme on different data sets.
Table 2. Comparison of P min (%) for every over sampling scheme on different data sets.
Data SetBaselineSSMOTEGCSABNCOSMOTEMDOMDO+AMDOEMDO
Balance0.000.008.449.186.449.832.224.9712.4413.412.004.470.000.0010.220.5020.4110.95
Hayes-Roth100.000.00100.000.00100.000.00100.000.00100.000.00100.000.00100.000.00100.000.00100.000.00
New-Thyroid83.3311.7990.0014.9193.339.1393.339.1390.009.1383.3316.6796.677.45100.000.00100.000.00
Page-Blocks82.6711.8878.676.9193.3314.9172.6724.9993.339.1375.3316.2693.339.1396.677.4596.675.33
Dermatology95.0011.1895.0011.1890.0013.69100.000.0090.0013.6995.0011.1895.0011.18100.000.00100.000.00
Breast-Tissue60.0027.8940.0036.5146.6729.8160.0027.8953.3329.8160.0027.8953.3329.8160.0027.8973.3613.32
UKM88.0013.0492.0013.0490.0010.0094.008.9488.0016.4386.0013.4286.0013.4294.008.9496.005.76
Vertebral-Column65.0016.0365.0019.0060.0019.9061.6718.2666.675.8968.3322.3666.6728.8786.679.5090.676.02
Ecoli65.0037.9170.0027.3955.0032.6070.0027.3955.0032.6075.0030.6290.0013.6990.0013.6990.0012.40
Average71.0071.0170.5372.6572.0971.6775.6781.9585.23
Rank Avg.6.335.896.395.115.785.895.282.561.78
The best result is in bold face.
Table 3. Comparison of P a v g (%) for every oversampling scheme on different data sets.
Table 3. Comparison of P a v g (%) for every oversampling scheme on different data sets.
Data SetBaselineSSMOTEGCSABNCOSMOTEMDOMDO+AMDOEMDO
Balance56.381.7558.382.9456.654.0362.333.2957.862.7857.282.9255.451.6660.372.0664.383.59
Hayes-Roth84.916.7184.675.0685.335.5883.527.0884.976.7484.916.7184.976.7484.976.7485.612.39
New-Thyroid88.866.0291.083.0293.144.5593.591.4592.546.6189.816.8494.983.5996.542.9996.602.05
Page-Blocks84.302.1484.571.8888.774.4979.705.5789.612.9881.241.7986.132.4888.771.9290.151.87
Dermatology95.672.0595.731.8393.502.7197.100.7595.312.4595.672.0596.061.6296.880.2697.130.20
Breast-Tissue63.223.7460.893.9468.785.7766.004.8865.837.0563.223.7466.562.1763.223.7470.606.28
UKM92.182.0292.574.8591.032.0094.492.4591.782.3291.452.4891.922.5094.232.1495.241.37
Vertebral-Column76.442.6577.224.6675.675.8676.673.1277.004.7178.225.5576.567.4181.892.3885.771.53
Ecoli74.647.8872.8112.0172.7311.3576.237.1473.128.7277.247.0382.305.2182.445.0885.311.62
Average79.6179.7780.6281.0780.8979.8981.6683.2685.65
Rank Avg.7.00 6.11 6.17 4.78 5.44 6.33 4.89 3.28 1.00
The best result is in bold face.
Table 4. Comparison of AUCm (%) for every oversampling scheme on different data sets.
Table 4. Comparison of AUCm (%) for every oversampling scheme on different data sets.
Data SetBaseSSMOTEGCSABNCOSMOTEMDOMDO+AMDOEMDO
Balance56.951.9158.123.2757.103.4360.522.8958.583.1857.402.7856.600.8860.611.2465.614.30
Hayes-Roth94.342.5294.251.9094.502.0993.822.6594.362.5394.342.5294.362.5394.362.5394.672.81
New-Thyroid91.405.3393.904.4595.433.7695.762.3794.045.1191.856.7497.042.9498.371.4198.551.23
Page-Blocks90.753.5789.842.1694.814.6686.827.8794.963.1687.904.6693.972.9195.512.0796.072.24
Dermatology97.323.4697.393.2795.554.1799.050.2896.094.0997.323.4697.483.2398.980.2699.060.22
Breast-Tissue76.807.3172.1810.1075.809.9978.308.9276.929.7576.807.3177.308.0976.807.3182.373.77
UKM94.143.6594.945.2794.193.0496.412.6194.014.6493.333.8893.553.9296.453.0297.262.49
Vertebral-Column79.253.1779.965.5678.386.3679.425.0480.042.4881.136.7679.548.7686.041.7289.072.00
Ecoli83.0711.5483.4310.8879.7811.4484.789.2179.7411.0086.389.5691.974.6191.594.1992.912.92
Average84.8984.8985.0686.1085.4285.1686.8788.7590.62
Rank Avg.7.006.226.334.895.446.334.892.891.00
The best result is in bold face.
Table 5. Comparison of MAUC (%) for every over sampling scheme on different data sets.
Table 5. Comparison of MAUC (%) for every over sampling scheme on different data sets.
Data SetBaselineSSMOTEGCSABNCOSMOTEMDOMDO+AMDOEMDO
Balance67.291.3168.792.2067.493.0271.752.4768.392.0967.962.1966.591.2570.271.5473.132.77
Hayes-Roth88.685.0388.503.7989.004.1887.645.3188.735.0688.685.0388.735.0688.735.0689.154.86
New-Thyroid91.644.5293.312.2794.863.4195.191.0994.404.9692.365.1396.242.6997.402.2497.762.14
Page-Blocks90.191.3490.351.1892.982.8087.313.4893.511.8688.271.1291.331.5592.981.2093.851.17
Dermatology97.401.2397.441.1096.101.6298.260.4597.191.4797.401.2397.630.9798.130.1698.260.16
Breast-Tissue77.932.2476.532.3681.273.4679.602.9379.504.2377.932.2479.931.3077.932.2484.075.16
UKM94.791.3495.053.2394.021.3396.321.6494.521.5494.301.6594.611.6796.151.4296.911.05
Vertebral-Column82.331.9982.923.5081.754.3982.502.3482.753.5383.674.1682.425.5586.421.7889.051.33
Ecoli84.154.9383.017.5082.967.1085.154.4683.205.4585.774.3988.933.2689.033.1890.821.01
Average86.0486.2186.7187.0886.9186.2687.3888.5690.33
Rank Avg.7.00 6.11 6.17 4.72 5.44 6.33 4.89 3.28 1.06
The best result is in bold face.
Table 6. Characteristics of the data sets for simulations in Example 2.
Table 6. Characteristics of the data sets for simulations in Example 2.
Data SetSizeAttributesClassesClass DistributionIRmaxNo. of Ellipsoids
Statlog (Shuttle)580009745586/50/171/8903/
3267/10/13
5684.67NA/3/4/5/5/
3/1
Mafalda2376214317757/2990/30155.94NA/11/11
NA: not available.
Table 7. Comparison of the performance with and without EMDO for imbalanced sensory data.
Table 7. Comparison of the performance with and without EMDO for imbalanced sensory data.
Data Set PminPavgAUCmMAUC
Statlog (Shuttle)w/ EMDO89.3313.7396.601.9599.660.4799.320.94
w/o EMDO6048.9993.217.4277.656.1792.801.66
Mafaldaw/ EMDO57.3818.3372.349.3376.399.3178.198.22
w/o EMDO33.1712.1360.7710.5365.257.8867.806.99
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yao, L.; Lin, T.-B. Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. Sensors 2021, 21, 6616. https://doi.org/10.3390/s21196616

AMA Style

Yao L, Lin T-B. Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. Sensors. 2021; 21(19):6616. https://doi.org/10.3390/s21196616

Chicago/Turabian Style

Yao, Leehter, and Tung-Bin Lin. 2021. "Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification" Sensors 21, no. 19: 6616. https://doi.org/10.3390/s21196616

APA Style

Yao, L., & Lin, T. -B. (2021). Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. Sensors, 21(19), 6616. https://doi.org/10.3390/s21196616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop