Next Article in Journal
A Machine Vision-Based Measurement Method for the Concentricity of Automotive Brake Piston Components
Previous Article in Journal
Asymmetry of Two-Dimensional Thermal Convection at High Rayleigh Numbers
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Decomposition and Symmetric Kernel Deep Neural Network Fuzzy Support Vector Machine

Laboratory of Engineering Science, Polydisciplinary Faculty of Taza, Sidi Mohamed Ben Abdellah of Fez, Taza P.O. Box 1223, Morocco
Department of Cyber-Physical Systems, St. Petersburg State Marine Technical University, Saint Petersburg 190121, Russia
Graduate School of Intelligent Data Science, National Yunlin University of Science and Technology, Douliou, Yunlin 640301, Taiwan
Authors to whom correspondence should be addressed.
Symmetry 2024, 16(12), 1585;
Submission received: 20 August 2024 / Revised: 1 November 2024 / Accepted: 15 November 2024 / Published: 27 November 2024
(This article belongs to the Section Computer)


Algorithms involving kernel functions, such as support vector machine (SVM), have attracted huge attention within the artificial learning communities. The performance of these algorithms is greatly influenced by outliers and the choice of kernel functions. This paper introduces a new version of SVM named Deep Decomposition Neural Network Fuzzy SVM (DDNN-FSVM). To this end, we consider an auto-encoder (AE) deep neural network with three layers: input, hidden, and output. Unusually, the AE’s hidden layer comprises a number of neurons greater than the dimension of the input samples, which guarantees linear data separation. The encoder operator is then introduced into the FSVM’s dual to map the training samples to high-dimension spaces. To learn the support vectors and autoencoder parameters, we introduce the loss function and regularization terms in the FSVM dual. To learn from large-scale data, we decompose the resulting model into three small-dimensional submodels using Lagrangian decomposition. To solve the resulting problems, we use SMO, ISDA, and SCG for optimization problems involving large-scale data. We demonstrate that the optimal values of the three submodels solved in parallel provide a good lower bound for the optimal value of the initial model. In addition, thanks to its use of fuzzy weights, DDNN-FSVM is resistant to outliers. Moreover, DDNN-FSVM simultaneously learns the appropriate kernel function and separation path. We tested DDNN-FSVM on several well-known digital and image datasets and compared it to well-known classifiers on the basis of accuracy, precision, f-measure, g-means, and recall. On average, DDNN-FSVM improved on the performance of the classic FSVM across all datasets and outperformed several well-known classifiers.

1. Introduction

Supervised techniques such as structural risk minimization and support vector machine (SVM) (derived from Vapnik–Chervonenkis dimension), provide an efficient instrument for both classification and regression models. SVM has been extensively employed to tackle a variety of challenges, including pattern matching and image recognition [1,2,3], biotechnology [4,5], and trade outlook [6]. SVM has captured the interest of numerous scientists thanks to its smaller degree of error compared to alternative training techniques, relatively rapid training speed, and easy integration with large-scale data.
In this paper, we introduce a new version of SVM named Decomposition Deep Neural Network Fuzzy Support Vector Machine (DDNN-FSVM). SVMs have difficulty solving large-scale classification problems involving more than two classes; for this reason, various improved versions have been introduced. These fall into three categories: (a) Fuzzy SVM (FSVM), (b) TWin SVM (TWSVM), and (c) Distributed SVM (DSVM). FSVM [7,8] was introduced to address concerns around the susceptibility of SVMs to noise and contours occurring in the learning pattern as a result of overtuning [9]. FSVM and its more recent derivatives [10,11,12] implement fuzzy membership on each instance so that each contributes in its own way to the construction of the SVM hyperplane, thereby enhancing the SVM by mitigating the influence of outlier data points. On the other hand, TSVM creates a pair of incomparable hyperplanes by performing a pair of small quadratic optimization problems, ensuring that each hyperplane is as nested as possible in one group while being as distant as possible from the other group [13,14]. Thanks to TSVM’s reduced calculating complexity and enhanced generalizability, a number of extensions have been introduced, including Twin-Bounded SVM [15], Nonparallel SVM [16], Least Squares TSVM [17], Twin Support Vector Regression [18], and Pinball Loss TSVM (Pin-TSVM) [19]. Multi-class SVMs have also been designed to treat multiclass phenomena in real time [20,21]. In [22], the authors introduced DSVM. Similar to parallel SVM, DSVM consists of two parallel schemes: a distributed batching scheme for data spreading, and a further refined distributed semiparametric SVM. New alternative techniques have been developed for DSVM, offering a different perspective to previous parallel approaches with excellent benchmark performance and empirical success. Because of these substantial advantages, DSVM has established itself as a preferred direction for researchers [23,24,25,26,27] and has been extended to multiclass situations.
In addition to the sensitivity of SVMs to noise and contours, learning the best kernel function for a particular dataset is a great challenge. The conditions for optimality of a good kernel function consist of prior information about optimality, computational techniques for obtaining the kernel, adaptation to an alternative kernel, bounds on the error rate of the learner, and the inherent nature of the structure of the dataset (see Section 3).
The main aim of this paper is to build a hybrid version of the fuzzy SVM model that is robust against outlier samples, learns from large-scale data, and learns kernel parameters iteratively. To achieve these objectives, we use a combination of three well-known models: the FSVM model, the self-encoding deep neural network, and Lagrangian decomposition. We name the obtained model Decomposition Deep Neural Network Fuzzy SVM (DDNN-FSVM). First, we design a three-layer auto-encoder (AE) consisting of an input layer ( n N), hidden layer ( n n  N), and output layer (n). The AE’s hidden layer unusually incorporates a number of cells higher than input sample length in order to promote linear separation of the data. The encoder operator is then introduced into the FSVM’s dual to map the training samples to high dimension space, which is the function that projects data into a higher-dimensional space. To learn the support vectors and auto-encoder parameters, we introduce the loss function of the example reconstruction and the regularization terms in the dual FSVM using a penalty parameter. To learn from large-scale data, we decompose the resulting model into three smaller sub-models using Lagrangian decomposition, namely, the classification model, reconstruction model, and encoding model. To solve the resulting problems, we use SMO and ISDA to solve the classification model and SCG to solve the reconstruction and encoding models. These models are designed to solve optimization problems involving large-scale data. We demonstrate mathematically that the optimal values of the three sub-models provide a good lower bound for the optimal value of the initial model when solved in parallel, ensuring very high quality of the support vectors. Compared with conventional kernel learning approaches, NMS-FSVM learns kernel parameters and vector supports iteratively, enabling the kernel to be adapted to the characteristics of the dataset. In addition, DDNN-FSVM is resistant to outliers thanks to its reliance on fuzzy weights. Moreover, DDNN-FSVM simultaneously learns the appropriate kernel function and separation path. We tested DDNN-FSVM on several well-known digital datasets from several fields (biology, health and medicine, social sciences, physics, and chemistry) as well as image sets, then compared it to well-known classifiers on the basis of accuracy, precision, f-measure, g-means, and recall. On average and across all datasets, DDNN-FSVM improves upon the performance of the classic version of FSVM and outperforms other well-known classifiers.
The main contribution of this paper is the introduction of a new version of fuzzy SVM that implements deep neural networks and Lagrangian decomposition, resulting in the following advantages:
  • DDNN-FSVM is robust against outliers thanks to fuzzy sample characterization.
  • DDNN-FSVM simultaneously learns the appropriate kernel function and separation path.
  • DDNN-FSVM learns from large-scale data using the SMO, ISDA, and SCG algorithms with Lagrangian decomposition.
The rest of this paper is organized as follows: Section 3 provides a detailed overview of the state-of-the-art with regard to learning kernel techniques; Section 4 presents the main important knowledge on FSVMs; Section 5 provides the essential details of the auto-encoder; Section 6 illustrates the Lagrangian decomposition approach; Section 7 describes the proposed approach; Section 8 presents our experimental results; and Section 9 provides conclusions, limitations, and future scope.

2. Methodology

In this paper, we use the notation in Table 1.
To build our model, we reviewed various works on the kernel function learning problem, recalling models that represent the foundations of our model (see Figure 1):
In the related works section, we analyze of the main important methods introduced in the literature to deal with the major problems linked to the kernel SVM selection, namely, kernel optimality constraints (data, base kernel, and kernel learning model) and kernel selection phase(optimization, kernel matrix/function, and kernel machine); see Section 3.
FSVM: We describe the fuzzy version of SVM by providing the formula of the membership coefficient associated with each sample and the fuzzy SVM quadratic dual problem that implements this coefficient; see Section 4. Then, we provide the learning equations of the SMO and KA optimization methods, which are designed especially for large-scale quadratic optimization problems.
Deep Neural Network: We present the main aspects of the Auto-Encoder(AE) neural network and the Scaled Conjugate Gradient (SCG) algorithm designed to train this type of neural network; see Section 5.
Lagrangian Decomposition: We recall the principal of partial Lagrangian relaxation and the idea behind Lagrangian decomposition, especially the copy past variables; see Section 6.
Setting up the FSVM deep neural network decomposition model involves several phases:
Phase 1: Implementation of Lagrangian relaxation to design the QP associated with the dual of the fuzzy version of SVM.
Phase 2: Introduction the operator e n c o d ( w e , . ) in QP-FSVM to map data in an appropriate artificial space.
Phase 3: Aggregation of the objective function of encoding-QP-SVM and the loss function of AE.
Phase 4: Introduction of a family of copy variables to split the problem obtained at the end of phase 3 into three small subproblems: neural weighted QP-FSVM, decoding loss, and encoding loss.
Phase 5: Solving the second and third subproblems using the SCG method.
Phase 6: Resolution of the first subproblem by one of the SMO or KA methods and L1QP. Note that this subproblem implements the encoder obtained in the previous phase to map the data into the appropriate artificial space.
Phase 7: The support vectors obtained in Phase 6 are used to predict the class of requests not seen by the NMS-SVM.
The flow chart in Figure 2 illustrates these different phases; the * symbol and the squares symbolize, respectively, the optimum separator parameters and the samples to be encoded.

3. Related Work

Kernel functions represent the inner product of data points, which represents the degree of similarity between these points.
These functions are introduced into algorithms for learning complex data. Given a dataset D = { s 1 ,   , s N } , a kernel function k can be defined through an inner product k ( s i , s j ) = φ ( s i ) , φ ( s j ) , i , j = 1 ,   , N , with φ as the function that maps the data to a synthetic feature space and the kernel matrix is K = ( k ( s i , s j ) ) i , j = 1 , . . . , N .
Well-known kernel functions include Gaussian k ( x , y ) = e x p ( x y / 2 σ 2 ) , linear k ( x , y ) = x , y , and polynomial k ( x , y ) = ( x , y + 1 ) p of degree p; σ and p are examples of hyperparameters. It can be shown that all these functions correspond to an inner product of points in an artificial space. In this respect, as Mercer’s theoretical statement indicates [28], the only condition necessary for a function to be an inner product is for its matrix K to be positive definite; for two non-zero samples x and y of the artificial space, we have x t K x > 0 .
Several challenges are linked to learning the kernel problem: (c1) the implicit aspect of the space in question prevents any direct analysis, and (c2) checking that the kernel matrix is positively defined is almost impossible to prove. The main concern in the learning kernel area is the performance of a kernel for a particular set of data. The optimality conditions of a good kernel are (o1) prior information for optimality, (o2) computational techniques for obtaining the kernel, (o3) adaptation to an alternative kernel, (o4) bounds on the learner’s error rate, and (o5) the inherent nature of the data set structure. Other aspects can also be taken into account: (a1) optimization, (a2) the kernel learning model, (a3) the kernel selection stage, (a4) the optimal kernel obtained, and (a5) optimality conditions in learning the kernels. The different types of kernel learning within a typical scenario appear in the Figure 3.
Optimization: Kernel learning may also be expressed as an optimization problem with a specific criterion to be minimized. Convex optimization has been widely employed, as it ensures the uniqueness of the solution [29,30,31,32]. However, these techniques are often not easy to implement, and their performance is not easily evaluated. Alternatively, gradient-based approximations are simple to implement, powerful, and easy to comprehend [33,34]. However, they require that the function be differentiable, which is rarely the case.
Kernel learning model: We distinguish between three model types, namely, data-dependent models, nonparametric models, and parametric models. Data-driven models adopt a simple specification analogous to Gaussian kernels [35,36]. The distinction is that they have to determine the parameters in the light of all the data, which needs to be supported by theoretical evidence. In nonparametric approaches, no predefined template is expected for the kernel, and the fulfillment of user-derived criteria ensures the kernel’s optimality. The problem is that the optimal kernel needs to be derived from both learning and test instances. The parametric approach involves identifying parameters given user-derived criteria and a preset model [37,38,39,40,41,42]. Nevertheless, most existing methodologies are expressed as a linear mixture of fundamental kernels.
Kernel selection phase: This stage may be subdivided in three groups. In the first group, kernel selection is entirely separate from the learning program itself, which renders it standard [39,41,42]. In the second group, two successive stages are carried out, consisting of identification of the best kernel and training of the model [37,38]. The third category introduces a version of SVM that makes use of a plurality of kernels and calculates their relative weights while learning. Ultimately, these techniques are only suitable to a specific training algorithm and only for a particular class of tasks.
Final optimal kernel: First, the kernel is identified according to tagged data [37,38,39,40], untagged data [35,43], or a mixture of tagged and untagged data [44,45,46], with labeled data being the most widely used. Learning a kernel results in a matrix [35,44,47] or a function [37,38,39]. It is necessary to carry out an extra step in order to build the kernel matrix on the test and training data.
That said, kernel construction algorithms have to deal with the difficult condition of definite positivity of the kernel matrix to be built. In the following, we briefly outline the state-of-the-art on optimality conditions for kernel learning.
Prior knowledge for optimality: When the user has a priori knowledge, for example in the case of diagonal dominance between samples, constrained linear programming may be required in order to formulate the kernel learning problem [48]. Other types of priori knowledge can be found in [49,50].
Statistical approaches for learning the kernel: Among the family of statistical models, we find natural kernels introduced in [36] that suppose the data to be generated via a known distribution and define the kernel based on the gradient of the log of this distribution. According to specific cases, we obtain a plain kernel or Ficher kernel, which is a particular case of marginalization kernel [51] inspired by the hidden Markov model. The probability product kernel was introduced in [43]. The Bhattacharyya and Kullback–Leibler kernels are particular cases of this probability product kernel [52]. This kernel function is especially useful in situations where the entry features correspond to a probability distribution. These data-dependent techniques are very interesting because they bring out the statistical nature of the data when creating a basis function. Nevertheless, it is not systematically clear what the exact distribution and its parameters are. It is possible to improve probabilistic kernels using Bayesian induction, which produces Gaussian processes [53,54] in which the central matrix serves as the covariance within the normal distribution, with the parameters approximated by means of the expectation minimization procedure.
Adaptation to another kernel: The matching operation ensures that the core maintains original features while enhancing its capabilities by considering the perfect scenario. Two measures are introduced, namely, kernel alignment and divergence measures, which are employed to measure the similarity between matrices [44,47,55]. However, these tools are ineffective when there are very few labeled instances or none at all in the dataset.
Limits on learner error rates: This class of techniques can be divided into four classes: (a) direct estimation of error levels, (b) implicit limits on error levels, (c) adjustments for kernel training, and (d) transitional frameworks. Cross-validation is conventionally used to determine the parameters, and has often been applied to kernel-based techniques such as SVMs [56]. Clearly, cross-validation only enables hyperparameters to be selected within the range specified by the customer. The approach suggested by [37] employs the notion of an experiential mistake criteria [57] to identify the learner’s success. This technique is designed specifically for SVM and utilizes its decision-making function. AdaBoost, a common form of boosting, can also be employed to promote learning of kernels [45,58,59]. In this method, a mixture of basic kernel functions is employed to minimize the error function. Nevertheless, the amount of parameters that need to be specified through user input, including the amount of iterations in the enhancement and the number of Gaussian combinations, renders this approach unfeasible. The target function of the training process may be considered as the optimality criterion. The SVM target function possesses a range of desirable characteristics, including a sound geometric understanding and moderately uncomplicated expression as a convex optimization. A notable and influential approach within this area is provided by [32], where the nucleus learning procedure is specified in two convex optimizations. The kernel is expressed as a composite of the teaching and testing datasets. In either of the two wordings, kernel picking is incorporated into the kernel machine (SVM) itself, resulting in a further wording of the training scheme. Lanckriet’s study constitutes a pioneering contribution to the existing literature, describing a parametric kernel function where the optimality can be specified on the basis of a particular collection of data. While offering a number of benefits, this method is appropriate in cases where a significant amount of learning instances exists. One example using this type of methodology for merging data is discussed in [60]. The use of kernel mixtures considered in [32] has been reviewed in [40], in which the authors envisaged resolving the task by means of a procedure analogous to Sequential Minimization Optimization (SMO) [61]. This method has come to be known as multiple kernel learning (MKL), which envisions the employment of more than one basic kernel for reaching a decision. A derivative of MKL capable of utilizing current SVM applications by breaking down the task in two stages was developed in [62]. The first stage updates the suitable mixture of base kernels, while the subsequent phase evaluates the corresponding SVM settings. A further version of MKL suggested in [34] improves training by incorporating the notion of grouping in kernels. In other words, cores representing a comparable rating of characteristic sets may be linked together. Furthermore, the situation involving a huge set of kernels in MKL was examined in [63]. In an alternative approach, the authors of [64] developed an approach for finding the best hyperparameter for a given dataset by shaping variations in the hyperparameter to eventually reduce the loss of the link in SVMs. The smoothness minimizer offers a powerful foundation for kernel building, as discussed in [65,66,67], where it was demonstrated that the convex model can be written out of the problem by using an adequate loss function, resulting in an internal point methodology [48,68]. In [66], the authors demonstrated how the optimal kernel array of the loss function can be generated by a suitable convex optimization model. This method of smoothing minimization was enhanced in [39], which employed CD (convexity difference) scheduling [69] during optimization. The smoothing approach was also explored in an overlap to Bregman’s divergence in [70], in which the resulting answer was derived through a Newtonian method. A more detailed explanation of these procedures is provided in [71]. We can expect to learn a model which will eventually extend further into the trial dataset; for instance, the authors of [72] investigated the situation that arises when the problem involves using more than one frame in a smoothing frame to identify the best kernel. This scheme operates by first learning the suitable template, then learning the best kernel, while the template is taught by driving the SVM. If the extra datasets are untagged, then the kernel training job is addressed through an unsupervised transmission training process, as in [73]. Having discovered which characteristics impact it, the next phase applies this knowledge of the characteristics in order to identify their conversion in the tagged set. Lastly, the morphed attributes are utilized in a training scheme such as SVM to search for a suitable template. A drawback of this procedure is that the characteristics identified in the untagged dataset are not necessarily representative of the most relevant features in the labeled training set. Comparable methods for employing transition training to construct a kernel are examined in [74,75,76,77].
The inherent architecture of the dataset: Lately, increasing attention has been focused on exploiting the inherent structural nature of datasets to identify the best kernels, thereby removing the need for previous expertise on the part of the program developer or etiquette of each instance. Several types of methods fit within this umbrella, namely, entry space preconditions, characteristic space preconditions, and inherent structure for the purpose of lowering dimensionality. The accuracy of the chosen kernel depends on an underlying property of the dataset deduced from the investigation of the mapping of the dataset to its input space. For instance, a heat or scattering kernel provides a familiar data-dependent kernel function defined by the statistically derived data gatherer with robust hardware understanding [78]. An alternative data-dependent kernel is the graph Laplacian, which is extensively employed to derive the graphical expression of a dataset [35,79]. The previous background knowledge provided by the customer about the problem can be exploited in parallel under the graph Laplacian concept to derive a novel kernel; see [49] for an illustration of and its further expansion through active learning in [46,80]. However, nonparametric techniques are likely to encounter the problem of over-fitting. In the category of characteristic space preconditions, the accent is placed on techniques that typically specify optimality in terms of the criteria imposed directly on the items represented in the characteristic space. For instance, [81] suggested developing a network of kernels to shorten the gap separating pairwise instances of the identical category in the characteristic space. A significant contribution to this group involving the geometric description technique was provided in [38]. This process concentrates on guided training, especially for SVMs. The mapped vertices are investigated by means of Riemannian geometry. A further nonsupervised technique is the one suggested in [82], in which the stochastic step along the graph is used to teach a linear mixture of the kernels where the ones can be derived via a pair of convex optimizations, specifically, linear scheduling and semidefinite scheduling. In the final category, kernel techniques and correspondence with a larger dimensional area, such as in kernel PCA, are employed to minimize dimensionality. The authors of [83] suggested a procedure that utilizes nonlinear mapped vertices to uncover the reduced-dimensional expression of the data in line with kernel PCA. A comparable technique was introduced in [84], where the topology composition of the dataset was considered. The purpose of such schedulers is to derive an adequate embedding of a graph built up with the dataset in a specific space. The dimensionality minimization tools described in this subsection tend to be task-specific, and are not readily transferable to solve other tasks including classification cases.
Table 2 provides a summarization of the state-of-art regarding different approaches used for estimating kernels. In this regard, we point out the following four aspects: name, references, keywords, and drawbacks.

4. Fuzzy Support Vector Machine (FSVM)

The desired SVM separator is defined by the equation < w , s i > + b = 0 , where < , > is the inner product and the pair ( w , b ) meets conditions i = 1 ,   , N   t i ( < s i , w > + b ) 1 . To guarantee the highest bandwidth, it is necessary to maximize 2 w . Because the patterns are not linearly splittable, kernel functions K (fulfilling Mercer’s restrictions [28]) are added to map the instances to a suitable space.

4.1. SVM with Soft Margin

By incorporating Lagrange expansion and expressing the Karush–Kuhn–Tucker(KKT) terms, we derive a quadratic optimization program involving a unique linear restriction to be solved with the aim of identifying the support vectors [85,86]. To deal with the problem of sutured binding, a number of authors have introduced the concept of a soft boundary [87]. In this approach, N additional parameters ξ i 0 are implemented for each constraint t i ( s i . w + b ) 1 . The weighted aggregate of the expanded variables is then incorporated into the cost function, obtaining the dual problem in Equation (1):
M a x i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j K ( s i , s j ) Subject to : i = 1 N ϱ i t i = 0 0 ϱ i C , i = 1 ,   N
where C is the introduced bound. As SVMs are very susceptible to noise and contours occurring in the learning pattern as a result of overtuning [9], FSVM has been introduced to address this concern [7,8].

4.2. Fuzzy Weights for Robust SVM

Fuzzy logic has the ability to capture key information in blurred environments [88]. For this reason, Fuzzy Support Vector Machine (FSVM) was introduced to alleviate the effect of outlier samples. In Fuzzy SVM (FSVM) [8,89], instances are allocated a specific fuzzy belonging weight that indicates how significant they are. Configurations for computing these fuzzy belonging weights are crucial to the efficiency of FSVMs. The weights of outliers and noise are typically relatively smaller compared to the rest of the instances.
To identify appropriate fuzzy memberships, the attributes of the data need to be examined. In [8], Chun hypothesized that the principal attributes of the dataset are incorporated over time. By accounting for frontier assumptions, linear and quadratic membership weights can be established.
In [90], the authors employed the class center to create the fuzzy belonging. They designated c + and ρ + as the average and rayon of category + 1 , and c and ρ as the average and rayon of category 1 . The rayon of a given class is measured as the largest gap separating its driving samples from the middle of the class, formerly ρ + = s u p s i , t i = + 1 c + s i and ρ = sup s i , t i = 1 c s i . In this sense, the weight of s i , denoted as μ i , is described by Formula (2):
μ i = 1 c + s i ρ + + ψ , i f t i = + 1 1 c s i ρ + ψ , i f t i = 1
where ψ > 0 is designed to handle the situation where μ i = 0 . In the remainder of this paper, we implement this method by testing the proposed DDNN-FSVM on different datasets, where we denote  μ i as m i .
In [91], the global centroid of the dataset in question was deployed. The advantage of such a consideration is that the weight of outliers is considerably reduced. As a result, model performance is greatly enhanced.
The implementation of fuzzy weights leads to the FSVM version; see Equation (3).
M a x i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j K ( s i , s j ) . Subject to : i = 1 N ϱ i t i = 0 . 0 ϱ i m i C , i = 1 ,   N .
The proposed model in this paper represents an improvement of the model in (3) using the weights described in Equation (2).

4.3. Algorithm for Optimizing Support Vectors

An important aspect of teaching from observed data using SVMs is the use of progressive training regimes whenever the dataset is extremely large.
The two most popular training algorithms which bypass conventional Quadratic Programming (QP) solutions include Kernel-Adatron (KA) [92,93,94,95,96] and Sequential Minimal Optimization (SMO) [97]. From the analysis aspect, SMO is an extremely user-friendly and refined algorithm. KA, although delivering comparable success in resolving categorization tasks with regard the accuracy and computation time, has not gained as many followers.
KA: The standard Adatron procedure is designed specifically to handle linear classification. KA is an adaptation of the conventional Adatron scheduler in the characteristic space of SVMs. KA resolves the dual Lagrangian using the gradient ascent procedure. The Δ ϱ i adjustment of the ϱ i dual variables can be calculated according to Equation (4), which is derived from the partial derivative applied to the problem’s target utility (22):
Δ ϱ i = δ ( 1 t i f i )
where f i designates the decision function’s response to query s i , i.e.,  f i = j ϱ j t j K ( s i , s j ) . The update of the dual variables ϱ i is provided by Equation (5):
ϱ i = M a x ( 0 , M i n ( ϱ i + Δ ϱ i , m i C ) ) .
SMO: Recently, an easy-to-use rule for updating ϱ i has been derived, including a comprehensive breakdown of the KKT conditions to verify the optimality of the solution (see Equation (6)):
Δ ϱ i = 1 t i f i K ( s i , s i ) .
The learning rule of ϱ i is presented by Equation (7):
ϱ i = M a x ( 0 , M i n ( ϱ i + Δ ϱ i , m i C ) ) .
In the experimentation section, we implement these two algorithms by testing FSVM and DNN-SVM on different datasets. We can use an heuristic approach such as the Multi-Objective Particle Swarm Optimization (MOPSO) algorithm to solve the dual associated with the FSVM dual model, which can improve the quality of the support vectors [98].

5. Deep Neural Network

This study focuses on the well-known type of deep neural network called auto-encoders (AEs). In this respect, we present the main AE architecture and the Scaled Conjugate Gradient (SCG) algorithm designed to train this type of neural network.

5.1. Auto-Encoder: Architecture and Loss Function

An auto-encoder is an ANN specifically designed to map its inputs into a synthetic feature space. An AE contains two main components, namely, an encoder and a decoder. The encoder maps the entry data to a new dimensional space, known as the hidden code or bottleneck, while the decoder spreads the hidden data back to the source data space. The main purpose of an auto-encoder is to reduce the rebuilding margin with respect to the initial data and the resulting output produced via the decoder; see Figure 4.
Theoretically, a dual-layer AE is understood as an ANN consisting of two principal components, an encoder and a decoder; considering an input instance p, x R n , the coder places it in a synthetic space with non-interpretable characteristics R n to obtain p ^ through a nonlinear cartography utility x e n c ( w e , x ) :
p ^ = e n c ( w e , x )
where w e denotes the latent layer weights.
The decoder then sends the synthetic output p ^ back to the native dataset through a nonlinear matching utility p ^ d e c ( w d , p ^ ) :
x ^ = d e c ( w d , p ^ )
where w d is the decoder’s weight matrix and x ^ is the reconstructed instance.
The tuning of w e and w d consists of minimizing a low-cost function E ( w e , w d ) , defined by Equation (10):
E ( w e , w d ) = i d e c ( w d , e n c ( w e , s i ) ) s i 2 + P e , r w e 2 + P d , r w d 2
where the first term is the global reconstruction error, the second and the third terms are the autoencoder regularization [99,100], and P e and P d are penalty hyperparameters that provide a compromise between the loss function terms. The loss function is minimized through an optimization process such as SCG in order to update the parameters of the encoder and decoder.

5.2. Auto-Encoder: Learning Algorithms

Given a random AE, we take their weights in the vector w ˜ R M , M N and denote the global loss function of the AE by E ( w ˜ ) .
Taking p ˜ 1 ,   , p ˜ k from R M { 0 } , a set of vectors is called a conjugate system if, given a nonsingular symmetric matrix A, it satisfies the following conditions:
p ˜ i T A p ˜ j = 0 ( i j , i = 1 , 2 ,   , k )
where the set of points w ˜ in R M satisfies
w ˜ = w ˜ 1 + ϱ 1 p ˜ 1 + + ϱ k p ˜ k , ϱ i R .
In this this equation, w ˜ 1 represents the weight vector and p ˜ 1 ,   , p ˜ k are conjugate vectors from the k-plane, denoted π k .
The key insight of this strategy is expressed in the algorithm described below, which is designed to minimize the error function E ( w ˜ ) :
  • Initialize vector w ˜ 1 and  k = 1 .
  • Build a descent p ˜ k and ϱ k as the minimum of E ( w ˜ k + ϱ p ˜ k ) on R + .
  • Improve the weight vector via the equation w ˜ k + 1 = w ˜ k + ϱ k p ˜ k .
  • If E ( w ˜ k ) 0 , then k = k + 1 and return to 2; otherwise, w ˜ k + 1 is the intended solution.
Identifying the next position involves two distinct phases, namely, calculating the search direction and estimating the displacement step.
The conjugate direction mechanism is a descent strategy in which the search directions and displacement steps are carefully chosen, as follows: E ( w ˜ + y ˜ ) E ( w ˜ ) + E ( w ˜ ) T y ˜ + 1 2 y ˜ T E ( w ˜ ) y ˜ . We denote the quadratic approximation to E in the neighborhood of a point w ˜ by
E q w ( y ˜ ) = E ( w ˜ ) + E ( w ˜ ) T y ˜ + 1 2 y ˜ T E ( w ˜ ) y ˜ .
The step from a starting point y ˜ 1 to a critical point y ˜ can be expressed as a linear combination of p ˜ 1 , , p ˜ N :
y ˜ y ˜ 1 = i = 1 N ϱ i p ˜ i , ϱ i R
where ϱ i is provided by Equation (12):
ϱ j = p ˜ j T E q w ( y ˜ 1 ) p ˜ j T E ( w ˜ ) p ˜ j .
It is possible to find the position y ˜ using Equations (11) and (12).
The conjugate weight vectors can be recursively calculated. Initially, p ˜ 1 = E q w ( y ˜ 1 ) ; next, p ˜ k + 1 is selected to be an orthogonal projection of E q w ( y ˜ k + 1 ) on the surface π N k conjugate to π k .
A standard CG can now be described, as shown in Algorithm 1.
Algorithm 1 Conjugate Gradient Algorithm
  •   Select initial weights w ˜ 1 .
      Set k = 1 , E ( w ˜ 1 ) = r ˜ 1 = p ˜ 1 .
  •   Calculate the required second-order information:
       s ˜ k = p ˜ k E ( w ˜ k ) ,
       δ k = s ˜ k p ˜ k T
  •   Calculate displacement step size:
       μ ˜ k = r ˜ k p ˜ k T
       α ˜ k = μ ˜ k δ k
  •   Update the weights:
       w ˜ k + 1 = p ˜ k α ˜ k + w ˜ k
       r ˜ 1 + k = E ( w ˜ 1 + k )
  •   If  k mod N = 0  then return to 2:
       p ˜ 1 + k = r ˜ 1 + k
      Else, build the conjugate direction:
       β ˜ k = r ˜ k r ˜ 1 + k T + | r ˜ 1 + k | 2 μ ˜ k
       p ˜ 1 + k = p ˜ k β ˜ k + r ˜ 1 + k
  •   If  r ˜ k 0  then  k = 1 + k and return to 2
      Else stop and return w ˜ 1 + k as the desired minimum.
Several other formulas for β ˜ k can be derived [101]. Among the major drawbacks of CG is that it requires E to be differentiable and calls E and E at each iteration, which increases its complexity.

Scaled Conjugate Gradient

It is possible to use another approach to estimate the step size than the line-search technique. The idea is to estimate the term s ˜ k = E ( w ˜ k ) p ˜ k in CG with a nonsymmetric approximation of the form by Equation (13) [102]:
s ˜ k E ( w ˜ k ) p ˜ k . E ( w ˜ k + σ k p ˜ k ) E ( w ˜ k ) σ k , 0 < σ k < < 1 .
In the limit, this approximation tends to the true value of E ( w ˜ k ) p ˜ k .
In [103], the authors modified CG by implementing a scalar λ k to deal with the indefiniteness of E ( w ˜ k ) ; see Equation (14):
s ˜ k = E ( w ˜ k + σ k p ˜ k ) E ( w ˜ k ) σ k + λ k p ˜ k .
The SCG pseudocode is provided in Algorithm 2. This algorithm takes as input the parameters ( σ , w ˜ 1 λ 1 ) connected to the file containing the function to be minimized E and its gradient. The output of SCG is the matrix of neural network parameters, with E as the loss.
Algorithm 2 Scaled Conjugate Gradient Algorithm for Fast Supervised Learning
  •   Choose weight vector w ˜ 1 , scalars 0 < σ 10 4 , 0 < λ 1 10 6 , λ ¯ 1 = 0 .
      Set p ˜ 1 = r ˜ 1 = E ( w ˜ 1 ) , k = 1 , and success = true.
  •   If success = true, then calculate second-order information:
       δ k = δ | p ˜ k |
       s ˜ k = E ( w ˜ k + σ k p ˜ k ) E ( w ˜ k ) σ k
       δ k = p ˜ k T s ˜ k
  •   Scale δ k : δ k = δ k + λ ˜ k λ k | p ˜ k | 2
  •   If  δ k 0  then make the Hessian matrix positive definite:
       λ ¯ k = 2 λ k δ k | p ˜ k | 2
       δ k = δ k + λ ¯ k | p ˜ k | 2
       λ k = λ ¯ k
  •   Calculate step size:
       μ k = p ˜ k T r ˜ k
       α k = μ k δ k
  •   Calculate the comparison parameter:
       Δ k = 2 δ k E ( w ˜ k ) E ( w ˜ k + α k p ˜ k ) μ k 2
  •   If  Δ k 0  then a successful reduction in error is made:
       w ˜ k + 1 = w ˜ k + α k p ˜ k
       r ˜ k + 1 = E ( w ˜ k + 1 )
       λ ¯ k = 0 success = true.
      If k mod N = 0  then restart algorithm:
       p ˜ k + 1 = r ˜ k + 1
       β k = | r ˜ k + 1 | 2 r ˜ k + 1 T r ˜ k μ k
       p ˜ k + 1 = r ˜ k + 1 + β k p ˜ k
      If  Δ > 0.75 , reduce the scale parameter: λ k = 1 k λ k .
  •   Else:
       λ ¯ k = λ k
      success = false.
  •   If  Δ < 0.25  then  λ k = λ k + δ k ( 1 Δ k ) | p ˜ k | 2
      If the steepest descent direction r ˜ k 0 , then set k = k + 1 and do to 2
  •   Else terminate and return w ˜ k + 1 as the desired minimum.
The value of σ should be as small as possible, taking the precision into account. In addition, SCG can be used to minimize a loss function E even if E does not exist. It should be noted that in this paper we use SCG to solve the subproblem associated with the encoder and decoder losses of the deep neural network introduced into FSVM to map the patterns into the feature space of the highest dimension.

6. Lagrangian Decomposition

Let f, g i , i = 1 ,   , m 1 , and  h j , with functions j = 1 ,   , m 2 , m 1 + m 2 + 1 defined on R n . Then, consider the optimization problem in Equation (15):
M i n f ( x ) Subject to : g i ( x ) = 0 , i = 1 ,   , m 1 . h i ( x ) = 0 , i = 1 ,   , m 2 . x R n .
Partial Lagrangian relaxation: Transfer via dualization of the constraints h i = 0 using the penalty parameters λ i in the objective function is the subject of a partial Lagrangian relaxation, making it possible to obtain a problem (16) that is easier to solve than (15) [104]:
M i n f ( x ) + < λ , h > Subject to : g i ( x ) = 0 , i = 1 ,   , m 1 . x R n .
where < . , . > is the dot product. The nearest partial Lagrangian value of the value of the problem in (15) is obtained by solving the following dual problem (17):
M a x { M i n { f ( x ) + < λ , h > , g i ( x ) = 0 , i = 1 ,   , m 1 x R n } , λ R m 2 } .
Lagrangian decomposition: To illustrate the idea behind this type of relaxation, we consider only the constraint g in (15). Suppose that we can decompose f and g as shown in (18):
M i n f 1 ( x ) + f 2 ( y ) Subject to : g ( x ) = g 1 ( x ) + g 2 ( y ) = 0 . x , y R n .
Then, the transfer by dualization of the constraints g = 0 using the penalty parameters λ in the objective function is the subject of Lagrangian decomposition relaxation, making it possible to obtain two problems (19) and (20) that are easier to solve than the (18) [105]. This is because (19) and (20) have smaller dimensions and constraints.
M i n f 1 ( x ) + λ g 1 ( x ) Subject to : x R n
f 2 ( y ) + g 2 ( y ) Subject to : y R n
The nearest Lagrangian decomposition value of the value of the problem (18) is obtained by solving the dual problem (21)
M a x { M i n { f 1 ( x ) + g 1 ( x ) , x R n } + M i n { f 2 ( y ) + g 2 ( y ) , y R n } , λ R } .
We use this principle in the following section to decompose the DDNN-FSVM model to obtain subproblems that are easier to solve.

7. Proposed Decomposition Deep Neural Network Fuzzy Classification Model

Consider an auto-encoder made up of an encoder x enc ( w e , x ) and a decoder y dec ( w d , y ) , where x R n is an input sample, w e is the encoder operator’s parameter matrix, y R n is the encoder’s output, and  w d is the decoder operator’s parameter matrix.

7.1. Auto-Encoder for FSVM Artificial Space

Using FSVM to classify patterns requires the implementation of some appropriate kernel map to map the patterns to space with a typical dimension in order to ensure the linear separability of the classes. In our case, we consider an encoder with the specific condition n n . The dual model associated with the FSVM takes the form of Equation (22):
M a x i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j K ( w e , s i , s j ) Subject to : i = 1 N ϱ i t i = 0 0 ϱ i m i C , i = 1 ,   N
where K ( w e , s i , s j ) = < e n c ( w e , s i ) , e n c ( w e , s j ) > . It is possible to set the auto-encoder parameters w e and w d before the classification, then bring x e n c ( w e , x ) into Equation (22). In our case, we perform classification and coding in parallel in order to personalize the auto-encoder to each dataset. To this end, we introduce the mapping loss and regularization terms in the objective function of the model in (22), which leads to the optimization problem in Equation (23).:
M a x i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j K ( w e , s i , s j ) P d i = 1 N dec ( w d , enc ( w e , s i ) ) s i 2 P e , r w e 2 P d , r w d 2 Subject to : i = 1 N ϱ i t i = 0 , 0 ϱ i m i C , i = 1 , , N , ϱ i R , w e R n , w d R D .
This aggregation leads to a very high-dimensional model ( n + n + N ), which can cause memory and time problems when solving it. To overcome this, we use the decomposition method described in the next section to decompose the problem in (23) into subproblems ( n , n , N ) with reduced dimensionality.

7.2. Decomposition for Large Scale Classification

It is possible to decompose the problem in Equation (23) into two smaller subproblems by introducing a family of copy variables. For each sample s i , we introduce a variable e i and a constraint e i = e n c ( w e , s i ) , with the result that Equation (23) is transformed into Equation (24):
M a x i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j < e i , e j > P d i = 1 N dec ( w d , e i ) s i 2 P e , r w e 2 P d , r w d 2 Subject to : i = 1 N ϱ i t i = 0 , 0 ϱ i m i C , i = 1 , , N , e i = enc ( w e , s i ) , e i R n , ϱ i R , w e R E , w d R D
where E and D are the number of encoder and decoder parameters, respectively. To decompose this last problem, we introduce these constraints into the objective function using the penalty parameters P c , which provides the two subproblems (25) and (26):
M a x i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j < e i , e j > + < P c , e i > P d i = 1 N dec ( w d , e i ) s i 2 P d , r w d 2 , Subject to : i = 1 N ϱ i t i = 0 , 0 ϱ i m i C , i = 1 , , N , ϱ i R , e i R n , w d R D ,
M i n < P c , e n c ( w e , s i ) > + P e , r w e 2 Subject to : w e R E .
Knowing that e i = e n c ( w e , s i ) for all i = 1 ,   . , N , we begin by solving the problem in (26) and introduce the values of e i into the objective function of (25), which makes it possible to decompose the latter into two small-dimensional problems (27) and (28):
max i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j < e i , e j > , Subject to : i = 1 N ϱ i t i = 0 , 0 ϱ i m i C , i = 1 , , N , ϱ i R ,
M i n E d ( w d ) = P d i d e c ( w d , e i ) s i 2 + P d , r w d 2 Subject to : w d R D .
It is interesting to note that the models in (28) and (26) are unconstrained optimization problems, indicating that the complexity of the problem in (24) has been broken down into two using Lagrangian decomposition. In addition, the model in (26) does not take the decoder quality parameters into account, which can lead to a poor-quality mapping. To address this challenge, the model must ensure a high degree of dissimilarity between the instances of the different classes. To this end, we introduce the term T ( w e ) , provided in Equation (29), into the objective function of the model (26):
T ( w e ) = i j = 1 N < e n c ( w e , s i ) , e n c ( w e , s j ) > i = 1 ,   , N .
This raises the unconstrained optimization problem in Equation (30).
M i n E p ( w e ) = i = 1 N < P c , e n c ( w e , s i ) > + P e , r w e 2 i j = 1 N < e n c ( w e , s i ) , e n c ( w e , s j ) > Subject to : w e R E
In the case where the auto-encoder has a single hidden layer, the following result shows that the objective function of the problem in (30) is a quadratic function if the encoder has a linear transfer function.
Proposition 1. 
Hypothesis (H1): The autoencoder has a single hidden layer. Hypothesis (H2): The encoder has a linear transfer function.
Result: The objective function of the problem in (30) is a polynomial of degree 2.
According to (H1), for each sample s i and each hidden neuron q we have
enc ( w e , s i ) q = l = 1 n w l , q e s i , l .
Then, we have
< P c , enc ( w e , s i ) > = q = 1 H l = 1 n P c , q w l , q e s i , l ,
which is a linear term with respect to w e ; on the other hand, we have
< enc ( w e , s i ) , enc ( w e , s j ) > = q = 1 H l = 1 n P c , q w l , q e s i , l l = 1 n P c , q w l , q e s j , l .
Thus, the cost function of the model in (30) is a quadratic function with respect to the variable w e .
The following results show that solving problems (26), (27), and (30) provides a lower bound for problem (25).    □
Theorem 1. 
(H1) Let w e , d , be the optimal solution to the problem in (26), let V e , d be its value, and let e i , d , = e n c ( w e , d , , s i ) , i = 1 , . . . , N .
(H2) For the problem in (27), we substitute e i , d , obtained by the parameter encoder w e , d , for e i . Let ϱ , d be the optimal solution to the problem in (27), and let V c , d be its value.
(H3) For the problem in (28), we substitute e i , d , obtained by the parameter encoder w e , d , for e i . Let w d , d , be the optimal solution of the problem in (28), and let V d , d be its value in (28).
(H4) Let ( ϱ , w e , , w e , ) be the solution of the problem in (25), and let V be its value.
Result: We have V c , d V e , d V d , d V .
We have w e , d , as the optimal solution of the problem in (26). Letting e i , d , = e n c ( w e , d , , s i ) , i = 1 , . . . , N , and V e , d , the optimal value of the problem in (26) is then
< P c , e n c ( w e , X ) > + P e , r w e 2 i j = 1 N < e n c ( w e , s i ) , e n c ( w e , s j ) > V e , d , w e R E .
We have ϱ , d as the optimal solution of the problem in (27), in which e i is set to e i , d , and  V c , d is the optimal value of the problem in (27). Then,
V c , d i = 1 N ϱ i 1 2 i = 1 N j = 1 N ϱ i ϱ j t i t j enc ( w e , s i ) , enc ( w e , s j ) , ϱ i [ 0 , m i C ] , and i = 1 N ϱ i t i = 0 .
We have w d , d , as the optimal solution of the problem in (28), in which e i is set to e i , d , and  V d , d is the optimal value of the problem in (28). Then,
P d i d e c ( w d , e n c ( w e , s i ) ) s i 2 + P d , r w d 2 V d , d .
We have ( ϱ , d , w e , d , , w e , d , ) as a realizable solution of the problem in (25); thus, the optimal value of the latter is greater than or equal to the value of the objective function of (25) at the point ( ϱ , d , w e , d , , w e , d , ) . Thus, from Equations (31)–(33) we have
V c , d V e , d V d , d V .
We call the function defined by Equation (34) the projection cost:
E p ( w e ) = i = 1 N < P c , e n c ( w e , s i ) > + P e , r w e 2 i j = 1 N < e n c ( w e , s i ) , e n c ( w e , s j ) > .
For a given e i , i = 1 , . . . , N , we call the function defined by Equation (35) the classification cost:
E c ( ϱ ) = i ϱ i 1 / 2 i , j ϱ i ϱ j t i t j < e i , e j > .
For a given e i , we call the function defined by Equation (36) the reconstruction cost:
E d ( w d ) = P d i d e c ( w d , e i ) s i 2 + P d , r w d 2 .
In Theorem 1, we show that introducing copying constraints into the objective function of problem in (23) leads to two subproblems for which the sum of the objective function yields only a lower bound. Moreover, when solving (27), we do not need to know the quality of the reconstruction of the samples, which can be considered an advantage from the point of view of complexity.
Lagrangian decomposition allows us to decompose the optimization model, which has a very large number of variables, into three subproblems with a smaller number of variables. In turn, this allows is to solving optimization problems with large-scale data. Notably, SMO, ISDA, and SCG are all designed to solve optimization problems involving large-scale data.    □

7.3. The DDNN-FSVM Algorithm

In this section, we describe the algorithm associated with the proposed method, which calls the SCG, L1QP, and ISDA algorithms.
The pseudocode associated with the DDNN-FSVM algorithm is provided in Algorithm 3. Step 4 cannot be performed before performing Steps 2 and 3. Moreover, Steps 1 and 2 can be performed at the same time. The flow chart in Figure 2 illustrates the different steps in Algorithm 3.
Algorithm 3 DDNN-FSVM Algorithm
Inputs (data):  D = { ( s 1 , t 1 ) ,   ( s N , t N ) }
Inputs (parameters):  0 < P e 1 , 0 < P e , r 1 , 0 < P d 1 , 0 < P d , r 1 , H number of the autoencoder hidden neurons, scalars 0 < σ 10 4 , 0 < λ 1 10 6 , and C.
Outputs: VS*(optimal vector support), w e , (parameters of the encoder), w d , (parameters of the decoder).
  •    w e , = SCG( P d , P d , r , σ , λ 1 );
  •    w d , = SCG( P e , P e , r , σ , λ 1 ); %This step can be carried out in parallel with task 1.
  •    m 1 , . . . , m N = m e b e r s h i p ( D ) . % To calculate the membershiping parameters, we use (2).
  •    V S = solver ( w e , , C ) . % The solver can be SMO or L1QP or ISDA.
Note: Let N h n be the number of hidden neurons in the AE. If ϕ e ( N h n ) is the complexity of SCG applied to E p , then ϕ d ( N h n ) is the complexity of SCG applied to E d and ϕ c ( N h n ) is the complexity of the solver (SMO, L1QP, or ISDA) applied to (27). Then, the complexity of DDNN-FSVM is ϕ D D N N F S V M ( N h n ) = ϕ e ( N h n ) ϕ d ( N h n ) + ϕ c ( N h n ) , where ⋁ represents the max operator in the context of complexity. Algorithm 3 can be used in the healthcare field, for instance to predict antiviral peptide sequences [106].

8. Experimentation

We implemented several algorithms to realize the classification task on different datasets from various fields: FSVM/Optimizer/Kernel(pk) and DNN(pm)/CGS-FSVM/ Optimizer/Kernel(pk), where Optimizer { S M O , I S D A , L 1 Q P } , K { l e a n e r , G a u s s i a n ,   p o l y n o m i a l } . In this case, Kernel = polynomial, pk = 3, …, 10; pm = (numbNeurs, regulParam) is the set of the DNN parameters, numbNeurs = 3, …, 30 (empirically defined as a function of the data set), and regulParam [ 0.001 , 0.1 ] . We tested Algorithm (27) for several values ( 0 < P e 1 , 0 < P e , r 1 , 0 < P d 1 , 0 < P d , r 1 ), scalars ( 0 < σ 10 4 , 0 < λ 1 10 6 ), and values of C. We have adopted the values that provide a good compromise between accuracy, precision, recall, and f-measure, namely, P e , r = 0.31 , P e , r = 0.4 , P d = 0.72 , P d , r = 0.01 , σ = 10 5 , and λ 1 = 10 6 .
As we split the problem in (26) into three independent subproblems, the mapping programs and the classification tasks were all implemented in a parallel manner.

8.1. DDNN-FSVM to Numerical Data

The algorithm was tested on eight commonly used datasets, which were all divided randomly into 70% training and 30% testing sets [107]. The attributes of each dataset are summarized in Table 3.
In machine learning, classification models are evaluated through a series of metrics such as A c c u r a c y = T P + T N T P + T N + F P + F N , P r e c i s i o n = T N T N + F P , R e c a l l = T P T P + F N , and F M e a s u r e = 2 × R e c a l l × P r e c i s i o n R e c a l l + P r e c i s i o n , where TP, FP, TN, and FN stand for true positive, false positive, true negative, and false negative, respectively.
Table 4 presents the accuracy, precision, recall, and f-measure of FSVM on seven datasets (iris, ionosphere, abalone, wine, pima, equilibrium, and ecoli) using three optimizers (SMO, ISDA, and L1QP) and considering linear, Gaussian, and polynomial kernel functions. On average, considering the four performance measures over the seven datasets and considering the Confidence Interval (CI), the versions of FSVM (kernel, Solver) are classed as follows: FSVM (Gauss, ISDA) with 83.86 ± 6.85; FSVM (Gauss, L1QP) with 80.72 ± 9.04; FSVM (linear, L1QP) with 67.4 ± 4.85; FSVM (Gauss, SMO) with 67.4 ± 6.17; FSVM (linear, ISDA) with 65.9 ± 5.65; and FSVM (linear, SMO) with 65.2 ± 3.89. From these results, we conclude that the Gaussian kernel and ISDA algorithm enable FSVM to achieve a good compromise between performance and stability. In fact, the complexity of polynomials of degree greater than 3 allows for very good generation; in addition, fuzzy weights make the SVM more stable.
We carried out additional experimental study on the kernel and solver in the FSVM context. Table 5 provides the mean and confidence interval of the accuracy, precision, recall, and f-measure of FSVM on seven datasets using three optimizers (SMO, ISDA, and L1QP). Different values of the metrics were recorded for different values of the kernel polynomes power (between 3 and 10). In this sense, the classifiers can be ordered as follows: FSVM (poly, SMO) with 70.19 ± 8.22; FSVM (poly, ISDA) with 68 ± 8.4; and FSVM (poly, L1QP) with 67.71 ± 7.32. On average, we can experimentally confirm that FSVM (poly) is less efficient and less sensitive than FSVM (Gauss, ISDA). This can be explained by the fact that the distribution of any type of data is the sum of the weighted Gaussian distributions using weights representing the inverse of the data size for each group, while the means and standard deviations are the respective centers and covariance matrices of each group.
We solved the three subproblems (25)–(27) using DDNN ([numNeur, 0.01])/CGS-FSVM/Optimizer, where Optimizer { S M O ,   I S D A ,   L 1 Q P } for different number of neurons (between 3 and 30) depending on the trained datasets. The regulation parameter was chosen experimentally; we tested DDNN-FSVM for different values of these parameters and adopted those associated with the best performance. Table 6 shows the mean and confidence interval of the accuracy, precision, recall, and f-measure results for DDNN-FSVM on the seven datasets using the three optimizers (SMO, ISDA, and L1QP). First, DDNN-FSVM does not need a kernel function, as the DNN allows the data to be mapped to a synthetic feature space where the separation between the data is clearer. Compared to FSVM (kernel, Solver), kernel ∈ {linear, Gaussian, polynom}, and Solver ∈ {SMO, ISDA, L1QP}, DDNN-FSVM ([numNeur, 0.01], CGS) (Optimizer) provides the best performance and highest stability. On average, across all datasets and measures, DDNN-FSVM delivers a 10% improvement over FSVM. On average and across all datasets, DDNN-FSVM improves on the performance of FSVM by 20% in terms of accuracy, 20% in terms of precision, and 12% in terms of f-measure, but reduces the recall by 4%. This means that the FN/TP ratio of DDNN-FSVM is greater than the FN/TP ratio of FSVM. As the accuracy of DDNN-FSVM is greater than the accuracy of FSVM, the FN of DDNN-FSVM is greater than the FN of FSVM.
To examine whether it is possible to reduce the aforementioned degradation in the recall metric, we implemented DDNN-FSVM for different numbers of neurons and different values of the regulation parameter. Figure 5 shows the curves of the different metrics for different numbers of neurons (between 3 and 30) in the DDNN neural network. All metrics reached a value of 100% across all of the datasets under consideration, indicating that it is possible to manipulate the number of autoencoder hidden layers to further improve the performance of DDNN-FSVM.
Figure 6 shows the accuracy, precision, recall, f-measure, and g-means of the DDNN-FSVM classifier applied to four datasets for different values of the DNN regulation parameter. Across all the datasets under consideration, all the metrics reached a value of 100%, indicating that it is possible to manipulate the regulation parameters to further improve the performance of DDNN-FSVM.
To compare DDNN-FSVM to other non-kernel classifiers, we implemented five other classifiers (KNN, BN, DT, Robust Boost, and Random Subspace) to classify four datasets (iris, balance, ecoli, ionosphere). Table 7 provides the accuracy, precision, recall, f-measure, and g-means of the KNN, BN, DT, RobustBoost, Random Subspace, FSVM, and DDNN-FSVM classifiers applied to these four datasets. Considering all datasets, DDNN-FSVM outperforms all other considered classifiers. Its superiority becomes especially clear on the iris data set. This can be explained by the fact that our classification model is a hybrid incorporating a neural network, fuzzy logic, and a robust quadratic optimizer, which offers sufficient complexity to generalize the acquired knowledge and sufficient flexibility to build a stable classification model. In addition, SCG, ISDA, and SMO can be used to solve large-scale optimization models.


We compared DDNN-FSVM and CNN-FSVM using MNIST, which consists of handwritten digits from 0 to 9. This collection of 70,000 images is divided into two parts, the first containing 60,000 images for training and the second part containing 10,000 images for testing. This collection of images consists of handwritten digits from 0 to 9 with a size of 28 × 28 ; see Figure 7.
We used 7500 images to train DDNN-FSVM and CNN-SVM and 2500 images to test them. To solve the weighted QP-FSVM problem, we used either L1QP, SMO, or ISDA. To solve the loss projection or loss reconstruction, we used the SCG algorithm described in Section 4. Regarding the comparison of DDNN-FSVM and CNN-FSVM, we considered six classification performance measures: accuracy, precision, recall, f-measure, and g-means.
In CNN-FSSVM, the first part of the CNN is used to extract the features of each image (ten features). To this end, SCG was used to train the CNN. Figure 8 shows the progress of the loss and the performance of the CNN on the training dataset. We note that the accuracy and the loss function start to be more stable and constant from epoch 20, which can lead to overfitting.
Figure 9 and Figure 10 respectively show the progress of the encoding loss and reconstruction loss on the training MNIST dataset considering the epoch. The loss functions start to be more stable and constant from epoch 100, which distances NMS-FSVM from the phenomenon of overfitting. Compared to the behavior of the loss in the CNN the model solved by SCG, our model shows high capacity to resist to overfitting.
To further point out this effect, we tested CNN-FSVM (ploy, L1QP), CNN-FSVM (ploy, SMO), CNN-FSVM (ploy, ISDA), DDNN-FSVM (SCG, SMO), DDNN-FSVM (SCG, L1QP), and DDNN-FSVM (SCG, ISDA) on the MNIST dataset. Table 8 shows the performance results of CNN-FSVM and DDNN-FSVM when applied to MNIST using L1QP, SMO, and ISDA. On average, DDNN-FSVM has the best performance considering accuracy, recall, precision, f-measure, and g-means, but has low performance in terms of recall. This can be explained by the gap due to the decomposition. This defect can be corrected by selecting the best parameters using a genetic approach. However, we remark that DDNN-FSVM does not need a kernel function, as the AE neural network is introduced in FSVM to provide the kernel function.
Finally, it is possible to adjust the DDNN-FSVM parameters to obtain the maximum performance for each dataset in order to improve the performance of FSVM.

8.3. DDNN-FSVM in Healthcare

The Wisconsin Prognostic Breast Cancer (WPBC) dataset is widely used in breast cancer research and machine learning. It comprises features extracted from digitized fine needle aspiration (FNA) images of breast masses, with each feature describing specific properties of the cell nuclei in these images. This information is used to classify cases as either benign or malignant. The dataset comprises 198 instances and 32 attributes, each of which provides essential data on individual samples. These attributes are essential for the development of predictive models for breast cancer prognosis, enabling more accurate diagnosis and treatment planning.
In healthcare, the problem of data imbalance is a common challenge that can have a significant impact on the effectiveness of predictive models and decision-making processes. This imbalance often occurs in clinical datasets where certain conditions or diseases are much rarer than others. Such disparities can lead to models that work well overall but fail to correctly identify or predict outcomes for the minority class, which in this case could represent critical health conditions.
The consequences of this imbalance can be severe, potentially leading to missed diagnoses, ineffective treatments, and ultimately poor patient outcomes. For example, if a machine learning model is primarily trained on data from healthy patients, it may struggle to recognize patterns associated with rare diseases, leading to a high rate of false negatives. In addition, traditional evaluation metrics such as accuracy can be misleading, as a model can achieve high accuracy simply by predicting the majority class.
To deal with data imbalances in healthcare, several strategies can be implemented. Techniques such as oversampling the minority class, undersampling the majority class, or using advanced algorithms designed to handle imbalanced data (such as cost-sensitive learning or ensemble methods) can improve model performance. In addition, focusing on more informative metrics such as precision, recall, and f-measure can offer a clearer picture of a model’s effectiveness in predicting minority classes. Ultimately, addressing data imbalances is crucial to developing reliable predictive models that can improve patient care and health outcomes.
To solve the data imbalance problem, we used an oversampling method called Fuzzy C-Means Center–SMOTE (FCM-CSMOTE), which generates synthetic samples in each cluster by considering its center as the memory of the main data components [108].
Table 9 shows the performance of different classifiers applied to the WPBC dataset, comparing the results without class balancing and with balancing. The tested classifiers included KNN, Bayesian Network, Decision Tree, RobustBoost, Random Subspace, and DDNN-FSVM. Overall, the performance of the classifiers varied. KNN shows an accuracy of 80.00 without balancing, but with relatively low precision. The recall improved considerably after balancing, reaching 80.85. The Bayesian Network shows the weakest performance without balancing, but a slight improvement after balancing. On the other hand, Random Subspace stands out for the best accuracy and precision in both scenarios, while DDNN-FSVM stands out for its overall performance, achieving the best recall and f-measure values with balancing. These results underline the positive impact of oversampling methods in improving positive case detection, particularly in the medical field, where the classes are often imbalanced.
These results show that the implementation of oversampling techniques generally improves classifier performance, particularly with regard to recall and f-measure. Classifiers such as Random Subspace and DDNN-FSVM appear to perform particularly well, while BN and RobustBoost require adjustments in order to improve their efficiency on unbalanced datasets. These results underline the importance of oversampling techniques in handling unbalanced data, particularly in the context of cancer classification, where accurate detection of positive cases is crucial.

9. Conclusions

Because of its ability to generalize and its sound theoretical basis, the SVM method has attracted the attention of scientists in various fields. Three main versions of this method have been developed: FSVM, TWSVM, and DSVM.
However, classification using the SVM kernel model encounters two major problems, namely, sensitivity to outliers and the problem of learning optimal kernels. Soft versions of SVM such as FSVM have been introduced to reduce the influence of outliers based on a membership weight associated with each instance. Several methods have been introduced for learning appropriate kernels, which are divided into the following categories: optimization-based models, kernel learning models, kernel selection, obtaining the optimal kernel, and obtaining the the optimality conditions in kernel learning. Nevertheless, each of these suffers from certain shortcomings.
In this paper, we present a new version of SVM intended to overcome these drawbacks, which we name Fuzzy Decomposition Deep Neural Network Support Vector Machine (DDNN-FSVM). DDNN-FSVM is a version of FSVM based on an original optimization model, and involves an autoencoder that maps the samples to a higher dimensional space in order to ensure linear separation. The resulting model is very high-dimensional, which makes it difficult to solve directly, especially when dealing with very high-dimensional datasets. For this reason, we use Lagrangian decomposition to introduce copy constraints, thereby decomposing the model into three low-dimensional submodels: the classification model, projection model, and reconstruction model. The first submodel is solved using the SMO, L1QP, and IKDA solvers, which are designed to solve quadratic optimization problems. The second and third models are solved using scaled conjugate gradient, which is designed to train multilevel neural networks such as auto-encoders.
We tested DDNN-FSVM on several well-known datasets and compared it to well-known classifiers on the basis of five performance measures: precision, f-measure, g-means, accuracy, and recall. On average, DDNN-FSVM improves all performance measures of the classic FSVM version and outperforms the other tested classifiers under consideration. Compared to FSVM (kernel, Solver), kernel ∈ {linear, Gaussian, polynom}, and Solver { S M O ,   I S D A ,   L 1 Q P } , DDNN ([numNeur, 0.01], CGS)-FSVM (Optimizer) provides the best performance and highest stability. On average, on all datasets and across all metrics, DDNN-FSVM provides an improvement of 10% over the performance of the classic FSVM model. On average and across all datasets, DDNN-FSVM provides an improvement over the performance of FSVM by 20% in terms of accuracy, 20% in terms of of precision, and 12% in terms of f-measure, although it results in a degradation of 4% in terms of recall. Lagrangian decomposition is used in this paper to decompose the introduced optimization problems into a very low-dimensional submodel; however, the sum of the objective functions of these models at the optimum only provides a lower bound. Nevertheless, this can be very good if the penalty parameters are well chosen. In addition, poor communication between the mapping space and the reconstruction space due to the decomposition can lead to very poor reconstruction. In addition, like all known classifiers, the performance of NMS-FSVM is influenced by unbalanced data, and can be improved by being combined with oversampling methods.
In the future, we will use genetic algorithm with A c c u r a c y ( D ) + P r e c i s i o n ( D ) + R e c a l l ( D ) + F M e a s u r e ( D ) + G m e a n s ( D ) as fitness function to select the best values of different parameters of DDNN-FSVM associated with each dataset D. In addition, we will use our method to predict instances of breast cancer from mammography images. Moreover, we will use the fuzzy C-means to compute the fuzzy weights of the different samples in FSVM in order to ameliorate the bounds of the Lagrangian parameters in the SVM dual.

Author Contributions

Conceptualization, K.E.M.; Methodology, K.E.M. and S.M.; Validation, A.O.; Investigation, A.O.; Writing—original draft, M.R.; Writing—review & editing, A.Z. and S.M. All authors have read and agreed to the published version of the manuscript.


This research was funded by the Ministry of Science and Higher Education of the Russian Federation as part of the World-Class Research Center Program for Advanced Digital Technologies, grant number 075-15-2022-312, dated 20 April 2022.

Data Availability Statement

The dataset used in the experiments is publicly available at [107].

Conflicts of Interest

The authors declare no conflict of interest.


  1. Adankon, M.M.; Cheriet, M. Model selection for the ls-svm. application to handwriting recognition. Pattern Recognit. 2009, 42, 3264–3270. Available online: (accessed on 1 October 2024). [CrossRef]
  2. Guo, G.; Li, S.Z.; Chan, K.L. Support vector machines for face recognition. Image Vis. Comput. 2001, 19, 631–638. Available online: (accessed on 1 October 2024). [CrossRef]
  3. Khan, N.; Ksantini, R.; Ahmad, I.; Boufama, B. A novel SVM + NDA model for classification with an application to face recognition. Pattern Recognit. 2012, 45, 66–79. Available online: (accessed on 1 October 2024). [CrossRef]
  4. Ho, S.Y.; Yu, F.C.; Chang, C.Y.; Huang, H.L. Design of accurate predictors for dna-binding sites in proteins using hybrid svm–pssm method. Biosystems 2007, 90, 234–241. [Google Scholar] [CrossRef] [PubMed]
  5. Park, B.; Im, J.; Tuvshinjargal, N.; Lee, W.; Han, K. Sequence-based prediction of protein-binding sites in dna: Comparative study of two svm models. Comput. Methods Programs Biomed. 2014, 117, 158–167. [Google Scholar] [CrossRef]
  6. Deng, C.; Li, H.; Peng, D.; Liu, L.; Zhu, Q.; Li, C. Modelling the coupling evolution of the water environment and social economic system using pso-svm in the yangtze river economic belt, China. Ecol. Indic. 2021, 129, 108012. Available online: (accessed on 1 October 2024). [CrossRef]
  7. Huang, H.P.; Liu, Y.H. Fuzzy support vector machines for pattern recognition and data mining. Intl. J. Fuzzy Syst. 2001, 4, 826–835. [Google Scholar]
  8. Lin, C.-F.; Wang, S.-D. Fuzzy support vector machines. IEEE Trans. Neural Netw. 2002, 13, 464–471. [Google Scholar]
  9. Wang, Y.; Wang, S.; Lai, K.K. A new fuzzy support vector machine to evaluate credit risk. IEEE Trans. Fuzzy Syst. 2005, 13, 820–831. [Google Scholar] [CrossRef]
  10. Lu, Y.L.; Lei, L.I.; Zhou, M.M.; Tian, G.L. A new fuzzy support vector machine based on mixed kernel function. In Proceedings of the 2009 International Conference on Machine Learning and Cybernetics, Baoding, China, 2–15 July 2009; pp. 526–531. [Google Scholar]
  11. Tang, W.M. Fuzzy svm with a new fuzzy membership function to solve the two-class problems. Neural Process. Lett. 2011, 34, 209–219. [Google Scholar] [CrossRef]
  12. Almasi, O.N.; Rouhani, M. Fast and denoise support vector machine training method based on fuzzy clustering method for large real world datasets. Turk. J. Elec. Eng. Comp. Sci. 2016, 24, 219–233. [Google Scholar] [CrossRef]
  13. Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar]
  14. Tomar, D.; Agarwal, S. Twin support vector machine: A review from 2007 to 2014. Egypt. Inform. J. 2015, 16, 55–69. [Google Scholar] [CrossRef]
  15. Shao, Y.H.; Zhang, C.H.; Wang, X.B.; Deng, N.Y. Improvements on twin support vector machines. IEEE Trans. Neural. Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef] [PubMed]
  16. Tian, Y.; Qi, Z.; Ju, X.; Shi, Y.; Liu, X. Nonparallel support vector machines for pattern classification. IEEE Trans. Cybern. 2014, 44, 1067. [Google Scholar] [CrossRef] [PubMed]
  17. Xu, Y.; Pan, X.; Zhou, Z.; Yang, Z.; Zhang, Y. Structural least square twin support vector machine for classification. Appl. Intell. 2015, 42, 527–536. [Google Scholar] [CrossRef]
  18. Chen, S.; Wu, X.; Zhang, R. A novel twin support vector machine for binary classification problems. Neural Process. Lett. 2016, 44, 795–811. [Google Scholar] [CrossRef]
  19. Xu, Y.; Yang, Z.; Pan, X. A novel twin supportvector machine with pinball loss. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 359–370. [Google Scholar] [CrossRef]
  20. Debnath, R.; Takahide, N.; Takahashi, H. A decision based one-against-one method for multi-class support vector machine. Pattern Anal. Appl. 2004, 7, 164–175. [Google Scholar] [CrossRef]
  21. Hsu, C.-W.; Lin, C.-J. A comparison of methods for multiclass support vector ma-chines. IEEE Trans. Neural Netw. 2002, 13, 415–425. [Google Scholar]
  22. Navia-Vazquez, A.; Parrado-Hernandez, E. Distributed support vector machines. IEEE Trans. Neural Netw. 2006, 17, 1091–1097. [Google Scholar] [CrossRef] [PubMed]
  23. Lu, Y.; Roychowdhury, V.; Vandenberghe, L. Distributed parallel support vector machines in strongly connected networks. IEEE Trans. Neural Netw. 2008, 19, 1167–1178. [Google Scholar]
  24. Flouri, K.; Beferull-Lozano, B.; Tsakalides, P. Distributed consensus algorithms for svm training in wireless sensor networks. In Proceedings of the 2008 16th European Signal Processing Conference, Lausanne, Switzerland, 25–29 August 2008; pp. 1–5. [Google Scholar]
  25. Kim, W.; Stanković, M.S.; Johansson, K.H.; Kim, H.J. A distributed support vector machine learning over wireless sensor networks. IEEE Trans. Cybern. 2015, 45, 2599–2611. [Google Scholar] [CrossRef] [PubMed]
  26. Scardapane, S.; Fierimonte, R.; Di Lorenzo, P.; Panella, M.; Uncini, A. Distributed semi-supervised support vector machines. Neural. Netw. 2016, 80, 43–52. [Google Scholar] [CrossRef]
  27. Yang, Z.; Bajwa, W.U. Rd-svm: A resilient distributed support vector machine. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2444–2448. [Google Scholar]
  28. Mercer, J. Functions of positive and negative type, and their connection the theory of integral equations. Philosophical transactions of the royal society of London. Ser. A Cont. Pap. Math. Phy. Char. 1909, 209, 415–446. [Google Scholar]
  29. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: New York, NY, USA, 2004. [Google Scholar]
  30. Fung, G.; Rosales, R.; Rao, R.B. Feature selection and kernel design via linear pro-gramming. In Proceedings of the 20th International Joint Conference on Artifical Intelligence: IJCAI’07, Hyderabad, India, 6–12 January 2007; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2008; pp. 786–791. [Google Scholar]
  31. Hoi, S.C.H.; Lyu, M.R.; Chang, E.Y. Learning the unified kernel machines for classifi-cation. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: KDD’06, Philadelphia, PA, USA, 20–23 August 2006; ACM: New York, NY, USA, 2006; pp. 187–196. [Google Scholar] [CrossRef]
  32. Lanckriet, G.R.G.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 2004, 5, 27–72. [Google Scholar]
  33. Cortes, C.; Mohri, M.; Rostamizadeh, A. Learning non-linear combinations of kernels. In NIPS: Advances in Neural Information Processing Systems; Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2009; Volume 22, pp. 396–404. [Google Scholar]
  34. Szafranski, M.; Grandvalet, Y.; Rakotomamonjy, A. Composite kernel learning. In Proceedings of the 25th International Conference on Machine Learning: ICML’08, Helsinki, Finland, 5–9 July 2008; ACM: New York, NY, USA, 2008; pp. 1040–1047. [Google Scholar] [CrossRef]
  35. Herbster, M.; Pontil, M.; Wainer, L. Online learning over graphs. In Proceedings of the 22nd International Conference on Machine Learning: ICML’05, Bonn, Germany, 7–11 August 2005; ACM: New York, NY, USA, 2005; pp. 305–312. [Google Scholar] [CrossRef]
  36. Jaakkola, T.S.; Haussler, D. Exploiting generative models in discriminative classifiers. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, Denver, CO, USA, 30 November–5 December 1998; MIT Press: Cambridge, MA, USA, 1999; pp. 487–493. [Google Scholar]
  37. Adankon, M.M.; Cheriet, M. Optimizing resources in model selection for support vector machine. Pattern. Recognit. 2007, 40, 953–963. [Google Scholar] [CrossRef]
  38. Amari, S.; Wu, S. Improving support vector machine classifiers by modifying kernal functions. Neural. Netw. 1999, 12, 783–789. [Google Scholar] [CrossRef]
  39. Argyriou, A.; Hauser, R.; Micchelli, C.A.; Pontil, M. A dc-programming algorithm for kernel selection. In Proceedings of the 23rd International Conference on Machine Learning: ICML’06, Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, NY, USA, 2006; pp. 41–48. [Google Scholar] [CrossRef]
  40. Bach, F.R.; Lanckriet, G.R.G.; Jordan, M.I. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the Twenty-First International Conference on Machine Learning: ICML’04, Banff, AB, Canada, 4–8 July 2004; ACM: New York, NY, USA, 2004; Volume 2, pp. 125–137. [Google Scholar] [CrossRef]
  41. Chen, B.; Liu, H.; Bao, Z. A kernel optimization method based on the localized kernel fisher criterion. Pattern. Recognit. 2008, 41, 1098–1109. [Google Scholar] [CrossRef]
  42. Kristin, P.; Bennett, M.J.E.; Michinari, M. Mark: A boosting algorithm for heterogeneous kernel models. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; ACM: New York, NY, USA, 2002; pp. 24–31. [Google Scholar] [CrossRef]
  43. Jebara, T.; Kondor, R.; Howard, A. Probability product kernels. J. Mach. Learn. Res. 2004, 5, 819–844. [Google Scholar]
  44. Cristianini, N.; Shawe-Taylor, J.; Elissee, A.; Kandola, J. On kernel-target alignment. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2002; Volume 14, pp. 367–373. [Google Scholar]
  45. Hertz, T.; Hillel, A.B.; Weinshall, D. Learning a kernel function for classification with small training samples. In Proceedings of the 23rd international conference on machine learning: ICML’06, Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, NY, USA, 2006; pp. 401–408. [Google Scholar] [CrossRef]
  46. Hoi, S.C.H.; Jin, R. Active kernel learning. In Proceedings of the 25th International Conference on Machine Learning: ICML’08, Helsinki, Finland, 5–9 July 2008; ACM: New York, NY, USA, 2008; pp. 400–407. [Google Scholar] [CrossRef]
  47. Davis, J.V.; Kulis, B.; Jain, P.; Sra, S.; Dhillon, I.S. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning: ICML’07, Corvalis, OR, USA, 20–24 June 2007; pp. 209–216. [Google Scholar] [CrossRef]
  48. Florian, A.; Potra, S.J.W. Interior-point methods. J. Comput. Appl. Math. 2000, 124, 281–302. [Google Scholar] [CrossRef]
  49. Hoi, S.C.H.; Jin, R.; Lyu, M.R. Learning nonparametric kernel matrices from pairwise constraints. In Proceedings of the 24th International Conference on Machine Learning: ICML’07, Corvalis, OR, USA, 20–24 June 2007; ACM: New York, NY, USA, 2007; pp. 361–368. [Google Scholar] [CrossRef]
  50. Tsuda, K.; Noble, W.S. Learning kernels from biological networks by maximizing entropy. Bioinformatics 2004, 20, 326–333. [Google Scholar] [CrossRef] [PubMed]
  51. Tsuda, K.; Kin, T.; Asai, K. Marginalized kernels for biological sequences. Bioinformatics 2002, 19, 2149–2156. [Google Scholar] [CrossRef] [PubMed]
  52. Moreno, P.J.; Ho, P.; Vasconcelos, N. A kullback-leibler divergence based kernel for svm classification in multimedia applications. In NIPS; Thrun, S., Saul, L.K., Schölkopf, B., Thrun, S., Saul, L.K., Schölkopf, B., Eds.; MIT Press: Cambridge, MA, USA, 2003. [Google Scholar]
  53. MacKay, J.C.D. Introduction to gaussian processes. NATO ASI Ser. F Comput. Syst. Sci. 1997, 168, 33–165. [Google Scholar]
  54. Rasmussen, C.E. Gaussian processes in machine learning. In Summer School on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2003; Volume 3176, pp. 63–71. [Google Scholar] [CrossRef]
  55. Cristianini, N.; Kandola, J.; Elisseeff, A.; Shawe-Taylor, J. On Optimizing Kernel Alignment; Technology Report; UC Davis Department of Statistics: Davis, CA, USA, 2003. [Google Scholar]
  56. Duan, K.; Keerthi, S.S.; Poo, A.N. Evaluation of simple performance measures for tuning svm hyperparameters. Neurocomputing 2003, 51, 41–59. [Google Scholar] [CrossRef]
  57. Ayat, N.; Cheriet, M.; Suen, C. Automatic model selection for the optimization of svm kernels. Pattern. Recognit. 2005, 38, 1733–1745. [Google Scholar] [CrossRef]
  58. Collins, M.; Schapire, R.E.; Singer, Y. Logistic regression, adaboost and bregman dis-tances. Mach. Learn. 2002, 48, 253–285. [Google Scholar] [CrossRef]
  59. Schapire, R.E.; Singer, Y. Improved boosting algorithms using confidence-rated predic-tions. J. Mach. Learn. 1999, 37, 297–336. [Google Scholar] [CrossRef]
  60. Lanckriet, G.; Deng, M.; Cristianini, N.; Jordan, M.I.; Noble, W.S. Kernel-based data fusion and its application to protein function prediction in yeast. Pac. Symp. Biocomput. 2004, 11, 300–311. [Google Scholar]
  61. Platt, J.C. Fast Training of support vector machines using sequential minimal optimiza-tion. In Advances in Kernel Methods: Support Vector Learning; Scholkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, USA, 1999; pp. 185–208. [Google Scholar]
  62. Rakotomamonjy, A.; Bach, F.; Canu, S.; Grandvalet, Y. More efficiency in multiple kernel learning. In Proceedings of the 24th International Conference on Machine Learning: ICML’07, Corvalis, OR, USA, 20–24 June 2007; ACM: New York, NY, USA, 2007; pp. 775–782. [Google Scholar] [CrossRef]
  63. Gehler, P.; Nowozin, S. Infinite Kernel Learning; Technology Report; Max Planck Institute for Biological Cybernetics: Tuebingen, Germany, 2008. [Google Scholar]
  64. Wang, G.; Yeung, D.Y.; Lochovsky, F.H. A kernel path algorithm for support vector machines. In Proceedings of the 24th International Conference on Machine Learning: ICML’07, Corvalis, OR, USA, 20–24 June 2007; ACM: New York, NY, USA, 2007; pp. 951–958. [Google Scholar] [CrossRef]
  65. Argyriou, A.; Micchelli, C.A.; Pontil, M. Learning convex combinations of continuously parameterized basic kernels. In Learning Theory, Lecture Notes in Computer Science; Auer, P., Meir, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3559, pp. 338–352. [Google Scholar] [CrossRef]
  66. Kim, S.J.; Zymnis, A.; Magnani, A.; Koh, K.; Boyd, S. Learning the kernel via convex optimization. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 1997–2000. [Google Scholar] [CrossRef]
  67. Micchelli, C.A.; Pontil, M. Learning the kernel function via regularization. J. Mach. Learn. Res. 2005, 6, 1099–1125. [Google Scholar]
  68. Freund, R.M.; Mizuno, S. Interior Point Methods: Current Status and Future Directions. Working Papers 3924–3996, Massachusetts Institute of Technology (MIT), Sloan School of Management. 1996. Available online: (accessed on 1 October 2024).
  69. Horst, R.; Thoai, N.V. Dc programming: Overview. J. Optim. Theory Appl. 1999, 103, 1–43. [Google Scholar] [CrossRef]
  70. Li, F.; Fu, Y.; Dai, Y.H.; Sminchisescu, C.; Jue, W. Kernel learning by unconstrained optimization. J. Mach. Learn. Res. 2009, 5, 328–335. [Google Scholar]
  71. Micchelli, C.A.; Pontil, M. Feature space perspectives for learning the kernel. Mach. Learn. 2007, 66, 297–319. [Google Scholar] [CrossRef]
  72. Rückert, U.; Kramer, S. Kernel-based inductive transfer. In Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2008; pp. 220–233. [Google Scholar] [CrossRef]
  73. Raina, R.; Battle, A.; Lee, H. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007. [Google Scholar]
  74. Abbasnejad, M.E.; Ramachandram, D.; Mandava, R. Optimizing kernel functions using transfer learning from unlabeled data. In Proceedings of the 2009 Second International Conference on Machine Vision, Dubai, United Arab Emirates, 28–30 December 2009; pp. 111–117. [Google Scholar] [CrossRef]
  75. Argyriou, A.; Evgeniou, T.; Pontil, M.; Argyriou, A.; Evgeniou, T.; Pontil, M. Multi-task feature learning. In Advances in Neural Information Processing Systems; Schölkopf, B., Platt, J., Hoffman, T., Eds.; MIT Press: Cambridge, MA, USA, 2007; Volume 19, pp. 41–48. [Google Scholar]
  76. Evgeniou, T.; Micchelli, C.A.; Pontil, M. Learning multiple tasks with kernel methods. JMLR Org. 2005, 6, 615–637. [Google Scholar]
  77. Jebara, T. Multi-task feature and kernel selection for svms. In Proceedings of the Twenty-First International Conference on Machine Learning: ICML’04, Banff, AB, Canada, 4–8 July 2004; ACM: New York, NY, USA, 2004. [Google Scholar] [CrossRef]
  78. Kondor, R.I.; Lafferty, J. Diffusion kernels on graphs and other discrete structures. In Proceedings of the 23rd International Conference on Machine Learning: ICML’02, Las Vegas, NV, USA, 24–27 June 2002; pp. 315–322. [Google Scholar]
  79. Zhu, X.; Kandola, J.; Ghahramani, Z.; Lafferty, J. Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning. In Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 1641–1648. [Google Scholar]
  80. Zhuang, J.; Tsang, I.W.; Hoi, S.C.H. Simplenpkl: Simple non-parametric kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning: ICML’09, Montreal, QC, Canada, 14–18 June 2009; ACM: New York, NY, USA, 2009; pp. 1273–1280. [Google Scholar] [CrossRef]
  81. Yeung, D.Y.; Chang, H.; Dai, G. A scalable kernel-based semisupervised metric learning algorithm with out-of-sample generalization ability. Neural. Comput. 2008, 20, 2839–2861. [Google Scholar] [CrossRef]
  82. Abbasnejad, M.E.; Ramachandram, D.; Mandava, R. An unsupervised approach to learn the kernel functions: From global influence to local similarity. Neural. Comput. Appl. 2010, 19, 631–640. [Google Scholar] [CrossRef]
  83. Weinberger, K.Q.; Sha, F.; Saul, L.K. Learning a kernel matrix for nonlinear dimen-sionality reduction. In Proceedings of the Twenty-First International Conference on Machine Learning: ICML’04, Banff, AB, Canada, 4–8 July 2004; ACM: New York, NY, USA, 2004. [Google Scholar]
  84. Shaw, B.; Jebara, T. Structure preserving embedding. In Proceedings of the 26th Annual International Conference on Machine Learning: ICML’09, Montreal, QC, Canada, 14–18 June 2009; ACM: New York, NY, USA, 2009; pp. 937–944. [Google Scholar] [CrossRef]
  85. Schölkopf, B.; Smola, A.J.; Bach, F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
  86. El Moutaouakil, K.; El Ouissari, A.; Touhafi, A.; Aharrane, N. An Improved Density Based Support Vector Machine (DBSVM). In Proceedings of the 2020 5th International Conference on Cloud Computing and Ar-tificial Intelligence: Technologies and Applications (CloudTech), Marrakesh, Morocco, 24–26 November 2020; pp. 1–7. [Google Scholar]
  87. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  88. Safari, K.; Imani, F. A Novel Fuzzy-BELBIC Structure for the Adaptive Control of Satellite Attitude. In Proceedings of the ASME 2022 International Mechanical Engineering Congress and Exposition, Columbus, OH, USA, 30 October–3 November 2022; Volume 3: Advanced Materials: Design, Processing, Characterization and Applications; Advances in Aerospace Technology. ASME: New York, NY, USA, 2022; p. V003T04A033. [Google Scholar] [CrossRef]
  89. El Ouissari, A.; El Moutaouakil, K. Density based fuzzy support vector machine: Application to diabetes dataset. Math. Model. Comput. 2021, 8, 747–760. [Google Scholar] [CrossRef]
  90. Verma, R.N.; Deo, R.; Srivastava, R.; Subbarao, N.; Singh, G.P. A new fuzzy support vector machine with pinball loss. Discov. Artif. Intell. 2023, 3, 14. [Google Scholar] [CrossRef]
  91. Dhanasekaran, Y.; Murugesan, P. Improved bias value and new membership function to enhance the per-formance of fuzzy support vector Machine. Expert Syst. Appl. 2022, 208, 118003. [Google Scholar] [CrossRef]
  92. Frieß, T.-T.; Cristianini, N.; Campbell, I.C.G. The Kernel-Adatron: A Fast and Simple Learning Procedure for Support Vector Machines. In Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, USA, 24–27 July 1998; Shavlik, J., Ed.; pp. 188–196. [Google Scholar]
  93. Huang, T.-M.; Kecman, V. Bias Term b in SVMs Again. In Proceedings of the ESANN 2004, 12th European Symposium on Artificial Neural Networks, Bruges, Belgium, 28–30 April 2004. [Google Scholar]
  94. Joachims, T. Making Large-Scale SVM Learning Practical. Advances in Kernel Methods-Support Vector Learning. 1999. Available online: (accessed on 1 October 2024).
  95. Kecman, V.; Vogt, M.; Huang, T.-M. On the Equality of Kernel AdaTron and Sequential Minimal Optimi-zation in Classification and Regression Tasks and Alike Algorithms for Kernel Machines. In Proceedings of the 11th European Symposium on Artificial Neural Networks, ESANN, Bruges, Belgium, 23–25 April 2003; pp. 215–222. [Google Scholar]
  96. Osuna, E.; Freund, R.; Girosi, F. An Improved Training Algorithm for Support Vector Machines. In Proceedings of the Neural Networks for Signal Processing VII, Proceedings of the 1997 Signal Processing Society Workshop, Amelia Island, FL, USA, 24–26 September 1997; pp. 276–285.
  97. Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. In Microsoft Research Technical Report MSR-TR-98-14; Microsoft Research: Redmond, WA, USA, 1998. [Google Scholar]
  98. Abbasi, S.; Mousavi, S.S.; Farbod, E.; Sorkhi, M.Y.; Parvin, M. Hybrid data mining and data-driven algo-rithms for a green logistics transportation network in the post-COVID era: A case study in the USA. Syst. Soft Comput. 2024, 6, 200156. [Google Scholar] [CrossRef]
  99. Moller, M.F. A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 1993, 6, 525–533. [Google Scholar] [CrossRef]
  100. Olshausen, B.A.; Field, D.J. Sparse coding with an overcomplete basis set: A strategy employed by V1. Vis. Res. 1997, 37, 3311–3325. [Google Scholar] [CrossRef] [PubMed]
  101. Yuan, G.; Meng, Z.; Li, Y. A modified Hestenes and Stiefel conjugate gradient algorithm for large-scale nonsmooth minimizations and nonlinear equations. J. Optim. Theory Appl. 2016, 168, 129–152. [Google Scholar] [CrossRef]
  102. Siddiqi, A.H.; Al-Lawati, M.; Boulbrachene, M. Modern Engineering Mathematics; Chapman and Hall/CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
  103. Hilal, W.; Gadsden, S.A.; Yawney, J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
  104. Fisher, M.L.; Northup, W.D.N.; Shapiro, J.F. Using Duality to solve discrite optimisation problems: Theory and compitational experience. Math. Program. Study 1975, 3, 56–94. [Google Scholar]
  105. Sun, Y.; Li, Z.; Tian, W.; Shahidehpour, M. A Lagrangian decomposition approach to energy storage transportation scheduling in power systems. IEEE Trans. Power Syst. 2016, 31, 4348–4356. [Google Scholar] [CrossRef]
  106. Kieslich, C.A.; Alimirzaei, F.; Song, H.; Do, M.; Hall, P. Data-driven prediction of antiviral peptides based on periodicities of amino acid properties. In Computer Aided Chemical Engineering; Türkay, M., Gani, R., Eds.; Elsevier: Amsterdam, The Netherlands, 2021; Volume 50, pp. 2019–2024. ISBN 9780323885065. [Google Scholar] [CrossRef]
  107. Dua, D.; Graff, C. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 2019; Available online: (accessed on 1 October 2024).
  108. Mohammed, R. FCM-CSMOTE: Fuzzy C-Means Center-SMOTE. Expert Syst. Appl. 2024, 248, 123406. [Google Scholar] [CrossRef]
Figure 1. Tools used for building DDNN-FSVM.
Figure 1. Tools used for building DDNN-FSVM.
Symmetry 16 01585 g001
Figure 2. DDNN–FSVM flow chart.
Figure 2. DDNN–FSVM flow chart.
Symmetry 16 01585 g002
Figure 3. Different aspects of algorithms for learning a kernel.
Figure 3. Different aspects of algorithms for learning a kernel.
Symmetry 16 01585 g003
Figure 4. Auto-encoder architecture. The left phase is the encoder that maps the instances into artificial space to find a linear separation. The right phase reconstructs the mapped instances to rejoin their original dimension. Here, W e and W d are the corresponding vectors of weights of the hidden layer and output layer, respectively.
Figure 4. Auto-encoder architecture. The left phase is the encoder that maps the instances into artificial space to find a linear separation. The right phase reconstructs the mapped instances to rejoin their original dimension. Here, W e and W d are the corresponding vectors of weights of the hidden layer and output layer, respectively.
Symmetry 16 01585 g004
Figure 5. Accuracy, sensitivity, precision, recall, f-measure, and g-means of the DDNN-FSVM classifier applied to four datasets for different numbers of DNN hidden neurons and sizes of the targeted space.
Figure 5. Accuracy, sensitivity, precision, recall, f-measure, and g-means of the DDNN-FSVM classifier applied to four datasets for different numbers of DNN hidden neurons and sizes of the targeted space.
Symmetry 16 01585 g005
Figure 6. Accuracy, sensitivity, precision, recall, f-measure, and g-means of the DDNN-FSVM classifier applied to four datasets for different values of the DNN regulation parameter.
Figure 6. Accuracy, sensitivity, precision, recall, f-measure, and g-means of the DDNN-FSVM classifier applied to four datasets for different values of the DNN regulation parameter.
Symmetry 16 01585 g006
Figure 7. MNIST dataset.
Figure 7. MNIST dataset.
Symmetry 16 01585 g007
Figure 8. CNN training progress on MNIST dataset.
Figure 8. CNN training progress on MNIST dataset.
Symmetry 16 01585 g008
Figure 9. Encoding loss progress using SCG algorithm on MNIST dataset.
Figure 9. Encoding loss progress using SCG algorithm on MNIST dataset.
Symmetry 16 01585 g009
Figure 10. Reconstruction loss progress using SCG algorithm on MNIST dataset.
Figure 10. Reconstruction loss progress using SCG algorithm on MNIST dataset.
Symmetry 16 01585 g010
Table 1. Main acronyms used in the paper.
Table 1. Main acronyms used in the paper.
AcronymFull NameAcronymFull Name
FSVMFuzzy Support Vector MachineDNNDeep Neural Network
QPQuadratic ProgrammingPLRPartial Lagrangian Relaxation
SMOSequential Minimisation OptimizationLDLagrangian Decomposition
ISDAIterative Single-Data AlgorithmsSCGScaled Conjugate Gradient
L1QPL1 Dual of Quadratic Problemenc(w,x)Mapping x to enc(w,x) via the encoder of matrix parameter w
KKTKruch-Kuhn-Tuker (KKT)dec(w,x)Reconstruction of x via the decoder dec(w,x)
of matrix parameter w
CGConjugate GradientDDNN-FSVMDecomposition and Symmetric Kernel Deep
Neural Network Fuzzy Support Vector Machine
Table 2. Summary of different approaches for kernel identification.
Table 2. Summary of different approaches for kernel identification.
Optimization[40,46,58,59,60,68]Convex optimization problem,
gradient-based approximations.
Requires differentiability of the
function, which is rarely the case.
[29,37,38,39,57,63,69,78]Data-dependent model,
nonparametric model, and
parametric model,
Gaussian kernels, user-derived
Most existing methodologies are
expressed as a linear mixture of
fundamental kernels.
[29,37,38,39,78]Identifying the best kernel,
then training the model,
kernel weights, and SVM.
These techniques are only suitable for
a specific training algorithm and only
for a particular class of tasks.
[33,35,37,38,39,45,55,57,63,77]A mixture of tagged and untagged
data, extra step to build the kernel
matrix on the test and training
It is necessary to carry out an extra
step to build the kernel matrix on the
test and training data. kernel
construction algorithms have to deal
with the difficult condition of definite
positivity of the kernel matrix to be
for optimality
[31,76,84]Diagonal dominance between
samples, constrained linear
The user must have priori knowledge
about the domain from which the data
are extracted.
approaches to
learn the kernel
[34,62,69,70,71,77]Natural kernels, marginalization
kernels, Bhattacharyya and
Kullback–Leibler kernels,
Bayesian inference,
expectation-minimization algorithm.
It may not be systematically
clear exactly what the distribution and
its parameters are.
Adaptation to
another kernel
[33,44,55]Kernel alignment and
divergence measures,
similarity between matrices.
These tools are ineffective when
there are very few or no
labeled instances in the dataset.
to learner
error rates
[30,32,35,36,37,39,41,42,43,47,49,50,52,53,54,56,57,59,60,61,65,67,73,74,76]Cross-validation, SVMs,
hyperparameters, experiential
mistake criteria, decision-making
function, AdaBoost, Gaussian
combinations, geometric
understanding, convex
optimization, Sequential Minimization
Optimization (SMO),
Multiple Kernel Learning
(MKL), Convexity Difference
(CD), Bregman’s divergence,
Newtonian method.
It may not be systematically clear exactly
what the distribution and it parameters
The inherent
of the dataset
[31,38,45,50,51,63,64,66,72,82,83]Entry space preconditions,
graph Laplacian, over-fitting,
SVM, Riemannian geometry,
kernel PCA.
Dimensionality minimization tools
tend to be task-specific
and not readily transferable
to other tasks
such as classification.
Table 3. UCI data description.
Table 3. UCI data description.
DatasetFeaturesSamplesNomber of ClassesSubject Aria
Wine131783Physics and Chemistry
Pima87682Health and Medicine
Balance46255Social Science
Ionosphere343512Physics and Chemistry
WPBC321982Health and Medicine
Table 4. Accuracy,precision, recall, and f-measure of FSVM on seven datasets using three optimizers (SMO, ISDA, and L1QP) considering linear and Gaussian kernel functions.
Table 4. Accuracy,precision, recall, and f-measure of FSVM on seven datasets using three optimizers (SMO, ISDA, and L1QP) considering linear and Gaussian kernel functions.
Table 5. Means and confidence intervals for the accuracy, precision, recall, and f-measure of different FSVMs on seven datasets: FSVM (poly, SMO), FSVM (poly, ISDA), and FSVM (poly, L1QP).
Table 5. Means and confidence intervals for the accuracy, precision, recall, and f-measure of different FSVMs on seven datasets: FSVM (poly, SMO), FSVM (poly, ISDA), and FSVM (poly, L1QP).
SolverFSVM (Poly, SMO)FSVM (Poly, ISDA)FSVM (Poly, L1QP)
Data AccuracyPrecisionRecallF_MeasureAccuracyPrecisionRecallF_MeasureAccuracyPrecisionRecallF_Measure
CI (±)18.5220.355.2811.623.024.183.772.7128.5328.3524.3326.38
CI (±)14.6520.7318.6816.951.792.812.922.607.562.3611.857.54
CI (±)12.0010.0011.0012.0014.0011.0012.0012.009.0010.009.0010.00
CI (±)0.000.00ctecte0.000.00ctecte18.0829.7330.674.54
CI (±)6.0612.3218.68cte7.77cte22.92cte6.4716.2621.540.93
CI (±)12.4013.493.839.200.860.781.380.8810.7710.6911.8011.11
CI (±)
Table 6. Means and confidence intervals of the accuracy, precision, recall, and f-measure for DDNN-FSVM on seven datasets using three optimizers: SMO, ISDA, and L1QP.
Table 6. Means and confidence intervals of the accuracy, precision, recall, and f-measure for DDNN-FSVM on seven datasets using three optimizers: SMO, ISDA, and L1QP.
Data AccuracyPrecisionRecallF_MeasureAccuracyPrecisionRecallF_MeasureAccuracyPrecisionRecallF_Measure
CI (±)1.492.962.181.662.314.383.013.201.533.622.271.94
CI (±)3.483.946.986.713.652.998.053.504.053.5510.714.59
CI (±)1.591.59cte1.111.631.63cte1.161.551.55cte1.12
CI (±)1.822.652.361.892.062.992.412.301.863.572.571.96
CI (±)
CI (±)0.670.990.970.710.951.081.361.050.781.071.150.78
CI (±)0.910.880.820.621.
Table 7. Accuracy, precision, recall, f-measure, and g-means of the KNN, BN, DT, RobustBoost, Random Subspace, FSVM, and DDNN-FSVM classifiers applied to the four datasets.
Table 7. Accuracy, precision, recall, f-measure, and g-means of the KNN, BN, DT, RobustBoost, Random Subspace, FSVM, and DDNN-FSVM classifiers applied to the four datasets.
Random subspace80.0070.00100.0082.3579.0695.3596.6794.5795.6095.40
Decision Tree96.5597.0697.0697.0696.4496.1995.9598.6197.2694.68
Random Subspace74.1472.22100.0083.8745.8893.33100.0090.2894.8995.01
Table 8. CNN-FSVMand DDNN-FSVM performance on MNIST dataset using L1QP, SMO, and ISDA.
Table 8. CNN-FSVMand DDNN-FSVM performance on MNIST dataset using L1QP, SMO, and ISDA.
Deep ModelAccSpecPrecRecFMGM
CNN-FSVM (Lin., L1QP)0.87480.87760.87690.87200.87440.8748
CNN-FSVM (Lin., SMO)0.890.88240.8841610.89760.890830.889968
CNN-FSVM (Lin., ISDA)0.78080.67840.7330680.88320.8011610.774056
CNN-FSVM (Poly, L1QP)0.80280.71680.7583620.88880.8184160.79818
CNN-FSVM (Poly, ISDA)0.67680.35360.60738610.7557440.594643
CNN-FSVM (Poly, SMO)0.79180.69760.7457150.8860.8097890.786118
DDNN-SVM (L1QP, SCG)0.84240.87040.8627120.81440.837860.841935
DDNN-SVM (SMO, SCG)0.84260.87040.862770.81480.8380990.842141
DDNN-SVM (ISDA, SCG)0.83760.87320.863480.8020.8316050.836843
Table 9. Accuracy, precision, recall, f-measure, and g-means of the KNN, BN, DT, RobustBoost, Random Subspace, FSVM, and DDNN-FSVM classifiers applied to the WPBC dataset.
Table 9. Accuracy, precision, recall, f-measure, and g-means of the KNN, BN, DT, RobustBoost, Random Subspace, FSVM, and DDNN-FSVM classifiers applied to the WPBC dataset.
Random subspace83.3380.0030.7744.4454.8880.2291.4368.0978.0579.65
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El Moutaouakil, K.; Roudani, M.; Ouhmid, A.; Zhilenkov, A.; Mobayen, S. Decomposition and Symmetric Kernel Deep Neural Network Fuzzy Support Vector Machine. Symmetry 2024, 16, 1585.

AMA Style

El Moutaouakil K, Roudani M, Ouhmid A, Zhilenkov A, Mobayen S. Decomposition and Symmetric Kernel Deep Neural Network Fuzzy Support Vector Machine. Symmetry. 2024; 16(12):1585.

Chicago/Turabian Style

El Moutaouakil, Karim, Mohammed Roudani, Azedine Ouhmid, Anton Zhilenkov, and Saleh Mobayen. 2024. "Decomposition and Symmetric Kernel Deep Neural Network Fuzzy Support Vector Machine" Symmetry 16, no. 12: 1585.

APA Style

El Moutaouakil, K., Roudani, M., Ouhmid, A., Zhilenkov, A., & Mobayen, S. (2024). Decomposition and Symmetric Kernel Deep Neural Network Fuzzy Support Vector Machine. Symmetry, 16(12), 1585.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop