1. Introduction
Extreme learning machine [
1,
2], as a remarkable single hidden layer feed-forward neural networks (SLFNs) [
3] training method, has been widely studied and applied in many fields such as efficient modeling [
4], fashion retailing forecasting [
5], fingerprint matching [
6], metagenomic taxonomic classification [
7], online sequential learning [
8], and feature selection [
9]. The weights of the input layer and hidden layer offsets are randomly generated. The output weight of the network is calculated effectively by minimizing the training error and the norm of the output weight. In addition, many researchers have tried to extend the extreme learning machine model to the support vector machine (SVM) learning framework to solve the classification problem [
10]. Frenay et al. [
11] found that the transformation performed by the first layer of ELM can be viewed as a kernel that can be plugged into SVM. Due to solving the support vector machine (SVM) type of optimization method that can be utilized to resolve the ELM model, an extreme learning machine based on the optimization method (OPTELM) was proposed in [
12]. For binary classification problems, traditional ELM needs to compute all the sample points of training data at the same time in the training stage, which is time-consuming. The singe hyperplane was trained to perform the classification task in the traditional ELM, which enormously restricts its application prospect and the direction of evolution. Jayadeva et al. [
13] proposed twin SVM (TWSVM), which is a famous non-parallel hyper-plane classification algorithm for binary classification. Inspired by TWSVM, Wan et al. [
14] proposed the twin extreme learning machine (TELM). Compared with ELM, TELM trains two non-parallel hyperplanes for classification tasks by solving two smaller quadratic programming problems (QPPs). Compared with TWSVM, TELM’s optimization problem has fewer constraints, so the training speed is faster and the application prospect is broader. In recent years, researchers have made many improvements to TELM, such as sparse twin extreme learning machine [
15], robust twin extreme learning machine [
16], time efficient varient of twin extreme learning machine [
17], and a generalized adaptive robust distance metric driven smooth regularization learning framework [
18], etc.
Although the above ELM-based algorithm has a good classification effect, the statistical knowledge from the data itself is ignored. However, the knowledge of mathematical statistics from the data is very important to construct an efficient classifier. Fisher discriminant analysis (FDA) is an effective discriminant tool by minimizing the intra-class divergence while keeping the inter-class divergence of the data constant. From the above discussion, it can be known that it is necessary to reconstruct a new classification model by combining the characteristics of ELM model and FDA. In recent years, Ma et al. [
19] have successfully combined them and proposed a Fisher-regularized extreme learning machine (Fisher-ELM), which not only has the advantages of efficient solution of ELM but also fully considers the statistical knowledge of the data.
Although the above models have good classification performance, most of them consider the
-norm. When the data contains noise or outliers, they can not deal with noise and outliers well, which degrades the classification performance of the model. In recent years, researchers have tried to introduce the
-norm into various models [
20,
21,
22,
23] to reduce the impact of outliers. This studies have shown that the
-norm was able to reduce the effect of outliers to some extent. However, it was still unsatisfactory when the data contains a large number of outliers. Recently, researchers have introduced the idea of truncation into the
-norm, constructed a new capped
-norm, and applied it to various models [
24,
25,
26]. Many studies [
27,
28] show that the capped
-norm not only inherits the advantages of the
-norm, but also is bounded. So it is more robust and it approaches the
-norm to some degree. For instance, by applying the capped
-norm to the twin SVM, Wang et al. [
29] proposed a new robust twin support vector machine (C
-TWSVM). Based on twin support vector machine with privileged information [
30] (TWSVMPI), a new robust TWSVMPI [
31] is proposed by replacing the
-norm with capped
-norm. The new model further improves the anti-noise ability of the pattern.
In order to utilize the advantanges of the twin extreme learning machine and FDA, we first put forward to a novel classifier named Fisher-regularized twin extreme learning machine (FTELM). Also considering the instability of the -norm for the outliers, we introduce the capped -norm into the FTELM model and propose a more robust capped -norm FTELM (C-FTELM) model.
The main contributions of this paper are as follows:
(1) Based on twin extreme learning machine and Fisher-regularization extreme learning machine (FELM), a new Fisher-regularized twin extreme learning machine (FTELM) is proposed. FTELM minimizes intra-class divergence while fixing the inter-class divergence of samples. FTELM takes full account of the statistical information of the sample data, and the training speed is faster than FELM.
(2) Considering the instability of -norm and Hinge loss used by FTELM, we introduce capped -norm instead of them and propose a new capped -norm FTELM model. C-FTELM uses the capped -norm to reduce the influence of noise points, and at the same time utilizes Fisher regularization to consider the statistical knowledge of the data.
(3) Two algorithms are designed by utilizing the successive overrelaxation (SOR) [
32] technique and the re-weighted technique [
27] to solve the optimization problems of the proposed FTELM and C
-FTELM, respectively.
(4) Two theorems about convergence and local optimality of C-FTELM are proved.
The organizational structure of this paper is as follows. In
Section 2, we briefly review related work. In
Section 3, we describe the FTELM model in detail. The robust capped
-norm FTELM learning framework along with related theoretical proofs are described in detail in
Section 4. In
Section 5, we describes numerical experiments on artificial and benchmark datasets. We summarize this paper in
Section 6.
5. Experiments
Description of the four comparison algorithms:
OPTELM: The optimization function of the model consists of minimizing the -norm of the weight vector and minimizing empirical loss. It neither consider the establishment of two non-parallel hyperplanes to deal with classification tasks, nor consider the statistical information of samples. At the same time, since it uses -norm metric and Hinge loss, it has weak anti-noise ability.
TELM: The optimization function of the model consists of minimizing the distance from the sample points to the hyperplane as well as minimizing empirical loss. TELM does not fully consider the statistical information of the sample. At the same time, its metric uses the -norm metric and the loss function uses the Hinge loss. When there is noise in the data set, the influence of noise data will be amplified and the accuracy of classification will be reduced.
FELM: The optimization function of the model includes minimizing the -norm of the weight vector, minimizing empirical loss, and minimizing the within-class scatter of the number sample data. Although FELM takes into account the statistics of the sample, it has to deal with a much larger optimization problem than the twin extreme learning machines, which is time-consuming. At the same time, FELM still continues the metric and loss used by OPTELM, so its anti-noise ability is weak.
C-TWSVM: C-TWSVM is formed on the basis of twin support vector machines by changing the model’s metric and loss to capped -norm. Although C-TWSVM has the ability to resist noise, it does not fully take into account the statistics of the data. Meanwhile, C-TWSVM not only needs to solve the weight vector of the hyperplane, but also needs to solve the bias of the hyperplane, so it is time-consuming.
5.1. Experimental Setting
All experiments were implemented in MATLAB R2020a installed in a personal computer (PC) with an AMD Radeon Graphics processor (3.2 GHz), and 16 GB random-access memory (RAM). For C-TWSVM, and C-FTELM, we take the maximum number of iterations to be 100 and the iteration stopping threshold to be 0.001. The activation functions used in a total of five models (OPTELM, TELM, FELM, FTELM, and C-FTELM) are . The Gaussian kernel function was used for C-TWSVM. The parameters selected by all the above algorithms are as follows: , , , were selected from , were selected from , was chosen from , and the hidden layer node number L was chosen from . The optimal parameters used by the model are selected by 10-fold cross-validation and grid search. Normalization was performed for both artificial and UCI datasets. For image datasets, we randomly select 20% of the data as the test set to get the classification accuracy of the algorithm. All experimental processes are repeated 10 times and the average of the 10 test results is used as the performance measure, and the evaluation criterion selected in this paper is classification accuracy (ACC).
5.2. Experiments on Artificial Datasets
We first do experiments on the Banana, Circle, Two spirals, and XOR datasets which are generated by trigonometric function(sine, cosine), two circle lines, two spirals lines, and two intersecting lines, respectively. The two-dimensional distributions of the four synthetic datasets are shown in
Figure 1. Dark blue ‘+’ represents class 1, and cyan ‘∘’ represents class 2.
Figure 2 illustrates the experimental results of four twin algorithms namely TELM, FTELM, C
-TWSVM, and C
-FTELM for four datasets with 0%, 20%, and 25% noise in terms of accuracy. From
Figure 2a, we can observe that the classification accuracy of our FTELM and C
-FTELM in Banana and Two spirals datasets is higher than the other two methods. In the Circle and XOR datasets, the classification accuracy of the four methods is similar. The experimental results show that fully considering the statistical information of the data can effectively improve the classification accuracy of the classifier, which shows that our C
-FTELM method is effective. From
Figure 2b,c, we can see that the overall effect of FTELM is better than TELM. This shows the importance of fully considering the statistical information of the sample. At the same time, we can see that C
-FTELM has the best effect, followed by C
-TWSVM. It shows that the capped
-norm can control the influence of noise on the model in a certain range, and further shows the effectiveness of using the capped
-norm. In summary,
Figure 2 illustrates the effectiveness of considering sample statistics information and changing the distance metric and loss of the model into capped
-norm at the same time.
To further show the robustness of C
-FTELM, we add noise with different ratios to the Circle dataset.
Figure 3 shows the accuracy of TELM, FTELM, C
-TWSVM, and C
-FTELM algorithms on the Circle dataset in different noises ratios. The ratio is set in the range of
. We plot the accuracy results of ten experiments with different noise ratios in a box-shaped plot. By observing the median of the four subgraphs, we can find that the median of C
-FTELM algorithm is much higher than the other three algorithms. And C
-FTELM method in four different noise ratios results is relatively concentrated. In other words, the variance of ten experimental results obtained by the C
-FTELM algorithm is smaller and the mean value is larger. The above results show that our C
-FTELM has better stability and better classification effect in environments containing noise. This shows the effectiveness and noise resistance of the distance metric and loss functions of the model using the capped
-norm.
5.3. Experiments on UCI Datasets
In this section, we conduct the numerical simulation on UCI datasets.
Table 1 describes the features of the UCI datasets used in detail. We also added two algorithms (OPTELM, FELM) to verify the classification performance of FTELM and C
-FTELM in ten UCI data sets.
All experimental results obtained based on the optimal parameters are shown in
Table 2. Here, the average running time according to the optimal parameters is denoted by Times(s), and the average classification plus or minus standard deviation is denoted by
. From
Table 2, we can see that FTELM performs better than OPTELM, TELM, and FELM on all ten datasets. This indicates that adding Fisher regularization term on the basis of TELM framework can significantly improve the accuracy of model classification. In addition, the average training time of FTELM algorithm on most data sets is smaller than that of FELM algorithm, which indicates that FTELM has inherited the advantages of TELM’s short training time. In addition, we also can draw our C
-FTELM in most data sets has achieved the highest classification accuracy besides the WDBC data set. Through the analysis of the above results, we can conclude that the Fisher regularization and capped
-norm added to the TELM learning framework can effectively improve the performance of the classifier. It is shown that the proposed FTELM and C
-FTELM are efficient algorithms.
In order to more significantly verify the robustness of C
-FTELM to outliers, we added 20% and 25% Gaussian noise to 10 data sets, respectively. All experimental results are presented in
Table 3 and
Table 4. From
Table 3 and
Table 4, we find that the classification accuracy of all six algorithms decreases after adding noise. However, the classification accuracy of our algorithm C
-FTELM is the highest of the eight datasets, which further reveals the effectiveness of our method using capped
-norm instead of Hinge loss and
-norm distance metric. Compared with the other five algorithms, our C
-FTELM algorithm is more time-consuming. This is due to that C
-FTELM requires a lot of time in the process of training to iterative calculation, eliminating outliers, and computing graph matrices. In addition, we used different noise factor values (0.1, 0.15, 0.2, 0.25, 0.3) on the Cancer, German, Ionosphere, and WDBC for the six algorithms. The experimental results are given in
Figure 4. It can be seen from
Figure 4a that when the Breast Cancer dataset contains 10% noise, the effects of our FTELM and C
-FTELM are comparable. This shows that it is important to consider the statistical information of the sample. As the ratio of noise increases, the classification accuracy of all methods decreases, but our C
-FTELM still has the highest accuracy. This illustrates the effectiveness of our using the capped
-norm.
Figure 4b shows that with the increase of noise ratio, the decline trend of accuracy of C
-TWSVM and C
-FTELM is similar, but C
-FTELM is still the most stable among the six methods when facing the influence of noise. From both
Figure 4c,d, we can clearly observe that the anti-noise effect of our C
-FTELM is the best. This illustrates the effectiveness of using the Fisher regularization term as well as the capped
-norm.
We also conduct experiments on four data sets (Breast cancer, QSAR, WDBC, and Vote) to verify the convergence of the proposed Algorithm 2. As shown in
Figure 5, we plot the objective function value of each iteration. It can be seen that the objective function value converges to a fixed value rapidly with the increase in the number of iterations. This shows that our algorithm can make the objective function value can converge to a local optimal value within a limited number of iterations. The effectiveness and convergence of the Algorithm 2 are demonstrated.
5.4. Experiments on Image Datasets
The image datasets include Yale, ORL, USPS, and MNIST.
Figure 6 illustrates examples of four high-dimensional image datasets. The number of samples and characteristics of the four image datasets are shown in
Table 5. These four image datasets are used to investigate the performance of our FTELM and C
-FTELM for multi-classification. Specifically, for the MNIST dataset, we only select the first 2000 samples to participate in the experiment.
Table 6 shows the specific experimental results. As can be seen from the results of the experiment, our C
-FTELM and C
-TWSVM have similar training times. This is because this paper uses an iterative algorithm to solve non-convex optimization problem of C
-FTELM, which is time-consuming. Simultaneously, the C
-FTELM at Yale, ORL, USPS, and MNIST four datasets classification accuracy is highest among the six algorithms. In addition, the classification accuracy of our FTELM algorithm on the four image datasets is the second highest after our C
-FTELM. The above results fully show the effectiveness of our two algorithms in dealing with multi-classification tasks.