1. Introduction
Pattern classification is a process related to categorization, concerned with classifying patterns into one or more categories. It plays an important role in many fields, such as media, medicine, science, business, and security [
1,
2,
3,
4,
5]. A classification problem can be converted to multiple binary classification problems in two ways, one-vs.-rest (OvR) and one-vs.-one (OvO), by creating several systems each of which is responsible for one category. However, OvR may suffer from unbalanced learning, since typically the set of negative training data is much larger than the set of positive training data. In contrast, OvO may suffer from ambiguities: two or more categories may receive the same number of votes and one category has to be picked arbitrarily. Therefore, researchers tend to seek other ways to solve classification problems.
In general, classification tasks can be divided into two kinds: single-label and multi-label. Single-label classification concerns classifying a pattern to only one category, but multi-label classification may classify a pattern to one or more categories. Many algorithms, based on multilayered perceptrons (MLP) [
6], decision trees [
7,
8], k-nearest neighbors (KNN) [
9], probability-based classifiers [
10], support vector machines (SVM) [
11,
12], or random vector functional-link (RVFL) networks [
13,
14], have been developed for pattern classification.
Most developed algorithms are for single-label classification. KNN [
9] is a lazy learning algorithm. It is based on similarity between input patterns and training data. The category to which the input pattern belongs to is determined by its k-nearest neighbors. Decision tree methods [
7,
15] build decision trees, e.g., ID3 and C4.5, from a set of training data using the concept of information entropy. An unseen pattern traverses down from the root of the tree until reaching a leaf which tells the category to which the unseen is assigned. Probability-based classifiers [
16,
17] assume independence among features. These classifiers can be trained for the estimation of parameters necessary for classification. MLP [
18,
19,
20] is a feedforward multi-layer network model consisting of several layers of nodes, with two adjacent layers fully connected to each other. Each node in the network is associated with an activation function, linear or nonlinear. RVFL networks [
13,
14,
21] are also fully-connected feedforward networks, with only one single hidden layer. The weights between input and hidden nodes are assigned to random values. The values of the weights between hidden and output nodes, instead, are learned from training patterns. An RBF network [
22,
23,
24,
25,
26,
27] is a two-layer network, using radial basis functions as the activation functions in the hidden layer. The output of the network is a linear combination of the outputs from the hidden layer. In SVM [
11,
12], training patterns are mapped into the points in a high-dimensional space and the points of different categories are separated in the space by a gap as wide as possible. For an unseen pattern, the same mapping is performed and the classification is predicted according to the side of the gap it falls on. For solving multi-label classification problems [
28], two approaches are generally adopted [
29]: In one approach, a multi-label classification task is transformed to several single-label classification tasks which are instead solved by single-label classification methods. In the other approach, the capability of a specific single-label classification algorithm is extended to handle directly the multi-label data [
30,
31,
32,
33].
Since first introduced in 1988 by Broomhead and Lowe [
34], RBF networks have been becoming popular in classification applications. However, as pointed out in [
34,
35], in the construction phase of such networks, there are several issues encountered, such as the determination of the number of nodes in the hidden layer, the form and initialization of the basis functions, and the learning of the parameters involved in the networks. In this paper, we present a novel approach for constructing RBF networks for pattern classification problems. An iterative self-constructing clustering algorithm is used to produce a desired number of clusters from the training data. Accordingly, the number of nodes in the hidden layer is determined. Basis functions are then formed, and their centers and deviations are initialized to be the centers and deviations of the corresponding clusters. Then the parameters of the network are refined with a hybrid learning strategy, involving hyperbolic tangent sigmoid functions, steepest descent backpropagation and least squares method. As a result, optimized RBF networks are obtained. Our approach can offer advantages in practicality. The number of nodes in the hidden layer is determined and basis functions are derived automatically, and higher classification rates can be achieved through the hybrid learning process. Furthermore, the approach is applicable to construct RBF networks for solving both single-label and multi-label pattern classification problems. Experimental results have shown that the proposed approach can be used to solve classification tasks effectively.
We have been working on RBF networks for years, and have developed different techniques [
26,
27,
36,
37]. This paper is mainly based on the MS thesis of the first author, Z.-R. He (who preferred to use a different English name, Tsan-Jung He, in the thesis) [
38], supervised by S.-J. Lee, with the following contributions:
An iterative self-constructing clustering algorithm is applied to determine the number of hidden nodes and associated basis functions.
The centers and deviations of the basis functions are refined through the steepest descent backpropagation method.
Tikhonov regularization is applied in the optimization process to maintain the robustness of the output parameters.
The hyperbolic tangent sigmoid function is used as the activation function of the output nodes when learning the parameters of basis functions.
The rest of this paper is organized as follows:
Section 2 gives an overview of the related work. The proposed approach is described in detail in
Section 3,
Section 4 and
Section 5. Basis functions are derived by clustering from training data and a hybrid learning is used to optimize network parameters. Experimental results are presented in
Section 6. The effectiveness of the proposed approach on single-label and multi-label classification is demonstrated. Finally, concluding remarks are given in
Section 7.
2. Related Work
Many single-label classification algorithms have been proposed. The extended nearest neighbor (ENN) method [
39], unlike the classic KNN, makes the prediction in a “two-way communication” style. It considers not only what are the nearest neighbors of the input pattern, but also those of which the input pattern is their nearest neighbors. The Natural Neighborhood Based Classification Algorithm (NNBCA) [
40] provides a good classification result without artificially selecting the neighborhood parameter. Unlike KNN, it predicts different k for different samples. An enhanced general fuzzy min-max neural network (EGFM) classification model is proposed in [
19] to perform supervised classification of data. New hyperbox expansion, overlap, and contraction rules are used to overcome some unidentified cases in some regions. SaE-ELM [
41] is an improved algorithm of RVFL networks. In SaE-ELM, the hidden node parameters are optimized by the self-adaptive differential evolution algorithm, whose trial vector generation strategies and associated control parameters are self-adapted in a strategy pool by learning from their previous experiences in generating promising solutions, and the network output weights are calculated using the Moore–Penrose generalized inverse. In [
42], basis functions in an RBF network are interpreted as probability density functions. The weights are seen as prior probabilities. Models that output class conditional densities or mixture densities were proposed. In [
24], an RBF network based on KNN with adaptive selection radius is proposed. An adaptive selection radius is calculated according to the population density sampling method. The RBF network is trained to locate the sound source by solving nonlinear equations about the time difference between arrival and source of the sound.
Several problem transformation methods exist for multi-label classification. One popular transformation method, called binary relevance [
43], transforms the original training dataset into
p datasets, where
p is the number of categories involved with the dataset. Each resulting dataset contains all the patterns of the original dataset, with each pattern assigned to one of the two labels: “belonging to” or “not belonging to” a particular category. Since the resulting datasets are all single-labeled, any single-label classification technique is applicable to them. The label powerset (LP) transformation [
44] method creates one binary classifier for every label combination present in the training set. For example, eight possible label combinations are created if the number of categories associated with the original dataset is three. Ensemble methods were developed to create multi-label ensemble classifiers [
45,
46]. A set of multi-class single-label classifiers are created. For an input example, each classifier outputs a single class. These predictions are then combined by an ensemble method.
Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Clare [
29] is an adapted C4.5 version for multi-label classification. Other decision tree classification methods based on multi-valued attributes were also developed [
47]. Zhang et al. [
32] propose several multi-label learning algorithms including back-propagation multi-label learning (BP-MLL) [
48], multi-label k-nearest neighbors (ML-KNN) [
49] and multi-label RBF (ML-RBF) [
50]. BP-MLL is a multi-label version of the back-propagation neural network. To take care of multi-labels, label co-occurrence is incorporated into the pairwise ranking loss function. However, it has a complex global error function to be minimized. ML-KNN is a lazy learning algorithm which requires a big run-time search. ML-RBF is a multi-label RBF neural network which is an extension of the traditional RBF learning algorithm. Multi-label with Fuzzy Relevance Clustering (ML-FRC), proposed by Lee and Jiang [
51], is a fuzzy relevance clustering based method for the task of multi-label text categorization. Kurata et al. [
52] treat some of the nodes in the final hidden layer as dedicated neurons for each training pattern assigned to more than one label. These dedicated neurons are initialized to connect to the corresponding co-occurring labels with stronger weights than to others. A multi-label metamorphic relations prediction approach, named RBF-MLMR [
33], is proposed to find a set of appropriate metamorphic relations for metamorphic testing. It uses soot analysis tool to generate the control flow graph and labels. The extracted nodes and the path properties constitute multi-label data sets for the control flow graph. Then, a multi-label RBF network prediction model is established to predict for multiple metamorphic relations.
3. Proposed Approach
Let
be a finite set of
training patterns, where
is an input vector with
n feature values, i.e.,
, and
is the corresponding category vector with
m components, i.e.,
, of pattern
, defined as
for
. Note that
denotes the number of categories associated with the training dataset
X. For convenience, the categories are labelled as 1, 2, ⋯,
respectively. For single-label classification, one of the elements in
is +1 and the other
m−1 elements are −1. For a multi-label classification, two or more components in
can be +1. Pattern classification is concerned with constructing a predicting model from
X, and for any input vector
x the model can respond with output
showing which categories
x belongs to.
In this work, we construct RBF networks as predicting models for classification. The data for training and testing the models will be collected from the openly accessible websites. The number of hidden nodes will be determined by clustering on the training data. The parameters of an RBF network will be learned by applying a hybrid learning algorithm. The performance of the constructed models will be analyzed by five-fold cross validation. Experiments for comparing performance with other methods will be conducted.
Our RBF network architecture, shown in
Figure 1, is a two-layer network consisting of one hidden layer and one output layer. There are
n input nodes, each node receiving and feeding forward one feature value of the input vector to the hidden layer. The hidden layer has
J nodes, each with a basis function as its activation function. Each input node is fully connected to every node in the hidden layer. The output layer has
m nodes, each corresponding to one distinct category. The hidden and output layers are also fully connected, i.e., each node in the hidden layer is connected to each node in the output layer. The weight between node
,
, of the hidden layer and node
,
, of the output layer is denoted by
wf,j for
and
.
When an input vector is presented to the input nodes of the network, the network produces the output vector at the output layer. The network operates as follows.
Each component in is either +1 or –1. If component is +1, , x is predicted to belong to category . Note that one component or more in can be +1. Therefore, the proposed network architecture is applicable to solving both single-label and multi-label pattern classification problems.
We describe below how the network of
Figure 1 is created from the given training set. Two phases, network setup and parameter refinement, are involved. In the first phase, the network structure is built and the initial values of the parameters are set. Then the parameters of the network are optimally refined in the second phase.
5. Parameter Refinement Phase
In this phase, the optimal values of the parameters of the network are learned from the training dataset . Such parameters include the centers and deviations , , associated with the basis functions in the hidden layer, and the weights , , , between the hidden layer and the output layer, and the biases and biases , , associated with the output nodes in the output layer. A hybrid learning process is applied iteratively. In each iteration, we first treat all the weights and biases as fixed and use the steepest descent backpropagation to update the centers and deviations associated with the basis functions. Then we treat all the centers and deviations associated with the basis functions as fixed and use the least squares method to update the weights and biases associated with the output layer. The process is iterated until convergence is reached.
5.1. Centers and Deviations
First, we describe how to refine the centers and deviations associated with the basis functions in the hidden layer. To make the refinement possible, we express
in Equation (5) in terms of the hyperbolic tangent sigmoid function which is analytic, i.e.,
for
, where
is a slope-controlling constant and
is defined in Equation (4). Note that
provides a good approximation to the sign function and it is differentiable at all points.
Let us denote the initial values of the centers and deviations, shown in Equation (9), as
and
,
,
, and the initial values of the weights and biases, obtained in Equation (21), as
and
,
,
. As in [
56], the mean squared error for the training set
can be estimated as
where
can be any pattern in the training dataset
. By steepest descent backpropagation, the centers and deviations of the basis functions in the hidden layer are updated as
for
, where
is the learning rate. From Equation (23), we have
Since
, and
, we have
Therefore, we have
for 1 ≤
i ≤
n, 1 ≤
j ≤
J. In matrix form this becomes:
for
, where
Note that the algorithm described above is the stochastic steepest descent backpropagation which involves incremental training. However, we perform batch training in which the complete gradient is computed, after all inputs are applied to the network, before the centers and deviations are updated. To implement the batch training, in each iteration the individual gradients for all the inputs in the training set are averaged to get the total gradient, and the update equations become
for
,
, where
is the mean squared error
induced from pattern
,
. In matrix form, the batch training becomes:
for
.
5.2. Weights and Biases
Next, we describe how to refine weights and biases associated with the output layer in each iteration. For pattern
,
, by Equation (13), we want
where
for
and
. Note that
and
are the newly updated values obtained in the first part of the current iteration. Let
and
and
be the same matrix as shown in Equation (18). As before, we’d like to minimize
The solution is which provides the new values of the weights and biases, i.e., and , , .
7. Concluding Remarks
We have presented a novel approach of constructing RBF networks for solving supervised classification problems. An iterative self-constructing clustering algorithm is used to determine the number of nodes in the hidden layer. Basis functions are formed, and their centers and deviations are initialized to be the centers and deviations of the corresponding clusters. Then, the parameters of the network are refined with a hybrid learning strategy, involving the steepest descent backpropagation and least squares method. Hyperbolic tangent sigmoid functions and Tikhonov regularization are employed. As a result, optimized RBF networks are obtained. The proposed approach is applicable to construct RBF networks for solving both single-label and multi-label pattern classification problems. Experimental results have shown that the proposed approach can be used to solve classification tasks effectively.
Note that all the dimensions of a training pattern are treated to be equally important, i.e., no different weights are involved. It might be beneficial to incorporate a kind of weighting mechanism which suggests different weights in different dimensions [
72]. Furthermore, the clustering algorithm may encounter difficulties finding useful clusters in high-dimensional data sets. Dimensionality reduction [
65,
68,
69,
70] may need to be applied in this case. We will take these as our future work.