3.1. Preliminary
Table 1 summarizes the terms used for elucidating the proposed method. Conventional MPGAs for single-label feature selection entail the following processes.
Step 1: Initialization of sub-populations. Each sub-population consists of individuals whose number is a pre-defined parameter. Furthermore, each individual represents a feature subset. For example, in the genetic algorithm, each individual is represented as a binary vector called a chromosome, which comprises ones and zeros that represent selected and unselected features, respectively. In particle swarm optimization, each individual is represented as a probability vector. The components of a particle are regarded as the probabilities that the corresponding features will be selected. In most studies, the individuals are initialized randomly.
Step 2: Evaluation using a fitness function. The individuals of each sub-population can be evaluated using a fitness function. Given a feature subset represented by each individual , a learning algorithm, such as a naive Bayes classifier, is trained, and trained classifier is used to predict the label for each test pattern. Given a correct label and the predicted label, a fitness value can be computed using evaluation metrics, such as accuracy. Intuitively, a feature subset that results in better single-label prediction has better a fitness value.
Step 3: Communication among sub-populations. The sub-populations communicate with each other based on the best individuals in terms of the fitness value. In each sub-population, the worst individual (with the lowest fitness value) is replaced by the best individual of another sub-population.
Step 4: Sub-population update. The individuals generate offspring via genetic operators. First, each sub-population chooses the parents based on fitness values. For example, roulette-wheel selection employs the fitness value percentage of each individual in each subpopulation, as the probability that the individual will be chosen as a parent. Subsequently, the offspring are generated via the crossover of parents or mutation.
Whenever the individuals are modified in Step 4, they are evaluated in the same manner as in Step 2. During the search process, MPGAs repeat Step 3→Step 4→Step 2 until a stopping criterion is met. In the left side of
Figure 1, the aforementioned process is presented as a flowchart.
3.2. Motivation and Approach
We designed a novel MPGA that specializes in solving the MLFS problem. To extend the benefits of the communication process used in conventional studies to the MLFS problem, the following issues should be considered:
Through communication between the sub-populations, the discriminating power of multiple labels should be complemented. Additionally, the referenced individuals should be used to generate offspring that are superior to the previous generation.
Feature subsets with high discriminating power for different label subsets can complement each other. Therefore, each sub-population should refer to an individual with the highest discriminating power for a subset of labels that are relatively difficult to discriminate, resulting in improved search capability for the MLFS.
Existing fitness-based parent selection methods may not fully use the individuals referenced from other sub-populations, because they are selected, regardless of fitness in our method. This issue can be resolved by ensuring that one of the important individuals in each sub-population is involved when generating the offspring.
Figure 1 presents a schematic overview of the proposed MPGA for solving the MLFS problem. Particularly, we modified the communication and update process of the existing MPGA. First, with regard to sub-population communication, the conventional method communicates by exchanging the best individuals among sub-populations. Specifically, the sub-population
imports the best individual
of
; then, the worst individual
is replaced by
of
. Similarly,
of
is replaced by
of
. In the proposed label complementary communication for MLFS, the evaluation of the individuals is performed similarly to that performed in the conventional methods for single-label feature selection; however, the learning algorithm is replaced by a multilabel classification algorithm, such as a multilabel naive Bayes classifier (MLNB) [
15], which uses a series of functions that predict each label. Therefore, the discriminating power corresponding to each label can be obtained by reusing the learning algorithm that was trained to evaluate the fitness values of individuals; a detailed description of this process is presented in
Section 3.3. As shown in
Figure 1, the best individual
of
lacks sufficient classification performance with regard to the label
. To complement the discriminating power with regard to
,
refers to individual
of
, which best discriminates
.
In the sub-population updating step, the conventional method stochastically or deterministically selects the parents of via fitness-based selection. Here, the individual that is imported from is selected and used with a high probability, because it had the highest fitness in . In contrast, in the proposed label complementary updating step, the complementary individual referenced from is chosen, regardless of fitness. Because the important individuals of are the complementary individual and the best individual , one of them is selected as a parent. In other words, one of the important individuals is always involved in the generation of offspring. For diversity, another parent is selected from the remaining individuals at random. Finally, the selected parents generate offspring while using a genetic operator.
If a MPGA begins with a promising initial sub-populations, then a good-quality feature subset can be found by spending fewer time than that begins with a randomly-initialized sub-populations. In this study, we introduce a simple but effective initialization method. Given an original feature set
and the number of sub-populations
m, the spherical
k-means algorithm partitions
F into
m clusters [
28]; herein, each of the clusters are composed of different features without overlapping, such that
. Subsequently, each sub-population
is intialized based on repetitive entropy-based stochastic sampling from cluster
.
Section 3.3 presents a detailed description of the sampling process.
3.3. Algorithm
Algorithm 1 represents the pseudocode of the proposed method. Each individual (chromosome) is represented by a binary string that is composed of ones and zeros, representing selected and unselected features, respectively. For simplicity, each sub-population is represented as a set of individuals, i.e.,
. Additionally, all of the sub-populations have the same number of individuals. In the initialization step (line 4), the individuals of each sub-population are initialized by Algorithm 2, and then evaluated to obtain their fitness values (line 6). In this study, the MLNB is used as the learning algorithm. Given the trained learning algorithm, the fitness values are computed according to the multilabel evaluation metrics detailed in
Section 4.1. To evaluate the discriminating power corresponding to each label, our algorithm uses an accuracy metric used in the fitness evaluation of single-label feature selection methods. For each individual
that belongs to
, the label-specific accuracy vector
is computed by reusing the already trained learning algorithm; here,
is the accuracy corresponding to the
j-th label predicted by
. Consequently, the label-specific accuracy matrix
is computed across all individuals of
(line 7).
Algorithm 1 Label Complementary multi-population genetic algorithm for multilabel feature selection |
- 1:
Input:; ▹ the multilabel dataset D, the number of sub-populations m - 2:
Output: S; ▹ the final feature subset S - 3:
; - 4:
initialization ▹ use Algorithm 2 - 5:
for each sub-population do - 6:
evaluate using D; ▹ compute fitness values via a fitness function - 7:
compute the label-specific accuracy matrix for individuals of ; ▹ reuse the fitness function - 8:
end for - 9:
while (not termination-condition) do - 10:
for each sub-population do - 11:
communication; ▹ use Algorithm 3 - 12:
update; ▹ use Algorithm 4 - 13:
evaluate using D; - 14:
compute the label-specific accuracy matrix for individuals of ; - 15:
; - 16:
end for - 17:
end while - 18:
the best feature subset so far;
|
Algorithm 2 Initialization function |
- 1:
input:m; ▹ the number of sub-populations m - 2:
output:; ▹ the initial sub-populations - 3:
for each feature do ▹ the original feature set F - 4:
if then - 5:
; - 6:
end if - 7:
end for - 8:
patition F into m clusters; ▹ use the spherical k-means algorithm - 9:
for to m do - 10:
for each individual do - 11:
initialize by selecting n features via stochastic sampling; ▹ use Equation ( 1) - 12:
end for - 13:
end for
|
After the initialization process, the sub-populations complement each other via the proposed label complementary communication (line 11), i.e., Algorithm 3. Specifically, each sub-population identifies a complementary individual that can complement itself from the other sub-populations. Next, our algorithm updates the sub-populations while using via Algorithm 4. All of the sub-populations repeat these processes until the termination condition is met. We use the number of fitness function calls (FFCs) as the termination criterion, and the algorithm conducts the search until the available FFCs are exhausted. Finally, Algorithm 1 outputs the best feature subset.
Algorithm 2 represents the procedure of initialization process for each sub-population. With regard to lines 3–7, if the entropy of any feature is zero, then it is preferentially removed because it does not have any information. Each cluster
of features is generated by the spherical
k-means algorithm (line 8), and it is used to initialize each sub-population
(lines 9–13). Given each feature
, its importance score
is calculated as
where
is the entropy of a variable
x. Finally, each individual of
is initialized via stochastic sampling based on the importance scores (line 11).
Algorithm 3 Communication function |
- 1:
input:; ▹ the sub-population P, the label-spcific accuracy matrix - 2:
output:; ▹ the complementary individual - 3:
find an index of the best individual in the P; ▹ the best individual - 4:
find an index set of labels with the highest error based on ; - 5:
for each individual do ▹ the other sub-populations - 6:
; ▹ the degree of complementarity c - 7:
end for - 8:
find a individual with highest c;
|
Algorithm 3 illustrates the procedure for realizing the label complementary communication between the sub-populations for multiple labels. For simplicity, an input sub-population and the others are represented as
P and
, respectively. With regard to lines 3–4, our algorithm finds an index set
of labels for which the best individual
in
P yields the lowest accuracies, where the size of
is set to half the size of the entire label set
. To find the complementary individual
from the other sub-populations
, our algorithm computes the degree of complementarity
for each individual
in
, where
is regarded as the discriminating power with regard to the labels in
. Specifically,
is calculated by adding the accuracies corresponding to the labels in
(line 6). In contrast with the simple communication of exchanging the best individuals, the individual
referenced from the other sub-populations can complement the discriminating power of the sub-population
P for the entire label set
L, which results in an improved search capability for MLFS.
Algorithm 4 Update function |
- 1:
input:; ▹ the sub-populations - 2:
output:; ▹ the new sub-population - 3:
generate new offspring by crossover from the and best individual of ; - 4:
; - 5:
whiledo ▹ keep the number of individuals in the P - 6:
select an individual at random among the and best individual of the ; - 7:
select an individual at random among remaining individuals of ; - 8:
generate new offspring by crossover from the and ; - 9:
; - 10:
end while - 11:
run a mutation on overlapping individuals;
|
Algorithm 4 represents the detailed procedure for generating new offspring. Because the complementary individual
and the best individual in
are considered to be important, our algorithm generates offspring from them once (line 3–4). With regard to lines 6–7, our algorithm conducts parent selection to generate offspring. Particularly, the first parent is randomly selected between
and the best individual; consequently, the important individuals are always involved in the generation of offspring. Furthermore, to generate diverse offspring, the other parent is selected from one of the remaining individuals. As shown in line 8, the selected parent pair generates offspring via a restrictive crossover method that is frequently used to control the number of selected features in feature selection [
29]. When compared to updating based on fitness-based parent selection, our algorithm can generate offspring that are superior to the previous generation by actively using the complementary individual
. The generated offspring are sequentially added to
(line 9). To maintain the number of individuals in each sub-population, the generation process is repeated until the offspring are as numerous as the number of individuals in
, i.e.,
. Furthermore, as described in line 11, a restrictive mutation is conducted on overlapping individuals.
Finally, we conducted the time complexity analysis of the proposed method. The most time is spent to evaluate feature subsets, because the learning algorithm should be trained through complicated sub-procedures for multiple labels [
30]. Because the numbers of training patterns and given labels are regarded as constant values during the evaluation process, the computation time required to evaluate a feature subset
S is determined by the number of selected features
, i.e.,
, where
represents the assumed basic time associated with the evaluation of a single feature [
3]. Given the total number of individuals
and maximum number of iterations
, the feature subset evaluation is conducted
times. Thus, the time complexity of the proposed method is
.
3.4. Algorithm: Example
We implement the proposed method on the multilabel toy dataset provided in
Table 2 as a representative example. In the table, each text pattern
is relevant to multiple labels, where the labels are represented as one if relevant and zero otherwise. Specifically, the first pattern
includes the terms “Music”, “The”, “Funny”, and “Lovely”, but not “Boring.” This pattern can be assigned to the labels “Comedy” and “Disney” simultaneously. For simplicity, we set the number of sub-populations and the number of features as two. Additionally, the number osf individuals in each sub-population was set to three. To focus on the communication process, in the initialization step, two sub-populations were initialized at random, as follows:
MLNB and multilabel accuracy are used to evaluate each individual. A detailed description of the evaluation metrics, including multilabel accuracy, is given in
Section 4.1. Additionally, the fitness values
for each sub-population
are calculated as the average value obtained from 10 repeated experiments, as follows:
where the label-specific accuracy matrix
for
is calculated using the MLNB that was pretrained for fitness evaluation.
In the communication process for
, our algorithm determines the index set
of labels for which the lowest accuracies are yielded by the best individual
, as it has the highest fitness
in
. We indicate important individuals in the sub-population
using bold font. In
,
has the lowest accuracy, 30% for
, as
is
when
because
,
=
. To complement
, our algorithm finds the complementary individual
from
. Based on
and
, the degree of complementarity
for each individual
of
is calculated as
Because the individual belonging to has , the complementary individual for is . Conventional methods import the best individual that belongs to . Our example exhibits a low accuracy of 40% for . However, our method refers to of , which has the highest accuracy with regard to . This indicates that our method can further complement the discriminating power of for multiple labels and increase the likelihood of avoiding local optima, resulting in improved multilabel accuracy. This process is similar for .
In the update process, selects its best individual and to be the parental pair once. Next, one of or is selected as a parent, and one of or is selected as the other parent at random. The selected parent pair generates offspring via the genetic operators used in conventional methods. Given and as the parent pair, our algorithm generates offspring 00110 and 10001 via the restrictive crossover. As a result, a feature subset represented by the offspring 10001 achieved a multilabel accuracy of 91%. This search process is repeated until the stopping criterion is met.