1. Introduction
When faced with complicated information presentation, single-label data are incompatible with real-world applications for multi-level and multi-view data processing. Multi-label data [
1] are intended to express a sample belonging to one or more categories simultaneously and are generally applied in the fields of text [
2,
3], audio [
4,
5], image [
6,
7], biology [
8,
9], and so on. In general, multi-label data involve high-dimensional feature space and many irrelevant and redundant features, which affects the classification accuracy and increases the computational complexity [
10]. As a result, data dimensionality reduction has become a research hotspot with significant attention, which simplifies the feature space and highlights the correlation of features, helping to reveal the relationship between features and labels.
As one of the important approaches of data dimensionality reduction, feature selection [
11] purports to select crucial features from the original data, which reduces the influence of irrelevant features and maintains the interpretability of feature subsets. The challenges of multi-label feature selection are the correlation of labels, the redundancy between features and the excessive dimensions of data. There are three categories based on the search strategy: wrapper, embedded and filter [
12]. Among them, wrapper methods involve information interaction with an off-the-shelf classifier, in which the performance is used as a metric for evaluating the quality of feature subsets [
13,
14]. Embedded methods directly incorporate the process of selecting features as part of a classifier’s training, designing objective functions and adding constraint terms to emphasize the geometric structure [
15]. For example, Jian et al. [
16] proposed mapping label information to a low-dimensional space to capture label correlations and guide feature selection, minimizing the impact of imperfect label data (MIFS). Huang et al. [
17] used manifold learning to transform the logical label space into a Euclidean space, constraining sample similarity with numerical labels (MCLS). Zhang et al. [
18] used
norm and
norm regularization to identify label-specific and group-specific features, incorporating instance and label-group correlations to execute feature selection (GLFS). Zhang et al. [
19] applied a low-dimensional embedding to capture local label correlations and co-occurrence relationships, ensuring convergence through
norm regularization (MDFS). However, both wrapper and embedded methods require large space complexity as well as many hyperparameters and are dependent on classifiers, which may lead to overfitting [
20]. Filter methods evaluate the feature importance by considering the distribution of features, which is independent of the classifier and requires relatively less running time [
21,
22]. It is categorized into two cases: one transforms multi-label data into single-label data, and another improves or directly proposes a new criterion. However, filter methods ignore the influence of the classifier, and the feature importance does not fully consider the correlation of labels.
Traditional multi-label feature selection methods are incompatible with analyzing the correlation of features, making it difficult to extract the global information in multi-label data. To compensate for this issue, information theory [
23] was introduced to calculate the correlation through the ideas such as statistical analysis, entropy and mutual information [
24,
25,
26]. Moreover, the relationship between features and labels is quantified to some extent [
27], and features with strong label dependency are selected based on evaluation metrics and ranking. However, the relationship between features and labels is usually based on a single evaluation metric, which makes it difficult to consider the correlation of labels and fails to grasp the purpose of multi-label classification tasks.
From the perspective of the label, the key issue of feature selection lies in a complete and accurate understanding of the relationship between features and labels. As a heuristic search strategy, evolutionary computation [
28] uses an inductive algorithm as a “black box” without any prior knowledge to discover possible crucial features [
29], and the co-occurrence pattern of labels is comprehensively described by an objective. Considering that the multi-label data involve multiple objectives, multi-objective optimization [
30,
31,
32] is used to balance these objectives, such as improving classification accuracy and reducing the number of selected features, and the feature subsets are obtained that are outstanding in a variety of aspects. However, the Pareto front [
33,
34] in multi-objective optimization still exhibits obvious feature redundancy for specific labels, making the effectiveness of data dimensionality reduction incomplete.
To obtain the optimal feature subset, it is necessary to explore the relationship between features and labels. Among them, graph theory [
35,
36] is commonly used to show the connection between objects, where the nodes usually represent the entities in multi-label data (e.g., feature or label), while edges represent the associations between nodes (e.g., relationship or correlation), and they are reflected by the connection of nodes and edges. Furthermore, the relationship between features and labels is intensively analyzed to reduce the data dimensionality more effectively. However, each label acts as an independent objective, and the connection of nodes grows exponentially with the increase in label dimensions, which affects the efficiency of feature selection.
This paper proposes a multi-label feature selection method based on feature–label subgraph association with graph representation learning (SAGRL). Among them, features and labels are mapped as nodes in the graph structure, and highly relevant nodes are aggregated into the same class to form feature and label sets. They are combined to construct feature–label subgraphs, which provide sufficient solution space for feature selection. Furthermore, features in each subgraph are ranked among multiple objectives from the perspective of labels to replace the previous feature set. For the feature and label nodes in each new subgraph, feature combinations are extracted by graph representation learning. The selected features are ranked to determine their importance and obtain the optimal feature subset, making them more representative. The main contributions of the paper are as follows:
An SAGRL is proposed for multi-label feature selection, which considers features and labels as nodes to construct feature–label subgraphs, and the relationship is adjusted in each subgraph.
Features and labels are mapped into the graph structure, which are aggregated to mine the intrinsic structural information of multi-label data and highlight the correlation of features or labels.
The subgraph association is updated based on multi-objective optimization to choose the Pareto front feature set, remove redundant features from the original data, and reconstruct new subgraphs.
The subgraphs are combined with augmentation paths to match feature and label sets, and the optimal feature subset is obtained by analyzing and ranking the importance of nodes.
The remainder of this paper is structured as follows. In
Section 2, the related work is briefly reviewed. In
Section 3, the proposed method involves concepts that are briefly introduced. In
Section 4, the proposed method is described in detail. Experimental results are presented and discussed in different aspects in
Section 5. Finally, conclusions and future work are presented in
Section 6.
4. The Proposed Method
In this section, the proposed multi-label feature selection method is presented in the following. The step-by-step procedure of the proposed method is listed in Algorithm 1. First, the shortcomings of previous methods are briefly summarized, and feasible ways are given in the motivation. Further, the key techniques of SAGRL are specificly listed, through the construction of feature–label subgraphs, updating subgraph associations, the integration and deduplication of matched subgraphs, respectively, and obtaining the optimal feature subset.
Algorithm 1: SAGRL for multi-label feature selection |
- Require:
The feature matrix X contains N samples and d features. The label matrix Y contains N samples and c labels. The clustering numbers and are used for the features and labels. - Ensure:
The optimal feature subset F. - 1:
is calculated by Equation (8) to quantify the correlation of features; - 2:
is the frequency of co-occurrence between labels calculated by Equation (9); - 3:
; - 4:
is the result by spectral clustering of ( is the number of clusters); - 5:
is the result by spectral clustering of ( is the number of clusters); - 6:
R is the feature–label similarity matrix calculated by Equation (12); - 7:
Extract submatrix ; - 8:
for to k do - 9:
is the non-dominant feature in each subgraph found by Equation (7); - 10:
end for - 11:
Updating subgraphs ; - 12:
for to k do - 13:
(i) is the index of the feature combination obtained by one-to-one matching; - 14:
end for - 15:
is obtained by the integration and deduplication of ; - 16:
F is obtained via ranking by Equation (15);
|
4.1. Motivation
Multi-label data involve feature–feature, label–label, feature–label dependencies, and traditional methods mainly focused on feature redundancy, which fail to fully consider the co-occurrence patterns of labels. In multi-objective optimization, when the number of objectives exceeds a threshold, it may cause most of the features become non-dominated, making the feature subset still redundant. In addition, it is difficult to reflect potential node connections by regarding multi-label data as the graph structure, which affects the relationship between features and their corresponding labels.
For the above problems, SAGRL is utilized for multi-label feature selection to efficiently analyze the features corresponding to each label set. Among them, graph representation learning expresses the complex correlations of features or labels by mapping multi-label data into the graph structure, and the feature and label sets are combined to construct feature–label subgraphs. To better associate the subgraphs, the connection relationship between them is adjusted based on multi-objective optimization, and non-dominated sorting is employed to remove highly relevant features. Features and labels in each subgraph are matched one to one, and all feature combinations are integrated and deduplicated to obtain the optimal feature subset.
4.2. Feature–Label Subgraph Construction
Multi-label data are viewed as the graph structure, and edges connecting nodes indicate correlation. Spectral clustering is performed from the perspectives of features and labels, the graph is cut to form feature and label sets and then construct feature–label subgraphs. The process of feature–label subgraph construction is shown in
Figure 2:
To quantify the correlation of features, mutual information is used as a criterion to capture its linear relationship and also to more effectively measure non-linear feature dependency; the similarity matrix is defined as Equation (
8).
where
d denotes the number of features, and
indicates the correlation of the
i-th and
j-th feature that is calculated by Equation (
3). The co-occurrence frequency is used to reveal the correlation of labels which visually reflects the probability of the simultaneous occurrence of labels in the dataset. The co-occurrence frequency matrix is defined as Equation (
9).
As the connection is established between nodes, two weighted graphs are output and notated as
and
. The above graphs are partitioned into
and
subgraphs to reduce the space complexity. Taking
as an example, the minimization of edge weights in the subgraph and the maximization of edge weights between different subgraphs are pursued after partition; for k subgraph nodes
,
, ⋯,
,
,
, the objective function is defined as Equation (
10) [
67]:
where
denotes the edge weights between the
i-th and
j-th node (feature),
is the minimum, and
represents cut graph operation. Due to the fuzzy constraints, it is necessary to qualitatively introduce indicator vectors to refine the objective function, and the Lagrange multiplier [
68] is used to transform the constrained problem into an unconstrained problem, which is shown in Equation (
11).
where
H and
L denote the indicator and Laplace matrices, and
is the eigenvalue of
L. The corresponding eigenvectors of
splice a matrix as an approximate solution of
H, and the clustering results are achieved by discretization. Feature and label sets are constructed into a subgraph that provides the entity information for considering the relationship between features and labels.
4.3. Feature–Label Subgraph Association Updating
The feature–label similarity matrix is calculated to represent the objective value of features on the corresponding labels, and it further extracts the submatrix for each subgraph. Labels are acted as independent objectives, non-dominated sorting is applied to select the features on the Pareto front in each subgraph when multiple objectives are considered at the same time, and the process of feature–label subgraph association adjusting is shown in
Figure 3 (Using the subgraph circled with red lines in
Figure 2 as examples).
The association degree of features and labels is comprehensively considered by calculating the feature–label similarity matrix, which is defined as Equation (
12).
where
x and
y denote features and labels, respectively,
is the covariance of
x and
y, and
indicates the variance.
In multi-objective optimization, the subgraph uses non-dominated sorting to yield its Pareto front, which makes the selected features more fitting to different distributions of labels in each subgraph. The 3rd subgraph in
Figure 4 shows 6 features indexed 1, 2, 3, 6, 8, 9 and 3 objectives with labels indexed 1, 3, 5:
,
, and
, and the submatrix is
r. As for the non-dominated sorting, the red circles 1, 3, 6, and 9 denote the non-dominated features, which are the members of a Pareto optimal set and surround the dominated features (green circles), while the orange circles indicate objectives. Among them, the red circle has a larger
W value in more than one dimension compared to the others. The feature indexed
on the Pareto front is known from the figure, so the feature–label submatrix is updated as
.
.
4.4. Feature–Label Subgraph Matching
Since each subgraph includes multiple objectives, there are still some redundant features. As a result, a bipartite graph is applied to match features and labels by one to one for new subgraphs and acquires a series of feature combinations. In particular, the same features may exist in different subgraphs, avoiding the phenomenon of different labels with same feature combination. The optimal feature subset is obtained by integrating feature combinations after ranking, and the process of feature–label subgraph matching is shown in
Figure 4.
Figure 4.
Feature–label subgraph matching.
Figure 4.
Feature–label subgraph matching.
Since feature–label subgraph matching requires the number of two sets (left and right vertices) to be equal, a squareization operation is essential on the submatrix. As for the 3rd subgraph in Figure 6 with 4 features (indexed 1, 3, 6, 9) and 3 labels (indexed 1, 3, 5), the submatrix is
. The cost matrix is calculated by the all-1 matrix subtracting
and matching the feature–label subgraph to find
B; the best match scheme is underlined. The feature indexes
are extracted by each label from
B. As shown in the matrix,
,
,
. The blue and yellow circles (features) in the figure are selected repeatedly by different labels, so it is necessary to conduct integration and deduplication for feature combinations. Additionally, the PageRank [
69] algorithm is utilized to assign weights for features, thus ranking the features and obtaining the optimal feature subset, and the weighting of features is defined as shown in Equation (
13).
where
is the PageRank value of
.
d is the damping factor, which is usually set to 0.85.
is the nodes directed to
.
is the out-degree of
.
.
5. Experimental Studies
In this section, we verify the performance of the proposed method through comprehensive experiments, utilizing datasets in different domains, evaluation metrics, experimental settings, comparisons with other state-of-art multi-label feature selection methods, and the analysis of parameter sensitivity.
5.1. Datasets
The datasets used for the experiments are derived from both the Mulan (
http://mulan.sourceforge.net/datasets.html, Accessed: 5 August 2024) and Meka (
http://waikato.github.io/meka/datasets/, Accessed: 5 August 2024) databases, covering text, image, audio, and bioinformatics domains.
Table 2 summarizes the specifications of these datasets, including dataset name (Datasets), number of samples (Samples), number of features (Features), number of labels (Labels), and feature type (Type). In addition, label cardinality (LC) [
48] is the cardinality normalized by
defined by Equation (
14), label density (LD) [
48] is the average number of labels associated with each sample as defined by Equation (
15), and the domain of the dataset (Domain) is also included.
5.2. Evaluation Metrics
Among them, Hamming Loss, Ranking Loss, Coverage, Average_Precision, macrof1, and microf1 are used as the evaluation metrics [
70] to measure the performance of SAGRL. Let
be a test set and
be the predicted label set for unknown instance
.
Hamming Loss: This metric evaluates the average error rate over all the binary labels, and ⊕ is the symmetric difference between two sets, where
denotes the
norm.
Ranking Loss: This metric evaluates the fraction of reversely ordered label pairs,
indicates the
j-th term of
, and
is the complementary set of
in
L.
Coverage: This metric evaluates how many steps are needed, on average, to go down the label ranking list so as to cover all the ground-truth labels, and
returns the rank of
in
L when all labels are sorted based on f in descending order.
Average_Precision: This metric evaluates the average fraction of relevant labels ranked higher than a particular label
.
Macrof1: This metric evaluates the classification accuracy of a label set, which considers F-measure averaging on each label.
Microf1: This metric evaluates the classification accuracy of a label set, which considers F-measure averaging on the prediction matrix.
For Hamming Loss, Ranking Loss, and Coverage, lower values indicate better performance, while for Average_Precision, macrof1, and microf1, higher values indicate better performance.
5.3. Experimental Setting
In this section, the proposed method is evaluated on 11 public datasets. To justify the performance, it is compared with several multi-label feature selection methods: PPT-CHI (Pruned Problem Transformation–CHI-square) [
38], PPT-MI [
37], PMU (Pairwise Multi-Label Utility) [
40], D2F [
43], MIFS [
16], MCLS [
17], MDFS [
19], MGFS [
53], MLACO [
29], BMFS [
55], and GLFS [
18]. All parameters of the comparison methods are set according to the recommendations in the corresponding paper. Among them, 60% of the samples are chosen randomly as the training data and the remaining 40% are used as the test data, and the experimental results are averaged over 20 independent runs. All experiments are performed on a Microsoft Windows 10 operating system Intel(R) Core(TM) i7-10700 CPU using Matlab_R2021b.
MLKNN [
71] is a commonly used classifier for multi-label classification tasks that extends the traditional K nearest neighbor classifier. By considering the neighbor distribution of each label, the Bayes rule is applied to predict the applicability of labels.
For the low dimension of features or labels, fewer features are matched to the corresponding labels. For the datasets with less than 300 features, only the evaluation metrics with {1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50} features are applied for plotting, and for those with more than 300 features, {1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100} are applied for plotting. In the figure, the horizontal coordinate represents the number of features, and the vertical coordinate represents the evaluation metrics.
5.4. Results and Discussion
5.4.1. Results for Different Datasets
The curves of comparison methods on 11 datasets in terms of six evaluation metrics are shown in
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15. The experimental results show that the proposed method outperforms most of the comparison methods in six evaluation metrics, which emphasizes the effectiveness of accurately identifying and predicting labels. On the Corel5k dataset, the values of all methods on Hamming Loss are around 0.00945, and the convergence trend is unobvious, which may be related to the inherent characteristics of the dataset as the label dimension is 374 and the LD is only 0.0094. When the number of features is between 50 and 70, the additional features have a certain effect for the ranking on the Social dataset, and the fluctuations in ranking loss, coverage, and average precision are 0.007, 0.27, and 0.06, respectively. On the Bibtex and Education datasets, SAGRL exhibits the same convergence trend as MDFS and the values are better, which performs more prominently when the number of selected features is small and shows the potential to cope with the “curse of dimensionality”. From a domain perspective, the “text” domain demonstrates a clear advantage, particularly on the Bibtex dataset dataset, where performance is superior to other comparative methods when the number of features is kept between 0 and 20. On the Corel5k dataset, while the convergence trend of the Hamming Loss exhibits potential for improvement, other metrics performed well. In contrast, the limited numbers in the “Audio” and “Biology” domains prohibit a more in-depth analysis in these areas.
5.4.2. Comparison of Running Time
Table 3 records the running time (in seconds) of SAGRL with all the comparison methods on 11 datasets, and the experimental results show that the running time of the proposed method is less than 1 s for all datasets. Even on the Bibtex dataset, it takes only 0.641 s, MGFS takes 1.197 s, and the other methods are over 10 s or more. The running time of SAGRL is slower than MGFS and PPT-MI by 0.227 s and 0.094 s on the Corel5k and Medical datasets, respectively, but it outperforms them in terms of classification accuracy, and the running time is obviously excellent on the other datasets. For example, on the Arts, Education, Enron, and Social datasets, SAGRL takes only 0.03 s, 0.042 s, 0.116 s, and 0.115 s, while the others also take 0.14 s, 0.129 s, 0.243 s, and 0.307 s especially for MGFS. SAGRL divides the feature and label sets into smaller, more manageable subgraphs through spectral clustering, minimizing inter-subgraph correlations. Each subgraph is processed individually, reducing computational load. Combined with selective ranking and aggregation mechanisms, it ensures that the computation time stays under 1 s for all datasets. The advantage of running time indicates that SAGRL is fast enough for feature selection and exhibits exceptional consistency and reliability across datasets from different domains.
5.4.3. Statistical Analysis of Evaluation Metrics
To compare the performance of SAGRL with other comparison methods statistically, the Friedman N*N test is used to calculate the ranks on 11 datasets, and the overall wins/ties/losses of SAGRL are summarized versus other methods. A lower sum of ranks indicates better performance against others. In all, the numbers in parentheses are the rankings of methods in different evaluation metrics. The last row denotes the ranking sum. From the rankings shown in
Table 4,
Table 5,
Table 6,
Table 7,
Table 8,
Table 9,
Table 10,
Table 11,
Table 12,
Table 13 and
Table 14, the proposed method is ranked in the top 3 on the Bibtex, Emotions, and Medical datasets, and it is ranked further on all other datasets are 1st overall. On the Enron, Image, and Scene datasets, it ranks 1st for all evaluation metrics.
Table 15 shows the sum of wins/ties/losses for SAGRL versus the others on all datasets. According to the last row, among 66 cases (11 datasets* 6 evaluation metrics), SAGRL significantly wins over PPT-CHI, PPT-MI, PMU, D2F, MIFS, MCLS, MDFS, MGFS, MLACO, BMFS, and GLFS 61, 62, 66, 65, 66, 64, 59, 53, 66, 50, and 54 times, respectively.
5.5. Parameter Sensitivity
The clustering numbers
and
for features and labels are essential to the classification accuracy. To investigate the influence of clustering number, further experiments are carried out under different values of
and
combinations as shown in
Figure 16. For fewer than 1000 features,
is set to
,
,
,
, and
, respectively. For more than 1000 features,
is set to
,
,
,
, and
. For
, both are set to
,
, and
[
72]. It is clear from the figure that when
, larger
values (e.g.,
,
) have a lower average Hamming Loss. When
(e.g., Bibtex, Enron, Medical) a smaller k1 (e.g.,
,
) works better. When there are few label dimensions (e.g., Emotions, Image, Scene), larger
values (e.g.,
,
) show a more favorable performance. When the are many label dimensions (e.g., Bibtex, Enron), smaller
values (e.g.,
,
) tend be advantageous. These results validate that the value of
is taken as
for the datasets with fewer features and
for the datasets with high dimensionality. The value of
is taken as
for the datasets with fewer labels and
for the datasets with more label dimensions.
6. Conclusions
In the paper, a novel SAGRL is proposed for multi-label feature selection. To highlight the correlation of features or labels, feature–label subgraphs are constructed based on spectral clustering, which has not been adequately considered in previous studies based on filter-based methods. For each subgraph combined with graph representation learning, highly relevant features are removed using non-dominated sorting, and the relationship between features and labels is adjusted with the connection of nodes in the subgraph. Further, the augmentation path is located to conduct one-to-one matching based on the bipartite graph. The effectiveness of the proposed method is confirmed by 11 multi-label datasets covering the domains of text, image, audio and biology. Compared with some state-of-art multi-label feature selection methods, our approach shows superior performance both in terms of evaluation metrics and running time. With the uncertainty of category boundaries in multi-label data, there are often more label dimensions than features, which poses a certain challenge regarding accurately selecting the crucial features among them. In the future, it would be useful to extend multi-label feature selection to the multi-modal field by selecting a series of optimal feature subsets that satisfy specific constraints. Moreover, the higher-order correlation of labels is fully considered to guide the model, learn the co-occurrence pattern among various labels, and improve the classification accuracy.