1. Introduction
Recently, the great progress in remote sensing technologies has provided increasing remote sensing images from satellite-borne and airborne sensors for land cover mapping and monitoring issues. Generally, since high-resolution remote sensing images depict diverse categories of land cover objects, a single label cannot accurately describe the content in the image. Therefore, compared with single-label classification [
1,
2,
3], multi-label remote sensing image classification is a more practical task. Specifically, a Multi-Label Image Classification (MLIC) method is developed to assign a set of preset land cover labels to each remote sensing image. In this paper, we focus on the multi-label remote sensing image classification task in the aerial scene understanding field.
Benefiting from the great success of deep learning, deep Convolutional Neural Networks (CNNs) [
4,
5] and vision Transformers [
6,
7] are proposed to extract high-level semantic features and make incremental progress in Single-Label multi-class Image Classification (SLIC). Additionally, the field of numerical simulation and stability has achieved significant progress [
8]. These advanced methods are also exploited in single-label remote sensing image classification. By treating each label in isolation, the multi-label problem can be simply addressed by using SLIC methods to predict whether each label is present or not. However, compared to SLIC, MLIC is a more complicated task. On the one hand, in an aerial image, there are multiple land cover objects at different spatial resolutions, which are related to the size of the objects. For example, the size of a car is far less than a court, and consequently, “car” is one of the inconspicuous categories. On the other hand, since land cover objects generally co-exist in an aerial scene image, the inter-class relationship is another key for the classification. Therefore, the MLIC task considers not only accurate spatial feature extraction, but also the correlations of multiple concepts. In classical MLIC, the utilization of spatial information and inter-class correlations are both significant issues. To handle the spatial information, some works introduce regional proposal techniques [
9,
10], implicit spatial attention [
11,
12], or multi-scale features [
13]. Nevertheless, these methods neglect the impact of the relationships among multiple categories. On the other, many works are proposed to model the inter-class correlations. Pioneering approaches [
14,
15,
16,
17,
18] based on Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) learn the correlations in sequential prediction. However, the performance of the RNN-based methods is influenced by the pre-set or learned sequence. Moreover, the complex label correlations cannot be accurately represented by sequential relations. Other works [
19,
20] formulate the MLIC task as a structural inference problem based on probabilistic graphical models [
21], while the practicality is limited by the high computational complexity. Inspired by the great success of the Graph Convolutional Network (GCN) [
22] in the representation of multivariate relations, ML-GCN [
23] proposes to explicitly model label correlations via GCN. Transformer [
24] from the Natural Language Processing (NLP) field has achieved great success [
25,
26,
27] in the Computer Vision (CV) area. Inspired by the impressive capability to model long-range dependencies, recent works [
28,
29] leverage Transformer to capture the label dependencies.
Previous methods have proven the effectiveness of dealing with label dependencies. In general, these methods directly model the holistic label dependencies among the generated label embeddings related to all categories. Nevertheless, only a portion of category objects is present in a single image, and the visual features extracted from the image are mostly relevant to the ground-truth labels. Consequently, the computed label dependencies among the nonexistent categories are inaccurate. In this paper, these label dependencies are called redundant dependencies, which bring noise to the classification task. Specifically, as shown in
Figure 1, the solid line indicates a higher connection and the dashed line indicates a lower connection between categories. According to the presence or absence of the categories in the image, some of the inter-class relationships are obvious and definite, such as the solid lines (higher connections) and dashed grey lines (lower connections). However, dependencies among nonexistent categories (e.g., car and court) cannot be accurately estimated, which leads to a negative impact on classification.
In short, there are mainly three challenges in multi-label remote sensing image classification tasks. Firstly, land cover objects have different scales in an image, making it more challenging to leverage spatial information. Secondly, there are usually co-occurrence dependencies among the objects of different categories in a remote sensing image. Thirdly, the redundant correlations among the nonexistent categories bring noise to multi-label remote sensing image classification.
To this end, we propose a novel and effective multi-label image classification framework, Semantic-driven Masked Attention Transformer (S-MAT), which consists of a backbone feature extractor, a Semantic Disentanglement Module (SDM), and a Masked Attention Transformer (MAT). In this paper, we first extract the high-level semantic feature of a remote sensing image with a CNN backbone. Then, the extracted feature is disentangled into a set of label embeddings in the SDM. After that, a Masked Attention Transformer (MAT) is trained to model label correlations and update the input label embeddings adaptively. Meanwhile, we exploit the masked attention in MAT to restrict the attention to the categories with higher confidence and avoid the noise from nonexistent categories. Finally, the updated label embeddings are projected to the image-level predictions, which are combined with the independent predictions generated directly from the high-level image representation to obtain the final predictions. In addition, we notice that masked attention is applied to other visual tasks, such as Mask2Former [
30] for image segmentation. The attention mask in Mask2Former is class agnostic, whereas, in our work, each element
in attention mask
M corresponds to the relevance between class
i and
j in the attention map.
Comprehensive results on three widely used multi-label image recognition benchmarks show that our S-MAT outperforms other recent methods that model label relationships via graph convolution networks or other proposed strategies. In summary, our main contributions are as follows:
A novel Transformer-based framework, S-MAT, namely Semantic-driven Masked Attention Transformer, is proposed. S-MAT aims to filter out the redundant dependencies and obtain more accurate label dependencies for multi-label aerial scene image classification.
We conduct in-depth studies on the application of masked attention and propose a plug-and-play module, Masked Attention Transformer (MAT), to constrain the attention to the categories with higher confidence and reduce the redundant dependencies among the nonexist classes. To our best knowledge, this is the first application of masked attention in modeling inter-class relationships.
We design a plug-and-play module, namely the Semantic Disentanglement Module (SDM), to disentangle the high-level semantic feature into a set of category-relevant embeddings for each image by locating the attention region of each category.
We conduct comprehensive experiments to verify the effectiveness of the proposed approach. On three widely used multi-label aerial scene image recognition benchmarks including UC-Merced Multi-label, AID Multi-label, and MLRSNET, our models consistently have state-of-the-art results.
The rest of this paper is structured as follows.
Section 2 gives the related works.
Section 3 demonstrates the details of the structure and relative setting of the proposed method.
Section 4 is devoted to the discussion of the experiments.
Section 5 gives the ablation studies.
Section 6 gives the qualitative results.
Section 7 analyzes the experimental results and discusses the difference among the proposed method and the previous methods.
Section 8 presents the conclusion.
3. Methods
In this section, we introduce a novel Multi-Label Image Classification (MLIC) framework named Semantic-driven Masked Attention Transformer (S-MAT), which provides a Transformer-based solution to make use of inter-class relationships to improve classification performance. This section consists of four parts. We first review Transformer in
Section 3.1. Then, we introduce the Semantic Disentanglement Module (SDM) in
Section 3.2 and the Masked Attention Transformer (MAT) in
Section 3.3. In the end, we briefly describe the final classification and loss function in
Section 3.4.
3.1. Recap of Transformer
The standard Transformer [
24] architecture is a typical encoder–decoder architecture. This work is based on the Transformer encoder; thus, we shall introduce the Transformer encoder in the following. The Transformer encoder is a multi-level architecture, in which each layer comprises two key components, namely multi-head self-attention module and feed-forward network module. In the area of natural language processing, the conventional Transformers build relationships among different semantic words in the input language sentences from global perspectives. In other areas, the non-serialized input data need to be preprocessed into a sequence. For example, ViT [
6] proposes to cut the image into multiple patches and flatten them into a sequence. Since the sequence loses positional information on the input, position embedding is introduced to preserve the relative position of each element in the sequence.
Given a sequence
, where
is the length of
X, they are converted into queries
Q, keys
K, and values
V by the fully connected layers.
where
,
, and
are the learnable parameters for channel transformation.
and
are the channel numbers of the key and value. In this paper, we set
. The standard dot-product attention with the residual path is defined in Equation (
1).
Then, the updated query embedding
Q is subjected to a fully connected Feed-Forward Network (FFN) to perform a nonlinear mapping. The FFN consists of two linear transformations with a GELU activation in between. The FFN with a residual path is as follows:
where
W and
b stand for the weight matrices and the bias. The subscripts represent different fully connected layers. In this paper, we set the dimension of
,
,
, and
as
,
,
, and
, respectively.
In our approach, we replace the standard dot-product attention operator with a masked attention operator based on the meta-architecture mentioned above. The overview of our S-MAT framework is presented in
Figure 2. It consists of four main parts: (i) high-level feature extraction of the input image via a pre-trained backbone, (ii) construction of label embeddings in te Semantic Disentanglement Module (SDM), (iii) relationship modeling and embedding refinement in a Masked Attention Transformer (MAT), and (iv) computing the final prediction logits for each category. Note that our method can be attached to any backbone without intrusive modifications. The detail will be introduced in the next part.
3.2. Semantic Disentanglement Module
Given an image
, where
and
are the height and the width of
I, high-level feature map
is extracted from the backbone. In this paper, the label embeddings are constructed by disentangling the image feature map
X. The disentanglement is divided into two steps: (1) generation of class-specific activation
M; (2) matrix product of
X and
M. Class-specific activation represents the probability of a label appearing at each spatial location, and its generation can be formulated as follows:
where
denotes the
convolution layer to transform the dimension of
X from
to the number of classes
C in the current dataset and
is the
function. In this paper,
was set to 2048. Each element
in class-specific activation
represents the probability of specific category
c’s presence in the feature map
X at
. We adopt Global Max Pooling (GMP) on
M to generate the predictive logit
, which is constrained by the loss demonstrated in
Section 3.4 for learning a more accurate representation
M. Hence, the label embeddings are constructed by the product of
M and the transformed feature map as follows:
where
denotes the
convolution layer to reduce the dimension of
X from
to
and
represents the reshape operation, which squeezes the spatial dimensions
H and
W into one dimension
. Intuitively, each category
c selects the interested region in the transformed feature map to combine. As a consequence, each embedding
aggregates the corresponding spatial feature and semantic information.
3.3. Masked Attention Transformer
Transformer has proven its outstanding ability to model long-range dependencies. In particular, the built-in mask matrix, which restricts the scope of attention, makes Transformer a perfect choice for modeling label relationships. The mask matrix in Transformer was originally intended to eliminate the effect of padding on the sequence in training or avoid exposing the decoder to predictive content in machine translation. In Mask2Former, the mask matrix is exploited to realize local attention by constraining the attention to the foreground region, instead of the full feature map. As mentioned in
Section 1, one crucial problem in multi-label classification is removing the redundant part and obtaining more accurate label dependencies. To solve this problem, we make the first attempt to introduce masked attention into multi-label classification to mask the redundant label dependencies, and the attention is confined to the categories with higher confidence. The proposed masked attention is shown in
Figure 3.
Suppose the ground-truth label set of the input image is , which contains labels, and the set of nonexistent labels is , which contains labels. represents the relationship among the labels in both label sets and . We believe that there are higher relations among the labels in , namely . On the contrary, . However, is uncertain and redundant. If we filter out the redundant part, we can obtain more accurate label relationships. In this paper, we simply judge the confidence of each label according to and then generate the mask . Since is not completely accurate, we retain some redundancy in the generation of masks by adjusting the proportion to be filtered out.
To generate the attention mask
, we first obtain
I, the index set of the top-k prediction in
, via the
operator in PyTorch. Then, the attention mask
is
Thus, the masked attention
can be defined as follows:
Extending the attention mechanism to a multi-head version enables the mechanism to consider different aspects of the label relationships. The masked multi-head attention (
) mechanism is the cascade of Equation (
7), and its definition is shown as follows:
The symbols , , and are the parameter matrices, where . is the output transform matrix. h is the number of heads, and is the output of each attention head.
Our masked Transformer encoder is a multi-layer architecture in which each layer consists of a masked multi-head attention mechanism and a Feed-Forward Network (FFN). With the label embedding from the previous layer
, each Transformer encoder layer exploits the label relationships and updates the label embedding
as follows:
The symbol means the feature modified by adding position encoding. is the intermediate variable.
3.4. Final Classification and Loss Function
After being refined by the
l Transformer encoder layer, we can obtain the final label embeddings
and project them to the logit
via a linear projection layer:
where
and
. Then, the final label confidence
is obtained by elementwise summing up
and
.
where
is the
function. The ground-truth label of the input image is
, where
denotes the absence or presence of label
i in the image. The whole framework is trained in an end-to-end manner with the traditional multi-label classification loss as follows:
where
is the
function.
7. Discussion
In this section, we demonstrate the analyses of the experimental results and differences between our S-MAT and previous methods. As shown in
Table 1 and
Table 2, we can find that S-MAT not only improves the performance of precision and recall, but also keeps them in balance. Consequently, S-MAT always wins the comparison of CF1, which considers both precision and recall. Secondly, the RNN-based methods, such as CA-ResNet-BiLSTM and CM-GM-N-R-BiLSTM, barely surpass the baseline ResNet50. Limited by modeling as the pre-set sequence, RNN-based methods are incapable of capturing the accurate inter-class relationship. Different from the RNN, Transformer can compute the attention matrix among the labels. Therefore, Transformer-based methods, such as ResNet50-SR-Net and our S-MAT-ResNet50, are the superior approaches to model the relationships between pairwise labels. Moreover, compared with SR-NET, in the S-MAT framework, MAT replaces standard dot-product attention with masked attention. The latter can filter out redundancy to obtain a more accurate inter-class relationship. Meanwhile, the SDM disentangles the global feature map to generate the label embeddings and, thus, introduce valuable spatial information for feature representation. As demonstrated in
Table 4, the combination of these two modules maximizes recognition performance on the three benchmark datasets.
Figure 9 shows a qualitative example on the AID Multi-label dataset. The first column is the input image. The second column and the third one are the methods and the output predictions, respectively. Green labels denote true positive; red labels denote false positive; gray ones denote false negative. We can find that the baseline ResNet-50 failed to distinguish the objects with similar appearances, such as “trees”, “grass”, and “field” since the model only utilizes the global spatial features. The other methods obtain a lower false positive rate than the baseline by dealing with label dependencies. Those inconspicuous objects, such as “cars” and “dock”, can not be recognized by ResNet-50 and CA-ResNet-BiLSTM. While SR-Net misses the prediction of “dock”, S-MAT can recognize both “cars” and “dock” accurately and obtain a better performance.
It is worth noting that our model leans upon massive annotated large-scale datasets to learn the semantic context and the label dependencies of a visual scene. However, massive annotated data are costly and rare, while most aerial scene images are unlabeled. Therefore, learning the visual feature via self-supervised learning is our research topic in the future.