1. Introduction
The rapid advancement of space information technology and the exponential expansion of remote sensing image data have created a pressing need for the efficient and convenient extraction of valuable information from vast amounts of remote sensing images. In response to this demand, cross-modal retrieval between remote sensing images and text descriptions has emerged as a valuable approach. This retrieval process involves finding text descriptions that match given remote sensing images or identifying remote sensing images that contain relevant content based on text descriptions. The growing attention towards this field highlights its potential in addressing the aforementioned demand.
Recent studies on the cross-modal retrieval of remote sensing images and texts have predominantly followed a two-step approach, involving unimodal feature extraction (
Figure 1a) and multimodal interaction (
Figure 1b). During the unimodal feature extraction stage, remote sensing images and text data are transformed into numerical representations that capture their semantic content for further statistical modeling. Deep learning techniques, such as convolutional neural networks (CNNs) (e.g., VGGNet [
1], ResNet [
2]) and vision Transformer networks [
3], are commonly employed for extracting image features. Similarly, recurrent neural networks (RNNs) (e.g., LSTM [
4], GRU [
5]) and Transformer models (e.g., BERT [
6] ) are utilized for extracting textual features. In the subsequent multimodal interaction stage, the semantic consistencies between image and text features are leveraged to generate comprehensive feature representations that effectively summarize the multimodal data. Baltrusaitis et al. [
7] classified multimodal feature representations into joint representations and coordinated representations. Joint representations merge multiple unimodal signals and map them into a unified representation, while coordinated representations process information independently for each modality while incorporating similarity constraints between different modalities. Following this framework, recent methods for multimodal interaction between remote sensing images and texts can be categorized into two groups: multimodal semantic alignment and multimodal fusion encoding.
The upper part of
Figure 1b illustrates multimodal semantic alignment methods [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18]. These approaches aim to align image and text data in a public embedding space based on their semantic information. By doing so, images and texts with similar semantics are positioned closer to each other in this space. During cross-modal retrieval, the similarity between image and text features is determined by measuring their distance in the public embedding space, followed by sorting. In the context of multimodal interaction, the simple dot product or shallow attention mechanisms are commonly employed to calculate the similarity between images and texts. Triplet loss [
19] and InfoNCE loss [
20] are utilized either directly or through intermediate variables to impose constraints on the position and distance of image and text features within the public embedding space. The bottom half of
Figure 1b depicts the method of multimodal data fusion encoding [
21]. This approach involves feeding remote sensing images and text features into a unified fusion encoder to obtain joint representations of the image–text pairs. Subsequently, a binary classification task known as the image–text matching (ITM) task is performed to determine the degree of compatibility between the image and text. During retrieval, the ITM score is employed as a measure of similarity between the image and text.
Significant advancements have been achieved in the cross-modal retrieval of natural images and texts, resulting in impressive average R@1 accuracies of 75.8% and 95.3% on the MS COCO and Flickr30k datasets, respectively [
22]. However, when compared to natural images, remote sensing images possess three distinct characteristics. Firstly, they serve as objective representations of ground objects, leading to intricate and diverse semantic details within the images. This implies that remote sensing images can be dissected into multiple basic units for semantic expression. Secondly, unlike natural images, remote sensing images lack specific themes and focal points [
23], which contributes to their pronounced multi-perspective nature. Consequently, the same remote sensing image can generate various descriptions from different perspectives, encompassing different combinations and permutations of the underlying fine-grained semantic units. Thirdly, remote sensing images of the same geographical area may exhibit variations in colors, brightness, resolution, and shooting angles due to factors such as weather conditions, photography equipment, and aircraft positions. These inherent characteristics pose substantial challenges in achieving effective cross-modal retrieval for remote sensing images.
The global similarity of image and text commonly arises from a complex aggregation of local similarities between image–sentence instances [
24]. Due to the fine-grained semantic composition and multi-perspective nature of remote sensing images, it is essential to capture the intricate correlation clues between the image and text at a granular level. This includes establishing connections between specific image regions and corresponding textual words. Therefore, in order to accomplish this, researchers have explored the use of fine-grained unimodal features. For instance, region features [
25] and patch features [
21] have been utilized for images, while word features have been employed for texts [
14,
21,
25]. These fine-grained correlations between images and texts are then established through cross-attention mechanisms between the modalities. However, despite utilizing high-performance unimodal encoders, simplistic interaction calculations between the features may still fall short when dealing with complex visual-and-language tasks [
26]. To address this limitation, Li et al. [
21] introduced a large-scale Transformer network as a multimodal fusion encoder. By leveraging multiple multi-head cross-attention modules, this approach enabled complex interaction calculations to be performed on the fine-grained features across modalities, thereby further exploring potential fine-grained correlations between the modalities.
However, existing multimodal fusion encoding models for remote sensing image–text primarily rely on the ITM task as the sole training objective, lacking precise supervision signals for capturing fine-grained correlations between images and texts. This limitation makes it challenging to provide efficient supervision for the correlation between specific words in the text and corresponding regions in the image. To address this issue, we have incorporated the masked language modeling (MLM) task from the recent vision-language pre-training (VLP) model [
27,
28,
29]. In the MLM task, certain words in the text are masked, and the model is trained to predict these masked words using context information from the masked text and patch-level information from the image. This approach facilitates a more effective capture of fine-grained image–text correlations.
In addition, the variations in remote sensing image acquisition, including weather conditions, sensor configurations, and viewing angles, present challenges for models to establish fine-grained correlations between remote sensing images and textual data, as well as accurately determine their similarity. To overcome these challenges, we propose the multi-view joint representations contrast (MVJRC) task, which incorporates automatic contrast, histogram equalization, brightness adjustment, definition adjustment, flipping, rotation, and offset operations to simulate imaging differences. Additionally, a weight-sharing Siamese network is designed to maximize the similarity between augmented views of the same remote sensing image and the joint representations of the corresponding text during training. By leveraging the update gradient alternation, the model effectively utilizes the mutual information contained in the joint representations of the same remote sensing image under different views as supervision signals. The MVJRC task successfully filters out the noise interference caused by imaging differences in remote sensing images. It achieves strong consistency in the joint representations of different views for texts and remote sensing images, facilitating the easier discrimination of paired samples. Furthermore, MVJRC enhances the complex cross-attention module between modalities by providing additional complementary signals, thereby enabling consistent fine-grained correlations.
The increasing computational complexity associated with large-scale networks can lead to reduced efficiency in measuring the similarity of multimodal data during cross-modal retrieval. While identifying negative samples with low similarity (easy negatives) is straightforward, identifying negative samples with high similarity (hard negatives) often requires a more intricate model. To address this challenge, we propose the retrieval filtering (RF) method. This method employs a small-scale network as a filter and utilizes knowledge distillation [
30] to transfer the "knowledge" of similarity measurements from the complex fusion network to the filter. During retrieval, the small-scale filter is initially used to screen out easy negatives, and the top k samples with high similarity are then fed into the complex fusion encoder for similarity calculation and re-ranking. By adopting the RF method, retrieval efficiency can be significantly improved while ensuring minimal accuracy loss, even with a large sample size.
In this research, we introduced a multi-task guided fusion encoder (MTGFE) for cross-modal retrieval of remote sensing images and texts. The key contributions of this paper can be summarized as follows:
- (1)
The model was trained using a combination of the ITM, MLM, and MVJRC tasks, enhancing its ability to capture fine-grained correlations between remote sensing images and texts.
- (2)
The introduction of the MVJRC task improved the consistency of feature expression and fine-grained correlation, particularly when dealing with variations in colors, resolutions, and shooting angles of remote sensing images.
- (3)
To address the computational complexity and retrieval efficiency limitations of large-scale fusion coding networks, we proposed the RF method. This method filters out easy negative samples, ensuring both high retrieval accuracy and efficient retrieval performance.
The remaining part of this paper is organized as follows. In
Section 2, related work on the remote sensing image–text cross-modal retrieval, text and image encoders based on Transformer, vision-language pre-training (VLP) models, and contrastive learning is summarized and analyzed. In
Section 3, the system architecture of our model is described in detail, with a focus on the design of the training task. In
Section 4, comparative and ablation experiments are conducted to demonstrate the superiority and effectiveness of our method. Meanwhile, the reason for the underperformance of the method is analyzed. In
Section 5, the discussions and conclusions are presented.
3. Method
To achieve a fine-grained association between remote sensing images and texts, we first utilize ViT and BERT (using the first 6 layers and parameters) models to extract patches and tokens features from images and texts, respectively. Afterwards, to represent the complex interaction of fine-grained semantic units for images and texts, we employ a large-scale Transformer (initialized with the last 6 layers and parameters of the BERT) as the fusion encoder to model the fine-grained association between images and texts. To better utilize the image–text association information in the annotated data, we utilize MLM task to mine the ground truth label (real words in the manually annotated dataset) of randomly masked tokens as the supervision signal, guiding the model to learn the fine-grained association between images and texts. Meanwhile, a MVJRC task is employed to mine the joint representation of text and different imaging remote sensing images as the supervision signal, ensuring consistency between the joint representation and fine-grained association. Additionally, we use the ITM task to align the remote sensing images and texts by using the supervision signal of whether the image and text match, facilitating cross-modal retrieval between remote sensing images and texts.
Figure 2 illustrates the overall structure of the model. Initially, the visual and language features of the image–text pair are generated separately by their respective unimodal encoders. These features are then paired and passed into the fusion encoder. The model is trained jointly through the ITM, MLM, and MVJRC tasks. During cross-modal image–text retrieval, the results are ranked based on the ITM score and provided to the user. After training the fusion encoder, a small-scale multilayer perceptron (MLP) network is trained using knowledge distillation. This MLP network functions as a retrieval filter to filter out easily identifiable negative samples. Subsequently, the results are re-ranked by the fusion encoder.
3.1. Unimodal Encoder
We select the ViT and BERT models, which leverage self-attention mechanisms, as the unimodal encoders for remote sensing images and texts. These models facilitate the fine-grained semantic representation of unimodal data.
3.1.1. Image Encoder
The image encoder, which is denoted by
, adopts the ViT-B/16 model structure and is initialized using the pre-training weights on ImageNet-1k [
47]. According to reference [
3], a given image is segmented into multiple 16 × 16 pixel patches. After linear projection, the learnable classification labels are embedded into the special token [cls]. The encoding output is
, where
is the classification label feature,
denotes the feature of the
i-th patch, and
m is the number of patches.
3.1.2. Text Encoder
The first 6 layers and weights of the pre-trained
model [
6] are used as the text encoder, denoted as
. It has 6 Transformer blocks. Given a text description T, WordPiece [
37] is used at first to obtain the embedded representation of the tokens in the sentence, and a classification label token ( [cls] ) is added at the start, as denoted by
. During the execution of the MLM task (as outlined in
Section 3.3), approximately 15% of the tokens are randomly masked and substituted with the special token [mask], as indicated by
; here,
n is the length of text tokens and
indicates the embedding representing the classification label [cls]. The encoded text features are represented as
and
, respectively; here,
represents the text classification label feature, and it is often used as a global vector for the text in some downstream tasks;
is the feature vector of the
i-th token; and
represents the feature vector of the special token [mask].
3.2. Multimodal Fusion Encoder
The multimodal fusion encoder comprises six layers of Transformer blocks, utilizing fine-grained features such as image patches and text tokens. To enable greater gradient flow for the image encoder, the image features are independently fed into each multi-head cross-attention layer, where they serve as the key and value for attention calculations. Conversely, the text tokens are treated as the query and fed into the multi-head cross-attention layer after the computation of the multi-head self-attention layer. The multiple stacked self-attention and cross-attention layers facilitate the calculation of fine-grained correlations between text tokens and image patches, and improve image encoder parameters and enhance visual representations.
The fusion encoder is initialized with the weights of the last 6 layers of
[
6], denoted as
. Each block in the architecture consists of three sub-layers: a multi-head self-attention layer, a multi-head cross-attention layer, and a feed-forward network (FFN) layer. Within each attention sub-layer, a residual connection is employed, where the input and output are added together prior to layer normalization [
48]. Here, the input of the self-attention layer is the embedded feature of the text
W; when executing the MLM task, the input is
. The self-attention layer maintains three learnable parameter matrices,
,
, and
, for each input token embedding. The calculation approach for each attention head is provided in Equation (
1).
where
is the dim of the input key. In the fusion of multi-head attention, we need to concatenate the output of each attention head
on the dimension of
and multiply it with a learnable parameter matrix
, as shown in Equation (
2).
Here,
h is the number of attention heads.
The calculation of the multi-head cross-attention layer is similar to that of the multi-head self-attention layer, except for that the output of text embedding W through the self-attention layer is used as Q, whereas the visual embedding S is used as K and V.
The FFN sub-layer is a FC network that utilizes the Gelu [
49] activation function. This activation function applies a nonlinear transformation to the output of the cross-attention network. The hidden vector of the last layer is taken as the feature output of the fusion encoder, represented by
, where
is the classification label feature of the image–text joint feature,
is the joint feature corresponding to the
i-th text token, and
n is the number of input text elements.
3.3. Training Task of the Multimodal Fusion Encoder
During the training process, we incorporate three tasks, namely MLM, MVJRC, and ITM, to collectively guide the training of the multimodal fusion encoder.
3.3.1. Masked Language Modeling (MLM)
The MLM task (shown in
Figure 3), derived from the BERT [
6] model, involves randomly masking 15% of the tokens in a given text sentence. By incorporating the MLM task into the fusion module, the training process is transformed into a self-supervised denoising procedure. This requires the masked tokens to utilize both the unmasked contextual information (through the self-attention mechanism) and additional image information (through the cross-attention mechanism) for reconstruction. This approach strengthens the fine-grained correlations between text tokens and image patches, enhancing their alignment and coherence.
A fully connected network MLM head is added after the output of the fusion encoder, and its input is the image–text joint representation, given by
U. The network output employs the SoftMax function for multiple classification tasks and is then mapped to a sequence of
. Here, vocabulary is the word dictionary introduced from
, with a length of 30,522. The MLM task involves minimizing the cross-entropy loss between the predicted value and the ground truth label, which is provided in Equation (
3).
where
refers to the ground truth label of the predicted vocabulary,
refers to the image–text pair after the masking operation,
refers to the model prediction for the masked vocabulary, and
refers to the cross-entropy loss function.
3.3.2. Multi-View Joint Representations Contrast (MVJRC)
To enhance the coherence of joint features and capture fine-grained correlations between a specific target and its corresponding text under varying imaging conditions, such as resolution, color, and shooting angle, we propose a weight-sharing MTGFE Siamese network (
Figure 4). Various image augment operations are employed to simulate the imaging discrepancies in remote sensing images. The joint representation undergoes self-supervised training, where the objective is to maximize the similarity of the joint representations between remote sensing images captured from different perspectives and their corresponding paired text. Specifically, a project head and prediction head are added after MTGFE, which are, respectively, expressed as
and
. A project head (
) has three FC layers, and each FC layer has a batch normalization (BN) layer [
50]. Apart from the output layer, the activation function utilized in each BN layer is the rectified linear unit (ReLU) [
51]. A prediction head (
) is a two-layer FC layer connected by the BN layer and the ReLU activation function.
For a given image–text pair
, Randaugment [
52] is used for random image augment to obtain
and
, whose fusion representations are denoted as
and
, respectively, and their classification label features
and
are used in subsequent operations. Let
,
, and
,
, and define
as the cosine similarity of two vectors, then
here,
represents
. The objective of the MVJRC task is to optimize the similarity between joint representations of various augments image-text pairs. The loss function for this task can be defined as follows:
The loss for each individual sample is computed, and then the mean loss is calculated within each minibatch. According to reference [
44], to prevent the model from collapsing, a stop gradient operation (stopgrad) is introduced when updating the gradient. That is, when calculating the gradient from
, only accept the gradient from
p each time. The mathematical expression for the MVJRC loss is as follows:
When updating the encoder parameters of the image–text pair
, the first item does not receive the gradient from
, and the second item only accepts the gradient from
. See Algorithm 1 for the pseudocode of MVJRC.
Algorithm 1 MVJRC Task Pseudocode |
# f: MTGFE Net, our fusion encoding model # : Projection head # : Prediction head
for in dataloader: do
, = aug(I), aug(I) # Image augmentation
=
=
# loss
l.backward() # Gradient return
update( f, , ) # Parameters update
end for
function( ) # Calculation of cosine similarity
z= z.detach() # Stop gradient
# -normalize
# -normalize
return s
end function |
3.3.3. Image–Text Matching (ITM)
In order to assess the similarity between images and text and determine their compatibility, we employ the ITM head to perform a linear mapping of the joint representation onto the [0, 1] interval. A higher value approaching 1 indicates a greater image–text similarity. During the cross-modal retrieval process for remote sensing images and text, the ITM score serves as the ranking criterion and is presented to the user. The ITM head is a FC layer that outputs
. Linear mapping is utilized to project the jointly represented classification label feature
into a 2D prediction
. The ITM loss quantifies the disparity between the minimized prediction and the ground truth label (whether the images and texts match in the manually annotated dataset) in terms of probability distribution, which can be defined as Equation (
7).
where
is the true matching value of the given image–text pair and
denotes the cross-entropy loss function. In training, the probability of
for the input image–text pair
is set to 1; the negative examples are randomly selected for each image and text in the minibatch and denoted as
and
, respectively, which is set to 0.
The overall loss of the MTGFE training is as follows:
3.4. Retrieval Filtering (RF)
Knowledge distillation, a machine learning technique, trains a compact model to mimic a larger, complex one. It involves transferring knowledge from the larger “teacher” model to the smaller “student” model. In order to improve the efficiency of MTGFE cross-modal retrieval, after model training, a simple FC network is designed as a retrieval filter (
Figure 5). Knowledge distillation transfers knowledge from the MTGFE (teacher model) to the retrieval filter (student model). The input of the retrieval filter is the concatenation of image and text classification label features, which includes three FC layers. Following the initial two FC layers, BN and the ReLU activation function are applied, which are consistent with the ITM head architecture. The final linear layer then transforms the output of the hidden layer into a two-dimensional vector.
The MTGFE’s ITM output and the manually annotated ground truth label (whether the images and texts match in the manually annotated dataset) are utilized as the soft target and hard target supervision signals, respectively, for the student model, considering the same set of image–text samples. The distribution biases between these signals are calculated using the Kullback–Leibler (KL) loss and cross-entropy loss. The calculation methods are as follows:
where
represents the KL divergence loss,
represents the ITM output of the teacher model,
represents the predicted ITM value of the student model, and
represents the ground truth label. Finally, the distillation loss of the model can be obtained as follows:
Here, denotes a constant hyperparameter. The student and teacher models use the same unimodal encoder to extract features. The difference between the two scenarios is that the student model only inputs the image and text classification label features, and .
4. Experimental Results and Analysis
To substantiate the efficacy of the proposed method in remote sensing image–text cross-modal retrieval tasks, we performed comprehensive experiments on four publicly available datasets. Furthermore, we conducted ablation tests to provide additional validation for the presented approach. It is important to mention that, in
Section 4.5, we exclusively employed the retrieval filtering method to evaluate its effectiveness, whereas the remaining experimental results were computed using MTGFE.
4.1. Datasets and Evaluation Indicators
In the experiments, we used four publicly available remote sensing image–text datasets: UCM-captions [
32], Sydney-captions [
32], RSICD [
23], and RSITMD [
8]. The basic information of each dataset is given in
Table 1.
In the evaluation, we employed recall at K (R@K), where K represents the rank position (1, 5, and 10), as the performance metric. R@K measures the percentage of correct samples within the top K ranked results for a given query. Additionally, we introduced the mR indicator, which represents the arithmetic mean of R@K values, to evaluate the performance of the proposed method.
4.2. Implementation Details
The experiments were performed on four NVIDIA GeForce RTX 3090 GPUs. All images were standardized to a size of 224 × 224 pixels and augmented using Randaugment [
52]. To simulate variations in remote sensing images, nine augment methods (“Identity”, “AutoContrast”, “Equalize”, “Brightness”, “Sharpness”, “ShearY”, “TranslateX”, “TranslateY”, and “Rotate”) were selected. However, since strong image augment transformations can disrupt the matching relationship between remote sensing images and texts, we applied relatively mild Randaugment function parameters, specifically (2,7). Here, ‘2’ indicates that two methods were randomly chosen from the aforementioned sequence of image augment methods, while “7” represents the amplitude of the image augment. For image, text, and fused representations, the dimensions of the token and patch features were set to 768. We utilized PyTorch’s DistributedDataParallel tool for distributed training and incorporated distributed BN. During the training of the multimodal fusion encoder, a batch size of 32 was employed, and the training process spanned 60 epochs. The AdamW optimizer [
53] with a weight decay of 0.02 was employed, and a cosine schedule was applied to decay the learning rate from 0.0001 during the first 1000 iterations. When training the student network, the distillation hyperparameter
was set to 0.2, the batch size was adjusted to 128, and the optimizer parameters remained unchanged.
4.3. Experimental Results and Analysis
During the experiments, we conducted a comparative analysis of the proposed method against the most up-to-date models, including VSE++ [
9], SCAN [
10], MTFN [
11], AMFMN [
8], SAM [
14], LW-MCR [
18], MAFA-Net [
17], FBCLM [
21], and GaLR [
16].
Table 2 provides an overview of the performance of the proposed method as well as the baseline models on four datasets: UCM-captions, Sydney-captions, RSICD, and RSITMD. The superior results are highlighted in bold. In this context, “text retrieval” refers to the task of matching relevant textual descriptions with images based on specific criteria, while “image retrieval” denotes the task of matching relevant remote sensing images with textual descriptions using specific criteria.
In
Table 2, the performance metrics for VSE++, SCAN, and MTFN are obtained from reference [
8], while the results of the other models are cited from their respective original papers. For the UCM-captions, Sydney-captions, and RSICD datasets, we followed the partitioning of the training set, validation set, and test set as defined by the dataset contributors. During the training phase, the model parameters were adjusted solely using the training set. The performance data presented in
Table 2 are exclusively derived from the test set. However, for the RSITMD dataset, the contributors only provided a division of the data into a training set and a test set. Thus, after training our model on the provided training set, the model’s performance was measured on the test set.
Results on UCM-captions: The performance of the proposed approach on the UCM-captions dataset is displayed in the upper left section of
Table 2. The mR metric of the method surpassed that of the best model by 9.4%. Except for the R@10 score in text retrieval, the method outperformed the baseline models, showcasing its overall superior performance. Notably, the R@1 scores for both text and image retrieval were 18.57% and 12.86% higher than those of the other models, respectively, indicating that our method exhibited a higher likelihood of returning accurate results at the top-1 position.
Results on Sydney-captions: The performance of the proposed method on the Sydney-captions dataset is presented in the upper right section of
Table 2. The results reveal that the average R@K of our method surpassed that of the best baseline model by 3.99%. Specifically, the R@1, R@5, and R@10 scores for text retrieval, as well as R@1 for image retrieval, outperformed those of the best baseline model by 15.52%, 8.47%, 8.62%, and 11.8%, respectively. These findings align with the outcomes obtained from the UCM-captions dataset, which also exhibited a substantial enhancement in terms of R@1 performance.
Results on RSICD: The performance of our model on the RSICD dataset is presented in the lower left section of
Table 2. It is evident that our model performed well, exhibiting superior text retrieval R@1 performance compared to other models. However, there were still some performance gaps observed in relation to other indicators when compared to the optimal baseline model.
Results on RSITMD: The performance on the RSITMD dataset can be observed in the lower right section of
Table 2. For this dataset, our proposed model achieved higher values for all R@K indicators and mR compared to the other baseline models. This suggests that our model was more effective in capturing the image–text similarity relationships in datasets with richer text semantics and lower text repeatability.
Our experimental results across four datasets showcase the competitiveness of our method against baseline models. In the retrieval task, R@1 is significantly more important than R@5 and R@10, as users prefer the model to return the desired result as the first result, rather than filtering through the results. Except for Image Retrieval on the RSICD dataset, our method outperformed all other models in terms of R@1 on all four datasets, providing strong evidence of its superior performance. However, it falls short in other RSICD dataset metrics. To analyze the reasons, we conducted experiments on the validation set of RSICD using the same model and parameters. The R@1, R@5, and R@10 scores for Text Retrieval and Image Retrieval are 16.91, 44.24, and 57.86 and 20.20, 39.93, and 53.53, respectively, with an mR of 38.78. These results significantly outshine baseline models, suggesting potential dataset imbalances as the cause.
Furthermore, we scrutinized the RSICD dataset, which is similar to the UCM-captions and Sydney-captions datasets. These datasets were specifically curated for the purpose of generating captions for remote sensing images. The objective of the image caption is to generate sentences that are similar to the annotated text. In these datasets, although each image has five textual captions, these five sentences are often repetitive. Additionally, there are instances where different remote sensing images have the same or similar textual descriptions. In cross-modal retrieval, these text–image similarities may align semantically but are frequently deemed incorrect in evaluations, failing to contribute to metric. Yuan et al. [
8] also noted this limitation of the dataset and quantified the diversity of data samples by using the ratio of inconsistent sentences to the number of images. The scores for UCM-captions, Sydney-captions, and RSICD datasets stand at 0.97, 1.83, and 1.67, respectively. However, cross-modal retrieval requires discerning the similarity between different samples, and needs more diverse samples to improve the discriminative ability of the model. To explore datasets that are more suitable for cross-modal retrieval between remote sensing images and text, Yuan et al. [
8] contributed a more diverse remote sensing image–text dataset called RSITMD, with an increased ratio of inconsistent sentences to the number of images to 4.60. In this dataset, our proposed method demonstrates a significant advantage over baseline models.
We further analyzed the performance of different models in
Table 2. While baseline models endeavor to address fine-grained associations between multimodal data through multimodal semantic alignment and multimodal fusion coding, issues persist. Models such as VSE++, SCAN, MTFN, AMFMN, SAM, LW-MCR, MAFA-Net, and GaLR grapple with insufficiently complex interactions between modalities, limiting their performance. The work on multimodal fusion encoding, exemplified by FBCLM, uses a large-scale fusion encoder to mine complex associations between modalities, demonstrating optimal performance across multiple datasets. However, it does not utilize different training tasks to mine more supervised signals to further promote fine-grained correlation between modalities, which limits the performance of the fusion coding model. Our approach combines three supervised tasks—MLM, MVJRC, and ITM—to extract richer supervised signals and attain superior multimodal fine-grained associations. By aggregating local similarities between images and texts through a large-scale cross-attention network, the accuracy of cross-modality retrieval is improved. We further analyze the contribution of these three tasks in
Section 4.4.
Although methods based on large-scale fusion encoders exhibit superior performance in remote sensing image–text cross-modal retrieval, their computational overhead hampers the retrieval speed. On the other hand, multi-modal semantic alignment methods can extract remote sensing image and text features offline and obtain the similarity between images and texts through simple calculations, thereby possessing superior retrieval speed. To compensate for the low retrieval efficiency of large-scale fusion encoders, we attempt to transfer the knowledge learned by the fusion encoder about the association between images and text to a small-scale model to improve retrieval efficiency. The details and arguments of this approach are presented in
Section 4.5.
4.4. Ablation Studies
For the RSITMD dataset, we performed ablation tests to analyze the contributions of the ITM, MLM, and MVJRC tasks proposed by the fusion encoder in terms of fine-grained image–text correlation and cross-modal retrieval. We examined four different task combinations: ITM, ITM + MLM, ITM + MVJRC, and ITM + MLM + MVJRC.
4.4.1. Visualization of Fine-Grained Correlations in Word–Patch
In order to assess the contributions of different tasks to fusion representation, we extracted the attention values of each input word to the corresponding image region from the fifth cross-attention layer of the multimodal fusion encoder. These values were then used to generate a visual heat map illustrating the word–patch correlation. Darker colors indicate a higher correlation between the query word and the image region.
Figure 6 presents the word–patch correlation heat map for a selected image and the sentence “Six water tanks and some pipes beside a pond” under various task combinations. It should be noted that the words displayed in the map are the result of contextual self-attention processing, thus encompassing contextual information.
The MLM task improved the fine-grained correlation between sentence words and image regions. For example, the words “six” and “pond” accurately matched the six white water tanks and the nearby pond, respectively, although some noise was present in the attention. However, when combining the ITM and MVJRC tasks, the correct association between words and image regions was not achieved. Only when all three tasks (ITM, MLM, and MVJRC) were used together did the words exhibit a strong correlation with the image regions. The global classification label [cls] was linked to a region that semantically matched the entire sentence. Words like “six” (referring to 6 water storage tanks), “tanks”, and “pond” (referring to the nearby pond) were correctly associated with their respective image regions. Compared to scenario b, the correlation between words and the image was more specific and accurate, demonstrating the effectiveness of the proposed MVJRC task in filtering out irrelevant correlations. Regarding the word “pipes”, except for scenario a, none of the other task combinations correctly associated it with an image region. This could be attributed to the low resolution of the target, which made detection challenging, and the lack of relevant samples in the training data.
We conducted additional testing of the proposed method using image–text pairs that had more diverse and detailed text semantics.
Figure 7 illustrates an example where the input text described a “viaduct” scene with multiple objects and included information about its surroundings. The results demonstrated that our method effectively improved the correlation between the text and the image. Even for non-target vocabulary such as “ring”, “surrounded”, and “green”, our method successfully associated them with the appropriate image regions.
Based on the visual analysis of the image–text correlation discussed above, it was observed that the supervision signal provided by the ITM task for fine-grained image–text correlation was not precise enough, leading to overlapping correlation effects. On the other hand, the MLM task played a crucial role in enhancing the fine-grained correlation between images and texts by providing more refined and accurate supervision signals. When combining the ITM and MVJRC tasks, the correlation effects between images and texts intersected, resulting in improved correlation effects compared to when only the ITM and MLM tasks were combined. The addition of the MVJRC task enhanced the mutual information for fine-grained correlation between modalities and improved the consistency of joint representation. By strengthening the consistency of fine-grained correlations between remote sensing images from different perspectives and the associated text, the correlation effects between remote sensing images and texts were significantly enhanced.
4.4.2. Impact of Task Combinations on Retrieval Accuracy
We conducted experiments on the RSITMD dataset, evaluating the contributions of four different task combinations: ITM, ITM + MLM, ITM + MVJRC, and ITM + MLM + MVJRC. The results of these experiments are presented in
Table 3.
The experimental results demonstrate that employing the ITM task alone yields a remarkable mR of 38.89, surpassing the accuracy metrics of the current state-of-the-art methods. It has validated the promoting effect of complex fine-grained interactions between modalities on the accuracy of cross-modal retrieval. When combining the ITM and MLM tasks, all retrieval accuracy metrics show significant improvement, with an increase of 2.25 in mR. This underscores the beneficial impact of complex fine-grained intermodal interactions on cross-modal retrieval accuracy. However, when combining the ITM and MVJRC tasks, the MVJRC task does not contribute to the retrieval performance, and there is a noticeable decrease in all retrieval accuracy metrics compared to using only the ITM task. When combining the ITM, MLM, and MVJRC tasks, the performance either slightly improves or remains the same compared to the combination of ITM and MLM, with a 0.93 increase in mR. The MVJRC task does not provide a significant improvement in retrieval accuracy. The impact of adding the MVJRC task to ITM and ITM + MLM on the retrieval accuracy aligns with the visual analysis results in
Section 4.4.1, indicating that the MVJRC task does not provide a significant gain in image–text association on top of the ITM task and may even introduce some noise. After adding the MVJRC task to the combination of ITM + MLM, the visualization of fine-grained correlations between remote sensing image regions and text words is significantly enhanced, but the contribution to retrieval accuracy metrics is not as evident. In some subjective retrieval experiments, the combination of ITM, MLM, and MVJRC tends to return samples that match retrieval conditions but are not ground truth samples in the dataset. While this may enhance user experience, it does not necessarily improve the retrieval accuracy metrics. We attribute this to the limitations of the dataset in terms of sample diversity. The dataset exhibits high intra-class similarity, where remote sensing images of the same scene, such as deserts, airports, and parking lots, have minimal differences, allowing many remote sensing images in the same scene to have the same text description. Additionally, the dataset contains significant category ambiguity in remote sensing images. For instance, the same remote sensing image can be classified as airport, barren land, or airplane, which further complicates the measurement of image–text matching in the dataset. Therefore, exploring datasets and metrics that are more suitable for cross-modal retrieval between remote sensing images and text is necessary for future work.
4.5. Retrieval Filtering Experiments
In order to alleviate the problem of low retrieval efficiency for a large-scale fusion encoder, as described in
Section 3.3, we conducted a validation of our proposed retrieval filtering method on the RSICD dataset. To accomplish this, the study utilized the MTGFE model trained on the RSICD dataset as the teacher network. We then performed joint training to train the student network filter by leveraging the ITM output of the teacher network along with the ground truth labels. A total of 30 epochs were trained with a parameter of 128. During the testing phase, the study implemented a process where the first 128 samples of the filter’s evaluation results were forwarded to the teacher network. The teacher network then recalculated the similarity ranking based on these samples and returned the updated ranking. The combined retrieval indicators are shown in
Table 4. The RSICD test set comprised 1093 images and 5465 texts. The average search time for text retrieval from images was reduced from 472.10 ms to 24.70 ms, while the average search time for image retrieval from texts was reduced from 94.41 ms to 14.27 ms. Remarkably, the average retrieval accuracy mR decreased by only 0.88, demonstrating that the retrieval filtering method substantially enhanced the model’s retrieval speed while maintaining a minimal loss in accuracy.
The retrieval filtering experiments in this study exclusively comprised simple knowledge distillation experiments. Further investigations, including hyperparameter optimization, parameter distillation, and the exploration of combination strategies between teacher and student networks, have the potential to significantly enhance the performance of retrieval filtering.
5. Conclusions
To address the challenges posed by the fine-grained and multi-perspective features, as well as the significant imaging variations in remote sensing images, this study incorporates the MLM task into existing multimodal fusion coding models and introduces the novel MVJRC task. By combining the ITM, MLM, and MVJRC tasks, the model’s ability to capture fine-grained correlations between remote sensing images and texts is enhanced. Furthermore, this paper proposes the retrieval filtering method to tackle the issue of low retrieval efficiency in large-scale fusion encoders. Experimental evaluations on four public datasets confirm the effectiveness of the proposed method in improving the accuracy and speed of cross-modal retrieval, leading to overall enhanced performance.
The limitation of this study is that the current remote sensing image–text datasets may not be suitable for high-performance cross-modal retrieval. The complex relationship between remote sensing images and texts also requires better evaluation metrics to judge the performance of cross-modal retrieval. This makes it difficult to effectively validate some of the methods we proposed, such as the MVJRC task, in experimental metrics. Additionally, conducting additional knowledge distillation experiments may enhance the efficiency of cross-modal retrieval between remote sensing images and texts. Finally, exploring the concept of good joint representation has yielded various downstream tasks in VLP model studies, thereby opening up possibilities for the joint learning of remote sensing images and texts in applications such as visual question answering, multi-temporal remote sensing image comprehension, and remote sensing image object segmentation.
In future endeavors, we will focus on annotating more diverse remote-sensing image-text datasets and specifying cross-modal retrieval evaluation metrics. Furthermore, our research will extend to exploring joint learning techniques and cross-modal retrieval tasks, leveraging high-performance fusion encoders for analyzing multi-temporal remote-sensing images alongside textual data.