Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering
Abstract
:1. Introduction
- Enhanced Image Comprehension: To bolster the model’s proficiency in deciphering remote sensing images, we have adopted the visual graph neural network (VIG) [15], as the primary image feature extraction mechanism. Moreover, a multi-tier node feature fusion module has been instituted, enabling the model to understand remote sensing images both at varied granularities and in their entirety.
- Optimized Text Understanding: This study leverages the BGRU model to extract word-level vector features from the text. Subsequently, the graph attention network (GAT) is employed to compute M word-level features that represent positional relationships. The culmination of this process involves the utilization of pooling to capture the overarching features of the textual content.
- Attention Correction Unit: We introduce a novel attention correction unit. Herein, the visual area features coupled with the textual word features are processed via the cross-attention module to generate attention weights. Subsequently, global similarity metrics are employed to rectify these attention weights. A distinct attention threshold is incorporated to recalibrate the attention weight, substantially mitigating the propensity for misalignment in discordant image-text pairs, especially when such misalignments arise from specific correlations between visual area attributes and textual words.
- Attention Filtering Unit: Recognizing that not all attention-derived information holds relevance, we propose an attention filtering unit. This study aims to discern the most pertinent attention weight, resonating with the visual area features and textual word attributes, and employs a secondary attention threshold to filter out inconsequential attention. This strategic approach attenuates the influence of non-essential visual zones and words when aligning image-text pairs, thus amplifying the likelihood of accurate matches.
2. Related Work
2.1. Cross-Modal Text-Image Retrieval
2.2. Remote Sensing Cross-Modal Text-Image Retrieval
3. Method
3.1. Feature Extraction
3.1.1. Image Feature Extraction
3.1.2. Text Feature Extraction
3.2. Attention Correction and Filtering
- 1.
- Cross-Attention Generation: In this phase, an attention weight matrix is derived using the cross-attention mechanism. This matrix characterizes the relationships between elements in the image and text modalities.
- 2.
- Attention Correction via Global Similarity: The initial attention weights are refined using a measure of global similarity.This refinement ensures that the attention mechanism is focused on semantically consistent areas of the image and corresponding segments of the text.
- 3.
- Attention Filtering for Relevance Determination: This stage identifies and retains only the most relevant attention weights. By concentrating on highly relevant areas, it eliminates non-essential regions or segments, thereby reducing noise in the attention mechanism.
3.2.1. Base Attention
3.2.2. Attention Correction Unit
3.2.3. Attention Filtering Unit
3.3. Loss Function
4. Experiment
4.1. Datasets and Evaluation Metrics
4.2. Implementation Details
4.3. Parameter Experiment
4.3.1. Attention Correction Threshold p
4.3.2. Attention Filtering Threshold q
4.3.3. Margin
4.4. Comparison with the Other Methods
- VSE++ [61]: This model extracts image features using CNNs and text features using GRU. It employs the triplet loss function directly for model optimization.
- SCAN [8]: SCAN extracts image regional features via target detection and text word features using a bidirectional GRU. It subsequently aligns them finely using a cross-attention mechanism.
- CMAP [9]: CAMP utilizes a passing mechanism to adaptively control cross-modal information flow, producing the final result through cosine similarity.
- MTFN [23]: This model capitalizes on the fusion of various features to compute cross-modal similarity in an end-to-end manner.
- CMFN [62]: CMFN enhances retrieval performance by individually learning the feature interaction between query text and RS images and modeling the feature association between both modes, thus preventing information misalignment.
- LW-MCR [63]: This lightweight multi-scale cross-modal retrieval method leverages techniques such as knowledge distillation and contrast learning.
- AMFMN [12]: AMFMN employs a multi-scale self-attention module to derive image features. These features then guide text representation, and a dynamically variable triplet loss function optimizes the model.
- GaLR [64]: GaLR amalgamates image features from different levels using a multi-level information dynamic fusion module, eliminating redundancy in the process.
- SWAN [64]: SWAN uses a multi-scale fusion module to extract regional image features and then employs significant feature correlation to formulate a comprehensive image representation.
RSICD Dataset | RSITMD Dataset | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Thredhold | Sentence Retrieval | Image Retrieval | mR | Sentence Retrieval | Image Retrieval | mR | ||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
VSE++ | 3.38 | 9.51 | 17.46 | 2.82 | 11.32 | 18.10 | 10.43 | 10.38 | 27.65 | 39.60 | 7.79 | 24.87 | 38.67 | 24.83 |
SCAN t2i | 4.39 | 10.90 | 17.64 | 3.91 | 16.20 | 26.49 | 13.25 | 10.18 | 28.53 | 38.49 | 10.10 | 28.98 | 43.53 | 26.64 |
SCAN i2t | 5.85 | 12.89 | 19.84 | 3.71 | 16.40 | 26.73 | 14.23 | 11.06 | 25.88 | 39.38 | 9.82 | 29.38 | 42.12 | 26.28 |
CAMP-triplet | 5.12 | 12.89 | 21.12 | 4.15 | 15.23 | 27.81 | 14.39 | 11.73 | 26.99 | 38.05 | 8.27 | 27.79 | 44.34 | 26.20 |
CAMP-bce | 4.20 | 10.24 | 15.45 | 2.72 | 12.76 | 22.89 | 11.38 | 9.07 | 23.01 | 33.19 | 5.22 | 23.32 | 38.36 | 22.03 |
MTFN | 5.02 | 12.52 | 19.74 | 4.90 | 17.17 | 29.49 | 14.81 | 10.40 | 27.65 | 36.28 | 9.96 | 31.37 | 45.84 | 26.92 |
CMFM | 5.40 | 18.66 | 28.55 | 5.31 | 18.57 | 30.03 | 17.75 | 10.84 | 28.76 | 40.04 | 10.00 | 32.83 | 47.21 | 28.28 |
LW-MCR(b) | 4.57 | 13.71 | 20.11 | 4.02 | 16.47 | 28.23 | 14.52 | 9.07 | 22.79 | 38.05 | 6.11 | 27.74 | 49.56 | 25.55 |
LW-MCR(d) | 3.29 | 12.52 | 19.93 | 4.66 | 17.51 | 30.02 | 14.66 | 10.18 | 28.98 | 39.82 | 7.79 | 30.18 | 49.78 | 27.79 |
AMFMN-soft | 5.05 | 14.53 | 21.57 | 5.05 | 19.74 | 31.04 | 16.02 | 11.06 | 25.88 | 39.82 | 9.82 | 33.94 | 51.90 | 28.74 |
AMFMN-fusion | 5.39 | 15.08 | 23.40 | 4.90 | 18.28 | 31.44 | 16.42 | 11.06 | 29.20 | 38.72 | 9.96 | 34.03 | 52.96 | 29.32 |
AMFMN-sim | 5.21 | 14.72 | 21.57 | 4.08 | 17.00 | 30.60 | 15.53 | 10.63 | 24.78 | 41.81 | 11.51 | 34.69 | 54.87 | 29.72 |
GaLR w/o MR | 6.50 | 18.91 | 29.70 | 5.11 | 19.57 | 31.92 | 18.62 | 13.05 | 30.09 | 42.70 | 10.47 | 36.34 | 53.35 | 31.00 |
GaLR with MR | 6.59 | 19.85 | 31.04 | 4.69 | 19.48 | 32.13 | 18.96 | 14.82 | 31.64 | 42.48 | 11.15 | 36.68 | 51.68 | 31.41 |
SWAN | 7.41 | 20.13 | 30.86 | 5.56 | 22.26 | 37.41 | 20.61 | 13.35 | 32.15 | 46.90 | 11.24 | 40.40 | 60.60 | 34.11 |
ACF (ours) | 8.23 | 20.31 | 30.47 | 7.39 | 23.28 | 36.91 | 21.10 | 15.49 | 34.96 | 49.12 | 12.83 | 37.74 | 55.53 | 34.28 |
4.5. Ablation Study
- M1: Incorporates the GAT for text feature extraction.
- M2: Pertains to image feature extraction supplemented with a multi-scale fusion module.
- M3: Involves the attention correction unit.
- M4: Introduces the attention filtering unit.
- M5: Adds a global similarity component.
- With the inclusion of the M1 module, there was a rise in the mR score of the model by 0.38.
- Upon the integration of the M2 module, the mR score experienced an increment of 0.72. This marked a 0.34 rise compared to the addition of M1 alone.
- Introducing the M3 module further augmented the mR score by 1.07. This denotes an enhancement of 0.35 when stacked against the combined addition of M1 and M2. Notably, at this juncture, the model topped the R@5 metric in the image retrieval task.
- The addition of the M4 module propelled the mR score by 1.56, showcasing an improvement of 0.49 over the previous configuration. This configuration yielded the best performance in the realm of text retrieval.
- Finally, with all modules incorporated, the model’s mR score surged by 1.71. In terms of image retrieval, the model outperformed its peers in the R@1 and R@10 metrics.
- With the integration of the M1 module, there was an increase in the mR score of the model by 1.3.
- Upon adding the M2 module, the mR score surged by 2.86, marking an enhancement of 1.56 compared to the sole addition of M1. Remarkably, during this phase, the model achieved pinnacle performance in the R@1 metric of text retrieval.
- Introducing the M3 module further augmented the mR score to 3.45. This denotes a rise of 0.59 when juxtaposed against the cumulative addition of M1 and M2.
- The inclusion of the M4 module propelled the mR score to 4.14, showcasing an improvement of 0.69 over the prior configuration. At this stage, the model clinched the top spot in the R@1 and R@5 metrics for image retrieval.
- Ultimately, when all modules were incorporated, the model’s mR score reached 4.53. It stood out in the R@10 metric for image retrieval and achieved premier results in both R@5 and R@10 metrics for text retrieval.
4.6. Visual Analysis of Retrieval Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, Y.; Ma, J.; Zhang, Y. Image retrieval from remote sensing big data: A survey. Inf. Fusion 2021, 67, 94–115. [Google Scholar] [CrossRef]
- Shyu, C.R.; Klaric, M.; Scott, J.G. GeoIRIS: Geospatial information retrieval and indexing system—Content mining, semantics modeling, and complex queries. IEEE Trans. Geosci. Remote Sens. 2007, 45, 839–852. [Google Scholar] [CrossRef]
- Kandala, H.; Saha, S.; Banerjee, B. Exploring transformer and multilabel classification for remote sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Li, X.; Zhang, X.; Huang, W. Truncation cross entropy loss for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5246–5257. [Google Scholar] [CrossRef]
- Zhao, R.; Shi, Z.; Zuo, Z. High-resolution remote sensing image captioning based on structured attention. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
- Hoxha, G.; Melgani, F.; Demir, B. Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 4462–4475. [Google Scholar] [CrossRef]
- Hoxha, G.; Melgani, F.; Demir, B. Retrieving images with generated textual descriptions. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5812–5815. [Google Scholar]
- Lee, K.H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 201–216. [Google Scholar]
- Wang, Z.; Liu, X.; Li, H.; Sheng, L.; Yan, J.; Wang, X.; Shao, J. CAMP: Cross-modal adaptive message passing for textimage retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5764–5773. [Google Scholar]
- Li, K.; Zhang, Y. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4654–4662. [Google Scholar]
- Wang, S.; Wang, R.; Yao, Z. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1508–1517. [Google Scholar]
- Yuan, Z.; Zhang, W.; Fu, K. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. arXiv 2022, arXiv:2204.09868. [Google Scholar] [CrossRef]
- Cheng, Q.; Zhuo, Y.; Fu, P. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
- Yuan, Z.; Zhang, W.; Tian, C. Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Guo, J. Vision GNN: An Image is Worth Graph of Nodes. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 1–16. [Google Scholar]
- Peng, Y.; Huang, X.; Zhao, Y. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2372–2385. [Google Scholar] [CrossRef]
- Hardoon, D.R.; Szedmak, S.; Shawe, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef] [PubMed]
- Andrew, G.; Arora, R.; Bilmes, J. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1247–1255. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Cho, K.; Merrienboer, B.V.; Gulcehre, C. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Comput. Sci. 2014, 1, 1–15. [Google Scholar]
- Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://hayate-lab.com/wp-content/uploads/2023/05/43372bfa750340059ad87ac8e538c53b.pdf (accessed on 20 January 2025).
- Matsubara, T. Target-oriented deformation of visual-semantic embedding space. IEICE Trans. Inf. Syst. 2021, 104, 24–33. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget, J.; Mirza, M. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Wang, B.; Yang, Y.; Xu, X. Adversarial cross-modal retrieval. In Proceedings of the ACM Multimedia Conference, Mountain View, CA, USA, 23–27 October 2017; pp. 154–162. [Google Scholar]
- Peng, Y.; Qi, J. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. 2019, 15, 1–24. [Google Scholar] [CrossRef]
- Gu, J.; Ha, J.C.; Joty, S.R. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7181–7189. [Google Scholar]
- Wen, X.; Han, Z.; Liu, Y.S. CMPD: Using cross memory network with pair discrimination for image-text retrieval. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2427–2437. [Google Scholar] [CrossRef]
- Nam, H.; Ha, J.W.; Kim, J. Dual attention networks for multimodal reasoning and matching. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 299–307. [Google Scholar]
- Wang, T.; Xu, X.; Yang, Y. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 12–20. [Google Scholar]
- Ma, L.; Jiang, W.; Jie, Z. Matching image and sentence with multi-faceted representations. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 2250–2261. [Google Scholar] [CrossRef]
- Ji, Z.; Wang, H.; Han, J. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5754–5763. [Google Scholar]
- Ren, S.; He, K.; Girshick, R. Faster rcnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- Karpathy, A.; Li, F.F. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
- Liu, C.; Mao, Z.; Liu, A. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 3–11. [Google Scholar]
- Wang, Y.; Yang, H.; Qian, X. Position focused attention network for image-text matching. arXiv 2019, arXiv:1907.09748. [Google Scholar]
- Zhang, Q.; Lei, Z.; Zhang, Z. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3536–3545. [Google Scholar]
- Chen, H.; Ding, G.; Liu, X. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12655–12663. [Google Scholar]
- Ji, Z.; Chen, K.; Wang, H. Step-wise hierarchical alignment network for image-text matching. arXiv 2021, arXiv:2106.06509. [Google Scholar]
- Liu, Y.; Wang, H.; Meng, F. Attend, Correct And Focus: A Bidirectional Correct Attention Network For Image-Text Matching. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 2673–2677. [Google Scholar]
- Ge, X.; Chen, F.; Jose, J.M. Structured multi-modal feature embedding and alignment for image-sentence retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 5185–5193. [Google Scholar]
- Scarselli, F.; Gori, M.; Tsoi, A.C. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Shi, B.; Ji, L.; Lu, P. Knowledge Aware Semantic Concept Expansion for Image-Text Matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 5182–5189. [Google Scholar]
- Wang, H.; Zhang, Y.; Ji, Z. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the 2020 European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 18–34. [Google Scholar]
- Liu, C.; Mao, Z.; Zhang, T. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10921–10930. [Google Scholar]
- Nguyen, M.D.; Nguyen, B.T.; Gurrin, C. A deep local and global scene-graph matching for image-text retrieval. arXiv 2021, arXiv:2106.02400. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding ofhigh resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar]
- Shi, Z.; Zou, Z. Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Trans. Geosci. Remote Sens. 2017, 55, 3623–3634. [Google Scholar] [CrossRef]
- Lu, X.; Wang, B.; Zheng, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
- Sumbul, G.; Nayak, S.; Demir, B. SD-RSIC: Summarization-driven deep remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6922–6934. [Google Scholar] [CrossRef]
- Hoxha, G.; Melgani, F. A novel SVM-based decoder for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
- Wang, Q.; Huang, W.; Zhang, X. Word–sentence framework for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10532–10543. [Google Scholar] [CrossRef]
- Wang, B.; Zheng, X.; Qu, B. Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 256–270. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, W.; Yan, M. Global visual feature and linguistic state guided attention for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
- Abdullah, T.; Bazi, Y.; Rahhal, M.M.A.; Mekhalfi, M.L.; Rangarajan, L.; Zuair, M. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sens. 2021, 12, 405. [Google Scholar] [CrossRef]
- Lv, Y.; Xiong, W.; Zhang, X. Fusion-based correlation learning model for cross-modal remote sensing image retrieval. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv 2017, arXiv:1707.05612. [Google Scholar]
- Yu, H.; Yao, F.; Lu, W. Text-Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 812–824. [Google Scholar] [CrossRef]
- Yuan, Z. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–19. [Google Scholar] [CrossRef]
- Pan, J.; Ma, Q.; Cong, B. Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, Thessaloniki, Greece, 12–15 June 2023; pp. 398–406. [Google Scholar]
RSICD Dataset | RSITMD Dataset | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Thredhold | Sentence Retrieval | Image Retrieval | mR | Sentence Retrieval | Image Retrieval | mR | ||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
6.92 | 19.76 | 30.34 | 5.94 | 20.82 | 33.94 | 19.62 | 13.72 | 30.31 | 44.47 | 9.87 | 37.12 | 55.66 | 31.78 | |
7.81 | 20.71 | 31.88 | 6.09 | 22.26 | 36.32 | 20.84 | 14.82 | 33.63 | 43.81 | 9.56 | 37.52 | 57.52 | 32.80 | |
8.23 | 20.31 | 30.47 | 7.39 | 23.28 | 36.91 | 21.10 | 15.94 | 34.96 | 49.12 | 12.83 | 37.74 | 55.53 | 34.28 | |
7.39 | 19.69 | 30.74 | 5.74 | 21.83 | 34.97 | 20.06 | 13.94 | 33.41 | 46.24 | 11.24 | 38.36 | 53.50 | 32.78 | |
5.95 | 18.21 | 31.47 | 5.58 | 21.06 | 33.65 | 19.32 | 13.32 | 33.54 | 46.28 | 10.90 | 38.18 | 56.69 | 32.40 | |
6.13 | 19.12 | 29.00 | 5.45 | 20.79 | 34.11 | 19.10 | 12.21 | 29.25 | 42.65 | 10.14 | 36.88 | 55.22 | 31.06 |
RSICD Dataset | RSITMD Dataset | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Thredhold | Sentence Retrieval | Image Retrieval | mR | Sentence Retrieval | Image Retrieval | mR | ||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
8.23 | 20.31 | 30.47 | 7.39 | 23.28 | 36.91 | 21.10 | 15.49 | 34.96 | 49.12 | 12.83 | 37.74 | 55.53 | 34.28 | |
7.41 | 20.04 | 31.20 | 6.04 | 22.89 | 36.83 | 20.73 | 12.74 | 32.26 | 46.15 | 11.12 | 38.21 | 55.93 | 32.73 | |
7.14 | 20.40 | 32.48 | 6.39 | 21.33 | 33.39 | 20.19 | 12.65 | 31.06 | 44.78 | 11.30 | 36.78 | 54.31 | 31.81 | |
7.04 | 19.76 | 30.74 | 5.23 | 21.81 | 33.92 | 19.75 | 12.83 | 29.87 | 43.94 | 10.58 | 37.69 | 55.74 | 31.77 | |
7.50 | 20.21 | 31.38 | 6.37 | 21.06 | 34.47 | 20.16 | 12.26 | 30.58 | 43.41 | 10.25 | 36.70 | 56.08 | 31.54 | |
7.59 | 20.59 | 30.28 | 5.67 | 21.04 | 31.78 | 19.49 | 11.68 | 30.22 | 42.96 | 10.55 | 37.37 | 55.38 | 31.36 |
RSICD Dataset | RSITMD Dataset | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Margin | Sentence Retrieval | Image Retrieval | mR | Sentence Retrieval | Image Retrieval | mR | ||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
6.68 | 19.12 | 29.83 | 6.92 | 23.18 | 35.55 | 20.21 | 12.61 | 30.97 | 45.80 | 12.30 | 38.27 | 55.66 | 32.60 | |
8.23 | 20.31 | 30.47 | 7.39 | 23.28 | 36.91 | 21.10 | 15.49 | 34.96 | 49.12 | 12.83 | 37.74 | 55.53 | 34.28 | |
7.87 | 19.76 | 31.56 | 6.68 | 21.48 | 33.98 | 20.22 | 14.82 | 32.30 | 44.47 | 11.73 | 38.41 | 52.79 | 32.42 | |
6.68 | 17.38 | 30.56 | 6.62 | 22.78 | 36.45 | 20.08 | 12.70 | 32.03 | 44.16 | 10.31 | 37.31 | 55.38 | 31.98 | |
6.95 | 19.21 | 31.11 | 6.17 | 21.10 | 32.96 | 19.58 | 12.17 | 30.31 | 46.02 | 10.88 | 36.90 | 53.81 | 31.68 | |
5.76 | 17.84 | 28.82 | 5.56 | 21.35 | 35.66 | 19.17 | 11.50 | 31.86 | 45.58 | 9.16 | 36.81 | 52.26 | 31.19 |
M1 | M2 | M3 | M4 | M5 | Sentence Retrieval | Image Retrieval | mR | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||||||
7.32 | 19.12 | 30.83 | 5.76 | 20.00 | 33.32 | 19.39 | |||||
√ | 7.12 | 20.02 | 30.98 | 5.75 | 20.91 | 33.83 | 19.77 | ||||
√ | √ | 7.50 | 19.76 | 31.75 | 6.39 | 20.42 | 34.82 | 20.11 | |||
√ | √ | √ | 6.04 | 19.30 | 30.92 | 6.57 | 23.29 | 36.63 | 20.46 | ||
√ | √ | √ | √ | 8.34 | 21.04 | 32.48 | 6.11 | 21.57 | 36.19 | 20.95 | |
√ | √ | √ | √ | √ | 8.23 | 20.31 | 30.47 | 7.39 | 23.28 | 36.91 | 21.10 |
M1 | M2 | M3 | M4 | M5 | Sentence Retrieval | Image Retrieval | mR | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||||||
11.95 | 28.23 | 41.11 | 10.95 | 34.94 | 51.35 | 29.75 | |||||
√ | 13.76 | 31.02 | 42.57 | 11.05 | 36.08 | 51.81 | 31.05 | ||||
√ | √ | 16.59 | 32.08 | 44.49 | 11.55 | 37.70 | 53.05 | 32.61 | |||
√ | √ | √ | 12.83 | 33.19 | 48.89 | 11.50 | 37.88 | 54.91 | 33.20 | ||
√ | √ | √ | √ | 14.16 | 34.51 | 48.23 | 12.92 | 38.67 | 54.87 | 33.89 | |
√ | √ | √ | √ | √ | 15.49 | 34.96 | 49.12 | 12.83 | 37.74 | 55.53 | 34.28 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, X.; Li, C.; Wang, Z.; Xie, H.; Mao, J.; Yin, G. Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering. Remote Sens. 2025, 17, 503. https://doi.org/10.3390/rs17030503
Yang X, Li C, Wang Z, Xie H, Mao J, Yin G. Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering. Remote Sensing. 2025; 17(3):503. https://doi.org/10.3390/rs17030503
Chicago/Turabian StyleYang, Xiaoyu, Chao Li, Zhiming Wang, Hao Xie, Junyi Mao, and Guangqiang Yin. 2025. "Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering" Remote Sensing 17, no. 3: 503. https://doi.org/10.3390/rs17030503
APA StyleYang, X., Li, C., Wang, Z., Xie, H., Mao, J., & Yin, G. (2025). Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and Filtering. Remote Sensing, 17(3), 503. https://doi.org/10.3390/rs17030503