Image–Text Matching Model Based on CLIP Bimodal Encoding
Abstract
:1. Introduction
- We present a novel image–text matching framework based on CLIP, integrating the ViT for image encoding and Bert for text encoding, and leveraging the LiT-tuning paradigm for enhanced training efficiency.
- We implement a cosine decay strategy for adaptive learning rate adjustments, which improves convergence and model stability during training.
- We conduct extensive experiments on the Flickr30k and WuKong datasets, achieving superior results compared to existing baselines and demonstrating the robustness of our model in multimodal alignment.
2. Related Work
3. Joint Encoding Model for Image–Text Matching
3.1. Image Encoder
3.2. Text Encoder
3.3. Similarity Calculation
4. Experimental Analysis
4.1. Experimental Environment
4.2. Datasets
4.3. Learning Rate Scheduling
4.4. Loss Function
4.5. Experimental Results and Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Toyama, J.; Misono, M.; Suzuki, M.; Nakayama, K.; Matsuo, Y. Neural machine translation with latent semantic of image and text. arXiv 2016, arXiv:1611.08459. [Google Scholar]
- Zhou, T.; Cai, Z.; Liu, F.; Su, J. In pursuit of beauty: Aesthetic-aware and context-adaptive photo selection in crowdsensing. IEEE Trans. Knowl. Data Eng. 2023, 35, 9364–9377. [Google Scholar] [CrossRef]
- Cheng, D.; Chen, L.; Lv, C.; Guo, L.; Kou, Q. Light-guided and cross-fusion U-Net for anti-illumination image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8436–8449. [Google Scholar] [CrossRef]
- Faghri, F.; Fleet, D.; Kiros, J.; Fidler, S.V. Improving visual-semantic embeddings with hard negatives. arXiv 2017, arXiv:1707.05612. [Google Scholar]
- Lee, K.-H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 201–216. [Google Scholar]
- Zhang, Q.; Lei, Z.; Zhang, Z.; Li, S.Z. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3536–3545. [Google Scholar]
- Hu, Z.; Luo, Y.; Lin, J.; Yan, Y.; Chen, J. Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 789–795. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Teney, D.; Liu, L.; van Den Hengel, A. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1–9. [Google Scholar]
- Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
- Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; Wu, F. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10941–10950. [Google Scholar]
- Nam, H.; Ha, J.-W.; Kim, J. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 299–307. [Google Scholar]
- Messina, N.; Stefanini, M.; Cornia, M.; Baraldi, L.; Falchi, F.; Amato, G.; Cucchiara, R. Aladin: Distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, Graz, Austria, 14–16 September 2022; pp. 64–70. [Google Scholar]
- Fu, Z.; Mao, Z.; Song, Y.; Zhang, Y. Learning semantic relationship among instances for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15159–15168. [Google Scholar]
- Pan, Z.; Wu, F.; Zhang, B. Fine-grained image-text matching by cross-modal hard aligning network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19275–19284. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 26418–26431. [Google Scholar]
- Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Wang, S.; Wang, R.; Yao, Z.; Shan, S.; Chen, X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 1508–1517. [Google Scholar]
- Zhang, K.; Mao, Z.; Wang, Q.; Zhang, Y. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15661–15670. [Google Scholar]
- Wang, H.; Zhang, Y.; Ji, Z.; Pang, Y.; Ma, L. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16, 2020. pp. 18–34. [Google Scholar]
Full Form | Abbreviation |
---|---|
Contrastive Language–Image Pre-training | CLIP |
Vision Transformer | ViT |
Bidirectional Encoder Representations from Transformers | Bert |
Recurrent Neural Network | RNN |
Sequence-to-Sequence | Seq2Seq |
Multi-Layer Perceptron | MLP |
Feed-Forward Neural Network | FFN |
K-Nearest Neighbors | KNN |
Information Noise Contrastive Estimation Loss | infoNCE loss |
Recall at n | R@n |
Stochastic Gradient Descent | SGD |
Classification Token | CLS |
Separator Token | SEP |
Comma-Separated Values | CSV |
Reference | Model | Key Techniques | Advantages | Limitations |
---|---|---|---|---|
Faghri et al. [5] | Shared Vector Space Model | Cosine similarity, global feature mapping | Simple, effective for overall similarity | Misses fine-grained image–text details |
Lee et al. [6] | SCAN Model | Attention-based fine-grained matching | Captures detailed region–word alignment | Ignores inter-modal relationships |
Radford et al. [9] | CLIP | Contrastive learning, global + fine-grained mapping | Strong generalization and efficiency | Limited handling of subtle, nuanced details |
Wei et al. [12] | Cross-Modal Attention | Multi-head inter- and intra-modal attention | Enhances semantic alignment and matching | High computational cost |
Messina et al. [14] | ALADIN | Fine-grained alignment, KNN search | Fast, efficient cross-modal retrieval | Requires accurate feature alignment |
Our Method | CLIP-Based Bimodal Encoding | ViT + Bert encoders, LiT-tuning, cosine decay | Superior convergence, robust performance, captures both global and fine-grained semantics | Potential computational complexity increase |
Library Name | Version |
---|---|
Python | 3.8 |
Pytorch | 1.12.0 |
Pandas | 1.4.2 |
Numpy | 1.22.4 |
Transformers | 4.21.0 |
Pillow | 9.3.0 |
Model | Image to Text | Text to Image | rsum | ||||
---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
SGM [22] | 73.2 | 92.8 | 96.6 | 57.2 | 87.2 | 94.1 | 501.1 |
NAAF [23] | 75.0 | 94.2 | 97.4 | 58.1 | 83.6 | 89.7 | 498.0 |
CVSE [24] | 69.1 | 93.3 | 97.4 | 55.5 | 86.9 | 93.8 | 496.0 |
ours | 75.2 | 94.1 | 98.1 | 59.3 | 84.2 | 95.2 | 506.1 |
Model | Image to Text | Text to Image | rsum | ||||
---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
SGM [22] | 71.2 | 83.1 | 90.3 | 46.5 | 63.4 | 72.6 | 427.1 |
NAAF [23] | 67.0 | 84.5 | 91.4 | 58.3 | 79.6 | 88.4 | 469.2 |
CVSE [24] | 70.2 | 79.4 | 88.2 | 66.7 | 71.9 | 80.2 | 456.6 |
ours | 73.4 | 81.3 | 93.5 | 68.7 | 61.8 | 90.8 | 469.5 |
Image | ViT32 + Bert_base | ViT32 + Bert_base_fin | ViT16 + Bert_base | ViT16 + Bert_base_fin |
---|---|---|---|---|
(‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 7.17 × 10−14), (‘A puppy sitting on the ground’, 1.82 × 10−17) | (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 3.84 × 10−12), (‘A puppy sitting on the ground’, 1.51 × 10−14) | (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 4.42 × 10−11), (‘A puppy sitting on the ground’, 1.87 × 10−15) | (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 3.35 × 10−12), (‘A puppy sitting on the ground’, 3.36 × 10−15) | |
(‘A puppy sitting on the ground’, 0.998), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 4.42 × 10−8), (‘A bus is parked on the roadside’, 1.52 × 10−10) | (‘A puppy sitting on the ground’, 0.996), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.33 × 10−9), (‘A bus is parked on the roadside’, 1.15 × 10−9) | (‘A puppy sitting on the ground’, 0.999), (‘A bus is parked on the roadside’, 1.23 × 10−11), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.80 51 × 10−14) | (‘A puppy sitting on the ground’, 0.977), (‘A bus is parked on the roadside’, 2.10 × 10−12), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 5.82 × 10−13) | |
(‘A bus is parked on the roadside’, 1.0), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 8.04 × 10−7), (‘A puppy sitting on the ground’, 8.00 × 10−11) | (‘A bus is parked on the roadside’, 0.998), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.59 × 10−3), (‘A puppy sitting on the ground’, 3.45 × 10−7) | (‘A bus is parked on the roadside’, 1.0), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 3.19 51 × 10−4), (‘A puppy sitting on the ground’, 1.23 × 10−10) | (‘A bus is parked on the roadside’, 0.999), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 5.68 × 10−4), (‘A puppy sitting on the ground’, 1.76 × 10−12) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, Y.; Xu, H.; Du, A.; Wang, B. Image–Text Matching Model Based on CLIP Bimodal Encoding. Appl. Sci. 2024, 14, 10384. https://doi.org/10.3390/app142210384
Zhu Y, Xu H, Du A, Wang B. Image–Text Matching Model Based on CLIP Bimodal Encoding. Applied Sciences. 2024; 14(22):10384. https://doi.org/10.3390/app142210384
Chicago/Turabian StyleZhu, Yihuan, Honghua Xu, Ailin Du, and Bin Wang. 2024. "Image–Text Matching Model Based on CLIP Bimodal Encoding" Applied Sciences 14, no. 22: 10384. https://doi.org/10.3390/app142210384
APA StyleZhu, Y., Xu, H., Du, A., & Wang, B. (2024). Image–Text Matching Model Based on CLIP Bimodal Encoding. Applied Sciences, 14(22), 10384. https://doi.org/10.3390/app142210384