DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning
Abstract
:1. Introduction
- We decouple painting captions into two aspects: objective and artistic conception. To achieve this, we propose a network structure with two branches that incorporate local to global feature fusion and multimodal fusion.
- We develop a multimodal fusion model that integrates objective caption text with multiple scales of image visual features.
- We create a small-scale image captioning dataset comprising both Chinese and Western paintings, conducting extensive experiments using this dataset.
2. Related Work
2.1. Image Captioning
2.2. Painting Captioning
2.3. Multimodal Models
3. Methodology
Algorithm 1 Overview of DecoupleCLIP |
|
3.1. Network Desgin
3.2. Local to Global Features Fusion
3.3. Multimodal Fusion
4. Experiments
4.1. Datasets
4.2. Experimental Settings
4.3. Evaluation
4.4. Ablation Studies
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015; pp. 3156–3164. [Google Scholar]
- Wang, Y.; Xu, J.; Sun, Y. End-to-End Transformer Based Model for Image Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 22 February–1 March 2023; pp. 2585–2594. [Google Scholar]
- Wang, D.; Hu, Z.; Zhou, Y.; Hong, R.; Wang, M. A Text-Guided Generation and Refinement Model for Image Captioning. IEEE Trans. Multimed. 2023, 25, 2966–2977. [Google Scholar] [CrossRef]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-Critical Sequence Training for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1179–1195. [Google Scholar]
- Dong, X.; Zhang, G.; Zhan, X.; Ding, Y.; Wei, Y.; Lu, M.; Liang, X. Caption-Aided Product Detection via Collaborative Pseudo-Label Harmonization. IEEE Trans. Multimed. 2023, 25, 1916–1927. [Google Scholar] [CrossRef]
- Wang, C.; Gu, X. Learning Double-Level Relationship Networks for Image Captioning. Inf. Process. Manag. 2023, 60, 103288–103312. [Google Scholar] [CrossRef]
- Luvembe, A.; Li, W.; Li, S.; Liu, F.; Wu, X. CAF-ODNN: Complementary Attention Fusion with Optimized Deep Neural Network for Multimodal Fake News Detection. Inf. Process. Manag. 2024, 61, 103653–103689. [Google Scholar] [CrossRef]
- Lu, J.; Yang, J.; Batra, D.; Parikh, D. Neural Baby Talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7219–7228. [Google Scholar]
- Jiang, W.; Zhou, W.; Hu, H. Double-Stream Position Learning Transformer Network for Image Captioning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7706–7718. [Google Scholar] [CrossRef]
- Wang, Y.; Xu, N.; Liu, A.; Li, W.; Zhang, Y. High-order interaction learning for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4417–4430. [Google Scholar] [CrossRef]
- Liu, A.; Zhai, Y.; Xu, N.; Nie, W.; Li, W.; Zhang, Y. Region-aware image captioning via interaction learning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 3685–3696. [Google Scholar] [CrossRef]
- Ge, H.; Yan, Z.; Zhang, K.; Zhao, M.; Sun, L. Exploring Overall Contextual Information for Image Captioning in Human-like Cognitive Style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1754–1763. [Google Scholar]
- Zhang, J.; Mei, K.; Zheng, Y.; Fan, J. Integrating Part of Speech Guidance for Image Captioning. IEEE Trans. Multimed. 2020, 23, 92–104. [Google Scholar] [CrossRef]
- Prudviraj, J.; Vishnu, C.; Mohan, C. Attentive contextual network for image captioning. In Proceedings of the International Joint Conference on Neural Networks, Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Zhang, Y.L.G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2017; pp. 764–773. [Google Scholar]
- Yu, L.; Zhang, J.; Wu, Q. Dual attention on pyramid feature maps for image captioning. IEEE Trans. Multimed. 2022, 24, 1775–1786. [Google Scholar] [CrossRef]
- Achlioptas, P.; Ovsjanikov, M.; Haydarov, K.; Elhoseiny, M.; Guibas, L. Artemis: Affective Language for Visual Art. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 11569–11579. [Google Scholar]
- Garcia, N.; Vogiatzis, G. How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 676–691. [Google Scholar]
- Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Huang, F.; Deussen, O.; Xu, C. Exploring the Representativity of Art Paintings. IEEE Trans. Multimed. 2020, 23, 2794–2805. [Google Scholar] [CrossRef]
- Lu, Y.; Guo, C.; Dai, X.; Wang, F. Data-Efficient Image Captioning of Fine Art Paintings via Virtual-Real Semantic Alignment Training. Neurocomputing 2022, 490, 163–180. [Google Scholar] [CrossRef]
- Yan, J.; Wang, W.; Yu, Y. Affective Word Embedding in Affective Explanation Generation for Fine Art Paintings. Pattern Recognit. Lett. 2022, 161, 24–29. [Google Scholar] [CrossRef]
- Bai, Z.; Nakashima, Y.; Garcia, N. Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5422–5432. [Google Scholar]
- Radford, A.; Jong, K.; Chris, H.; Aditya, R.; Gabriel, G.; Sandhini, A.; Girish, S.; Amanda, A.; Pamela, M.; Jack, C. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Wien, Austria, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
- Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. Rstnet: Captioning with Adaptive Attention on Visual and Non-Visual Words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 15465–15474. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, G. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Yu, J.; Li, J.; Yu, Z.; Huang, Q. Multimodal Transformer with Multi-View Visual Representation for Image Captioning. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4467–4480. [Google Scholar] [CrossRef]
- Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, W.; Wei, F.; Dai, J. VL-BERT: Pre-Training of Generic Visual-Linguistic Representations. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Li, G.; Duan, N.; Fang, Y.; Gong, M.; Jiang, D. Unicoder-vl: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11336–11344. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision, New Glasgow, UK, 23–28 August 2020; pp. 121–137. [Google Scholar]
- Li, W.; Gao, C.; Niu, G.; Xiao, X.; Liu, H.; Liu, J.; Wu, H.; Wang, H. Unimo: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Minneapolis, MN, USA, 2–7 June 2021; pp. 2592–2607. [Google Scholar]
- Hao, H.; Mohit, B. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barceló Bávaro Convention Centre, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 13–23. [Google Scholar]
- Zhuang, J.; Yu, J.; Ding, Y.; Qu, X.; Hu, Y. Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment. IEEE Trans. Multimed. 2023, 26, 1361–1372. [Google Scholar] [CrossRef]
- Nguyen, V.-Q.; Suganuma, M.; Okatani, T. GRIT: Faster and better image captioning transformer using dual visual features. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Volume 13696, pp. 167–184. [Google Scholar]
- Desai, K.; Johnson, J. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 11.157–11.168. [Google Scholar]
Metrics | PureT [2] | GRIT [35] | VirTex [36] | DecoupleCLIP (Ours) |
---|---|---|---|---|
B@1 | 43.40 | 39.33 | 29.73 | 43.20 |
B@2 | 31.43 | 28.80 | 20.08 | 31.70 |
B@3 | 23.45 | 21.78 | 13.93 | 24.15 |
B@4 | 17.33 | 16.23 | 8.93 | 18.18 |
METEOR | 18.55 | 18.08 | 16.34 | 18.13 |
ROUGE-L | 37.40 | 35.78 | 30.45 | 38.08 |
CIDEr | 33.53 | 20.23 | 8.07 | 35.50 |
SPICE | 27.08 | 22.33 | 17.35 | 27.65 |
Metrics | PureT [2] | GRIT [35] | VirTex [36] | DecoupleCLIP (Ours) |
---|---|---|---|---|
B@1 | 41.65 | 28.85 | 24.78 | 41.18 |
B@2 | 29.20 | 20.28 | 16.65 | 29.13 |
B@3 | 21.35 | 14.90 | 11.90 | 21.60 |
B@4 | 15.53 | 10.50 | 7.83 | 15.93 |
METEOR | 16.85 | 11.83 | 15.38 | 16.83 |
ROUGE-L | 35.58 | 28.90 | 27.93 | 35.20 |
CIDEr | 32.05 | 10.68 | 9.05 | 33.88 |
SPICE | 23.28 | 14.65 | 16.75 | 23.95 |
MM Module | B@1 | B@4 | CIDEr | SPICE |
---|---|---|---|---|
co-attention layer | 43.33 | 18.30 | 35.08 | 27.55 |
w/o MM | 42.85 | 17.90 | 33.88 | 26.98 |
Ours | 43.20 | 18.18 | 35.50 | 27.65 |
Module | B@1 | B@4 | CIDEr | SPICE |
---|---|---|---|---|
w/o Global Fusion and CLIP | 43.00 | 17.10 | 27.60 | 25.48 |
Ours | 43.20 | 18.18 | 35.50 | 27.65 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, M.; Hou, X.; Yan, Y.; Sun, M. DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning. Electronics 2024, 13, 4207. https://doi.org/10.3390/electronics13214207
Zhang M, Hou X, Yan Y, Sun M. DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning. Electronics. 2024; 13(21):4207. https://doi.org/10.3390/electronics13214207
Chicago/Turabian StyleZhang, Mingliang, Xia Hou, Yujing Yan, and Meng Sun. 2024. "DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning" Electronics 13, no. 21: 4207. https://doi.org/10.3390/electronics13214207
APA StyleZhang, M., Hou, X., Yan, Y., & Sun, M. (2024). DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning. Electronics, 13(21), 4207. https://doi.org/10.3390/electronics13214207