Attention-Guided Image Captioning through Word Information
Abstract
:1. Introduction
- We propose a novel word guided attention module for image captioning to determine the relationships among the attention features of an encoded image.
- We use the WGA with the previous step word and the current step word. With the previous word, the WGA concentrates on covering more objects in the scene and describing the relevance among them. With the current step word, the WGA is devoted to obtaining more details and deeper relation information from the current attention region.
2. Related Work
2.1. Image Captioning
2.2. Attention Mechanism
3. Methods
3.1. WGA
3.2. Image Captioning Model
3.3. Training and Objectives
4. Experiments
4.1. Dataset
4.2. Implementation Details
4.3. Quantitative Analysis
4.4. Qualitative Analysis
4.5. Ablative Studies
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Baby talk: Understanding and generating simple image descriptions. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1601–1608. [Google Scholar]
- Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R.K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J.C.; et al. From captions to visual concepts and back. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1473–1482. [Google Scholar]
- Yang, Y.; Teo, C.L.; Daumé, H.; Aloimonos, Y. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–31 July 2011; pp. 444–454. [Google Scholar]
- Mitchell, M.; Han, X.; Dodge, J.; Mensch, A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; Daumé, H. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012; pp. 747–756. [Google Scholar]
- Zhang, L.; Zhang, Y.; Zhao, X.; Zou, Z. Image captioning via proximal policy optimization. Image Vis. Comput. 2021, 108, 104126. [Google Scholar] [CrossRef]
- Cho, K.; Merrienboer, B.v.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv Preprint 2014, arXiv:1406.1078. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, Cambridge, MA, USA, 8–13 December 2014; MIT Press: Montreal, QC, Canada, 2014; pp. 3104–3112. [Google Scholar]
- Yang, Z.; Yuan, Y.; Wu, Y.; Cohen, W.W.; Salakhutdinov, R.R. Review networks for caption generation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Barcelona, Spain, 2016; pp. 2369–2377. [Google Scholar]
- Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L. Semantic Compositional Networks for Visual Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1141–1150. [Google Scholar]
- Chen, Y.; Wang, S.; Zhang, W.; Huang, Q. Less Is More: Picking Informative Frames for Video Captioning; Springer International Publishing: Cham, Switzerland, 2018; pp. 367–384. [Google Scholar]
- Gan, C.; Gan, Z.; He, X.; Gao, J.; Deng, L. StyleNet: Generating Attractive Visual Captions with Styles. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 955–964. [Google Scholar]
- Li, R.; Liang, H.; Shi, Y.; Feng, F.; Wang, X. Dual-CNN: A Convolutional language decoder for paragraph image captioning. Neurocomputing 2020, 396, 92–101. [Google Scholar] [CrossRef]
- Wang, H.; Wang, H.; Xu, K. Evolutionary recurrent neural network for image captioning. Neurocomputing 2020, 401, 249–256. [Google Scholar] [CrossRef]
- Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning—Volume 37, Lille, France, 7–9 July 2015; JMLR.org: Brookline, MA, USA, 2015; pp. 2048–2057. [Google Scholar]
- Xiao, F.; Gong, X.; Zhang, Y.; Shen, Y.; Li, J.; Gao, X. DAA: Dual LSTMs with adaptive attention for image captioning. Neurocomputing 2019, 364, 322–329. [Google Scholar] [CrossRef]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 933–941. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Long Beach, CA, USA, 2017; pp. 6000–6010. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A Neural Image Caption Generator. In Proceedings of the Name of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Wu, Q.; Shen, C.; Liu, L.; Dick, A.; Hengel, A.V.D. What Value Do Explicit High Level Concepts Have in Vision to Language Problems? In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 203–212. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Yang, X.; Tang, K.; Zhang, H.; Cai, J. Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10677–10686. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring Visual Relationship for Image Captioning; Springer International Publishing: Berlin, Germany, 2018; pp. 711–727. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; Mei, T. Boosting Image Captioning with Attributes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4904–4912. [Google Scholar]
- Corbetta, M.; Shulman, G.L. Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 2002, 3, 201–215. [Google Scholar] [CrossRef] [PubMed]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3242–3250. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar]
- Wu, L.; Tian, F.; Zhao, L.; Lai, J.; Liu, T. Word Attention for Sequence to Sequence Text Understanding; AAAI: Palo AltoO, CA, USA, 2018. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-Critical Sequence Training for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1179–1195. [Google Scholar]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Springer: Zurich, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Satanjeev, B. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments; ACL: Ann Arbor, MI, USA, 2005; pp. 228–231. [Google Scholar]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In Proceedings of the 21st ACM Conference on Computer and Communications Security, CCS 2014, Scottsdale, AZ, USA, 3–7 November 2014; Springer: Scottsdale, AZ, USA, 2016; pp. 382–398. [Google Scholar]
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Kai, L.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Jiang, W.; Ma, L.; Jiang, Y.-G.; Liu, W.; Zhang, T. Recurrent Fusion Network for Image Captioning; Springer International Publishing: Cham, Sweitzerland; Berlin, Germany, 2018; pp. 510–526. [Google Scholar]
- Ji, J.; Xu, C.; Zhang, X.; Wang, B.; Song, X. Spatio-Temporal Memory Attention for Image Captioning. IEEE Trans. Image Process. 2020, 29, 7615–7628. [Google Scholar] [CrossRef]
- Cai, W.; Liu, Q. Image captioning with semantic-enhanced features and extremely hard negative examples. Neurocomputing 2020, 413, 31–40. [Google Scholar] [CrossRef]
- Guo, L.; Liu, J.; Lu, S.; Lu, H. Show, Tell, and Polish: Ruminant Decoding for Image Captioning. IEEE Trans. Multimed. 2020, 22, 2149–2162. [Google Scholar] [CrossRef]
- Kv, G.; Nambiar, A.; Srinivas, K.S.; Mittal, A. Linguistically-aware attention for reducing the semantic gap in vision-language tasks. Pattern Recognit. 2021, 112, 107812. [Google Scholar] [CrossRef]
- Zhang, Z.; Wu, Q.; Wang, Y.; Chen, F. Exploring region relationships implicitly: Image captioning with visual relationship attention. Image Vis. Comput. 2021, 109, 104146. [Google Scholar] [CrossRef]
- Fei, Z. Memory-Augmented Image Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 1317–1324. [Google Scholar]
- Yan, C.; Hao, Y.; Li, L.; Yin, J.; Liu, A.; Mao, Z.; Chen, Z.; Gao, X. Task-Adaptive Attention for Image Captioning. In IEEE Transactions on Circuits and Systems for Video Technology (Early Access); IEEE: Piscataway, NJ, USA, 2021; p. 1. [Google Scholar]
Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr-D | SPICE |
---|---|---|---|---|---|---|---|---|
LSTM [20] | - | - | - | 29.6 | 25.2 | 52.6 | 94.0 | - |
SCST [30] | - | - | - | 30.0 | 25.9 | 53.4 | 99.4 | - |
Adaptive-Attention [27] | 73.4 | 56.6 | 41.8 | 30.4 | 25.7 | - | 102.9 | - |
RFNet [40] | 76.4 | 60.4 | 46.6 | 35.8 | 27.4 | 56.5 | 112.5 | 20.5 |
UpDown [22] | 77.2 | - | - | 36.2 | 27.0 | 56.4 | 113.5 | 20.3 |
Att2in+RD [43] | - | - | - | 34.3 | 26.4 | 55.2 | 106.1 | 19.7 |
UpDown+STAM [41] | 77.4 | 61.5 | 47.6 | 36.5 | 27.4 | 56.8 | 114.4 | 20.5 |
Ours: PW | 77.4 | 61.5 | 47.7 | 36.8 | 28.1 | 57.3 | 117.0 | 21.2 |
Ours: CW | 77.2 | 61.5 | 47.8 | 36.9 | 28.0 | 57.2 | 117.4 | 21.1 |
Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr-D | SPICE |
---|---|---|---|---|---|---|---|---|
LSTM [20] | - | - | - | 31.9 | 25.5 | 54.3 | 106.3 | - |
SCST [30] | - | - | - | 34.2 | 26.7 | 55.7 | 114.0 | - |
RFNet [40] | 79.1 | 63.1 | 48.4 | 36.5 | 27.7 | 57.3 | 121.9 | 21.2 |
UpDown [22] | 79.8 | - | - | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 |
Cai et al. [42] | 80.0 | 64.3 | 49.6 | 37.5 | 28.2 | 58.2 | 126.0 | 21.8 |
UpDown+RD [43] | 80.0 | - | - | 37.8 | 28.2 | 57.9 | 125.3 | - |
UpDown+STAM [41] | 80.2 | 64.4 | 49.7 | 37.7 | 28.2 | 58.1 | 125.9 | 21.7 |
UpDown+LAT [44] | 80.4 | - | - | 37.7 | 28.4 | 58.3 | 127.1 | 22.0 |
VRAtt-Soft [45] | 80.2 | 63.3 | 48.7 | 37.3 | 28.4 | 61.4 | 121.8 | 21.8 |
UpDown+MA [46] | 80.2 | - | - | 37.5 | 28.4 | 58.2 | 125.4 | 22.0 |
Ours: PW | 80.4 | 65.1 | 50.8 | 39.1 | 28.7 | 58.7 | 127.6 | 22.2 |
Ours: CW | 80.6 | 65.2 | 50.9 | 39.1 | 28.7 | 58.8 | 127.2 | 22.1 |
Method | BLEU-4 | METEOR | ROUGE-L | CIDEr-D | SPICE |
---|---|---|---|---|---|
SCST [30] | 35.4 | 27.1 | 56.6 | 117.5 | - |
RFNet [40] | 37.9 | 28.3 | 58.3 | 125.7 | 21.7 |
Yan et al. [47] | 38.4 | 27.8 | 57.9 | 121.6 | 21.5 |
Ours: PW | 39.6 | 28.7 | 59.1 | 128.3 | 22.2 |
Ours: CW | 39.8 | 28.8 | 59.4 | 128.3 | 22.2 |
Image | Captions |
---|---|
Baseline: A couple of women standing next to each other. Our PW: Two women standing next to each other holding wine glasses. Our CW: Two women drinking wine in a room. GT1: Two young women are sharing a bottle of wine. GT2: Two female friends posing with a bottle of wine. GT3: Two women posing for a photo with drinks in hand. | |
Baseline: A group of people walking down a street. Our PW: A group of people standing in the street with an umbrella. Our CW: A group of people standing under an umbrella. GT1: Several people standing on a sidewalk under an umbrella. GT2: Some people standing on a dark street with an umbrella. GT3: Some people standing on a dark street with an umbrella. | |
Baseline: A close up of a horse in a field. Our PW: A white horse standing in the grass in a field. Our CW: A white horse grazing in a field of grass. GT1: A horse eating grass in a green field. GT2: A while horse bending down eating grass. GT3: A tall black and white horse standing on a lush green field. | |
Baseline: A group of people on skis in the snow. Our PW: A group of people riding skis down a snow covered slope. Our CW: Two men are skiing down a snow covered slope. GT1: Two cross country skiers heading onto the trail. GT2: Two guys cross country ski in a race. GT3: Skiers on their skis ride on the slope while others watch. |
Model | BLEU-1 | BLEU-4 | ROUGE-L | CIDEr-D |
---|---|---|---|---|
Baseline | 79.4 | 36.7 | 57.6 | 122.7 |
+self-att(Dec) | 80.0 | 37.6 | 58.0 | 124.7 |
+self-att(Enc+Dec) | 79.9 | 38.4 | 58.4 | 125.8 |
Full: PW | 80.4 | 39.1 | 58.7 | 127.6 |
Full: CW | 80.6 | 39.1 | 58.8 | 127.2 |
Image | Captions |
---|---|
Baseline: A couple of women standing next to each other. +self-att(Dec): A couple of women standing next to each other. +self-att(Enc+Dec): Two women are holding wine glasses in a room. Our PW: Two women standing next to each other holding wine glasses. Our CW: Two women drinking wine in a room. | |
Baseline: A group of people walking down a street +self-att(Dec): A group of people standing in the street. +self-att(Enc+Dec): A group of people standing with an umbrella. Our PW: A group of people standing in the street with an umbrella. Our CW: A group of people standing under an umbrella. | |
Baseline: A close up of a horse in a field. +self-att(Dec): A horse standing in a field. +self-att(Enc+Dec): A horse in the grass in a field. Our PW: A white horse standing in the grass in a field. Our CW: A white horse grazing in a field of grass. | |
Baseline: A group of people on skis in the snow. +self-att(Dec): A man riding skis in the snow. +self-att(Enc+Dec): A group of people skiing down a snow covered slope. Our PW: A group of people riding skis down a snow covered slope. Our CW: Two men are skiing down a snow covered slope. |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tang, Z.; Yi, Y.; Sheng, H. Attention-Guided Image Captioning through Word Information. Sensors 2021, 21, 7982. https://doi.org/10.3390/s21237982
Tang Z, Yi Y, Sheng H. Attention-Guided Image Captioning through Word Information. Sensors. 2021; 21(23):7982. https://doi.org/10.3390/s21237982
Chicago/Turabian StyleTang, Ziwei, Yaohua Yi, and Hao Sheng. 2021. "Attention-Guided Image Captioning through Word Information" Sensors 21, no. 23: 7982. https://doi.org/10.3390/s21237982
APA StyleTang, Z., Yi, Y., & Sheng, H. (2021). Attention-Guided Image Captioning through Word Information. Sensors, 21(23), 7982. https://doi.org/10.3390/s21237982