Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning
Abstract
:1. Introduction
- We create a Transformer-based image captioning model that integrates the external semantic knowledge representation of image objects into the encoder Transformer. This incorporation enhances the encoder and decoder Transformers’ capability to focus their attention on relevant regions and capture the meaningful relationships between image objects throughout the image captioning generation process.
- We conduct a linguistic social word analysis for the generated captions, offering valuable insights into the efficacy of using the proposed model in vision and language tasks.
- We extend the applicability of the proposed model and generate a description for the MACE visual captioning dataset. This newly archival dataset contains significant historical videos and scenes.
2. Background and Related Works
3. Model Architecture
3.1. Transformer Model for Image Captioning
3.2. Leveraging Knowledge Graphs
3.3. Transformer with Semantic-Based Model for Image Captioning
Algorithm 1: Caption Generating Procedure |
4. Experiments and Results
4.1. Datasets
4.2. Evaluation Metrics
4.3. Experiment Details
4.3.1. Quantitative Evaluation
4.3.2. Ablation Studies
4.3.3. Qualitative Evaluation
5. Discussion
6. Generalization
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Stangl, A.; Verma, N.; Fleischmann, K.R.; Morris, M.R.; Gurari, D. Going beyond one-size-fits-all image descriptions to satisfy the information wants of people who are blind or have low vision. In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility, Virtual, 18–22 October 2021; pp. 1–15. [Google Scholar]
- Jung, J.Y.; Steinberger, T.; Kim, J.; Ackerman, M.S. “So What? What’s That to Do with Me?” Expectations of People with Visual Impairments for Image Descriptions in Their Personal Photo Activities. In Proceedings of the Designing Interactive Systems Conference, Virtual, 13–17 June 2022; pp. 1893–1906. [Google Scholar]
- Yang, Y.; Yu, J.; Zhang, J.; Han, W.; Jiang, H.; Huang, Q. Joint embedding of deep visual and semantic features for medical image report generation. IEEE Trans. Multimed. 2021, 25, 167–178. [Google Scholar] [CrossRef]
- Ayesha, H.; Iqbal, S.; Tariq, M.; Abrar, M.; Sanaullah, M.; Abbas, I.; Rehman, A.; Niazi, M.F.K.; Hussain, S. Automatic medical image interpretation: State of the art and future directions. Pattern Recognit. 2021, 114, 107856. [Google Scholar] [CrossRef]
- Szafir, D.; Szafir, D.A. Connecting human-robot interaction and data visualization. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, Boulder, CO, USA, 8–11 March 2021; pp. 281–292. [Google Scholar]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Suresh, K.R.; Jarapala, A.; Sudeep, P. Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study. Circuits Syst. Signal Process. 2022, 41, 5719–5742. [Google Scholar] [CrossRef]
- He, S.; Liao, W.; Tavakoli, H.R.; Yang, M.; Rosenhahn, B.; Pugeault, N. Image captioning through image transformer. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
- Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
- Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10971–10980. [Google Scholar]
- Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.W.; Ji, R. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2286–2293. [Google Scholar]
- Hafeth, D.A.; Kollias, S.; Ghafoor, M. Semantic Representations with Attention Networks for Boosting Image Captioning. IEEE Access 2023, 41, 40230–40239. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10578–10587. [Google Scholar]
- Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1655–1663. [Google Scholar]
- Zhang, J.; Fang, Z.; Wang, Z. Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl. Intell. 2023, 53, 13398–13414. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Luo, J.; Li, Y.; Pan, Y.; Yao, T.; Feng, J.; Chao, H.; Mei, T. Semantic-conditional diffusion networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23359–23368. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- MACE—The Media Archive for Central England. Available online: https://www.macearchive.org/ (accessed on 10 November 2019).
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
- Zhang, Z.; Wu, Q.; Wang, Y.; Chen, F. Exploring region relationships implicitly: Image captioning with visual relationship attention. Image Vis. Comput. 2021, 109, 104146. [Google Scholar] [CrossRef]
- Zhong, X.; Nie, G.; Huang, W.; Liu, W.; Ma, B.; Lin, C.W. Attention-guided image captioning with adaptive global and local feature fusion. J. Vis. Commun. Image Represent. 2021, 78, 103138. [Google Scholar] [CrossRef]
- Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R.K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J.C.; et al. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1473–1482. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
- Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; Mei, T. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4894–4902. [Google Scholar]
- Li, N.; Chen, Z. Image Captioning with Visual-Semantic LSTM. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, 13–19 July 2018; pp. 793–799. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
- Yang, X.; Tang, K.; Zhang, H.; Cai, J. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10685–10694. [Google Scholar]
- Guo, L.; Liu, J.; Tang, J.; Li, J.; Luo, W.; Lu, H. Aligning linguistic words and visual semantic units for image captioning. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 765–773. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, VIC, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
- Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image captioning: Transforming objects into words. Adv. Neural Inf. Process. Syst. 2019, 32, 11137–11147. [Google Scholar]
- Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8928–8937. [Google Scholar]
- Speer, R.; Chin, J.; Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Li, J.; Yao, P.; Guo, L.; Zhang, W. Boosted Transformer for Image Captioning. Appl. Sci. 2019, 9, 3260. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Tian, C.; Tian, M.; Jiang, M.; Liu, H.; Deng, D. How much do cross-modal related semantics benefit image captioning by weighting attributes and re-ranking sentences? Pattern Recognit. Lett. 2019, 125, 639–645. [Google Scholar] [CrossRef]
- Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25 July 2004; pp. 74–81. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- González-Chávez, O.; Ruiz, G.; Moctezuma, D.; Ramirez-delReal, T. Are metrics measuring what they should? An evaluation of Image Captioning task metrics. Signal Process. Image Commun. 2024, 120, 117071. [Google Scholar] [CrossRef]
- Tausczik, Y.R.; Pennebaker, J.W. The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 2010, 29, 24–54. [Google Scholar] [CrossRef]
- Pennebaker, J.W.; Booth, R.J.; Francis, M.E. Linguistic Inquiry and Word Count (LIWC2007): A Text Analysis Program; LIWC.net: Austin, TX, USA, 2007. [Google Scholar]
- Zhang, J.; Huang, C.; Chow, M.Y.; Li, X.; Tian, J.; Luo, H.; Yin, S. A Data-model Interactive Remaining Useful Life Prediction Approach of Lithium-ion Batteries Based on PF-BiGRU-TSAM. IEEE Trans. Ind. Inform. 2023, 20, 1144–1154. [Google Scholar] [CrossRef]
- Zhang, J.; Tian, J.; Alcaide, A.M.; Leon, J.I.; Vazquez, S.; Franquelo, L.G.; Luo, H.; Yin, S. Lifetime Extension Approach Based on Levenberg-Marquardt Neural Network and Power Routing of DC-DC Converters. IEEE Trans. Power Electron. 2023, 38, 10280–10291. [Google Scholar] [CrossRef]
Method | BLEU@1 | BLEU@4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|
Base | 79.4 | 35.8 | 28.0 | 58.0 | 118.8 |
Semantic Transformer [15] | 78.6 | 36.0 | 27.6 | 57.7 | 120.9 |
IMFRE-Transformer [20] | 77.1 | 36.4 | 28.3 | 57.1 | 117.1 |
Image Transformer [8] | 80.8 | 39.5 | 29.1 | 59.0 | 130.8 |
SCD-Net [23] | 79.0 | 37.3 | 28.1 | 58.0 | 118.0 |
Ours | 80.0 | 37.7 | 28.9 | 58.5 | 132.0 |
Encoder | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
ResNet-18 | 74.88 | 57.17 | 54.30 | 29.85 | 21.10 | 47.29 | 111.70 |
ResNet-50 | 77.97 | 59.20 | 57.74 | 34.63 | 24.83 | 50.10 | 121.22 |
ResNet-101 | 80.04 | 60.29 | 60.01 | 37.70 | 28.99 | 58.50 | 132.04 |
Encoder | △BLEU@1 | △BLEU@2 | △BLEU@3 | △BLEU@4 | △METEOR | △ROUGE-L | △CIDEr |
---|---|---|---|---|---|---|---|
ResNet-18 | 0.650 | 0.760 | 0.836 | 0.894 | 0.611 | 0.707 | 0.711 |
ResNet-50 | 0.346 | 0.397 | 0.429 | 0.461 | 0.283 | 0.208 | 0.244 |
ResNet-101 | 1.081 | 1.198 | 1.136 | 1.302 | 0.774 | 0.740 | 0.114 |
Encoder | Head Number | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|---|
ResNet-18 | 4 | 74.32 | 60.99 | 59.80 | 26.59 | 18.89 | 41.00 | 98.20 |
8 | 74.88 | 57.17 | 54.30 | 29.85 | 21.10 | 47.29 | 111.70 | |
16 | 76.14 | 63.09 | 55.36 | 29.93 | 23.30 | 47.18 | 122.01 | |
ResNet-50 | 4 | 75.29 | 59.44 | 58.00 | 30.91 | 21.28 | 59.02 | 107.90 |
8 | 77.97 | 59.20 | 57.74 | 34.63 | 24.83 | 50.10 | 121.22 | |
16 | 82.10 | 61.04 | 58.37 | 34.13 | 26.85 | 54.12 | 123.86 | |
ResNet-101 | 4 | 75.14 | 62.04 | 61.00 | 28.73 | 25.11 | 56.39 | 115.60 |
8 * | 80.04 | 60.29 | 60.01 | 37.70 | 28.99 | 58.50 | 132.04 | |
16 | 79.13 | 71.94 | 71.03 | 42.97 | 31.00 | 60.75 | 134.82 |
Encoder | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
ResNet-18 | 74.38 | 71.54 | 69.73 | 68.51 | 46.19 | 71.89 | 62.24 |
ResNet-50 | 81.30 | 79.40 | 78.17 | 77.24 | 53.21 | 79.42 | 71.16 |
ResNet-101 | 75.50 | 72.74 | 71.10 | 69.93 | 49.81 | 76.95 | 76.01 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abdal Hafeth, D.; Kollias, S. Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning. Sensors 2024, 24, 1796. https://doi.org/10.3390/s24061796
Abdal Hafeth D, Kollias S. Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning. Sensors. 2024; 24(6):1796. https://doi.org/10.3390/s24061796
Chicago/Turabian StyleAbdal Hafeth, Deema, and Stefanos Kollias. 2024. "Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning" Sensors 24, no. 6: 1796. https://doi.org/10.3390/s24061796
APA StyleAbdal Hafeth, D., & Kollias, S. (2024). Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning. Sensors, 24(6), 1796. https://doi.org/10.3390/s24061796