Generalized Image Captioning for Multilingual Support
Abstract
:1. Introduction
- Image caption data generation requires a lot of manual work, but there is no need to process new data through the domain object dictionary presented in this study.
- Various image captions can be created from one image.
- There is no need to learn new models when creating domain image captions.
- Proposed filter captioning model that can generate various image captions.
2. Related Work
2.1. Image Caption Dataset
2.2. Image Caption Model
2.3. OCR (Optical Character Recognition)
3. Filter Captioning Model Algorithm
3.1. Image Captioning Model by Language
3.2. Image and Natural Language Understanding Model
3.3. Image Caption Fine-Tuning
Algorithm 1: Image Caption Fine Tuning |
3.4. Image Caption Inference
Algorithm 2: Image Caption Infernece |
3.4.1. Domain Object Dictionary
3.4.2. Domain Object Pre-Filtering
4. Research Method
4.1. Dataset for Research Subjects
4.2. Filter Captioning to Create a Fine Coordinated Learning Object Tag
4.3. Establishing a Domain Dictionary for Filter Captioning Inference
5. Results
5.1. Filter Captioning Learning and Brotherhood Indicators
5.2. Filter Captioning Model Training
5.3. Filter Captioning Model Results
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–331. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Michigan, MI, USA, 25–30 June 2005; pp. 65–72. [Google Scholar]
- Lin, C. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic Propositional Image Caption Evaluation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Amsterdam, The Netherlands, 2016. [Google Scholar]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Sidorov, O.; Hu, R.; Rohrbach, M.; Singh, A. TextCaps: A Dataset for Image Captioning with Reading Comprehension. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Amsterdam, The Netherlands, 2020; pp. 742–758. [Google Scholar]
- Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
- Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; Rohrbach, M. Towards VQA Models That Can Read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8317–8326. [Google Scholar]
- Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.; Ringshia, P.; Testuggine, D. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 2611–2624. [Google Scholar]
- Cho, S.; Oh, H. A general-purpose model capable of image captioning in Korean and English and a method to generate text suitable for the purpose VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. J. Korea Inst. Inf. Commun. Eng. 2022, 26, 1111–1120. [Google Scholar]
- Hu, X.; Yin, X.; Lin, K.; Wang, L.; Zhang, L.; Gao, J.; Liu, Z. VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA, 2–9 February 2021; pp. 1575–1583. [Google Scholar]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In Transactions of the Association for Computational Linguistics; MIT Press: Cambridge, MA, USA, 2014; Volume 2, pp. 67–78. [Google Scholar]
- Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; Johnson, M.; Batra, D.; Parikh, D.; Lee, S.; Anderson, P. nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8948–8957. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1, pp. 2556–2565. [Google Scholar]
- Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing Images Using 1 Million Captioned Photographs. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 12–15 December 2011; Volume 24, pp. 1143–1151. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Amsterdam, The Netherlands, 2020; pp. 121–137. [Google Scholar]
- Luo, R.; Shakhnarovich, G. Controlling Length in Image Captioning. arXiv 2005, arXiv:2005.14386. [Google Scholar]
- Wu, Y.; Jiang, L.; Yang, Y. Switchable Novel Object Captioner. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1162–1173. [Google Scholar] [CrossRef] [PubMed]
- Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rennie, J.S.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-Critical Sequence Training for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
- Du, Y.; Li, C.; Guo, R.; Yin, X.; Liu, W.; Zhou, J.; Bai, Y.; Yu, Z.; Yang, Y.; Dang, Q.; et al. PP-OCR: A Practical Ultra Lightweight OCR System. arXiv 2020, arXiv:2009.09941. [Google Scholar]
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. Vinvl: Revisiting Visual Representations in Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 20–25 June 2021; pp. 5579–5588. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Yu, D.; Li, X.; Zhang, C.; Liu, T.; Han, J.; Liu, J.; Ding, E. Towards Accurate Scene Text Recognition with Semantic Reasoning Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12113–12122. [Google Scholar]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
- Hudson, D.A.; Manning, C.D. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6700–6709. [Google Scholar]
- Manning, C.D.; Raghavan, P.; Schütze, H. Scoring, term weighting and the vector space model. Introd. Inf. Retr. 2008, 100, 2–4. [Google Scholar]
traffic | There are a few cars waiting for the signal. | A street with a lot of traffic during a snowstorm |
blind | Railways and roads are left and right, so be careful. | There is a road on the left, and it is snowing a lot |
weather | It is a bright day. | It is snowing a lot and piling up. |
general | An electric train at an intersection with cars. | Cars drive down the street on a snowy day. |
STEPS | BLEU_1 | BLEU_2 | BLEU_3 | METEOR | ROUGE_L | CIDEr |
---|---|---|---|---|---|---|
10,000 | 0.42127 | 0.20266 | 0.11601 | 0.12247 | 0.33375 | 0.09163 |
20,000 | 0.71462 | 0.52953 | 0.36954 | 0.23072 | 0.50887 | 0.89150 |
30,000 | 0.70895 | 0.52509 | 0.36691 | 0.23589 | 0.50220 | 0.91953 |
40,000 | 0.73438 | 0.55464 | 0.39340 | 0.24312 | 0.51699 | 0.96477 |
50,000 | 0.73607 | 0.55556 | 0.39715 | 0.24187 | 0.52158 | 0.97054 |
60,000 | 0.72510 | 0.54554 | 0.39294 | 0.23722 | 0.51872 | 0.95560 |
70,000 | 0.73662 | 0.55446 | 0.39710 | 0.24398 | 0.52163 | 0.98510 |
80,000 | 0.74010 | 0.56100 | 0.39824 | 0.24772 | 0.52673 | 0.99428 |
90,000 | 0.74544 | 0.56322 | 0.40071 | 0.24576 | 0.52677 | 0.99567 |
100,000 | 0.74129 | 0.56459 | 0.40614 | 0.24620 | 0.52770 | 1.00714 |
110,000 | 0.73404 | 0.56014 | 0.40337 | 0.24641 | 0.52487 | 1.00562 |
120,000 | 0.73852 | 0.56181 | 0.40515 | 0.24865 | 0.52758 | 1.01481 |
130,000 | 0.74709 | 0.56888 | 0.40597 | 0.24676 | 0.52776 | 1.00650 |
140,000 | 0.74031 | 0.56349 | 0.40255 | 0.25084 | 0.52789 | 1.01821 |
150,000 | 0.74121 | 0.56514 | 0.40694 | 0.24896 | 0.52990 | 1.00385 |
160,000 | 0.74501 | 0.56633 | 0.4044 | 0.24766 | 0.52776 | 1.00549 |
170,000 | 0.74086 | 0.56254 | 0.39975 | 0.24765 | 0.52614 | 0.98849 |
180,000 | 0.70593 | 0.53553 | 0.38207 | 0.23373 | 0.49929 | 0.95181 |
STEPS | BLEU_1 | BLEU_2 | BLEU_3 | METEOR | ROUGE_L | CIDEr |
---|---|---|---|---|---|---|
10,000 | 0.46965 | 0.26674 | 0.13260 | 0.11563 | 0.35089 | 0.06661 |
20,000 | 0.00011 | 0.00000 | 0.00000 | 0.00008 | 0.00015 | 0.00001 |
30,000 | 0.00000 | 0.00000 | 0.00000 | 0.00001 | 0.00000 | 0.00000 |
40,000 | 0.70185 | 0.51810 | 0.35942 | 0.22918 | 0.50381 | 0.88718 |
50,000 | 0.72191 | 0.54263 | 0.38456 | 0.23680 | 0.51336 | 0.94791 |
60,000 | 0.72714 | 0.54356 | 0.38166 | 0.23899 | 0.51698 | 0.95379 |
70,000 | 0.73415 | 0.55288 | 0.39198 | 0.24205 | 0.51970 | 0.97153 |
80,000 | 0.73601 | 0.55712 | 0.39863 | 0.24615 | 0.52486 | 0.98178 |
90,000 | 0.73091 | 0.55250 | 0.39219 | 0.24363 | 0.52348 | 0.98480 |
100,000 | 0.72987 | 0.54703 | 0.38805 | 0.24623 | 0.52281 | 0.98792 |
110,000 | 0.74525 | 0.56238 | 0.39909 | 0.24368 | 0.52491 | 0.98655 |
120,000 | 0.73675 | 0.55613 | 0.39652 | 0.24493 | 0.52313 | 0.99187 |
130,000 | 0.74340 | 0.56317 | 0.39923 | 0.24049 | 0.52366 | 0.98576 |
140,000 | 0.73526 | 0.55514 | 0.39534 | 0.24632 | 0.52502 | 0.99858 |
150,000 | 0.73647 | 0.56033 | 0.39929 | 0.24671 | 0.52459 | 1.00759 |
160,000 | 0.73844 | 0.55914 | 0.39964 | 0.24705 | 0.52253 | 1.00374 |
170,000 | 0.74422 | 0.56506 | 0.40350 | 0.24360 | 0.52643 | 1.00244 |
180,000 | 0.73687 | 0.55853 | 0.39854 | 0.24690 | 0.52613 | 1.00990 |
Default Model | Filter Captioning Model | ||
---|---|---|---|
Image Caption | Left | a woman is standing in front of a bus | a bus is parked on the side of a street |
Middle | a motorcycle parked on the side of a street | a group of people walking down a street with a road. | |
right | a group of people standing next to a train station | a group of people walking down a street with a train | |
Changed object tag | Left | sidewalk, bus, building, car, street, pole, sign | |
Middle | trees, tree, person, road, street, people, line, pole | ||
right | street, people, sign, wall, man, train, pole, person, line | ||
Domain dictionary | bus, train, road, sidewalk, sign, person, building, pole, door, wall, man, people, cars, car, street, trees, line, tree |
Object Dictionary | Image Caption |
---|---|
- | 밤에 길을 따라 운전하는 교통 신호등 |
보도 or 도시 | 밤에 도시 거리를 따라 운전하는 한 무리의 사람들 |
도로 | 자동차와 신호등이 있는 도시의 거리. |
건물 | 고층 건물들로 가득 찬 거리. |
Image Objects | 폴, 빛, 도로, 교통, 불, 건물, 도시, 거리, 나무, 보도, 숫자, 하늘 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cho, S.; Oh, H. Generalized Image Captioning for Multilingual Support. Appl. Sci. 2023, 13, 2446. https://doi.org/10.3390/app13042446
Cho S, Oh H. Generalized Image Captioning for Multilingual Support. Applied Sciences. 2023; 13(4):2446. https://doi.org/10.3390/app13042446
Chicago/Turabian StyleCho, Suhyun, and Hayoung Oh. 2023. "Generalized Image Captioning for Multilingual Support" Applied Sciences 13, no. 4: 2446. https://doi.org/10.3390/app13042446
APA StyleCho, S., & Oh, H. (2023). Generalized Image Captioning for Multilingual Support. Applied Sciences, 13(4), 2446. https://doi.org/10.3390/app13042446