Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks
Abstract
:1. Introduction
- i.
- by integrating Vision Transformers (ViTs) with Generative Adversarial Networks (GANs), this study improves the quality of generated captions by better capturing complex visual information. ViTs enable a more effective understanding of image context, while GANs refine the naturalness and coherence of generated text, making captions more descriptive and accurate.
- ii.
- GANs reduce the need for extensive labeled datasets by learning to generate realistic captions through adversarial training. This could allow the model to perform well even with limited labeled data, making it beneficial in applications where labeled data are scarce or expensive to obtain.
- iii.
- the use of Vision Transformers allows the model to be more adaptable across various image types and domains, as they can better generalize features in diverse datasets. This enables the model to generate captions that are relevant across a wide range of contexts, enhancing usability for tasks like image search, accessibility, and visual content description in different environments.
2. Related Work
- i.
- Object Hallucination: Similar to other deep learning models, Transformer-based and GAN-based models can sometimes generate captions that include objects that are not present in the image.
- ii.
- Missing Context: These models often struggle to understand the broader context of the image, leading to captions that may be technically correct but miss the overall meaning of the image
- iii.
- Exposure Bias: Most existing models, including those based on Transformers and GANs, suffer from exposure bias problems, where past-predicted sequences during the training are required to generate future captions. Data Requirement: Like other deep learning models, Transformer-based and GAN-based models require large amounts of labeled data for training. Computational Resources: Training these models can be computationally intensive and require powerful hardware
- iv.
- Model Interpretability: These models, like many deep learning models, are often referred to as “black boxes” because it can be challenging to understand how they make their predictions
3. Methodology
3.1. Dataset
3.2. Model Architecture
3.2.1. Generator Architecture
3.2.2. Discriminator Architecture
3.3. Training Procedure
3.3.1. Loss Functions
3.3.2. Optimization Strategy
- RMSProp: The RMSProp optimizer adjusts learning rates based on the magnitude of gradients similar to Adam [31]. It updates parameters θ as follows:
- 2.
- Adagrad: Adagrad’s adaptive learning rate mechanism updates parameters θ as follows [32]:
- 3.
- SGD with Momentum: Standard Stochastic Gradient Descent (SGD) with momentum updates parameters θ as follows:
3.3.3. Mini Batch Training
3.4. Evaluation Metrics
- BLEU Score: The primary metric utilized to assess the caliber of the generated captions is the BLEU score. By comparing the overlap of n-grams—contiguous sequences of n components from a given sample of text or speech—between the machine-generated and human-made (ground truth) captions, it assesses the degree of similarity between them. Greater accuracy and semantic similarity in the caption-generating process are suggested by a higher BLEU score, which indicates greater similarity. Consequently, the BLEU score offers a neutral and numerical assessment of the model’s ability to provide linguistically correct and contextually appropriate captions [29].
- ROUGE Score: In addition to the BLEU score, the ROUGE score is also employed to provide a more comprehensive evaluation of the produced captions. The ROUGE score quantifies the overlap of n-grams between the produced captions and the reference (ground truth) captions. Because the ROUGE score places more of an emphasis on recall than the BLEU score does on precision, it provides an additional dimension for assessing the relevancy and caliber of the generated captions. By employing both precision (BLEU) and recall (ROUGE), this dual evaluation approach ensures a more thorough and fair assessment of the model’s caption-producing skills [34,36].
- CIDEr Metric: The CIDEr (Consensus-based Image Description Evaluation) metric is used to further enhance the evaluation of produced captions. CIDEr determines the consensus between the produced captions and the reference captions by computing the cosine similarity between their TF-IDF vectors. Apart from BLEU and ROUGE ratings, CIDEr is a valuable tool for assessing the uniqueness and variety of generated captions by providing an extensive analysis of caption quality [35].
- By using these indicators, as stated in Table 3, the evaluation methodology ensures a fair and nuanced assessment of the generated captions, taking into account both grammatical correctness and contextual significance.
4. Experimental Setup
4.1. Initial Setup
4.1.1. Data Preparation
4.1.2. Dataset Splitting
4.1.3. Hyperparameter Tuning and Optimization Strategy
4.1.4. Hardware Configuration and Experimentation
4.2. Implementation Details
4.2.1. Data Loader
4.2.2. Training of the Model
5. Results and Discussion
5.1. Implementation Results
5.2. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ghandi, T.; Pourreza, H.; Mahyar, H. Deep learning approaches on image captioning: A review. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
- Hu, S.; Shen, Y.; Wang, S.; Lei, B. Brain MR to PET Synthesis via Bidirectional Generative Adversarial Network. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020, Proceedings of the 23rd International Conference, Lima, Peru, 4–8 October 2020; Part II 23; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 698–707. [Google Scholar]
- Rinaldi, A.M.; Russo, C.; Tommasino, C. Automatic image captioning combining natural language processing and deep neural networks. Results Eng. 2023, 18, 101107. [Google Scholar] [CrossRef]
- van der Lee, C.; Krahmer, E.; Wubben, S. Automated Learning of Templates for Data-to-Text Generation: Comparing Rule-Based, Statistical, and Neural Methods. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg, The Netherlands, 5–8 November 2018; pp. 35–45. [Google Scholar]
- Hill, T.; Lewicki, P.; Lewicki, P. Statistics: Methods and Applications: A Comprehensive Reference for Science, Industry, and Data Mining; StatSoft, Inc.: Tulsa, OK, USA, 2006. [Google Scholar]
- NIST/SEMATECH. E-Handbook of Statistical Methods; NIST/SEMATECH: Gaithersburg, MD, USA, 2012.
- Hochreiter, S. Long Short-Term Memory; Neural Computation MIT-Press: La Jolla, CA, USA, 1997. [Google Scholar]
- He, S.; Liao, W.; Tavakoli, H.R.; Yang, M.; Rosenhahn, B.; Pugeault, N. Image Captioning Through Image Transformer. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Liu, W.; Chen, S.; Guo, L.; Zhu, X.; Liu, J. Cptr: Full transformer network for image captioning. arXiv 2021, arXiv:2101.10804. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Jolicoeur-Martineau, A. The relativistic discriminator: A key element missing from standard GAN. arXiv 2018, arXiv:1807.00734. [Google Scholar]
- Chen, C.; Mu, S.; Xiao, W.; Ye, Z.; Wu, L.; Ju, Q. Improving Image Captioning with Conditional Generative Adversarial Nets. In Proceedings of the AAAI Conference on Artificial Intelligence, 27 January–1 February 2019; Volume 33, pp. 8142–8150. [Google Scholar]
- Hossain, M.Z.; Sohel, F.; Shiratuddin, M.F.; Laga, H.; Bennamoun, M. Text to image synthesis for improved image captioning. IEEE Access 2021, 9, 64918–64928. [Google Scholar] [CrossRef]
- Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial feature learning. arXiv 2016, arXiv:1605.09782. [Google Scholar]
- Mishra, S.; Seth, S.; Jain, S.; Pant, V.; Parikh, J.; Jain, R.; Islam, S.M. Image Caption Generation using Vision Transformer and GPT Architecture. In Proceedings of the 2024 2nd International Conference on Advancement in Computation & Computer Technologies (InCACCT), Gharuan, India, 2–3 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
- Sharma, H.; Srivastava, S. Graph neural network-based visual relationship and multilevel attention for image captioning. J. Electron. Imaging 2022, 31, 053022. [Google Scholar] [CrossRef]
- Ondeng, O.; Ouma, H.; Akuon, P. A review of transformer-based approaches for image captioning. Appl. Sci. 2023, 13, 11103. [Google Scholar] [CrossRef]
- Kolla, T.; Vashisth, H.K.; Kaur, M. Attention Unveiled: Revolutionizing Image Captioning through Visual Attention. In Proceedings of the 2023 Global Conference on Information Technologies and Communications (GCITC), Bengaluru, India, 1–3 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar]
- Zhang, H.; Qu, W.; Long, H.; Chen, M. The Intelligent Advertising Image Generation Using Generative Adversarial Networks and Vision Transformer: A Novel Approach in Digital Marketing. J. Organ. End User Comput. (JOEUC) 2024, 36, 1–26. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September, 2014; Part V 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Lala, C.; Madhyastha, P.S.; Scarton, C.; Specia, L. Sheffield submissions for WMT18 multimodal translation shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, 31 October–1 November 2018; pp. 624–631. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Oluborode, K.; Kadams, A.; Mohammed, U. An Intelligent Image Caption Generator Model Using Deep Learning. Int. J. Dev. Math. (IJDM) 2024, 1, 162–173. [Google Scholar] [CrossRef]
- Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Ashish, V. Attention is All You Need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30, Part I. [Google Scholar]
- Papineni, K. BLEU: A Method for Automatic Evaluation of MT; Research Report, Computer Science RC22176 (W0109-022); IBM T. J. Watson Research Center: Yorktown Heights, NY, USA, 2001. [Google Scholar]
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Hinton, G.; Srivastava, N.; Swersky, K. Neural networks for machine learning. Coursera Video Lect. 2012, 264, 2146–2153. [Google Scholar]
- Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
- Brownlee, J. A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size; Machine Learning Mastery: San Juan, PR, USA, 2019. [Google Scholar]
- Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-Based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Vidyabharathi, D.; Mohanraj, V.; Kumar, J.S.; Suresh, Y. Achieving generalization of deep learning models in a quick way by adapting T-HTR learning rate scheduler. Pers. Ubiquitous Comput. 2023, 27, 1335–1353. [Google Scholar] [CrossRef]
- Ayinde, B.O.; Nishihama, K.; Zurada, J.M. Diversity Regularized Adversarial Deep Learning. In Artificial Intelligence Applications and Innovations, Proceedings of the 15th IFIP WG 12.5 International Conference, AIAI 2019, Hersonissos, Crete, Greece, 24–26 May 2019; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 292–306. [Google Scholar]
- Santiesteban, S.S.; Atito, S.; Awais, M.; Song, Y.Z.; Kittler, J. Improved Image Captioning Via Knowledge Graph-Augmented Models. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4290–4294. [Google Scholar]
Authors | Models | Work | Limitations |
---|---|---|---|
Van der Lee el at. (2018) [6] | Template-based | Template-based image captioning involves creating predefined structures or patterns for captions and filling in details based on image content. While simple and interpretable, it may lack adaptability for diverse images. | Template-based image captioning has limitations, including inflexibility with diverse images, a lack of context understanding, dependence on predefined patterns, inability to capture fine details, and challenges with ambiguity and unstructured data. |
Hill et al. (2006) [7] | Statistical Model | Statistical models for image captioning operate by initially extracting relevant features from images through statistical techniques or handcrafted descriptors. These extracted features serve as input to models employing statistical methods such as n-grams or Hidden Markov Models for language modeling. | One of their notable limitations lies in handling ambiguity, leading to the generation of less contextually rich and sometimes generic captions. Despite these challenges, statistical models excel in feature extraction and the establishment of mappings between image features and textual descriptions. |
He et al. (2020) [10] | Transformers | The Transformer processes these features to generate sequential captions. The self-attention mechanism in Transformers allows them to focus on relevant parts of the image, facilitating the generation of contextually rich and detailed captions. | Transformers might struggle with handling very long sequences of data due to their self-attention mechanism, leading to increased processing times and memory constraints. Fine-tuning large pre-trained models for specific image-captioning tasks can also be challenging, requiring substantial computational resources and expertise. |
He et al. (2020) [10] | CNN-Transformer [6] | This approach combines properties of both the Transformer and CNN: a convolutional neural network (CNN) to extract image features and an attention-based encoder–decoder Transformer model for generating captions. The attention mechanism allows the model to focus on different parts of the image while generating each word of the caption. | Their complexity often leads to computational expenses during training and inference, requiring substantial resources. Transformers may not inherently capture long-range dependencies in image data and interpreting the interactions between convolutional and Transformer layers can be challenging. |
Liu et al. (2021) [11] | End-to-End Transformer | A unique token and these visual characteristics are supplied into the transformer-based language model. The transformer encoder processes this input, capturing contextual information and relationships between tokens. The output of the encoder initializes the decoder, which generates the caption word by word based on the context encoded by the encoder and previously generated words. Post-processing is performed to improve the caption once the generated token IDs have been decoded into words that can be understood by humans. | There are a few drawbacks to the end-to-end Transformer-based architecture used to create picture captions. It largely relies on pre-trained convolutional neural networks (CNNs) for feature extraction, thus limiting its ability to adapt to various image datasets or capture fine-grained visual features. Additionally, the Transformer architecture may struggle to accurately express long-range dependencies in picture data, as it was originally built for sequential data like text. Interpreting the interactions between visual and textual processing components becomes difficult due to the complexity introduced by the integration of CNNs and Transformers. |
Goodfellow et al. (2014) [2] | Generative Adversarial Networks (GANs) | Combining generative models with adversarial training. In this framework, a generator network produces captions, and a discriminator network evaluates the quality of the generated captions. Through adversarial training, the generator refines its ability to produce more realistic and contextually relevant captions. GANs leverage a feedback loop between the generator and discriminator, iteratively improving the captioning quality. | Generative Adversarial Networks (GANs) in image captioning face several limitations. One significant challenge is the potential for mode collapse, where the generator produces limited and repetitive captions, lacking diversity. GANs are also known for training instability, requiring careful hyperparameter tuning and regularization techniques to achieve reliable results. |
Jolicoeur-Martineau (2018) [13] | RAGAN (Residual Attention Generative Adversarial Network) | Residual Attention Generative Adversarial Network (RAGAN) aims to produce high-quality captions for images. On top of a Generative Adversarial Network (GAN) foundation, it applies an attention-based residual learning technique. Using residual connections to preserve the original input data and concentrating on the most relevant portions of the image, this method improves the diversity and authenticity of the generated picture captions. | Training instability is one of the most common training issues in GANs (including RAGAN), where convergence and sensitivity to issues during training are difficult to achieve. Experience mode collapse, particularly when dealing with large datasets. The computational complexity increases due to the implementation of attention mechanisms in RAGAN and may result in training times that are longer and higher resource demands. |
Donahue et al. (2020) [16] | BraIN (Adversarial Network) | The Generative Adversarial Network (GAN) architecture is expanded upon in a Bidirectional Generative Adversarial Network (BraIN) by adding an encoder network in addition to the generator and discriminator. | One common challenge is mode collapse, where the generator learns to produce a limited variety of samples, ignoring the diversity of the data distribution. Training BraIN can be unstable and prone to instability as finding the right balance between the generator, discriminator, and encoder can be difficult. |
Wang and Cook (2020) [11] | Bidirectional Generative | The generator in a BraIN uses random noise as input to create synthetic data samples, and the discriminator separates the artificial samples produced by the generator from the real data samples from the training set. | If the encoder fails to learn a meaningful latent space it can cause failure in capturing important features of the data distribution and can result in low-quality and less diverse generated samples. |
Mishra et al. (2024) [17] | ViT + GPT-2 | A novel ViT-GPT-2 model for image captioning, utilizing Vision Transformer as the encoder and GPT-2 as the decoder. | Caption accuracy issues for complex visuals. Need for addressing existing challenges in image captioning. |
Zhang et al. (2024) [21] | VGG + SeqGAN + GA | Focuses on advertising image generation using a framework that integrates GANs and Vision Transformer models, enhancing the effectiveness and attractiveness of advertising content, rather than specifically addressing image caption generation. | Existing methods struggle with diverse advertising content demands. Need for innovative algorithms to improve generation outcomes. |
Kolla et al. (2023) [20] | RLHF + GANs + SCST | Utilizing visual attention, specifically employing Transformers and GANs. These techniques enhance caption quality by leveraging competition between generator and discriminator networks, improving relevance and accuracy in generated textual descriptions. | Although visual attention models aim to enhance the understanding of image content, they may still struggle with nuanced context or abstract concepts. This limitation can result in captions that fail to capture the full essence of the image, particularly in complex scenes. |
Dataset | Description | Advantages |
---|---|---|
COCO [22] | Contains over 200,000 images of everyday scenes and objects. | Large, diverse dataset. |
Flickr30k [23] | Consists of 30,000 images with five captions per image from Flickr. | Diverse images with multiple captions, suitable for real-world scenarios. |
Multi30k [24] | Includes images and captions in multiple languages, facilitating cross-lingual evaluation. | Multilingual support for exploring model performance across languages. |
Visual Genome [25] | Annotated with detailed scene graphs providing rich contextual information. | Enables understanding of semantic structure in visual content. |
Metric | Focus | Advantages | Disadvantages |
---|---|---|---|
BLEU Score [29] | Precision | Objective, Quantifiable | Insensitive to Paraphrasing |
ROUGE Score [34] | Recall | Comprehensive | Computational Complexity |
CIDEr Metric [35] | Consensus | Captures Diversity | Sensitive to Vocabulary |
Parameters | Value |
---|---|
Learning Rate | 1 × 10−5 |
Batch Size | 128 |
Number of Epoch | 100 |
Optimizer | Adam |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tyagi, S.; Oki, O.A.; Verma, V.; Gupta, S.; Vijarania, M.; Awotunde, J.B.; Babatunde, A.O. Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks. Computers 2024, 13, 305. https://doi.org/10.3390/computers13120305
Tyagi S, Oki OA, Verma V, Gupta S, Vijarania M, Awotunde JB, Babatunde AO. Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks. Computers. 2024; 13(12):305. https://doi.org/10.3390/computers13120305
Chicago/Turabian StyleTyagi, Shourya, Olukayode Ayodele Oki, Vineet Verma, Swati Gupta, Meenu Vijarania, Joseph Bamidele Awotunde, and Abdulrauph Olanrewaju Babatunde. 2024. "Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks" Computers 13, no. 12: 305. https://doi.org/10.3390/computers13120305
APA StyleTyagi, S., Oki, O. A., Verma, V., Gupta, S., Vijarania, M., Awotunde, J. B., & Babatunde, A. O. (2024). Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks. Computers, 13(12), 305. https://doi.org/10.3390/computers13120305