Semi-Supervised Implicit Augmentation for Data-Scarce VQA †
Abstract
:1. Introduction
- We introduce SemIAug, a non-generative model, as well as a dataset-agnostic data augmentation approach that performs augmentation by only using the images and questions from within a dataset;
- We propose a simple and effective technique that employs VLMs for matching new images and questions and reuses the same VLM for answering. This helps improve the performance of VQA tasks on datasets with relatively fewer question-answer pairs, i.e., data-scarce datasets;
- We propose a computationally efficient image-question matching strategy by extracting the image and question features using frozen VLMs.
2. Related Works
3. Methodology
3.1. Image-Question Matching
3.1.1. Noun-Based Filtering
3.1.2. Segregating Rephrased and Diverse Questions
3.1.3. Handling Multiplicity
3.2. Answering Model
4. Experimental Results
4.1. Quantitative Analysis
4.2. Qualitative Analysis
4.3. Ablation Study
4.4. Qualitative Visual Results
5. Discussion
6. Limitations and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1. Dynamic Number of Question Augmentation
Appendix A.2. Pretrained and Finetuned Model Checkpoints
Appendix A.3. More Visual Results
Appendix A.4. Results of Augmentation on VQAv2-Sampled Dataset
Dataset | Type | Answering Setup | Ques. Augmentation | Ques Count | Multiplicity | Accuracy (%) | Gain (%) |
---|---|---|---|---|---|---|---|
VQAv2-sampled | original | - | - | 16,000 | − | ||
augmented | semi-supervised | dynamic | 23,817 | −2.06 | |||
augmented | semi-supervised | fixed | 32,000 | −0.26 | |||
augmented | finetuned BLIP VQA | dynamic | 23,817 | +2.35 | |||
augmented | finetuned BLIP VQA | fixed | 32,000 | +3.77 |
Dataset | Type | Answering Setup | Ques. Augmentation | Ques Count | Multiplicity | Accuracy (%) | Gain (%) |
---|---|---|---|---|---|---|---|
VQAv2-sampled | original | - | - | 16,000 | − | ||
augmented | semi-supervised | dynamic | 23,817 | +0.17 | |||
augmented | semi-supervised | fixed | 32,000 | +0.12 | |||
augmented | finetuned BLIP VQA | dynamic | 23,817 | +0.21 | |||
augmented | finetuned BLIP VQA | fixed | 32,000 | +0.2 |
References
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the vs. in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
- Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A. Do not just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4971–4980. [Google Scholar]
- Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3195–3204. [Google Scholar]
- Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; Mottaghi, R. A-okvqa: A benchmark for visual question answering using world knowledge. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part VIII. pp. 146–162. [Google Scholar]
- Garcia, N.; Ye, C.; Liu, Z.; Hu, Q.; Otani, M.; Chu, C.; Nakashima, Y.; Mitamura, T. A Dataset and Baselines for Visual Question Answering on Art. In Proceedings of the Computer Vision—ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; pp. 92–108. [Google Scholar]
- Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
- Hao, X.; Zhu, Y.; Appalaraju, S.; Zhang, A.; Zhang, W.; Li, B.; Li, M. MixGen: A New Multi-Modal Data Augmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, Waikoloa, HI, USA, 3–7 January 2023; pp. 379–389. [Google Scholar]
- Yang, A.; Miech, A.; Sivic, J.; Laptev, I.; Schmid, C. Just Ask: Learning to Answer Questions from Millions of Narrated Videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1666–1677. [Google Scholar]
- Grunde-McLaughlin, M.; Krishna, R.; Agrawala, M. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019. Article No. 2. pp. 13–23. [Google Scholar]
- Li, L.; Yatskar, M.; Yin, D.; Hsieh, C.; Chang, K. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 4, 9694–9705. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, ML, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
- Xu, X.; Wu, C.; Rosenman, S.; Lal, V.; Che, W.; Duan, N. Bridgetower: Building bridges between encoders in vision-language representation learning. Proc. AAAI Conf. Artif. Intell. 2023, 37, 10637–10647. [Google Scholar] [CrossRef]
- Li, D.; Li, J.; Li, H.; Niebles, J.C.; Hoi, S.C.H. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4953–4963. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2556–2565. [Google Scholar]
- Qi, D.; Su, L.; Song, J.; Cui, E.; Bharti, T.; Sacheti, A. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv 2020, arXiv:2001.07966. [Google Scholar]
- Lee, D. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop Chall. Represent. Learn. ICML 2013, 3, 896. [Google Scholar]
- Rizve, M.N.; Duarte, K.; Rawat, Y.S.; Shah, M. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. arXiv 2021, arXiv:2101.06329. [Google Scholar]
- Cascante-Bonilla, P.; Tan, F.; Qi, Y.; Ordonez, V. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. Proc. AAAI Conf. Artif. Intell. 2021, 35, 6912–6920. [Google Scholar] [CrossRef]
- Xu, Y.; Wei, F.; Sun, X.; Yang, C.; Shen, Y.; Dai, B.; Zhou, B.; Lin, S. Cross-model pseudo-labeling for semi-supervised action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2959–2968. [Google Scholar]
- Kil, J.; Zhang, C.; Xuan, D.; Chao, W.-L. Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 6346–6361. [Google Scholar] [CrossRef]
- Chen, L.; Zheng, Y.; Xiao, J. Rethinking Data Augmentation for Robust Visual Question Answering. In Computer Vision–ECCV 2022; Springer Nature: Cham, Switzerland, 2022; pp. 95–112. ISBN 978–3-031-20059-5. [Google Scholar]
- Khan, Z.; Vijay, K.; Schulter, S.; Yu, X.; Fu, Y.; Chandraker, M. Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 15005–15015. [Google Scholar]
- Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Gui, L.; Wang, B.; Huang, Q.; Hauptmann, A.; Bisk, Y.; Gao, J. Kat: A knowledge augmented transformer for vision-and-language. arXiv 2021, arXiv:2112.08614. [Google Scholar]
- Lin, Y.; Xie, Y.; Chen, D.; Xu, Y.; Zhu, C.; Yuan, L. Revive: Regional visual representation matters in knowledge-based visual question answering. Adv. Neural Inf. Process. Syst. 2022, 35, 10560–10571. [Google Scholar]
- Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; Kembhavi, A. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv 2022, arXiv:2206.08916. [Google Scholar]
- Tan, H.; Bansal, M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar]
- Kamath, A.; Clark, C.; Gupta, T.; Kolve, E.; Hoiem, D.; Kembhavi, A. Webly supervised concept expansion for general purpose vision models. In European Conference on Computer Vision; Springer: Cham, Germany, 2022; pp. 662–681. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Dataset | Model | Ques Count | Multiplicity | BLIP Retrieval | BLIP-2 Retrieval | ||
---|---|---|---|---|---|---|---|
Accuracy | % Gain | Accuracy | % Gain | ||||
OK-VQA | Finetuned Baseline [15] | 9000 | - | - | |||
(evaluated on test set) | SemIAug (Ours) | 18,000 | |||||
27,000 | |||||||
36,000 | |||||||
A-OKVQA | Finetuned baseline [15] | 17,000 | - | - | |||
(evaluated on val set) | SemIAug (Ours) | 34,000 | |||||
51,000 | |||||||
68,000 |
Dataset | Relevance Threshold | Lower Limit | Upper Limit | BLIP Retrieval | BLIP-2 Retrieval | ||||
---|---|---|---|---|---|---|---|---|---|
Multiplicity | Accuracy | % Gain | Multiplicity | Accuracy | % Gain | ||||
OK-VQA | - | - | - | - | - | ||||
(evaluated on test set) | 1 | 3 | 56.14 | ||||||
1 | 3 | ||||||||
1 | 5 | ||||||||
1 | 5 | ||||||||
A-OKVQA | - | - | - | - | - | ||||
(evaluated on val set) | 1 | 3 | |||||||
1 | 3 | ||||||||
1 | 5 | ||||||||
1 | 5 |
Model | Accuracy (Test) |
---|---|
(a) KAT (single) [30] | |
(b) REVIVE (single) [31] | |
(c) Unified-IO (2.8B) [32] | |
(d) ALBEF [14] | |
(e) BLIP [15] | |
(f) BLIP [15] | |
(g) BLIPretrieval + BLIPVQA (SemIAug) | |
% gain w.r.t our baseline implementation (f) | |
(h) BLIP-2retrieval + BLIPVQA (SemIAug) | |
% gain w.r.t our baseline implementation (f) |
Model | Accuracy (Val) |
---|---|
(a) ViLBERT [11] | |
(b) LXMERT [33] | |
(c) GPV-2 [34] | |
(d) ALBEF [14] | |
(e) BLIP [15] | |
(f) BLIP [15] | |
(g) BLIPretrieval + BLIPVQA (SemIAug) | |
% gain w.r.t our baseline implementation (f) | |
(h) BLIP-2retrieval + BLIPVQA (SemIAug) | |
% gain w.r.t our baseline implementation (f) |
Dataset | Multiplicity | With Rephrased? | Threshold (t) | Accuracy |
---|---|---|---|---|
OK-VQA | ✗ | 0.8 | 55.74 | |
(evaluated on test set) | ✓ | - | 55.62 | |
✗ | 0.8 | 56.27 | ||
✓ | - | 55.97 | ||
A-OKVQA | ✗ | 0.8 | 54.96 | |
(evaluated on val set) | ✓ | - | 54.64 | |
✗ | 0.8 | 55.48 | ||
✓ | - | 55.28 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dodla, B.; Hegde, K.; Rajagopalan, A.N. Semi-Supervised Implicit Augmentation for Data-Scarce VQA. Comput. Sci. Math. Forum 2024, 9, 3. https://doi.org/10.3390/cmsf2024009003
Dodla B, Hegde K, Rajagopalan AN. Semi-Supervised Implicit Augmentation for Data-Scarce VQA. Computer Sciences & Mathematics Forum. 2024; 9(1):3. https://doi.org/10.3390/cmsf2024009003
Chicago/Turabian StyleDodla, Bhargav, Kartik Hegde, and A. N. Rajagopalan. 2024. "Semi-Supervised Implicit Augmentation for Data-Scarce VQA" Computer Sciences & Mathematics Forum 9, no. 1: 3. https://doi.org/10.3390/cmsf2024009003
APA StyleDodla, B., Hegde, K., & Rajagopalan, A. N. (2024). Semi-Supervised Implicit Augmentation for Data-Scarce VQA. Computer Sciences & Mathematics Forum, 9(1), 3. https://doi.org/10.3390/cmsf2024009003