Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering
Abstract
:1. Introduction
- We propose an interpretable and practical VQA model which unifies the visual, textual, knowledge-based factual representations from different modalities. Given the question-guided parsing tree structure, it performs remarkable reasoning based on the attention mechanism with the help of a knowledge base.
- We propose a tree-structure reasoning model based on modular network structure, which has four notable advantages: First, the modular network effectively mines and merges features of different modalities and implements parallel reasoning over tree nodes for logical reasoning. Second, the knowledge base provides some common knowledge other than images and texts to facilitate collaboration and communication between child nodes and parent nodes. Third, the attention mechanism has good interpretability of clearly performing explicit logical reasoning, and the gated reasoning model can tailor and update features.
- The proposed model has achieved excellent results on benchmark datasets, including VQA2.0, CLVER and FVQA, which indicates that our model has strong adaptability and versatility for different tasks. Through ablative experiments, it can be seen that the contribution of each module network to improving the overall performance of the model.
2. Related Work
2.1. Visual Question Answering
2.2. Knowledge Base
2.3. Neural Module Network
3. Approach
3.1. Overview
3.2. Attention Model
3.3. Reasoning Model
3.4. Knowledge-Based Fact Model
3.5. Answer Prediction
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Comparison with Existing Methods
4.4. Visual Reasoning
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shih, K.J.; Singh, S.; Hoiem, D. Where To Look: Focus Regions for Visual Question Answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 4613–4621. [Google Scholar]
- Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, TX, USA, 1–5 November 2016; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2016; pp. 457–468. [Google Scholar]
- Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
- Shi, Y.; Furlanello, T.; Zha, S.; Anandkumar, A. Question Type Guided Attention in Visual Question Answering. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Nam, H.; Ha, J.-W.; Kim, J. Dual attention networks for multimodal reasoning and matching. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; Institute of Electrical and Electronics Engineers Inc.: Manhattan, NY, USA, 2017; Volume 2017-January, pp. 2156–2164. [Google Scholar]
- Manjunatha, V.; Saini, N.; Davis, L.S.; Soc, I.C. Explicit Bias Discovery in Visual Question Answering Models. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9554–9563. [Google Scholar]
- Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Hoffman, J.; Fei-Fei, L.; Lawrence Zitnick, C.; Girshick, R. Inferring and Executing Programs for Visual Reasoning. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3008–3017. [Google Scholar]
- Hu, R.; Andreas, J.; Rohrbach, M.; Darrell, T.; Saenko, K. Learning to Reason: End-to-End Module Networks for Visual Question Answering. In Proceedings of the 16th IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; Institute of Electrical and Electronics Engineers Inc.: Manhattan, NY, USA, 2017; Volume 2017-October, pp. 804–813. [Google Scholar]
- Zhang, W.F.; Yu, J.; Hu, H.; Hu, H.Y.; Qin, Z.C. Multimodal feature fusion by relational reasoning and attention for visual question answering. Inf. Fusion 2020, 55, 116–126. [Google Scholar] [CrossRef]
- Cao, Q.; Liang, X.; Li, B.; Lin, L. Interpretable Visual Question Answering by Reasoning on Dependency Trees. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 887–901. [Google Scholar] [CrossRef] [Green Version]
- Cao, Q.; Liang, X.; Li, B.; Li, G.; Lin, L. Visual Question Reasoning on General Dependency Tree. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Washington, DC, USA, 2018; pp. 7249–7257. [Google Scholar]
- Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. DBpedia: A nucleus for a Web of open data. In Proceedings of the 6th International Semantic Web Conference, ISWC 2007 and 2nd Asian Semantic Web Conference, ASWC 2007, Busan, Korea, 11–15 November 2007; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4825 LNCS, pp. 722–735. [Google Scholar]
- Tandon, N.; De Melo, G.; Suchanek, F.; Weikum, G. WebChild: Harvesting and organizing commonsense knowledge from the web. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, 24–28 February 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 523–532. [Google Scholar]
- Speer, R.; Chin, J.; Havasi, C. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4444–4451. [Google Scholar]
- Ren, M.; Kiros, R.; Zemel, R.S. Exploring models and data for image question answering. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, NIPS 2015, Montreal, QC, Canada, 7–12 December 2015; Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2015; Volume 2015-January, pp. 2953–2961. [Google Scholar]
- Zhang, Y.; Hare, J.; Prugel-Bennett, A. Learning to count objects in natural images for visual question answering. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Yu, Z.; Yu, J.; Fan, J.; Tao, D. Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1839–1848. [Google Scholar]
- Ben-Younes, H.; Cadene, R.; Thome, N.; Cord, M. BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Annual Conference on Innovative Applications of Artificial Intelligence, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Palo Alto, CA, USA, 2019; pp. 8102–8109. [Google Scholar]
- Xu, H.; Saenko, K. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the 21st ACM Conference on Computer and Communications Security, CCS 2014, Scottsdale, AZ, USA, 3–7 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9911 LNCS, pp. 451–466. [Google Scholar]
- Zhu, Y.; Groth, O.; Bernstein, M.; Fei-Fei, L. Visual7W: Grounded question answering in images. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 26 June–1 July 2016; IEEE Computer Society: Washington, DC, USA, 2016; Volume 2016-December, pp. 4995–5004. [Google Scholar]
- Nguyen, D.-K.; Okatani, T. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Washington, DC, USA, 2018; pp. 6087–6096. [Google Scholar]
- Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A. Stacked attention networks for image question answering. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 26 June–1 July 2016; IEEE Computer Society: Washington, DC, USA, 2016; Volume 2016-December, pp. 21–29. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Washington, DC, USA, 2018; pp. 6077–6086. [Google Scholar]
- Santoro, A.; Raposo, D.; Barrett, D.G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2017; Volume 2017-December, pp. 4968–4977. [Google Scholar]
- Wu, C.; Liu, J.; Wang, X.; Dong, X. Chain of reasoning for visual question answering. In Proceedings of the 32nd Conference on Neural Information Processing Systems, NeurIPS 2018, Montreal, QC, Canada, 2–8 December 2018; Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2018; Volume 2018-December, pp. 275–285. [Google Scholar]
- Wang, P.; Wu, Q.; Shen, C.; Dick, A.; van den Hengel, A. FVQA: Fact-Based Visual Question Answering. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2413–2427. [Google Scholar] [CrossRef] [Green Version]
- Wu, Q.; Wang, P.; Shen, C.; Dick, A.; van den Hengel, A. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 26 June–1 July 2016; IEEE Computer Society: Washington, DC, USA, 2016; Volume 2016-December, pp. 4622–4630. [Google Scholar]
- Wang, P.; Wu, Q.; Shen, C.; Dick, A.; van den Hengel, A. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, VIC, Australia, 19–25 August 2017; pp. 1290–1296. [Google Scholar]
- Yu, J.; Zhu, Z.; Wang, Y.; Zhang, W.; Hu, Y.; Tan, J. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognit. 2020, 108, 107563–107576. [Google Scholar] [CrossRef]
- Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3190–3199. [Google Scholar]
- Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Neural module networks. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 26 June–1 July 2016; IEEE Computer Society: Washington, DC, USA, 2016; Volume 2016-December, pp. 39–48. [Google Scholar]
- Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Learning to compose neural networks for question answering. In Proceedings of the 15th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2016; pp. 1545–1554. [Google Scholar]
- Chen, D.; Manning, C.D. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2014; pp. 740–750. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 26 June–1 July 2016; IEEE Computer Society: Washington, DC, USA, 2016; Volume 2016-December, pp. 770–778. [Google Scholar]
- Ruby, U.; Yendapalli, V. Binary cross entropy with deep learning technique for Image classification. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5393–5397. [Google Scholar]
- Johnson, J.; Fei-Fei, L.; Hariharan, B.; Zitnick, C.L.; van der Maaten, L.; Girshick, R. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; Institute of Electrical and Electronics Engineers Inc.: Manhattan, NY, USA, 2017; Volume 2017-January, pp. 1988–1997. [Google Scholar]
- Goyal, Y.; Khot, T.; Agrawal, A.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int. J. Comput. Vis. 2019, 127, 398–414. [Google Scholar] [CrossRef] [Green Version]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8693 LNCS, pp. 740–755. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2014; pp. 1532–1543. [Google Scholar]
- Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Model | Overall | Count | Exist | Compare Numbers | Query Attribute | Compare Attribute |
---|---|---|---|---|---|---|
Q-type model [36] | 41.8 | 34.6 | 50.2 | 51.0 | 36.0 | 51.3 |
LSTM [36] | 46.8 | 41.7 | 61.1 | 69.8 | 36.8 | 51.8 |
CNN + LSTM [36] | 52.3 | 43.7 | 65.2 | 67.1 | 49.3 | 53.0 |
N2NMN scratch [8] | 69.0 | 55.1 | 72.7 | 78.5 | 83.2 | 50.9 |
N2NMN cloning Expert [8] | 78.9 | 63.3 | 83.3 | 80.3 | 87.0 | 78.5 |
N2NMN policy search [8] | 83.7 | 68.5 | 85.7 | 84.9 | 90.0 | 88.7 |
PG + EE(9 K prog.) [7] | 88.6 | 79.7 | 89.7 | 79.1 | 92.6 | 96.0 |
PG + EE(700 K prog.) [7] | 96.9 | 92.7 | 97.1 | 98.7 | 98.1 | 98.9 |
RN [24] | 95.5 | 90.1 | 97.8 | 93.6 | 97.9 | 97.1 |
QGTSKB (Ours) | 97.6 | 94.2 | 99.1 | 93.6 | 99.3 | 99.2 |
Model | Test-Dev | Test-Standard | |||
---|---|---|---|---|---|
Overall | Yes/No | Number | Other | Overall | |
QTA [4] | 57.99 | 80.87 | 37.32 | 43.12 | 58.24 |
SAN [22] | 58.7 | 79.3 | 36.6 | 46.1 | 58.9 |
HQIC [3] | 61.8 | 79.7 | 38.7 | 51.7 | 62.1 |
DAN [5] | 64.3 | 83 | 39.1 | 53.9 | 64.2 |
MFB [17] | 65.9 | 84 | 39.8 | 56.2 | 65.8 |
DCN [21] | 66.89 | 84.61 | 42.35 | 57.31 | 67.02 |
Count [16] | 68.09 | 83.14 | 51.62 | 58.97 | 68.41 |
QGTSKB (Ours) | 70.5 | 86.76 | 52.54 | 60.68 | 70.62 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Q.; Tang, X.; Jian, Y. Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering. Sensors 2022, 22, 1575. https://doi.org/10.3390/s22041575
Li Q, Tang X, Jian Y. Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering. Sensors. 2022; 22(4):1575. https://doi.org/10.3390/s22041575
Chicago/Turabian StyleLi, Qifeng, Xinyi Tang, and Yi Jian. 2022. "Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering" Sensors 22, no. 4: 1575. https://doi.org/10.3390/s22041575
APA StyleLi, Q., Tang, X., & Jian, Y. (2022). Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering. Sensors, 22(4), 1575. https://doi.org/10.3390/s22041575