MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features
Abstract
:1. Introduction
- (1)
- This paper proposes a method for fusing multimodal features, which makes full use of structural, visual, and textual features, extracts these features through a specific modality encoder, and uses a Transformer for multimodal feature fusion. This method can effectively reduce the heterogeneity of multimodal entity representations.
- (2)
- By fusing feature information from different modalities at each layer, we can more comprehensively understand and represent entities and relationships in multimodal knowledge graphs and perform better in capturing the complex interactions of multimodal features.
- (3)
- Through case analysis, multimodal feature fusion effectively reduces the bias that may be caused by single modal features. By analyzing the contribution of each feature to the final result, the interpretability and credibility of the model are enhanced.
2. Related Work
3. Methodology
3.1. Overall Architecture
3.2. Structural Feature Extraction
3.3. Visual Feature Extraction
3.4. Text Feature Extraction
3.5. Multimodal Feature Fusion
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Baselines
- VisualBERT [17], a pre-trained vision–language model with a single-stream structure.
- ViLBERT [16], a pre-trained vision-language model with a two-stream structure.
- IKRL [14], which extends TransE to learn the visual representation of entities and the structural information of knowledge graphs respectively.
- TransAE [19], combines a multimodal autoencoder with TransE to encode visual and texture knowledge into a unified representation and uses the hidden layers of the autoencoder as the representation of entities in the TransE model.
- RSME [21], which designs a forget gate with MRP metric to select valuable images for multimodal knowledge graph embedding learning.
- MKGformer [28], a proposed hybrid Transformer model with multi-layer fusion to integrate visual and textual representations.
4.1.3. Experiment Details
5. Experimental Results
5.1. Overall Performance
5.2. Ablation Study
5.3. Visual Analysis
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Huang, X.; Zhang, J.; Li, D.; Li, P. Knowledge graph embedding based question answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia, 11–15 February 2019; pp. 105–113. [Google Scholar]
- Yih, S.W.; Chang, M.W.; He, X.; Gao, J. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP, Beijing, China, 26 July 2015; pp. 1321–1331. [Google Scholar]
- Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 4623–4629. [Google Scholar]
- Huang, J.; Zhao, W.X.; Dou, H.; Wen, J.R.; Chang, E.Y. Improving sequential recommendation with knowledge-enhanced memory networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, New York, NY, USA, 8–12 July 2018; pp. 505–514. [Google Scholar]
- Zhang, N.; Jia, Q.; Deng, S.; Chen, X.; Ye, H.; Chen, H.; Tou, H.; Huang, G.; Wang, Z.; Hua, N.; et al. Alicg: Fine-grained and evolvable conceptual graph construction for semantic search at alibaba. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 3895–3905. [Google Scholar]
- Dietz, L.; Kotov, A.; Meij, E. Utilizing knowledge graphs for text-centric information retrieval. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Tokyo, Japan, 8–12 July 2018; pp. 1387–1390. [Google Scholar]
- Yang, Z. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; p. 2486. [Google Scholar]
- Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. Adv. Neural Inf. Process. Syst. 2013, 26, 2787–2795. [Google Scholar]
- Wang, Z.; Zhang, J.; Feng, J.; Chen, Z. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI Conference on Artificial Intelligence, Qutbec City, QC, Canada, 27–31 July 2014; pp. 1112–1119. [Google Scholar]
- Nathani, D.; Chauhan, J.; Sharma, C.; Kaul, M. Learning attention-based embeddings for relation prediction in knowledge graphs. arXiv 2019, arXiv:1906.01195. [Google Scholar]
- Nguyen, D.Q.; Nguyen, T.D.; Nguyen, D.Q.; Phung, D. A novel embedding model for knowledge base completion based on convolutional neural network. arXiv 2017, arXiv:1712.0212. [Google Scholar]
- Pezeshkpour, P.; Chen, L.; Singh, S. Embedding multimodal relational data for knowledge base completion. arXiv 2018, arXiv:1809.01341. [Google Scholar]
- Mousselly-Sergieh, H.; Botschen, T.; Gurevych, I.; Roth, S. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA, 5–6 June 2018; pp. 225–234. [Google Scholar]
- Xie, R.; Liu, Z.; Luan, H.; Sun, M. Image-embodied knowledge representation learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, VIC, Australia, 19–25 August 2017; pp. 3140–3146. [Google Scholar]
- Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 13–23. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
- Wang, Z.; Li, L.; Li, Q.; Zeng, D. Multimodal data enhanced representation learning for knowledge graphs. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Zhao, Y.; Cai, X.; Wu, Y.; Zhang, H.; Zhang, Y.; Zhao, G.; Jiang, N. MoSE: Modality split and ensemble for multimodal knowledge graph completion. arXiv 2022, arXiv:2210.08821. [Google Scholar]
- Wang, M.; Wang, S.; Yang, H.; Zhang, Z.; Chen, X.; Qi, G. Is visual context really helpful for knowledge graph? A representation learning perspective. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 2735–3743. [Google Scholar]
- Shankar, S.; Thompson, L.; Fiterau, M. Progressive fusion for multimodal integration. arXiv 2022, arXiv:2209.00302. [Google Scholar]
- Liang, P.P.; Ling, C.K.; Cheng, Y.; Obolenskiy, A.; Liu, Y.; Pandey, R.; Salakhutdinov, R. Quantifying Interactions in Semi-supervised Multimodal Learning: Guarantees and Applications. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Jiang, Y.; Gao, Y.; Zhu, Z.; Yan, C.; Gao, Y. HyperRep: Hypergraph-Based Self-Supervised Multimodal Representation Learning. Available online: https://openreview.net/forum?id=y3dqBDnPay (accessed on 22 September 2023).
- Golovanevsky, M.; Schiller, E.; Nair, A.A.; Singh, R.; Eickhoff, C. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. In Proceedings of the ICML 2024 Workshop on Efficient and Accessible Foundation Models for Biological Discovery, Vienna, Austria, 27 July 2024. [Google Scholar]
- Zhang, X.; Yoon, J.; Bansal, M.; Yao, H. Multimodal representation learning by alternating unimodal adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 27456–27466. [Google Scholar]
- Li, X.; Zhao, X.; Xu, J.; Zhang, Y.; Xing, C. IMF: Interactive multimodal fusion model for link prediction. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2572–2580. [Google Scholar]
- Chen, X.; Zhang, N.; Li, L.; Deng, S.; Tan, C.; Xu, C.; Huang, F.; Si, L.; Chen, H. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 904–915. [Google Scholar]
- Gu, W.; Gao, F.; Lou, X.; Zhang, J. Link prediction via graph attention network. arXiv 2019, arXiv:1910.04807. [Google Scholar]
- Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
- Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1247–1250. [Google Scholar]
Dataset | Ent | Rel | Train | Dev | Test |
---|---|---|---|---|---|
FB15K-237-IMG | 14,541 | 237 | 272,115 | 17,535 | 20,466 |
WN18-IMG | 40,943 | 18 | 141,442 | 5000 | 5000 |
Model | FB15K-237-IMG | WN18-IMG | ||||||
---|---|---|---|---|---|---|---|---|
Hits@1 ↑ | Hits@3 ↑ | Hits@10 ↑ | MR ↓ | Hits@1 ↑ | Hits@3 ↑ | Hits@10 ↑ | MR ↓ | |
VisualBERT_base [17] | 0.217 | 0.324 | 0.439 | 592 | 0.179 | 0.437 | 0.654 | 122 |
ViLBERT_base [16] | 0.233 | 0.335 | 0.457 | 483 | 0.223 | 0.552 | 0.761 | 131 |
IKRL [14] | 0.194 | 0.284 | 0.458 | 298 | 0.127 | 0.796 | 0.928 | 596 |
TransAE [19] | 0.199 | 0.317 | 0.463 | 431 | 0.323 | 0.835 | 0.934 | 352 |
RSME [21] | 0.242 | 0.344 | 0.467 | 417 | 0.943 | 0.951 | 0.957 | 223 |
MKGformer [28] | 0.256 | 0.367 | 0.504 | 221 | 0.944 | 0.961 | 0.972 | 28 |
MM-Transformer | 0.259 | 0.362 | 0.511 | 215 | 0.948 | 0.968 | 0.976 | 117 |
FB15K-237-IMG | ||||
---|---|---|---|---|
Hits@1 ↑ | Hits@3 ↑ | Hits@10 ↑ | MR ↓ | |
T | 0.241 | 0.345 | 0.457 | 248 |
S + T | 0.242 | 0.351 | 0.386 | 232 |
V + T | 0.256 | 0.367 | 0.504 | 221 |
S + V + T | 0.259 | 0.362 | 0.511 | 215 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, D.; Tang, K.; Zeng, J.; Pan, Y.; Dai, Y.; Li, H.; Han, B. MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features. Symmetry 2024, 16, 961. https://doi.org/10.3390/sym16080961
Wang D, Tang K, Zeng J, Pan Y, Dai Y, Li H, Han B. MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features. Symmetry. 2024; 16(8):961. https://doi.org/10.3390/sym16080961
Chicago/Turabian StyleWang, Dongsheng, Kangjie Tang, Jun Zeng, Yue Pan, Yun Dai, Huige Li, and Bin Han. 2024. "MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features" Symmetry 16, no. 8: 961. https://doi.org/10.3390/sym16080961
APA StyleWang, D., Tang, K., Zeng, J., Pan, Y., Dai, Y., Li, H., & Han, B. (2024). MM-Transformer: A Transformer-Based Knowledge Graph Link Prediction Model That Fuses Multimodal Features. Symmetry, 16(8), 961. https://doi.org/10.3390/sym16080961