Pre-Training and Fine-Tuning with Next Sentence Prediction for Multimodal Entity Linking
Abstract
:1. Introduction
- We introduce a paradigm of pre-training and fine-tuning for MEL. To the best of our knowledge, our work is first to improving MEL performance by pre-training.
- We introduce three different categories of NSP tasks to further pre-train and connect multimodal corpus and entity information in an appropriate way for fine-tuning.
- We conduct extensive experiments on a multimodal corpus based on Twitter, including both the general plain text and multimodal investigation.
2. Related Work
2.1. Multimodal Entity Linking
2.2. Multimodal Pre-Training
3. Proposed Method
3.1. Task Definition
3.2. Representation
3.3. Pre-Training
3.4. Linking
4. Experiment
4.1. Dataset and Experiment Settings
4.2. Baselines
4.3. Results
4.4. Error Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sevgili, O.; Shelmanov, A.; Arkhipov, M.; Panchenko, A.; Biemann, C. Neural Entity Linking: A Survey of Models Based on Deep Learning. arXiv 2021, arXiv:2006.00575. [Google Scholar] [CrossRef]
- Csomai, A.; Mihalcea, R. Linking Documents to Encyclopedic Knowledge. IEEE Intell. Syst. 2008, 23, 34–41. [Google Scholar] [CrossRef]
- Yang, X.; Gu, X.; Lin, S.; Tang, S.; Zhuang, Y.; Wu, F.; Chen, Z.; Hu, G.; Ren, X. Learning Dynamic Context Augmentation for Global Entity Linking. arXiv 2008, arXiv:1909.02117. [Google Scholar]
- Adjali, O.; Besançon, R.; Ferret, O.; Le Borgne, H.; Grau, B. Building a Multimodal Entity Linking Dataset From Tweets. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4285–4292. [Google Scholar]
- Adjali, O.; Besançon, R.; Ferret, O.; Le Borgne, H.; Grau, B. Multimodal Entity Linking for Tweets. In Advances in Information Retrieval; Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12035, pp. 463–478. [Google Scholar]
- Gan, J.; Luo, J.; Wang, H.; Wang, S.; He, W.; Huang, Q. Multimodal Entity Linking: A New Dataset and A Baseline. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 993–1001. [Google Scholar]
- Zhang, L.; Li, Z.; Yang, Q. Attention-Based Multimodal Entity Linking with High-Quality Images. In Database Systems for Advanced Applications; Jensen, C.S., Lim, E.P., Yang, D.N., Lee, W.C., Tseng, V.S., Kalogeraki, V., Huang, J.W., Shen, C.Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; pp. 533–548. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-Training of Generic Visual-Linguistic Representations. arXiv 2016, arXiv:1908.08530. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Sun, Y.; Zheng, Y.; Hao, C.; Qiu, H. NSP-BERT: A Prompt-Based Zero-Shot Learner Through an Original Pre-Training Task–Next Sentence Prediction. arXiv 2021, arXiv:2109.03564. [Google Scholar]
- Moon, S.; Neves, L.; Carvalho, V. Multimodal Named Entity Disambiguation for Noisy Social Media Posts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2000–2008. [Google Scholar]
- Bast, H.; Bäurle, F.; Buchhold, B.; Haußmann, E. Easy Access to the Freebase Dataset. In Proceedings of the 23rd International Conference on World Wide Web—WWW ’14 Companion, Seoul, Korea, 7–1 April 2014; pp. 95–98. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
- Li, W.; Gao, C.; Niu, G.; Xiao, X.; Liu, H.; Liu, J.; Wu, H.; Wang, H. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 2592–2607. [Google Scholar]
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. VinVL: Revisiting Visual Representations in Vision-Language Models. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5575–5584. [Google Scholar]
- Hao, W.; Li, C.; Li, X.; Carin, L.; Gao, J. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13134–13143. [Google Scholar]
- Singh, H.; Shekhar, S. STL-CQA: Structure-Based Transformers with Localization and Encoding for Chart Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 3275–3284. [Google Scholar]
- Tanaka, R.; Nishida, K.; Yoshida, S. Visualmrc: Machine reading comprehension on document images. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 13878–13888. [Google Scholar]
- Shen, W.; Wang, J.; Han, J. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Trans. Knowl. Data Eng. 2014, 27, 443–460. [Google Scholar] [CrossRef]
- Kannan Ravi, M.P.; Singh, K.; Mulang’, I.O.; Shekarpour, S.; Hoffart, J.; Lehmann, J. CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 504–514. [Google Scholar]
- De Cao, N.; Aziz, W.; Titov, I. Highly Parallel Autoregressive Entity Linking with Discriminative Correction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7662–7669. [Google Scholar]
Number of Tweets | Number of Mentions or Entities | |
---|---|---|
groundTruth | 57,905 | 1553 |
KB | 2,478,625 | 18,434 |
Unique | Random | Totals | |
---|---|---|---|
trainSet | 7000 | 3619 | 10,619 |
devSet | 6923 | 3619 | 10,542 |
testSet | 7795 | 28,949 | 36,744 |
trainSet07 | 7000 | 2533 | 9533 |
trainSet05 | 7000 | 1809 | 8809 |
Name | Value |
---|---|
max len | 300 |
batch size for pre-training | 4 |
batch size for fine-tuning | 16 |
epochs for pre-training | 5 |
epochs for fine-tuning | 4 |
lr | |
optimizer | AdamW |
TEXT | Dev | Test |
---|---|---|
NSP-BERT | 0.6045 | 0.73946 |
CHOLAN | 0.52832 | 0.70818 |
Autoregressive | 0.28302 | 0.59606 |
TEXT-IMAGE | ||
Att | 0.32634 | 0.53424 |
FMEL | 0.60406 | 0.73926 |
PFMEL | 0.60268 | 0.75192 |
Dev | Test | |
---|---|---|
FMEL | 0.60406 | 0.73926 |
PFMELm | 0.62462 | 0.7465 |
PMELmt | 0.59428 | 0.74836 |
PFMELmtm | 0.59008 | 0.74378 |
PFMELmtd | 0.59906 | 0.7447 |
PFMEL | 0.60268 | 0.75192 |
Dev | Test | |
---|---|---|
FMEL | 0.60406 | 0.73926 |
PFMEL05 | 0.59192 | 0.73916 |
PFMEL07 | 0.61292 | 0.74758 |
PFMEL | 0.60268 | 0.75192 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, L.; Wang, Q.; Zhao, B.; Li, X.; Zhou, A.; Wu, H. Pre-Training and Fine-Tuning with Next Sentence Prediction for Multimodal Entity Linking. Electronics 2022, 11, 2134. https://doi.org/10.3390/electronics11142134
Li L, Wang Q, Zhao B, Li X, Zhou A, Wu H. Pre-Training and Fine-Tuning with Next Sentence Prediction for Multimodal Entity Linking. Electronics. 2022; 11(14):2134. https://doi.org/10.3390/electronics11142134
Chicago/Turabian StyleLi, Lu, Qipeng Wang, Baohua Zhao, Xinwei Li, Aihua Zhou, and Hanqian Wu. 2022. "Pre-Training and Fine-Tuning with Next Sentence Prediction for Multimodal Entity Linking" Electronics 11, no. 14: 2134. https://doi.org/10.3390/electronics11142134
APA StyleLi, L., Wang, Q., Zhao, B., Li, X., Zhou, A., & Wu, H. (2022). Pre-Training and Fine-Tuning with Next Sentence Prediction for Multimodal Entity Linking. Electronics, 11(14), 2134. https://doi.org/10.3390/electronics11142134