A Hierarchical Multi-Task Learning Framework for Semantic Annotation in Tabular Data
Abstract
:1. Introduction
- We introduce a novel representation learning framework based on multi-task learning for table semantic annotation. The central concept is the integration of diverse tasks into a unified learning architecture, leveraging inter-task correlations and domain knowledge among multiple table semantic annotation subtasks. This enables model to concurrently learn and understand both commonalities and differences among different subtasks, thereby augmenting performance of each individual subtask.
- Our model achieves remarkable performance without relying on external knowledge bases, linked web resources, or pre-trained tabular models, instead utilizing only target datasets. This not only reduces dependence on external resources, but also mitigates implementation costs and complexities, offering more flexibility and scalability.
- Extensive experimentation validates the superiority of our proposed model in multi-task learning and representation learning, as well as its advantages in data utilization and training efficiency.
2. Related Work
2.1. Table Semantic Annotation
2.2. Pre-Trained Tabular Models
2.3. Multi-Task Learning
3. Notations and Problem Definition
- Problem 1: Column named entity recognition (NER). Given a table (without table headers or contextual information) and a set of simple named entity types , the goal is to predict the entity type of target column i, which can best describe most of the entities in column. This process is denoted as .
- Problem 2: Column type annotation (CTA). Given a table (without table headers or contextual information) and a set of column semantic types , the goal is to predict the semantic type of target column i based on only the table content. This process is represented as .
- Problem 3: Inter-column relationship annotation (CRA). Given a table (without table headers or contextual information) and a set of inter-column relationship types , the goal is to predict the relation semantic type of target column pair based solely on the table content. This process can be represented as .
Notations | Description |
---|---|
Standard relational table without headers | |
The i-th column in table | |
the value of cell at the i-th column, r-th row of table | |
The set of all column semantic types | |
The set of all inter-column relationship types | |
The set of all column named entity types | |
Contextual representation of i-th column in encoding layer | |
, | Output vector of i-th column or -th column-pair in task layer |
, | Prediction probability of i-th column or -th column-pair in task layer |
4. Methodology
4.1. Table Serialization and Column Representations
4.2. Column Named Entity Recognition
4.3. Column Type Annotation
4.4. Inter-Column Relationship Annotation
4.5. Joint Multi-Task Learning
Algorithm 1: Training strategy of our multi-task learning framework. |
5. Experiments
5.1. Datasets
5.2. Experimental Settings
5.3. Results and Analysis
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, Z.; Zhang, S.; Davison, B.D. WTR: A Test Collection for Web Table Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 2514–2520. [Google Scholar] [CrossRef]
- Zhong, V.; Xiong, C.; Socher, R. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv 2017, arXiv:1709.00103. [Google Scholar]
- Chen, W.; Wang, H.; Chen, J.; Zhang, Y.; Wang, H.; Li, S.; Zhou, X.; Wang, W.Y. TabFact: A Large-scale Dataset for Table-based Fact Verification. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Chen, J.; Jiménez-Ruiz, E.; Horrocks, I.; Sutton, C. Learning semantic annotations for tabular data. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 2088–2094. Available online: https://dl.acm.org/doi/abs/10.5555/3367243.3367329 (accessed on 28 June 2024).
- Chen, J.; Jiménez-Ruiz, E.; Horrocks, I.; Sutton, C. ColNet: Embedding the semantics of web tables for column type prediction. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar] [CrossRef]
- Hitzler, P.; Cruz, I.; Zhang, Z. Effective and efficient Semantic Table Interpretation using TableMiner+. Semant. Web 2017, 8, 921–957. [Google Scholar] [CrossRef]
- Yin, P.; Neubig, G.; Yih, W.t.; Riedel, S. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8413–8426. [Google Scholar] [CrossRef]
- Iida, H.; Thai, D.; Manjunatha, V.; Iyyer, M. TABBIE: Pretrained Representations of Tabular Data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 3446–3456. [Google Scholar] [CrossRef]
- Ramnandan, S.; Mittal, A.; Knoblock, C.A.; Szekely, P. Assigning Semantic Labels to Data Sources. In Proceedings of the 12th European Semantic Web Conference on The Semantic Web, Berlin/Heidelberg, Germany, 31 May–4 June 2015; pp. 403–417. [Google Scholar]
- Hulsebos, M.; Hu, K.; Bakker, M.; Zgraggen, E.; Satyanarayan, A.; Kraska, T.; Demiralp, C.; Hidalgo, C. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1500–1508. [Google Scholar] [CrossRef]
- Zhang, D.; Hulsebos, M.; Suhara, Y.; Demiralp, C.; Li, J.; Tan, W.C. Sato: Contextual semantic type detection in tables. Proc. VLDB Endow. 2020, 13, 1835–1848. [Google Scholar] [CrossRef]
- Takeoka, K.; Oyamada, M.; Nakadai, S.; Okadome, T. Meimei: An efficient probabilistic approach for semantically annotating tables. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar] [CrossRef]
- Khurana, U.; Galhotra, S. Semantic Concept Annotation for Tabular Data. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, Queensland, Australia, 1–5 November 2021; pp. 844–853. [Google Scholar] [CrossRef]
- Korini, K.; Bizer, C. Column type annotation using ChatGPT. In Proceedings of the Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDBW’23), Vancouver, BC, Canada, 28 August–1 September 2023; Volume 3462, pp. 1–12. [Google Scholar]
- Korini, K.; Bizer, C. Column Property Annotation Using Large Language Models, Crete, Greece. 2024. Available online: https://2024.eswc-conferences.org/wp-content/uploads/2024/04/ESWC_2024_paper_283.pdf (accessed on 28 June 2024).
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Wang, Z.; Dong, H.; Jia, R.; Li, J.; Fu, Z.; Han, S.; Zhang, D. TUTA: Tree-based Transformers for Generally Structured Table Pre-training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, Singapore, 14–18 August 2021; pp. 1780–1790. [Google Scholar] [CrossRef]
- Eisenschlos, J.; Gor, M.; Müller, T.; Cohen, W. MATE: Multi-view Attention for Table Transformer Efficiency. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7606–7619. [Google Scholar] [CrossRef]
- Deng, X.; Sun, H.; Lees, A.; Wu, Y.; Yu, C. TURL: Table understanding through representation learning. Proc. VLDB Endow. 2020, 14, 307–319. [Google Scholar] [CrossRef]
- Tang, N.; Fan, J.; Li, F.; Tu, J.; Du, X.; Li, G.; Madden, S.; Ouzzani, M. RPT: Relational pre-trained transformer is almost all you need towards democratizing data preparation. Proc. VLDB Endow. 2021, 14, 1254–1261. [Google Scholar] [CrossRef]
- Wang, F.; Sun, K.; Chen, M.; Pujara, J.; Szekely, P. Retrieving Complex Tables with Multi-Granular Graph Representation Learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 1472–1482. [Google Scholar] [CrossRef]
- Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
- Brüggemann, D.; Kanakis, M.; Obukhov, A.; Georgoulis, S.; Van Gool, L. Exploring Relational Context for Multi-Task Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15869–15878. [Google Scholar]
- Xu, Y.; Yang, Y.; Zhang, L. DeMT: Deformable mixer transformer for multi-task learning of dense prediction. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
- He, J.; Zhou, C.; Ma, X.; Berg-Kirkpatrick, T.; Neubig, G. Towards a Unified View of Parameter-Efficient Transfer Learning. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Karimi Mahabadi, R.; Ruder, S.; Dehghani, M.; Henderson, J. Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 565–576. [Google Scholar] [CrossRef]
- Tay, Y.; Zhao, Z.; Bahri, D.; Metzler, D.; Juan, D.C. HyperGrid Transformers: Towards A Single Model for Multiple Tasks. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Zhao, S.; Liu, T.; Zhao, S.; Wang, F. A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 817–824. [Google Scholar] [CrossRef]
- Zhao, H.; Huang, L.; Zhang, R.; Lu, Q.; Xue, H. SpanMlt: A Span-based Multi-Task Learning Framework for Pair-wise Aspect and Opinion Terms Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3239–3248. [Google Scholar] [CrossRef]
- Fei, H.; Tan, S.; Li, P. Hierarchical Multi-Task Word Embedding Learning for Synonym Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 834–842. [Google Scholar] [CrossRef]
- Chauhan, D.S.; S R, D.; Ekbal, A.; Bhattacharyya, P. Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4351–4360. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
- Jiménez-Ruiz, E.; Hassanzadeh, O.; Efthymiou, V.; Chen, J.; Srinivas, K. SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems. In Proceedings of the The Semantic Web, Crete, Greece, 31 May–4 June 2020; pp. 514–530. [Google Scholar]
- Hassanzadeh, O.; Efthymiou, V.; Chen, J. SemTab 2022: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets—“Hard Tables” R1 and R2. In The Semantic Web; Zenodo: Geneva, Switzerland, 2022. [Google Scholar] [CrossRef]
- Honnibal, M.; Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To Appear 2017, 7, 411–420. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Suhara, Y.; Li, J.; Li, Y.; Zhang, D.; Demiralp, C.; Chen, C.; Tan, W.C. Annotating Columns with Pre-trained Language Models. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 1493–1503. [Google Scholar] [CrossRef]
Dataset | CTA | CRA | ||||
---|---|---|---|---|---|---|
# Type | # Sample | # Table | # Type | # Sample | # Table | |
SemTab2019 | 275 | 7614 | 3045 | 550 | 10,438 | 3025 |
HardTables2022 | 492 | 8072 | 6420 | 402 | 9772 | 6301 |
Model | CTA | CRA | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
TaBERT [7] | 0.661 ± 0.012 | 0.412 ± 0.017 | 0.561 ± 0.016 | 0.440 ± 0.019 |
TABBIE [8] | 0.774 ± 0.010 | 0.607 ± 0.011 | 0.673 ± 0.014 | 0.572 ± 0.010 |
DODUO [40] | 0.795 ± 0.011 | 0.583 ± 0.013 | 0.690 ± 0.010 | 0.573 ± 0.014 |
Ours | 0.824 ± 0.009 | 0.636 ± 0.012 | 0.726 ± 0.011 | 0.628 ± 0.013 |
w/ target only | 0.781 ± 0.010 | 0.532 ± 0.015 | 0.697 ± 0.008 | 0.568 ± 0.011 |
w/o NER | 0.808 ± 0.011 | 0.599 ± 0.009 | 0.716 ± 0.010 | 0.604 ± 0.013 |
w/o CTA | - | - | 0.710 ± 0.008 | 0.584 ± 0.013 |
w/o CRA | 0.798 ± 0.012 | 0.566 ± 0.014 | - | - |
Model | CTA | CRA | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
TaBERT [7] | 0.684 ± 0.011 | 0.466 ± 0.014 | 0.429 ± 0.012 | 0.325 ± 0.016 |
TABBIE [8] | 0.822 ± 0.010 | 0.683 ± 0.010 | 0.551 ± 0.014 | 0.488 ± 0.011 |
DODUO [40] | 0.846 ± 0.011 | 0.689 ± 0.012 | 0.569 ± 0.011 | 0.476 ± 0.013 |
Ours | 0.863 ± 0.010 | 0.732 ± 0.009 | 0.615 ± 0.009 | 0.519 ± 0.011 |
w/target only | 0.845 ± 0.008 | 0.711 ± 0.010 | 0.549 ± 0.012 | 0.462 ± 0.012 |
w/o NER | 0.860 ± 0.006 | 0.727 ± 0.011 | 0.613 ± 0.013 | 0.516 ± 0.010 |
w/o CTA | - | - | 0.556 ± 0.011 | 0.473 ± 0.015 |
w/o CRA | 0.850 ± 0.008 | 0.720 ± 0.010 | - | - |
Max Rows | CTA | CRA | ||
---|---|---|---|---|
Micro avg F1 | Macro avg F1 | Micro avg F1 | Macro avg F1 | |
2 rows | 0.484 ± 0.006 | 0.338 ± 0.008 | 0.416 ± 0.008 | 0.343 ± 0.009 |
4 rows | 0.755 ± 0.009 | 0.518 ± 0.011 | 0.670 ± 0.012 | 0.572 ± 0.015 |
8 rows | 0.774 ± 0.011 | 0.584 ± 0.013 | 0.694 ± 0.008 | 0.592 ± 0.012 |
16 rows | 0.809 ± 0.009 | 0.610 ± 0.010 | 0.704 ± 0.011 | 0.605 ± 0.012 |
max length | 0.824 ± 0.009 | 0.636 ± 0.012 | 0.726 ± 0.011 | 0.628 ± 0.013 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, J.; Hou, M. A Hierarchical Multi-Task Learning Framework for Semantic Annotation in Tabular Data. Entropy 2024, 26, 664. https://doi.org/10.3390/e26080664
Wu J, Hou M. A Hierarchical Multi-Task Learning Framework for Semantic Annotation in Tabular Data. Entropy. 2024; 26(8):664. https://doi.org/10.3390/e26080664
Chicago/Turabian StyleWu, Jie, and Mengshu Hou. 2024. "A Hierarchical Multi-Task Learning Framework for Semantic Annotation in Tabular Data" Entropy 26, no. 8: 664. https://doi.org/10.3390/e26080664
APA StyleWu, J., & Hou, M. (2024). A Hierarchical Multi-Task Learning Framework for Semantic Annotation in Tabular Data. Entropy, 26(8), 664. https://doi.org/10.3390/e26080664