Automatic Taxonomy Classification by Pretrained Language Model
Abstract
:1. Introduction
- Extracting key phrases from a target corpus;
- Generating a taxonomic structure consisting of a hypernym–hyponym relationship from the extracted phrase set;
- Creating detailed relationships between phrases according to the intended use of the ontology.
2. Related Work
2.1. Lexico-Syntactic-Based Ontology Generation
2.2. Word Vectorization-Based Ontology Generation
2.3. RNN-Based Ontology Generation
2.4. Hypernym and Synonym Matching for BERT
2.5. Ontology Generation Using the Framework
3. Preliminaries
3.1. Relationship Classification between Phrase Pairs
- Word embedding: By converting words into their corresponding vectors, they are converted into a format that can be easily handled by neural networks. Additionally, because ontology generation often needs to deal with rare words that are task-specific, learning those in advance in a large corpus is preferable.
- Acquisition of context information: Phrases used in ontology generation often consist of a small number of words. However, the connection between words is stronger than between general sentences, and it is necessary to process contextual information.
- Concatenation: because the input data for this task comprise two independent phrases, we need to combine the information at some stage in the process.
- Classifier: we apply the information obtained in the presented steps to the classification model and calculate the final output label.
3.2. Encoding and Architecture for a Pretrained Model
3.2.1. Byte Pair Encoding
3.2.2. BERT
3.2.3. ALBERT
3.3. Noun Phrase Extraction
Syntactic Parsing
4. Ontology Classifier Using PLM
4.1. Learning Procedure
4.2. Architecture
4.2.1. Preprocess
- Concatenation and Special Token Insertion: First, the input phrase pair is concatenated into one sentence. At this time, a classifier token (CLS) is inserted at the front of the first phrase and separator tokens (SEP) [12] are appended in the middle of the two phrases and at the end of the second phrase.
- Tokenization: The concatenated phrases are divided into subwords by a tokenizer corresponding to each language model. The number of divided subwords is equal to or greater than the number of words included in the phrase.
4.2.2. Pretrained Language Model
4.2.3. Classification Layers
5. Data Collection
5.1. Phrase-Pair Relationship Datasets
5.2. Overview of WordNet
5.3. Dataset Extraction from WordNet
5.4. Dataset of SQuAD V2.0
5.5. Dataset Extraction from SQuAD V2.0 for Ontology Generation
5.5.1. Extraction of Nouns
5.5.2. Extraction of Noun Phrases
6. Experiment
6.1. Training and Validation
6.2. Model Setup
6.3. Ontology Generation for the Real Text
7. Evaluation
7.1. Comparison of Accuracy
7.2. Comparison of Batch Size with BERT-Base
7.3. Ontology-Generation Experiment Results
8. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Oba, A.; Paik, I.; Kuwana, A. Automatic Classification for Ontology Generation by Pretrained Language Model. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems; Artificial Intelligence Practices. Fujita, H., Selamat, A., Lin, J.C.-W., Ali, M., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 210–221. [Google Scholar]
- Bittner, T.; Donnelly, M.; Smith, B. A spatio-temporal ontology for geographic information integration. Int. J. Geogr. Inf. Sci. 2009, 23, 765–798. [Google Scholar] [CrossRef] [Green Version]
- Paik, I.; Komiya, R.; Ryu, K. Customizable active situation awareness framework based on meta-process in ontology. In Proceedings of the International Conference on Awareness Science and Technology (iCAST) 2013, Fukushima, Japan, 2–4 November 2013. [Google Scholar]
- Zhu, H.; Paschalidis, I.C.; Tahmasebi, A. Clinical concept extraction with contextual word embedding. arXiv 2018, arXiv:1810.10566. [Google Scholar]
- Brack, A.; D’Souza, J.; Hoppe, A.; Auer, S.; Ewerth, R. Domain-independent extraction of scientific concepts from research articles. Adv. Inf. Retr. 2020, 12035, 251–266. [Google Scholar]
- Oba, A.; Paik, I. Extraction of taxonomic relation of complex terms by recurrent neural network. In Proceedings of the 2019 IEEE International Conference on Cognitive Computing (ICCC), Milan, Italy, 8–13 July 2019; pp. 70–72. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Duan, S.; Zhao, H. Attention is all you need for Chinese word segmentation. arXiv 2020, arXiv:1910.14537. [Google Scholar]
- Dowdell, T.; Zhang, H. Is attention all what you need?—An empirical investigation on convolution-based active memory and self-attention. arXiv 2019, arXiv:1912.11959. [Google Scholar]
- García, I.; Agerri, R.; Rigau, G. A common semantic space for monolingual and cross-lingual meta-embeddings. arXiv 2021, arXiv:2001.06381. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Klaussner, C.; Zhekova, D. Lexico-syntactic patterns for automatic ontology building. In Proceedings of the Second Student Research Workshop Associated with RANLP 2011, Hissar, Bulgaria, 13 September 2011; pp. 109–114. Available online: https://www.aclweb.org/anthology/R11-2017/ (accessed on 25 October 2021).
- Omine, K.; Paik, I. Classification of taxonomic relations by word embedding and wedge product. In Proceedings of the 2018 IEEE International Conference on Cognitive Computing (ICCC), San Francisco, CA, USA, 2–7 July 2018; pp. 122–125. [Google Scholar]
- Araci, D. FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
- Elnagar, S.; Yoon, V.; Thomas, M.A. An Automatic Ontology Generation Framework with An Organizational Perspective. In Proceedings of the Hawaii International Conference on System Sciences, Grand Wailea, Hawaii, 1 July–1 October 2013; Available online: https://aisel.aisnet.org/hicss-53/ks/knowledge_flows/3/ (accessed on 25 October 2021).
- Wang, Y.; Zhu, M.; Qu, L.; Spaniol, M.; Weikum, G. Timely YAGO: Harvesting, Querying, and Visualizing Temporal Knowledge from Wikipedia. In Proceedings of the 13th International Conference on Extending Database Technology, New York, NY, USA, 23–26 March 2010; pp. 697–700. [Google Scholar]
- Gangemi, A.; Guarino, N.; Masolo, C.; Oltramari, A. Sweetening WORDNET with DOLCE. AI Mag. 2003, 24, 13. [Google Scholar] [CrossRef]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2016, arXiv:1508.07909. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
- Heinzerling, B.; Strube, M. Bpemb: Tokenization-free pre-trained sub-word embeddings in 275 languages. arXiv 2017, arXiv:1710.02187. [Google Scholar]
- Mrini, K.; Dernoncourt, F.; Tran, Q.; Bui, T.; Chang, W.; Nakashole, N. Rethinking Self-Attention: Towards Interpretability in Neural Parsing. arXiv 2020, arXiv:1911.03875. [Google Scholar]
- Marcus, M.P.; Marcinkiewicz, M.A.; Santorini, B. Building a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist. 1993, 19, 313–330. Available online: https://repository.upenn.edu/cis_reports/237/ (accessed on 25 October 2021).
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2020, arXiv:1910.03771. [Google Scholar]
- Miller, G.A.; Beckwith, R.; Fellbaum, C.; Gross, D.; Miller, K.J. Introduction to WordNet: An On-line Lexical Database. Int. J. Lexicogr. 1990, 3, 235–244. [Google Scholar] [CrossRef] [Green Version]
- Rajpurkar, P.; Jia, R.; Liang, P. Know what you don’t know: Unanswerable questions for SQuAD. arXiv 2018, arXiv:1806.03822. [Google Scholar]
- Toutanova, K.; Klein, D.; Manning, C.D.; Singer, Y. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—NAACL’03, Edmonton, AB, Canada, 27 May–1 June 2003; Volume 1, pp. 173–180. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. Distilbert, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1909.10351. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for natural language understanding. arXiv 2020, arXiv:1909.10351. [Google Scholar]
Phrase A | Phrase B | |
---|---|---|
Input | Airplane | Jet aeroplane |
Concatenation | [CLS] airplane [SEP] jet aeroplane [SEP] | |
Tokenization | [CLS] airplane [SEP] jet aero ##plane [SEP] |
Relation Label | Number of Data |
---|---|
Synonym | 135,658 |
Hypernym | 215,554 |
Hyponym | 215,554 |
Unrelated | 500,000 |
Total | 1,066,766 |
Synonym | Hypernym | Hyponym | Unrelated | |
---|---|---|---|---|
Train | 133,164 | 211,514 | 211,445 | 490,643 |
Validation | 1233 | 2002 | 2108 | 4657 |
Test | 1261 | 2038 | 2001 | 4700 |
Total | 135,658 | 215,554 | 215,554 | 500,000 |
BERT | ALBERT | |||
---|---|---|---|---|
Base | Large | Base | Large | |
Optimizer | Adam | |||
Learning rate | 1 × 10−5 | |||
Transformer layers | 12 | 24 | 12 | 24 |
Hidden Size | 768 | 1024 | 768 | 1024 |
Embedding Size | 768 | 1024 | 128 | 128 |
Parameters | 108 M | 334 M | 12 M | 18 M |
Vocabulary Size | |
---|---|
Word2vec (previous) | 297,141 |
BERT-Embedding (Subworld Representation) | 30,522 |
Accuracy (Four Classes) | Recall | Precision | F1 | Ratio of Calculation Time | |
---|---|---|---|---|---|
Word2vec + RNN (Previous) | 87.1% | 94.87 | 93.13 | 93.88 | 1.00 |
BERT-Embedding + RNN | 89.6% | 95.89 | 93.4 | 94.96 | 1.21 |
BERT-Base | 98.1% | 98.8 | 99 | 98.9 | 6.26 |
BERT-Large | 98.6% | 99.7 | 99.4 | 99.55 | 15.34 |
ALBERT-Base | 96.8% | 98.6 | 99.1 | 98.9 | 4.23 |
ALBERT-Large | 98.3% | 99.3 | 99.05 | 99.17 | 12.01 |
Number of Extracted Pairs | Number of Distilled Pairs | |
---|---|---|
Noun pairs | 1,106,386 | 426,450 |
Noun phrase pairs | 2809,133 | 280,397 |
Noun 1 | Noun 2 | Relationship |
---|---|---|
Beyoncé | actress | sub-sup |
Beyoncé | artist | sub-sup |
music | pop | sup-sub |
Grammy | award | sub-sup |
Beyoncé | singer | sub-sup |
family | parents | sup-sub |
B’Day | birthday | synonyms |
female | Beyoncé | sup-sub |
singer | Swift | sup-sub |
career | life | sub-sup |
announcement | tweets | sup-sub |
Barbara | female | sub-sup |
artist | entertainer | sub-sup |
star | performer | sub-sup |
choreography | dance | sub-sup |
video | YouTube | sup-sub |
video | parodies | sup-sub |
albums | music | sub-sup |
records | music | sub-sup |
March | music | sub-sup |
service | Spotify | sup-sub |
service | industry | sub-sup |
women | grandmother | sup-sub |
department | stores | sup-sub |
mother | human | sub-sup |
Chopin | composer | sub-sup |
Chopin | pianist | sub-sup |
era | generation | sup-sub |
birthdate | date | sub-sup |
passages | passages | sup-sub |
commenting | piano-bashing | sup-sub |
student | role | sub-sup |
Liszt | musician | sub-sup |
friendship | relationship | sub-sup |
rift | relationship | sub-sup |
woman | daughter | sup-sub |
woman | mother | sup-sub |
couple | people | sub-sup |
apartment | accommodation | sub-sup |
canon | music | sub-sup |
preludes | music | sub-sup |
sonata | music | sub-sup |
method | technique | sup-sub |
rubato | melody | sub-sup |
environment | reputation | sup-sub |
dynasty | leaders | sub-sup |
suzerainty | region | sub-sup |
Tibet | China | sub-sup |
ethnicities | Han | sup-sub |
King | Emperor | synonyms |
Noun Phrase 1 | Noun Phrase 2 | Relationship |
---|---|---|
official posts | official posts | sup-sub |
succession important posts | hereditary positions | sup-sub |
succession important posts | official posts | sub-sup |
true Han representatives | Han Chinese government | synonyms |
1390 | 14th century | sub-sup |
shamanistic ways | native Mongol practices shamanism blood sacrifice | sup-sub |
event | conflict | sup-sub |
event | war | sup-sub |
aid gelug monks supporters | help | sub-sup |
gelug monasteries | traditional religious sites | sub-sup |
fifth Dalai Lama lozang gyatso | Dalai Lama | sub-sup |
Chinese claims suzerainty Tibet | territory | sub-sup |
portable media players multipurpose pocket computers | iPod | sup-sub |
128 gb iPod touch | iPods | sub-sup |
iPods | digital music players | sub-sup |
product | iPod | sup-sub |
fonts | Chicago font | sup-sub |
commercial use | trademark | sup-sub |
100 db | maximum volume output level | synonyms |
legal limit | user-configurable volume limit | sup-sub |
legal limit | maximum volume output level | sup-sub |
maximum volume output level | user-configurable volume limit | synonyms |
implementation interface | dock connector | sup-sub |
apple lightning cables | new 8pin dock connector named lightning | synonyms |
cars | BMW | sup-sub |
cars | Volkswagen | sup-sub |
advanced menu iTunes | iPod software | sub-sup |
alternative opensource audio formats ogg vorbis flac | several audio file formats | sub-sup |
audio files | midi files | sup-sub |
audio files | mpeg4 QuickTime video formats | sup-sub |
audio files | several audio file formats | synonyms |
iPods library | entire music libraries music playlists | synonyms |
iPods library | iTunes library | synonyms |
main computer library | entire music libraries music playlists | synonyms |
devices | iPhone | sup-sub |
devices | iPod touch | sup-sub |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kuwana, A.; Oba, A.; Sawai, R.; Paik, I. Automatic Taxonomy Classification by Pretrained Language Model. Electronics 2021, 10, 2656. https://doi.org/10.3390/electronics10212656
Kuwana A, Oba A, Sawai R, Paik I. Automatic Taxonomy Classification by Pretrained Language Model. Electronics. 2021; 10(21):2656. https://doi.org/10.3390/electronics10212656
Chicago/Turabian StyleKuwana, Ayato, Atsushi Oba, Ranto Sawai, and Incheon Paik. 2021. "Automatic Taxonomy Classification by Pretrained Language Model" Electronics 10, no. 21: 2656. https://doi.org/10.3390/electronics10212656
APA StyleKuwana, A., Oba, A., Sawai, R., & Paik, I. (2021). Automatic Taxonomy Classification by Pretrained Language Model. Electronics, 10(21), 2656. https://doi.org/10.3390/electronics10212656