The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units
Abstract
:1. Introduction
- Differences in morpheme separation and unknown token ratios due to dependence on the corpus used for tokenization;
- Loss of meaning due to morpheme separation.
2. Materials and Methods
2.1. Input for the Multi-Hot Language Model
2.2. Loss Function for the Multi-Hot Language Model
2.3. Configuration of the Model and Token List
- Small Model—4 transformer layers, 256-neuron hidden layers, maximum token input size of 256 tokens;
- Basic Model—12 transformer layers, 768-neuron hidden layers, maximum token input size of 512 tokens.
3. Results
3.1. Details of the Experimental Areas
- KorQuAD v1.0—A Korean QA corpus constructed by LG CNS. It consists of 60,407 training data and 5774 validation data. This corpus was used to compare machine reading comprehension performance with that of existing language models. An exact match for the answers and the F1-score were used for the evaluation.
- NER and SRL (NIKL MODU Corpus)—The ‘NIKL MODU corpus’, constructed by the National Institute of Korean Language, was used for NER and SRL. The two domains were used to test the model’s semantic analysis of natural languages. NER consists of 15 tags, while SRL consists of 18 tags. A BIO-based method was used for NER, and for SRL, the word segments of the target predicates were attached after the natural language sentence to be analyzed. The NER corpus contains 120,066 sentences of training data and 30,018 sentences of test data. The SRL corpus contains 108,509 sentences of training data and 27,501 sentences of test data. The F1-score was used for the evaluation.
- NSMC (Naver Sentiment Movie Corpus)—The NSMC analyzes the positive and negative emotions in review comments. Unlike the above domains, NSMC was used to determine the suitability of the language model for a spoken corpus in which everyday terms were used, rather than a written corpus like the newspaper corpus. The corpus contains 150,000 training data and 50,000 test data. It was evaluated based on the accuracy.
3.2. Comparison of Loss Functions
3.3. Experiments on the Multi-Hot Language Model
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.cs.ubc.ca/~amuham01/LING530’papers/radford2018improving.pdf (accessed on 17 October 2022).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Lukasz, K.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; p. 30. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Levine, Y.; Lenz, B.; Dagan, O.; Ram, O.; Padnos, D.; Sharir, O.; Shalev-Shwartz, S.; Shashua, A.; Shoham, Y. Sensebert: Driving some sense into bert. arXiv 2019, arXiv:1908.05646. [Google Scholar]
- Peters, M.E.; Neumann, M.; Logan IV, R.L.; Schwartz, R.; Joshi, V.; Singh, S.; Smith, N.A. Knowledge enhanced contextual word representations. arXiv 2019, arXiv:1909.04164. [Google Scholar]
- Ma, W.; Cui, Y.; Si, C.; Liu, T.; Wang, S.; Hu, G. Charbert: Character-aware pre-trained language model. arXiv 2020, arXiv:2011.01513. [Google Scholar]
- Tay, Y.; Tran, V.Q.; Ruder, S.; Gupta, J.; Chung, H.W.; Bahri, D.; Qin, Z.; Baumgartner, S.; Yu, C.; Metzler, D. Charformer: Fast character transformers via gradient-based subword tokenization. arXiv 2021, arXiv:2106.12672. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Amodei, D. Language models are few-shot learners. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS) 2020, virtual, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv 2019, arXiv:1910.10683. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
- Chen, T.; Min, M.R.; Sun, Y. Learning k-way d-dimensional discrete codes for compact embedding representations. In Proceedings of the International Conference on Machine Learning PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 854–863. [Google Scholar]
- Li, C.; Zheng, L.; Wang, S.; Huang, F.; Yu, P.S.; Li, Z. Multi-hot compact network embedding. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 459–468. [Google Scholar]
- Li, B.; Tang, X.; Qi, X.; Chen, Y.; Li, C.G.; Xiao, R. EMU: Effective Multi-Hot Encoding Net for Lightweight Scene Text Recognition With a Large Character Set. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5374–5385. [Google Scholar] [CrossRef]
- Shin, J.C.; Ock, C.Y. A Korean morphological analyzer using a pre-analyzed partial word-phrase dictionary. J. KIISE Softw. Appl. 2012, 39, 415–424. [Google Scholar]
Model | BPE Base | BPE + Syllable | Dictionary + Syllable |
---|---|---|---|
All Tokens | 37,027 | 20,892 | 37,822 |
Syllable Tokens | ∙ | 1994 | 1994 |
Loss Function | Exact | F1-Score |
---|---|---|
Softmax Cross Entropy | 69.8% | 76.3% |
Binary Cross Entropy | 43.3% | 55.8% |
Multi-hot Softmax Cross Entropy | 73.6% | 80.1% |
Area | KorQuAD v1.0 | NER | SRL | NSMC | ||||
---|---|---|---|---|---|---|---|---|
Token List | Size | Exact | F1-Score | F1-Score | F1-Score | Accuracy | ||
Macro | Micro | Macro | Micro | |||||
BPE base | Small | 75.2% | 81.2% | 78.65% | 86.78% | 43.41% | 54.35% | 86.99% |
BPE + Syllable | Small | 73.6% | 80.1% | 79.81% | 87.39% | 43.18% | 53.21% | 87.62% |
Dictionary + Syllable | Small | 74.9% | 81.5% | 79.70% | 87.37% | 43.07% | 53.20% | 87.97% |
BPE base | Basic | 83.0% | 89.1% | 80.74% | 88.57% | 45.79% | 57.12% | 87.95% |
BPE + Syllable | Basic | 81.8% | 87.9% | 81.47% | 89.01% | 46.54% | 58.64% | 88.91% |
Token List | KorQuAD Exact Match Ratio | |||
---|---|---|---|---|
Not Divided | Divided | |||
BEP base | 2046/2618 | 78.15% | 2301/3156 | 72.90% |
BPE + Syllable | 1958/2618 | 74.78% | 2291/3156 | 72.59% |
BPE + Dictionary | 2002/2618 | 76.47% | 2326/3156 | 73.70% |
Token List | NER Tag | ||
---|---|---|---|
PER | LOC | ORG | |
BEP base | 93.84% | 87.53% | 84.26% |
BPE + Syllable | 94.95% | 87.68% | 85.18% |
BPE + Dictionary | 94.77% | 87.81% | 85.09% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, J.-S.; Shin, J.-C.; Ock, C.-Y. The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units. Appl. Sci. 2022, 12, 10612. https://doi.org/10.3390/app122010612
Lee J-S, Shin J-C, Ock C-Y. The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units. Applied Sciences. 2022; 12(20):10612. https://doi.org/10.3390/app122010612
Chicago/Turabian StyleLee, Ju-Sang, Joon-Choul Shin, and Choel-Young Ock. 2022. "The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units" Applied Sciences 12, no. 20: 10612. https://doi.org/10.3390/app122010612
APA StyleLee, J. -S., Shin, J. -C., & Ock, C. -Y. (2022). The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units. Applied Sciences, 12(20), 10612. https://doi.org/10.3390/app122010612