Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu
Abstract
:1. Introduction
- The existing U-NER corpus, the largest annotated dataset for Urdu named entity recognition (NER), was expanded through “contextual word embeddings augmentation” (CWEA). The original dataset of 50,692 tokens with 16,300 named entities was increased to 160,132 tokens, resulting in the creation of the UNER-II corpus.
- In the extended dataset, we annotated named entities (NEs) into specific classes: person (PER), location (LOC), and organization (ORG).
- Four distinct types of transformer models (multilingual bidirectional encoder representation from transformers (BERT), RoBERTa-Urdu-small, BERT-base-cased, and BERT-large-cased) were utilized on two datasets to classify named entities.
- Our approach, evaluated using precision, recall, and F score, significantly outperforms existing state-of-the-art DL-based NER methods for the Urdu language.
- The paper is organized as follows: The Section 2 outlines the difficulties specific to Urdu NER. The Section 3 reviews the relevant literature. The Section 4 details the research approach. The Section 5 presents and discusses the findings. Finally, the Section 7 provides concluding remarks and suggestions for future research.
2. Urdu NER Challenges
2.1. Lack of Capitalization
2.2. Segmentation
2.3. Cursive Context Sensitivity
2.4. Agglutination
2.5. Diacritics
2.6. Variation in Spelling
2.7. Loanwords from Other Languages
3. Related Work
4. Methodology and Material
4.1. Corpus
4.2. Data Processing and Augmentation
4.3. Pre-Trained BERT Model
4.4. Experimental Setup
5. Results and Analysis
6. Discussion and Error Analysis
Error Analysis
- LOCATION: The model accurately identifies 6400 instances as location (true positives for location). However, it misclassifies 135 instances as organization and makes no errors in categorizing instances as other or person. For example, for Mardan University (مردان یونیورسٹی), the first part of the entity is location, and then یونیورسٹی (university) [O].
- ORGANIZATION: For the organization class, the model correctly identifies 5916 instances. There are 29 instances incorrectly labeled as OTHER and 64 as person, indicating some confusion between an organization and these classes. For instance, (باچا خان یونیورسٹی) Bacha khan University was tokenized into two separate tokens, باچا خان and یونیورسٹی, which were marked as person and other, respectively.
- PERSON: For the person class, the model correctly classified 9730 instances. However, it incorrectly labeled 80 instances as location, 111 as organization, and 40 as other, indicating some level of confusion between person and these other categories. Examples of misclassified persons due to incorrect tokenization and scarcity of availability were اقبال لاہوری (Iqbal Lahorey) and شیرپاؤ ہسپتال (Sherpao Hospital).
7. Conclusions and Future Work
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Alshammari, N.; Alanazi, S. The impact of using different annotation schemes on named entity recognition. Egypt. Inform. J. 2021, 22, 295–302. [Google Scholar] [CrossRef]
- Yadav, V.; Bethard, S. A survey on recent advances in named entity recognition from deep learning models. arXiv 2019, arXiv:1910.11470. [Google Scholar]
- Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; Abdelmajeed, M.; Sadiq, M.T. Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access 2020, 8, 91213–91226. [Google Scholar] [CrossRef]
- Sundheim, B.M. Overview of results of the MUC-6 evaluation. In Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MA, USA, 6–8 November 1995. [Google Scholar]
- Khan, W.; Daud, A.; Shahzad, K.; Amjad, T.; Banjar, A.; Fasihuddin, H. Named entity recognition using conditional random fields. Appl. Sci. 2022, 12, 6391. [Google Scholar] [CrossRef]
- Khattak, A.; Asghar, M.Z.; Saeed, A.; Hameed, I.A.; Hassan, S.A.; Ahmad, S. A survey on sentiment analysis in Urdu: A resource-poor language. Egypt. Inform. J. 2021, 22, 53–74. [Google Scholar] [CrossRef]
- Khan, I.U.; Khan, A.; Khan, W.; Su’ud, M.M.; Alam, M.M.; Subhan, F.; Asghar, M.Z. A review of Urdu sentiment analysis with multilingual perspective: A case of Urdu and roman Urdu language. Computers 2021, 11, 3. [Google Scholar] [CrossRef]
- Riaz, K. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop, Uppsala, Sweden, 16 July 2010. [Google Scholar]
- Malik, M.K.; Sarwar, S.M. Urdu named entity recognition and classification system using conditional random field. Sci. Int. 2015, 5, 4473–4477. [Google Scholar]
- Saha, S.K.; Chatterji, S.; Dandapat, S.; Sarkar, S.; Mitra, P. A hybrid named entity recognition system for south and south east asian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, 12 January 2008. [Google Scholar]
- Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
- Roberts, A.; Gaizauskas, R.J.; Hepple, M.; Guo, Y. Combining Terminology Resources and Statistical Methods for Entity Recognition: An Evaluation. In Proceedings of the LREC, Miyazaki, Japan, 7–12 May 2008. [Google Scholar]
- Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the CoNLL-2003, Edmonton, AB, Canada, 31 May–1 June 2003. [Google Scholar]
- Shaalan, K.; Raza, H. NERA: Named entity recognition for Arabic. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 1652–1663. [Google Scholar] [CrossRef]
- Singh, U.; Goyal, V.; Lehal, G.S. Named entity recognition system for Urdu. In Proceedings of the COLING 2012, Mumbai, India, 8–15 December 2012. [Google Scholar]
- Mukund, S.; Srihari, R.; Peterson, E. An information-extraction system for Urdu—A resource-poor language. ACM Trans. Asian Lang. Inf. Process. (TALIP) 2010, 9, 1–43. [Google Scholar] [CrossRef]
- Zoya Latif, S.; Latif, R.; Majeed, H.; Jamail, N.S.M. Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–31. [Google Scholar] [CrossRef]
- Çoban, Ö.; Özel, S.A.; İnan, A. Deep learning-based sentiment analysis of Facebook data: The case of Turkish users. Comput. J. 2021, 64, 473–499. [Google Scholar] [CrossRef]
- Haq, R.; Zhang, X.; Khan, W.; Feng, Z. Urdu named entity recognition system using deep learning approaches. Comput. J. 2023, 66, 1856–1869. [Google Scholar] [CrossRef]
- Naz, S.; Umar, A.I.; Razzak, M.I. A hybrid approach for NER system for scarce resourced language-URDU: Integrating n-gram with rules and gazetteers. Mehran Univ. Res. J. Eng. Technol. 2015, 34, 349–358. [Google Scholar]
- Collins, M.; Singer, Y. Unsupervised models for named entity classification. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, USA, 21–22 June 1999. [Google Scholar]
- Capstick, J.; Diagne, A.K.; Erbach, G.; Uszkoreit, H.; Leisenberg, A.; Leisenberg, M. A system for supporting cross-lingual information retrieval. Inf. Process. Manag. 2000, 36, 275–289. [Google Scholar] [CrossRef]
- Jahangir, F.; Anwar, W.; Bajwa, U.I.; Wang, X. N-gram and gazetteer list based named entity recognition for Urdu: A scarce resourced language. In Proceedings of the 10th Workshop on Asian Language Resources, Mumbai, India, 9 December 2012; pp. 95–104. [Google Scholar]
- Kanwal, S.; Malik, K.; Shahzad, K.; Aslam, F.; Nawaz, Z. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2019, 19, 1–13. [Google Scholar] [CrossRef]
- Gali, K.; Surana, H.; Vaidya, A.; Shishtla, P.M.; Sharma, D.M. Aggregating machine learning and rule based heuristics for named entity recognition. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, 12 January 2008. [Google Scholar]
- Khan, W.; Daud, A.; Alotaibi, F.; Aljohani, N.; Arafat, S. Deep recurrent neural networks with word embeddings for Urdu named entity recognition. ETRI J. 2020, 42, 90–100. [Google Scholar] [CrossRef]
- Ullah, F.; Zeeshan, M.; Ullah, I.; Alam, M.N.; Al-Absi, A.A. Towards Urdu Name Entity Recognition Using Bi-LSTM-CRF with Self-attention. In Proceedings of the International Conference on Smart Computing and Cyber Security: Strategic Foresight, Security Challenges, and Innovation, Gosung, Republic Korea, 28–29 October 2021; Springer: Singapore, 2021. [Google Scholar]
- Balouchzahi, F.; Sidorov, G.; Shashirekha, H.L. ADOP FERT-Automatic Detection of Occupations and Profession in Medical Texts using Flair and BERT. In IberLEF@SEPLN; 2021. Spain. Available online: https://www.researchgate.net/publication/354795026_ADOP_FERT-Automatic_Detection_of_Occupations_and_Profession_in_Medical_Texts_using_Flair_and_BERT (accessed on 31 July 2024).
- Sathyanarayanan, D.; Ashok, A.; Mishra, D.; Chimalamarri, S.; Sitaram, D. Kannada Named Entity Recognition and Classification using Bidirectional Long Short-Term Memory Networks. In Proceedings of the 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), Msyuru, India, 14–15 December 2018; pp. 65–71. [Google Scholar]
- Dedes, K.; Utama AB, P.; Wibawa, A.P.; Afandi, A.N.; Handayani, A.N.; Hernandez, L. Neural Machine Translation of Spanish-English Food Recipes Using LSTM. JOIV Int. J. Inform. Vis. 2022, 6, 290–297. [Google Scholar] [CrossRef]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Suleman, M.; Asif, M.; Zamir, T.; Mehmood, A.; Khan, J.; Ahmad, N.; Ahmad, K. Floods Relevancy and Identification of Location from Twitter Posts using NLP Techniques. arXiv 2023, arXiv:2301.00321. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Agrawal, A.; Tripathi, S.; Vardhan, M.; Sihag, V.; Choudhary, G.; Dragoni, N. BERT-based transfer-learning approach for nested named-entity recognition using joint labeling. Appl. Sci. 2022, 12, 976. [Google Scholar] [CrossRef]
- Ullah, F.; Ullah, I.; Kolesnikova, O. Urdu named entity recognition with attention bi-lstm-crf model. In Mexican International Conference on Artificial Intelligence; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 3–17. [Google Scholar]
- Dai, X.; Adel, H. An analysis of simple data augmentation for named entity recognition. arXiv 2020, arXiv:2010.11683. [Google Scholar]
- Daud, A.; Khan, W.; Che, D. Urdu language processing: A survey. Artif. Intell. Rev. 2017, 47, 279–311. [Google Scholar] [CrossRef]
- Feng, X.; Feng, X.; Qin, B.; Feng, Z.; Liu, T. Improving Low Resource Named Entity Recognition using Cross-lingual Knowledge Transfer. IJCAI 2018, 1, 4071–4077. [Google Scholar]
- Jin, G.; Yu, Z. A Korean named entity recognition method using Bi-LSTM-CRF and masked self-attention. Comput. Speech Lang. 2021, 65, 101134. [Google Scholar] [CrossRef]
- Gunawan, W.; Suhartono, D.; Purnomo, F.; Ongko, A. Named-entity recognition for indonesian language using bidirectional lstm-cnns. Procedia Comput. Sci. 2018, 135, 425–432. [Google Scholar] [CrossRef]
- Bayer, M.; Kaufhold, M.-A.; Reuter, C. A survey on data augmentation for text classification. ACM Comput. Surv. 2021, 55, 146. [Google Scholar] [CrossRef]
Label | PER | LOC | ORG | O |
---|---|---|---|---|
Total Entities | 2380 | 1547 | 1545 | 45,220 |
Characteristics | Existing Dataset | Extension | UNER-II |
---|---|---|---|
Total tokens | 50,692 | 109,440 | 160,132 |
Person | 2380 | 47,600 | 49,980 |
Location | 1547 | 30,940 | 32,487 |
Organization | 1545 | 30,900 | 32,445 |
Other | 45,220 | -- | 45,220 |
Model | Epochs | Batch Size | Learning Rate | Optimizer | Dropout Rate | Training Duration (μs) | Computational Resources |
---|---|---|---|---|---|---|---|
Neural Network | 10 | 32 | 0.001 | Adam | 0.2 | 1.0041 × 108 | Tesla K80 12 GB GPU and 32 GB RAM |
Recurrent Neural Network | 10 | 32 | 0.001 | Adam | 0.2 | 4.0526 × 108 | Tesla K80 12 GB GPU and 32 GB RAM |
BiLSTM | 10 | 32 | 0.001 | Adam | 0.2 | 6.606 × 108 | Tesla K80 12 GB GPU and 32 GB RAM |
RoBERTa-urdu-small | 5 | 16 | 2 × 10−5 | AdamW | N/A | 1.77 × 108 | Tesla K80 12 GB GPU and 32 GB RAM |
BERT-large-cased | 5 | 16 | 2 × 10−5 | AdamW | N/A | 4.29× 108 | Tesla K80 12 GB GPU and 32 GB RAM |
BERT-base-cased | 5 | 16 | 2 × 10−5 | AdamW | N/A | 1.39 × 108 | Tesla K80 12 GB GPU and 32 GB RAM |
BERT-multilingual | 5 | 16 | 2 × 10−5 | AdamW | N/A | 2.2 × 108 | Tesla K80 12 GB GPU and 32 GB RAM |
Study | Dataset | Methods | Precision | Recall | F1-Score |
---|---|---|---|---|---|
MEMM | 0.73 | 0.53 | 0.61 | ||
CRF | 0.77 | 0.61 | 0.68 | ||
Kanwal et al. [24] | MkPUCIT | NN | 0.76 | 0.75 | 0.75 |
RNN | 0.76 | 0.79 | 0.77 | ||
Proposed | UNER (Original Data) | NN | 0.87 | 0.78 | 0.82 |
RNN | 0.86 | 0.83 | 0.84 | ||
BiLSTM | 0.85 | 0.83 | 0.84 | ||
Ruberta-Urdu-small | 0.79 | 0.77 | 0.78 | ||
BERT-large-cased | 0.73 | 0.67 | 0.70 | ||
BERT-base-cased | 0.64 | 0.59 | 0.62 | ||
BERT-multilingual | 0.82 | 0.80 | 0.85 | ||
UNER-II (Augmented Data) | NN | 0.96 | 0.96 | 0.96 | |
RNN | 0.96 | 0.96 | 0.97 | ||
BiLSTM | 0.96 | 0.96 | 0.96 | ||
RoBERTa-Urdu-small (CWEA) | 0.89 | 0.88 | 0.88 | ||
BERT-large-cased (CWEA) | 0.87 | 0.86 | 0.92 | ||
BERT-base-cased (CWEA) | 0.88 | 0.89 | 0.91 | ||
BERT-multilingual (CWEA) | 0.979 | 0.984 | 0.982 |
Classes | Location | Organization | Other | Person |
---|---|---|---|---|
Location | 6400 | 135 | 0 | 0 |
Organization | 0 | 5916 | 29 | 64 |
Person | 80 | 111 | 40 | 9730 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ullah, F.; Gelbukh, A.; Zamir, M.T.; Riverόn, E.M.F.; Sidorov, G. Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers 2024, 13, 258. https://doi.org/10.3390/computers13100258
Ullah F, Gelbukh A, Zamir MT, Riverόn EMF, Sidorov G. Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers. 2024; 13(10):258. https://doi.org/10.3390/computers13100258
Chicago/Turabian StyleUllah, Fida, Alexander Gelbukh, Muhammad Tayyab Zamir, Edgardo Manuel Felipe Riverόn, and Grigori Sidorov. 2024. "Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu" Computers 13, no. 10: 258. https://doi.org/10.3390/computers13100258
APA StyleUllah, F., Gelbukh, A., Zamir, M. T., Riverόn, E. M. F., & Sidorov, G. (2024). Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers, 13(10), 258. https://doi.org/10.3390/computers13100258