An Improved Word Representation for Deep Learning Based NER in Indian Languages
Abstract
:1. Introduction
2. Related Works
3. Dataset Details
4. Proposed Method
4.1. Generalized Word Representation
4.2. Convnet Based Character Level Word Representation
4.3. Pre-Trained Word Representation
4.4. Affix Level Word Representation
5. Experiments and Results
- -
- Baseline: Since one of the best results in the shared data competition is reported by CRF, we considered it as the baseline.
- -
- Pre-trained + Bi-LSTM-CRF: The method using pre-trained word embedding on Bi-LSTM-CRF model.
- -
- Pre-trained + char_Convnet + Bi-LSTM-CRF: The technique using the concatenation of pre-trained word embedding and character-based word composition vector on Bi-LSTM-CRF model.
- -
- Pre-trained + char_Convnet + affix embedding + Bi-LSTM-CRF: The method using the concatenation of pre-trained word embedding, character-based word embedding and affix embedding on Bi-LSTM-CRF model.
- -
- Deep learning in the competition: The best reported deep learning methods (based on the results) from the competition.
- -
- Bi-LSTM hidden size = 300
- -
- Maximum number of epochs = 100
- -
- early stopping = 30
- -
- drop_out = 0.5
- -
- Pre-trained word embedding size = 300
- -
- Optimizer = Adam
- -
- Batch siz = 100
- -
- Initial learning =
- -
- Total word representation size = 510.
5.1. Impact of Character-Based Word Embedding
5.2. Impact of Affix Embeddings
5.3. Impact of Training Data Size
5.4. Analysis
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Patil, N.; Patil, A.S.; Pawar, B. Survey of named entity recognition systems with respect to Indian and foreign languages. Int. J. Comput. Appl. 2016, 134, 21–26. [Google Scholar] [CrossRef]
- Bindu, M.; Idicula, S.M. Named Entity Identifier for Malayalam Using Linguistic Principles Employing Statistical Methods. Int. J. Comput. Sci. Issues 2011, 8, 185. [Google Scholar]
- Wu, D.; Zhang, Y.; Zhao, S.; Liu, T. Identification of web query intent based on query text and web knowledge. In Proceedings of the 2010 First International Conference on Pervasive Computing, Signal Processing and Applications, Harbin, China, 17–19 September 2010; pp. 128–131. [Google Scholar]
- Etaiwi, W.; Awajan, A.; Suleiman, D. Statistical Arabic Name Entity Recognition Approaches: A Survey. Procedia Comput. Sci. 2017, 113, 57–64. [Google Scholar] [CrossRef]
- Amato, F.; Colace, F.; Greco, L.; Moscato, V.; Picariello, A. Semantic processing of multimedia data for e-government applications. J. Vis. Lang. Comput. 2016, 32, 35–41. [Google Scholar] [CrossRef]
- Fantacci, R.; Gei, F.; Marabissi, D.; Micciullo, L. The Use of Social Networks in Emergency Management. In Wireless Public Safety Networks 2; Elsevier: Amsterdam, The Netherlands, 2016; pp. 25–61. [Google Scholar]
- Kokkinogenis, Z.; Filguieras, J.; Carvalho, S.; Sarmento, L.; Rossetti, R.J. Mobility network evaluation in the user perspective: Real-time sensing of traffic information in twitter messages. In Advances in Artificial Transportation Systems and Simulation; Elsevier: Amsterdam, The Netherlands, 2015; pp. 219–234. [Google Scholar]
- Barathi Ganesh, H.; Soman, K.; Reshma, U.; Mandar, K.; Prachi, M.; Gouri, K.; Anitha, K.; Anand Kumar, M. Overview of arnekt iecsil at fire-2018 track on information extraction for conversational systems in Indian languages. In Proceedings of the Proceedings of the 10th annual meeting of the Forum for Information Retrieval Evaluation, Gandhinagar, India, 6–9 December 2018; pp. 18–20. [Google Scholar]
- Zamora, J. Rise of the chatbots: Finding a place for artificial intelligence in India and US. In Proceedings of the 22nd International Conference on Intelligent User Interfaces Companion, Limassol, Cyprus, 13–16 March 2017; pp. 109–112. [Google Scholar]
- Murthy, R.; Khapra, M.M.; Bhattacharyya, P. Improving NER Tagging Performance in Low-Resource Languages via Multilingual Learning. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2018, 18, 9. [Google Scholar] [CrossRef]
- Murthy, V.R.; Bhattacharyya, P. A deep learning solution to Named Entity Recognition. In International Conference on Intelligent Text Processing and Computational Linguistics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 427–438. [Google Scholar]
- Kaur, K. Khushleen@IECSIL-FIRE-2018: Indic Language Named Entity Recognition Using BidirectionalLSTMs with Subword Information. In Proceedings of the Proceedings of the 10th annual meeting of the Forum for Information Retrieval Evaluation, Gandhinagar, India, 6–9 December 2018. [Google Scholar]
- Thenmozhi, D.; Kumar, B.S.; Aravindan, C. SSN_NLP@ IECSIL-FIRE-2018: Deep Learning Approach to Named Entity Recognition and Relation Extraction for Conversational Systems in Indian Languages; Department of CSE, SSN College of Engineering: Chennai, India, 2018. [Google Scholar]
- Sagar, S.P.; Gollakota, R.K.; Das, A. HiLT@ IECSIL-FIRE-2018: A Named Entity Recognition System for Indian Languages; Indian Institute of Information Technology: Sri City, India, 2018. [Google Scholar]
- Gupta, A.; Ayyar, M.; Singh, A.K.; Shah, R.R. raiden11@ IECSIL-FIRE-2018: Named Entity Recognition For Indian Languages. In Proceedings of the Proceedings of the 10th annual meeting of the Forum for Information Retrieval Evaluation, Gandhinagar, India, 6–9 December 2018. [Google Scholar]
- Segura Bedmar, I.; Martínez, P.; Herrero Zazo, M. Semeval-2013 Task 9: Extraction of Drug-Drug Interactions from Biomedical Texts (Ddiextraction 2013). In Proceedings of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, 4–9 August 2013. [Google Scholar]
- Bossy, R.; Golik, W.; Ratkovic, Z.; Bessières, P.; Nédellec, C. Bionlp shared task 2013—An overview of the bacteria biotope task. In Proceedings of the BioNLP Shared Task 2013 Workshop, Sofia, Bulgaria, 9 August 2013; pp. 161–169. [Google Scholar]
- Uzuner, Ö.; South, B.R.; Shen, S.; DuVall, S.L. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med Inf. Assoc. 2011, 18, 552–556. [Google Scholar] [CrossRef] [Green Version]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
- Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
- Santos, C.N.D.; Guimaraes, V. Boosting named entity recognition with neural character embeddings. arXiv 2015, arXiv:1505.05008. [Google Scholar]
- Bharadwaj, A.; Mortensen, D.; Dyer, C.; Carbonell, J. Phonologically aware neural model for named entity recognition in low resource transfer settings. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1462–1472. [Google Scholar]
- Santos, C.D.; Zadrozny, B. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, 21–26 June 2014; pp. 1818–1826. [Google Scholar]
- Ling, W.; Luís, T.; Marujo, L.; Astudillo, R.F.; Amir, S.; Dyer, C.; Black, A.W.; Trancoso, I. Finding function in form: Compositional character models for open vocabulary word representation. arXiv 2015, arXiv:1508.02096. [Google Scholar]
- Yadav, V.; Sharp, R.; Bethard, S. Deep affix features improve neural named entity recognizers. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA, 5–6 June 2018; pp. 167–172. [Google Scholar]
- Nair, R.S.S. A Grammar of Malayalam. Available online: http://www.languageinindia.com/nov2012/ravisankarmalayalamgrammar.pdf (accessed on 12 June 2018). (In India).
- Hamada, A.; Nayel, H.L.S. Improvin NER for Clinical Texts by Ensemble Approach using Segment Representations. In Proceedings of the ICON 2017(NLPAI), Calcutta, India, 18–21 December 2017; pp. 197–204. [Google Scholar]
- Cohen, W.W.; Sarawagi, S. Exploiting dictionaries in named entity extraction: Combining semi-Markov extraction processes and data integration methods. In Proceedings of the Tenth Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 89–98. [Google Scholar]
- Wang, X.; Jiang, X.; Liu, M.; He, T.; Hu, X. Bacterial named entity recognition based on dictionary and conditional random field. In Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, MO, USA, 13–16 November 2017; pp. 439–444. [Google Scholar]
- Eftimov, T.; Seljak, B.K.; Korošec, P. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLoS ONE 2017, 12, e0179488. [Google Scholar] [CrossRef]
- Alfred, R.; Leong, L.C.; On, C.K.; Anthony, P.; Fun, T.S.; Razali, M.N.B.; Hijazi, M.H.A. A rule-based named-entity recognition for malay articles. In Proceedings of the International Conference on Advanced Data Mining and Applications, Hangzhou, China, 14–16 December 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 288–299. [Google Scholar]
- Wu, Y.; Jiang, M.; Xu, J.; Zhi, D.; Xu, H. Clinical Named Entity Recognition Using Deep Learning Models. In Proceedings of the AMIA Annual Symposium Proceedings, Washington, DC, USA, 4–8 November 2017. [Google Scholar]
- Salini, A.; Jeyapriya, U. Named Entity Recognition Using Machine Learning Approaches. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
- Zhang, L.; Pan, Y.; Zhang, T. Focused named entity recognition using machine learning. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, UK, 25–29 July 2004; ACM: New York, NY, USA, 2004; pp. 281–288. [Google Scholar]
- Sienčnik, S.K. Adapting word2vec to named entity recognition. In Proceedings of the 20th Nordic Conference of Computational Linguistics, Nodalida 2015, Vilnius, Lithuania, 11–13 May 2015; pp. 239–243. [Google Scholar]
- Nita, P.; Ajay, S.; Patil, B.P. HYbrid Approach for Marathi Named Entity Recognition. In Proceedings of the ICON 2017(NLPAI), Calcutta, India, 18–21 December 2017; pp. 103–111. [Google Scholar]
- Zhou, G.; Su, J. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 473–480. [Google Scholar]
- Malouf, R. Markov models for language-independent named entity recognition. In Proceedings of the COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), Stroudsburg, PA, USA, 31 August 2002. [Google Scholar]
- Carreras, X.; Màrquez, L.; Padró, L. Named entity extraction using adaboost. In Proceedings of the 6th Conference on Natural Language Learning 2002 (CoNLL-2002) 2002, Stroudsburg, PA, USA, 31 August 2002. [Google Scholar]
- Li, Y.; Li, W.; Sun, F.; Li, S. Component-enhanced chinese character embeddings. arXiv 2015, arXiv:1508.06669. [Google Scholar]
- Yin, R.; Wang, Q.; Li, P.; Li, R.; Wang, B. Multi-granularity chinese word embedding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 981–986. [Google Scholar]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
- Chalapathy, R.; Borzeshi, E.Z.; Piccardi, M. Bidirectional LSTM-CRF for clinical concept extraction. arXiv 2016, arXiv:1611.08373. [Google Scholar]
- Plank, B.; Søgaard, A.; Goldberg, Y. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. arXiv 2016, arXiv:1604.05529. [Google Scholar]
- Xu, K.; Zhou, Z.; Hao, T.; Liu, W. A bidirectional LSTM and conditional random fields approach to medical named entity recognition. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, Cairo, Egypt, 9–11 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 355–365. [Google Scholar]
- Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A.M. Character-Aware Neural Language Models. In Proceedings of the Thirtieth AAAI Conference (AAAI-16), Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Dong, C.; Zhang, J.; Zong, C.; Hattori, M.; Di, H. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In Natural Language Understanding and Intelligent Applications; Springer: Berlin/Heidelberg, Germany, 2016; pp. 239–250. [Google Scholar]
- Zhang, Y.; Yang, J. Chinese ner using lattice lstm. arXiv 2018, arXiv:1805.02023. [Google Scholar]
- Yang, J.; Zhang, Y.; Liang, S. Subword encoding in lattice lstm for chinese word segmentation. arXiv 2018, arXiv:1810.12594. [Google Scholar]
- Kuru, O.; Can, O.A.; Yuret, D. Charner: Character-level named entity recognition. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 911–921. [Google Scholar]
- Limsopatham, N.; Collier, N.H. Bidirectional LSTM for named entity recognition in Twitter messages. In Proceedings of the 2nd Workshop on Noisy User-generated Text, Osaka, Japan, 11 December 2016. [Google Scholar]
- Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar]
- Bhattu, S.N.; Krishna, N.S.; Somayajulu, D. idrbt-team-a@ IECSIL-FIRE-2018: Named Entity Recognition of Indian languages using Bi-LSTM. In Proceedings of the Working Notes of FIRE 2018-Forum for Information Retrieval Evaluation, Gandhinagar, India, 6–9 December 2018. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Barathi Ganesh, H.; Soman, K.; Reshma, U.; Mandar, K.; Prachi, M.; Gouri, K.; Anitha, K. Information Extraction for Conversational Systems in Indian Languages-Arnekt IECSIL. In Proceedings of the Forum for Information Retrieval Evaluation, Gandhinagar, India, 7–9 December 2018. [Google Scholar]
- Forum for Information Retrieval Evaluation. Available online: http://fire.irsi.res.in/fire/2019/home (accessed on 2 February 2018).
- Skymind. A Beginner’s Guide to Neural Networks and Deep Learning. 2017. Available online: https://skymind.ai/wiki/neural-network (accessed on 14 November 2018).
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Twenty-eighth Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Na, S.H.; Kim, H.; Min, J.; Kim, K. Improving LSTM CRFs using character-based compositions for Korean named entity recognition. Comput. Speech Lang. 2019, 54, 106–121. [Google Scholar] [CrossRef]
- Klein, D.; Smarr, J.; Nguyen, H.; Manning, C.D. Named entity recognition with character-level models. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 180–183. [Google Scholar]
- Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning Word Vectors for 157 Languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Yu, X.; Faleńska, A.; Vu, N.T. A general-purpose tagger with convolutional neural networks. arXiv 2017, arXiv:1706.01723. [Google Scholar]
- Ajees, A.; Idicula, S.M. CUSAT TEAM@ IECSIL-FIRE-2018: A Named Entity Recognition System for Indian Languages. In Proceedings of the Working Notes of FIRE 2018 - Forum for Information Retrieval Evaluation, Gandhinagar, India, 6–9 December 2018. [Google Scholar]
Dataset | Filetype | FileLength | # Sentences | # Words | # Unique Words |
---|---|---|---|---|---|
Hindi | Train | 1,548,570 | 76,537 | 1,472,033 | 87,842 |
Test | 519,115 | 25,513 | 493,602 | 43,797 | |
Kannada | Train | 318,356 | 20,536 | 297,820 | 73,712 |
Test | 107,325 | 6846 | 100,479 | 34,200 | |
Malayalam | Train | 903,521 | 65,188 | 838,333 | 143,990 |
Test | 301,860 | 21,730 | 280,130 | 67,361 | |
Tamil | Train | 1,626,260 | 134,030 | 1,492,230 | 185,926 |
Test | 542,225 | 44,677 | 497,548 | 89,529 | |
Telugu | Train | 840,904 | 63,223 | 777,681 | 108,059 |
Test | 280,533 | 21,075 | 259,458 | 51,555 |
Language | Event | Things | Org | Occupation | Name | Location | Other | Average Presence |
---|---|---|---|---|---|---|---|---|
Hindi | 99.69 | 99.33 | 99.23 | 99.48 | 94.96 | 98.91 | 96.38 | 98.28 |
Kannada | 98.85 | 97.11 | 96.85 | 96.92 | 89.17 | 96.94 | 89.4 | 95.03 |
Malayalam | 94.86 | 96.65 | 97.17 | 95.72 | 90.71 | 96.52 | 86.14 | 93.96 |
Tamil | 98.34 | 98.3 | 97.95 | 96.93 | 91.72 | 95.13 | 93.05 | 95.91 |
Telugu | 98.9 | 99.16 | 98.72 | 98.72 | 83.65 | 99.15 | 93.48 | 95.96 |
Class Avg. | 98.12 | 98.11 | 98.00 | 97.55 | 90.04 | 97.33 | 91.69 | 95.83 |
Model | Hindi | Kannada | Malayalam | Tamil | Telugu | Average |
---|---|---|---|---|---|---|
Baseline (CRF) [64] | 97.67 | 97.03 | 97.44 | 97.36 | 97.72 | 97.44 |
Bhattu et al. [53] | 97.82 | 97.04 | 97.46 | 97.41 | 97.54 | 97.45 |
Khushleen Kaur [12] | 96.84 | 96.38 | 96.64 | 96.15 | 96.63 | 96.53 |
Thenmozhi et al. [13] | 96.73 | 95.63 | 95.87 | 95.55 | 96.77 | 96.11 |
Sagar et al. [14] | 94.44 | 92.94 | 92.92 | 92.48 | 92.42 | 93.04 |
Gupta et al. [15] | 91.52 | 92.14 | 90.27 | 87.72 | 90.02 | 90.33 |
Our model | 98.44 | 97.62 | 98.25 | 98.35 | 98.41 | 98.21 |
Representation | Hindi | Kannada | Malayalam | Tamil | Telugu | Average |
---|---|---|---|---|---|---|
FastText | 96.87 | 96.41 | 96.68 | 96.22 | 96.66 | 96.57 |
FastText+char_ConvNet | 97.98 | 97.30 | 97.76 | 97.61 | 97.84 | 97.69 |
FastText+char_ConvNet+Affix | 98.44 | 97.62 | 98.25 | 98.35 | 98.41 | 98.21 |
Training Data Size | Representation | Average Accuracy (%) |
---|---|---|
Complete (100%) | FastText | 96.57 |
FastText+char_ConvNet | 97.69 | |
FastText+char_ConvNet+Affix | 98.21 | |
80% | FastText | 95.83 |
FastText+char_ConvNet | 96.87 | |
FastText+char_ConvNet+Affix | 97.51 | |
60% | FastText | 95.61 |
FastText+char_ConvNet | 96.70 | |
FastText+char_ConvNet+Affix | 95.61 | |
40% | FastText | 95.29 |
FastText+char_ConvNet | 96.44 | |
FastText+char_ConvNet+Affix | 96.98 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
A P, A.; K, M.; Mary Idicula, S. An Improved Word Representation for Deep Learning Based NER in Indian Languages. Information 2019, 10, 186. https://doi.org/10.3390/info10060186
A P A, K M, Mary Idicula S. An Improved Word Representation for Deep Learning Based NER in Indian Languages. Information. 2019; 10(6):186. https://doi.org/10.3390/info10060186
Chicago/Turabian StyleA P, Ajees, Manju K, and Sumam Mary Idicula. 2019. "An Improved Word Representation for Deep Learning Based NER in Indian Languages" Information 10, no. 6: 186. https://doi.org/10.3390/info10060186
APA StyleA P, A., K, M., & Mary Idicula, S. (2019). An Improved Word Representation for Deep Learning Based NER in Indian Languages. Information, 10(6), 186. https://doi.org/10.3390/info10060186