A Neural N-Gram-Based Classifier for Chinese Clinical Named Entity Recognition
Abstract
:1. Introduction
- The ambiguous chunk of text corresponds to the same character sequence but to different named entities. For example, “泌尿道感染” (urinary tract infection) could refer to the disease entity or symptom entity depending on the context;
- There are no clear word boundaries in Chinese text and the effect of word segmentation will significantly impact the performance of the NER. For example, “小腸切除術” (small bowel resection) is a treatment entity if it is considered as one segmentation unit. However, if the word segmentation model splits it into “小腸” (small intestine) and “切除術” (resection), their entity types will become body part and treatment, respectively;
- Because of the casual use of Chinese abbreviations for clinical entities written by doctors, it may result in multiple expressions of the same entity. For example, “盲腸炎” and “闌尾炎” could all refer to appendicitis.
- We propose an Att-BiLSTM-CRF model to perform the Chinese CNER task based on combinations of n-gram character embeddings of different lengths without using external knowledge. Unlike other approaches in the literature which rely on domain-specific resources and may limit the ability of generalization, our model will be scalable to other datasets.
- We assess the effectiveness of the proposed model on the CCKS-2017 Shared Task 2 dataset. Our model obtains an F-score of 89.33% and performs better than other competitive methods including CNN, BiLSTM and BERT based models which have F-scores in the range 87.75% to 88.51%.
2. Related Work
3. The Proposed Approach
3.1. N-Gram Character Embeddings
3.2. Neural Entity Recognition Model
4. Experiments
4.1. Dataset and Evaluation Metrics
4.2. Experiment and Results
- LSTM-CRF: A LSTM neural network model with a CRF layer.
- BiLSTM-CRF: A bidirectional LSTM model with a CRF layer [32].
- RD-CNN-CRF: A residual dilated Convolutional Neural Network with CRF where dictionary features are utilized according to the drug information in Shanghai Shuguang Hospital and some medical literature [33].
- BERT-BiLSTM-CRF: A pre-trained language model BERT to enhance the semantic representation, a BiLSTM network and a CRF layer [36].
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Skounakis, M.; Craven, M.; Ray, S. Hierarchical hidden markov models for information extraction. IJCAI 2003, 2003, 427–433. [Google Scholar]
- Kang, T.; Zhang, S.; Tang, Y.; Hruby, G.W.; Rusanov, A.; Elhadad, N.; Weng, C. EliIE: An open-source information extraction system for clinical trial eligibility criteria. J. Am. Med. Inform. Assoc. 2017, 24, 1062–1071. [Google Scholar] [CrossRef] [Green Version]
- Yadav, V.; Bethard, S. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 21–25 August 2018; pp. 2145–2158. [Google Scholar]
- Wang, X.; Yang, C.; Guan, R. A comparative study for biomedical named entity recognition. Int. J. Mach. Learn. Cybern. 2018, 9, 373–382. [Google Scholar] [CrossRef]
- Hu, J.; Shi, X.; Liu, Z.; Wang, X.; Chen, Q.; Tang, B. HITSZ_CNER: A Hybrid System for Entity Recognition from Chinese Clinical Text; CEUR Workshop Proceedings: Aachen, Germany, 2017; Volume 1976, pp. 25–30. [Google Scholar]
- Li, L.; Zhao, J.; Hou, L.; Zhai, Y.; Shi, J.; Cui, F. An attention-based deep learning model for clinical named entity recognition of Chinese electronic medical records. BMC Med. Inform. Decis. Mak. 2019, 19, 235. [Google Scholar] [CrossRef] [Green Version]
- Gong, L.; Zhang, Z.; Chen, S. Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining. J. Healthc. Eng. 2020, 2020, 8829219. [Google Scholar] [CrossRef]
- Wu, G.; Tang, G.; Wang, Z.; Zhang, Z.; Wang, Z. An Attention-Based BiLSTM-CRF Model for Chinese Clinic Named Entity Recognition. IEEE Access 2019, 7, 113942–113949. [Google Scholar] [CrossRef]
- Zhu, Q.; Li, X.; Conesa, A.; Pereira, C. GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics 2018, 34, 1547–1554. [Google Scholar] [CrossRef] [Green Version]
- Wang, Q.; Zhou, Y.; Ruan, T.; Gao, D.; Xia, Y.; He, P. Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition. J. Biomed. Inform. 2019, 92, 103133. [Google Scholar] [CrossRef]
- Han, X.; Zhou, F.; Hao, Z.; Liu, Q.; Li, Y.; Qin, Q. MAF-CNER: A Chinese Named Entity Recognition Model Based on Multifeature Adaptive Fusion. Complexity 2021, 2021, 6696064. [Google Scholar] [CrossRef]
- Zeng, Q.T.; Goryachev, S.; Weiss, S.; Sordo, M.; Murphy, S.N.; Lazarus, R. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: Evaluation of a natural language processing system. BMC Med. Inform. Decis. Mak. 2006, 6, 30. [Google Scholar] [CrossRef]
- Savova, G.K.; Masanz, J.J.; Ogren, P.V.; Zheng, J.; Sohn, S.; Kipper-Schuler, K.C.; Chute, C.G. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 2010, 17, 507–513. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rindflesch, T.C.; Tanabe, L.; Weinstein, J.N.; Hunter, L. EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Biocomputing 2000, 2000, 517–528. [Google Scholar]
- Aronson, A.R. Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program. In Proceedings of the AMIA Symposium, Washington, DC, USA, 3–7 November 2001; American Medical Informatics Association: Bethesda, MD, USA, 2001; p. 17. [Google Scholar]
- Gaizauskas, R.; Demetriou, G.; Humphreys, K. Term recognition and classification in biological science journal articles. In Proceedings of the Computional Terminology for Medical and Biological Applications Workshop of the 2nd International Conference on NLP, Patras, Greece, 2–4 June 2000. [Google Scholar]
- McDonald, C.J.; Overhage, J.M.; Tierney, W.M.; Dexter, P.R.; Martin, D.K.; Suico, J.G.; Zafar, A.; Schadow, G.; Blevins, L.; Glazener, T.; et al. The Regenstrief medical record system: A quarter century experience. Int. J. Med. Inform. 1999, 54, 225–253. [Google Scholar] [CrossRef]
- Donnelly, K. SNOMED-CT: The advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 2006, 121, 279. [Google Scholar]
- Wang, Y.; Yu, Z.; Chen, L.; Chen, Y.; Liu, Y.; Hu, X.; Jiang, Y. Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: An empirical study. J. Biomed. Inform. 2014, 47, 91–104. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ju, Z.; Wang, J.; Zhu, F. Named entity recognition from biomedical text using SVM. In Proceedings of the 2011 5th International Conference on Bioinformatics and Biomedical Engineering, Wuhan, China, 10–12 May 2011; pp. 1–4. [Google Scholar]
- Yin, W.; Kann, K.; Yu, M.; Schütze, H. Comparative study of CNN and RNN for natural language processing. arXiv 2017, arXiv:1702.01923. [Google Scholar]
- Li, Z.; Zhang, Q.; Liu, Y.; Feng, D.; Huang, Z. Recurrent Neural Networks with Specialized Word Embedding for Chinese Clinical Named Entity Recognition; CEUR Workshop Proceedings: Aachen, Germany, 2017; Volume 1976, pp. 55–60. [Google Scholar]
- Ouyang, E.; Li, Y.; Jin, L.; Li, Z.; Zhang, X. Exploring N-Gram Character Presentation in Bidirectional RNN-CRF for Chinese Clinical Named Entity Recognition; CEUR Workshop Proceedings: Aachen, Germany, 2017; Volume 1976, pp. 37–42. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Tan, Z.; Wang, M.; Xie, J.; Chen, Y.; Shi, X. Deep semantic role labeling with self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Ma, Q.; Yan, J.; Lin, Z.; Yu, L.; Chen, Z. Deformable Self-Attention for Text Classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1570–1581. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar]
- Alzaidy, R.; Caragea, C.; Giles, C.L. Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2551–2557. [Google Scholar]
- Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
- Li, X.; Zhang, H.; Zhou, X.H. Chinese clinical named entity recognition with variant neural structures based on BERT methods. J. Biomed. Inform. 2020, 107, 103422. [Google Scholar] [CrossRef] [PubMed]
- Unanue, I.J.; Borzeshi, E.Z.; Piccardi, M. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. J. Biomed. Inform. 2017, 76, 102–109. [Google Scholar] [CrossRef]
- Qiu, J.; Wang, Q.; Zhou, Y.; Ruan, T.; Gao, J. Fast and accurate recognition of Chinese clinical named entities with residual dilated convolutions. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018; pp. 935–942. [Google Scholar]
- Strubell, E.; Verga, P.; Belanger, D.; McCallum, A. Fast and accurate entity recognition with iterated dilated convolutions. arXiv 2017, arXiv:1702.02098. [Google Scholar]
- Zhao, S.; Cai, Z.; Chen, H.; Wang, Y.; Liu, F.; Liu, A. Adversarial training based lattice LSTM for Chinese clinical named entity recognition. J. Biomed. Inform. 2019, 99, 103290. [Google Scholar] [CrossRef] [PubMed]
- Jiang, S.; Zhao, S.; Hou, K.; Liu, Y.; Zhang, L. A bert-bilstm-crf model for chinese electronic medical records named entity recognition. In Proceedings of the 2019 12th International Conference on Intelligent Computation Technology and Automation (ICICTA), Xiangtan, China, 26–27 October 2019; pp. 166–169. [Google Scholar]
左 | 側 | 髖 | 部 | 正 | 常 | |
---|---|---|---|---|---|---|
BIO | B-BODY | I-BODY | I-BODY | I-BODY | O | O |
Training Set | Testing Set | |
---|---|---|
Body | 10,719 | 3021 |
Exam | 9546 | 3143 |
Disease | 722 | 553 |
Symptom | 7831 | 2311 |
Treatment | 1048 | 465 |
Total | 29,866 | 9493 |
Parameter | Value |
---|---|
n-gram | 1, 2, 3 |
character embedding size | 100 |
LSTM hidden units | 100 |
batch size | 16 |
dropout rate | 0.5 |
learning rate | 0.001 |
Models | P | R | F |
---|---|---|---|
LSTM-CRF | 83.59 | 85.28 | 84.42 |
BiLSTM-CRF | 88.22 | 88.53 | 88.37 |
RD-CNN-CRF | 88.64 | 88.38 | 88.51 |
ID-CNN-CRF | 88.30 | 87.21 | 87.75 |
BERT-BiLSTM-CRF | 86.50 | 90.48 | 88.45 |
Our model | 88.53 | 90.13 | 89.33 |
Models | P | R | F |
---|---|---|---|
1-g | 87.88 | 89.98 | 88.92 |
2-g | 87.30 | 90.47 | 88.86 |
3-g | 87.21 | 89.84 | 88.50 |
Our model | 88.53 | 90.13 | 89.33 |
Models | Body | Exam | Disease | Symptom | Treatment |
---|---|---|---|---|---|
1-g | 84.47 | 92.87 | 75.76 | 95.04 | 74.75 |
2-g | 84.06 | 93.22 | 75.55 | 95.13 | 73.71 |
3-g | 83.39 | 92.99 | 75.13 | 94.82 | 74.27 |
Our model | 83.99 | 93.82 | 77.89 | 95.23 | 75.88 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, C.-S.; Jwo, J.-S.; Lee, C.-H. A Neural N-Gram-Based Classifier for Chinese Clinical Named Entity Recognition. Appl. Sci. 2021, 11, 8682. https://doi.org/10.3390/app11188682
Lin C-S, Jwo J-S, Lee C-H. A Neural N-Gram-Based Classifier for Chinese Clinical Named Entity Recognition. Applied Sciences. 2021; 11(18):8682. https://doi.org/10.3390/app11188682
Chicago/Turabian StyleLin, Ching-Sheng, Jung-Sing Jwo, and Cheng-Hsiung Lee. 2021. "A Neural N-Gram-Based Classifier for Chinese Clinical Named Entity Recognition" Applied Sciences 11, no. 18: 8682. https://doi.org/10.3390/app11188682
APA StyleLin, C. -S., Jwo, J. -S., & Lee, C. -H. (2021). A Neural N-Gram-Based Classifier for Chinese Clinical Named Entity Recognition. Applied Sciences, 11(18), 8682. https://doi.org/10.3390/app11188682