Filtered BERT: Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents
Abstract
:1. Introduction
2. Materials and Methods
2.1. Datasets
2.2. Data Preprocessing
2.3. Data Augmentation
2.3.1. Filtered BERT for Augmentation Structure
2.3.2. Filtered BERT-Based Clinical Documents Augmentation
2.4. Named Entity Recognition with BERT
2.4.1. Tokenization and Labeling for the BERT Model
2.4.2. Fine-Tuning BERT
2.5. Evaluation
3. Results
3.1. Data Augmentation
3.2. Results of Named Entity Recognition
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Melo, C.; de Melo, J.A.G.; Araújo, N.M.F. Impact of the Fourth Industrial Revolution on the Health Sector: A Qualitative Study. Healthc. Inform. Res. 2020, 26, 328–334. [Google Scholar] [CrossRef]
- Park, Y.T.; Kim, Y.S.; Yi, B.K.; Kim, S.M. Clinical Decision Support Functions and Digitalization of Clinical Documents of Electronic Medical Record Systems. Healthc. Inform. Res. 2019, 25, 115–123. [Google Scholar] [CrossRef] [Green Version]
- Mujtaba, G.; Shuib, L.; Idris, N.; Hoo, W.L.; Raj, R.G.; Khowaja, K.; Shaikh, K.; Nweke, H.F. Clinical Text Classification Research Trends: Systematic Literature Review and Open Issues. Expert Syst. Appl. 2019, 116, 494–520. [Google Scholar] [CrossRef]
- Shin, S.-Y.; Park, Y.R.; Shin, Y.; Choi, H.J.; Park, J.; Lyu, Y.; Lee, M.-S.; Choi, C.-M.; Kim, W.-S.; Lee, J.H. A De-Identification Method for Bilingual Clinical Texts of Various Note Types. J. Korean Med. Sci. 2015, 30, 7–15. [Google Scholar] [CrossRef] [Green Version]
- Corinna, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar]
- Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2001; pp. 282–289. [Google Scholar]
- Aramaki, E.; Imai, T.; Miyo, K.; Ohe, K. Automatic Deidentification by Using Sentence Features and Label Consistency. i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data. 2006, Volume 2006, pp. 10–11. Available online: http://luululu.com/paper/2006-i2b2/i2b2-deid.pdf (accessed on 29 January 2021).
- He, B.; Guan, Y.; Cheng, J.; Cen, K.; Hua, W. CRFS Based De-Identification of Medical Records. J. Biomed. Inform. 2015, 58, S39–S46. [Google Scholar] [CrossRef]
- Hochreiter, S. Long Short-Term Memory. J. Neural Comput. Schmidhuber 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Liu, Z.; Yang, M.; Wang, X.; Chen, Q.; Tang, B.; Wang, Z.; Xu, H. Entity Recognition from Clinical Texts Via Recurrent Neural Network. BMC Med Inform. Decis. Mak. 2017, 17, 67. [Google Scholar] [CrossRef] [Green Version]
- Yang, X.; Lyu, T.; Li, Q.; Lee, C.Y.; Bian, J.; Hogan, W.R.; Wu, Y. A Study of Deep Learning Methods for De-Identification of Clinical Notes in Cross-Institute Settings. BMC Med Inform. Decis. Mak. 2019, 19, 232. [Google Scholar] [CrossRef] [Green Version]
- Yue, X.; Zhou, S. Phicon: Improving Generalization of Clinical Text De-Identification Models Via Data Augmentation. arXiv 2020, arXiv:2010.05143. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Mikołajczyk, A.; Grochowski, M. Data Augmentation for Improving Deep Learning in Image Classification Problem. In Proceedings of the 2018 International Interdisciplinary PhD Workshop (IIPhDW), Świnoujście, Poland, 9–12 May 2018. [Google Scholar]
- Um, T.T.; Pfister, F.M.; Pichler, D.; Endo, S.; Lang, M.; Hirche, S.; Fietzek, U.; Kulić, D. Data Augmentation of Wearable Sensor Data for Parkinson’s Disease Monitoring Using Convolutional Neural Networks. ICMI 2017, 17, 216–220. [Google Scholar]
- Kobayashi, S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. arXiv 2018, arXiv:1805.06201. [Google Scholar]
- Wei, J.; Zou, K. Eda: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar]
- Erhan, D.; Courville, A.; Bengio, Y.; Vincent, P. Why Does Unsupervised Pre-Training Help Deep Learning? In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010.
- Shao, L.; Zhu, F.; Li, X. Transfer Learning for Visual Categorization: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 1019–1034. [Google Scholar] [CrossRef] [PubMed]
- Deng, J.W.; Dong, R.; Socher, L.; Li, L.K.; Li, F.F. Imagenet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. J. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
- Pennington, J.; Richard, S.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Joulin, A.; Edouard, G.; Piotr, B.; Matthijs, D.; Hérve, J.; Mikolov, T. Fasttext. Zip: Compressing Text Classification Models. arXiv 2016, arXiv:1612.03651. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Yang, S.; Yoo, S.; Jeong, O. Denert-Kg: Named Entity and Relation Extraction Model Using Dqn, Knowledge Graph, and Bert. Appl. Sci. 2020, 10, 6429. [Google Scholar] [CrossRef]
- Stubbs, A.; Christopher, K.; Uzuner, Ö. Automated Systems for the De-Identification of Longitudinal Clinical Narratives: Overview of 2014 I2b2/Uthealth Shared Task Track 1. J. Biomed. Inform. 2015, 58, S11–S19. [Google Scholar] [CrossRef]
- Stubbs, A.; Uzuner, Ö. Annotating Longitudinal Clinical Narratives for De-Identification: The 2014 I2b2/Uthealth Corpus. J. Biomed. Inform. 2015, 58, S20–S29. [Google Scholar] [CrossRef]
- Sang, E.F.; De Meulder, F. Introduction to the Conll-2003 Shared Task: Language-Independent Named Entity Recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
- Kumar, V.; Choudhary, A.; Cho, E. Data Augmentation Using Pre-Trained Transformer Models. arXiv 2020, arXiv:2003.02245. [Google Scholar]
- Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical Bert Embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar]
- Zhang, Y.; Chen, Q.; Yang, Z.; Lin, H.; Lu, Z. Biowordvec, improving Biomedical Word Embeddings with Subword Information and Mesh. Sci. Data 2019, 6, 52. [Google Scholar] [CrossRef] [Green Version]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Dean, J. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Kim, Y.-M.; Lee, T.-H. Korean Clinical Entity Recognition from Diagnosis Text Using Bert. BMC Med Inform. Decis. Mak. 2020, 20, 242. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Ting, K.M. Confusion Matrix. In Encyclopedia of Machine Learning and Data Mining; Claude, S., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2017; p. 260. [Google Scholar]
- Liu, Z.; Chen, Y.; Tang, B.; Wang, X.; Chen, Q.; Li, H.; Wang, J.; Deng, Q.; Zhu, S. Automatic De-Identification of Electronic Medical Records Using Token-Level and Character-Level Conditional Random Fields. J. Biomed. Inform. 2015, 58, S47–S52. [Google Scholar] [CrossRef]
- Park, J.H.; Baek, J.H.; Sym, S.J.; Lee, K.Y.; Lee, Y. A Data-Driven Approach to a Chemotherapy Recommendation Model Based on Deep Learning for Patients with Colorectal Cancer in Korea. BMC Med. Inform. Decis. Mak. 2020, 20, 241. [Google Scholar] [CrossRef] [PubMed]
Main Category | Subcategory |
---|---|
NAME | DOCTOR, PATIENT, USERNAME |
PROFESSION | |
LOCATION | HOSPITAL, COUNTRY, ORGANIZATION, ZIP, STREET, CITY, STATE, LOCATION-OTHER |
AGE | |
DATE | |
CONTACT | PHONE, FAX, EMAIL, URL, IPADDR |
ID | MEDICALRECORD, SSN, ACCOUNT, LICENSE, DEVICE, IDNUM, BIOID, HEALTHPLAN, VEHICLE |
Instance | ||||||||
---|---|---|---|---|---|---|---|---|
None | He O | has O | h/o O | drug/ETOG O | abuse O | but O | denies O | X O |
Filtered BERT Augmentation | He O | has O | h/o O | drug/ETOG O | abuse O | but O | denied O | X O |
Precision | Recall | F1-Score | |
---|---|---|---|
None | 0.8811 | 0.8880 | 0.8845 |
Filtered BERT Augmentation | 0.9265 | 0.9201 | 0.9233 |
Tags | Precision | Recall | F1-Score | Support | |||
---|---|---|---|---|---|---|---|
Before | After | Before | After | Before | After | ||
DOCTOR | 0.73 | 0.88 | 0.76 | 0.90 | 0.75 | 0.89 | 2525 |
PATIENT | 0.80 | 0.89 | 0.74 | 0.89 | 0.76 | 0.89 | 2275 |
USERNAME | 0.95 | 0.98 | 0.85 | 0.91 | 0.90 | 0.95 | 167 |
PROFESSION | 0.08 | 0.29 | 0.26 | 0.34 | 0.12 | 0.31 | 135 |
HOSPITAL | 0.70 | 0.84 | 0.63 | 0.76 | 0.66 | 0.80 | 1665 |
COUNTRY | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 133 |
ORGANIZATION | 0.00 | 0.05 | 0.00 | 0.17 | 0.00 | 0.07 | 88 |
ZIP | 0.92 | 0.98 | 0.77 | 0.99 | 0.84 | 0.99 | 416 |
STREET | 0.28 | 0.86 | 0.32 | 0.74 | 0.30 | 0.79 | 173 |
CITY | 0.46 | 0.69 | 0.49 | 0.52 | 0.47 | 0.59 | 404 |
STATE | 0.60 | 0.83 | 0.94 | 0.84 | 0.73 | 0.83 | 227 |
LOCATION-OTHER | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 16 |
AGE | 0.92 | 0.95 | 0.86 | 0.96 | 0.89 | 0.95 | 621 |
DATE | 0.98 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 13,024 |
PHONE | 0.77 | 0.90 | 0.69 | 0.86 | 0.73 | 0.88 | 665 |
MEDICALRECORD | 0.96 | 0.95 | 0.95 | 0.98 | 0.95. | 0.97 | 2046 |
DEVICE | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 31 |
IDNUM | 0.89 | 0.91 | 0.84 | 0.76 | 0.86 | 0.83 | 675 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kang, M.; Lee, K.H.; Lee, Y. Filtered BERT: Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents. Appl. Sci. 2021, 11, 3668. https://doi.org/10.3390/app11083668
Kang M, Lee KH, Lee Y. Filtered BERT: Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents. Applied Sciences. 2021; 11(8):3668. https://doi.org/10.3390/app11083668
Chicago/Turabian StyleKang, Min, Kye Hwa Lee, and Youngho Lee. 2021. "Filtered BERT: Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents" Applied Sciences 11, no. 8: 3668. https://doi.org/10.3390/app11083668
APA StyleKang, M., Lee, K. H., & Lee, Y. (2021). Filtered BERT: Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents. Applied Sciences, 11(8), 3668. https://doi.org/10.3390/app11083668