Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing
Abstract
:Featured Application
Abstract
1. Introduction
2. Materials and Methods
2.1. Data
2.2. SPPMI-Based Keyword Extraction
3. Results
3.1. Experiment Setting
3.2. Keyword Extraction Precision
3.3. Smoking Status Classification
3.4. Frequency Distribution of the Expanded Keywords
4. Discussion
4.1. Limitations of Pre-Trained Language Models
4.2. Implications to Bilingual EHRs
4.3. Strengths and Limitations
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Baker, F.; Ainsworth, S.R.; Dye, J.T.; Crammer, C.; Thun, M.J.; Hoffmann, D.; Repace, J.L.; Henningfield, J.E.; Slade, J.; Pinney, J. Health risks associated with cigar smoking. Jama 2000, 284, 735–740. [Google Scholar] [CrossRef]
- Freund, K.M.; Belanger, A.J.; D’Agostino, R.B.; Kannel, W.B. The health risks of smoking the framingham study: 34 years of follow-up. Ann. Epidemiol. 1993, 3, 417–424. [Google Scholar] [CrossRef]
- Jha, P.; Ramasundarahettige, C.; Landsman, V.; Rostron, B.; Thun, M.; Anderson, R.N.; McAfee, T.; Peto, R. 21st-century hazards of smoking and benefits of cessation in the United States. N. Engl. J. Med. 2013, 368, 341–350. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jha, P. Avoidable global cancer deaths and total deaths from smoking. Nat. Rev. Cancer 2009, 9, 655–664. [Google Scholar] [CrossRef] [PubMed]
- Godtfredsen, N.S.; Holst, C.; Prescott, E.; Vestbo, J.; Osler, M. Smoking reduction, smoking cessation, and mortality: A 16-year follow-up of 19,732 men and women from The Copenhagen Centre for Prospective Population Studies. Am. J. Epidemiol. 2002, 156, 994–1001. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mons, U.; Müezzinler, A.; Gellert, C.; Schöttker, B.; Abnet, C.C.; Bobak, M.; de Groot, L.; Freedman, N.D.; Jansen, E.; Kee, F.; et al. Impact of smoking and smoking cessation on cardiovascular events and mortality among older adults: Meta-analysis of individual participant data from prospective cohort studies of the CHANCES consortium. BMJ 2015, 350, h1551. [Google Scholar] [CrossRef] [Green Version]
- Jonnagaddala, J.; Dai, H.-J.; Ray, P.; Liaw, S.-T. A preliminary study on automatic identification of patient smoking status in unstructured electronic health records. In Proceedings of the BioNLP 15, Beijing, China, 30 July 2015; pp. 147–151. [Google Scholar]
- Kim, H.K.; Choi, S.W.; Bae, Y.S.; Choi, J.; Kwon, H.; Lee, C.P.; Lee, H.-Y.; Ko, T. MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors. Appl. Sci. 2020, 10, 7831. [Google Scholar] [CrossRef]
- Elbattah, M.; Arnaud, É.; Gignon, M.; Dequen, G. The Role of Text Analytics in Healthcare: A Review of Recent Developments and Applications. In Proceedings of the HEALTHINF, Vienna, Austria, 11–13 February 2021; pp. 825–832. [Google Scholar]
- Golmaei, S.N.; Luo, X. DeepNote-GNN: Predicting hospital readmission using clinical notes and patient network. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Virtual Conference, 1–4 August 2021; pp. 1–9. [Google Scholar]
- Shoenbill, K.; Song, Y.; Gress, L.; Johnson, H.; Smith, M.; Mendonca, E.A. Natural language processing of lifestyle modification documentation. Health Inform. J. 2020, 26, 388–405. [Google Scholar] [CrossRef]
- Miñarro-Giménez, J.A.; Cornet, R.; Jaulent, M.-C.; Dewenter, H.; Thun, S.; Gøeg, K.R.; Karlsson, D.; Schulz, S. Quantitative analysis of manual annotation of clinical text samples. Int. J. Med. Inform. 2019, 123, 37–48. [Google Scholar] [CrossRef]
- Pilán, I.; Brekke, P.H.; Øvrelid, L. Building a Norwegian Lexical Resource for Medical Entity Recognition. arXiv 2004, arXiv:2004.02509. [Google Scholar]
- Leslie, H. openEHR archetype use and reuse within multilingual clinical data sets: Case study. J. Med. Internet Res. 2020, 22, e23361. [Google Scholar] [CrossRef]
- Levy, O.; Goldberg, Y. Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 2014, 27, 2177–2185. [Google Scholar]
- Kang, M.-Y. Topics in Korean Syntax: Phrase Structure, Variable Binding and Movement. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1988. [Google Scholar]
- Church, K.; Hanks, P. Word association norms, mutual information, and lexicography. Comput. Linguist. 1990, 16, 22–29. [Google Scholar]
- Bouma, G. Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 2009, 30, 31–40. [Google Scholar]
- Ravichandran, D.; Pantel, P.; Hovy, E. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 23–25 June 2005; pp. 622–629. [Google Scholar]
- Han, L.; Finin, T.; McNamee, P.; Joshi, A.; Yesha, Y. Improving word similarity by augmenting PMI with estimates of word polysemy. IEEE Trans. Knowl. Data Eng. 2012, 25, 1307–1322. [Google Scholar] [CrossRef] [Green Version]
- Arora, S.; Li, Y.; Liang, Y.; Ma, T.; Risteski, A. A latent variable model approach to pmi-based word embeddings. Trans. Assoc. Comput. Linguist. 2016, 4, 385–399. [Google Scholar] [CrossRef]
- Levy, O.; Goldberg, Y.; Dagan, I. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 2015, 3, 211–225. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Wang, P.; Xu, B.; Xu, J.; Tian, G.; Liu, C.-L.; Hao, H. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 2016, 174, 806–814. [Google Scholar] [CrossRef]
- Kim, H.K.; Kim, H.; Cho, S. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 2017, 266, 336–352. [Google Scholar] [CrossRef] [Green Version]
- Tang, D.; Wei, F.; Yang, N.; Zhou, M.; Liu, T.; Qin, B. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 23–25 June 2014; Volume 1, pp. 1555–1565. [Google Scholar]
- Nikfarjam, A.; Sarker, A.; O’connor, K.; Ginn, R.; Gonzalez, G. Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J. Am. Med. Inform. Assoc. 2015, 22, 671–681. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Uzuner, Ö.; Goldstein, I.; Luo, Y.; Kohane, I. Identifying patient smoking status from medical discharge records. J. Am. Med. Inform. Assoc. 2008, 15, 14–24. [Google Scholar] [CrossRef] [Green Version]
- Cohen, A.M. Five-way smoking status classification using text hot-spot identification and error-correcting output codes. J. Am. Med. Inform. Assoc. 2008, 15, 32–35. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Clark, C.; Good, K.; Jezierny, L.; Macpherson, M.; Wilson, B.; Chajewska, U. Identifying smokers with a medical extraction system. J. Am. Med. Inform. Assoc. 2008, 15, 36–39. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Golden, S.E.; Hooker, E.R.; Shull, S.; Howard, M.; Crothers, K.; Thompson, R.F.; Slatore, C.G. Validity of Veterans Health Administration structured data to determine accurate smoking status. Health Inform. J. 2020, 26, 1507–1515. [Google Scholar] [CrossRef] [Green Version]
- Groenhof, T.K.J.; Koers, L.R.; Blasse, E.; de Groot, M.; Grobbee, D.E.; Bots, M.L.; Asselbergs, F.W.; Lely, A.T.; Haitjema, S.; van Solinge, W. Data mining information from electronic health records produced high yield and accuracy for current smoking status. J. Clin. Epidemiol. 2020, 118, 100–106. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- De Silva, L.; Ginter, T.; Forbush, T.; Nokes, N.; Fay, B.; Mikuls, T.; Cannon, G.; DuVall, S. Extraction and quantification of pack-years and classification of smoker information in semi-structured Medical Records. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
- Figueroa, R.L.; Soto, D.A.; Pino, E.J. Identifying and extracting patient smoking status information from clinical narrative texts in Spanish. In Proceedings of the 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 26–30 August 2014; pp. 2710–2713. [Google Scholar]
- Patel, J.; Siddiqui, Z.; Krishnan, A.; Thyvalikakath, T.P. Leveraging electronic dental record data to classify patients based on their smoking intensity. Methods Inf. Med. 2018, 57, 253–260. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Caccamisi, A.; Jørgensen, L.; Dalianis, H.; Rosenlund, M. Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records. Upsala J. Med Sci. 2020, 125, 316–324. [Google Scholar] [CrossRef]
- Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
- Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 50–57. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Matsuo, Y.; Ishizuka, M. Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 2004, 13, 157–169. [Google Scholar] [CrossRef]
- HaCohen-Kerner, Y.; Gross, Z.; Masa, A. Automatic extraction and learning of keyphrases from scientific articles. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, 13–19 February 2005; pp. 657–669. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Arnaud, É.; Elbattah, M.; Gignon, M.; Dequen, G. Deep learning to predict hospitalization at triage: Integration of structured data and unstructured text. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 4836–4841. [Google Scholar]
- Yao, L.; Jin, Z.; Mao, C.; Zhang, Y.; Luo, Y. Traditional Chinese medicine clinical records classification with BERT and domain specific corpora. J. Am. Med. Inform. Assoc. 2019, 26, 1632–1636. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Stenner, S.P.; Doan, S.; Johnson, K.B.; Waitman, L.R.; Denny, J.C. MedEx: A medication information extraction system for clinical narratives. J. Am. Med. Inform. Assoc. 2010, 17, 19–24. [Google Scholar] [CrossRef]
- Haerian, K.; Varn, D.; Vaidya, S.; Ena, L.; Chase, H.; Friedman, C. Detection of pharmacovigilance-related adverse events using electronic health records and automated methods. Clin. Pharmacol. Ther. 2012, 92, 228–234. [Google Scholar] [CrossRef] [Green Version]
- Park, R.W. A clinical research strategy using longitudinal observational data in the post-electronic health records era. J. Korean Med. Assoc. 2012, 55, 711–719. [Google Scholar] [CrossRef] [Green Version]
- Névéol, A.; Dalianis, H.; Velupillai, S.; Savova, G.; Zweigenbaum, P. Clinical natural language processing in languages other than english: Opportunities and challenges. J. Biomed. Semant. 2018, 9, 1–13. [Google Scholar] [CrossRef]
- American Diabetes Association. 5. Facilitating behavior change and well-being to improve health outcomes: Standards of medical care in diabetes—2021. Diabetes Care 2021, 44 (Suppl. 1), S53–S72. [Google Scholar] [CrossRef]
- Unger, T.; Borghi, C.; Charchar, F.; Khan, N.A.; Poulter, N.R.; Prabhakaran, D.; Ramirez, A.; Schlaich, M.; Stergiou, G.S.; Tomaszewski, M. 2020 International Society of Hypertension global hypertension practice guidelines. Hypertension 2020, 75, 1334–1357. [Google Scholar] [CrossRef]
Family Medicine | Pulmonary and Critical Care Medicine | Total | |
---|---|---|---|
Current smokers | 1046 | 84 | 1130 |
Past smokers | 547 | 431 | 978 |
Never smokers | 399 | 144 | 543 |
Unknown | 1520 | 540 | 2060 |
Total | 3512 | 1199 | 4711 |
Never Smoker | Past Smoker | Current Smoker |
---|---|---|
smk never | smk ex | current smoker |
smk negative | smoker ya | smk yr |
비흡연 1 | 금연 중 2 | smoking |
# of Keywords | Top 1 | Top 5 | Top 10 | Top 20 | |
---|---|---|---|---|---|
Methods | |||||
Word co-occurrence | 38.00% | 35.60% | 30.40% | 29.50% | |
PMI vector | 42.00% | 36.40% | 31.20% | 27.40% | |
NPMI vector | 40.00% | 35.60% | 32.00% | 27.70% | |
PMI score | 38.00% | 36.80% | 37.00% | 34.50% | |
NPMI score | 36.00% | 38.40% | 37.40% | 34.20% | |
SPPMI ( | 42.00% | 36.40% | 31.20% | 27.40% | |
SPPMI ( | 38.00% | 34.80% | 31.80% | 27.40% | |
SPPMI ( | 36.00% | 30.40% | 27.00% | 27.20% | |
SPPMI-SVD ( | 46.00% | 40.80% | 38.40% | 31.50% | |
SPPMI-SVD ( | 46.00% | 41.20% | 35.40% | 29.80% | |
SPPMI SVD ( | 50.00% | 38.80% | 31.20% | 29.50% | |
SPPMI-SVD ( | 56.00% | 37.60% | 32.00% | 29.30% | |
SPPMI-SVD ( | 42.00% | 36.00% | 32.80% | 30.00% | |
SPPMI- SVD ( | 42.00% | 34.40% | 32.80% | 30.10% | |
SPPMI SVD ( | 56.00% | 40.40% | 33.40% | 30.60% | |
SPPMI- SVD ( | 44.00% | 32.40% | 29.80% | 29.30% | |
SPPMI-SVD ( | 44.00% | 33.20% | 31.00% | 28.50% | |
word2vec (c = 2, d = 100) | 10.00% | 6.80% | 7.60% | 7.20% | |
word2vec (c = 4, d = 100) | 11.11% | 8.00% | 6.20% | 5.20% | |
word2vec (c = 6, d = 100) | 8.82% | 5.20% | 4.60% | 4.00% | |
word2vec (c = 2, d = 200) | 10.00% | 9.60% | 8.40% | 7.20% | |
word2vec (c = 4, d = 200) | 9.09% | 7.20% | 6.40% | 4.90% | |
word2vec (c = 6, d = 200) | 8.00% | 4.80% | 4.60% | 4.20% | |
word2vec (c = 2, d = 300) | 16.00% | 9.60% | 7.40% | 5.90% | |
word2vec (c = 4, d = 300) | 8.00% | 6.80% | 5.20% | 3.80% | |
word2vec (c = 6, d = 300) | 8.00% | 5.20% | 4.60% | 4.60% |
Never Smoker | Past Smoker | Current Smoker |
---|---|---|
흡연 negative 1 | 년전 ppd 2 | 아직 담배 3 |
s negative | smoking ya | 못 끊었어요 4 |
never smoker | quit since | still smoking |
# of Extracted Keyword | 1 | 5 | 10 | 20 | |
---|---|---|---|---|---|
Methods | |||||
Bag of Words | 90.35% | ||||
LSA | 49.63% | 55.04% | 57.79% | 61.93% | |
LDA | 43.69% | 43.69% | 43.69% | 43.69% | |
SPPMI ( | 90.67% | 90.99% | 91.30% | 91.30% | |
SPPMI ( | 91.20% | 89.93% | 91.52% | 91.83% | |
SPPMI ( | 90.67% | 91.20% | 90.99% | 91.41% | |
SPPMI-SVD ( | 90.88% | 90.35% | 90.56% | 91.09% | |
SPPMI-SVD ( | 90.88% | 90.88% | 91.09% | 91.94% | |
SPPMI SVD ( | 90.46% | 91.09% | 92.15% | 92.79% | |
SPPMI-SVD ( | 90.56% | 91.41% | 90.46% | 91.41% | |
SPPMI-SVD ( | 90.56% | 91.30% | 91.41% | 91.83% | |
SPPMI- SVD ( | 90.56% | 90.88% | 91.94% | 91.52% | |
SPPMI SVD ( | 91.09% | 90.56% | 90.35% | 91.73% | |
SPPMI- SVD ( | 90.24% | 90.56% | 91.30% | 91.09% | |
SPPMI-SVD ( | 90.77% | 90.77% | 91.30% | 91.30% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bae, Y.S.; Kim, K.H.; Kim, H.K.; Choi, S.W.; Ko, T.; Seo, H.H.; Lee, H.-Y.; Jeon, H. Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing. Appl. Sci. 2021, 11, 8812. https://doi.org/10.3390/app11198812
Bae YS, Kim KH, Kim HK, Choi SW, Ko T, Seo HH, Lee H-Y, Jeon H. Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing. Applied Sciences. 2021; 11(19):8812. https://doi.org/10.3390/app11198812
Chicago/Turabian StyleBae, Ye Seul, Kyung Hwan Kim, Han Kyul Kim, Sae Won Choi, Taehoon Ko, Hee Hwa Seo, Hae-Young Lee, and Hyojin Jeon. 2021. "Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing" Applied Sciences 11, no. 19: 8812. https://doi.org/10.3390/app11198812
APA StyleBae, Y. S., Kim, K. H., Kim, H. K., Choi, S. W., Ko, T., Seo, H. H., Lee, H. -Y., & Jeon, H. (2021). Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing. Applied Sciences, 11(19), 8812. https://doi.org/10.3390/app11198812