Synthesizing Electronic Health Records for Predictive Models in Low-Middle-Income Countries (LMICs)
Abstract
:1. Introduction
- Deep generative models for LMICS. For the first time, we demonstrate the feasibility of using generative models for synthesizing data that is used to develop ML models from small datasets from LMIC healthcare settings.
- Comprehensive data utility evaluation. We evaluate the utility of the synthetic data in comparison to other commonly used approaches and demonstrate a superior performance using models trained on synthetic data. We also showcase the impact of synthetic tabular data size on the performance of the predictive model in a series of experiments where the synthetic data training size is varied.
- Interpretability analysis: We conduct a post-hoc SHapley Additive exPlanations (SHAP) interpretability analysis to investigate the impact of using various training sets on the feature importance in the test set predictions, which is a new approach for evaluating deep generative models for EHRs.
2. Materials and Methods
2.1. Dataset Description
2.2. Synthetic Data Generation
2.3. Predictive Modelling Task and Baselines
2.4. Interpretability Analysis
3. Results
3.1. Predictive Modelling Task
3.2. Interpretability Analysis
4. Discussion and Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
EHRs | Electronic Health Records |
ML | Machine Learning |
LMICs | Low-Middle-Income Countries |
SMOTE | Synthetic Minority Oversampling TEchnique |
GANs | Generative Adversarial Networks |
VAEs | Variational AutoEncoders |
AUROC | Area Under the Receiver Operating Characteristic Curve |
AUPRC | Area Under the Precision-Recall Curve |
References
- Kruk, M.E.; Gage, A.D.; Arsenault, C.; Jordan, K.; Leslie, H.H.; Roder-DeWan, S.; Adeyi, O.; Barker, P.; Daelmans, B.; Doubova, S.V.; et al. High-quality health systems in the Sustainable Development Goals era: Time for a revolution. Lancet Glob. Health 2018, 6, e1196–e1252. [Google Scholar] [CrossRef] [Green Version]
- Xiao, C.; Choi, E.; Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. 2018, 25, 1419–1428. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jeni, L.A.; Cohn, J.F.; De La Torre, F. Facing imbalanced data–recommendations for the use of performance metrics. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2–5 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 245–251. [Google Scholar] [CrossRef] [Green Version]
- Van der Ploeg, T.; Austin, P.C.; Steyerberg, E.W. Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Med. Res. Methodol. 2014, 14, 137. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Abbasgholizadeh Rahimi, S.; Cwintal, M.; Huang, Y.; Ghadiri, P.; Grad, R.; Poenaru, D.; Gore, G.; Zomahoun, H.T.V.; Légaré, F.; Pluye, P. Application of artificial intelligence in shared decision making: Scoping review. JMIR Med. Inform. 2022, 10, e36199. [Google Scholar] [CrossRef] [PubMed]
- Dagliati, A.; Malovini, A.; Tibollo, V.; Bellazzi, R. Health informatics and EHR to support clinical research in the COVID-19 pandemic: An overview. Briefings Bioinform. 2021, 22, 812–822. [Google Scholar] [CrossRef] [PubMed]
- Adeloye, D.; Song, P.; Zhu, Y.; Campbell, H.; Sheikh, A.; Rudan, I. Global, regional, and national prevalence of, and risk factors for, chronic obstructive pulmonary disease (COPD) in 2019: A systematic review and modelling analysis. Lancet Respir. Med. 2022, 10, 447–458. [Google Scholar] [CrossRef] [PubMed]
- Baqui, P.; Marra, V.; Alaa, A.M.; Bica, I.; Ercole, A.; van der Schaar, M. Comparing COVID-19 risk factors in Brazil using machine learning: The importance of socioeconomic, demographic and structural factors. Sci. Rep. 2021, 11, 15591. [Google Scholar] [CrossRef]
- Farran, B.; Channanath, A.M.; Behbehani, K.; Thanaraj, T.A. Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: Machine-learning algorithms and validation using national health data from Kuwait—A cohort study. BMJ Open 2013, 3, e002457. [Google Scholar] [CrossRef] [Green Version]
- Rudd, K.E.; Seymour, C.W.; Aluisio, A.R.; Augustin, M.E.; Bagenda, D.S.; Beane, A.; Byiringiro, J.C.; Chang, C.C.H.; Colas, L.N.; Day, N.P.; et al. Association of the quick sequential (sepsis-related) organ failure assessment (qSOFA) score with excess hospital mortality in adults with suspected infection in low-and middle-income countries. JAMA 2018, 319, 2202–2211. [Google Scholar] [CrossRef] [Green Version]
- Mensah, N.K.; Boadu, R.O.; Adzakpah, G.; Lasim, O.U.; Amuakwa, R.D.; Taylor-Abdulai, H.B.; Chatio, S.T. Electronic health records post-implementation challenges in selected hospitals: A qualitative study in the Central Region of southern Ghana. Health Inf. Manag. J. 2022. [Google Scholar] [CrossRef]
- Galindo-Fraga, A.; Villanueva-Reza, M.; Ochoa-Hein, E. Current challenges in antibiotic stewardship in low-and middle-income countries. Curr. Treat. Options Infect. Dis. 2018, 10, 421–429. [Google Scholar] [CrossRef]
- Mills, A. Health care systems in low-and middle-income countries. N. Engl. J. Med. 2014, 370, 552–557. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar] [CrossRef]
- Ghosheh, G.; Li, J.; Zhu, T. A review of Generative Adversarial Networks for Electronic Health Records: Applications, evaluation measures and data sources. arXiv 2022, arXiv:2203.07018. [Google Scholar] [CrossRef]
- Thuy, D.B.; Campbell, J.; Nhat, L.T.H.; Hoang, N.V.M.; Hao, N.V.; Baker, S.; Geskus, R.B.; Thwaites, G.E.; Chau, N.V.V.; Thwaites, C.L. Hospital-acquired colonization and infections in a Vietnamese intensive care unit. PLoS ONE 2018, 13, e0203600. [Google Scholar] [CrossRef]
- CDC and Prevention Surveillance Definitions for Specific Types of Infections. 2015. Available online: admin.inicc.org/media/2015-CDCNHSN-ALLDA-HAI-Definitions.pdf (accessed on 8 January 2023).
- Gholamy, A.; Kreinovich, V.; Kosheleva, O. Why 70/30 or 80/20 Relation between Training and Testing Sets: A Pedagogical Explanation; Technical Report: UTEP-CS-18-09; UTEP: El Paso, TX, USA, 2018. [Google Scholar]
- Mi, L.; Shen, M.; Zhang, J. A probe towards understanding gan and vae models. arXiv 2018, arXiv:1812.05676. [Google Scholar] [CrossRef]
- Kwon, Y.J.; Toussie, D.; Azour, L.; Concepcion, J.; Eber, C.; Reina, G.A.; Tang, P.T.P.; Doshi, A.H.; Oermann, E.K.; Costa, A.B. Appropriate Evaluation of Diagnostic Utility of Machine Learning Algorithm Generated Images. In Proceedings of the PMLR 2020: Machine Learning for Health, Virtual, 11 December 2020; Volume 136, pp. 179–193. [Google Scholar]
- Lee, D.; Yu, H.; Jiang, X.; Rogith, D.; Gudala, M.; Tejani, M.; Zhang, Q.; Xiong, L. Generating sequential electronic health records using dual adversarial autoencoder. J. Am. Med. Inform. Assoc. 2020, 27, 1411–1419. [Google Scholar] [CrossRef]
- Choi, E.; Schuetz, A.; Stewart, W.F.; Sun, J. Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv 2016, arXiv:1602.03686. [Google Scholar] [CrossRef]
- Qi, Y. Random forest for bioinformatics. In Ensemble Machine Learning: Methods and Applications; Springer: Berlin/Heidelberg, Germany, 2012; pp. 307–323. [Google Scholar] [CrossRef] [Green Version]
- Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef] [PubMed]
- Larose, D.T.; Larose, C.D. k-nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern. 2014, SMC-15, 580–585. [Google Scholar] [CrossRef]
- Hajian-Tilaki, K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Casp. J. Intern. Med. 2013, 4, 627. [Google Scholar]
- Ozenne, B.; Subtil, F.; Maucort-Boulch, D. The precision–recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J. Clin. Epidemiol. 2015, 68, 855–859. [Google Scholar] [CrossRef]
- Mavrogiorgou, A.; Kiourtis, A.; Kleftakis, S.; Mavrogiorgos, K.; Zafeiropoulos, N.; Kyriazis, D. A Catalogue of Machine Learning Algorithms for Healthcare Risk Predictions. Sensors 2022, 22, 8615. [Google Scholar] [CrossRef]
- Zafeiropoulos, N.; Mavrogiorgou, A.; Kleftakis, S.; Mavrogiorgos, K.; Kiourtis, A.; Kyriazis, D. Interpretable Stroke Risk Prediction Using Machine Learning Algorithms. In Intelligent Sustainable Systems: Selected Papers of WorldS4 2022; Springer: Berlin/Heidelberg, Germany, 2023; Volume 2, pp. 647–656. [Google Scholar] [CrossRef]
- Zou, K.H.; O’Malley, A.J.; Mauri, L. Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation 2007, 115, 654–657. [Google Scholar] [CrossRef] [Green Version]
- Ling, C.X.; Huang, J.; Zhang, H. AUC: A better measure than accuracy in comparing learning algorithms. In Proceedings of the Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2003, Halifax, NS, Canada, 11–13 June 2003, Proceedings 16; Springer: Berlin/Heidelberg, Germany, 2003; pp. 329–341. [Google Scholar] [CrossRef]
- Hancock, J.; Khoshgoftaar, T.M.; Johnson, J.M. Informative evaluation metrics for highly imbalanced big data classification. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1419–1426. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NIPS 2017); NeurIPS: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
- Lundberg, S.M.; Erion, G.G.; Lee, S.I. Consistent individualized feature attribution for tree ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar] [CrossRef]
- Murray, C.J.; Ikuta, K.S.; Sharara, F.; Swetschinski, L.; Aguilar, G.R.; Gray, A.; Han, C.; Bisignano, C.; Rao, P.; Wool, E.; et al. Global burden of bacterial antimicrobial resistance in 2019: A systematic analysis. Lancet 2022, 399, 629–655. [Google Scholar] [CrossRef]
- Nguyen, K.V.; Thi Do, N.T.; Chandna, A.; Nguyen, T.V.; Pham, C.V.; Doan, P.M.; Nguyen, A.Q.; Thi Nguyen, C.K.; Larsson, M.; Escalante, S.; et al. Antibiotic use and resistance in emerging economies: A situation analysis for Viet Nam. BMC Public Health 2013, 13, 1158. [Google Scholar] [CrossRef] [Green Version]
- Nga, D.T.T.; Chuc, N.T.K.; Hoa, N.P.; Hoa, N.Q.; Nguyen, N.T.T.; Loan, H.T.; Toan, T.K.; Phuc, H.D.; Horby, P.; Van Yen, N.; et al. Antibiotic sales in rural and urban pharmacies in northern Vietnam: An observational study. BMC Pharmacol. Toxicol. 2014, 15, 6. [Google Scholar] [CrossRef] [Green Version]
- Improta, G.; Mazzella, V.; Vecchione, D.; Santini, S.; Triassi, M. Fuzzy logic–based clinical decision support system for the evaluation of renal function in post-Transplant Patients. J. Eval. Clin. Pract. 2020, 26, 1224–1234. [Google Scholar] [CrossRef] [Green Version]
- Lakshmanaprabu, S.; Mohanty, S.N.; Sheeba, R.S.; Krishnamoorthy, S.; Uthayakumar, J.; Shankar, K. Online clinical decision support system using optimal deep neural networks. Appl. Soft Comput. 2019, 81, 105487. [Google Scholar] [CrossRef]
- Du, Y.; Rafferty, A.R.; McAuliffe, F.M.; Wei, L.; Mooney, C. An explainable machine learning-based clinical decision support system for prediction of gestational diabetes mellitus. Sci. Rep. 2022, 12, 1170. [Google Scholar] [CrossRef] [PubMed]
- Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating multi-label discrete patient records using generative adversarial networks. In Proceedings of the PMLR 2017: Machine Learning for Healthcare Conference, Boston, MA, USA, 18–19 August 2017; pp. 286–305. [Google Scholar]
- Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (medical) time series generation with recurrent conditional gans. arXiv 2017, arXiv:1706.02633. [Google Scholar] [CrossRef]
- Li, J.; Cairns, B.J.; Li, J.; Zhu, T. Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications. NPJ Digit. Med. 2023, 6, 98. [Google Scholar] [CrossRef] [PubMed]
- Kim, B.G.; Kang, M.; Lim, J.; Lee, J.; Kang, D.; Kim, M.; Kim, J.; Park, H.; Min, K.H.; Cho, J.; et al. Comprehensive risk assessment for hospital-acquired pneumonia: Sociodemographic, clinical, and hospital environmental factors associated with the incidence of hospital-acquired pneumonia. BMC Pulm. Med. 2022, 22, 21. [Google Scholar] [CrossRef] [PubMed]
- Chang, Y.J.; Yeh, M.L.; Li, Y.C.; Hsu, C.Y.; Lin, C.C.; Hsu, M.S.; Chiu, W.T. Predicting hospital-acquired infections by scoring system with simple parameters. PLoS ONE 2011, 6, e23137. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Baowaly, M.K.; Lin, C.C.; Liu, C.L.; Chen, K.T. Synthesizing electronic health records using improved generative adversarial networks. J. Am. Med. Inform. Assoc. 2019, 26, 228–241. [Google Scholar] [CrossRef]
- Engelmann, J.; Lessmann, S. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 2021, 174, 114582. [Google Scholar] [CrossRef]
- Palmer, S.; Jansen, A.; Leitmeyer, K.; Murdoch, H.; Forland, F. Evidence-Based Medicine applied to the control of communicable disease incidents when evidence is scarce and the time is limited. Eurosurveillance 2013, 18, 20507. [Google Scholar] [CrossRef]
- Centers for Disease Control and Prevention. HIPAA privacy rule and public health. Guidance from CDC and the US Department of Health and Human Services. MMWR Morb. Mortal. Wkly. Rep. 2003, 52, 1–17. [Google Scholar]
- Voigt, P.; Von dem Bussche, A. The EU General Data Protection Regulation (GDPR), 1st ed.; A Practical Guide; Springer International Publishing: Cham, Switzerland, 2017; Volume 10, p. 3152676. [Google Scholar] [CrossRef]
Co-Morbidities (n, %) | |
---|---|
Diabetes | 35 (9.62%) |
Steroids | 15 (4.12%) |
Chronic Liver | 55 (15.11%) |
Chronic Kidney | 3 (0.82%) |
Demographics (n, %) | |
Female | 242 (66.48%) |
Age | |
16–45 | 133 (36.54%) |
45–60 | 142 (39.01%) |
60+ | 89 (24.45%) |
Admission Diagnosis (n, %) | |
Tetanus | 17 (4.67%) |
Sepsis | 45 (12.36%) |
Local Infections | 75 (20.60%) |
Dengue | 204 (56.04%) |
Internal Medicine | 139 (6.32%) |
Outcomes (n, %) | |
Hospital Acquired Infections | 86 (23.6%) |
Estimator | Model | AUROC | AURPC | Balanced Accuracy |
---|---|---|---|---|
Random Forest | Original | 0.528 (0.386, 0.649) | 0.246 (0.157, 0.377) | 0.462 (0.389, 0.542) |
SMOTE | 0.577 (0.428, 0.713) | 0.281 (0.169, 0.451) | 0.538 (0.419,0.651) | |
Synthetic 200 | 0.511 (0.370,0.658) | 0.261 (0.153, 0.431) | 0.548 (0.448,0.648) | |
Synthetic 500 | 0.533 (0.397,0.677) | 0.266 (0.162, 0.440) | 0.555 (0.459,0.657) | |
Synthetic 1000 | 0.592 (0.455,0.723) | 0.286 (0.185, 0.462) | 0.548 (0.450, 0.661) | |
Synthetic 2000 | 0.602 (0.459, 0.743) | 0.295 (0.182, 0.469) | 0.569 (0.471, 0.675) | |
Synthetic 2500 | 0.610 (0.460, 0.751) | 0.334 (0.185, 0.542) | 0.569 (0.470, 0.669) | |
Synthetic 10,000 | 0.605 (0.479, 0.742) | 0.298 (0.191, 0.481) | 0.569 (0.477, 0.674) | |
Support Vector Machines | Original | 0.560 (0.418, 0.699) | 0.267 (0.165, 0.434) | 0.500 (0.500, 0.500) |
SMOTE | 0.568 (0.428, 0.707) | 0.270 (0.170, 0.419) | 0.500 (0.500, 0.500) | |
Synthetic 200 | 0.565 (0.427, 0.703) | 0.285 (0.181, 0.454) | 0.548 (0.452, 0.662) | |
Synthetic 500 | 0.566 (0.427, 0.707) | 0.287 (0.176, 0.459) | 0.562 (0.470, 0.672) | |
Synthetic 1000 | 0.568 (0.436, 0.712) | 0.288 (0.185, 0.470) | 0.548 (0.450, 0.659) | |
Synthetic 2000 | 0.565 (0.431, 0.707) | 0.286 (0.177, 0.457) | 0.562 (0.470, 0.660) | |
Synthetic 2500 | 0.564 (0.427, 0.690) | 0.286 (0.178, 0.449) | 0.562 (0.465, 0.671) | |
Synthetic 10,000 | 0.565 (0.409, 0.708) | 0.292 (0.178, 0.460) | 0.569 (0.476, 0.674) | |
K-Nearest Neighbor | Original | 0.526 (0.390, 0.666) | 0.255 (0.154, 0.401) | 0.500 (0.500, 0.500) |
SMOTE | 0.526 (0.391, 0.657) | 0.255 (0.157, 0.405) | 0.500 (0.500, 0.500) | |
Synthetic 200 | 0.528 (0.391, 0.675) | 0.280 (0.167, 0.448) | 0.548 (0.451, 0.650) | |
Synthetic 500 | 0.520 (0.368, 0.669) | 0.281 (0.168, 0.444) | 0.555 (0.455, 0.662) | |
Synthetic 1000 | 0.525 (0.386, 0.669) | 0.281 (0.164, 0.445) | 0.555 (0.465, 0.660) | |
Synthetic 2000 | 0.542 (0.405, 0.687) | 0.290 (0.178, 0.457) | 0.555 (0.464, 0.669) | |
Synthetic 2500 | 0.536 (0.394, 0.676) | 0.281 (0.173, 0.437) | 0.569 (0.469, 0.666) | |
Synthetic 10,000 | 0.546 (0.404, 0.689) | 0.272 (0.171, 0.441) | 0.569 (0.476, 0.675) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ghosheh, G.O.; Thwaites, C.L.; Zhu, T. Synthesizing Electronic Health Records for Predictive Models in Low-Middle-Income Countries (LMICs). Biomedicines 2023, 11, 1749. https://doi.org/10.3390/biomedicines11061749
Ghosheh GO, Thwaites CL, Zhu T. Synthesizing Electronic Health Records for Predictive Models in Low-Middle-Income Countries (LMICs). Biomedicines. 2023; 11(6):1749. https://doi.org/10.3390/biomedicines11061749
Chicago/Turabian StyleGhosheh, Ghadeer O., C. Louise Thwaites, and Tingting Zhu. 2023. "Synthesizing Electronic Health Records for Predictive Models in Low-Middle-Income Countries (LMICs)" Biomedicines 11, no. 6: 1749. https://doi.org/10.3390/biomedicines11061749
APA StyleGhosheh, G. O., Thwaites, C. L., & Zhu, T. (2023). Synthesizing Electronic Health Records for Predictive Models in Low-Middle-Income Countries (LMICs). Biomedicines, 11(6), 1749. https://doi.org/10.3390/biomedicines11061749