An Improved Machine-Learning Approach for COVID-19 Prediction Using Harris Hawks Optimization and Feature Analysis Using SHAP
Abstract
:1. Introduction
1.1. Related Works
1.2. Contributions
- Since COVID-19 is a highly contagious disease, hospitals and diagnostic centers need extra precautions to test for COVID-19, ultimately increasing the costs and health hazards. The proposed model used the less expensive inpatient facility data which can be collected at home, instead of X-rays or CT scans to predict COVID-19 with the expectation of reducing patients’ visits to the hospital and diagnostic center.
- An ML framework is designed using HHO to detect COVID-19.
- The hyperparameters of the boosting classifiers are optimized using our method, and then the ensemble classifier has increased the performance of our model.
- The important features are estimated using SHapely adaptive exPlanations (SHAP) analysis.
- The performance is compared with other existing models.
- A decision support system and a clinically operable decision forest are created to support the medical staff.
2. Materials and Methods
2.1. Data Source
2.2. Data Preprocessing
2.3. Classification Algorithms
2.3.1. XGB
2.3.2. CatBoost
2.3.3. RF
- Let N be the original dataset. A classification tree is constructed using Bootstrap. After random selection from the original dataset, the remaining samples create the out-of-bag data.
- At first, we have selected n variables randomly from each node of each tree. Then, a constant m is set, and we select m variables from n variables. After splitting the tree, the variable having the most classification ability is chosen from m variables based on the Gini index of the node impurity measurement. During classification, the threshold value of the variable is determined by checking each classification point. For a given training set N, we randomly selected one case and said it belongs to some class .
- Every tree grows up to its maximum without any pruning.
- The classification outcome is obtained by the maximum voting result of the classifier.
2.3.4. LGB
2.3.5. Ensemble Methods
2.4. Hyperparameter Optimization Using HHO
2.4.1. Exploration Phase
2.4.2. Shifting between Exploration and Exploitation
2.4.3. Exploitation Phase
2.5. Performance Evaluation Metrics
Algorithm 1: Proposed HHO-based COVID-19 prediction algorithm |
2.6. Feature Importance Using SHAP Analysis
3. Experimental Result
3.1. HHO Result
3.2. Precision vs. Recall Curve
3.3. Recall vs. Decision Boundary
3.4. ROC Curve
3.5. Feature Importance (SHAP Value Analysis)
4. Discussion
4.1. Comparative Study
References | Classifiers | Dataset Used | ACC | SE | SP | AUC |
---|---|---|---|---|---|---|
Jim et al. [37] | Deep Convolutional Neural Network | Clinical Image Data | 92.5% | 94.2% | 95.6% | |
He et al. [38] | XGB | Clinical, Blood samples of 75 Features | 90% | |||
Ahamad et al. [39] | XGB, RF, DT, SVM | Demographic and Symptom | 85% | 90% | ||
Li et al. [40] | XGB | Clinical Data | 92.5% | 97.9% | >90% | |
Brinati et al. [41] | DT, RF | Hematochemical Values from Blood Exams | 86% | 95% | ||
Chimmula and Zhang [42] | Deep learning using LSTM | Demographic | 92.67% | |||
Proposed | Ensemble Method | Clinical Data | 92.68% | 92.68% | 92.68% | 97.80% |
5. Potential Application
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- World Health Organization. WHO Director-General’s Opening Remarks at the Media Briefing on COVID-19-11 March 2020; World Health Organization: Geneva, Switzerland, 2020. [Google Scholar]
- Du Toit, A. Outbreak of a novel coronavirus. Nat. Rev. Microbiol. 2020, 18, 123. [Google Scholar] [CrossRef] [PubMed]
- Allam, Z. The first 50 days of COVID-19: A detailed chronological timeline and extensive review of literature documenting the pandemic. Surv. COVID-19 Pandemic Implic. 2020. [Google Scholar]
- World Health Organization. WHO Director-General’s Remarks at the Media Briefing on 2019-nCoV on 11 February 2020; World Health Organization: Geneva, Switzerland, 2020. [Google Scholar]
- Public Health England. COVID-19: Epidemiology, Virology and Clinical Features; Public Health England: London, UK, 2020. [Google Scholar]
- Guan, W.j.; Ni, Z.Y.; Hu, Y.; Liang, W.H.; Ou, C.Q.; He, J.X.; Liu, L.; Shan, H.; Lei, C.L.; Hui, D.S.; et al. Clinical Characteristics of Coronavirus Disease 2019 in China. N. Engl. J. Med. 2020, 382, 1708–1720. [Google Scholar] [CrossRef] [PubMed]
- Bartlett, J.G.; Mundy, L.M. Community-acquired pneumonia. N. Engl. J. Med. 1995, 333, 1618–1624. [Google Scholar] [CrossRef]
- Tolksdorf, K.; Buda, S.; Schuler, E.; Wieler, L.H.; Haas, W. Influenza-associated pneumonia as reference to assess seriousness of coronavirus disease (COVID-19). Eurosurveillance 2020, 25, 2000258. [Google Scholar] [CrossRef]
- Grasselli, G.; Pesenti, A.; Cecconi, M. Critical care utilization for the COVID-19 outbreak in Lombardy, Italy: Early experience and forecast during an emergency response. JAMA 2020, 323, 1545–1546. [Google Scholar] [CrossRef] [Green Version]
- Wang, M.; Wu, Q.; Xu, W.; Qiao, B.; Wang, J.; Zheng, H.; Jiang, S.; Mei, J.; Wu, Z.; Deng, Y.; et al. Clinical diagnosis of 8274 samples with 2019-novel coronavirus in Wuhan. MedRxiv 2020. [Google Scholar] [CrossRef] [Green Version]
- Rajaraman, S.; Antani, S. Trainingdeep-learning algorithms with weakly labeled pneumonia chest X-ray data for COVID-19 detection. MedRxiv 2020. [Google Scholar]
- Yan, L.; Zhang, H.T.; Xiao, Y.; Wang, M.; Guo, Y.; Sun, C.; Tang, X.; Jing, L.; Li, S.; Zhang, M.; et al. Prediction of criticality in patients with severe COVID-19 infection using three clinical features: A machine learning-based prognostic model with clinical data in Wuhan. MedRxiv 2020, 27, 2020. [Google Scholar]
- Awal, M.A.; Masud, M.; Hossain, M.S.; Bulbul, A.A.M.; Mahmud, S.H.; Bairagi, A.K. A novel bayesian optimization-based machine learning framework for COVID-19 detection from inpatient facility data. IEEE Access 2021, 9, 10263–10281. [Google Scholar] [CrossRef]
- Kassania, S.H.; Kassanib, P.H.; Wesolowskic, M.J.; Schneidera, K.A.; Detersa, R. Automatic detection of coronavirus disease (COVID-19) in X-ray and CT images: A machine learning based approach. Biocybern. Biomed. Eng. 2021, 41, 867–879. [Google Scholar] [CrossRef] [PubMed]
- Saha, P.; Sadi, M.S.; Islam, M.M. EMCNet: Automated COVID-19 diagnosis from X-ray images using convolutional neural network and ensemble of machine learning classifiers. Inform. Med. Unlocked 2021, 22, 100505. [Google Scholar] [CrossRef] [PubMed]
- Rasheed, J.; Hameed, A.A.; Djeddi, C.; Jamil, A.; Al-Turjman, F. A machine learning-based framework for diagnosis of COVID-19 from chest X-ray images. Interdiscip. Sci. Comput. Life Sci. 2021, 13, 103–117. [Google Scholar] [CrossRef]
- Williamson, E.J.; Walker, A.J.; Bhaskaran, K.; Bacon, S.; Bates, C.; Morton, C.E.; Curtis, H.J.; Mehrkar, A.; Evans, D.; Inglesby, P.; et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature 2020, 584, 430–436. [Google Scholar] [CrossRef] [PubMed]
- Buck, S.F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J. R. Stat. Soc. Ser. B 1960, 22, 302–306. [Google Scholar] [CrossRef]
- Ma, Z.; Chen, G. Bayesian methods for dealing with missing data problems. J. Korean Stat. Soc. 2018, 47, 297–313. [Google Scholar] [CrossRef]
- Mostafa, S.M.; Eladimy, A.S.; Hamad, S.; Amano, H. CBRG: A Novel Algorithm for Handling Missing Data Using Bayesian Ridge Regression and Feature Selection Based on Gain Ratio. IEEE Access 2020, 8, 216969–216985. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Breiman, L. 1 Random Forests–Random Features; CiteSeerX: State College, PA, USA, 1999. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Awal, M.A.; Hossain, M.S.; Debjit, K.; Ahmed, N.; Nath, R.D.; Habib, G.M.; Khan, M.S.; Islam, M.A.; Mahmud, M.P. An early detection of asthma using BOMLA detector. IEEE Access 2021, 9, 58403–58420. [Google Scholar] [CrossRef]
- Mirjalili, S. Evolutionary algorithms and neural networks. In Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2019; Volume 780. [Google Scholar]
- Mirjalili, S.; Lewis, A. The whale optimization algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
- Pelikan, M.; Goldberg, D.E.; Cantú-Paz, E. BOA: The Bayesian optimization algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference GECCO-99, Orlando, FL, USA, 13–17 July 1999; Volume 1, pp. 525–532. [Google Scholar]
- Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
- Heidari, A.A.; Mirjalili, S.; Faris, H.; Aljarah, I.; Mafarja, M.; Chen, H. Harris hawks optimization: Algorithm and applications. Future Gener. Comput. Syst. 2019, 97, 849–872. [Google Scholar] [CrossRef]
- Alabool, H.M.; Alarabiat, D.; Abualigah, L.; Heidari, A.A. Harris hawks optimization: A comprehensive review of recent variants and applications. Neural Comput. Appl. 2021, 33, 8939–8980. [Google Scholar] [CrossRef]
- Hu, J.; Heidari, A.A.; Shou, Y.; Ye, H.; Wang, L.; Huang, X.; Chen, H.; Chen, Y.; Wu, P.; Han, Z. Detection of COVID-19 severity using blood gas analysis parameters and Harris hawks optimized extreme learning machine. Comput. Biol. Med. 2022, 142, 105166. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. Consistent feature attribution for tree ensembles. arXiv 2017, arXiv:1706.06060. [Google Scholar]
- Hasan, M.K.; Jawad, M.T.; Dutta, A.; Awal, M.A.; Islam, M.A.; Masud, M.; Al-Amri, J.F. Associating Measles Vaccine Uptake Classification and its Underlying Factors Using an Ensemble of Machine Learning Models. IEEE Access 2021, 9, 119613–119628. [Google Scholar] [CrossRef]
- Islam, S.M.S.; Talukder, A.; Awal, M.A.; Siddiqui, M.M.U.; Ahamad, M.M.; Ahammed, B.; Rawal, L.B.; Alizadehsani, R.; Abawajy, J.; Laranjo, L.; et al. Machine Learning Approaches for Predicting Hypertension and Its Associated Factors Using Population-Level Data From Three South Asian Countries. Front. Cardiovasc. Med. 2022, 9, 839379. [Google Scholar] [CrossRef]
- Howlader, K.C.; Satu, M.; Awal, M.; Islam, M.; Islam, S.M.S.; Quinn, J.M.; Moni, M.A. Machine learning models for classification and identification of significant attributes to detect type 2 diabetes. Health Inf. Sci. Syst. 2022, 10, 2. [Google Scholar] [CrossRef]
- Jim, A.A.J.; Rafi, I.; Chowdhury, M.S.; Sikder, N.; Mahmud, M.P.; Rubaie, S.; Masud, M.; Bairagi, A.K.; Bhakta, K.; Nahid, A.A. An automatic computer-based method for fast and accurate COVID-19 diagnosis. MedRxiv 2020. [Google Scholar] [CrossRef]
- He, X.; Wang, S.; Shi, S.; Chu, X.; Tang, J.; Liu, X.; Yan, C.; Zhang, J.; Ding, G. Benchmarking deep learning models and automated model design for COVID-19 detection with chest ct scans. MedRxiv 2020. [Google Scholar] [CrossRef]
- Ahamad, M.M.; Aktar, S.; Rashed-Al-Mahfuz, M.; Uddin, S.; Liò, P.; Xu, H.; Summers, M.A.; Quinn, J.M.; Moni, M.A. A machine learning model to identify early stage symptoms of SARS-Cov-2 infected patients. Expert Syst. Appl. 2020, 160, 113661. [Google Scholar] [CrossRef] [PubMed]
- Li, W.T.; Ma, J.; Shende, N.; Castaneda, G.; Chakladar, J.; Tsai, J.C.; Apostol, L.; Honda, C.O.; Xu, J.; Wong, L.M.; et al. Using machine learning of clinical data to diagnose COVID-19: A systematic review and meta-analysis. BMC Med. Inform. Decis. Mak. 2020, 20, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Brinati, D.; Campagner, A.; Ferrari, D.; Locatelli, M.; Banfi, G.; Cabitza, F. Detection of COVID-19 infection from routine blood exams with machine learning: A feasibility study. J. Med. Syst. 2020, 44, 1352. [Google Scholar] [CrossRef] [PubMed]
- Chimmula, V.K.R.; Zhang, L. Time series forecasting of COVID-19 transmission in Canada using LSTM networks. Chaos Solitons Fractals 2020, 135, 109864. [Google Scholar] [CrossRef]
Variable Name | Description | Variable Type | Ratio of Boolean, 0/1 | Missing Value (Percentage) |
---|---|---|---|---|
BMI | Measures a healthy connection between the weight and height of a person | numerical | ||
Alcohol | An ethanol-based organic compound used in manufacturing different drugs | numerical | 0.2% | |
Cannabis | A psychoactive drug made from the cannabis plant which is also called marijuana | numerical | 28% | |
Contacts-count | Defines number of individuals contact with the patients | numerical | 0.4% | |
Age | Defines the number of years of a person’s existence | categorical | ||
COVID-19 Symptoms | Symptoms that can be noticed if a person is COVID-19 positive | categorical | 999899/23527 | |
COVID-19 Contact | Defines number of contact with COVID-19 positive individuals | categorical | 973763/49663 | |
Asthma | A persistent lung disease which causes breathing difficulties | categorical | 867913/1555 | |
Kidney-disease | Kidney conditions that create inconvenience while filtering blood in the body | categorical | 1019551/3875 | |
Liver-disease | Diseases that hamper the regular functions of a liver and cause significant damages | categorical | 1020832/2594 | |
Compromised-immune | Having a weak immune system with a chance of getting infected by various diseases easily | categorical | 965426/58000 | |
Heart-disease | Unfavorable heart conditions or diseases of an unhealthy heart | categorical | 1003713/19713 | |
Lung-disease | Some breathing disorders and diseases that affect one’s lung | categorical | 1008154/15272 | |
Diabetes | A disease that occurs when one’s body becomes unable to maintain insulin and the blood sugar becomes high | categorical | 959537/63889 | |
HIV-positive | A physical state when a person’s body contains fully functional HIV virus and most likely have AIDS | categorical | 949839/73587 | |
Hypertension | High blood pressure due to several health issues and other circumstances | categorical | 879542/1438 | |
Other-chronic | Persistent and long-lasting health conditions | categorical | 949839/73587 | |
Health-worker | A person who works for health-related issues and provide basic healthcare | categorical | 1002710/20716 | |
Sex | Expresses gender differences among humans | categorical | 63389/382573 | |
Smoking | Inhaling smoke from burnt plant material | categorical | 0.02% | |
COVID-19_positive | A physical state when one’s body contains fully functional COVID-19 virus | categorical | 1011257/12169 |
Classifiers | No. of Hyperparameter | Hyperparameters | Domain Range |
---|---|---|---|
Extreme Gradient Boosting | 7 | Learning rate | ( – 0.9) |
Colsample_tree | (0.001–1.00) | ||
Min_child_weight | (1–200) | ||
Gamma | ( – 1.0) | ||
Subsample | (0.001–1.0) | ||
Max_depth | (1–200) | ||
Alpha | ( – 1.0) | ||
Light Gradient Boosting | 10 | Learning_rate | (0.01–1) |
Max_bin | (15–100) | ||
Num_leaves | (20–100) | ||
Bagging_fraction | (0.6–1.0) | ||
Feature_fraction | (0.1–0.9) | ||
Max_depth | (5–50) | ||
Subsample | (0.1–1.0) | ||
Colsample_tree | (0.01–1.0) | ||
Min_child_samples | (3–100) | ||
Min_data_inleaf | (90–120) | ||
CatBoost | 6 | Depth | (1–12) |
Colsample_bylevel | (0.01–0.1) | ||
Subsample | (0.01–0.1) | ||
n_estimator | (100–400) | ||
Learning_rate | (0.001–0.01) | ||
12_leaf_reg | (1–9) | ||
Random Forrest | 5 | Min_sample_split | (1,20) |
Min_sample_leaf | (1,20) | ||
n_estimator | (10,1000) | ||
Criterion | (“Gini,” “entropy”) | ||
Support Vector Classifier | 2 | Cost | (0.001,20) |
Gamma | (−6,6) |
Classifiers | Performance Indexes | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CV-Score | AC | Err | F1-Score | FPR | Kappa | MCC | PPV | SEN | SPE | Threat-Score | BAC | AUC | |
HHOLGB | 90.13% | 90.47% | 9.53% | 90.47% | 9.53% | 80.94% | 80.95% | 90.48% | 90.47% | 90.47% | 82.60% | 90.47% | 96.40% |
HHOCAT | 86.87% | 86.81% | 13.19% | 86.79% | 13.20% | 73.62% | 73.83% | 87.02% | 86.81% | 86.80% | 76.67% | 86.81% | 94.50% |
HHORF | 91.68% | 91.53% | 8.47% | 91.53% | 8.47% | 83.06% | 83.06% | 91.53% | 91.53% | 91.53% | 84.38% | 91.53% | 97.40% |
HHOXGB | 92.23% | 92.54% | 7.46% | 92.54% | 7.46% | 85.09% | 85.10% | 92.55% | 92.54% | 92.54% | 86.12% | 92.54% | 97.70% |
HHOSVC | 83.50% | 84.54% | 15.46% | 84.29% | 15.48% | 69.07% | 71.33% | 86.83% | 84.54% | 84.52% | 72.89% | 84.53% | 95.60% |
ENSEMBLE_MODEL | 92.38% | 92.67% | 7.33% | 92.67% | 7.33% | 85.34% | 85.35% | 92.67% | 92.67% | 92.67% | 86.34% | 92.67% | 97.80% |
DT | 87.71% | 88.43% | 11.57% | 88.42% | 11.57% | 76.85% | 76.89% | 88.47% | 88.43% | 88.43% | 79.25% | 88.43% | 88.50% |
KNN | 83.93% | 83.89% | 16.11% | 83.64% | 16.09% | 67.80% | 70.01% | 86.15% | 83.89% | 83.91% | 71.94% | 83.90% | 92.20% |
BernoulliNB | 76.61% | 76.11% | 23.89% | 76.11% | 23.89% | 52.22% | 52.22% | 76.12% | 76.11% | 76.11% | 61.43% | 76.11% | 85.20% |
LDA | 79.39% | 79.50% | 20.50% | 79.49% | 20.50% | 59.00% | 59.04% | 79.54% | 79.50% | 79.50% | 65.97% | 79.50% | 84.30% |
QLDA | 70.01% | 69.63% | 30.37% | 68.59% | 30.34% | 39.29% | 42.22% | 72.69% | 69.63% | 69.66% | 52.49% | 69.64% | 86.90% |
SVC (without HHO) | 83.87% | 83.86% | 16.14% | 83.80% | 16.15% | 67.71% | 68.21% | 84.36% | 83.86% | 83.85% | 72.13% | 83.85% | 87.30% |
RF (without HHO) | 86.22% | 86.63% | 13.37% | 86.63% | 13.36% | 73.27% | 73.32% | 86.69% | 86.63% | 86.64% | 76.41% | 86.63% | 86.70% |
LGB (without HHO) | 86.56% | 86.77% | 13.23% | 86.77% | 13.23% | 73.54% | 73.55% | 86.78% | 86.77% | 86.77% | 76.63% | 86.77% | 93.90% |
XGB (without HHO) | 83.38% | 83.16% | 16.84% | 83.15% | 16.84% | 66.31% | 66.32% | 83.16% | 83.16% | 83.16% | 71.17% | 83.16% | 90.10% |
CAT (without HHO) | 86.90% | 86.94% | 13.06% | 86.92% | 13.07% | 73.87% | 74.10% | 87.16% | 86.94% | 86.93% | 76.86% | 86.93% | 94.60% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Debjit, K.; Islam, M.S.; Rahman, M.A.; Pinki, F.T.; Nath, R.D.; Al-Ahmadi, S.; Hossain, M.S.; Mumenin, K.M.; Awal, M.A. An Improved Machine-Learning Approach for COVID-19 Prediction Using Harris Hawks Optimization and Feature Analysis Using SHAP. Diagnostics 2022, 12, 1023. https://doi.org/10.3390/diagnostics12051023
Debjit K, Islam MS, Rahman MA, Pinki FT, Nath RD, Al-Ahmadi S, Hossain MS, Mumenin KM, Awal MA. An Improved Machine-Learning Approach for COVID-19 Prediction Using Harris Hawks Optimization and Feature Analysis Using SHAP. Diagnostics. 2022; 12(5):1023. https://doi.org/10.3390/diagnostics12051023
Chicago/Turabian StyleDebjit, Kumar, Md Saiful Islam, Md. Abadur Rahman, Farhana Tazmim Pinki, Rajan Dev Nath, Saad Al-Ahmadi, Md. Shahadat Hossain, Khondoker Mirazul Mumenin, and Md. Abdul Awal. 2022. "An Improved Machine-Learning Approach for COVID-19 Prediction Using Harris Hawks Optimization and Feature Analysis Using SHAP" Diagnostics 12, no. 5: 1023. https://doi.org/10.3390/diagnostics12051023
APA StyleDebjit, K., Islam, M. S., Rahman, M. A., Pinki, F. T., Nath, R. D., Al-Ahmadi, S., Hossain, M. S., Mumenin, K. M., & Awal, M. A. (2022). An Improved Machine-Learning Approach for COVID-19 Prediction Using Harris Hawks Optimization and Feature Analysis Using SHAP. Diagnostics, 12(5), 1023. https://doi.org/10.3390/diagnostics12051023