Efficient Data-Driven Machine Learning Models for Cardiovascular Diseases Risk Prediction
Abstract
:1. Introduction
- An essential step of the elaborated methodology is data preprocessing, consisting of data cleaning and class balancing. Data preprocessing is achieved with the SMOTE. In this way, the dataset’s instances are distributed in a balanced way allowing us to design efficient classification models and predict the occurrence of CVD.
- In the context of features analysis, three ranking methods, i.e., Gain Ratio, Random Forest and Information Gain were applied to measure their importance in the CVD class, and a statistical description of their prevalence is also presented.
- Experimental evaluation with several ML models after the use or not of SMOTE with 10-fold cross-validation evaluating and comparing them in terms of Accuracy, Recall, Precision and AUC in order to identify the most efficient for predicting the risk of an instance being diagnosed with CVD.
2. Materials and Methods
2.1. Dataset Description
- Age (years) [42]: It is the attribute that keeps the participant’s age. The age range is 30 to 65 years.
- Gender [43]: This attribute indicates the participant’s gender. The number of men is 2184 (34.6%), while the number of women is 4127 (65.4%).
- BMI (Kg/m2) [44]: This attribute illustrates the participant’s body mass index.
- Systolic Blood Pressure (Sys BP) (mmHg) [45]: This attribute illustrates the participant’s systolic blood pressure.
- Diastolic Blood Pressure (Dias BP) (mmHg) [46]: This attribute illustrates the participant’s diastolic blood pressure.
- Glucose [47]: This feature captures the participant’s glucose status. It has three categories (85.6% normal, 7.4% above normal and 7% well above normal).
- Smoke [48]: This attribute refers to whether the participant smokes or not. The percentage of participants who are smoking is 9.2%.
- Alcohol Intake [49]: This attribute refers to whether the participant consumes alcohol or not. Up to 5.4% of participants consume alcohol.
- Physical Activity [50]: This variable records whether the participant is physically active or not. The percentage of participants who have physical activity is 80.1%.
- Total Cholesterol [51]: This variable captures the participant’s total cholesterol status. It has three categories (78.1% normal, 12.7% above normal and 9.2% well above normal).
- Cardiovascular Disease (CVD): This attribute refers to whether the participant suffers from cardiovascular disease or not. A total of 1944 (30.8%) of the participants suffer from cardiovascular disease.
2.2. Proposed Methodology for CVD Risk Prediction
2.2.1. Data Preprocessing
2.2.2. Features Ranking
2.2.3. Features Prevalence in the Balanced Data
2.3. Machine Learning Models for the CVD Risk Prediction
2.4. Evaluation Metrics
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cardiovascular Diseases. Available online: https://www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) (accessed on 26 December 2022).
- Fuchs, F.D.; Whelton, P.K. High blood pressure and cardiovascular disease. Hypertension 2020, 75, 285–292. [Google Scholar] [CrossRef] [PubMed]
- Cocciolone, A.J.; Hawes, J.Z.; Staiculescu, M.C.; Johnson, E.O.; Murshed, M.; Wagenseil, J.E. Elastin, arterial mechanics, and cardiovascular disease. Am. J.-Physiol.-Heart Circ. Physiol. 2018, 315, H189–H205. [Google Scholar] [CrossRef] [Green Version]
- Watkins, D.A.; Beaton, A.Z.; Carapetis, J.R.; Karthikeyan, G.; Mayosi, B.M.; Wyber, R.; Yacoub, M.H.; Zühlke, L.J. Rheumatic heart disease worldwide: JACC scientific expert panel. J. Am. Coll. Cardiol. 2018, 72, 1397–1416. [Google Scholar] [CrossRef] [PubMed]
- d’Alessandro, E.; Becker, C.; Bergmeier, W.; Bode, C.; Bourne, J.H.; Brown, H.; Buller, H.R.; Arina, J.; Ten Cate, V.; Van Cauteren, Y.J.; et al. Thrombo-inflammation in cardiovascular disease: An expert consensus document from the third Maastricht consensus conference on thrombosis. Thromb. Haemost. 2020, 120, 538–564. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Robinson, S. Cardiovascular disease. In Priorities for Health Promotion and Public Health; Routledge: London, UK, 2021; pp. 355–393. [Google Scholar]
- Shaito, A.; Thuan, D.T.B.; Phu, H.T.; Nguyen, T.H.D.; Hasan, H.; Halabi, S.; Abdelhady, S.; Nasrallah, G.K.; Eid, A.H.; Pintus, G. Herbal medicine for cardiovascular diseases: Efficacy, mechanisms, and safety. Front. Pharmacol. 2020, 11, 422. [Google Scholar] [CrossRef] [Green Version]
- Jagannathan, R.; Patel, S.A.; Ali, M.K.; Narayan, K. Global updates on cardiovascular disease mortality trends and attribution of traditional risk factors. Curr. Diabetes Rep. 2019, 19, 44. [Google Scholar] [CrossRef]
- Sharifi-Rad, J.; Rodrigues, C.F.; Sharopov, F.; Docea, A.O.; Can Karaca, A.; Sharifi-Rad, M.; Kahveci Karıncaoglu, D.; Gülseren, G.; Şenol, E.; Demircan, E.; et al. Diet, lifestyle and cardiovascular diseases: Linking pathophysiology to cardioprotective effects of natural bioactive compounds. Int. J. Environ. Res. Public Health 2020, 17, 2326. [Google Scholar] [CrossRef] [Green Version]
- Kaminsky, L.A.; German, C.; Imboden, M.; Ozemek, C.; Peterman, J.E.; Brubaker, P.H. The importance of healthy lifestyle behaviors in the prevention of cardiovascular disease. Prog. Cardiovasc. Dis. 2021, 70, 8–15. [Google Scholar] [CrossRef]
- Bays, H.E.; Taub, P.R.; Epstein, E.; Michos, E.D.; Ferraro, R.A.; Bailey, A.L.; Kelli, H.M.; Ferdinand, K.C.; Echols, M.R.; Weintraub, H.; et al. Ten things to know about ten cardiovascular disease risk factors. Am. J. Prev. Cardiol. 2021, 5, 100149. [Google Scholar] [CrossRef]
- Francula-Zaninovic, S.; Nola, I.A. Management of measurable variable cardiovascular disease’risk factors. Curr. Cardiol. Rev. 2018, 14, 153–163. [Google Scholar] [CrossRef]
- Mensah, G.A.; Roth, G.A.; Fuster, V. The global burden of cardiovascular diseases and risk factors: 2020 and beyond. J. Am. Coll. Cardiol. 2019, 74, 2529–2532. [Google Scholar] [CrossRef] [PubMed]
- Flora, G.D.; Nayak, M.K. A brief review of cardiovascular diseases, associated risk factors and current treatment regimes. Curr. Pharm. Des. 2019, 25, 4063–4084. [Google Scholar] [CrossRef] [PubMed]
- Jagpal, A.; Navarro-Millán, I. Cardiovascular co-morbidity in patients with rheumatoid arthritis: A narrative review of risk factors, cardiovascular risk assessment and treatment. BMC Rheumatol. 2018, 2, 10. [Google Scholar] [CrossRef] [Green Version]
- Silvani, A. Sleep disorders, nocturnal blood pressure, and cardiovascular risk: A translational perspective. Auton. Neurosci. 2019, 218, 31–42. [Google Scholar] [CrossRef]
- Konstantoulas, I.; Kocsis, O.; Dritsas, E.; Fakotakis, N.; Moustakas, K. Sleep Quality Monitoring with Human Assisted Corrections. In Proceedings of the International Joint Conference on Computational Intelligence (IJCCI) (SCIPTRESS 2021), Online Streaming, 25–27 October 2021; pp. 435–444. [Google Scholar]
- Tadic, M.; Cuspidi, C.; Mancia, G.; Dell’Oro, R.; Grassi, G. COVID-19, hypertension and cardiovascular diseases: Should we change the therapy? Pharmacol. Res. 2020, 158, 104906. [Google Scholar] [CrossRef]
- Shamshirian, A.; Heydari, K.; Alizadeh-Navaei, R.; Moosazadeh, M.; Abrotan, S.; Hessami, A. Cardiovascular diseases and COVID-19 mortality and intensive care unit admission: A systematic review and meta-analysis. medRxiv 2020. [Google Scholar] [CrossRef] [Green Version]
- Winzer, E.B.; Woitek, F.; Linke, A. Physical activity in the prevention and treatment of coronary artery disease. J. Am. Heart Assoc. 2018, 7, e007725. [Google Scholar] [CrossRef] [Green Version]
- Rippe, J.M.; Angelopoulos, T.J. Lifestyle strategies for risk factor reduction, prevention and treatment of cardiovascular disease. In Lifestyle Medicine, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2019; pp. 19–36. [Google Scholar]
- Karunathilake, S.P.; Ganegoda, G.U. Secondary prevention of cardiovascular diseases and application of technology for early diagnosis. BioMed Res. Int. 2018, 2018, 5767864. [Google Scholar] [CrossRef]
- Dritsas, E.; Trigka, M. Data-Driven Machine-Learning Methods for Diabetes Risk Prediction. Sensors 2022, 22, 5304. [Google Scholar] [CrossRef]
- Fazakis, N.; Kocsis, O.; Dritsas, E.; Alexiou, S.; Fakotakis, N.; Moustakas, K. Machine learning tools for long-term type 2 diabetes risk prediction. IEEE Access 2021, 9, 103737–103757. [Google Scholar] [CrossRef]
- Alexiou, S.; Dritsas, E.; Kocsis, O.; Moustakas, K.; Fakotakis, N. An approach for Personalized Continuous Glucose Prediction with Regression Trees. In Proceedings of the 2021 6th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), Preveza, Greece, 24–26 September 2021; pp. 1–6. [Google Scholar]
- Dritsas, E.; Alexiou, S.; Konstantoulas, I.; Moustakas, K. Short-term Glucose Prediction based on Oral Glucose Tolerance Test Values. In Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies—HEALTHINF, Online, 9–11 February 2022; Volume 5, pp. 249–255. [Google Scholar]
- Fazakis, N.; Dritsas, E.; Kocsis, O.; Fakotakis, N.; Moustakas, K. Long-Term Cholesterol Risk Prediction with Machine Learning Techniques in ELSA Database. In Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI) (SCIPTRESS 2021), Online Streaming, 25–27 October 2021; pp. 445–450. [Google Scholar]
- Dritsas, E.; Fazakis, N.; Kocsis, O.; Fakotakis, N.; Moustakas, K. Long-Term Hypertension Risk Prediction with ML Techniques in ELSA Database. In Learning and Intelligent Optimization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 113–120. [Google Scholar]
- Dritsas, E.; Alexiou, S.; Moustakas, K. Efficient Data-driven Machine Learning Models for Hypertension Risk Prediction. In Proceedings of the 2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Biarritz, France, 8–10 August 2022; pp. 1–6. [Google Scholar]
- Dritsas, E.; Trigka, M. Machine Learning Methods for Hypercholesterolemia Long-Term Risk Prediction. Sensors 2022, 22, 5365. [Google Scholar] [CrossRef] [PubMed]
- Dritsas, E.; Alexiou, S.; Moustakas, K. COPD Severity Prediction in Elderly with ML Techniques. In Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments, Corfu, Greece, 29 June–1 July 2022; pp. 185–189. [Google Scholar]
- Dritsas, E.; Trigka, M. Supervised Machine Learning Models to Identify Early-Stage Symptoms of SARS-CoV-2. Sensors 2023, 23, 40. [Google Scholar] [CrossRef] [PubMed]
- Dritsas, E.; Trigka, M. Stroke Risk Prediction with Machine Learning Techniques. Sensors 2022, 22, 4670. [Google Scholar] [CrossRef]
- Dritsas, E.; Trigka, M. Machine learning techniques for chronic kidney disease risk prediction. Big Data Cogn. Comput. 2022, 6, 98. [Google Scholar] [CrossRef]
- Dritsas, E.; Trigka, M. Supervised Machine Learning Models for Liver Disease Risk Prediction. Computers 2023, 12, 19. [Google Scholar] [CrossRef]
- Butt, M.B.; Alfayad, M.; Saqib, S.; Khan, M.; Ahmad, M.; Khan, M.A.; Elmitwally, N.S. Diagnosing the stage of hepatitis C using machine learning. J. Healthc. Eng. 2021, 2021, 8062410. [Google Scholar] [CrossRef]
- Dritsas, E.; Trigka, M. Lung Cancer Risk Prediction with Machine Learning Models. Big Data Cogn. Comput. 2022, 6, 139. [Google Scholar] [CrossRef]
- Konstantoulas, I.; Dritsas, E.; Moustakas, K. Sleep Quality Evaluation in Rich Information Data. In Proceedings of the 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA), Corfu, Greece, 18–20 July 2022; pp. 1–4. [Google Scholar]
- Dritsas, E.; Alexiou, S.; Moustakas, K. Metabolic Syndrome Risk Forecasting on Elderly with ML Techniques. In Learning and Intelligent Optimization; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Dritsas, E.; Alexiou, S.; Moustakas, K. Cardiovascular Disease Risk Prediction with Supervised Machine Learning Techniques. In Proceedings of the ICT4AWE, Online, 23–25 April 2022; pp. 315–321. [Google Scholar]
- Ilyas, I.F.; Chu, X. Data Cleaning; Morgan & Claypool: San Rafael, CA, USA, 2019. [Google Scholar]
- Zhang, Y.; Chen, Y.; Ma, L. Depression and cardiovascular disease in elderly: Current understanding. J. Clin. Neurosci. 2018, 47, 1–5. [Google Scholar] [CrossRef]
- Gao, Z.; Chen, Z.; Sun, A.; Deng, X. Gender differences in cardiovascular disease. Med. Nov. Technol. Devices 2019, 4, 100025. [Google Scholar] [CrossRef]
- Elagizi, A.; Kachur, S.; Lavie, C.J.; Carbone, S.; Pandey, A.; Ortega, F.B.; Milani, R.V. An overview and update on obesity and the obesity paradox in cardiovascular diseases. Prog. Cardiovasc. Dis. 2018, 61, 142–150. [Google Scholar] [CrossRef]
- Whelton, S.P.; McEvoy, J.W.; Shaw, L.; Psaty, B.M.; Lima, J.A.; Budoff, M.; Nasir, K.; Szklo, M.; Blumenthal, R.S.; Blaha, M.J. Association of normal systolic blood pressure level with cardiovascular disease in the absence of risk factors. JAMA Cardiol. 2020, 5, 1011–1018. [Google Scholar] [CrossRef] [PubMed]
- Choi, Y.J.; Kim, S.H.; Kang, S.H.; Yoon, C.H.; Lee, H.Y.; Youn, T.J.; Chae, I.H.; Kim, C.H. Reconsidering the cut-off diastolic blood pressure for predicting cardiovascular events: A nationwide population-based study from Korea. Eur. Heart J. 2019, 40, 724–731. [Google Scholar] [CrossRef] [Green Version]
- Kabootari, M.; Hasheminia, M.; Azizi, F.; Mirbolouk, M.; Hadaegh, F. Change in glucose intolerance status and risk of incident cardiovascular disease: Tehran Lipid and Glucose Study. Cardiovasc. Diabetol. 2020, 19, 41. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kondo, T.; Nakano, Y.; Adachi, S.; Murohara, T. Effects of tobacco smoking on cardiovascular disease. Circ. J. 2019, 83, 1980–1985. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Larsson, S.C.; Burgess, S.; Mason, A.M.; Michaëlsson, K. Alcohol consumption and cardiovascular disease: A Mendelian randomization study. Circ. Genom. Precis. Med. 2020, 13, e002814. [Google Scholar] [CrossRef] [PubMed]
- Kraus, W.E.; Powell, K.E.; Haskell, W.L.; Janz, K.F.; Campbell, W.W.; Jakicic, J.M.; Troiano, R.P.; Sprow, K.; Torres, A.; Piercy, K.L.; et al. Physical activity, all-cause and cardiovascular mortality, and cardiovascular disease. Med. Sci. Sport. Exerc. 2019, 51, 1270. [Google Scholar] [CrossRef]
- Soliman, G.A. Dietary cholesterol and the lack of evidence in cardiovascular disease. Nutrients 2018, 10, 780. [Google Scholar] [CrossRef] [Green Version]
- Rattan, V.; Mittal, R.; Singh, J.; Malik, V. Analyzing the Application of SMOTE on Machine Learning Classifiers. In Proceedings of the 2021 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 5–7 March 2021; pp. 692–695. [Google Scholar]
- Dritsas, E.; Fazakis, N.; Kocsis, O.; Moustakas, K.; Fakotakis, N. Optimal Team Pairing of Elder Office Employees with Machine Learning on Synthetic Data. In Proceedings of the 2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA), Chania Crete, Greece, 12–14 July 2021; pp. 1–4. [Google Scholar]
- Darst, B.F.; Malecki, K.C.; Engelman, C.D. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet. 2018, 19, 65. [Google Scholar] [CrossRef] [Green Version]
- Tangirala, S. Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 612–619. [Google Scholar] [CrossRef] [Green Version]
- Mohammad, A.H. Comparing two feature selections methods (information gain and gain ratio) on three different classification algorithms using arabic dataset. J. Theor. Appl. Inf. Technol. 2018, 96, 1561–1569. [Google Scholar]
- Powell-Wiley, T.M.; Poirier, P.; Burke, L.E.; Després, J.P.; Gordon-Larsen, P.; Lavie, C.J.; Lear, S.A.; Ndumele, C.E.; Neeland, I.J.; Sanders, P.; et al. Obesity and cardiovascular disease: A scientific statement from the American Heart Association. Circulation 2021, 143, e984–e1010. [Google Scholar] [CrossRef]
- Luo, D.; Cheng, Y.; Zhang, H.; Ba, M.; Chen, P.; Li, H.; Chen, K.; Sha, W.; Zhang, C.; Chen, H. Association between high blood pressure and long term cardiovascular events in young adults: Systematic review and meta-analysis. BMJ 2020, 370, m3222. [Google Scholar] [CrossRef] [PubMed]
- Petrie, J.R.; Guzik, T.J.; Touyz, R.M. Diabetes, hypertension, and cardiovascular disease: Clinical insights and vascular mechanisms. Can. J. Cardiol. 2018, 34, 575–584. [Google Scholar] [CrossRef] [Green Version]
- Berrar, D. Bayes’ theorem and naive Bayes classifier. In Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics; Elsevier: Amsterdam, The Netherlands, 2018; Volume 403. [Google Scholar]
- Nusinovici, S.; Tham, Y.C.; Yan, M.Y.C.; Ting, D.S.W.; Li, J.; Sabanayagam, C.; Wong, T.Y.; Cheng, C.Y. Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [Google Scholar] [CrossRef] [PubMed]
- González, S.; García, S.; Del Ser, J.; Rokach, L.; Herrera, F. A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities. Inf. Fusion 2020, 64, 205–237. [Google Scholar] [CrossRef]
- Rodríguez, J.J.; Juez-Gil, M.; López-Nozal, C.; Arnaiz-González, Á. Rotation Forest for multi-target regression. Int. J. Mach. Learn. Cybern. 2022, 13, 523–548. [Google Scholar] [CrossRef]
- Kang, K.; Michalak, J. Enhanced version of AdaBoostM1 with J48 Tree learning method. arXiv 2018, arXiv:1802.03522. [Google Scholar]
- Palimkar, P.; Shaw, R.N.; Ghosh, A. Machine learning technique to prognosis diabetes disease: Random forest classifier approach. In Advanced Computing and Intelligent Technologies; Springer: Berlin/Heidelberg, Germany, 2022; pp. 219–244. [Google Scholar]
- Dogan, A.; Birant, D. A weighted majority voting ensemble approach for classification. In Proceedings of the 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, 11–15 September 2019; pp. 1–6. [Google Scholar]
- Pavlyshenko, B. Using stacking approaches for machine learning models. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 255–258. [Google Scholar]
- Masih, N.; Naz, H.; Ahuja, S. Multilayer perceptron based deep neural network for early detection of coronary heart disease. Health Technol. 2021, 11, 127–138. [Google Scholar] [CrossRef]
- Cunningham, P.; Delany, S.J. k-Nearest neighbour classifiers-A Tutorial. ACM Comput. Surv. (CSUR) 2021, 54, 1–25. [Google Scholar] [CrossRef]
- Moccia, S.; De Momi, E.; El Hadji, S.; Mattos, L.S. Blood vessel segmentation algorithms—Review of methods, datasets and evaluation metrics. Comput. Methods Programs Biomed. 2018, 158, 71–91. [Google Scholar] [CrossRef] [Green Version]
- WEKA Tool. Available online: https://www.weka.io/ (accessed on 26 December 2022).
- Hunter, R.W.; Dhaun, N.; Bailey, M.A. The impact of excessive salt intake on human health. Nat. Rev. Nephrol. 2022, 18, 321–335. [Google Scholar] [CrossRef] [PubMed]
- Dinesh, K.G.; Arumugaraj, K.; Santhosh, K.D.; Mareeswari, V. Prediction of cardiovascular disease using machine learning algorithms. In Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), Coimbatore, India, 1–3 March 2018; pp. 1–7. [Google Scholar]
- Sun, W.; Zhang, P.; Wang, Z.; Li, D. Prediction of cardiovascular diseases based on machine learning. ASP Trans. Internet Things 2021, 1, 30–35. [Google Scholar] [CrossRef]
- Mohan, S.; Thirumalai, C.; Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 2019, 7, 81542–81554. [Google Scholar] [CrossRef]
- Louridi, N.; Amar, M.; El Ouahidi, B. Identification of cardiovascular diseases using machine learning. In Proceedings of the 2019 7th mediterranean congress of telecommunications (CMT), Fez, Morocco, 24–25 October 2019; pp. 1–6. [Google Scholar]
- Alaa, A.M.; Bolton, T.; Di Angelantonio, E.; Rudd, J.H.; Van der Schaar, M. Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants. PLoS ONE 2019, 14, e0213653. [Google Scholar] [CrossRef] [Green Version]
- Theerthagiri, P.; Vidya, J. Cardiovascular disease prediction using recursive feature elimination and gradient boosting classification techniques. Expert Syst. 2022, 39, e13064. [Google Scholar] [CrossRef]
- Casalino, G.; Castellano, G.; Kaymak, U.; Zaza, G. Balancing accuracy and interpretability through neuro-fuzzy models for cardiovascular risk assessment. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–8. [Google Scholar]
- Karaboga, D.; Kaya, E. Adaptive network based fuzzy inference system (ANFIS) training approaches: A comprehensive survey. Artif. Intell. Rev. 2019, 52, 2263–2293. [Google Scholar] [CrossRef]
- Cardiovascular Disease Dataset. Available online: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset (accessed on 14 January 2023).
- Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of machine learning models using Shapley additive explanation and application for real data in hospital. Comput. Methods Programs Biomed. 2022, 214, 106584. [Google Scholar] [CrossRef]
- Chowdhury, S.U.; Sayeed, S.; Rashid, I.; Alam, M.G.R.; Masum, A.K.M.; Dewan, M.A.A. Shapley-Additive-Explanations-Based Factor Analysis for Dengue Severity Prediction using Machine Learning. J. Imaging 2022, 8, 229. [Google Scholar] [CrossRef] [PubMed]
Numerical Attribute | Description | ||
---|---|---|---|
Min | Max | Mean ± StdDev | |
Age | 30 | 65 | 52.73 ± 6.86 |
BMI | 15.36 | 58.59 | 27.12 ± 5.11 |
Sys BP | 70 | 220 | 123.8 ± 15.3 |
Dias BP | 40 | 150 | 80.25 ± 9.1 |
Nominal Attribute | Description | ||
Gender | Men 2184 (34.6%) Women 4127 (65.4%) | ||
Glucose | (85.6%) normal (7.4%) above normal (7%) well above normal | ||
Smoke | Yes (9.2%) | ||
Alcohol Intake | Yes (5.4%) | ||
Physical Activity | Yes (80.1%) | ||
Total Cholesterol | (78.1%) normal (12.7%) above normal (9.2%) well above normal |
Random Forest | Gain Ratio | Information Gain | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
SMOTE | No SMOTE | SMOTE | No SMOTE | SMOTE | No SMOTE | ||||||
Attribute | Rank | Attribute | Rank | Attribute | Rank | Attribute | Rank | Attribute | Rank | Attribute | Rank |
Age | 0.253 | SysBP | 0.23907 | SysBP | 0.07589 | SysBP | 0.07243 | SysBP | 0.16307 | SysBP | 0.14602 |
SysBP | 0.2426 | DiasBP | 0.17963 | DiasBP | 0.06083 | DiasBP | 0.05732 | DiasBP | 0.10531 | DiasBP | 0.09216 |
BMI | 0.1897 | Age | 0.12714 | Age | 0.03809 | Cholesterol | 0.03947 | Age | 0.08847 | Cholesterol | 0.03837 |
DiasBP | 0.1893 | Cholesterol | 0.09492 | Cholesterol | 0.02763 | Age | 0.01894 | BMI | 0.03646 | Age | 0.03511 |
Cholesterol | 0.0574 | BMI | 0.04311 | Smoke | 0.02654 | BMI | 0.01489 | Cholesterol | 0.02615 | BMI | 0.02709 |
Gender | 0.0519 | Glucose | 0.02689 | BMI | 0.0177 | Glucose | 0.00806 | Gender | 0.01067 | Glucose | 0.00594 |
Physical activity | 0.03550 | Physical activity | 0.00931 | Alcohol intake | 0.01749 | Physical activity | 0.00157 | Smoke | 0.00935 | Physical activity | 0.00113 |
Smoke | 0.0263 | Smoke | 0.00135 | Physical activity | 0.0146 | Smoke | 0.00026 | Physical activity | 0.00887 | Smoke | 0.00012 |
Alcohol intake | 0.0144 | Alcohol intake | −0.00339 | Gender | 0.01231 | Gender | 0.00006 | Alcohol intake | 0.00416 | Gender | 0.00006 |
Glucose | 0.012 | Gender | −0.00467 | Glucose | 0.00293 | Alcohol intake | 0.00002 | Glucose | 0.00178 | Alcohol intake | 0.000006 |
Features | Min | Max | Mean ± Std |
---|---|---|---|
Age | 30 | 65 | 53.36 ± 6.76 |
BMI | 15.36 | 58.59 | 27.48 ± 5.17 |
Sys BP | 70 | 220 | 126.46 ± 16.33 |
Dias BP | 40 | 150 | 81.44 ± 9.39 |
Age Groups | Non-CVD | CVD |
---|---|---|
30–34 | 0.01% | 0.00% |
35–39 | 0.53% | 0.10% |
40–44 | 9.53% | 4.24% |
45–49 | 7.05% | 5.60% |
50–54 | 15.14% | 12.06% |
55–59 | 10.03% | 13.57% |
60–64 | 7.51% | 14.24% |
65–69 | 0.21% | 0.19% |
Gender | Non-CVD | CVD |
Female | 32.84% | 38.33% |
Male | 17.16% | 11.67% |
BMI Classes | Non-CVD | CVD |
---|---|---|
Underweight BMI | 0.70% | 0.14% |
Healthy BMI < 25 | 21.71% | 14.79% |
Overweight BMI < 30 | 17.99% | 19.45% |
Obese I BMI | 6.90% | 9.67% |
Obese II BMI | 2.06% | 4.21% |
Obese III BMI | 0.64% | 1.73% |
Physical Activity | Non-CVD | CVD |
No | 9.41% | 5.48% |
Yes | 40.59% | 44.52% |
Cholesterol | Non-CVD | CVD |
---|---|---|
Normal | 41.88% | 37.29% |
Above Normal | 5.69% | 4.59% |
Well Above Normal | 2.43% | 8.12% |
Glucose | Non-CVD | CVD |
Normal | 43.85% | 45.20% |
Above Normal | 3.33% | 2.24% |
Well Above Normal | 2.82% | 2.55% |
Smoke | Non-CVD | CVD |
---|---|---|
No | 45.53% | 45.65% |
Yes | 4.48% | 4.36% |
Alcohol | Non-CVD | CVD |
No | 45.28% | 48.08% |
Yes | 4.72% | 1.92% |
Sys/Dias Blood Pressure Categories [58] | Non-CVD | CVD |
---|---|---|
Normal Sys BP and Dias BP | 10.37% | 3.53% |
Elevated Sys BP and Dias BP | 2.61% | 1.32% |
Hypertension I Sys BP or Dias BP | 32.24% | 25.96% |
Hypertension II Sys BP or Dias BP | 4.77% | 19.20% |
Models | Parameters |
---|---|
NB | useKernelEstimator: False useSupervisedDiscretization: True |
LR | ridge = useConjugateGradientDescent: True |
MLP | learning rate = 0.1 momentum = 0.2 training time = 200 |
kNN | k = 3 Search Algorithm: LinearNNSearch with Euclidean cross-validate = True |
RF | breakTiesRadomly: True numIterations = 500 storeOutOfBagPredictions: True |
RotF | classifier: Random Forest numberOfGroups: True projectionFilter: PrincipalComponents |
AdaBoostM1 | classifier: Random Forest resume: True useResampling: True |
Stacking | classifiers: Random Forest and Naive Bayes metaClassifier: Logistic Regression |
Voting | classifiers: Random Forest and Naive Bayes combinationRule: average of probabilities |
Bagging | classifiers: Random Forest printClassifiers: True storeOutOfBagPredictions: True |
Accuracy | Precision | Recall | AUC | |||||
---|---|---|---|---|---|---|---|---|
No SMOTE | SMOTE | No SMOTE | SMOTE | No SMOTE | SMOTE | No SMOTE | SMOTE | |
NB | 0.771 | 0.836 | 0.648 | 0.849 | 0.560 | 0.791 | 0.787 | 0.866 |
LR | 0.772 | 0.846 | 0.706 | 0.855 | 0.444 | 0.799 | 0.789 | 0.880 |
MLP | 0.768 | 0.840 | 0.656 | 0.858 | 0.519 | 0.806 | 0.771 | 0.894 |
3NN | 0.714 | 0.833 | 0.544 | 0.801 | 0.446 | 0.807 | 0.695 | 0.811 |
RF | 0.740 | 0.866 | 0.588 | 0.877 | 0.522 | 0.874 | 0.749 | 0.977 |
RotF | 0.752 | 0.872 | 0.614 | 0.875 | 0.527 | 0.860 | 0.759 | 0.940 |
AdaBoostM1 | 0.738 | 0.868 | 0.584 | 0.876 | 0.521 | 0.871 | 0.724 | 0.976 |
Stacking | 0.776 | 0.878 | 0.676 | 0.883 | 0.560 | 0.880 | 0.786 | 0.982 |
Bagging | 0.753 | 0.876 | 0.619 | 0.878 | 0.520 | 0.876 | 0.763 | 0.975 |
Voting | 0.775 | 0.867 | 0.660 | 0.834 | 0.557 | 0.838 | 0.781 | 0.946 |
Reference | Dataset | Proposed Model | Performance |
---|---|---|---|
[40] | [81] | Logistic Regression | AUC 78.4% Accuracy 72.1% |
[73] | Long Beach VA heart disease database | Logistic Regression | Accuracy 86.5% |
[74] | [81] | SVM | AUC 78.84% |
[75] | [81] | Hybrid Random Forest with a linear model (HRFLM) | Accuracy 88.7% |
[76] | [81] | SVM (linear kernel) | Accuracy 86.8% |
[77] | UK Biobank | AutoPrognosis model | AUC 77.4% |
[78] | [81] | Gradient Boosting algorithm | AUC 84% Accuracy 89.7% |
[79] | Not Publicy Available | Neuro-Fuzzy model | Accuracy 91% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dritsas, E.; Trigka, M. Efficient Data-Driven Machine Learning Models for Cardiovascular Diseases Risk Prediction. Sensors 2023, 23, 1161. https://doi.org/10.3390/s23031161
Dritsas E, Trigka M. Efficient Data-Driven Machine Learning Models for Cardiovascular Diseases Risk Prediction. Sensors. 2023; 23(3):1161. https://doi.org/10.3390/s23031161
Chicago/Turabian StyleDritsas, Elias, and Maria Trigka. 2023. "Efficient Data-Driven Machine Learning Models for Cardiovascular Diseases Risk Prediction" Sensors 23, no. 3: 1161. https://doi.org/10.3390/s23031161
APA StyleDritsas, E., & Trigka, M. (2023). Efficient Data-Driven Machine Learning Models for Cardiovascular Diseases Risk Prediction. Sensors, 23(3), 1161. https://doi.org/10.3390/s23031161