Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis
Abstract
:1. Introduction
1.1. Past Reviews, Rationale for the Review and Intended Audience
1.2. Review Content
2. Survey Methodology
- i.
- history of conventional statistics and machine learning in medicine
- ii.
- comparison between conventional statistics and machine learning
- iii.
- use of machine learning in various fields
- iv.
- analysis of medical data using conventional statistics and
- v.
- use of machine learning and artificial intelligence in medical analysis.
2.1. Inclusion Criteria
- (i)
- all papers with year of publication between 2015 to 2022
- (ii)
- all open access papers that are freely available
- (iii)
- the keywords used for the search are conventional statistics, machine learning, medical data, comparison and health research. The entries by using these keywords were from various medical domains, machine learning analyses and statistics in healthcare research, not focusing only on one type of disease.
2.2. Exclusion Criteria
- (i)
- all papers not relevant to our topic
- (ii)
- all papers that are not freely accessible
- (iii)
- all papers with year of publication before 2015
3. Results
3.1. Concepts in Conventional Statistics
3.1.1. Hypothesis Testing and Statistical Inference for Classification
3.1.2. Regression
3.2. Concepts in Machine Learning (ML)
3.2.1. Predictive Analytics
3.2.2. Representation Learning
3.2.3. Reinforcement Learning
3.2.4. Causal Inference/Generative Models
3.3. Advantages and Disadvantages of Conventional Statistics and Machine Learning
3.3.1. Data Management
3.3.2. Computational Power, Interpretation/Explainability and Visualization of Results
3.3.3. Dimensionality Reduction
3.3.4. Frequently Used Models or Methods for Data Assessment
3.4. Case Study to Compare Conventional Statistics and Machine Learning
3.4.1. Imputation and Data Pre-Processing
3.4.2. Significant Factors (CS) and Variable Importance (ML)
3.4.3. Survival Analysis
3.5. Simplified Machine Learning Algorithms and Their Relationship with Conventional Statistics
3.5.1. Decision Tree
3.5.2. Random Forest
3.5.3. Extreme Gradient Boosting
3.5.4. Logistic Regression
3.5.5. Support Vector Machine
3.5.6. Artificial Neural Networks
4. Discussion
4.1. Integration of Conventional Statistics with Machine Learning
4.2. Significance of Machine Learning to Healthcare, Education and Society
4.3. Automation of Machine Learning in Healthcare Research
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Vercio, L.L.; Amador, K.; Bannister, J.J.; Crites, S.; Gutierrez, A.; Macdonald, M.E.; Moore, J.; Mouches, P.; Rajashekar, D.; Schimert, S.; et al. Supervised machine learning tools: A tutorial for clinicians. J. Neural Eng. 2020, 17, 062001. [Google Scholar] [CrossRef]
- Tonekaboni, S.; Joshi, S.; McCradden, M.D.; Goldenberg, A. What clinicians want: Contextualizing explainable machine learning for clinical end use. arXiv 2019, arXiv:1905.05134. [Google Scholar]
- Rowe, M. An introduction to machine learning for clinicians. Acad. Med. 2019, 94, 1433–1436. [Google Scholar] [CrossRef] [PubMed]
- Faes, L.; Liu, X.; Wagner, S.K.; Fu, D.J.; Balaskas, K.; Sim, D.A.; Bachmann, L.M.; Keane, P.A.; Denniston, A.K. A clinician’s guide to artificial intelligence: How to critically appraise machine learning studies. Transl. Vis. Sci. Technol. 2020, 9, 3–5. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wei, J.-X.; Wang, J.; Zhu, Y.-X.; Sun, J.; Xu, H.-M.; Li, M. Traditional Chinese medicine pharmacovigilance in signal detection: Decision tree-based data classification. BMC Med. Inform. Decis. Mak. 2018, 18, 19. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cro, S.; Morris, T.P.; Kenward, M.G.; Carpenter, J.R. Sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation: A practical guide. Stat. Med. 2020, 39, 2815–2842. [Google Scholar] [CrossRef]
- Austin, P.C.; Fine, J.P. Accounting for competing risks in randomized controlled trials: A review and recommendations for improvement. Stat. Med. 2017, 36, 1203–1209. [Google Scholar] [CrossRef] [Green Version]
- Austin, P.C.; Cafri, G. Variance estimation when using propensity-score matching with replacement with survival or time-to-event outcomes. Stat. Med. 2020, 39, 1623–1640. [Google Scholar] [CrossRef] [Green Version]
- Bowden, J.; Del Greco, M.F.; Minelli, C.; Davey Smith, G.; Sheehan, N.; Thompson, J. A framework for the investigation of pleiotropy in two-sample summary data Mendelian randomization. Stat. Med. 2017, 36, 1783–1802. [Google Scholar] [CrossRef] [Green Version]
- Martin, G.P.; Sperrin, M.; Snell, K.I.E.; Buchan, I.; Riley, R.D. Clinical prediction models to predict the risk of multiple binary outcomes: A comparison of approaches. Stat. Med. 2021, 40, 498–517. [Google Scholar] [CrossRef]
- Loh, W.W.; Vansteelandt, S. Confounder selection strategies targeting stable treatment effect estimators. Stat. Med. 2021, 40, 607–630. [Google Scholar] [CrossRef] [PubMed]
- Ray, E.L.; Sakrejda, K.; Lauer, S.A.; Johansson, M.; Reich, N.G. Infectious disease prediction with kernel conditional density estimation. Stat. Med. 2017, 36, 4908–4929. [Google Scholar] [CrossRef] [PubMed]
- Irimata, K.M.; Broatch, J.; Wilson, J.R. Partitioned GMM logistic regression models for longitudinal data. Stat. Med. 2019, 38, 2171–2183. [Google Scholar] [CrossRef]
- Wu, Z.Y.; Kim, H.J.; Lee, J.W.; Chung, I.Y.; Kim, J.S.; Lee, S.B.; Son, B.-H.; Eom, J.-S.; Kim, S.-B.; Jung, K.H.; et al. Long-term Oncologic Outcomes of Immediate Breast Reconstruction vs. Conventional Mastectomy Alone for Breast Cancer in the Setting of Neoadjuvant Chemotherapy. JAMA Surg. 2020, 155, 1142–1150. [Google Scholar] [CrossRef] [PubMed]
- Im, S.-A.; Lu, Y.-S.; Bardia, A.; Harbeck, N.; Colleoni, M.; Franke, F.; Chow, L.; Sohn, J.; Lee, K.-S.; Campos-Gomez, S.; et al. Overall Survival with Ribociclib plus Endocrine Therapy in Breast Cancer. N. Engl. J. Med. 2019, 381, 307–316. [Google Scholar] [CrossRef]
- Romeo, L.; Loncarski, J.; Paolanti, M.; Bocchini, G.; Mancini, A.; Frontoni, E. Machine learning-based design support system for the prediction of heterogeneous machine parameters in industry 4.0. Expert Syst. Appl. 2020, 140, 112869. [Google Scholar] [CrossRef]
- Çinar, Z.M.; Nuhu, A.A.; Zeeshan, Q.; Korhan, O.; Asmael, M.; Safaei, B. Machine learning in predictive maintenance towards sustainable smart manufacturing in industry 4.0. Sustainability 2020, 12, 8211. [Google Scholar] [CrossRef]
- Fiebrink, R. Machine learning education for artists, musicians, and other creative practitioners. ACM Trans. Comput. Educ. 2019, 19, 1–32. [Google Scholar] [CrossRef] [Green Version]
- Villegas-Ch, W.; Román-Cañizares, M.; Palacios-Pacheco, X. Improvement of an online education model with the integration of machine learning and data analysis in an LMS. Appl. Sci. 2020, 10, 5371. [Google Scholar] [CrossRef]
- Sekeroglu, B.; Dimililer, K.; Tuncal, K. Student performance prediction and classification using machine learning algorithms. In Proceedings of the 2019 8th International Conference on Educational and Information Technology, Online, 2 March 2019; pp. 7–11. [Google Scholar]
- Leblanc, E.; Washington, P.; Varma, M.; Dunlap, K.; Penev, Y.; Kline, A.; Wall, D.P. Feature replacement methods enable reliable home video analysis for machine learning detection of autism. Sci. Rep. 2020, 10, 21245. [Google Scholar] [CrossRef]
- Odabaşı, Ç.; Yıldırım, R. Machine learning analysis on stability of perovskite solar cells. Sol. Energy Mater. Sol. Cells 2020, 205, 110284. [Google Scholar] [CrossRef]
- De Felice, F.; Polimeni, A. Coronavirus disease (COVID-19): A machine learning bibliometric analysis. Vivo 2020, 34, 1613–1617. [Google Scholar] [CrossRef] [PubMed]
- Agne, N.A.; Tisott, C.G.; Ballester, P.; Passos, I.C.; Ferrão, Y.A. Predictors of suicide attempt in patients with obsessive-compulsive disorder: An exploratory study with machine learning analysis. Psychol. Med. 2020, 52, 715–725. [Google Scholar] [CrossRef] [PubMed]
- Punn, N.S.; Sonbhadra, S.K.; Agarwal, S. COVID-19 epidemic analysis using machine learning and deep learning algorithms. medRxiv 2020, 1–10. [Google Scholar] [CrossRef] [Green Version]
- Min, B.; Kim, M.; Lee, J.; Byun, J.-I.; Chu, K.; Jung, K.-Y.; Lee, S.K.; Kwon, J.S. Prediction of individual responses to electroconvulsive therapy in patients with schizophrenia: Machine learning analysis of resting-state electroencephalography. Schizophr. Res. 2020, 216, 147–153. [Google Scholar] [CrossRef]
- Nabipour, M.; Nayyeri, P.; Jabani, H.; Shahab, S.; Mosavi, A. Predicting Stock Market Trends Using Machine Learning and Deep Learning Algorithms Via Continuous and Binary Data; A Comparative Analysis. IEEE Access 2020, 8, 150199–1501212. [Google Scholar] [CrossRef]
- Clare, S.E.; Shaw, P.L. “Big Data” for breast cancer: Where to look and what you will find. NPJ Breast Cancer 2016, 2, 16031. [Google Scholar] [CrossRef] [Green Version]
- Schaeffer, C.; Booton, L.; Halleck, J.; Studeny, J.; Coustasse, A. Big Data Management in US Hospitals. Health Care Manag. 2017, 36, 87–95. [Google Scholar] [CrossRef]
- Chen, T.T. History of statistical thinking in medicine. Adv. Med. Stat. 2015, 3–19. [Google Scholar] [CrossRef] [Green Version]
- Jiang, F.; Jiang, Y.; Zhi, H.; Dong, Y.; Li, H.; Ma, S.; Wang, Y.; Dong, Q.; Shen, H.; Wang, Y. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc. Neurol. 2017, 2, 230–243. [Google Scholar] [CrossRef] [Green Version]
- ÖĞÜŞ, E. To be Together Medicine and Biostatistics in History: Review. Turkiye Klin. J. Biostat. 2017, 9, 74–83. [Google Scholar] [CrossRef]
- Paramasivam, V.; Yee, T.S.; Dhillon, S.K.; Sidhu, A. A methodological review of data mining techniques in predictive medicine: An application in hemodynamic prediction for abdominal aortic aneurysm disease. Biocybern. Biomed. Eng. 2014, 34, 139–145. [Google Scholar] [CrossRef]
- Li, J.J.; Tong, X. Statistical Hypothesis Testing versus Machine Learning Binary Classification: Distinctions and Guidelines. Patterns 2020, 1, 100115. [Google Scholar] [CrossRef] [PubMed]
- Rajula, H.; Verlato, G.; Manchia, M.; Antonucci, N.; Fanos, V. Comparison of conventional statistical methods with machine learning in medicine: Diagnosis, drug development, and treatment. Medicina 2020, 56, 455. [Google Scholar] [CrossRef]
- Feng, J.-Z.; Wang, Y.; Peng, J.; Sun, M.-W.; Zeng, J.; Jiang, H. Comparison between logistic regression and machine learning algorithms on survival prediction of traumatic brain injuries. J. Crit. Care 2019, 54, 110–116. [Google Scholar] [CrossRef]
- Shameer, K.; Johnson, K.W.; Glicksberg, B.S.; Dudley, J.T.; Sengupta, P.P. Machine learning in cardiovascular medicine: Are we there yet? Heart 2018, 104, 1156–1164. [Google Scholar] [CrossRef]
- Ganggayah, M.D.; Taib, N.A.; Har, Y.C.; Lio, P.; Dhillon, S.K. Predicting factors for survival of breast cancer patients using machine learning techniques. BMC Med Informatics Decis. Mak. 2019, 19, 48. [Google Scholar] [CrossRef] [Green Version]
- Bhoo-Pathy, N.; Verkooijen, H.M.; Tan, E.-Y.; Miao, H.; Taib, N.A.M.; Brand, J.S.; Dent, R.A.; See, M.H.; Subramaniam, S.; Chan, P.; et al. Trends in presentation, management and survival of patients with de novo metastatic breast cancer in a Southeast Asian setting. Sci. Rep. 2015, 5, 16252. [Google Scholar] [CrossRef] [Green Version]
- Kummerow, K.L.; Du, L.; Penson, D.F.; Shyr, Y.; Hooks, M.A. Nationwide trends in mastectomy for early-stage breast cancer. JAMA Surg. 2015, 150, 9–16. [Google Scholar] [CrossRef] [Green Version]
- Zhang, B.-L.; Sivasubramaniam, P.G.; Zhang, Q.; Wang, J.; Zhang, B.; Gao, J.-D.; Tang, Z.-H.; Chen, G.-J.; Xie, X.-M.; Wang, Z.; et al. Trends in Radical Surgical Treatment Methods for Breast Malignancies in China: A Multicenter 10-Year Retrospective Study. Oncologist 2015, 20, 1036–1043. [Google Scholar] [CrossRef] [Green Version]
- Sinnadurai, S.; Kwong, A.; Hartman, M.; Tan, E.Y.; Bhoo-Pathy, N.T.; Dahlui, M.; See, M.H.; Yip, C.H.; Taib, N.A. Breast-conserving surgery versus mastectomy in young women with breast cancer in Asian settings. BJS Open 2019, 3, 48–55. [Google Scholar] [CrossRef] [PubMed]
- Vila, J.; Gandini, S.; Gentilini, O. Overall survival according to type of surgery in young (≤40 years) early breast cancer patients: A systematic meta-analysis comparing breast-conserving surgery versus mastectomy. Breast 2015, 24, 175–181. [Google Scholar] [CrossRef] [PubMed]
- Guo, T.; Fan, Y.; Chen, M.; Wu, X.; Zhang, L.; He, T.; Wang, H.; Wan, J.; Wang, X.; Lu, Z. Cardiovascular Implications of Fatal Outcomes of Patients with Coronavirus Disease 2019 (COVID-19). JAMA Cardiol. 2020, 5, 811–818. [Google Scholar] [CrossRef] [Green Version]
- Islam, T.; Musthaffa, S.; Hoong, S.M.; Filza, J.; Jamaris, S.; Cheng, M.L.; Harun, F.; Din, N.A.; Rahman, Z.A.; Mohamed, K.N.; et al. Development and evaluation of a sustainable video health education program for newly diagnosed breast cancer patients in Malaysia. Support. Care Cancer 2020, 29, 2631–2638. [Google Scholar] [CrossRef]
- Kong, Y.C.; Bhoo-Pathy, N.; O’Rorke, M.; Subramaniam, S.; Bhoo-Pathy, N.T.; See, M.H.; Jamaris, S.; Teoh, K.-H.; Bustam, A.Z.; Looi, L.-M.; et al. The association between methods of biopsy and survival following breast cancer: A hospital registry based cohort study. Medicine 2020, 99, e19093. [Google Scholar] [CrossRef] [PubMed]
- Lim, Y.C.; Hoe, V.C.W.; Darus, A.; Bhoo-Pathy, N. Association between night-shift work, sleep quality and metabolic syndrome. Occup. Environ. Med. 2018, 75, 716–723. [Google Scholar] [CrossRef]
- Sinnadurai, S.; Okabayashi, S.; Kawamura, T.; Mori, M.; Bhoo-Pathy, N.; Taib, N.A.; Ukawa, S.; Tamakoshi, A.; The JACC Study Group. Intake of common alcoholic and non-alcoholic beverages and breast cancer risk among Japanese women: Findings from the Japan collaborative cohort study. Asian Pac. J. Cancer Prev. 2020, 21, 1701–1707. [Google Scholar] [CrossRef] [PubMed]
- Balakrishnan, N.; Teo, S.-H.; Sinnadurai, S.; Pathy, N.T.B.; See, M.-H.; Taib, N.A.; Yip, C.-H.; Pathy, N.B. Impact of Time Since Last Childbirth on Survival of Women with Premenopausal and Postmenopausal Breast Cancers. World J. Surg. 2017, 41, 2735–2745. [Google Scholar] [CrossRef] [PubMed]
- Birkeland, K.I.; Jørgensen, M.E.; Carstensen, B.; Persson, F.; Gulseth, H.L.; Thuresson, M.; Fenici, P.; Nathanson, D.; Nyström, T.; Eriksson, J.W.; et al. Cardiovascular mortality and morbidity in patients with type 2 diabetes following initiation of sodium-glucose co-transporter-2 inhibitors versus other glucose-lowering drugs (CVD-REAL Nordic): A multinational observational analysis. Lancet Diabetes Endocrinol. 2017, 5, 709–717. [Google Scholar] [CrossRef]
- Cheng, M.L.; See, M.H.; Sinnadurai, S.; Islam, T.; Alip, A.; Ng, C.G.; Taib, N.A.; The MyBCC Study Group. Adherence rate and the factors contribute toward the surgical adherence of breast cancer in Malaysia. Breast J. 2020, 26, 568–570. [Google Scholar] [CrossRef]
- Hedayati, E.; Papakonstantinou, A.; Gernaat, S.A.M.; Altena, R.; Brand, J.S.; Alfredsson, J.; Bhoo-Pathy, N.; Herrmann, J.; Linde, C.; Dahlstrom, U.; et al. Outcome and presentation of heart failure in breast cancer patients: Findings from a Swedish register-based study. Eur. Hear. J. Qual. Care Clin. Outcomes 2020, 6, 147–155. [Google Scholar] [CrossRef]
- Kamada, M.; Shiroma, E.J.; Buring, J.E.; Miyachi, M.; Lee, I. Strength training and all-cause, cardiovascular disease, and cancer mortality in older women: A cohort study. J. Am. Heart Assoc. 2017, 6, e007677. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lee, D.-C.; Pate, R.R.; Lavie, C.J.; Sui, X.; Church, T.S.; Blair, S.N. Leisure-time running reduces all-cause and cardiovascular mortality risk. J. Am. Coll. Cardiol. 2014, 64, 472–481. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Herrmann, J. From trends to transformation: Where cardio-oncology is to make a difference. Eur. Heart J. 2019, 40, 3898–3900. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Roos-Hesselink, J.; Baris, L.; Johnson, M.; De Backer, J.; Otto, C.; Marelli, A.; Jondeau, G.; Budts, W.; Grewal, J.; Sliwa, K.; et al. Pregnancy outcomes in women with cardiovascular disease: Evolving trends over 10 years in the ESC Registry of Pregnancy and Cardiac disease (ROPAC). Eur. Heart J. 2019, 40, 3848–3855. [Google Scholar] [CrossRef] [Green Version]
- Phung, M.T.; Tin, S.T.; Elwood, J.M. Prognostic models for breast cancer: A systematic review. BMC Cancer 2019, 19, 230. [Google Scholar] [CrossRef] [Green Version]
- Chachi, J.; Taheri, S.M.; D’Urso, P. Fuzzy regression analysis based on M-estimates. Expert Syst. Appl. 2022, 187, 115891. [Google Scholar] [CrossRef]
- Yi, H.-C.; You, Z.-H.; Huang, D.-S.; Kwoh, C.K. Graph representation learning in bioinformatics: Trends, methods and applications. Brief. Bioinform. 2022, 23, bbab340. [Google Scholar] [CrossRef]
- Yu, C.; Liu, J.; Nemati, S.; Yin, G. Reinforcement learning in healthcare: A survey. ACM Comput. Surv. 2021, 55, 1–36. [Google Scholar] [CrossRef]
- Recht, B. A tour of reinforcement learning: The view from continuous control. Annu. Rev. Control. Robot. Auton. 2019, 2, 253–279. [Google Scholar] [CrossRef] [Green Version]
- Prosperi, M.; Guo, Y.; Sperrin, M.; Koopman, J.S.; Min, J.S.; He, X.; Rich, S.; Wang, M.; Buchan, I.E.; Bian, J. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat. Mach. Intell. 2020, 2, 369–375. [Google Scholar] [CrossRef]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Raschka, S.; Mirjalili, V. Python Machine Learning: Machine Learning and Deep Learning with Python, Scikit-Learn, and TensorFlow, 2nd ed.; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
- Talukdar, J.; Kalita, S.K. Detection of Breast Cancer using Data Mining Tool (WEKA). Int. J. Sci. Eng. Res. 2015, 6, 1124–1128. [Google Scholar]
- Yeulkar, K. R Analysis of SEER Breast Cancer Dataset Using Naive Bayes and C4.5 Algorithm. Int. J. Comput. Sci. Telecommun. 2017, 8491, 43–45. [Google Scholar]
- Al-Salihy, N.K.; Ibrikci, T. Classifying breast cancer by using decision tree algorithms. In Proceedings of the 6th International Conference on Software and Computer Applications, Bangkok, Thailand, 26 February 2017; pp. 144–148. [Google Scholar] [CrossRef]
- Huang, B.F.; Boutros, P.C. The parameter sensitivity of random forests. BMC Bioinform. 2016, 17, 331. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jamil, L.S. Data Analysis Based on Data Mining Algorithms Using Weka. Int. J. Eng. Sci. Res. Technol. 2016, 5, 262–267. [Google Scholar]
- Rashmi, G.D.; Lekha, A.; Bawane, N. Analysis of efficiency of classification and prediction algorithms (Naïve Bayes) for Breast Cancer dataset. In Proceedings of the 2015 International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT), Mandya, India, 17–19 December 2015; pp. 108–113. [Google Scholar]
- Reddy, G.T.; Reddy, M.P.K.; Lakshmanna, K.; Kaluri, R.; Rajput, D.S.; Srivastava, G.; Baker, T. Analysis of dimensionality reduction techniques on big data. IEEE Access 2020, 8, 54776–54788. [Google Scholar] [CrossRef]
- Colgan, R.E.; Gutierrez, D.E.; Sundram, J.; Tenali, G.B. Analysis of Medical Data Using Dimensionality Reduction Techniques. Przegląd Elektrotechniczny 2013, 89, 279–281. [Google Scholar]
- Zebari, R.; AbdulAzeez, A.; Zeebaree, D.; Zebari, D.; Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 2020, 1, 56–70. [Google Scholar] [CrossRef]
- Esmaeily, H.; Dolat, E.; Heidarian Miri, H.; Taji-Heravi, A.; Kiani, O. Reference Values for Serum Total Cholesterol Concentrations Using Percentile Regression Model: A Population Study in Mashhad. Iran. J. Health Sci. 2019, 7, 26–35. [Google Scholar] [CrossRef]
- Mostafaei, S.; Kabir, K.; Kazemnejad, A.; Feizi, A.; Mansourian, M.; Hassanzadeh Keshteli, A.; Afshar, H.; Arzaghi, S.M.; Rasekhi Dehkordi, S.; Adibi, P.; et al. Explanation of somatic symptoms by mental health and personality traits: Application of Bayesian regularized quantile regression in a large population study. BMC Psychiatry 2019, 19, 1–8. [Google Scholar] [CrossRef] [Green Version]
- Bujang, M.A.; Sa’At, N.; Sidik, T.M.I.T.A.B.; Joo, L.C. Sample size guidelines for logistic regression from observational studies with large population: Emphasis on the accuracy between statistics and parameters based on real life clinical data. Malays. J. Med. Sci. 2018, 25, 122–130. [Google Scholar] [CrossRef]
- Huang, C.; Lv, X.-W.; Xu, T.; Ni, M.-M.; Xia, J.-L.; Cai, S.-P.; Zhou, Q.; Li, X.; Yang, Y.; Zhang, L.; et al. Alcohol use in Hefei in relation to alcoholic liver disease: A multivariate logistic regression analysis. Alcohol 2018, 71, 1–4. [Google Scholar] [CrossRef] [PubMed]
- Sinha, P.; Delucchi, K.L.; McAuley, D.F.; O’Kane, C.M.; A Matthay, M.; Calfee, C.S. Development and validation of parsimonious algorithms to classify acute respiratory distress syndrome phenotypes: A secondary analysis of randomised controlled trials. Lancet Respir. Med. 2020, 8, 247–257. [Google Scholar] [CrossRef]
- Sonabend, R.; Király, F.J.; Bender, A.; Bischl, B.; Lang, M. mlr3proba: Machine learning survival analysis in R. arXiv 2020, 30, 2019–2021. [Google Scholar] [CrossRef] [PubMed]
- Nemesure, M.D.; Heinz, M.V.; Huang, R.; Jacobson, N.C. Predictive modeling of depression and anxiety using electronic health records and a novel machine learning approach with artificial intelligence. Sci. Rep. 2021, 11, 1980. [Google Scholar] [CrossRef]
- Rehm, G.B.; Cortés-Puch, I.; Kuhn, B.T.; Nguyen, J.; Fazio, S.A.; Johnson, M.A.; Anderson, N.R.; Chuah, C.-N.; Adams, J.Y. Use of Machine Learning to Screen for Acute Respiratory Distress Syndrome Using Raw Ventilator Waveform Data. Crit. Care Explor. 2021, 3. [Google Scholar] [CrossRef]
- Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 2019, 110, 12–22. [Google Scholar] [CrossRef]
- Ford, E.; Sheppard, J.; Oliver, S.; Rooney, P.; Banerjee, S.; Cassell, J.A. Automated detection of patients with dementia whose symptoms have been identified in primary care but have no formal diagnosis: A retrospective case-control study using electronic primary care records. BMJ Open 2021, 11, e039248. [Google Scholar] [CrossRef]
- Afsari, B.; Kuo, A.; Zhang, Y.; Li, L.; Lahouel, K.; Danilova, L.; Favorov, A.; Rosenquist, T.A.; Grollman, A.P.; Kinzler, K.W.; et al. Supervised mutational signatures for obesity and other tissue-specific etiological factors in cancer. Elife 2021, 10, e61082. [Google Scholar] [CrossRef]
- Matheny, M.E.; Ricket, I.; Goodrich, C.A.; Shah, R.U.; Stabler, M.E.; Perkins, A.M.; Dorn, C.; Denton, J.; Bray, B.E.; Gouripeddi, R.; et al. Development of Electronic Health Record—Based Prediction Models for 30-Day Readmission Risk Among Patients Hospitalized for Acute Myocardial Infarction. JAMA Netw. Open 2021, 4, e2035782. [Google Scholar] [CrossRef] [PubMed]
- Roimi, M.; Gutman, R.; Somer, J.; Ben Arie, A.; Calman, I.; Bar-Lavie, Y.; Gelbshtein, U.; Liverant-Taub, S.; Ziv, A.; Eytan, D.; et al. Development and validation of a machine learning model for predicting illness trajectory and hospital resource utilization of COVID-19 hospitalized patients—A nationwide study. J. Am. Med. Inform. Assoc. 2021, 28, 1188–1196. [Google Scholar] [CrossRef] [PubMed]
- Pepic, I.; Feldt, R.; Ljungström, L.; Torkar, R.; Dalevi, D.; Söderholm, H.M.; Andersson, L.-M.; Axelson-Fisk, M.; Bohm, K.; Sjöqvist, B.A.; et al. Early detection of sepsis using artificial intelligence: A scoping review protocol. Syst. Rev. 2021, 10, 28. [Google Scholar] [CrossRef]
- Sun, Y.; Rashedi, N.; Vaze, V.; Shah, P.; Halter, R.; Elliott, J.T.; A Paradis, N. Predicting Future Occurrence of Acute Hypotensive Episodes Using Noninvasive and Invasive Features. Mil. Med. 2021, 186, 445–451. [Google Scholar] [CrossRef] [PubMed]
- Noorbakhsh, J.; Chandok, H.; Karuturi, R.K.M.; George, J. Machine Learning in Biology and Medicine. Adv. Mol. Pathol. 2019, 2, 143–152. [Google Scholar] [CrossRef]
- Ganggayah, M.D. Machine learning on breast cancer prediction. Available online: https://github.com/MoganaD/Machine-Learning-on-Breast-Cancer-Survival-Prediction (accessed on 1 November 2020).
- Chen, C. Ascent of machine learning in medicine. Nat. Mater. 2019, 18, 407. [Google Scholar]
- De Glas, N.A.; Bastiaannet, E.; Engels, C.C.; De Craen, A.J.M.; Putter, H.; Van De Velde, C.J.H.; Hurria, A.; Liefers, G.J.; Portielje, J.E.A. Validity of the online PREDICT tool in older patients with breast cancer: A population-based study. Br. J. Cancer 2016, 114, 395–400. [Google Scholar] [CrossRef] [Green Version]
- Hoveling, L.A.; van Maaren, M.C.; Hueting, T.; Strobbe, L.J.A.; Hendriks, M.P.; Sonke, G.S.; Siesling, S. Validation of the online prediction model CancerMath in the Dutch breast cancer population. Breast Cancer Res. Treat. 2019, 178, 665–681. [Google Scholar] [CrossRef] [PubMed]
- Islam, T.; Bhoo-Pathy, N.; Su, T.T.; Majid, H.A.; Nahar, A.M.; Ng, C.G.; Dahlui, M.; Hussain, S.; Cantwell, M.; Murray, L.; et al. The Malaysian breast Cancer survivorship cohort (MyBCC): A study protocol. BMJ Open 2015, 5, e008643. [Google Scholar] [CrossRef] [Green Version]
- Pan, I.; Mason, L.R.; Matar, O.K. Data-centric Engineering: Integrating simulation, machine learning and statistics. Challenges and opportunities. Chem. Eng. Sci. 2022, 249, 117271. [Google Scholar] [CrossRef]
- Van Calster, B.; Verbakel, J.; Christodoulou, E.; Steyerberg, E.W.; Collins, G. Statistics versus machine learning: Definitions are interesting (but understanding, methodology, and reporting are more important). J. Clin. Epidemiol. 2019, 116, 137–138. [Google Scholar] [CrossRef] [PubMed]
- Sra, S. Directional statistics in machine learning: A brief review. Appl. Dir. Stat. Mod. Methods Case Stud. 2018, 225, 6. [Google Scholar]
- Eloranta, S.; Smedby, K.E.; Dickman, P.W.; Andersson, T.M. Cancer survival statistics for patients and healthcare professionals–a tutorial of real-world data analysis. J. Intern. Med. 2021, 289, 12–28. [Google Scholar] [CrossRef] [PubMed]
- Pandey, A.K.; Khan, A.I.; Abushark, Y.B.; Alam, M.M.; Agrawal, A.; Kumar, R.; Khan, R.A. Key issues in healthcare data integrity: Analysis and recommendations. IEEE Access 2020, 8, 40612–40628. [Google Scholar] [CrossRef]
- Shadbahr, T.; Roberts, M.; Stanczuk, J.; Gilbey, J.; Teare, P.; Dittmer, S.; Thorpe, M.; Torne, R.V.; Sala, E.; Lio, P.; et al. Classification of datasets with imputed missing values: Does imputation quality matter? arXiv 2022, arXiv:2206.08478, arXiv:2206.08478. [Google Scholar]
- Jin, D.; Sergeeva, E.; Weng, W.H.; Chauhan, G.; Szolovits, P. Explainable deep learning in healthcare: A methodological survey from an attribution view. WIREs Mech. Dis. 2022, 14, e1548. [Google Scholar] [CrossRef]
- Riccardo, M.; Wang, F.; Wang, S.; Jiang, X.; Dudley, J. Deep learning for healthcare: Review, opportunities and challenges. Brief. Bioinform. 2018, 19, 1236–1246. [Google Scholar] [CrossRef]
Approach | Concept | Procedure |
---|---|---|
Hypothesis testing • Inference | Research question: • Is the null hypothesis false? • E.g., There is no difference in survival outcome between patients who underwent surgical treatment of mastectomy or breast-conserving therapy Answer: • The null hypothesis is false • E.g., There is a significant difference between type of surgical treatment and survival outcome Decision rule: • A statistical test analyzes the data, which results in a p-value, which is then compared against the significance level and probability odds ratio or hazard ratio with a magnitude of confidence interval (CI) • E.g., p < 0.05, Hazard ratio 1.50, CI 1.12–2.30 Decision: • Reject the null hypothesis | Step 1: Identify predictors from related literature Step 2: Design a hypothesis and compare the similarities and differences using a new dataset |
Approach | Concept | Procedure |
---|---|---|
Classification | Research question: Is label 1 considered as the target outcome? Answer: Yes, label 1 is the target outcome Decision rule: • A trained classifier that analyzes an unlabeled observations’ variables and values, which results in a predicted label (1). | Step 1: Split dataset into training and testing datasets Step 2: Train the data using a specific algorithm Step 3: Test the remaining dataset using the trained algorithms to predict the results accurately |
Analysis | Conventional Statistics | Machine Learning |
---|---|---|
1 | Imputations | Imputations |
Objective | To impute missing value based on pattern of missing data. For example, missing by random | To impute missing values to maintain the quality of data |
Method | Missing values are identified on the percentage of missing data, with an acceptance range between (10–20%) | 1. Determine the missing values in the data 2. Perform multiple imputations using algorithms such as mice, Amelia and missForest (sample built in packages in R) |
Result | Imputed data/clean data | Imputed data/clean data |
2 | Effect Size | Model Evaluation |
Objective | To determine if the data explain the variability in data. Often called residual error, the residual should be as minimized as possible | To determine the quality of data to be used for further analysis |
Method | Residual analysis, if necessary, standardized residual error is performed using linear regression | 1. Split data into training and testing sets 2 Build models using algorithms (e.g., decision tree, support vector machine, etc.) |
Results | Measurable R2 • Under-fit model < 0.3 • Good-fit model (0.3–0.7) • Over-fit model (preferable) (> 0.7) | Accuracy, sensitivity, specificity, precision, Matthew Correlation Coefficient, area under the receiver operating curve (AUROC) • Good-fit model (> 0.7) |
3 | Significant Factors | Variable Importance |
Objective | To select important independent variables, which affect the target variable (dependent variable) | To select important independent variables, which affect the target variable (dependent variable) |
Method | 1 Run the analysis using all data 2 Treat missing values or exclude missing values 3 Chi square test or logistic regression to choose significant variable | 1. Run variable importance using the best model, which fit the data from model evaluation 2. Rank and select important variables for further analysis using the importance score |
Results | p-value, 95%CI OR 2.00 CI (1.51–12.52) | Variable importance score/mean (numerical) and variable importance plot (visualization) |
4 | Survival Analysis | Survival Analysis |
Objective | To determine survival rate using time series data | To determine survival rate (%) using time series data |
Method | Similar to machine learning, just the software is different | 1. Specify independent variable and target (survival years) 2. Define survival status, event = dead/1 3. Use machine learning survival package, which computes survival based on Kaplan–Meier to calculate survival percentage |
Results | Survival rate in percentage, Hazard ratio | Survival rate in percentage, Hazard ratio |
Variables (Independent) | Total, n (%) | Survival Status (Dependent) | p-Value 1 | |
---|---|---|---|---|
Alive, n (%) | Death, n (%) | |||
Age (years), median | 51 (42, 61) | 51 (43, 60) | 53 (42, 63) | 0.001 |
Marital status | 0.001 | |||
Married | 6397 (81.6) | 4554 (82.5) | 1843 (79.3) | |
Not married | 1443 (18.4) | 963 (17.5) | 480 (20.7) | |
Menopausal status | 0.000 | |||
Natural menopause | 3984 (50.8) | 2675 (48.5) | 1309 (56.3) | |
Pre-menopause | 3347 (42.7) | 2459 (44.6) | 888 (38.2) | |
Surgical menopause | 509 (6.5) | 383 (6.9) | 126 (5.4) | |
Presence of family history | 0.000 | |||
No | 6357 (81.1) | 4378 (79.4) | 1979 (85.2) | |
Yes | 1483 (18.9) | 1139 (20.6) | 344 (14.8) | |
Race | 0.000 | |||
Chinese | 5394 (68.8) | 4041 (73.2) | 1353 (58.2) | |
Indian | 921 (11.7) | 608 (11.0) | 313 (13.5) | |
Malay | 1525 (19.5) | 868 (15.7) | 657 (28.3) | |
Method of diagnosis | 0.000 | |||
Excision | 1617 (20.6) | 1294 (23.5) | 323 (13.9) | |
FNAC | 1886 (24.1) | 1013 (18.4) | 873 (37.6) | |
Imaging only | 35 (0.4) | 31 (0.6) | 4 (0.2) | |
Trucut | 4302 (54.9) | 3179 (57.6) | 1123 (48.3) | |
Classification of breast cancer | 0.000 | |||
Insitu | 366 (4.7) | 348 (6.3) | 18 (0.8) | |
Invasive | 7474 (95.3) | 5169 (93.7) | 2305 (99.2) | |
Laterality | 0.000 | |||
Bilateral | 97 (1.2) | 26 (0.5) | 71 (3.1) | |
Left | 3553 (45.3) | 2464 (44.7) | 1089 (46.9) | |
Right | 3895 (49.7) | 2830 (51.3) | 1065 (45.8) | |
Unilateral side unknown | 295 (3.8) | 197 (3.6) | 98 (4.2) | |
Cancer stage | 0.000 | |||
Pre-cancer | 365 (4.7) | 347 (6.3) | 18 (0.8) | |
Curable cancer | 6624 (84.5) | 4956 (89.8) | 1668 (71.8) | |
Metastatic | 851 (10.9) | 214 (3.9) | 637 (27.4) | |
Tumour size (cm), median | 2.7 (1.6, 4.5) | 2.3 (1.5, 3.5) | 4.00 (2.5, 8.0) | 0.000 |
Total axillary lymph nodes removed, median | 11 (4, 16) | 12 (6, 17) | 9 (0,16) | 0.000 |
Number of positive lymph nodes, median | 0 (0, 2) | 0 (0, 1) | 0 (0,4) | 0.000 |
Grade of differentiation in tumour | 0.000 | |||
Good | 2548 (32.5) | 1631 (29.6) | 917 (39.5) | |
Moderate | 2936 (37.4) | 2259 (40.9) | 677 (29.1) | |
Poor | 2356 (30.1) | 1627 (29.5) | 729 (31.4) | |
Estrogen status | 0.000 | |||
Negative | 3198 (40.8) | 1936 (35.1) | 1262 (54.3) | |
Positive | 4642 (59.2) | 3581 (64.9) | 1061 (45.7) | |
Progesterone status | 0.000 | |||
Negative | 4157 (53.0) | 2603 (47.2) | 1554 (66.9) | |
Positive | 3683 (47.0) | 2914 (52.8) | 769 (33.1) | |
c-er-b2 status | 0.000 | |||
Positive | 1862 (23.8) | 1245 (22.6) | 617 (26.6) | |
Negative | 5148 (65.7) | 3666 (66.4) | 1482 (63.8) | |
Equivocal | 830 (10.6) | 606 (11.0) | 224 (9.6) | |
Primary treatment type | 0.000 | |||
Chemotherapy | 976 (12.4) | 438 (7.9) | 538 (23.3) | |
Hormone therapy | 270 (3.4) | 100 (1.8) | 170 (7.3) | |
Surgery | 6140 (78.3) | 4812 (87.2) | 1328 (57.2) | |
None | 454 (5.8) | 167 (3.0) | 287 (12.4) | |
Surgery status | 0.000 | |||
Surgery done | 6740 (86.0) | 5121 (92.8) | 1619 (69.7) | |
No surgery | 1100 (14.0) | 396 (7.2) | 704 (30.3) | |
Type of surgery | 0.000 | |||
Breast-conserving surgery | 1916 (24.4) | 1661 (30.1) | 255 (11.0) | |
Mastectomy | 4821 (61.5) | 3456 (62.6) | 1365 (58.8) | |
No surgery | 1103 (14.1) | 400 (7.3) | 703 (30.3) | |
Method of axillary lymph node dissection | 0.000 | |||
Yes | 5553 (70.8) | 4048 (73.4) | 1505 (64.8) | |
SLNB (sentinel lymph node biopsy) | 540 (6.9) | 531 (9.6) | 9 (0.4) |
Comparison | SPSS | R | |||||
---|---|---|---|---|---|---|---|
Method | 1. Under survival, life table is used to plot the survival curve 2. Log rank test is used to determine significant difference between variables 3. Stage is grouped as Stage 0 (pre-cancer), Stages 1–3 (curable cancer), Stage 4 (metastatic cancer) 4. The grouping of tumor size and positive lymph nodes were done using clinical guideline | 1. Package survival is loaded 2. Survival years and survival status are used to calculate overall survival rate for selected variables 3. Stage is grouped as Stage 0 (pre-cancer), Stages 1–3 (curable cancer), Stage 4 (metastatic cancer) 4. The grouping of tumor size and positive lymph nodes were done using results from decision tree | |||||
Results | a. Tumor size | a. Tumor size | |||||
b. Cancer stage | b. Cancer stage | ||||||
c. Positive lymph nodes | c. Positive lymph nodes | ||||||
Interpretation | p-value < 0.05 means it is statistically significant | The survival percentages are extracted from the survival curve to estimate the survival rate of patients. | |||||
Log Rank (Mantel-Cox) | Chi-square | df | Sig. | ||||
Tumor size | 1105.407 | 2 | 0.000 | ||||
Cancer stage | 1721.517 | 2 | 0.000 | ||||
Positive lymph nodes | 234.999 | 3 | 0.000 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dhillon, S.K.; Ganggayah, M.D.; Sinnadurai, S.; Lio, P.; Taib, N.A. Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis. Diagnostics 2022, 12, 2526. https://doi.org/10.3390/diagnostics12102526
Dhillon SK, Ganggayah MD, Sinnadurai S, Lio P, Taib NA. Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis. Diagnostics. 2022; 12(10):2526. https://doi.org/10.3390/diagnostics12102526
Chicago/Turabian StyleDhillon, Sarinder Kaur, Mogana Darshini Ganggayah, Siamala Sinnadurai, Pietro Lio, and Nur Aishah Taib. 2022. "Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis" Diagnostics 12, no. 10: 2526. https://doi.org/10.3390/diagnostics12102526
APA StyleDhillon, S. K., Ganggayah, M. D., Sinnadurai, S., Lio, P., & Taib, N. A. (2022). Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis. Diagnostics, 12(10), 2526. https://doi.org/10.3390/diagnostics12102526