A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management
Abstract
:1. Introduction
2. Diabetes
2.1. Definitions
2.2. T2DM Complications and Management
3. Machine Learning Background
3.1. Models
- Logistic Regression: This model uses a logistic function to estimate the probability of a vector of observable data, Xi, belonging to a class, Y. It emerges as the most straightforward approach to utilize for binary classification of T2DM.
- Naive Bayes (NB): NB assumes independence between features and applies Bayes’ Rule to predict the class that maximizes the respective likelihoods. In terms of T2DM classification, the probability of each feature within each class is calculated [11].
- K-Nearest Neighbors (KNN): KNN is an algorithm, which, given an observation, tries to find the K most similar, known data, sums their classes and assigns the majority class to the unknown observation. The similarity can be calculated with various methods, the most common being Euclidean distance, Manhattan and Mahalanobis. It is applied to accurately forecast the onset of T2DM [12] and is preferred over other data-driven Machine Learning algorithms for diabetes risk prediction [13].
- Support Vector Machine (SVM): This model maps the obtained data onto a space and then tries to find the best hyperplane that separates the data of different classes. The term “best” is evaluated by the margin; the greater value, the better it is considered to have performed. SVMs are implemented to predict the diagnosis of diabetes mellitus [14] and are the most successful and widely used algorithms for both biomarker discovery and diabetes mellitus prediction [15].
- Decision Tree (DT): DT is a data-driven model which does not make any assumption about data distribution but constructs a tree-shaped structure of simple if–else rules based on input features and based on these rules, tries to make predictions. The identification of potential interactions between T2DM risk factors [16] and the determination of risk factors associated with T2DM [17] are integrated in such models.
- Multilayer Perceptron (MLP): MLP is perhaps the simplest form of Artificial Neural Network (ANN). It is a type of feed-forward ANN, simulating the way that the human brain makes decisions. It consists of nodes and artificial neurons, which all cooperate to export a simple numerical value. It applies to T2DM diagnosis and determination of the relative importance of risk factors [20] and diabetes risk prediction [21].
- Gradient Boosting: Gradient Boosting refers to a broad family of ensemble models which combine trees with the mathematical term “gradient”. In every iteration, it combines the predictions of trees, also called “weak learners”, to produce a better result than the previous iteration. After the last iteration, it gives a final prediction. Such models are Extreme Gradient Boost (XGBOOST), the Gradient Boosting Machine (GBM), Light GBM and Categorical Boosting (CatBoost). These are referred to as some of the most recent successful research findings within the Gradient Boosting framework, presenting low computational complexity when employed in T2DM diagnosis and prediction [22,23], as well as in the identification of T2DM predictors [8].
- Ensemble Models: In this category, a voting classifier and a stacking classifier are set. They both utilize the concept of a combination of weak classifiers. In the voting classifier, weak classifier prediction is combined with majority, soft or weighted voting to make the final prediction. In the stacking classifier, the weak classifiers produce a probability output and then all are fed to a final classifier, which is trained based on probabilistic outputs rather than conventional features. Ultimately, the final classifier makes a prediction [24,25,26,27,28].
- In the regression category, most of the aforementioned models can also work efficiently. However, a classical model that belongs to this category is Linear Regression. In this case, a linear function tries to predict continuous values, utilizing the least-squares method.
3.2. Imputation and Normalization
3.3. Balancing
3.4. Feature Selection
- Filter: Pearson correlation coefficient, chi-squared test and ANOVA.
- Wrapper: Sequential feature selection or backward elimination.
- Embedded: L1 and Ridge Regression.
3.5. Evaluation
4. Relevant Sections
4.1. Related Work
- Dataset structure;
- Top-performing models;
- Most frequent models;
- Complementary techniques;
- Ascending models;
- Performance evaluation.
4.2. Machine Learning Applications in Diabetes
4.2.1. Current-State Classification
4.2.2. Biomarker Regression
4.2.3. Long-Term Prediction
5. Discussion
- Types of hypotheses addressing T2DM through Machine Learning using tabular data.
- Data preprocessing.
- Features involved.
- Selection and identification of most important features.
- Methodology structure towards model building.
- Evaluation metrics.
- Best models.
5.1. Types of Hypotheses Addressing Diabetes through Machine Learning Using Tabular Data
5.2. Data Preprocessing
5.3. Features Involved
5.4. Selection and Identification of the Most Important Features
5.5. Methodology Structure towards Model Building
5.6. Evaluation Metrics
5.7. Best Models
5.8. Limitations of Existing Approaches
5.9. Comparison with Previous Reviews
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- International Diabetes Federation. Available online: https://idf.org/about-diabetes/what-is-diabetes/ (accessed on 23 May 2024).
- Kristo, A.S.; İzler, K.; Grosskopf, L.; Kerns, J.J.; Sikalidis, A.K. Emotional Eating Is Associated with T2DM in an Urban Turkish Population: A Pilot Study Utilizing Social Media. Diabetology 2024, 5, 286–299. [Google Scholar] [CrossRef]
- Kavakiotis, I.; Tsave, O.; Salifoglou, A.; Maglaveras, N.; Vlahavas, I.; Chouvarda, I. Machine Learning and Data Mining Methods in Diabetes Research. Comput. Struct. Biotechnol. J. 2017, 15, 104–116. [Google Scholar] [CrossRef] [PubMed]
- Fregoso-Aparicio, L.; Noguez, J.; Montesinos, L.; García-García, J. Machine learning and deep learning predictive models for type 2 diabetes: A systematic review. Diabetol. Metab. Syndr. 2021, 13, 148. [Google Scholar] [CrossRef] [PubMed]
- Sudharsan, B.; Peeples, M.; Shomali, M. Hypoglycemia Prediction Using Machine Learning Models for Patients With Type 2 Diabetes. J. Diabetes Sci. Technol. 2015, 9, 86–90. [Google Scholar] [CrossRef]
- You, Y.; Doubova, S.V.; Pinto-Masis, D.; Pérez-Cuevas, R.; Borja-Aburto, V.H.; Hubbard, A. Application of machine learning methodology to assess the performance of DIABETIMSS program for patients with type 2 diabetes in family medicine clinics in Mexico. BMC Med. Inform. Decis. Mak. 2019, 19, 221. [Google Scholar] [CrossRef]
- Uddin, M.J.; Ahamad, M.M.; Hoque, M.N.; Walid, M.A.A.; Aktar, S.; Alotaibi, N.; Alyami, S.A.; Kabir, M.A.; Moni, M.A. A Comparison of Machine Learning Techniques for the Detection of Type-2 Diabetes Mellitus: Experiences from Bangladesh. Information 2023, 14, 376. [Google Scholar] [CrossRef]
- Lugner, M.; Rawshani, A.; Helleryd, E.; Eliasson, B. Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data. Sci. Rep. 2024, 14, 2102. [Google Scholar] [CrossRef]
- Sikalidis, A.K. From Food for Survival to Food for Personalized Optimal Health: A Historical Perspective of How Food and Nutrition Gave Rise to Nutrigenomics. J. Am. Coll. Nutr. 2019, 38, 84–95. [Google Scholar] [CrossRef] [PubMed]
- Cloete, L. Diabetes mellitus: An overview of the types, symptoms, complications and management. Nurs. Stand. 2022, 37, 61–66. [Google Scholar] [CrossRef]
- Iparraguirre-Villanueva, O.; Espinola-Linares, K.; Flores Castañeda, R.O.; Cabanillas-Carbonell, M. Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes. Diagnostics 2023, 13, 2383. [Google Scholar] [CrossRef]
- Garcia-Carretero, R.; Vigil-Medina, L.; Mora-Jimenez, I.; Soguero-Ruiz, C.; Barquero-Perez, O.; Ramos-Lopez, J. Use of a K-nearest neighbors model to predict the development of type 2 diabetes within 2 years in an obese, hypertensive population. Med. Biol. Eng. Comput. 2020, 58, 991–1002. [Google Scholar] [CrossRef] [PubMed]
- Dritsas, E.; Trigka, M. Data-Driven Machine-Learning Methods for Diabetes Risk Prediction. Sensors 2022, 22, 5304. [Google Scholar] [CrossRef]
- Viloria, A.; Herazo-Beltran, Y.; Cabrera, D.; Pineda, O.B. Diabetes diagnostic prediction using vector support machines. Procedia Comput. Sci. 2020, 170, 376–381. [Google Scholar] [CrossRef]
- Bernabe-Ortiz, A.; Borjas-Cavero, D.B.; Páucar-Alfaro, J.D.; Carrillo-Larco, R.M. Multimorbidity Patterns among People with Type 2 Diabetes Mellitus: Findings from Lima, Peru. Int. J. Environ. Res. Public Health 2022, 19, 9333. [Google Scholar] [CrossRef]
- Ramezankhani, A.; Hadavandi, E.; Pournik, O.; Shahrabi, J.; Azizi, F.; Hadaegh, F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: A decade follow-up in a Middle East prospective cohort study. BMJ Open 2016, 6, e013336. [Google Scholar] [CrossRef]
- Esmaily, H.; Tayefi, M.; Doosti, H.; Ghayour-Mobarhan, M.; Nezami, H.; Amirabadizadeh, A. A Comparison between Decision Tree and Random Forest in Determining the Risk Factors Associated with Type 2 Diabetes. PubMed 2018, 18, e00412. [Google Scholar]
- Aguilera-Venegas, G.; López-Molina, A.; Rojo-Martínez, G.; Galán-García, J.L. Comparing and Tuning Machine Learning Algorithms to Predict Type 2 Diabetes Mellitus. J. Comput. Appl. Math. 2023, 427, 115115. [Google Scholar] [CrossRef]
- Wang, X.; Zhai, M.; Ren, Z.; Ren, H.; Li, M.; Quan, D.; Chen, L.; Qiu, L. Exploratory Study on Classification of Diabetes Mellitus through a Combined Random Forest Classifier. BMC Med. Inform. Decis. Mak. 2021, 21, 105. [Google Scholar] [CrossRef]
- Borzouei, S.; Soltanian, A.R. Application of an Artificial Neural Network Model for Diagnosing Type 2 Diabetes Mellitus and Determining the Relative Importance of Risk Factors. Epidemiol. Health 2018, 40, e2018007. [Google Scholar] [CrossRef] [PubMed]
- Mao, Y.; Zhu, Z.; Pan, S.; Lin, W.; Liang, J.; Huang, H.; Li, L.; Wen, J.; Chen, G. Value of machine learning algorithms for predicting diabetes risk: A subset analysis from a real-world retrospective cohort study. J. Diabetes Investig. 2023, 14, 309–320. [Google Scholar] [CrossRef]
- Rufo, D.D.; Debelee, T.G.; Ibenthal, A.; Negera, W.G. Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics 2021, 11, 1714. [Google Scholar] [CrossRef]
- Khan, A.A.; Qayyum, H.; Liaqat, R.; Ahmad, F.; Nawaz, A.; Younis, B. Optimized Prediction Model for Type 2 Diabetes Mellitus Using Gradient Boosting Algorithm; IEEE Xplore: Piscataway, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Alsadi, B.; Musleh, S.; Al-Absi, H.R.; Refaee, M.; Qureshi, R.; El Hajj, N.; Alam, T. An Ensemble-Based Machine Learning Model for Predicting Type 2 Diabetes and Its Effect on Bone Health. BMC Med. Inform. Decis. Mak. 2024, 24, 144. [Google Scholar] [CrossRef]
- Ganie, S.M.; Malik, M.B. An Ensemble Machine Learning Approach for Predicting Type-II Diabetes Mellitus Based on Lifestyle Indicators. Healthc. Anal. 2022, 2, 100092. [Google Scholar] [CrossRef]
- Morgan-Benita, J.A.; Galván-Tejada, C.E.; Cruz, M.; Galván-Tejada, J.I.; Gamboa-Rosales, H.; Arceo-Olague, J.G.; Luna-García, H.; Celaya-Padilla, J.M. Hard Voting Ensemble Approach for the Detection of Type 2 Diabetes in Mexican Population with Non-Glucose Related Features. Healthcare 2022, 10, 1362. [Google Scholar] [CrossRef]
- Dinh, A.; Miertschin, S.; Young, A.; Mohanty, S. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med. Inform. Decis. Mak. 2019, 19, 211. [Google Scholar] [CrossRef]
- Fazakis, N.; Kocsis, O.; Dritsas, E.; Alexiou, S.; Fakotakis, N.; Moustakas, K. Machine Learning Tools for Long-Term Type 2 Diabetes Risk Prediction. IEEE Access 2021, 9, 103737–103757. [Google Scholar] [CrossRef]
- Frank, E.; Hall, M.A.; Holmes, G.; Kirkby, R.; Pfahringer, B.; Witten, I.H. Weka: A machine learning workbench for data mining. In Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers; Maimon, O., Rokach, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 1305–1314. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Seabold, S.; Perktold, J. statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, Austin, AX, USA, 28–30 June 2010. [Google Scholar]
- Gray, L.J.; Taub, N.A.; Khunti, K.; Gardiner, E.; Hiles, S.; Webb, D.R.; Srinivasan, B.T.; Davies, M.J. The Leicester Risk Assessment score for detecting undiagnosed Type 2 diabetes and impaired glucose regulation for use in a multiethnic UK setting. Diabet. Med. 2010, 27, 887–895. [Google Scholar] [CrossRef]
- Lindstrom, J.; Tuomilehto, J. The Diabetes Risk Score: A practical tool to predict type 2 diabetes risk. Diabetes Care 2003, 26, 725–731. [Google Scholar] [CrossRef]
- Lai, H.; Huang, H.; Keshavjee, K.; Guergachi, A.; Gao, X. Predictive models for diabetes mellitus using machine learning techniques. BMC Endocr. Disord. 2019, 19, 101. [Google Scholar] [CrossRef]
- Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting Diabetes Mellitus With Machine Learning Techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef] [PubMed]
- Zhang, L.; Wang, Y.; Niu, M.; Wang, C.; Wang, Z. Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: The Henan Rural Cohort Study. Sci. Rep. 2020, 10, 4406. [Google Scholar] [CrossRef] [PubMed]
- American Diabetes Association. Diabetes Diagnostic Criteria. Available online: https://diabetes.org/about-diabetes/diagnosis (accessed on 21 June 2024).
- De Silva, K.; Lim, S.; Mousa, A.; Teede, H.; Forbes, A.; Demmer, R.T.; Jönsson, D.; Enticott, J. Nutritional markers of undiagnosed type 2 diabetes in adults: Findings of a machine learning analysis with external validation and benchmarking. PLoS ONE 2021, 16, e0250832. [Google Scholar] [CrossRef] [PubMed]
- Phongying, M.; Hiriote, S. Diabetes Classification Using Machine Learning Techniques. Computation 2023, 11, 96. [Google Scholar] [CrossRef]
- Qin, Y.; Wu, J.; Xiao, W.; Wang, K.; Huang, A.; Liu, B.; Yu, J.; Li, C.; Yu, F.; Ren, Z. Machine Learning Models for Data-Driven Prediction of Diabetes by Lifestyle Type. Int. J. Environ. Res. Public Health 2022, 19, 5027. [Google Scholar] [CrossRef]
- Kazerouni, F.; Bayani, A.; Asadi, F.; Saeidi, L.; Parvizi, N.; Mansoori, Z. Type2 Diabetes Mellitus Prediction Using Data Mining Algorithms Based on the Long-Noncoding RNAs Expression: A Comparison of Four Data Mining Approaches. BMC Bioinform. 2020, 21, 372. [Google Scholar] [CrossRef]
- Agliata, A.; Giordano, D.; Bardozzo, F.; Bottiglieri, S.; Facchiano, A.; Tagliaferri, R. Machine Learning as a Support for the Diagnosis of Type 2 Diabetes. Int. J. Mol. Sci. 2023, 24, 6775. [Google Scholar] [CrossRef]
- Kopitar, L.; Kocbek, P.; Cilar, L.; Sheikh, A.; Stiglic, G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020, 10, 11981. [Google Scholar] [CrossRef]
- Liu, Q.; Zhang, M.; He, Y.; Zhang, L.; Zou, J.; Yan, Y.; Guo, Y. Predicting the Risk of Incident Type 2 Diabetes Mellitus in Chinese Elderly Using Machine Learning Techniques. J. Pers. Med. 2022, 12, 905. [Google Scholar] [CrossRef]
- Lama, L.; Wilhelmsson, O.; Norlander, E.; Gustafsson, L.; Lager, A.; Tynelius, P.; Wärvik, L.; Östenson, C.G. Machine learning for prediction of diabetes risk in middle-aged Swedish people. Heliyon 2021, 7, e07419. [Google Scholar] [CrossRef]
- Shin, J.; Lee, J.; Ko, T.; Lee, K.; Choi, Y.; Kim, H.S. Improving Machine Learning Diabetes Prediction Models for the Utmost Clinical Effectiveness. J. Pers. Med. 2022, 12, 1899. [Google Scholar] [CrossRef] [PubMed]
- Deberneh, H.M.; Kim, I. Prediction of Type 2 Diabetes Based on Machine Learning Algorithm. Int. J. Environ. Res. Public Health 2021, 18, 3317. [Google Scholar] [CrossRef] [PubMed]
- Sikalidis, A.K.; Kristo, A.S.; Reaves, S.K.; Kurfess, F.J.; DeLay, A.M.; Vasilaky, K.; Donegan, L. Capacity Strengthening Undertaking—Farm Organized Response of Workers against Risk for Diabetes: (C.S.U.—F.O.R.W.A.R.D. with Cal Poly)—A Concept Approach to Tackling Diabetes in Vulnerable and Underserved Farmworkers in California. Sensors 2022, 22, 8299. [Google Scholar] [CrossRef] [PubMed]
(a) | ||||
---|---|---|---|---|
Study and Purpose | Dataset | Complementary Techniques | Important Features | Best Model |
[34] Lai et al. Diabetes classification | 13,309 records from CPCSSN 1. Personal data and recent laboratory results | Misclassification cost matrix, grid search, adjusted threshold and 10-fold cross validation, and information gain 1 | FPG, HDL, BMI and triglycerides | GBM AUC 0.847, Misclassification rate 0.189, Sensitivity 0.716 and Specificity 0.837 |
[35] Zou et al. Diabetes identification | 138,000 records with glucose, physical examination, biometric, demographic and laboratory results | Random sampling, mRMR, PCA and 5-fold CV | FPG, weight and age | Random Forest using all 14 available features, with Accuracy 0.8084, Sensitivity 0.8495 and Specificity 0.7673 |
[27] Dinh et al. Diabetes identification | NHANES, survey data and laboratory results | Standardization, majority downsampling, ensemble model weighting optimization, hyperparameter tuning and 10-fold CV | Waist circumference, age, blood osmolality, sodium, blood urea nitrogen and triglycerides | Case 1: (a) With survey data: XGBoost, AUC 0.862, Precision, Recall and F1-Score all 0.78; (b) With laboratory results: AUC 0.957, Precision, Recall and F1-Score all 0.89. Case 2: (a) With survey data: Ensemble, AUC 0.737, Precision, Recall and F1-Score all 0.68; (b) With laboratory results: XGBoost, AUC 0.802, Precision, Recall and F1-Score all 0.74 |
[36] Zhang et al. Diabetes identification | 36,652 records from Henan rural cohort, including sociodemographic, anthropometric, biometric, laboratory result and history of disease data | SMOTE, hyperparameter tuning and 10-fold CV | Urinary parameters, sweet flavor, age, heart rate and creatinine | Experiments with and without laboratory results: both XGBoost methods: AUC 0.872 and 0.817, Accuracy 0.812 and 0.702, Sensitivity 0.76 and 0.789, and Specificity 0.871 and 0.694 |
[38] DeSilva et al. Diabetes identification | 16,429 records from NHANES with nutritional, behavioral, 146 socioeconomic and non-modifiable demographic features | MICE imputation, minority class oversampling, ROSE and SMOTE. Hyperparameter tunning, CV and odds ratio | Folate, self-reported diet health, number of people in household, and total fat and cigarette consumption | Logistic Regression trained on minority oversampling dataset: AUC 0.746 |
[39] Phongying et al. Diabetes identification | 20,227 records from the Department of Medical Services in Bangkok, including demographic, biometric, heart pressure and rate results and family history of diabetes data | MinMax normalization, Gain ratio, interaction variables and hyperparameter tuning | BMI and family history of diabetes | Random Forest trained on interaction variable dataset, achieving 0.975 Accuracy, 0.974 Precision and 0.966 Recall |
[41] Kazerouni et al. Diabetes identification | 200 records, Shohadan Hospital, Tehran (100 T2DM) | Standardization, 10-fold cross validation | Long non-coding RNA (lncRNA) expression for predicting T2DM and diabetes detection on an RNA molecular basis | SVM: AUC 0.95, Sensitivity 95% and Specificity 86% |
[42] Agliata et al. Diabetes identification | Balanced dataset (NHANES, MIMIC-III and MIMIC-IV) | Standardization | Glucose level, triglyceride level, HDL, systolic blood pressure, diastolic blood pressure, gender/sex, age, weight and Body Mass Index (BMI) | Binary classifier (NN): Accuracy approximately 86%, AUC 0.934 |
[7] Uddin et al. Diabetes identification | 508-record dataset from Bangladesh | SMOTE and random oversampling, feature selection by recursive feature elimination | Age, having diabetes in the family, regular intake of medicine and extreme thirst | Ensemble Technique: Accuracy of 99.27% and F1-score of 99.27% |
[40] Qin et al. Diabetes identification | 17,833 records (NHANES), including demographic, dietary, examination and questionnaire features | SMOTE, backward feature selection with objective AIC and SHAP | Sleep time, energy and age | CatBoost: AUC 0.83, Accuracy 0.821, Sensitivity 0.82 and Specificity 0.519 |
[22] Rufo et al. Diabetes identification | 2109 records from ZMHDD hospital, including demographic, anthropometric, blood pressure, cholesterol, pulse rate and FBS data | Median imputation, MinMax normalization, Pearson correlation coefficient hyperparameter tuning and 10-fold CV | FPG, total cholesterol and BMI | LightGBM: AUC 0.98, Accuracy 0.98, Sensitivity 0.99 and Specificity 0.96 |
[26] Benita et al. Diabetes identification | 1787 records from Centro Medico Nacional Siglo XXI in Mexico City, including sociodemographic, anthropometric and laboratory data, such as HDL and diastolic pressure under treatment and systolic pressure without treatment | Standardization, LASSO feature selection and hyperparameter tuning with 10-fold CV | Lipid level in treatment and hypertension treatment | SVM: AUC 0.928, Accuracy 0.898, Sensitivity 0.878 and Specificity 0.923 |
[13] Dritsas et al. Diabetes identification | 520-record dataset from Kaggle, including symptoms such as polyuria, polydipsia, sudden 226 weight loss, weakness, polyphagia, genital thrush, itching, obesity, etc. | SMOTE using 5-NN, Pearson coefficient, Gain ratio, AUC of NB and RF, and 10-fold CV | Polyuria, polydipsia, sudden weight loss and gender | Random Forest and KNN: AUCs 0.99 and 0.98, respectively; Accuracy 0.985, Recall 0.986 and Precision 0.986 |
(b) | ||||
Study and Purpose | Dataset | Complementary Techniques | Important Features | Best Model |
[43] Kopitar et al. FPG regression | 2109 records from ZMHDD hospital, including demographic, anthropometric, blood pressure, cholesterol, pulse rate and FBS data | Outlier detection, MICE imputation, bootstrap random sampling with replacement, R2 model calibration | Hyperglycemia, age, triglyceride, cholesterol and blood pressure results | LightGBM: RMSE 0.8 mmol/L |
[44] Liu et al. Long-term diabetes prediction | 127,031 records, patients older than 65 years | LASSO feature selection, SHAP | FPG, education, exercise, gender and waist circumference as the top-five important predictors | XGBoost model with 21 features: AUC 0.78, Accuracy 75%, Sensitivity 64,5% and Specificity 75.7% |
[45] Lama et al. Long-term diabetes prediction | 7949 records: socioeconomic and psychosocial factors, physical and laboratory results, physical activity, diet information and tobacco use | Median imputation, SHAP, 5-fold CV grid search with objective as 6 and risk profiles | BMI, waist–hip ratio, age, systolic and diastolic BP, and diabetes heredity | Random Forest: AUC 0.7795 |
[46] Shin et al. Long-term diabetes prediction | 38,379 records, including demographic, laboratory, pulmonary test, personal history and family history data | Mean/mode imputation, hyperparameter tuning with stratified 10-fold CV, SHAP and survival analysis | FPG, HbA1c and family history of diabetes | XGBoost: 0.623 AUC, 0.966 Accuracy, 0.970 Sensitivity and 0.690 Specificity |
[21] Mao et al. Long-term diabetes prediction | 3687 records, including demographic, smoking, drinking, history of health condition and laboratory data | LR feature analysis, hyperparameter tuning, 10-fold cv and SHAP | Age, impaired fasting glycose and glycose tolerance | Random Forest: AUC 0.835 |
[28] Fazakis et al. Long-term diabetes prediction | 2331 records from ELSA, including biometric, anthropometric, hematological, lifestyle, sociodemographic and performance index variables | Feature selection techniques: LASSO, correlation, Greedy stepwise. Random undersampling. Adjusted threshold with objective J3, multiobjective optimization | Not applicable | Weighted soft voting ensemble with base classifiers LR and RF: AUC 0.884, Sensitivity 0.856 and Specificity 0.798 |
[47] Deberneh et al. Multiclass long-term diabetes prediction | 500,000 records containing diagnostic results and questionnaires | Majority undersampling and SMOTE, ANOVA, x2 and RFE, grid search, 10-fold cross validation | FPG, HbA1c and gamma-GTP | CIM: Accuracy, Precision, Recall 0.77 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Petridis, P.D.; Kristo, A.S.; Sikalidis, A.K.; Kitsas, I.K. A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management. Informatics 2024, 11, 70. https://doi.org/10.3390/informatics11040070
Petridis PD, Kristo AS, Sikalidis AK, Kitsas IK. A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management. Informatics. 2024; 11(4):70. https://doi.org/10.3390/informatics11040070
Chicago/Turabian StylePetridis, Panagiotis D., Aleksandra S. Kristo, Angelos K. Sikalidis, and Ilias K. Kitsas. 2024. "A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management" Informatics 11, no. 4: 70. https://doi.org/10.3390/informatics11040070
APA StylePetridis, P. D., Kristo, A. S., Sikalidis, A. K., & Kitsas, I. K. (2024). A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management. Informatics, 11(4), 70. https://doi.org/10.3390/informatics11040070