Early Identification of Risk Factors in Non-Alcoholic Fatty Liver Disease (NAFLD) Using Machine Learning
Abstract
:1. Introduction
2. Related Work
2.1. Bagging Ensembles
2.2. Boosting Ensembles
2.3. Stacking Ensembles
2.4. Bagging and Boosting Ensembles
2.5. Bagging, Boosting, and Stacking Ensembles
3. Materials and Methods
3.1. Datasets
3.2. Machine Learning Classifiers
3.3. Methodology
- Load data dataset. Select and stack information from the dataset containing clinical records of patients with liver diseases;
- Pre-process dataset. Audit stacked information to get their content. At that point, select the classification variable to obtain the best results.
- Select attributes or main risk factors. Use RF to select the top two and top four attributes from each dataset. Split data for training and testing (i.e., 70% for training and 30% for testing) and k = 10 cross-validation. Similary, calculate the best parameters for RandomizedSearchCV for n_estimators, max_attributes, and max_depth. Most of the algorithms have these parameters in common, except for KNN. The parameter ramdom_state was set to 42 in all the assessments;
- Run ML classifiers, bagging ensemble, and boosting ensemble: Apply the nine ML classifiers to observe members with liver illnesses from healthy people. Tune bagging and boosting parameters such as n_estimators and max_samples on the train and test split and cross-validation techniques;
- Apply evaluation metrics. Analyze MLA classification performance with respect to five criteria: accuracy, precision, recall, f1-score, and area under the curve (ROC-AUC);
- Process performance results. Assemble and compare execution values from the nine MLAs with the bagging and boosting ensembles and record such outcomes for further analysis. At that point, select the best-performing MLA or ensemble.
4. Results and Discussion
4.1. Attribute Selection in Datasets
- (a)
- The BUPA Liver Disorders Dataset. We applied RF with the six numerical attributes of the dataset to identify and select the six most important ones. Figure 2a depicts the ranking of these attributes from the most important to the least important;
- (b)
- HCC Survival Dataset. The 49 attributes were ranked using RF on the HCC Survival dataset. Figure 2b depicts a graph of said ranking. As in the previous case, the top six attributes were used in the classifier performance;
- (c)
- ILPD. The 10 attributes were ranked using RF on the ILPD dataset. Figure 2c depicts a graph ranking the first six attributes, of which the top six were used in the analysis;
- (d)
- CPD. The 19 attributes were ranked using RF in this dataset. Figure 2d graphically shows the ranking of said attributes, of which the top six were used in the analysis.
4.2. Results
4.2.1. Classifier Performance on the BUPA Dataset
4.2.2. Classifier Performance on the CPD
4.2.3. Classifier Performance on the HCC Survival Dataset
4.2.4. Classifier Performance on ILPD Dataset
4.3. Most Important Dataset Attributes
5. Conclusions and Future Directions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- INEGI. INEGI Instituto Nacional de Estadística, Geografía e Informática. Características de las Defunciones Registradas en México Durante Enero a Agosto de 2020. INEGI. 28 June 2022. Available online: https://www.inegi.org.mx/contenidos/saladeprensa/boletines/2021/EstSociodemo/DefuncionesRegistradas2020_Pnles.pdf (accessed on 27 June 2023).
- Lee, H.W.; Sung, J.J.Y.; Ahn, S.H. Artificial intelligence in liver disease. J. Gastroenterol. Hepatol. 2021, 36, 539–542. [Google Scholar] [CrossRef] [PubMed]
- Goldman, O.; Ben-Assuli, O.; Rogowski, O.; Zeltser, D.; Shapira, I.; Berliner, S.; Zelber-Sagi, S.; Shenhar-Tsarfaty, S. Non-alcoholic Fatty Liver and Liver Fibrosis Predictive Analytics: Risk Prediction and Machine Learning Techniques for Improved Preventive Medicine. J. Med. Syst. 2021, 45, 22. [Google Scholar] [CrossRef] [PubMed]
- Kwak, M.S.; Kim, D. Non-alcoholic fatty liver disease and lifestyle modifications, focusing on physical activity. Korean J. Intern. Med. 2018, 33, 64–74. [Google Scholar] [CrossRef] [Green Version]
- Ahmed, M.H. Biochemical Markers the Road Map for the Diagnosis of Nonalcoholic Fatty Liver Disease. Am. J. Clin. Pathol. 2007, 127, 20–22. [Google Scholar] [CrossRef]
- Aravind, G.N.; Abhilash, K.; Syed, U.F. A study of alanine aminotransferase—Aspartate aminotransferase as a marker of advanced alcoholic liver disease. Int. J. Adv. Med. 2020, 7, 551–553. [Google Scholar] [CrossRef] [Green Version]
- Pancreas, J.J.; Das, R.N.; Mukherjee, S.; Sharma, I. Alkaline Phosphatase Determinants of Liver Patients. 2018. Available online: http://pancreas.imedpub.com/ (accessed on 27 June 2023).
- Lin, E.; Lin, C.H.; Lane, H.Y. Applying a bagging ensemble machine learning approach to predict functional outcome of schizophrenia with clinical symptoms and cognitive functions. Sci. Rep. 2021, 11, 6922. [Google Scholar] [CrossRef] [PubMed]
- Ponnaganti, N.D.; Anitha, R. A Novel Ensemble Bagging Classification Method for Breast Cancer Classification Using Machine Learning Techniques. Trait. Signal 2022, 39, 229–237. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. An ensemble learning approach for enhanced classification of patients with hepatitis and cirrhosis. IEEE Access 2021, 9, 24485–24498. [Google Scholar] [CrossRef]
- Anisha, C.D.; Saranya, K.G. Early diagnosis of stroke disorder using homogenous logistic regression ensemble classifier. Int. J. Nonlinear Anal. Appl. 2021, 12, 1649–1654. [Google Scholar] [CrossRef]
- Devi, M.S.; Swathi, P.; Upadhyay, S.S.; Sah, N.K.; Budhia, A.; Srivastava, S.; Rohella, M. Feature Predominance Ensemble Inquisition towards Liver Disease Prediction using Machine Learning. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC), Delhi, India, 20–21 February 2021. [Google Scholar] [CrossRef]
- Lin, E.; Lin, C.H.; Lane, H.Y. A bagging ensemble machine learning framework to predict overall cognitive function of schizophrenia patients with cognitive domains and tests. Asian J. Psychiatr. 2022, 69, 103008. [Google Scholar] [CrossRef]
- Ejiofor, C.I.; Ochei, L.C. Application of Heterogenous Bagging Ensemble Model for predicting Breast Cancer. J. Comput. Sci. Its Appl. 2021, 28. [Google Scholar] [CrossRef]
- Rahman, F.; Mahmood, M.A. A Dynamic Approach to Identify the Most Significant Biomarkers for Heart Disease Risk Prediction utilizing Machine Learning Techniques. Available online: https://www.researchgate.net/publication/357458668 (accessed on 28 April 2023).
- Thomgkam, J.; Sukmak, V.; Klangnok, P. Application of Machine Learning Techniques to Predict Breast Cancer Survival. In Lecture Notes in Computer Science, Proceedings of the 14th Multi-disciplinary International Conference on Artificial Intelligence (MIWAI 2021), Online, 2–3 July 2021; Springer: Cham, Switzerland, 2021; Volume 12832, pp. 141–151. [Google Scholar] [CrossRef]
- Yadav, S.; Singh, M.K. Hybrid Machine Learning Classifier and Ensemble Techniques to Detect Parkinson’s Disease Patients. SN Comput. Sci. 2021, 2, 189. [Google Scholar] [CrossRef]
- Buyrukoglu, S. Improvement of Machine Learning Models Performances based on Ensemble Learning for the detection of Alzheimer Disease. In Proceedings of the 6th International Conference on Computer Science and Engineering (UBMK), Ankara, Turkey, 15–17 September 2021; pp. 102–106. [Google Scholar] [CrossRef]
- Singh, A.; Mehta, J.C.; Anand, D.; Nath, P.; Pandey, B.; Khamparia, A. An intelligent hybrid approach for hepatitis disease diagnosis: Combining enhanced k-means clustering and improved ensemble learning. Expert Syst 2021, 38, e12526. [Google Scholar] [CrossRef]
- Sarvestany, S.S.; Kwong, J.C.; Azhie, A.; Dong, V.; Cerocchi, O.; Ali, A.F.; Karnam, R.S.; Kuriry, H.; Shengir, M.; Candido, E.; et al. Development and validation of an ensemble machine learning framework for detection of all-cause advanced hepatic fibrosis: A retrospective cohort study. Lancet Digit Health 2022, 4, e188–e199. [Google Scholar] [CrossRef] [PubMed]
- Dutta, K.; Chandra, S.; Gourisaria, M.K. Early-Stage Detection of Liver Disease Through Machine Learning Algorithms. Lect. Notes Netw. Syst. 2022, 318, 155–166. [Google Scholar] [CrossRef]
- Verma, A.; Mehta, S. A comparative study of ensemble learning methods for classification in bioinformatics. In Proceedings of the 7th International Conference on Cloud Computing, Data Science & Engineering—Confluence, Noida, India, 12–13 January 2017; pp. 155–158. [Google Scholar] [CrossRef]
- Meng, L.; Treem, W.; Heap, G.; Chen, J. Predicting Clinical Outcomes of Alpha-1 Antitrypsin Deciency-Associated Liver Disease Using a Stacking Ensemble Machine Learning Model Based on UK Biobank Data. 2022; preprint. [Google Scholar] [CrossRef]
- Al Telaq, B.H.; Hewahi, N. Prediction of Liver Disease using Machine Learning Models with PCA. In Proceedings of the 2021 International Conference on Data Analytics for Business and Industry (ICDABI), Sakheer, Bahrain, 25–26 October 2021; pp. 250–254. [Google Scholar] [CrossRef]
- Gupta, S.; Gupta, M.K. Computational Prediction of Cervical Cancer Diagnosis Using Ensemble-Based Classification Algorithm. Comput. J. 2021, 65, 1527–1539. [Google Scholar] [CrossRef]
- Pouriyeh, S.; Vahid, S.; Sannino, G.; De Pietro, G.; Arabnia, H.; Gutierrez, J. A comprehensive investigation and comparison of Machine Learning Techniques in the domain of heart disease. In Proceedings of the 2017 IEEE Symposium on Computers and Communications (ISCC), Heraklion, Greece, 3–6 July 2017; pp. 204–207. [Google Scholar] [CrossRef]
- Kabir, M.F.; Ludwig, S.A. Enhancing the Performance of Classification Using Super Learning. Data-Enabled Discov. Appl. 2019, 3, 5. [Google Scholar] [CrossRef]
- Doğaner, A.; Çolak, C.; Küçükdurmaz, F.; Ölmez, C. Prediction of Renal Cell Carcinoma Based on Ensemble Learning Methods. Middle Black Sea J. Health Sci. 2021, 7, 104–114. [Google Scholar] [CrossRef]
- Hakim, M.A.; Jahan, N.; Zerin, Z.A.; Farha, A.B. Performance Evaluation and Comparison of Ensemble Based Bagging and Boosting Machine Learning Methods for Automated Early Prediction of Myocardial Infarction. In Proceedings of the 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021. [Google Scholar] [CrossRef]
- Yadav, D.C.; Pal, S. An Experimental Study of Diversity of Diabetes Disease Features by Bagging and Boosting Ensemble Method with Rule Based Machine Learning Classifier Algorithms. SN Comput. Sci. 2021, 2, 50. [Google Scholar] [CrossRef]
- Gao, X.Y.; Ali, A.A.; Hassan, H.S.; Anwar, E.M. Improving the Accuracy for Analyzing Heart Diseases Prediction Based on the Ensemble Method. Complexity 2021, 2021, 6663455. [Google Scholar] [CrossRef]
- Taser, P.Y. Application of Bagging and Boosting Approaches Using Decision Tree-Based Algorithms in Diabetes Risk Prediction. Proceedings 2021, 74, 6. [Google Scholar] [CrossRef]
- Murthy, H.S.N.; Manjunatha, M.N. Early Prognosis of Coronary Heart Disease using Ensemble Classifiers: A Comparative Analysis. Volatiles Essent. Oils 2021, 8, 2136–2142. [Google Scholar]
- Fraiwan, L.; Hassanin, O. Computer-aided identification of degenerative neuromuscular diseases based on gait dynamics and ensemble decision tree classifiers. PLoS ONE 2021, 16, e0252380. [Google Scholar] [CrossRef]
- Dhilsath, F.M.; Samuel, S.J. Hyperparameter Tuning of Ensemble Classifiers Using Grid Search and Random Search for Prediction of Heart Disease. Comput. Intell. Healthc. Inform. 2021, 139–158. [Google Scholar] [CrossRef]
- Khanam, F.; Mondal, M.R.H. Ensemble Machine Learning Algorithms for the Diagnosis of Cervical Cancer. In Proceedings of the 2021 International Conference on Science and Contemporary Technologies, ICSCT, Dhaka, Bangladesh, 5–7 August 2021. [Google Scholar] [CrossRef]
- Bang, C.S.; Ahn, J.Y.; Kim, J.H.; Kim, Y.I.; Choi, I.J.; Shin, W.G. Establishing Machine Learning Models to Predict Curative Resection in Early Gastric Cancer with Undifferentiated Histology: Development and Usability Study. J. Med. Internet Res. 2021, 23, e25053. [Google Scholar] [CrossRef]
- UCI Machine Learning Repository: Liver Disorders Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/liver+disorders (accessed on 22 May 2023).
- Santos, M.S.; Abreu, P.H.; García-Laencina, P.J.; Simão, A.; Carvalho, A. A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inf. 2015, 58, 49–59. [Google Scholar] [CrossRef] [Green Version]
- UCI Machine Learning Repository: HCC Survival Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/HCC+Survival# (accessed on 22 May 2023).
- UCI Machine Learning Repository: ILPD (Indian Liver Patient Dataset) Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29 (accessed on 22 May 2023).
- Cirrhosis Prediction Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/fedesoriano/cirrhosis-prediction-dataset (accessed on 22 May 2023).
- Iyer, R.; Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression. J. R. Stat. Soc. Ser. D 1991, 40, 458. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Sharma, J.; Giri, C.; Granmo, O.-C.; Goodwin, M. Multi-layer intrusion detection system with ExtraTrees feature selection, extreme learning machine ensemble, and softmax aggregation. EURASIP J. Inf. Secur. 2019, 2019, 15. [Google Scholar] [CrossRef] [Green Version]
- Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Sorich, M.J.; Miners, J.O.; McKinnon, R.A.; Winkler, D.A.; Burden, F.R.; Smith, P.A. Comparison of Linear and Nonlinear Classification Algorithms for the Prediction of Drug and Chemical Metabolism by Human UDP-Glucuronosyltransferase Isoforms. J. Chem. Inf. Comput. Sci. 2003, 43, 2019–2024. [Google Scholar] [CrossRef] [PubMed]
- Ramana, B.V.; Babu, M.S.P.; Venkateswarlu, N.B. A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis. Int. J. Database Manag. Syst. 2011, 3, 101–114. [Google Scholar] [CrossRef]
- Biau, G.; Cadre, B.; Rouvière, L. Accelerated gradient boosting. Mach. Learn. 2019, 108, 971–992. [Google Scholar] [CrossRef] [Green Version]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. 2017. Available online: https://github.com/Microsoft/LightGBM (accessed on 31 May 2021).
- Zhu, J.; Zou, H.; Rosset, S.; Hastie, T. Multi-class AdaBoost. Stat. Its Interface 2009, 2, 349–360. [Google Scholar]
- Dietterich, T.G. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Mach. Learn. 2000, 40, 139–157. [Google Scholar] [CrossRef]
- Zhang, W.; Zeng, F.; Wu, X.; Zhang, X.; Jiang, R. A comparative study of ensemble learning approaches in the classification of breast cancer metastasis. In Proceedings of the 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS, Shanghai, China, 3–5 August 2009; pp. 242–245. [Google Scholar] [CrossRef]
- Guarneros-Nolasco, L.R.; Cruz-Ramos, N.A.; Alor-Hernández, G.; Rodríguez-Mazahua, L.; Sánchez-Cervantes, J.L. Identifying the main risk factors for cardiovascular diseases prediction using machine learning algorithms. Mathematics 2021, 9, 2537. [Google Scholar] [CrossRef]
- Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Chen, Z.; Chen, L.; Dai, H.; Chen, J.; Fang, L. Relationship between alanine aminotransferase levels and metabolic syndrome in nonalcoholic fatty liver disease. J. Zhejiang Univ. Sci. B 2008, 9, 616–622. [Google Scholar] [CrossRef]
- Grytczuk, A.; Gruszewska, E.; Panasiuk, A.; Cylwik, B.; Chrostek, L. Serum Profile of Lactate Dehydrogenase (LDH) and Alkaline Phosphatase (ALP) in Alcoholic Liver Diseases. 2020; preprint. [Google Scholar] [CrossRef]
- Arsik, I.; Frediani, J.K.; Frezza, D.; Chen, W.; Ayer, T.; Keskinocak, P.; Jin, R.; Konomi, J.V.; Barlow, S.E.; Xanthakos, S.A.; et al. Alanine Aminotransferase as a Monitoring Biomarker in Children with Nonalcoholic Fatty Liver Disease: A Secondary Analysis Using TONIC Trial Data. Children 2018, 5, 64. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pantsari, M.W.; Harrison, S.A. Nonalcoholic fatty liver disease presenting with an isolated elevated alkaline phosphatase. J. Clin. Gastroenterol. 2006, 40, 633–635. [Google Scholar] [CrossRef] [PubMed]
- Tripodi, A.; Caldwell, S.H.; Hoffman, M.; Trotter, J.F.; Sanyal, A.J. Review article: The prothrombin time test as a measure of bleeding risk and prognosis in liver disease. Aliment Pharmacol. Ther. 2007, 26, 141–148. [Google Scholar] [CrossRef] [PubMed]
- Angulo, P.; Hui, J.M.; Marchesini, G.; Bugianesi, E.; George, J.; Farrell, G.C.; Enders, F.; Saksena, S.; Burt, A.D.; Bida, J.P.; et al. The NAFLD fibrosis score: A noninvasive system that identifies liver fibrosis in patients with NAFLD. Hepatology 2007, 45, 846–854. [Google Scholar] [CrossRef] [PubMed]
- Stancu, G.; Iliescu, E.L. The Influence of Liver Transplant on Serum Cholinesterase Levels: A Case Report. Cureus 2023, 15, e33761. [Google Scholar] [CrossRef]
- Sun, L.; Wang, Q.; Liu, M.; Xu, G.; Yin, H.; Wang, D.; Xie, F.; Jin, B.; Jin, Y.; Yang, H.; et al. Albumin binding function is a novel biomarker for early liver damage and disease progression in non-alcoholic fatty liver disease. Endocrine 2020, 69, 294–302. [Google Scholar] [CrossRef]
- Enomoto, H.; Bando, Y.; Nakamura, H.; Nishiguchi, S.; Koga, M. Liver fibrosis markers of nonalcoholic steatohepatitis. World J. Gastroenterol. 2015, 21, 7427–7435. [Google Scholar] [CrossRef]
- Maggiore, G.; Bernard, O.; Hadchouel, M.; Lemonnier, A.; Alagille, D. Diagnostic value of serum gamma-glutamyl transpeptidase activity in liver diseases in children. J. Pediatr. Gastroenterol. Nutr. 1991, 12, 21–26. [Google Scholar] [CrossRef]
- Luo, X.; Cui, H.; Cai, L.; Zhu, W.; Yang, W.-C.; Patrick, M.; Zhu, S.; Huang, J.; Yao, X.; Yao, Y.; et al. Selection of a Clinical Lead TCR Targeting Alpha-Fetoprotein-Positive Liver Cancer Based on a Balance of Risk and Benefit. Front. Immunol. 2020, 11, 623. [Google Scholar] [CrossRef]
Dataset | Number of Attributes | Number of Classes | Number of Records | Prediction/Diagnosis |
---|---|---|---|---|
BUPA Liver Disorders | 6 | 1 | 345 | Prediction/Diagnosis |
HCC Survival | 49 | 1 | 165 | Prediction |
ILPD | 9 | 1 | 313 | Diagnosis |
CPD | 19 | 1 | 424 | Prediction |
Attribute Name | Attribute Description |
---|---|
Mcv | Mean corpuscular volume |
Alkphos | Alkaline phosphatase |
Sgpt | Alanine aminotransferase |
Sgot | Aspartate aminotransferase |
Gammagt | Gamma-glutamyl transpeptidase |
Drinks | Number of half-pint equivalents of alcoholic beverages drunk per day |
Selector | Field created by BUPA researchers to split the data into trains/test sets |
Attribute Name | Attribute Description |
---|---|
Gender | Gender of the patient |
Symptoms | Symptoms |
Alcohol | Alcohol |
HBsAg | Hepatitis B Surface Antigen |
HBeAg | Hepatitis B e Antigen |
HBcAb | Hepatitis B Core Antibody |
HCVAb | Hepatitis C Virus Antibody |
Cirrhosis | Cirrhosis |
Endemic countries | Endemic countries |
Smoking | Smoking |
Diabetes | Diabetes |
Obesity | Obesity |
Hemochromatosis | Hemochromatosis |
AHT | Arterial Hypertension |
CRI | Chronic Renal Insufficiency |
HIV | Human Immunodeficiency Virus |
NASH | Nonalcoholic Steatohepatitis |
Esophageal varices | Esophageal varices |
Splenomegaly | Splenomegaly |
Portal hypertension | Portal hypertension |
Portal vein thrombosis | Portal vein thrombosis |
Liver metastasis | Liver metastasis |
Radiological hallmark | Radiological hallmark |
Age at diagnosis | Age at diagnosis |
Grams/day | Grams of Alcohol per day |
Packs/year | Packs of cigarettes per day |
Performance status | Performance status |
Encephalopathy | Encephalopathy |
Ascites | Ascites degree |
INR | International Normalized Ratio |
AFP | Alpha-Fetoprotein (ng/mL) |
Hemoglobin | Hemoglobin (g/gL) |
MCV | Mean Corpuscular Volume (fl) |
Leukocytes | Leukocytes (G/L) |
Platelets | Platelets (G/L) |
Albumin | Albumin (mg/dL) |
Total Bil | Total bilirubin (mg/dL) |
ALT | Alanine Transaminase (U/L) |
AST | Aspartate Transaminase (U/L) |
GGT | Gamma Glutamyl Transferase (U/L) |
ALP | Alkaline phosphatase (U/L) |
TP | Total proteins (g/dL) |
Creatinine | Creatinine (mg/dL) |
Number of nodules | Number of nodules |
Major dimension | Major dimension of nodule |
Dir. Bil | Direct bilirubin (mg/dL) |
Iron | Iron (mcg/dL) |
Sat | Oxygen saturation (%) |
Ferritin | Ferritin (ng/mL) |
Attribute Name | Attribute Description |
---|---|
Age | Age of the patient |
Gender | Gender of the patient |
TB | Total Bilirubin |
DB | Direct Bilirubin |
Alkphos | Alkaline Phosphatase |
Sgpt | Alanine Aminotransferase |
Sgot | Aspartate Aminotransferase |
TP | Total Proteins |
ALB | Albumin |
A/G | Albumin Ratio and Globulin Ratio |
Selector | Field used to split the data into two sets (labeled by the experts) |
Attribute Name | Attribute Description |
---|---|
ID | Unique Identifier |
N_Days | Number of days between registration and the earlier of death, transplantation, or study analysis time in July 1986 |
Status | Status of the patient: C (censored), CL (censored due to liver tx), or D (death) |
Drug | Type of drug D-penicillamine or placebo |
Age | Patient age in days |
Sex | M(male) or F (female) |
Ascites | Presence of ascites: N (No) or Y (Yes) |
Hepatomegaly | Presence of hepatomegaly: N (No) or Y (Yes) |
Spiders | Presence of spiders: N (No) or Y (Yes) |
Edema | Presence of edema: N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy) |
Bilirubin | Serum bilirubin in [mg/dL] |
Cholesterol | Serum cholesterol in [mg/dL] |
Albumin | Albumin in [gm/dL] |
Copper | Urine copper in [ug/day] |
Alk_Phos | Alkaline phosphatase in [U/L] |
SGOT | SGOT in [U/mL] |
Triglycerides | Triglycerides in [mg/dL] |
Platelets | Platelets per cubic [mL/1000] |
Prothrombin | Prothrombin time in seconds [s] |
Stage | Histologic stage of disease (1, 2, 3, or 4) |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Non-ensemble learning | Train and test split | AdaBoost | 66.35 | 69.23 | 75.00 | 72.00 | 64.77 |
DT | 65.38 | 70.00 | 70.00 | 70.00 | 64.55 | ||
ETT | 72.12 | 73.85 | 80.00 | 76.80 | 70.68 | ||
GB | 68.27 | 70.15 | 78.33 | 74.02 | 66.44 | ||
KNN | 66.35 | 74.51 | 63.33 | 68.47 | 66.89 | ||
LGBM | 65.38 | 70.00 | 70.00 | 70.00 | 64.55 | ||
LR | 73.08 | 74.24 | 81.67 | 77.78 | 71.52 | ||
RF | 70.19 | 72.31 | 78.33 | 75.20 | 68.71 | ||
SVC | 70.19 | 69.86 | 85.00 | 76.69 | 67.50 | ||
Cross-validation | AdaBoost | 66.95 | 69.78 | 75.97 | 72.24 | 70.40 | |
DT | 58.87 | 64.53 | 65.88 | 64.09 | 57.44 | ||
ETT | 68.09 | 70.25 | 77.70 | 73.48 | 71.96 | ||
GB | 66.69 | 69.32 | 76.20 | 71.93 | 73.02 | ||
KNN | 65.19 | 70.45 | 67.50 | 68.44 | 67.42 | ||
LGBM | 70.45 | 73.53 | 77.94 | 75.11 | 73.00 | ||
LR | 68.66 | 70.18 | 81.42 | 74.59 | 71.59 | ||
RF | 69.61 | 71.50 | 79.66 | 74.92 | 71.41 | ||
SVC | 71.60 | 71.21 | 86.49 | 77.62 | 74.58 |
Ensemble | Technique | Base Estimator | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Bagging ensemble | Train and test split | AdaBoost | 68.27 | 69.57 | 80.00 | 74.42 | 66.14 |
DT | 71.15 | 72.73 | 80.00 | 76.19 | 69.55 | ||
ETT | 74.04 | 74.63 | 83.33 | 78.74 | 72.35 | ||
GB | 67.31 | 69.70 | 76.67 | 73.02 | 65.61 | ||
KNN | 70.19 | 70.42 | 83.33 | 76.34 | 67.80 | ||
LGBM | 65.38 | 68.18 | 75.00 | 71.43 | 63.64 | ||
LR | 73.08 | 74.24 | 81.67 | 77.78 | 71.52 | ||
RF | 68.27 | 69.01 | 81.67 | 74.81 | 65.83 | ||
SVC | 71.15 | 70.27 | 86.67 | 77.61 | 68.33 | ||
Cross-validation | AdaBoost | 68.39 | 70.33 | 76.46 | 70.95 | 69.62 | |
DT | 69.01 | 73.99 | 76.67 | 74.12 | 72.84 | ||
ETT | 71.29 | 70.67 | 83.09 | 75.91 | 73.88 | ||
GB | 71.61 | 71.69 | 80.93 | 74.26 | 75.84 | ||
KNN | 67.50 | 67.95 | 81.69 | 74.11 | 70.22 | ||
LGBM | 69.86 | 73.17 | 79.83 | 74.74 | 75.28 | ||
LR | 69.81 | 70.29 | 81.07 | 74.52 | 71.73 | ||
RF | 69.28 | 71.83 | 81.71 | 76.03 | 74.51 | ||
SVC | 69.55 | 69.13 | 84.87 | 76.38 | 75.18 |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Boosting ensemble | Train and test split | AdaBoost | 66.35 | 71.19 | 70.00 | 70.59 | 65.68 |
GB | 66.35 | 68.66 | 76.67 | 72.44 | 64.47 | ||
LGBM | 65.38 | 70.00 | 70.00 | 70.00 | 64.55 | ||
Cross-validation | AdaBoost | 62.87 | 67.14 | 69.85 | 67.99 | 68.76 | |
GB | 68.14 | 71.37 | 75.35 | 72.60 | 69.05 | ||
LGBM | 66.97 | 70.54 | 74.99 | 72.15 | 71.33 |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Non-ensemble learning | Train and test split | AdaBoost | 69.05 | 57.14 | 45.45 | 50.63 | 63.58 |
DT | 57.14 | 41.07 | 52.27 | 46.00 | 56.01 | ||
ETT | 65.08 | 50.00 | 40.91 | 45.00 | 59.48 | ||
GB | 66.67 | 52.78 | 43.18 | 47.50 | 61.23 | ||
KNN | 62.70 | 44.00 | 25.00 | 31.88 | 53.96 | ||
LGBM | 69.05 | 56.76 | 47.73 | 51.85 | 64.11 | ||
LR | 65.87 | 51.43 | 40.91 | 45.57 | 60.09 | ||
RF | 66.67 | 52.78 | 43.18 | 47.50 | 61.23 | ||
SVC | 64.29 | 47.37 | 20.45 | 28.57 | 54.13 | ||
Cross-validation | AdaBoost | 71.79 | 60.04 | 48.54 | 53.06 | 68.53 | |
DT | 61.98 | 45.64 | 50.57 | 47.17 | 59.29 | ||
ETT | 68.20 | 55.01 | 44.48 | 48.57 | 72.01 | ||
GB | 72.74 | 62.84 | 49.90 | 54.83 | 71.64 | ||
KNN | 67.69 | 54.42 | 28.71 | 37.10 | 62.58 | ||
LGBM | 69.86 | 58.37 | 49.44 | 52.36 | 68.32 | ||
LR | 72.25 | 66.00 | 41.75 | 50.04 | 75.84 | ||
RF | 70.10 | 59.51 | 46.65 | 51.60 | 72.01 | ||
SVC | 69.36 | 70.17 | 22.43 | 32.04 | 65.46 |
Ensemble | Technique | Base Estimator | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Bagging ensemble | Train and test split | AdaBoost | 66.67 | 52.78 | 43.18 | 47.50 | 61.23 |
DT | 66.67 | 52.94 | 40.91 | 46.15 | 60.70 | ||
ETT | 65.87 | 51.61 | 36.36 | 42.67 | 59.04 | ||
GB | 65.87 | 51.52 | 38.64 | 44.16 | 59.56 | ||
KNN | 61.90 | 42.86 | 27.27 | 33.33 | 53.88 | ||
LGBM | 67.46 | 54.05 | 45.45 | 49.38 | 62.36 | ||
LR | 65.87 | 51.43 | 40.91 | 45.57 | 60.09 | ||
RF | 66.67 | 52.94 | 40.91 | 46.15 | 60.70 | ||
SVC | 65.08 | 50.00 | 20.45 | 29.03 | 54.74 | ||
Cross-validation | AdaBoost | 73.21 | 68.28 | 50.39 | 55.73 | 72.68 | |
DT | 69.15 | 58.06 | 44.34 | 51.55 | 72.80 | ||
ETT | 71.05 | 65.23 | 43.33 | 53.12 | 74.97 | ||
GB | 71.54 | 66.56 | 49.38 | 55.51 | 74.62 | ||
KNN | 69.37 | 59.01 | 31.13 | 41.82 | 65.58 | ||
LGBM | 70.81 | 65.67 | 49.62 | 54.10 | 72.20 | ||
LR | 72.01 | 67.23 | 41.75 | 49.06 | 75.99 | ||
RF | 72.02 | 66.37 | 43.09 | 54.18 | 75.15 | ||
SVC | 68.39 | 60.33 | 16.24 | 29.15 | 65.09 |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Boosting ensemble | Train and test split | AdaBoost | 66.67 | 52.38 | 50.00 | 51.16 | 62.80 |
GB | 63.49 | 47.62 | 45.45 | 46.51 | 59.31 | ||
LGBM | 65.08 | 50.00 | 47.73 | 48.84 | 61.06 | ||
Cross-validation | AdaBoost | 66.51 | 50.60 | 44.22 | 46.69 | 63.68 | |
GB | 67.50 | 52.35 | 45.32 | 48.10 | 68.25 | ||
LGBM | 67.72 | 53.89 | 48.65 | 50.00 | 67.33 |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Non-ensemble learning | Train and test split | AdaBoost | 67.74 | 59.38 | 73.08 | 65.52 | 68.48 |
DT | 75.81 | 68.97 | 76.92 | 72.73 | 75.96 | ||
ETT | 69.35 | 62.07 | 69.23 | 65.45 | 69.34 | ||
GB | 75.81 | 68.97 | 76.92 | 72.73 | 75.96 | ||
KNN | 67.74 | 60.71 | 65.38 | 62.96 | 67.41 | ||
LGBM | 70.97 | 64.29 | 69.23 | 66.67 | 70.73 | ||
LR | 62.90 | 54.29 | 73.08 | 62.30 | 64.32 | ||
RF | 70.97 | 65.38 | 65.38 | 65.38 | 70.19 | ||
SVC | 45.16 | 43.33 | 100.00 | 60.47 | 52.78 | ||
Cross-validation | AdaBoost | 72.55 | 73.41 | 70.56 | 71.06 | 78.57 | |
DT | 66.71 | 66.81 | 63.89 | 64.47 | 66.81 | ||
ETT | 74.55 | 72.74 | 74.87 | 73.09 | 82.39 | ||
GB | 74.60 | 74.20 | 73.14 | 73.17 | 83.65 | ||
KNN | 73.12 | 76.00 | 67.81 | 70.12 | 75.54 | ||
LGBM | 69.64 | 70.41 | 69.57 | 68.76 | 79.87 | ||
LR | 57.38 | 41.01 | 52.90 | 45.51 | 74.11 | ||
RF | 73.98 | 74.27 | 71.91 | 72.49 | 82.53 | ||
SVC | 53.95 | 52.51 | 99.23 | 67.73 | 80.02 |
Ensemble | Technique | Base Estimator | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Bagging ensemble | Train and test split | AdaBoost | 74.19 | 66.67 | 76.92 | 71.43 | 74.57 |
DT | 69.35 | 65.22 | 57.69 | 61.22 | 67.74 | ||
ETT | 69.35 | 62.96 | 65.38 | 64.15 | 68.80 | ||
GB | 74.19 | 69.23 | 69.23 | 69.23 | 73.50 | ||
KNN | 70.97 | 65.38 | 65.38 | 65.38 | 70.19 | ||
LGBM | 69.35 | 62.07 | 69.23 | 65.45 | 69.34 | ||
LR | 62.90 | 54.84 | 65.38 | 59.65 | 63.25 | ||
RF | 74.19 | 69.23 | 69.23 | 69.23 | 73.50 | ||
SVC | 45.16 | 43.33 | 100.00 | 60.47 | 52.78 | ||
Cross-validation | AdaBoost | 75.98 | 77.71 | 79.60 | 78.59 | 82.99 | |
DT | 73.50 | 74.86 | 72.99 | 76.24 | 82.59 | ||
ETT | 76.50 | 75.91 | 75.55 | 74.41 | 83.74 | ||
GB | 76.98 | 78.45 | 77.67 | 76.48 | 84.98 | ||
KNN | 73.57 | 75.28 | 72.17 | 71.95 | 75.82 | ||
LGBM | 74.55 | 75.53 | 73.42 | 71.08 | 82.81 | ||
LR | 73.14 | 71.86 | 74.91 | 72.02 | 80.06 | ||
RF | 77.45 | 76.27 | 76.75 | 76.42 | 84.56 | ||
SVC | 49.60 | 45.26 | 69.23 | 51.99 | 79.92 |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Boosting ensemble | Train and test split | AdaBoost | 66.13 | 57.58 | 73.08 | 64.41 | 67.09 |
GB | 75.81 | 68.97 | 76.92 | 72.73 | 75.96 | ||
LGBM | 69.35 | 62.96 | 65.38 | 64.15 | 68.80 | ||
Cross-validation | AdaBoost | 72.07 | 72.24 | 71.54 | 71.07 | 78.95 | |
GB | 74.60 | 74.20 | 73.14 | 73.17 | 83.65 | ||
LGBM | 71.07 | 71.52 | 71.74 | 70.64 | 80.95 |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Non-ensemble learning | Train and test split | AdaBoost | 73.14 | 79.41 | 85.04 | 82.13 | 63.35 |
DT | 66.29 | 78.81 | 73.23 | 75.92 | 60.57 | ||
ETT | 69.71 | 76.43 | 84.25 | 80.15 | 57.75 | ||
GB | 68.57 | 76.09 | 82.68 | 79.25 | 56.96 | ||
KNN | 67.43 | 78.69 | 75.59 | 77.11 | 60.71 | ||
LGBM | 73.14 | 79.85 | 84.25 | 81.99 | 64.00 | ||
LR | 74.29 | 74.40 | 98.43 | 84.75 | 54.42 | ||
RF | 70.29 | 77.37 | 83.46 | 80.30 | 59.44 | ||
SVC | 72.57 | 72.57 | 100.00 | 84.11 | 50.00 | ||
Cross-validation | AdaBoost | 69.46 | 76.06 | 83.34 | 79.35 | 69.51 | |
DT | 62.95 | 74.09 | 73.86 | 73.75 | 53.92 | ||
ETT | 68.95 | 76.02 | 82.61 | 78.85 | 71.34 | ||
GB | 69.46 | 76.01 | 84.11 | 79.54 | 71.56 | ||
KNN | 63.29 | 75.04 | 73.01 | 73.83 | 66.49 | ||
LGBM | 70.66 | 77.69 | 82.90 | 79.89 | 71.36 | ||
LR | 71.51 | 73.88 | 93.55 | 82.22 | 72.52 | ||
RF | 68.44 | 75.83 | 81.81 | 78.48 | 71.39 | ||
SVC | 71.35 | 71.35 | 100.00 | 83.08 | 61.78 |
Ensemble | Technique | Base Estimator | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Bagging ensemble | Train and test split | AdaBoost | 76.00 | 78.15 | 92.91 | 84.89 | 62.08 |
DT | 70.86 | 76.76 | 85.83 | 81.04 | 58.54 | ||
ETT | 71.43 | 75.16 | 90.55 | 82.14 | 55.69 | ||
GB | 71.43 | 75.84 | 88.98 | 81.88 | 56.99 | ||
KNN | 72.00 | 74.68 | 92.91 | 82.81 | 54.79 | ||
LGBM | 71.43 | 76.92 | 86.61 | 81.48 | 58.93 | ||
LR | 74.86 | 74.85 | 98.43 | 85.03 | 55.46 | ||
RF | 70.86 | 75.33 | 88.98 | 81.59 | 55.95 | ||
SVC | 72.57 | 72.57 | 100.00 | 84.11 | 50.00 | ||
Cross-validation | AdaBoost | 70.65 | 76.32 | 89.09 | 83.01 | 71.23 | |
DT | 70.32 | 75.93 | 84.97 | 79.85 | 72.07 | ||
ETT | 70.15 | 74.56 | 88.62 | 81.15 | 73.62 | ||
GB | 71.69 | 75.69 | 87.93 | 81.74 | 73.31 | ||
KNN | 69.97 | 73.82 | 87.99 | 80.24 | 68.17 | ||
LGBM | 69.12 | 74.86 | 87.00 | 79.67 | 72.74 | ||
LR | 72.03 | 74.49 | 93.10 | 82.32 | 72.62 | ||
RF | 70.65 | 75.22 | 88.14 | 80.97 | 72.72 | ||
SVC | 71.35 | 71.35 | 100.00 | 83.08 | 69.71 |
Ensemble | Technique | Predictive Model | Performance Evaluation Metrics | ||||
---|---|---|---|---|---|---|---|
% Accuracy | % Precision | % Recall | % f1-Score | % roc_auc | |||
Boosting ensemble | Train and test split | AdaBoost | 72.00 | 79.55 | 82.68 | 81.08 | 63.21 |
GB | 68.00 | 77.10 | 79.53 | 78.29 | 58.51 | ||
LGBM | 73.14 | 79.85 | 84.25 | 81.99 | 64.00 | ||
Cross-validation | AdaBoost | 68.59 | 75.69 | 82.28 | 78.64 | 68.07 | |
GB | 67.57 | 75.89 | 79.94 | 77.64 | 68.97 | ||
LGBM | 68.77 | 75.79 | 82.53 | 78.75 | 71.52 |
Data | Best Rated Feature | Description |
---|---|---|
BUPA | Gammagt | Gamma-glutamyl transpeptidase |
Sgpt | Alanine aminotransferase | |
Sgot | Aspartate aminotransferase | |
alkphos | Alkaline phosphatase | |
HCC SURVIVAL DATASET | AFP | Alpha-Fetoprotein (ng/mL) |
Hemoglobin | Hemoglobin (g/dL) | |
ALP | Alkaline phosphatase (U/L) | |
Albumin | Albumin (mg/dL) | |
ILPD | Alkphos | Alkaline Phosphatase |
Sgot | Aspartate Aminotransferase | |
Sgpt | Alanine Aminotransferase | |
Age | Age of the patient | |
CPD | Prothrombin | prothrombin time in seconds [s] |
Albumin | albumin in [gm/dL] | |
Platelets | platelets per cubic [mL/1000] | |
Age | Age of the patient |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guarneros-Nolasco, L.R.; Alor-Hernández, G.; Prieto-Avalos, G.; Sánchez-Cervantes, J.L. Early Identification of Risk Factors in Non-Alcoholic Fatty Liver Disease (NAFLD) Using Machine Learning. Mathematics 2023, 11, 3026. https://doi.org/10.3390/math11133026
Guarneros-Nolasco LR, Alor-Hernández G, Prieto-Avalos G, Sánchez-Cervantes JL. Early Identification of Risk Factors in Non-Alcoholic Fatty Liver Disease (NAFLD) Using Machine Learning. Mathematics. 2023; 11(13):3026. https://doi.org/10.3390/math11133026
Chicago/Turabian StyleGuarneros-Nolasco, Luis Rolando, Giner Alor-Hernández, Guillermo Prieto-Avalos, and José Luis Sánchez-Cervantes. 2023. "Early Identification of Risk Factors in Non-Alcoholic Fatty Liver Disease (NAFLD) Using Machine Learning" Mathematics 11, no. 13: 3026. https://doi.org/10.3390/math11133026
APA StyleGuarneros-Nolasco, L. R., Alor-Hernández, G., Prieto-Avalos, G., & Sánchez-Cervantes, J. L. (2023). Early Identification of Risk Factors in Non-Alcoholic Fatty Liver Disease (NAFLD) Using Machine Learning. Mathematics, 11(13), 3026. https://doi.org/10.3390/math11133026