Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data
Abstract
:1. Introduction
- Propose an intelligent credit risk prediction system that integrates mental health data into supervised machine learning algorithms;
- Conduct a comprehensive evaluation of multiple classification techniques to identify the optimal methodology that minimizes overfitting and maximizes performance in credit risk predictions;
- Analyze the key factors that influence loan approvals and explore their interdependencies with the response variable, and;
- Establish a framework for future research endeavors that can enhance the accuracy of predictive models, while also shedding light on the ethical considerations associated with the utilization of mental health data.
2. Related Research
3. Input Datasets: Mental Health and Loan Approval
3.1. Mental Health Dataset
3.2. Loan Approval Dataset
4. Machine Learning Algorithms to Use for Loan Predictions
4.1. Decision Tree
4.2. Random Forest
4.3. Naive Bayes
4.4. KNN
4.5. Boosting Algorithms
4.5.1. AdaBoost
4.5.2. Gradient Boosting
4.5.3. XGBoost
5. Methodology
5.1. Importing Libraries and Datasets
5.2. Data Preprocessing
missing_values = df.isnull().sum() |
numeric_features = [‘numeric_attribute_1’, ‘numeric_attribute_2’] for feature in numeric_features: df[feature].fillna(df[feature].mean(), inplace = True) |
categorical_features = [‘categorical_attribute_1’, ‘categorical_attribute_2’] for feature in categorical_features: df[feature].fillna(df[feature].mode()[0], inplace = True) |
dfX = pd.concat([dataset[“Age”],pd.get_dummies(dataset[categorical_columns])], axis = 1) dfY = dataset[“obs_consequence”] dfX |
abel_encoder = preprocessing.LabelEncoder() encoded_features = [‘attribute_1’, ‘attribute_2’, ‘attribute_3’, ‘attribute_4’] for feature in encoded_features: df[feature] = label_encoder.fit_transform(df[feature]) |
5.3. Model Selection
5.4. Train–Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y) |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) |
5.5. Model Training
model.fit(X_train, y_train) |
5.6. Model Evaluation
y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) confusion_mat = confusion_matrix(y_test, y_pred) |
5.7. Confusion Matrix and Visualization
labels = [‘Negative prediction’, ‘Affirmative prediction’] confusion_mat = confusion_matrix(y_test, y_pred, labels = labels) fig, ax = plt.subplots(figsize = (8, 6)) sns.heatmap(confusion_mat, annot = True, fmt = ‘d’, cmap = ‘Blues’, xticklabels = labels, yticklabels = labels, ax = ax) ax.set_xlabel(‘Predicted’) ax.set_ylabel(‘True’) |
- confusion_mat: the confusion matrix to be visualized.
- annot = True: enabled the annotation of each cell in the heatmap with the corresponding count.
- fmt = ‘d’: formatted the annotations as integers.
- cmap = ‘Blues’: specified the color map for the heatmap.
- xticklabels = labels: set the labels for the x-axis tick marks to the specified labels.
- yticklabels = labels: set the labels for the y-axis tick marks to the specified labels.
- ax = ax: specified the subplot to which the heatmap was plotted.
6. Results
6.1. The Evaluation of the First Dataset: Mental Health
6.2. The Evaluation of the Second Dataset: Loan Approval
7. Discussion
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Prabaljeet, S.S.; Atush, B.; Lekha, R. Loan Approval Prediction Using Machine Learning: A Comparative Analysis of Classification Algorithms. 2023. Available online: https://ieeexplore.ieee.org/document/10182799/authors#authors (accessed on 19 December 2023).
- Yash, D.; Prashant, R.; Pratik, C. Loan Approval Prediction Using Machine Learning. 2021. Available online: https://www.irjet.net/archives/V8/i5/IRJET-V8I5331.pdf (accessed on 19 December 2023).
- Mohammad, A.S.; Amit, K.G.; Tapas, K. An Approach for Prediction of Loan Approval Using Machine Learning Algorithm. 2020. Available online: https://ieeexplore.ieee.org/document/9155614 (accessed on 19 December 2023).
- Almheiri, A.S. Automated Loan Approval System for Banks. Rochester Institute of Technology, Dubai. 2023. Available online: https://scholarworks.rit.edu/cgi/viewcontent.cgi?article=12535&context=theses (accessed on 19 December 2023).
- Banco de España, Eurosistema. Report on the Financial and Banking Crisis in Spain, 2008–2014. 2017. Available online: https://repositorio.bde.es/bitstream/123456789/15112/1/InformeCrisis_Completo_web_en.pdf (accessed on 19 December 2023).
- How Much Does Racial Bias Affect Mortgage Lending? Evidence from Human and Algorithmic Credit Decisions—Neil Bhutta, Aurel Hizmo, Daniel Ringo. Available online: https://www.federalreserve.gov/econres/feds/files/2022067pap.pdf (accessed on 19 December 2023).
- Roberts, R. Mental Health and Money: A Practical Guide; Money and Mental Health Policy Institute: London, UK, 2019. [Google Scholar]
- Bhargav, P.; Sashirekha, K. A Machine Learning Method for Predicting Loan Approval by Comparing the Random Forest and Decision Tree Algorithms. 2023. Available online: https://sifisheriessciences.com/journal/index.php/journal/article/view/414/397 (accessed on 19 December 2023).
- Wang, Y.; Wang, M.; Yong, P.; Chen, J. Joint loan risk prediction based on deep learning-optimized stacking model. Eng. Rep. 2023, e12748. [Google Scholar] [CrossRef]
- Abdullah, M.; Chowdhury, M.A.F.; Uddin, A.; Moudud-Ul-Huq, S. Forecasting nonperforming loans using machine learning. J. Forecast. 2023, 42, 1664–1689. [Google Scholar] [CrossRef]
- Alsaleem, M.Y.E.; Hasoon, S.O. Predicting bank loan risks using machine learning algorithms. AL-Rafidain J. Comput. Sci. Math. 2020, 14, 159–168. [Google Scholar] [CrossRef]
- World Health Organization. Mental Disorders. 2019. Available online: https://www.who.int/health-topics/mental-disorders#tab=tab_1 (accessed on 19 December 2023).
- National Alliance on Mental Illness. Mental Health by the Numbers. 2021. Available online: https://www.nami.org/mhstats (accessed on 19 December 2023).
- Mental Health America. The State of Mental Health in America. 2021. Available online: https://mhanational.org/sites/default/files/2021%20State%20of%20Mental%20Health%20in%20America_0.pdf (accessed on 19 December 2023).
- Mental Health First Aid USA. About Mental Health First Aid. 2021. Available online: https://www.mentalhealthfirstaid.org/about/ (accessed on 19 December 2023).
- Javed, K.; Hamid, F. A comparative study of decision tree algorithms for nonlinear and complex relationships between input features and output variables. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2015, 29, 65–74. [Google Scholar]
- Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; Taylor & Francis: Abingdon, UK, 1984. [Google Scholar]
- Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005. [Google Scholar]
- Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. In Proceedings of the IEEE; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
- Kaviani, P.; Dhotre, M.S. Short Survey on Naive Bayes Algorithm. Int. J. Adv. Res. Comput. Sci. Manag. 2017, 4, 44839. [Google Scholar] [CrossRef]
- Jena, B. Gender Recognition of Speech Signal using KNN and SVM. SSRN Electron. J. 2021. [Google Scholar] [CrossRef]
- Zhan, Y.; Liu, J.; Gou, J.; Wang, M. A video semantic detection method based on locality-sensitive discriminant sparse representation and weighted KNN. J. Vis. Commun. Image Represent. 2016, 41, 65–73. [Google Scholar] [CrossRef]
- Syaliman, K.U.; Labellapansa, A. Improving the Accuracy of Features Weighted k-Nearest Neighbor Using Distance Weigh; SciTePress: Setúbal, Portugal, 2019. [Google Scholar]
- Freund, Y.; Schapire, R.E. Boosting: Foundations and Algorithms; The MIT Press: Cambridge, MA, USA, 2013. [Google Scholar]
- Shahri, N.H.N.B.M.; Lai, S.B.S.; Mohamad, M.B.; Rahman, H.A.B.A.; Bin Rambli, A. Comparing the Performance of AdaBoost, XGBoost, and Logistic Regression for Imbalanced Data. Math. Stat. 2021, 9, 379–385. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Masui, T. All You Need to Know about Gradient Boosting Algorithm—Part 1. Regression. 2022. Available online: https://towardsdatascience.com/all-you-need-to-know-about-gradient-boosting-algorithm-part-1-regression-2520a34a502 (accessed on 19 December 2023).
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Sujatha, C.N.; Gudipalli, A.; Pushyami, B.H.; Karthik, N.; Sanjana, B.N. Loan Prediction Using Machine Learning and Its Deployment on Web Application. In Proceedings of the 2021 Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 27–29 November 2021. [Google Scholar]
- Tumuluru, P.; Burra, L.R.; Loukya, M.; Bhavana, S.; CSaiBaba, H.M.H.; Sunanda, N. Comparative Analysis of Customer Loan Approval Prediction using Machine Learning Algorithms. In Proceedings of the Second International Conference on Artificial Intelligence and Smart Energy (ICAIS-2022), Coimbatore, India, 23–25 February 2022. [Google Scholar]
- Mamun, M.A.; Farjana, A.; Mamun, M. Predicting Bank Loan Eligibility Using Machine Learning Models and Comparison Analysis. In Proceedings of the 7th North American International Conference on Industrial Engineering and Operations Management, Orlando, FL, USA, 12–14 June 2022. [Google Scholar]
Attribute Name | Description | Value Range |
---|---|---|
age | indicates the age of the participant | 8–62 |
gender | indicates the gender of the participant | Male, Female, Other |
country | indicates the country where the participant is located | |
state | indicates the US state where the participant is located, if applicable | |
self_employed | indicates whether the participant is self-employed | Binary (Y, N) |
family_history | indicates whether the participant has a family history of mental illness | Binary (Y, N) |
treatment | indicates whether the participant has sought treatment for mental illness | Binary (Y, N) |
work_interfere | indicates whether the participant feels that their work has been affected by their mental health | Never, Rarely, Sometimes, Often |
no_employees | indicates the number of employees in the participant’s company or organization | 6–25 26–100 100–500 500–1000 More than 1000 |
remote_work | indicates whether the participant works remotely | Binary (Y, N) |
tech_company | indicates whether the participant works for a tech company | Binary (Y, N) |
benefits | indicates whether the participant’s employer provides mental health benefits | Yes, No, Donot know |
care_options | indicates whether the participant knows about mental healthcare options provided by their employer | Yes, No, Not Sure |
wellness_program | indicates whether the participant knows about or has participated in a wellness program provided by their employer | Yes, No, Not Sure |
seek_help | indicates whether the participant would feel comfortable discussing mental health with their employer | Yes, No, Not Sure |
anonymity | indicates whether the participant feels that they could be anonymous if they discussed mental health with their employer | Yes, No, Not Sure |
leave | indicates whether the participant knows the options for taking time off work for mental health reasons | Difficult, Easy, Do not know |
mental_health_consequence | indicates whether the participant thinks that discussing mental health would have negative consequences on their workplace environment | Yes, No, Maybe |
phys_health_consequence | indicates whether the participant thinks that discussing physical health would have negative consequences on their workplace enivironment | Yes, No, Maybe |
coworkers | indicates whether the participant would discuss mental health with their coworkers | Yes, No, Some of them |
supervisor | indicates whether the participant would discuss mental health with their supervisor | Yes, No, Some of them |
mental_health_interview | indicates whether the participant has ever discussed mental health in a job interview | Yes, No, Maybe |
phys_health_interview | indicates whether the participant has ever discussed physical health in a job interview | Yes, No, Maybe |
mental_vs_physical | indicates whether the participant feels that their mental health is treated as seriously as their physical health | Yes, No, Do not know |
obs_consequence | indicates whether the participant has heard of or observed negative consequences for coworkers with mental health conditions in their workplace | Binary (Y, N) |
Attribute Name | Description | Value Range |
---|---|---|
Gender | indicates the gender of the loan applicant | Male, Female, Other |
Married | indicates whether the loan applicant is married or not | True, False |
Dependents | indicates the number of dependents (such as children or elderly parents) that the loan applicant has | (0, 3+) |
Education | indicates the education level of the loan applicant | Graduate/Not a graduate |
Self_Employed | indicates whether the loan applicant is self-employed or not | True, False |
Applicant_Income | indicates the income of a loan applicant | Range (150, 81,000) |
Coapplicant_Income | indicates the income of the co-applicant | Range (0, 41,700) |
Loan_Amount | indicates the amount of loan applied for by the applicant | Range (9000, 700,000) |
Loan_Amount_Term | indicates the term or duration of the loan | Range (12, 480) |
Credit_History | indicates the credit history of the loan applicant, i.e., whether they have a history of repaying loans on time or not | True, False |
Attribute Name | Description | Value Range |
---|---|---|
Loan_Status | indicates whether the loan application was approved or not | Binary (Yes, No) |
Accuracy | Precision | Recall | F1 Score | |
---|---|---|---|---|
Naive Bayes | 20% | 17% | 91% | 28% |
KNN | 80% | 23% | 7% | 11% |
Decision tree | 75% | 24% | 21% | 22% |
Random forest | 83% | 60% | 7% | 12% |
AdaBoost | 81% | 35% | 14% | 20% |
Gradient boost | 83% | 47% | 16% | 24% |
XGBoost | 84% | 62% | 12% | 20% |
True Neg | False Pos | False Neg | True Pos | |
---|---|---|---|---|
Naive Bayes | 15.48% | 78.17% | 1.59% | 4.76% |
KNN | 1.19% | 3.97% | 15.87% | 78.97% |
Decision tree | 3.97% | 11.90% | 13.10% | 71.03% |
Random forest | 1.59% | 1.59% | 15.48% | 81.35% |
AdaBoost | 2.38% | 4.37% | 14.68% | 78.57% |
Gradient boost | 3.17% | 3.17% | 13.89% | 79.76% |
XGBoost | 1.98% | 1.19% | 15.08% | 81.75% |
Accuracy | Precision | Recall | F1 Score | |
---|---|---|---|---|
Naive Bayes | 56% | 50% | 68% | 58% |
KNN | 83% | 80% | 82% | 81% |
Decision tree | 83% | 80% | 83% | 82% |
Random forest | 85% | 86% | 79% | 82% |
AdaBoost | 59% | 58% | 26% | 36% |
Gradient boost | 58% | 60% | 16% | 25% |
XGBoost | 59% | 74% | 12% | 20% |
True Neg | False Pos | False Neg | True Pos | |
---|---|---|---|---|
Naive Bayes | 30.18% | 29.70% | 14.05% | 26.06% |
KNN | 36.23% | 9.23% | 8% | 46.54% |
Decision tree | 36.87% | 9.20% | 7.37% | 46.56% |
Random forest | 34.75% | 5.78% | 9.49% | 49.99% |
AdaBoost | 11.16% | 8.56% | 32.60% | 47.21% |
Gradient boost | 11.75% | 8% | 32.48% | 47.05% |
XGBoost | 5.23% | 1.81% | 39.01% | 53.96% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alagic, A.; Zivic, N.; Kadusic, E.; Hamzic, D.; Hadzajlic, N.; Dizdarevic, M.; Selmanovic, E. Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data. Mach. Learn. Knowl. Extr. 2024, 6, 53-77. https://doi.org/10.3390/make6010004
Alagic A, Zivic N, Kadusic E, Hamzic D, Hadzajlic N, Dizdarevic M, Selmanovic E. Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data. Machine Learning and Knowledge Extraction. 2024; 6(1):53-77. https://doi.org/10.3390/make6010004
Chicago/Turabian StyleAlagic, Adnan, Natasa Zivic, Esad Kadusic, Dzenan Hamzic, Narcisa Hadzajlic, Mejra Dizdarevic, and Elmedin Selmanovic. 2024. "Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data" Machine Learning and Knowledge Extraction 6, no. 1: 53-77. https://doi.org/10.3390/make6010004
APA StyleAlagic, A., Zivic, N., Kadusic, E., Hamzic, D., Hadzajlic, N., Dizdarevic, M., & Selmanovic, E. (2024). Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data. Machine Learning and Knowledge Extraction, 6(1), 53-77. https://doi.org/10.3390/make6010004