Feature Selection Engineering for Credit Risk Assessment in Retail Banking
Abstract
:1. Introduction
2. Credit Risk Assessment in Retail Banking
- Probability of default (PD): the likelihood that the borrower will fail to pay the loan.
- Exposure at default (EAD): the expected value of the loan at the time of default (how much remains to be paid?).
- Loss given default (LGD): the amount of loss if there is a default expressed as a percentage of the EAD.
- Machine learning methods:
3. Features Selection Engineering
- The univariate feature selection [6] determines a vector of weight showing the strength of the relationships between each feature and the label. These techniques are based on statistical tests to calculate the weights vector. Therefore, many variants of univariate feature selections can be implemented depending on the statistics used, such as the ANOVA F-value method, chi-squared () tests, etc. The selected features are the K features with the highest weights. In this paper, we implement a () based univariate feature selection algorithm named chi2UFS.
- The recursive feature elimination [7] consists of removing the features showing less importance. The recursive selection process is based on ranking the features according to their importance in a defined model. Recursively, the model is rebuilt using the remaining features (initially all features), and the least important attributes are removed. The number of features to return is defined by the parameter K. However, it is possible to use the cross-validation technique to calculate the optimal number of features. In this paper, we implement an RFE algorithm named lrRFECV based on the logistics regression model, using cross-validation techniques. As a recursive algorithm, the lrRFECV is time consuming depending on the number of features being evaluated. Therefore, the lrRFECV is preceded in this paper by the calculation of the features correlation matrix and the removal of highly correlated features.
- The feature importance determination [8] named also the model-based feature importance consists of building a classifier using all available features. Then, the K most important features are selected based on the ranking of the model. The output of the feature importance technique depends mainly on the used classifier, which can be a logistics regression, Bayesian linear classifier, decision tree, random forest, support vector machine (SVM), or an XGBoost classifier. In this paper, we implement the feature importance determination technique using a decision tree classifier named FIDT.
- The information value [9] is a statistic that shows how well a feature (predictor) will separate between a binary target variable, such as the type of the borrowers in the credit scoring problem (good or bad). It is based on the weight of evidence of the feature. The WoE of a feature is a simple statistic showing the strength of a predictor in separating the values of a binary target. The target in the credit scoring problem is to identify good and bad borrowers, denoted respectively as g and b. The weight of evidence of feature x is given below:
4. Implementation and Computational Results
Algorithm 1 Main algorithm |
readData() |
[’UFS’, ’RFE’, ’IV’, ’FIDT’] |
[XGBoost] |
initComparisionFrame() |
dataPreprocessing() |
for in do |
applySelector() ▹ apply the features selection technique |
getFeatures() ▹ read the selected features |
SMOTE() ▹ apply Smote to balance the training sets |
splitData() ▹ split the learning set |
for in do |
fitModel() |
testValidate() |
updateComparisionFrame() |
end for |
end for |
buildWordCloud() ▹ plot the wordcloud of the selected features |
return |
4.1. Data Set
4.2. Data Pre-Processing
- Manual screening:According to the expert’s opinion, some attributes are irrelevant and may not be used to determine the customer’s creditworthiness. In total, 53 features are identified to be removed from the initial data set. Then, 99 elements are kept.
- Null values handling:The phase of data acquisition always comes with some missing values. Null values mean that some attributes cannot be used in the prediction of the credit scoring system without adjusting their empty records. In this paper, we consider that features with more than 50% values missing are ignored. The average of the column values will replace the remaining empty cells if they are numeric. Null values of discrete columns are replaced by their mode. The number of removed features for null values cause is 66, and the empty cells of 22 features are adjusted using their mean/mode value. After handling null values, 33 features are kept.
- Standardization:Standardization is a data transformation method that scales the data to the standard normal distribution (mean equals 0 and standard deviation equals 1).
- Encoding categorical variables:Categorical variables cannot be handled, as they are by machine learning algorithms. Therefore, a transformation step is required to convert categorical data into numerical data. For instance, we used the label encoding method to transform categorical features into numeric values.
- Features correlation The data set was initially composed of 152 variables used to describe whether a customer may default or not in the payment of their loan. However, some of the features may not be important to assess a customer’s default. Moreover, some features may also appear to be much more significant in separating good customers from bad customers than other features. The financial consultant review helped to find insignificant features and removed 53 columns. After the null handling process, we finished with 33 features.The second step in feature selection is based on the correlation between variables. For instance, highly positive or negative correlated variables seem to inform about the target variable in the same way, which means that it is enough to keep one of them. Given the correlation matrix between the 33 variables and the used threshold of , eight (8) variables were dropped.
- Correlation to the label Given the 25 remaining features, we conducted a correlation to the label (Status) evaluation. With a threshold of , we found that some features are highly correlated with the type of borrower. Therefore, six other variables were also dropped. By the end, we found a clean dataset composed of 217,320 customers described by 19 variables.
4.3. Computational Results
5. Research Findings
- Number of open trades in last 6 months: This feature states that if a customer tends to open more accounts (trades) in the last 6 months, then there is a high chance that they will default on their loan.
- Interest rate on the loan: Higher interest rates increase the amount of the installments, which may push the borrower to default.
- Balance to the credit limit on all trades: This feature informs on how much debt the customer has and how much credit they are using. A customer with a low ratio has a much greater chance of being a good borrower.
- Number of personal finance inquiries: The number of accesses to the customer credit records made by authorized entities to check the customer’s score. Such accesses are generally made based on a request made by the customer for a loan or credit card from the accessing entity. Generally, more access means that the customer is seeking more credits from different banks, increasing the probability of future defaults.
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hull, J. Machine Learning in Business: An Introduction to the World of Data Science; Amazon Fulfillment Poland Sp. zoo: Sady, Poland, 2021. [Google Scholar]
- Dumitrescu, E.I.; Hué, S.; Hurlin, C.; Tokpavi, S. Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds (15 January 2021). SSRN 2020. [Google Scholar] [CrossRef]
- Chen, W.; Xiang, G.; Liu, Y.; Wang, K. Credit risk Evaluation by hybrid data mining technique. Syst. Eng. Procedia 2012, 3, 194–200. [Google Scholar] [CrossRef] [Green Version]
- Das, S.R. The future of fintech. Financ. Manag. 2019, 48, 981–1007. [Google Scholar] [CrossRef]
- Tang, J.; Alelyani, S.; Liu, H. Feature selection for classification: A review. In Data Classification Algorithms and Application; Chapman & Hall: London, UK, 2014; Volume 37. [Google Scholar]
- Kar, M.; Dewangan, L. Univariate feature selection techniques for classification of epileptic EEG Signals. In Advances in Biomedical Engineering and Technology; Springer: Berlin/Heidelberg, Germany, 2021; pp. 345–365. [Google Scholar]
- Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
- Zien, A.; Krämer, N.; Sonnenburg, S.; Rätsch, G. The feature importance ranking measure. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Bled, Slovenia, 7–11 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 694–709. [Google Scholar]
- Lund, B.; Brotherton, D. Information Value Statistic. In Proceedings of the MWSUG 2013, Columbus, OH, USA, 22–24 September 2013. [Google Scholar]
- Lending Club Platform. 2021. Available online: www.kaggle.com/datasets/wordsforthewise/lending-club (accessed on 10 September 2021).
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
- Berg, T.; Burg, V.; Gombović, A.; Puri, M. On the rise of fintechs: Credit scoring using digital footprints. Rev. Financ. Stud. 2020, 33, 2845–2897. [Google Scholar] [CrossRef]
- Bumacov, V.; Ashta, A. The conceptual framework of credit scoring from its origins to microfinance. In Proceedings of the Second European Research Conference on Microfinance, Groningen, The Netherlands, 16–18 June 2011. [Google Scholar]
- Boyes, W.J.; Hoffman, D.L.; Low, S.A. An econometric analysis of the bank credit scoring problem. J. Econom. 1989, 40, 3–14. [Google Scholar] [CrossRef]
- Thomas, L.; Crook, J.; Edelman, D. Credit Scoring and Its Applications; SIAM: Bangkok, Thailand, 2017. [Google Scholar]
- Avery, R.B.; Bostic, R.W.; Calem, P.S.; Canner, G.B. Credit scoring: Statistical issues and evidence from credit-bureau files. Real Estate Econ. 2000, 28, 523–547. [Google Scholar] [CrossRef]
- Amaro, M.M. Credit Scoring: Comparison of Non-Parametric Techniques against Logistic Regression. Master’s Thesis, Lisbon, Portugal, February 2020. [Google Scholar]
- Dastile, X.; Celik, T.; Potsane, M. Statistical and machine learning models in credit scoring: A systematic literature survey. Appl. Soft Comput. 2020, 91, 106263. [Google Scholar] [CrossRef]
- Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification and Scene Analysis; Wiley: New York, NY, USA, 1973; Volume 3. [Google Scholar]
- Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef] [Green Version]
- Liu, W. Enterprise Credit Risk Management Using Multicriteria Decision-Making. Math. Probl. Eng. 2021, 2021, 6191167. [Google Scholar] [CrossRef]
- Pla-Santamaria, D.; Bravo, M.; Reig-Mullor, J.; Salas-Molina, F. A multicriteria approach to manage credit risk under strict uncertainty. Top 2021, 29, 494–523. [Google Scholar] [CrossRef]
- Baesens, B.; Van Gestel, T.; Viaene, S.; Stepanova, M.; Suykens, J.; Vanthienen, J. Benchmarking state-of-the-art classification algorithms for credit scoring. J. Oper. Res. Soc. 2003, 54, 627–635. [Google Scholar] [CrossRef]
- Louzada, F.; Ara, A.; Fernandes, G.B. Classification methods applied to credit scoring: Systematic review and overall comparison. Surv. Oper. Res. Manag. Sci. 2016, 21, 117–134. [Google Scholar] [CrossRef] [Green Version]
- Teles, G.; Rodrigues, J.J.; Saleem, K.; Kozlov, S.; Rabêlo, R.A. Machine learning and decision support system on credit scoring. Neural Comput. Appl. 2020, 32, 9809–9826. [Google Scholar] [CrossRef]
- Oreski, S.; Oreski, G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst. Appl. 2014, 41, 2052–2064. [Google Scholar] [CrossRef]
- Yu, L.; Yue, W.; Wang, S.; Lai, K.K. Support vector machine based multiagent ensemble learning for credit risk evaluation. Expert Syst. Appl. 2010, 37, 1351–1360. [Google Scholar] [CrossRef]
- Desai, V.S.; Crook, J.N.; Overstreet, G.A., Jr. A comparison of neural networks and linear scoring models in the credit union environment. Eur. J. Oper. Res. 1996, 95, 24–37. [Google Scholar] [CrossRef]
- West, D. Neural network credit scoring models. Comput. Oper. Res. 2000, 27, 1131–1152. [Google Scholar] [CrossRef]
- Lai, K.K.; Yu, L.; Wang, S.; Zhou, L. Credit risk analysis using a reliability-based neural network ensemble model. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4132, p. 682. [Google Scholar]
- Zhang, D.; Zhou, X.; Leung, S.C.; Zheng, J. Vertical bagging decision trees model for credit scoring. Expert Syst. Appl. 2010, 37, 7838–7843. [Google Scholar] [CrossRef]
- Louzada, F.; Anacleto-Junior, O.; Candolo, C.; Mazucheli, J. Poly-bagging predictors for classification modelling for credit scoring. Expert Syst. Appl. 2011, 38, 12717–12720. [Google Scholar] [CrossRef]
- Wang, G.; Ma, J.; Huang, L.; Xu, K. Two credit scoring models based on dual strategy ensemble trees. Knowl.-Based Syst. 2012, 26, 61–68. [Google Scholar] [CrossRef]
- Zhang, X.; Yang, Y.; Zhou, Z. A novel credit scoring model based on optimized random forest. In Proceedings of the 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 8–10 January 2018. [Google Scholar]
- Qin, C.; Zhang, Y.; Bao, F.; Zhang, C.; Liu, P.; Liu, P. XGBoost Optimized by Adaptive Particle Swarm Optimization for Credit Scoring. Math. Probl. Eng. 2021, 2021, 6655510. [Google Scholar] [CrossRef]
- Li, H.; Cao, Y.; Li, S.; Zhao, J.; Sun, Y. XGBoost model and its application to personal credit evaluation. IEEE Intell. Syst. 2020, 35, 52–61. [Google Scholar] [CrossRef]
- Hurley, M.; Adebayo, J. Credit scoring in the era of big data. Yale J. Law Technol. 2016, 18, 148. [Google Scholar]
- Andreeva, G.; Altman, E.I. The Value of Personal Credit History in Risk Screening of Entrepreneurs: Evidence from Marketplace Lending. J. Financ. Manag. Mark. Inst. 2021, 9, 2150004. [Google Scholar] [CrossRef]
- Bastos, J.A.; Matos, S.M. Explainable models of credit losses. Eur. J. Oper. Res. 2022, 301, 386–394. [Google Scholar] [CrossRef]
Category | Number of Samples | Proportion |
---|---|---|
Good | 345,844 | 78.57% |
Bad | 94,307 | 21.43% |
Total | 440,151 | 100% |
Probability of default | ||
PD = 0.2143 |
Model | F1 | Roc_Auc_Score | Accuracy | Recall | Precision | |||
---|---|---|---|---|---|---|---|---|
Train | Test | Train | Test | Train | Test | |||
XGBoost | ||||||||
UFS | 0.780 | 0.780 | 0.784 | 0.777 | 0.790 | 0.783 | 0.781 | 0.770 |
RFE | 0.773 | 0.776 | 0.784 | 0.780 | 0.787 | 0.786 | 0.782 | 0.779 |
FIDT | 0.708 | 0.736 | 0.718 | 0.708 | 0.718 | 0.709 | 0.717 | 0.711 |
IV | 0.777 | 0.780 | 0.792 | 0.779 | 0.788 | 0.776 | 0.795 | 0.779 |
Feature Name | Feature Description | UFS | FIDT | RFE | IV |
---|---|---|---|---|---|
’open_acc_6m’ | Number of open trades in last 6 months | * | * | * | * |
’int_rate’ | Interest Rate on the loan | * | * | * | * |
’all_util’ | Balance to the credit limit on all trades | * | * | * | |
’inq_fi’ | Number of personal finance inquiries | * | * | * | |
’loan_amnt’ | The listed amount of the loan applied for by the borrower | * | * | * | |
’funded_amnt’ | The total amount committed to that loan at that point in time | * | * | ||
’sub_grade’ | Assigned loan subgrade | * | * | ||
’term’ | The number of payments on the loan | * | * | ||
’total_cu_tl’ | Number of finance trades | * | * | ||
’home_ownership’ | The home ownership status provided by the borrower | * | |||
’num_op_rev_tl’ | Number of open revolving accounts | * | |||
’num_sats’ | Number of satisfactory accounts | * | |||
’open_acc’ | The number of open credit lines in the borrower’s credit file | * | |||
’total_bal_ex_mort’ | Total credit balance excluding mortgage | * | |||
’total_il_high _credit_limit’ | Total installment high credit/credit limit | * | |||
’installment’ | The monthly payment owed by the borrower if the loan originates | * | |||
’annual_inc’ | The self-reported annual income provided by the borrower | * | |||
’debt_to_income’ | Percentage of the gross monthly income used to pay monthly debt | * | |||
’revol_bal’ | Total credit revolving balance | * | |||
’application_type’ | Indicates whether the loan is an individual or a joint application | * |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jemai, J.; Zarrad, A. Feature Selection Engineering for Credit Risk Assessment in Retail Banking. Information 2023, 14, 200. https://doi.org/10.3390/info14030200
Jemai J, Zarrad A. Feature Selection Engineering for Credit Risk Assessment in Retail Banking. Information. 2023; 14(3):200. https://doi.org/10.3390/info14030200
Chicago/Turabian StyleJemai, Jaber, and Anis Zarrad. 2023. "Feature Selection Engineering for Credit Risk Assessment in Retail Banking" Information 14, no. 3: 200. https://doi.org/10.3390/info14030200
APA StyleJemai, J., & Zarrad, A. (2023). Feature Selection Engineering for Credit Risk Assessment in Retail Banking. Information, 14(3), 200. https://doi.org/10.3390/info14030200