1. Introduction
Additional risks emerge in the credit life cycle because of the spike in activities from the generation of credit to the awarding of loans. These difficulties have prompted the incorporation of technologies that can create reliable credit risk models and handle the underlying data quality problems. While data-driven tools significantly rely on the data to establish links between the inputs and outputs for an empirical framework, data quality is still crucial at every level of the model generation process. Biasness will result from failing to recognise and remove the noise from the data.
Many studies have investigated the effects of poor credit risk management procedures and excessive levels of credit risk in the banking system after the onset of the global financial crisis that caused economic downturns from 2007 to 2009. Banking organisations have now created sophisticated methods for measuring and controlling credit risk across various product lines. The advantages of credit risk management through the Internal Risk-Based (IRB) approach, where banks use internal data to estimate risk components to quantify exposure to credit loss, have been noted by the Basel Committee (
Häger and Andersen 2010).
Due to the missing data utilised to create these models, data quality becomes a fundamental difficulty in the new method. To make strategic decisions that will lead to business success, data are crucial to organisational operations, according to (
Madnick and Lee 2009). Given the evolution of big data and its impact on business productivity, data scarcity or abnormalities in data management systems will have a significant adverse effect. According to (
Haug et al. 2011), numerous studies have found that inadequate data continues to be an issue in most businesses despite adequate data within the organisation.
Unfortunately, the study on credit risk modelling, which addresses poor data problems, is limited, and reliable imputation strategies are used to reduce the missingness of the data using modern AI approaches. This study will eventually determine if credit risk models reduce the underlying financial loss in banks, which serves as a foundation for predicted credit loss.
By learning from data, artificial intelligence enables machines to replicate human intelligence. These programs will produce biased findings if their underlying data does not fully represent the population, eventually influencing their decision-making. This brings us to the research’s main goal: address the significance of data quality, find ways to deal with missing data methodically, and propose effective machine-learning techniques for poor credit data.
The remainder of the paper is outlined as follows:
Section 2 presents a literature review of the methods used in the paper.
Section 3 related work for poor data phenomena, imputation strategies, mechanisms, and common statistical and machine learning approaches leveraged to tackle the problem;
Section 4 presents the experimental set-up for the proposed use of the generative adversarial network strategy and the experimental results. We close with remarks and conclusions in
Section 5.
2. Background Review
“Credit” is a derivative of the Greek word “Credere”, which means reliable. According to the Greek interpretation, the fundamental principle of credit is the mutual trust between the creditor, which can be either an individual or an entity, and the debtors (the credit applicants). Credit risk is one of the most significant risks associated with banks’ operations. Credit risk modelling aims to segregate the customers, identify borrowers likely to default, and eventually calculate the expected credit loss of the loan portfolio. It also assigns customers the proper cost based on their credit score, for instance, charging high rates to high-risk clients and cheap rates to low-risk clients.
As the granting of credit is essential to carrying out normal banking operations, it is safe to argue that this is where the majority of the risks the banks are exposed to originate (
Andersen et al. 2001). To adequately manage credit risk exposures, banks need to understand credit risk, according to (
Kim and Wu 2008). Furthermore, (
Ljubić et al. 2015) assert that this calls for a system of credit risk management that offers accurate and adequate credit risk rating and exact segmentations of credits among priorities that should be closely overseen. A good credit risk policy should lead to the adoption of a system that encourages controlled risk, resulting in the development of a strong credit rating.
2.1. Machine Learning Classification Techniques
2.1.1. Support Vector Machines
Machine learning describes a system that can automatically learn information from experience and other sources. Various machine learning algorithms have been used to solve diverse classification issues. The approach taken to the learning process in each method differs significantly from one another (
Danso et al. 2014). As opposed to modelling continuous valued functions, classification predicts categorical labels. In contrast to classification, which focuses on extending known structures to new data, clustering is the problem of identifying groups and structures in the data that are in some other way without using known structures (
David and Balakrishnan 2010).
One of the supervised machine learning approaches found in the literature is called support vector machine (SVM), a classification tool. SVM is based on the structural risk minimisation principle and was first introduced by Vapnik in 1999 to address classification issues. SVM’s operating principles can be characterised as a hybrid of linear and non-linear. For regression, classification, and other learning tasks the kernel mapping technique can be used with separate and combined data, and it performs well across various learning tasks. Data storage kernels, (
Cömert and Kocamaz 2017), are used to determine how similar or dissimilar data objects are.
In the analytical study of the service quality of Indian railways under soft computing, (
Majumder et al. 2024), used support vector machines, extra trees, and multinomial naive Bayes as the three machine learning classification methods they utilised. They conducted a comparative analysis based on seven performance metrics and predicted the overall train rating. According to their findings, the support vector machine is the best estimator out of the three classifiers due to its improved capacity for making predictions on the train and railway data.
2.1.2. Logistic Regression
A form of regression model called a logistic regression is used to forecast the outcome of a categorical dependent variable using one or more predictor variables. Several axes can be related to a dichotomous dependent variable, such as D, using the mathematical modelling technique known as logistic regression. When the illness measure is dichotomous, logistic regression is the most frequent modelling technique used to analyse epidemiologic data. However, other modelling approaches are also viable. In other words, logistic regression computes probability scores for the dependent variable. It uses them to calculate the association between a continuous independent variable (or numerous continuous independent variables) and a categorical dependent variable (
Chaudhary et al. 2013).
Like ordinary least squares (OLS) regression, logistic regression is a prediction method. The prediction outcome in logistic regression, however, is dichotomous. Logistic regression is one of the most used tools for applied statistics and discrete data analysis. Since linear interpolation is used in logistic regression (
Osisanwo et al. 2017). The approach of choice for credit risk modelling over time has been logistic regression. For instance, (
Kutty 1990) provided a logistic regression model for estimating the likelihood that debt owed by developing nations will default. The analysis included the debts of 79 nations over 19 years. The model anticipated the country’s debt default two years in advance for Mexico, Brazil, and Argentina.
To forecast the likelihood of default based on financial data, (
Hol et al. 2002) used a logistic regression model. Current estimates of the PD of US banks were made (
Gurný and Gurný 2013) utilising various statistical techniques, including linear discriminate analysis, probit model, and logistic regression (LDA). For a model estimation in his study, the author examined a sample of 298 American commercial banks gathered from 2007 to 2010 during the financial crisis. Logit and probit models were also subjected to stepwise selection. Despite the probit model having one more indicator, the logit model and the probit model exhibited remarkably similar explanatory powers based on the fit in the training data (96.30% for the logit model and 95.85% for the probit model in terms of pseudo-R-square) (
Zhang 2015).
2.1.3. Naïve Bayes
The Naive Bayes variation, known as Gaussian naïve Bayes, assumes that each class is normally distributed and supports continuous values. Based on Bayes’ Theorem, Gaussian naïve Bayes strongly assumes that predictors should be independent of one another. For instance, whether we should grant a loan depends on the applicant’s income, age, history of prior loans, geography, and transaction history. It is implausible that data points in a real-world setting won’t interact with one another, but unexpectedly, Gaussian naïve Bayes works well in that circumstance.
Despite oversimplified presumptions, it frequently performs better in various challenging real-world scenarios. The naïve Bayes theorem has the significant benefit of requiring less training data to estimate the parameters (
Alam and Pachauri 2017). In agriculture, medicine, and biometrics, NB has been applied and proven successful. Nevertheless, nothing is performed with credit scoring to improve banks’ and other financial institutions’ assessments of a customer’s creditworthiness. Finding the likelihood of a label given some observable features is our goal in Bayesian classification. Gaussian, multinomial, and Bernoulli are the three categories of Naive Bayes models.
2.1.4. Decision Trees
Decision trees (DTs) are supervised algorithms that repeatedly divide the data into subsets according to its properties until a stopping requirement is satisfied; a tree-like structure results from this iterative partitioning. DTs are white boxes because it is simple to follow the path from the root node to each leaf node in the tree and deduce the classification rules they learned. Even with the massive quantities of data, DTs are quite effective. This is owing to the algorithm’s partitioning nature, which works on ever-smaller portions of the dataset. It typically only works with straightforward attribute–value data that is simple to alter.
Multistage decision-making can be approached in several ways, including the decision tree classifier (DTC). The capacity of DTCs to break down a complex decision-making process into a set of more straightforward options is its most crucial attribute since it results in a solution that is frequently simpler to understand (
David and Balakrishnan 2010).
Many conventional applications in numerous domains have successfully used decision trees. Although it can be said that DT is an older strategy, it has proven effective. For instance, DT has recently been used as a machine-learning technique to create automatic classification models for data related to pancreatic cancer. By classifying cases and arranging them based on feature values, DT-based algorithms “learn” from training examples. Each node in a DT stands for a characteristic of an instance that needs to be classified, and each branch stands for a value that the node might consider when planning. The image in the following figure shows how DT operates in the feature space (
Danso et al. 2014).
Due to the replication problem, decision trees can represent some concepts substantially more sophisticatedly. One approach is to create intricate characteristics at nodes using an algorithm to prevent replication. The FICUS building technique was first introduced by (
Markovitch and Rosenstein 2002). It employs the typical input of supervised learning and a feature representation specification to create a set of generated features. While FICUS shares certain similarities with other feature-creation algorithms, its key advantages are its flexibility and generality. FICUS was made to generate features from any feature representation specification that complies with its general-purpose language.
2.1.5. Random Forest
A random forest is a group of unpruned classification or regression trees inferred from bootstrap samples of the training data using random feature selection. Classifier trees don’t require exploratory variables to be multicollinear or to have a functional form. Since they use non-parametric techniques, no distributional presumptions are necessary. Decision trees can be constructed using a variety of algorithms. Classification trees, first famous by
Breiman et al. (
1984). These are the most common solutions for binary problems.
Splitting rules, which we may use to divide the variable space into more manageable chunks, are the foundation of classification trees. Classification trees have the advantage of being easily understood and easy to understand. Overfitting is a problem that arises when developing an entire tree. Due to its intricacy, the tree’s predictive capability may not be good. The pruning strategy, which is outlined, for instance, is the typical approach to address this problem; (
Ortl 2016).
The ensemble’s predictions are combined (by majority vote for classification or by average for regression) to produce the prediction. Random forest typically shows a significant performance gain compared to single tree classifiers like CART and C4.5 (
Yadav and Tiwari 2015). In ensemble learning, RF is a classifier that improves generalisation and classification accuracy for big databases by growing over many classification trees. To use numerous prediction models, RF combines a set of base classifiers that operate independently of one another.
2.1.6. Gradient Boosting
The ensemble model, known as gradient boosting (GB) or stochastic gradient boosting, consists of several elementary decision trees. By merging the predictions of various base models and iteratively reducing the error term, ensemble models seek to increase accuracy. After the initial base model (tree) is set up, each succeeding base model is fitted to the residuals of the previous model to minimise the error term and avoid errors in the current ensemble. Since it is a homogeneous ensemble classifier, more randomness is introduced through bootstrap sampling (
Andrić et al. 2019).
Gradient boosting results from Leo Breiman’s discovery that boosting may be seen as an optimisation technique for an appropriate cost function. Later, Jerome H. Friedman created explicit regression gradient boosting techniques in parallel with Llew Mason’s more comprehensive functional gradient boosting viewpoint (
Mason et al. 2000).
The two later studies provide an abstract interpretation of boosting techniques as iterative functional gradient descent algorithms. This refers to algorithms that select a function (weak hypothesis) that points toward the negative gradient iteratively to optimise a cost function across function space. In many fields of machine learning and statistics outside of regression and classification, boosting techniques have been developed due to this functional gradient interpretation of boosting.
2.2. Poor Data Overview
In reality, the data’s qualities frequently vary across its many dimensions. The default coverage, for instance, may be less extensive in the early period of the data than in the latter time. Also, smaller firms’ financial statements may not be as accurate or consistent compared to larger organisations. For example, sample selection biases, data collection difficulties, and default definition variations are just a few of the data factors that (
Dwyer 2007) offer an overview of when validating private enterprises’ default risk models.
Most analysts know that these data issues exist in their validation and development samples. Still, they cannot quantify their exact scope, according to (
Tang et al. 2011). A standard statistical issue in many application areas is missing values in datasets. Internal records may be lacking in credit analysis for various reasons, including improper registration, clients who refuse to respond to enquiries, and database or recording system errors. One possible approach is to drop the missing values from the original dataset, as performed by (
Adams et al. 2002) and (
Berger et al. 2005), or perform a preprocessing to replace the missing values, as performed by
Banasik et al. (
2003) and
Paleologo et al. (
2010). These procedures are missing data imputations (
Louzada et al. 2016).
2.2.1. Listwise Deletion Approach
Listwise deletion, also known as complete-case analysis, is a generally used approach in data analysis. It involves removing any case with one or more missing values from the dataset. The simplest option for dealing with missing data is to use deletion techniques, in which the dataset’s rows or columns having missing values are just eliminated. Even though deletion techniques are simple to comprehend and apply, they could yield biased results, particularly when handling missing data that adheres to a unique missingness mechanism. Randomly deleting data might result in the loss of essential information and deliver biases into the analysis that comes afterwards (
Zhou et al. 2024).
If a single value is absent from a record, the entire record is removed from the analysis, known as listwise deletion. However, if there are a lot of incomplete records, this strategy could be wasteful and drastically impair statistical power. This happens when the likelihood of missing data is the same across all cases and the process is unrelated to the data (MCAR). As such, there won’t be any bias additions to the dataset. Instead, there will only be a loss of data and diminished statistical power (
Woods et al. 2021).
A complete-case analysis will provide unbiased estimates in this situation, but standard errors and confidence intervals will reflect the smaller data group containing complete records. Removing any subject with an incomplete set of measurements is comparable to having a lower sampling fraction and a smaller sample size and does not introduce bias. Thus, listwise deletion decreases efficiency even when appropriate, sometimes leading to a dramatically smaller sample size.
2.2.2. Imputation Techniques
Various imputation strategies might be used to keep records of incomplete data in the analysis. Maximum likelihood and multiple imputation techniques can generate accurate estimates for MCAR and MAR without sacrificing statistical power. Observations that are missing are replaced by values that are in some way projected, frequently from a model, using imputation techniques. In a single imputation, the missing observation may be replaced with the sample mean or median, with a predicted value of the variable (for example, from a regression model, bootstrap, or a random dataset from multiple imputation), or with the value from a study patient who matches the patient with the missing data on a set of chosen covariates.
2.2.3. Single Imputation (Median/Mean) and Mode Imputation
Machine learning algorithms can learn from incomplete data with the support of missing value imputation. The three basic missing value imputation strategies are mean, median, and mode. The mean of a set is its average value, the median is its middle value when the numbers are arranged by size, and the mode is the value that appears most frequently in two or more sets. They are only conjectures; it is reasonable to assume that the missing values are most likely very close to the value of the mean or the median of the distribution since these represent the most frequent/average observation when data are missing completely at random. The missing observations most likely look like the majority of the observations in the variable.
2.2.4. Multiple Imputation
Various imputation strategies that accommodate records with missing value data have been used to deal with the incomplete data problem. Maximum likelihood and multiple imputation techniques can generate accurate estimates for MCAR and MAR without sacrificing statistical power. Observations that are missing are replaced by values that are in some way projected, frequently from a model, using imputation techniques. In a single imputation, the missing observation may be replaced with the sample mean or median, with a predicted value of the variable (for example, from a regression model, bootstrap, or a random dataset from multiple imputation), or with the value from a study patient who matches the patient with the missing data on a set of chosen covariates.
2.2.5. K-Nearest Neighbour
A supervised technique called
k-nearest neighbours (KNNs) categorises new cases based on a similarity metric after storing all the existing data (e.g., distance functions). KNN has been employed in statistical estimation and pattern identification based on their nearest neighbours and an odd number. Its resemblance is quantified by a distance function, like the Euclidean one (
Payrovnaziri et al. 2020). K is typically treated as a bizarre number exclusively when making decisions. If K = 1, it will be automatically placed in the class of its closest neighbour. The class with the most votes from its closest neighbour is assigned if K is an odd number; (
Reddy and Babu 2018).
where
and
.
The two types of nearest neighbour algorithms are structure-based KNN and structure-less KNN. The structure-based approach deals with the data’s fundamental structure, with fewer mechanisms connected to training data samples. The data are divided into training and sample points using a structure-less technique. The distance between each training and sample point is determined, and the training point with the least distance is referred to as the nearest neighbour. The KNN technique’s effectiveness for big training datasets and robustness against noisy training sets are among its primary features.
Rather than using all available instances in the data, the KNN imputation algorithm uses only similar cases with incomplete patterns. Given an incomplete pattern x, this method selects the K closest cases that are not missing values in the attributes to be imputed (i.e., features with missing values in x), such that they minimise some distance measures.
KNN’s benefits include its simplicity, transparency, robustness against noisy training data, ease of understanding, and ease of implementation, while its drawbacks include its computational complexity, memory constraints, poor runtime performance for large training sets, and the potential for issues caused by irrelevant attributes (
Soofi and Awan 2017).
The goal is to commit the training set to memory and then predict the label of any new instance based on the labels of its close training set neighbours. This method’s justification is predicated on the idea that the characteristics utilised to characterise the domain points are pertinent to their labelling in a way that makes nearby points likely to share the same label (
Shalev-Shwartz and Ben-David 2014).
2.2.6. Generative Adversarial Networks (GANs)
A technique called generative adversarial networks (GANs) imputation uses GANs to fill in the missing values in a dataset. A generator network plus a discriminator network makes up GANs, a subtype of deep neural network. The discriminator network attempts to differentiate between the generated samples and the genuine data, while the generator network creates samples similar to the input data. The GAN network is trained to develop data samples identical to the actual data using the available data, including the features and the target variable.
The generator network is taught to produce samples that the discriminator network cannot differentiate from actual data. The GAN network can fill in the missing values once trained by providing the available data along with the missing values and utilising the generator to create the imputed data.
The goal of generative modelling, an unsupervised learning task in machine learning, is to automatically identify and learn the patterns in input data so that the model can produce new examples that could have been reasonably drawn from the original dataset.
Figure 1 below shows the GAN architecture for image complexion, where G denotes a generative network trained to produce fake images and the discriminator that learns to categorise them. The discriminator learns to distinguish the counterfeit images made by the generator from the actual images from the dataset.
It takes two networks to train a GAN: a discriminator network and a generator network. Real data from a dataset is used in the process, along with fake data continuously produced by the generator throughout training. The discriminator receives instructions similarly to any other deep neural network-defined classifier. Data from the training set is presented to the discriminator. The discriminator has been trained to classify the data into the “real” class, and the training process also uses fake data. First, a random vector z is sampled from a previous distribution over the model’s latent variables to create the bogus data.
Consequently, the generator generates a sample x = G(z). The function G serves as a neural network model of the random, unstructured z vector meant to be statistically undetectable from the training data. The discriminator subsequently categorises this fake material. Training the discriminator assigns this data to the “fake” class. The generator can be trained using the discriminator’s output derivatives concerning its input due to the backpropagation process.
The generator is trained to fool the discriminator into identifying its input with the “real” class. Except for using data from a parameter distribution for the “fake” class, which is updated dynamically as the generator learns, the discriminator’s training process resembles that of any other binary classifier. The generator is rewarded by generating outcomes that fool its opponent rather than being given explicit output targets, and its learning process is somewhat different; (
Goodfellow et al. 2020).
Motivated by the generative adversarial networks in the previous studies, this paper aims to simulate GAN architecture in the incomplete case context to stimulate the classification accuracy of machine learning models for credit risk modelling. The credit data are mainly classified under a missing at-random mechanism, and the analysis will simulate the GANs architecture to build a robust strategy to improve the data quality.
3. Related Work
First, the nature of the missingness in this paper must be explained to align the methods and assumptions that will handle the dataset used in the experiment. There are three missing data categories: MAR, MCAR, and NMCAR. The data variable is missing completely at random (MCAR) if the probability that the feature is missing is independent of the feature’s value and any other features’ values. This is often the best-case missingness scenario, with no relation to observed or unobserved values and data equally likely to be missing.
In the medical field, (
Jerez et al. 2010) applied machine learning techniques, such as the multi-layer perceptron (MLP), self-organising maps (SOM), and k-nearest neighbour (KNN), to data collected through the “El lamo-I” project. The results were compared to those obtained from the listwise deletion (LD) imputation method. The prediction of patient outcomes was improved by imputation techniques based on machine learning algorithms over imputation statistical techniques.
Templ et al. (
2011) mentioned that the imputation method can be used for multiple imputation, producing more than one option for a missing cell if it can handle the inherent randomness in the data. The use of multiple imputations to reflect sampling variability should only be performed after carefully examining the distributional assumptions and underlying models.
In datasets gathered through the delivery of psychological and educational assessments, missing data are a prevalent issue. It is well known that missing data can cause significant problems like skewed parameter estimates and inflated standard errors. An empirical comparison of the effectiveness of the missing data imputation techniques IRT model-based imputation (MBI), expectation–maximization (EM), multiple imputation (MI), and regression imputation (RI) was conducted in this study. Results demonstrated that MBI performed better than other imputation approaches, particularly with larger sample sizes, in retrieving item difficulty and the mean of the ability characteristics. However, when recovering item discrimination parameters, MI delivered the most significant results (
Kalkan et al. 2018).
In practical machine learning tasks, incompleteness is one of the primary data quality concerns. Numerous studies have been carried out to address this problem. Although there is little research on symbolic regression with missing data, most concentrate on the classification task. In this study, a brand-new imputation technique for symbolic regression using unfinished data is presented. The method seeks to enhance the effectiveness and efficiency of symbolic regression using imputed missing data.
Genetic programming (GP) and weighted k-nearest neighbours (KNN.) are the foundations of this approach. To forecast the missing values of partial features, it builds GP-based models utilising other already available features. Such models are built using weighted KNN to choose the instances. The experimental results on actual datasets demonstrate that the suggested method exceeds a number of state-of-the-art methods in terms of imputation accuracy, symbolic regression performance, and imputation time (
Al-Helali et al. 2021).
Banks and other lending institutions can build credit risk control models for lending businesses by harnessing machine learning algorithms. The result helps to mitigate the negative aspects of conventional evaluation techniques, such as low efficiency and an overreliance on subjective matter assessment. Nonetheless, data with missing credit features will always be encountered during the practical evaluation of the process. For those machine learning algorithms to be trained appropriately, the missing attributes must be filled in, especially when applying the algorithms to small banks that have little credit data. In this study, we introduced an autoencoder-based approach that can recover the missing data items in the features by leveraging the correlation between the data (
Yan 2023).
Thin-file borrowers are consumers whose creditworthiness assessment is uncertain because they do not have a credit history. Many researchers have utilised borrowers’ social interactions as an alternative data source to address missing credit information. Traditionally, manual feature engineering has been used to exploit social networking data; however, in recent times, graph neural networks have emerged as a promising alternative.
Muñoz-Cancino et al. (
2023) introduced an information-processing framework to improve credit scoring models using several methods of graph-based learning: feature engineering graph embeddings.
In several areas of medical research, missing data frequently occurs, especially in questionnaires. In addition to the widely used complete case analysis, this article aims to describe and compare six conceptually distinct multiple imputation methods. It also examines whether the methodology for handling missing data may affect the clinical conclusions drawn from a regression model when the data are categorical.
The missing handling methods selected significantly influence the clinical interpretation of the supplementary statistical analysis. The decision to impute missing data and the imputation method can affect the clinical conclusions made from a regression model and should, therefore, be given adequate consideration (
Stavseth et al. 2019).
According to (
Wu et al. 2021), they have to categorise a sizable amount of data in the study of mining software repositories to build a predictive model. How accurate the labels are will significantly impact a model’s performance. However, the effect of incorrectly labelled occurrences on a predictive model has only been the subject of a few small studies. The case study was on the prediction of the security bug report (SBR) to close the gap. They discovered that five publicly accessible datasets for SBR prediction contain many occurrences that have been incorrectly classified, which has negatively impacted the accuracy of SBR prediction models used in recent studies.
Although the concept of a decision tree is not recent, decision tree algorithms have been gaining popularity with the growth of machine learning. This technique uses mathematical formulas like the Gini index to find an attribute of the data and a threshold value of that attribute to make splits of the input space (
Patil et al. 2016;
Amaro 2020). (
Namvar et al. 2018) presented an empirical comparison of various combinations of classifiers and resampling methods within a novel risk assessment methodology that integrates unbalanced data to solve these problems. The credit projections from each combination are assessed using a G-mean measure to prevent bias toward the majority class, which has not been considered in previous studies. As a result of their findings, combining random forest and under-sampling may be a valuable method for determining the credit risk of loan applications in social lending markets.
(
Sharma et al. 2022), in the study titled “A Study on Decision-Making of the Indian Railways Reservation System during COVID-19” and under machine learning application, used a random forest classifier and an extra trees classifier. They also examined the classifiers’ predictive power to the cross-validation score and six performance metrics, which include accuracy, precision, recall, F1-score, Hamming loss, and Matthew’s correlation coefficient. Their findings indicated no differences in the confusion matrices or values of any performance measures between the two classifiers. However, ETC performs better than RFC in terms of cross-validation score when measured using 10-fold stratified cross-validation.
By adapting the popular generative adversarial networks (GANs) paradigm, we provide a novel approach to imputing missing data. Because of this, they refer to our technique as generative adversarial imputation networks (GAIN). After observing a portion of an actual data vector, the generator (G) imputes the missing data based on the observed data and returns a completed vector. Next, given a finished vector, the discriminator (D) tries to identify which components were imputed and which were genuinely observed. We give D some extra data in the form of a hint vector to make sure D makes G learn the intended distribution. D uses the partial information about the original sample’s missingness that the hint provides to concentrate on the imputation quality of specific components. With this edge, G is guaranteed to learn how to generate based on the actual data distribution. We conducted tests of our method on multiple datasets and observed that GAIN performed much better than the state-of-the-art imputation methods (
Yoon et al. 2018).
The possibility of missing credit risk data might significantly diminish the assessment model’s effectiveness. Therefore, building a data imputation technique is highly beneficial for accurate missing data prediction. Due to the complicated random missing patterns and high missing rate of credit risk assessment datasets, creating an efficient imputation model is typically exceedingly tricky. Multiple generative adversarial imputation networks (MGAINs), a novel imputation technique, is proposed in this research, and this was the first time missing credit risk assessment data were imputed using GANs (
Zhao et al. 2022).
The studies above provide various methods for coping with incomplete data in credit risk and other industries. The studies presented one solution in a particular area instead of many solutions to different issues in the credit risk environment. Although robust methods were identified and acknowledged from the studies above, a gap exists in integrating effective strategies to handle poor data extensively. Given the current fourth industrial revolution age and the complexity of its systems and technology component, the paper aims to illustrate the integration of various techniques holistically to underline the necessity to remediate faulty data from end to end but not from its totality.
4. Experiments
4.1. Experimental Set-Up
This section will propose using generative adversarial network unsupervised learning to deal with poor credit data. The aim is to test the robustness of machine learning classification algorithms, namely SVM, naïve base, decision trees, random forest, gradient boosting, K-NN, and logistic regression, when GANs are harnessed as an imputation strategy for the incomplete case scenario. In addition, the empirical comparison of the classification algorithms when the complete case approach, single imputation (mean/median), mode, multiple imputation, and K-NN machine learning are employed is also explored to determine the effectiveness of the machine learning strategies when handling poor data. Furthermore, the complete case approach involves the instance where records with missing values are eliminated from the dataset. While the complete case approach is more effective for the MCAR missingness mechanism, depending on how outcomes and missingness are related, the validity of this strategy may hold in scenarios where data are missing at random.
4.1.1. Data Source
The loan application data are sourced from the Lending Club website in Kaggle. Lending Club is a US peer-to-peer lending company headquartered in San Francisco, California. It was the first peer-to-peer lender to register its securities with the Securities and Exchange Commission (SEC) and offer loan trading on a secondary market. The Lending Club is the world’s largest peer-to-peer lending platform.
In terms of missing values, there are a total of 74 identified variables with missing factor ratios, and it is composed of 57 features of numeric data types and 17 categorical data types. The missingness rate percentage ranges from (0.01%) to (100%). Out of the 57 numeric variables, only 16 features have a missingness rate percentage below (50%), 18 features have a missingness rate between +50% and 90% and the remaining 23 features have above a 90% missingness rate.
4.1.2. Data Processing
We applied Z-Score with the threshold method and isolation forest approaches as outlier detection methods to identify potential outliers for the numeric features. The results of the two outlier detectors were compared to see which method was more stable when categorising the outlier instances. The identified outlier’s values were treated using median imputation, Winsorize to with (5% and 10%), and different transformations (logarithms, square root, Yeo, and quantiles) were tested. Additionally, this transformation was experimented with to deal with highly skewed data. This attempt was intended to complement the Z-score approach to enhance the technique’s performance. Finally, the outliers that impacted the model performance results were discarded.
The missing rate percentage assessed was observed since we are dealing with an incomplete case scenario. The features with over 50% missingness rates were all dropped. Features with missing values were treated using mean/median, mode, multiple imputation, K-NN, and GANs imputation techniques. One-hot encoding was applied to convert categorical variables to numerical data format, and the data were normalised using the min–max scaler.
Correlation metrics for feature relationships were conducted to understand the variables’ relationships better. A SelectKBest approach based on a scoring function was used to select the best features and leveraged to train the classifiers. The model performance results were compared with the case where variable selection was not used. The outcome shows almost identical results.
4.1.3. Missing Data
The credit data are mainly classified under a missing at-random mechanism, and the analysis will simulate the GANs architecture to build a robust strategy to solve the data quality issue.
Nicoletti and Peracchi (
2006) summarise missing at random (MAR) for the default loan data and state the missing values depend on the observed data, can be fully described by other variables in the dataset, and are not dependent on the missing data.
This assumption underlies most imputation procedures. For example, even though respondents at the lower and upper end of the income distribution are less likely to provide survey responses than those in the middle, these missing data points are related to demographics and other socioeconomic variables, which can be observed in the data (
Pedersen et al. 2017).
4.1.4. Data Partition
According to (
Siddiqi 2006), there are various ways to split the development (sample on which the scorecard is developed) and validation (“hold”) datasets. Typically, 70% to 80% of the sample is used to create each scorecard; the remaining 20% to 30% is set aside and then used to test or validate the scorecard independently. Where sample sizes are small, the scorecard can be developed using 100% of the sample and validated using several randomly selected samples of 50% to 80% each. In this paper, we propose to use a 70% split of the data from the population sample since the primary purpose is to assess the robustness of the classification algorithm to manage poor data.
4.1.5. Model Evaluation
According to (
Castermans et al. 2010), the primary aim of back-testing PD discrimination is to verify whether the model still correctly distinguishes or separates between defaulters and non-defaulters or provides a correct ordinal ranking of default risk such that defaults are assigned low ratings and non-defaulters high ratings. Consequently, discriminatory power aims to discriminate the difference between good default events and good non-default events. Therefore, the model can make wrong forecasts for the event.
With that setting in mind, (
Wójcicka 2012) stated that in times of economic crises, banks (financial institutions) first and foremost need to minimise their losses by limiting the probability of default of the companies they finance. Of course, on the other hand, they would also want to optimise the profit that comes from funding “good” companies, and lowering the number of “good” companies by classifying too many of them into the group of “bad” firms will result in decreasing their income.
There are two scenarios in which default models can produce misleading results. First, if the model shows a low risk when the risk is high, it points to a type 1 error. This error issues customers with no chance of defaulting with high credit quality. Type 2 error is when the model indicates a high risk when, in fact, the risk is low. The illustration explains the
Table 1 below scenarios further on the contingency table.
4.1.6. Model Performance Analysis
The KS statistic and Gini coefficient are the two most frequently used metrics in an industry context, and the Basel Committee recommends that the Gini coefficient or accuracy ratio (AR) be used in banks to measure models of discriminants. The KS statistic measures the maximum vertical separation (deviations) between two cumulative distributions (good and bad) in scorecard modelling.
To know how to order the attributes from best to worst risk, one must know whether the variable is positively or negatively correlated with the outcome variable. If there is a positive correlation, the higher the variable values, the higher their levels of risk, and vice versa (
Bjornsdottir et al. 2009).
(
Guo et al. 2017) state that receiver operating characteristic (ROC) curves display the discrimination potential of fitted logistic models by evaluating the tradeoffs between actual positive rate (sensitivity) and false positive rate (1—specificity). If you have a cutoff value, you can classify the subjects as positive or negative according to their predicted probabilities. Hence, a 2 × 2 classification table can be constructed to show the relationship between the expected and actual outcomes. Sensitivity is the fraction of positive subjects predicted as positive, and specificity is the fraction of negative subjects correctly predicted as unfavourable.
A hamming loss (HL) is the ratio of incorrect labels to total labels. In multiclass classification, the hamming distance between y_true and y_pred is used to compute hamming loss. Consequently, HL considers the prediction and missing errors normalised across the whole number of samples and classes.
The Matthews coefficient is a machine learning performance statistic for binary classifiers, commonly known as the Matthews correlation coefficient (MCC). It evaluates the relationship between the actual and expected binary outcomes, considering each of the confusion matrix’s four components.
The randomised search from the sklearn Python library was leveraged to select the best set of hyperparameters so that the classifiers could achieve improved model classification accuracy. The following are the hyperparameters per classifier. The random state for all the classifiers was set to 42. The following are the hyperparameters selected using the randomised search criteria for the seven machine learning classifiers.
Support vector machines: C = 1.3, gamma = 0.7, kernel = ‘linear’;
Random forest: n_estimators = 50, min_samples_split = 5, min_samples_leaf = 2, max_features = ‘sqrt’, max_depth = None, bootstrap = False;
Logistic regression: C = 0.2, penalty = ‘l1’, solver = ‘liblinear’;
Gradient boosting: learning_rate = 0.01, max_depth = 4, min_samples_leaf = 2, min_samples_split = 2, n_estimators = 117;
K-nearest neighbourhood: metric = ‘euclidean’, n_neighbors = 20, weights = ‘uniform’;
Naïve Bayes: var_smoothing = 0.12 and GausianNB;
Decision trees: max_depth = 5, max_features = None, min_samples_leaf = 2, min_samples_split = 5.
4.2. Experimental Results
A comparative analysis of five imputation strategies and a complete case scenario are conducted. The results of this experiment show the effectiveness of machine learning methods when blended with different missing data approaches. The random state is set to 42, and the classification accuracy for each model is produced to measure the performance.
When GANs imputation is harnessed, the decision tree is the best-performing classifier with an accuracy rate of 93.01%, followed by random forest (92.92%), gradient boosting (92.33%), support vector machine (90.83%), logistic regression (90.76%), and naïve Bayes k-NN (89.29%), respectively. The K-nearest neighbours classifier is the worst-performing method, with an 88.68% accuracy.
For the complete-case approach, the best-performing machine learning algorithms in terms of their robustness when dealing with missing values are the random forest (93.01%), followed by gradient boosting (92.67%), naïve Bayes (90.01%), logistic regression (88.43%), k-NN (87.34%), and SVM (86.84%), respectively. The worst performance is by decision tree with an accuracy rate of 85.01%. The differences in performance of all seven machine learning algorithms are significant at the 95% significance level.
The above table shows the performance report concerning six performance metrics when cross-validation is employed. These models are developed under the GANs imputation for incomplete data since the simulation results of GANs from
Table 2 revealed improved classification accuracy for algorithms compared to traditional methods. It can be noted that decision trees outperform all six other classifiers in all six performance metrics; gradient boosting, SVM, logistic regression, decision trees, and naïve Bayes follow this. The least-performing classifier is K-NN. The random forest and gradient boosting have almost identical performances, which could be attributed to the fact that both classifiers belong to the ensemble family.
The recall rate of 93% implies the decision tree classifier correctly generalises the data well as shown from the
Table 3 below. The GANs strategy has been optimised, and the epochs and batch size have been adjusted to improve the results. The accuracy rate of Naive Bayes was significantly enhanced by over 5% when the GANs strategy was optimised. The results suggest that most algorithms are more robust to incomplete data when GANs are leveraged as an imputation method.
The
Figure 2 below shows the learning curves of different classifiers to evaluate the performance of the classifiers in the training and cross-validation. The intention is to visually diagnose if the model overfits, underfits, or performs well over time during the learning experience. Additionally, the cross-validation ensures that classifiers’ performance is robust and the learning curves are reliable. The learning curve for the training error is higher than that of cross-validation, with most classification models around 90% average accuracy in the train scores, which indicates that the models do not overfit and thus can generalise well in the test data. The lift curve performs well for all the classifiers because the training score and validation score accuracy are close to each other. Even though K-NN is the least performing classifier compared to others, the classifier shows a consistent trend and performs as observed in
Figure 2.
The
Figure 3 below shows the ROC curve for a comparative analysis of the performance of six different classifiers. The classifiers with the highest AUC scores are as follows: decision trees (82%) AUC and gradient boosting (82%) AUC, followed by random forest (81%) AUC, logistic regression (79%) AUC, support vector machine (74%) AUC, and naïve Bayes (73%). The K-NN has the lowest (60%) AUC score. Compared to other algorithms, the 82% AUC score for the two classifiers indicates that algorithms perform better for discriminatory power when distinguishing between positive and negative cases. For support vector machine (74%) AUC and naïve Bayes (73%) AUC, this implies a good discriminatory power and offers room to improve the classifiers, which can be achieved through further hyperparameter tuning. Furthermore, though decision trees and gradient boosting have completed (82%) higher AUC scores, decision trees stood out as having a better classification when taking into account all the other evaluation metrics, such as accuracy, precision, F1 score, recall, HL, and MCC. K-NN has the least discriminatory power compared to the six other algorithms and has shown consistent sensitivity to poor data across other evaluation metrics.
The
Figure 4 below shows that the decision tree algorithm has the lowest error rate, followed by random forest, gradient boosting, and logistic regression. This low error rate implies that these four algorithms are performing well in classification and generalising well in the validation dataset, as it reveals what percentage of the class predictions are invalid. Though naïve Bayes achieved the highest accuracy from the classification report table refer to
Table 2, naïve Bayes have a somewhat lower error rate, above the average 9% error rate compared to SVM and K-NN. This error rate assessment is supplemented with ROC—AUC curve metrics for comprehensive model evaluation to limit the cases of bias in the event of class imbalance. Overall, the performance of DT, RF, GB, and LR consistently show more robustness towards poor data than NB, SVM, and K-NN classifiers.
5. Remarks and Conclusions
Overall, the results suggest that the GANs unsupervised learning strategy is effective for handling incomplete credit data. When this unsupervised method is harnessed, the results show that most single classifiers displayed robustness towards missing data problems due to their improved model performance. Additionally, though the sample size impacts the model performance, this strategy can overcome the complexity challenge. The accuracy of the classifier experimented with under old methods deteriorated compared to the GANs strategy.
The decision tree classifier is the best-performing algorithm concerning the six performance evaluation metrics. DT outperforms all the other five classifiers and has consistent outputs across all performance metrics. Random Forest showed consistent classification accuracy compared to other algorithms, regardless of which imputation technique is leveraged. Though random forest is the second-best performing algorithm, its AUC value is slightly lower than that of gradient boosting. This dynamic reveals the tradeoff between random forest and gradient boosting algorithms, and this is ideal because these two belong to the ensemble learning family. The other key finding is the efficiency of K-nearest neighbourhood imputation; the method is faster than the GANs and multiple imputation techniques. Although K-NN is not better than GANs for handling poor data, it works better than traditional statistical methods such as median and mode. By leveraging optimisation, the naïve Bayes classifier accuracy rate significantly improved. Hyperparameter optimisation is a progressive practice that is essential for any modelling process to achieve better performance.
In conclusion, this paper offers the first steps toward a clearer understanding of the relative benefits and drawbacks of methods for estimating credit risk from potentially incomplete information. We hope that future empirical and theoretical research into the significance of data quality in the credit risk sector and cutting-edge deep learning techniques as robust methods are expected to be motivated by this work.