1. Introduction
Appropriate customer selection is a key element of risk management in the banking industry [
1]. However, achieving accuracy in risk assessment is considered a difficult task. In problems related to credit scoring, the dependent variable is dichotomous, where ‘0’ is assigned to failed loans and ‘1’ to non-failed loans. Thus, techniques such as logistic regression and neural networks (NNs) can be used to estimate the borrower’s probability of default [
2]. To manage financial risks, banks collect information from customers and other financial institutions to distinguish safe borrowers from risky ones. However, current automated lending risk evaluation methods are imperfect, and the failure of credit scoring algorithms to accurately assess loan recipients can result in considerable losses. Thus, from the perspective of the banking sector, appropriate assessment of credit applicants is crucial.
The topic of credit scoring has been at the forefront in the fields of finance and economics in applying machine learning (ML) techniques such as decision trees (DTs) [
3], NNs [
4], and support vector machines (SVMs) [
5]; thus, the performance of various classification algorithms for credit scoring has been intensively researched over the past 50 years. Initially, the accuracy gains of these methods (compared with the logistic regression model) for the assessment of creditworthiness appeared to be limited. However, the performance of ML-based scoring methods has improved considerably since the adoption of ensemble methods, especially bagging [
6] and boosting [
7] methods.
The application of deep learning (DL) to business analytics and operations has also attracted considerable research attention [
8]. Kraus et al. [
8] revealed that DL is a feasible and effective method in these fields and determined that it can outperform its traditional counterparts in terms of predictive accuracy. The development of accurate and analytical credit scoring models has thus emerged as a major area of focus for financial institutions [
8]. Numerous classification algorithms have been proposed for credit scoring. For example, Gunnarsson et al. [
9] reported that XGBoost, which was originally proposed by Chen and Guestrin [
10], is the best-ranking classifier. However, the application of DL algorithms in credit scoring has been largely ignored in the literature [
9].
DL has been successfully used in many real-world applications, especially in domains involving visual and audio recognition or time-series economic and financial data analysis. In these domains, the temporal and/or spatial correlation of data enables DL methods to learn features effectively, leading to superior classification results. DL models such as convolutional neural networks (CNNs) [
11] and long short-term memory (LSTM) networks [
12] commonly use data correlations to learn feature representations. One-dimensional (1D) CNNs have been applied to data with temporal correlations, such as stock indices, whereas convolution has been used to learn meaningful patterns in data. Existing DL methods largely benefit from this learning power to identify meaningful features by capturing temporal/spatial correlations [
13]. In a systematic and comprehensive review, Sezer et al. [
14] reported a lack of review papers focusing solely on DL for credit scoring, despite the growing interest in the development of models incorporating DL for financial time-series forecasting.
Furthermore, to the best of the author’s knowledge, only one review article on the application of DL to credit scoring has been published. Dastile and Celik [
15] conducted a systematic literature survey on statistical and ML models for credit scoring to leverage the performance benefits of DL while complying with the legislation requirements for automated decision-making processes. In their paper, they briefly described the DL techniques in credit scoring published from 2015 to 2018, which represents the first trend of the replacement of statistical and classical ML techniques with DL techniques in credit scoring. Luo et al. [
16] first used corporate default swap data to compare the performance of deep belief networks (DBNs) with that of logistic regression, multi-layer perceptrons (MLPs), and SVMs, and revealed that DBNs exhibited superior performance. Tran et al. [
17] proposed a hybrid model combining genetic programming and stacked autoencoder (AE) network models. They compared the proposed hybrid model with logistic regression,
k-nearest neighbor (KNN) classification methods, SVMs, artificial neural networks (ANNs), and DTs for credit scoring datasets. The results revealed that the proposed hybrid model exhibits excellent accuracy.
In a survey of the literature published from 2015 to 2018 on the use of DL for financial applications, Ozbayoglu et al. [
18] described how relatively simple DL methods are for application in credit scoring. In addition, Yu et al. [
19] proposed a unique cascade hybrid model of a DBN-based resampling SVM ensemble learning paradigm to classify the German and Japanese credit scoring datasets. However, over the past few years, novel DL-based methods for credit scoring have been rapidly developed.
As the present study aimed to provide an in-depth insight rather than a systematic review, studies published between 2019 and 2021 were searched using Web of Science, Science Direct, and IEEE eXplore. For
Section 4 only, a few recent studies from arXiv were also selected. The present review focuses on an emerging trend in which ML techniques are partially being replaced by DL techniques for credit scoring. The architectures used in DL include DBNs [
20], LSTM networks [
12], CNNs [
11], and AEs [
21]. Such comparisons should be performed using a considerable number of real-world credit scoring datasets [
8]. Thus, the models in this review were evaluated and compared in terms of their accuracy and area under the receiver operating characteristic curve (AUC) [
22] for the Australian, German (categorical), German (numerical), Japanese, and Taiwanese datasets, which are commonly used in the credit scoring and other research communities [
22]. Further, the improvements in accuracy and AUC values achieved with these datasets using ensemble classifiers and their hybrids, DL techniques, rule extraction, and rule-based classifiers for credit scoring have been tabulated as well.
Over 2019–2021, DL-based classifiers emerged as exhibiting the highest output accuracies, subsequently leading to the emergence of a “DL revolution” in credit scoring. A key aspect of DL-inspired ensemble systems involves the hierarchical distribution of ML elements in cascade and/or parallel ensembles [
23,
24,
25]. Another key aspect is the conversion of tabular datasets into images using bins employed to calculate the weight of evidence (WOE) [
26]. Dastile and Celik [
27] considered both continuous and categorical features and achieved the highest accuracy (88%) amongst DL-based classifiers for the German (categorical) dataset.
The objectives of this review are fourfold: (1) to present certain theoretical characteristics of DBNs and the reasons they achieve higher accuracy than shallower networks with one hidden layer by using ML theorems; (2) to review the most recent DL techniques that have been shown to achieve higher accuracies than ensemble classifiers, their hybrids, rule extraction methods, and rule-based classifiers; (3) to reveal the potential classification capabilities of DL-based classifiers and investigate their applicability to credit scoring datasets; and (4) to provide deep insights into the usefulness and interpretability of DL in credit scoring and related financial areas.
The remainder of this paper is structured as follows.
Section 2 presents the fundamentals of DL models used in credit scoring, such as deep multi-layer perceptrons (DMLPs), CNNs, LSTM networks, restricted Boltzmann machines (RBMs) [
28], DBNs, AEs, discretised interpretable multi-layer perceptrons (DIMLPs), 1D CNNs,
gcForest, and DL ensemble systems, as well as data attributes and preprocessing/encoding techniques for DL in credit scoring.
Section 3 provides an overview of the accuracy, AUC, and methods recently reported for the Australian, German (categorical), German (numerical), Australian, and Taiwanese credit scoring datasets.
Section 4 explains how the tabular datasets are converted into images for the application of a two-dimensional (2D) CNN over 2018–2021.
Section 5 presents an explanation of “black box” models using local and global rule extraction and rule-based methods in credit scoring.
Section 6 summarises the emerging trends and accuracies of various methods for the Australian, German (categorical), German (numerical), Australian, and Taiwanese datasets. Further, it highlights the potential capabilities of DL classifiers, discusses their applicability for credit scoring based on emerging trends reported mainly over 2020–2021 from the perspective of ML with and without DL techniques for the five datasets, outlines the design of CNN-based classifiers for credit scoring datasets, and provides promising research directions. Finally,
Section 7 concludes the paper.
4. Converting Tabular Datasets into Images to Apply Two-Dimensional Convolutional Neural Networks (2D CNNs)
DL differs from conventional ML because it can learn good feature representation from data. Existing DL methods greatly benefit from this feature-learning ability to determine meaningful features to capture temporal/spatial correlations [
13].
In practice, in conventional ML and data classification, a different setting is considered in which instances are assumed to be independent and identically distributed, while features used to represent data are assumed to have weak or no correlations. Because of this assumption, conventional ML methods do not consider feature/data correlations in the learning process. Most conventional ML methods, including multi-layer NNs and randomised learning methods, such as stochastic configuration networks, do not explicitly consider feature interactions for learning, mainly because they require feature correlations to be handled via a data preprocessing method that creates independent features before applying the ML methods.
Neagoe et al. [
69] compared deep CNNs and MLPs using credit scoring datasets and reported that deep CNNs achieved higher accuracy for the German and Australian datasets. However, according to Hamori et al. [
70], the DL model performance depends on the choice of the activation function, the number of hidden layers, and the dropout rate. Their results showed that ensemble methods such as boosting and bagging achieve better performance on the Taiwanese credit scoring dataset compared to DNNs. These studies suggest the applicability of CNNs to credit scoring datasets.
Zhu et al. [
71] used a hybrid method to perform credit scoring by combining a CNN with a relief algorithm (to perform feature selection) and found that this hybrid relief–CNN model achieved better performance than logistic regression and RF [
72]. It converted tabular credit scoring data into images by bucketing and mapping features into image pixels. However, in their study, they considered numerical features only. As a result, their relief–CNN hybrid model outperformed benchmark models such as logistic regression and RF for credit scoring.
Inspired by the Super Characters method and 2D embeddings, Sun et al. [
73] proposed the SuperTML method to address the problem associated with classifying tabular data. In their method, for each input, the features are first projected into 2D embeddings, such as an image, and then this image is fed into fine-tuned 2D CNN models for classification. A conceptual example of SuperTML is shown in
Figure 10.
Han et al. [
13,
66,
74] proposed using DL for generic data classification, a technique in which rows of data are transformed from tabular into matrix form for use as inputs to CNNs. However, using CNNs to classify tabular data has progressed slowly. As a result, the use of non-NN methods, including SVMs and XGBoost, is still predominant for working with tabular data.
Buturović and Miljković [
75] developed a tabular convolution approach to convert tabular datasets into images by treating each row of tabular data (i.e., feature vector) as an image filter (kernel) and then applying that filter to a fixed base image for application in 2D CNN models. A CNN was then trained to classify these filtered images. In their study, they used gene expression data obtained from the blood of patients with bacterial or viral infection.
In most tabular data, the spatial relationships between features are not considered and are therefore unsuitable for modelling using CNNs. To overcome this challenge, Zhu et al. [
76] developed a novel image generator to convert tabular data into images by assigning features to pixel positions such that similar features are close to each other. Their algorithm obtains an optimised assignment by minimising the difference between the ranking of distances amongst features and the ranking of distances amongst their assigned pixels.
Using CNNs, Sharma and Kumar [
77] proposed a new data-wrangling preprocessing method that can transform a 1D data vector into a 2D graphical image with appropriate correlations amongst fields. To our knowledge, for non-time-series data, this is the first method capable of converting non-image data to image data. These converted data, which are processed using a CNN with VGGnet-16, achieved competitive classification accuracy results compared with the canonical ANN approach; this suggests that there is considerable potential for further improvement of the method.
7. Promising Research Directions
Post the 2021 emergence of DL-based classifiers with high accuracies for credit scoring, the author believes that two promising research directions exist, as explained below.
The first is the use of a DL-inspired ensemble system [
24,
34]. The key aspect of a DL-inspired ensemble system is the inclusion of ML elements that are distributed in cascade and/or parallel ensembles hierarchically. As shown in
Section 6.2, DGCEC [
23] and DGHNL [
24], as typical DL-inspired ensemble systems, achieved very high accuracies for credit scoring datasets consisting of only numerical attributes, such as the Australian, German (numerical), and Japanese datasets. A hybrid ensemble classifier with DBN [
20] and deep forest [
36,
37] also achieved very high accuracies for the Japanese and Taiwanese datasets.
The credit scoring models and their applications in peer-to-peer (P2P) lending (which consists of individual lenders who provide loans to individual borrowers on an electronic platform) are still immature owing to the different characteristics of P2P lending [
106]. Chen et al. [
107] proposed a credit assessment model for banks to assess the risk of default for home credit based on DeepGBM [
108]; however, their model did not consider deviations caused by changes in the distribution of the data and cannot be updated online. Although substantial progress has been made, no similar attempts have been made for credit scoring in P2P lending. However, a deep sequential model ensemble [
109] has been proposed for the detection of credit card fraud.
Research on DL-based credit scoring has begun only recently and has the potential to significantly impact the working of banks and other financial intuitions. However, increases in the volume and velocity of credit card transactions can cause class imbalance and concept deviation problems in datasets where credit card fraud is detected, which may make it very difficult for traditional approaches to produce robust detection models. To address this, Sinanc et al. [
110] proposed a novel approach called fraud detection with image conversion.
In the general CNN structure, high-dimensional input data, such as images, are not easily interpretable. Although DL techniques are not fully used CNN structures, certain DL-based classifiers in this review ranked amongst the top five classifiers with the highest accuracy. The performances currently being achieved are higher than expected.
Currently, most existing credit scoring models are implemented with shallow structures; thus, DL is innovatively introduced into the credit scoring model; for example, the use of XGBoost for credit scoring [
67]. Jiao et al. [
67] proposed a unique bidirectional optimization structure that simultaneously optimises both CNN and XGBoost by using APSO. Optimizing a CNN to extract deep features is more suitable for XGBoost and optimizing XGBoost makes the model structure match the extracted features, which provides a better understanding of the image features. Bidirectional optimization maintains the characteristics of both parts while allowing them to be combined more closely together and enabling the features of the fully extracted image to be used for classification. The classification accuracy reported by Jiao et al. [
67] for the German (numerical) and Taiwanese datasets ranked very highly for these two datasets; thus, it is reasonable to believe that this simple idea of DL-based classifiers could help simultaneously deal with structured and unstructured datasets.
In 2022, Du and Shu [
111] proposed a model that uses logistic regression), BRNN (bidirectional recurrent neural network), and XGBoost for credit scoring. The model achieved an AUC of 0.9574 and accuracy of 89.35% for the Australian dataset. The model also achieved an AUC of 0.8374 and accuracy of 77.5% for the German (categorical) dataset. In [
112], a novel financial distress prediction model uses an adaptive whale optimization algorithm with a deep learning (AWOA-DL) technique, including a multi-layer perceptron (MLP) and optimization algorithm. In the experiments, the AWOA-DL algorithm showed the best performance with maximum accuracy of 0.9689 for the Australian dataset.
The second research direction is to develop a class-imbalanced XGBoost as well as multiclass classification, which are of great practical significance in the field of business analytics and can be applied in the areas of credit scoring, credit card fraud detection, bankruptcy and digital marketing. However, considering the nature of data structure, such datasets are not only imbalanced but also contain many nominal attributes, making it technically difficult to achieve high classification accuracy. Undoubtedly, DL- based classifiers also constitute an urgent research issue as well and many techniques have been discussed in this review that can classify structured data with high accuracy.
The third research direction is to convert tabular datasets into images using bins employed to calculate WOE. Each pixel of a feature image corresponds to a feature bin. WOE is used to create meaningful bins that are monotonic to the response variable. Dastile and Celik [
27] considered both continuous and categorical features, and their proposed method achieved the highest accuracy (88%) amongst the DL-based classifiers for the German (categorical) dataset. In 2022, Borisov et al. [
113] proposed the DeepTLF (
https://github.com/unnir/DeepTLF, accessed on 1 January 2022) framework for deep tabular learning. The core idea of their method is to transform the heterogeneous input data into homogeneous data to boost the performance of DNNs considerably.
In contrast to a previous study [
76], they systematically discretised tabular data into optimal categories by using WOE and utilised both categorical and continuous features. Considering the practical applications of business analytics for credit scoring, such conversions are required to deal with both numerical and categorical datasets. As credit scoring datasets are stored in the databases of banks and the other financial institutions, they can be used for P2P lending [
106].
However, the use of credit scoring models in P2P lending involves certain limitations. First, the feature space of P2P credit data usually contains two types of features: dense numerical features (e.g., amount of the loan, asset-to-liability ratio) and sparse categorical features (e.g., gender, credit score). However, existing classifiers, including DT classifiers and NN models, are typically useful for processing only one data type. Zhang et al. [
106] previously developed an effective model with multiple data types for P2P lending credit datasets. Therefore, developing an accurate and efficient method for converting tabular data into categorical and continuous features is a promising direction for future studies. The accuracy of these approaches should be enhanced, and suitable methods should be investigated to improve interpretability in banks and other financial intuitions.
Finally, a tree diagram of topics for future development is provided here. As very early works, all papers with reference numbers are scattered in
Figure 17. These papers are pioneering works and more advanced technologies will be developed in the near future.
8. Concluding Remarks and Future Scope of Work
Based on the above discussion, it can be concluded that there is a need to actively aim towards not only high quantitative performance, such as in predictive accuracy, but also high qualitative performance, such as in interpretability shown in
Figure 13. In response to social demands such as General Data Protection Regulations [
115], xDNN was developed as an innovative approach that showed very high classification accuracy using images; however, its level of explainability was still quite low. As previously discussed, xDNN offers a novel DL architecture that synergistically combines reasoning and learning and has outperformed well-known image classification methods in terms of accuracy. Currently, xDNN algorithms are not easily adaptable to credit scoring because Angelov and Soares [
103] simply prioritised the highest accuracy using complicated
if–then rules, with the
if part consisting of considerably large images. On the contrary, various tools have been developed for converting tabular data into images and a bidirectional optimization structure using both CNN and XGBoost.
Based on
Section 6.4,
Section 6.5,
Section 6.6 and
Section 7, it is reasonable to believe that many researchers may assume that there is no significant difference in classification accuracy no matter what method is used; however, this is true only when the degree of mission-criticality is not severe; exceptions include data in finance and medicine. Therefore, using XGBoost for structured data and DL for classification of unstructured data (i.e., images) is simple and quite traditional. In addition, if there is no significant difference in accuracy, improving interpretability is an invaluable option for wider adoption in various areas. A very recent study in finance, similar to the credit scoring framework, has been proposed that extracted rules to classify bank failure [
114]. Research in the area of credit scoring or credit risk can contribute to the modernisation of financial engineering by simply introducing a time series so that the elemental technologies described in this review can be applied to financial distress, bankruptcy, peer-to-peer (P2P) lending, credit card fraud detection, and inclusion of macro-economic variables. Their findings are useful for bank supervisor authorities, bank executives, risk management professionals, as well as policymakers in the field of finance.
At present, we are moving towards the intersection of the above research avenues to deal with both structured and unstructured data. DL would achieve not only very high accuracy for images but also high performance for structured data in explainable credit scoring. In a future work, an attempt will be made to bridge images and symbolic rules to realise AI finance.