Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging

Wang, Linhui; Zhu, Jianping; Zheng, Chenlu; Zhang, Zhiyuan

doi:10.3390/math12182907

Open AccessArticle

Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging

¹

School of Management, Xiamen University, Xiamen 361005, China

²

Data Mining Research Center, Xiamen University, Xiamen 361005, China

³

Public Administration Department, Fujian Police College, Fuzhou 350007, China

⁴

Artificial Intelligence and Model Development Center, Technology Development Department, Xiamen International Bank, Xiamen 361001, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(18), 2907; https://doi.org/10.3390/math12182907

Submission received: 27 August 2024 / Revised: 13 September 2024 / Accepted: 15 September 2024 / Published: 18 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Digital footprints provide crucial insights into individuals’ behaviors and preferences. Their role in credit scoring is becoming increasingly significant. Therefore, it is crucial to combine digital footprint data with traditional data for personal credit scoring. This paper proposes a novel credit-scoring model. First, lasso-logistic regression is used to select key variables that significantly impact the prediction results. Then, digital footprint variables are categorized based on business understanding, and candidate models are constructed from various combinations of these groups. Finally, the optimal weight is selected by minimizing the Kullback–Leibler loss. Subsequently, the final prediction model is constructed. Empirical analysis validates the advantages and feasibility of the proposed method in variable selection, coefficient estimation, and predictive accuracy. Furthermore, the model-averaging method provides the weights for each candidate model, providing managerial implications to identify beneficial variable combinations for credit scoring.

Keywords:

digital footprints; credit scoring; model averaging; Kullback–Leibler loss

MSC:

91G40

1. Introduction

Personal credit scoring is a fundamental tool for assessing personal credit risk and making credit decisions. It represents an advanced practice of data-driven decision-making in the financial sector. However, the rising demand for personal credit has resulted in a corresponding increase in loan defaults, presenting new challenges for risk management within financial institutions. Traditional credit-scoring models, such as the FICO score, primarily rely on a borrower’s credit history and repayment records. These models use statistical methods to predict default risk [1]. These traditional data sources have inherent limitations, including limited predictive ability, infrequent updates, and inadequate adaptability to emerging markets [2,3]. Over the past decade, computing infrastructure has rapidly developed, leading to the high digitalization of modern society. People’s primary activities have moved online, generating a large amount of user-generated digital footprint data [4]. Digital footprints—that is, all information generated by users during their online activities and interactions with smart devices—have become valuable data sources for analyzing online behavior in the social sciences [5]. They are particularly valuable in fields such as tourism management and marketing [6,7,8]. These factors have driven the development of credit models, prompting financial institutions to integrate more multisource data and advanced technologies to improve credit-scoring accuracy [9,10].

The concept of digital footprints was first introduced to credit scoring by Berg et al. [11]. Recent studies have shown that digital footprints have information content in predicting consumer default [12,13]. By analyzing digital footprints, financial institutions can create more comprehensive personal credit profiles. For instance, the frequency of borrowers’ activities on social media, financial-related online search behaviors, and shopping habits on e-commerce platforms can provide valuable information for personal credit scoring [11,14].

Some studies have used unique account data to compare fintech credit scoring with traditional credit scoring for default prediction, showing the advantages of fintech credit scoring [15]. It has also been shown that digital footprint variables are highly valuable for default prediction [11]. To the best of our knowledge, there is little work on the role of digital footprints in default prediction. Current studies have not integrated these two types of data to compare predictive accuracy. Therefore, this paper aims to merge digital footprint variables with traditional ones to improve credit-scoring prediction.

Unlike many studies focused on identifying key predictors, model averaging aims to enhance prediction accuracy using multiple predictors [16]. Model averaging extends model selection by combining multiple models, thereby significantly reducing risks [17]. This method has advantages in prediction performance that single model selection methods do not possess, especially in handling multisource data and complex problems, and it can improve prediction accuracy and stability. In credit scoring, different models have advantages in handling specific data, and model averaging can fully utilize the complementarity of these models. Additionally, model averaging has advantages in improving model generalization ability, reducing overfitting, and handling uncertainty.

This paper explores whether employing model averaging with digital footprint variables can enhance the predictive accuracy of personal credit-scoring models. To explore this, we first applied a lasso variable selection to reduce variable redundancy [18]. We then grouped variables based on our business understanding to create the eight candidate models with different group combinations. We determined the weights of these models by minimizing the Kullback–Leibler (KL) loss and used the ensemble model to evaluate credit risk. Our approach combined logistic regression and model averaging for credit scoring using digital footprints.

Our study makes contributions in the following areas: Firstly, we utilized digital footprints in credit models, improving prediction accuracy. Secondly, due to the multitude of digital footprint variables and the lack of prior information, we employed a generalized linear model-averaging approach to address this issue. Thirdly, we utilized a generalized linear model-averaging method based on KL loss to determine the weights of the candidate models, enabling a straightforward recognition of the significance of different variable groups. Finally, we applied the proposed generalized linear model-averaging method based on KL loss to personal credit data, effectively combining traditional credit data with digital footprint data. This approach provides financial institutions with more comprehensive risk assessments.

Our study proceeds as follows: Section 2 reviews literature on personal credit scoring, digital footprints, and model averaging. Section 3 outlines the statistical models used and presents the framework of our proposed method. Section 4 covers empirical analysis. Finally, Section 5 concludes the paper and suggests avenues for further research.

2. Literature Review

2.1. Personal Credit Scoring

Personal credit scoring uses risk assessment tools to manage borrowers’ credit accounts, including evaluating credit status and repayment capacity to approve loans. After loan approval, risk assessment tools are utilized to oversee the borrower’s account, ensuring timely repayment [19]. Initially in credit evaluation, statistical methods were the predominant classification techniques [20]. Durand first proposed using discriminant functions to classify the risk of personal credit [21]. Subsequently, Hand and Henley [1] developed a method to estimate the factors influencing credit scores and employed logistic regression to predict customer credit ratings. Traditional statistical methods encompass linear discriminant analysis (LDA), linear regression (LR), probit, tobit, and decision trees [9,22]. The two most widely used methods are LDA and LR [23,24]. Research shows that logistic regression outperforms linear discriminant analysis in predictive capability [25]. Due to their strong predictive performance and good comprehensibility, these methods are very popular [26]. Baesens et al. [9] found that most personal credit datasets exhibit weak nonlinearity, allowing both LDA and LR to perform effectively. Furthermore, the strong interpretability of logistic regression has led to its widespread use in the industry.

As digitalization progresses, digital information has become increasingly accessible. Consequently, personal data is now more thoroughly captured through a variety of technologies, including wearable devices, smartphones, and internet platforms [27]. Therefore, it is necessary to combine personal credit scoring with digital information and design a personal credit scoring based on digital footprints. In the subsequent literature review, relevant research about digital footprints and model averaging is explored.

2.2. Digital Footprints

Digital footprints have become a vital source of information in areas such as risk management, decision-making, and user profiling. In the field of risk management, Zarate et al. [28] conducted a systematic literature review exploring the effectiveness of digital phenotyping and data in assessing, diagnosing, and monitoring depression. Loutfi [4] proposed a framework to evaluate the practical deployment of consumer credit models using digital footprints, emphasizing their applications in the financial sector. Additionally, Azcona et al. [29] investigated how digital footprints can be used to detect at-risk students in computer programming courses, demonstrating the potential of learning analytics in this area. In terms of decision-making, Feher [30] discussed the conscious decisions made by users to control their digital footprints, emphasizing the importance of online personal strategies. Mou et al. [31] investigated tourism behavior in Qingdao, China, through the analysis of tourists’ digital footprints, demonstrating the potential of such data in understanding travel patterns. Wang et al. [32] proposed a method for predicting personality traits using digital footprints from social networks, incorporating temporal factors through an attention-based recurrent neural network and showcasing its application in decision-making. Yang et al. [33] explored the capability of using search queries and digital footprints for tourism forecasting during the COVID-19 pandemic. In terms of user profiling, Gladstone et al. [34] studied how to infer psychological traits from consumption records, demonstrating the capability of digital footprints in large-scale psychological research. These studies demonstrate the broad applications and significant potential of digital footprints in risk management, decision-making, and user profiling, highlighting their importance in data-driven analysis.

In recent years, fintech lending, as an innovative financing model, has provided unique competitive advantages for emerging fintech companies and traditional financial institutions. On the one hand, the proliferation of smart devices and the development of the Internet of Things (IoT) have made data collection and processing more efficient. Financial institutions can acquire and utilize multidimensional, heterogeneous data for credit risk assessment through collaboration or data purchase [11]. These data include information from borrowers’ device usage, such as phone models and locations, as well as credit scores and shopping behavior data purchased from other sources, providing a more comprehensive perspective for credit decisions. On the other hand, digitalized and automated lending processes reduce manual operations, enhancing the operational efficiency and market competitiveness of financial institutions [35]. A digital footprint augments credit bureau information instead of replacing it, impacting credit access and lowering default rates [11]. How to better utilize digital footprint data to serve the field of credit scoring is a question that requires our current exploration.

2.3. Model Averaging

With advancements in computing technology, the number of digital footprint variables continues to increase. Indiscriminate data collection leads to challenges in variable selection and model choice. Thus, more flexible and advanced methods are needed for applying these variables in personal credit scoring.

Model averaging was initially proposed as a method to reduce prediction errors, particularly when dealing with high variance in predictive models. It serves as an alternative to model selection, which aims to identify an optimal model. Instead, model averaging improves overall predictive performance by combining the results of multiple models and can be viewed as a special form of ensemble learning. Its core idea is to leverage the complementary nature of different model predictions to reduce the overall variance and bias of the prediction. Compared to homogeneous and heterogeneous ensembles, model averaging emphasizes the combination of model outputs rather than the diversity of the models themselves. Research has shown that model averaging can significantly improve predictive accuracy, especially when the base models have similar performance. Model averaging, as a technique to enhance predictive accuracy, is extensively used in economics, finance, and environmental science [36]. Model averaging plays a crucial role in handling model uncertainty. Its main idea is to perform statistical predictions and inferences by considering the weighted average of multiple candidate models, thereby addressing the uncertainty that arises from using a single model. Model averaging has evolved primarily through Bayesian model averaging (BMA) and Frequentist model averaging (FMA). BMA has been widely applied in the field of credit scoring [37,38], whereas FMA is relatively less applied. Recently, there has been growing scholarly interest in FMA. In FMA research, one of the core issues is the selection of weights. Specifically, it involves establishing a criterion based on the given data to determine the appropriate weights. Different weight selections can lead to varying risks and asymptotic properties. The widely used weight selection methods include S-AIC (smoothed AIC) and S-BIC (smoothed BIC), introduced by Buckland et al. [39], MMA (Mallows model averaging), proposed by Hansen [40], and optimal model averaging (OPT), introduced by Zhang et al. [41]. Among these, S-AIC and S-BIC are two model-averaging methods derived from the model selection criteria AIC and BIC, respectively.

This study explores modeling strategies in situations where the impact of variables is uncertain, achieved by adjusting the dimensions of independent variables. Recognizing the challenge of identifying which independent variables best explain the dependent variable, this paper employs model averaging to tackle this issue. The core of model averaging is not to exclude any potentially effective variables but to capture information from different combinations of variables by constructing multiple candidate models. Specifically, this study selected eight models, each representing a different combination of variables, covering the diversity of digital footprint variables and traditional variables. Through weighted averaging, this study aims to achieve a balance among these models for more robust and accurate predictions.

3. Digital Footprints Data Processing and Forecasting Method

3.1. Variable Selection

Logistic regression is a prevalent method in credit scoring. For datasets containing many digital footprint variables, indiscriminate data collection can complicate variable and model selection. As technology advances and data volumes increase, flexible methods like lasso are essential in personal credit scoring to identify key variables, minimize redundancy, and improve prediction accuracy and robustness [42]. Therefore, we first use lasso logistic regression for variable selection.

Suppose there are

n

samples, the response variable is

y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}

, a covariate vector is

x_{i} = {(x_{i 1}, x_{i 2}, \dots, x_{i p}, x_{i p + 1}, \dots, x_{i d})}^{T}

, and a response indicator is

R \in \{1, \dots, K\}

, where

R = k

indicates that only

X_{k} = (x_{i j} : i \in n, j \in Δ_{k}),

Δ_{k}

is a subset of

D = \{1, \dots p, p + 1, \dots d\}

is used among all variable, and

K

is the number of classifications based on the digital footprints.

y_{i} = 0

or 1 is a binary variable.

The logistic regression model estimates the probability

p (y_{i} = 1 |x_{i})

as follows:

p (y_{i} = 1 |x_{i}) = \frac{1}{1 + e^{- (w \cdot x_{i} + ε)}},

(1)

where

w

is the vector of weights,

ε

is the bias term, and

w \cdot x_{i}

denotes the dot product between vectors

w

and

x_{i} .

The log-likelihood function for logistic regression is the sum of the log-likelihoods for all samples:

l (w, ε) = \sum_{i = 1}^{n} [y_{i} \log (p (y_{i} = 1 |x_{i})) + (1 - y_{i}) \log (1 - p (y_{i} = 1 |x_{i}))] .

(2)

The lasso regularization term is the

l_{1}

norm of the weight vector multiplied by the regularization parameter

λ

:

Ω (w) = λ \sum_{j = 1}^{p} |ω_{j}|,

(3)

where

λ \geq 0

is the regularization parameter controlling the strength of regularization, and

p

is the number of variables. Incorporating the log-likelihood function and the lasso regularization term, the objective function for lasso logistic regression is:

J (w, ε) = - l (w, ε) + Ω (w) .

(4)

The goal is to find

w

and

ε

that minimize

J (w, ε) .

The optimization problem for lasso logistic regression is formulated as:

\min_{w, ε} \{- \sum_{i = 1}^{n} [y_{i} \log (σ (w \cdot x_{i} + ε)) + (1 - y_{i}) \log (1 - σ (w \cdot x_{i} + ε))] + λ \sum_{j = 1}^{p} |ω_{j}|\},

(5)

where

σ (z) = 1 / (1 + e^{- z})

is the logistic sigmoid function. Solving the optimization problem is typically achieved using numerical optimization methods. In this paper, the objective function is solved based on coordinate descent [43]. In the iterative process, all weights except one are fixed, and only one weight is updated. The choice of the regularization parameter

λ

is commonly determined through cross-validation. In this paper, we have employed 10-fold cross-validation.

3.2. Model Averaging Estimation and Weight Choice

The OPT method for generalized linear models was developed by Zhang et al. [41]. It is a model-averaging method that obtains the combined estimated weights by minimizing a weight selection criterion based on the KL loss with a penalty term. The method is specifically applied to the logistic model for processing digital footprint data. We continue to use the notation established earlier.

Our goal is to make predictions on data

\{(y_{i}, x_{i j}), i = 1, \dots, n . j \in Δ_{k}, k = 1, 2, \dots, K\}

containing digital footprint variables, considering exponentially distributed families:

\begin{array}{l} f (y_{i} | θ_{i}, ϕ) = \exp \{\frac{y_{i} θ_{i} - b (θ_{i})}{ϕ} + c (y_{i}, ϕ)\} \\ = \exp \{y_{i} θ_{i} - \log (1 + e^{θ_{i}})\}, i = 1, 2, \dots, n . \end{array}

(6)

where the

ϕ = 1, b (θ) = \log (1 + e^{θ}), c (y_{i}, ϕ) = 0

. Parameter

θ_{i}

is linked via

θ_{i} = X_{i}^{T} β

to the parameters

β

and the

d

dimension covariate

X_{i} .

Construct of the candidate model:

Notation

y = {(y_{1}, y_{2}, \dots, y_{n})}^{T},

x_{i} = {(x_{i 1}, x_{i 2}, \dots, x_{i p}, x_{i p + 1}, \dots, x_{i d})}^{T}

assumes that the digital footprint variable divides the model into

K

and that

K

is finite. For any

k = 1, \dots, K,

on the dataset

\{(y_{i}, x_{i j}), j \in Δ_{k}\}

or the

k th

candidate model

M_{k},

we model the typical parameters

θ_{i}^{k}

with the model

x_{i (k)}^{T} β_{(k)},

whose density function is as follows:

f (y_{i} | θ_{i}^{k}, 1) = \exp \{y_{i} θ_{i}^{k} - \log (1 + e^{θ_{i}^{k}})\} = \exp \{y_{i} x_{i (k)}^{T} β_{(k)} - \log (1 + e^{x_{i (k)}^{T} β_{(k)}})\}, i = 1, 2, \dots, n .

(7)

where

x_{i (k)} = Π_{k} x_{i}

is a

d_{k}

-dimensional subvector of

x_{i},

Π_{k}

is a

d_{k} \times d

projection matrix consisting of 0 or 1, and

β_{(k)}

is a

d_{k} \times 1 - d i m e n s i o n a l

regression parameter.

2.: Parameter estimation:

Let the maximum likelihood estimation of

β,

defined under the

k th

candidate model, be denoted as

{\hat{β}}_{(k)}

,

\begin{array}{l} {\hat{β}}_{(k)} = \arg \max_{β_{(k)}} \log \prod_{i \in 1, \dots, n} f (y_{i} | θ_{i}^{k}, 1) \\ = \arg \max_{β_{(k)}} \sum_{i \in 1, \dots, n} \log [\exp \{y_{i} θ_{i}^{k} - \log (1 + e^{θ_{i}^{k}}\}] \\ = \arg \max_{β_{(k)}} \sum_{i \in 1, \dots, n} \{y_{i} (x_{i (k)}^{T} β_{k}) - \log (1 + e^{x_{i (k)}^{T} β_{k}})\}, \end{array}

(8)

where

x_{i (k)} = {(x_{i j}, j \in Δ_{k})}^{T} .

Note that some elements of

{\hat{β}}_{(k)}

are restricted to 0.

Let the weight vector be

ω = {(ω_{1}, ω_{2}, \dots, ω_{K})}^{T}

and restricted to the following set of weights:

W = \{ω \in {[0, 1]}^{K} : \sum_{k = 1}^{K} ω_{k} = 1\} .

(9)

Let

Π_{k}

be the

d_{k} \times d

dimensional projection matrix, and the elements of

Π_{k}

are all 0 or 1, such that

Π_{k}^{T} β_{k}

is a vector of

d \times 1

whose nonzero elements have exactly the values of

β_{k} .

Then the weighted regression coefficient estimate of

β

is expressed as:

\hat{β} (ω) = \sum_{k = 1}^{K} ω_{k} Π_{K}^{T} {\hat{β}}_{(k)} .

(10)

Also, for the new sample

x^{*} = {(x_{1}^{*}, x_{2}^{*}, \dots, x_{p}^{*})}^{T}

, its logodds prediction is

{\hat{θ}}^{*} (ω) = x^{* T} \hat{β} (ω) .

3.: Calculation of the KL loss for the model-averaged approach to logistics regression:

Define

X = {(X_{1}, X_{2}, \dots, X_{n})}^{T}

and assume that it is column-full rank, and

y = {(y_{1}, y_{2} . \dots, y_{n})}^{T},

θ = {(θ_{1}, θ_{2}, \dots, θ_{n})}^{T},

which assumes that

θ_{0}

is the true value of

θ .

So the average estimate of the model about

θ

is

θ \{\hat{β} (ω)\} = {[θ_{1} \{\hat{β} (ω)\}, θ_{2} \{\hat{β} (ω)\}, \dots, θ_{n} \{\hat{β} (ω)\}]}^{T} = X \hat{β} (ω) .

(11)

For the generalized linear model, we define the weight selection criterion as

γ (ω) = 2 ϕ^{- 1} B \{\hat{β} (ω)\} - 2 ϕ^{- 1} y^{T} θ \{\hat{β} (ω)\} + λ_{n} ω^{T} p,

(12)

where

B \{\hat{β} (ω)\} = \sum_{i = 1}^{n} b [θ_{i} \{\hat{β} (ω)\}] = \sum_{i = 1}^{n} \log (1 + e^{θ_{i} \{\hat{β} (ω)\}}),

λ_{n} ω^{T} p

is the penalty term,

p = {(p_{1}, p_{2}, \dots, p_{K})}^{T},

p_{k}

represents the number of columns of the matrix

X

under the

k th

model, and

λ_{n}

is the tuning parameter. This criterion is derived from the KL loss with an added penalty term and can also be formulated by penalizing the negative log-likelihood function. Thus, we denote

θ_{0} = {(θ_{0, 1}, θ_{0, 2}, \dots, θ_{0, n})}^{T}

as the true value of

θ,

y^{*} = {(y_{1}^{*}, \dots, y_{n}^{*})}^{T}

from the distribution

f (\cdot | θ_{0}, 1)

and independent of

y

, and the KL loss

μ = E (y), B_{0} = \sum_{i = 1}^{n} b (θ_{0, i}) = \sum_{i = 1}^{n} \log (1 + e^{θ_{0, i}}),

B \{\hat{β} (ω)\} = \sum_{i = 1}^{n} b [θ_{i} \{\hat{β} (ω)\}] = \sum_{i = 1}^{n} \log (1 + e^{θ_{i} \{\hat{β} (ω)\}}),

θ \{\hat{β} (ω)\}

as

\begin{array}{l} K L (ω) = 2 \sum_{i = 1}^{n} E_{y^{*}} \{\log \{f (y^{*} | θ_{0}, ϕ)\} - \log (f [y^{*} | θ \{\hat{β} (ω)\}, ϕ])\} \\ = 2 \sum_{i = 1}^{n} E_{y^{*}} \{\log \{f (y_{i}^{*} | θ_{0, 1}, 1)\} - \log (f [y_{i}^{*} | θ_{i} \{\hat{β} (ω)\}, 1])\} \\ = 2 J (ω) - 2 ϕ^{- 1} B_{0} + 2 ϕ^{- 1} μ^{T} θ_{0} \\ = 2 \sum_{i = 1}^{n} [\log {\{1 + e\}}^{θ_{i} \{\hat{β} (ω)\}} - μ^{T} θ_{i} \{\hat{β} (ω)\}] + 2 μ^{T} θ_{0} - 2 \sum_{i = 1}^{n} \log (1 + e^{θ_{0, i}}) \end{array}

(13)

4.: Selection of the optimal weight vector $ω;$

ς (ω) = 2 \sum_{i = 1}^{n} [\log {\{1 + e\}}^{θ_{i} \{\hat{β} (ω)\}} - μ^{T} θ_{i} \{\hat{β} (ω)\}] + λ_{n} \sum_{k = 1}^{K} ω_{k} p_{k},

(14)

where the value of

λ_{n}

follows the setting of Zhang et al. [41] and is taken as 2 or

\log (n) .

Therefore, the optimal weight vector is specified as

\hat{ω} = \arg \max_{ω \in W} ς (ω) .

(15)

4. Empirical Analysis

4.1. Dataset

The dataset is collected from a joint lending business between a commercial bank and an internet-based commerce platform in China, spanning from 4 August 2019 to 31 March 2021. And the data are generated from the online application information filled out by users while browsing and shopping in the platform, as well as the digital footprints formed during the lending behavior.

The objective of this study was to assess whether customers would default, framed as a binary classification problem. There is no universally accepted standard for defining the default threshold; theoretically, a customer is considered to have defaulted if they fail to fulfill their repayment obligation by the agreed-upon date. In practice, different studies often set varied grace periods based on their specific objectives. Berg et al. [11] defines a customer as being in default if they fail to make payment after three reminders. Our platform follows a somewhat similar approach. On this platform, customers are initially reminded to repay via SMS and app notifications. If the customer fails to make a payment within three days of the due date, the platform will initiate collection efforts via phone calls. Therefore, in this study, customers who have not repaid after receiving SMS and app notifications are classified as defaulters, with a default indicator coded as 1 and nondefaulters coded as 0.

The first step of preprocessing removed the missing data. Specifically, if the missing data rate for a customer exceeds 50%, all records for that customer are excluded from the dataset. This threshold was set to ensure that the data we analyzed were of sufficient quality and usability. Following the exclusion of customer data with substantial missing values, variable selection was subsequently conducted employing the lasso algorithm to mitigate collinearity. From the 75 variables, we obtained 16 traditional credit variables such as loan amount, platform interest rate, and number of loan periods, as well as 16 digital footprint variables, such as platform consumption power level, consumption frequency level, and consumption scene level. The figure below shows the variable selection diagram of the lasso algorithm. As shown in Figure 1, it illustrates the relationship between the value of

λ

and the regression coefficients. As the value of

λ

increases, the regression coefficients continuously approach zero. The study includes 75 variables, resulting in 75 lines of different colors. Each curve illustrates the trajectory of the coefficients for each independent variable. In Figure 2, the black dotted line on the right denotes one standard deviation above the minimum

λ

. At this value of

λ

, the model demonstrates a good fit while incorporating fewer variables, resulting in a more parsimonious model. Using one standard deviation of

λ

as the cutoff point leads to the selection of 32 variables, which are employed as independent variables in this prediction model.

The second step is pairwise integration for the original data, resulting in 79,715 samples. We randomly divided 56,430 samples as training data and 23,285 samples as test data in the approximate ratio of 7:3.

For details, please refer to Table 1. We categorize the model variables into the basic variable group

A_{1}, A_{2},

which contains loan information and bank credit, and customer characteristics, respectively, and the digital footprint variable group

B_{1}, B_{2}, B_{3}

, which contains consumption performance, transactions, and activity, respectively. In the Value Categories section, we outline the segmentation for each variable, with the specific criteria provided by the platform. Missing values are categorized as ‘Not Found = 0’.

Currently we have 56,430 samples, each with a 32-dimensional independent variable as well as a dependent variable Y. The set of all independent variables is defined as

D = {X_{1}, X_{2}, \dots, X_{16}, X_{17}, \dots, X_{32}} .

From the above data description, we don’t know which independent variables contribute more, and we group them based on the characteristics of the digital footprint and finally divide them into

K = 8

groups for prediction. Detailed information regarding the specific groupings can be found in Table 2.

We compare the OPT method for generalized linear models with several popular model selection methods.

AIC and BIC, two prevalent information criteria, are defined as

\begin{array}{l} A I C_{k} = - 2 \log l_{k} + 2 l_{k}, \\ B I C_{k} = - 2 \log l_{k} + l_{k} \log n, \end{array}

(16)

where

l_{k}

and

l_{k}

are the great likelihood function and the number of unknown parameters for model

k

, respectively, and

n

is the sample size. For model averaging, Buckland et al. [39] introduced the Smoothed AIC (S-AIC) and Smoothed BIC (S-BIC) methods with combined weights of

ω_{x I C, s} = \exp (- x I C_{s} / 2) / \sum_{s = 1}^{S} \exp (- x I C_{s} / 2),

(17)

where

s

stands for the

s t h

model,

ω_{x I C, s}

is its weight, and

x I C

denotes AIC or BIC. S-AIC and S-BIC are currently the most used weight selection methods in frequency model averaging due to their simplicity.

Furthermore, in order to better analyze the combination of weights and digital footprint variables from the generalized linear model-averaging approach, we added separate generalized linear fits for the subgroups containing no digital footprint and the subgroups containing all the digital footprints to the model-averaging comparisons.

4.2. Model Performance Evaluation

In the empirical analysis, we first preprocessed the dataset, including the elimination of samples and variables with serious missing data and variable screening using lasso algorithm. Eventually, we obtained a sample set containing 32 independent variables and 1 dependent variable, where the independent variables were categorized into a basic variable group and a digital footprint variable group.

In credit risk assessment, the output of the classifier is typically a probability score representing the likelihood of a default event. This score is compared to a predefined threshold to assign a predicted label to each sample. The comparison between the predicted labels and the actual labels produces a confusion matrix. Based on Table 3, five evaluation metrics can be calculated: accuracy, specificity, sensitivity, precision, and F-score. Accuracy is defined as the ratio of correctly predicted samples to the total number of samples. The formula for accuracy (ACC) is as follows:

A C C = \frac{T P + T N}{T P + F N + F P + T N} .

(18)

Specificity reflects the model’s ability to correctly identify negative samples; that is, it does not misclassify negative samples as positive. The formula for specificity is as follows:

S p e c i f i c i t y = \frac{T N}{T N + F N} .

(19)

Sensitivity, also known as recall, reflects the model’s ability to identify positive samples; that is, it does not miss any true positive samples. The formula for sensitivity is as follows:

S e n s i t i v i t y = \frac{T P}{T P + F N} .

(20)

Precision reflects the proportion of actual positive cases among those samples predicted as positive by the model. The formula for precision is as follows:

P r e c i s i o n = \frac{T P}{T P + F P} .

(21)

Specificity, sensitivity, and precision can only assess the prediction performance for a specific class label. Model accuracy and completeness are assessed using the F-score, a balancing act between precision and sensitivity. It is a weighted harmonic average, particularly suitable for imbalanced datasets. The formula for F-score is as follows:

F - s c o r e = 2 \times \frac{P r e c i s i o n \times S e n s i t i v i t y}{P r e c i s i o n + S e n s i t i v i t y} = \frac{2 \times T P}{2 \times T P + F N + F P} .

(22)

AUC is a metric used to evaluate the performance of classification models, especially in binary classification problems. The ROC curve is a plot of sensitivity against 1-specificity, obtained by varying the classification threshold based on the model’s output probabilities. The AUC measures the overall performance of the model, with higher AUC values indicating better performance. AUC offers the advantage of comprehensive performance assessment, particularly suitable for imbalanced datasets, and serves as a reference evaluation metric in this paper.

We used several model selection techniques, including AIC, S-AIC, BIC, S-BIC, and the OPT based on the KL loss. As shown in Table 2, the comparison models also include the credit-scoring model group Logistics1, which does not incorporate digital footprint data, and Logistics8, which includes all digital footprint variables without subgrouping. Comparing the prediction outcomes of these methods, we found the OPT method demonstrated superior performance in evaluation metrics such as AUC, ACC, and F-score.

Specifically based on Table 4, the AUC value of the OPT method is 0.7740, the ACC is 0.7495, and the F-score is 0.7911. These results indicate that the model has a better ability in distinguishing credit risk. It is also found that the model including the digital footprint variable has a significant advantage in predicting credit risk, which further proves that the digital footprint is reasonable and necessary as a potential influencing factor in credit assessment.

4.3. Validity of Digital Footprint Variables

The model parameters were analyzed in detail, and the results were validated in this study to determine whether digital footprint variables provide valuable information in credit risk assessment models. A comparison of the accuracy of credit risk predictions, with and without digital footprints, revealed that incorporating digital footprints significantly enhances model accuracy. To determine the optimal model weights, this study employs the method of minimizing the KL loss, thereby elucidating the relative contributions of the candidate models in credit risk prediction.

\begin{array}{l} \hat{β} (ω) = \sum_{k = 1}^{8} ω_{k} {\hat{β}}_{(k)} \\ = 0.0243 \times {\hat{β}}_{(1)} + 0.0076 \times {\hat{β}}_{(2)} + 0.0373 \times {\hat{β}}_{(3)} + 0.0001 \times {\hat{β}}_{(4)} + 0.3192 \times {\hat{β}}_{(5)} + 0.0005 \times {\hat{β}}_{(6)} + 0.0772 \times {\hat{β}}_{(7)} + 0.5339 \times {\hat{β}}_{(8)} . \end{array}

(23)

Based on Equation (10) in Section 3.2, we obtain Equation (23). The results indicate that

β_{(8)}

and

β_{(5)}

carry higher weights. In other words, the combination of consumption performance and transaction variables carries a higher weight, suggesting that this combination has a more significant impact. Additionally, the weight of

β_{(8)}

is the highest, further indicating that e-commerce digital footprints related to consumption performance, transactions, and activity significantly improve the accuracy of default prediction, with consumption performance and transactions being particularly influential.

Furthermore, the experimental results demonstrate that the coefficient estimates of the shared variables (

β

) remain relatively stable across different models, while variations in the weights (

ω

) primarily arise from the inclusion of nonshared variables. According to this study, digital footprint variables can be leveraged to assess credit risk effectively. Therefore, digital footprint variables not only enrich the variable set of credit assessment models but also enhance the model’s predictive capability.

This study examines the combined application of traditional and nontraditional data in credit risk assessment. By analyzing comprehensive account data from 56,430 borrowers, we categorized borrower information into five main groups: loan information and bank credit, customer characteristics, consumption performance, transaction, and activity. We then selected loan information with bank credit and customer characteristics as representatives of traditional data, using them as benchmarks to evaluate the effectiveness of three categories of e-commerce digital footprints—consumption performance, transaction, and activity—in default prediction.

Additionally, this paper demonstrates that using digital footprints in fintech lending allows the construction of digital credit profiles for individuals lacking credit history and provides more accurate credit assessments for borrowers whose credit quality is underestimated by traditional systems. This finding is crucial for expanding credit services and improving the accuracy of credit assessments.

5. Conclusions and Future Research

In addressing diverse and complex digital footprint variables, the credit-scoring model proposed in this study effectively identifies key variables using the lasso-logistics model. Variables are then grouped according to business understanding, and candidate models are constructed from various combinations of these groups. The optimal weights for the final model combination are determined through generalized linear model averaging based on KL loss. This method aids financial institutions in more accurately identifying high-risk customers, thereby reducing default risks. The methods presented in this study, which involve variable selection and the integration of digital footprint variables, offer a valuable reference for banks. These approaches have the potential to enhance the predictive accuracy of credit assessment.

This study integrates digital footprint variables with traditional credit assessment data, creating a more comprehensive risk assessment framework. The findings demonstrate that digital footprints significantly enhance the predictive accuracy of borrowers’ credit risks for financial institutions. Additionally, the application of model-averaging methods, particularly when handling digital footprint data, has shown its advantages in improving predictive accuracy. Our study identified that these variables make a substantial contribution to predicting defaults, providing a novel perspective for the financial technology sector. In the context of inadequate coverage by traditional credit reporting systems, the application of digital footprints facilitates the expansion of credit services and enhances the accuracy of credit assessments.

The grouping strategy in this study reveals significant differences among various digital footprints. Combining different digital footprints allows for the creation of credit-scoring models that are more detailed and personalized. Banks can utilize this information for customer segmentation, devising more precise risk management strategies. For instance, offering personalized credit products and services based on customers’ spending habits and recommending products specifically to high-value customers. Prioritizing the collection of variables with significant weights in digital footprint analysis helps reduce costs for financial institutions. This approach also aids in identifying potential high-value customers.

By integrating multiple sources of digital footprint data, the model can be extended to incorporate various types of multidimensional data. For example, analyzing online shopping patterns, social media interactions, and payment histories allows for a comprehensive understanding of an individual’s credit status. This integrated approach can reveal insights overlooked by traditional credit-scoring methods, helping credit-invisible individuals to access consumer credit and improve the credit quality of borrowers who have been underestimated.

Despite the progress made in applying digital footprints to credit risk assessment, there is still room for further research. Further research could focus on integrating a variety of unstructured data types—such as voice, text, images, and videos—to create more personalized and comprehensive individual credit profiles within a big data context. Addressing real-time credit risk monitoring using continuously updated digital footprint data is crucial, particularly as technological advancements make real-time assessment increasingly essential. Future research should explore leveraging digital footprints for real-time monitoring, enhancing the intelligence and personalization of credit-scoring models in fintech, benefiting both financial institutions and borrowers alike.

Author Contributions

Conceptualization, L.W. and C.Z.; methodology, L.W.; software, L.W.; validation, L.W., C.Z. and J.Z.; formal analysis, L.W.; investigation, L.W.; resources, L.W.; data curation, Z.Z.; writing—original draft preparation, L.W.; writing—review and editing, L.W.; visualization, L.W.; supervision, C.Z. and J.Z.; project administration, C.Z. and J.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Office for Philosophy and Social Sciences of China, grant number 23BTJ044.

Data Availability Statement

The data supporting the findings of this study are not publicly available due to privacy/ethical restrictions.

Acknowledgments

We are grateful to the reviewers and the editor for their helpful comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hand, D.J.; Henley, W.E. Statistical Classification Methods in Consumer Credit Scoring: A Review. J. R. Stat. Soc. Ser. A Stat. Soc. 1997, 160, 523–541. [Google Scholar] [CrossRef]
Xu, D.; Zhang, X.; Feng, H. Generalized Fuzzy Soft Sets Theory-based Novel Hybrid Ensemble Credit Scoring Model. Int. J. Fin. Econ. 2019, 24, 903–921. [Google Scholar] [CrossRef]
Zhang, T.; Chi, G. A Heterogeneous Ensemble Credit Scoring Model Based on Adaptive Classifier Selection: An Application on Imbalanced Data. Int. J. Fin. Econ. 2021, 26, 4372–4385. [Google Scholar] [CrossRef]
Loutfi, A.A. A Framework for Evaluating the Business Deployability of Digital Footprint Based Models for Consumer Credit. J. Bus. Res. 2022, 152, 473–486. [Google Scholar] [CrossRef]
Dai, L.; Han, J.; Shi, J.; Zhang, B. Digital Footprints as Collateral for Debt Collection. SSRN Work. Pap. 2022. [Google Scholar] [CrossRef]
Arya, V.; Sethi, D.; Paul, J. Does Digital Footprint Act as a Digital Asset?—Enhancing Brand Experience through Remarketing. Int. J. Inf. Manag. 2019, 49, 142–156. [Google Scholar] [CrossRef]
Golder, S.A.; Macy, M.W. Digital Footprints: Opportunities and Challenges for Online Social Research. Annu. Rev. Sociol. 2014, 40, 129–152. [Google Scholar] [CrossRef]
Salas-Olmedo, M.H.; Moya-Gómez, B.; García-Palomares, J.C.; Gutiérrez, J. Tourists’ Digital Footprint in Cities: Comparing Big Data Sources. Tour. Manag. 2018, 66, 13–25. [Google Scholar] [CrossRef]
Baesens, B.; Van Gestel, T.; Viaene, S.; Stepanova, M.; Suykens, J.; Vanthienen, J. Benchmarking State-of-the-Art Classification Algorithms for Credit Scoring. J. Oper. Res. Soc. 2003, 54, 627–635. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Berg, T.; Burg, V.; Gombović, A.; Puri, M. On the Rise of FinTechs: Credit Scoring Using Digital Footprints. Rev. Financ. Stud. 2020, 33, 2845–2897. [Google Scholar] [CrossRef]
Jiang, J.; Liao, L.; Lu, X.; Wang, Z.; Xiang, H. Deciphering Big Data in Consumer Credit Evaluation. J. Empir. Financ. 2021, 62, 28–45. [Google Scholar] [CrossRef]
Orlova, E.V. Methodology and Models for Individuals’ Creditworthiness Management Using Digital Footprint Data and Machine Learning Methods. Mathematics 2021, 9, 1820. [Google Scholar] [CrossRef]
Wang, S.; Zhang, D.; Cui, L.; Lu, X.; Liu, L.; Li, Q. Personality Traits Prediction Based on Sparse Digital Footprints via Discriminative Matrix Factorization. In Database Systems for Advanced Applications; Jensen, C.S., Lim, E.-P., Yang, D.-N., Lee, W.-C., Tseng, V.S., Kalogeraki, V., Huang, J.-W., Shen, C.-Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 12682, pp. 692–700. ISBN 978-3-030-73196-0. [Google Scholar]
Jagtiani, J.; Lemieux, C. The Roles of Alternative Data and Machine Learning in Fintech Lending: Evidence from the LendingClub Consumer Platform. Financ. Manag. 2019, 48, 1009–1029. [Google Scholar] [CrossRef]
Ando, T.; Li, K.-C. A Model-Averaging Approach for High-Dimensional Regression. J. Am. Stat. Assoc. 2014, 109, 254–265. [Google Scholar] [CrossRef]
Hansen, B.E. Model Averaging, Asymptotic Risk, and Regressor Groups. Quant. Econ. 2014, 5, 495–530. [Google Scholar] [CrossRef]
Zheng, C.; Zhu, J. Promote Sign Consistency in Cure Rate Model with Weibull Lifetime. AIMS Math. 2022, 7, 3186–3202. [Google Scholar] [CrossRef]
Crook, J.N.; Edelman, D.B.; Thomas, L.C. Recent Developments in Consumer Credit Risk Assessment. Eur. J. Oper. Res. 2007, 183, 1447–1465. [Google Scholar] [CrossRef]
He, H.; Zhang, W.; Zhang, S. A Novel Ensemble Method for Credit Scoring: Adaption of Different Imbalance Ratios. Expert Syst. Appl. 2018, 98, 105–117. [Google Scholar] [CrossRef]
Durand, D. Risk Elements in Consumer Instalment Financing; Nber Books: Cambridge, MA, USA, 1941. [Google Scholar]
Thomas, L.C. A Survey of Credit and Behavioural Scoring: Forecasting Financial Risk of Lending to Consumers. Int. J. Forecast. 2000, 16, 149–172. [Google Scholar] [CrossRef]
Desai, V.S.; Crook, J.N.; Overstreet, G.A. A Comparison of Neural Networks and Linear Scoring Models in the Credit Union Environment. Eur. J. Oper. Res. 1996, 95, 24–37. [Google Scholar] [CrossRef]
Lee, T.; Chen, I. A Two-Stage Hybrid Credit Scoring Model Using Artificial Neural Networks and Multivariate Adaptive Regression Splines. Expert Syst. Appl. 2005, 28, 743–752. [Google Scholar] [CrossRef]
Abid, L.; Masmoudi, A.; Zouari-Ghorbel, S. The Consumer Loan’s Payment Default Predictive Model: An Application of the Logistic Regression and the Discriminant Analysis in a Tunisian Commercial Bank. J. Knowl. Econ. 2018, 9, 948–962. [Google Scholar] [CrossRef]
De Caigny, A.; Coussement, K.; De Bock, K.W. A New Hybrid Classification Algorithm for Customer Churn Prediction Based on Logistic Regression and Decision Trees. Eur. J. Oper. Res. 2018, 269, 760–772. [Google Scholar] [CrossRef]
LEHDONVIRTA, V. A History of the Digitalization of Consumer Culture. In Digital Virtual Consumption; Routledge: London, UK, 2012; ISBN 978-0-203-11483-4. [Google Scholar]
Zarate, D.; Stavropoulos, V.; Ball, M.; De Sena Collier, G.; Jacobson, N.C. Exploring the Digital Footprint of Depression: A PRISMA Systematic Literature Review of the Empirical Evidence. BMC Psychiatry 2022, 22, 421. [Google Scholar] [CrossRef] [PubMed]
Azcona, D.; Hsiao, I.-H.; Smeaton, A.F. Detecting Students-at-Risk in Computer Programming Classes with Learning Analytics from Students’ Digital Footprints. User Model. User-Adapt. Interact. 2019, 29, 759–788. [Google Scholar] [CrossRef]
Feher, K. Digital Identity and the Online Self: Footprint Strategies—An Exploratory and Comparative Research Study. J. Inf. Sci. 2021, 47, 192–205. [Google Scholar] [CrossRef]
Mou, N.; Zheng, Y.; Makkonen, T.; Yang, T.; Tang, J.; Song, Y. Tourists’ Digital Footprint: The Spatial Patterns of Tourist Flows in Qingdao, China. Tour. Manag. 2020, 81, 104151. [Google Scholar] [CrossRef]
Wang, S.; Cui, L.; Liu, L.; Lu, X.; Li, Q. Personality Traits Prediction Based on Users’ Digital Footprints in Social Networks via Attention RNN. In Proceedings of the 2020 IEEE International Conference on Services Computing (SCC), Beijing, China, 7–11 July 2020; IEEE: Beijing, China, 2020; pp. 54–56. [Google Scholar]
Yang, Y.; Fan, Y.; Jiang, L.; Liu, X. Search Query and Tourism Forecasting during the Pandemic: When and Where Can Digital Footprints Be Helpful as Predictors? Ann. Tour. Res. 2022, 93, 103365. [Google Scholar] [CrossRef]
Gladstone, J.J.; Matz, S.C.; Lemaire, A. Can Psychological Traits Be Inferred from Spending? Evidence From Transaction Data. Psychol Sci 2019, 30, 1087–1096. [Google Scholar] [CrossRef]
Rozo, B.J.G.; Crook, J.; Andreeva, G. The Role of Web Browsing in Credit Risk Prediction. Decis. Support Syst. 2023, 164, 113879. [Google Scholar] [CrossRef]
Moral-Benito, E. Model averaging in economics: An overview. J. Econ. Surv. 2015, 29, 46–75. [Google Scholar] [CrossRef]
Figini, S.; Giudici, P. Credit Risk Assessment with Bayesian Model Averaging. Commun. Stat. Theory Methods 2017, 46, 9507–9517. [Google Scholar] [CrossRef]
Jha, P.N.; Cucculelli, M. A New Model Averaging Approach in Predicting Credit Risk Default. Risks 2021, 9, 114. [Google Scholar] [CrossRef]
Buckland, S.T.; Burnham, K.P.; Augustin, N.H. Model Selection: An Integral Part of Inference. Biometrics 1997, 53, 603. [Google Scholar] [CrossRef]
Hansen, B.E. Least Squares Model Averaging. Econometrica 2007, 75, 1175–1189. [Google Scholar] [CrossRef]
Zhang, X.; Yu, D.; Zou, G.; Liang, H. Optimal Model Averaging Estimation for Generalized Linear Models and Generalized Linear Mixed-Effects Models. J. Am. Stat. Assoc. 2016, 111, 1775–1790. [Google Scholar] [CrossRef]
Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Wright, S.J. Coordinate Descent Algorithms. Math. Program. 2015, 151, 3–34. [Google Scholar] [CrossRef]

Figure 1. Variable coefficient path diagram.

Figure 2. Cross-validation curve.

Table 1. Description of variables.

Groups	Variables	Description	Value Categories
$Loan information and bank credit A_{1}$	$X_{1}$	Loan amount (RMB)	[99, 40,000]
	$X_{2}$	Term	Repayment term selected by the borrower, ranging from 1 to 12 months.
	$X_{3}$	Penalty interest rate	[0.00045, 0.00097]
	$X_{4}$	Platform interest rates	[0.000277, 0.00065]
	$X_{5}$	Total Credit Limit of Circulating Card (RMB)	(Low, 10,000); [10,000, 30,000); [30,000, 50,000); [50,000, 100,000); $[100,000, + \infty)$
	$X_{6}$	Maximum Month on Book	The longest time period between the account or loan opening date and the current date.
	$X_{7}$	Current highest overdue status	Write-off; Bad Debt; Payment Stopped; Frozen; Account Closed; Normal; Not Activated
	$X_{8}$	Number of loans	$0; [1, 3); [3, 7); [7, 12); [12, + \infty)$
	$X_{9}$	Total Credit Limit (RMB)	$(Low, 10, 000); [10, 000, 30, 000); [30, 000, 50, 000); [50, 000, 100, 000); [100, 000, + \infty)$
	$X_{10}$	The longest months on book for loan accounts	$[0, 6]; (6, 12]; (12, 24); [24, + \infty)$
	$X_{11}$	Number of CC approval reason inquiries in the last 1 month	$[0, 1]; [2, 3]; [4, 5]; [6, 10]; (10, + \infty)$
$Customer Characteristics A_{2}$	$X_{12}$	Age	$[0, 22]; (22, 40]; (40, + \infty)$
	$X_{13}$	Gender	male; female
	$X_{14}$	Marital status	Married; Not Married
	$X_{15}$	Number of active cities in the last 90 days	$[0, 1]; (1, 3]; (3, + \infty)$
	$X_{16}$	Length of registration (days)	$[0, 90]; (90, 180]; (180, 360]; (360, 720]; (720, + \infty)$
$Consumer performance B_{1}$	$X_{17}$	Platform Consumption Power Levels	$[0, 30); [30, 70); [70, 90); [90, + \infty)$
	$X_{18}$	Consumption Frequency Levels	[0, 3); [3, 6); [6, 10); [10, +∞)
	$X_{19}$	Consumption Scenario Level	[0, 2); [2, 4); [4, +∞)
$Transactions B_{2}$	$X_{20}$	User’s Number of Transactions in 360 Days	[0, 2], (2, 16], (16, +∞)
	$X_{21}$	User’s Successful Transaction Amount in 180 Days (RMB)	[0, 50], (50, 400], (400, +∞)
	$X_{22}$	User’s Successful Transaction Amount in 90 Days (RMB)	[0, 25], (25, 200], (200, +∞)
	$X_{23}$	User’s Number of Successful Transactions in 360 Days	[0, 1], (1, 14], (14, +∞)
	$X_{24}$	User’s Number of Successful Transactions in 90 Days	[0, 1], (1, 4], (4, +∞)
	$X_{25}$	User’s Number of Successful Food Delivery Transactions in 180 Days	[0, 2], (2, 11], (11, +∞)
	$X_{26}$	User’s Number of Successful Food Delivery Transactions in 360 Days	[0, 2], (2, 15], (15, +∞)
	$X_{27}$	User’s Number of Successful Food Delivery Transactions in 90 Days	$[0, 2]; (2, 8]; (8, + \infty)$
$Activity B_{3}$	$X_{28}$	Number of Channels Visited by User in 90 Days	$[0, 4]; (4, 7]; (7, + \infty)$
	$X_{29}$	Number of Days User Visited in 360 Days	$[0, 6]; (6, 46]; (46, + \infty)$
	$X_{30}$	Number of Days User Visited in 90 Days	$[0, 4]; (4, 19]; (19, + \infty)$
	$X_{31}$	Number of Days User Visited Food Delivery Service in 180 Days	$[0, 5]; (5, 30]; (30, + \infty)$
	$X_{32}$	Annual Active User Tag	Active Customer; Inactive Customer

Table 2. Results of the digital footprint variable grouping.

Subject	Y	A₁	A₂	B₁	B₂	B₃
1	*	*	*
2	*	*	*	*
3	*	*	*		*
4	*	*	*			*
5	*	*	*	*	*
6	*	*	*	*		*
7	*	*	*		*	*
8	*	*	*	*	*	*

* Indicates that the current subject is selected.

Table 3. Confusion matrix.

	Predicted Positive	Predicted Negative
Real positive	TP	FN
Real negative	FP	TN

Note: TP (true positive) denotes the number of borrowers accurately identified by the model as defaulters. FN (false negative) denotes the number of defaulters that the model incorrectly classified as nondefaulters. FP (false positive) denotes to the number of nondefaulters that the model incorrectly classified as defaulters. TN (true negative) denotes the number of borrowers accurately identified by the model as nondefaulters.

Table 4. Comparison of model results.

	AUC	ACC	F-Score
AIC	0.7640	0.6676	0.7732
BIC	0.7633	0.6621	0.7654
S-AIC	0.7638	0.6764	0.7632
S-BIC	0.7633	0.6621	0.7670
OPT	0.7740	0.7495	0.7911
Logistics1	0.7600	0.7370	0.7620
Logistics8	0.7601	0.7390	0.7812

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Zhu, J.; Zheng, C.; Zhang, Z. Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging. Mathematics 2024, 12, 2907. https://doi.org/10.3390/math12182907

AMA Style

Wang L, Zhu J, Zheng C, Zhang Z. Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging. Mathematics. 2024; 12(18):2907. https://doi.org/10.3390/math12182907

Chicago/Turabian Style

Wang, Linhui, Jianping Zhu, Chenlu Zheng, and Zhiyuan Zhang. 2024. "Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging" Mathematics 12, no. 18: 2907. https://doi.org/10.3390/math12182907

APA Style

Wang, L., Zhu, J., Zheng, C., & Zhang, Z. (2024). Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging. Mathematics, 12(18), 2907. https://doi.org/10.3390/math12182907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Incorporating Digital Footprints into Credit-Scoring Models through Model Averaging

Abstract

1. Introduction

2. Literature Review

2.1. Personal Credit Scoring

2.2. Digital Footprints

2.3. Model Averaging

3. Digital Footprints Data Processing and Forecasting Method

3.1. Variable Selection

3.2. Model Averaging Estimation and Weight Choice

4. Empirical Analysis

4.1. Dataset

4.2. Model Performance Evaluation

4.3. Validity of Digital Footprint Variables

5. Conclusions and Future Research

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI