1. Introduction and Motivation
Credit scoring is an important technique used in many financial institutions in order to model the probability of default, or some other event of interest, of a potential client. For example, a bank typically has access to data sets containing information pertinent to credit risk, which may be used in order to assess the credit worthiness of potential clients. The characteristics or covariates recorded in such a data set are referred to as attributes throughout; these include information such as income, the total amount of outstanding debt held and the number of recent credit enquiries. A bank may use logistic regression to model an applicant’s probability of default as a function of their recorded attributes; these logistic regression models are referred to as credit risk scorecards. In addition to informing the decision as to whether or not a potential borrower is provided with credit, the scorecard is typically used to determine the quoted interest rate. For a detailed treatment of scorecards, see [
1] as well as [
2].
The development of credit risk scorecards are expensive and time consuming. As a result, once properly trained and validated, a bank may wish to keep a scorecard in use for an extended period, provided that the model continues to be a realistic representation of the attributes of the applicants in the population. One way to determine whether or not a scorecard remains a representative model is to test the hypothesis of population stability. This hypothesis states that the distribution of the attributes remains unchanged over time (i.e., that the distribution of the attributes at present is the same as the distribution observed when the scorecard was developed). When the distribution of the attributes changes, it provides the business with an early indication that the scorecard may no longer be a useful model. Further explanations and examples regarding population stability testing can be found in [
3,
4] as well as [
5].
In the context of testing for population stability, performing scenario testing requires the ability to simulate realistic data sets. To this end, this paper proposes a simple technique for the simulation of such data sets. This enables practitioners to consider scenarios with predefined deviations from specified distributions for the attributes, which allows them to gauge the effects that changes in the distribution of one or more attributes have on the predictions made using the model. Furthermore, the business may also wish to consider the effects of a certain strategy before said strategy is implemented. As a concrete example, consider the case where a bank markets more aggressively to younger people. In this case, they may wish to test the effect of a shift in the distribution of the age of their clients.
The concept of population stability can be further illustrated by means of a simple example. Consider a model that predicts whether someone is wealthy based on a single attribute; the value of the property owned. If this attribute exceeds a specified value, the model predicts that a person is wealthy. Due to house price inflation, the overall prices of houses rise over time. Thus, after a substantial amount of time has passed, the data can no longer be interpreted in the same way as before, and the hypothesis of population stability is rejected, meaning that a new model (or perhaps just a new cut off point) is required.
Population stability metrics measure the magnitude of the change in the distribution of the attributes over time. A number of techniques have been described in the literature, whereby population stability may be tested; see [
6,
7] as well as [
8]. For practical implementations of techniques for credit risk scorecards, see [
9] in the statistical software R as well as [
10] in Statistical Analysis Software (SAS). The mentioned papers typically provide one or more numerical examples illustrating the use of the proposed techniques. The data sets upon which these techniques are used are typically protected by regulations, meaning that including examples based on the observed data is problematic. As a result, authors often use simulated data. However, the settings wherein these examples are to be found are often oversimplified, stylized and not entirely realistic. This can, at least in part, be ascribed to the difficulties associated with the simulation of realistic data sets. These difficulties arise as a result of the complexity of the nature of the relationship between the attributes and the response.
The data sets typically used for scorecard development have a number of features in common. They are usually relatively large; typical ranges for the number of observations range from one thousand observations to one hundred thousand, while a sample size of one million observations is not unheard of. The data used are multivariate; the number of attributes used varies according to the type of scorecard, what the scorecard will be used for and other factors, but scorecards based on five to fifteen attributes are common. The inclusion of attributes in a scorecard depends on the predictive power of the attribute as well as more practical considerations. These can include the ability to obtain the required data in the future (for example, changing legislation may, in the future, prohibit the inclusion of certain attributes such as gender into the model) as well as the stability of the attribute over the expected lifetime of the scorecard. Care is usually taken to include only attributes with a low level of association with each other so as to avoid the problems associated with multicolinearity.
This paper proposes a simple simulation technique, which may be used for the construction of realistic data sets for use in credit risk scorecards. These data sets contain the attributes of hypothetical customers as well as the associated outcomes. The constructed data sets can be used to perform empirical investigations into the effects of changes in the distribution of the attributes as well as changes in the relationship between these attributes and the outcome. In summary, the advantages of the newly proposed simulation technique are:
It is a simple technique.
It allows the generation of realistic data sets.
These data sets can be used to perform scenario testing.
It should be noted at the outset that the proposed technique is not restricted to the context of credit scoring, or even to the case of logistic regression, but rather has a large number of other modeling applications. However, we restrict our attention to this important special case for the remainder of the paper.
The idea underlying the proposed simulation technique can be summarized as follows. When building a scorecard, practitioners cannot be expected to specify realistic values for the parameters in the model which will ultimately be used. The large number of parameters in the model coupled with the complex relationships between these parameters conspire to make this task almost impossible. However, practitioners can readily be called on to have intuition regarding the bad ratios associated with different states of an attribute. That is, practitioners are often comfortable making statements such as “on average new customers are 1.5 times as likely to default as existing customers with similar attributes”. It should be noted that techniques such as the so-called Delphi method can be used in order to make statement such as these; for a recent reference, see [
11].
This paper proposes a technique that can be used to choose parameter values that mimic these specified bad ratios. The inputs required for the proposed technique are the overall bad rate, the specified bad ratios and the marginal distributions of the attributes. It should be noted that the proposed technique can be used to generate data without reference to an existing data set. As such, it is not a data augmentation technique. However, in the event that a reference data set is available, these techniques can be implemented in order to achieve similar goals. An example of a data augmentation technique that can be implemented in this context is so-called generalised adversarial networks, see [
12]. Another useful reference on data augmentation is [
13]. We emphasize that the newly proposed method can be used in cases where classical data augmentation techniques are not appropriate as the new technique does not require the availability of a data set in order to perform a simulation. As a result, classical data augmentation techniques are not considered further in this paper.
A final noteworthy advantage of the newly proposed technique is its simplicity. Since not all users of scorecards are trained in statistics, the simple nature of the proposed simulation technique (i.e., specifying bad ratios and choosing parameters accordingly) is advantageous.
The remainder of the paper is structured as follows.
Section 2 shows several examples of settings in which logistic regression is used in order to model the likelihood of an outcome based on attributes. Here, we demonstrate the need for the proposed simulation procedure. A realistic setting is specified in this section which is used throughout the paper.
Section 3 proposes a method that may be used to translate specified bad ratios into model parameters emulating these bad ratios using simulation, followed by parameter estimation. We discuss the numerical results obtained using the proposed simulation technique in
Section 4.
Section 5 provides some conclusions as well as directions for future research.
2. Motivating Examples
This section outlines several examples. We begin by considering a simple model and we show that the parameters corresponding to a single specified bad ratio can be calculated explicitly, negating the need for the proposed simulation technique. Thereafter, we consider slightly more complicated settings and demonstrate that, in general, no solution exists for a specified set of bad ratios. We also highlight the difficulties encountered when attempting to find the required parameters, should a solution exist. Finally, we consider a realistic model, similar to what one would use in practice.
It should be noted that we consider both discrete and continuous attributes below. There does not seem to be general consensus between practitioners on whether or not continuous attributes should be included in the model, as these attributes are often discretized during the modeling process (some practitioners may argue that we only need consider discrete attributes while others argue against this discretization); for a discussion, see pages 45 to 56 of [
1]. Since the number of attributes considered simultaneously using the proposed simulation technique is arbitrary, we may simply chose to replace any continuous attribute by its discretized counterpart. As a result, the techniques described below are applicable in either setting mentioned above.
2.1. A Simple Example
Let
be a single attribute, associated with the
jth applicant, with two levels, 0 and 1. Denote the respective frequencies with which these values occur by
p and
, respectively;
for
. Let
be the indicator of default for the
jth applicant. Denote the overall bad rate by
d; meaning that the unconditional probability of default is
. Let
be the bad ratio of
relative to
. That is,
is the ratio of the conditional probabilities that
given
and
, respectively;
. We may call upon a practitioner to specify appropriate values for
d and
.
Using the information above, we are able to calculate the conditional default rates
and
. Simple calculations yield
In this setting, building a scorecard requires that the following logistic regression model be fitted:
Calculating the parameters of the model that give rise to the specified bad ratio requires solving the two equations in (
1) in two unknowns. The required solution is calculated to be
As a result, given the values of p, d and , we can find a model that perfectly mimics the specified overall probability of default as well as the bad ratio. However, the above example is clearly unrealistically simple.
2.2. Slightly More Complicated Settings
Consider the case where we have three discrete attributes, each with five nominal levels. In this case, the practitioner in question would be required to specify bad ratios for each level of each attribute. This would translate into fifteen equations in fifteen unknowns (since the model would require fifteen parameters in this setting). Solving such a system of equations is already a taxing task, but two points should be emphasized. First, the models used in practice typically have substantially more parameters than fifteen, making the proposition of finding an analytical solution very difficult. Second, there is no guarantee that a solution will exist in this case.
Next, consider the case where a single continuous attribute, say income, is used in the model. When the scorecard is developed, it is common practice to discretize continuous variables such as income into a number of so-called buckets. As a result, the practitioner may suggest, for example, that the population be split into four categories and they may specify a bad ratio for each of these buckets. However, the “true” model underlying the data generates income from a continuous distribution and assigns a single parameter to this attribute in the model. Therefore, this example results in a model with a single parameter which needs to be chosen to satisfy four different constraints (in the form of specified bad ratios). Algebraically, this results in an over specified system in which the number of equations exceed the number of unknowns. In general, an over-specified system of equations cannot be solved.
The two examples above illustrate that, even in unrealistically simple cases, we may not be able to obtain parameters that result in the specified bad ratios.
2.3. A Realistic Setting
We now turn our attention to a realistic setting. Consider the case where ten attributes are used; some of which are continuous while others are discrete. For the discrete case, we distinguish between attributes measured on a nominal scale and attributes measured on a ratio scale. An example of an attribute measured on a nominal scale is the application method used by the applicant as the numerical value assigned to this attribute does not allow direct interpretation. On the other hand, the number of credit cards that an applicant has with other credit providers is measured on an ratio scale, and the numerical value of this attribute allows direct interpretation. In the model used, we treat discrete attributes measured on a ratio scale in the same way as continuous variables; that is, each of these attributes are associated with a single parameter in the model.
As mentioned above, we consider a model containing ten attributes. However, since several discrete attributes are measured on a nominal scale, the number of parameters in the model exceeds the number of attributes. To be precise, let
l denote the number of parameters in the model and let
m denote the number of attributes measured. Note that
, with equality holding only if no discrete attributes measured on a nominal scale are present. Let
be the set of attributes associated with the
jth applicant. This vector contains the values of observed continuous and discrete, ratio scaled, and attributes. Additionally,
includes dummy variables capturing the information contained in the discrete, nominal scaled, attributes. Define
; the conditional probability of default associated with the
jth applicant. The model used can be expressed as
where
is a vector of
l parameters.
The names of the attributes included in the model, as well as the scales on which these attributes are measured can be found in
Table 1. Care has been taken to use attributes which are often included in credit risk scorecards so as to provide a realistic example. For a discussion of the selection of attributes, see pages 60 to 63 of [
1]. Additionally,
Table 1 reports the information value of each attribute; this value measures the ability of a specified attribute to predict the value of the default indicator (higher information values indicate higher levels of predictive ability). Consider a discrete attribute with
k levels. Let
D be the number of defaults in the data set, let
be the number of defaults associated with the
jth level of this attribute and let
be the total number of observations associated with the
jth level of this attribute. In this case, the information value of the variable in question is
All calculations below are performed in the statistical software R; see [
14].
For the sake of brevity, we only discuss four of the attributes in detail in the main text of the paper. However, the details of the remaining six attributes, including the numerical results obtained, can be found in
Appendix A.
We specify the distribution of the attributes below. For each attribute, we also specify the levels used as well as the bad ratio associated with each of these levels. Care has been taken to use realistic distributions and bad ratios in this example. Admittedly, the process of specifying bad rates is subjective, but we base these values on many years of practical experience in credit scoring, and we believe that most risk practitioners will consider the chosen values plausible. However, it should be stressed that the modeler is not bound to the specific example used here; the proposed technique is general, and the number and distributions of attributes are easily changed. The attributes are treated separately below.
2.4. Existing Customer
Existing customers are usually assumed to be associated with lower levels of risk than is the case for applicants who are not existing customers. This can be due to the fact that existing customers have already shown their ability to repay credit extended to them in the past, or are more likely to pay the company where they have other products. We specify that of applicants are exiting customers and that the bad ratio is , meaning that the probability of default for a new customer is, on average, 2.7 times higher than the probability of default of an existing customer with the same remaining attributes.
2.5. Credit Cards with Other Providers
This attribute is an indication of the clients exposure to potential credit. A client could, for example, have a low outstanding balance, but through multiple credit cards have access to a large amount of credit. Depending on the type of product being assessed, this could signal higher risk.
Table 2 shows the assumed distribution of this attribute together with the specified bad ratios.
2.6. Application Method
The method of application is often found to be a very predictive indicator in credit scorecards. A customer actively seeking credit, especially in the unsecured credit space, is often found to be of a higher risk than customers opting in for credit through an outbound method like a marketing call. We distinguish four different application methods:
Branch—Applications done in the branch.
Online—Application done through an online application channel.
Phone—Applications done through a non-direct channel.
Marketing call—Application done after being prompted by the credit provider.
Table 3 specifies the distribution of this attribute as well as the associated bad ratios.
2.7. Age
Younger applicants tend to be higher risk, with risk decreasing as the applicants become older. We assume that the ages of applicants are uniformly distributed between 18 and 75 years. We divide these ages into seven groups, see
Table 4.
As was mentioned above, the remaining attributes are discussed in
Appendix A. In the next section, we turn our attention to the proposed simulation technique.
3. Proposed Simulation Technique
Having described the details of the attributes included in the model, we turn our attention to finding a model that results in bad ratios approximately equal to those specified. This is done by simulating a large data set, containing attributes as well as default indicators. Thereafter, the parameters of the scorecard are estimated by fitting a logistic regression model to the simulated data. We demonstrate in
Section 4 that the resulting parameters constitute a model that closely corresponds to the specified bad ratios and other characteristics. The steps used to arrive at the parameters for the model as well as, ultimately, a simulated data set are as follows:
Specify the global parameters.
Simulate each attribute separately.
Combine the simulated attributes.
Fit a logistic regression model.
Simulate the final default indicators.
It should be noted that the procedure detailed below assumes independence between the attributes. We opt to incorporate this assumption because it is often made in credit scoring in practice. However, augmenting the procedure below to incorporate dependence between attributes is a simple matter. For example, we can drop the assumption of independent attributes by simulating a group of attributes from a specified copula. Although we do not pursue the use of copulas further below, the reader is referred to [
15] for more details.
3.1. Specify the Global Parameters
We specify a fixed, large sample size. It is important that the initial simulated data set be large even in the case where the final simulated sample may be of more modest size, as this will reduce the effect of sample variability. We also specify the overall bad rate. It should be noted that overly small bad rates will tend to decrease the information value of the attributes included in the model (for fixed sets of bad ratios). This is due to the difficulty associated with predicting extremely rare events. We use a sample size of 50,000 and an overall bad rate of 10% to obtain the numerical results shown in the next section.
3.2. Simulate Each Attribute Separately
The next step entails specifying the marginal distribution as well as the bad ratio associated with each attribute. In the case of discrete attributes, a bad ratio is specified for each of the levels of the attribute. In the case of continuous attributes, the attributes are required to be discretized and a bad ratio is specified for each level of the resulting discrete attribute. Given the marginal distribution and the bad ratios of an attribute, we explicitly calculate the bad rate for each level of the attribute. Consider an attribute with
k levels and let
be the average bad rate associated with the
jth level of the attribute for
. In this case,
We now simulate a sample of attributes from the specified marginal distribution. Given the values of these attributes, we simulate default indicators from the conditional distribution of these indicators. That is, given that the jth level of the specific attribute is observed, simulate a 1 for the default indicator with probability .
3.3. Combine the Simulated Attributes
Upon completion of the previous step, we have a realized sample for each of the attributes with a corresponding default indicator. Denoting the sample size by n, the expected number of defaults for each attribute is . However, due to sample variation, the number of defaults simulated for the various attributes will differ, which complicates the process of combining the attributes to form a set of simulated attributes for a (simulated) applicant. In order to overcome this problem, we need to ensure that the number of defaults per attribute are equal.
For each attribute, the number of defaults follows a binomial distribution with parameters n and d. As a result, the number of defaults have an expected value and variance . Therefore, for large values of n, the ratio of the expected and simulated number of defaults converges to 1 in probability. To illustrate the effect of sample variation, consider the following example. If a sample size of is used and the overall default rate is set to , then the expected number of defaults is 50,000 for each attribute. Due to sample variation, the number of defaults will vary. However, this variation is small when compared to the expected number of defaults; in fact, a confidence interval for the number of defaults is given by . Stated differently, the probability that the simulated number of defaults will be within of the expected number is approximately in this case, while the probability that the realized number of defaults differ from the expected number by more than is less than 1 in 200,000.
The examples above indicate that the simulated number of defaults will generally be close to , and we may assume that changing the simulated number of defaults to exactly will not have a large effect on the relationships between the values of the attribute and the default indicator. As a result, we proceed as follows. If the number of defaults exceed , we arbitrarily replace 1s with 0s in the default indicator in order to reduce the simulated number of defaults to . Similarly, if the number of defaults is less than , we replace 0s with 1s.
Following the previous step, the number of defaults per attribute are equal, and we simply combine these attributes according to the default indicator. That is, in order to arrive at the details of a simulated applicant who defaults, we arbitrarily choose one realization of each attributed that resulted in default. The same procedure is used to combine the attributes of applicants who do not default.
3.4. Fit a Logistic Regression Model
We now have a (large) data set containing all of the required attributes as well as the simulated default indicators. We fit a logistic regression model to this data in order to find a parameter set that mimics the specified bad ratios. That is, we estimate the set of regression coefficients in (
2). The required estimation is standard, and the majority of statistical analysis packages includes a function to perform the required estimation; the results shown below are obtained using the
glm function in the
Stats package of R.
3.5. Simulate the Final Default Indicators
When considering the data set constructed up to this point, the simulated values for the individual attributes are realized from the marginal distribution specified for that attribute. As a result, we need only concern ourselves with the distribution of the default indicator. We now replace the initial default indicator by an indicator simulated from the conditional distribution given the attributes (which is a simple matter since the required parameter estimates are now available). The simulated values of the attributes together with this default indicator constitute the final data set.
The following link contains the R code used for the simulation of a data set using the proposed method;
https://bit.ly/3FFLSpp. We emphasize that the user is not bound by the specifications chosen in this paper, as the code is easily amended in order to change the distributions of attributes, to specify other bad ratios and to add or remove attributes from the data set.
4. Performance of the Fitted Model
In order to illustrate the techniques advocated for above, we use the proposed technique to simulate a number of data sets using the specifications in
Section 3. Below, we report the means (denoted “Observed bad rate”) and standard deviations (denoted “Std dev of obs bad rate”) of the observed bad ratios obtained when generating 10,000 data sets, each of size 50,000.
In
Table 5,
Table 6,
Table 7 and
Table 8, we consider each of the four attributes discussed in the previous section in the main text, while the results associated with the remaining attributes are considered in
Appendix B.
Table 5,
Table 6,
Table 7 and
Table 8 indicate that the average observed bad ratios are remarkably close to the nominally specified bad ratios. Furthermore, the standard deviations of the observed bad ratios are also shown to be quite small, indicating that the proposed method results in data sets in which the specifications provided in
Section 3 are closely adhered to.
The marginal distributions of the attributes are not reported in the tables since the average observed proportions coincide with the specified proportions up to 0.01% in all cases. This result is not unexpected, when taking the large sample sizes used into account.
Although less common in practice, smaller sample sizes occur from time to time. This is usually due to constraints placed on the sampling itself; for example, a high cost associated with sampling or regulatory restrictions. When considering smaller sample sizes, the proposed method can still be used. However, in this case the standard deviations of the observed bad rates are increased.
5. Practical Application
The method described above provides a way to arrive at a parametric model, which can be used for simulation purposes, via specification of bad ratios for each attribute considered. One interesting application of this procedure is to specify a deviation from the distribution of the attributes and default indicator and to simulate a second data set. This deviation may, for instance, be in the form of specifying a change in the marginal distribution associated with one or more attributes. The newly simulated data set can then be analyzed in order to gauge the effect of the change to, for example, the overall credit risk of the population.
In practice, a common metric used to measure the level of population stability is the aptly named population stability index (PSI). The PSI quantifies the discrepancy between the observed proportions per level of a given attribute in two samples. Typically, the first data set is observed when the scorecard is developed (we refer to this data set as the
base data set) and the second is a more recent sample (referred to as the
test data set). Letting
k be the number of levels of the attributes, the PSI is calculated as follows:
where
and
, respectively, represent the proportion of the
jth level of the attribute in question in the test and base data sets. The following rule-of-thumb for the interpretation of PSI values is suggested in [
1]; a value of less than 0.1 indicates that the population shows no substantial changes, a PSI between 0.1 and 0.25 indicates a small change and a PSI of more than 0.25 indicates a substantial change.
It should be noted that the PSI is closely related to the Kullback–Leibler divergence. Let
and
. The Kullback–Leibler divergence between the base and test populations is defined to be
see [
16] as well as [
17]. Note that the Kullback–Leibler divergence is an asymmetric discrepancy measure, meaning that the discrepancy between the base and test populations,
, need not equal the discrepancy between the test and base populations,
. In order to arrive at a symmetric discrepancy measure, one may simply add
to
;
which equals the
between the base and test populations. A further discussion of the Kullback–Leibler divergence can be found in [
18].
In order to illustrate the use of the PSI, consider the following setup. A single realization of the base data set is simulated using the marginal distributions and the bad ratios specified in
Section 2 and
Appendix A. We also simulate a test data set using the same specifications, with only the following changes:
The proportion of existing customers is changed from 80% to 57%. The new distribution is chosen such as to have a PSI value that is approximately 0.25.
The distribution for the number of enquiries is changed from (30%, 25%, 20%, 15%, 5%, 5%) to (10%, 10%, 20%, 50%, 5%, 5%).
Following these changes, a test data sets is simulated from the distribution specified above and the resulting PSI is calculated for each attribute. This process is repeated 1000 times in order to arrive at 1000 PSI values for each attribute.
In addition to considering the magnitude in the change of the distribution of the attributes, we are interested in measuring the change in the overall credit risk of the population. In order to achieve this, it is standard practice to divide the applicants into various so-called risk buckets based on their probability of default as calculated by the scorecard. In the example used here, we proceed as follows; at the time when the data for the base data set is collected, the applicants may be segmented into ten risk buckets, each containing 10% of the applicants. That is, the
quantiles of the probabilities of default of the base data set are calculated. Then, given the test data set, we calculate the proportions of applicants for whom the calculated probability of default is between the
and
quantiles of the base data set, for
. These proportions are then compared to those of the base data set (which are clearly 10% for each risk bucket) in the same way as the proportions associated with the various levels of the attributes are compared.
Table 9 contains the average and standard deviations of the PSI calculated for each of the attributes as well as for the risk buckets.
When considering the results in
Table 9, three observations are in order. First, the PSI values calculated for the risk buckets are less than 0.1, indicating that no substantial change in the distribution of the data is observed. Second, the PSI values for the attribute “existing customer” are, on average, 0.2557. Based on the average PSI, the analyst would typically conclude that the variable is unstable as the calculated average PSI value exceeds the cut-off of 0.25. However, in 27.5% of the simulated test data sets, the PSI was calculated to be less than 0.25. This demonstrates that the proposed simulation technique enables us to perform sensitivity analysis in cases where a change in the distribution of the attributes results in PSI values close to the cut-off value of 0.25. When considering the attribute “Number of enquiries”, the PSI indicates that a substantial change has occurred. The PSI values calculated for this attribute has an average of 0.7988 and a standard deviation of 0.0178.