The experimental testing of the FBLearn platform aims to explore and evaluate the suggested approaches for global FL model aggregation. The designed experiments estimate the aggregated global FL model in comparison to the local models to prove if the aggregation approaches are beneficial for the requestor.
5.3. Experimental Results for Credit Risk Scoring Using FBLearn Platform
Use Case 1 is based on a dataset for Credit Risk Scoring [
43]. The original dataset is split to training, test, and validation datasets that are preprocessed using the processing steps described in [
45]. The exploratory data analyses applied are aimed at data cleaning, feature and target data transformation, encoding, and proper feature scaling. The obtained preprocessed training dataset is then split into three training files for each of the three trainers (nodes): score1.csv, score2.csv, and score3.csv. In addition, one fake training file (fake.csv), comprising randomly generated data, is generated and used in some of the test cases of the third node. A preprocessed validation dataset (validate.csv) is also used. A summary of the number of data samples and the number of data features of the datasets is given in
Table 1.
All data files comprise the same features: numeric columns for age, annual balance, number of credit cards, number of loans, outstanding balance, credit history, money spent, etc., as well as a “TARGET” column with values of {0, 1, 2}, in which a value of “0” represents a “Bad Credit Score”, a value of “1” is a “Risky Credit Score”, and a value of “2” denotes a “Good Credit Score”. In this use case, the target variable is a numeric value from a limited set and the model is trained to predict one of the target values, thus classification ML is used for model training. For this use case of the FBLearn platform, a random forest classifier is utilized. Random forest classifiers are well suited for tasks with a large number of input features because they are capable of handling high-dimensional data without the risk of overfitting, as is the discussed use case for credit risk scoring.
The parameters of the random forest classifier used for the experimental evaluation in Use Case 1 are based on the default settings of the scikit-learn implemented using the Python class RandomForestClassifier: the number of trees in the forest is 100, the function to measure the quality of a split is based on Gini impurity, the minimum number of samples required to split an internal node is two, the minimum number of samples required for a leaf node is one, the number of features to consider for the best split is the square root of the number of features, and bootstrap samples are used when building the trees. No hyperparameter tuning is applied in the presented experimental evaluation as its focus is on distributed learning and global model aggregation. The same parameters are used in model training, testing, and validation.
Two approaches are applied in order to assess the quality of the data in all the training datasets:
Each dataset is split into training and testing data using a ratio of 80:20. A model is obtained using random forest classification based on the training data. Predictions are calculated using the test data and are compared with the target feature. The results for the selected evaluation metrics are given in
Table 2.
Figure 3 shows the ROC curves for the used data sources.
For each dataset, a model is obtained using random forest classification based on all the samples from the dataset. Predictions are calculated for the validation samples and are compared with the target feature in the validation dataset. The results for the selected evaluation metrics are given in
Table 3.
Figure 4 shows the ROC curves for the used data sources where Class 0 represents the “Bad Credit Score” category, Class 1 denotes the “Risky Credit Score” category, and Class 2 is the “Good Credit Score” category.
Both dataset assessment approaches show comparable results for the evaluation metrics. The first three data sources have almost equal evaluation metrics and ROC curves while the fourth (fake.csv), which has a different distribution, shows lower evaluation metrics. That is why the experimental results for Use Case 1 are divided in two parts, with two different hypotheses:
Part 1: Experiments are executed using only the first three data sources (score1, score2, and score3). The hypothesis for the evaluation is that the trainers are participating in the distributed learning with high-quality datasets and are not aiming to disturb the system or to decrease the quality of the final global FL model. The goal is to evaluate the suggested approaches for global FL model aggregation;
Part 2: Experiments are executed using the first two data sources (score1 and score2) for two trainers, and the third trainer uses the fourth data source (fake.cvs). The hypothesis is that the first two trainers participate in the distributed training with data sources of similar quality and the third one uses a lower-quality dataset for local model training, thus either it aims to decrease the quality of the final global model or it just does not have good-quality data. The goal is to evaluate the resiliency of the distributed training system against trainers with datasets of differing quality used during local training.
For part 1 of Use Case 1 (described in
Table 4), the training datasets score1, score2, and score3 have a similar quality and distribution; thus, the goal is to explore different scenarios of distributed training and validation using the suggested approaches for global FL model aggregation. The test cases correspond to nine scenarios for global model aggregation according to the above presented algorithms.
The results for the test cases are given in
Table 5. The results show relatively similar values for all the evaluation metrics. The ROC curves shown in
Figure 5 also confirm the quality of the global FL models obtained. Generally, the models have small MSE and R-squared parameter values and high values for accuracy, F1 score, precision, and recall. The aggregation of the global model in all the cases using different algorithms shows equal results when the data of both trainers are of similar quality. In all the cases, the achieved quality of the global FL model is good with quality metrics very similar to those of the original datasets, thus confirming that the model aggregation can be based on all of the suggested approaches. Test Cases 3, 4, and 5 show good performance and quality versus the original quality of the models trained on the separate datasets. The quality metrics for the resulting final global models of Test Cases 7, 8, and 9 are significantly good. This confirms that the aggregation techniques used in Test Cases 7, 8, and 9 are the best.
For Part 2 of Use Case 1 (described in
Table 6), the same algorithms for aggregation of the global FL models are used, but the third local trainer node uses a dataset with a significantly lower quality (fake.csv).
The results for the test cases are given in
Table 7 and the ROC curves are shown in
Figure 6. In Test Case 10, the global FL model is aggregated using a combination of the local models and no weights are applied. The ROC curve shows significantly good results; there is no serious deviation in the quality of the global model influenced by the fake dataset as a result of the combination. The results for Test Cases 11 - 14 show that aggregation using weighed average approaches ensures the high quality of the model and the small influence of the fake dataset on the global FL model. Test Cases 12, 13, and 14 have three candidates for the final model, and the results show the models are of comparable quality, which means that the requestor can choose one of them. This can be a good approach in the specific situation in which the local trainers are required to use private datasets to create the final model proposition. Test Cases 15, 16, and 17 use the ensemble technique for global FL model aggregation: without weights (Test Case 15) and with different weighting approaches (Test Case 16 using the MSE and Test Case 17 using the MCC). The results show that these approaches ensure comparable results with the others and the quality of the global FL model remains significantly good despite the influence of the fake dataset.
In general, the experimental results confirm that using a combination of local trained models with or without weights allows us to eliminate malicious behavior in the system and ensures the high quality of the global FL model even though some of the local trainers have datasets with lower quality or try to compromise the system results. The different scenarios can be chosen depending on the specific status of the participants in the system and the wish of the requestor to share a validation dataset or not.
5.4. Experimental Results for Credit Card Fraud Using FBLearn Platform
Use Case 2 aims to experiment with the same approaches for global FL model aggregation but using a logistic regression. Logistic regression is particularly well suited for binary classification problems for which the outcome variable has two classes and provides a straightforward probability estimate for class membership. The dataset for this use case is the Credit Card Fraud dataset [
44]. The original dataset is split into three training datasets for three trainers (nodes): train1, train2, and train3 (csv files) and a validation file which is used in the experiments based on validations for global FL model aggregations. The datasets for each trainer are presented in
Table 8. All data files are preprocessed and consist of 30 features: numeric columns, representing credit card transactions, the cardholder’s personal data, financial and demographic data, historical data for the financial profile of the customer, and a target column of value of {0, 1}, in which a value of 0 denotes there is no fraud and a value of 1 denotes fraud. The data samples in the first dataset (train1) are significantly more than the samples in the other two. The training dataset of this use case is appropriate for a binary case logistic regression that outputs a result as true or false for each prediction. The logistic regression parameters used for the experimental evaluation in Use Case 2 are based on the default settings of scikit-learn for the relevant Python class. A large-scale bound-constrained optimization solver is used due to its robustness with l2 regularization penalties, and the maximum number of iterations for the solvers to converge is set to 100. Hyperparameter tuning is not applied in the presented experimental evaluation as it focuses on the distributed learning and global model aggregation. The same parameters are used in model training, testing, and validation.
In order to assess the quality of the data in the different training datasets, the same two approaches as in Use Case 1 are used:
Each of the datasets is split into training and testing data using a ratio of 80:20. A logistic regression model is trained on the training samples and predictions are compared with the test sample’s target values.
Table 9 shows the results for the evaluation metrics and
Figure 7 shows the ROC curves for the data sources.
For each dataset, a logistic regression model is trained using all the samples from the dataset. Predictions are calculated for the validation samples and are compared with the target feature in the validation dataset. The results for the selected evaluation metrics are given in
Table 10 and the ROC curves in
Figure 8.
Both assessment approaches show comparable results. The three data sources have almost equal evaluation metrics and ROC curves.
The main hypothesis of Use Case 2 is to evaluate the suggested approaches for global FL model aggregation when the logistic regression is used for distributed learning. The test case scenarios for Use Case 2 are presented in
Table 11.
The experimental results given in
Table 12 show that the global FL model aggregation approaches are applicable for distributed model training using a logistic regression and result in a high-quality global FL model. The different approaches can be used in different scenarios in real cases depending on the requestor’s intention to share their validation data or not.
The resulted final global models from the test cases of Use Case 2 show the quality of the initial training dataset files is not better than the quality of the final models if the evaluation metrics and the ROC curves (
Figure 9) are compared. The ROC curves of the final models are closer to the top left corner of the frame, which is an indicator of a better performance, especially for Test Cases 24 and 25.