This section presents the findings from our comprehensive analysis of the three distinct datasets, as previously described. The examination of these datasets allows us to draw meaningful conclusions and insights that contribute to our understanding of the underlying patterns and relationships. By evaluating the outcomes from each dataset, we aim to showcase the implications of our research and facilitate a better comprehension of the subject matter under investigation.
3.1. Prudencial Dataset
The optimal result for dataset 1 is achieved using boosting. In contrast, for datasets 2 and 3, random forest yields the most favorable outcomes, while dataset 4 exhibits the least desirable results, with no model attaining even 50% accuracy. It appears that the elimination of certain variables in this scenario directly impacts the categorization of the
, as categories such as 1, 3, and 8 are devoid of cases in the testing exercises. The results are compared using the accuracy indicator from the testing outcomes for each model, taking into account the four proposed work scenarios (refer to
Table 4). The best model for each scenario is highlighted in bold, while the second best is presented in italics.
As anticipated, datasets 2 and 3 exhibit better response levels since they involve binary classification problems, as opposed to the more complex prediction capacity required for multi-class classification problems. The boosting model delivers the best result for the original problem, aligning with the competition outcomes, and appears to be the most suitable model, despite not being directly comparable to the problem statement.
Notably, dataset 4 demonstrates a considerably low accuracy level, with no model identifying all categories during the testing phase, typically capturing only 5 to 6 categories. The least represented categories, 1 and 3, were not identified in this phase, suggesting that some variables eliminated due to missing values may hold crucial information for the classification process.
Interestingly, a model such as the classification tree model, which allows for variable importance evaluation through coefficients, ranks second for datasets 1 to 3, outperformed by more challenging-to-interpret models, like random forest and XGBoost.
As expected, accuracy results for binary classification are favorable in more interpretable models like decision tree classifiers, though surpassed by more sophisticated techniques. In the case of 8-category classification bases, specifically for dataset 1, acceptable and consistent results are achieved, with the best technique aligning with those obtained by the competition winners, from which the base was extracted.
3.1.1. Explainable Results
All models, excluding GLM, generally exhibit a predominance of shared variables, such as BMI, WT, Medical_History_4, and Product_info 4, albeit in varying orders. The best-performing models demonstrate a similar ranking, which is further elaborated upon below. Concerning variable relevance, shared variables prevail for dataset 1, including BMI, WT, Medical_History_4, and Product_info 4, with BMI being the most significant for models other than GLM (see
Figure 5).
Concerning variable importance, common variables dominate for dataset 1, such as BMI, WT, , and . Notably, BMI is the most crucial factor for models excluding GLM.
Related to the importance of the variables in the different models of dataset 2, BMI, WT and Medical_History_4 and Product_info 4 predominate, being the most important for models other than GLM, BMI, with the participation of additional variables such as Medical_History_39 and Ins_Age.
The most relevant variables in dataset 3 are like those described in dataset 2, considering the same order in each model. There are no novelties regarding the importance of the variables in the dataset 4 models, considering BMI, WT Medical_History_4, and Product_info as the most important for models other than GLM and BMI, with the participation of additional variables such as Medical_History 39 and Ins_Age.
In general, and considering the limitation of information related to product, medical history, family information, whose actual content is unknown, about variables relevant to life coverage, such as BMI and weight, which, together with the age of a person, could give a signal of good or bad health, closely related to the risk of mortality. On the other hand, there are three variables whose content is unknown, but they seem to have information characteristic of the population analyzed in any of the approaches, such as Medi-cal_History_4, Product_info 4 and Medical_History 39, being common variables in terms of the analysis of the alternatives provided by means of the databases constructed.
3.1.2. Evaluation
According to feature importance, the first dimension in our evaluation framework (see
Section 2.3), at least three of the five most relevant variables per model and per scenario are within the a priori most relevant groups (see
Table 5).
Regarding consistency, only boosting achieves a correlation of at least 0.5 among two or more explainability techniques across all datasets. For instance, decision trees reach this consistency level in two datasets (3 and 4), while GLM and random forest attain it in only one dataset (2 and 4, respectively).
Concerning stability and robustness, the between-variable importance (VI) of each model is computed for the four scenarios (see
Table 6). A similar analysis was conducted by comparing the results using dataset 1. VI with AI techniques yields high correlations, exceeding 75%, whereas GLM models produce notably lower VI values.
The model measure attains the best results regarding computational cost and availability. While the firm and shap techniques apply to every ML technique, their execution is slow. In contrast, the execution of perm is fast, but it is not available for all models and data scenarios.
From a regulatory compliance perspective, this dataset adheres to the GDPR, as it is anonymized and precludes the association of characteristics for inferring personal information. Moreover, in insurance pricing ease, we have a model for defining homogeneous groups that, although only partially replicable, would yield similar results when repeated. This is further reinforced by the XAI analysis, which facilitates understanding and review by the regulator.
Regarding fairness and bias, all analyses reveal that BMI, Wt, and certain health-related variables influence the classification process. For example, if we consider that the rating process aims to evaluate the allocation of insurance risk associated with financial products, the identified relevance reaffirms these characteristics, despite their seemingly discriminative nature. Furthermore, the dataset does not include variables penalized for discrimination in insurance pricing, such as sex.
3.2. Health Insurance Results
After running the parameter optimization process for each group of techniques, the parameterization that gave the best results for each of the models in each dataset is selected. The results obtained can be seen in the following
Table 7.
In dataset 2 (see
Table 8), the algorithms were tested with the same dataset after a previous feature selection process. As a result, their results remained the same. They even worsened slightly, except for the neural networks, which significantly improved. However, reducing the dataset, even more, does not bring significant improvements for a problem with a small dataset.
The results are similar in the scenarios with normalized data with (
Table 9) or without outliers (
Table 10). However, the predictive power of the neural networks is even more outstanding, obtaining the best MAE values in the scenario without outliers.
The performance results obtained in the scenario without outliers (
Table 11) are worse regarding
in both the cross validation and test. They are better in MAE in cross validation (slight overlearning) but worse in the test step. The best results are obtained by neural networks that take advantage of not having to handle these extreme cases to obtain good results.
In most cases, decision forests (random or bagging) achieve the best results both in cross validation and evaluation. The optimized models that achieve the best results are complex models without a reduced capability to provide explanations. Artificial neural networks achieve good predictive results with the best in the dataset with feature selection (Scenario 2) or without outliers (Scenario 5). These techniques present outstanding predictive performance () with lower performance in description power () and difficulty explaining the results. In addition, self-explainable techniques like linear regression or decision trees achieve the worst results, showing poor performance in any scenario. However, the poor performance of the boosting algorithms is remarkable; these methods need a considerable volume of data to achieve good performance.
Explainability Results
According to this method, if we study the correlation between the results of the variables’ relevance and the results of the different algorithms (see
Table 12), the result is very similar, presenting very high correlations (average =
). From another point of view, analyzing the relevance of the variables for the same algorithm but for different datasets, the result is also very high, with the highest differences compared to
scenario 5. Artificial neural networks are the most stable in terms of the correlation of results.
Table 13 reflects different results for the models obtained. However, the relevance of
smoker.no is as the most important in all models, and
BMI and
age are third place in several of the models.
Analyzing the features that significantly impact predictions, it is noteworthy that at least three of the five most relevant variables per model fall within the a priori relevant groups. Moreover, the explanations are consistent across different instances and similar inputs, as all correlations between various techniques in each scenario exceed 85%. This demonstrates that the explainability techniques align with domain knowledge in this case.
The model measure achieves the most optimal results in terms of computational cost and availability. While the firm and shap techniques apply to all machine learning techniques, their execution is slow. In contrast, the perm technique executes quickly but is unavailable for all models and data scenarios.
All models concur that BMI, age, and smoking condition influence the fitting process (see
Figure 6). Considering that the rating process aims to evaluate the allocation of insurance risk associated with financial products, the identified relevance reaffirms these characteristics, despite their seemingly discriminative nature. Conversely, the dataset includes variables such as sex and paternity, which are penalized as discriminatory for insurance pricing, but reflect actual conditions according to experience.
The database adheres to the GDPR, as it is anonymized and precludes the association of characteristics for inferring personal information.
3.3. Claim Results
The results of our analysis, as presented in
Table 14, reveal some interesting insights. Firstly, it is clear that the performance of the machine learning algorithms varies across the different datasets. For example, regression trees performed best in dataset 1, with an error rate of 11,538.02, while Bagging achieved the lowest error rate in dataset 3, with an error rate of 0.423. On the other hand, boosting performed the best in dataset 4, with an error rate of 2960.106.
It is worth noting that neural networks consistently performed well across all datasets. This highlights the versatility and robustness of neural networks, making them a viable option for a wide range of applications. Another interesting observation is the relatively high error rates in datasets 1 and 2 compared to datasets 3 and 4. This can be attributed to datasets 1 and 2 being multi-class classification problems, which are generally more challenging than binary classification problems, such as those in datasets 3 and 4.
Our results demonstrate the importance of carefully considering the problem and selecting an appropriate machine learning algorithm to achieve optimal results. While there is no one-size-fits-all solution, the versatility of neural networks and the varied strengths of different algorithms in different datasets highlight the importance of conducting thorough experimentation and analysis to identify the best solution for each problem.
Explainability Results
The analysis of the prediction techniques used in the study show that the ANN model with linearly significant and untransformed variables had a preponderance for the insured’s age and marital status, whereas work-related variables were less relevant. However, this model included a variable that was previously discarded in the other datasets: the year of occurrence, which is potentially related to inflation or the growth of the average cost of the claim over time. Additionally, the boosting model, which considered the normalized base without outliers or variables with no linear relationship, reinforced the importance of the most relevant variables from the previous relationships, such as weekly income, year of occurrence, age, and gender.
It should be noted that the incidence of the variables in the models cannot be understood solely from the results obtained from each algorithm. Despite the limitations of information, such as the type of work or cause of the accident, the study was able to clearly identify the influence of weekly income, age, and year of occurrence. The cost of an accident at work is directly related to the injured worker’s salary (income), which is usually related to their experience level, making sense that age matters. Furthermore, the year provides a reference to the influence of the value of money over time or the inflationary effect in wages. The results of the study are presented in
Figure 7.
The boosting machine learning (ML) algorithm demonstrated its effectiveness in generating consistent and robust explanations through various explainability techniques. The high level of correlation among the results of these techniques further strengthens the credibility of the explanations. Moreover, the variable selection process achieves high consistency and robustness across different datasets, indicating the reliability of the approach.
In terms of computational cost and availability, the model measure outperforms the firm and shap techniques, which are slow. In contrast, the perm technique executes quickly but is not universally available for all models and datasets.
From a regulatory compliance perspective, the model provides a suitable fit that may only be partially replicable, but repetition would lead to similar results. The XAI analysis further facilitates understanding any review by the regulator, reinforcing the model’s reliability. The analysis of the results reveals the significance of the weekly income, age, and year of occurrence as influential variables in solving the problem. These variables align with the criteria of a claims specialist, as the cost of the accident claim is directly related to the salary (income) of the injured worker, which in turn is usually associated with the worker’s experience level. It makes sense that age matters in such cases, and the year of occurrence refers to the influence of the value of money over time or the inflationary effect on salaries. Therefore, no weight is given to any discriminant variable.