Balancing Risk and Profit: Predicting the Performance of Potential New Customers in the Insurance Industry
Abstract
:1. Introduction
2. Literature Review
3. Overview of Methodology and First Steps
3.1. Data Preparation
- p refers to the premiums or total amount paid by the established customer for the insurance policies.
- c represents the claims recorded on the policy from the policy start date.
- a denotes the amount paid to agents as a percentage of the premiums.
- The constant k represents the portion of the premiums allocated by the insurance company to cover administrative expenses (in our case, ).
- e represents the percentage of exposure, i.e., the percentage of days in a year that the policy has been active.
3.2. Selection of Relevant Features
4. Evaluation of a Pool of Models Using Cross-Validation
5. Evaluation of Selected Models, Using Early Stopping
6. Further Analysis on the Test Dataset
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ML | Machine Learning |
ROC AUC | Receiver Operating Characteristic Area Under the Curve |
AAP | Average Annual Profit |
References
- Krenn, M.; Buffoni, L.; Coutinho, B.; Eppel, S.; Foster, J.G.; Gritsevskiy, A.; Lee, H.; Lu, Y.; Moutinho, J.P.; Sanjabi, N.; et al. Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network. Nat. Mach. Intell. 2023, 5, 1326–1335. [Google Scholar] [CrossRef]
- Dinov, I.D. Volume and value of big healthcare data. J. Med. Stat. Inform. 2016, 4, 3. [Google Scholar] [CrossRef] [PubMed]
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
- Kuznetsov, V. Gaining insight from large data volumes with ease. In EPJ Web of Conferences; EDP Sciences: Sofia, Bulgaria, 2019; Volume 214, p. 04027. [Google Scholar]
- Rani, S.; Bhambri, P.; Kataria, A. Integration of IoT, Big Data, and Cloud Computing Technologies: Trend of the Era. In Big Data, Cloud Computing and IoT; Chapman and Hall/CRC: Boca Raton, FL, USA, 2023; pp. 1–21. [Google Scholar]
- Ionescu, S.A.; Diaconita, V. Transforming financial decision-making: The interplay of AI, cloud computing and advanced data management technologies. Int. J. Comput. Commun. Control 2023, 18, 5735. [Google Scholar] [CrossRef]
- Siddiqa, A.; Hashem, I.A.T.; Yaqoob, I.; Marjani, M.; Shamshirband, S.; Gani, A.; Nasaruddin, F. A survey of big data management: Taxonomy and state-of-the-art. J. Netw. Comput. Appl. 2016, 71, 151–166. [Google Scholar] [CrossRef]
- Raghav, R.S.; Pothula, S.; Vengattaraman, T.; Ponnurangam, D. A survey of data visualization tools for analyzing large volume of data in big data platform. In Proceedings of the 2016 International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 21–22 October 2016; pp. 1–6. [Google Scholar]
- Jones, K.I.; Sah, S. The Implementation of Machine Learning In The Insurance Industry With Big Data Analytics. Int. J. Data Inform. Intell. Comput. 2023, 2, 21–38. [Google Scholar]
- Jamal, S.; Goyal, S.; Grover, A.; Shanker, A. Machine Learning: What, Why, and How? In Bioinformatics: Sequences, Structures, Phylogeny; Springer: Singapore, 2018; pp. 359–374. [Google Scholar]
- Tian, X.; Todorovic, J.; Todorovic, Z. A Machine-Learning-Based Business Analytical System for Insurance Customer Relationship Management and Cross-Selling. J. Appl. Bus. Econ. 2023, 25, 273. [Google Scholar] [CrossRef]
- Hanafy, M.; Ming, R. Machine learning approaches for auto insurance big data. Risks 2021, 9, 42. [Google Scholar] [CrossRef]
- Rawat, S.; Rawat, A.; Kumar, D.; Sabitha, A.S. Application of machine learning and data visualization techniques for decision support in the insurance sector. Int. J. Inf. Manag. Data Insights 2021, 1, 100012. [Google Scholar] [CrossRef]
- Mahbobi, M.; Kimiagari, S.; Vasudevan, M. Credit risk classification: An integrated predictive accuracy algorithm using artificial and deep neural networks. Ann. Oper. Res. 2023, 330, 609–637. [Google Scholar] [CrossRef]
- Hosein, P. A data science approach to risk assessment for automobile insurance policies. Int. J. Data Sci. Anal. 2024, 17, 127–138. [Google Scholar] [CrossRef]
- Jeong, H.; An, J.; Jeong, J. Are you a good client? Client classification in federated learning. In Proceedings of the 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 20–22 October 2021; pp. 1691–1696. [Google Scholar]
- Eluwole, O.T.; Akande, S. Artificial Intelligence in Finance: Possibilities and Threats. In Proceedings of the 2022 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Virtual, 28–30 July 2022; pp. 268–273. [Google Scholar]
- Luciano, E.; Cattaneo, M.; Kenett, R. Adversarial AI in Insurance: Pervasiveness and Resilience. arXiv 2023, arXiv:2301.07520. [Google Scholar]
- Finger, D.; Albrecher, H.; Wilhelmy, L. On the cost of risk misspecification in insurance pricing. Jpn. J. Stat. Data Sci. 2024, 1–43. [Google Scholar] [CrossRef]
- Leo, M.; Sharma, S.; Maddulety, K. Machine learning in banking risk management: A literature review. Risks 2019, 7, 29. [Google Scholar] [CrossRef]
- Fitriani, M.A.; Febrianto, D.C. Data mining for potential customer segmentation in the marketing bank dataset. JUITA J. Inform. 2021, 9, 25–32. [Google Scholar] [CrossRef]
- Simester, D.; Timoshenko, A.; Zoumpoulis, S.I. Targeting prospective customers: Robustness of machine-learning methods to typical data challenges. Manag. Sci. 2020, 66, 2495–2522. [Google Scholar] [CrossRef]
- Hutagaol, B.J.; Mauritsius, T. Risk level prediction of life insurance applicant using machine learning. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 2213–2220. [Google Scholar]
- Sadreddini, Z.; Donmez, I.; Yanikomeroglu, H. Cancel-for-Any-Reason Insurance Recommendation Using Customer Transaction-Based Clustering. IEEE Access 2021, 9, 39363–39374. [Google Scholar] [CrossRef]
- Sari, P.K.; Purwadinata, A. Analysis characteristics of car sales in E-commerce data using clustering model. J. Data Sci. Appl. 2019, 2, 19–28. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble methods in machine learning. In Proceedings of the International Workshop on Multiple Classifier Systems, Cagliari, Italy, 21–23 June 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
- Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Elbhrawy, A.S.; Belal, M.A.; Hassanein, M.S. CES: Cost Estimation System for Enhancing the Processing of Car Insurance Claims. J. Comput. Commun. 2024, 3, 55–69. [Google Scholar] [CrossRef]
- De Meulemeester, H.; De Moor, B. Unsupervised embeddings for categorical variables. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
- Kolambe, S.; Kaur, P. Survey on Insurance Claim analysis using Natural Language Processing and Machine Learning. Int. J. Recent Innov. Trends Comput. Commun. 2023, 11, 30–38. [Google Scholar] [CrossRef]
- Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
- Cambria, E.; White, B. Jumping NLP curves: A review of natural language processing research. IEEE Comput. Intell. Mag. 2014, 9, 48–57. [Google Scholar] [CrossRef]
- Orji, U.; Ukwandu, E. Machine learning for an explainable cost prediction of medical insurance. Mach. Learn. Appl. 2024, 15, 100516. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
- Le, T.T.H.; Prihatno, A.T.; Oktian, Y.E.; Kang, H.; Kim, H. Exploring local explanation of practical industrial AI applications: A systematic literature review. Appl. Sci. 2023, 13, 5809. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
- Sharma, A. Demystifying Privacy-preserving AI: Strategies for Responsible Data Handling. MZ J. Artif. Intell. 2024, 1, 1–8. [Google Scholar]
- Voigt, P.; Von dem Bussche, A. The eu general data protection regulation (gdpr). In A Practical Guide, 1st ed.; Springer International Publishing: Cham, Switzerland, 2017; Volume 10. [Google Scholar]
- Hastie, T.; Mazumder, R.; Lee, J.D.; Zadeh, R. Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 2015, 16, 3367–3402. [Google Scholar]
- Rafsunjani, S.; Safa, R.S.; Al Imran, A.; Rahim, M.S.; Nandi, D. An empirical comparison of missing value imputation techniques on APS failure prediction. Int. J. Inf. Technol. Comput. Sci. 2019, 2, 21–29. [Google Scholar] [CrossRef]
- Hancock, J.; Khoshgoftaar, T.M. Leveraging lightgbm for categorical big data. In Proceedings of the 2021 IEEE Seventh International Conference on Big Data Computing Service and Applications (BigDataService), Virtual, 23–26 August 2021; pp. 149–154. [Google Scholar]
- Li, S.; Zhang, X. Research on orthopedic auxiliary classification and prediction model based on XGBoost algorithm. Neural Comput. Appl. 2020, 32, 1971–1979. [Google Scholar] [CrossRef]
- Alzamzami, F.; Hoda, M.; El Saddik, A. Light gradient boosting machine for general sentiment classification on short texts: A comparative evaluation. IEEE Access 2020, 8, 101840–101858. [Google Scholar] [CrossRef]
- Abdurrahman, G.; Sintawati, M. Implementation of Xgboost for Classification of Parkinson’s Disease. J. Phys. Conf. Ser. 2020, 1538, 012024. [Google Scholar] [CrossRef]
- Sari, L.; Romadloni, A.; Lityaningrum, R.; Hastuti, H.D. Implementation of LightGBM and Random Forest in Potential Customer Classification. TIERS Inf. Technol. J. 2023, 4, 43–55. [Google Scholar] [CrossRef]
- Dreiseitl, S.; Ohno-Machado, L. Logistic regression and artificial neural network classification models: A methodology review. J. Biomed. Inform. 2002, 35, 352–359. [Google Scholar] [CrossRef]
- Charbuty, B.; Abdulazeez, A. Classification based on decision tree algorithm for machine learning. J. Appl. Sci. Technol. Trends 2021, 2, 20–28. [Google Scholar] [CrossRef]
- Gladence, L.M.; Karthi, M.; Anu, V.M. A statistical comparison of logistic regression and different Bayes classification methods for machine learning. ARPN J. Eng. Appl. Sci. 2015, 10, 5947–5953. [Google Scholar]
- Carrington, A.M.; Manuel, D.G.; Fieguth, P.W.; Ramsay, T.; Osmani, V.; Wernly, B.; Bennett, C.; Hawken, S.; Magwood, O.; Sheikh, Y.; et al. Deep ROC analysis and AUC as balanced average accuracy, for improved classifier selection, audit and explanation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 329–341. [Google Scholar] [CrossRef]
- Akula, R. Fraud identification of credit card using ML techniques. Int. J. Comput. Artif. Intell. 2020, 1, 31–33. [Google Scholar] [CrossRef]
Process | Description | Method |
---|---|---|
Handling missing values | Imputation of missing data, to ensure dataset completeness and integrity. | The ‘soft impute’ method was employed; it ensured that the imputed values maintained the inherent patterns and relationships within the dataset, minimizing distortion of the original data distribution [40,41]. |
Outlier management | Identification and handling of anomalous records, to refine data quality. | Removal of fleet data and review of anomalous values. |
Date variable transformation | Transformation of date variables to year format, to streamline analysis and simplify temporal data handling. | Conversion to year format. |
Categorical variable encoding | Transformation of categorical variables into a numerical format, to facilitate model interpretation. | By using the ‘one-hot encoding’ method [42], each vehicle type category was substituted by the corresponding dummy variables without imposing hierarchy or order on them. |
Variable | Description | Data Type | Transformation |
---|---|---|---|
Age | The customer’s age in years at the time the policy is taken out. | Discrete | From date format to discrete number. |
Risk | Level of associated risk, generated from the equalities and inequalities between the driver, the vehicle owner, and the policy payer. | Categorical | Elimination of driver, owner, and co-driver identification variables. |
Horse power | The power of the vehicle’s engine. | Discrete | N/A |
Historical family premiums | Accumulated net premiums of policies contracted by the customer’s family members in the company. | Continuous | N/A |
License age | The number of years the customer has held a driver’s license. | Discrete | Calculated from the variables card issue date and policy application date. |
Hiring channel score | Score assigned to the salesperson in charge of the policy contracting process. | Categorical | N/A |
INE income | Represents the average annual net income of the region where the customer resides, according to INE. | Continuous | The missing values have been imputed using the soft impute technique. |
DGT accident | The average annual number of accidents with fatalities recorded by the DGT, value computed for the average 2016–2022. | Continuous | The missing values have been imputed using the soft impute technique. |
Exposure | Duration of the customer’s insurance in the last 5 years, extracted from the SINCO database. | Discrete | Calculated from the dates of the policies registered in SINCO. |
Frequency of material damage | Frequency of accidents with material damage obtained after consulting SINCO. | Continuous | Calculated from the dates of occurrence of this type of accident. |
Frequency of personal damage | Frequency of accidents with personal damage obtained after consulting SINCO. | Continuous | Calculated from the dates of occurrence of this type of accident. |
Claim outcome score | Score assessing the claims obtained from SINCO. | Discrete | N/A |
Statistic | Value |
---|---|
Count | 51,618 customers |
Mean | 18 euros |
StDev | 2277 euros |
Min | −185,990 euros |
Q1 | 73 euros |
Median | 178 euros |
Q3 | 262 euros |
Max | 21,747 euros |
Model | ROC AUC | Low Perform. | Est. Benefit | Time |
---|---|---|---|---|
(Train) | (Test) | (Test) | (s) | |
naive Bayes | 0.624 | 1462 (9%) | 1,092,228 € | 9 |
decision trees | 0.633 | 1634 (11%) | 1,228,023 € | 9 |
logistic regression | 0.653 | 194 (1%) | 513,173 € | 11 |
multi-layer perceptron | 0.660 | 48 (0%) | 445,084 € | 18 |
extra trees | 0.669 | 1583 (10%) | 1,249,698 € | 20 |
random forest | 0.687 | 1571 (10%) | 1,270,656 € | 19 |
adaboost | 0.697 | 264 (2%) | 611,936 € | 12 |
XGB | 0.709 | 1099 (7%) | 1,269,588 € | 10 |
gradient boosting | 0.713 | 973 (6%) | 1,183,631 € | 15 |
LGBM | 0.717 | 1000 (6%) | 1,226,696 € | 10 |
Model | ROC AUC | ROC AUC | Low Perform. | Est. Benefit | Time |
---|---|---|---|---|---|
(Train) | (Val) | (Test) | (Test) | (s) | |
LGBM | 0.738 | 0.723 | 423 (3%) | 763,145 € | 8 |
XGB | 0.748 | 0.742 | 1003 (6%) | 1,232,663 € | 10 |
Statistic | Value |
---|---|
Count | 14,483 customers |
Mean | 85 euros |
StDev | 1964 euros |
Min | −176,973 euros |
Q1 | 92 euros |
Median | 183 euros |
Q3 | 267 euros |
Max | 11,774 euros |
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.84 | 0.97 | 0.90 | 12,513 |
1 | 0.67 | 0.23 | 0.34 | 2973 |
Accuracy | 0.83 | 15,486 | ||
Macro Avg | 0.76 | 0.60 | 0.62 | 15,486 |
Weighted Avg | 0.81 | 0.83 | 0.79 | 15,486 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Soriano-Gonzalez, R.; Tsertsvadze, V.; Osorio, C.; Fuster, N.; Juan, A.A.; Perez-Bernabeu, E. Balancing Risk and Profit: Predicting the Performance of Potential New Customers in the Insurance Industry. Information 2024, 15, 546. https://doi.org/10.3390/info15090546
Soriano-Gonzalez R, Tsertsvadze V, Osorio C, Fuster N, Juan AA, Perez-Bernabeu E. Balancing Risk and Profit: Predicting the Performance of Potential New Customers in the Insurance Industry. Information. 2024; 15(9):546. https://doi.org/10.3390/info15090546
Chicago/Turabian StyleSoriano-Gonzalez, Raquel, Veronika Tsertsvadze, Celia Osorio, Noelia Fuster, Angel A. Juan, and Elena Perez-Bernabeu. 2024. "Balancing Risk and Profit: Predicting the Performance of Potential New Customers in the Insurance Industry" Information 15, no. 9: 546. https://doi.org/10.3390/info15090546
APA StyleSoriano-Gonzalez, R., Tsertsvadze, V., Osorio, C., Fuster, N., Juan, A. A., & Perez-Bernabeu, E. (2024). Balancing Risk and Profit: Predicting the Performance of Potential New Customers in the Insurance Industry. Information, 15(9), 546. https://doi.org/10.3390/info15090546