GA-CatBoost-Weight Algorithm for Predicting Casualties in Terrorist Attacks: Addressing Data Imbalance and Enhancing Performance
Abstract
:1. Introduction
2. Literature Review
- First, we propose an innocent civilian casualties prediction model named GA-CatBoost-Weight for terrorist attacks based on CatBoost. CatBoost is capable of directly handling numerical, categorical, and textual features, and its powerful computational capability has been widely applied in various fields. However, to our knowledge, there has been no research applying CatBoost to the issue of terrorist attacks. Therefore, we use the CatBoost algorithm combined with some strategies to enhance algorithm performance to predict whether terrorist attacks will result in casualties;
- Secondly, we employ RF-RFE for feature selection. High-dimensional features not only increase the computational cost of models but also affect model performance. In this paper, we combine RF and RFE to reduce feature redundancy and effectively decrease computational costs. This method obtains importance scores for each feature using RF, reduces feature numbers based on feature importance ranking to generate a model performance curve, and obtains the optimal feature subset based on the trend of the curve;
- Thirdly, we conduct hyperparameter tuning for CatBoost. In the case of data imbalance where traditional data sampling techniques struggle to handle textual information, we propose using CatBoost’s built-in parameters to improve the data imbalance issue in terrorist attack scenarios. Instead of additional processing at the data level, we address the data imbalance issue from the model perspective in an end-to-end manner to prevent the loss of excessive semantic information in textual features. GA is an excellent hyperparameter optimization algorithm that is not commonly combined with CatBoost for tuning. Hence, in this study, we choose genetic algorithm to effectively enhance the performance of our innocent civilian casualties prediction model for terrorist attacks.
3. Materials and Methods
3.1. Data Preprocessing
3.2. Feature Processing
3.3. Hyperparameter Tuning Method Based on CatBoost
- Initialize the first decision tree:
- For each iteration d = 1, 2, …, D:
- (a)
- Compute the negative gradient of the loss function to fit the residual values in the current iteration of the model:
- (b)
- In the leaf node region of the current iteration of the model, fit a decision tree for (using CART regression tree as an example) using (). Here, t represents the index of . Calculate the optimal value within the leaf node region:
- (c)
- Update the model:
- Output the final strong learner .
- 1.
- Feature handling: CatBoost introduces the Ordered Target Statistic method to handle categorical features. This method sorts each category feature value based on its relationship with the target variable and performs corresponding statistical calculations. This technique can be used to encode category features, helping the model better understand the meaning of category features. The formula is as follows:
- 2.
- Addressing Gradient Bias: Traditional GBDT methods estimate gradients using the same dataset for model training, which can lead to cumulative bias and overfitting due to incomplete consistency in data distribution. To address this issue, CatBoost introduces the Ordered Boosting method. The approach involves first shuffling the sample data. For each sequence , t models are trained, where t represents the number of samples. Each model is trained using data preceding the current sample sequence.
- 3.
- Symmetric Trees. Compared to conventional decision trees, CatBoost uses a lower-degree symmetric tree structure, which has the following characteristics:
- (a)
- Symmetric Splitting: In contrast to traditional decision tree algorithms that split nodes based on a single optimal feature dimension, the symmetric tree in CatBoost splits nodes based on two feature dimensions simultaneously. This symmetric splitting allows for more effective utilization of relationships and interactions between features, enhancing the model’s training efficiency and generalization capability;
- (b)
- Feature Interaction: In a symmetric tree, decision tree nodes at the same level consider the mutual influence of multiple features simultaneously. This feature interaction helps the model capture feature interactions better, enhancing accuracy and robustness.
Algorithm 1. GA-CatBoost hyperparameters tuning mechanism. |
Input: cross-validation fold K, mutation type MT, fitness function Func, crossover type CT, total iterations I, Dataset D, crossover probability C, mutation probability M |
Output: The optimal hyperparameter values for CatBoost |
1. Initialize i to 0 |
2. Initialize population randomly |
3. Execute the following loop until i < I: |
4. For each solution in the population do |
5. Extract hyperparameters for CatBoost from solution |
6. Split D into K parts, one part as testing set and the rest as training set |
7. For each fold from 1 to K do |
8. Training CatBoost on the training set |
9. Predict values using CatBoost on the testing set |
10. Calculate fitness value based on Func |
11. End for |
12. Compare and select the optimal model performance parameters |
13. End for |
14. Select solutions using roulette wheel selection |
15. Apply crossover on the selected solutions with CT and C |
16. Mutate of the new solutions with MT and M |
17. Generate the new population |
18. End loop |
19. Return the optimal hyperparameter values of CatBoost |
4. Results
4.1. Model Training
4.2. Comparison among CatBoost and Other Classifcation Methods
4.3. Comparison between Training Models with Different Fitness Evaluations
4.4. Comparison between Different Hyperparameter Tuning Methods
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- LaFree, G.; Dugan, L. Introducing the Global Terrorism Database. Terror. Political Violence 2007, 19, 181–204. [Google Scholar] [CrossRef]
- Li, W.; Guo, L. The Impact of COVID-19 on International Terrorism and Its Counter-measures. Glob. Gov. 2021, 3, 65–77+157. [Google Scholar]
- Han, X. Kremlin Drone Attack Raises Concerns of Escalating Russia-Ukraine Conflict. Guangming Dly. 2023, 008. [Google Scholar] [CrossRef]
- Lu, Y. Intensification of Israel-Palestine Conflict Exacerbates Social Division, Europe Faces High Risk of Major Terrorist Attacks. Lib. Dly. 2023, 007. [Google Scholar]
- Abdalsalam, M.; Li, C.; Dahou, A.; Noor, S. A Study of the Effects of Textual Features on Prediction of Terrorism Attacks in GTD Dataset. Eng. Lett. 2021, 29, 416–443. [Google Scholar]
- Guo, X.; Wu, W.; Xiao, Z. Civil aviation airport terrorism risk assessment model based on event tree and PRA. Appl. Res. Comput. 2017, 34, 1809–1811. [Google Scholar]
- Yang, Y. Research on the Risk Assessment and Prevention of Terrorist Attacks in Religious Site Based on FAHP-SWOT. J. Hunan Police Acad. 2019, 31, 99–106. [Google Scholar]
- Luo, L.; Qi, C. An analysis of the crucial indicators impacting the risk of terrorist attacks: A predictive perspective. Saf. Sci. 2021, 144, 105442. [Google Scholar] [CrossRef]
- Zhang, D.; Qian, L.; Mao, B.; Huang, C.; Huang, B.; Si, Y. A Data-Driven Design for Fault Detection of Wind Turbines Using Random Forests and XGboost. IEEE Access 2018, 6, 21020–21031. [Google Scholar] [CrossRef]
- Feng, Y.; Wang, D.; Yin, Y.; Li, Z.; Hu, Z. An XGBoost-based casualty prediction method for terrorist attacks. Complex Intell. Syst. 2020, 6, 721–740. [Google Scholar] [CrossRef]
- Shafiq, S.; Haider Butt, W.; Qamar, U. Attack type prediction using hybrid classifier. In Advanced Data Mining and Applications, Proceedings of the 10th International Conference, ADMA 2014, Guilin, China, 19–21 December 2014; Springer International Publishing: Berlin/Heidelberg, Germany,, 2014; pp. 488–498. [Google Scholar]
- Meng, X.; Nie, L.; Song, J. Big data-based prediction of terrorist attacks. Comput. Electr. Eng. 2019, 77, 120–127. [Google Scholar] [CrossRef]
- Gundabathula, V.T.; Vaidhehi, V. An Efficient Modelling of Terrorist Groups in India Using Machine Learning Algorithms. Indian J. Sci. Technol. 2018, 11, 1–10. [Google Scholar] [CrossRef]
- Khan, F.A.; Li, G.; Khan, A.N.; Khan, Q.W.; Hadjouni, M.; Elmannai, H. AI-Driven Counter-Terrorism: Enhancing Global Security Through Advanced Predictive Analytics. IEEE Access 2023, 11, 135864–135879. [Google Scholar] [CrossRef]
- Zhang, L.; Qiao, F.; Wang, J.; Zhai, X. Equipment Health Assessment Based on Improved Incremental Support Vector Data Description. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 3205–3216. [Google Scholar] [CrossRef]
- Rodriguez-Galiano, V.F.; Luque-Espinar, J.A.; Chica-Olmo, M.; Mendes, M.P. Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Sci. Total Environ. 2018, 624, 661–672. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Jin, M.; Fu, J.; Hao, M.; Yu, C.; Xie, X. On the Risk Assessment of Terrorist Attacks Coupled with Multi-Source Factors. ISPRS Int. J. Geo-Inf. 2018, 7, 9. [Google Scholar] [CrossRef]
- Jiang, L.; Kong, G.; Li, C. Wrapper Framework for Test-Cost-Sensitive Feature Selection. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 1747–1756. [Google Scholar] [CrossRef]
- Michalewicz, Z.; Schoenauer, M. Evolutionary Algorithms for Constrained Parameter Optimization Problems. Evol. Comput. 1996, 4, 1–32. [Google Scholar] [CrossRef]
Hyperparameter Name | Interval | Explain |
---|---|---|
learning_rate | [0.01, 1] | Weight of each step |
depth | [1, 16] | Limiting the maximum depth of the tree model |
l2_leaf_reg | [0, 10] | Penalizing the model complexity. |
min_data_in_leaf | [1, 1000] | Making the model more robust. |
max_ctr_complexity | [1, 10] | Controlling the complexity of feature combinations. |
auto_class_weights | Balanced | Automatically adapt to the data imbalance issue. |
Feature Processing | Training Models | AUC | Accuracy | F1 | Sensitivity | Precision |
---|---|---|---|---|---|---|
LabelEncoder | Logistics regression | 61.54 | 70.93 | 80.48 | 90.74 | 72.31 |
Adaboost | 72.62 | 78.55 | 84.87 | 91.07 | 79.46 | |
Decision tree | 74.98 | 77.25 | 82.65 | 82.04 | 83.27 | |
Random forest | 79.22 | 82.65 | 87.26 | 89.90 | 84.76 | |
XGBoost | 78.56 | 82.52 | 87.29 | 90.87 | 83.98 | |
Built-in category processing | LightGBM | 78.87 | 82.60 | 87.29 | 90.48 | 84.32 |
CatBoost | 78.24 | 82.24 | 87.09 | 90.69 | 87.37 | |
CatBoost(text) | 84.06 | 86.79 | 90.25 | 92.56 | 88.06 | |
GA-CatBoost | 85.50 | 87.77 | 90.91 | 92.59 | 89.29 | |
GA-CatBoost-weight | 85.59 | 87.87 | 90.99 | 92.68 | 89.35 |
Training Model | AUC | Accuracy | F1 | Sensitivity | Precision |
---|---|---|---|---|---|
GA-CatBoost-Accuracy | 84.92 | 87.39 | 90.65 | 92.59 | 88.80 |
GA-CatBoost-Sensitivity | 84.92 | 87.39 | 90.65 | 92.59 | 88.80 |
GA-CatBoost-Precision | 84.91 | 87.38 | 90.65 | 92.59 | 88.78 |
GA-CatBoost-F1 | 84.92 | 87.39 | 90.65 | 92.59 | 88.80 |
GA-CatBoost-AUC | 84.91 | 87.38 | 90.65 | 92.59 | 88.78 |
Training Model | AUC | Accuracy | F1 | Sensitivity | Precision |
---|---|---|---|---|---|
Manual | 78.52 | 82.22 | 87.00 | 90.04 | 84.16 |
Grid search | 84.64 | 87.13 | 90.46 | 92.38 | 88.62 |
Random search | 81.82 | 84.93 | 88.91 | 91.49 | 86.48 |
Bayesian | 83.81 | 86.51 | 90.03 | 92.19 | 87.96 |
Genetic algorithm | 84.92 | 87.39 | 90.65 | 92.59 | 88.80 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, Y.; Yang, B.; Chu, C. GA-CatBoost-Weight Algorithm for Predicting Casualties in Terrorist Attacks: Addressing Data Imbalance and Enhancing Performance. Mathematics 2024, 12, 818. https://doi.org/10.3390/math12060818
He Y, Yang B, Chu C. GA-CatBoost-Weight Algorithm for Predicting Casualties in Terrorist Attacks: Addressing Data Imbalance and Enhancing Performance. Mathematics. 2024; 12(6):818. https://doi.org/10.3390/math12060818
Chicago/Turabian StyleHe, Yuxiang, Baisong Yang, and Chiawei Chu. 2024. "GA-CatBoost-Weight Algorithm for Predicting Casualties in Terrorist Attacks: Addressing Data Imbalance and Enhancing Performance" Mathematics 12, no. 6: 818. https://doi.org/10.3390/math12060818
APA StyleHe, Y., Yang, B., & Chu, C. (2024). GA-CatBoost-Weight Algorithm for Predicting Casualties in Terrorist Attacks: Addressing Data Imbalance and Enhancing Performance. Mathematics, 12(6), 818. https://doi.org/10.3390/math12060818