A Comparative Analysis of the TDCGAN Model for Data Balancing and Intrusion Detection
Abstract
:1. Introduction
- Utilize the TDCGAN model to address a data imbalance in four benchmark datasets for an intrusion detection system.
- Employ different machine learning techniques to test intrusion detection classification across four benchmark datasets before and after data balancing.
- Perform a comparative analysis of the models utilized.
2. Materials and Methods
2.1. Datasets
2.2. Machine Learning Methods
3. Experiments Setup
3.1. Data Selection and Preparation
- Stratified sampling: The subset selection utilized stratified sampling techniques to preserve a proportional representation of each attack type, ensuring a balanced distribution of attacks within the subset.
- Class balancing: Additional measures were implemented to balance the representation of different attack types in the subset. These actions might include oversampling the minority classes or undersampling the majority classes to address issues related to imbalanced distributions.
- Randomization: Randomization techniques were used during the subset selection process to reduce potential biases. This approach ensured that the selection was not influenced by any specific order or predetermined biases.
3.2. Feature Selection
3.3. The Experimental Setup
3.4. Evaluation Metrics
4. Results and Discussion
4.1. Experimental Results
4.2. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep learning approach for intelligent intrusion detection system. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
- Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
- Liu, Y.; Li, X.; Chen, X.; Wang, X.; Li, H. High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance. Sci. Program. 2020, 2020, 1953461. [Google Scholar] [CrossRef]
- Tyagi, S.; Mittal, S. Sampling approaches for imbalanced data classification problem in machine learning. In Proceedings of the ICRIC 2019: Recent Innovations in Computing, Jammu, India, 20–21 March 2019; Springer: Berlin/Heidelberg, Germany, 2020; pp. 209–221. [Google Scholar]
- Khushi, M.; Shaukat, K.; Alam, T.M.; Hameed, I.A.; Uddin, S.; Luo, S.; Yang, X.; Reyes, M.C. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access 2021, 9, 109960–109975. [Google Scholar] [CrossRef]
- Tran, N.; Chen, H.; Jiang, J.; Bhuyan, J.; Ding, J. Effect of class imbalance on the performance of machine learning-based network intrusion detection. Int. J. Perform. Eng. 2021, 17, 741. [Google Scholar]
- Dablain, D.; Krawczyk, B.; Chawla, N.V. DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 6390–6404. [Google Scholar] [CrossRef]
- Ayoub, S.; Gulzar, Y.; Rustamov, J.; Jabbari, A.; Reegu, F.A.; Turaev, S. Adversarial approaches to tackle imbalanced data in machine learning. Sustainability 2023, 15, 7097. [Google Scholar] [CrossRef]
- Huang, L.; Lin, K.C.J.; Tseng, Y.C. Resolving intra-class imbalance for gan-based image augmentation. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 970–975. [Google Scholar]
- Abayomi-Alli, O.O.; Damaševičius, R.; Qazi, A.; Adedoyin-Olowe, M.; Misra, S. Data augmentation and deep learning methods in sound classification: A systematic review. Electronics 2022, 11, 3795. [Google Scholar] [CrossRef]
- Jamoos, M.; Mora, A.M.; AlKhanafseh, M.; Surakhi, O. A New Data-Balancing Approach Based on Generative Adversarial Network for Network Intrusion Detection System. Electronics 2023, 12, 2851. [Google Scholar] [CrossRef]
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1–6. [Google Scholar]
- Kolias, C.; Kambourakis, G.; Stavrou, A.; Voas, J. DDoS in the IoT: Mirai and other botnets. Computer 2017, 50, 80–84. [Google Scholar] [CrossRef]
- Mienye, I.D.; Sun, Y.; Wang, Z. Prediction performance of improved decision tree-based algorithms: A review. Procedia Manuf. 2019, 35, 698–703. [Google Scholar] [CrossRef]
- Primartha, R.; Tama, B.A. Anomaly detection using random forest: A performance revisited. In Proceedings of the 2017 International Conference on Data and Software Engineering (ICoDSE), Palembang, Indonesia, 1–2 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
- Mohammed, A.J.; Arif, M.H.; Ali, A.A. A multilayer perceptron artificial neural network approach for improving the accuracy of intrusion detection systems. IAES Int. J. Artif. Intell. 2020, 9, 609. [Google Scholar]
- Gu, J.; Lu, S. An effective intrusion detection approach using SVM with naïve Bayes feature embedding. Comput. Secur. 2021, 103, 102158. [Google Scholar] [CrossRef]
- Gulati, P.; Sharma, A.; Gupta, M. Theoretical study of decision tree algorithms to identify pivotal factors for performance improvement: A review. Int. J. Comput. Appl. 2016, 141, 19–25. [Google Scholar] [CrossRef]
- Pandey, M.; Sharma, V.K. A decision tree algorithm pertaining to the student performance analysis and prediction. Int. J. Comput. Appl. 2013, 61, 1–5. [Google Scholar] [CrossRef]
- Winham, S.J.; Freimuth, R.R.; Biernacka, J.M. A weighted random forests approach to improve predictive performance. Stat. Anal. Data Mining ASA Data Sci. J. 2013, 6, 496–505. [Google Scholar] [CrossRef]
- Schoppa, L.; Disse, M.; Bachmair, S. Evaluating the performance of random forest for large-scale flood discharge simulation. J. Hydrol. 2020, 590, 125531. [Google Scholar] [CrossRef]
- Surakhi, O.M.; Zaidan, M.A.; Serhan, S.; Salah, I.; Hussein, T. An optimal stacked ensemble deep learning model for predicting time-series data using a genetic algorithm—An application for aerosol particle number concentrations. Computers 2020, 9, 89. [Google Scholar] [CrossRef]
- Zaidan, M.A.; Surakhi, O.; Fung, P.L.; Hussein, T. Sensitivity Analysis for Predicting Sub-Micron Aerosol Concentrations Based on Meteorological Parameters. Sensors 2020, 20, 2876. [Google Scholar] [CrossRef]
- Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4 August 2001; Volume 3, pp. 41–46. [Google Scholar]
- Lowd, D.; Domingos, P. Naive Bayes models for probability estimation. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 529–536. [Google Scholar]
- Maciá-Fernández, G.; Camacho, J.; Magán-Carrión, R.; García-Teodoro, P.; Therón, R. UGR ‘16: A new dataset for the evaluation of cyclostationarity-based network IDSs. Comput. Secur. 2018, 73, 411–424. [Google Scholar] [CrossRef]
- Malhi, A.; Gao, R.X. PCA-based feature selection scheme for machine defect classification. IEEE Trans. Instrum. Meas. 2004, 53, 1517–1525. [Google Scholar] [CrossRef]
Dataset | Original Features | Selected Features | ||
---|---|---|---|---|
CIC-IDS2017 | Benign | 2,273,097 | Benign | 2,273,097 |
DOS Hulk | 231,073 | DOS Hulk | 231,073 | |
PortScan | 158,930 | PortScan | 158,930 | |
DDoS | 128,027 | DDoS | 128,027 | |
DOS GoldenEye | 10,293 | DOS GoldenEye | 10,293 | |
FTP-Patator | 7938 | FTP-Patator | 7938 | |
SSH-Patator | 5879 | SSH-Patator | 5879 | |
DoS- slowloris | 5796 | DoS- slowloris | 5796 | |
DoS slowhttptest | 5499 | DoS slowhttptest | 5499 | |
Bot | 1966 | Bot | 1966 | |
Web Attack Brute Force | 1507 | Other Attack | 2227 | |
Web Attack XSS | 652 | |||
Infiltration | 367 | |||
Web Attack Sql Injection | 21 | |||
Heartbleed | 11 | |||
CSE-CIC-IDS2018 | Benign | 1,222,612 | Benign | 1,222,612 |
DDOS attack-HOIC | 137,157 | DDOS attack-HOIC | 137,157 | |
DOS attack-Hulk | 92,325 | DOS attack-Hulk | 92,325 | |
Bot | 57,067 | Bot | 57,067 | |
FTP-Burteforce | 38,881 | FTP-Burteforce | 38,881 | |
SSH-Burteforce | 37,403 | SSH-Burteforce | 37,403 | |
Infilteration | 32,470 | Infilteration | 32,470 | |
DOS attacks-SlowHTTPTest | 27,902 | DOS attacks-SlowHTTPTest | 27,902 | |
DOS attacks-GoldenEye | 8284 | DOS attacks-GoldenEye | 8284 | |
DDOS attack-LOIC-UDP | 343 | DOS attacks-Slowloris | 2216 | |
Burte Force-Web | 113 | Other Attacks | 533 | |
Burte Force-XSS | 49 | |||
SQL Injection | 16 | |||
Label | 12 | |||
KDD-cup 99 | Smrf | 2,807,886 | Smrf | 2,807,886 |
Neptune | 1,072,017 | Neptune | 1,072,017 | |
Normal | 972,780 | Normal | 972,780 | |
Satan | 15,892 | Satan | 15,892 | |
Ipsweep | 12,481 | Ipsweep | 12,481 | |
Portsweep | 10,413 | Portsweep | 10,413 | |
Nmap | 2316 | Nmap | 2316 | |
back | 2203 | back | 2203 | |
Warezclient | 1020 | Other attack | 1422 | |
Teardrop | 979 | warezclient | 1020 | |
Pod | 264 | |||
Guess_passwd | 53 | |||
Buffer_overflow | 30 | |||
Land | 21 | |||
Warezmaster | 20 | |||
imap | 12 | |||
Rootkit | 10 | |||
Loadmodule | 9 | |||
ftp_write | 8 | |||
Multihop | 7 | |||
Phf | 4 | |||
Perl | 3 | |||
spy | 2 | |||
BOT-IOT | UDP | 3,170,060 | UPD | 3,170,060 |
TCP | 2,549,206 | TCP | 2,549,206 | |
Service_Scan | 117,170 | Service_Scan | 117,170 | |
OS_Fingerprint | 28,479 | OS_Fingerprint | 28,479 | |
HTTP | 3902 | HTTP | 3902 | |
Normal | 707 | Normal | 707 | |
Keylogging | 100 | Other Attack | 111 | |
Data_Exfiltration | 11 |
Dataset | Class Distribution before Data Balancing | Class Distribution after Data Balancing |
---|---|---|
CIC-IDS2017 | ||
CSE-CIC-IDS2018 | ||
KDD-cup 99 | ||
BOT-IOT |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.96 | 0.97 | 0.99 | 0.98 |
CSE-CIC-IDS2018 | 0.95 | 0.95 | 0.97 | 0.98 |
KDD-cup 99 | 0.95 | 0.96 | 0.95 | 0.96 |
BOT-IOT | 0.96 | 0.95 | 0.95 | 0.96 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.70 | 0.92 | 0.86 | 0.89 |
CSE-CIC-IDS2018 | 0.69 | 0.89 | 0.82 | 0.86 |
KDD-cup 99 | 0.75 | 0.85 | 0.63 | 0.73 |
BOT-IOT | 0.72 | 0.83 | 0.79 | 0.80 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.85 | 0.92 | 0.90 | 0.90 |
CSE-CIC-IDS2018 | 0.77 | 0.91 | 0.89 | 0.86 |
KDD-cup 99 | 0.88 | 0.89 | 0.78 | 0.82 |
BOT-IOT | 0.88 | 0.92 | 0.91 | 0.90 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.68 | 0.83 | 0.77 | 0.72 |
CSE-CIC-IDS2018 | 0.61 | 0.79 | 0.78 | 0.80 |
KDD-cup 99 | 0.70 | 0.82 | 0.75 | 0.82 |
BOT-IOT | 0.81 | 0.87 | 0.88 | 0.90 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.78 | 0.80 | 0.83 | 0.79 |
CSE-CIC-IDS2018 | 0.82 | 0.84 | 0.86 | 0.87 |
KDD-cup 99 | 0.88 | 0.90 | 0.89 | 0.89 |
BOT-IOT | 0.90 | 0.89 | 0.91 | 0.91 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.82 | 0.84 | 0.85 | 0.87 |
CSE-CIC-IDS2018 | 0.84 | 0.88 | 0.82 | 0.84 |
KDD-cup 99 | 0.83 | 0.82 | 0.84 | 0.82 |
BOT-IOT | 0.85 | 0.83 | 0.81 | 0.82 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.92 | 0.90 | 0.93 | 0.91 |
CSE-CIC-IDS2018 | 0.93 | 0.92 | 0.94 | 0.91 |
KDD-cup 99 | 0.92 | 0.91 | 0.92 | 0.92 |
BOT-IOT | 0.92 | 0.93 | 0.94 | 0.94 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.60 | 0.71 | 0.72 | 0.69 |
CSE-CIC-IDS2018 | 0.65 | 0.67 | 0.70 | 0.68 |
KDD-cup 99 | 0.70 | 0.75 | 0.80 | 0.82 |
BOT-IOT | 0.72 | 0.74 | 0.78 | 0.77 |
Dataset | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
CIC-IDS2017 | 0.80 | 0.85 | 0.82 | 0.81 |
CSE-CIC-IDS2018 | 0.73 | 0.81 | 0.73 | 0.76 |
KDD-cup 99 | 0.85 | 0.87 | 0.85 | 0.87 |
BOT-IOT | 0.80 | 0.82 | 0.81 | 0.83 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jamoos, M.; Mora, A.M.; AlKhanafseh, M.; Surakhi, O. A Comparative Analysis of the TDCGAN Model for Data Balancing and Intrusion Detection. Signals 2024, 5, 580-596. https://doi.org/10.3390/signals5030032
Jamoos M, Mora AM, AlKhanafseh M, Surakhi O. A Comparative Analysis of the TDCGAN Model for Data Balancing and Intrusion Detection. Signals. 2024; 5(3):580-596. https://doi.org/10.3390/signals5030032
Chicago/Turabian StyleJamoos, Mohammad, Antonio M. Mora, Mohammad AlKhanafseh, and Ola Surakhi. 2024. "A Comparative Analysis of the TDCGAN Model for Data Balancing and Intrusion Detection" Signals 5, no. 3: 580-596. https://doi.org/10.3390/signals5030032
APA StyleJamoos, M., Mora, A. M., AlKhanafseh, M., & Surakhi, O. (2024). A Comparative Analysis of the TDCGAN Model for Data Balancing and Intrusion Detection. Signals, 5(3), 580-596. https://doi.org/10.3390/signals5030032