Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling
Abstract
:1. Introduction
- We provide a comprehensive analysis of machine-learning algorithms for FDIA detection using the two representative datasets, namely power system and water treatment datasets.
- We determine the subset of features which can be used to achieve the best performance using different filter and wrapper approaches.
- We mitigate performance bias in imbalanced datasets using four different oversampling methods.
2. Related Work
3. Critical Infrastructure Experimental Framework
3.1. The Experimental Framework for Power System Data
3.1.1. Dataset Pre-Processing
3.1.2. Description of Features
3.2. Water Treatment Plant
3.2.1. Dataset Pre-Processing
3.2.2. Feature Description
4. Feature Selection
4.1. Filter Method
4.2. Wrapper Method
4.3. Embedded Method
5. Imbalance Dataset: Issue and Solution
5.1. Synthetic Minority Oversampling Technique (SMOTE)
5.2. Borderline-SMOTE
5.3. Borderline Oversampling
5.4. Adaptive Synthetic (ADASYN) Sampling
6. Experiments and Results
6.1. Experimental System
6.2. Machine Learning Algorithms
6.3. Training and Testing
6.3.1. Percentage Split (70–30)
6.3.2. Testing with Cross-Validation
6.4. Imbalance Dataset and Impact
6.5. Comparison with Previous Works
7. Conclusions and Future Scope
Author Contributions
Funding
Conflicts of Interest
Sample Availability
Abbreviations
ANN | Artificial Neural Network |
AI | Artificial Intelligence |
ARFF | Attribute-Relation File Format |
AUC | Area Under Curve |
CPPS | Cyber-Physical Power System |
CPU | Central Processing Unit |
CSV | Comma-Separated Value |
DT | Decision Tree |
FDIA | False Data Injection Attack |
GB | Gradient Boost |
GUI | Graphical User Interface |
HMI | Human–Machine Interfaces |
ICT | Information and Communication Technology |
IDS | Intrusion Detection System |
IPS | Intrusion Prevention System |
IED | Intelligent Electronic Device |
kNN | k Nearest Negibour |
LR | Linear Regression |
LTS | Long Term Support |
ML | Machine Learning |
NB | Naive Bayes |
OCSVM | One-Class Support Vector Machine |
PCA | Principal Component Analysis |
PDC | Power Distribution Center |
PLC | Programmable Logic Controllers |
PMU | Phasor Measurement Unit |
RF | Random Forest |
ROC | Receiver Operating Characteristic |
SCADA | Supervisory Control and Data Acquisition |
SVC | Support Vector Classifier |
SVM | Support Vector Machine |
References
- Corallo, A.; Lazoi, M.; Lezzi, M. Cybersecurity in the context of industry 4.0: A structured classification of critical assets and business impacts. Comput. Ind. 2020, 114, 103165. [Google Scholar] [CrossRef]
- Griffor, E.R.; Greer, C.; Wollman, D.A.; Burns, M.J. Framework for cyber-physical systems: Volume 1, overview. NIST SP 2017. [Google Scholar] [CrossRef]
- Rodofile, N.R.; Radke, K.; Foo, E. Extending the cyber-attack landscape for SCADA-based critical infrastructure. Int. J. Crit. Infrastruct. Prot. 2019, 25, 14–35. [Google Scholar] [CrossRef]
- Khanna, K.; Panigrahi, B.K.; Joshi, A. AI-based approach to identify compromised meters in data integrity attacks on smart grid. IET Gener. Transm. Distrib. 2017, 12, 1052–1066. [Google Scholar] [CrossRef] [Green Version]
- Maleh, Y.; Shojafar, M.; Darwish, A.; Haqiq, A. Cybersecurity and Privacy in Cyber Physical Systems; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
- Liang, G.; Weller, S.R.; Zhao, J.; Luo, F.; Dong, Z.Y. The 2015 ukraine blackout: Implications for false data injection attacks. IEEE Trans. Power Syst. 2016, 32, 3317–3318. [Google Scholar] [CrossRef]
- Reeder, J.R.; Hall, C.T. Cybersecurity’s Pearl Harbor Moment: Lessons Learned from the Colonial Pipeline Ransomware Attack; Government Contractor Cybersecurity: Washington, DC, USA, 2021. [Google Scholar]
- Gönen, S.; Sayan, H.H.; Yılmaz, E.N.; Üstünsoy, F.; Karacayılmaz, G. False Data Injection Attacks and the Insider Threat in Smart Systems. Comput. Secur. 2020, 97, 101955. [Google Scholar] [CrossRef]
- Aoufi, S.; Derhab, A.; Guerroumi, M. Survey of false data injection in smart power grid: Attacks, countermeasures and challenges. J. Inf. Secur. Appl. 2020, 54, 102518. [Google Scholar] [CrossRef]
- Pan, S.; Morris, T.; Adhikari, U. Developing a hybrid intrusion detection system using data mining for power systems. IEEE Trans. Smart Grid 2015, 6, 3104–3113. [Google Scholar] [CrossRef]
- Goh, J.; Adepu, S.; Junejo, K.N.; Mathur, A. A dataset to support research in the design of secure water treatment systems. In International Conference on Critical Information Infrastructures Security; Springer: Berlin/Heidelberg, Germany, 2016; pp. 88–99. [Google Scholar]
- Guan, Z.; Sun, N.; Xu, Y.; Yang, T. A comprehensive survey of false data injection in smart grid. Int. J. Wirel. Mob. Comput. 2015, 8, 27–33. [Google Scholar] [CrossRef]
- Liang, G.; Zhao, J.; Luo, F.; Weller, S.R.; Dong, Z.Y. A review of false data injection attacks against modern power systems. IEEE Trans. Smart Grid 2016, 8, 1630–1638. [Google Scholar] [CrossRef]
- Musleh, A.S.; Chen, G.; Dong, Z.Y. A survey on the detection algorithms for false data injection attacks in smart grids. IEEE Trans. Smart Grid 2019, 11, 2218–2234. [Google Scholar] [CrossRef]
- Cao, J.; Wang, D.; Qu, Z.; Cui, M.; Xu, P.; Xue, K.; Hu, K. A Novel False Data Injection Attack Detection Model of the Cyber-Physical Power System. IEEE Access 2020, 8, 95109–95125. [Google Scholar] [CrossRef]
- Maglaras, L.A.; Jiang, J. Intrusion detection in SCADA systems using machine learning techniques. In Proceedings of the 2014 Science and Information Conference, Las Vegas, NV, USA, 25–26 April 2014; pp. 626–631. [Google Scholar]
- Esmalifalak, M.; Liu, L.; Nguyen, N.; Zheng, R.; Han, Z. Detecting stealthy false data injection using machine learning in smart grid. IEEE Syst. J. 2014, 11, 1644–1652. [Google Scholar] [CrossRef]
- Yan, J.; Tang, B.; He, H. Detection of false data attacks in smart grid with supervised learning. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 1395–1402. [Google Scholar]
- Wang, Y.; Amin, M.M.; Fu, J.; Moussa, H.B. A novel data analytical approach for false data injection cyber-physical attack mitigation in smart grids. IEEE Access 2017, 5, 26022–26033. [Google Scholar] [CrossRef]
- Wang, D.; Wang, X.; Zhang, Y.; Jin, L. Detection of power grid disturbances and cyber-attacks based on machine learning. J. Inf. Secur. Appl. 2019, 46, 42–52. [Google Scholar] [CrossRef]
- Panthi, M. Anomaly Detection in Smart Grids using Machine Learning Techniques. In Proceedings of the 2020 First International Conference on Power, Control and Computing Technologies (ICPC2T), Raipur, India, 3–5 January 2020; pp. 220–222. [Google Scholar]
- Ahmed, C.M.; Zhou, J.; Mathur, A.P. Noise matters: Using sensor and process noise fingerprint to detect stealthy cyber attacks and authenticate sensors in cps. In Proceedings of the 34th Annual Computer Security Applications Conference, San Juan, PR, USA, 3–7 December 2018; pp. 566–581. [Google Scholar]
- Dutta, A.K.; Negi, R.; Shukla, S.K. Robust Multivariate Anomaly-Based Intrusion Detection System for Cyber-Physical Systems. International Symposium on Cyber Security Cryptography and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2021; pp. 86–93. [Google Scholar]
- Jahromi, A.N.; Karimipour, H.; Dehghantanha, A.; Choo, K.K.R. Toward Detection and Attribution of Cyber-Attacks in IoT-enabled Cyber-physical Systems. IEEE Internet Things J. 2021. [Google Scholar] [CrossRef]
- Begli, M.; Derakhshan, F.; Karimipour, H. A layered intrusion detection system for critical infrastructure using machine learning. In Proceedings of the 2019 IEEE 7th International Conference on Smart Energy Grid Engineering (SEGE), UOIT, ON, Canada, 12–14 August 2019; pp. 120–124. [Google Scholar]
- Dick, K.; Russell, L.; Souley Dosso, Y.; Kwamena, F.; Green, J.R. Deep learning for critical infrastructure resilience. J. Infrastruct. Syst. 2019, 25, 05019003. [Google Scholar] [CrossRef]
- Rodofile, N.R.; Schmidt, T.; Sherry, S.T.; Djamaludin, C.; Radke, K.; Foo, E. Process control cyber-attacks and labelled datasets on S7Comm critical infrastructure. In Australasian Conference on Information Security and Privacy; Springer: Berlin/Heidelberg, Germany, 2017; pp. 452–459. [Google Scholar]
- Kotsiantis, S. Feature selection for machine learning classification problems: A recent overview. Artif. Intell. Rev. 2011, 42, 157–176. [Google Scholar] [CrossRef] [Green Version]
- He, H.; Ma, Y. Imbalanced Learning: Foundations, Algorithms, and Applications; Wiley-IEEE Press: Hoboken, NJ, USA, 2013. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Chawla, N.V. Data mining for imbalanced datasets: An overview. In Data Mining and Knowledge Discovery Handbook; Springer: Berlin/Heidelberg, Germany, 2009; pp. 875–886. [Google Scholar]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
- Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 2011, 3, 4–21. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–6 June 2008; pp. 1322–1328. [Google Scholar]
- Elyan, E.; Moreno-Garcia, C.F.; Jayne, C. CDSMOTE: Class decomposition and synthetic minority class oversampling technique for imbalanced-data classification. Neural Comput. Appl. 2021, 33, 2839–2851. [Google Scholar] [CrossRef]
- Guan, H.; Zhang, Y.; Xian, M.; Cheng, H.D.; Tang, X. SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Appl. Intell. 2021, 51, 1394–1409. [Google Scholar] [CrossRef]
- Fajardo, V.A.; Findlay, D.; Jaiswal, C.; Yin, X.; Houmanfar, R.; Xie, H.; Liang, J.; She, X.; Emerson, D. On oversampling imbalanced data with deep conditional generative models. Expert Syst. Appl. 2021, 169, 114463. [Google Scholar] [CrossRef]
- Bellinger, C.; Corizzo, R.; Japkowicz, N. Calibrated Resampling for Imbalanced and Long-Tails in Deep Learning. In International Conference on Discovery Science; Springer: Berlin/Heidelberg, Germany, 2021; pp. 242–252. [Google Scholar]
- Krawczyk, B.; Bellinger, C.; Corizzo, R.; Japkowicz, N. Undersampling with support vectors for multi-class imbalanced data classification. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–7. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
- Shao, J. Linear model selection by cross-validation. J. Am. Stat. Assoc. 1993, 88, 486–494. [Google Scholar] [CrossRef]
- Wang, J.; Shi, D.; Li, Y.; Chen, J.; Ding, H.; Duan, X. Distributed framework for detecting PMU data manipulation attacks with deep autoencoders. IEEE Trans. Smart Grid 2018, 10, 4401–4410. [Google Scholar] [CrossRef]
- Adhikari, U.; Morris, T.H.; Pan, S. Applying non-nested generalized exemplars classification for cyber-power event and intrusion detection. IEEE Trans. Smart Grid 2016, 9, 3928–3941. [Google Scholar] [CrossRef]
Ref. | Method | Dataset | Samples Ratio (Normal, Attack) | Feature Selection |
---|---|---|---|---|
[18] | Supervised Learning (SVM, kNN, and ENN) | Simulations IEEE 30-bus system | 0.1 | No |
[4] | ANN and ELM | NYISO load data IEEE 14-bus system | NA | NA |
[19] | Data Centric (Big Data and MSA) | Simulated (6 bus power system) and real-world (Texas synchrophasor network) | 100 K, 0.334, 0.196, 0.086 | No |
[20] | Voting on ML-classifier, dataset divided per PMUs | 4 PMUs events and firewall log [10] | Balance | Yes |
[16] | OCSVM | Network traces | 1570, NA | NA |
[15] | Ensemble Learning | Measurement data and power system audit logs | Balance | Yes |
[17] | Distributed SVM and PCA | IEEE standard test systems | NA | Yes |
[21] | Machine Learning (One R, J-Ripper, NB, RF) | Power system [10] | NA | No |
[22] | Fingerprinting and OCSVM | Water treatment (SWaT) | NA | NA |
[23] | AutoEncoder | SWaT | NA | Yes |
[24] | DT and Deep learning | SWaT and gas pipeline | 214 K, | Yes |
Class Type | Number of Samples | |
---|---|---|
Power System | Water Treatment | |
Normal | 22,714 | 395,298 |
FDIA | 9582 | 54,621 |
Total | 32,296 | 449,919 |
Field Device | Type | Description | Total (51) |
---|---|---|---|
Actuators (27) | MV | Motorized Valve | 6 |
P | Pump | 19 | |
LIT | Level Transmitter | 1 | |
UV | Dechlorinator | 1 | |
Sensors (24) | FIT | Flow Meter | 9 |
LIT | Level Transmitter | 2 | |
AIT | Analyzer | 9 | |
DPIT | Differential Pressure Indicating transmitter | 1 | |
PIT | Pressure meter | 3 |
Filter Method | Wrapper Method | ||||||
---|---|---|---|---|---|---|---|
Top Features | Bottom Features | Top Features | Bottom Features | ||||
Feature | Value | Feature | Value | Feature | Value | Feature | Value |
FIT401 | 6.281 | FIT601 | 0.00066 | FIT504 | 0.223181 | P204 | 0.000403 |
FIT504 | 6.218 | P602 | 0.00058 | FIT401 | 0.125924 | P206 | 0.000313 |
FIT503 | 6.105 | P403 | 0.00008 | P501 | 0.105114 | P402 | 0.000114 |
UV401 | 6.076 | P202 | 0.0 | PIT502 | 0.070205 | P403 | 0.000035 |
P501 | 6.075 | P301 | 0.0 | FIT503 | 0.063296 | P202 | 0.000000 |
PIT501 | 5.972 | P401 | 0.0 | P102 | 0.040979 | P301 | 0.000000 |
FIT501 | 5.906 | P404 | 0.0 | LIT301 | 0.040890 | P401 | 0.000000 |
PIT503 | 5.899 | P502 | 0.0 | LIT101 | 0.030181 | P404 | 0.000000 |
FIT502 | 5.860 | P601 | 0.0 | LIT401 | 0.027320 | P502 | 0.000000 |
P402 | 5.550 | P603 | 0.0 | DPIT301 | 0.022423 | P601 | 0.000000 |
ML Model | Precision | Recall | F1-Score | Accuracy | |||
---|---|---|---|---|---|---|---|
Normal | FDI | Normal | FDI | Normal | FDI | – | |
NB | 0.71 | 0.33 | 0.98 | 0.02 | 0.82 | 0.04 | 0.70 |
SVM | 0.97 | 0.29 | 0.01 | 1.0 | 0.01 | 0.46 | 0.30 |
kNN | 0.86 | 0.70 | 0.88 | 0.66 | 0.87 | 0.68 | 0.82 |
DT | 0.90 | 0.75 | 0.90 | 0.75 | 0.90 | 0.75 | 0.85 |
RF | 0.91 | 0.93 | 0.98 | 0.78 | 0.94 | 0.85 | 0.92 |
Ada | 0.72 | 0.53 | 0.96 | 0.10 | 0.82 | 0.16 | 0.71 |
LR | 0.71 | 0.49 | 1.0 | 0.01 | 0.83 | 0.02 | 0.71 |
ML Model | Precision | Recall | F1-Score | Accuracy | |||
---|---|---|---|---|---|---|---|
Normal | FDI | Normal | FDI | Normal | FDI | – | |
NB | 0.96 | 0.98 | 1.0 | 0.70 | 0.98 | 0.82 | 0.96 |
SVM | 0.96 | 0.99 | 1.0 | 0.71 | 0.98 | 0.83 | 0.96 |
kNN | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
DT | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
RF | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
BagSVC | 0.96 | 0.99 | 1 | 0.66 | 0.98 | 0.80 | 0.96 |
LR | 0.95 | 0.99 | 1 | 0.62 | 0.97 | 0.76 | 0.95 |
Classifiers | NB | SVC | kNN | DT | RF | Ada | BSVC | LR | XGB |
---|---|---|---|---|---|---|---|---|---|
Time (S) | 0.132 | 9.770 | 7.280 | 3.580 | 17.900 | 11.800 | 533.00 | 1.140 | 62.0 |
Classifiers | Imbalance | Smote | Bsmote | Bsmote-Svm | Adasyn |
---|---|---|---|---|---|
LR | 0.592 | 0.590 | 0.550 | 0.595 | 0.565 |
GNB | 0.559 | 0.548 | 0.578 | 0.544 | 0.596 |
GBoost | 0.747 | 0.847 | 0.839 | 0.839 | 0.845 |
BagSVC | 0.504 | 0.505 | 0.512 | 0.515 | 0.513 |
AdaBoost | 0.672 | 0.783 | 0.777 | 0.747 | 0.785 |
kNN | 0.855 | 0.924 | 0.917 | 0.926 | 0.910 |
DT | 0.826 | 0.830 | 0.835 | 0.847 | 0.831 |
RF | 0.974 | 0.984 | 0.982 | 0.988 | 0.983 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kumar, A.; Saxena, N.; Jung, S.; Choi, B.J. Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling. Energies 2022, 15, 212. https://doi.org/10.3390/en15010212
Kumar A, Saxena N, Jung S, Choi BJ. Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling. Energies. 2022; 15(1):212. https://doi.org/10.3390/en15010212
Chicago/Turabian StyleKumar, Ajit, Neetesh Saxena, Souhwan Jung, and Bong Jun Choi. 2022. "Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling" Energies 15, no. 1: 212. https://doi.org/10.3390/en15010212
APA StyleKumar, A., Saxena, N., Jung, S., & Choi, B. J. (2022). Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling. Energies, 15(1), 212. https://doi.org/10.3390/en15010212