Fuzzy CNN Autoencoder for Unsupervised Anomaly Detection in Log Data
Abstract
:1. Introduction
1.1. Relevance of the Work
1.2. Goal of the Work
1.3. Related Work
- Log collection. Software systems constantly write information about ongoing events to special logs, which are descriptions stored either in a database or in files. Such information is represented as semi-structured data streams suitable for further analysis.
- Log parsing. After log collection, the data are converted to a text format, which makes them suitable for applying text mining methods.
- Feature extraction. At this stage, text data modeling algorithms are applied.
- Anomaly detection. Once the data model is built, various machine learning models can be trained and applied to solve the considered problem.
1.3.1. Nature of the Input Data
- The input is a stream of semi-structured descriptions of related (interconnected) events;
- The volume of the input data can be very large (more than 50 GB per hour);
- Information about one event can be duplicated several times in a row;
- The format of event descriptions is highly dependent on the particular system;
- Often, the event descriptions are almost similar to each other and differ only in some minimal factors (for example, identifiers of related subsystems and processes);
- A set of patterns can be extracted from similar event descriptions. The number of such patterns is usually small (up to 1000).
1.3.2. Log Collection
1.3.3. Log Parsing
1.3.4. Feature Extraction
1.3.5. Anomaly Detection
2. Materials and Methods
2.1. Log Collection
2.2. Log Parsing
2.3. Data Vectorization and Grouping
2.4. Convolution Neural Networks for Feature Extraction
2.5. Asymmetric Decoder to Minimize Information Loss When Extracting Features
2.6. Fuzzy Clustering and Anomaly Detection
2.7. Regularization
2.8. Training
2.9. Evaluation Metrics
2.10. Materials
- HDFS1 (Hadoop Distributed File System). This dataset is a file containing textual descriptions of events occurring on more than 200 Amazon EC2 nodes. In total, it contains about 11 million events. Each event description contains a block ID—an associated subsystem identifier that can be used to group events. The dataset contains the marking of blocks into anomalous and normal [30].
- BGL (Blue Gene/L). This dataset is a file containing information about events occurring in the BlueGene/L system. In total, it contains about 5 million events. At the same time, the dataset contains information about marking each event as normal or abnormal [36].
- Villani. This dataset is a file containing textual descriptions of keystroke events by individual users collected during their computer work. In total, this dataset contains information about the order of 2 million events, which are key presses and releases for 144 different users [40,41,42]. For this dataset, it is necessary to build a separate model for each user, while events from a current user are considered as normal, and events from all other users are anomalies [6].
3. Results
3.1. Evaluation Metrics
3.2. Experimental Setup
3.2.1. Log Collection
- HDFS1: “2023-02-02 20:55:54 INFO dfs.DataNode$DataXceiver: Receiving block blk_5792489080791696128 src: /10.251.30.6:33145 dest: /10.251.30.6:50010”;
- BGL: “2005-06-03-15.42.50.363779 R02-M1-N0-C:J12-U11 RAS KERNEL INFO instruction cache parity error corrected”;
- Villani. This dataset is a .csv-file consisting of information about user, system, a key code and timestamp of the key pressed and released.
3.2.2. Log Parsing
3.2.3. Feature Extraction
- HDFS1.
- (a)
- Dictionary size: 75.
- (b)
- As for grouping, the authors use the number of an event block. To build fixed size groups, the first 20 events in each block are selected. For blocks smaller than 20, a special event !EMPTY EVENT! is added the required number of times. This event is also added to the dictionary.
- BGL.
- (a)
- Dictionary size: 728.
- (b)
- Since the dataset contains information about the anomaly of individual events, it is proposed to solve the problem of predicting the onset of an anomalous event. Data are combined into groups of five events using a sliding window. Then, the result of the work of the solution is the anomality degree of the next event.
- Villani.
- (a)
- Dictionary size: 446 (223 separate keys with two event types for each key).
- (b)
- Events for each username are grouped into groups of size 100 using a sliding window. For the correct work of the network, 14 users are selected with at least 312 blocks (312 is the 90% percentile in terms of data volume for all users).
3.2.4. Anomaly Detection
- HDFS1:blocks are considered anomalous if they are marked as anomalous in the original dataset.
- BGL:anomalous blocks are those that immediately precede the event marked as anomalous in the original dataset.
- Villani:blocks associated with users not included in the training sample are considered abnormal.
3.3. Technical Characteristics
3.4. Evaluation
3.4.1. Evaluation of the Encoder
- F—number of independent convolution layers. For all datasets, ;
- —filter sizes for each convolution layer ( is a filter size for layer numbered i). This parameter must fulfill the constraint described in (16);
- —number of filters per layer. This value greatly affects the complexity of the network, so it should not be very large. In addition, to optimize the running time of the proposed solution on a GPU with the mixed precision, it is proposed to consider only values that are multiples of eight, as shown in [43];
- —is the -regularization constant to compute , , and in (23).
- HDFS1. , , , ;
- BGL. , , , ;
- Villani. , , , .
3.4.2. Evaluation of the Decoder
3.4.3. Evaluation of the Fuzzy Layer
- ;
- ;
- For the initial value of parameter a, the best initializer is based on a continuous uniform distribution with mean equal to zero and variance equal to one; for the initial value of parameter C, a constant identity diagonal matrix is used (stored as a vector of diagonal values); for the initial value of the parameter , the best as the initial value is the constant zero;
- As a distance metric, the Mahalanobis distance shows the best results, which allows building an elliptical-shaped normal data cluster.
3.5. Comparison with Existing Approaches
- One-class SVM:
- (a)
- Kernel activation function ;
- (b)
- Parameter of the RBF kernel .
- Fuzzy:
- (a)
- Degree of fuzziness ;
- (b)
- Outliers percent ;
- (c)
- Parameter of the RBF kernel .
- LogBERT:
- (a)
- The amount of Transformer layers ;
- (b)
- Token embedding size ;
- (c)
- Hidden state embedding size ;
- (d)
- Masked event percent for each block ;
- (e)
- The number of candidates g for determining the anomality degree of the predicted event .
3.6. Robustness Estimation
4. Discussion
Further Research
5. Conclusions
- HDFS1 dataset. Median ROC AUC = 0.973, median PR AUC = 0.97;
- BGL dataset. Median ROC AUC = 0.939, median PR AUC = 0.921;
- Villani dataset. Median ROC AUC = 0.856, median PR AUC = 0.867.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, Z.; Liu, J.; Gu, W.; Su, Y.; Lyu, M.R. Experience report: Deep learning-based system log analysis for anomaly detection. arXiv 2021, arXiv:2107.05908. [Google Scholar]
- Wang, B.; Hua, Q.; Zhang, H.; Tan, X.; Nan, Y.; Chen, R.; Shu, X. Research on anomaly detection and real-time reliability evaluation with the log of cloud platform. Alex. Eng. J. 2022, 61, 7183–7193. [Google Scholar] [CrossRef]
- Landauer, M.; Skopik, F.; Wurzenberger, M.; Rauber, A. System log clustering approaches for cyber security applications: A survey. Comput. Secur. 2020, 92, 101739. [Google Scholar] [CrossRef]
- He, S.; Zhu, J.; He, P.; Lyu, M.R. Experience report: System log analysis for anomaly detection. In Proceedings of the 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), Ottawa, ON, Canada, 23–27 October 2016; pp. 207–218. [Google Scholar]
- Manevitz, L.M.; Yousef, M. One-class SVMs for document classification. J. Mach. Learn. Res. 2001, 2, 139–154. [Google Scholar]
- Kazachuk, M.; Petrovskiy, M.; Mashechkin, I.; Gorohov, O. Novelty Detection Using Elliptical Fuzzy Clustering in a Reproducing Kernel Hilbert Space. In Proceedings of the Intelligent Data Engineering and Automated Learning–IDEAL 2018: 19th International Conference, Madrid, Spain, 21–23 November 2018; Proceedings, Part II 19. Springer: Berlin/Heidelberg, Germany, 2018; pp. 221–232. [Google Scholar]
- Du, M.; Li, F.; Zheng, G.; Srikumar, V. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 1285–1298. [Google Scholar]
- Guo, H.; Yuan, S.; Wu, X. Logbert: Log anomaly detection via bert. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
- Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv 2019, arXiv:1901.03407. [Google Scholar]
- Yadav, R.B.; Kumar, P.S.; Dhavale, S.V. A survey on log anomaly detection using deep learning. In Proceedings of the 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 4–5 June 2020; pp. 1215–1220. [Google Scholar]
- Mi, H.; Wang, H.; Zhou, Y.; Lyu, M.R.T.; Cai, H. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Trans. Parallel Distrib. Syst. 2013, 24, 1245–1255. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, X.; He, S.; Zhang, H.; Li, L.; Kang, Y.; Xu, Y.; Ma, M.; Lin, Q.; Dang, Y.; et al. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 1893–1901. [Google Scholar]
- Zhu, J.; He, S.; Liu, J.; He, P.; Xie, Q.; Zheng, Z.; Lyu, M.R. Tools and benchmarks for automated log parsing. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Montreal, QC, Canada, 25–31 May 2019; pp. 121–130. [Google Scholar]
- Le, V.H.; Zhang, H. Log-based anomaly detection with deep learning: How far are we? In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; pp. 1356–1367. [Google Scholar]
- Chollet, F. Deep Learning with Python; Simon and Schuster: New York, NY, USA, 2021. [Google Scholar]
- What Are Vector Embeddings. Available online: https://www.pinecone.io/learn/vector-embeddings/ (accessed on 1 August 2023).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Gorokhov, O.; Petrovskiy, M.; Mashechkin, I. Convolutional neural networks for unsupervised anomaly detection in text data. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning, Guilin, China, 30 October–1 November 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 500–507. [Google Scholar]
- Girolami, M. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 2002, 13, 780–784. [Google Scholar] [CrossRef]
- Petrovskiy, M. Outlier detection algorithms in data mining systems. Program. Comput. Softw. 2003, 29, 228–237. [Google Scholar] [CrossRef]
- Liu, D.; Qian, H.; Dai, G.; Zhang, Z. An iterative SVM approach to feature selection and classification in high-dimensional datasets. Pattern Recognit. 2013, 46, 2531–2537. [Google Scholar] [CrossRef]
- Erfani, S.M.; Rajasegarar, S.; Karunasekera, S.; Leckie, C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognit. 2016, 58, 121–134. [Google Scholar] [CrossRef]
- Mahalanobis, P.C. On the generalized distance in statistics. Sankhyā: Indian J. Stat. Ser. A 2018, 80, S1–S7. [Google Scholar]
- Graves, A. Generating sequences with recurrent neural networks. arXiv 2013, arXiv:1308.0850. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-normalizing neural networks. Adv. Neural Inf. Process. Syst. 2017, 30, 972–981. [Google Scholar]
- Amirian, M.; Schwenker, F. Radial basis function networks for convolutional neural networks to learn similarity distance metric and improve interpretability. IEEE Access 2020, 8, 123087–123097. [Google Scholar] [CrossRef]
- Xu, W.; Huang, L.; Fox, A.; Patterson, D.; Jordan, M.I. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, Big Sky, MT, USA, 11–14 October 2009; pp. 117–132. [Google Scholar]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
- Lin, Q.; Zhang, H.; Lou, J.G.; Zhang, Y.; Chen, X. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion, Austin, TX, USA, 14–22 May 2016; pp. 102–111. [Google Scholar]
- Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, S.; Sun, P.; et al. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; Volume 19, pp. 4739–4745. [Google Scholar]
- Duan, X.; Ying, S.; Yuan, W.; Cheng, H.; Yin, X. A Generative Adversarial Networks for Log Anomaly Detection. Comput. Syst. Sci. Eng. 2021, 37, 135–148. [Google Scholar] [CrossRef]
- Zhou, Y.; Liang, X.; Zhang, W.; Zhang, L.; Song, X. VAE-based deep SVDD for anomaly detection. Neurocomputing 2021, 453, 131–140. [Google Scholar] [CrossRef]
- Oliner, A.; Stearley, J. What supercomputers say: A study of five system logs. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), Edinburgh, UK, 25–28 June 2007; pp. 575–584. [Google Scholar]
- Cosine Similarity. Available online: https://www.learndatasci.com/glossary/cosine-similarity/ (accessed on 1 August 2023).
- Hinton, G.; Srivastava, N.; Swersky, K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited 2012, 14, 2. [Google Scholar]
- He, S.; Zhu, J.; He, P.; Lyu, M.R. Loghub: A large collection of system log datasets towards automated log analytics. arXiv 2020, arXiv:2008.06448. [Google Scholar]
- Tappert, C.C.; Villani, M.; Cha, S.H. Keystroke biometric identification and authentication on long-text input. In Behavioral Biometrics for Human Identification: Intelligent Applications; IGI Global: Hershey, PA, USA, 2010; pp. 342–367. [Google Scholar]
- Monaco, J.V.; Bakelman, N.; Cha, S.H.; Tappert, C.C. Developing a keystroke biometric system for continual authentication of computer users. In Proceedings of the 2012 European Intelligence and Security Informatics Conference, Odense, Denmark, 22–24 August 2012; pp. 210–216. [Google Scholar]
- Villani, M.; Tappert, C.; Ngo, G.; Simone, J.; Fort, H.S.; Cha, S.H. Keystroke biometric recognition studies on long-text input under ideal and application-oriented conditions. In Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06), New York, NY, USA, 17–22 June 2006; p. 39. [Google Scholar]
- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv 2017, arXiv:1710.03740. [Google Scholar]
Value | HDFS1 | BGL | |
---|---|---|---|
Train | 601,576 | 541,385 | |
Validation | Normal blocks | 3367 | 4263 |
Anomalous blocks | 3367 | 4263 | |
Test | Normal blocks | 13,471 | 17,053 |
Anomalous blocks | 13,471 | 17,053 | |
Subsample amount | 10 | 10 |
Dataset | N | |
---|---|---|
HDFS1 | 20 | |
BGL | 5 | |
Villani | 100 |
Dataset | Layers Amount | ROC AUC | PR AUC | ||||
---|---|---|---|---|---|---|---|
Median | Q1 | Q3 | Median | Q1 | Q3 | ||
HDFS1 | 1 | 0.73 | 0.711 | 0.734 | 0.752 | 0.744 | 0.754 |
2 | 0.734 | 0.713 | 0.735 | 0.753 | 0.749 | 0.755 | |
3 | 0.729 | 0.727 | 0.731 | 0.751 | 0.749 | 0.753 | |
BGL | 1 | 0.684 | 0.681 | 0.687 | 0.697 | 0.695 | 0.713 |
2 | 0.685 | 0.682 | 0.686 | 0.698 | 0.696 | 0.711 | |
3 | 0.683 | 0.682 | 0.685 | 0.695 | 0.694 | 0.696 | |
Villani | 1 | 0.631 | 0.59 | 0.645 | 0.657 | 0.643 | 0.667 |
2 | 0.655 | 0.631 | 0.671 | 0.678 | 0.666 | 0.682 | |
3 | 0.641 | 0.637 | 0.648 | 0.662 | 0.66 | 0.664 |
Dataset | Approach | ROC AUC | PR AUC | ||||
---|---|---|---|---|---|---|---|
Median | Q1 | Q3 | Median | Q1 | Q3 | ||
HDFS1 | LogBERT | 0.9 | 0.894 | 0.907 | 0.901 | 0.898 | 0.911 |
FuzzyCNN | 0.973 | 0.971 | 0.974 | 0.97 | 0.969 | 0.972 | |
BGL | LogBERT | 0.908 | 0.901 | 0.912 | 0.898 | 0.893 | 0.901 |
FuzzyCNN | 0.939 | 0.934 | 0.94 | 0.921 | 0.916 | 0.926 | |
Villani | LogBERT | 0.8 | 0.781 | 0.81 | 0.821 | 0.819 | 0.824 |
FuzzyCNN | 0.856 | 0.843 | 0.863 | 0.867 | 0.855 | 0.87 |
Dataset | Percent | ROC AUC | PR AUC | ||||
---|---|---|---|---|---|---|---|
Median | Q1 | Q3 | Median | Q1 | Q3 | ||
HDFS1 | 0.2 | 0.959 | 0.955 | 0.963 | 0.958 | 0.953 | 0.963 |
0.5 | 0.937 | 0.933 | 0.94 | 0.944 | 0.94 | 0.949 | |
BGL | 0.2 | 0.846 | 0.844 | 0.85 | 0.872 | 0.871 | 0.874 |
0.5 | 0.805 | 0.8 | 0.81 | 0.845 | 0.843 | 0.848 | |
Villani | 0.2 | 0.823 | 0.81 | 0.825 | 0.831 | 0.829 | 0.841 |
0.5 | 0.78 | 0.767 | 0.792 | 0.793 | 0.787 | 0.799 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gorokhov, O.; Petrovskiy, M.; Mashechkin, I.; Kazachuk, M. Fuzzy CNN Autoencoder for Unsupervised Anomaly Detection in Log Data. Mathematics 2023, 11, 3995. https://doi.org/10.3390/math11183995
Gorokhov O, Petrovskiy M, Mashechkin I, Kazachuk M. Fuzzy CNN Autoencoder for Unsupervised Anomaly Detection in Log Data. Mathematics. 2023; 11(18):3995. https://doi.org/10.3390/math11183995
Chicago/Turabian StyleGorokhov, Oleg, Mikhail Petrovskiy, Igor Mashechkin, and Maria Kazachuk. 2023. "Fuzzy CNN Autoencoder for Unsupervised Anomaly Detection in Log Data" Mathematics 11, no. 18: 3995. https://doi.org/10.3390/math11183995
APA StyleGorokhov, O., Petrovskiy, M., Mashechkin, I., & Kazachuk, M. (2023). Fuzzy CNN Autoencoder for Unsupervised Anomaly Detection in Log Data. Mathematics, 11(18), 3995. https://doi.org/10.3390/math11183995