Efficient Online Log Parsing with Log Punctuations Signature
Abstract
:1. Introduction
- We propose a novel and efficient log signature method based on log punctuations and length information applied for log parsing;
- We present an online log parser based on our log signature method, named LogPunk, which is better than the previous log parsers in robustness and efficiency;
- We conduct extensive experiments on 16 datasets and comparing LogPunk with five other log parsers. The results show that LogPunk is accurate, robust, and efficient.
2. Problem Description
3. Methodology
3.1. Step 1 Preprocess and Split
3.2. Step 2 Generate Log Signature
Algorithm 1: Log signature |
3.3. Step 3 Search Signature Group
Algorithm 2: Log similarity |
3.4. Step 4 Update Signature Group
4. Evaluation
4.1. LogHub Dataset and Accuracy Metrics
4.2. Accuracy
4.3. Robustness
4.4. Efficiency
5. Discussion
6. Related Work
- (1)
- Frequent Pattern Mining: SLCT [19], LFA [20], and LogCluster [31] propose automated log parsers that parse log messages by mining the frequent tokens in log files. These approaches first count token frequencies and then use a predefined threshold to identify the static parts of log messages. The intuition is that if a log event occurs frequently, then the static template parts will occur more times than the dynamic parts from variables. SLCT applies frequent pattern mining to log parsing for the first time. LFA utilizes the token frequency in each log message instead of the whole log data to parse infrequent logs. LogCluster improves SLCT and is robust to shifts in token position. All the above three methods are offline and need to traverse all log data to count the token frequency. In contrast, LogPunk is an online log parser.
- (2)
- Clustering: Many previous studies regard log parsing as a clustering problem and propose many clustering approaches to solve this problem. From this perspective, log messages sharing the same templates are grouped into one cluster and various approaches to measure the similarity (or distance) between two log messages have been proposed. LKE [32], LogSig [21], and LogMine [22] propose offline clustering methods. LKE employs a k-means clustering algorithm based on weighted edit distance to extract log events from free text messages. LogSig groups log messages with the same frequent subsequence into a predefined number of clusters. LogMine clusters log messages from bottom to top and identifies the most suitable log template to represent each cluster.SHISO [24] and LenMa [25] are both online methods. SHISO employs Euclidean distance to measure the similarity between logs and generate a score. If the score is smaller than the pre-defined threshold, SHISO makes a cluster of the similar two. LenMa proposes an online clustering method using the length information of each word in log messages. Additionally, it measured the similarity between two log messages based on cosine similarity, to determine which cluster the new coming log message should be added to. Despite performing well on test datasets, these two methods perform poorly on public datasets. To ensure robustness, LogPunk has been tested on 16 datasets from different systems.
- (3)
- Heuristics: Different from general text data, log messages have some unique characteristics, which can be used for log parsing. AEL [29] uses heuristics based on domain knowledge to identify dynamic parts (e.g., tokens following “is” or “are”) in log messages, then clusters log messages into the same template set if they have the same structure of dynamic parts. IPLoM [23] iteratively partitions log messages into finer clusters, firstly by the number of tokens, then by the position of tokens, and lastly by the association between token pairs. Spell [26] supposes that template tokens often take most of the log message, and variable tokens take only a small portion. So, it utilizes an LCS-based approach to measure log similarity and to find the most similar template. Drain [17] uses a fixed-depth tree to parse logs. Each layer encodes specially designed rules for log parsing. In the first layer, Drain searches by log message length, and in the following layers, searches by preceding tokens. By doing so, log messages with the same length and preceding tokens are clustered into the same groups placed on the leaf nodes. The tree-based Spell and Drain outperform other methods in the previous benchmark [15] and are state-of-the-art log parsers at present. LogPunk overcomes the defect of tree structures (cf. Section 2) and parses logs in a hash-like manner.
7. Conclusions
- (1)
- Automated parameters tuning. Logpunk has two hyperparameters similarity threshold and prefix threshold (cf. Section 3.3). During the experiment, these two hyperparameters are fine-tuned manually. This process is time-consuming, and the obtained hyperparameters may not be optimal. A mechanism for automated parameters tuning can greatly improve this situation.
- (2)
- Punctuation table generation. The punctuation table (cf. Section 3.2) determines the log signature and affects the whole log parsing process. We presented a punctuation table by eliminating the punctuations appearing in variables, and it performs well on the 16 evaluated datasets. For a new system, if we customize a punctuation table for it, we may get better log parsing results. It is desirable to find a way to generate a customized punctuation table automatically for an unknown system.
- (3)
- Variable type identification. Existing log parsers treat all variables in the parsing result as strings. However, obviously, each variable has its specific type information (e.g., number, IP, URL, file path, etc.) and it is useful to detect the variable-related anomaly. If log parsing not only identifies variable but also variable types, it will bring more initiative to downstream tasks.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AIOps | Artificial Intelligence for IT Operations |
LCS | Longest Common Subsequence |
PA | Parsing Accuracy |
URL | Uniform Resource Locator |
Symbols | |
EV | Log Content |
A Log Template | |
A List of Variables | |
E | The Set of All Log Templates |
L | A Sequence of Log Messages |
A Log Message | |
A List of Tokens | |
The Token with Index j | |
T | The Set of All Tokens |
n | Message Length |
References
- Cito, J.; Leitner, P.; Fritz, T.; Gall, H.C. The Making of Cloud Applications: An Empirical Study on Software Development for the Cloud. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering; Association for Computing Machinery: New York, NY, USA, 2015; pp. 393–403. [Google Scholar] [CrossRef] [Green Version]
- Barik, T.; DeLine, R.; Drucker, S.; Fisher, D. The bones of the system: A case study of logging and telemetry at microsoft. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering Companion, Austin, TX, USA, 14–22 May 2016; pp. 92–101. [Google Scholar] [CrossRef] [Green Version]
- Forestiero, A.; Mastroianni, C.; Papuzzo, G.; Spezzano, G. A Proximity-Based Self-Organizing Framework for Service Composition and Discovery. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, Melbourne, VIC, Australia, 17–20 May 2010; pp. 428–437. [Google Scholar] [CrossRef]
- Forestiero, A.; Mastroianni, C.; Meo, M.; Papuzzo, G.; Sheikhalishahi, M. Hierarchical approach for green workload management in distributed data centers. In European Conference on Parallel Processing; Springer: New York, NY, USA, 2014; pp. 323–334. [Google Scholar] [CrossRef]
- Mi, H.; Wang, H.; Zhou, Y.; Lyu, M.R.T.; Cai, H. Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems. IEEE Trans. Parallel Distrib. Syst. 2013, 24, 1245–1255. [Google Scholar] [CrossRef]
- Zhang, X.; Xu, Y.; Lin, Q.; Qiao, B.; Zhang, H.; Dang, Y.; Xie, C.; Yang, X.; Cheng, Q.; Li, Z.; et al. Robust Log-Based Anomaly Detection on Unstable Log Data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, 26–30 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 807–817. [Google Scholar] [CrossRef]
- Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, S.; Sun, P.; et al. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. In Proceedings of the International Joint Conferences on Artificial Intelligence Organization, Macao, China, 10–16 August 2019; pp. 4739–4745. [Google Scholar] [CrossRef] [Green Version]
- Zhou, X.; Peng, X.; Xie, T.; Sun, J.; Ji, C.; Liu, D.; Xiang, Q.; He, C. Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs; Association for Computing Machinery: New York, NY, USA, 2019; pp. 683–694. [Google Scholar] [CrossRef]
- Chen, Y.; Yang, X.; Lin, Q.; Zhang, H.; Gao, F.; Xu, Z.; Dang, Y.; Zhang, D.; Dong, H.; Xu, Y.; et al. Outage Prediction and Diagnosis for Cloud Service Systems. In The World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2659–2665. [Google Scholar] [CrossRef]
- Zaman, T.S.; Han, X.; Yu, T. SCMiner: Localizing System-Level Concurrency Faults from Large System Call Traces. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering, San Diego, CA, USA, 11–15 November 2019; pp. 515–526. [Google Scholar] [CrossRef]
- Cotroneo, D.; De Simone, L.; Liguori, P.; Natella, R.; Bidokhti, N. How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, 26–30 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 200–211. [Google Scholar] [CrossRef] [Green Version]
- Xu, W.; Huang, L.; Fox, A.; Patterson, D.; Jordan, M.I. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, Big Sky, MT, USA, 11–14 October 2009; pp. 117–132. [Google Scholar] [CrossRef] [Green Version]
- Lou, J.G.; Fu, Q.; Yang, S.; Xu, Y.; Li, J. Mining Invariants from Console Logs for System Problem Detection. In Proceedings of the USENIX Annual Technical Conference, Boston, MA, USA, 23–25 June 2010; pp. 1–14. [Google Scholar]
- Lou, J.G.; Fu, Q.; Yang, S.; Li, J.; Wu, B. Mining program workflow from interleaved traces. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 613–622. [Google Scholar] [CrossRef]
- Zhu, J.; He, S.; Liu, J.; He, P.; Xie, Q.; Zheng, Z.; Lyu, M.R. Tools and benchmarks for automated log parsing. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, Montreal, QC, Canada, 25–31 May 2019; pp. 121–130. [Google Scholar] [CrossRef] [Green Version]
- Beschastnikh, I.; Brun, Y.; Ernst, M.D.; Krishnamurthy, A. Inferring models of concurrent systems from logs of their behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering, Hyderabad, India, 31 May 2014; pp. 468–479. [Google Scholar] [CrossRef] [Green Version]
- He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the 2017 IEEE International Conference on Web Services, Honolulu, HI, USA, 25–30 June 2017; pp. 33–40. [Google Scholar] [CrossRef]
- Dai, H.; Li, H.; Chen, C.S.; Shang, W.; Chen, T.H. Logram: Efficient log parsing using n-gram dictionaries. IEEE Trans. Softw. Eng. 2020. [Google Scholar] [CrossRef]
- Vaarandi, R. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management, Kansas City, MO, USA, 3 October 2003; pp. 119–126. [Google Scholar] [CrossRef]
- Nagappan, M.; Vouk, M.A. Abstracting log lines to log event types for mining software system logs. In Proceedings of the 2010 7th IEEE Working Conference on Mining Software Repositories, Cape Town, South Africa, 2–3 May 2010; pp. 114–117. [Google Scholar] [CrossRef]
- Tang, L.; Li, T.; Perng, C.S. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK, 24–28 October 2011; pp. 785–794. [Google Scholar] [CrossRef]
- Hamooni, H.; Debnath, B.; Xu, J.; Zhang, H.; Jiang, G.; Mueen, A. Logmine: Fast pattern recognition for log analytics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 1573–1582. [Google Scholar] [CrossRef]
- Makanju, A.A.; Zincir-Heywood, A.N.; Milios, E.E. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June 2009–1 July 2009; pp. 1255–1264. [Google Scholar] [CrossRef]
- Mizutani, M. Incremental mining of system log format. In Proceedings of the 2013 IEEE International Conference on Services Computing, Santa Clara, CA, USA, 28 June–3 July 2013; pp. 595–602. [Google Scholar] [CrossRef]
- Shima, K. Length matters: Clustering system log messages using length of words. arXiv 2016, arXiv:1611.03213. [Google Scholar]
- Du, M.; Li, F. Spell: Streaming parsing of system event logs. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 859–864. [Google Scholar] [CrossRef]
- Du, M.; Li, F. Spell: Online Streaming Parsing of Large Unstructured System Logs. IEEE Trans. Knowl. Data Eng. 2019, 31, 2213–2227. [Google Scholar] [CrossRef]
- He, S.; Zhu, J.; He, P.; Lyu, M.R. Loghub: A large collection of system log datasets towards automated log analytics. arXiv 2020, arXiv:2008.06448. [Google Scholar]
- Jiang, Z.M.; Hassan, A.E.; Flora, P.; Hamann, G. Abstracting execution logs to execution events for enterprise applications (short paper). In Proceedings of the 2008: The Eighth International Conference on Quality Software, Oxford, UK, 12–13 August 2008; pp. 181–186. [Google Scholar] [CrossRef]
- Huang, S.; Liu, Y.; Fung, C.; He, R.; Zhao, Y.; Yang, H.; Luan, Z. Paddy: An event log parsing approach using dynamic dictionary. In Proceedings of the NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium, Budapest, Hungary, 20–24 April 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Vaarandi, R.; Pihelgas, M. Logcluster—A data clustering and pattern mining algorithm for event logs. In Proceedings of the 2015 11th International Conference on Network and Service Management, Barcelona, Spain, 9–13 November 2015; pp. 1–7. [Google Scholar] [CrossRef]
- Fu, Q.; Lou, J.G.; Wang, Y.; Li, J. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, Miami Beach, FL, USA, 6–9 December 2009; pp. 149–158. [Google Scholar] [CrossRef]
Platform | Description | #Templates(2k) | #Templates (Total) | Length (Max, Average) | Size |
---|---|---|---|---|---|
Android | Android framework | 166 | 76,923 | 32, 13.31 | 183.37 MB |
Apache | Apache web server error | 6 | 44 | 14, 12.28 | 4.90 MB |
BGL | Blue Gene/L supercomputer | 120 | 619 | 84, 15.32 | 708.76 MB |
Hadoop | Hadoop map reduce job | 114 | 298 | 50, 14.82 | 48.61 MB |
HDFS | Hadoop distributed file system | 14 | 30 | 111, 12.45 | 1.47 GB |
HealthApp | Health app | 75 | 220 | 14, 4.93 | 22.44 MB |
HPC | High performance cluster | 46 | 104 | 47, 9.56 | 32.00 MB |
Linux | Linux system | 118 | 488 | 24, 14.39 | 2.25 MB |
Mac | Mac OS | 341 | 2214 | 249, 15.49 | 16.09 MB |
OpenSSH | OpenSSH server | 27 | 62 | 19, 13.81 | 70.02 MB |
OpenStack | OpenStack infrastructure | 43 | 51 | 31, 20.63 | 58.61 MB |
Proxifier | Proxifier software | 8 | 9 | 27, 13.73 | 2.42 MB |
Spark | Spark job | 36 | 456 | 22, 12.76 | 2.75 GB |
Thunderbird | Thunderbird supercomputer | 149 | 4040 | 132, 17.52 | 29.60 GB |
Windows | Windows event | 50 | 4833 | 42, 31.93 | 26.09 GB |
Zookeeper | ZooKeeper service | 50 | 95 | 26, 13.46 | 9.95 MB |
Dataset | IPLoM | LenMa | AEL | Spell | Spell+ | Drain | Drain+ | LogPunk | Best |
---|---|---|---|---|---|---|---|---|---|
Android | 0.712 | 0.88 | 0.682 | 0.919 | 0.922 | 0.911 | 0.913 | 0.936 * | 0.936 |
Apache | 1 * | 1 * | 1 * | 1 * | 1 * | 1 * | 1 * | 1 * | 1 |
BGL | 0.939 | 0.69 | 0.957 | 0.787 | 0.822 | 0.963 | 0.97 | 0.979 * | 0.979 |
Hadoop | 0.954 | 0.885 | 0.869 | 0.778 | 0.795 | 0.948 | 0.949 | 0.992 * | 0.992 |
HDFS | 1 * | 0.998 | 0.998 | 1 * | 0.998 | 0.998 | 0.998 | 0.998 | 1 |
HealthApp | 0.822 | 0.174 | 0.568 | 0.639 | 0.686 | 0.78 | 0.78 | 0.901 * | 0.901 |
HPC | 0.829 | 0.83 | 0.903 | 0.654 | 0.898 | 0.887 | 0.926 | 0.939 * | 0.939 |
Linux | 0.672 | 0.701 | 0.673 | 0.605 | 0.739 | 0.69 | 0.749 * | 0.741 | 0.749 |
Mac | 0.671 | 0.698 | 0.764 | 0.757 | 0.804 | 0.787 | 0.858 * | 0.852 | 0.858 |
OpenSSH | 0.54 | 0.925 | 0.538 | 0.554 | 0.803 | 0.788 | 0.788 | 0.995 * | 0.995 |
OpenStack | 0.331 | 0.743 | 0.758 | 0.764 | 0.764 | 0.733 | 0.733 | 1 * | 1 |
Proxifier | 0.517 | 0.508 | 0.495 | 0.527 * | 0.527 * | 0.527 * | 0.527 * | 0.504 | 0.527 |
Spark | 0.92 | 0.884 | 0.905 | 0.905 | 0.905 | 0.92 | 0.92 | 0.923 * | 0.923 |
Thunderbird | 0.663 | 0.943 | 0.941 | 0.844 | 0.95 | 0.955 | 0.955 * | 0.951 | 0.955 |
Windows | 0.567 | 0.566 | 0.69 | 0.989 | 0.99 | 0.997 * | 0.997 * | 0.996 | 0.997 |
Zookeeper | 0.962 | 0.841 | 0.921 | 0.964 | 0.964 | 0.967 | 0.967 | 0.995 * | 0.995 |
Average | 0.756 | 0.767 | 0.791 | 0.793 | 0.848 | 0.866 | 0.877 | 0.919 | N.A. |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, S.; Wu, G. Efficient Online Log Parsing with Log Punctuations Signature. Appl. Sci. 2021, 11, 11974. https://doi.org/10.3390/app112411974
Zhang S, Wu G. Efficient Online Log Parsing with Log Punctuations Signature. Applied Sciences. 2021; 11(24):11974. https://doi.org/10.3390/app112411974
Chicago/Turabian StyleZhang, Shijie, and Gang Wu. 2021. "Efficient Online Log Parsing with Log Punctuations Signature" Applied Sciences 11, no. 24: 11974. https://doi.org/10.3390/app112411974
APA StyleZhang, S., & Wu, G. (2021). Efficient Online Log Parsing with Log Punctuations Signature. Applied Sciences, 11(24), 11974. https://doi.org/10.3390/app112411974