A Taxonomy of Techniques for SLO Failure Prediction in Software Systems
Abstract
:1. Introduction
2. Related Work
3. Methodology
4. Taxonomy
4.1. Prediction Target
4.2. Time Horizon
4.3. Modeling Type
5. Survey Results
5.1. Event and Anomaly Prediction
5.1.1. Anomaly Detection
Detection Based on Time Series Forecasting
Detection Based on Normal Behavior Modeling
5.1.2. Anomaly Prediction
5.2. Performance Prediction
5.2.1. Black-Box and Machine Learning Models
5.2.2. Models Based on Queueing Theory
5.2.3. Architectural White-Box Models
5.3. Failure Prediction
5.3.1. Offline Prediction
5.3.2. Online Prediction
Black-Box and Machine Learning Models
Rule-Based Models
Architectural Models
5.4. Summarizing Taxonomy Table
6. Open Research Challenges
6.1. Explainability
6.2. Resource Consumption
6.3. Hybrid or Overarching Approaches
7. Limitations and Threats to Validity
8. Conclusions
Author Contributions
Acknowledgments
Conflicts of Interest
References
- Kounev, S.; Lewis, P.; Bellman, K.; Bencomo, N.; Camara, J.; Diaconescu, A.; Esterle, L.; Geihs, K.; Giese, H.; Götz, S.; et al. The Notion of Self-Aware Computing. In Self-Aware Computing Systems; Kounev, S., Kephart, J.O., Milenkoski, A., Zhu, X., Eds.; Springer Verlag: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
- Salfner, F.; Lenk, M.; Malek, M. A survey of online failure prediction methods. ACM Comput. Surv. (CSUR) 2010, 42, 10. [Google Scholar] [CrossRef]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 15. [Google Scholar] [CrossRef]
- Amiri, M.; Mohammad-Khanli, L. Survey on prediction models of applications for resources provisioning in cloud. J. Netw. Comput. Appl. 2017, 82, 93–113. [Google Scholar] [CrossRef]
- Witt, C.; Bux, M.; Gusew, W.; Leser, U. Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 2019, 82, 33–52. [Google Scholar] [CrossRef] [Green Version]
- Koziolek, H. Performance evaluation of component-based software systems: A survey. Perform. Eval. 2010, 67, 634–658. [Google Scholar] [CrossRef] [Green Version]
- Márquez-Chamorro, A.E.; Resinas, M.; Ruiz-Cortés, A. Predictive Monitoring of Business Processes: A Survey. IEEE Trans. Serv. Comput. 2018, 11, 962–977. [Google Scholar] [CrossRef]
- Weingärtner, R.; Bräscher, G.B.; Westphall, C.B. Cloud resource management: A survey on forecasting and profiling models. J. Netw. Comput. Appl. 2015, 47, 99–106. [Google Scholar] [CrossRef]
- Kephart, J.O.; Chess, D.M. The vision of autonomic computing. Computer 2003, 36, 41–50. [Google Scholar] [CrossRef]
- Webster, J.; Watson, R.T. Analyzing the past to prepare for the future: Writing a literature review. MIS Q. 2002, 26, xiii–xxiii. [Google Scholar]
- Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Technical report; School of Computer Science and Mathematics, Keele University: Newcastle, UK, 2007. [Google Scholar]
- Petersen, K.; Feldt, R.; Mujtaba, S.; Mattsson, M. Systematic mapping studies in software engineering. Ease 2008, 8, 68–77. [Google Scholar]
- Booth, A. Unpacking your literature search toolbox: On search styles and tactics. Health Inf. Libr. J. 2008, 25, 313. [Google Scholar] [CrossRef] [PubMed]
- Oehler, M.; Wert, A.; Heger, C. Online Anomaly Detection Based on Monitoring Traces. In Proceedings of the Symposium on Software Performance (SSP’16), Leipzig, Germany, 5–6 November 2017; pp. 45–50. [Google Scholar]
- Song, X.; Wu, M.; Jermaine, C.; Ranka, S. Conditional anomaly detection. IEEE Trans. Knowl. Data Eng. 2007, 19, 631–645. [Google Scholar] [CrossRef]
- Zhang, X.; Meng, F.; Chen, P.; Xu, J. TaskInsight: A fine-grained performance anomaly detection and problem locating system. In Proceedings of the IEEE 9th International Conference on Cloud Computing (CLOUD 2016), San Francisco, CA, USA, 27 June–2 July 2016; pp. 917–920. [Google Scholar]
- Monni, C.; Pezzè, M. Energy-Based Anomaly Detection: A New Perspective for Predicting Software Failures. In Proceedings of the International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track, IEEE, Montreal, QC, Canada, 25–31 May 2019. [Google Scholar]
- Monni, C.; Pezzè, M.; Prisco, G. An RBM Anomaly Detector for the Cloud. In Proceedings of the 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), Xi’an, China, 22–27 April 2019; pp. 148–159. [Google Scholar] [CrossRef]
- Chan, P.K.; Mahoney, M.V. Modeling Multiple Time Series for Anomaly Detection. In Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM’05, Houston, TX, USA, 27–30 November 2005; IEEE Computer Society: Washington, DC, USA, 2005; pp. 90–97. [Google Scholar] [CrossRef] [Green Version]
- Tan, Y.; Gu, X.; Wang, H. Adaptive System Anomaly Prediction for Large-scale Hosting Infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC’10, Zurich, Switzerland, 25–28 July 2010; ACM: New York, NY, USA, 2010; pp. 173–182. [Google Scholar] [CrossRef] [Green Version]
- Schörgenhumer, A.; Kahlhofer, M.; Grünbacher, P.; Mössenböck, H. Can We Predict Performance Events with Time Series Data from Monitoring Multiple Systems? In Proceedings of the Companion of the 2019 ACM/SPEC International Conference on Performance Engineering, ICPE’19, Mumbai, India, 9–13 March 2019; ACM: New York, NY, USA, 2019; pp. 9–12. [Google Scholar] [CrossRef]
- Faber, M.; Happe, J. Systematic Adoption of Genetic Programming for Deriving Software Performance Curves. In Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, ICPE’12, Boston, MA, USA, 22–25 April 2012; ACM: New York, NY, USA, 2012; pp. 33–44. [Google Scholar]
- Westermann, D.; Momm, C. Using Software Performance Curves for Dependable and Cost-Efficient Service Hosting. In Proceedings of the 2nd International Workshop on the Quality of Service-Oriented Software Systems, QUASOSS’10, Oslo, Norway, 4 October 2010; Association for Computing Machinery: New York, NY, USA, 2010. [Google Scholar] [CrossRef]
- Westermann, D.; Happe, J.; Krebs, R.; Farahbod, R. Automated inference of goal-oriented performance prediction functions. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, Essen, Germany, 3–7 September 2012; pp. 190–199. [Google Scholar]
- Kwon, Y.; Lee, S.; Yi, H.; Kwon, D.; Yang, S.; Chun, B.G.; Huang, L.; Maniatis, P.; Naik, M.; Paek, Y. Mantis: Automatic performance prediction for smartphone applications. In Proceedings of the 2013 USENIX Annual Technical Conference, San Jose, CA, USA, 26–28 June 2013; USENIX Association: Berkeley, CA, USA; pp. 297–308. [Google Scholar]
- Thereska, E.; Doebel, B.; Zheng, A.X.; Nobel, P. Practical Performance Models for Complex, Popular Applications. SIGMETRICS Perform. Eval. Rev. 2010, 38, 1–12. [Google Scholar] [CrossRef]
- Abdelzaher, T.F.; Shin, K.G.; Bhatti, N. Performance guarantees for web server end-systems: A control-theoretical approach. IEEE Trans. Parallel Distrib. Syst. 2002, 13, 80–96. [Google Scholar] [CrossRef] [Green Version]
- Almeida, J.; Almeida, V.; Ardagna, D.; Cunha, Í.; Francalanci, C.; Trubian, M. Joint admission control and resource allocation in virtualized servers. J. Parallel Distrib. Comput. 2010, 70, 344–362. [Google Scholar] [CrossRef]
- Tesauro, G.; Jong, N.K.; Das, R.; Bennani, M.N. A hybrid reinforcement learning approach to autonomic resource allocation. In Proceedings of the 2006 IEEE International Conference on Autonomic Computing, ICAC’06, Dublin, Ireland, 12–16 June 2006; pp. 65–73. [Google Scholar]
- Kephart, J.O.; Chan, H.; Das, R.; Levine, D.W.; Tesauro, G.; Rawson III, F.L.; Lefurgy, C. Coordinating Multiple Autonomic Managers to Achieve Specified Power-Performance Tradeoffs. ICAC 2007, 7, 24–33. [Google Scholar]
- Noorshams, Q.; Bruhn, D.; Kounev, S.; Reussner, R. Predictive performance modeling of virtualized storage systems using optimized statistical regression techniques. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ICPE’13, Prague, Czech Republic, 21–24 April 2013; ACM: New York, NY, USA, 2013; pp. 283–294. [Google Scholar]
- Chow, M.; Meisner, D.; Flinn, J.; Peek, D.; Wenisch, T.F. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, USA, 6–8 October 2014; pp. 217–231. [Google Scholar]
- Chen, Y.; Das, A.; Qin, W.; Sivasubramaniam, A.; Wang, Q.; Gautam, N. Managing server energy and operational costs in hosting centers. ACM Sigmetrics Perform. Eval. Rev. 2005, 33, 303–314. [Google Scholar] [CrossRef]
- Zhang, Q.; Cherkasova, L.; Smirni, E. A Regression-Based Analytic Model for Dynamic Resource Provisioning of Multi-Tier Applications. In Proceedings of the Fourth International Conference on Autonomic Computing (ICAC’07), Jacksonville, FL, USA, 11–15 June 2007; pp. 27–37. [Google Scholar] [CrossRef] [Green Version]
- Urgaonkar, B.; Pacifici, G.; Shenoy, P.; Spreitzer, M.; Tantawi, A. An analytical model for multi-tier internet services and its applications. ACM Sigmetrics Perform. Eval. Rev. 2005, 33, 291–302. [Google Scholar] [CrossRef] [Green Version]
- Bennani, M.; Menasce, D. Resource allocation for autonomic data centers using analytic performance models. In Proceedings of the Second International Conference on Autonomic Computing (ICAC 2005), Seattle, WA, USA, 13–16 June 2005; pp. 229–240. [Google Scholar]
- Abrahao, B.; Almeida, V.; Almeida, J.; Zhang, A.; Beyer, D.; Safai, F. Self-adaptive SLA-driven capacity management for internet services. In Proceedings of the 10th IEEE/IFIP Network Operations and Management Symposium, Vancouver, BC, Canada, 3–7 April 2006; pp. 557–568. [Google Scholar]
- Jung, G.; Joshi, K.R.; Hiltunen, M.A.; Schlichting, R.D.; Pu, C. Generating adaptation policies for multi-tier applications in consolidated server environments. In Proceedings of the International Conference on Autonomic Computing (ICAC), Chicago, IL, USA, 2–6 June 2008; pp. 23–32. [Google Scholar]
- Menascé, D.A.; Gomaa, H. A Method for Design and Performance Modeling of Client/Server Systems. IEEE Trans. Softw. Eng. 2000, 26, 1066–1085. [Google Scholar]
- Santhi, K.; Saravanan, R. Performance analysis of cloud computing using series of queues with Erlang service. Int. J. Internet Technol. Secur. Trans. 2019, 9, 147–162. [Google Scholar] [CrossRef]
- Li, J.; Chinneck, J.; Woodside, M.; Litoiu, M.; Iszlai, G. Performance model driven QoS guarantees and optimization in clouds. In Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing, Vancouver, BC, Canada, 23 May 2009; pp. 15–22. [Google Scholar]
- Kounev, S. Performance modeling and evaluation of distributed component-based systems using queueing petri nets. IEEE Trans. Softw. Eng. 2006, 32, 486–502. [Google Scholar] [CrossRef]
- Gilmore, S.; Haenel, V.; Kloul, L.; Maidl, M. Choreographing security and performance analysis for web services. In Formal Techniques for Computer Systems and Business Processes; Springer: Berlin, Germany, 2005; pp. 200–214. [Google Scholar]
- Eskenazi, E.; Fioukov, A.; Hammer, D. Performance prediction for component compositions. In Proceedings of the International Symposium on Component-Based Software Engineering, Edinburgh, UK, 24–25 May 2004; pp. 280–293. [Google Scholar]
- Garlan, D.; Monroe, R.T.; Wile, D. Acme: An Architecture Description Interchange Language. In Proceedings of the 1997 Conference of the Centre for Advanced Studies on Collaborative Research (CASCON’97), Toronto, ON, Canada, 10–13 November 1997; pp. 169–183. [Google Scholar]
- Spitznagel, B.; Garlan, D. Architecture-based performance analysis. In Proceedings of the 1998 Conference on Software Engineering and Knowledge Engineering, San Francisco, CA, USA, 18–20 June 1998; pp. 146–151. [Google Scholar]
- Garlan, D.; Monroe, R.; Wile, D. Acme: An Architecture Description Interchange Language; CASCON First, Decade High Impact Papers; IBM Corp.: Armonk, NY, USA, 2010; pp. 159–173. [Google Scholar] [CrossRef]
- Bondarev, E.; Muskens, J.; de With, P.; Chaudron, M.; Lukkien, J. Predicting real-time properties of component assemblies: A scenario-simulation approach. In Proceedings of the 30th Euromicro Conference, Rennes, France, 3 September 2004; pp. 40–47. [Google Scholar]
- Smith, C.U.; Lladó, C.M.; Cortellessa, V.; Marco, A.D.; Williams, L.G. From UML models to software performance results: An SPE process based on XML interchange formats. In Proceedings of the 5th international workshop on Software and Performance, Illes Balears, Spain, 12–14 July 2005; pp. 87–98. [Google Scholar]
- Petriu, D.B.; Woodside, M. An intermediate metamodel with scenarios and resources for generating performance models from UML designs. Softw. Syst. Model. 2007, 6, 163–184. [Google Scholar] [CrossRef]
- Grassi, V.; Mirandola, R.; Sabetta, A. Filling the gap between design and performance/reliability models of component-based systems: A model-driven approach. J. Syst. Softw. 2007, 80, 528–558. [Google Scholar] [CrossRef]
- Becker, S.; Koziolek, H.; Reussner, R. The Palladio component model for model-driven performance prediction. J. Syst. Softw. 2009, 82, 3–22. [Google Scholar] [CrossRef]
- Kounev, S.; Huber, N.; Brosig, F.; Zhu, X. A Model-Based Approach to Designing Self-Aware IT Systems and Infrastructures. IEEE Comput. 2016, 49, 53–61. [Google Scholar] [CrossRef]
- Goševa-Popstojanova, K.; Trivedi, K.S. Architecture-based approach to reliability assessment of software systems. Perform. Eval. 2001, 45, 179–204. [Google Scholar] [CrossRef]
- Grunske, L.; Han, J. A comparative study into architecture-based safety evaluation methodologies using AADL’s error annex and failure propagation models. In Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, Nanjing, China, 3–5 December 2008; pp. 283–292. [Google Scholar]
- Babar, M.A.; Gorton, I. Comparison of scenario-based software architecture evaluation methods. In Proceedings of the 11th Asia-Pacific Software Engineering Conference, Busan, Korea, 30 November–3 December 2004; pp. 600–607. [Google Scholar]
- Babar, M.A.; Zhu, L.; Jeffery, R. A framework for classifying and comparing software architecture evaluation methods. In Proceedings of the 2004 Australian Software Engineering Conference, Melbourne, Victoria, Australia, 13–16 April 2004; pp. 309–318. [Google Scholar]
- Grunske, L. Early quality prediction of component-based systems—A generic framework. J. Syst. Softw. 2007, 80, 678–686. [Google Scholar] [CrossRef]
- Cheung, R.C. A User-Oriented Software Reliability Model. IEEE Trans. Softw. Eng. 1980, SE-6, 118–125. [Google Scholar] [CrossRef]
- Cortellessa, V.; Grassi, V. A modeling approach to analyze the impact of error propagation on reliability of component-based systems. In Proceedings of the International Symposium on Component-Based Software Engineering, Berlin, Heidelberg, Germany, 14–17 October 2007; pp. 140–156. [Google Scholar]
- Yilmaz, C.; Porter, A. Combining Hardware and Software Instrumentation to Classify Program Executions. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE’10, Santa Fe, NM, USA, November 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 67–76. [Google Scholar] [CrossRef] [Green Version]
- Alonso, J.; Belanche, L.; Avresky, D.R. Predicting Software Anomalies Using Machine Learning Techniques. In Proceedings of the IEEE 10th International Symposium on Network Computing and Applications, Cambridge, MA, USA, 25–27 August 2011; pp. 163–170. [Google Scholar] [CrossRef] [Green Version]
- Lou, J.; Jiang, Y.; Shen, Q.; Wang, R. Failure prediction by relevance vector regression with improved quantum-inspired gravitational search. J. Netw. Comput. Appl. 2018, 103, 171–177. [Google Scholar] [CrossRef]
- Li, L.; Lu, M.; Gu, T. Extracting Interaction-Related Failure Indicators for Online Detection and Prediction of Content Failures. In Proceedings of the 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Memphis, TN, USA, 15–18 October 2018; pp. 278–285. [Google Scholar] [CrossRef]
- Sharma, B.; Jayachandran, P.; Verma, A.; Das, C.R. CloudPD: Problem determination and diagnosis in shared dynamic clouds. In Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Budapest, Hungary, 24–27 June 2013; pp. 1–12. [Google Scholar]
- Daraghmeh, M.; Agarwal, A.; Goel, N.; Kozlowskif, J. Local Regression Based Box-Cox Transformations for Resource Management in Cloud Networks. In Proceedings of the Sixth International Conference on Software Defined Systems (SDS), Rome, Italy, 10–13 June 2019; pp. 229–235. [Google Scholar] [CrossRef]
- Grohmann, J.; Nicholson, P.K.; Iglesias, J.O.; Kounev, S.; Lugones, D. Monitorless: Predicting Performance Degradation in Cloud Applications with Machine Learning, Middleware ’19. In Proceedings of the 20th ACM/IFIP Middleware Conference, Davis, CA, USA, 9–13 December 2019; ACM: New York, NY, USA, 2019. [Google Scholar]
- Cavallo, B.; Di Penta, M.; Canfora, G. An empirical comparison of methods to support QoS-aware service selection. In Proceedings of the 2nd International Workshop on Principles of Engineering Service-Oriented Systems, Cape Town, South Africa, 1–2 May 2010; pp. 64–70. [Google Scholar]
- Amin, A.; Colman, A.; Grunske, L. Using automated control charts for the runtime evaluation of qos attributes. In Proceedings of the IEEE 13th International Symposium on High-Assurance Systems Engineering, Boca Raton, FL, USA, 10–12 November 2011; pp. 299–306. [Google Scholar]
- Amin, A.; Colman, A.; Grunske, L. An approach to forecasting QoS attributes of web services based on ARIMA and GARCH models. In Proceedings of the IEEE 19th International Conference on Web Services, Honolulu, HI, USA, 24–29 June 2012; pp. 74–81. [Google Scholar]
- Amin, A.; Grunske, L.; Colman, A. An automated approach to forecasting QoS attributes based on linear and nonlinear time series modeling. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, Essen, Germany, 3–7 September 2012; pp. 130–139. [Google Scholar]
- Van Beek, V.; Oikonomou, G.; Iosup, A. A CPU contention predictor for business-critical workloads in cloud datacenters. In Proceedings of the IEEE 4th International Workshops on Foundations and Applications of Self* Systems (FAS* W), Umea, Sweden, 16–20 June 2019; pp. 56–61. [Google Scholar]
- Clemm, A.; Hartwig, M. NETradamus: A forecasting system for system event messages. In Proceedings of the IEEE Network Operations and Management Symposium (NOMS 2010), Osaka, Japan, 19–23 April 2010; pp. 623–630. [Google Scholar]
- Gu, X.; Papadimitriou, S.; Philip, S.Y.; Chang, S.P. Online failure forecast for fault-tolerant data stream processing. In Proceedings of the IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008; pp. 1388–1390. [Google Scholar]
- Pitakrat, T.; Grunert, J.; Kabierschke, O.; Keller, F.; van Hoorn, A. A Framework for System Event Classification and Prediction by Means of Machine Learning. In Proceedings of the 8th International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS’14, Bratislava, Slovakia, 9–11 December 2014; ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering): Brussels, Belgium, 2014; pp. 173–180. [Google Scholar] [CrossRef] [Green Version]
- Pitakrat, T.; Okanović, D.; van Hoorn, A.; Grunske, L. Hora: Architecture-aware online failure prediction. J. Syst. Softw. 2018, 137, 669–685. [Google Scholar] [CrossRef]
- Pertet, S.; Narasimhan, P. Handling cascading failures: The case for topology-aware fault-tolerance. In Proceedings of the IEEE First, Workshop on Hot Topics in System Dependability, Yokohama, Japan, 30 June 2005. [Google Scholar]
- Capelastegui, P.; Navas, A.; Huertas, F.; Garcia-Carmona, R.; Dueñas, J.C. An online failure prediction system for private IaaS platforms. In Proceedings of the 2nd International Workshop on Dependability Issues in Cloud Computing, Braga, Portugal, 30 September 2013; pp. 1–3. [Google Scholar]
- Ozcelik, B.; Yilmaz, C. Seer: A Lightweight Online Failure Prediction Approach. IEEE Trans. Softw. Eng. 2016, 42, 26–46. [Google Scholar] [CrossRef] [Green Version]
- Mariani, L.; Pezzè, M.; Riganelli, O.; Xin, R. Predicting failures in multi-tier distributed systems. J. Syst. Softw. 2020, 161, 110464. [Google Scholar] [CrossRef] [Green Version]
- Bielefeld, T.C. Online Performance Anomaly Detection for Large-Scale Software Systems. Master’s Thesis, Kiel University, Kiel, Germany, 2012. [Google Scholar]
- Frotscher, T. Architecture-Based Multivariate Anomaly Detection for Software Systems. Master’s Thesis, Kiel University, Kiel, Germany, 2013. [Google Scholar]
- Rathfelder, C.; Becker, S.; Krogmann, K.; Reussner, R. Workload-aware System Monitoring Using Performance Predictions Applied to a Large-scale E-Mail System. In Proceedings of the 2012 Joint Working IEEE/IFIP Conference on Software Architecture and European Conference on Software Architecture, Helsinki, Finland, 20–24 August 2012; pp. 31–40. [Google Scholar] [CrossRef]
- Van Hoorn, A. Model-Driven Online Capacity Management for Component-Based Software Systems. Ph.D. Thesis, Department of Computer Science, Kiel University, Kiel, Germany, 2014. [Google Scholar]
- Brosch, F. Integrated Software Architecture-Based Reliability Prediction for IT Systems. Ph.D. Thesis, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany, 2012. [Google Scholar]
- Uhle, J. On Dependability Modeling in a Deployed Microservice Architecture. Master’s Thesis, Universität Potsdam, Stuttgart, Germany, 2014. [Google Scholar]
- Pitakrat, T. Architecture-Aware Online Failure Prediction for Software Systems. Ph.D. Thesis, Universität Stuttgart, Stuttgart, Germany, 2018. [Google Scholar] [CrossRef]
- Mohamed, A. Software Architecture-Based Failure Prediction. Ph.D. Thesis, Queen’s University, Kingston, ON, USA, 2012. [Google Scholar]
- Chan, P.K.; Mahoney, M.V.; Arshad, M.H. A machine Learning Approach to Anomaly Detection; Technical report; Florida Institute of Technology: Melbourne, FL, USA, 2003. [Google Scholar]
- Object Management Group (OMG). UML Profile for Modeling and Analysis of Real-Time and Embedded Systems (MARTE); Object Management Group (OMG): Needham, MA, USA, 2006. [Google Scholar]
- Pertet, S.; Narasimhan, P. Causes of Failure in Web Applications; Technical report, Technical Report CMU-PDL-05-109; Carnegie Mellon University: Pittsburgh, PA, USA, 2005. [Google Scholar]
- Iyer, A.; Zhao, Y. Time Series Metric Data Modeling and Prediction. U.S. Patent 9,323,599, 20 April 2016. [Google Scholar]
- Mayle, G.E.; Reves, J.P.; Clubb, J.A.; Wilson, L.F. Automated Adaptive Baselining and Thresholding Method and System. U.S. Patent 6,182,022, 30 January 2001. [Google Scholar]
- Avizienis, A.; Laprie, J.; Randell, B.; Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secur. Comput. 2004, 1, 11–33. [Google Scholar] [CrossRef] [Green Version]
- Pitakrat, T.; Van Hoorn, A.; Grunske, L. A comparison of machine learning algorithms for proactive hard disk drive failure detection. In Proceedings of the 4th International ACM Sigsoft Symposium on Architecting Critical Systems, Vancouver, BC, Canada, 17–21 June 2013; pp. 1–10. [Google Scholar]
- Murray, J.F.; Hughes, G.F.; Kreutz-Delgado, K. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Mach. Learn. Res. 2005, 6, 783–816. [Google Scholar]
- Eckart, B.; Chen, X.; He, X.; Scott, S.L. Failure prediction models for proactive fault tolerance within storage systems. In Proceedings of the 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems, Baltimore, MD, USA, 8–10 September 2008; pp. 1–8. [Google Scholar]
- Zhu, B.; Wang, G.; Liu, X.; Hu, D.; Lin, S.; Ma, J. Proactive drive failure prediction for large scale storage systems. In Proceedings of the 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), Long Beach, CA, USA, 6–10 May 2013; pp. 1–5. [Google Scholar]
- Ganguly, S.; Consul, A.; Khan, A.; Bussone, B.; Richards, J.; Miguel, A. A practical approach to hard disk failure prediction in cloud platforms: Big data model for failure management in datacenters. In Proceedings of the 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), Oxford, UK, 29 March–1 April 2016; pp. 105–116. [Google Scholar]
- Wang, Y.; Ma, E.W.; Chow, T.W.; Tsui, K.L. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Ind. Informat. 2013, 10, 419–430. [Google Scholar] [CrossRef]
- Züfle, M.; Krupitzer, C.; Erhard, F.; Grohmann, J.; Kounev, S. To Fail or Not to Fail: Predicting Hard Disk Drive Failure Time Windows. In Proceedings of the 20th International GI/ITG Conference on Measurement, Modelling and Evaluation of Computing Systems (MMB 2020), Saarbrücken, Germany, 16–18 March 2020. [Google Scholar]
- Chalermarrewong, T.; Achalakul, T.; See, S.C.W. Failure prediction of data centers using time series and fault tree analysis. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Singapore, 17–19 December 2012; pp. 794–799. [Google Scholar]
- Wold, H. A Study in the Analysis of Stationary Time Series. Ph.D. Thesis, Almqvist & Wiksell, Stockholm, Sweden, 1938. [Google Scholar]
- Hyndman, R.J.; Koehler, A.B.; Snyder, R.D.; Grose, S. A state space framework for automatic forecasting using exponential smoothing methods. Int. J. Forecast. 2002, 18, 439–454. [Google Scholar] [CrossRef] [Green Version]
- Goodwin, P. The holt-winters approach to exponential smoothing: 50 years old and going strong. Foresight 2010, 19, 30–33. [Google Scholar]
- Herbst, N.R.; Huber, N.; Kounev, S.; Amrehn, E. Self-Adaptive Workload Classification and Forecasting for Proactive Resource Provisioning. In Concurrency and Computation—Practice and Experience; John Wiley and Sons, Ltd.: Hoboken, NJ, USA, 2014; Volume 26, pp. 2053–2078. [Google Scholar] [CrossRef]
- De Livera, A.M.; Hyndman, R.J.; Snyder, R.D. Forecasting time series with complex seasonal patterns using exponential smoothing. J. Am. Stat. Assoc. 2011, 106, 1513–1527. [Google Scholar] [CrossRef] [Green Version]
- Züfle, M.; Bauer, A.; Herbst, N.; Curtef, V.; Kounev, S. Telescope: A Hybrid Forecast Method for Univariate Time Series. In Proceedings of the International Work-Conference on Time Series (ITISE 2017), Granada, Spain, 18–20 September 2017. [Google Scholar]
- Faloutsos, C.; Flunkert, V.; Gasthaus, J.; Januschowski, T.; Wang, Y. Forecasting Big Time Series: Theory and Practice. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
- Bauer, A.; Züfle, M.; Grohmann, J.; Schmitt, N.; Herbst, N.; Kounev, S. An Automated Forecasting Framework based on Method Recommendation for Seasonal Time Series. In Proceedings of the 11th ACM/SPEC International Conference on Performance Engineering (ICPE 2020), Edmonton, AB, Canada, 24 April 2020; ACM: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Verma, A.; Ahuja, P.; Neogi, A. pMapper: Power and migration cost aware application placement in virtualized systems. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, Leuven, Belgium, 1–5 December 2008; Springer: New York, NY, USA, 2008; pp. 243–264. [Google Scholar]
- Jung, G.; Hiltunen, M.A.; Joshi, K.R.; Schlichting, R.D.; Pu, C. Mistral: Dynamically managing power, performance, and adaptation cost in cloud infrastructures. In Proceedings of the IEEE 30th International Conference on Distributed Computing Systems (ICDCS 2010), Genova, Italy, 21–25 June 2010; pp. 62–73. [Google Scholar]
- Mi, H.; Wang, H.; Yin, G.; Zhou, Y.; Shi, D.; Yuan, L. Online self-reconfiguration with performance guarantee for energy-efficient large-scale cloud computing data centers. In Proceedings of the IEEE International Conference on Services Computing (SCC 2010), Miami, FL, USA, 5–10 July 2010; pp. 514–521. [Google Scholar]
- Lorido-Botran, T.; Miguel-Alonso, J.; Lozano, J.A. A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments. J. Grid Comput. 2014, 12, 559–592. [Google Scholar] [CrossRef]
- Noorshams, Q. Modeling and Prediction of I/O Performance in Virtualized Environments. Ph.D. Thesis, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany, 2015. [Google Scholar]
- Walter, J.; van Hoorn, A.; Kounev, S. Automated and Adaptable Decision Support for Software Performance Engineering. In Proceedings of the 11th EAI International Conference on Performance Evaluation Methodologies and Tools, Venice, Italy, 5–7 Decmber 2017; pp. 66–73. [Google Scholar]
- Walter, J. Automation in Software Performance Engineering Based on a Declarative Specification of Concern. Ph.D. Thesis, University of Würzburg, Würzburg, Germany, 2018. [Google Scholar]
- Rygielski, P.; Kounev, S.; Tran-Gia, P. Flexible Performance Prediction of Data Center Networks using Automatically Generated Simulation Models. In Proceedings of the Eighth International Conference on Simulation Tools and Techniques (SIMUTools 2015), Athens, Greece, 24–26 August 2015; pp. 119–128. [Google Scholar] [CrossRef] [Green Version]
- Rygielski, P. Flexible Modeling of Data Center Networks for Capacity Management. Ph.D. Thesis, University of Würzburg, Würzburg, Germany, 2017. [Google Scholar]
- Grohmann, J.; Eismann, S.; Kounev, S. The Vision of Self-Aware Performance Models. In Proceedings of the 2018 IEEE International Conference on Software Architecture Companion (ICSA-C), Seattle, WA, USA, 30 April–4 May 2018; pp. 60–63. [Google Scholar] [CrossRef]
- Eismann, S.; Grohmann, J.; Walter, J.; von Kistowski, J.; Kounev, S. Integrating Statistical Response Time Models in Architectural Performance Models. In Proceedings of the 2019 IEEE International Conference on Software Architecture (ICSA), Hamburg, Germany, 25–29 March 2019; pp. 71–80. [Google Scholar] [CrossRef]
- Bezemer, C.; Eismann, S.; Ferme, V.; Grohmann, J.; Heinrich, R.; Jamshidi, P.; Shang, W.; van Hoorn, A.; Villavicencio, M.; Walter, J.; et al. How is Performance Addressed in DevOps? In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, Mumbai, India, 7–11 April 2019; pp. 45–50. [Google Scholar] [CrossRef]
- Islam, T.; Manivannan, D. Predicting application failure in cloud: A machine learning approach. In Proceedings of the 2017 IEEE International Conference on Cognitive Computing (ICCC), Honolulu, HI, USA, 25–30 June 2017; pp. 24–31. [Google Scholar]
- Zheng, Z.; Lan, Z.; Gupta, R.; Coghlan, S.; Beckman, P. A practical failure prediction with location and lead time for blue gene/p. In Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), Chicago, IL, USA, 28 June–1 July 2010; pp. 15–22. [Google Scholar]
- Yu, L.; Zheng, Z.; Lan, Z.; Coghlan, S. Practical online failure prediction for blue gene/p: Period-based vs event-driven. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), Hong Kong, China, 27–30 June 2011; pp. 259–264. [Google Scholar]
- Liang, Y.; Zhang, Y.; Xiong, H.; Sahoo, R. Failure prediction in ibm bluegene/l event logs. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 583–588. [Google Scholar]
Prediction Target | Time Horizon | Modeling Type | G | A | E | References |
---|---|---|---|---|---|---|
Anomaly Prediction | Detection | Time Series | + | + | - | [14,81,82,92] |
Model based | + | + | - | [15,16,17,18,19,83,89,93] | ||
Prediction | + | + | - | [20,21] | ||
Performance Prediction | Black box | + | - | - | [22,23,24,25,26,27,28,29,30,31,32] | |
Queueing Theory | - | + | - | [33,34,35,36,37,38,39,40,41,42,43,44] | ||
Architectural | - | + | + | [45,46,47,48,49,50,51,52,53,84,90] | ||
Failure Prediction | Offline | - | - | + | [54,55,56,57,58,59,60,61,85,86] | |
Online | Black box | + | - | - | [62,63,64,65,66,67,68,69,70,71,72] | |
Rule based | - | - | + | [73,74,75] | ||
Architectural | - | + | + | [76,77,78,79,80,87,88,91] |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Grohmann, J.; Herbst, N.; Chalbani, A.; Arian, Y.; Peretz, N.; Kounev, S. A Taxonomy of Techniques for SLO Failure Prediction in Software Systems. Computers 2020, 9, 10. https://doi.org/10.3390/computers9010010
Grohmann J, Herbst N, Chalbani A, Arian Y, Peretz N, Kounev S. A Taxonomy of Techniques for SLO Failure Prediction in Software Systems. Computers. 2020; 9(1):10. https://doi.org/10.3390/computers9010010
Chicago/Turabian StyleGrohmann, Johannes, Nikolas Herbst, Avi Chalbani, Yair Arian, Noam Peretz, and Samuel Kounev. 2020. "A Taxonomy of Techniques for SLO Failure Prediction in Software Systems" Computers 9, no. 1: 10. https://doi.org/10.3390/computers9010010
APA StyleGrohmann, J., Herbst, N., Chalbani, A., Arian, Y., Peretz, N., & Kounev, S. (2020). A Taxonomy of Techniques for SLO Failure Prediction in Software Systems. Computers, 9(1), 10. https://doi.org/10.3390/computers9010010