Improvements to Supercomputing Service Availability Based on Data Analysis
Abstract
:1. Introduction
2. System Configuration
2.1. Hardware Configuration
2.2. Software Configuration
3. Problem Statement
4. Method
4.1. Workflow with K-Hook
4.2. Flowchart with K-Hook
5. Result
5.1. Evaluation of Summitted-Job Success Rate
5.2. Analysis of Waiting Time for Main Queues
5.3. MTBI of Supercomputing Service
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Di, S.; Guo, H.; Pershey, E.; Snir, M.; Cappello, F. Characterizing and Understanding HPC Job Failures over the 2K-day Life of IBM BlueGene/Q System. In Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Portland, OR, USA, 24–27 June 2019; pp. 473–484. [Google Scholar]
- Di, S.; Guo, H.; Gupta, R.; Pershey, E.R.; Snir, M.; Cappello, F. Exploring properties and correlations of fatal events in a large-scale hpc system. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 361–374. [Google Scholar] [CrossRef]
- Rojas, E.; Meneses, E.; Jones, T.; Maxwell, D. Analyzing a five-year failure record of a leadership-class supercomputer. In Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Campo Grande, Brazil, 15–18 October 2019; pp. 196–203. [Google Scholar]
- Wang, F.; Oral, S.; Sen, S.; Imam, N. Learning from Five-year Resource-Utilization Data of Titan System. In Proceedings of the 2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 23–26 September 2019; pp. 1–6. [Google Scholar]
- Yamamoto, K.; Uno, A.; Murai, H.; Tsukamoto, T.; Shoji, F.; Matsui, S.; Sekizawa, R.; Sueyasu, F.; Uchiyama, H.; Okamoto, M.; et al. The K computer operations: Experiences and statistics. Procedia Comput. Sci. 2014, 29, 576–585. [Google Scholar] [CrossRef] [Green Version]
- Wang, C.; Mueller, F.; Engelmann, C.; Scott, S.L. Proactive process-level live migration in HPC environments. In Proceedings of the SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Austin, TX, USA, 15–21 November 2008. [Google Scholar]
- Hsu, C.H.; Feng, W.C. A power-aware run-time system for high-performance computing. In Proceedings of the SC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, Seattle, WA, USA, 12–18 November 2005. [Google Scholar]
- Geist, A.; Reed, D. A survey of high-performance computing scaling challenges. Int. J. High Perform. Comput. Appl. 2017, 31, 104–113. [Google Scholar] [CrossRef]
- Yoon, J.; Hong, T.; Park, C.; Noh, S.Y.; Yu, H. Log Analysis-Based Resource and Execution Time Improvement in HPC: A Case Study. Appl. Sci. 2020, 10, 2634. [Google Scholar] [CrossRef] [Green Version]
- Wagner, M.; López, V.; Morillo, J.; Cavazzoni, C.; Affinito, F.; Giménez, J.; Labarta, J. Performance analysis and optimization of the fftxlib on the intel knights landing architecture. In Proceedings of the 46th International Conference on Parallel Processing Workshops (ICPPW), Bristol, UK, 14–17 August 2017; pp. 243–250. [Google Scholar] [CrossRef]
- Kang, J.H.; Kwon, O.K.; Ryu, H.; Jeong, J.; Lim, K. Performance evaluation of scientific applications on Intel Xeon Phi Knights Landing clusters. In Proceedings of the 2018 International Conference on High Performance Computing & Simulation (HPCS), Orleans, France, 16–20 July 2018; pp. 338–341. [Google Scholar] [CrossRef]
- Hammond, S.; Vaughan, C.; Hughes, C. Evaluating the Intel Skylake Xeon processor for HPC workloads. In Proceedings of the 2018 International Conference on High Performance Computing & Simulation (HPCS), Orleans, France, 16–20 July 2018; pp. 342–349. [Google Scholar] [CrossRef]
- Birrittella, M.S.; Debbage, M.; Huggahalli, R.; Kunz, J.; Lovett, T.; Rimmer, T.; Underwood, K.D.; Zak, R.C. Intel® Omni-path architecture: Enabling scalable, high performance fabrics. In Proceedings of the IEEE 23rd Annual Symposium on High-Performance Interconnects, Santa Clara, CA, USA, 26–28 August 2015; pp. 1–9. [Google Scholar] [CrossRef]
- Salunkhe, R.; Kadam, A.D.; Jayakumar, N.; Joshi, S. Luster a scalable architecture file system: A research implementation on active storage array framework with Luster file system. In Proceedings of the International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, India, 3–5 March 2016; pp. 1073–1081. [Google Scholar] [CrossRef]
- Kim, J.; Kim, D. KREONET-GENI future internet testbed. In Proceedings of the 7th International Conference on Networked Computing and Advanced Information Management, Gyeongju, Korea, 21–23 June 2011; pp. 121–122. [Google Scholar]
Type | Queue | Total Nodes | Total CPU Cores | Wall Clock Limit (Hours) | Max. Submit Jobs | Max. Running Jobs |
---|---|---|---|---|---|---|
KNL | exclusive | 2600 | 176,800 | unlimited | 100 | 100 |
normal | 4970 | 337,960 | 48 | 40 | 20 | |
long | 300 | 20,400 | 120 | 20 | 10 | |
flat | 180 | 12,240 | 48 | 20 | 10 | |
debug | 20 | 1360 | 48 | 2 | 2 | |
SKL | commercial | 118 | 4720 | 48 | 6 | 2 |
norm_skl | 118 | 4720 | 48 | 10 | 5 |
Item | The Cause of Job Failure | Percentage (%) |
---|---|---|
PBS error | Job was requeued (if rerunnable) or deleted (if not) | 0.01 |
Job execution failed, do retry | 0.04 | |
Job execution failed, before files, no retry | 0.12 | |
Job deletion with qdel includes walltime limit | 36.37 | |
Licensed CPUs exceeded | 0.09 | |
Undefined attribute | 0.36 | |
PBS etc. | 0.001 | |
Program error | Operation not permitted | 41.01 |
Argument list was too long | 0.04 | |
Exec format error | 0.47 | |
Command not found | 1.91 | |
Stack overflow | 0.56 | |
Segmentation fault | 0.28 | |
Floating-point exception | 0.07 | |
Illegal instruction | 0.1 | |
I/O error | Operation requires sequential file organization and access | 0.41 |
I/O procedure was truncated | 0.005 | |
No such file or directory | 10.37 | |
Input/output error | 0.2 | |
Bad file descriptor | 1.34 | |
Too many open files | 0.24 | |
H/W error | No such device or address | 0.31 |
No child processes | 0.01 | |
Resource temporarily unavailable | 0.26 | |
Cannot allocate memory | 0.03 | |
Bus error | 0.01 | |
User error | Abort signal | 0.18 |
Kill signal | 0.09 | |
Termination signal | 2.05 | |
etc. | etc. | 3.07 |
Month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Interrupts | 1 | 1 | 1 | 1 | 1 | 3 | 2 | 1 | 1 | 1 | 1 | 15 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, J.-K.; Kwon, M.-W.; An, D.-S.; Yoon, J.; Hong, T.; Woo, J.; Kim, S.-J.; Li, G. Improvements to Supercomputing Service Availability Based on Data Analysis. Appl. Sci. 2021, 11, 6166. https://doi.org/10.3390/app11136166
Lee J-K, Kwon M-W, An D-S, Yoon J, Hong T, Woo J, Kim S-J, Li G. Improvements to Supercomputing Service Availability Based on Data Analysis. Applied Sciences. 2021; 11(13):6166. https://doi.org/10.3390/app11136166
Chicago/Turabian StyleLee, Jae-Kook, Min-Woo Kwon, Do-Sik An, Junweon Yoon, Taeyoung Hong, Joon Woo, Sung-Jun Kim, and Guohua Li. 2021. "Improvements to Supercomputing Service Availability Based on Data Analysis" Applied Sciences 11, no. 13: 6166. https://doi.org/10.3390/app11136166
APA StyleLee, J. -K., Kwon, M. -W., An, D. -S., Yoon, J., Hong, T., Woo, J., Kim, S. -J., & Li, G. (2021). Improvements to Supercomputing Service Availability Based on Data Analysis. Applied Sciences, 11(13), 6166. https://doi.org/10.3390/app11136166