HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
Abstract
:1. Introduction
2. Background and Related Work
2.1. Machine-Learning Inference Serving
2.2. Autoscaling and Resource Efficiency
2.3. Multi-Tenant Inference
3. Multi-Tenant Inference Characterization Study
3.1. Effectiveness of Multi-Tenant Inference
3.2. Interference of Multi-Tenant Inference
4. Our Approach: HetSev
4.1. Heterogeneity-Aware Instance Autoscaling
4.2. Resource-Efficient Instance Scheduling
Algorithm 1 Resource-efficient instance scheduling. |
S: current cluster state. : k queues and instances to put into the buffer for each scheduling round.
|
5. System Implementation
6. Evaluation
6.1. HetSev with Production Workload
6.2. HetSev with Fluctuating Workload
6.3. Cost Effectiveness
7. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Amazon Machine Learning. Available online: https://aws.amazon.com/machine-learning/ (accessed on 1 October 2022).
- Google Cloud Prediction API Documentation. Available online: https://cloud.google.com/ai-platform/prediction/docs (accessed on 3 October 2022).
- Microsoft Azure Machine Learning. Available online: https://azure.microsoft.com/en-us/svices/machine-learning/ (accessed on 22 September 2022).
- Crankshaw, D.; Wang, X.; Zhou, G.; Franklin, M.J.; Gonzalez, J.E.; Stoica, I. Clipper: A {Low-Latency} Online Prediction Serving System. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), Boston, MA, USA, 27–29 March 2017; pp. 613–627. [Google Scholar]
- Gujarati, A.; Elnikety, S.; He, Y.; McKinley, K.S.; Brandenburg, B.B. Swayam: Distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Las Vegas, NV, USA, 11–15 December 2017; pp. 109–120. [Google Scholar]
- Zhang, C.; Yu, M.; Wang, W.; Yan, F. Enabling cost-effective, slo-aware machine learning inference serving on public cloud. IEEE Trans. Cloud Comput. 2020, 10, 1765–1779. [Google Scholar] [CrossRef]
- Shen, H.; Chen, L.; Jin, Y.; Zhao, L.; Kong, B.; Philipose, M.; Krishnamurthy, A.; Sundaram, R. Nexus: A GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, ON, Canada, 27–30 October 2019; pp. 322–337. [Google Scholar]
- Amazon SageMaker. Available online: https://aws.amazon.com/sagemaker/ (accessed on 11 October 2022).
- Luksa, M. Kubernetes in Action; Simon and Schuster: New York, NY, USA, 2017. [Google Scholar]
- Yu, F.; Bray, S.; Wang, D.; Shangguan, L.; Tang, X.; Liu, C.; Chen, X. Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU. In Proceedings of the 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), Munich, Germany, 1–4 November 2021; pp. 1–9. [Google Scholar]
- Dhakal, A.; Kulkarni, S.G.; Ramakrishnan, K. Gslice: Controlled spatial sharing of gpus for a scalable inference platform. In Proceedings of the 11th ACM Symposium on Cloud Computing, Virtual Event, 19–21 October 2020; pp. 492–506. [Google Scholar]
- Choi, S.; Lee, S.; Kim, Y.; Park, J.; Kwon, Y.; Huh, J. Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning. arXiv 2021, arXiv:2109.01611. [Google Scholar]
- Ghodrati, S.; Ahn, B.H.; Kim, J.K.; Kinzer, S.; Yatham, B.R.; Alla, N.; Sharma, H.; Alian, M.; Ebrahimi, E.; Kim, N.S.; et al. Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 681–697. [Google Scholar]
- Choi, Y.; Rhu, M. Prema: A predictive multi-task scheduling algorithm for preemptible neural processing units. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020; pp. 220–233. [Google Scholar]
- Mendoza, D.; Romero, F.; Li, Q.; Yadwadkar, N.J.; Kozyrakis, C. Interference-aware scheduling for inference serving. In Proceedings of the 1st Workshop on Machine Learning and Systems, Online, 26 April 2021; pp. 80–88. [Google Scholar]
- Wu, X.; Xu, H.; Wang, Y. Irina: Accelerating DNN Inference with Efficient Online Scheduling. In Proceedings of the 4th Asia-Pacific Workshop on Networking, Seoul, Republic of Korea, 3–4 August 2020; pp. 36–43. [Google Scholar]
- Yu, F.; Wang, D.; Shangguan, L.; Zhang, M.; Tang, X.; Liu, C.; Chen, X. A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities. arXiv 2021, arXiv:2111.14247. [Google Scholar]
- Gandhi, A.; Harchol-Balter, M.; Raghunathan, R.; Kozuch, M.A. Autoscale: Dynamic, robust capacity management for multi-tier data centers. ACM Trans. Comput. Syst. (TOCS) 2012, 30, 1–26. [Google Scholar] [CrossRef]
- Olston, C.; Fiedel, N.; Gorovoy, K.; Harmsen, J.; Lao, L.; Li, F.; Rajashekhar, V.; Ramesh, S.; Soyke, J. Tensorflow-serving: Flexible, high-performance ml serving. arXiv 2017, arXiv:1712.06139. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NIPS: Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
- Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. End-to-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4945–4949. [Google Scholar]
- SeldonIO. Seldon Core. Available online: https://github.com/SeldonIO/seldon-core (accessed on 21 October 2022).
- AWSLABS. Multi Model Server. Available online: https://github.com/awslabs/multi-model-server (accessed on 24 October 2022).
- Docker. Available online: https://www.docker.com/ (accessed on 14 September 2022).
- Open Neural Network Exchange. Available online: https://github.com/onnx/onnx (accessed on 15 October 2022).
- Romero, F.; Li, Q.; Yadwadkar, N.J.; Kozyrakis, C. {INFaaS}: Automated Model-less Inference Serving. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21), Virtual Event, 14–16 July 2021; pp. 397–411. [Google Scholar]
- Perri, D.; Simonetti, M.; Gervasi, O. Deploying Efficiently Modern Applications on Cloud. Electronics 2022, 11, 450. [Google Scholar] [CrossRef]
- GOOGLE. Google Cloud Autoscaling. Available online: https://cloud.google.com/compute/docs/autoscaler (accessed on 23 October 2022).
- AMAZON. AWS Autoscaling. Available online: https://aws.amazon.com/autoscaling/ (accessed on 27 September 2022).
- Al-Haidari, F.; Sqalli, M.; Salah, K. Impact of cpu utilization thresholds and scaling size on autoscaling cloud resources. In Proceedings of the 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, Bristol, UK, 2–5 December 2013; Volume 2, pp. 256–261. [Google Scholar]
- Casalicchio, E. A study on performance measures for auto-scaling CPU-intensive containerized applications. Clust. Comput. 2019, 22, 995–1006. [Google Scholar] [CrossRef] [Green Version]
- Zhu, J.; Yang, R.; Sun, X.; Wo, T.; Hu, C.; Peng, H.; Xiao, J.; Zomaya, A.Y.; Xu, J. QoS-aware co-scheduling for distributed long-running applications on shared clusters. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 4818–4834. [Google Scholar] [CrossRef]
- Hu, Y.; Rallapalli, S.; Ko, B.; Govindan, R. Olympian: Scheduling gpu usage in a deep neural network model serving system. In Proceedings of the 19th International Middleware Conference, Rennes, France, 10–14 December 2018; pp. 53–65. [Google Scholar]
- Ding, Y.; Zhu, L.; Jia, Z.; Pekhimenko, G.; Han, S. Ios: Inter-operator scheduler for cnn acceleration. Proc. Mach. Learn. Syst. 2021, 3, 167–180. [Google Scholar]
- Borowiec, D.; Yeung, G.; Friday, A.; Harper, R.H.; Garraghan, P. Trimmer: Cost-Efficient Deep Learning Auto-tuning for Cloud Datacenters. In Proceedings of the 2022 IEEE 15th International Conference on Cloud Computing (CLOUD), Barcelona, Spain, 10–16 July 2022; pp. 374–384. [Google Scholar]
- Xiao, W.; Bhardwaj, R.; Ramjee, R.; Sivathanu, M.; Kwatra, N.; Han, Z.; Patel, P.; Peng, X.; Zhao, H.; Zhang, Q.; et al. Gandiva: Introspective cluster scheduling for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, USA, 8–10 October 2018; pp. 595–610. [Google Scholar]
- Xiao, W.; Ren, S.; Li, Y.; Zhang, Y.; Hou, P.; Li, Z.; Feng, Y.; Lin, W.; Jia, Y. {AntMan}: Dynamic Scaling on {GPU} Clusters for Deep Learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Online, 4–6 November 2020; pp. 533–548. [Google Scholar]
- Yeh, T.A.; Chen, H.H.; Chou, J. Kubeshare: A framework to manage gpus as first-class and shared resources in container cloud. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, Stockholm, Sweden, 23–26 June 2020; pp. 173–184. [Google Scholar]
- Yu, F.; Wang, D.; Shangguan, L.; Zhang, M.; Liu, C.; Chen, X. A Survey of Multi-Tenant Deep Learning Inference on GPU. arXiv 2022, arXiv:2203.09040. [Google Scholar]
- Delimitrou, C.; Kozyrakis, C. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Not. 2013, 48, 77–88. [Google Scholar] [CrossRef]
- Delimitrou, C.; Kozyrakis, C. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Not. 2014, 49, 127–144. [Google Scholar] [CrossRef] [Green Version]
- Novaković, D.; Vasić, N.; Novaković, S.; Kostić, D.; Bianchini, R. {DeepDive}: Transparently Identifying and Managing Performance Interference in Virtualized Environments. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), Berkeley, CA, USA, 26 June 2013; pp. 219–230. [Google Scholar]
- Phull, R.; Li, C.H.; Rao, K.; Cadambi, H.; Chakradhar, S. Interference-driven resource management for GPU-based heterogeneous clusters. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, Delft, The Netherlands, 18–22 June 2012; pp. 109–120. [Google Scholar]
- Kato, S.; Lakshmanan, K.; Rajkumar, R.; Ishikawa, Y. {TimeGraph}:{GPU} Scheduling for {Real-Time}{Multi-Tasking} Environments. In Proceedings of the 2011 USENIX Annual Technical Conference (USENIX ATC 11), Portland, OR, USA, 15–17 June 2011. [Google Scholar]
- Chen, Q.; Yang, H.; Guo, M.; Kannan, R.S.; Mars, J.; Tang, L. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, Xi’an, China, 8–12 April 2017; pp. 17–32. [Google Scholar]
- Jog, A.; Bolotin, E.; Guz, Z.; Parker, M.; Keckler, S.W.; Kandemir, M.T.; Das, C.R. Application-aware memory system for fair and efficient execution of concurrent gpgpu applications. In Proceedings of the Workshop on General Purpose Processing Using GPUs, Salt Lake City, UT, USA, 1 March 2014; pp. 1–8. [Google Scholar]
- Xu, F.; Xu, J.; Chen, J.; Chen, L.; Shang, R.; Zhou, Z.; Liu, F. iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud. arXiv 2022, arXiv:2211.01713. [Google Scholar] [CrossRef]
- Albahar, H.; Dongare, S.; Du, Y.; Zhao, N.; Paul, A.K.; Butt, A.R. SCHEDTUNE: A Heterogeneity-Aware GPU Scheduler for Deep Learning. In Proceedings of the 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Taormina, Italy, 16–19 May 2022; pp. 695–705. [Google Scholar]
- Mars, J.; Tang, L. Whare-map: Heterogeneity in “homogeneous” warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture, Tel-Aviv, Israel, 23–27 June 2013; pp. 619–630. [Google Scholar]
- Wang, H.; Yang, Y.; Huang, P.; Zhang, Y.; Zhou, K.; Tao, M.; Cheng, B. S-CDA: A smart cloud disk allocation approach in cloud block storage system. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC), Virtual Event, 20–24 July 2020; pp. 1–6. [Google Scholar]
- Istio. Available online: https://github.com/istio/istio (accessed on 19 September 2022).
- DCGM-Exporter. NVIDIA GPU Metrics Exporter for Prometheus Leveraging DCGM. Available online: https://github.com/NVIDIA/dcgm-exporter (accessed on 13 October 2022).
- Prometheus. Available online: https://github.com/prometheus/prometheus (accessed on 19 October 2022).
- Twitter Streaming Traces. Available online: https://archive.org/details/archiveteam-twitter-stream-2021-03 (accessed on 22 October 2022).
- Gupta, U.; Hsia, S.; Saraph, V.; Wang, X.; Reagen, B.; Wei, G.Y.; Lee, H.H.S.; Brooks, D.; Wu, C.J. Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual Event, 30 May–3 June 2020; pp. 982–995. [Google Scholar]
- Reddi, V.J.; Cheng, C.; Kanter, D.; Mattson, P.; Schmuelling, G.; Wu, C.J.; Anderson, B.; Breughe, M.; Charlebois, M.; Chou, W.; et al. Mlperf inference benchmark. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual Event, 30 May–3 June 2020; pp. 446–459. [Google Scholar]
- AWS EC2 Pricing. Available online: https://aws.amazon.com/ec2/pricing/on-demand/ (accessed on 28 October 2022).
Model | Type | Size | Input Data |
---|---|---|---|
ResNet50 | Image classification | 90 MB | ImageNet |
Inception-v3 | Image classification | 83 MB | ImageNet |
SSD-ResNet50 | Object detection | 77 MB | COCO |
Transformer | Language translation | 168 MB | WMT 2014 English-to-German |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mo, H.; Zhu, L.; Shi, L.; Tan, S.; Wang, S. HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving. Electronics 2023, 12, 240. https://doi.org/10.3390/electronics12010240
Mo H, Zhu L, Shi L, Tan S, Wang S. HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving. Electronics. 2023; 12(1):240. https://doi.org/10.3390/electronics12010240
Chicago/Turabian StyleMo, Hao, Ligu Zhu, Lei Shi, Songfu Tan, and Suping Wang. 2023. "HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving" Electronics 12, no. 1: 240. https://doi.org/10.3390/electronics12010240
APA StyleMo, H., Zhu, L., Shi, L., Tan, S., & Wang, S. (2023). HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving. Electronics, 12(1), 240. https://doi.org/10.3390/electronics12010240