Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs
Abstract
:1. Introduction
- We propose a new hybrid storage format (HYB5) for sparse matrices, which relies on a mixture of SELL-C- and CSR5 and their individual SpMV implementations (Section 4).
- We use a machine learning based approach to automatically predict the optimal matrix segmentation threshold (Section 5).
- We demonstrate that, by using the best of both worlds, our approach can achieve a better performance than the state-of-the-art SpMV implementations on three representative multi-core CPU architectures (Section 6).
2. Background
2.1. Sparse Matrix–Vector Multiplication
2.2. Sparse Matrix Storage Format
3. Motivation
4. Our Approach
4.1. HYB5 Overview
4.2. Matrix Segmentation
Algorithm 1: Convert COO to HYB5 |
4.3. Compressed CSR
Algorithm 2: Compress CSR format. |
4.4. Format Conversion
5. Adaptive Parameter Tuning
5.1. Training the Predictor
5.1.1. Generating Training Data
5.1.2. Building the Model
5.1.3. Training Overhead
5.2. Feature Engineering
5.2.1. Feature Selection
5.2.2. Feature Scaling
5.3. Runtime Deployment
6. Evaluation
6.1. Experimental Setup
6.2. Compared with State-of-the-Art
6.3. Compared with Predictive Models
6.4. Compared with Best Individual Formats
7. Related Work
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mohammed, T.; Albeshri, A.; Katib, I.A.; Mehmood, R. DIESEL: A novel deep learning-based tool for SpMV computations and solving sparse linear equation systems. J. Supercomput. 2020, 77, 6313–6355. [Google Scholar] [CrossRef]
- Cui, H.; Hirasawa, S.; Takizawa, H.; Kobayashi, H. A Code Selection Mechanism Using Deep Learning. In Proceedings of the 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC), Lyon, France, 21–23 September 2016. [Google Scholar]
- Goumas, G.I.; Kourtis, K.; Anastopoulos, N.; Karakasis, V.; Koziris, N. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. J. Supercomput. 2009, 50, 36–77. [Google Scholar] [CrossRef]
- Bell, N.; Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, Oregon, 14–20 November 2009. [Google Scholar]
- Liu, X.; Smelyanskiy, M.; Chow, E.; Dubey, P. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, Eugene, OR, USA, 10–14 June 2013. [Google Scholar]
- Tang, W.T.; Zhao, R.; Lu, M.; Liang, Y.; Huyng, H.P.; Li, X.; Goh, R.S.M. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi. In Proceedings of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), San Francisco, CA, USA, 7–11 February 2015. [Google Scholar]
- Blelloch, G.E.; Heroux, M.A.; Zagha, M. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors; Technical Report CMU-CS-93-173; Carnegie Mellon University: Pittsburgh, PA, USA, 1993. [Google Scholar]
- Liu, W.; Vinter, B. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing, Newport Beach, CA, USA, 8–11 June 2015. [Google Scholar]
- Kreutzer, M.; Hager, G.; Wellein, G.; Fehske, H.; Bishop, A.R. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM J. Sci. Comput. 2014, 36, C401–C423. [Google Scholar] [CrossRef]
- Williams, S.; Oliker, L.; Vuduc, R.W.; Shalf, J.; Yelick, K.A.; Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, Reno, NV, USA, 10–16 November 2007. [Google Scholar]
- Guo, D.; Gropp, W.; Olson, L.N. A hybrid format for better performance of sparse matrix-vector multiplication on a GPU. Int. J. High Perform. C 2016, 30, 103–120. [Google Scholar] [CrossRef]
- Chen, S.; Fang, J.; Chen, D.; Xu, C.; Wang, Z. Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures. In Proceedings of the 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 28–30 June 2018. [Google Scholar]
- Hou, K.; Feng, W.; Che, S. Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lake Buena Vista, FL, USA, 29 May–2 June 2017. [Google Scholar]
- Benatia, A.; Ji, W.; Wang, Y.; Shi, F. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In Proceedings of the 2016 45th International Conference on Parallel Processing (ICPP), Philadelphia, PA, USA, 16–19 August 2016. [Google Scholar]
- Benatia, A.; Ji, W.; Wang, Y.; Shi, F. Machine Learning Approach for the Predicting Performance of SpMV on GPU. In Proceedings of the 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), Wuhan, China, 13–16 December 2016. [Google Scholar]
- Zhao, Y.; Li, J.; Liao, C.; Shen, X. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vienna, Austria, 24–28 February 2018. [Google Scholar]
- Zhou, W.; Zhao, Y.; Shen, X.; Chen, W. Enabling Runtime SpMV Format Selection through an Overhead Conscious Method. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 80–93. [Google Scholar] [CrossRef]
- Li, J.; Tan, G.; Chen, M.; Sun, N. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, WA, USA, 16–19 June 2013. [Google Scholar]
- Anzt, H.; Cojean, T.; Yen-Chen, C.; Dongarra, J.J.; Flegar, G.; Nayak, P.; Tomov, S.; Tsai, Y.M.; Wang, W. Load-balancing Sparse Matrix Vector Product Kernels on GPUs. ACM Trans. Parallel Comput. 2020, 7, 1–26. [Google Scholar] [CrossRef]
- Grimes, R.; Kincaid, D.; Young, D. ITPACK 2.0 User’s Guide; Technical Report CNA-150; Center for Numerical Analysis, University of Texas: Austin, TX, USA, 1979. [Google Scholar]
- Xie, B.; Zhan, J.; Liu, X.; Gao, W.; Jia, Z.; He, X.; Zhang, L. CVR: Efficient vectorization of SpMV on x86 processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, Vienna, Austria, 24–28 February 2018. [Google Scholar]
- Monakov, A.; Lokhmotov, A.; Avetisyan, A. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. In Proceedings of the International Conference on High-Performance Embedded Architectures and Compilers, Pisa, Italy, 25–27 January 2010. [Google Scholar]
- Davis, T.A.; Hu, Y. The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 2011, 38, 1–25. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Mustaqeem, A.; Anwar, S.M.; Majid, M.; Khan, A.R. Wrapper method for feature selection to classify cardiac arrhythmia. In Proceedings of the IEEE 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju, Korea, 11–15 July 2017. [Google Scholar]
- Kohavi, R.; Sommerfield, D. Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology; AAAI Press: Menlo Park, CA, USA, 1995. [Google Scholar]
- Ramos, S.; Hoefler, T. Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, USA, 29 May–2 June 2017. [Google Scholar]
- Bunn, C.C.; Barclay, H.; Lazarev, A.; Yusuf, F.; Fitch, J.; Booth, J.; Shivdikar, K.; Kaeli, D.R. Student cluster competition 2018, team northeastern university: Reproducing performance of a multi-physics simulations of the Tsunamigenic 2004 Sumatra Megathrust earthquake on the AMD EPYC 7551 architecture. Parallel Comput. 2019, 90, 102568. [Google Scholar] [CrossRef]
- Devices, A.M. AMD EPYC 7002 Series Processors. Available online: https://www.amd.com/zh-hans/products/cpu/amd-epyc-7702 (accessed on 6 September 2022).
- Fang, J.; Liao, X.; Huang, C.; Dong, D. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. J. Comput. Sci. Technol. 2021, 36, 33–43. [Google Scholar] [CrossRef]
- Mellor-Crummey, J.M.; Garvin, J. Optimizing Sparse Matrix—Vector Product Computations Using Unroll and Jam. Int. J. High Perform. Comput. Appl. 2004, 18, 225–236. [Google Scholar] [CrossRef] [Green Version]
- Pinar, A.; Heath, M.T. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, Portland, OR, USA, 13–19 November 1999. [Google Scholar]
- Im, E.; Yelick, K.A.; Vuduc, R.W. Sparsity: Optimization Framework for Sparse Matrix Kernels. Int. J. High Perform. Comput. Appl. 2004, 18, 135–158. [Google Scholar] [CrossRef]
- Coronado-Barrientos, E.; Indalecio, G.; Garcia-Loureiro, A. AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL. Concurr. Comput. Pract. Exp. 2019, 31, e4864. [Google Scholar] [CrossRef]
- Chen, X.; Xie, P.; Chi, L.; Liu, J.; Gong, C. An efficient SIMD compression format for sparse matrix-vector multiplication. Concurr. Comput. Pract. Exp. 2018, 30, e4800. [Google Scholar] [CrossRef]
- Lehnert, C.; Berrendorf, R.; Ecker, J.P.; Mannuss, F. Performance Prediction and Ranking of SpMV Kernels on GPU Architectures. In Proceedings of the European Conference on Parallel Processing, Grenoble, France, 24–26 August 2016. [Google Scholar]
- Abubaker, N.; Akbudak, K.; Aykanat, C. Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 445–458. [Google Scholar] [CrossRef]
- Mohammed, T.; Albeshri, A.; Katib, I.A.; Mehmood, R. Nitro: A Framework for Adaptive Code Variant Tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, 19–23 May 2014. [Google Scholar]
- Cui, H.; Hirasawa, S.; Kobayashi, H.; Takizawa, H. A Machine Learning-Based Approach for Selecting SpMV Kernels and Matrix Storage Formats. IEICE Trans. Inf. Syst. 2018, 9, 2307–2314. [Google Scholar] [CrossRef]
- Guo, P.; Lee, C. A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs. Procedia Comput. Sci. 2016, 80, 178–189. [Google Scholar] [CrossRef]
- Usman, S.; Mehmood, R.; Katib, I.A.; Albeshri, A. ZAKI+: A Machine Learning Based Process Mapping Tool for SpMV Computations on Distributed Memory Architectures. IEEE Access 2019, 7, 81279–81296. [Google Scholar] [CrossRef]
- Chen, D.; Fang, J.; Chen, S.; Xu, C.; Wang, Z. Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture. Int. J. Parallel Program. 2019, 47, 418–432. [Google Scholar] [CrossRef] [Green Version]
Representation | Specific Values |
---|---|
COO | |
CSR | |
CSR5 | |
ELL | |
SELL | |
SELL-C- | |
HYB | ELL: COO: , , |
CVR |
Features | Description | Importance |
---|---|---|
number of rows | 33.71% | |
number of columns | ||
variation | matrix regularity | 3.96% |
percentage of non-zero elements | 42.72% | |
total number of non-zero elements | ||
K is obtained by a histogram algorithm | ||
average number of non-zero | 2.10% | |
elements per row | ||
nnz_min | minimum number of non-zero | |
elements per row | ||
maximum number of non-zero | ||
elements per row | ||
nnz_std | standard derivation of non-zero | 17.51% |
elements per row |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, S.; Fang, J.; Xu, C.; Wang, Z. Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs. Appl. Sci. 2022, 12, 9812. https://doi.org/10.3390/app12199812
Chen S, Fang J, Xu C, Wang Z. Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs. Applied Sciences. 2022; 12(19):9812. https://doi.org/10.3390/app12199812
Chicago/Turabian StyleChen, Shizhao, Jianbin Fang, Chuanfu Xu, and Zheng Wang. 2022. "Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs" Applied Sciences 12, no. 19: 9812. https://doi.org/10.3390/app12199812
APA StyleChen, S., Fang, J., Xu, C., & Wang, Z. (2022). Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs. Applied Sciences, 12(19), 9812. https://doi.org/10.3390/app12199812