Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture
Abstract
:1. Introduction
- Propose, develop, and assess a machine learning tool designed to determine optimal number of thread configurations for SpMV, thereby enhancing overall performance of the SpMV kernel;
- Train and test the tool utilizing an extensive dataset comprising nearly 1000+ real-world matrices sourced from 46 application domains such as computational fluid dynamics, 2D/3D problems, computer vision, and robotics;
- Conduct an extensive performance modeling and evaluation, employing a range of machine learning techniques to optimize SpMV computations under varying conditions, ensuring robustness and adaptability in real-world scenarios.
2. Literature Survey
3. Methodology and Design
- Data Collection: Real-world matrices were gathered from the Suit Sparse Matrix Market Collection (UOF) [27];
- Data Transformation: The sparse matrices were transformed from the matrix market format to the Compressed Sparse Row Format. Subsequently, SpMV computations were performed, utilizing different numbers of threads;
- Feature Extraction: In this step, fourteen sparse matrix features were extracted, leveraging the matrix structure. Feature extraction included characteristics derived from the matrix structure;
- Application of Machine Learning Models: Five different machine learning models were employed in this step to predict the number of threads, aiming to identify the near-optimal number of threads;
- Model Evaluation: In this phase, predictive models were evaluated using relative mean error. This metric provides insight into the effectiveness of the models in predicting the optimal number of threads. To assess the performance of our proposed model, we employed the geometric mean of normalized performance (GMNP). This metric is computed as the product of the ratios of predicted execution times to the best execution times across all matrices in a test set.
3.1. Construction of Dataset
3.2. Identification of Features
3.3. Data Labeling and Training and Testing Phases
3.4. Feature Scaling
3.5. Model Evaluation Phase
4. Result and Analysis
4.1. Software and Hardware
4.2. Execution Time Analysis of SpMV Computations
4.3. Speedup
4.4. Predictive Analysis
4.5. Performance Gain
5. Limitation of the Proposed Solution
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Grossman, M.; Thiele, C.; Araya-Polo, M.; Frank, F.; Alpak, F.O.; Sarkar, V. A Survey of Sparse Matrix-Vector Multiplication Performance on Large Matrices. 2016. Available online: https://software.intel.com/en-us/intel-mkl (accessed on 18 April 2024).
- cuSPARSE | NVIDIA Developer, NVIDIA. Available online: https://developer.nvidia.com/cusparse (accessed on 18 April 2024).
- Pınar, A.; Heath, M.T. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing, Portland, OR, USA, 14–19 November 1999. [Google Scholar]
- Mehmood, R.; Crowcroft, J. Parallel Iterative Solution Method for Large Sparse Linear Equation Systems. 2005. Available online: http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-650.html (accessed on 18 April 2024).
- Pabst, H.; Bachmayer, B.; Klemm, M. Performance of a Structure-detecting SpMV using the CSR Matrix Representation. In Proceedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC 2012), Munich, Germany, 25–29 June 2012. [Google Scholar] [CrossRef]
- Yang, W.; Li, K.; Li, K. A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems. J. Parallel Distrib. Comput. 2017, 104, 49–60. [Google Scholar] [CrossRef]
- Gu, A.T.; Liu, X.; Mo, Z.; Xu, X.; Zhu, S. On the Memory Wall and Performance of Symmetric Sparse Matrix Vector Multiplications in Different Data Structures on Shared Memory Machines. In Proceedings of the 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), Beijing, China, 10–14 August 2015. [Google Scholar] [CrossRef]
- Sun, H.; Gainaru, A.; Shantharam, M.; Raghavan, P. Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead. In Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Porto, Portugal, 9–11 September 2020. [Google Scholar] [CrossRef]
- Elafrou, A.; Goumas, G.; Koziris, N. Performance analysis and optimization of sparse matrix-vector multiplication on modern multi-and many-core processors. In Proceedings of the 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 14–17 August 2017; pp. 292–301. [Google Scholar]
- Asanovic, K.; Bodik, R.; Demmel, J.; Keaveny, T.; Keutzer, K.; Kubiatowicz, J.; Morgan, N.; Patterson, D.; Sen, K.; Wawrzynek, J.; et al. A View of the Parallel Computing Landscape. Commun. ACM 2009, 52, 56–67. [Google Scholar] [CrossRef]
- Accelerate Fast Math with Intel® oneAPI Math Kernel Library. Available online: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html#gs.874u1l (accessed on 18 April 2024).
- Bian, H.; Huang, J.; Dong, R.; Liu, L.; Wang, X. CSR2: A New Format for SIMD-accelerated SpMV. In Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, VIC, Australia, 11–14 May 2020. [Google Scholar] [CrossRef]
- Zheng, C.; Gu, S.; Gu, T.-X.; Yang, B.; Liu, X.-P. BiELL: A bisection ELLPACK-based storage format for optimizing SpMV on GPUs. J. Parallel Distrib. Comput. 2014, 74, 2639–2647. [Google Scholar] [CrossRef]
- CUSP: Main Page. Available online: https://cusplibrary.github.io/ (accessed on 18 April 2024).
- Ahmed, M.; Usman, S.; Shah, N.A.; Ashraf, M.U.; Alghamdi, A.M.; Bahadded, A.A.; Almarhabi, K.A. AAQAL: A Machine Learning-Based Tool for Performance Optimization of Parallel SPMV Computations Using Block CSR. Appl. Sci. 2022, 12, 7073. [Google Scholar] [CrossRef]
- Dufrechou, E.; Ezzatti, P.; Quintana-Ortí, E.S. Selecting optimal SpMV realizations for GPUs via machine learning. Int. J. High Perform. Comput. Appl. 2021, 35, 254–267. [Google Scholar] [CrossRef]
- Usman, S.; Mehmood, R.; Katib, I.; Albeshri, A.; Altowaijri, S.M. ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines. Mob. Netw. Appl. 2023, 28, 744–763. [Google Scholar] [CrossRef]
- Nie, J.; Zhang, C.; Zou, D.; Xia, F.; Lu, L.; Wang, X.; Zhao, F. Adaptive sparse matrix-vector multiplication on CPU-GPU heterogeneous architecture. In Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference, Guangzhou, China, 22–24 June 2019; pp. 6–10. [Google Scholar]
- Williams, S.; Oliker, L.; Vuduc, R.; Shalf, J.; Yelick, K.; Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC’07, Reno, NV, USA, 10–16 November 2007. [Google Scholar] [CrossRef]
- Yesil, S.; Heidarshenas, A.; Morrison, A.; Torrellas, J. Wise: Predicting the performance of sparse matrix vector multiplication with machine learning. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Montreal, QC, Canada, 25 February–1 March 2023; pp. 329–341. [Google Scholar]
- Bylina, B.; Bylina, J.; Stpiczyński, P.; Szałkowski, D. Performance analysis of multicore and multinodal implementation of SpMV operation. In Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, 7–10 September 2014. [Google Scholar]
- Zhao, H.; Xia, T.; Li, C.; Zhao, W.; Zheng, N.; Ren, P. Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, Hartford, CT, USA, 18–21 October 2020; pp. 601–609. [Google Scholar] [CrossRef]
- Usman, S.; Mehmood, R.; Katib, I.; Albeshri, A. ZAKI+: A Machine Learning Based Process Mapping Tool for SpMV Computations on Distributed Memory Architectures. IEEE Access 2019, 7, 81279–81296. [Google Scholar] [CrossRef]
- Gao, J.; Ji, W.; Liu, J.; Wang, Y.; Shi, F. Revisiting thread configuration of SpMV kernels on GPU: A machine learning based approach. J. Parallel Distrib. Comput. 2024, 185, 104799. [Google Scholar] [CrossRef]
- Nisa, I.; Siegel, C.; Rajam, A.S.; Vishnu, A.; Sadayappan, P. Effective machine learning based format selection and performance modeling for SpMV on GPUs. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 21–25 May 2018; pp. 1056–1065. [Google Scholar]
- Furuhata, R.; Zhao, M.; Agung, M.; Egawa, R.; Takizawa, H. Improving the accuracy in spmv implementation selection with machine learning. In Proceedings of the 2020 Eighth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 24–27 November 2020; pp. 172–177. [Google Scholar]
- Davis, T.A.; Hu, Y. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 2011, 38, 1–25. [Google Scholar] [CrossRef]
Symbol | Name | Symbol | Name |
---|---|---|---|
NNZs | Number of non-zeros | CSR | Compressed sparse row |
A | M × N matrices | BCSR | Block compressed sparse row |
M | Number of rows | ||
N | Number of columns | ||
X | N × 1 dense vector | ||
Y | M × 1 output vector | ||
U | Number of matrices | ||
F | Sparse matrix feature |
Matrix Name | Application Domain | Rows | Columns | NNZs |
---|---|---|---|---|
mcfe | 2D/3D Problem | 765 | 765 | 24,382 |
fs_760_1 | 2D/3D Problem Sequence | 760 | 760 | 5976 |
mplate | Acoustics Problem | 5962 | 5962 | 74,076 |
Sandi_sandi | Bipartite Graph | 314 | 360 | 613 |
bayer05 | Chemical Process Simulation Problem | 3268 | 3268 | 27,836 |
circuit_3 | Circuit Simulation Problem | 12,127 | 12,127 | 48,137 |
fpga_dcop_01 | Circuit Simulation Problem Sequence | 1220 | 1220 | 5892 |
f855_mat9 | Combinatorial Problem | 2456 | 2511 | 171,214 |
ex31 | Computational Fluid Dynamics Problem | 3909 | 3909 | 115,357 |
cavity16 | Computational Fluid Dynamics Problem Sequence | 4562 | 4562 | 138,187 |
QRpivot | Counter Example Problem | 660 | 749 | 3808 |
ibm32 | Directed Graph | 32 | 32 | 126 |
SmaGri | Directed Multigraph | 1059 | 1059 | 4919 |
CollegeMsg | Directed Temporal Multigraph | 1899 | 1899 | 20,296 |
gre_1107 | Directed Weighted Graph | 1107 | 1107 | 5664 |
soc-sign-bitcoin-alpha | Directed Weighted Temporal Graph | 3783 | 3783 | 24,186 |
young2c | Duplicate Acoustics Problem | 841 | 841 | 4089 |
pcrystk02 | Duplicate Materials Problem | 13,965 | 13,965 | 491,274 |
bcsstk12 | Duplicate Structural Problem | 1473 | 1473 | 17,857 |
lshp1009 | Duplicate Thermal Problem | 1009 | 1009 | 3937 |
orani678 | Economic Problem | 2529 | 2529 | 90,158 |
qc2534 | Electromagnetics Problem | 2534 | 2534 | 232,947 |
onetone2 | Frequency Domain Circuit Simulation Problem | 36,057 | 36,057 | 227,628 |
cari | Linear Programming Problem | 400 | 1200 | 152,800 |
sctap1-2c | Linear Programming Problem Sequence | 3390 | 7458 | 21,854 |
crystm01 | Materials Problem | 4875 | 4875 | 55,107 |
reorientation_4 | Optimal Control Problem | 2717 | 2717 | 17,613 |
reorientation_3 | Optimal Control Problem | 2513 | 2513 | 32,166 |
zenios | Optimization Problem | 2873 | 2873 | 15,032 |
bp_0 | Optimization Problem Sequence | 822 | 822 | 3276 |
gemat1 | Power Network Problem | 4929 | 10,595 | 47,369 |
rbsa480 | Robotics Problem | 480 | 480 | 17,088 |
jpwh_991 | Semiconductor Device Problem | 991 | 991 | 6027 |
rw496 | Statistical/Mathematical Problem | 496 | 496 | 1859 |
pct20stif | Structural Problem | 52,329 | 52,329 | 1,375,396 |
fs_760_2 | Subsequent 2D/3D Problem | 760 | 760 | 5976 |
adder_dcop_02 | Subsequent Circuit Simulation Problem | 1813 | 1813 | 11,246 |
cavity17 | Subsequent Computational Fluid Dynamics Problem | 4562 | 4562 | 138,187 |
bp_1600 | Subsequent Optimization Problem | 822 | 822 | 4841 |
nemeth21 | Subsequent Theoretical/Quantum Chemistry Problem | 9506 | 9506 | 591,626 |
conf5_0-4x4-10 | Theoretical/Quantum Chemistry Problem | 3072 | 3072 | 119,808 |
epb1 | Thermal Problem | 14,734 | 14,734 | 95,053 |
fe_body | Undirected Graph | 45,087 | 45,087 | 163,734 |
cage | Undirected Graph Sequence | 366 | 366 | 2562 |
G27 | Undirected Random Graph | 2000 | 2000 | 19,990 |
nopoly | Undirected Weighted Graph | 10,774 | 10,774 | 40,808 |
G7 | Undirected Weighted Random Graph | 800 | 800 | 19,176 |
Set | Features | Description | Formula | Complexity |
---|---|---|---|---|
Basics Features | Number of rows | M | ||
Number of columns | N | |||
row + column | M+N | |||
Number of non-zeros | ||||
High-Complexity Features | Minimum . | |||
Maximum | ||||
Average | ||||
Standard deviation of nonzero elements per row. | ||||
Average column distance between the first and the last nonzero element in each row. | ||||
Minimum column distance between the first and the last nonzero element in each row. | ||||
Maximum column distance between the first and the last nonzero element in each row. | ||||
Standard deviation of column distances between the first and the last nonzero element in each row. | ||||
Clustering |
Software/Tools/Library | Version |
---|---|
Operating System | Windows 11, 22H2 |
Compiler | Visual Studio Professional 2022, 17.8 |
Sci kit-Learn | Sklearn 0.23.1 |
OpenMP | Open-mp 6.0 |
Python | Google Colab 3.10 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmad, M.; Sardar, U.; Batyrshin, I.; Hasnain, M.; Sajid, K.; Sidorov, G. Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture. Information 2024, 15, 685. https://doi.org/10.3390/info15110685
Ahmad M, Sardar U, Batyrshin I, Hasnain M, Sajid K, Sidorov G. Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture. Information. 2024; 15(11):685. https://doi.org/10.3390/info15110685
Chicago/Turabian StyleAhmad, Muhammad, Usman Sardar, Ildar Batyrshin, Muhammad Hasnain, Khan Sajid, and Grigori Sidorov. 2024. "Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture" Information 15, no. 11: 685. https://doi.org/10.3390/info15110685
APA StyleAhmad, M., Sardar, U., Batyrshin, I., Hasnain, M., Sajid, K., & Sidorov, G. (2024). Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture. Information, 15(11), 685. https://doi.org/10.3390/info15110685