Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture

Ahmad, Muhammad; Sardar, Usman; Batyrshin, Ildar; Hasnain, Muhammad; Sajid, Khan; Sidorov, Grigori

doi:10.3390/info15110685

Open AccessArticle

Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture

by

Muhammad Ahmad

¹,

Usman Sardar

²

,

Ildar Batyrshin

¹,

Muhammad Hasnain

³,

Khan Sajid

⁴

and

Grigori Sidorov

^1,*

¹

Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-PN), Mexico City 07738, Mexico

²

School of Informatics and Robotics, Institute of Arts and Culture, Lahore 54000, Pakistan

³

Department of Computer Science, Lahore Leads University, Lahore 54000, Pakistan

⁴

College of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China

^*

Author to whom correspondence should be addressed.

Information 2024, 15(11), 685; https://doi.org/10.3390/info15110685

Submission received: 30 September 2024 / Revised: 25 October 2024 / Accepted: 30 October 2024 / Published: 1 November 2024

(This article belongs to the Special Issue Deep Learning in Parallel and Distributed Data Applications and Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The sparse matrix–vector product (SpMV) is a fundamental computational kernel utilized in a diverse range of scientific and engineering applications. It is commonly used to solve linear and partial differential equations. The parallel computation of the SpMV product is a challenging task. Existing solutions often employ a fixed number of threads assignment to rows based on empirical formulas, leading to sub-optimal configurations and significant performance losses. Elegante, our proposed machine learning-powered tool, utilizes a data-driven approach to identify the optimal thread configuration for SpMV computations within a shared memory architecture. It accomplishes this by predicting the best thread configuration based on the unique sparsity pattern of each sparse matrix. Our approach involves training and testing using various base and ensemble machine learning algorithms such as decision tree, random forest, gradient boosting, logistic regression, and support vector machine. We rigorously experimented with a dataset of nearly 1000+ real-world matrices. These matrices originated from 46 distinct application domains, spanning fields like robotics, power networks, 2D/3D meshing, and computational fluid dynamics. Our proposed methodology achieved 62% of the highest achievable performance and is 7.33 times faster, demonstrating a significant disparity from the default OpenMP configuration policy and traditional practice methods of manually or randomly selecting the number of threads. This work is the first attempt where the structure of the matrix is used to predict the optimal thread configuration for the optimization of parallel SpMV computation in a shared memory environment.

Keywords:

CSR; machine learning; SVM; high-performance computing; parallel computing; OpenMPI; shared memory

1. Introduction

The sparse matrix–vector multiplication (SpMV) stands as a foundational computational process widely applied in diverse scientific and engineering applications. In contrast to matrices with a majority of nonzero entries, sparse matrices are primarily constituted of zero elements. The essence of SpMV lies in the multiplication of a sparse matrix by a dense vector, yielding a new vector. This operation is essential for solving linear equation systems and supporting iterative methods, simulations, and data analysis. The equation y = Ax posits that the matrix A is sparse, while the vector x can be either dense or sparse. Conversely, the output vector y is consistently dense. As illustrated in Figure 1, both the dense and sparse matrix–vector products always produce a dense output. Several factors play a significant role in affecting the efficiency of SpMV computations [1,2], i.e., specifications relating to storage formats, sparsity patterns of the sparse matrix, and the implementation nuances of both software and hardware platforms [3,4]. Over time, efforts to optimize the performance of the SpMV kernel have focused on the introduction of new storage schemes, including compressed sparse column (CSC), extended block compressed sparse row (BCSR) [5,6], compressed sparse row (CSR) [7,8,9], and diagonal coordinate format and hybrid coordinate format [5,7,8,10]. Additionally, efforts have been directed toward the development of software libraries, including Intel MKL [11], Trillions Project [12,13], CUSPARSE [14], and Cusp [2]. Efficiently utilizing DRAM bandwidth and cache hierarchies has also been a primary focus in improving SpMV performance [1,5].

Although prior research has significantly improved the performance of SpMV computations through optimized storage formats and hardware-specific libraries, the application of machine learning to dynamically predict the optimal number of threads remains underexplored. Additionally, existing studies have not fully utilized the structure of diverse real-world matrices to optimize thread configurations. This study addresses these gaps by introducing a machine learning-driven approach to predict optimal thread counts and enhance SpMV performance across multiple domains. To this end, we propose Elegante, a novel tool designed to improve the efficiency of SpMV within shared-memory architectures. Elegante predicts the optimal number of threads by leveraging the compressed sparse row (CSR) format and identifying the most effective configuration for SpMV computations.

Our dataset comprised matrices obtained from the University of Florida’s (UOF) Suite of Sparse Matrix Market Collection [3,6]. We collected a dataset comprising nearly 1000+ real-world matrices from diverse application domains. These sparse matrices underwent conversion from the matrix market format to the compressed sparse row (CSR) format, and computations of SpMV were performed using multiple threads and then pertinent features of the sparse matrices were extracted. Subsequently, the minimum, default OpenMP, and maximum execution times corresponding to the number of threads for each matrix were collected. Following dataset preparation, we employed a variety of base and ensemble machine learning models to predict the near-optimal threads for achieving the best solution for SpMV computation. During the training phase, 90% of the dataset was used, while the remaining 10% was reserved for testing, as depicted in Figure 2.

This study makes the following contributions:

Propose, develop, and assess a machine learning tool designed to determine optimal number of thread configurations for SpMV, thereby enhancing overall performance of the SpMV kernel;
Train and test the tool utilizing an extensive dataset comprising nearly 1000+ real-world matrices sourced from 46 application domains such as computational fluid dynamics, 2D/3D problems, computer vision, and robotics;
Conduct an extensive performance modeling and evaluation, employing a range of machine learning techniques to optimize SpMV computations under varying conditions, ensuring robustness and adaptability in real-world scenarios.

The sparse matrix was converted from a matrix market format to a CSR format, and then the sparse matrix–vector product was executed with different numbers of threads. Subsequently, the selected features, along with the optimal number of threads, were used for the training phase.

During the training phase, we used the number of threads as the target variable. The training datasets were utilized to train various base and ensemble predictive models, including random forest (RF), decision tree (DT), logistic regression (LR), gradient boosting (XGB), ridge regression (RR), and k-nearest neighbor (K-NN). These models were evaluated using relative mean error (RME), and the geometric mean was employed to assess performance gain. To the best of our knowledge, this research represents a pioneering effort in leveraging the inherent structure of matrices to predict the near-optimal number of threads. Elegante, the tool introduced in the research, derived its name from the Spanish language, where it means “intelligent” or “smart”.

The rest of the sections of this paper are organized as follows: Section 2 provides an in-depth review of the literature and background information related to sparse matrix–vector (SpMV) computations. Section 3 outlines the proposed methodology and design. Detailed results and evaluations are presented in Section 4. Section 5 present the limitations of the proposed solution. Finally, Section 6 discusses the conclusions of our research and potential avenues for future exploration. Fundamental symbols used in this study are shown in Table 1.

2. Literature Survey

SpMV is the most important algorithm being used to solve the commonly used sparse linear equations on both the CPU and GPU architectures. GPU computing overhead is minimized by focusing on the optimization of parallel SpMV on CPU–GPU heterogeneous platforms to attain a high performance. Pinar, A. et al. [3] proposed an adoptive SpMV optimizer to achieve the improved performance of SpMV and minimize overhead for all sparse matrices by analyzing the performance bottleneck of the input matrix on specific platforms. The detection process of the bottleneck is taken as a classification problem in which two classifiers are used, i.e., profile-guided classifier, which emphasizes the online profiling to classify the matrix, and feature-guided classifier that is pre-trained by machine learning models to check the properties of matrices. After the detection of the bottleneck, appropriate optimization is adopted to handle these bottlenecks on CSR and inspect executive CSR storage formats. The proposed optimizer is assessed on one multi-core and two many-core processors (Intel’s Xeon Phi and Intel Broadwell processors) that optimize the maximum number of matrices and provide outstanding speedups.

Ahmed et al. [15] introduced AAQAL, a machine learning-based tool designed to forecast the optimal block size setting for BCSR. The researcher employed five machine learning models to predict the near-optimal block size for the best solution of SpMV computation and attained a performance level of 93.47% compared to manual or random block size selection, indicating a significant improvement in practical outcomes.

Dufrechou, E. et al. [16] assessed over 3000 matrices from diverse domains. They utilized machine learning to create classifiers to predict the best SpMV method for each matrix on a specific parallel platform with over 80% accuracy. Their approach resulted in significant reductions in both execution time and energy consumption.

Usman, Sardar, et al. [17] introduced ZAKI, a machine learning tool designed to predict the optimal number of processes for SpMV computations on distributed memory machines, irrespective of the specific sparse matrix. They trained and tested ZAKI on a dataset of nearly 2000 matrices from 45 application domains. The tool utilizes three machine learning algorithms: decision tree, random forest, and gradient boosting. They achieved an 85.1% performance gain by selecting the optimal SpMV kernel.

Nie, Jing, et al. [18] designed a diverse parallel computing method for SpMV, built on a hybrid CPU–GPU computing model by utilizing the capability of “CPU” and “GPU” to obtain the improved performance. Multiple cores in CPU enhance its parallel computing abilities, because of its unbalancing of nonzero elements. The sparse matrix assigned to the CPU and GPU for instantaneous computation is divided into two parts by the hybrid storage format using the sparse matrix “partitioning strategy” to give the improved computing performance of SpMV.

Williams, S. et al. [19] illustrated the structure and execution processes of adaptive SpMV by introducing a dynamic task scheduling the framework to increase the utilization of CPU and GPU on heterogeneous platforms. To cover the data allocation overhead between two processing units, i.e., “CPU” and “GPU”, a double-buffering scheme along with the CPU- and GPU-based SpMV kernels are established to achieve the adaptability of multiple sparse matrices and consequential improvements.

Yesil, Serif, et al. [20] proposed a WISE tool, a machine learning framework to predict the speedup of SpMV methods for specific sparse matrices. Using a unique feature set, WISE identifies the best SpMV method for each matrix. Testing on nearly 1500 matrices showed a 2.4× average speedup over Intel’s MKL on a 24-core server.

Bylina, Beata, et al. [21] described a merge-based parallel technique using the CSR format for calculating SpMV computation. CsrMV mounts an impartial multi-partitioning of the input dataset, and the workload offered to each processing element is tightly bound through a 2D-merged base decomposition. CsrMV evaluated both CPU and GPU micro-architectures and concluded that the merged-base method gives the predictable performance on CsrMV.

Zhao, Haoran, et al. [22] explained the parallelization of SpMV operations on a sparse matrix and dense vector multiplication using a multicore processing unit. A distributed multi-nodal algorithm is used by choosing the high-performance cluster to speed up the SpMV computations. Other SPARSKIT library-based parallel types of algorithms are utilized in a multicore architecture. Multiple architectures, tools, and different-sized matrices are compared using different algorithms, and the conclusion is that the use of hardware architectures, MIC architectures, and GPU cards to enhance the calculation speed improves the overall performance.

Usman S. et al. [23] introduced a ZAKI+ data-driven machine learning approach for optimizing parallel sparse matrix (SpMV) computations in cyber–physical systems (CPSs). They trained and tested the tool by leveraging three ML methods: decision tree, random forest, and extreme gradient boosting and over 2000 real-world sparse matrices from diverse domains. ZAKI+ achieved a 4.24× performance improvement over MPI default mapping policies, marking a significant advancement in automating and enhancing SpMV computations.

Gao, Jianhua, et al. [24] introduced a machine learning-driven approach to optimize SpMV on GPUs by predicting optimal thread configurations through matrix partitioning and kernel specialization. In contrast, our work focuses on optimization of SpMV on the CPU architecture.

Israt et al. [25] investigate the optimization of sparse matrix–vector (SpMV) multiplication on GPUs by utilizing machine learning techniques. They explored two key questions: predicting the best sparse matrix format for SpMV based on an unseen matrix and forecasting the execution time for that matrix across different formats. By identifying a small set of relevant features from the sparse matrices, they achieved approximately 88% accuracy in predicting the best format among six well-known formats and a relative mean error of about 10% in execution time predictions. This approach demonstrates the potential of machine learning to enhance SpMV performance in computational applications.

Furuhata et al. [26] applied a machine learning approach to predict the best sparse matrix–vector multiplication (SpMV) implementation, focusing on minimizing performance loss by considering performance differences between implementations. Unlike previous methods, their model prioritizes avoiding the mis-prediction of matrices where performance losses are high. This approach helps ensure that low-performance implementations are not selected, leading to overall improved expected performance. The evaluation results demonstrate clear advantages over earlier methods that did not account for these performance gaps.

3. Methodology and Design

The proposed methodology consists of four main phases: collection of sparse matrices, conversion of sparse matrices to CSR format, feature extraction, and application of machine learning algorithms for predicting the optimal thread configuration. In the initial data preparation phase, we collected approximately 1000+ matrices from 46 diverse application domains. Subsequently, 14 sparse matrix features are extracted by leveraging the sparse matrix structure. Our training dataset contains multiple features, specifically targeting the optimal number of threads, which are determined with minimal execution time. The predictive models we developed aim to forecast the optimal number of threads, and its performance is validated against arbitrary matrices. A critical aspect of our approach lies in selecting the appropriate set of features. To mitigate the risk of over-fitting, which may occur when the number of features is excessively large for the training set, we emphasized the importance of dimensionality reduction and feature selection. These measures are crucial for maintaining the model’s generalization and preventing it from closely fitting to the training data. Let us denote the number of threads as T and Ti as the execution time of the ith matrix, where 0 < i < N and N is the total number of matrices in our dataset. For each matrix, the execution time is observed for different numbers of threads, and the corresponding minimum and maximum times are recorded as can be seen in Formulas (1) and (2):

T_{i m i n} = T_{i m i n} \{T_{1 \dots .,} T_{2 \dots} T_{3,} \dots \dots T_{n}\}

(1)

T_{i m a x} = T_{i m a x} \{T_{1 \dots,} T_{2 \dots} T_{3,} \dots \dots T_{n}\}

(2)

In the context of optimizing SpMV computations in a shared memory architecture through machine learning models, we followed the following key steps:

Data Collection: Real-world matrices were gathered from the Suit Sparse Matrix Market Collection (UOF) [27];
Data Transformation: The sparse matrices were transformed from the matrix market format to the Compressed Sparse Row Format. Subsequently, SpMV computations were performed, utilizing different numbers of threads;
Feature Extraction: In this step, fourteen sparse matrix features were extracted, leveraging the matrix structure. Feature extraction included characteristics derived from the matrix structure;
Application of Machine Learning Models: Five different machine learning models were employed in this step to predict the number of threads, aiming to identify the near-optimal number of threads;
Model Evaluation: In this phase, predictive models were evaluated using relative mean error. This metric provides insight into the effectiveness of the models in predicting the optimal number of threads. To assess the performance of our proposed model, we employed the geometric mean of normalized performance (GMNP). This metric is computed as the product of the ratios of predicted execution times to the best execution times across all matrices in a test set.

3.1. Construction of Dataset

Data used for training are produced by using sparse matrices taken from “Suite Sparse market collection”. More than 1000 matrices were selected from a wide range of applications from multiple domains (see Table 2). While creating dataset, each matrix is executed with different thread configurations and target variables or label is selected for which a matrix gives minimum execution time. Table 2 presents the application domains and one sample sparse matrix from each domain encompassing 46 distinct domains. Column 1 displays the names of the matrices, and column 2 indicates the specific application domains. The number of rows and columns of matrices is given against their application domains in column 3 (row count) and column 4 (column count). The largest matrix used in the study contained 13, 75,396 non-zeros and belonged to a specific domain of Structural Problem.

3.2. Identification of Features

Table 3 lists the “sparse matrix features” as well as their complexity. Based on the computational complexity and significance of the features, they are categorized into two sets: basic features and high-complexity features. The names of the selected features are described in the second column, and the description is given in column 3. The formulas for the calculation of these features are listed in column 4 of the table. The last column shows the complexity measures related to the presented features. Features with

\emptyset (1)

complexity are considered basic features, while those with

\emptyset (M)

and

\emptyset (2 M)

complexities are classified as high-complexity features.

3.3. Data Labeling and Training and Testing Phases

The dataset is generated by executing the provided matrix on multiple processes and observing the minimal execution time. The number of threads with minimal execution time is taken as label for every sparse matrix, and that time was considered the best execution layout for the matrices. The matrix is executed 2k times on a single thread to make the matrix anomalies free and to estimate the execution time through average recorded execution time on each process. The presented model trained on 90% data and is tested on remaining 10% as shown in Figure 3.

3.4. Feature Scaling

Feature scaling is a widely used technique in machine learning and statistics that involves standardizing the range of independent variables or data features to ensure that all features are equally important in the model. This is crucial because many algorithms can be affected by the magnitude of their features, leading to poor performance. In our study, we used min–max scaling, which transforms the features to a range of 0 to 1, as shown in Equation (3):

X n o r m = \frac{X - X m i n}{X m a x - X m i n}

(3)

3.5. Model Evaluation Phase

After constructing the models, the relative mean error was used to assess their performance. Predictive regression techniques are employed that utilize the sparsity pattern of the matrix to predict the best number of threads for a specific sparse matrix. The objective is to determine the best-performing algorithm by evaluating the RME metrics, as specified in Equation (4):

R M E = \frac{1}{P} \sum_{i = 1}^{P} \frac{y - y^}{y}

(4)

where y is the actual value, y^ is the predicted value, and P is the number of observations.

4. Result and Analysis

4.1. Software and Hardware

Experiments were conducted on an AMD EPYC 7401P, which is considered a server-grade CPU with 24 physical cores (two threads per core) and supports the Intel hyper-threading technology. This processor is commonly used in high-performance computing and data center environments for parallel and multitasking computing. The software/libraries employed in this study for SpMV computations are outlined in Table 4.

4.2. Execution Time Analysis of SpMV Computations

The SpMV computations using a CSR storage format are executed across various numbers of threads. For each matrix, the maximum, default OpenMP, and minimum execution times (in seconds) were determined. Figure 4 and Figure 5 show a comparison of the execution times for SpMV computations on the dataset that is sorted in ascending order based on the minimum to maximum number of rows and non-zero elements (nnzs), respectively. In both figures, the X-axis organizes different matrices from diverse application domains based on the minimum to maximum sorting of rows and nnzs. Meanwhile, the Y-axis displays the execution times (s) for SpMV computations on a logarithmic scale in seconds.

4.3. Speedup

Figure 6 and Figure 7 illustrate the speedup achieved for the complete dataset when organized by row- and nnz-sorted datasets. The default and maximum speedups are calculated by using Formulas (5) and (6). The X-axis shows the number of matrices in our entire dataset, while the Y-axis portrays the speedup relative to both the default OpenMP configuration and worst-case scenarios. The speedup metrics against the worst-case scenario are indicated in red, whereas those against the default OpenMP speedup are in blue. Similarly, Figure 8 and Figure 9 present a comparative analysis of execution times across different application domains, in the same order as presented in Table 2. Conversely, Figure 10 and Figure 11 show the attained speedup leveraging the complete dataset. This comparative visualization illustrates the performance variations and optimizations across diverse application domains, emphasizing the effectiveness of the proposed approach in enhancing computational efficiency.

S p e e d u p M a x = \frac{T_{i m a x}}{T_{i m i n}}

(5)

S p e e d u p D e f a u l t = \frac{T_{i d e f}}{T_{i m i n}}

(6)

4.4. Predictive Analysis

Our proposed methodology concentrates on predicting the near-optimal number of threads through the analysis of the sparse matrix structure. To achieve this goal, we employed six predictive machine learning models: decision tree (DT), random forest (RF), logistic regression (LR), ridge regression (RR), gradient boosting (GB), and k-nearest neighbor (K-NN). The performance of these models is evaluated using relative mean error (RME). We outlined the details of the basic and important features in Table 3, where we categorized the features based on their complexity into three distinct categories: full feature set, basic feature set, and important feature set. The basic feature set contains those features which do not require a full scan of the matrix, while the identification of important features is accomplished through the utilization of Sci-kit-Learn’s integrated feature importance metric. This metric facilitates the calculation of scores corresponding to each machine learning algorithm employed in the study. The prominent features, identified by their highest feature importance scores, are selected as significant contributors within the dataset as illustrated in Figure 12.

Figure 13 presents the relative mean error (RME) scores with all features for various machine learning models, providing insights into their predictive performance. Lower RME values indicate better model accuracy, meaning that the model predictions are closer to the actual values. In this case, RF is the top-performing model with the lowest RME score of 0.449, indicating it made the fewest prediction errors and is thus the most accurate. K-NN follows with an RME score of 0.482, also showing strong performance. GB, with an RME score of 0.52, and the DT model, with a score of 0.533, perform moderately but are less accurate than RF and K-NN. Finally, LR and RR have the highest RME scores of 0.563, meaning they are the least accurate models in this comparison. Overall, the evaluation shows that RF and K-NN perform the best, while LR and RR lag behind in accuracy.

Figure 14 shows the RME scores of various machine learning models when evaluated using only the most important features. In this case, RF performs the best, achieving the lowest RME score of 0.454, indicating it has the most accurate predictions. K-NN follows closely with an RME score of 0.473, also demonstrating a solid performance. GB has a moderate RME score of 0.498, while both LR and RR have higher RME scores of 0.562, indicating more significant prediction errors. The DT model has the highest RME score at 0.591, meaning it performed the least accurately in this case. Overall, RF and K-NN perform the best with important features, while DT shows the weakest performance.

Figure 15 shows a comparison of the RME scores using the basic features. This chart compares the relative mean error (RME) scores of different models when evaluated using basic features. DT actually performs the best with the lowest RME score of 0.457, indicating the fewest prediction errors with basic features. RF follows with an RME score of 0.479, showing a solid performance. K-NN has an RME score of 0.467, performing slightly better than RF. GB comes in next with a score of 0.492, showing moderate accuracy. Both LR and RR have the highest RME scores at 0.551, indicating they make the most significant errors in this context. Overall, DT shows the best performance with basic features, while LR and RR demonstrate the weakest accuracy.

4.5. Performance Gain

In this section, we present an evaluation of our proposed model, focusing on the analysis of execution time as depicted in Figure 16, Figure 17 and Figure 18. We conducted evaluations using all features, as detailed in the corresponding Table 3. The “minimum time” denotes the idle case scenario associated with our target variable selection, and default is considered when using the OpenMP default configurations. To assess the performance of our proposed model, we employed the geometric mean of normalized performance (GMNP). This metric is computed as the product of the ratios of predicted execution times to the minimum execution times across all matrices as shown in Equation (7). The formula for GMNP is can be expressed as follows:

G M N P = {(\prod_{i = 1}^{n} \frac{{m i n i m u m}_{i}}{{P r e d i c t e d}_{i}})}^{1 / n}

(7)

where n is the total number of matrices in the test set. Here, minimum_i is the execution time achieved for the ith matrix among all threads, while predicted_i represents the execution time predicted by our machine learning model for that matrix.

Figure 16, shows a comparison of Elegante and the other execution times for all 1000+ matrices. It is evident from the figure that Elegante performed well with a significant margin. Figure 17 illustrates the comparison of execution times among the minimum (best), predicted, maximum, average, and default OpenMP configurations for the test set. The average is a scenario where the number of threads is selected randomly.

In Figure 18, we employed Equation (7) to evaluate the performance gain. Our proposed methodology achieved 61.95% of the highest achievable performance, and is 7.33 times faster, demonstrating a significant disparity from the default OpenMP configuration policy and traditional practice methods of manually or randomly selecting the number of threads.

5. Limitation of the Proposed Solution

Our proposed methodology predicts the near-optimal number of threads based on the sparse matrix structure and shows considerable performance gain. The limitations associated with the proposed solution are highlighted in this section. Firstly, the selection of 14 sparse matrix features might not capture all the relevant factors that influence SpMV computation, potentially omitting important characteristics. Secondly, while the machine learning model aims to predict the optimal number of threads for SpMV computation, the actual optimal number may be affected by various factors not considered in the study, such as hardware architecture, memory bandwidth, and specific SpMV algorithm implementations. Thirdly, the machine learning model trained on the specific dataset may not generalize well to different types of matrices or various hardware configurations. Lastly, feature extraction and performing SpMV computations with different numbers of threads could introduce additional computational overhead that is not reflective of real-world scenarios.

6. Conclusions

Currently, a significant number of applications in computer graphics, research operations, robotics, computer networking, social media analytics, and other fields rely on solving sparse linear equation systems. These applications play crucial roles in the design, construction, and management of smart societies and cities. The performance of SpMV, which is used to solve these linear equations, can be influenced by various factors that affect computation time. The performance of SpMV is mostly affected by the sparsity pattern of the matrices involved. A predictive regression model is proposed, implemented, and assessed to optimize the performance of SpMV across various matrix types. The model is trained and tested for more than 1000+ matrices taken from 46 application domains. To predict the optimal number of threads for each matrix, the sparse matrix is first converted into the CSR format and executed on different numbers of threads.

The future development of the proposed model can be significantly enhanced by expanding the dataset size, incorporating additional application domains, and increasing the number of relevant features. Additionally, enhancing the size of sparse matrices, implementing advanced machine learning and deep learning algorithms, and transitioning to a platform with a greater number of cores will optimize parallel processing capabilities and improve the model’s performance. Integrating the model into a hybrid system with scheduling policies can further maximize parallelism and efficiency.

Author Contributions

Conceptualization, G.S., M.A. and U.S., methodology, M.A. and U.S.; software, M.A.; validation, G.S. and I.B.; formal analysis, U.S.; investigation, M.H. and K.S.; resources, G.S. and I.B.; data curation, M.A.; writing—original draft preparation, M.A. and U.S; writing—review and editing, M.A.; visualization, G.S.; supervision, G.S. project administration, I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

This work was performed with partial support from the Mexican Government through the grant A1-S-47854 of CONACYT, Mexico, and grants 20241816, 20241819, and 20240951 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico and acknowledge support of Microsoft through the Microsoft Latin America PhD.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Grossman, M.; Thiele, C.; Araya-Polo, M.; Frank, F.; Alpak, F.O.; Sarkar, V. A Survey of Sparse Matrix-Vector Multiplication Performance on Large Matrices. 2016. Available online: https://software.intel.com/en-us/intel-mkl (accessed on 18 April 2024).
cuSPARSE | NVIDIA Developer, NVIDIA. Available online: https://developer.nvidia.com/cusparse (accessed on 18 April 2024).
Pınar, A.; Heath, M.T. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing, Portland, OR, USA, 14–19 November 1999. [Google Scholar]
Mehmood, R.; Crowcroft, J. Parallel Iterative Solution Method for Large Sparse Linear Equation Systems. 2005. Available online: http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-650.html (accessed on 18 April 2024).
Pabst, H.; Bachmayer, B.; Klemm, M. Performance of a Structure-detecting SpMV using the CSR Matrix Representation. In Proceedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC 2012), Munich, Germany, 25–29 June 2012. [Google Scholar] [CrossRef]
Yang, W.; Li, K.; Li, K. A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems. J. Parallel Distrib. Comput. 2017, 104, 49–60. [Google Scholar] [CrossRef]
Gu, A.T.; Liu, X.; Mo, Z.; Xu, X.; Zhu, S. On the Memory Wall and Performance of Symmetric Sparse Matrix Vector Multiplications in Different Data Structures on Shared Memory Machines. In Proceedings of the 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), Beijing, China, 10–14 August 2015. [Google Scholar] [CrossRef]
Sun, H.; Gainaru, A.; Shantharam, M.; Raghavan, P. Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead. In Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Porto, Portugal, 9–11 September 2020. [Google Scholar] [CrossRef]
Elafrou, A.; Goumas, G.; Koziris, N. Performance analysis and optimization of sparse matrix-vector multiplication on modern multi-and many-core processors. In Proceedings of the 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 14–17 August 2017; pp. 292–301. [Google Scholar]
Asanovic, K.; Bodik, R.; Demmel, J.; Keaveny, T.; Keutzer, K.; Kubiatowicz, J.; Morgan, N.; Patterson, D.; Sen, K.; Wawrzynek, J.; et al. A View of the Parallel Computing Landscape. Commun. ACM 2009, 52, 56–67. [Google Scholar] [CrossRef]
Accelerate Fast Math with Intel® oneAPI Math Kernel Library. Available online: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html#gs.874u1l (accessed on 18 April 2024).
Bian, H.; Huang, J.; Dong, R.; Liu, L.; Wang, X. CSR2: A New Format for SIMD-accelerated SpMV. In Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, VIC, Australia, 11–14 May 2020. [Google Scholar] [CrossRef]
Zheng, C.; Gu, S.; Gu, T.-X.; Yang, B.; Liu, X.-P. BiELL: A bisection ELLPACK-based storage format for optimizing SpMV on GPUs. J. Parallel Distrib. Comput. 2014, 74, 2639–2647. [Google Scholar] [CrossRef]
CUSP: Main Page. Available online: https://cusplibrary.github.io/ (accessed on 18 April 2024).
Ahmed, M.; Usman, S.; Shah, N.A.; Ashraf, M.U.; Alghamdi, A.M.; Bahadded, A.A.; Almarhabi, K.A. AAQAL: A Machine Learning-Based Tool for Performance Optimization of Parallel SPMV Computations Using Block CSR. Appl. Sci. 2022, 12, 7073. [Google Scholar] [CrossRef]
Dufrechou, E.; Ezzatti, P.; Quintana-Ortí, E.S. Selecting optimal SpMV realizations for GPUs via machine learning. Int. J. High Perform. Comput. Appl. 2021, 35, 254–267. [Google Scholar] [CrossRef]
Usman, S.; Mehmood, R.; Katib, I.; Albeshri, A.; Altowaijri, S.M. ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines. Mob. Netw. Appl. 2023, 28, 744–763. [Google Scholar] [CrossRef]
Nie, J.; Zhang, C.; Zou, D.; Xia, F.; Lu, L.; Wang, X.; Zhao, F. Adaptive sparse matrix-vector multiplication on CPU-GPU heterogeneous architecture. In Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference, Guangzhou, China, 22–24 June 2019; pp. 6–10. [Google Scholar]
Williams, S.; Oliker, L.; Vuduc, R.; Shalf, J.; Yelick, K.; Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC’07, Reno, NV, USA, 10–16 November 2007. [Google Scholar] [CrossRef]
Yesil, S.; Heidarshenas, A.; Morrison, A.; Torrellas, J. Wise: Predicting the performance of sparse matrix vector multiplication with machine learning. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Montreal, QC, Canada, 25 February–1 March 2023; pp. 329–341. [Google Scholar]
Bylina, B.; Bylina, J.; Stpiczyński, P.; Szałkowski, D. Performance analysis of multicore and multinodal implementation of SpMV operation. In Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, 7–10 September 2014. [Google Scholar]
Zhao, H.; Xia, T.; Li, C.; Zhao, W.; Zheng, N.; Ren, P. Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, Hartford, CT, USA, 18–21 October 2020; pp. 601–609. [Google Scholar] [CrossRef]
Usman, S.; Mehmood, R.; Katib, I.; Albeshri, A. ZAKI+: A Machine Learning Based Process Mapping Tool for SpMV Computations on Distributed Memory Architectures. IEEE Access 2019, 7, 81279–81296. [Google Scholar] [CrossRef]
Gao, J.; Ji, W.; Liu, J.; Wang, Y.; Shi, F. Revisiting thread configuration of SpMV kernels on GPU: A machine learning based approach. J. Parallel Distrib. Comput. 2024, 185, 104799. [Google Scholar] [CrossRef]
Nisa, I.; Siegel, C.; Rajam, A.S.; Vishnu, A.; Sadayappan, P. Effective machine learning based format selection and performance modeling for SpMV on GPUs. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 21–25 May 2018; pp. 1056–1065. [Google Scholar]
Furuhata, R.; Zhao, M.; Agung, M.; Egawa, R.; Takizawa, H. Improving the accuracy in spmv implementation selection with machine learning. In Proceedings of the 2020 Eighth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 24–27 November 2020; pp. 172–177. [Google Scholar]
Davis, T.A.; Hu, Y. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 2011, 38, 1–25. [Google Scholar] [CrossRef]

Figure 1. Sparse vs. dense matrix.

Figure 2. Proposed methodology. Data preparation and training and testing phases.

Figure 3. Training and testing phases.

Figure 4. Execution time comparison as row-sorted data.

Figure 5. Execution time comparison as nnz-sorted data.

Figure 6. Speedup row-sorted data.

Figure 7. Speedup NNZs-sorted data.

Figure 8. Domain-wise execution time (Application Domains 1 to 24).

Figure 9. Domain-wise execution time (Application Domains 25 to 46).

Figure 10. Domain-wise speedup (Application Domains 1 to 24).

Figure 11. Domain-wise speedup (Application Domains 25 to 46).

Figure 12. Comparison of important features.

Figure 13. Model evaluation with all features.

Figure 14. Model evaluation with important features.

Figure 15. Model evaluation with basic features.

Figure 16. Execution time comparison of Elegante with default, minimum, maximum, and average.

Figure 17. Performance difference among minimum, Elegante, and default.

Figure 18. GMNP of proposed methodology.

Table 1. Fundamental symbols used in the study.

Symbol	Name	Symbol	Name
NNZs	Number of non-zeros	CSR	Compressed sparse row
A	M × N matrices	BCSR	Block compressed sparse row
M	Number of rows
N	Number of columns
X	N × 1 dense vector
Y	M × 1 output vector
U	Number of matrices
F	Sparse matrix feature

Table 2. Application domains used in this study.

Matrix Name	Application Domain	Rows	Columns	NNZs
mcfe	2D/3D Problem	765	765	24,382
fs_760_1	2D/3D Problem Sequence	760	760	5976
mplate	Acoustics Problem	5962	5962	74,076
Sandi_sandi	Bipartite Graph	314	360	613
bayer05	Chemical Process Simulation Problem	3268	3268	27,836
circuit_3	Circuit Simulation Problem	12,127	12,127	48,137
fpga_dcop_01	Circuit Simulation Problem Sequence	1220	1220	5892
f855_mat9	Combinatorial Problem	2456	2511	171,214
ex31	Computational Fluid Dynamics Problem	3909	3909	115,357
cavity16	Computational Fluid Dynamics Problem Sequence	4562	4562	138,187
QRpivot	Counter Example Problem	660	749	3808
ibm32	Directed Graph	32	32	126
SmaGri	Directed Multigraph	1059	1059	4919
CollegeMsg	Directed Temporal Multigraph	1899	1899	20,296
gre_1107	Directed Weighted Graph	1107	1107	5664
soc-sign-bitcoin-alpha	Directed Weighted Temporal Graph	3783	3783	24,186
young2c	Duplicate Acoustics Problem	841	841	4089
pcrystk02	Duplicate Materials Problem	13,965	13,965	491,274
bcsstk12	Duplicate Structural Problem	1473	1473	17,857
lshp1009	Duplicate Thermal Problem	1009	1009	3937
orani678	Economic Problem	2529	2529	90,158
qc2534	Electromagnetics Problem	2534	2534	232,947
onetone2	Frequency Domain Circuit Simulation Problem	36,057	36,057	227,628
cari	Linear Programming Problem	400	1200	152,800
sctap1-2c	Linear Programming Problem Sequence	3390	7458	21,854
crystm01	Materials Problem	4875	4875	55,107
reorientation_4	Optimal Control Problem	2717	2717	17,613
reorientation_3	Optimal Control Problem	2513	2513	32,166
zenios	Optimization Problem	2873	2873	15,032
bp_0	Optimization Problem Sequence	822	822	3276
gemat1	Power Network Problem	4929	10,595	47,369
rbsa480	Robotics Problem	480	480	17,088
jpwh_991	Semiconductor Device Problem	991	991	6027
rw496	Statistical/Mathematical Problem	496	496	1859
pct20stif	Structural Problem	52,329	52,329	1,375,396
fs_760_2	Subsequent 2D/3D Problem	760	760	5976
adder_dcop_02	Subsequent Circuit Simulation Problem	1813	1813	11,246
cavity17	Subsequent Computational Fluid Dynamics Problem	4562	4562	138,187
bp_1600	Subsequent Optimization Problem	822	822	4841
nemeth21	Subsequent Theoretical/Quantum Chemistry Problem	9506	9506	591,626
conf5_0-4x4-10	Theoretical/Quantum Chemistry Problem	3072	3072	119,808
epb1	Thermal Problem	14,734	14,734	95,053
fe_body	Undirected Graph	45,087	45,087	163,734
cage	Undirected Graph Sequence	366	366	2562
G27	Undirected Random Graph	2000	2000	19,990
nopoly	Undirected Weighted Graph	10,774	10,774	40,808
G7	Undirected Weighted Random Graph	800	800	19,176

Table 3. SpMV features.

Set	Features	Description	Formula	Complexity
Basics Features	$M$	Number of rows	M	$\emptyset (1)$
	$N$	Number of columns	N	$\emptyset (1)$
	$M + N$	row + column	M+N	$\emptyset (1)$
	$n n z$	Number of non-zeros	$n n z$	$\emptyset (1)$
	$D e n s i t y$	$\frac{n n z}{M \times N}$	$\frac{n n z}{M \times N}$	$\emptyset (1)$
High-Complexity Features	$n n z_m i n$	Minimum $n n z$ .	$M i n ({n n z}_{1}, \dots \dots {, n n z}_{n})$	$\emptyset (M)$
	$n n z_m a x$	Maximum $n n z$	$M a x ({n n z}_{1}, \dots \dots {, n n z}_{n})$	$\emptyset (M)$
	$n n z_a v g$	Average $n n z$	$1 / M \sum_{i = 1}^{M} {n n z}_{i}$	$\emptyset (M)$
	$n n z_s d$	Standard deviation of nonzero elements per row.	$\sqrt{1 / M \sum_{i = 1}^{M} {({n n z}_{i} - n n z_a v g)}^{2}}$	$\emptyset (2 M)$
	$B w_A v g$	Average column distance between the first and the last nonzero element in each row.	$1 / N \sum_{i = 1}^{N} {B w}_{i}$	$\emptyset (M)$
	$B w_m i n$	Minimum column distance between the first and the last nonzero element in each row.	$M i n ({B w}_{1, \dots \dots . .} {B w}_{n})$	$\emptyset (M)$
	$B w_m a x$	Maximum column distance between the first and the last nonzero element in each row.	$M a x ({B w}_{1, \dots \dots . .} {B w}_{n})$	$\emptyset (M)$
	$B w_s d$	Standard deviation of column distances between the first and the last nonzero element in each row.	$\sqrt{1 / M \sum_{i = 1}^{N} {(B w i - B w_a v g)}^{2}}$	$\emptyset (2 M)$
	$C l u s t e r i n g$	Clustering	$\sum_{i = 0}^{N} {C l u s}_{i}$	$\emptyset (n n z)$

Table 4. Software and hardware used in this study.

Software/Tools/Library	Version
Operating System	Windows 11, 22H2
Compiler	Visual Studio Professional 2022, 17.8
Sci kit-Learn	Sklearn 0.23.1
OpenMP	Open-mp 6.0
Python	Google Colab 3.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ahmad, M.; Sardar, U.; Batyrshin, I.; Hasnain, M.; Sajid, K.; Sidorov, G. Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture. Information 2024, 15, 685. https://doi.org/10.3390/info15110685

AMA Style

Ahmad M, Sardar U, Batyrshin I, Hasnain M, Sajid K, Sidorov G. Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture. Information. 2024; 15(11):685. https://doi.org/10.3390/info15110685

Chicago/Turabian Style

Ahmad, Muhammad, Usman Sardar, Ildar Batyrshin, Muhammad Hasnain, Khan Sajid, and Grigori Sidorov. 2024. "Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture" Information 15, no. 11: 685. https://doi.org/10.3390/info15110685

APA Style

Ahmad, M., Sardar, U., Batyrshin, I., Hasnain, M., Sajid, K., & Sidorov, G. (2024). Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture. Information, 15(11), 685. https://doi.org/10.3390/info15110685

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture

Abstract

1. Introduction

2. Literature Survey

3. Methodology and Design

3.1. Construction of Dataset

3.2. Identification of Features

3.3. Data Labeling and Training and Testing Phases

3.4. Feature Scaling

3.5. Model Evaluation Phase

4. Result and Analysis

4.1. Software and Hardware

4.2. Execution Time Analysis of SpMV Computations

4.3. Speedup

4.4. Predictive Analysis

4.5. Performance Gain

5. Limitation of the Proposed Solution

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI