1. Introduction
Sparse Linear algebra is vital to scientific computations and various fields of engineering and thus has been included among the seven dwarfs [
1] by the Berkeley researchers. Among the sparse numerical techniques, iterative solutions of sparse linear equation systems can be considered as of prime importance due to its application in various important areas such as solving finite differences of partial differential equations (PDEs) [
2,
3,
4], high accuracy surface modelling [
5], finding steady-state and transient solutions of Markov chains [
6,
7,
8], probabilistic model checking [
9,
10,
11], solving the time-fractional Schrödinger equation [
12], web ranking [
13,
14,
15], inventory control and manufacturing systems [
16], queuing systems [
17,
18,
19,
20,
21,
22,
23], fault modelling, weather forecasting, stochastic automata networks [
24,
25], communication systems and networks [
26,
27,
28,
29,
30,
31], reliability analysis [
32], wireless and sensor networks [
33,
34,
35,
36,
37], computational biology [
38], healthcare [
27,
39,
40], transportation [
41,
42], natural language processing [
43], operations research [
23], reliability analysis [
44,
45], big data [
46,
47], and smart cities [
48,
49].
Sparse Matrix-Vector (SpMV) multiplication is the key operation of iterative methods for solving linear equation systems [
50]. Iterative solvers such as Jacobi and Gauss–Seidel mainly consist of multiple iterations of Sparse Matrix-Vector computations, whereas a single iteration in Krylov solvers includes several SpMV operations. Sparse Matrix-Vector (SpMV) Multiplication is a Level-2 Basic Linear Algebra Subprograms (BLAS) operation between a sparse matrix and a dense vector and can be represented mathematically as
[
51]. Other sparse computations such as eigenvalue problems (
) also require a large number of SpMV operations.
Implementation of parallel iterative solvers on different kinds of graphics processing units (GPUs) has numerous issues. Specialized storage structures are used to improve the performance of SpMV. These structures have design issues for translating it to GPUs. The major issues include coalesced memory access to both the sparse matrix
A and the vector
y, load balance among array threads and warps, thread divergence among a warp of threads, performance variance based on the structure of the sparse matrices, and the amount of memory access required for computations. These issues are due to the irregular sparsity structure of the matrix, which results in poor performance. Researchers have suggested several storage formats to improve SpMV on GPUs (see
Section 3). Leading sparse libraries such as CUSP [
52] and cuSPARSE [
53] provide a number of sparse storage techniques including Coordinate (COO), Compressed Sparse Rows (CSR), ELLPACK (ELL), Hybrid (HYB), and Diagonal (DIA), and associated BLAS operations. However, the performance depends upon the sparsity structure and, due to the diversity in the sparsity pattern, they do not perform efficiently for all matrices. The burden of using an appropriate format for a given matrix usually lies with the user. Nevertheless, SpMV computations deliver low throughput due to the reasons mentioned above.
In this article, we propose SURAA (an Arabic word meaning speed), a novel method for SpMV computations on GPUs that addresses the major challenges of SpMV computations on GPUs, i.e., load balanced execution, coalesced memory access, and redundant computations. The novelty lies in the way we group matrix rows into different segments, and the way we dynamically schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row (
) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule [
54,
55]. To the best of our knowledge, no other work exists that used the Freedman–Diaconis rule for SpMV computations the way we used, or used adaptive kernels the way we have done in this paper. The segments are assembled into three groups based on the mean
of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic parallelism available in Nvidia GPUs (Major Version 3.5, NVIDIA Corporation, Santa Clara, CA, USA) is utilized to execute the group containing segments with the largest mean
, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Note that the row segments in SURAA are dynamically scheduled by the dynamic kernel and are executed at the same time using the hardware streams on the GPU. The use of multiple streams provides higher performance by hiding the kernel launch overhead. This could be considered as reducing the kernel launch overhead, with the exception that the number of dynamic kernels that can be executed on GPU in parallel depends on the number of available Streaming Processors because the number is fixed for a given GPU. The number of segments that are added to the dynamically executed group is tunable. In our experiments, we find that depending upon the structure of the matrices, tuning the size of dynamically executed group enhances the performance of the scheme; see
Section 4.4.
We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP) tools using widely used benchmarks comprising 26 matrices from 13 diverse domains. A total of seven other SpMV storage and computation techniques have been compared with SURAA. We have used three widely used performance metrics to evaluate SURAA and other techniques; throughput in Giga Floating Point Operations per second(GFLOP/s), speedup, and effective memory bandwidth (GB/s). Each scheme is also evaluated in terms of their performance with increasing (variance in the number of non-zeros per row) and (the total number of non-zeros in the matrix).
On average, SURAA outperforms all other seven SpMV computation methods by achieving an average speedup of 40× for SELL-P, 29.89× against WPK1, 12.98× for CSR, 11.33× for BSR, 1.50× for HYB, 1.15× for WPK2, and 1.1× for CSR5. SURAA provides the best throughput for 20 matrices out of the 26 matrices in the benchmark dataset. The average collective speedup of SURAA against all the other schemes is 13.99×. The highest speedups achieved by SURAA against other schemes for any single matrix are 562×, 531.56×, 147×, 26.70×, 3.06×, 1.97×, and 1.94× against SELL-P, WPK1, CSR, BSR, HYB, WPK2, and CSR5, respectively. We repeat for clarity that these are SURAA’s peak performance values against specific storage schemes for any single matrix. SURAA provides the highest throughput (GFLOP/s) of 15.75, followed by 14.28 for CSR5. SURAA also leads the table with the highest throughput value of 36.70 GFLOP/s, followed by 26.8 for SELL-P. SURAA also provides the highest effective memory bandwidth of 226.25 GB/s compared to the second best 160.75 GB/s for SELL-P. SURAA also demonstrates a much lower setup overhead, equal to five SpMV execution time, compared to 69 and 95.5 SpMV executions for SELL-P and CSR5.
An important aspect of our analysis reveals (see
Section 5.4, Page 24) that SURAA comparatively delivers higher performance with the increase in the
and
. SURAA achieved the highest throughput (GFLOP/s) among all the schemes for the highest
matrices.
We note that the speedup of SURAA against CSR5 and WPK2 was small, however, SURAA provided the highest throughput for 20 out of 26 matrices. More importantly, none of the schemes except SURAA delivered consistently high performance on all the performance benchmarks. SURAA code can be found at the GitHub online repository (see:
http://github.com/stormvirux/suraa). cyan The rest of the paper is organized as follows. In
Section 2, we discuss the background materials related to this paper. In
Section 3, we discuss the state of the art. In
Section 4, we describe our proposed technique. The experimental results from the implementation and analysis is given in
Section 5.
Section 6 concludes and summarizes future work.
3. Literature Survey
There are many studies that have attempted to improve the performance of Sparse Matrix vector multiplications on GPUs. Many of these have been developed to improve the performance of a specific sparsity structure. In this section, we shall discuss some of the approaches for SpMV computations on GPUs.
Jagged Array Diagonals (JAD) format [
65,
66] can be considered as an optimization of ELL format on GPUs. Initial preprocessing is required to store a sparse matrix in JAD format. JAD provides better performance in SpMV kernel than ELL, but the occupancy in JAD is lesser than ELL. Abu-Sufah and Abdel Karim [
67] introduced an improvement to the Transpose Jagged Diagonal Storage (TJDS) by the use of blocking technique, which is called Blocked Transpose Jagged Diagonal Storage (BTJDS). This technique does not achieve a good memory bandwidth utilization as compared to HYB and ELL.
Choi et al. [
68] implemented the blocked version of CSR algorithm (BCSR) on GPU. Furthermore, they also propose a second sparse storage structure, which is the blocked version of ELL called the blocked ELLPACK (BELLPACK). ELL-R was introduced in [
69] to improve the performance of the ELL storage format on GPUs. It has an additional data structure that helps in reducing the performance degradation due to thread divergence caused by non-zero entries in the ELL storage format. Kreutzer et al. [
70] propose a sparse matrix storage scheme to reduce the space overhead of extra padded zeroes in the ELL format and improve the performance of SpMV on GPUs. The major disadvantage of this technique is that the multiplication by permutation destroys the dense blocks and diagonal elements which prevents cache reuse and degrades the performance.
ELLR-T is based on GPU kernel modification of ELL-R [
69,
71]. In ELL-R format, each row is handled by a thread during SpMV, whereas in ELLR-T, T number of threads operate on the row at the same time, similar to vector CSR (In vector CSR 32 threads operate on a row). Here, T can be 1, 2, 4, 8, 16, or 32. However, in this technique, the authors do not propose a way for selecting best T based on the input sparse matrix, rather they leave it to the user to select by experimentation. Dziekonski et al. [
72] proposed a sparse matrix storage scheme called Sliced ELLR-T. The storage scheme was designed specifically to solve complex-valued sparse linear equations in computational electromagnetics. The proposed technique is based on Sliced ELL [
73] and ELLR-T storage schemes. The amount of space required by this scheme is less than ELL-R and ELLR-T but more than CSR. The bigger the matrix, the better the performance of the SpMV kernel using this scheme.
Padded Sliced ELLPACK (SELL-P) [
74] is derived from SELLPACK [
73] and SELL-C-
[
75], wherein the rows with a similar number of nonzero elements are gathered into slices, thus minimizing the storage overhead as compared to ELLPACK scheme. By assigning multiple threads per row, as in ELLR-T [
69], they modify SELL-C-
by padding rows with zeroes such that the row length of each slice becomes a multiple of the number of threads assigned to each row. The implementation of SELL-P is available in the MAGMA library. The implementation of the sorting in the MAGMA library is performed on the CPU, which increases the overhead.
All the schemes discussed until now are focused on the storage of the sparse matrices and they mostly use just one kernel or fixed number of threads as opposed to SURAA.
Maggioni and Berger-Wolf [
76] proposed a storage scheme based on ELL called Adaptive ELL that balances the non-zero rows to GPU warps. In this technique, the computations on the nonzero elements are distributed on the warps (hardware level blocks) adaptively. Using a best-fit heuristic greedy policy, they try to fill the warps with the rows. They solve an optimization problem of maximizing the efficiency of each warp and minimizing the load unbalance among the warps. This is a finer level (lower level) optimization unlike our technique, which uses multiple parallel kernels and variable threads to perform scheduling on sorted segments and dynamic parallelism. Adaptive ELL also requires four extra data structures. Of these four extra data structures, three data structures will have the size of total working warps in the GPU and one will have the size of the number of rows in the matrix. Older GPUs, such as Tesla K20 and Tesla K40, have lower memory and hence this technique will lead to difficulty in processing larger matrices. However, many of the newer GPUs such as Pascal P100 and Volta V100 have 12 to 16 GBs of DRAM and can also access memory on other GPUs over high bandwidth buses.
Ashari et al. [
77] propose a sparse storage scheme based on CSR called Adaptive Compressed Sparse Row (ACSR). In this scheme, the rows are grouped into bins, such that a bin consists of rows with an equal number of non-zeros per row. The rows of the sparse matrix are moved into bins depending on the number of non-zeros per row. The binning is done based on powers of 2. Since the binning is not done on consecutive rows, it is termed as inter-binning. Each bin is either executed by a bin specific kernel or a row specific dynamic kernel. However, as their work is based on CUDA 5.0, in their technique, dynamic parent kernel does not process more than 2048 rows. Our work exploits virtual pooling and execute rows greater than 2048. SURAA is also different from their work in that it achieves coalesced memory access due to reordering the similar loads (rows with similar
) together in the memory rather than logically performing inter-binning. Moreover, we differ in the way we assemble matrix rows into different segments, and segments into groups, as we use the Freedman–Diaconis rule [
54,
55] to achieve a relatively balanced segment size. Consequently, we achieve higher levels of load balancing due to the way we devise row segments and allocate them to dynamic kernels. This decreases the idle number of threads and leads to better memory coalescing. The row segments in SURAA are dynamically scheduled by the dynamic kernel and are executed at the same time using the hardware streams on the GPU. The use of multiple streams provides higher performance by hiding the kernel launch overhead. However, the number of dynamic kernels that can be executed on GPU in parallel depends on the number of available Streaming Processors because these are fixed for a given GPU.
ELL-WARP is a variant of ELLPACK-R and padded JAD (JDS) developed for finite element unstructured grids by Wong et al. [
78]. They propose two kernels, ELL WARP(K1) also known as WPK1 and ELL WARP v2 (K2) which is also known as WPK2. Both these kernels initially sort the rows by length. WPK1 then arranges rows into groups of warp size and pads accordingly and then it reorders the data within each warp in a column-major coalesced pattern. However, in WPK2, the rows are subdivided recursively beginning when the number of elements a thread should execute exceeds the prescribed threshold. The number of elements a thread should execute are continued to be subdivided until the number reaches below the threshold. This is done to efficiently process the larger rows as compared to WPK1.
Lu and Vinter [
79] propose a storage format for sparse matrix-vector multiplication called CSR5. The 5 in the name indicates the number of data structures used to store the sparse matrices. The designed data storage scheme is insensitive to the sparsity of the data structure provides similar performances for sparse as well as dense matrices. This technique uses a combination of row block method [
77] and segmented sum method [
80]. The design objective is improved load balancing and reduction in overhead due to memory access and synchronization. The matrix transpose operations and the complexity causes overhead. Moreover, higher meta-data requirement increases the required memory for the representation. However, CSR5 does provide improved performance over the earlier SpMV techniques.
Hou et al. [
81] proposed an auto-tuning framework for AMD APU platforms to find appropriate binning scheme and select appropriate kernel for each bin. The process of grouping rows with similar number of nonzeros together is referred to as binning by the authors. The proposed binning method is a coarse grained method with different granularities. The auto tuning technique selects the best granularity for a given matrix using machine learning techniques (C5.0). In their binning strategy they combine multiple neighboring rows to be a single virtual row. The granularity or the number of neighboring rows selected is determined by the auto-tuner. The autotuner hence needs to be trained with appropriate data for the proper feature selection and prediction which is non-trivial. Moreover, for irregular matrices with highly varying non-zero per row, a fixed granularity will require each bin requiring a different kernel. Executing large number of kernels at the same time (which is subject to the device capability) requires multiple streams which is hardware dependent and hence might degrade the performance if the number of bins increases. Unlike this work, we do not use machine learning; instead, we segment the matrix on the basis of Freedman–Diaconis rule [
54,
55]. The other major difference is the use of dynamic parallelism in our work and the way we adaptively schedule different kernels with varying number of threads, which addresses memory coalescing, thread divergence, and load balancing.
Flegar and Anzt [
82] propose a load-balanced GPU kernel based on COO for computing SpMV product. The proposed kernel use a "warp vote function” along with a warp level segmented scan to avoid atomic collisions. It also uses an "oversubscribing" parameter to determine the number of threads allocated per each physical core. The three arrays in the COO scheme is sorted with respect to the row index to decrease the number of atomic operations such that elements with same row index require only one atomic operation.
The major issues that need to be addressed are improved coalesced memory access to the matrix, better load balanced execution of SpMV, higher memory bandwidth utilization, preprocessing requirements, utilization and improvement in cache usage, and reduction of thread divergence. A comparative analysis of the related approaches for SpMV computations with respect to these issues have been provided in
Table 1. The schemes are listed in chronological order of their publications. We observe that no single work has comprehensively addressed all the issues.
From the discussions, it is clear that due to the variance in the number of nonzeros per row, and differences in the sparsity structure, a single kernel with fixed number of threads would not be able to guarantee the best performance for various matrix sizes, application domains, and sparsity structures. Hence, in this work, we propose the SURAA method that balances the load efficiently and achieves coalesced memory access to attain the best overall performance. This is due to the fact that sparse matrices have a range of diverse sparsity structures, and dynamic kernels (together with Freedman–Diaconis (FD) rule based segmenting) allow us to dynamically schedule the matrix rows based on the matrix properties enabling improved memory and computation layout, and in turn providing improved load balancing, memory coalescing, and thread divergence.
4. SURAA: The Proposed Method and Tool
Load-balanced execution and coalesced memory access are the two crucial factors that we consider to improve the performance of SpMV. A load-balanced execution balances the workload of the threads and eradicates idle threads. When a group of consecutive threads accesses consecutive memory locations, we call it coalesced memory access. Coalesced memory access tremendously improves the throughput of data access from global memory and hence improves the SpMV performance.
The two major CSR based algorithms are the CSR scalar and the vector CSR (VCSR). The issue with scalar CSR is that it does not correctly balance workload as one thread is assigned per row. This results in imbalanced workload for the threads in a warp. The vector CSR rectifies this issue by assigning threads equal to the number of warps (32 threads) equal to closest power of the mean number of the non-zero per row on the GPU. Matrices that have a uniform number of non-zeros per row and are a multiple of a warp accelerates the SpMV. However, there are many matrices with less number of nnz per row, especially the matrices with the number of non-zeros per row less than sixteen. These matrices are not efficiently executed by VCSR as many GPU threads remain idle while the other threads in a warp are under execution. Therefore, warps that are assigned to rows with a substantial number of nonzeroes per row lag behind other warps with a fewer number of the non-zero per row.
Hence, it is essential to consider the sparsity structure and coalesced memory access to derive benefit from the merits and avoid the flaws of both the scalar and vector CSR techniques. To this end, we develop a load balanced dynamic technique based on the CSR storage format called SURAA. SURAA uses sorting, segmentation, asynchronous kernels, streams, and dynamic parallelism. Sorting is used to bring all the rows with similar non-zero elements together and segmentation is used to divide the work into manageable blocks to distribute the workload evenly. Dynamic kernel enables us to launch multiple child kernels from within the parent kernel which can efficiently execute the sorted segments.
The terminologies and the data structures used are illustrated in
Table 2. The format is very similar to the CSR format, except that we introduce an array named
to store the sorted indices.
is of size
m. A temporary vector,
, of size
m is generated to speed up the pre-processing (sorting), which stores the number of non zero elements per row. In the rest of the paper, we shall use the terminologies defined in the
Table 2 for our discussions.
The SURAA tool is divided into two phases—(a) the setup phase and (b) the execution phase—comprising a total of six algorithms (details to follow). Algorithm 1 shows our main driver program which invokes both the setup phase and the execution phase. We shall discuss the driver algorithm and the two phases in the next subsections.
Algorithm 1 SURAA_Master_Program |
Input: Output:- 1:
functionMain - 2:
Setup_Phase() - 3:
for Each segment do - 4:
- 5:
end for ▹ Asynchronous returns for the kernel - 6:
for Each segment do - 7:
- 8:
- 9:
▹ Thread per blocks as a multiple of 32 - 10:
- 11:
end for ▹ Asynchronous returns for the kernel - 12:
- 13:
end function
|
4.1. Setup Phase
Algorithm 2 depicts the setup phase where the matrix is sorted, segmented, and grouped into various groups. The matrix is sorted row by row in the descending order of the number of non-zero elements per row. The rows with the same number of number of non-zeros are further sorted based on the column index of the first non-zero element, which further balances the load of vector x. The sorting can be done on the GPUs or the CPUs. In our work, we use a GPU based parallel radix sort to perform the sort. The function is used to perform the sorting of the vector. It sorts the vector and stores the permutations of the original vector in the vector and the associated values in the vector. The input matrix is then rearranged using the vector.
Algorithm 2 SURAA_Setup_Phase |
Input: Output:- 1:
functionSetup_Phase - 2:
Initialisation: - 3:
non-zero elements per row - 4:
▹ Sort Descending - 5:
Rearrange using - 6:
; - 7:
; - 8:
- 9:
for Each segment s do - 10:
if then - 11:
- 12:
else if then - 13:
- 14:
else - 15:
- 16:
end if - 17:
end for - 18:
end function
|
Secondly, the sorted matrix is divided into various segments. The division into segments is based on the Freedman–Diaconis rule [
54,
55], which is given below:
where
is the size of each segment,
is the Interquartile range,
has been defined already, and
m is the number of rows in the matrix. Moreover,
, where
and
are the third and first quartiles of
, respectively. In our algorithm, we calculate the quartiles of the sorted
array (
). Since the matrix is pre-sorted, we can get the optimal size of the segments as well as the number of segments using the Freedman–Diaconis rule for each matrix with
complexity.
In the third step, the segments are allocated to various groups depending upon the average number of non-zero elements per row in each segment. This is also depicted in the left part of
Figure 5. The number 0 to N indicates the
for each row and the numbers 0 to M indicate the number of rows in the matrix. The red dashed line in the figure indicates the size of the warp (0 to 31) for GPUs, which is used as a threshold in our method. Since the matrix is sorted in the descending order of
, we add each segment
from the beginning of the matrix to Group 1 (
) until the mean number of non-zero per row for a segment becomes less than 32 or the total number of rows in
exceeds
(The number of child grids in the queue ready for execution). Since the segment size might not be a multiple of group size, if
, we will only include
. The rest of the segments which has mean
greater than or equal to 32 is allocated to
. The remaining segments that have a mean number of non-zero per row less than 32 are added to
.
The last segment might not have the same number of rows as the other segments and hence is executed separately. All the segments in are executed asynchronously on different streams using a scalar CSR kernel, whereas, all the segments in are executed by a modified vector CSR kernel asynchronously. Finally, we employ dynamic parallelism based kernel to run the segments in , which we shall discuss in greater details in the following subsections.
After setting up the matrix, we proceed to the execution phase where we schedule and allocate various segments to different kernels.
4.2. Execution Phase
4.2.1. Dynamic Parallel Kernel
Figure 5 further illustrates the allocation of various segments to different kernels. The segments with rows containing larger
are executed as a part of the dynamic kernel, whereas the other segments are executed based on the average length of the rows in the segment. The segments that form the part of the dynamic kernel have been shown in red. In the figure, the segments from
to
have
less than or equal to
(The blue dashed line shows the
parameter) and hence will be grouped together in
, which shall be executed by the dynamic kernel. The segment
has mean
less than 32 and hence is executed by the scalar CSR based kernel. The rest of the segments, between
and
are grouped together into
and executed by the vector kernel.
Algorithm 3 is the parent algorithm for the dynamic kernel that will execute on the segments of . This parent kernel assigns child kernels, depicted in Algorithm 4, to each row. Each thread of the parent kernel will work on a row by invoking child kernels. Multiple warps (depending upon the row size) of the child algorithm will execute the row assigned to them. As Multiple warps are involved for a row in the child kernel, we require a reduction within each warp as well as among the multiple warps. We use the CUDA function to compute the partial sums of the threads within a warp. The function is faster than the function and is available in the CUDA version 9.0 and later. Each warp stores their respective partial sums in the shared memory () and a final reduction of the shared memory is performed to get the sum.
The size of the grids for the child kernels is dependent upon the number of non-zeros in each row. In our example shown in
Figure 5, the parent kernel will then invoke child kernels for all the segments from
to
, i.e., for each row in
, a child kernel is assigned to process the row.
executes the first row in the first segment and
executes on the second row of the first segment and so on throughout
. A segment specific vector kernel is used to execute each segment at the same time using various cuda streams for each segment in
. In addition, the last two segments,
and
, are added to
and executed using a scalar CSR based algorithm.
Algorithm 3 SURAA_Dynamic_Kernel_ |
Input: Output:- 1:
functionDynamicSpMV - 2:
- 3:
if tid < rows() then - 4:
- 5:
- 6:
- 7:
- 8:
- 9:
- 10:
- 11:
- 12:
ChildKernel1,threads_per_block,stream ( values, col_idx, nnz, x, y, perm) - 13:
end if - 14:
end function
|
In the Child kernel, initially we calculate the thread Id (), lane Id (), and warp Id () to access the data structures. Each child thread computes the partial sum and stores in the vector. Segmented parallel reduction using the shuffle instruction () is used to accumulate the partial results. The partial sum is then stored in the shared memory () for processing of all the relevant threads. We perform one more reduction to accumulate the final sum and store it in the result vector (y).
Algorithm 4 SURAA_Child_Kernel |
Input: Output:- 1:
functionChildKernel - 2:
Initialisation: - 3:
- 4:
- 5:
- 6:
- 7:
- 8:
for do - 9:
- 10:
end for - 11:
for do - 12:
- 13:
end for - 14:
if then - 15:
- 16:
end if - 17:
- 18:
if tid == 0 then - 19:
- 20:
- 21:
end if - 22:
end function
|
4.2.2. Scalar Kernel
Algorithm 5 illustrates the segment based scalar SpMV execution. SpMV for the segments in is done by this algorithm. In this algorithm, one thread is assigned per each row. For each row, the threads will compute the products of the non-zero elements with the corresponding vector x. Each thread will finally sum the products of each row and write to the corresponding row in the result vector y. The kernel is quite handy in handling the product of the smaller rows as compared to other kernels. For each segment, this kernel is launched with a total of blocks. The total number of threads per block can be varied as required to achieve optimal performance.
Algorithm 5 SURAA_Scalar_Kernel |
Input: Output:- 1:
functionscalarSpmv - 2:
- 3:
- 4:
for to do - 5:
- 6:
end for - 7:
- 8:
end function
|
4.2.3. Vector Kernel
Algorithm 6 illustrates the segment based vector SpMV algorithm. In this algorithm, a total of Y threads operate on each row. We assign Y as 32, 64, or 128 depending on . Y would be the smallest multiple of 32 greater than . The calculation of Y can be seen in lines 7, 8, and 9 in Algorithm 1. As in the Child kernel (see the explanation of Algorithm 4), we calculate the thread Id (), lane Id (), and warp Id () to access the data structures. A total of Y threads are assigned to each row. These Y threads will then use segmented parallel reduction using the shuffle instruction () to accumulate the partial results. The partial sum is then stored in the shared memory () for processing of all the relevant threads. We perform one more reduction to accumulate the final sum and store it in the result vector (y). This is very similar to the CSR Vector scheme. The kernels for each segment is launched with blocks.
Algorithm 6 SURAA_Vector_Kernel |
Input: Output:- 1:
functionDCSR vector - 2:
- 3:
- 4:
- 5:
- 6:
- 7:
- 8:
for do - 9:
- 10:
end for - 11:
for do - 12:
- 13:
end for - 14:
if then - 15:
- 16:
end if - 17:
if then - 18:
- 19:
- 20:
end if - 21:
end function
|
4.3. A Note on
The parameter in our code is a tunable parameter. It is the number of child grids that are pending, running, suspended, or waiting in the queue ready for execution. Exceeding the in earlier CUDA versions between 5.0 and 5.9 results in discarding the kernel. Such kernels were not executed. signifies the size of the buffer that tracks the running kernels as well as the kernels to be executed. The maximum child kernels can be modified in such scenarios using which takes in two parameters, cudaLimitDevRuntimePendingLaunchCount and x, where x is the new number of child kernels. Changing the Pending Launch Count changes the size of the buffer. However, this leads to degradation in performance for older CUDA devices. The default space of the buffer is set as 2048 by default.
However, from CUDA 6.0 onwards, in addition to the fixed sized pool of size
, a variable sized virtualized pool has been added to the buffer. The fixed sized pool is used until it is filled. After that, the virtualized pool is used. The cost associated with the virtualized pool is slightly higher, but in the overall scenario for SpMV we find that use of virtualized pool contributes to better performance until a threshold. Wrong pool size affects the performance of SpMV. In our experiments, we tried 1024, 2048, 4096, and 8192 as pool sizes. We will discuss the effects of pool sizes on our benchmark suite in
Section 5.
4.4. Section Summary and Clarifications
We would like to clarify here that in using multiple dynamic kernels we do not imply that a single kernel with a fixed number of threads will not be able to guarantee a better or the best performance. The SURRA Master Program (Algorithm 1) first launches the Scalar (Algorithm 5) and Vector kernels (Algorithm 6) asynchronously. Subsequently, it invokes the dynamic parallelism kernel (Algorithm 3). Dynamic parallelism enables executing the child kernels (which are managing different rows: Algorithm 4) with different thread configurations, determined dynamically, and this is not possible if we were to use a static kernel.
We reiterate that the number of threads for the dynamic kernels are calculated during the execution, dynamically, using the
parameter and the properties of the matrix being multiplied to a vector. If we were to reuse the Algorithm 2 for the setup phase and then spawn three separate asynchronous, non-dynamic, kernels, we will not be able to dynamically adjust the configuration of each thread that is spawned by the kernels. The dynamic kernel allows us to spawn a different number of child kernels, up to
(as discussed in
Section 4.4), each having its own thread configuration parameters and this is not possible with a non-dynamic kernel. Consequently, it will not allow us to achieve higher performance as we are able to in the current case (for the results, see
Section 5).
In summary, the six algorithms, their descriptions, and the other discussions in this section show that we have utilized streams, warps, warp divergence, and dynamic kernels to efficiently use the GPU resources and launched the maximum possible number of concurrently executing kernels.
5. Results and Analysis
In this section, we evaluate the performance of SURAA against seven well-known and widely used SpMV storage and computation techniques. It comprises seven subsections. The details of the experimental testbed including the storage schemes used are given in
Section 5.1. The details of the dataset comprising 26 matrices for comparatively benchmarking SURAA and other schemes are provided in
Section 5.2.
Section 5.3 analyzes in detail the comparative performance of SURAA against the seven other schemes using three performance metrics, namely throughput, speedup and effective memory bandwidth.
Section 5.4 provides a comparative analysis of SURAA with other schemes with respect to
variance of all the 26 matrices in our dataset.
Section 5.6 compares the setup cost or overhead of SURAA with other SpMV computation techniques.
Section 5.5 discusses the parametric configuration for SURAA’s performance in terms of the number of maximum child threads (
) that can be invoked by the dynamic kernel. Finally,
Section 5.7 relates SURAA with our broader work on developing sparse iterative solvers.
5.1. Experimental Testbed
The platform used for the experiments is a part of the 230 TFLOPS Aziz supercomputer (ranked number 360 in June 2015 among the Top500 machines). It consists of 496 nodes and 11,904 cores in total. Each node contains two Intel Xeon E5-2695v2 processors (Intel Corporation, Santa Clara, CA, USA), each with 12 Cores (2.4 GHz). In addition, 380 of these nodes have 96 GB Random Access Memory, while each of the rest 112 nodes contain 256 GB. This sums up to a total of 66 TB memory. It also has two Nvidia K20X GPU (Kepler) accelerator nodes and two Intel Phi 5110P MIC accelerator nodes. It is controlled by 64-bit Red Hat Enterprise Linux v6.5 (Red Hat, Riyadh, Saudi Arabia). We used the K20 nodes to execute our code. We used CUDA toolkit 9.0 with C++ for the compilation of our code. The code was compiled using . We compare our scheme with the following seven SpMV storage and computation techniques:
5.2. Benchmark Suite
Table 3 illustrates the selected sparse matrices in our benchmark suite. We have selected 26 matrices for the benchmarking. Fifteen of these matrices are selected based on the recommendations by [
85], which are the most frequently used matrices by the researchers for SpMV and iterative solvers. The other included matrices have been widely used (see e.g., [
67,
79]). All the selected matrices are from the University of Florida Sparse Matrix Collection [
86]. The chosen matrices are from diverse applications that include computational fluid dynamics, bioengineering, thermal, quantum chemistry, circuit simulation, power and web search.
Figure 6 indicates the normalized mean and standard deviation of the non zeros per row for the matrices in our Benchmark suite. The red dots in the figure indicates the normalized mean and the bars indicate the normalized standard deviation. We can observe the irregularity in the number of non-zeros per row for these selected matrices. The high standard deviation for most of the matrices indicates the challenge in designing a uniform storage format and SpMV algorithm.
5.3. SpMV Performance
In this subsection, we evaluate the performance of SURAA in terms of three main performance metrics: (1) Throughput (GFLOP/s), (2) Average Speedup, and (3) Effective Memory Bandwidth (GB/s). We first define these three performance metrics in the this section.
Next, the throughput and speedup are evaluated in
Section 5.3.1; it provides a detailed comparative analysis of SURAA including individual performance for all matrices in the dataset, average performance over all the matrices, best performance for any single matrix in the dataset, and a detailed discussion of comparison of SURAA with each of the other seven techniques (see the paragraph headings, WPK1 & WPK2, HYB, etc.).
The aggregate throughput and speedup are analyzed in
Section 5.3.2. The effective memory bandwidth performance is compared in
Section 5.3.3.
The throughput in GFLOP/s is calculated based on the effective flop count in an SpMV execution, as given below in Equation (
2):
In the above equation, denotes the total number of non-zero elements in the given matrix and denotes the execution time in seconds.
The average speedup is calculated using the standard method depicted in Equation (
3)
Here, in the above equation, indicates the throughput in GFLOP/s for each matrix from our benchmark suite of 26 matrices and indicates the throughput (in GFLOP/s) of the reference scheme against which we make comparison, i.e., CSR, HYB, BSR, CSR5, SELL-P, WPK1, or WPK2.
The effective memory bandwidth in GB/s is calculated as in Equation (
4):
In the above equation, , with and representing the number of bytes read and written during a single kernel execution time.
5.3.1. Throughput and Speedup
Figure 7 illustrates the double precision performance for the 26 benchmark matrices on an Nvidia K20 GPU. Each matrix was executed 1000 times using each sparse storage technique and their average was used to calculate throughout (GFLOP/s) using Equation (
2). These throughput values for all the eight storage schemes are depicted as a bar chart in
Figure 7.
Average Throughput Performance
We observe in
Figure 7 that, overall, our proposed technique performs better than all the compared schemes, CSR, HYB, BSR, CSR5, SELL-P, WPK1, and WPK-2 (the SURAA throughput is represented by dark blue color bar and it is higher than the other bars in the figure). SURAA provides higher throughput than any other technique for 20 out of 26 matrices (we will come back to the remaining six matrices in a moment). On average, SURAA outperform all seven other SpMV computation methods by achieving an average speedup of 40× against SELL-P, 29.89× against WPK1, 12.98× for CSR, 11.33× for BSR, 1.50× for HYB, 1.15× for WPK2, and 1.1× for CSR5.
Best Performance for Individual Matrices
Note in
Figure 7 that the highest speedups achieved by SURAA against other schemes for any single matrix are 562× (matrix
ASIC_680ks), 531.56× (for the matrix
ASIC_680ks), 147× (
FullChip), 26.70× (
flickr), 3.06× (matrix
Torso1), 1.97× (
torso1), and 1.94× (
rail4284) against SELL-P, WPK1, CSR, BSR, HYB, WPK2, and CSR5, respectively. We repeat for clarity that these are SURAA’s peak performance values against specific storage schemes for any single matrix. Even for blocked matrices, such as
ASIC_680ks and
TSOPF_RS_b2383, we gain higher performance compared to the BSR technique, which is expected to perform better due to the block structure. Clearly, SURAA outperforms SELL-P, WPK1, CSR, BSR, and HYB by a fairly large margins; however, its performance gain over WPK2 and CSR5 is small.
Note that stating the highest gain against a particular scheme for a given matrix is meant to understand the performances of individual schemes against SURAA. It does not imply that SURAA has equally higher gain over other schemes. For example, the 562× speedup over SELL-P for the matrix
ASIC_680ks only applies to SELL-P, and
Figure 7 clearly depicts that SURAA’s performance is slightly higher (1.06×) than the next best scheme CSR5. Comparing each scheme with every other scheme will lengthen the paper, may confuse the reader, and some may find such analysis inappropriate.
We now dig deeper into the throughput performance of all the schemes by further elaboration.
BCSR/BSR
SURAA outperforms the BSR scheme, overall, by moderate margins (average speedup: 11.33×, max: 26.70×).
Figure 7 depicts that BSR delivers poor performance compared to most schemes including SURAA.
HYB
SURAA outperforms the HYB scheme, overall, although by relatively smaller margins (average speedup: 1.50×, max: 3.06×). The highest speedup achieved by SURAA against HYB for a single matrix from the 26 matrices is 3.06× for the matrix
torso1; see
Figure 7. The HYB scheme outperforms the SURAA method for two matrices,
thermal2 and
amazon0312. We noted that this is also the case for a few other schemes, and we will discuss them in respective paragraphs. Further investigation reveals that SURAA does not utilize the dynamic parallelism for these two matrices because the largest rows for these matrices have only 11 and 10 non-zeros, respectively, and the average number of non-zeros is 7 and 8, respectively. Hence, SURAA uses the scalar CSR kernel for these two matrices (
thermal2 and
amazon0312). We also observe (see
Table 3 and
Figure 6) that both of these matrices have low
variance and standard deviation, and therefore do not benefit from the adaptive kernels in SURAA. We will see in
Section 5.4 that SURAA provides higher differential performance for matrices with higher
variance (a behavior which is highly desirable). The performance aspects of SURAA for low
matrices will be investigated in the future to enable SURAA gain higher performance, also, for low
variance matrices.
CSR
SURAA outperforms the CSR scheme, overall, by moderate margins (average speedup: 12.98×, max: 147×). Note in
Figure 7 that CSR delivers moderate performance compared to SURAA and other schemes. However, its overall performance degradation (12.98×) comes from its poor performance for two matrices,
ASIC_680k and
FullChip. SURAA achieves a higher throughput against CSR for all the 26 matrices except one matrix,
amazon0312 (0.91×). The reasons for this low performance of SURAA for this particular matrix have been already discussed alongside the HYB scheme.
SELL-P
SURAA outperforms the SELL-P scheme, overall, by fairly large margins (average speedup: 40×, max: 562×). The high speedup of SURAA against SELLP is particularly attributed to the poor performance of SELLP for two matrices,
ASIC_680k (speedup, 562×) and
FullChip (speedup, 378×); see
Figure 7. SELL-P mostly has produced very low throughput for high
matrices such as
ASIC_680k,
FullChip,
rail4284, and others (see
Figure 6 and
Figure 7). This behavior will be investigated further in
Section 5.4. SURAA achieves a higher throughput against SELL-P for all the 26 matrices except four matrices,
thermal2 (0.71×),
amazon0312 (0.72×),
Andrews (0.79×), and
Thermomech_TK (0.70×). The first two of these matrices have been mentioned already while discussing the HYB performance. The reason for this low performance, as before, is that SURAA did not invoke dynamic parallel kernel for these four matrices because these matrices have low average nonzero per row (see
Table 3 and
Figure 6). We have mentioned this earlier that SURAA does not invoke dynamic kernel for matrices which have average
value less than 32. A possible modification in the future to improve the SURAA performance for such matrices is either to optimize our CSR scalar kernel or replace just the SURAA CSR scalar kernel with the cuSPARSE CSR kernel, which is very likely to make SURAA faster than all the techniques for all the matrices. Another possibility is to tune the parameter that invokes the dynamic kernel to an optimum value or to embed further intelligence into the kernel invocation process.
CSR5
SURAA outperforms the CSR5 scheme, overall, although by relatively smaller margins (average speedup: 1.1×, max: 1.94×). The highest speedup achieved for SURAA against CSR5 for any single matrix among the 26 matrices of our dataset is 1.94× (for matrix
rail4284). SURAA achieves a higher throughput against CSR5 for all the 26 matrices except for four matrices,
thermal2 (0.71×),
amazon0312 (0.78×),
in-2004 (0.63×), and
bundle1 (0.90×). The reasons for low performance of SURAA for the first two matrices have already been discussed while discussing the HYB performance (i.e.,
). The average
for the third matrix,
in-2004, is 12.23, and it has a medium
variance; see
Table 3 and
Figure 6. Therefore, the reasons for the low performance of SURAA for this matrix are the same, i.e.,
implies no invocation of dynamic kernel in SURAA. The potential solutions to address these limitations of SURAA have also been mentioned earlier. The
variance of the fourth matrix,
bundle1, is high, with an average
of 72.84. This matrix has neither low average
nor low
variance. However, bundle1 is a small (10,581 rows), square, symmetric matrix and the difference in SURAA and CSR5 performance is quite small, 9.30 versus 10.30 GFLOP/s, respectively. Future work will focus on further investigation of SURAA’s performance to improve its behavior in such cases.
WPK1 and WPK2
SURAA outperforms the WPK1 scheme by large margin (average speedup: 29.89×, max: 531.56×), but its performance over WPK2 is small (average speedup: 1.15×, max: 1.97×). Our experiments suggest that WPK1 does not provide good results for asymmetric matrices (see
Figure 7). There are a total of 16 asymmetric matrices in our dataset (see
Table 3) and we noted that WPK1 produced relatively low throughput for all these matrices. Two of these 16 matrices,
Freescale and
Fullchip, are exceptions because WPK1 failed to execute SpMV for these two matrices and caused a segmentation error. Particularly, for three matrices,
ASIC_680ks,
12month1, and
rail4284, WPK1 produced very low throughput results, all less than 0.04 GFLOP/s. On the contrary, WPK2 produced good results. Although it never produced the highest throughput for any of the benchmark matrices, it did produce high throughput comparable to the second high performing scheme CSR5. Overall, it produced the third best average performance after SURAA and CSR5 with small margins. SURAA achieved a higher throughput against WPK1 for all the 26 matrices except for three matrices,
amazon0312 (0.70×),
Andrews (0.74×), and
Thermomech_TK (0.96×). SURAA also achieved a higher throughput against WPK2 for all the 26 matrices except
Andrews (0.75×) and
thermal2 (0.96×). These four matrices belong to the set of the six matrices where SURAA did not produce the best performance and we have discussed the reasons for SURAA’s poor performance for these matrices in the earlier paragraphs. Note that WPK2 failed to execute on ten matrices from the dataset, of which seven are from circuit simulation and directed graph, and the remaining three were rectangular matrices. On inspecting the WPK1 and WPK2 GitHub Web Page [
87], we found that the developers have reported code error issues on the page as follows,
“trying to debug issues with large unsymmetric matrices—now mainly thrust errors and a few numerical errors to debug on certain matrices”. Note that, in computing speedups of SURAA against other schemes, we did not include the results of the matrices where WPK1 and WPK2 schemes failed to execute.
5.3.2. Aggregate Throughput and Speedup
Table 4 provides for all eight schemes the statistical information such as
, standard deviation (
), variance (
), maximum (
), and minimum (
) of the GFLOP/s metric of the SpMV kernel for the 26 benchmark matrices. The values are listed in descending order of the
value. We observe that SURAA provides the highest
GFLOP/s of 15.75. The next best scheme in terms of the
throughput is CSR5, it has the second highest
GFLOP/s, 14.28, followed by 13.65 for WPK2. In terms of the
throughout, SURAA again leads the table with the highest value of 36.7077 GFLOP/s, followed by SELL-P with the second best value of 26.8 GFLOP/s, followed by CSR5. Based on the
values, WPK2 comes third after SURAA and CSR5. As noted earlier, WPK1 and WPK2 were unable to execute some of the matrices in the benchmark dataset and their results have not been included in computing
and other qualities in the table.
In order to calculate the aggregate performance of SURAA, we calculate aggregate speedup as in Equation (
5).
Figure 8 indicates the aggregate GFLOP/s achieved for SURAA and other tools. These GFLOP/s are aggregated over all the matrices for each scheme and do not imply to be the GFLOP/s for a single GPU at one time. The intention of providing the aggregate throughput is to highlight the aggregate performance gain or loss that could be achieved by the use of a single SpMV computation scheme for matrices from various domains. SURAA has the highest throughput compared to all the other seven techniques, i.e., CSR, HYB, BSR, CSR5, WPK1, WPK2, and SELL-P. For our benchmark suite of 26 matrices, we achieve an aggregate throughput of 409.7 GFLOP/s compared to the next best CSR5 (371.37 GFLOP/s), HYB (273.38 GFLOP/s), CSR (255.42 GFLOP/s), WPK2 (218.51 GFLOP/s), SELL-P (204.56 GFLOP/s), WPK1 (165.163 GFLOP/s), and BSR (70.751 GFLOP/s). The reason for low aggregate throughput for WPK1 anf WPK2 is that, as noted earlier, they were unable to execute some of the matrices in the benchmark dataset. It means that SURAA has an aggregate speedup of 1.1× compared to CSR5, 1.49× compared to HYB, 1.60× compared to CSR, 1.87× compared with WPK2, 2.0× compared to SELL-P, 2.47× compared to WPK1, and 5.79× compared to BSR for the given 26 matrices.
5.3.3. Effective Memory Bandwidth
Effective memory bandwidth utilization indicates the processor usage efficiency by the kernals [
88]. Specifically, effective memory bandwidth denotes the rate at which the bytes are read and written for a given kernel. The general formula to calculate the effective memory bandwidth was given in Equation (
4).
We did not find the specific formulas in the literature for calculating the effective memory bandwidth for each of the schemes that we have considered in this paper. Therefore, we looked at each of the schemes, its memory requirements, and devised equations to calculate the number of bytes read and written,
(see Equation (
4)), based on their storage requirements. Equation (
4) was then used to calculate the effective memory bandwidth,
, in GB/s. Each scheme was executed 1000 times for each matrix in our dataset to get the average execution time for each of the 26 benchmark matrices. For further details on calculating effective memory bandwidth, see Nvidia documentation at
https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/.
The formulas for the CSR and HYB schemes were reported in [
58]) as given below:
The effective memory bandwidth for the SELL-P scheme was reported in [
75], but no formula was given. After consulting [
74,
75], we devised the formula to calculate
for SELL-P as below:
where
is the number of segments in the matrix, as determined by the SELL-P scheme,
is the size of the segments that is fixed, and
is the maximum number of nonzero elements in any one row of segment
i. It was difficult for us to calculate these numbers and therefore we approximated
as below:
The effective memory bandwidth for the CSR5 scheme has not been reported before. After consulting [
79] where this scheme was first introduced, we devised a formula to calculate
for CSR5 as below:
where
is a parameter that determines the column width of the blocks in their data structure. It can assume three different values depending on the value of
of a matrix. These values are 4, 32, or
.
The effective memory bandwidth for WPK1 and WPK2 was reported by Wong et al. [
78], but it was on a different GPU device (Nvidia GeForce GTX 480) and for different matrices. They did not give formulas to calculate the effective memory bandwidth. We looked at their paper and have derived the following formula to calculate the
:
where
is the number of warps in the matrix, as determined by the WPK1 and WPK2 schemes,
is the size of the warps which is fixed, and
is the maximum number of nonzero elements in any one row of warp
i. It was difficult for us to calculate these numbers and therefore we approximated
as below:
The effective memory bandwidth for the SURAA scheme can be calculated using the
formula as below:
Figure 9 plots the calculated effective memory bandwidth for double precision SpMV for CSR, HYB, SURAA, CSR5, SELL-P, WPK1, and WPK2. The maximum theoretical memory bandwidth of K20M is 208 GBytes/s. On average, SURAA delivers a bandwidth of 50.2% of the theoretical maximum as compared to 44.41%, 43.7%, 34.62%, 30.15%, 28.1%, and 24.02% of the theoretical maximum for CSR5, WPK2, HYB, CSR, WPK1, and SELL-P, respectively.
The highest effective memory bandwidths delivered by SURAA were 226.25 (
torso1), 164.03 (
TSOPF_RS_b2383), and 140.76 (
rail4284) GB/s which are 109%, 78.9%, and 67.7% of the peak, respectively. A higher bandwidth, more than the theoretical bandwidth, signifies an efficient use of the cache and shared memory for SURAA when executing the matrix
torso1. A bandwidth, for SpMV computations, higher than the peak memory bandwidth has also been reported earlier; see, e.g., [
58,
74]. The highest bandwidths obtained by the other techniques in descending order are: 160.75 GB/s (
ASIC_680ks) for SELL-P, 138.87 GB/s (
TSOPF_RS_b2383) for CSR5, 138.48 GB/s (
TSOPF_RS_b2383) for CSR, 137.55 GB/s (
TSOPF_RS_b2383) for WPK2, 112.7 GB/s (
thermal2) for HYB, and 108.39 GB/s (
RM07R) for WPK1. Clearly, SURAA provides significantly higher effective memory bandwidth over all the schemes.
5.4. SURAA: Comparative Performance against High npr Variance
Figure 10 illustrates the relationship between the obtained throughput for the seven schemes, SURAA, CSR5, CSR, HYB, WPK2, SELL-P, and WPK1, with respect to the variance of the non-zero elements of all the 26 matrices in our dataset. The symbols indicate the actual GFLOP/s obtained and the straight lines indicate the smoothened GFLOP/s. We observe that SURAA provides better throughput with the increase in the variance of nonzero elements per row. On extrapolating the GFLOP/s, we observe that larger variations in nonzero elements per row will achieve higher GFLOP/s for SURAA compared to the next best CSR5, CSR, and HYB (note the upward slope for these schemes); and this behavior is highly desirable because higher variations usually decrease SpMV performance. We also observe that larger variations decrease the performance of the WPK2, WPK1, and SELL-P techniques.
Similarly,
Figure 11 depicts the speedup of SURAA against CSR5, HYB, SELL-P, WPK1, WPK2, and CSR, with respect to the variance of the nonzero elements per row for all the 26 matrices in the benchmark suite. As in
Figure 10, the symbols indicate the actual SURAA speedups obtained and the straight lines indicates the smoothed speedups. Note that the
variance and the speedups are plotted in log10. We observe in the figure that the speedup of SURAA over SELL-P, WPK1, CSR, HYB, WPK2, and CSR5 is increasing with the increase in the variance of the number of non-zeros per row. As seen and discussed before, this indicates an excellent behavior for a sparse matrix computation tool. The rate of increase in speedup of SURAA against SELL-P, WPK1 and CSR is relatively high compared to the rest of the schemes and this has made the plots for HYB, WPK2, and CSR5 appear as straight lines. To allow a closer look at the SURAA performance,
Figure 12 depicts the same graphs as in
Figure 11 but excludes the SELL-P, WPK1, and CSR data. The speedup scale now does not use log10. It can now be observed that the SURAA speedup against CSR5 and HYB show an increase for an increasing
variance though the rate is lower than the other SpMV computation schemes. The highest rate of increase of SURAA speedup is over the HYB scheme followed by WPK2 and CSR5.
Figure 13 illustrates the GFLOP/s performance of BSR, SELL-P, WPK1, WPK2, CSR, HYB, CSR5, and SURAA, respectively, against the
variance and the total number of nonzero elements,
.
Note the colour legend in the figure depicting colours ranging from blue to green, yellow, orange and red, representing GFLOP/s in increasing order from near zero (SELL-P) to maximum 36.70 (SURAA). More yellow/orange/red means higher throughput. Blue and green colours represent lower throughput.
BSR provides a low performance for the majority of the cases as shown by the blue shades that cover majority of the figure. The highest GFLOP/s results attained by BSR are 12.95 and 12.75, for the matrices nemeth23 and TSOPF_RS_b2383, which have a block structure.
SELL-P shows a performance in the range of 10 to 26.8 GFLOP/s for smaller
values and
variance varying from low to medium. However, a higher
variance results in poor performance with GFLOP/s in the range of 0 to 5. We had discussed this earlier while evaluating SELL-P individual performance. A look at
Figure 6 and
Figure 7 will reveal that SELL-P mostly has produced very low throughput for high
matrices such as
ASIC_680k.
WPK1 delivers relatively low performance against the increasing
as evident by mostly green and blue colors though some yellow patches can be seen towards lower and upper middle (medium
and low/high
) sections of the plot representing higher GFLOP/s of 17.5. WPK2 also delivers low performance against the increasing
. This can be seen by mostly green and blue colors though the yellow patches are larger for WPK2 compared to the WPK1. This seems surprising at the first look because WPK2 has delivered a superior performance overall after SURAA and CSR5 (max throughput is 22.78; see
Table 4). A closer look at
Figure 6 and
Figure 7 reveals that both WPK1 and WPK2 did not successfully execute some of the high
matrices including
spal_004,
12month1, and
rail4284.
CSR provides average performance (depicted by the yellow color; see the GFLOP/s legend) in two scenarios: (1) for matrices with low variance and low and (2) for medium variance and large . For high variance scenarios, CSR provides low performance, indicated by the blue and green shaded regions in the figure. CSR performance for low variance scenarios is also poor. The blue color indicates that at high variance and high CSR provides low performance.
For HYB, we note a general trend that the performance increases with the increasing and the variance. However, we do note light green shades towards the right of the plot which indicate slightly poor performance. For the majority of the part, the HYB performance is almost constant within the range of 10 to 18 GFlops.
CSR5 shows a general trend of an increasing performance as the and variance increases. It provides an average performance of 15 to 23 GFLOP/s as seen by the yellowish shade in the figure. Low to average performance (5 to 15 GFLOP/s) can be observed in three scenarios: (1) high variance and medium to high , (2) low and medium variance, and (3) low variance and low to medium .
Finally, SURAA shows an excellent behavior in that it delivers high performance with the increase in the
variance and
. The patch in red color on the right of the plot shows the highest performance for the associated range of
variance and
depicted by the red color. SURAA achieves the highest GFLOP/s of 36.70 among all the schemes as can also be seen in
Figure 7.
5.5. SURAA: Parametric Configuration
The matrix, spal_004, is a rectangular matrix with fewer rows and a large number of dense columns. It also has a fairly high . Due to the high density, it is highly suitable for dynamic parallelism. In the results presented in this paper, the parameter is set to 8192 for spal_004. During the course of conducting our experiments, we found that configuring this parameter to 2048, 4096, and 8192 varies the performance of the SURAA tool. The typical trend was that the highest value provided the best performance for spal_004 due to the increase in dynamic parallelism. To illustrate, spal_004 achieved 19.5 GFLOP/s in our experiments with 8192 kernels, while it provided 13.60 GFLOP/s when the parameter was set to 2048. This indicates that tweaking of the affects the performance of the SpMV computation. However, larger values for cause larger overheads due to the spawning of the dynamic kernels, and, therefore, setting it to higher values should be done on a necessity basis. In our benchmark dataset, for 23 of the 26 matrices, there are no more than 2048 rows, in total, whose value is greater than 32. The three matrices are spal_004, rail4284 and 12month1, with spal_004 the biggest exception containing 8192 rows which have more than 32 . Therefore, we have set to 2048 for all the matrices except for the spal_004 matrix. An automatic configuration of the tool for this parameter can be done during the set up phase by calculating the number of rows with and setting the parameter to an appropriate value. In future work, we will further investigate this behavior.
5.6. Preprocessing Cost
The preprocessing time introduced due to the conversion of the input matrix in the base format (usually CSR/COO) to a more advanced format could be more than three or four orders of magnitude higher than the time required for executing a single sparse matrix vector multiplication [
77]. Applications that utilize the same sparse matrix for multiple executions of SpMV operations such as iterative linear solvers amortize the preprocessing cost. However, when the iterations are small in the linear iterative solver or when the matrix structure changes frequently as in dynamic graph applications such as PageRank, the preprocessing cost can adversely affect the performance gains obtained while using a specific storage format.
The major operation involved in converting CSR to SURAA is sorting the number of nonzero elements in each row. The sorting process in SURAA by default is performed on the GPU side unlike the SELL-P and CSR5 conversion routines in the MAGMA library. Technically executing on the CPU is not as bad as it sounds, rather the time it takes depends upon the type of operations involved. Conversions involving auxillary structures are difficult to implement on the device side. We use the radix sort by the key algorithm provided by the thrust library on the device side for faster sorting of the non-zero elements per row. The rest of the processing such as application of FD rule and grouping for execution are not time intensive.
In our experiments, for the purpose of a fair comparison, we assume that the initial format of all the matrices before the conversion is CSR. Use of CSR/COO as an initial stage for the data structures have been discussed in [
85]. Since the conversion cost between COO and CSR is negligible and SURAA loads matrices as CSR by default, we selected CSR as the initial state of the storage format. Moreover, our conversion time includes the sorting time (required also by SELL-P), data transfer time from the host to device, as well as the time for other processing required for each format.
Figure 14 depicts the normalized overhead time taken for SELL-P, CSR-5, and SURAA for our 26 benchmark matrices. The time is obtained empirically by executing the codes for SURAA, SELL-P and CSR-5. We can observe that SURAA has the lowest conversion overhead compared to SELL-P and CSR-5. For a better understanding, we calculate the ratio of the measured conversion time to a single SpMV execution (rounded to the next integer), which denotes the number of spare matrix vector operations that could have been possible in the overhead time.
Figure 15 presents the ratio of the overhead in format conversion to the time for a single SpMV execution for SURAA, SELL-P, and CSR5. For the benchmark set of 26 matrices, the average ratio is 69, 5, and 96 for SELL-P, SURAA, and CSR5, respectively. This implies that on an average 69, 5, and 96 SpMV execution could have been performed in the time taken for format conversion. CSR5 has a higher overhead than SELL-P while converting graph based applications especially circuit simulation based matrices. The lowest ratio we obtained for SURAA is 0.26 (
12month1), which is rounded to 0 and the largest ratio is 14 (
wb-edu).
Figure 16 illustrates the absolute time taken for device based radix sort used in SURAA for sorting the matrix based on the number of nonzero values. We can observe that the graph based matrices such as
FullChip,
wb-edu, and
Freescale take more time than other matrices. The average time for sorting the tested matrices is less than 3 ms.
5.7. Sparse Iterative Solver Performance
Iterative solution of sparse linear equation systems can be considered as of prime importance due to its application in various important areas such as solving finite differences of PDEs, high accuracy surface modeling, finding steady-state probability vector of Markov chains, webranking, inventory control and manufacturing systems, queuing systems, fault modeling, and weather forecasting. This is a main focus of our research. Iterative methods converge based on the properties of the matrices. Many matrices take more than 1000 iterations to converge. The gain reported in the paper will lead to much larger time savings.
6. Conclusions
In this paper, we presented and evaluated SURAA, a new load balanced scheme for SpMV storage and computations that employ dynamic parallel kernel for the segments with larger rows, CSR vector kernel for segments with number of non-zeros per row greater than or equal to 32, and CSR scalar kernel for rows length shorter than 32. To further balance the load and reduce idle threads, the threads are distributed adaptively to the rows based on their lengths. We also achieve a coalesced memory access and accelerate the SpMV computations for a range of matrices with diverse sparsity features. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row () and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis (FD) rule. To the best of our knowledge, no other work exists that used the FD rule for SpMV computations the way we used it, or used adaptive kernels the way we have done in this paper.
We conducted experiments to assess the performance of SURAA on K20 GPUs and compared with the best techniques from the best CUSP (open source) and cuSPARSE (commercial) libraries. A benchmark dataset was formulated using widely used benchmarks comprising 26 matrices from 13 diverse domains. SURAA was compared with a total of seven other SpMV storage and computation techniques using three widely used performance metrics; throughput (GFLOP/s), speedup, and effective memory bandwidth (GB/s).
SURAA outperformed, overall, all other seven SpMV computation methods with the average speedup varying between 40× for SELL-P and 1.15× for WPK2 and 1.1× for CSR5, with the average collective speedup of 13.99×. SURAA provided the best throughput for 20 out of the 26 benchmark matrices. The highest speedups achieved by SURAA against other schemes for any single matrix varied between 562× against SELL-P and 1.94× against CSR5. SURAA provided the highest throughput (GFLOP/s) of 15.75, followed by 14.28 by CSR5, and the highest throughput of 36.70 GFLOP/s followed by 26.8 by SELL-P. SURAA also provided the highest effective memory bandwidth of 226.25 GB/s compared to the second best 160.75 GB/s by SELL-P. Moreover, SURAA has shown a much lower setup overhead, equal to five SpMV execution time, compared to 69 and 95.5 SpMV executions for SELL-P and CSR5. The performance analysis of SURAA revealed that it consistently delivers the highest throughput among all the compared schemes with the increase in the and , a property which is highly desirable.
While the speedup of SURAA against CSR5 and WPK2 was small, SURAA provided the highest throughput for 20 out of 26 matrices. More importantly, none of the schemes except SURAA delivered consistently high performance on all of the performance benchmarks. SURAA code can be found at the github online repository (see
http://github.com/stormvirux/suraa).
In our future work, we shall strive to achieve parametric improvements for the various parameters that have been used in our work such as the number of segments and the number of child threads for parent kernel and use larger datasets to get a better understanding. We will also look to further optimize the scalar, vector, and dynamic kernels. We have achieved significant progress in addressing the major challenges of GPU SpMV computing including coalesced memory access and load balancing. Future work will look into further improving our method based on these challenges. SURAA did not perform for low matrices as well as it did for the high matrices. Future work will also look into improving this behavior of the tool. We will investigate further optimization of scalar vector and dynamic kernels using significantly larger data sets.