In this section, we present an in-depth exploration of the GPU-based implementation of TCA-SpMM algorithm. We first outline the architectural considerations and optimization strategies employed in adapting the algorithm to the GPU environment, especially Tensor Cores. Then, we delve into the techniques utilized to maximize Tensor Core utilization and achieve load balancing. Finally, we provide a comprehensive evaluation of the algorithm’s computational and memory complexity on the GPU.
4.1. Design Overview of TCA-SpMM
In Algorithm 1, the innermost loop computes the dot product between the row vector from the sparse matrix S and the column vector from the dense matrix D, resulting in the value in the output matrix O. In using the non-zero element values, , and column indices, , from the CSR format, only the non-zero elements in S are multiplied by the corresponding elements in D. However, since TCs operate on sub-matrices (2D blocks) and perform GEMM operations, it is not possible to directly utilize TCs with the original SpMM algorithm in the CSR format.
Figure 2 illustrates our fundamental concept of adapting the SpMM algorithm to utilize TCs in the TCA-SpMM approach. This concept transforms the dot product problem, with vector–vector operations (BLAS-1), into a matrix–matrix multiplication problem, with matrix-matrix operations (BLAS-3). Initially, we convert the two vectors involved in the innermost loop in Algorithm 1,
and
, into matrices. Specifically, we reformulate each dot product with two vectors as a smaller blocked matrix–matrix multiplication. Once the vectors are transformed into matrices, these transformed matrices are passed to the TCs to execute the accelerated GEMM operation. However, as shown in
Figure 2a, the output matrix has intermediate partial results so far. Thus, following the GEMM operation, the diagonal elements of the output matrix must be accumulated to yield the final result
, corresponding to the dot product of
and
.
For example, let us assume that we have two vectors
and
. Then, we convert them into two
matrices
D and
S as follows:
To compute the dot product, , using transformed matrices, D and S, we multiply D by the transpose of S, yielding . The element-wise multiplication of corresponding elements in D and S produces intermediate results, similar to performing element-wise multiplication of the vectors. Each element of the resulting matrix O is composed as . In this case, however, since we are trying to capture the dot product (which is a scalar), we are interested in summing the element-wise products of D and S. This effectively translates to computing the sum of element-wise products of corresponding elements in A and B, which can be written as . These values represent the diagonal elements of O. The final sum yields the same result as the original dot product of the vectors.
As shown in
Figure 2a,
and
represent a sparse row vector from matrix
S and a dense column vector from matrix
D, respectively. Since the element-wise product is zero when either element is zero, the dot product of
and
is identical to the dot product of the compressed vectors
and
, where
contains only non-zero elements from
, and
retains the corresponding elements from
with matching indices. With the compressed vectors
and
, a single dot product operation can be reformulated as a matrix–matrix multiplication, with an additional step for aggregating the diagonal elements. While converting the dot product into a matrix–matrix multiplication, we reshape the vectors into operand matrices, with the LHS operand in a row-wise fashion, and the RHS operand in a column-wise fashion. Hereafter, we refer to this transformation process as the matricization of vectors. The transformed matrix–matrix multiplication is performed using the MMA operation from TCs, while the aggregation of diagonal elements, also known as the
operation, is optimized through parallel reduction using warp shuffle. More specifically, because the most fine-grained shape for the half-precision MMA instruction on the TCs in the Ampere architecture is
[
4], we leverage this by matricizing the sparse row vector into a column-wise RHS operand matrix and converting multiple dense column vectors into the LHS operand matrix, as described in
Figure 2b. This approach enables the simultaneous computation of multiple output elements using the MMA instruction.
Figure 3 describes the overall design of our parallelized SpMM implementation. After matricizing the vectors, as illustrated in
Figure 2, each CUDA thread block performs the computations required for either an entire row vector or a part of it, as described in
Figure 3. Since the number of non-zero elements varies across different rows, as illustrated in
Figure 3, we optimize our kernel so that each thread block processes multiple row vectors with fewer non-zero elements, thereby mitigating the load imbalance problem. As all elements in a single row vector of the output matrix are derived from the same sparse row vector, the thread block can share this sparse row vector as an operand for the dot product operation. Moreover, fetching the sparse row vector from global memory is straightforward when using the CSR format. Therefore, we store the sparse row vector in shared memory, allowing all warps within the thread block to access it with minimal memory transactions. Furthermore, since the same non-zero pattern of a sparse row vector is used to compute the corresponding elements across all column vectors in
D, we compress the column vectors from
D by referencing the same indices in
S.
col_idx. The set of compressed column vectors is managed in shared memory and referred to as the
compressed RHS buffer. Once theses compressed column vectors are stored in shared memory, all warps within the thread block participate in computing different output elements. For example, when an
MMA operation is available, two such operations are required to compute an output element if the corresponding sparse row vector contains 128 non-zero elements. In this case, the accumulator fragment of the MMA operation,
, is used to accumulate the results of two MMA operations, allowing the final output element to be obtained with a single
operation. Thus, the main challenges and considerations for achieving high performance in SpMM using TCs are as follows.
Since the total number of MMA operations required for SpMM in our approach is deterministic, maximizing the utilization of TCs is crucial for achieving high performance. Increasing arithmetic intensity is a key optimization strategy for improving TC utilization. In other words, maximizing the ratio of MMA operations to data movement is essential. However, minimizing data movement in SpMM poses significant challenges due to the irregular distribution of non-zero elements in the LHS sparse matrix, which results in irregular access patterns in the RHS dense matrix and complicates memory coalescing. To mitigate this issue, we leverage shared memory to optimize global memory access, thereby enhancing the efficiency of TCs.
The sparsity pattern of the sparse matrix can lead to significant variation in the number of non-zero elements across different rows, causing a load imbalance problem. For instance, as shown in
Figure 3, if the number of non-zero elements in the
-th row and the
i-th row of the sparse matrix
S are 0 and 128, respectively, the computation for the
-th row in the output matrix
O can be skipped, while the
i-th row must undergo multiplication. In this scenario, parallelizing computations across rows, where each thread block performs a vector–vector multiplication, can result in unused thread blocks for rows containing only zero values. Launching such unused thread blocks leads to performance degradation, as other active thread blocks may be delayed, waiting to be scheduled on Streaming Multiprocessors (SMs). Additionally, if the number of non-zero elements in the
k-th row is 64, the thread block assigned to process the
k-th row performs fewer computations than the thread block processing the
i-th row. To address this load imbalance, we dynamically allocate a variable number of thread blocks to different rows, enabling finer-grained control and more efficient resource utilization.
4.2. Parallelization of TCA-SpMM
Algorithm 2 presents the pseudo-code for the global kernel in the TCA-SpMM, while Algorithms 3 and 4 provide the pseudo-codes for the device kernels, which are used for executing the MMA operations based on the number of non-zero elements, invoked within Algorithm 2. To achieve a high degree of parallelism, TCA-SpMM parallelizes SpMM computations across the rows of the sparse matrix S, taking into account both the distribution of non-zero elements and the warp-level matrix multiplication (MMA) structure of the Tensor Cores. TCA-SpMM is designed around the MMA shape, which requires a LHS operand matrix for the dense matrix D and an RHS operand matrix for the sparse matrix S. Then, TCA-SpMM determines how many rows of S process each thread block with 64 (i.e., non-zero elements, which can be mapped onto the RHS operand matrix. If a row contains more than 64 non-zero elements, multiple thread blocks will process the row as described in Algorithm 3. However, if a row contains 64 or fewer non-zero elements, a single thread block will process multiple rows or a single row, as described in Algorithm 4.
We assume that each CUDA thread block contains eight warps. In order to distribute SpMM computations appropriately, TCA-SpMM uses three preprocessed information sources for different warps, as shown in Algorithm 2:
row_id_for_each_warp,
num_warps_per_row, and
warpid_within_row. First of all, the array
row_id_for_each_warp specifies the row indices of the output matrix
O that each warp contributes to computing. For example, as shown in
Figure 3, since the
i-th row of the sparse input matrix contains 128 non-zero elements, two thread blocks are launched for the
i-th row to distribute the workload evenly across the blocks, thereby mitigating load imbalance. All warps in Thread Block 0 and Thread Block 1, as illustrated in
Figure 3, maintain the same row index
i for the variable
, as defined in line 5 of Algorithm 2. Secondly, the array
num_warps_per_row specifies the number of warps assigned to compute each row. For example, as shown in
Figure 3, there are 16 warps (16 warps = 8 warps × 2 thread blocks) for the
i-th row because both Thread Block 0 and 1 handle the same row (line 6 in Algorithm 2). Finally, the array
warpid_within_row provides identifiers (IDs) for each warp to manage the output elements that it is responsible for computing (line 7 in Algorithm 2). For example, using the array
warpid_within_row, the 16 distinct warps in both Thread Block 0 and Thread Block 1 can be identified by unique ID values ranging from 0 to 15. Based on the unique warp ID values, we divide the workload of computing the output values in the
i-th row of matrix
O between Thread Block 0 and Thread Block 1 by distributing the output elements assigned to each thread block (lines 8–14 in Algorithm 2).
Algorithm 2: Parallel implementation of TCA-SpMM on GPUs |
|
As shown in the top right of
Figure 3, TCA-SpMM exploits shared memory to reuse and share data within a thread block. After allocating the shared memory resources for the three components—the (1) sparse row scratch pad, (2) compressed RHS buffer, and (3) output buffer (lines 16–18 in Algorithm 2)—the shared memory is not further divided among different warps because
is always 0 in this case (lines 16 and 19–20), as shown in
Figure 3. Thereafter, Thread Block 0 and Thread Block 1 are directed to the
__device__ function multiblock_per_row() to perform accelerated matrix multiplication for the
i-th row, which requires multiple thread blocks to handle the large workload (Algorithm 3).
After distributing the workloads required to obtain the output elements in
across multiple thread blocks, Algorithm 3 details how the output results of
are computed using multiple thread blocks. Since the total number of operand elements processed at once by each MMA operation is set to 16 × 8 × 8, our implementation pads the remaining portions of the row vectors of
S and the collections of compressed column vectors of
D with zero values in the MMA operand when the number of non-zero elements in these vectors is not divisible by a fragment size of 64.
Algorithm 3: multiblock_per_row() device kernel on GPUs |
|
From lines 5 to 27 in Algorithm 3, each thread block performs tiled matrix multiplication to compute output in parallel. The tilewidth specifies how many operands are computed in the outermost tiled loop (line 5). To minimize the number of aggregations required for the operation, we declare multiple output fragments, denoted as , which serve as different accumulators for the MMA operations. Through distributing the output tiles across warps, the number of output fragments in is determined (line 2). Using multiple output fragments, lines 6 to 17 describe how our parallel implementation efficiently utilizes MMA operations for the reformulated computation. Specifically, we first initialize all output fragments to null (line 6) and then apply the sequential computation, dividing the number of non-zero elements in the sparse row vector into fragment size of 64 (lines 7–8). In order to fill the RHS_buffer of the shared memory with the corresponding compressed column vectors of D, we use S.col_idx to identify the set of column vectors corresponding to the current output tile (lines 10–12).
Moreover, the transpose is applied to the set of compressed column vectors to store their elements in a row-wise fashion (line 11). After successfully fetching the elements for the reformulated computation, we matricize the sparse row vector in a column-wise fashion (line 13), and the subset of compressed column vectors in RHS_buffer in a row-wise fashion (line 16), storing these matrices in the fragments
and
, respectively. Then, we perform the MMA operation (line 17), with all compressed column vectors used as operands during the iteration (lines 14–17). Lastly, we perform two
operations on each output fragment using warp shuffle, since it is possible to obtain two different output values from each 16 × 8 × 8 MMA operation (lines 18–20).
Algorithm 4: multirow_per_block() device kernel on GPUs |
|
4.2.1. Maximizing Tensor Core Utilization
One of the most significant factors affecting the performance of our TCA-SpMM implementation is the method used for fetching compressed column vectors of
D, particularly due to strided memory access patterns. Fetching a compressed column vector directly from global memory using a warp results in uncoalesced memory access, leading to an increased number of memory transactions. It is evident that the efficiency of TCs is closely tied to the cost of data movement. Therefore, to maximize TC utilization, we employ the RHS_buffer located in shared memory, corresponding to the
compressed RHS buffer depicted in
Figure 3. This approach enables cooperative memory fetches of compressed column vectors using the warps within a thread block (line 11 in Algorithm 3).
Since the compressed column vectors pass through shared memory, a warp is able to access global memory contiguously by fetching certain parts of different compressed column vectors, while other warps access the remaining parts. Furthermore, since the multiple threads within a warp can only obtain a few output values, directly storing the output results in global memory limits write instruction efficiency. Hence, we use the output_buffer residing in shared memory to temporarily collect multiple output tiles and then flush them into global memory to achieve a higher global memory write efficiency (lines 19–27 in Algorithm 3). Moreover, to increase the number of elements accessed by load/store memory instruction, we explicitly align the address of every memory access by controlling
and
tilewidth using PTX instruction such as ld.global.b128 [
28].
4.2.2. Achieving Load Balancing
In SpMM computation, workload imbalance occurs due to the uneven distribution of non-zero elements across rows. To address this, the TCA-SpMM method parallelizes the computation across rows to improve load balancing. Specifically, the number of rows assigned to each thread block is determined by the number of non-zero values in those rows.
Figure 4 presents implementation details for TCA-SpMM. The top section of
Figure 4 shows how to distribute six rows to two thread blocks. Let us assume that Thread Block 0 handles five rows from row 0 to row 4 and Thread Block 1 handles one row— row 5. As shown in
Figure 4, row 5 has 64 non-zero elements (
=
row_ptr[6] − row_ptr[5]) based on the values in
row_ptr. In this case, the load imbalance problem does not occur when executing Algorithm 3, because it is enough to generate two different output values from a single output fragment, as described in the rightmost MMA fragment layout in
Figure 4, by employing the 16 × 8 × 8 MMA operation from Thread Block 1. However, since there are eight non-zero elements in row index 0, obtaining only two output values from a single output fragment is inefficient. In our approach, as the 16 compressed column vectors can be maintained in a single fragment, one MMA operation is able to compute 16 output elements for row 0 in parallel, as illustrated in the bottom-left MMA fragment layout in
Figure 4. In addition, for row 4, which has 32 non-zero elements, it is possible to obtain four different output values from the MMA operation by applying the
operation to the finer-grained sub-matrices of the output fragment. Likewise, eight output values can be computed from the MMA operation for row 1. The number of non-zero elements is not necessarily divisible by 8, 16, 32, or 64, as padding zero values to the remaining part is possible, as shown in the cases of row 1 and row 2.
However, applying different optimization schemes for different rows using the same number of warps may cause an additional workload imbalance problem. For example, as described in
Figure 4, we can yield sixteen different output values from a single output fragment in row 0, while we can yield only four different output values from a single output fragment in row 4. Therefore, the total number of MMA operations required to complete the computations for every output values in row 0 is different from that of row 4. Assuming that we assign only one warp for each output row’s computation, the warp assigned for row 0 requires
iterative MMA operations, while the warp assigned for row 4 requires
, where
N stands for the number of column vectors of the output matrix, which is equivalent to the number of output values in a single row. Our fundamental solution to address this situation is to enable a thread block to compute multiple output rows by assigning a number of warps proportional to the number of non-zero elements. After balancing the workloads, we change the configuration of fragments to perform the dot product by placing additional compressed column vectors of
D to maximize the number of output element values produced. Our TCA-SpMM assigns a single warp to process rows with 8 or fewer non-zero elements, doubles the number of warps for rows with 9 to 16 non-zero elements, and doubles it again for rows with 17 to 32 non-zero elements. Finally, eight warps are needed for rows with 33 to 64 non-zero elements. To assign a variable number of warps to different rows, we distribute shared memory resources to regularize the behavior of each warp, as shown in the middle of
Figure 4. Since we set the number of warps based on the number of non-zero elements in the rows, the scratch pad (
sparse row) is distributed in proportion to the number of warps. For example, the memory space required to store all non-zero elements in row index 4 is four times larger than that of row 0; therefore, we assign four times as many warps to row 4 compared to row 0. Similarly, the memory space required to store all elements from the compressed column vectors corresponding to the tile is proportional to the number of assigned warps. This warp assignment scheme also applies to the output buffer, which is designed for efficient writing to global memory.
Algorithm 4 shows the pseudo-code for the device kernel that processes rows with fewer than 64 non-zero elements, where a single thread block performs computations for multiple rows to achieve better load balancing. In Algorithm 4, only the non-zero elements are stored in shared memory (line 5), while zero values are used to fill the positions beyond the range of the non-zero elements in the given sparse row (line 7). For , we store instead of 0 to minimize data movement when fetching the set of compressed column vectors from D. Since the number of non-zero elements does not always exceed 8, we use as the invariant MMA operand to hold the non-zero elements in the sparse row (line 10) and maintain for the corresponding column indices. After computing the number of output values for a single MMA operation (line 11), the tiled multiplication for the output row is performed in a manner similar to Algorithm 3. The major change in Algorithm 4 is that the matricization performed in line 16 of Algorithm 3 is now carried out in line 16 of Algorithm 4, allowing to contain the total number of and different compressed column vectors from D. As the number of output values is identical to , the same number of operations is performed for the corresponding sub-matrices represented in . However, it is possible to perform these operations simultaneously using shared warp shuffle instructions, provided that the communication offsets in warp shuffle are judiciously controlled (line 20). Finally, output_buffer is used to temporarily store the resulting output values and then flush them into global memory to reduce the number of global memory write operations.
4.3. Detailed Complexity Analysis of TCA-SpMM
A single MMA instruction executes multiple
matrix multiplications to leverage the high degree of parallelism and throughput of TCs. In this subsection, an MMA instruction is defined as a single instruction, denoted as
, where
,
, and
denote the size of the row in the left fragment, the size of the column in the right fragment, and the size of the column/row in the left/right fragment, respectively. For an incremental analysis, we first assume that every sparse row vector contains more than (
) non-zero elements. The value
represents the size of the right fragment used in a single MMA instruction, which corresponds to the number of non-zero elements from the sparse row vector that are processed simultaneously via MMA instruction in our approach. As the number of MMA instructions required for our TCA-SpMM is associated with the number of non-zero elements in the sparse matrix, the total number of MMA instructions required for our TCA-SpMM with TCs is
where
M,
N, and
K denote the number of rows of the LHS matrix, the number of columns of the RHS matrix and the number of columns/rows of the LHS/RHS matrix, respectively. In addition,
,
, and
denote the number of non-zero elements in the
i-th row, the total number of non-zero elements in a matrix (i.e.,
), and the sparsity of the matrix, respectively. As the non-zero elements in a specific row are divided into the blocks of
and then the blocks applied to MMA instructions (
) in our approach,
MMA instructions are required to complete the computation for a single output value at the
i-th row, which is maintained in a single output fragment. On the other hand, if the number of rows in the output fragment,
, is divisible by the number of columns,
, more than one output value can be obtained using a single output fragment by maintaining multiple column vectors of the dense input matrix in the left fragment. For example, in assuming that
and
, two output values are obtainable from a single output fragment by maintaining the specific column vector input in the top-half of the left fragment, while maintaining another column vector in the bottom-half of the left fragment. This implies that the number of output values that we can obtain from a single output fragment is
. Thus, to compute all the output values in a specific row,
output fragments are need to be maintained. In multiplying the number of MMA instructions required to complete the computation for a single output fragment by the number of output fragments needed to maintain all output values in a specific row, the total number of MMA instructions required to compute all output values in a specific row can be derived. Finally, the total number of MMA instructions required to compute every entry of output matrix can be found by summing the number of MMA instructions needed to compute all output values in each row, as organized in Equation (
1). Since
represents the sparsity of the LHS matrix, the matrix density, defined as the ratio of the non-zero elements to the total number of elements, can be expressed as
. Therefore, with the total number of elements in the LHS matrix being
, we can utilize the relationship between
M,
K, and
as
in Equation (
1). Equation (
1) demonstrates that our TCA-SpMM can effectively reduce the total number of operations required for the computation as the sparsity grows. For example, when the sparsity value
changes from 0.5 to 0.9, the total number of 4 × 4 × 4 matrix multiplications is reduced by a factor of 5, because
.
To achieve load balancing across rows, when the
,
, and
, the number of MMA instructions (the number of
) required to compute the
i-th row in the resulting output matrix
O, as a function of the number of non-zero elements in the
i-th row (
), is as follows:
Since the number of non-zero elements varies across different rows, the number of output values produced by a single MMA instruction can differ in our TCA-SpMM. Therefore, the number of MMA instructions required to compute each output row varies depending on the number non-zero elements in the corresponding row of the input sparse matrix S. By assigning a number of warps proportional to the number of non-zero elements, our TCA-SpMM is able to evenly distribute the workload required for computing multiple rows across the warps in a thread block.