1. Introduction
Dense linear algebra libraries lay the foundation for scientific and engineering computation. The Basic Linear Algebra Subprograms (BLAS) defines a collection of routines which act as basic building blocks for dense linear algebra operations. As the BLAS APIs are so widely used, processor vendors often provide BLAS implementations that are highly optimized for their processors, e.g., Intel MKL, AMD ACML and NVIDIA cuBLAS. The High-Performance Computing (HPC) community has also contributed several high-quality open-source BLAS implementations such as ATLAS [
1], GotoBLAS [
2], OpenBLAS [
3] and BLIS [
4].
The BLAS routines are categorized into three levels, level-1 for vector-vector operations, level-2 for matrix-vector operations, and level-3 for matrix-matrix operations. The three levels have different computational and memory accessing complexity. Specifically, the computational and memory accessing complexity are and for level-1, and for level-2, and for level-3, respectively. Among all three levels, level-3 provides the most opportunities for optimization because it performs computation while accessing only memory.
Among all level-3 operations, GEneral Matrix Multiply (GEMM) is of the most interest as other level-3 operations can be defined in terms of GEMM and some level-1 and level-2 operations [
5]. As a consequence, the research community has spent lots of effort on optimizing GEMM for different architectures [
3,
6,
7,
8,
9,
10].
To fully exploit modern multi-core and many-core processors, GEMM is often parallelized with threading techniques such as OpenMP and pthreads. In the past decades, since the emergence of multi-core processors, the simplest strategy has been used to parallelize GEMM, in which the whole workload is equally partitioned among all threads. This simplest strategy worked well because on a homogeneous processor all threads run at roughly the same speed. So an equalized workload partition leads to a balanced utilization of processor cores. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. On NUMA architectures, the processor cores witness different memory latency when accessing different memory nodes. For GEMM, data of matrices are distributed on all memory nodes to maximize memory bandwidth and balance memory traffic. As a result, threads on different cores may run at different speeds due to the NUMA effect. The GEMM performance suffers from the variation in thread speed because the overall executing time is determined by the slowest thread.
In recent years, processor vendors have been introducing more and more cores in a single processor. To provide sufficient memory capacity and bandwidth for the processor cores, recent high-end servers and HPC nodes can have 16 or more memory chips installed on a single board. As the number of memory chips grows, more Memory Controller Units (MCU) will be harnessed to manage the memory, and future architectures will have more NUMA nodes. The equalized workload partitioning technique used in current BLAS implementations is not sufficient to achieve optimal performance on large NUMA systems.
In this paper, we present a hybrid-grained dynamic load-balancing method to reduce the penalty caused by the NUMA effect. The key idea is to allow fast threads to steal work from slow ones. Our approach is based on the work-stealing algorithm, but with several improvements specifically designed for GEMM.
The main contributions are as follows:
We are the first to address the GEMM performance problem on NUMA architectures.
We propose a dynamic load-balancing method to reduce the penalty of NUMA effect.
We implemented the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that synchronization overhead is reduced by 51.5% and GEMM performance get improved by 1.9% with our method applied.
The rest of the paper is organized as follows.
Section 2 introduces the GEMM program and current parallelization techniques.
Section 3 demonstrates the proposed dynamic load-balancing method.
Section 4 presents and analyzes the evaluation results.
Section 5 reviews the related work. Finally,
Section 6 concludes.
2. Background
GEMM performs a matrix-multiply-accumulation operation, denoted as , where A, B and C are matrices of shape , and , respectively, and and are scalars. While GEMM is algorithmically simple so that a 3-deep loop nest suffices to accomplish the computation, a high-performance implementation usually use a blocked algorithm due to the sophisticated memory hierarchies on modern processors.
Listing 1 shows the blocked algorithm for GEMM. Each loop in the original 3-deep loop nest is blocked, resulting in a total of six loops (iterated with variables k, n, m, , and in Listing 1). This blocked algorithm can be viewed as two blocking layers. The first blocking layer consists of the outer three loops and the second blocking layer is formed by the inner 3.
Figure 1 shows the structure of the blocking layers.
Figure 1a shows blocking layer 1. The
n-loop (line 5) and
m-loop (line 8) are presented, i.e., only one iteration of the outer most
k-loop (line 2).
and
(line 4) are referred to as
and
for the sake of brevity. In blocking layer 1, the matrices
A,
B and
C are blocked into
,
and
sub-matrices, denoted as
(line 10),
(line 7) and
(line 11), respectively.
Figure 1b shows blocking layer 2. The
-loop (line 13) and
-loop (line 16) are presented and the inner most
-loop (line 21) is represented by a single task (drawn in gray color). In blocking layer 2,
,
and
are further blocked into
,
and
sub-matrices, denoted as
(line 18),
(line 15) and
(line 19), respectively. Note that
,
are obtained by packing
(line 10) and
(line 7) into special memory layout to guarantee continuous memory access in GEMM computation, as shown by the polylines in
Figure 1b.
Listing 1: GEMM Blocked Algorithm |
1 2 for (int ; ; ) { 3 ; 4 // (blocking layer 1) 5 for (int ; ; ) { 6 ; 7 8 for (int ; ; ) { 9 ; 10 11 12 // (blocking layer 2) 13 for (int ; ; ) { 14 ; 15 16 for (int ; ; ) { 17 ; 18 19 20 // += 21 for (int ; ; ) { 22 += 23 } 24 } 25 } 26 } 27 } 28 } |
The blocking factors
,
,
,
and
are carefully selected so that the sub-matrices
,
,
,
,
and
fit into a certain level in the memory hierarchy, with the following constraints:
where
denotes the size of matrix element e.g., 8B for a double-precision floating-point number,
denotes the number of threads, and
denotes the total size of all caches on level
l. The register file is viewed as a pseudo cache on level 0. By (
1),
and
are so constrained that
elements from
,
elements from
and
(
) fit into the registers (the pseudo level-0 cache). By (2),
is so constrained that
(
) and two
s (
) fit into the L1 cache. By (3),
is so constrained that
(
) and two
s (
) fit into the L2 cache. Finally, by (4),
is so constrained that
(
) and
(
) fit into the L3 cache (if it exists).
The first blocking layer represents a coarse-grained workload partition and the second blocking layer represents a fine-grained one. The memory footprints of tasks on layer 1 and 2 are roughly the same as the size of L2 and L1 caches, respectively. GEMM is a computational intensive operation and
Figure 1 clearly shows that workload partition on both layers have a regular shape.
In current GEMM implementations, the whole workload is parallelized on the first blocking layer, as shown in
Figure 2a. There are four tasks for packing matrix
A, four tasks for packing matrix
B, and 16 tasks for computing matrix
C. These tasks are statically scheduled to 4 different threads (
), denoted by different colors in
Figure 2a. Specifically, the
ith (
) thread
gets 1 task for packing
A (
), 1 task for packing
B (
), and four tasks for computing
C (
–
). For thread
, the packing tasks
and
are first executed, Then
is executed as its data dependencies
and
are resolved. Then
checks if any other thread
(
and
) has finished
because
’s computing task
depends on
’s packing task
. If no
is done,
has to wait until one
is available. This is the first kind of synchronization that may decrease GEMM performance because threads do no effective computation while waiting for other threads. There exists another kind of synchronization overhead. When
has finished all its computing tasks
–
, before it continues to the next
k-iteration (of the outer most loop), it must wait for all computing tasks depending on its packing task
, i.e.,
–
, to be finished by their owner threads, as the buffer space of
would be overwritten in the next
k-iteration.
The code for executing a single
task is encapsulated in a standalone function, known as the kernel function. Generally, the inner-most three loops in Listing 1 are factorized out as the kernel function. The kernel function is a serial (single-thread) program which is highly optimized to achieve near-peak performance. There are plenty of research on optimizing kernel functions, which are briefly discussed in
Section 5.
According to the above description of GEMM parallelization, the overall executing time of GEMM can be divided into five parts, (1) for effective computation, (2) for packing s, (3) for packing s, (4) and (5) for the two kinds of synchronization overhead. The “c” and “r” subscripts are used here because the threads are waiting for data they are about to “consume” and “release”, respectively. and harms GEMM performance on NUMA architectures. Reducing the synchronization overhead is the main motivation of this article.
4. Results
We implemented our method in the OpenBLAS [
3] library and evaluated it on Phytium 2000+, an emerging high-performance many-core processor based on Arm’s AArch64 architecture. We restrict our evaluation to DGEMM, as in prior work [
10,
11,
12], for two reasons. First, the basic idea of the hybrid-grained load-balancing method applies to other variants of GEMM such as SGEMM, CGEMM and ZGEMM. Second, the LINPACK benchmark, which is used to build the TOP500 [
13] list of world’s most powerful supercomputers, relies on the DGEMM variant.
This section presents the evaluation results. First, we introduce the hardware and software environment used in our experiments. Then, we present the performance results and quantitative analysis. Finally, we give a brief discussion on tuning the granularity.
4.1. Environment
The Phytium 2000+ processor has 64 cores, organized into 16 core clusters with each cluster containing four cores.
Figure 4a shows the memory hierarchy of the core cluster. Every core has its own L1 data cache and an L2 unified cache is shared by all cores in the cluster.
Figure 4b shows the structure of the Phytium2000+ machine. L2 caches of all 16 clusters are connected by a hardware coherence network. The main memory is organized into 8 NUMA nodes, with each NUMA node hosting two clusters.
Table 1 lists the hardware features of the Phytium 2000+ machine and the software environment used in evaluation. In our evaluation, we parallelize the GEMM test process with 64 threads (one thread per core).
4.2. Performance
The hybrid-grained dynamic load-balancing method can be configured by three parameters, (number of tasks per ), (number of tasks per ) and g (granularity of dynamic task). For convenience, we take the ratio of one dynamic task to the whole task chunk (sum of and ) as the granularity g. For instance, the configuration means that each task chunk is divided into two static tasks and two dynamic tasks, with each static task accounting for 40% and each dynamic tasks accounting for 10% of the whole task chunk.
We set , and , and evaluate all parameter compositions and compare the performance with the coarse-grained implementation.
Both the hybrid-grained and the coarse-grained implementations use the same kernel function generated by the POCA [
14] optimizer. So the only difference lies in the way they partition and schedule the computational tasks, as shown in
Figure 2.
Figure 5 presents the results. For comparison, we also evaluate the performance of BLIS [
4] GEMM and ATLAS [
1] GEMM. The average thread efficiency,
, is used to describe the performance results. The average thread efficiency is a normalized metric derived from
(floating-point operations per second), computed as
, where
stands for the theoretical performance peak of a single core.
All reported results in
Figure 5 are obtained by running the test program five times and computing the mean value of all measured results. On each point there is an error bar representing the variation of the five measured results on that point.
ATLAS shows the worst performance among all because it use an auto-tuning methodology which has no knowledge of the underlying architecture. BLIS performs better than ATLAS, but falls behind all hybrid-grained configurations and the coarse-grained version. The reason is that BLIS’s kernel (the inner-most loop in Listing 1) is not as optimal as that of OpenBLAS. For configuration , the hybrid-grained implementation outperforms the coarse-grained implementation at most matrix sizes except for 3328, 3584, 5120 and 5888. All other hybrid-grained configurations show better performance than the coarse-grained implementation consistently at all matrix sizes. The average performance win over the coarse-grained implementation are 1.29% for , 1.88% for , 1.94% for and 1.91% for . The results clearly demonstrate the effectiveness of our hybrid-grained dynamic load-balancing method. falls behind the coarse-grained implementation at some points because the number of tasks is too few () to balance the variation in thread speed. Though it seems marginal at first glance, the 1%–2% performance improvement still makes sense for GEMM because it is so fundamental in the domain of scientific computation.
To analyze the performance gain quantitatively, we measured the
and
overhead of the GEMM program.
Figure 6 compares the synchronization overhead of the coarse-grained implementation and the hybrid-grained configuration
. The y-axis represents the ratio of synchronization overhead to the overall GEMM execution time. Both
and
are reduced by our hybrid-grained dynamic load-balancing method. The average overhead decreases from 4.19% to 2.03%, achieving an reduction of 51.5%. Theoretically, the improvement of GEMM performance should be
. The measured improvement (≈1.9%) is quite close to this theoretical result, proving that the hybrid-grained dynamic load-balancing method incurs little overhead.
4.3. Tuning Granularity
In
Section 4.2 configurations with
achieve good performance. To analyze how the granularity affect the overall performance of the hybrid-grained dynamic load-balancing method, we set
,
and vary the granularity
, resulting in a total of four configurations.
Figure 7 presents the performance results of these four hybrid-grained configurations, as well as the coarse-grained implementation.
All hybrid-grained configurations perform better than the coarse-grained implementation, though there exists a few points at which
falls behind. The average thread efficiency of the hybrid-grained configurations are 78.68% for
, 78.60% for
, 77.80% for
and 77.42% for
. The law is that smaller granularity shows better performance. The results confirm the design in
Section 3.1. For configurations
and
, the dynamic tasks are larger than the static ones, which violates the design illustrated in
Section 3.1. So they show suboptimal performance compared to configurations
and
. In general,
is a reasonable setting for the hybrid-grained dynamic load-balancing method.
5. Related Work
The idea that all level-3 BLAS operations can be built on top of a high-performance GEMM implementation was first proposed in [
5,
15]. Optimizing GEMM has always been the central task in developing dense linear algebra software since then. The GotoBLAS library [
2] and its successor OpenBLAS [
3], are instantiated based on this insight. Optimization of GEMM has two aspects. One is developing fast kernel functions to accomplish the computation of
tasks, and the other is workload partition and parallelization, which is the focus of this article. As far as we know, this paper is the first to address the problem of GEMM parallelization on NUMA architectures.
There are several approaches for obtaining optimized kernels, yielding different tradeoffs between performance and portability. In GotoBLAS [
2], OpenBLAS [
3] and BLIS [
4], the kernels are usually written by domain experts in assembly. ATLAS [
1] adopts the auto-tuning method to automatically generate kernels with different parameters in C and find the best-performing one by running them on the actual computing system. POET [
12,
16,
17] and AUGEM [
11] use a directive-based programming approach.
Poca [
14] is a compiler-based approach which generates and optimize kernels automatically and portably.
The blocking factors
,
,
,
and
are essential to GEMM performance. ATLAS [
1] relies on auto-tuning to determine optimal values for these factors. Analytic techniques [
18,
19,
20] can also be used instead of auto-tuning. In multi-thread contexts, the
m-loop (line 8 in Listing 1) is generally parallelized. BLIS [
10] allows developers to specify a sophisticated configuration so that any combination of the
n-,
m-,
-, and
-loops can be simultaneously parallelized to suit complex architecture features like multi-sockets and hyper-threading.
While our hybrid-grained dynamic load-balancing method is specially designed for dense linear algebra computation on shared memory parallel architectures, generic dynamic load-balancing techniques have been studied extensively for distributed computing architectures. Olga et al. [
21] proposed a decoupled load-balancing algorithm to enable the load balance computation to run concurrently with the application so that the scalability of the load-balancing algorithm gets improved. Patni and Aswal [
22] proposed a distributed load-balancing algorithm for grid architectures. Kaur and Sharma [
23] uses a “Central Load Balancer” formula to balance the burden among virtual products with reasoning data center. In [
24], a hierarchical (2 levels) dynamic load-balancing model is proposed to compromise between centralized and fully distributed load-balancing schemes. The LBPSA algorithm [
25] aims at balancing the workload on multiprocessor systems in real-time contexts to reduce response time and improve resource utilization. Stavros and Manos [
26] address the shortage of the general block-cyclic redistribution problem on non-all-to-all networks.