1. Introduction
Matrix multiplication has always been a critical part of scientific and engineering applications. Recently, it has found extensive use in machine learning and other AI applications, which are growing in importance at all levels of computing. From the standpoint of high-performance systems, it will likely be a significant fraction of daily workloads. Therefore, even modest improvements in runtime or energy consumption can result in substantial gains in the overall performance of a typical high-performance system. Those gains may translate to meaningful reductions in the costs of operating such systems, especially those at the extreme. At the mobile levels, savings in power consumption may help yield longer battery life.
Kouya [
1] noted that multiplication methods that process the matrices in blocks rather than whole rows and columns produced high cache hit ratios. One would expect those methods to result in significant memory reference locality. They are no better complexity-wise than their non-blocked counterparts. An energy perspective, however, may lend more value to the locality advantage. Unlike time efficiency, which relies on patterns of repetition of operations, energy efficiency is affected more by the use of physical assets. The two effects are not unrelated, of course. Therefore, block-oriented methods should show more energy efficiency since they rely on the SRAM-based cache, which is known to be energy efficient. Processing whole rows and columns relies more on the DRAM-based main memory, which is less efficient. Functional unit usage patterns should have similar effects. For example, in 1968, S. Winograd suggested a variation on the standard multiplication method [
2] that ran faster. It was still cubic in complexity, but since it cut the number of element multiplications in half, there may potentially be some energy benefits to expect. Also, depending on how algorithmic operations map to the hardware, the rate at which energy is consumed (power) may vary. Consuming energy at high power levels may cause equipment to experience failures due to high temperatures or may cause batteries to drain too fast. In those situations, power consumption may be more significant. In the context of this report, the reader should note that the terms energy and power, the time rate of energy, will not be used interchangeably. The authors will use the phrase
energy profile or
behavior to refer to all energy concerns in general, including power draw.
Strassen’s divide–conquer multiplication [
3] reduced the number of element multiplications for the 2 × 2 case to 7, down from 8 in the standard procedure. Winograd showed that result to be optimal for element multiplications and reduced the number of additions to 15, down from 18 in the original Strassen procedure, which Probert [
4] proved optimal for additions for all Strassen-like methods. Strassen-like methods tend to be block-oriented since they are recursive, but they involve large memory overheads. A naive implementation may cause an excessive amount of memory allocation and motion. Strassen-like methods perform significantly fewer multiplications than the school pen–paper method at the expense of substantially more additions. Multiplications tend to be more expensive in terms of hardware and cost more energy. In addition, modern processors strive to offer multiplications that cost fewer cycles of latency to keep up with tight pipelines [
5]. Aggressive timings may raise power consumption when energy costs are kept comparable. These variations in hardware and its usage should be interesting for energy concerns.
Based on the preceding ideas, this paper reports on an extensive experimental study of the energy and the power characteristics of six matrix multiplication algorithms. The algorithms are described fully in
Section 3. Here is a summary.
A straight implementation of the standard/school algorithm;
Winograd’s improvement on the standard method;
A naive implementation of Strassen’s divide-and-conquer algorithm;
A naive Winograd’s variant of Strassen implemented similarly to the previous one;
A memory-optimized Strassen that eliminates the allocation of temporary memory blocks used for calculations;
An algorithm similar to the previous one but with some memory motions eliminated.
The study focused on the case of square matrices crucial to most applications. It collected empirical energy performance and other data from an HPC-class AMD EPYC processor, including timings, cache misses, and memory bandwidth data. The algorithms varied in how they used the physical hardware. The study considered the following factors: functional unit usage, memory allocations, and memory motion (copying around regions in memory). It suggested memory allocation strategies to optimize energy behavior for the divide-and-conquer methods based on hardware usage. The choice of processor was significant since it is part of a family used in many high-performance systems (some on the
https://top500.org/ (accessed on 24 March 2024) list). Thus, the findings should credibly approximate what to expect in those environments. This work attempts to elucidate the links between the usage patterns of energy-significant hardware and expected energy behaviors. It should, hopefully, provide valuable insights for future optimization efforts and may contribute to a deeper understanding of these computations. It may also help design algorithms that address energy concerns in better ways.
The organization of this paper is as follows.
Section 2 presents a review of related work.
Section 3 describes the characteristics of interest of the investigated algorithms.
Section 4 describes methods and materials used in this research.
Section 5 presents results and a thorough discussion of findings. Lastly,
Section 6 offers some conclusions and proposes directions for future research.
2. A Literature Review
This section reviews a selection of relevant work in chronological order.
Table 1 at the end of the section summarizes the main contributions for convenience.
Boyer et al. [
6] focused on reducing the amount of memory used by Strassen–Winograd matrix multiplication. A highly interesting feature of the study was the development of several partially and fully in-place calculation schedules. Also, they developed a generic way to transform non-in-place algorithms to in-place ones with less overhead. They did not show that the memory-optimized versions ran faster (due to the drastic improvement in the utilization of the cache memory in many cases). The study did not consider stability or speed issues. It did include a cursory timing table, however.
Ren and Suda [
7] studied power-efficient software for large matrix multiplication on multi-core CPU–GPU platforms. They measured and modeled the power of each component, considering frequency, voltage, and capacitance. They proposed a method to save power via multithreaded CPUs to control two parallel GPUs. They showed that their method saved 22% energy and sped up kernel time by 71%. However, their method depended on problem size and hardware configuration.
Ren and Suda [
8] (second paper) optimized large matrix multiplication on multi-core and GPU platforms for power efficiency. They used block matrix partitioning, shared memory, and multithreading to improve the algorithm. Their enhanced kernel was 10.81 times faster than the original and saved 91% more energy. However, their power measurement, power model, and power-saving strategies may not be accurate or generalizable for other components, problems, or algorithms.
Yan et al. [
9] compared parallel matrix multiplication algorithms on shared-memory machines in terms of performance, energy efficiency, and cache behavior. They studied loop chucking, recursive and hybrid tiling, and Strassen’s algorithm. They found that the Intel MKL
cblas_dgemm was the best for large matrices and small or medium threads. The hybrid tiling improved locality and performance. However, Strassen’s algorithm had a lower cache hit rate and higher memory bandwidth usage. Their study lacked hardware features and software tools for power and energy profiling.
Legaux et al. [
10] optimized matrix multiplication for multi-core systems with data access, vector unit, and parallelization techniques. They used Halstead metrics to evaluate software complexity and efficiency trade-offs. They showed that SSE and OpenMP versions were faster, but the SSE version was harder to develop. However, their method depended on architecture and had high development costs for some optimizations. Those factors limited their method’s feasibility in some contexts.
Lai et al. [
11] optimized the Strassen–Winograd algorithm for arbitrary matrix sizes on a GPU using techniques such as an empirical model, multi-kernel streaming, dynamic peeling, and two temporary matrices. They surpassed previous GPU implementations and showed Strassen’s algorithm’s practicality. However, they ignored the CPU–GPU communication cost, other fast algorithms, and Strassen’s algorithm’s numerical stability.
Kouya et al. [
1] improved multiple precision matrix multiplications with Strassen’s and Winograd’s algorithms. They compared their speeds and accuracies for different precisions and block sizes. They used their technique for LU decomposition and checked the numerical properties on well-conditioned and ill-conditioned matrices. They showed that Winograd’s was faster than Strassen’s, and both recursive algorithms reduced computation time for LU decomposition. However, they lost significant digits for ill-conditioned matrices and required more tuning and parallelization for efficiency.
Sheikh et al. [
12] developed a multi-objective evolutionary algorithm for parallel task scheduling on multi-core processors, optimizing performance, energy, and temperature together. Their approach beat existing techniques, as shown by experiments. The paper built on previous work on performance and temperature optimization.
Zhang and Gao [
13] compared standard and Strassen tile-based matrix multiplication algorithms on high-density multi-GPU systems. They found that the standard method outperformed the Strassen method on these systems due to the performance gap between multiplication and addition operations. They concluded that the performance ratio of these operations determines the best matrix multiplication algorithm for new architectures. However, their study was specific to the hardware configurations tested and may not apply to other systems or future technologies.
Kouya et al. [
14] (second paper) compared parallelized Strassen and Winograd algorithms for multiple precision matrix multiplications using MPFR/GMP and QD libraries. They used thread-based parallelization to improve performance. They also showed that the Strassen and Winograd algorithms can reduce costs and increase speed in QD and DD precision environments.
Jakobs et al. [
15] examined vectorization techniques’ impact on dense matrix multiplication algorithms’ power and energy. They used loop unrolling, frequency control, and compiler optimization levels. They showed that vectorization and loop unrolling reduced power up to 10% or more. The results varied by architecture and frequency. The study was limited to Intel’s Sandy Bridge and Haswell architectures.
Muhammad and Islam [
16] improved matrix–matrix multiplication performance on the .NET platform by loop restructuring of sequential and blocking algorithms. They found KIJ and IKJ reordering the best due to fewer cache faults and reads. However, they did not discuss the real-world implications or the applicability of these findings.
Haidar et al. [
17] explored power usage control and its impact on scientific algorithms’ performance. They used representative kernels and benchmarks and PAPI’s
powercap component for measurement and control to provide a framework for understanding and regulating power usage in scientific applications. They focused on identifying algorithms that benefit from power management strategies and gave observations and recommendations for energy efficiency in scientific algorithm development and execution.
Fahad et al. [
18] measured on-chip power sensors and energy predictive models using PMCs against external power meters, finding significant errors in both methods. On-chip sensors had up to 73% average and 300% maximum error. Predictive models had up to 32% average and 100% maximum error. They also found that inaccurate energy measurements can cause up to 84% of losses in energy for dynamic energy optimization.
Oo and Chaikan [
19] improved energy efficiency and performance of Strassen’s algorithm for matrix multiplication on a shared memory architecture by using loop unrolling and AVX. They achieved 98% speedup and 95% savings in energy compared to the method without unrolling. However, their method depended on the loop unrolling factor, the code size, the number of registers, and the hardware configuration.
Kung et al. [
20] addressed performance issues in bandwidth-constrained systems that are prevalent today. They proposed an algorithm to optimize memory bandwidth for a widely used block-oriented matrix multiplication. The goal was to find a balance between compute and memory bandwidths and tilted sharply against the memory access trends in the current state of technology. Their approach relied on shaping constant bandwidth blocks (self-contained units of computation that fit in local memory with fixed bandwidth needs to external ones). Adjusting block properties allowed the algorithm to control the ratio of (slow) external memory access to (fast) computation rate and overcome memory bottlenecks. Optimizing local memory (the cache) block bandwidth while limiting the less power-efficient DRAM-based external one should also provide energy benefits in addition to the reported performance boost, which makes such techniques of interest to our research.
Oo and Chaikan [
21] optimized the Strassen–Winograd algorithm for arbitrary matrix sizes on a GPU using techniques such as empirical modeling, multi-kernel streaming, dynamic peeling, and two temporary matrices. They surpassed previous GPU implementations and showed Strassen’s algorithm’s practicality. However, they ignored CPU–GPU communication cost, other fast algorithms, and Strassen’s algorithm’s numerical stability.
Jammal et al. [
22] compared the energy performance of a definition-based algorithm against two basic divide-and-conquer ones on an HPC-class Haswell Intel processor. They showed that the divide-and-conquer algorithms were the best in terms of power and energy consumption when the calculations fit in the cache. Both continued to be the most energy-efficient regardless of placement in memory. A notable result was that the definition-based algorithm took the lead in power consumption, but not overall energy, for sizes that spilled over to the main memory. The authors described their results as preliminary.
3. The Algorithms
In classic complexity analysis, designers are concerned with the patterns of mapping the most frequent operations to time or memory. Efficiency depends on the growth of functions that express those patterns. The main idea in this work is that a careful mapping of such operations to the hardware is the key to designing power efficiency and, ultimately, total energy budgets. In some cases, the algorithms may offer opportunities for choosing an energy-efficient operation, such as dividing by two, which maps to highly efficient hardware. In others, there is a choice of circuits that serve the same purpose functionally but vary in energy behavior. A design focused on speed may result in a circuit that consumes a lot more energy or, worse in some cases, increases consumption rates (power) to maintain a time advantage. This section describes the investigated algorithms and highlights the differences in hardware usage patterns that are likely significant to their energy profiles. They vary mainly in how the frequent operations map to functional units, the local memory footprint, and the data motions in the memory. It should clarify the experimental design and perhaps help with interpretation of the outcomes.
Table 2 presents a coding scheme for the algorithms and a reminder of their generic complexity. In each case, the complexity term is the exact number of element multiplications. For convenience, the codes will be used from this point on to refer to the algorithms.
DEF is a straight implementation of the definition of matrix multiplication based on the inner product of rows into columns. It is simply the school pen–paper algorithm. WINOG is Winograd’s improvement [
2] on the standard, which cuts the number of element multiplications roughly in half without changing the base complexity. Both of these algorithms process whole rows and columns. Their performance should suffer relatively quickly as matrix sizes increase and cache utilization drops. The number of additions, however, increases 1.5 times in WINOG. So the net effect is not readily apparent, given that multiplication tends to be more energy expensive than addition.
The remaining algorithms are divide-and-conquer and are typically implemented recursively. A quick overview of the base method helps highlight the differences between them. These algorithms generally operate on submatrix blocks (k-blocks, , , …, 1) of inputs and output on each recursive iteration, dubbed k to simplify the discussion. There are seven of these blocks included for the following iteration (therefore, the number of recursive iterations is ). These are combined to calculate and return an iteration’s product matrix. Calculating these and holding the iteration input and output submatrices requires extra k-blocks. More are needed to return the iteration product matrix (, i.e., four more k-blocks). Those blocks are common to all except STARS3 and thus are omitted from the discussion. It helps when designing these algorithms to distinguish temporary blocks (temps) used during calculations to hold expression partial and final results from those used to store iteration input and output submatrices (submats).
A careless implementation could use up to 37 blocks per iteration: 25 temps (18 sum and 7 product) and 12 in/out submats. With reasonable effort, one should be able to manage the block allocation to reduce its overhead considerably. Standard calculation sequences offer natural opportunities to reuse temps, which helps reduce the memory footprint, especially in local memory that is faster and more power efficient. For example, even a beginner would quickly see that 8 of the sum temps may be reused to reduce the total allocation to 29 blocks. Also, in a straight implementation, an iteration k fills the k-blocks from its inputs and copies the resulting ones to its output. Filling those involves considerable back-and-forth copying.
STRAS follows a natural calculation schedule as suggested by a standard formulation (such as the one in [
23]). It is considered naive for this reason. The natural flow allows the reuse of eight
temps (as mentioned earlier) and four
submats. Therefore, STRAS needs 17
temps and 8
submats, a total of 25 in iteration
k. No memory allocation or motion optimizations were applied except an obvious allocation strategy: allocate blocks as late as possible and free blocks as early as possible, which is considered in this work to be the baseline for the family of methods. This allocation strategy needed 18 blocks at first, with four released early. STRAS retains four
temps to calculate the outputs used to fill the iteration return matrix. Hence, its block allocation pattern was 18–14 and down to 4 before and after the recursive calls in each iteration—compared to the the 25 needed otherwise. WINOGV followed the same plan. It benefited from the recursive reduction in additions to hold slightly fewer temps. Together with the significantly fewer overall multiplications shared with STRAS, WINOGV should have better energy performance.
Table 3 compares how the algorithms used memory.
STRAS2 is a memory-optimized multiplication technique that focuses on eliminating the
temps and only keeping the
submats. It employs the calculation schedule of
Table 3 from [
6] that uses the input
submats as
temps instead of creating extra ones. STRAS2 needs fewer blocks (good for caching) but keeps those that would have been freed earlier in STRAS until the schedule finishes (cached longer). The algorithm behind STRAS3, like STRAS2, does not create
temps. It also eliminates the need for some
submats by constructing the return product matrix in place, i.e., directly in the original
output matrix. It eliminates the need to copy from the output
submats and thus reduces some of the memory motion. So it has the smallest memory footprint per recursive call,
and it reduces the need to move data around memory. It should show the best energy performance of the lot.
Rather than a complex full implementation, STRAS3 simulates the power behavior. It fully implements the memory behavior of STARS2 in addition to in-place product matrix generation. However, instead of updating the row and column positions of the output block for each recursive iteration, it reuses the first position. The position calculation requires a constant time overhead per iteration. It should have a modest effect on the power draw relative to the memory access. Therefore, STRAS3 focuses on the parts that exercise the most relevant hardware for the energy rate. The rationale for the simulation is as follows: (a) expect a fairly representative power footprint and (b) inexpensively assess against the relatively cheap STRAS2. More importantly, it illustrates a point about power assessment. It should be reasonably accurate as long as the simulation exercises the same hardware in the same or close enough ways, which should help credibly check if more complex algorithms are worth it before committing to a full implementation.
Strassen-like algorithms rely on cutting the input size in half on each iteration, so they work naturally with input sizes that are powers of two, with no loss of generality since one may pad the rows and columns of any matrix with zeros to change its dimensions without affecting its arithmetic value. The implementations in this work automatically upscale the dimensions to the nearest power of two (so 300, 400, and 500 are processed as 512). Therefore, some cases will be sparse, while others will be dense, which may be interesting from an energy perspective. Finally, the implementations do not halt recursion at an appropriately small block size and switch to a faster procedure, as is customary. They recurse down all the way to remain faithful to the method.
4. Materials and Methodology
The methodology used in this study was based on the established techniques for experimental measurement of energy and power on modern CPU platforms that were developed by some of the authors of this report and detailed in [
22,
24]. The main difference was the shift to another CPU platform. Fortunately, more modern CPUs, circa 2021, provided access to better profiling options. A summary of the methodology is as follows. To empirically determine the energy performance that one could credibly ascribe to the fundamental way a computation was configured, i.e., the algorithm: (a) the code is isolated on one processor core, and (b) as much influence of the runtime environment as possible, considered noise in this study, must be eliminated. Adverse influences could arise from the machine code representation of the computation, the OS threading model, and how the processor executes the code. In particular, power management mechanisms in modern CPUs can completely mask the natural energy behavior. Ultimately, the empirical evidence relies on high-resolution data from probes inside modern CPUs to dynamically control the run voltages and frequencies in response to loading conditions. Therefore, with a careful setup, there can be a credible argument that most of the measurements stem from the computation style. This backboxing approach eliminates the need for complex simulations that rely on detailed knowledge of the inner workings of the CPU and keeps the focus on the algorithm.
Timing data were collected as a control to ensure correct programming and expected asymptotic behavior. Experimentation showed that the readings from the profiling tools were consistent and close in value. An average of twenty to fifty sampling runs was sufficient to get a reliably convergent value for a single reading (to at least three fractional digits). For examples, see
Table 4. These figures seem to confirm that the experimental setup was quite successful at eliminating noise from the runtime environment. Runs up to fifty times revealed little added value in the context of the experiment beyond guaranteeing convergence to four fractional digits. Therefore, readings for dimensions 3000 and 4000 were from an average of 20 runs due to their heavy time cost; the rest were from 40–50 runs.
The following describes the environment and procedures used in the experiments, the characteristics of the test datasets, the executable code, and the tools used to collect data.
4.1. Experimental Environment and Procedures
Table 5 lists the specifications of the testbed machine. The main feature is the AMD CPU from the third generation of the high-performance EPYC 7003 series (code name Milan). Cores are organized in blocks, called core complexes (CCXs), that combine 2–8
Zen 3 cores with a sizable multilevel cache. A silicon die can have one or more CCXs (plus necessary off-core functions). The 7313 has four, with each on a separate die. This configuration should work well for thermal cross-noise affecting the core chosen to run the experiments. Some experimentation revealed that readings were not affected by the system case being open or left closed or case fans on/off.
The remainder of the section lists the measures and precautions put in place by the researchers to minimize environmental noise and ensure a reliable estimate of the energy profiles under investigation. These are related to the experiment environment and procedures, the CPU, or the operating system (OS).
Disabled the CPU cooler fan: prevent cooling the chip package based on different loading conditions, which allows the experimental loads to run under consistent thermals.
Configured a cool-off period between test runs: run scripts check the CPU temperature between runs and apply a time delay to return to a consistent baseline temperature of 30 °C before the following run could begin.
Disabled SMT: hardware-managed threads share a single physical CPU core in unpredictable ways from a computation viewpoint.
Disabled the power management components responsible for regulating the energy behavior of the CPU (core performance boost in AMD terminology). This function depends on hardware behaviors set by the CPU manufacturer. It could significantly alter energy patterns in unpredictable ways. Hence, it prevents an even comparison of the desired effects due to the test loads.
Set the isolcpus kernel boot parameter to explicitly restrict the OS process scheduler and load balancing algorithms from using a designated set of CPU cores reserved for the experiment (Cores 0–3 from the same CCX in this case).
Set the core affinity for the experiment’s code via the profiling tool to ensure that the code ran exclusively on Core 0.
Set perf_event_paranoid in Linux to (allow all events), as required by the profiler. The default settings usually restrict access to event monitoring for security reasons.
Disabled the NMI (non-maskable interrupt) watchdog on Linux (also recommended by the profiler) to stop periodic check overhead to improve the reported results.
4.2. Test Datasets
The datasets consisted of random square matrices ranging in dimension from 50 to 2500. Divide-and-conquer methods adjust the dimension to the nearest power of two. The range span ensured that the algorithms work across the three levels of cache well into the DRAM. Matrix elements were double-precision (8-byte) floating point numbers, the most common operand type in typical high-performance applications, based on randomizing the bit patterns rather than the stored values. There is some evidence [
25] based on extensive experimental work that suggests power consumption depends on the data value being read or written. Randomizing stored bits should help eliminate any bias due to those effects.
The patterns used in the study only randomized the lower 52 bits that hold the fractional part of the number in the prevalent IEEE encoding. The upper bits, exponent and sign, were fixed so that values were in the interval
. The rationale is as follows. Each application will likely work, for the most part, with a different range of values depending on the most frequent calculations and the application domain. The upper bits will not likely change frequently, in general. The lower 52 bits, however, are guaranteed to change constantly throughout any computation in every application. Therefore, randomizing them will be a good representation of typical behavior involving floating point calculations. The actual range of values will not matter (any will do for this line of research).
Figure 1 shows a sample of the random patterns used to fill test matrices. The datasets were pre-generated and saved in binary files accessible from any C/C++ program in order to be reusable. All algorithms used the same test data to keep things even.
4.3. Executables
The authors implemented the algorithms in standard C++. A multiply function calls code that implements an algorithm in a different source file. Therefore, separate executables for each case were generated via the GNU compiler collection (gcc 11.4.0) using the base command: gcc-lstdc++-lm. The experimental methodology calls for producing code as faithful as possible to the algorithms it purports to execute to be able to argue about the underlying methods. The machine code, in practice, is composed of instructions that reflect a combination of the hardware and the compiler. Optimizing compilers are concerned with efficient execution and can make the code bear little resemblance to the original computations described by an algorithm as long as the code runs correctly. Therefore, it may be desirable to turn off optimization. For this work, only the lowest level, the default, was allowed (compiler flag-O0). It performs basic optimizations, such as loop unrolling, which is arguably more faithful to the algorithms. Unrolling resembles hand applications of the algorithm (repeating based on a cheap decision) more than costly, hardware-dependent branches that machines go through. Thus, the resultant code should yield a realistic estimate of what users can expect while remaining reasonably faithful to the methods.
Higher optimization levels attempt to improve code performance at the expense of compilation speed, typically via aggressive code changes. For example, the next level in gcc (accessed via optimization flag-O2) performs common subexpression elimination [
26]. It replaces expressions that result in the same values with variables for reuse if helpful. This optimization may alter the codes in different ways, making comparisons of behavior due to hardware use problematic. In general, such optimizations are counterproductive for this line of research as they may obscure the natural energy behaviors in question. It becomes harder to tell if the effects stem from the algorithms or from the compilation technology.
4.4. Profiling Tools
The study used the Linux version of the
Prof profiler [
27] version 4.0, which is provided by AMD and supports extensive power and thermal monitoring [
28]. Some recent studies utilized the profiler for a variety of purposes. For example, Lane and Lobrano [
29] used the profiler to analyze the memory bandwidth limitations in the second-generation EPYC chips. Lu et al. [
30] used the profiler’s energy reporting features to construct reliable thermal maps for an AMD Ryzen processor.
The main advantage of the tool for the study was that it provided a dependable source of nuanced core-specific metrics on the EPYC processor. It can report the runtime for a whole executable or break it down by function or thread. Moreover, it separates CPU time from user modules and system libraries. The detailed information helped confirm that core and thread usage was consistent with expectations. It also revealed that most of the user module time (97–99% typically) was in the multiplication function. Therefore, the user module CPU time was an adequate choice since it captured the essential information while being more resilient to inaccuracies due to measurement resolution.
6. Conclusions and Future Work
This study investigated the energy behavior of a selection of standard matrix multiplication algorithms for advantages, in terms of their total energy and power draw budgets, that one may attribute to their design and hardware use patterns. It also examined cache misses to make sense of consumption patterns and to determine the effects of interactions with the memory.
The immediate conclusion was that, at least from a power viewpoint, the Winograd variant of Strassen’s original algorithm (WINOGV) should be the baseline for the divide-and-conquer method. The baseline should include at least in-place temporary block scheduling like STRAS2, which involves minimal effort. Also, switching to a fast, non-recursive procedure should be a must for the best effect. Either DEF or WINOG would do for such a procedure. This baseline implementation should yield slightly better performance than that reported in this paper for STRAS2. Timing and power data suggest it will not suffer computationally but could be significantly more power efficient (see data points 1000–1100).
However, for the best power and energy performance, a high-performance algorithm for large matrix sizes should go for the full extent of memory optimizations described in [
6] on top of WINOGV. It should perhaps detect recursions that yield zero blocks in sufficiently large sparse cases. WINOG is preferable to DEF here for a fast procedure for appropriately small dimensions. The choice of the point at which the algorithm switches to that procedure should be custom-optimized for a system. That is not as open-ended as it may sound. The state-of-the-art in technology and the tight design trade-offs between hit times vs. miss rates often lead to similar configurations within a generation of processors, which likely lead to the same ends. The complexity of such an undertaking could be offset by including it in a system-wide library for frequent reuse. A multicore parallel implementation should probably also start from the previous guidelines.
The study also found the usage patterns of energy-significant functional and memory units to be reasonably good predictors of what to expect in terms of power and energy behavior. Improving energy budgets through optimizing the runtime under the same power draw is perhaps expected. However, optimizing hardware usage was also found to improve the power performance. For some applications, lower power budgets may be more critical. Reducing those budgets, in turn, also boosts energy gains as well. The power advantage may be more valuable for a ubiquitous computation that is likely to make up a significant fraction of a large workload on a high-performance system where the savings may accumulate. Classical algorithm design keeps an eye on the growth of a dominant term of a function that expresses the relation of the repetitions of the most frequent operations with the input size. Similarly, an analysis based on the most frequently used hardware-significant units could guide the design of algorithms with better energy profiles. Moreover, this could lead to rethinking the design of those units to increase the savings.
The findings from the simulated case using STRAS3 seem to encourage an examination of a full implementation to better assess its advantage in a more realistic setting, especially in terms of power consumption. In particular, a study should perhaps focus on in-place calculations that minimize memory motion. Moreover, the SIMD architecture of GPUs makes them a natural fit for matrix multiplication. These platforms are now primary environments for applications that rely on that vital computation. It would be interesting to replicate the study on a popular GPU setting like those from Nvidia or AMD. It would also be interesting to formalize a framework for power complexity like that for time and space. Additional studies along the suggested lines may lend further support for seeking the efficiency inherent in algorithms by design. In particular, this may be useful for designing algorithms that manage the use of hardware with a critical influence on energy behaviors.