1. Introduction
With the advent of the artificial intelligence (AI) era, AI technology has made remarkable progress and is being widely applied in real-life scenarios. Particularly, the integration with IoT devices has rapidly expanded the scope of AI technology to edge devices, which are closely connected to the end-user environment. This shift has increased the necessity to process data-centric tasks in AI applications more energy-efficiently and swiftly across various edge devices. Traditional centralized data processing methods, such as cloud computing, have enabled large-scale data processing and storage, but also have several limitations, such as energy consumption due to data transmission, latency, and data privacy concerns [
1,
2]. To overcome these limitations, computing has gained attention. Edge computing processes data directly on the edge device, that is, close to the data source, reducing network load effectively and minimizing latency caused by data transmission [
3].
Since edge devices have more stringent energy constraints compared to centralized data processing methods, energy-efficient data processing is a major challenge [
4,
5]. In response, various technological efforts have been made to enhance the energy efficiency of edge computing, one of which is the integration of processing-in-memory (PIM) architecture with edge devices [
6,
7]. PIM performs data processing inside or near the memory array, reducing latency due to data movement and addressing the key design goal of high energy efficiency in edge devices for memory-centric tasks such as AI applications [
8]. In traditional computing architectures, data movement between the processor and memory incurs significant latency and energy consumption, but PIM has the potential to alleviate the memory bottleneck by fully utilizing memory bandwidth [
9]. However, most current research on PIM has focused on the development of PIM architecture and its integration with existing processors, with performance improvements being evaluated under fixed computational conditions [
10,
11]. When considering practical applications, there is greater potential for energy efficiency improvements by reflecting dynamic scenarios, where the computational load fluctuates in real time during the runtime of an application. For instance, in autonomous vehicles equipped with convolutional neural network (CNN) applications for object recognition and road condition assessment, the inference workload per hour can vary significantly depending on factors such as weather, traffic, and the movement of surrounding vehicles [
12,
13]. Using fixed computational resources to meet the maximum performance requirements for all time intervals without accounting for these fluctuations can lead to inefficient energy consumption. Therefore, to maximize the energy efficiency of the PIM architecture in such realistic scenarios, a flexible approach that accommodates the variability in the computational workload is needed, but this area has yet to be deeply explored.
In this paper, we propose a novel PIM architecture that can flexibly respond to real-time variations in the computational workload of edge applications, as well as an operational algorithm to optimize the energy efficiency of the proposed PIM architecture. Firstly, the proposed PIM architecture consists of PIM modules, where each PIM module is a fundamental unit of computation, comprising memory and a processing element (PE). We introduce two types of PIM modules in the proposed architecture: the low-power PIM (LP-PIM) modules and the high-performance PIM (HP-PIM) modules. In other words, the proposed PIM architecture is a heterogeneous architecture composed of both LP-PIM and HP-PIM modules, providing the capability to flexibly respond to varying computational loads in real time. Next, we propose a data placement optimization algorithm that maximizes the potential of the heterogeneous PIM architecture. This algorithm predicts the changing computational workload of the running application and optimally allocates data to the HP-PIM and LP-PIM modules, thereby improving the energy efficiency of the proposed heterogeneous PIM system. For instance, when the computational workload of the application is low, the system allocates a higher proportion of data to the LP-PIM modules to reduce the workload on the HP-PIM modules, minimizing the dynamic energy consumed by the HP-PIM modules. Conversely, when the computational workload is high, the system actively utilizes the HP-PIM modules to increase the processing throughput of the heterogeneous PIM system. Furthermore, we developed the proposed algorithm by taking into account the time and energy overhead caused by data movement between PIM modules, ensuring that the system meets the computational latency requirements of the application while maximizing energy efficiency.
To verify the functionality and evaluate the effectiveness of the proposed technology, we performed the modeling of the memory device and PE, followed by the register transfer level (RTL) design of the entire PIM processor, including the proposed heterogeneous PIM architecture. Additionally, we conducted experiments using field-programmable gate array (FPGA) prototyping with various testbench scenarios to validate the energy-saving effects of the proposed PIM architecture and data placement algorithm. The results demonstrated that the proposed approach maximizes energy efficiency while meeting the computational latency requirements of applications in edge computing environments. More precisely, the developed PIM processor showed superior adaptability to real-time variations in computational load compared to the baseline PIM architecture-based processor, demonstrating an average energy efficiency improvement of up to 29.54% and at least 21.07%. These results demonstrate the potential of the heterogeneous PIM architecture in edge computing environments and prove that the proposed technology is well suited to maximize the efficiency of edge processors performing AI applications.
The remainder of this paper is organized as follows. In
Section 2, we describe the proposed heterogeneous PIM architecture and the computational mechanism of the hardware in detail.
Section 3 provides the data placement algorithm for optimizing the energy efficiency of the heterogeneous PIM architecture.
Section 4 is dedicated to the experimental work. In this section, we describe the FPGA prototyping of the PIM processor equipped with the proposed PIM architecture, and demonstrate the superiority of the proposed technology by running various scenarios on the developed PIM processor prototype and measuring the results. Finally, the Conclusions summarizes the research findings and the significance of this study.
2. Proposed Heterogeneous PIM Architecture for Edge Processors
The PIM architecture is designed to fully utilize memory bandwidth, which can significantly improve the performance of memory-centric applications by alleviating memory bottlenecks. However, when applying such PIM architectures to the processors of edge devices, where power efficiency and battery life are critical, energy efficiency must be carefully considered. While various energy optimization techniques for edge processors have been studied, traditional low-power circuit and design methods determine power efficiency at design time, making them effective only in scenarios where the workload is constant or changes in a periodic and predictable pattern [
14,
15,
16,
17]. Moreover, power capping techniques such as workload scheduling or dynamic voltage and frequency scaling (DVFS), which dynamically adjust the power efficiency of processors, introduce additional overhead due to operating circuits and real-time power monitoring [
18,
19,
20]. Therefore, we propose a heterogeneous PIM architecture that can dynamically maximize energy efficiency, even in situations where workloads fluctuate irregularly over time, while delivering high performance in memory-centric tasks such as AI applications.
Figure 1 shows the proposed heterogeneous PIM architecture. The gray-colored section in the figure represents the baseline PIM architecture. The overall configuration of the functional blocks in this baseline adopts the basic structure of several previously studied PIM architectures [
21,
22,
23], including multiple PIM modules composed of PEs and memory banks, a controller to manage these modules, and an interface for external communication. The most significant feature of the proposed PIM architecture is the inclusion of two types of PIM modules: HP-PIM, which operates at high performance with high power consumption; and LP-PIM, which operates at low power with low performance. The hallmark of the proposed PIM architecture lies in its integration of two distinct types of PIM modules: the HP-PIM, optimized for intensive computations with higher power consumption; and the LP-PIM, designed to operate at lower power with reduced performance. This configuration allows the heterogeneous PIM to dynamically balance power efficiency and performance. The PIM controller enables each PIM module to independently perform data I/O or computations based on commands received from the core through the interface. Two PIM controllers independently manage the HP-PIM and LP-PIM modules, ensuring stable PIM operation by synchronizing between the PIM modules. The PIM interface between the system and the heterogeneous PIM is designed based on a 32-bit-width AXI interface, facilitating communication with the core. Specifically, the PIM interface either receives PIM operation requests from the core and forwards them to the PIM controller, or notifies the core when PIM operations are completed. This PIM interface is designed as a single channel with a single data path, operating at a data rate of 1.6 Gbps under a 50 MHz system clock frequency.
Meanwhile, the heterogeneous PIM architecture incorporates two types of memory, SRAM and STT-MRAM, each included in the configuration of the PIM modules at the bank level. SRAM primarily serves as a buffer for data to be processed or for the results of computations performed by the PE. Its fast read and write speeds ensure that the PE can quickly access the data required for computation, preventing a decrease in processing speed due to memory access latency. Due to its relatively large footprint, however, SRAM is not suitable for storing large amounts of data, such as the weights of neural networks in AI applications. On the other hand, neural network weights, once trained, are used for inference without further updates until additional training is required. This characteristic makes non-volatile memory (NVM), which retains data even when power is off, an ideal choice for storing such weights [
24]. STT-MRAM, in particular, stands out as an NVM with read and write speeds fast enough to be used as cache memory, while consuming less power than DRAM, which requires periodic refreshes. This makes STT-MRAM highly suitable for edge devices [
25]. Consequently, we adopted both SRAM and STT-MRAM in the proposed PIM architecture, ensuring that data are stored in the appropriate memory type based on their characteristics.
Next, in designing the heterogeneous PIM architecture, we devised a data storage method and processing flow to minimize data movement overhead. Conventional PIM architectures typically configure independent computation units at the subarray or bank level, whereas in the proposed heterogeneous PIM architecture, the computation unit is the PIM module. The PE within a PIM module thus cannot directly access data stored in another PIM module without the aid of a controller. If data are stored randomly, performance degradation due to data movement overhead becomes inevitable. Since the optimal data storage location varies depending on the types of computations involved in the application, developers must carefully consider this to minimize data movement overhead. To illustrate this, we use computations from convolutional neural network (CNN)-based AI inference models, which are frequently employed in various applications, as an example to explain the data storage method and processing flow in the proposed heterogeneous PIM architecture that minimizes data movement overhead.
Figure 2 shows the weight allocation scheme of the heterogeneous PIM for the convolution layer in a CNN. In the convolution layer, the weight (
w) corresponds to the filter. In the example, where a
-pixel image is used as the input
x, and
n output channels
y are generated through
n different
filters, the output for the
pixel of the input image can be expressed by the following convolution operation:
In this convolution operation, the results of computations between the input data
x and each filter are independent of one another. To reduce data movement overhead, as shown in
Figure 2, the weights can be distributed across the PIM modules on a per-channel basis. Accordingly, the
n filters are divided between the HP-PIM and LP-PIM modules in a ratio of
, with each module storing a portion of the weights. Unlike the distributed storage of weights across the PIM modules, the input data
x for the convolution layer are broadcast to all PIM modules to allow parallel processing, and are stored identically in each module’s SRAM buffer. During the computation, each PIM module moves the required weights to its SRAM buffer and sequentially feeds the input data and weights to the PE for multiply–accumulate (MAC) operations. The ACC register is used to store intermediate results from the MAC operations, and once the computation for each filter is completed, the output
y is stored in the SRAM buffer.
Now, turning our attention to the fully connected layer of a CNN,
Figure 3 presents the weight allocation scheme for the heterogeneous PIM architecture. In the fully connected layer, the operation involves a matrix–vector multiplication between the input vector
X with
j input nodes and the weight matrix
W of size
, producing an output vector
Y with
n output nodes. Denoting the elements of
X,
Y, and
W as
x,
y, and
w, respectively, the matrix–vector multiplication at the element level can be described as follows:
In the fully connected layer, the weights of the weight matrix are distributed across the HP-PIM and LP-PIM modules according to a specific ratio, as shown in
Figure 3, similar to the example in the convolution layer. Since the computation for each output node can be performed independently according to (2), the weight distribution across the PIM modules for the matrix–vector multiplication should ensure that the weights required to compute a single output node are contained within a single PIM module. In other words, for the column vector
X, the rows of
W must be stored in each PIM module to allow for parallel computation while minimizing data movement overhead.
The proposed heterogeneous PIM architecture, as demonstrated in the previous examples, can achieve optimal performance if the weights are appropriately allocated based on the characteristics of the computations within the application during the development process. Additionally, since the ratio of weights stored in the HP-PIM and LP-PIM modules reflects the proportion of computations each PIM module will handle, this allows for the adjustment of the balance between energy consumption and performance in the heterogeneous PIM. In the following section, we introduce a data placement strategy and discuss methods to optimize the energy consumption of the heterogeneous PIM during the application’s runtime.
3. Optimal Data Placement Strategy for the Proposed Heterogeneous PIM
The performance of the proposed heterogeneous PIM for target AI applications is closely related to the placement of the weight data. The overall computation results for each neural network layer are obtained by aggregating the results from multiple HP-PIM and LP-PIM modules within the heterogeneous PIM. In this process, even though the HP-PIM modules complete all assigned tasks quickly, there may be idle time as they wait for the slower LP-PIM modules to finish their computations. This idle time is directly tied to the performance of the PIM. To minimize it and ensure the PIM operates at its maximum performance, the workload allocation between the HP-PIM and LP-PIM modules must be carefully adjusted, allowing for the fastest possible inference results.
However, in real-time AI application processing, the application processes do not always demand the highest inference speed; in other words, they do not always require the PIM to operate at its maximum performance. When the inference frequency of the application is low, it is possible to satisfy the required latency without having the PIM operate at its highest throughput. In this case, more weights can be allocated to the energy-efficient LP-PIM modules to improve the overall energy efficiency of the processor. Leveraging this, we propose a weight placement strategy that periodically optimizes energy efficiency by adjusting the distribution of weights between the HP-PIM and LP-PIM modules during the application runtime.
The proposed weight placement strategy consists of two algorithms: one that determines the weights to be stored in the HP-PIM and LP-PIM modules during a given time period, and another that predicts the inference frequency of the next period in order to adjust the weight allocation ratio for the subsequent period. First, to explain the former in detail, the number of inferences
performed during a given time period
, which is the interval during which a specific weight placement is maintained, is categorized into
N levels based on the magnitude. The highest level, level
N, corresponds to the maximum number of inferences
that the baseline processor with only HP-PIM modules (cf.
Figure 1) can perform during
at its fastest operating speed. The remaining levels are then associated with the corresponding number of inferences, based on
N, as follows:
To maintain a consistent inference latency for each inference frequency categorized by level, a time constraint
must be set for the time in which each inference should be completed.
Figure 4 illustrates the relationships between the time parameters in the proposed method. In the figure,
and
represent the time required for the HP-PIM and LP-PIM modules, respectively, to process all assigned computations, while
refers to the time it takes to complete one inference across all PIM modules. As shown in the figure, since
is directly affected by the computation time of the slower LP-PIM modules,
can be determined based on the total computation time of the MAC operations performed by the LP-PIM modules. In other words, this defines how many weight data should be allocated to the LP-PIM modules to perform computations for each level. When defining
as the maximum number of weight data that can be stored in the LP-PIM modules under the given time constraint,
,
can be expressed as follows:
where
is the time required for a single MAC operation. Once
is determined for each level, the number of weight data stored in the HP-PIM modules,
, is then determined as the remainder of the total weight data after subtracting
. Based on these values,
and
, the total weight data are evenly distributed across the multiple HP-PIM and LP-PIM modules.
Along with for each level, the maximum number of weights that can be stored in the HP-PIM modules under the time constraint, denoted as , is stored in a lookup table and used for runtime data placement during the execution of the application. To derive the pre-calculated values of and to be stored in the table, we introduced an initialization phase. This phase involves storing all weights evenly across the HP-PIM modules and running a few inference tasks using test data before the application is fully deployed, while measuring the execution time.
However, relying solely on the table information filled during this initialization phase may be insufficient to address the additional time and energy overhead caused by weight placement operations, which repeat every
during runtime. These overheads cannot be captured during initialization because they vary depending on the level applied in the previous
. The potential problem that can arise if these overheads are not accounted for is depicted on the right-hand side of
Figure 4. In this figure, it can be observed that the inference latency fails to meet the time constraint
due to the overhead time
. Specifically, if the difference between the actual inference time
and the time constraint
, denoted as
, is smaller than
, the application’s required inference latency may be delayed.
To mitigate this issue, we introduced a turbo mode to the proposed PIM. This turbo mode defines a new level , with the fastest possible weight placement, and the corresponding and are also determined during the initialization phase. The turbo mode ensures that the inference time reduction exceeds the worst-case overhead difference of . Although the turbo mode could be further refined by introducing multiple levels for more granular control, this would increase design complexity, so we implemented only a single level in this work.
Next, we developed an algorithm to predict the inference frequency of the application for the next weight placement during the period
, in which the current weight placement is maintained. Various prediction methods, ranging from statistical techniques to machine learning approaches, can be applied. However, to ensure that the algorithm can be executed on edge devices and minimize overhead when integrated into existing applications, we adopted the lightweight and low-complexity simple exponential smoothing (SES) method. By using SES, which applies exponential weighting, the influence of the level applied in the previous
gradually diminishes, while more weight is assigned to the most recent
, allowing the inference frequency to be determined. This can be expressed by the following recursive formula:
where
and
represent the predicted levels at time
and
, respectively, and
refers to the actual inference frequency level during the previous
at time
. Additionally,
is the smoothing constant, and the closer this value is to 1, the more weight is placed on the most recent
level during prediction. We implemented an algorithm that maintains a table of the last 10 actual inference frequency levels that occurred over the previous
, updating it every
. The initial placement corresponds to the weight placement for level
N and, until the level table is fully populated, predictions are made using only the actual level data gathered so far.
Figure 5 shows the process through which the inference frequency level for the next
is predicted and the table is updated. First, after the weight placement is performed, the contents of the actual inference frequency level table are updated. Then, by iterating through elements 0 to 9 in the table and applying (
6), the next inference frequency level for the upcoming
is predicted based on the accumulated data from the previous 10 actual inference frequency levels. However, there may be cases where the predicted frequency level for the next
is incorrect. If
, even if the prediction fails, as long as
, the system can still achieve a certain degree of energy saving in the heterogeneous PIM, although it will not be optimal. On the other hand, if
, the inference latency requirement may not be met. In such cases, the next weight placement will skip level prediction and immediately apply the weight placement corresponding to level
(turbo mode operation) to quickly handle the remaining inference requests.