An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems

Kong, Xiangcong; Zhu, Zixuan; Feng, Chujun; Zhu, Yongxin; Zheng, Xiaoying

doi:10.3390/electronics14030466

Open AccessArticle

An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems

by

Xiangcong Kong

^1,2

,

Zixuan Zhu

^1,2,

Chujun Feng

^1,2,

Yongxin Zhu

^1,2,*

and

Xiaoying Zheng

^1,2,*

¹

Shanghai Advanced Research Institute, Chinese Academy of Sciences, 99 Haike Road, Shanghai 201210, China

²

University of Chinese Academy of Sciences, 19 Yuquan Road, Beijing 100049, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(3), 466; https://doi.org/10.3390/electronics14030466

Submission received: 8 December 2024 / Revised: 4 January 2025 / Accepted: 12 January 2025 / Published: 24 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

The integration of High-Bandwidth Memory (HBM) into Field-Programmable Gate Arrays (FPGAs) has significantly enhanced data processing capabilities. However, the segmentation of HBM into 32 pseudo-channels, each managed by a performance-limited crossbar, imposes a significant bottleneck on data throughput. To overcome this challenge, we propose a transparent HBM access framework that integrates a non-blocking network-on-chip (NoC) module and fine-grained burst control transmission, enabling efficient multi-channel memory access in HBM. Our Omega-based NoC achieves a throughput of 692 million packets per second, surpassing state-of-the-art solutions. When implemented on the Xilinx Alveo U280 FPGA board, the proposed framework attains near-maximum single-channel write bandwidth, delivering 12.94 GB/s in many-to-many unicast communication scenarios, demonstrating its effectiveness in optimizing memory access for high-performance applications.

Keywords:

FPGA; high-bandwidth memory; network-on-chip; burst optimization; parallel processing

1. Introduction

Field-Programmable Gate Arrays (FPGAs) are acclaimed for their energy efficiency and circuit reconfigurability, which markedly boost the performance of computation-intensive applications within modern datacenters [1,2,3]. An example is deploying FPGAs equipped with HBM as an embedded solution for real-time data acquisition in radio astronomy data processing. For instance, projects like the Five-hundred-meter Aperture Spherical Telescope (FAST) generate data streams at rates as high as 38 GB/s, placing significant demands on memory bandwidth and access efficiency. The limited bandwidth between compute units and off-chip memory poses a major bottleneck for bandwidth-critical workloads. To overcome this limitation, FPGA vendors such as Xilinx and Intel have integrated High-Bandwidth Memory (HBM) into their FPGA boards [4,5,6]. As a result, a high-throughput, multi-channel HBM access framework is essential for handling large-scale data processing applications.

HBM’s high bandwidth is derived from its 32 independent channels, each connected via 32 Advanced eXtensible Interface (AXI) ports. However, two primary challenges hinder the full utilization of HBM’s bandwidth: first, contention in the lateral links between crossbars, and second, conservative burst inference strategies employed by the HLS compiler. These factors limit the maximum transfer capacity between channels, making seamless, efficient access to all channels without bandwidth degradation a significant challenge in multi-channel access scenarios. As HBM increasingly replaces DDR memory, these issues of suboptimal bandwidth utilization become more pronounced, highlighting the need for improved system design to address these limitations [7,8].

In recent years, many efforts have been made to address the bandwidth degradation caused by channel contention, focusing on two main approaches: data reordering and on-chip networks. Furukawa et al. (2021) [9] proposed an application-specific buffering mechanism for implementing the radiative transfer equation, which connects compute units (CUs) with HBM by organizing data before memory access to avoid cross-channel access. However, this approach lacks general applicability to other workloads. Prakash et al. (2022) [10] demonstrated the use of FPGA overlay NoCs in a multi-die FPGA, but this introduced additional complexity due to on-chip routing. Choi et al. (2021) [7] introduced a butterfly network-based interconnect to interface processing elements with HBM. Additionally, Xue et al. (2023) [11] proposed a custom switching network based on a CLOS topology to replace the built-in crossbar, optimizing HBM access scheduling. However, their design is limited to super logic region 0 (SLR0), leaving SLR1 and SLR2 underutilized.

Additionally, Nabavi Larimi et al. (2021) [12] investigated the power consumption and reliability of HBM under voltage underscaling, demonstrating power savings while characterizing bit flip faults. These studies collectively contribute to understanding and addressing HBM-related challenges, as summarized in Table 1.

The Vitis HLS tool may fail to identify burst transmissions in loops with variable bounds or conditionals, defaulting to AXI burst lengths of 1, which results in reduced bandwidth [7,15]. Ferry et al. (2022) [16] proposed a memory allocation technique and provided a proof-of-concept source-to-source compiler pass that enables burst transfers by modifying the data layout in external memory. Choi et al. (2021) [7] introduced a BRAM-efficient HLS buffering scheme that increases AXI burst length and effective bandwidth, but this leads to high routing complexity and significant BRAM consumption.

Existing approaches either introduce additional complexity through advanced NoC structures or fail to adequately consider FPGA resource allocation for balancing data flow forwarding and computation units. As a result, these approaches are not well suited to deployment challenges in memory-constrained environments, such as those encountered in radio astronomy. Maximizing bandwidth utilization hinges on fully utilizing all pseudo-channels (PCs) and minimizing lateral transmissions across switches. The complexity of maximizing bandwidth utilization stems from two challenges:

Preventing deadlocks and enhancing the throughput of the switch units present a challenge to the performance of on-chip networks. The proposed strategy should address these challenges by employing low-latency forwarding and lightweight flow control mechanisms.
Inefficient memory access patterns, such as non-contiguous and small-burst-size transactions, can cause AXI burst inference failure. The proposed method ensures reliable burst inference.

This paper presents an improved network-on-chip (NoC) to enable high-throughput access of computational units across HBM channels. Firstly, we integrate a high-performance Omega network between the built-in crossbar layer and the compute units, utilizing on-chip routing to replace crossbar-to-crossbar transmission. Then, we introduce a general NoC-based computing framework that is designed to be adaptable to a wide range of application scenarios. We summarize the contributions of our work as follows:

•: A deadlock-free and efficient network-on-chip (NoC)-based Omega network is presented, which leverages the Omega network topology to eliminate traffic congestion in the lateral links of the HBM internal crossbar, effectively addressing Challenge 1.
•: A manual burst transmission technique is introduced for systematic control of the burst process, ensuring reliable pipeline burst inference and successful data transmission, thereby tackling Challenge 2.
•: A transparent stream-driven HBM access framework is developed, which decouples HBM memory operations located in the SLR0 region from the NoC-based compute units assigned to the SLR1 region. This framework incorporates manual burst transfer and a deadlock-free NoC, enabling adaptability to a wide range of applications.

Our proposed method is evaluated using many-to-many unicast as a benchmark, a component of the correlation algorithm in astronomical applications. The multi-channel data access capability achieves an average utilization that exceeds 90% of the peak bandwidth.

2. Background

In this section, we first introduce the HBM-to-FPGA connection structure in multi-die FPGAs, such as the Xilinx Alveo U280, with a focus on the structural reasons behind the competition for bandwidth caused by multi-channel access across the built-in crossbar. We then provide the background on the two key challenges addressed in this work, on-chip network topology and the AXI burst transfer mechanism, followed by a discussion of the relevant research in each area.

2.1. FPGA-HBM Platform Architecture

The integration of the Advanced eXtensible Interface (AXI) serves as a pivotal interconnect between the user domain and memory controller, ensuring the fulfillment of high-throughput communication demands for HBM within the FPGA architecture. Figure 1 illustrates the internal architecture of the Xilinx Alveo U280, which consists of three super logic regions (SLRs), with two HBM stacks connected to the bottom SLR (SLR0). The integration of the Advanced eXtensible Interface (AXI) serves as a pivotal interconnect between the user domain and memory controller. Xilinx achieves a balance between resource utilization and throughput by using eight 4 × 4 AXI switches, rather than a 32 × 32 crossbar, to enable global access to 32 pseudo-channels (PCs) by 32 AXI channels. The Alveo U280 FPGA can theoretically achieve a bandwidth of 460 GB/s without bandwidth reduction [7,17].

The integration of the Advanced eXtensible Interface (AXI) serves as a pivotal interconnect between the user domain and memory controller, ensuring the fulfillment of high-throughput communication demands for HBM within the FPGA architecture. Figure 1 illustrates the internal architecture of the Xilinx Alveo U280, which consists of three super logic regions (SLRs), with two HBM stacks connected to the bottom SLR (SLR0). Each HBM stack is segmented into eight independent memory channels, and each memory channel is further divided into two pseudo-channels (PCs). An AXI3 interface running at 450 MHz connects the memory controller to the user logic, with a 512-bit AXI3 master interface available for user logic. Consequently, the Alveo U280 FPGA can theoretically achieve a bandwidth of 460 GB/s [5,7,18].

The traversal of HBM data across a crossbar interconnect introduces a bottleneck that potentially results in a reduction in the overall bandwidth efficiency. Xilinx achieves a balance between resource utilization and throughput by using eight 4 × 4 AXI switches, rather than a 32 × 32 crossbar, to enable global access to 32 pseudo-channels (PCs) by 32 AXI channels. These built-in crossbars provide flexible channel access, allowing four user logic AXI leaders within a crossbar to communicate with any adjacent PC AXI followers.

If an AXI leader needs to access non-adjacent PCs, it can use lateral connections between crossbars, although network contention may limit adequate bandwidth. For example, as shown in Figure 1, compute unit (CU) 0 (M0-M3) can access PC0-PC3 directly without any loss of bandwidth. However, M24-M27 of CU1 needs to access PC28-PC31, and this cross-crossbar access causes competition between M24 and M25, as well as M26 and M27, for lateral connection lines, resulting in a significant impact on bandwidth.

2.2. Network-on-Chip Topology

Previous work has shown that instantiating an AXI Master for each HBM channel and using a fully connected crossbar can lead to excessive resource usage and routing challenges [7,10]. To address these issues, we introduce a custom multi-level interconnect network, the Omega network, positioned between the built-in crossbar and the user logic space. The Omega network is chosen for its inherent advantages of structural regularity, resource efficiency, and deterministic routing [19,20,21].

Structural regularity: An Omega network of dimension

N \times N

is constructed with

{log}_{c} N

stages of

c \times c

switch units, where c represents the number of input/output ports of the switch unit. Each stage requires only

N / c \cdot {log}_{c} N

switch units, and is interconnected with the subsequent stage following a specific pattern known as a perfect shuffle.

Figure 2a illustrates an Omega network with N = 8 and c = 2. Each stage requires only

\frac{N}{c} \cdot {log}_{c} N

switching units, which are connected to the next stage according to a specific arrangement known as a perfect shuffle [22], except the last stages. The implementation of the perfect shuffle arrangement is achieved by rotating the binary representation of the input to the left by one bit. Specifically, for an input with an n-bit binary representation

[x_{n - 1} x_{n - 2} x_{n - 3} \dots x_{1} x_{0}]

, after the perfect shuffle, its position becomes

[x_{n - 2} x_{n - 3} \dots x_{1} x_{0} x_{n - 1}]

.

To clarify, consider an example of a perfect shuffle with

N = 8

for rearranging input indices. Let the input indices be

[0, 1, 2, 3, 4, 5, 6, 7]

. After applying the perfect shuffle, the indices are rearranged as

[0, 2, 4, 6, 1, 3, 5, 7]

. This rearrangement occurs by rotating the binary representation of the indices, with the following as an example:

Index 0 (binary 000) remains 0.
Index 1 (binary 001) becomes 2 (binary 010).
Index 2 (binary 010) becomes 4 (binary 100), and so on.

This perfect shuffle pattern ensures the correct interconnection between switch units in the network, facilitating the routing of data through the Omega network efficiently.

Resource efficiency: For an

N \times N

network, the Omega network needs a total of

O (N {log}_{c} N)

interconnects across all stages, as opposed to the crossbar network which requires

O (N^{2})

switch units and links.

Deterministic routing: In an Omega network, each switching unit independently selects the appropriate output port for packet forwarding based on the packet’s destination address. As shown in Figure 2b, the 2 × 2 switching unit performs six basic operations: direct connection, swapping, and merging outputs and separating inputs for the upper and lower ports. Figure 2c illustrates the internal structure of the switching unit, which implements packet routing and forwarding.

During the routing process, the k-th bit of the destination address determines the routing decision in the k-th stage of the network. Specifically, when this bit is 0, the packet is routed to the upper output port of the switching unit, while if this bit is 1, the packet is routed to the lower output port. The packet’s path is fixed as soon as it enters the on-chip network, ensuring that the routing decision is predetermined. This address-based deterministic routing strategy significantly reduces the routing decision time, enhances transmission reliability, and guarantees that packets are delivered to their intended output ports with low latency, high efficiency, and precision.

2.3. AXI Memory Access and Burst Transfer

The AXI protocol defines five key channels that enable efficient data transfer: the Read Address Channel (AR) and the Read Data Channel (R) are responsible for handling read transactions, while the Write Address Channel (AW), Write Data Channel (W), and Write Response Channel (B) facilitate write operations. These channels work in parallel to ensure high-throughput and low-latency communication between components [23,24,25].

One of the key features of the AXI protocol is its burst transfer mechanism, which optimizes data transfer by enabling the transmission of multiple data units in a single request. This strategy minimizes overhead and improves efficiency compared to individual data transfers. Figure 3 illustrates the AXI burst transfer process, which is divided into two parts: the left side of the figure shows the data transfer path, including the M-AXI adapter, AXI switch interconnect module, memory controller, and HBM; the right side depicts the sequence of burst read and write requests initiated by a user-defined kernel, where each burst has a length of four data units [17].

During a read operation, the system first sends a read request along with the necessary control signals and enters a waiting state for the data response from HBM. HBM responds by transmitting one data unit per clock cycle, in accordance with the read request, until all requested data units have been transferred. In contrast, the write operation involves sending both the write request and the corresponding data, followed by waiting for a write response signal from HBM, which confirms the successful completion of the write operation.

An analysis of Figure 3 reveals several key observations: first, increasing the burst length and burst size results in higher throughput, as larger data volumes can be transferred in fewer cycles. Furthermore, regardless of whether the operation is a read or write transaction, the AXI switch interconnect unit introduces the largest time overhead in the process.

3. Non-Blocking Switch Unit Design in Omega-Based NoC

Network-on-chip (NoC) consists of two types of nodes: terminal nodes and switch nodes [26]. Terminal nodes act as sources or destinations within the network, which, in the context of this paper, correspond to the compute units or HBM channels. Switch nodes, also known as switch units, handle the routing and forwarding of data within the network, directly influencing the performance of the on-chip network. This section focuses on the switch unit of the NoC and presents an optimized design for a non-blocking FIFO-based back-pressure flow control mechanism, which ensures deadlock-free operation and low latency.

3.1. Back-Pressure Flow Control Mechanism for Switch Units

This internal structure design of the switch unit in Figure 2c enables each split node to simultaneously transmit input data to an intermediate buffer based on network information, while the merge node collects the data in round-robin and forwards them to the output ports. The network information cache retains details such as the destination and number of the packets. The valid buffer retains the network information for a single transfer transaction.

However, this design faces two performance challenges.

Pipeline Startup Delay: The dependency between checking if the output stream is full and writing to it can increase the pipeline initiation interval if the compiler cannot schedule these operations in the same cycle, thus degrading performance [27].
Deadlock: Inconsistencies between input and output stream operations can cause severe deadlocks. For instance, a deadlock occurs when the split 1 node writes to buffer 3 while the merge 1 node tries to read from empty buffer 1, blocking both operations.

To address the aforementioned challenges, we propose a non-blocking back-pressure flow control method, which allocates a corresponding valid buffer (validStrm) for each data buffer (dataStrm). By using non-blocking operations on validStrm to replace stream checks on dataStrm, we can determine stream status and perform operations within a single cycle, thus resolving Challenge 1.

The back-pressure flow control mechanism provides stream status feedback using non-blocking operations, effectively resolving the deadlock issue described in Challenge 2 [28]. For instance, the merge 1 node, as illustrated in Figure 2b, engages in a non-blocking read operation of validStrm’ in buffer 1. Should this operation not yield the expected result, indicating a failure to read, the node promptly shifts its attention to buffer 2’s channel. In the event of a read failure, the merge 1 node transitions to the subsequent channel.

3.2. State Machine Design for Switch Units

Figure 4 demonstrates the operation of the switching unit through a finite state machine (FSM) for its split and merge functionalities. The FSM transitions are controlled by the flags ‘H’, ‘V’, and ‘T’, which are embedded in the first packet of each data stream. These flags indicate the status of the data stream and guide the FSM’s behavior.

To provide a more detailed explanation, the meanings of these control flags and their combinations are summarized in Table 2. Specifically, the ’H’ (Head) flag indicates the start of a packet, the ’V’ (Valid) flag shows whether the data stream is valid, and the ’T’ (Tail) flag signals the end of a data stream. The combination of these flags is used to explicitly represent the current state of the data stream. For example, a flag combination of ’110’ signifies the start of a valid data stream, indicating that the processing of a valid data flow should begin. Their precise definition helps maintain the consistency and predictability of the system’s behavior, particularly in complex network-on-chip (NoC) environments. Furthermore, the definitions provided in Table 2 are aligned with the packet identification scheme used in Section 5.2, ensuring uniformity throughout the design.

The split node distributes incoming data to multiple streams based on their destination address while recording the packet status into a dedicated status stream. The FSM transitions, shown in Figure 4a, are outlined below:

S_IDLE (Idle State): The FSM begins in the S_IDLE state. When a valid packet is detected in the status stream, the FSM transitions as follows: If the destination address (Dst) is 0, the FSM retrieves the total packet count (size1) and moves to S_PROCESS_IN1. If Dst is 1, the FSM retrieves the total packet count (size2) and transitions to S_PROCESS_IN2.
S_PROCESS_IN1 (Processing Data for Stream 1): In this state, the FSM forwards packets to stream 1 and decrements size1 with each processed packet. Once all packets are processed (size1 == 0), the FSM transitions back to S_IDLE.
S_PROCESS_IN2 (Processing Data for Stream 2): Similar to S_PROCESS_IN1, this state forwards packets to stream 2 and decrements size2. When all packets are processed (size2 == 0), the FSM transitions back to S_IDLE.
S_LAST (Final State): If the tail flag (T = 1) is detected during the S_IDLE state, the FSM transitions to S_LAST, indicating the end of the data stream. The FSM remains in this state until a reset signal is received, which re-initializes the system.

The merge node combines two data streams into a single output using a non-blocking polling mechanism. The FSM transitions for this operation, depicted in Figure 4b, are described as follows:

M_IDLE1 (Polling Stream 1): The FSM begins in M_IDLE1, where it attempts a non-blocking read (nb_read) from stream 1. If data are available, the FSM transitions to M_PROCESS_IN1 to process the data. If no data are available (nb_read(IN1) == false), the FSM transitions to M_IDLE2 to poll stream 2.
M_PROCESS_IN1 (Processing Data from Stream 1): In this state, the FSM forwards packets from stream 1 to the output stream while decrementing the packet count (size1). Once all packets are processed (size1 == 0), the FSM transitions back to M_IDLE1 to poll stream 1 for new data.
M_IDLE2 (Polling Stream 2): When no data are available in stream 1, the FSM transitions to M_IDLE2, where it polls stream 2 using nb_read. If data are available (nb_read(IN2) == true), the FSM transitions to M_PROCESS_IN2 to process the data. If no data are found in both streams, the FSM transitions to S_LAST, indicating the end of the data stream.
M_PROCESS_IN2 (Processing Data from Stream 2): In this state, the FSM forwards packets from stream 2 to the output stream while decrementing the packet count (size2). When all packets are processed (size2 == 0), the FSM transitions back to M_IDLE2 to continue polling stream 2 for new data.
S_LAST (Final State): When both streams are empty, the FSM transitions to S_LAST, signaling the completion of the merge operation. The FSM remains in this state until a reset signal is received, which re-initializes the system.

4. Fine-Grained Burst Control for Streamlined HBM Access

The burst inference process in HLS tools is conservative. It automatically merges memory accesses within loops/functions and converts them into read/write requests for global memory [11]. Vitis HLS defines two types of burst requests: sequential burst, where accesses within basic blocks are merged into a single burst, and pipeline burst, where read/write sequences within loop iterations are linked into a continuous burst.

Although we aim to ensure continuous data access in chunks, the Vitis HLS tool may default to sequential bursts or fail to infer bursts altogether in loop scenarios with variable boundaries or conditions. To address this, we implement a manual burst technique using the hls::burst_maxi class, which explicitly manages burst lengths to ensure that memory access patterns align with AXI protocol requirements for successful pipeline bursts [29,30].

Listing 1 illustrates a key code snippet of a scenario where burst inference fails. This code defines nine states, with the W_IDLE state responsible for parsing the header, and the remaining eight states corresponding to memory write operations for HBM channels 0 to 7. In the W_IDLE state, the program analyzes the header; if the packet is valid, it switches to the corresponding HBM channel based on the parsed memory information and proceeds with the subsequent block writes. Due to the conditional checks on packet status and the uncertainty in data stream length within the loop, the HLS tool conservatively performs sequential burst inference, resulting in lower write throughput for the HBM memory channels.

Listing 1. Manual burst transfer function.

The packet information allows for fine-grained manual control over the burst transfer process, enabling pipelined data transfers. Before starting the transfer, a write request specifying the size is sent to the AXI interface (highlighted in red on line 12). During the burst write phase, data are continuously transferred to the memory (lines 19–26). Upon completion, a write response from the memory controller confirms the successful burst transaction (highlighted in red on line 23).

The synthesized timing diagram in Figure 5 demonstrates the pipeline burst technique applied to the code in Table 1. In conventional transfer mechanisms, each subsequent request is initiated only after the completion of the previous one, resulting in temporal gaps between transmissions. However, by utilizing the pipeline burst technique, which explicitly determines the burst transfer size, multiple data transmissions can occur within a single request. This technique significantly optimizes data transfer efficiency by enabling concurrent handling of several data packets, eliminating idle times that would otherwise occur during sequential transfers. For instance, with manual pipeline burst optimization, a single transaction can transfer multiple packets (three packets, as shown in the upper part of Figure 5), whereas in traditional sequential transfers, only one packet is transmitted per transaction. As a result, this technique leads to a notable increase in throughput by reducing the number of transactions and maximizing data transfer within each burst.

5. Efficient NoC and Burst Transfer for Managing Data Streams

In this section, we propose a transparent HBM access framework tailored to meet the real-time computation and high-bandwidth memory access demands of large-scale data streams, with applications such as radio astronomy, as shown in Figure 6. This framework combines a non-blocking NoC with fine-grained burst transfer techniques to facilitate seamless and efficient multi-channel HBM access.

We begin by introducing the framework’s components and overall design, highlighting the role of data packets as the smallest transmission units for transactional transfers. Subsequently, we delve into the structural details of the data packet.

5.1. Transparent HBM Access Framework

The design of transparent stream-driven HBM access framework in Figure 6 decouples computing from memory, enabling asynchronous parallel processing to enhance system throughput, with communication facilitated via AXI Stream. The term stream-driven conceptually refers to a continuous, batch-oriented approach to memory access, in contrast to handling individual or isolated data transactions. Our multi-channel access framework is designed to prioritize high throughput and low latency during sustained memory transfers. While this approach is highly efficient for large, continuous data streams, it introduces a known trade-off: performance degradation when handling smaller, fragmented data transfers. This compromise is inherent to our design, as we focus on optimizing large-scale data transfers.

The on-chip switch kernel supervises memory interfaces for computational units and implements on-chip network routing for the data streams. The NoC integrated within the kernel is specifically designed to mitigate the bandwidth degradation issues that arise from multi-channel access. Concurrently, the HBM wrapper kernel is responsible for parsing memory requests from compute units, managing HBM transactions, and handling data flow communication. The use of wide data paths (512 bits) and manual burst detection improves memory access efficiency.

To enhance routing efficiency and resource allocation, the HBM Wrapper kernel is strategically assigned to the SLR0, which is co-located with the HBM. The compute unit and on-chip switch kernel are positioned in SLR1.

5.2. NoC Packet Design for Multi-Channel Memory Access

A packet is the smallest unit of information that is routed by the network in our framework. Data requests initiated by the computational units to the HBM management module are accomplished through the communication of these data packets. Figure 7a provides a detailed depiction of the format design for the 512-bit request data packet. In this design, each channel is allocated 16 bits, a configuration that sufficiently meets the information requirements of the 32 channels available in the HBM. The least significant bit of each channel plays a critical role, determining whether data are to be read from that channel.

Data exchange between the computational units and HBM channels primarily relies on response data packets. To maximize port width utilization and ensure efficient burst transfers, we implement a two-level data aggregation strategy, as shown in Figure 7b. The first level aggregates the state and control information of the data stream into the header format, and multiple packets are then combined to form a continuous data stream. The second level of aggregation occurs within the data packet itself, as depicted in the data format. Each data block is 32 bits, and by merging 16 such data blocks into a single 512-bit packet, we significantly improve the bandwidth utilization of the write port.

For clarity, the key fields of both request and response packets, including the destination channel ID, address offset, data length, and other essential parameters, are summarized in Table 3.

Algorithm 1 demonstrates the process of generating request packets, which are responsible for initiating data requests from the computational units to the HBM channels. The process begins by checking the enable signal array (sel) for each HBM channel, which determines whether a particular channel is active for the request (line 3). For each active channel, the corresponding request packet fields are populated with the data length, offset, and target channel index (lines 3–11). The generated packet is then written to the output request stream (line 13). This process ensures that the computational units can effectively communicate the necessary information to the HBM management module, requesting the required data for further processing.

Algorithm 1: Request packet generation process.

Input: requestStrm: Output request stream; sel: The enable signal array for the HBM channel; Len: Read size; offset: Read offset; Dest: target HBM channel index

Output: Generated request packets written to requestStrm

Algorithm 2 illustrates the process of generating response packets, which are responsible for transmitting processed data from the computational units to the HBM channels. Initially, a header packet is created, containing control bits and metadata such as the source and destination channel identifiers, along with the total data size (line 1–3). Once the header is prepared, the data are added to the response packet in sequential blocks (line 4–7). After all the data have been processed, a final packet is sent with control bits indicating the completion of the data transfer (line 8). These response packets are then written to the output data stream (dataStrm) and forwarded to the HBM channels (line 9).

Algorithm 2: Response packet generation process.

Input: src: Input data; dataStrm: Output data stream; pcID: HBM pseudo channel index; data_num: Total data size

Output: Generated packets written to dataStrm

6. Evaluation and Analysis

In this section, we provide a detailed experimental validation and performance analysis of the proposed framework for transparent high-bandwidth memory access. First, we describe the experimental setup in detail. Next, we focus on the on-chip network, conducting experimental modeling and in-depth analysis of the switch units to validate the effectiveness of the Omega-topology-based NoC design in packet routing and forwarding. Finally, the transparent HBM access framework is applied to a many-to-many unicast communication scenario, thoroughly evaluating its memory data transfer bandwidth performance in radio astronomy processing applications.

6.1. Experimental Setup

The experimental setup is entirely FPGA-based, utilizing the Xilinx Alveo U280 accelerator card equipped with 8GB of high-bandwidth memory (HBM) [31]. All computations and evaluations are executed solely within the FPGA, without involving PCIe communication bandwidth with the CPU. The FPGA is programmed using the Xilinx Vitis 2022.2 development framework, and all compute kernels are implemented in High-Level Synthesis (HLS) [29,32].

The evaluation comprises two main experiments:

Switch Unit Analysis for On-chip Network Performance: We assessed the performance of the Omega network topology by constructing data streams with varying destinations to measure the contention overhead of continuous data transfers. A stability probability P was defined to control the consistency of each stream’s destination. This was achieved by modifying the response header packets. The experiment involved modeling switch unit behavior under these scenarios, followed by hardware emulation using a testbench to simulate a large number of randomized data streams. The measured results were then compared with the predictive model to validate accuracy. By leveraging the structural regularity of the Omega topology, the throughput of the network was ultimately determined.
Transparent HBM Access Framework Validation: To evaluate the framework’s performance, a many-to-many unicast communication scenario was adopted, a setup frequently encountered in radio astronomy applications. The complexity of the communication patten ranged from $1 \times 1$ (data read from one HBM pseudo-channel and written to one HBM pseudo-channel) to $8 \times 8$ (data read from one HBM pseudo-channel and written to eight pseudo-channels). Data are processed in continuous blocks, ensuring equal-sized transfers across all pseudo-channels. Using the same benchmark as Choi et al. (https://github.com/UCLA-VAST/hbmbench (accessed on 2 December, 2020)) [13], we modified the response header packets to control the destination and quantity of data streams. This setup facilitated a comprehensive comparison of effective bandwidth improvements achieved by our NoC-based framework over the built-in crossbar design.

6.2. Evaluating On-Chip Network Through Switch Unit Analysis

The switch unit plays a decisive role in the routing efficiency of the network [26]. A standard 2 × 2 switch is capable of routing both input data streams to their respective outputs when the output ports are distinct. However, in the event that the output ports coincide, one data stream must be temporarily halted, awaiting the subsequent cycle for transmission. The bandwidth of the switch node (

B W_{s w i t c h}

) is influenced by the destination of the input data streams, which can be calculated as

\begin{matrix} B W_{s w i t c h} & = \frac{# p a c k e t s}{# L a t e n c y} \\ = \frac{2 (n \times l e n)}{(n + σ (n, P)) (l e n + τ)} \end{matrix}

(1)

We define a sequence of continuous packet transfers as a transaction. The variable n represents the number of such transactions required to complete a specific task, while the variable

l e n

denotes the number of packets transferred in each individual transaction.The overall delay

τ

experienced by a switch unit in forwarding a transaction is attributed to operational delays, including those from logical switching that prevent scheduling within a single pipeline cycle. Experimentally,

τ

is 7. Furthermore, we introduce the stability probability P, which describes the likelihood that the destination of the current transfer transaction remains unchanged from the previous one.

σ (n, P)

denotes the extra delay introduced by data routing conflicts.

To assess the impact of

σ

, we conducted multiple simulations to examine how variations in probability (P) and transaction count (n) influence the delay, as depicted in Figure 8. At

P = 0

, with each input data targeting a different port per cycle, the switch unit ensures concurrent data production on both output ports. When P is 100, the input stream has a constant destination, leading to both the best- and worst-case scenarios. Therefore,

σ

fluctuates between zero and n, averaging at

n / 2

. Moreover, the value of

σ

increases as P increases. In our simulation with

n = 50

and

P = 50 %

, the average latency was measured to be

5.8968

.

We measured our circuit structure using simulation data with a 50% stability probability. Table 4 presents a comparative analysis of our Omega-type network against prior NoC designs. The low average latency and high operating frequency of our switch units enable a switching bandwidth of 692 M-packets per second, outperforming other NoCs in the comparison. Designed as an eight-input, eight-output network with three layers, our network excels in zero-load latency due to uniform delay across each layer’s switches. Moreover, the estimated bandwidth and latency, calculated using Equation (1), closely match the empirical data from our circuit, validating the precision of our predictive model.

The benefits of the Omega-based NoC are particularly compelling, as evidenced by the comparative analysis in Table 4. The network achieves a reduction in resource utilization due to its modular and recursive design, which minimizes the complexity of individual switch units while maintaining scalability for larger network sizes. Compared to traditional mesh or torus networks, the Omega-based NoC requires fewer interconnect resources, as its structure inherently reduces the number of cross-layer connections. This resource efficiency directly translates into higher operating frequencies and lower power consumption, allowing the network to sustain high throughput under constrained hardware budgets. Additionally, the deterministic routing mechanism of the Omega-based design ensures predictable latency, which is critical for time-sensitive applications.

6.3. Evaluating the Transparent HBM Access Framework

We assess the FPGA resource utilization and effective bandwidth of our NoC framework using a many-to-many unicast benchmark, essential for distributing antenna data in radio astronomy algorithms. We conducted an analysis of the channel write bandwidth under various working conditions and compared the effective bandwidth of our NoC-based work with that of Choi’s open-source built-in crossbar work [13], as detailed in Table 5. Experiments were conducted at 300 MHz with a 512-bit data width, yielding a theoretical peak average channel bandwidth of 14.375 GB/s.

For the 8 × 8 workload, data must traverse a lateral link from one pseudo-channel to be written to another pseudo-channel. The experimental results indicate that the baseline only achieved an average channel bandwidth utilization rate of 60.9% under 4 × 4 and 8 × 8 working types, suggesting that insufficient burst size transmission became the main bottleneck affecting bandwidth performance, compared to crossbar contention. Our work, due to explicit manual burst transfer and Omega network-based on-chip routing, achieved an average channel bandwidth utilization rate of over 90% for different working types. In addition, the slight decrease in bandwidth utilization observed as the work type scales from 2 × 2 to 8 × 8 is attributed to the intensified competition among an increasing number of AXI masters for access to the same pseudo-channel.

Figure 9 shows the waveform of multi-channel write operations using the manually pipelined burst transfer technique. After pipeline burst inference, the transfer length for each operation is set to 64, with a burst size of 64 bytes, resulting in a total burst length of 4 KB. This optimization strategy not only improves data transfer efficiency but also enhances the overall bandwidth performance of the system.

We evaluate the resource performance of the Transparent Stream-Driven HBM Access Framework, where the HBM management module is located in the SLR0 region, and the on-chip interconnect module and computation units are deployed in the SLR1 region. In the 8 × 8 communication scenario configuration, only 16 HBM channels were enabled.

Figure 10a presents the FPGA device view after circuit synthesis, clearly indicating the kernel mapping for both the SLR0 and SLR1 regions. Figure 10b shows the resource usage report for the computational framework. In the SLR0 region, the resource utilization of block RAM (BRAM) is significantly increased due to the need to configure the complete AXI protocol for the 16 HBM channels. This highlights the growing challenge of allocating critical resources such as BRAM under high-communication-density configurations.

7. Conclusions

To provide scalable and efficient solutions for scale data stream processing, which is increasingly critical in big data streaming domains such as radio astronomy, we propose a transparent HBM access framework designed to address bandwidth limitations in FPGA-HBM systems. By integrating a high-performance Omega network between the built-in crossbars and compute units, combined with fine-grained burst control, the framework enables flexible access to HBM channels. The experimental evaluation demonstrated a switching bandwidth of 692 M-packets/s and effective bandwidth utilization in many-to-many unicast scenarios, achieving near-maximum write transfer rates of 12.94 GB/s per channel with a transfer size of 4 KB. These results validate the effectiveness of the proposed approach in improving HBM bandwidth utilization and resource balance, demonstrating its potential to support similar high-throughput workloads in scientific applications.

Challenges and Future Work

While the proposed framework significantly improves bandwidth and scalability, some challenges remain.

First, the scalability of the framework in larger FPGA-HBM systems, especially under dynamic or irregular memory access patterns, requires further study. Adaptive routing algorithms and better arbitration mechanisms are promising areas for improvement. Second, the framework currently uses only two of the three super logic regions (SLRs) available on the Alveo U280 FPGA, leaving SLR2 underutilized. Lastly, the current implementation focuses on synthetic benchmarks to validate the framework. Expanding the evaluation to real-world applications, such as machine learning or scientific computing workloads, will provide deeper insights into its performance and applicability.

Future work will aim to address these challenges, explore cross-SLR optimizations, and extend the framework to support more diverse FPGA architectures.

Author Contributions

Methodology, X.K.; Software, X.K. and Z.Z.; Validation, C.F.; Formal analysis, X.K., Y.Z. and X.Z.; Writing—original draft, X.K.; Writing—review and editing, X.K. and Y.Z.; Supervision, X.Z., Z.Z. and Y.Z.; Funding acquisition, X.Z. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National SKA Program of China (grant numbers 2020SKA0120202) and the National Natural Science foundation of China (grant numbers 12373113).

Data Availability Statement

The dataset employed in this study is publicly accessible. It can be retrieved from the following link: https://github.com/UCLA-VAST/hbmbench (accessed on 2 December 2020). Additionally, the given link directs to the scripts for data simulation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Malakonakis, P.; Isotton, G.; Miliadis, P.; Alverti, C.; Theodoropoulos, D.; Pnevmatikatos, D.; Ioannou, A.; Harteros, K.; Georgopoulos, K.; Papaefstathiou, I.; et al. Preconditioned Conjugate Gradient Acceleration on FPGA-Based Platforms. Electronics 2022, 11, 3039. [Google Scholar] [CrossRef]
Du, C.; Yamaguchi, Y. High-Level Synthesis Design for Stencil Computations on FPGA with High Bandwidth Memory. Electronics 2020, 9, 1275. [Google Scholar] [CrossRef]
Guo, S.; Zheng, L.; Jin, X. Accelerating a radio astronomy correlator on FPGA. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 1–14 February 2018; pp. 85–89. [Google Scholar] [CrossRef]
Xu, R.; Han, F.; Ta, Q. Deep Learning at Scale on NVIDIA V100 Accelerators. In Proceedings of the 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, TX, USA, 12 November 2018; pp. 23–32. [Google Scholar] [CrossRef]
Iskandar, V.; Ghany, M.A.A.E.; Göhringer, D. Near-Memory Computing on FPGAs with 3D-stacked Memories: Applications, Architectures, and Optimizations. ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 1–32. [Google Scholar] [CrossRef]
Holzinger, P.; Reiser, D.; Hahn, T.; Reichenbach, M. Fast HBM Access with FPGAs: Analysis, Architectures, and Applications. In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, 17–21 June 2021; pp. 152–159. [Google Scholar] [CrossRef]
Choi, Y.k.; Chi, Y.; Qiao, W.; Samardzic, N.; Cong, J. HBM Connect: High-Performance HLS Interconnect for FPGA HBM. In Proceedings of the FPGA‘ 21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 28 February–2 March 2021; pp. 116–126. [Google Scholar] [CrossRef]
Puranik, S.; Barve, M.; Rodi, S.; Patrikar, R. FPGA-Based High-Throughput Key-Value Store Using Hashing and B-Tree for Securities Trading System. Electronics 2023, 12, 183. [Google Scholar] [CrossRef]
Furukawa, K.; Kobayashi, R.; Yokono, T.; Fujita, N.; Yamaguchi, Y.; Boku, T.; Yoshikawa, K.; Umemura, M. An efficient RTL buffering scheme for an FPGA-accelerated simulation of diffuse radiative transfer. In Proceedings of the 2021 International Conference on Field-Programmable Technology (ICFPT), Auckland, New Zealand, 6–10 December 2021; pp. 1–9. [Google Scholar] [CrossRef]
Prakash, S.K.; Patel, H.; Kapre, N. Managing HBM Bandwidth on Multi-Die FPGAs with FPGA Overlay NoCs. In Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), New York, NY, USA, 15–18 May 2022; pp. 1–9. [Google Scholar] [CrossRef]
Xue, S.; Liang, H.; Wu, Q.; Jin, X. Scheduling Memory Access Optimization for HBM Based on CLOS. In Proceedings of the 2023 25th International Conference on Advanced Communication Technology (ICACT), Pyeongchang, Republic of Korea, 19–22 February 2023; pp. 448–453. [Google Scholar] [CrossRef]
Nabavi Larimi, S.S.; Salami, B.; Unsal, O.S.; Kestelman, A.C.; Sarbazi-Azad, H.; Mutlu, O. Understanding Power Consumption and Reliability of High-Bandwidth Memory with Voltage Underscaling. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; pp. 517–522. [Google Scholar] [CrossRef]
Choi, Y.k.; Chi, Y.; Wang, J.; Guo, L.; Cong, J. When hls meets fpga hbm: Benchmarking and bandwidth optimization. arXiv 2020, arXiv:2010.06075. [Google Scholar]
Zhou, P.J.; Yu, Q.; Chen, M.; Qiao, G.C.; Zuo, Y.; Zhang, Z.; Liu, Y.; Hu, S.G. Fullerene-Inspired Efficient Neuromorphic Network-on-Chip Scheme. IEEE Trans. Circuits Syst. Ii: Express Briefs 2024, 71, 1376–1380. [Google Scholar] [CrossRef]
Xiao, Z.; Chamberlain, R.D.; Cabrera, A.M. HLS Portability from Intel to Xilinx: A Case Study. In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC), Virtual, 20–24 September 2021; pp. 1–8. [Google Scholar] [CrossRef]
Ferry, C.; Yuki, T.; Derrien, S.; Rajopadhye, S. Increasing FPGA Accelerators Memory Bandwidth With a Burst-Friendly Memory Layout. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2023, 42, 1546–1559. [Google Scholar] [CrossRef]
Xilinx Inc. AXI High Bandwidth Memory Controller v1.0: LogiCORE IP Product Guide, PG276 (v1.0); Xilinx Inc.: San Jose, CA, USA, 2021; Vivado Design Suite. [Google Scholar]
Lu, A.; Fang, Z.; Liu, W.; Shannon, L. Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking. In Proceedings of the FPGA‘ 21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 28 February–2 March 2021; pp. 105–115. [Google Scholar] [CrossRef]
Sehwag, V.; Prasad, N.; Chakrabarti, I. A Parallel Stochastic Number Generator With Bit Permutation Networks. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 231–235. [Google Scholar] [CrossRef]
Almazyad, A.S. Optical omega networks with centralized buffering and wavelength conversion. J. King Saud Univ. Comput. Inf. Sci. 2011, 23, 15–28. [Google Scholar] [CrossRef]
Vasiliadis, D.C.; Rizos, G.E.; Margariti, S.V.; Tsiantis, L.E. Comparative study of blocking mechanisms for packet switched Omega networks. In Proceedings of the EHAC ’07: The 6th WSEAS International Conference on Electronics, Hardware, Wireless and Optical Communications, Stevens Point, WI, USA, 16–19 February 2007; pp. 18–22. [Google Scholar]
Almazyad, A.S. A New Look-Ahead Algorithm to Improve the Performance of Omega Networks. Math. Comput. Appl. 2010, 15, 156–165. [Google Scholar] [CrossRef]
Shalini, N.; Shashikala, K.P. Design and Functional Verification of Axi2OCP Bridge for Highly Optimized Bus Utilization and Closure Using Functional Coverage. In Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications; Satapathy, S.C., Bhateja, V., Udgata, S.K., Pattnaik, P.K., Eds.; Springer: Singapore, 2017; pp. 525–535. [Google Scholar]
Bhaktavatchalu, R.; Rekha, B.S.; Divya, G.A.; Jyothi, V.U.S. Design of AXI bus interface modules on FPGA. In Proceedings of the 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), Ramanathapuram, India, 25–27 May 2016; pp. 141–146. [Google Scholar] [CrossRef]
Nakkala, S.; Vaddavalli, S.; Arja, S.S. Design and Verification of AMBA AXI Protocol. In Proceedings of the 2024 International Conference on Electronics, Computing, Communication and Control Technology (ICECCC), Bengaluru, India, 2–3 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
Mnejja, S.; Aydi, Y.; Abid, M.; Monteleone, S.; Catania, V.; Palesi, M.; Patti, D. Delta Multi-Stage Interconnection Networks for Scalable Wireless On-Chip Communication. Electronics 2020, 9, 913. [Google Scholar] [CrossRef]
Eran, H.; Zeno, L.; István, Z.; Silberstein, M. Design Patterns for Code Reuse in HLS Packet Processing Pipelines. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), San Diego, CA, USA, 18 April–1 May 2019; pp. 208–217. [Google Scholar] [CrossRef]
Boraten, T.H.; Kodi, A.K. Securing NoCs Against Timing Attacks with Non-Interference Based Adaptive Routing. In Proceedings of the 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Torino, Italy, 4–5 October 2018; pp. 1–8. [Google Scholar] [CrossRef]
Xilinx Inc. Vitis High-Level Synthesis User Guide (UG1399), Version 2023.1; Xilinx Inc.: San Jose, CA, USA, 2023. [Google Scholar]
Li, H.; Rieger, P.; Zeitouni, S.; Picek, S.; Sadeghi, A.R. FLAIRS: FPGA-Accelerated Inference-Resistant & Secure Federated Learning. In Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 4–8 September 2023; pp. 271–276. [Google Scholar] [CrossRef]
Xilinx Inc. Alveo U280 Data Center Accelerator Card User Guide (UG1314), Version 1.2.1; Xilinx Inc.: San Jose, CA, USA, 2019. [Google Scholar]
Xilinx Inc. Vivado Design Suite User Guide: High-Level Synthesis (UG902), Version 2020.1; Xilinx Inc.: San Jose, CA, USA, 2021. [Google Scholar]

Figure 1. Xilinx HBM subsystem in Alveo U280.

Figure 2. Block diagram for an 8 × 8 Omega network structure with dual split–merge switch unit.

Figure 3. The process of burst transfer.

Figure 4. State machine design of the switching unit.

Figure 5. Performance impacts of manual pipeline write burst and non-pipeline burst transfer.

Figure 6. An overview of the design for a transparent HBM access framework.

Figure 7. Design of packet formats for request and response processes.

Figure 8.

σ (n, P)

vs. P for different n.

Figure 8.

σ (n, P)

vs. P for different n.

Figure 9. Waveform results of burst transfer.

Figure 10. Resource utilization and device view of the HBM transparent access framework on the Alveo U280. (a) Device view of the HBM access framework. (b) Resource usage of SLR0 and SLR1.

Table 1. Summary of related works on HBM and related mechanisms.

Proposed Method	Work	Year	Contribution	Topology	Routing Logic
Application-specific Buffering Mechanism	Furukawa et al. [10]	2021	Implementation of Radiative Transfer Equation	/	/
Application-specific Buffering Mechanism	Choi et al. [13]	2020	Baseline Design for HBM Bottleneck Analysis	/	/
Energy Conversion	Nabavi Larimi et al. [12]	2021	Power Consumption Analysis of HBM Under Voltage Underscaling	/	/
NoC	ARouter [14]	2024	NoC for Neuromorphic Systems	Fullerene-60 Surface Topology	Adaptive
	CMRouter [14]	2024	NoC for Neuromorphic Systems	Fullerene-60 Surface Topology	Reconfigurable
	Prakash et al. [10]	2022	NoC for Improving HBM Bandwidth Utilization	FPGA Overlay NoCs	Adaptive
	Xue et al. [11]	2023		CLOS	Determine routing
	HBM Connect [7]	2021		Butterfly
	This work		Non-blocking NoC and Manual Burst Control for High HBM Throughput Access	Omega

Table 2. Finite state machine identifiers and their description.

	Identifier			Description
Control Flags	Packet header flag (Head)	Packet valid flag (Valid)	Packet tail flag (Tail)
Control Flags	H	V	T
Control Flag Combinations	1	1	0	Processing starts, data stream valid
	0	0	1	Current transfer ends, but data stream continues
	0	1	hl1	End of data stream
	nb_read			Non-blocking read

Table 3. Packet format identifiers and their descriptions.

	Identifier	Description
Control Flags	H, V, T	Flags indicating the status of the data stream (refer to Table 2).
Request Packet Information	Dest	HBM channel ID of the requested data.
	Offset	Address offset of the requested data in HBM memory.
	Len	Length of the requested data (in bytes).
Response Packet Information	Size	Length of the data written to the channel (in bytes).
	interDst	Address offset within the destination HBM channel.
	pcDst	HBM channel ID where the data are written.

Table 4. Comparative analysis of on-chip networks.

Work	NoC Type	Zero-Load Latency ¹ [Cycles]	Switching Bandwidth ² [M-Data/s]
ARouter ^3a [14]	Fullerene-60	15.8	40
CMRouter ^3b [14]	Fullerene-60	6.32	100
HBM Connect ^3c [7]	Butterfly	1.55	579
Our Work ^3d	Omega	1.73 (Estimate: 1.70)	692 (Estimate: 704)

¹ The zero-load latency of a network is the latency where only one packet traverses the network. ² The switching bandwidth is the maximum rate at which the switch can process packets. ³ Networks proposed by a and b are tailored for spike-based neuromorphic systems, with switching bandwidth measured in million spikes per second (M-Spikes/s). In contrast, c and d, designed for high-bandwidth HBM access systems, measure bandwidth in million packets per second (M-Packets/s).

Table 5. Effective bandwidth of many-to-many unicast test.

	Work Type	Avg Bytes Per Transfer [Bytes]	Transfer Rate Per Pseudo-Channel [GB/s]	BW Utilization [%]
Choi et al. [7,13]	4 × 4	1024	8.8	61.2
Choi et al. [7,13]	8 × 8	1024	8.75	60.9
Our work	2 × 2	4096	14.0	97.4
	4 × 4	4096	13.63	94.8
	8 × 8	4096	12.94	90.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kong, X.; Zhu, Z.; Feng, C.; Zhu, Y.; Zheng, X. An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems. Electronics 2025, 14, 466. https://doi.org/10.3390/electronics14030466

AMA Style

Kong X, Zhu Z, Feng C, Zhu Y, Zheng X. An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems. Electronics. 2025; 14(3):466. https://doi.org/10.3390/electronics14030466

Chicago/Turabian Style

Kong, Xiangcong, Zixuan Zhu, Chujun Feng, Yongxin Zhu, and Xiaoying Zheng. 2025. "An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems" Electronics 14, no. 3: 466. https://doi.org/10.3390/electronics14030466

APA Style

Kong, X., Zhu, Z., Feng, C., Zhu, Y., & Zheng, X. (2025). An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems. Electronics, 14(3), 466. https://doi.org/10.3390/electronics14030466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems

Abstract

1. Introduction

2. Background

2.1. FPGA-HBM Platform Architecture

2.2. Network-on-Chip Topology

2.3. AXI Memory Access and Burst Transfer

3. Non-Blocking Switch Unit Design in Omega-Based NoC

3.1. Back-Pressure Flow Control Mechanism for Switch Units

3.2. State Machine Design for Switch Units

4. Fine-Grained Burst Control for Streamlined HBM Access

5. Efficient NoC and Burst Transfer for Managing Data Streams

5.1. Transparent HBM Access Framework

5.2. NoC Packet Design for Multi-Channel Memory Access

6. Evaluation and Analysis

6.1. Experimental Setup

6.2. Evaluating On-Chip Network Through Switch Unit Analysis

6.3. Evaluating the Transparent HBM Access Framework

7. Conclusions

Challenges and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI