An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems
Abstract
:1. Introduction
- Preventing deadlocks and enhancing the throughput of the switch units present a challenge to the performance of on-chip networks. The proposed strategy should address these challenges by employing low-latency forwarding and lightweight flow control mechanisms.
- Inefficient memory access patterns, such as non-contiguous and small-burst-size transactions, can cause AXI burst inference failure. The proposed method ensures reliable burst inference.
- •
- A deadlock-free and efficient network-on-chip (NoC)-based Omega network is presented, which leverages the Omega network topology to eliminate traffic congestion in the lateral links of the HBM internal crossbar, effectively addressing Challenge 1.
- •
- A manual burst transmission technique is introduced for systematic control of the burst process, ensuring reliable pipeline burst inference and successful data transmission, thereby tackling Challenge 2.
- •
- A transparent stream-driven HBM access framework is developed, which decouples HBM memory operations located in the SLR0 region from the NoC-based compute units assigned to the SLR1 region. This framework incorporates manual burst transfer and a deadlock-free NoC, enabling adaptability to a wide range of applications.
2. Background
2.1. FPGA-HBM Platform Architecture
2.2. Network-on-Chip Topology
- Index 0 (binary 000) remains 0.
- Index 1 (binary 001) becomes 2 (binary 010).
- Index 2 (binary 010) becomes 4 (binary 100), and so on.
2.3. AXI Memory Access and Burst Transfer
3. Non-Blocking Switch Unit Design in Omega-Based NoC
3.1. Back-Pressure Flow Control Mechanism for Switch Units
- Pipeline Startup Delay: The dependency between checking if the output stream is full and writing to it can increase the pipeline initiation interval if the compiler cannot schedule these operations in the same cycle, thus degrading performance [27].
- Deadlock: Inconsistencies between input and output stream operations can cause severe deadlocks. For instance, a deadlock occurs when the split 1 node writes to buffer 3 while the merge 1 node tries to read from empty buffer 1, blocking both operations.
3.2. State Machine Design for Switch Units
- S_IDLE (Idle State): The FSM begins in the S_IDLE state. When a valid packet is detected in the status stream, the FSM transitions as follows: If the destination address (Dst) is 0, the FSM retrieves the total packet count (size1) and moves to S_PROCESS_IN1. If Dst is 1, the FSM retrieves the total packet count (size2) and transitions to S_PROCESS_IN2.
- S_PROCESS_IN1 (Processing Data for Stream 1): In this state, the FSM forwards packets to stream 1 and decrements size1 with each processed packet. Once all packets are processed (size1 == 0), the FSM transitions back to S_IDLE.
- S_PROCESS_IN2 (Processing Data for Stream 2): Similar to S_PROCESS_IN1, this state forwards packets to stream 2 and decrements size2. When all packets are processed (size2 == 0), the FSM transitions back to S_IDLE.
- S_LAST (Final State): If the tail flag (T = 1) is detected during the S_IDLE state, the FSM transitions to S_LAST, indicating the end of the data stream. The FSM remains in this state until a reset signal is received, which re-initializes the system.
- M_IDLE1 (Polling Stream 1): The FSM begins in M_IDLE1, where it attempts a non-blocking read (nb_read) from stream 1. If data are available, the FSM transitions to M_PROCESS_IN1 to process the data. If no data are available (nb_read(IN1) == false), the FSM transitions to M_IDLE2 to poll stream 2.
- M_PROCESS_IN1 (Processing Data from Stream 1): In this state, the FSM forwards packets from stream 1 to the output stream while decrementing the packet count (size1). Once all packets are processed (size1 == 0), the FSM transitions back to M_IDLE1 to poll stream 1 for new data.
- M_IDLE2 (Polling Stream 2): When no data are available in stream 1, the FSM transitions to M_IDLE2, where it polls stream 2 using nb_read. If data are available (nb_read(IN2) == true), the FSM transitions to M_PROCESS_IN2 to process the data. If no data are found in both streams, the FSM transitions to S_LAST, indicating the end of the data stream.
- M_PROCESS_IN2 (Processing Data from Stream 2): In this state, the FSM forwards packets from stream 2 to the output stream while decrementing the packet count (size2). When all packets are processed (size2 == 0), the FSM transitions back to M_IDLE2 to continue polling stream 2 for new data.
- S_LAST (Final State): When both streams are empty, the FSM transitions to S_LAST, signaling the completion of the merge operation. The FSM remains in this state until a reset signal is received, which re-initializes the system.
4. Fine-Grained Burst Control for Streamlined HBM Access
Listing 1. Manual burst transfer function. |
5. Efficient NoC and Burst Transfer for Managing Data Streams
5.1. Transparent HBM Access Framework
5.2. NoC Packet Design for Multi-Channel Memory Access
Algorithm 1: Request packet generation process. |
Input: requestStrm: Output request stream; sel: The enable signal array for the HBM channel; Len: Read size; offset: Read offset; Dest: target HBM channel index |
Output: Generated request packets written to requestStrm |
Algorithm 2: Response packet generation process. |
Input: src: Input data; dataStrm: Output data stream; pcID: HBM pseudo channel index; data_num: Total data size |
Output: Generated packets written to dataStrm |
6. Evaluation and Analysis
6.1. Experimental Setup
- Switch Unit Analysis for On-chip Network Performance: We assessed the performance of the Omega network topology by constructing data streams with varying destinations to measure the contention overhead of continuous data transfers. A stability probability P was defined to control the consistency of each stream’s destination. This was achieved by modifying the response header packets. The experiment involved modeling switch unit behavior under these scenarios, followed by hardware emulation using a testbench to simulate a large number of randomized data streams. The measured results were then compared with the predictive model to validate accuracy. By leveraging the structural regularity of the Omega topology, the throughput of the network was ultimately determined.
- Transparent HBM Access Framework Validation: To evaluate the framework’s performance, a many-to-many unicast communication scenario was adopted, a setup frequently encountered in radio astronomy applications. The complexity of the communication patten ranged from (data read from one HBM pseudo-channel and written to one HBM pseudo-channel) to (data read from one HBM pseudo-channel and written to eight pseudo-channels). Data are processed in continuous blocks, ensuring equal-sized transfers across all pseudo-channels. Using the same benchmark as Choi et al. (https://github.com/UCLA-VAST/hbmbench (accessed on 2 December, 2020)) [13], we modified the response header packets to control the destination and quantity of data streams. This setup facilitated a comprehensive comparison of effective bandwidth improvements achieved by our NoC-based framework over the built-in crossbar design.
6.2. Evaluating On-Chip Network Through Switch Unit Analysis
6.3. Evaluating the Transparent HBM Access Framework
7. Conclusions
Challenges and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Malakonakis, P.; Isotton, G.; Miliadis, P.; Alverti, C.; Theodoropoulos, D.; Pnevmatikatos, D.; Ioannou, A.; Harteros, K.; Georgopoulos, K.; Papaefstathiou, I.; et al. Preconditioned Conjugate Gradient Acceleration on FPGA-Based Platforms. Electronics 2022, 11, 3039. [Google Scholar] [CrossRef]
- Du, C.; Yamaguchi, Y. High-Level Synthesis Design for Stencil Computations on FPGA with High Bandwidth Memory. Electronics 2020, 9, 1275. [Google Scholar] [CrossRef]
- Guo, S.; Zheng, L.; Jin, X. Accelerating a radio astronomy correlator on FPGA. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 1–14 February 2018; pp. 85–89. [Google Scholar] [CrossRef]
- Xu, R.; Han, F.; Ta, Q. Deep Learning at Scale on NVIDIA V100 Accelerators. In Proceedings of the 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, TX, USA, 12 November 2018; pp. 23–32. [Google Scholar] [CrossRef]
- Iskandar, V.; Ghany, M.A.A.E.; Göhringer, D. Near-Memory Computing on FPGAs with 3D-stacked Memories: Applications, Architectures, and Optimizations. ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 1–32. [Google Scholar] [CrossRef]
- Holzinger, P.; Reiser, D.; Hahn, T.; Reichenbach, M. Fast HBM Access with FPGAs: Analysis, Architectures, and Applications. In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, 17–21 June 2021; pp. 152–159. [Google Scholar] [CrossRef]
- Choi, Y.k.; Chi, Y.; Qiao, W.; Samardzic, N.; Cong, J. HBM Connect: High-Performance HLS Interconnect for FPGA HBM. In Proceedings of the FPGA‘ 21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 28 February–2 March 2021; pp. 116–126. [Google Scholar] [CrossRef]
- Puranik, S.; Barve, M.; Rodi, S.; Patrikar, R. FPGA-Based High-Throughput Key-Value Store Using Hashing and B-Tree for Securities Trading System. Electronics 2023, 12, 183. [Google Scholar] [CrossRef]
- Furukawa, K.; Kobayashi, R.; Yokono, T.; Fujita, N.; Yamaguchi, Y.; Boku, T.; Yoshikawa, K.; Umemura, M. An efficient RTL buffering scheme for an FPGA-accelerated simulation of diffuse radiative transfer. In Proceedings of the 2021 International Conference on Field-Programmable Technology (ICFPT), Auckland, New Zealand, 6–10 December 2021; pp. 1–9. [Google Scholar] [CrossRef]
- Prakash, S.K.; Patel, H.; Kapre, N. Managing HBM Bandwidth on Multi-Die FPGAs with FPGA Overlay NoCs. In Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), New York, NY, USA, 15–18 May 2022; pp. 1–9. [Google Scholar] [CrossRef]
- Xue, S.; Liang, H.; Wu, Q.; Jin, X. Scheduling Memory Access Optimization for HBM Based on CLOS. In Proceedings of the 2023 25th International Conference on Advanced Communication Technology (ICACT), Pyeongchang, Republic of Korea, 19–22 February 2023; pp. 448–453. [Google Scholar] [CrossRef]
- Nabavi Larimi, S.S.; Salami, B.; Unsal, O.S.; Kestelman, A.C.; Sarbazi-Azad, H.; Mutlu, O. Understanding Power Consumption and Reliability of High-Bandwidth Memory with Voltage Underscaling. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; pp. 517–522. [Google Scholar] [CrossRef]
- Choi, Y.k.; Chi, Y.; Wang, J.; Guo, L.; Cong, J. When hls meets fpga hbm: Benchmarking and bandwidth optimization. arXiv 2020, arXiv:2010.06075. [Google Scholar]
- Zhou, P.J.; Yu, Q.; Chen, M.; Qiao, G.C.; Zuo, Y.; Zhang, Z.; Liu, Y.; Hu, S.G. Fullerene-Inspired Efficient Neuromorphic Network-on-Chip Scheme. IEEE Trans. Circuits Syst. Ii: Express Briefs 2024, 71, 1376–1380. [Google Scholar] [CrossRef]
- Xiao, Z.; Chamberlain, R.D.; Cabrera, A.M. HLS Portability from Intel to Xilinx: A Case Study. In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC), Virtual, 20–24 September 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Ferry, C.; Yuki, T.; Derrien, S.; Rajopadhye, S. Increasing FPGA Accelerators Memory Bandwidth With a Burst-Friendly Memory Layout. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2023, 42, 1546–1559. [Google Scholar] [CrossRef]
- Xilinx Inc. AXI High Bandwidth Memory Controller v1.0: LogiCORE IP Product Guide, PG276 (v1.0); Xilinx Inc.: San Jose, CA, USA, 2021; Vivado Design Suite. [Google Scholar]
- Lu, A.; Fang, Z.; Liu, W.; Shannon, L. Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking. In Proceedings of the FPGA‘ 21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 28 February–2 March 2021; pp. 105–115. [Google Scholar] [CrossRef]
- Sehwag, V.; Prasad, N.; Chakrabarti, I. A Parallel Stochastic Number Generator With Bit Permutation Networks. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 231–235. [Google Scholar] [CrossRef]
- Almazyad, A.S. Optical omega networks with centralized buffering and wavelength conversion. J. King Saud Univ. Comput. Inf. Sci. 2011, 23, 15–28. [Google Scholar] [CrossRef]
- Vasiliadis, D.C.; Rizos, G.E.; Margariti, S.V.; Tsiantis, L.E. Comparative study of blocking mechanisms for packet switched Omega networks. In Proceedings of the EHAC ’07: The 6th WSEAS International Conference on Electronics, Hardware, Wireless and Optical Communications, Stevens Point, WI, USA, 16–19 February 2007; pp. 18–22. [Google Scholar]
- Almazyad, A.S. A New Look-Ahead Algorithm to Improve the Performance of Omega Networks. Math. Comput. Appl. 2010, 15, 156–165. [Google Scholar] [CrossRef]
- Shalini, N.; Shashikala, K.P. Design and Functional Verification of Axi2OCP Bridge for Highly Optimized Bus Utilization and Closure Using Functional Coverage. In Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications; Satapathy, S.C., Bhateja, V., Udgata, S.K., Pattnaik, P.K., Eds.; Springer: Singapore, 2017; pp. 525–535. [Google Scholar]
- Bhaktavatchalu, R.; Rekha, B.S.; Divya, G.A.; Jyothi, V.U.S. Design of AXI bus interface modules on FPGA. In Proceedings of the 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), Ramanathapuram, India, 25–27 May 2016; pp. 141–146. [Google Scholar] [CrossRef]
- Nakkala, S.; Vaddavalli, S.; Arja, S.S. Design and Verification of AMBA AXI Protocol. In Proceedings of the 2024 International Conference on Electronics, Computing, Communication and Control Technology (ICECCC), Bengaluru, India, 2–3 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Mnejja, S.; Aydi, Y.; Abid, M.; Monteleone, S.; Catania, V.; Palesi, M.; Patti, D. Delta Multi-Stage Interconnection Networks for Scalable Wireless On-Chip Communication. Electronics 2020, 9, 913. [Google Scholar] [CrossRef]
- Eran, H.; Zeno, L.; István, Z.; Silberstein, M. Design Patterns for Code Reuse in HLS Packet Processing Pipelines. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), San Diego, CA, USA, 18 April–1 May 2019; pp. 208–217. [Google Scholar] [CrossRef]
- Boraten, T.H.; Kodi, A.K. Securing NoCs Against Timing Attacks with Non-Interference Based Adaptive Routing. In Proceedings of the 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Torino, Italy, 4–5 October 2018; pp. 1–8. [Google Scholar] [CrossRef]
- Xilinx Inc. Vitis High-Level Synthesis User Guide (UG1399), Version 2023.1; Xilinx Inc.: San Jose, CA, USA, 2023. [Google Scholar]
- Li, H.; Rieger, P.; Zeitouni, S.; Picek, S.; Sadeghi, A.R. FLAIRS: FPGA-Accelerated Inference-Resistant & Secure Federated Learning. In Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 4–8 September 2023; pp. 271–276. [Google Scholar] [CrossRef]
- Xilinx Inc. Alveo U280 Data Center Accelerator Card User Guide (UG1314), Version 1.2.1; Xilinx Inc.: San Jose, CA, USA, 2019. [Google Scholar]
- Xilinx Inc. Vivado Design Suite User Guide: High-Level Synthesis (UG902), Version 2020.1; Xilinx Inc.: San Jose, CA, USA, 2021. [Google Scholar]
Proposed Method | Work | Year | Contribution | Topology | Routing Logic |
---|---|---|---|---|---|
Application-specific Buffering Mechanism | Furukawa et al. [10] | 2021 | Implementation of Radiative Transfer Equation | / | / |
Choi et al. [13] | 2020 | Baseline Design for HBM Bottleneck Analysis | / | / | |
Energy Conversion | Nabavi Larimi et al. [12] | 2021 | Power Consumption Analysis of HBM Under Voltage Underscaling | / | / |
NoC | ARouter [14] | 2024 | NoC for Neuromorphic Systems | Fullerene-60 Surface Topology | Adaptive |
CMRouter [14] | 2024 | Reconfigurable | |||
Prakash et al. [10] | 2022 | NoC for Improving HBM Bandwidth Utilization | FPGA Overlay NoCs | Adaptive | |
Xue et al. [11] | 2023 | CLOS | Determine routing | ||
HBM Connect [7] | 2021 | Butterfly | |||
This work | Non-blocking NoC and Manual Burst Control for High HBM Throughput Access | Omega |
Identifier | Description | |||
Control Flags | Packet header flag (Head) | Packet valid flag (Valid) | Packet tail flag (Tail) | |
H | V | T | ||
Control Flag Combinations | 1 | 1 | 0 | Processing starts, data stream valid |
0 | 0 | 1 | Current transfer ends, but data stream continues | |
0 | 1 | hl1 | End of data stream | |
nb_read | Non-blocking read |
Identifier | Description | |
---|---|---|
Control Flags | H, V, T | Flags indicating the status of the data stream (refer to Table 2). |
Request Packet Information | Dest | HBM channel ID of the requested data. |
Offset | Address offset of the requested data in HBM memory. | |
Len | Length of the requested data (in bytes). | |
Response Packet Information | Size | Length of the data written to the channel (in bytes). |
interDst | Address offset within the destination HBM channel. | |
pcDst | HBM channel ID where the data are written. |
Work | NoC Type | Zero-Load Latency 1 [Cycles] | Switching Bandwidth 2 [M-Data/s] |
---|---|---|---|
ARouter 3a [14] | Fullerene-60 | 15.8 | 40 |
CMRouter 3b [14] | Fullerene-60 | 6.32 | 100 |
HBM Connect 3c [7] | Butterfly | 1.55 | 579 |
Our Work 3d | Omega | 1.73 (Estimate: 1.70) | 692 (Estimate: 704) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kong, X.; Zhu, Z.; Feng, C.; Zhu, Y.; Zheng, X. An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems. Electronics 2025, 14, 466. https://doi.org/10.3390/electronics14030466
Kong X, Zhu Z, Feng C, Zhu Y, Zheng X. An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems. Electronics. 2025; 14(3):466. https://doi.org/10.3390/electronics14030466
Chicago/Turabian StyleKong, Xiangcong, Zixuan Zhu, Chujun Feng, Yongxin Zhu, and Xiaoying Zheng. 2025. "An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems" Electronics 14, no. 3: 466. https://doi.org/10.3390/electronics14030466
APA StyleKong, X., Zhu, Z., Feng, C., Zhu, Y., & Zheng, X. (2025). An On-Chip Architectural Framework Design for Achieving High-Throughput Multi-Channel High-Bandwidth Memory Access in Field-Programmable Gate Array Systems. Electronics, 14(3), 466. https://doi.org/10.3390/electronics14030466