Specially-Designed Out-of-Order Processor Architecture for Microcontrollers
Abstract
:1. Introduction
- Sub-modules work through register configuration: reading and writing registers make up the major means of sub-module controlling. For example, NVDLA, an open-source deep learning accelerator from NVIDIA Corporation, works in this way [7]. The controller configures registers in a certain order, and then the sub-module starts up after the operation. This method is simple, but has high hardware portability since each sub-module is linked directly through a standard bus.
- Issuing instructions through the custom instruction interface: in modern microarchitecture, there will be a custom instruction interface for specific requirements. Processor designers formulate custom instructions according to the computing characteristics. In a general controlling flow, after a custom instruction is recognized in the processor dispatch unit, it will be transmitted to the sub-module through the custom interface directly. In return, the sub-module sends results back. In this method, the coupling between the processor and the sub-module is tighter. Therefore, it is necessary to consider the computing characteristics of the processor and the sub-modules, as well as their interaction. The custom instruction interfaces are the key to realizing such control, among which the most mature and popular is the ROCC interface in the Rocket architecture [8].
2. Contributions
- Detailed analysis of the microarchitecture requirements of microcontrollers in VLSI systems and the difficulties that microarchitecture hardware implementation needs to address.
- Based on the analysis of an open-source industrial-grade microcontroller design, this paper discusses the shortcomings, problems and improvement strategies of the existing microarchitecture.
- This paper proposes a hardware design of the microarchitecture for a functional and low power consumption microcontroller, which is implemented on the FPGA platform.
3. Challenges and Motivation
3.1. Analysis of Out-of-Order Processing
3.2. Analysis of Data Hazards
- WAW (Write After Write): If the destination register index that the subsequent instruction needs to write back is the same as that of the previous one, then in the process of out-of-order execution, the subsequent instruction may write back before the previous one. The result of the previous instruction will overwrite the written back result of the subsequent instruction.
- WAR (Write After Read): If the destination register index that the subsequent instruction needs to write back is the same as that which the previous instruction needs to read, then in the process of out-of-order execution, the subsequent instruction may write back before the previous one reads the register, at which point the previous instruction reads the wrong data of source operands.
- RAW (Read After Write): If the source register index that the subsequent instruction needs to read is the same as that which the previous instruction needs to write back, then in the process of out-of-order execution, the subsequent instruction may read the register before the previous one writes back, at which point the subsequent instruction reads the wrong data of source operands.
3.3. Analysis of Static and Dynamic Hardware Scheduling in Abnormal Cases
4. Architecture and Analysis
4.1. Pipeline of Original Microarchitecture
- Instructions executed in one cycle. Such instructions are executed, committed and written back in one cycle on the second stage of the pipeline.
- Instructions executed in multiple cycles. When multi-cycle instructions are dispatched to the execution units and have not yet been written back, they are typically known as Outstanding Instructions.
4.2. Order of the Instructions Matters
- MI/SI after SI. Since the previous instruction completes all operations in one cycle, the subsequent instruction does not produce data hazards with the previous one.
- MI/SI after MI. Since the previous instruction requires multiple cycles to be executed, when the subsequent instruction operates on the register file, likely the previous one has not been written back. This will result in a data hazard. We call these MAM and SAM for short.
4.3. Stalling in Pipeline
- When the instruction dispatch unit dispatches a multi-cycle instruction, it compares the source operand register index and result register index of the instruction with every entry in the OITF. If the data hazard gets checked, the dispatch stage will be stalled.
- When the instruction write-back unit writes back a multi-cycle instruction, it cannot be deregistered if the information in the multi-cycle instruction controller is different from the entry information of the OITF, thus, stalling the write-back stage.
4.4. Differences Caused by Different Multi-Cycle Instruction Dispatch Modes
- Stalled Multi-cycle Instruction Unit: the execution unit can no longer accept instructions when operating one instruction. The next instruction can only be accepted after the current operation is completed, e.g., a divider that continuously loops and calculates the result by using a serial technique.
- Pipelined Multi-cycle Instruction Unit: the execution unit has a pipeline structure. It can execute multiple instructions at the same time, expanding the throughput of instructions by the technique of the pipeline. Therefore, multi-cycle instructions can be continuously dispatched to the execution unit in the pipeline style unless the pipeline is full or stalling.
5. Implementation
5.1. Designs for Instruction Flow
5.2. Design of the Multi-Cycle Instruction Queues
5.3. Design of System Architecture
- The fast access interface that directly bypasses the bus is additionally developed. This fast access interface is used for some sub-modules requiring high responsiveness control, while the general access interface needs to step across the multi-level bus.
- Separate access interface for vector processor. Due to the huge memory demand of the vector processor, a large number of data streams may affect the execution of the basic memory instructions. Therefore, as shown in Figure 6, we decouple the access interface of the vector processor from that of the microcontroller, so the vector computing data streams are separated from the control streams.
6. Results
6.1. Function Analysis
6.2. Performance
6.3. Hardware Utilization
7. Conclusions
8. Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Vasiljevic, J.; Bajic, L.; Capalija, D.; Sokorac, S.; Ignjatovic, D.; Bajic, L.; Trajkovic, M.; Hamer, I.; Matosevic, I.; Cejkov, A. Compute substrate for Software 2.0. IEEE Micro 2021, 41, 50–55. [Google Scholar] [CrossRef]
- Fleischer, B.; Shukla, S.; Ziegler, M.; Silberman, J.; Oh, J.; Srinivasan, V.; Choi, J.; Mueller, S.; Agrawal, A.; Babinsky, T. A scalable multi-TeraOPS deep learning processor core for AI trainina and inference. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 18–22 June 2018; pp. 35–36. [Google Scholar]
- Fowers, J.; Ovtcharov, K.; Papamichael, M.; Massengill, T.; Liu, M.; Lo, D.; Alkalay, S.; Haselman, M.; Adams, L.; Ghandi, M. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 1–14. [Google Scholar]
- Saha, S.S.; Sandha, S.S.; Srivastava, M. Machine Learning for Microcontroller-Class Hardware—A Review. arXiv 2022, arXiv:2205.14550. [Google Scholar]
- Parai, M.K.; Das, B.; Das, G. An overview of microcontroller unit: From proper selection to specific application. Int. J. Soft Comput. Eng. IJSCE 2013, 2, 228–231. [Google Scholar]
- Babiuch, M.; Foltýnek, P.; Smutný, P. Using the ESP32 microcontroller for data processing. In Proceedings of the 2019 20th International Carpathian Control Conference (ICCC), Kraków, Poland, 26–29 May 2019; pp. 1–6. [Google Scholar]
- Corporation, NVIDIA. NVDLA Open Source Hardware, Version 1.0. Available online: https://github.com/nvdla/hw (accessed on 8 July 2022).
- Asanovic, K.; Avizienis, R.; Bachrach, J.; Beamer, S.; Biancolin, D.; Celio, C.; Cook, H.; Dabbelt, D.; Hauser, J.; Izraelevitz, A. The Rocket Chip Generator; Technical Report UCB/EECS-2016-17; EECS Department, University of California: Berkeley, CA, USA, 2016; p. 4. [Google Scholar]
- Waterman, A.; Asanović, K. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 2019121. Available online: https://riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf (accessed on 8 July 2022).
- Asanović, K.; Patterson, D.A. Instruction Sets Should Be Free: The Case for risc-v; Technical Report UCB/EECS-2014-146; EECS Department, University of California: Berkeley, CA, USA, 2014. [Google Scholar]
- Blem, E.; Menon, J.; Sankaralingam, K. Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 23–27 February 2013; pp. 1–12. [Google Scholar]
- Blem, E.; Menon, J.; Sankaralingam, K. A Detailed Analysis of Contemporary Arm and x86 Architectures. UW-Madison Technical Report. 2013. Available online: https://caxapa.ru/thumbs/788118/10.1.1.364.1145.pdf (accessed on 8 July 2022).
- Liu, S.; Du, Z.; Tao, J.; Han, D.; Luo, T.; Xie, Y.; Chen, Y.; Chen, T. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea, 18–22 June 2016; pp. 393–405. [Google Scholar]
- Celio, C.P. A Highly Productive Implementation of an Out-of-Order Processor Generator; University of California: Berkeley, CA, USA, 2017. [Google Scholar]
- Palacharla, S.; Jouppi, N.P.; Smith, J.E. Complexity-effective superscalar processors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, Denver, CO, USA, 2–4 June 1997; pp. 206–218. [Google Scholar]
- Hilton, A.; Nagarakatte, S.; Roth, A. iCFP: Tolerating all-level cache misses in in-order processors. In Proceedings of the 2009 IEEE 15th International Symposium on High Performance Computer Architecture, Raleigh, NC, USA, 14–18 February 2009; pp. 431–442. [Google Scholar]
- Barnes, R.D.; Sias, J.W.; Nystrom, E.M.; Patel, S.J.; Navarro, J.; Hwu, W.-m.W. Beating in-order stalls with “flea-flicker” two-pass pipelining. IEEE Trans. Comput. 2005, 55, 18–33. [Google Scholar] [CrossRef]
- McFarlin, D.S.; Tucker, C.; Zilles, C. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? ACM SIGARCH Comput. Archit. News 2013, 41, 241–252. [Google Scholar] [CrossRef]
- Kulkarni, K.N.; Mekala, V.R. A Review of Branch Prediction Schemes and a Study of Branch Predictors in Modern Microprocessors. 2016. Available online: https://www.researchgate.net/profile/Venkata-Mekala/publication/266891966_A_Review_of_Branch_Prediction_Schemes_and_a_Study_of_Branch_Predictors_in_Modern_Microprocessors/links/545ac9ed0cf2c46f6643898c/A-Review-of-Branch-Prediction-Schemes-and-a-Study-of-Branch-Predictors-in-Modern-Microprocessors.pdf (accessed on 8 July 2022).
- Mittal, S. A survey of techniques for dynamic branch prediction. Concurr. Comput. Pract. Exp. 2019, 31, e4666. [Google Scholar] [CrossRef]
- Technology, N.S. Hummingbirdv2 E203 Core and SoC. Available online: https://github.com/riscv-mcu/e203_hbirdv2 (accessed on 8 July 2022).
- Abella Ferrer, J.; Canal Corretger, R.; González Colás, A.M. Power-and complexity-aware issue queue designs. IEEE Micro 2003, 23, 50–58. [Google Scholar] [CrossRef]
- Mittal, S. A survey of techniques for designing and managing CPU register file. Concurr. Comput. Pract. Exp. 2017, 29, e3906. [Google Scholar] [CrossRef]
- Yeager, K.C. The MIPS R10000 superscalar microprocessor. IEEE Micro 1996, 16, 28–41. [Google Scholar] [CrossRef] [Green Version]
- Inc, X. 7 Series FPGAs Configuration (UG470 v1.13.1). Available online: https://docs.xilinx.com/v/u/en-US/ug470_7Series_Config (accessed on 8 July 2022).
Hardware Design | Stalling Cycle | Total Cycles |
---|---|---|
Original | m1 ∗ n1 + m2 ∗ n2 | M + m1 ∗ n1 + m2 ∗ n2 |
Shadow Register | m2 ∗ n2 | M + m2 ∗ n2 |
Waiting Queue | m1 + m2 | M + m1 + m2 |
Shadow Register + Waiting Queue | m2 | M + m2 |
Utilization | Functions | ||||||
---|---|---|---|---|---|---|---|
LUT | FF | MUX | Lanes | Max-Pipelined Instructions | Write-Back Optimized | Issuing Optimized | |
Original | 3832 | 1832 | 289 | 2 | 2 | × | × |
Extended_OITF | 3972 | 2076 | 328 | 2 | 8 | × | × |
Extended_OITF + Waiting Queue | 4106 | 2134 | 332 | 3 | 8 | √ | × |
Extended_OITF + Shadow Register | 3997 | 2124 | 328 | 3 | 8 | × | √ |
Extended_OITF + Shadow Register + Waiting Queue | 4131 | 2162 | 332 | 4 | 8 | √ | √ |
Extended_OITF + Shadow Register + Waiting Queue + More Lanes | 4209 | 2204 | 332 | 12 | 8 | √ | √ |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, Y.; Chen, J.; Zhu, K.; Xing, Q.; Liu, W.; Shen, J.; Gao, G. Specially-Designed Out-of-Order Processor Architecture for Microcontrollers. Electronics 2022, 11, 2989. https://doi.org/10.3390/electronics11192989
Hu Y, Chen J, Zhu K, Xing Q, Liu W, Shen J, Gao G. Specially-Designed Out-of-Order Processor Architecture for Microcontrollers. Electronics. 2022; 11(19):2989. https://doi.org/10.3390/electronics11192989
Chicago/Turabian StyleHu, Yunhao, Jie Chen, Kaiben Zhu, Qijun Xing, Wei Liu, Junfeng Shen, and Ge Gao. 2022. "Specially-Designed Out-of-Order Processor Architecture for Microcontrollers" Electronics 11, no. 19: 2989. https://doi.org/10.3390/electronics11192989
APA StyleHu, Y., Chen, J., Zhu, K., Xing, Q., Liu, W., Shen, J., & Gao, G. (2022). Specially-Designed Out-of-Order Processor Architecture for Microcontrollers. Electronics, 11(19), 2989. https://doi.org/10.3390/electronics11192989