Low-Latency and Minor-Error Architecture for Parallel Computing XY-like Functions with High-Precision Floating-Point Inputs
Abstract
:1. Introduction
- We propose a parallel computing architecture with low-latency based on the QH CORDIC methodology;
- We enlarge the feasible range of FP inputs of the proposed architecture with specific techniques to make sure the proposed architecture applies to high-precision computing;
- We conduct hardware modeling on the proposed architecture to achieve the lowest possible circuit complexity and resource consumption;
- We compare the hardware implementation results with related works to show the minor-error and high-accuracy features of the proposed architecture.
2. QH CORDIC-Based Methodology of XY-Like Functions
2.1. Iterative Formulae of QH CORDIC Methodology
2.2. Range of Convergence of QH CORDIC Methodology
2.3. Validity of Computation for Logarithmic Function and Exponential Function with QH CORDIC
3. Hardware Modeling of XY-Like Functions with QH CORDIC
3.1. Preprocessing Module
3.2. QH Module
3.3. Postprocessing Module
4. Implementation Results and Comparisons
4.1. ASIC Implementation Results of the Proposed Architecture
4.2. Evalutation and Comparative Analysis
4.2.1. Computational Correctness
4.2.2. Word Length
4.2.3. Timing Analysis and Power Analysis
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Pineiro, J.-A.; Ercegovac, M.D.; Bruguera, J.D. High-radix iterative algorithm for powering computation. In Proceedings of the 16th IEEE Symposium on Computer Arithmetic, Santiago de Compostela, Spain, 15−18 June 2003; pp. 204–211. [Google Scholar]
- Harris, D. A powering unit for an Open GL lighting engine. In Proceedings of the 35th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 4–7 November 2001; pp. 1641–1645. [Google Scholar]
- Zuras, D.; Cowlishaw, M.; Aiken, A.; Applegate, M.; Bailey, D.; Bass, S.; Bhandarkar, D.; Bhat, M.; Bindel, D.; Boldo, S.; et al. IEEE Standard for Floating-Point Arithmetic. IEEE Std. 2008, 754, 1–70. [Google Scholar]
- Pineiro, J.-A.; Ercegovac, M.D.; Bruguera, J.D. Algorithm and architecture for logarithm, exponential, and powering computation. IEEE Trans. Comput. 2004, 53, 1085–1096. [Google Scholar] [CrossRef]
- Antelo, E.; Lang, T.; Bruguera, J.D. Very-high radix CORDIC vectoring with scalings and selection by rounding. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic, Adelaide, SA, Australia, 14−16 April 1999; pp. 204–213. [Google Scholar]
- Vazquez, A.; Bruguera, J.D. Iterative algorithm and architecture for exponential, logarithm, powering, and root extraction. IEEE Trans. Comput. 2013, 62, 1721–1731. [Google Scholar] [CrossRef]
- Oberman, S.F. Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic, Adelaide, SA, Australia, 14−16 April 1999; pp. 106–115. [Google Scholar]
- Pineiro, J.-A.; Bruguera, J.D. High-speed double-precision computation of reciprocal, division, square root, and inverse square root. IEEE Trans. Comput. 2002, 51, 1377–1388. [Google Scholar] [CrossRef] [Green Version]
- Langhammer, M.; Pasca, B. Single precision logarithm and exponential architectures for hard floating-point enabled FPGAs. IEEE Trans. Comput. 2017, 66, 2031–2043. [Google Scholar] [CrossRef]
- Muller, J.M. Elementary functions: Algorithms and implementation. Math. Comput. Educ. 1997, 34, 21–52. [Google Scholar]
- Schulte, M.J.; Stine, J.E. Approximating elementary functions with symmetric bipartite tables. IEEE Trans. Comput. 1999, 48, 842–847. [Google Scholar] [CrossRef]
- Chen, H.; Yang, H.; Song, W.; Lu, Z.; Fu, Y.; Li, L.; Yu, Z. Symmetric-Mapping LUT-Based Method and Architecture for Computing XY-Like Functions. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 1231–1244. [Google Scholar] [CrossRef]
- Luo, Y.; Wang, Y.; Sun, H.; Zha, Y.; Wang, Z.; Pan, H. CORDIC-based architecture for computing Nth root and its implementation. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 4183–4195. [Google Scholar] [CrossRef]
- Mopuri, S.; Acharyya, A. Low complexity generic VLSI architecture design methodology for Nth root and Nth power computations. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 4673–4686. [Google Scholar] [CrossRef]
- Wang, Y.; Luo, Y.; Wang, Z.; Shen, Q.; Pan, H. GH CORDIC-based architecture for computing Nth root of single-precision floating-point number. IEEE Trans. Very Large Scale Integr. Syst. 2020, 28, 864–875. [Google Scholar] [CrossRef]
- Combet, M.; van Zonneveld, H.; Verbeek, L. Computation of the base two logarithm of binary numbers. IEEE Trans. Electron. Comput. 1965, EC-14, 863–867. [Google Scholar] [CrossRef]
- Hall, E.L.; Lynch, D.D.; Dwyer, S.J. Generation of products and quotients using approximate binary logarithms for digital filtering applications. IEEE Trans. Comput. 1970, C-19, 97–105. [Google Scholar] [CrossRef]
- Abed, K.H.; Siferd, R.E. CMOS VLSI implementation of a low-power logarithmic converter. IEEE Trans. Comput. 2003, 52, 1421–1433. [Google Scholar] [CrossRef]
- Abed, K.H.; Siferd, R.E. VLSI implementation of a low-power antilogarithmic converter. IEEE Trans. Comput. 2003, 52, 1221–1228. [Google Scholar] [CrossRef]
- Paul, S.; Jayakumar, N.; Khatri, S.P. A fast hardware approach for approximate, efficient logarithm and antilogarithm computations. IEEE Trans. Very Large Scale Integr. Syst. 2009, 17, 269–277. [Google Scholar] [CrossRef] [Green Version]
- De Dinechin, F.; Pasca, B. Floating-point exponential functions for DSP-enabled FPGAs. In Proceedings of the IEEE International Conference on Field-Programmable Technology, Beijing, China, 8−10 December 2010; pp. 110–117. [Google Scholar]
- Chen, D.; Han, L.; Ko, S.B. Decimal floating-point antilogarithmic converter based on selection by rounding: Algorithm and architecture. IET Comput. Digit. Technol. 2012, 6, 277–289. [Google Scholar] [CrossRef]
- Chen, D.; Han, L.; Choi, Y.; Ko, S.-B. Improved decimal floating-point logarithmic converter based on selection by rounding. IEEE Trans. Comput. 2012, 61, 607–621. [Google Scholar] [CrossRef]
- Liu, W.; Nannarelli, A. Power efficient division and square root unit. IEEE Trans. Comput. 2012, 61, 1059–1070. [Google Scholar] [CrossRef]
- Seth, A.; Gan, W.-S. Fixed-point square roots using L-b truncation. IEEE Signal Process. Mag. 2011, 28, 149–153. [Google Scholar] [CrossRef]
- Kabuo, H.; Taniguchi, T.; Miyoshi, A.; Yamashita, H.; Urano, M.; Edamatsu, H.; Kuninobu, S. Accurate rounding scheme for the Newton-Raphson method using redundant binary representation. IEEE Trans. Comput. 1994, 43, 43–51. [Google Scholar] [CrossRef]
- Mack, J.; Bellestri, S.; Llamocca, D. Floating point CORDIC-based architecture for powering computation. In Proceedings of the 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Riviera Maya, Mexico, 7–9 December 2015; pp. 1–6. [Google Scholar]
- Luo, Y.; Wang, Y.; Ha, Y.; Wang, Z.; Chen, S.; Pan, H. Generalized Hyperbolic CORDIC and Its Logarithmic and Exponential Computation with Arbitrary Fixed Base. IEEE Trans. Very Large Scale Integr. Syst. 2019, 27, 2156–2169. [Google Scholar] [CrossRef]
- Duprat, J.; Muller, J.M. The CORDIC algorithm: New results for fast VLSI implementation. IEEE Trans. Comput. 1993, 42, 168–178. [Google Scholar] [CrossRef] [Green Version]
- Phatak, D.S. Double step branching CORDIC: A new algorithm for fast sine and cosine generation. IEEE Trans. Comput. 1998, 47, 587–602. [Google Scholar] [CrossRef] [Green Version]
- Fu, W.; Xia, J.; Lin, X.; Liu, M.; Wang, M. Low-Latency Hardware Implementation of High-Precision Hyperbolic Functions Sinhx and Coshx Based on Improved CORDIC Algorithm. Electronics 2021, 10, 2533. [Google Scholar] [CrossRef]
- Llamocca-Obregón, D.R.; Agurto-Ríos, C.P. A fixed-point implementation of the expanded hyperbolic CORDIC algorithm. Lat. Am. Appl. Res. 2007, 37, 83–91. [Google Scholar]
- Hao, L.; Ming-Jiang, W.; Mo-Ran, C.; Ming, L. A VLSI Implementation of Double Precision Floating-Point Logarithmic Function. In Proceedings of the 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 19–21 July 2019; pp. 345–349. [Google Scholar]
Case | σn | σn+1 | σn+2 | σn+3 | Iterative Formula of yn+4 |
---|---|---|---|---|---|
1 | −1 | −1 | −1 | −1 | yn+4 = [1 + 2−(4n+6) + 35 × 2−(2n+5)] × yn + [201315 × 2−(n+3) − 15 × 2−(3n+6)] × xn |
2 | −1 | −1 | −1 | 1 | yn+4 = [1–2−(4n+6) + 21 × 2−(2n+5)] × yn + [−13 × 2−(n+3) − 2−(3n+6)] × xn |
3 | −1 | −1 | 1 | −1 | yn+4 = [1–2−(4n+6) + 9 × 2−(2n+5)] × yn + [−11 × 2−(n+3) + 7 × 2−(3n+6)] × xn |
4 | −1 | 1 | −1 | −1 | yn+4 = [1–2−(4n+6) − 9 × 2−(2n+5)] × yn + [–7 × 2−(n+3) + 11 × 2−(3n+6)] × xn |
5 | 1 | −1 | −1 | −1 | yn+4 = [1–2−(4n+6) − 21 × 2−(2n+5)] × yn + [2−(n+3) + 13 × 2−(3n+6)] × xn |
6 | −1 | −1 | 1 | 1 | yn+4 = [1 + 2−(4n+6) − 2−(2n+5)] × yn + [–9 × 2−(n+3) + 9 × 2−(3n+6)] × xn |
7 | −1 | 1 | −1 | 1 | yn+4 = [1 + 2−(4n+6) − 15 × 2−(2n+5)] × yn + [–5 × 2−(n+3) + 5 × 2−(3n+6)] × xn |
8 | −1 | 1 | 1 | −1 | yn+4 = [1 + 2−(4n+6) − 19 × 2−(2n+5)] × yn + [–3 × 2−(n+3) − 3 × 2−(3n+6)] × xn |
9 | 1 | −1 | −1 | 1 | yn+4 = [1 + 2−(4n+6) − 19 × 2−(2n+5)] × yn + [3 × 2−(n+3) + 3 × 2−(3n+6)] × xn |
10 | 1 | −1 | 1 | −1 | yn+4 = [1 + 2−(4n+6) − 15 × 2−(2n+5)] × yn + [5 × 2−(n+3) − 5 × 2−(3n+6)] × xn |
11 | 1 | 1 | −1 | −1 | yn+4 = [1 + 2−(4n+6) − 2−(2n+5)] × yn + [9 × 2−(n+3) − 9 × 2−(3n+6)] × xn |
12 | −1 | 1 | 1 | 1 | yn+4 = [1–2−(4n+6) − 21 × 2−(2n+5)] × yn + [–2−(n+3) − 13 × 2−(3n+6)] × xn |
13 | 1 | −1 | 1 | 1 | yn+4 = [1–2−(4n+6) − 9 × 2−(2n+5)] × yn + [7 × 2−(n+3) − 11 × 2−(3n+6)] × xn |
14 | 1 | 1 | −1 | 1 | yn+4 = [1–2−(4n+6) + 9 × 2−(2n+5)] × yn + [11 × 2−(n+3) − 7 × 2−(3n+6)] × xn |
15 | 1 | 1 | 1 | −1 | yn+4 = [1–2−(4n+6) 21 × 2−(2n+5)] × yn + [13 × 2−(n+3) + 2−(3n+6)] × xn |
16 | 1 | 1 | 1 | 1 | yn+4 = [1 + 2−(4n+6) + 35 × 2−(2n+5)] × yn + [15 × 2−(n+3) + 15 × 2−(3n+6)] × xn |
Case | σn | σn+1 | σn+2 | σn+3 | Iterative Formula of zn+4 |
---|---|---|---|---|---|
1 | −1 | −1 | −1 | −1 | zn+4 = zn + θn + θn+1 + θn+2 + θn+3 |
2 | −1 | −1 | −1 | 1 | zn+4 = zn + θn + θn+1 + θn+2 − θn+3 |
3 | −1 | −1 | 1 | −1 | zn+4 = zn + θn + θn+1 − θn+2 + θn+3 |
4 | −1 | 1 | −1 | −1 | zn+4 = zn + θn–θn+1 + θn+2 + θn+3 |
5 | 1 | −1 | −1 | −1 | zn+4 = zn–θn + θn+1 + θn+2 + θn+3 |
6 | −1 | −1 | 1 | 1 | zn+4 = zn + θn + θn+1 − θn+2 − θn+3 |
7 | −1 | 1 | −1 | 1 | zn+4 = zn + θn–θn+1 + θn+2 − θn+3 |
8 | −1 | 1 | 1 | −1 | zn+4 = zn + θn–θn+1 − θn+2 + θn+3 |
9 | 1 | −1 | −1 | 1 | zn+4 = zn − θn + θn+1 + θn+2 − θn+3 |
10 | 1 | −1 | 1 | −1 | zn+4 = zn − θn + θn+1 − θn+2 + θn+3 |
11 | 1 | 1 | −1 | −1 | zn+4 = zn − θn − θn+1 + θn+2 + θn+3 |
12 | −1 | 1 | 1 | 1 | zn+4 = zn + θn − θn+1 − θn+2 − θn+3 |
13 | 1 | −1 | 1 | 1 | zn+4 = zn − θn + θn+1 − θn+2 − θn+3 |
14 | 1 | 1 | −1 | 1 | zn+4 = zn − θn − θn+1 + θn+2 − θn+3 |
15 | 1 | 1 | 1 | −1 | zn+4 = zn − θn − θn+1 − θn+2 + θn+3 |
16 | 1 | 1 | 1 | 1 | zn+4 = zn − θn − θn+1 − θn+2 − θn+3 |
Item | - |
---|---|
Period (ns) | 3.3 |
Latency (cycle) | 76 |
Area (μm2) | 1417366 |
Power (mW) | 36.2189 |
Precision (bit) | 113 |
Total time (ns) 1 | 250.8 |
ATP (mm2∙ns) 2 | 355.4754 |
Total energy (fJ) 3 | 9083.7001 |
Energy efficiency (fJ/bit) 4 | 80.3867 |
Area efficiency (bit/(mm2∙ns)) 5 | 0.3179 |
Item | XN | |||||
---|---|---|---|---|---|---|
[14] | [12] | Proposed | [14] | [12] | Proposed | |
X | [10−6, 106] | [10−6, 106] | (2−16382, 216383) | [10−2, 102] | [10−2, 102] | (2−16382, 216383) |
N | [2, 1002] | [2, 1002] | (2−16382, 216383) | [1, 5] | [1, 5] | (2−16382, 216383) |
k | 40,000 | 40,000 | 100,000 | 40,000 | 40,000 | 100,000 |
max(RE) | 1.928 × 10–3 | 1.069 × 10–3 | 1.688 × 10−34 | 1.030 × 10–2 | 5.272 × 10–3 | 1.610 × 10−34 |
ARE | 5.464 × 10–4 | 4.160 × 10–4 | 1.446 × 10−34 | 2.875 × 10–3 | 2.095 × 10–3 | 1.442 × 10−34 |
Function | Architecture | Type | Logarithm | Division/Multiplication | Exponential |
---|---|---|---|---|---|
[14] | Module | HV CORDIC | LV CORDIC | HR CORDIC | |
S + I + F 1 | 1 + 2 + 45 | 1 + 10 + 27 | 1 + 2 + 27 | ||
Total Bits | 48 | 38 | 30 | ||
XN | [14] | Module | BV CORDIC | LV CORDIC | BR CORDIC |
S + I + F | 1 + 2 + 32 | 1 + 5 + 27 | 1 + 2 + 27 | ||
Total Bits | 35 | 33 | 30 | ||
and XN | [12] | Module | SM-LUT | Multiplier | SM-LUT |
S + I + F | 1 + 0 + 27 | 1 + 10 + 27 | 1 + 0 + 27 | ||
Total Bits | 28 | 38 | 28 | ||
Proposed | Module | states INIT_LN, ITE_LN, ONE_STEP_1_LN, ONE_STEP_2_LN | states INNER_DEAL_0, INNER_DEAL_1, INNER_DEAL_2, INNER_DEAL_3 | states INIT_EXP, ITE_EXP, ONE_STEP_1_EXP, ONE_STEP_2_EXP | |
S + I + F | 1 + 15 + 112 | 1 + 15 + 112 | 1 + 15 + 112 | ||
Total Bits | 128 | 128 | 128 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, M.; Fu, W.; Xia, J. Low-Latency and Minor-Error Architecture for Parallel Computing XY-like Functions with High-Precision Floating-Point Inputs. Electronics 2022, 11, 69. https://doi.org/10.3390/electronics11010069
Liu M, Fu W, Xia J. Low-Latency and Minor-Error Architecture for Parallel Computing XY-like Functions with High-Precision Floating-Point Inputs. Electronics. 2022; 11(1):69. https://doi.org/10.3390/electronics11010069
Chicago/Turabian StyleLiu, Ming, Wenjia Fu, and Jincheng Xia. 2022. "Low-Latency and Minor-Error Architecture for Parallel Computing XY-like Functions with High-Precision Floating-Point Inputs" Electronics 11, no. 1: 69. https://doi.org/10.3390/electronics11010069
APA StyleLiu, M., Fu, W., & Xia, J. (2022). Low-Latency and Minor-Error Architecture for Parallel Computing XY-like Functions with High-Precision Floating-Point Inputs. Electronics, 11(1), 69. https://doi.org/10.3390/electronics11010069