1. Introduction
Transcendental functions refer to functions that cannot be represented by finite quadric operations, power operations or square root operations, such as trigonometric functions, inverse trigonometric functions, exponential functions and logarithmic functions. They are basic components of mathematical calculations and are widely used in algorithms in various fields [
1].
For some data-intensive algorithms with strict real-time requirements, the low-latency computation of transcendental functions is of great significance. A large number of intensive floating-point trigonometric functions, exponential and logarithmic operations are usually required in the fields of motor control, noise filtering, digital signal processing, etc. [
2,
3,
4,
5,
6]. In the field of electric power, trigonometric functions are widely used in the calculation of power quality, complex harmonic processing and phase calculation [
7]. It takes a lot of cycles for the software program to perform those function operations, which cannot meet real-time requirements. Therefore, a low-latency high-precision transcendental-function hardware accelerator is needed. Furthermore, multiple transcendental functions are frequently required to be operated in some high-complexity algorithms. These transcendental functions require a lot of hardware resources when implemented separately, resulting in large area overheads [
8,
9]. Therefore, a hardware accelerator that supports multiple transcendental functions is of great significance.
Researchers have also already proposed various hardware accelerators to implement transcendental functions’ calculation, including the CORDIC algorithm [
10,
11,
12] and the piecewise polynomial approximation method [
13,
14,
15]. However, the CORDIC algorithm requires multiple iterations to achieve a high accuracy, and it supports only limited transcendental functions [
16,
17,
18]. Although the lookup table method combined with polynomial operation can easily and effectively implement a transcendental function [
19,
20,
21,
22,
23,
24], there are often multiple data paths when implementing multiple transcendental functions, resulting in a waste of hardware resources.
A reconfigurable hardware architecture for miscellaneous floating-point transcendental functions is proposed in this paper. High-precision lookup tables for polynomial coefficients of transcendental functions are generated by polynomial fitting. The whole calculation is divided into preprocessing, core computing and postprocessing to achieve a low-latency hardware design. With reconfigurable technology, multiple transcendental function calculations can be implemented by the same core module, which leads to a lower area cost of hardware.
This paper identifies two key challenges in designing a reconfigurable low-latency hardware architecture for floating-point transcendental functions. Firstly, how to fit each transcendental function using a piecewise lookup table combined with a polynomial computation to obtain a high-precision polynomial coefficient lookup table in order to achieve a low-latency hardware architecture. Secondly, how to design a core computing unit that can implement multiple transcendental functions in the smallest possible area while maintaining a high accuracy.
The current research cannot achieve the balance of low delay and high precision for transcendental functions. For example, ref. [
25] had high accuracy but a high computation delay. Its error was no more than 1.5
, and its delay was 15 cycles. In contrast, ref. [
26] only needed to spend 40.3 ns to compute the transcendental function with a maximum error of
. In the present research, different transcendental functions have different implementation processes, so they cannot be implemented using the same data path.
The reconfigurable hardware architecture for miscellaneous transcendental function proposed in this paper uses only 3.75 KB of lookup tables to implement five high-precision floating point transcendental functions, including sine, cosine, arctangent, exponential and logarithmic functions. The difference between the calculation result of the hardware circuit and that of the C language math library is most 2 . Under the UMC 40 nm CMOS process, the hardware can reach a maximum frequency of 220 MHz. The synthesis results also show that the total area is m and the full-load power consumption is 0.923 mW.
2. Related Work
The CORDIC algorithm is generally used to achieve transcendental functions in digital circuit design. Muñoz et al. [
27] used the CORDIC algorithm and a Taylor series expansion to calculate sine, cosine and arctangent functions, which implied floating-point operations and a search in ROM method to achieve a high throughput. Sergiyenko et al. [
25] applied the CORDIC algorithm to achieve transcendental functions in three stages, whose angles were, respectively, from the ROM table, a network of CORDIC microrotations and an approximation network, so as to minimize the area and delay. It only took 15 cycles to achieve an accuracy of 0.5
for the sine and cosine calculations of a small angle.
In addition to the CORDIC algorithm, a lookup table method combined with a polynomial computation is also an important approach to implementing transcendental functions. Chen et al. [
13] proposed a logarithmic function hardware accelerator based on lookup tables with 7.8 KB of lookup tables and a large number of basic computing units, achieving an accuracy of 3.5
and a latency of 78 ns. Gener et al. [
28] presented a lossless LUT compression method which could be used to replace tables among other applications of LUTs. Their method resulted in a 10% performance improvement, but only two transcendental functions were supported in that work. The hardware unit of the high-speed transcendental function proposed by Tian et al. [
14] used a binomial operation, and its accuracy reached
. However, the multiple data paths that were necessary to implement multiple transcendental functions resulted in an excessive area overhead. Nandagopal et al. [
15] proposed a novel piecewise-linear method to approximately represent nonlinear logarithmic and antilogarithmic functions. In that study, the calculation delay reached 15.20 ns, but the accuracy could only reach
.
In summary, the iterative CORDIC algorithm for transcendental functions is very inferior in performance. It needs to spend more clock cycles to complete a single-precision floating-point operation with high accuracy. The lookup table method combined with a polynomial computation can achieve single-precision floating-point operations with low latency by using a small amount of storage and hardware resources.
5. Conclusions
In order to support multiple floating-point transcendental function operations with a small hardware circuit area, this paper proposed a reconfigurable hardware architecture for miscellaneous floating-point transcendental functions. This paper utilized a reconfigurable technology to implement multiple transcendental functions, including sine, cosine, arctangent, exponential and logarithmic functions. The transcendental function hardware accelerator with a high accuracy and low latency, which is significant for many application scenarios, cost a small quantity of hardware resources. In this paper, the method of combining lookup tables with binomial operations, which generated lookup tables occupying 3.75 KB of space, was used to design a hardware accelerator of high-precision transcendental functions.
The experimental results showed that the difference between the calculation results of the proposed hardware circuit and those of the C language math library was at most 2 . Under the UMC 40 nm CMOS process library, the clock frequency could reach 220 MHz with a latency of 18.18 ns, a full-load power consumption of 0.923 mW and an area of m. Compared with five separate superfunction hardware accelerators, the area was reduced by 47.99% and the power was reduced by 38.91%. In some area-sensitive application scenarios that require a low latency and a high precision for transcendental function operations, the floating-point transcendental function hardware architecture proposed in this paper has important application value. Moreover, the reconfigurable architecture proposed in this paper will play an even greater role in the future as various fields pursue high-performance computing.