1. Introduction
With the advancements in information technology, there is a constantly growing volume of data processing, for example, for medical, financial, social communication, and entertainment purposes. The increasing quality of multimedia content, like photos, videos and audio recordings has led to the need to process but also transmit digital information in a faster and more efficient way. High-speed and errorless data transmission systems are an important part of the information technology development. However, in order to make, for example, a radio communication system errorless, special mechanisms called channel coding or Error Correction Coding (ECC) are needed. ECC makes it possible to correct a certain portion of transmission errors without the need for data retransmission. Such mechanisms are based on algorithms with relatively high computational complexity, and thus the elements responsible for data correction in high-speed communication technology are usually complex and energy-consuming.
The energy-efficiency problems are significant in any Very Large Scale Integration (VLSI) circuit design, where an increase in temperature can lead to serious damage [
1]. However, in the case of mobile communication devices powered by batteries, the efficient use of power is of special importance. By reducing power consumption, the operating time of a mobile device is extended [
2,
3]. Let us also mention the resulting environmental benefits. By limiting electricity consumption of designed devices, we reduce the negative impact on the environment, as well as reduce the cost of operating the device itself.
There exists a relationship between the seemingly independent issues of reliable data transmission and the electricity consumption in data transmission systems. It results from the relationship between the power of the transmitted signal and the expected level of reliability (error rate), as well as from the fact that there are possibilities to limit the energy consumption of hardware systems implementing the advanced correction mechanisms. A properly designed system should ensure error-free data transfer, with the lowest possible consumption of electricity. Most of the published research works in the field of ECC focus on finding fast coding-decoding solutions. However, few works take into account the need to search also for energy-saving solutions.
With the development of communication technology, many methods of ECC have been developed, with different levels of computational complexity on the one hand and error correction effectiveness (performance) on the other hand. One of the basic parameters characterizing the ECC coding system is the Bit Error Rate (BER), which quantifies the number of erroneous bits observed in the receiver to the total number of transmitted bits. Similarly, a Frame Error Rate (FER) indicates the relative number of observed erroneous data frames (blocks) in a coding scheme, for schemes that partition data into blocks (which is the case in a broad range of block coding methods).
Low-Density Parity-Check (LDPC) codes are one of the best block ECC methods known today. They were first presented by R. G. Gallager [
4], but the computing technology available at that time did not allow for their practical use. In 1999, LDPC codes were recalled by D. J. C. MacKay [
5]. The scientific community began to look for methods to simplify the algorithm for implementation in integrated circuits. The proposed simplifications enabled the use of LDPC codes in the telecommunication industry, for example, (DVB-S2) [
6,
7], Ethernet 10GBase-T [
8,
9], 802.16e (WiMax) [
10,
11], 802.11ad (WiGig) [
12,
13], 802.11n/ac/ax (WiFi) [
14,
15], GSFC-STD-9100 (NASA) [
16] and recently 5G mobile [
17,
18].
Along with the growing popularity of LDPC codes, the development of optimized LDPC codes and their decoding methods with improved performance is observed. Due to the high computational complexity of such effective solutions, the power consumption seems to be an important, but often neglected issue. Designing energy-efficient LDPC decoders calls for providing a system solution with power saving potential at several design stages. It seems that at each project level, specific techniques to reduce energy consumption can be used [
19].
Probably great benefits can be obtained by minimizing expected energy consumption at the system level. For example, we have found that it is possible to design the architecture with the possibility of involving elements responsible for power gating [
20,
21,
22] of individual elements, or the clock gating [
23,
24,
25]. Also, while reviewing methods to reduce energy consumption, the method of local voltage reduction was considered. Reducing the supply voltage allows to significantly reduce energy consumption by adjusting the supply voltage of individual system elements at the expense of deteriorating dynamic parameters. Power gating enables great energy consumption reduction according to [
19]. Clock gating can be used in FPGAs, unlike power gating, which is not applicable there.
Clock gating allows for the greatest reduction in dynamic energy consumption. However, there are other methods of reducing energy consumption that are worth mentioning. These methods were not used in the presented article due to the insufficient energy-saving in relation to additional elements increasing the hardware demand of the decoder. The first interesting method is a preliminary calculation of the state of outputs (precomputation logic), which consists of adding a circuit, the purpose of which is to determine the state of the outputs one clock signal period earlier [
26,
27,
28]. The second interesting method is determining the kernel [
29,
30,
31], consisting of determining the dominant states of the sequential system from a subset of all states, and then implementing a separate, smaller element performing operations only for dominant states systems. The third interesting method is the implementation of the system in the Globally Asynchronous Locally Synchronous (GALS) architecture [
32,
33,
34], involving decomposition of a locally synchronous system through which data are exchanged via an asynchronous bus, and the clock frequency of each module is individually adapted to the needs of this module.
The aim of the paper is to implement an energy-saving LDPC decoder characterized by non-deteriorated dynamic parameters compared to a decoder without mechanisms limiting energy consumption. For this reason, and taking into account the limitations resulting from the use of FPGA, the article proposes an LDPC decoder architecture based on Token Ring oriented clock gating. The second chapter presents a theoretical introduction describing the most important issues concerning the energy-saving QC-LDPC decoder. The third chapter presents the implementation of the QC-LDPC decoder based on Token Ring architecture with clock gating, which was then tested and the results are presented in the fourth chapter. The article ends with chapter five, which summarizes the obtained results and then indicates further work.
Therefore, the following list of contributions within the framework of the article can be formulated:
General idea of technology mapping of the proposed QC-LDPC decoder in LUT-based FPGA oriented to clock gating (
Section 2).
Development, implementation and testing of a new QC-LDPC decoder architecture based on Token Ring (
Section 2 and
Section 3).
Proposal and design of original architecture low power QC-LDPC decoder oriented to a reduction in dynamic power into LUT-based FPGA (
Section 3).
Performing experiments and presenting results confirming the effectiveness of the proposed solutions (
Section 4).
2. LDPC Codes and Their Decoding—Preliminaries
LDPC codes are an ECC scheme belonging to the class of linear block coding methods. The encoding process of an code of rate adds parity check elements to the vector , which is an information vector to be ECC protected. As a result, the code vector (codeword) is obtained. Vectors and are over Galois field for the binary codes considered in this paper.
A binary
LDPC code is specified by a low density parity check matrix
with
elements. In the decoder, a vector
of length
N is approved to be a correct codeword if and only if it satisfies the parity check equation
over
field arithmetic. Otherwise, a specific error correction decoding algorithm can be used. The LDPC systems employ iterative belief propagation decoding algorithms [
5].
A parity check matrix that contains the same number of “1” values in every row and the same in every column is called a regular matrix. A matrix that does not satisfy the above condition is called an irregular matrix and a corresponding code is called irregular code. Irregular codes, with properly selected weight distributions of ones in columns of are known to outperform their regular counterparts.
2.1. QC-LDPC Codes
The structure of the parity check matrix may be subject to additional constraints according to design requirements, in particular to the codec implementation method. For example, in order to facilitate the hardware implementation of LDPC decoders, the matrix structured into Quasi-Cyclic (QC) form, as shown in the example in
Figure 1, is commonly used. In this illustration, the matrix
of size
is divided into square submatrices of size
, where each submatrix is either a zero matrix or a matrix formed by cyclically shifting the columns of the identity matrix (CPM—Circulant Permutation Matrix [
35]). In general, the parity check matrix of a QC-LDPC code is partitioned with square submatrices
, where each is either a zero matrix of size
or a CPM of size
. QC-LDPC codes are particularly attractive to systems implemented in hardware, because the block-circulant parity check matrix structure enables fast and efficient message routing in a decoder implementation.
The parity check matrix
of QC-LDPC code can be specified by a table (matrix) of cyclic shifts in CPMs (an example is shown in
Figure 1), in which the nonnegative integers (from 0 to
P) represent the number of cyclic shift positions in CPM, while the values “
” indicate null submatrices. Such a representation of the parity check matrix
is convenient because it allows
to be stored in a compressed form in the memory of the decoder. All possible shifts in the CPMs with
are shown in
Figure 2.
2.2. LDPC Decoding Algorithm
The classical LDPC decoding algorithms belong to the class of iterative belief propagation (BP) algorithms, as recalled by MacKay [
5]. In BP, beliefs are propagated between nodes in a Tanner graph [
36] representation of an LDPC code. This graph is a bipartite graph with so-called control nodes representing rows of
, bit nodes representing columns of
and edges representing nonzero positions in
. The commonly used simplified method for determining the beliefs (messages propagated between nodes in the algorithm) is known as the Min-Sum algorithm [
37,
38,
39]. The decoding process is an estimation process of the transmitted code vector
from input data (received “soft” values
). The following denotations are used in the algorithm presentation:
symbol means “becomes”,
—code vector,
—received vector,
—belief LLR value from the n-th bit vertex to the m-th Tanner graph control vertex,
—LLR a priori probabilities for the n-th bit,
—set of integers between 1 and N,
—a set of column indexes in the parity check matrix containing one in the m-th row,
—a set of row indexes in the parity check matrix containing one in the n-th column,
—message from the m-th control vertex to the n-th bit vertex of Tanner graph,
—the n-th value of the decoded vector.
The Min-Sum decoding algorithm can be presented in consecutive steps as follows:
Initialization of the decoder based on “soft” decisions: Based on the vector of data
received from the demodulator, the probability values are determined, which can be written as
and
for
, which are the probabilities that the
bit was originally transmitted if the
value has been received. From values
and
, Log-Likelihood Ratios (LLR) are determined, which are the actual algorithm input. For every
and
. Initialization of
beliefs (messages) can be expressed as:
Control nodes processing: Calculation of
, messages from
mth control node to
nth bit node. For every
and
:
Bit nodes processing: Recalculation of
, messages from
nth bit node to
mth control node. For every
and
:
Total beliefs update: Calculation of the current beliefs about all bits in
. For every
:
Hard decisions update: Making trial hard decisions. For every
:
Verification of control equations:
Another iteration: If the parity check Equation (
6) is satisfied or the maximum number of iterations is reached, the computations are terminated. Otherwise, the next iteration starts from point (
2).
The hardware implementation of the QC-LDPC decoder is closely related to the content of the parity check matrix . Considering the data processing parallelism in an implementation, the decoder architecture can be serial, partially-parallel or parallel. Partially-parallel decoders are the best choice for most applications as being a compromise between decoding throughput and hardware resources overhead. The parity check matrix of QC-LDPC code fits perfectly into the concept of partially-parallel decoders by structuring the matrix into submatrices that have only single “1” value in each column and each row. This allows the individual decoding steps to be parallelized and messages efficiently propagated with the use of memories, with single memory word grouping messages corresponding to the single submatrix of .
Figure 3 presents the concept of partially-parallel decoder architecture. We have noticed that this architecture of the QC-LDPC decoder is consistent with the idea of a Token Ring network, one of the methods of creating a Local Area Network (LAN) developed by IBM [
40]. In the concept of “token passing”, a token is awarded in a given moment to this network element that is about to start processing and transmitting data. We have applied this concept in our partially-parallel QC-LDPC decoder implementation. Here the token is passed between the control and bit nodes computing units, which require obtaining the result of the previous iteration, according to the presented MS decoding algorithm, to perform the next iteration of computation. The proposed token passing scheme for the partially-parallel QC-LDPC decoder architecture is shown in
Figure 3.
3. Implementation of the QC-LDPC Decoder Based on the Token Ring Architecture with Clock Gating
It has been assumed that in the designed implementation of a decoder with an architecture based on Token Ring, the set of implementable QC-LDPC codes should not be limited. The proposed general concept of the decoder in the form of a block diagram is shown in
Figure 4. The flexibility of the proposed architecture results from the fact that it is easy to design a decoder for various parameters of the parity check matrix
. Such configurable parameters of the
matrix are: regular or irregular matrices, with different block lengths (
N), code rate (
R) and the size of the submatrix,
P.
The decoder comprises five separate memory blocks,
P control node computing units,
P bit node computing units, a control equation verification module and modules responsible for initialization and communication with the environment. In the developed decoder structure, memories constitute a specific interface between the computing module. The message passing between the nodes is done via memory
(messages
in (
2)–(
3)) and memory
(messages
in (
2)–(
3)).
The communication support module receives the data to be decoded in the form of the LLR and saves it in the proper order in the memory
. The received data is also passed to the decoder initialization module as in (
1), which fills the memory
. Starting the initialization while the data transmission is in progress can slightly reduce the time in which the QC-LDPC decoder completes the decoding process. After finishing initialization, the process of calculating the control node messages begins. The computing unit of the control nodes takes the input data (messages) from the memory
, performs the computations according to the algorithm step (
2) and stores results in the memory
. The computing unit core consists of
P-computing units responsible for parallel computations corresponding to output messages of
P nodes.
In the next processing time slot, the messages of bit nodes (
3), total beliefs (
4) and hard decisions (
5) are computed. Due to an obvious similarity in expressions (
3) and (
4), we found that a single complex decoder element can jointly realize these computations. The current hard decisions as in (
5) are equivalent to the sign bits of the result of computations in (
4). Therefore, computations according to (
5) can also be integrated into this complex computing block, which we will call the bit node computing unit. The whole computation core consists of
P such complex bit node computing units. Every bit node computation retrieves data (messages) from memories
and
, determines the bit node messages and saves results in memory
as well as current hard decisions in memory
.
After the bit node processing step is completed, verification of the control equations module for current hard decisions is performed according to expression (
6). This is done in the verification of the control equations module by checking each of the control equations defined by rows of
in (
6). When all controls are satisfied for current hard decisions, the final decoding result is sent to the decoder output via the communication of the result module. If the control equations are not satisfied, the next iteration in the decoder starts. The communication of the results module contains a counter that is incremented with each iteration. If the counter reaches a predetermined maximum value, the decoding is aborted and the decoder signals an erroneous decoding result.
Figure 4 shows the idea of a token passing between elements of the QC-LDPC decoder based on the Token Ring architecture. This idea is directly related to the operation of the distributed control system where each decoder element is responsible for the correct receipt and transfer of the token (token passing) at the time of starting and finishing its work. The presented QC-LDPC decoder based on Token Ring architecture is an innovative design approach in relation to other known solutions from the literature.
The token is passed sequentially up to the point of verification of control equations module, where—depending on the result of the verification—one of the two possible token transfers, or , can be made. The token transfer according to will occur when the current hard decisions do not pass the verification, and the next decoding iteration is performed, with the token returning to the control nodes computing units. The global control system is partitioned into six separate controller modules in this proposed solution, constituting a distributed control system. In the proposed solution, at each stage of the computation, only one control unit is active, which gives a potential for reduction of dynamic energy consumption.
The distributed control system used in the proposed decoder architecture is the result of the search for a solution that is well suited to LUT-based FPGA resources. Typically, the control system is implemented in the form of an integrated control block from which control signals are distributed to individual executive blocks (decoder elements). This type of solution implemented in FPGA introduces a number of problems related to delays. In the proposed QC-LDPC architecture, the integral control system would create problems due to clock-skew [
41]. The result of the search supported by numerous experiments is a distributed implementation of the control system, in which the control unit responsible for blocking the clock signal is integrated with the executive module. This proposed scheme is modeled on the solutions used in GALS systems. Practical experiments have shown that it works well in systems where the clock gating technique is used.
When presenting the concept of implementing a decoder with a distributed control system (Token Ring architecture), it is worth paying attention to the role of memories, which constitute a specific interface between the QC-LDPC decoder modules. The memory blocks in
Figure 4 do not have a separate control module, because the control elements are located in the modules that communicate with them. The memory blocks can, therefore, be associated with interface elements, which are the place of storage of data used by individual decoder modules.
The Token Ring architecture with a distributed control system forms a basis for implementing clock gating of decoder modules for a significant dynamic power reduction. Activation of a selected control unit with a token may be done together with the computing elements controlled by this control unit. In this method, when computations are completed the module and the token is passed, the clock signal distributed in the respective computing unit is blocked (deactivated).
Figure 5 illustrates the clock gating processing in a single control unit of the proposed distributed system. A similar circuit is found in every element of the decoder, to which a token can be passed.
Immediately after receiving the token, the clock gating element shown in
Figure 5 activates the local clock signal (LCLK) by unblocking the gating circuit of the global clock signal (GCLK). The LCLK is passed to the element of the distributed control system, which starts to control the decoder element (computing module and memory). At the same time, the local clock signal LCLK is delivered to this computing module. After local computations are finished, the control system passes the token to the next control system element and blocks the LCLK. This processing scheme allows for significant reduction of the energy consumption due to activating the clock signal delivered to computation logic only in the needed time slot.
Table 1 presents the analysis of the operating time of individual decoder elements and the number of Adaptive Logic Module (ALM) blocks [
37,
42] utilized by the designed QC-LDPC implementation. These results have been obtained for a single-iteration-decoding of a QC-LDPC regular matrix with dimensions
(
),
and
, where
Z is the number of nonzero elements of the parity check matrix
. During normal operation, in the first iteration the decoder performs initialization once and in the last iteration it communicates the results module. For all other iterations, both decoder elements are not active. The communication support module was not taken into account during the analysis because it is an element necessary for the proper operation of the decoder but it strictly depends on the design assumptions and can be implemented in any form (serial, parallel or partially-parallel), so it should not be included in
Table 1. In
Table 1, it is worth noting that for the proposed architecture, a single decoder element works no more than 34% of the total time and no more than 50% of ALM blocks have an active clock signal at any time.
4. Experimental Results
The proposed architecture of the QC-LDPC decoder based on Token Ring with a distributed control system and clock gating has been implemented in the Altera (Intel) Cyclone V FPGA devices, for a number of different parity check matrices
of QC-LDPC codes. The obtained implementation and power loss testing results are presented in
Table 2. The presented results concern decoders with 4-bit data quantization of decoder messages and the Min-Sum computation algorithm.
Table 2 contains characterizations of the parity check matrices
, where
I refers to an irregular matrix and the
R refers to a regular matrix. Two variants were implemented and analyzed for each
: “Without Token Ring”, which does not include clock gating and token forwarding, and a “With Token Ring” variant, which includes a distributed control system, an architecture based on a Token Ring with token passing and clock gating. It is worth noting a significant reduction of the
(dynamic energy consumption) for the architecture based on a Token Ring (on average by
), with a simultaneous slight increase in the demand for ALM units (on average by
). It is worth noting also that the architecture based on a Token Ring does not increase the demand for memory bits (Mem).
Besides the reduction of power in the Token Ring solution, several other observations can be made from the results in
Table 2. Since an important decision in the LDPC coding system design is the choice between regular and irregular codes, it is valuable to compare the energy consumption for these two cases. We have observed a clear difference in energy consumption for regular and irregular decoders in the proposed implementation. In
Table 3, we emphasize results for two selected matrices, regular and irregular, which are direct counterparts; that is, they have the same size and the overall number of nonzero elements (
). In this table, we also include
—the number of GCLK ticks needed to perform one complete decoding iteration and
—the average dynamic energy per iteration. As can be seen, a QC-LDPC decoder of a regular code requires slightly more clock cycles
than an irregular counterpart. This is due to the possibility of faster execution of computations through their parallelization, resulting from a different construction of both matrices
in the form of QC: irregular column weights allow for faster computations for a portion of rows or columns. Due to the different values of
, it is improper to compare the two decoders on the basis of
; and it is valuable to determine the average dynamic energy
. Comparing
reveals that the decoder based on an irregular matrix is characterized by a much higher energy consumption (more by 168%) compared to the decoder based on the regular matrix.
Table 3 also shows an increased hardware demand for ALM units (44% more) in irregular codes decoder. This shows a tradeoff between coding performance (favorable for irregular codes) and energy consumption (favorable for regular codes).
We have also determined the power loss in particular decoded elements. The results are shown in
Table 4. They allow for a comparison of
in regular and irregular decoders. Significant differences can be seen, in particular for the control node computing units, bit node computing units and the clock tree being the power loss of the clock tree designed in the QC-LDPC decoder (for each FPGA device). In relation to the above, it can be concluded that especially regular matrices decoded in a Token Ring architecture can be perceived as an energy-saving QC-LDPC decoding framework.
The dynamic power loss
and the hardware utilization (ALM and Mem) are also dependent on the size
P of the submatrix.
Table 5 presents a comparison of the decoders for the same matrix
size but different values of
P. The submatrix size
P corresponds to the number of processing units (see
Figure 4), and at the same time it affects the number of clock cycles
. As a result, the decoder throughput, but also power loss, is strongly affected by
P. On the other hand, the average energy consumption
is supposed to be independent of
P. However, we still observed
dependent to some extent on
P, which reveals the energy efficiency of the proposed implementation is dependent on
P.
Results presenting the mentioned dependencies are shown in
Table 5 as well as
Figure 6,
Figure 7 and
Figure 8, where “[cyc/it]” denotes the number of ticks of a global clock needed to perform one decoding iteration.
Figure 6 shows the dependence of
on the value of
P.
Figure 7 shows the dependence of the hardware demand for ALM units on the value of
P. One can see an approximately linear increase in hardware requirements with increasing values of
P.
Figure 8 shows the dynamic energy consumption
depends on the value of
P. It can be seen that despite the increase in
with increasing
P, the
significantly decreases with
P. This means that a semi-parallel decoder with a high degree of parallelization of computations will be characterized by a greater energy-efficiency in comparison to a decoder with a lower degree of parallelization. In view of the above, it can be seen that the correct selection of
P for the design requirements is an important system design choice in terms of hardware requirements and dynamic energy losses.
The presented architecture should finally be compared with the results from the literature. This is very complicated due to the enormous number of possible parity check matrices
as well as the decoder design configurations. Even standards (like IEEE802.11ad, WiGig) are characterized by a large number of utilized matrices
upon which the QC-LDPC decoder can be designed. The most popular architectures in the literature are parallel architectures, which also pose a significant difficulty when trying to make a comparison, although in engineering practice the highest decoding speed is not always required. Nevertheless, an attempt was made to compare the proposed Token Ring-based architecture for the WiGig standard code. This is presented in
Table 6, where NMS means Normalized Min-Sum decoder algorithm, which is a slightly modified decoding computation method in control nodes of the presented Min-Sum algorithm [
37].
The design proposed in this article will be recommended in the systems when the maximum possible throughput is not critical, while maintaining a reduced value, which has a direct impact on the temperature generated in the integrated circuit.
5. Conclusions
This article presents an innovative idea of the QC-LDPC decoder design using the Token Ring architecture. The proposed architecture is oriented to reduce energy consumption. A significant reduction in dynamic energy consumption has been demonstrated with the decoder throughput remaining unchanged. Moreover, the analysis of the implementation results of decoders based on regular and irregular matrices , as well as matrices with different submatrix sizes, allowed us to indicate an energy-saving configuration. Such a configuration is the regular code decoder with a sufficiently large submatrix size P. It is likely that similar conclusions can be drawn for other QC-LDPC decoder designs, but we cannot claim that for sure.
The architecture based on a Token Ring can undoubtedly be used in the implementations of QC-LDPC decoders as well as in many other modern digital systems. In the proposed solution, which is not optimized in terms of throughput, significantly lower losses of dynamic power were achieved combined with lower throughput. This type of implementation can be applied wherever the throughput is not a critical parameter of the designed device.
The obtained results of the experiments indicate the general principles of selecting the matrix form, which has a positive effect on the power consumption of the designed decoder. It cannot be denied that it is difficult to judge whether the obtained results are universal in nature, or only apply to a specific family of FPGAs. It seems that they should be universal in nature, but confirming this would require experiments for several families of FPGAs. A comparison of the designed decoder with alternative solutions taken from the literature shows that the proposed solution has high implementation potential. It seems that it is also worth using them in ASICs, which do not have a number of the limitations characteristic of programmable systems.