1. Introduction
In recent years, communication technology has made a spurt of progress. As an important part of the communication process, demodulators have been widely used in satellite communications [
1,
2,
3,
4], navigation [
5,
6], mobile communications [
7,
8,
9], and underwater communications [
10,
11], and so forth. Among them, due to their excellent antinoise performance, BPSK demodulators have attracted wide attention [
12,
13,
14].
To realize a BPSK demodulator, the cooperation of hardware platforms and algorithms is necessary. So far, the most frequently used hardware platforms include analog circuits using discrete components, application-specific integrated circuits (ASIC), and general-purpose programmable devices. Analog circuit platforms have been rarely used in recent years, due to the instability of the discrete components [
15]. ASICs have a small volume and low power consumption, but the cost is much higher and the function cannot be modified once the chip tape-out has been completed [
16]. Among the general-purpose programmable devices, FPGAs have the advantages of remarkable stability, repeatable programming, and high energy efficiency. In recent years, FPGA implementation in BPSK demodulation has become a popular trend [
17].
The traditional algorithms of BPSK demodulation can be divided into two categories: coherent demodulation [
18,
19] and noncoherent demodulation [
20]. The coherent demodulation regenerates a local carrier, which has the same frequency and phase as the modulated carrier by carrier synchronization; then mixes the local carrier with the modulated signal; commits to down-conversion, low-pass filtering, and timing synchronization; and finally completes demodulation. The noncoherent demodulation adds the delayed input signal to the current input signal, and then a pulse sequence can be formed. By analysis of the sequence, demodulated data can be obtained. As there is no need for carrier synchronization, noncoherent demodulation algorithms are easy to realize, but their performance is much lower than that of coherent demodulation algorithms. Nowadays, coherent demodulation is the most widely used BPSK demodulation method. However, it also has shortcomings, as follows: (1) sensitivity to the frequency offset; large frequency offsets may cause the carrier synchronization to malfunction; (2) high complexity, which means more hardware resource consumption and energy consumption; (3) the carrier synchronization needs too much time, which causes delay at the beginning of the communication task.
Recently, some scholars have attempted to introduce machine learning into demodulation technology. For the neural network demodulators mentioned in [
21,
22,
23], the demodulation principle is to analyse the modulated data in every symbol period by neural network. The modulated data is divided into symbols according to the number of samplings. This may result in mistakes when offset exists. Furthermore, the design of the training data sets is not reasonable, as more nonideal factors need to be considered.
The convolution neural network (CNN) is an outstanding machine learning pattern [
24]. Therein, due to the feature extraction ability and one-dimensional structure of the 1D-CNN [
25], it is suitable for the treatment of discrete time series. It is noticed that in the published studies, there is almost no use of 1D-CNNs in demodulator research. The 1D-CNN is able to handle the problem of frequency offset if it is trained by appropriate training data, and the complexity can be simplified by a suitable network structure. By making full use of the parallel computing architecture of FPGA, the delay of demodulation may be reduced. Hence, this paper proposes the design and implementation of a BPSK demodulator based on 1D-CNN. In this paper, five nonideal factors were considered in the training data sets, so that the demodulator had a better adaptability to offsets. Experiments were repeatedly carried out to obtain a suitable network structure, considering both accuracy and complexity. A time-delay network for convolutional calculation and a structure for piecewise approximation of the activation function were designed, to further simplify calculation and shorten latency.
The rest of this paper is organized as follows:
Section 2 introduces the basic principle of 1D-CNN demodulation.
Section 3 describes the FPGA implementation of a 1D-CNN demodulator. In
Section 4, the training and test process of the 1D-CNN demodulator are introduced, and their results are analyzed. Finally,
Section 5 summarizes this paper.
2. Basic Principle of 1D-CNN Demodulation
The modulated BPSK signal in the time domain can be expressed as
where
, g(
t) denotes a single rectangular pulse,
is the pulse width, and
is the angular frequency of the carrier. Information carried by the BPSK signal is contained in the phase shifts. Through the detection of the phase shifts, the carried information can be recovered.
The process of 1D-CNN demodulation is shown in
Figure 1. The BPSK signal is imported separately into two 1D-CNNs. The 1D-CNN1 detects phase shift from 0 to π, and the 1D-CNN2 detects phase shift from π to 0. Each 1D-CNN outputs a pulse when a certain type of phase shift is met. A decision module is employed to handle the outputs of the 1D-CNNs. When the output of 1D-CNN1 is greater than the predefined threshold, the output of the decision module is converted from 1 to 0. When the output of 1D-CNN2 is greater than the predefined threshold, the output of the decision module is converted from 0 to 1.
The two 1D-CNNs have the same structure, only the parameters of the convolution kernels and neurons are different. The structure of the 1D-CNN is shown in
Figure 2, which consists of four layers: an input layer, convolution layer, hidden layer, and output layer. The input layer conveys segmented data as an input vector. The convolution layer convolutes the input vector with a convolution kernel, and the result is transported to the hidden layer. The hidden layer aims to avoid the network being trapped in the local optimum, and to make 1D-CNN convergence easier during the training process. Neurons in the hidden layer are connected to the convolution layer. The weighted results from every neuron in the hidden layer are summed in the output layer, and then imported to the decision module. The operation process of 1D-CNNs includes forward propagation and backward propagation. In the training process, the backward propagation adjusts network parameters according to the output and loss, until the loss reaches the minimum. Once the network parameters are determined, the forward propagation can deduce the result independently.
3. FPGA Implementation
This paper implements the 1D-CNN demodulator in an FPGA. The implementation block diagram is shown in
Figure 3.
Two 1D-CNNs are implemented in an FPGA. The input of the 1D-CNN is a discrete time series from an analog-to-digital converter (ADC). The input layer segments the input vector by a sliding window, and conveys the input vector to the convolution layer. If we suppose that the input series is
, where
is the length of the input vector, then the input vector
can be indicated as
Registers are cascaded to realize a sliding window. The convolution layer includes two convolution kernels, which store the convolution parameters in block read-only memory (ROM) and complete convolution calculations. A rectified linear unit (ReLU) function is selected as the activation function of the convolution layer. The number of neurons in the hidden layer is set to 20, and the activation function of the hidden layer is a sigmoid function. A single neuron is set in the output layer. The existence of a certain type of phase shift can be concluded directly by referring to the state of this neuron. Comparators are used in the decision module to compare the predefined threshold with the results of the two 1D-CNNs. The valid bus gives the enable signal according to the states of the ADC and each layer. All components are driven by the same clock, provided by the external crystal.
To improve the efficiency of resource utilization and operation speed of the FPGA, three methods were adopted as follows: (1) the use of a time-delay network for convolutional calculation; (2) the use of a look-up table (LUT) together with a piecewise function to achieve the activation function; (3) the use of a parallel structure within layers and a pipeline structure between layers.
3.1. Implementation of the Convolution Kernel
The convolution kernel is the core component of the convolutional layer. Repeated experiments show that the accuracy of detection reached the highest when the length of the input vector was slightly larger than the sampling times of each carrier period. In the design, the length of the input vector is
M + 1;
M denoting the sampling times of each carrier period. In order to accurately detect the phase shifts in the input vectors, the convolution operation mode was always selected, meaning that the input vector and the output vector had the same length. According to the rule of convolution calculation, the output vector with the length of
M + 1 can be obtained only if the input vector is expanded to the length of 2
M + 1. We used the element 0 to expand the vector. The expanded input vector convoluted with the convolution kernel
M + 1 in length, and the result is shown in
Figure 4.
As shown in
Figure 4, in order to complete this convolution calculation, (
M + 1)
2 multiplication operations must be done. Some of the operations include the multiplier ‘0’, which can be ignored. Calculation results of data in the line frame had all appeared in previous steps or would appear in the forthcoming steps. By feeding some of the results into the time-delay network that was composed of several time-delay queues, a large number of repeated calculations could be avoided. The structure of the time-delay network is shown in
Figure 5.
Assuming that
is the current input vector, together with data in the time-delay network, we can obtain the convolution result
by only
M + 1 multiplication operations. The calculation complexity is reduced by one dimension. The cost of the time-delay queues is far less than the multipliers in FPGA implementation. Under the control of the clock in the FPGA, data can be accurately beat-delayed with the occupation of few hardware resources. The timing sequence of the time-delay network is shown in
Figure 6.
3.2. Implementation of a Single Neuron
Neurons play an important role in the hidden layer. Assuming that
are the inputs of each neuron,
are the weights of the synapses,
b is the bias, and
f(
x) is the activation function, the structure of the neuron can be indicated as in
Figure 7. Each neuron is fully connected to every convolution kernel output through synapses, and each synapse has its own weight. The weighted results are successively summed, biased, activated, and finally output. In FPGA implementation, a full adder circuit is used to achieve the summing operation. The fixed-point multiplication IP core is used for multiplication.
In this paper, two activation functions are used, which are:
The implementation method of the ReLU function is very simple, so we mainly discuss the implementation method of the sigmoid function. Two methods are combined to realize the sigmoid function: the LUT method and the piecewise liner function method.
In the implementation of the LUT method, we mainly used the internal block RAM in the FPGA. First, we made a table from the input to the output, according to the resolution. Then, letting the input of the function be the address, the corresponding function value was written to the memory cell of this address. In this way, the complicated calculation process was simplified to the straightforward addressing process. By using the symmetry of the sigmoid function, only the positive part of the function was written to the LUT, whereas the negative part could be obtained by some simple adjustments according to the input data. This method has the advantages of high precision and ultrashort delay. However, the disadvantage is that the hardware resource cost is too high.
In the implementation of the piecewise liner function method, we used several linear functions for piecewise fitting of the sigmoid function. It is obvious that the more the function was segmented, the higher the accuracy, and the higher the resource consumption. This method is suitable for functions with good linearity. As for the sigmoid function, part of the interval can be replaced by linear functions, especially in sections near or far from the longitudinal axis.
In order to combine the advantages of the two methods, the LUT together with the piecewise function were adopted to achieve the sigmoid function. In three intervals with good linearity, which were
, three linear functions were used to replace the sigmoid function; in the intervals with bad linearity, i.e.,
, the LUT was used to implement the sigmoid function; and in other intervals that were far away from the longitudinal axis, the sigmoid function converging to constants was replaced by constants. The structure of the sigmoid function implementation is shown in
Figure 8. The control block firstly analyses the range of the input, then places the multiplexers (MUXs) at the right gear and adjusts the output value of the ROM. The output first-input/first-output (FIFO) is used to adjust the output delay of the LUT method. It aims to synchronize the two paths, and then synchronize the outputs of all neurons. This method takes into account both resource consumption and accuracy. The implementation result is shown in
Figure 9.
3.3. Pipeline and Parallel Structure
The neural network runs in a pipelined mode between layers. The data are not cached from input to output. Layers are called every clock period. Each clock period, a set of data is fed into the 1D-CNN, and a set of results is output, which is delayed for several clock periods compared with the corresponding input. The pipelined mode maximizes the utilization of resources.
The pipelined mode between layers requires a parallel computing architecture inside each layer. The speed of the data stream is fixed in each layer, with no multiplexing of hardware resources between each step. In the convolution layer, two convolution kernels are routed respectively. Calculations of each element of the two output vectors are obtained synchronously; in the hidden layer, calculations of 20 parallel neurons are routed respectively. This parallel structure avoids the problem of routing across the clock domain in the implementation of the FPGA, and also avoids the problem of timing tension caused by frequent calls of the critical paths. The designed pipeline and parallel structure improves the stability of the circuit, and also provides the possibility to accelerate the speed of operations.
3.4. Precision and Quantization
In the training process of neural networks in a personal computer (PC), we chose the double precision floating-point type as the data type of neural network parameters, in order to obtain a network with high precision. In the FPGA implementation process, however, this kind of high-precision data type is neither feasible nor necessary. Taking this into account, the signed int type is chosen as the data type, meaning one bit for sign and 15 bits for data.
During the training process, the network input and parameters are normalized to unity; we treat the amplitude of the input signal as 1, and then obtain the network parameters of each layer. Here, the network parameters are represented as double precision floating-point, so we need to complete the data type conversion from floating-point to fixed-point. Differing from the floating-point numbers, fixed-point numbers have the problem of width expansion after multiplication. This means that the product of fixed-point numbers will become bigger and bigger, regardless of their actual value.
Assuming that
f1 and
f2 are two floating-point numbers, after fixed-point quantization with the coefficient
n, their integer values are
I1 and
I2; therefore,
I1 = [
nf1],
I2 = [
nf2], and
I1I2 = [
n2f1f2]. However, the quantized actual value of the product of
f1 and
f2 is expected to be
R = [
nf1f2]. It is clear that we can avoid the problem of width expansion through an additional division operation, that is:
According to this characteristic, this paper chooses an integer power of 2 as the quantization coefficient, such that division operations can be replaced by bit shift operations. The actual product value can be expressed as
Bit shift operation is very suitable for FPGA structure, occupying few hardware resources. Such a quantitative method greatly reduced the resource usage of multiplication. In the experiment, n was set to 2048. Examination of the results shows that no overflow happened during the operation process.
4. Experiment and Results
4.1. Experimental Platform
In this chapter, we describe the proposed 1D-CNN demodulator implementation using the Xilinx KC705 evaluation board. The block diagram of the experimental platform is shown in
Figure 10. As shown in this diagram, a PC with a Keras neural network toolkit was used to generate the network parameters, which were later provided to a Xilinx KC705 evaluation board. The training data set
Xi was provided in the neural network training process. In addition to the evaluation board itself, another three devices were applied to generate modulated data, which were the BPSK generator, FPGA mezzanine card (FMC) sampling subsystem, and additive white Gaussian noise (AWGN) generator.
The experimental condition was set as follows: (1) carrier frequency was fc = 10 MHz; (2) symbol rate was rb = 5 Msps; (3) sampling frequency of the system was fs = 80 MHz.
4.2. Training Data Sets
The demodulation performance of the 1D-CNN demodulator is obtained by training, so appropriate training data sets must be generated first. To enhance the adaptability of the 1D-CNN demodulator for a real channel condition, five nonideal factors were taken into consideration during the generation of training data
Xi: signal-to-noise ratio (SNR), carrier frequency offset
oc, symbol rate offset
ob, sampling frequency offset
os, and initial phase
. Training data
Xi can be regarded as the function of these five variables, which is
Empirically, we set the range of each variable, as shown in
Table 1.
MATLAB was employed to generate training data sets. Firstly, these five variables were assigned randomly in their respective range, and a set of values of the five variables was obtained. According to the carrier frequency, the symbol rate, and the initial phase, a BPSK-modulated waveform was generated, with the length fixed to 100 random symbols. Next, the noise signal of certain power was created referring to the SNR value, and the noise signal was added to the modulated waveform. Then, we extracted samples from the modulated waveform according to the sampling frequency. Finally, phase shifts labels were added to the sample sequence. In this way, the first data set X1 was generated.
By repeating the above steps, a total of 1000 sets of training data were yielded. Next, the structures of the 1D-CNNs were built in a PC using a Keras neural network toolkit, and the generated training data sets
X1–
X1000 were provided to train the neural network. Data sets
X1–
X1000 were iterated for 100 times. The training loss curve is shown in
Figure 11. It can be seen that the training process converged well. The well-trained network parameters were then imported to the FPGA.
4.3. Structure Parameters of the Network
In order to obtain the best network structure, several networks were designed by changing the two most important structure parameters: the length of the input vector and the number of neurons in the hidden layer. The modulated signals of different SNR were used to judge the performance of the network. As a representative case, the bit error rate (BER) results when SNR was 6 dB are shown in
Figure 12 and
Figure 13.
The result shown in
Figure 12 indicates that the network cannot converge when the length of the input vector is too short (less than 4). The performance of the network was kept at an acceptable and stable level when the length of the input vector was slightly larger than the sampling number of each carrier period. In the experiment, the sampling number of each carrier period was 8, and the length of the input vector was set as 9.
The result shown in
Figure 13 indicated that with the increase of the number of neurons in the hidden layer, the network performance gets better and better. However, change is not obvious when the number is more than 20. Considering that too many neurons may bring a greater amount of computation, the number was set as 20 in the experiment.
4.4. Results of the Implementation
The Xilinx KC705 evaluation board is equipped with a xc7k325t FPGA. The resource occupancy of the 1D-CNNs demodulator is shown in
Table 2. We can see that the chosen FPGA can meet the requirements of the implementation. However, the occupancy rate of DSP48E1s is very high, so it was urgent to simplify the multiplication.
In order to fully demonstrate the performance of the 1D-CNN demodulator, complexity, power consumption, and latency of the 1D-CNN demodulator and a of coherent demodulator were compared. In order to eliminate the influence of the hardware platform, the coherent demodulation algorithm was implemented in the same FPGA chip. A Costas loop and Gardner algorithm were employed in the coherent demodulation. The results are shown in
Figure 14 and
Table 3.
Hardware resource utilization is listed in
Figure 14 to illustrate its complexity. The 1D-CNN demodulator consumes fewer slice registers, slice LUTs, block RAM, and more DSP481Es. This means that the 1D-CNN demodulator has simpler logic and a larger amount of calculation.
Table 3 shows that the 1D-CNN demodulator saved power consumption by 9.07%. As for latency, because of the nonexistence of carrier synchronization, the 1D-CNN demodulator greatly shortened the delay, by 96%.
In order to illustrate the adaptability to frequency offset and symbol rate offset, the 1D-CNN demodulator and coherent demodulator were tested under the condition of the offset channel. In the offset channel, two of the abovementioned five factors,
oc and
ob, were set as 0.1 MHz and 5 kHz, respectively. The BER result is shown in
Figure 15. It indicates that the 1D-CNN demodulator has the better adaptability to offset, especially in the case of low SNR.
BER performance under an AWGN channel is generally regarded as an evaluation criterion for a demodulator. Theoretical value, simulation value, and actual tested value under the condition of an AWGN channel were carried out. The results are shown in
Figure 16.
The experimental results show that under the condition of an AWGN channel, the demodulation loss of the 1D-CNN demodulator could be kept almost within 2 dB. This can be regarded as a relatively good performance, which can fully meet most of the requirements in wireless communication.
5. Conclusions
This paper presented an FPGA implementation of a 1D-CNN demodulator for BPSK. Two 1D-CNNs were contained in the 1D-CNN demodulator, to detect types and moments of the phase shift. A decision module was employed to synthesize results of the 1D-CNNs, and then the information carried by the modulation signal was obtained. A time-delay network for convolutional calculation and a structure for piecewise approximation for the activation function were adopted, improving the efficiency of resource utilization and operation speed. Universal and diversity training data were generated, strengthening the adaptability to the real channel condition. Complexity, power consumption, and latency of the 1D-CNN demodulator and a coherent demodulator were compared. The result shows that the 1D-CNN demodulator had acceptable complexity, power consumption, and outstanding latency. The performance of the 1D-CNN demodulator was tested in different channels. In complicated channels where high offset was introduced, the 1D-CNN demodulator showed better adaptability to frequency offset. In an AWGN channel, the demodulation loss of the 1D-CNN demodulator could be kept almost within 2 dB. Owing to the good performance in the AWGN channel, the proposed 1D-CNN demodulator can meet most of the requirements in wireless communication. However, deficiencies of the 1D-CNN demodulator cannot be ignored; its complexity may be further reduced, and its adaptability to other nonideal factors like multipath effects should be considered.