1. Introduction
The study of the development of the nervous system, its structure, and its performance and functions is known as neuroscience or neural science. Neuroscience is concerned with not only the usual workings of the nervous system but also what occurs to the nervous system when people have neurological, psychiatric, or neurodevelopmental abnormalities [
1]. Most of the nervous problems are caused by a loss of regional sensory or motor function. The topic of neuroprosthetics is still developing. It focuses on designing and executing neural prostheses such as visual and cochlear implants that restore lost neural function by capturing signals from active neurons, processing them, and then electrically stimulating a selective population of neurons [
2].
Microelectrode arrays (MEAs) are essential equipment in neuroscience. Neural sensing and stimulation by MEAs have been the backbone of neuroscience research, brain-machine interfaces, and clinical neuromodulation therapies for decades [
3]. When neurons and muscle cells are activated, ion currents flow through their membranes, generating a voltage difference between the interior and exterior of the cell. The electrodes of an MEA convert the change in voltage from the environment, which is carried by ions, into currents carried by electrons for recording [
4].
In a closed-loop MEA system, recording and stimulating electrodes can be integrated into the same MEA or connected to an external stimulator, which can be optical or electrical. This setup enables the study of neural system response by analyzing the spike activity caused by stimulating the neurons. It also draws the map of bioelectrical signals with great detail and resolution [
5]. Open-loop systems in these studies can be used to analyze the activity of neurons in addition to history-dependent neural network states without external effects. The success of neuroscience studies relies on the fabrication of MEA electrodes in contact with the neural tissue and the associated electronic processing system to obtain consistent and real-time recordings with the highest resolution [
6].
Huge technological advancements occurred in the field of neuroscience in the last decade. The advances in microtechnology have enabled an exponential increase in the number of neurons that can be simultaneously recorded [
7]. MEAs consisting of thousands of sensing electrodes are available to instantly record and monitor the activity of many neural cells with high resolution in parallel [
8]. Modern MEA acquisition systems enable simultaneous sampling of four thousand channels, usually at 18–25 k samples/s per channel with a 12-bit resolution.
The real-time processing of the acquired biological signals by MEAs represents a challenging and computation-intensive task [
9]. One of the major challenges is to reduce the latency of processing the received signal and detecting spikes to improve the stimulation response. The system must be able to acquire data, process it, and generate feedback stimuli with the lowest latency. This constraint is difficult to overcome since the processing system must filter the received signals and detect neural spiking events which last for a few milliseconds. Furthermore, reconfigurability is required to run multiple setups and trials. The overall processing system should also have a high degree of parallelism to process the data received from various MEA channels in parallel. The circuit area of the processing system is also an important parameter since the system consists of parallel processing sets. Programmable devices such as microcontrollers and FPGAs have limited resources that should be utilized efficiently. Hence, more processing sets can be built on the same chip and handle the data channels in parallel if the set circuit area is smaller. Less power consumption by the processing units will also enhance the portability and decrease the overall power consumption of the system. Recent approaches, e.g., [
10], tried to overcome these problems by exploiting advanced FPGA chips, which offer more resources to increase the number of channels processed in parallel. Other approaches, e.g., [
11], proposed the filtering of received signals outside the FPGA with analog electronics and utilized the chip resources for other processing tasks.
This paper proposes to overcome the abovementioned limitations by applying ap-proximate computing to process the acquired neural signals on FPGA devices which can be useful in custom implementations. The proposed processing system uses approximate computing to provide better performance gains by reducing the computation complexity and latency associated with real-time processing. Approximate computing is a computing paradigm that compromises the accuracy of the result while improving the area, power, or delay [
12]. It has emerged as a new paradigm for the design of circuits and systems with high performance and low energy consumption. It is widely used in error-tolerant applications such as image processing, recognition, and data mining [
13] but never used in the application of processing the neural signals acquired by MEAs. The main contributions of this paper include:
Applying approximate computing to develop neural signal processing systems on FPGA and comparing them with the accurate system with the aim of reducing latency, area, and power consumption. In this context, three types of approximate adders were utilized, i.e., CPredA [
14], GeAr [
15], and AA2 [
16].
Testing the proposed processing systems using real biological signals. Results show an enhancement in processing speed of up to 37.6% in some approximate implementations, and in other implementations, the area reduction is up to 14.3% without loss in accuracy.
2. Related Work
Several MEA signal processing algorithms and hardware implementations are available in the literature with the common aim of achieving real-time control of neural electrical activities. These architectures vary from software running on a PC to complex FPGA architectures, integrated circuits, and analog electronics. This section presents a brief overview of some recent and notable implementations.
Many implementations prefer to process the neural signals using a desktop PC for simplicity. Newman et al. in [
17] presented a desktop application which simplifies the design of sophisticated real-time experiments for conducting closed-loop multichannel interfacing experiments. The minimum latency achieved with this system when targeting a 64-channel MEA was 7.1 ± 1.5 ms. A 320-channel active probe for high-resolution neuromonitoring and responsive neurostimulation was also reported in [
18]. The work presented a bidirectional integrated neural interface and seizure-predicting software. An integrated circuit (IC) cell array was attached to the reverse side of a pitch-matched microelectrode array in the probe. The IC supported 256 neural recording sites and 64 neural stimulation sites. At the application level, the work controlled the seizures in a rat epilepsy model with online closed-loop neurostimulation. In [
19], an integrated 64-channel system was presented for recording local field and action potentials. The system was implemented on a 4 × 4 mm
2 chip die using the SMIC 0.18 µm CMOS standard process. The chip included four-stage signal conditioning circuitry, a successive-approximation analog to digital converter, and a digital logic unit for clock and control. The work proposed the two-stage amplifier with high gain and the clock logic that can be used to align the switching clock as two novel ideas. The system was connected to PC software and tested on raw neural data downloaded from the Internet. It recorded multiple LFP and AP signals successfully and simultaneously.
Other works used microcontrollers and analog electronics to process the acquired neural signals. The researchers in [
20] presented a portable 16-channel microcontroller-based wireless system for a bi-directional interaction with the central nervous system. Eight of the sixteen available electrodes were used to stimulate the brain and directly connected to the stimulating unit on a rat backpack. The remaining eight electrodes were used to record the neural activity. The system consisted of a remote unit, a home unit, and control software running on a personal computer for offline processing. Detection-to-stimulation latency was 3 ms in the local closed loop for this system and 2.6 ms in the remote closed loop. Cong et al. [
21] presented a prototype bi-directional neural interface system with closed-loop and embedded DSP capabilities. The system included 32-electrode stimulation capability, and eight multiplexed low-noise, low-power bio-potential sensing channels with an on-chip digital FFT. It also used a Cortex M3-based microcontroller for implementing closed-loop algorithms. Another work by Liu et al. in [
22] suggested a completely programmable, bidirectional neural interface device. A proprietary SoC conducted noise-sensitive neural signal recording, high-safety neural stimulation, computation-intensive neural feature extraction, and on-chip closed-loop operation in the system. The use of digital-assisted analog parallel neural feature extraction units was offered as a novel implementation. The prototype system was interfaced with standard computers using a programmable general-purpose microcontroller (MCU) with integrated flash memory and Bluetooth wireless protocol. Lee et al. in [
23] presented a multichannel neural recording device that records brain signals from many electrodes using fewer recording channels. The system used an adaptive electrode selection technique to automatically scan the electrode arrays and record from chosen electrodes where brain spikes are identified. The grouped signals are connected to a microcontroller unit to determine the relative occurrence rate of neural spikes between scan groups and decide adaptive electrode selection. Results from experiments using pre-recorded brain data showed that the suggested system could separate, amplify, and count neural spikes in real time. In [
24], an approach based on memristor arrays was proposed to process multichannel neural signals in parallel. The results were verified by demonstrating seizure prediction with high accuracy and improved power dissipation.
FPGA-based systems have also been used to implement real-time processing systems for neural signals. FPGA has well-known gains in computational performance in comparison to PC-based systems. The work conducted by Muller et al. in [
25] is an example of FPGA implementation of a closed-loop spike detection and stimuli generation system. It processed data from 126 channels in parallel while generating stimuli on 42 different output channels in less than 1 ms. Another recent closed-loop FPGA implementation, presented by Park et al. in [
11], was shown to perform spike detection and spike sorting of 128 input channels simultaneously. However, no latency time was reported in this work. The paper focused on the hardware implementation of the 128-channel FPGA-based bidirectional neural interface system, including two 64-channel analog front-end boards. Filtering of signals was performed on the analog front-end boards, while spike detection and sorting were performed on the FPGA board. Sue et al. [
10] introduced a closed-loop system to interface HD-MEA with 4K input channels. The latency time of the system was less than 2 ms. The system proposed in that work was completely implemented on an FPGA board, including spike detection and filtering of signals by digital filters. In [
26], the authors used an FPGA to prototype a low-power multichannel neuron activity extraction unit appropriate for a wireless neural interface. A neural signal extraction algorithm was proposed to achieve the low power requirement by reducing the data transmission rate. The algorithm is based on transmitting a particular channel that is recording a high-frequency signal above a specific voltage. The work results showed a reduction in the data transmission rate by up to 6000 times, which in turn consumes only 3 mW of the FPGA dynamic power.
To the best of our knowledge, approximate computing has never been used in this context before. We believe that approximate computing can bring significant performance enhancements, such as reduced processing time and system area, without compromising much accuracy. With this motivation, we propose approximate computing-based algorithms for MEA signal processing, along with their FPGA implementations, and evaluate their usefulness in the field of neuroscience.
3. Proposed Processing System
In this work, two identical processing systems are implemented and compared: the accurate system, which uses precise calculations in all processes, and the new proposed approximate system, which employs various approximate adders in processing the acquired neural signals. The block diagram of the processing system is shown in
Figure 1.
The system consists of three main parts. The acquisition unit can be any commercial HD-MEA device that continuously sends a digitalized neural signal at 12-bit resolution to the real-time processing unit. This unit is the primary focus of this paper. It is implemented in Verilog hardware description language using Xilinx Vivado Design Suite with ZedBoard development board for the Xilinx Zynq-7000 specified as the target board. The processing unit consists of parallel processing sets of filtering and spike detection modules, where each set is dedicated to one MEA channel to process its signals. The schematic diagram in
Figure 2 shows the details of the set with its input and output signals.
The set receives the system clock and reset signals used for synchronization. It also receives 12-bit raw neural data generated by the MEA device and sent through one of its channels. The raw signal is forwarded to an FIR low-pass filter module to isolate typical frequency components of neural spikes. The filtered signal is then fed to the spike detection module, which detects a spike in the signal when it is below a preset threshold. At each detection, the spike line will be asserted high for a certain period, and the spike counter will be incremented. The following sections describe the filtering and spike detection modules in more detail.
3.1. Filtering Module
Low-pass filters are used in this work to extract the low-frequency components of neural signals. The proposed system is used in the processing of local field potential signals (LFP) [
27]. Therefore, each filter implements a finite impulse response (FIR) low-pass filter, which is programmed with a cut-off frequency of 1 kHz.
FIR filter is the most popular type of filter implemented in software. It is a digitally implemented filter structure that may implement practically any frequency response. The impulse response of a finite impulse response (FIR) filter is of finite duration since it settles to zero in finite time. This differs from infinite impulse response (IIR) filters, which can have internal feedback and respond forever (usually decaying). Linear phase is another advantage of FIR filters. As all frequencies are shifted in time by the same amount, no phase distortion is introduced into the signal to be filtered, resulting in a constant group and phase delay. FIRs also have no feedback and can never become unstable for any form of input signal, which ensures stability, an advantage over IIR filters [
28].
The output of an FIR filter with order
N is the convolution of input signal
x[
n] and coefficients
h[
n]. It is a sequence of values where each value is the weighted sum of the most recent input values, as shown in Equation (1):
where
x[
n − i] in each input sample belongs to the sliding window of the last
N received samples. Each sample is weighted by a coefficient (
hi) and then added together to obtain the output sample
y[
n].
During filter design, the order of the filter is an important parameter. Higher filter order provides better low pass selectivity, but it will require more circuit area and delay. This work aims to design a real-time processing system with minimal latency and resource usage. Thus, we designed low-pass FIR filters with different orders of 4, 8, 12, 16, and 24, then tested them with the accurate system on real biological signals containing approximately 106,000 samples to detect all spikes. Based on the results shown in
Table 1, we decided to use the FIR filter with order
N = 8, as it is the minimum order required to precisely detect all spikes in the accurate system.
Figure 3 shows a sample of the test performed with different filter orders. It shows the misdetection of the spikes when using the FIR filter with order 4, while all higher order filters detect all spikes precisely.
The MATLAB Filter Design and Analysis tool is used to obtain the best set of coefficients. The coefficients are then enlarged and rounded to integer values to be used in the Verilog programming code of the filter’s implementation on FPGA.
Equation (1) shows the massive number of calculations, especially the addition operation required by the FIR filter. Therefore, symmetric coefficients are used so that the number of calculations required can almost be reduced by half. Parallelization is also considered while programming the FIR filter to utilize the FPGA device used in this work. We programmed the FIR filter by dividing it into four blocks which work in parallel at each clock cycle, as illustrated in
Figure 4.
The first block acquires the new sample and assigns it to the first Table. It also shifts the old samples in parallel to the adjacent tabs at each clock cycle since non-blocking assignments are used in its Verilog program. The second block of the FIR filter uses 12-bit adders to add every two opposite samples at each clock cycle and store the results in the registers of the first level addition. The 12-bit adders are either full adders in the case of the accurate processing system or approximate adders in the approximate system. The third block multiplies the addition registers produced by the second block with the corresponding coefficient at each clock cycle and stores the multiplication results in its registers. The fourth and final block adds the results of the multiplication of the third block at each clock cycle using 24-bit adders. It also crops and preserves the sign of the result to keep the signal in its 12-bit format. Adders in this block are also selected according to the type of the system, accurate or approximate. The final output of the filtering module is a 12-bit filtered neural signal sent to the spike detection and counting module, which will be described hereafter.
3.2. Spike Detection Module
As illustrated in
Figure 5, the spike detection module receives the 12-bit filtered signal produced by the FIR filter to check for spikes and update the spike counter.
This module detects a spike in the signal at the point where it satisfies a specific criterion: when the sample is below a certain negative threshold [
29]. It compares the received sample with the preset negative threshold value at each clock cycle. If a spike is detected, it pulls the spike signal high for a specific time and increments the spike counter. Estimating the correct threshold value is essential since a low threshold value leads to false detections while a high value leads to a high number of missed spikes. The threshold value is estimated in the accurate system to precisely detect all spikes and kept at the same value in the approximate systems. The previous spike detection algorithm is used due to its simple hardware implementation to prove the concept of this paper. Other algorithms are also available in the literature. The algorithm in [
30] detects the spikes in extracellular recordings by extracting the minimum distance between the through and peak in a slice of recorded signal. The absolute distance value is then incorporated into the differential operator applied to the rectified signal to analyze the difference between spikes and noise. The signal is then passed through a convolution filter to suppress the noise. Spikes are finally detected when the sample value exceeds a preset threshold proportional to the mean value of the filtered signal. An adaptive spike detection algorithm is presented in [
31]. The algorithm first removes the local field potentials from the recorded signal by mean subtraction and using a moving average filter. The signal-to-noise ratio is then enhanced by using an amplitude slope operator, and finally spikes are detected by comparing the samples with an adaptive threshold considering the local signal statistics. The work in [
32] proposed a low-power spike detector using latch-based RAM. The threshold in this algorithm is estimated by dynamically calculating the standard deviation of noise in the new samples and updating the old value. In [
33], multiple spike detection algorithms for multi-transistors arrays (MTA) are investigated based on some variants of the smoothed non-linear energy operator (SNEO). The latter work has shown that the performance of the spike detector benefits from the correlation of the signals detected by the MTA pixels but degrades when a high firing rate of neurons occurs.
4. Implementation of Approximate Computing Algorithms
This paper aims to improve the neural signal processing system by enhancing each filtering and spike detection set in terms of time, area, and power. Less area will allow the processing system to contain more parallel filtering and spike detection sets to handle more MEA channels. In contrast, less processing delay will enhance the overall performance of the system.
By using approximate computing algorithms, we can attain high performance by tolerating some loss of quality [
34]. In this context, we applied approximate computing algorithms in the most extensive computation parts of the processing unit, the FIR filter. As shown in Equation (1), the number of calculations required by the FIR filter depends on its order N. Every two opposite samples in a symmetric FIR filter are added together. The result is multiplied by a coefficient and then accumulated with other multiplication results. Due to the time constraints of this study, we chose to focus on approximating the addition operation since it is more extensively used in the FIR filter than the multiplication operation. Approximate signed multipliers will be included in our next work where more enhancements in the system parameters are expected to be gained. On the other hand, more approximate operations will mostly impose the use of error correction techniques which will require more power and circuit area. Different static segmented approximate multipliers are analyzed in detail in [
35] while presenting new segmentation and correction techniques that can significantly reduce hardware costs.
The workload of addition operations considering the sampling frequency is described by Equation (2):
where
N is the order of the filter and
fs is the sampling frequency in Hz.
For our symmetric FIR filter of order 8 in this work and a sampling frequency of 7 kHz, the number of additions is 56,000 operations per second for each MEA channel. Therefore, any reduction in the time required to process addition operations can significantly improve the whole system’s performance.
Many approximate adders are available in the literature. The objective of this work is to design a neural signal processing system that reduces the processing time, circuit area, and power consumption on FPGA as the target device. Therefore, to prove the concept of this work within its time limit, we selected the adders primarily based on the parallelism in producing the result, the flexibility of combining the adder with accurate adders to control the precision, and the simplicity of the approximate adder circuit to reduce the area and power dissipation. The approximate processing system proposed in this paper is designed with three approximate adders, which are CPredA [
14], GeAr [
15], and AA2 [
16]. Other adders are planned to be used in future work such as Lower-Part OR adder [
36] which calculates the sum as the OR of the two operands while discarding the carry.
CPredA adder is chosen as an example of adders that approximate the output carry in their algorithms. In contrast, AA2 is chosen as an example of adders that approximate the output sum. GeAr is an example of the adders that use segmentation. The following sections describe the selected adders and their implementations.
4.1. Carry Predicting Full Adder (CPredA)
Figure 6 shows the circuit diagram of CpredA. Sum (S) and Carry (C
out) are produced as follows:
Table 2 compares the output of Full Adder (FA) and CpredA in prediction mode.
We notice that CpredA always generates the correct S value. Cout value is always correct except when {Cin, A, B} = {1, 0, 1} and {1, 1, 0}. The architecture of the CPredA adder shows that Cout values are predicted almost correctly depending only on the current values of A and B without using the carry chain. The last point is critical when using this adder in FPGA-based systems, where parallelism is one of its main advantages. Flexibility is another significant feature of CPredA. Higher order adders can be constructed by combining CPredA cells only or with accurate full adders in any configuration, controlling the accuracy to the required level. CpredA also reduces the circuit area required by implementing simpler logic to produce its results.
4.2. Generic Accuracy Configurable Adder (GeAr)
The GeAr adder breaks the carry chain by using K L-bit sub-adders to perform the approximate addition of N-bit length operands. The length of each sub-adder (
L) is less than or equal to the length of each operand (
N). Each sub-adder produces an R-bit result depending on the number of previous bits (
P) used for the carry prediction except the first sub-adder, which produces an L-bit result where
L =
R + P. The number of required sub-adders
K can be calculated using Equation (5):
The result of the first sub-adder can be calculated using Equation (6):
The result of the
ith sub-adder can be calculated using Equation (7):
Figure 7 shows an example of 12-bit GeAr in R2-P2 configuration, where SR is the sub-result of each sub-adder.
In this configuration:
The adder length (N) = 12 bits.
The length of each sub-adder (L) = four bits.
The first sub-adder contributes with the first four bits of the final sum.
Each sub-adder, except the first one, produces two bits of the final result depending on the last two bits. Therefore, R = 2 and P = 2.
The number of required sub-adders (K) is ((12 − 4)/2) + 1 = 5 sub-adders.
The final sum can be formed as:
Sum = [SR5 [3:2], SR4 [3:2], SR3 [3:2], SR2 [3:2], SR1 [3:0]]
All sub-results in GeAr are produced in parallel and combined to produce the result. Compared to an N-bit full adder, the delay is reduced by breaking the carry chain into smaller segments. Thus, the carry propagation will only be limited to the length of the segment represented by L. GeAr R2-P2 is used in this work since it offers low latency and a high degree of configurability. The result accuracy can be controlled by combining GeAr with accurate adders.
4.3. Approximate Adder (AA2)
Figure 8 shows the circuit diagram of the AA2 adder. The design approach of AA2 is approximated on the sum (S) alone. The sum is precise in six out of eight cases. The carry (C
out) is precise in all cases in AA2.
Table 3 compares the output of the Full Adder (FA) and AA2. S and C
out can be produced as follows:
The advantage of AA2 is that the sum (S) is easily produced by inverting the value of the carry (Cout) signal. This advantage reduces the logic complexity of the adder and the area needed. Flexibility is also another advantage. Like previous adders, accuracy in higher order adders can be controlled by combining AA2 with accurate adders.
5. Evaluation of Approximate Adders
The three selected approximate adders (CPredA, GeAr, and AA2) are tested in two configurations, Half Prediction (HP) and Full Prediction (FP). Multiple adder widths of 8, 12, 24, and 32 bits are selected to obtain more consistent results.
In the Half Prediction configuration, the result of addition is produced by predicting the result of the lower half bits with an approximate adder. On the other hand, the upper half bits are added precisely with full adders. For instance, to test CPredA in Half Prediction configurations with eight-bit width, the lower four bits are added using CPredA adders while the upper four bits are added using full adders. In the Full Prediction configuration, the total addition result is produced using approximate adders only. All adders are expressed in the Verilog language and implemented on the Xilinx Vivado Design Tool with the Zed-Board for the Xilinx Zynq-7000 specified as the target board.
The design delay time is not reported directly in the Vivado software. Therefore, we measured it by using a wrapper module and assigning successively tighter timing constraints until the design failed the implementation step. The last timing constraint that succeeded is used as the design delay time. Power and area are also measured using the total on-chip power and chip utilization obtained from the implementation reports. Normalized mean error distance (NMED) [
37] is employed to evaluate accuracy by testing the adders with various widths on 10,000 randomly generated samples. The error distance (ED) is defined as the difference between an accurate sum (S) and its approximate sum (S’) as described in Equation (10). Mean error distance (MED) is then calculated as the average of (ED) as shown in Equation (11). The normalized error distance (NMED) is finally calculated as the ratio between (MED) and the maximum exact result of tested addition operations, as shown in Equation (12). All characteristics are compared to the full adder as a reference.
Table 4 and
Table 5 summarize the test results of the three approximate adders in Half Prediction and Full Prediction configurations. The tables show the delay time for each approximate adder in different widths along with the delay time of the full adder for comparison. They also show the reduction in time, area, and power normalized to the full adder.
5.1. Delay Time
Figure 9 shows the actual delay time for all adders in half and full prediction configurations including the full adder for reference.
In both configurations, CPredA and GeAr significantly reduce the delay time due to the parallelism in their architecture. CPredA predicts the carry using only the values of A and B, while GeAr breaks the carry chain into smaller chunks. The delay of AA2 is higher than that of CPredA and GeAr but still less than that of the full adder since this adder simplifies the production of outputs without breaking the carry chain.
5.2. Area
Figure 10 shows the resource utilization, which measures the circuit area of the adders by reporting the utilization of chip resources.
GeAr is dividing the original full adder into smaller segments. Therefore, it is the closest to the full adder area. CPredA and AA2 have smaller areas due to their simpler circuits.
5.3. Power Estimation
Figure 11 shows the power estimation for the adders. All adders consume the same amount of power or that slightly less than the full adder. GeAr consists of several segments of the full adder. It displays the power consumption figures that are closest to the full adder as a result. AA2 has the lowest power consumption due to its simplified circuit where the sum output is only an inverted version of the output carry.
5.4. Accuracy
Figure 12 shows the normalized mean error distance, which measures adder accuracy. The vertical axis uses a logarithmic scale and shows −10/log(NMED) to enhance the readability of small values.
In the Half Prediction configuration, the lower half of the final result is approximated while the upper half is accurate. All adders in this configuration have relatively small NMED, which decreases as the operand size increases. The error value is only a tiny portion of the accurate result when the operands are large. In Full prediction configuration, NMED is constant for all adder widths, with GeAr having the lowest value. GeAr shows the lowest error measure among other adders since each sub-adder result is calculated accurately using some last bits, four bits in the case of GeAr R2 P2.
6. Evaluation of the Proposed Approximate System
The proposed processing system consists of identical parallel sets of filtering and spike detection modules as shown in
Figure 1 and
Figure 2. Each set filters the signals acquired from a single MEA channel with its FIR filter that uses 12- and 24-bit approximate adders in its calculations. The filtered sample is forwarded to the spike detector, which compares it with a threshold value. A spike is detected when the sample value is below the threshold. Spike output is pulled high to indicate the presence of the spike, and the spike counter is incremented.
This paper proposes using the approximate computing paradigm to minimize the time required by each set to process the signal received from its channel. In addition, the work aims to reduce the footprint of the processing set on the FPGA, as well as to minimize or maintain power consumption. Reducing each set area will allow the system to handle more MEA channels with additional processing sets in parallel. On the other hand, a faster processing set will enhance the whole system’s performance since all processing sets run in parallel. This will enable the system to acquire data, process it, and generate the required stimuli with minimal latency.
Two systems, accurate and approximate, are built identically. Both systems are built in the Verilog language and implemented on the Xilinx Vivado Design Tool with ZedBoard for the Xilinx Zynq-7000 specified as the target board. Full adders are used in the design of the accurate system, while the three selected approximate adders are used in the approximate system versions.
Tests are performed on actual neural data recordings of a mouse acute retina downloaded from the 3Brain website [
38] and fed to accurate and approximate systems. The samples are recordings of spontaneous neural signals from the ganglion cell layer in a knockout mouse used as a model of dystrophic retina. They show strong spontaneous bursting and propagation of the neural signal along axons converging towards the optic disc. Data signals used involve approximately 106,000 samples which is sufficient to overcome any abnormality in some samples and obtain consistent results through the tests. Data samples are processed offline first with the official BrainWave X software from 3Brain to identify the accurate number of spikes and set this as the reference for tests in accurate and approximate systems.
To study the effect of approximation level on the system parameters such as latency, accuracy, and chip utilization (which represents the area of the circuit), we started testing the approximate systems with half of the calculation results predicted and then increased the level of approximation while observing the system performance and parameters. The accuracy of spike detection in approximate systems is measured at the same filtering and threshold setting in the accurate system.
The design delay time is not reported directly in the Vivado software. Therefore, we measured it by using a wrapper module and assigning successively tighter timing constraints until the design fails the implementation step. The design delay time is determined by the last timing constraint that succeeded.
The system area is obtained from the chip utilization report after the implementation step which includes placing and routing. All parameters are then compared with the accurate system.
We noticed that approximation levels that are less than half of the adders’ size do not result in significant enhancements. Therefore, we started the tests with the 12-bit and 24-bit adders in half prediction mode and then increased the level of approximation until we reached full prediction mode. Multiple approximation modes are tested and explained in the following sections.
Table 6 shows the overall evaluation results for the accurate and approximate systems for all adders. We can see that the system that utilizes CPredA in its operations improves both speed and area values. The accuracy of the approximate system with this adder is even better than those with other adders at high approximation levels.
6.1. Mode 1
All three versions of the approximate system in this mode are developed using half prediction 12-bit and 24-bit adders CPredA, AA2, and GeAr R2 P2. In this mode, the lower half of the result is predicted, and the upper half is calculated accurately.
Figure 13 is a sample of the test in half prediction mode. It shows the original and filtered signals, as well as the spikes detected by accurate and approximate systems. Filtering and detection are almost identical in the accurate and approximate systems. Approximate systems detect all the spikes the accurate system detects at the same filtering and threshold settings.
The accuracy is very high for all approximate systems. On the other hand, the system delay has reduced significantly in the approximate implementations, where we can achieve an approximately 37.6% reduction in time by using GeAr. When using CPredA, the area of the system is reduced from 0.35% in the accurate system to 0.30%. Therefore, a 14.3% reduction in area is achieved with an almost 30% reduction in delay at the same time. All parameters have been improved without losing spike detection accuracy.
6.2. Mode 2
The approximation level is slightly increased in this mode. Eight bits of the 12-bit adder results are predicted while 24-bit adder results are half-predicted.
More system enhancement is achieved with satisfactory accuracy that is still high and acceptable. It reaches 98.7% in all approximate systems. The system delay is reduced further in this mode compared to the previous modes. It is reduced by 38% in the case of the GeAr system and by 36.3% in the case of CPredA which also decreased the area by 14.3%.
6.3. Mode 3
In this mode, 10 bits of the 12-bit adder results are predicted, whereas the 24-bit adder results are half-predicted. GeAr R2 P2 is not applicable in this mode since only the last two bits of the result will be left to be accurately calculated. These are less than the four bits used for prediction in each segment.
The accuracy has dropped to 75% in the CPredA system and to 60% in AA2 systems. System speed has been enhanced more in the CPredA system to reach 37.6%, while the enhancement in area is less than that in the previous modes due to the need to allocate more chip resources to achieve low delay.
6.4. Mode 4
In this mode, 12-bit adder results are fully predicted, while 24-bit adders are kept in half prediction mode.
The maximum improvement in speed is gained in this mode by trading off accuracy. The system speed enhancement reaches 38.6% in the GeAr system and 37.7% in the CPredA system which also provides a reasonable area reduction. In contrast, accuracy has dropped to almost 70% in the CPredA system and to lower values in the GeAr and AA2 systems.
The previous four modes show that system speed improves depending on the approximate adder used and the level of approximation. With Mode 1, the speed and area of the system are significantly improved without compromising accuracy. This mode can be applied in scenarios where speed, area, and accuracy are all required. In this mode, all adders are configured to half approximation, i.e., the low significant bits of the result are approximated while the high significant bits are calculated accurately. Therefore, the samples processed by the FIR and fed to the spike detector contain some errors, but they are still above the threshold needed for spike detection. High-speed scenarios can use Mode 2 by sacrificing system accuracy within acceptable margins. Modes 3 and 4 do not provide reliable accuracy but they provide faster processing speed for those speed-demanding applications with low accuracy requirements. For instance, they can be used in a system that detects the existence of spikes in a neural signal but not their exact number. The accuracy of these modes can be enhanced if required by changing the threshold value or decreasing the approximation level of the 24-bit adders. Other approximate adders can also be investigated in these modes. This is because the results show that the accuracy depends on the adder used, where CPredA in our implementation performed better than other adders tested.
7. Conclusions
Due to huge technological advances in the field of neuroscience, MEAs with thousands of electrodes are now available. Recent MEAs monitor and record instantly the activity of thousands of neurons in parallel which requires processing systems with low latency, smaller circuit area, and less power consumption.
MEA System processing time is a critical parameter in neuroscience studies, where the system must be able to acquire data, process it, and generate feedback stimuli with the lowest latency. Furthermore, the processing system’s circuit area is important. This is due to the system’s usage of parallel processing sets and the fact that programmable hardware, such microcontrollers and FPGAs, has limited resources that need to be used wisely. Hence, more processing sets can be built on the same chip and handle the data channels in parallel if the set circuit area is smaller. Less power consumption by the processing units will also enhance portability and decrease the overall power consumption of the system.
This paper proposes a novel neural processing and spike detection system which can be very useful in custom implementations. The system approach reduces processing latency, circuit area, and power consumption by applying the approximate computing paradigm. Approximate adders were used in the most computationally intensive parts of the system to achieve the minimum possible latency in processing the signals and detecting the spikes. Additional enhancement techniques are also considered when designing the FIR filters, such as the minimum filter order required, parallelism, and symmetry. Verilog hardware description language is used to program the system and then implement it on FPGA using Xilinx Zynq-7000 All Programmable SoC on ZedBoard as the target board.
The proposed processing system is tested with different adders at different approximation levels. The results of using approximate computing in the system calculations show effective enhancements in processing time and circuit area depending on the type of adder and the level of approximation used. With some adders, CPredA in our implementation, and half approximation level, the reduction in processing time can reach 30% and the system circuit area may also be enhanced by 14% without losing spike detection accuracy. A slight increase in the approximation level provides more enhancements at a very acceptable accuracy of 98.7%.
More improvements in speed and area are gained when spike detection accuracy is traded off and the system runs in high approximation modes, where most of the adders’ results are predicted. These modes can be used in applications in which the occurrence of spikes is significant, but not their precise number.
Due to time restrictions, a limited number of approximate adders could be investigated and benchmarked in this paper. We plan to investigate more adders in future. In addition, approximate multipliers will be also included and the system will be connected to physical MEA devices to perform more tests and obtain more results.