1. Introduction
The effects of machine learning on our everyday life have been far-reaching. During the last few years, artificial intelligence and machine learning have shown that they are effective and have a tremendous amount of utility in solving real-world computationally-intensive problems. The detection and classification of an image are very often to understand what and where is it. Regional segmentation has garnered a deal of interest, especially in terms of designing an indoor robotic unmanned ground vehicle. To detection and map the floor, image processing and computer vision are required. Technology for indoor robots has been flourishing and researchers have found significant applications for floor detection by Unmanned Group Vehicles (UGVs) [
1]. The UGV allows the robotic vehicle to identify the walkable range on the ground so that it can share its visual range and environmental data, which requires high-performance ground-range-detection calculation design scope plays a very important role in robotics control. Another example of the application of regional segmentation is the Advanced Driver Assistance System (ADAS) [
2]. ADAS employs an algorithm of regional segmentation that assists the driver in terms of having rules for driving and warns to driver of any kind of obstacle in its way of driving so that vehicle accidents can be avoided. The automatic driver faces significant challenges in discerning vehicles and preparing the rules for driving. The most common method in the previous literature was to use parallax or depth information from a dual lens to detect other object [
3]. The calculation for this was completed under the conditions of supply, etc. At present, existing neural networks and machine learning [
4], image recognition processing, heterogeneous image fusion [
5], depth information ground segmentation methods [
6], etc., all require a high degree of computing power. Moreover, the increased-complexity of the calculations needed for the ground area segmentation method means that is not an acceptable design method in this case.
The vehicle discernment results for accident safeguards include data for, maintaining a safe distance at the speed limit, and automatic headlamp dimming. The logistical regression and support vector machines (SVMs) [
7] are a better option for simple pattern recognition, regarding computer vision. They provide an adequate and acceptable amount of accuracy and are cost-effective in comparison of neural networks. However, since vehicles have different from there shapes, colors, etc., this increases the pattern’s complexity. For this reason, Deep Neural Network (DNN) has advantages over the traditional classification models.
Over the few years, Deep Neural Networks (DNNs) has become an active area of research in the field of machine learning. DNNs have provided outstanding results in the areas of computer vision, the recognition of speech, statistical machine translation, natural language processing, regression, and robotics. The DNNs first differentiate objects in to simple pattern and use this simple pattern to identify the objects. Deep Convolutional Neural Networks (CNNs) [
8] are widely use in DNNs for data analysis. The data may be image, video or speech. Using large-scale datasets several studies have found that CNNs have a tremendous amount of accuracy in terms of object recognition, even more than humans. Since, Deep CNNs provide an incredible amount of accuracy. However, this work requires a large dataset, its computational complexity requires an immense amount of storage. This requirement of CNNs makes it unstable for real-time embedded applications when a significant factor is having a low amount of power. For this reason, GPU [
9] is the only option for CNNs with large datasets.
For a single image detection, contemporary CNNs may be required to perform billions of floating-point operations, which increases the need for storage. For instance, AlexNet [
10] requires 244 MB of parameters and 1.4 billion floating-point operations (GFLOPs) per image, while VGG-16 [
11] needs 552 MB of parameters and 30.8 GFLOPs per image. Thus, we need to shorten the data storage and computational stage of DNNs. In this way, a 32-bit float point can be converted into low precision bits using quantization of a float point [
12]. Quantization refers to the process of reducing the float number into a low precision number i.e., bits. A Binary Neural Network (BNN), a binary quantized version of CNNs, since it can significantly alleviate the SRAM/cache memory access overhead and on-chip storage constraints.
Low-precision arithmetic requires a smaller amount of memory for storage and less complexity. This advantages enhance the speed and make the operation more power-efficient in terms of image classification for Field Programmable Gate Arrays (FPGAs) Platform. For binary operations such as XOR, LUT, and Add, compare with floating-point operations FPGAs have a much higher theoretical peak performance. In addition, their small memory footprint removes the off-chip memory bottleneck by keeping the parameters on-chip, even when there are extensive networks. The BNN proposed by Courbariaux et al. 2015 [
13], is interesting and can be implemented almost entirely with binary operations. It has the potential to attain a high performance in the tera-operations-per-second (TOPS) range on FPGAs.
Additionally, the FPGAs are more cost and power efficient than GPUs, and CPUs for embedded applications in machine learning. The significant factors associated with FPGAs are system performance and real-time applications. FPGAs are used as accelerators to compute, connect, collect, and process the massive quantities of information around us while employing a controlled data path. Moreover, the FPGAs have the ability of offload computing operation, i.e., they can directly receive data and process data inline without going through the host system. Hence, they stop the processor doing other work and provide higher real-time system performance. Therefore, by implementing binary-based machine learning on an FPGA platform, we can obtain a classification and segmentation system for the robot, that is efficient, reconfigurable, and has better power efficiency.
Dr. Taguchi of Nippon Telephones and Telegraph Company, Japan has proposed a statistical approach based on “ORTHOGONAL ARRAY” (Taguchi method) [
14,
15], which employs an experimental design for quality control to ensure excellent performance during the design stage of the process. Especially in the last few years, this method has been used in several analytic research studies [
16,
17,
18,
19] to design experiments with the best performance in fields like engineering, biotechnology, and computer science. Three factors should be consider when applying the Taguch method to find the best options i.e., the Taguchi loss function, offline quality control, and the orthogonal arrays for experimental design. Lin et al. [
20] have suggested using Taguchi based CNNs that are optimized CNNs with an AlexNet Architecture. This CNN AlexNet architecture is used for gender image recognition. Lin et al. used the MORPH database and claimed to have obtained a
greater accuracy than the original AlexNet network. Another Taguchi based 2D CNN has been suggested [
21]. They use Taguchi method to parametric optimization of 2D CNN to improve the computer tomography image in lung cancer recognition. They used two types of database, LIDC-IDRI and SPIE-AAPM, and claim
and
better performance than the original 2D CNN in the LIDC-IDRI and SPIE-AAPM databases, respectively. Another study [
22] looked at the computer tomography image classification of the lung nodule. In this study, the authors suggested using a Taguchi based CNN architecture that improves the accuracy of the lung nodule classification of the computer tomography images obtaining an accuracy of
. In this paper, uses the Taguchi method to search for the sub-optimal B-FCN architecture that best balances hardware resource utilization and accuracy.
In this research, we focus on the segmentation of the image. Many neural network architectures have been previously suggested. Fully Convolutional Networks (FCNs) have a fundamental architecture proposed by Long et al. [
23] for segmentation. Segmentation is more difficult task than classification or detection. It involves the labeling each pixel of the image. Semantic Segmentation Network (Seg-Net) [
24,
25] is a semantic analysis network model, whose structure is similar to the FCN. High-Resolution Networks (HR-Nets) [
26] utilize a network architecture which is mainly for high-resolution data calculations and does not abandon high-resolution data sources to maintain high-resolution data characteristics. In multi-scale fusion of the network layer, each one is based on the branch feature images of different resolutions carried out on a collected sample. Here, we have chosen the FCN architecture for our segmentation architecture.
In this paper, the proposed floor segmentation method is based on the Binary Fully Convolutional Neural Network (B-FCN) using the FPGA accelerator. We search a sub-optimal B-FCN architecture using the Taguchi method. The BNN accelerator on an FPGA has an exceptionally good image classification speed, and accuracy. In addition, in the B-FCN, the operation and required memory size are smaller. The experimental results show that the proposed method accuracy is awfully close to the traditional Fully Convolutional Neural Network. The advantages of the present architecture are (1) the accuracy is upward of on average compared with other methods reported in the literature; (2) the B-FCN accelerator on FPGA uses less memory storage BRAM size of the ZCU104 is only in terms of a fast image segmentation process; (3) it is worth noting that the proposed fast methodology is ideal for UGV rovers using embedded devices that need to solve path searching in motion planning for real-time operations with a high amount of energy efficiency.
Onward, this paper is organized as follows. In
Section 2 we summarize the Quantization Neural Networks algorithm, as it concerns deep learning.
Section 3 reviews the different Binary Neural Networks Model.
Section 4 presents the proposed sub-optimal method of FPGA hardware accelerator based on Binary Fully Convolutional Neural Network architecture using Taguchi method. The experimental results and hardware implementation are reviewed in
Section 5. Lastly,
Section 6 concludes the research.
2. Quantization Neural Network
In
Section 2, we describe the quantization of CNNs. Quantization [
27] allows reducing the float numbers for 32-bits into the low bit number i.e., 1-bit, 2-bit, 4-bit, 8-bit, and 16-bit. In the area of deep learning, generally 32-bit float point number is the predominant numerical format used for research and deployment. However, the requirement of having low bandwidth and reducing the computing complexity of the quantization makes it necessity for the deep learning models that drive research into using lower-precision numerical formats.
Figure 1 shows a quantization Neural Network block diagram. First, the full precision weights and activations are quantized. The total number of layers is “
”. The “layer
N” is the content of the conv + batch-norm + activation. The operations in the
N layer run with full precision within the boundaries. Courbariaux et al. [
28] suggested 2-bit QNN, i.e., the binarization of 32 float points in 2-bit binary, which is called BNNs. In this methodology, both the weights and activations are binarized with 1-bit each. Since, only bit-wise operations are required, this reduces the need for memory storage and the computational complexity. As is shown in
Figure 2, in the forward propagation, BNNs drastically reduce the memory size because most arithmetic operations can be compute with bit-wise operations, which means the power consumed is less than what is needed for full precision networks. In short, we can say that the BNNs provide power efficient networks.
2.1. Binarization Methods
The BNN, weights, and activations for each of the convolutional layers are quantized into 1-bit each, i.e., +1 or −1. Courbariaux et al. 2016 [
28] have suggested two methods for the quantization of the full precision network into binary values. The first is the deterministic binarization method shown in Equation (
1). The deterministic method is nothing but a simple signum function of weight and activations, where
x is the floating-point (weights and activations) number, and
is the binarized variable of
x.
The second method of quantization of full precision as is shown in Equation (
2), which is called stochastic binarization. This quantization method is more accurate and precise than the first one but it has more computational complexity.
where
is the hard-sigmoid function, as shown in the following Equation (
3)
The deterministic method in Equation (
1) is chosen in this research, because the stochastic binarization method requires hardware, that generates random bits during the quantization process, while the deterministic binarization method requires less computation and less memory space and has no need of a random bit generator during quantization.
2.2. Forward and Backward Propagation
Figure 3 shows the BNNs architecture, which consists of different layers in the forward propagation of the first image input for the convolutional layer, while the output of convolutional layers is the input for the batch normalization layer, which becomes the output of the batch normalization to the binarization layers and finally to the fully connected layer. In this process, only the binary weights are used for the convolutional and fully connected layers. For the color image input, each channel for red, green, and blue is an 8-bit fixed-point value, for the first convolutional layer. Comparatively, the fixed continuous-valued input precision of
m bits is simple to hold. For example, in the general case of
x in an 8-bit fixed-point input and corresponding to that input
is the binary weight, the output terms for that multiplication are given as Equation (
4).
where
x is a vector of 320 × 280 8-bit inputs,
is the first input’s most significant bit,
is a vector of 320 × 280 1-bit weights, and s is the final weighted sum. Since the value of weight
is +1 or −1 and input
is binary-valued, the convolution is nothing but simple addition and subtraction. Thus, in the first convolutional layer, every neuron convolution is nothing but a part of chain of additions and subtractions. Furthermore, for all of the left convolutional layers, the input activations and weights are binarized. Hence, all multiply-accumulate operations are XNOR-addition operations, which is an essential ideal for FPGA LUT cell configuration. After completing the forward propagation process and loss evaluation, by running in backward propagation calculate all parameters associated with the gradient of the BNN. For the optimizing the deep learning methodology, the Stochastic Gradient Descent (SGD) process plays an essential role in the evaluation of real-valued gradients. SGD typically requires a lower learning rate. Equation (
5), shows the relationship between the real-valued gradient and the updated weights.
where,
= the old real-valued weight,
= the learning rate,
= the gradient of cost concerning weight and
is the updated weight.
Here, the generated weights are the crop to be in the range of −1 to +1. If the weight is falling out of the value between −1 and +1, it may fatten very much with each weight update. The large magnitude weights do not affect BNNs because the binarization of the weight always has to be in the range from −1 to +1.
One prominent drawback of the signum function is the zero derivative. The Straight-Through Estimator (STE) method has to overcome the zero gradient issues. To calculate the gradient STE applies a simple hard threshold function. Consider the signum function quantization,
, and if
is an estimator for the gradient
has been obtained. Then, the straight-through estimator is simply
, which is given in Equation (
6).
This stores only the gradient’s information, and for large values of r, it eliminates the gradient. STE performs back-propagation, and during this process, it assumes the derivative of the signum function is equal to one.
Since in this research, we use binarization for data pruning, it should be noted that other techniques exist for data pruning, such as Principal Component Analysis (PCA) [
29] and Locally Linear Embedding (LLE) [
30]. Moreover, we can also optimize the data size using the Restricted Boltzmann Machine (RBM) combined with the CNN [
31,
32]. All of the above-discussed pruning techniques help to reduce the data size. However, in our case, we use the binarization technique reduces computational complexity as well. The binary valued weights and activations require only addition and subtraction for the most of operations.
5. Experimental Results
For comparison, in total, we evaluated 6080 indoor testing scenes for UGV robot vision testing were evaluated based on three types of data sets KITTI 580 images, CamVid 3500 images, and MIT 2000 images. The complex environments included indoor like patterned floors, shadows, and reflections, and outdoor like obstacle detection module, path planning, and tracking module. These scenes were all detectable. Overall, the UGV robots benefited from better navigation and route planning when using our approach. All image scenes were annotated for the floor and non-floor regions on the ground. For evaluating the results of the proposed methodology, we trained 5500 indoor testing sets in BNN+ and BNN. Moreover, 580 sets were used to evaluate the accuracy. Here, true positives, false positives, true negatives, and false negatives were calculated. Besides, we evaluated these using the well-known accuracy in Equation (
7), and looked for the additional G-Mean reliability of the accuracy was also evaluated, as it is shown in Equation (
8).
Table 4 shows the performance results for the traditional Fully Convolutional Neural Network with the proposed B-FCN in Accuracy and G-Mean.
As is shown in
Figure 6, the results are shown in terms of comparing the different methods the Baseline B-FCN with the Maximum B-FCN and the Minimum B-FCN. The walkable floor area was identified with the error. Therefore, the area edge detection algorithm was used to flatten the edges of the detection floor area. Finally, an improved Taguchi based B-FCN was combined to take advantage of various detection methods to obtain the optimal floor area position. The proposed method exerted its effect on the detection of the floor area position detection and effectively improved the accuracy of identification.
The compression of the binary platform prototype design with the different datasets is shown in
Table 5. This table presents a synthesis of the different binary platform results regarding of accuracy, storage size, and power consumption. In this paper, we synthesized three types of B-FCN i.e., the Minimum B-FCN, the Baseline B-FCN and the Maximum B-FCN. First, we set up maximum and the minimum accuracy. The minimum accuracy should be higher than
. Using the Taguchi method we searched for the Baseline B-FCN, which was the most balanced sub-optimal condition that improved storage size with significantly less of a loss of accuracy within the range of 1–3%. We can see from the table the number of BRAM for the Baseline B-FCN decreased by
and accuracy compensation was
, when compared to the Maximum B-FCN. The proposed method for B-FCN has a higher GOPS/W compared with all other architectures, which means it is a more power-efficient architecture for PYNQ ZCU104 FPGA in terms of the real-time embedded system. In this way, the experimental results prove that the proposed B-FCN is a much more power-efficient architecture at this level of accuracy.
5.1. System Architecture
Figure 7 shows the B-FCN system architecture on the PYNQ ZCU104 FPGA [
38]. It belongs to the research and development platform in the UltraScale + series, and the main multi-core processor system on chip (Multiprocessor System on Chip—MPSoC) is used for the base architecture. It is also made by TSMC 16FinFET to realize the Zynq UltraScale + MPSoC, which is used in Zynq-7000, and based on the SoC series. The system began with the initialization of the system environment, import the underlying bitstream file and various function modules. Then, it imported the video or image into the B-FCN computing core, and configured the B-FCN endpoint starting the calculation, and finally, the calculation result was founded and displayed on the HDMI screen.
5.2. Hardware Implementation
First, the loads for the pre-placed bit file and the TCL file were placed into the PL side to configure the hardware architecture. Then, the PS side started to initialize the USB video camera, the underlying HDMI-related settings, and the B-FCN weight and threshold load. After the camera and the image were screened, the accessible ground area was judged through the AXI 4-Stream to the B-FCN. After the judgement was completed, the resulting image was sent to the PS terminal and put on the display screen.
The Taguchi based B-FCN system was used for the unmanned vehicle as it is shown in
Figure 8. The unmanned vehicle could be controlled through a computer connected to a wireless network, and it can also be controlled through a pre-loaded program. The main test was on the ground to see if it could travel the region and identify anything in its way. Communication was based on Robot Operation System (ROS) and communications infrastructure nodes communicated with each mode node. A schematic diagram of the hardware system shown in
Figure 9. While running the unmanned vehicle the robot recognized the image of the floor, and the non-travel area was marked as the black region and displayed on the screen.
As it is shown in
Table 5, we can compare the different platforms with CNN and binary CNN architectures used for image classification. We know that the FCN architecture was 2 to 3 times more complex architecture than CNN. From the table, we can see that the binary CNN retained an accuracy of up to
, while using of 2210 BRAM with 74.96 GOPS/W for the ImageNet dataset for classification. However, the proposed B-FCN architecture retained an accuracy of up to
with 451.79 GOPS/W while only using 213 BRAM for full segmentation. In other words, the B-FCN architecture retained a high level of accuracy using only
of BRAM and was 6 times more power efficient compared to binary CNN. In terms of memory storage and power consumption, the proposed B-FCN architecture proved that it was efficient for KITTI, MIT, and CamVid datasets, because it needed minimal storage and very little power to compute its operations.
The proposed architecture provided about average accuracy for region segmentation over UGV, which is an acceptable level of accuracy for the robot system. One of the most beneficial advantages of our B-FCN architecture is its low level of power consumption. The large size of the data should have required a high resolution, resulting in higher throughput. This typically placed a more significant burden on the battery of the robot. In other words, this made for power-hungry robots. Thus, the proposed architecture reduces the data’s size resulting in lower power uses by the robot for region segmentation. The disadvantage of the UGV was that it left the boundary while moving, but its moving system was not highly affected by the segmentation.
We discussed in
Section 2 the pruning techniques like PCA, LLE and, RBM. In
Table 3, we can easily see that only more or less of
of LUTs and
FF were required for our proposed B-FCN architecture after binarization pruning. Therefore, there was no need for node pruning. When we applied node pruning, this helped reduce the size but caused a more significant amount of design complexity. Hence, there was no further node pruning that was needed.
6. Conclusions
In this paper, a region segmentation algorithm is proposed. The BNNs provide an essential model of improving storage and making faster inferences than conventional DNNs. The BNN’s accuracy is not as great as the full precision models, and our refinements have overcome the issues that have arisen. The FPGA is reconfigurable and has a parallel architecture, which allows the neural network architecture to have great scalability and adaptability. The proposed algorithm accuracy attains upwards of B-FCN and Taguchi B-FCN , for the SoCFPGA embedded platform, and it can reach 25 FPS with 1080P resolution. Regarding the experimental results, the storage size was much improved, with up to decrease compared to the Maximum B-FCN. The second most significant part of this proposed sub-optimal methodology is power consumption. This proposed methodology is more better power efficiency i.e, it has 451.79 GOPS/W compared to other binary platforms with a higher throughput of 2135 GOPS.
After the actual image is imported by the video camera, the B-FCN calculates the feasible range of the ground. The FPGA hardware IP setting is done, and simple optimizations are performed to calculate the SoC-FPGA ground feasibility area identification. Comparing the different cases, we understand the application feasibility of the B-FCN proposed in this paper for the embedded platform, which is essential for future applications. The application of SoC-FPGA has contributed to new solutions, even though most of the practical cases of object recognition provided in the SoC-FPGA field have belonged to a small group. Most of the previous studies used the CPU or GPU for ground classification and recognition, and the calculation accuracy of the CPU and GPU are higher than the calculation results published in this research. However, the resource usage and power consumption required for these calculations are relatively high. Compared with the resource usage and power consumption in the FPGA, it is low for system stability. The FPGA is also better because these characteristics will lead to more in-depth research and discussion in the field of SoC-FPGA.
In future work, we will make the B-FCN more efficient by updating the binary and quarternary FCN (B-Q-FCN) and removing less significant nodes. Moreover, we can use a pruning method other than binarization, such as RBM and PCA for larger datasets, which will maximize the complexity of architecture. For implementation, we can use the next-generation FPGA, which will be speed up the system.