In the first subsection, we review the fundamentals of the convolutional layers, which are fundamental to understanding the topology of the SqueezeNet convolutional network. Above all, we consolidate the terminology used throughout the article and establish the theoretical foundations.
Other types of layers commonly used in convolutional architectures and the selected model, SqueezeNet V1.1, are also discussed.
Once we have all the pieces of the puzzle that make up this type of network, we discuss the main structural element that gives essence to this type of network, the fire module.
We then review works that work with this type of network, with low-cost FPGA devices that can develop AI at the edge.
Finally, we discuss the OpenCL standard and its two main approaches to address different types of problems from the point of view of FPGA solutions.
2.1. Convolution Layer
The convolution layer is normally used to process input or feature maps whose spatial distribution is two-dimensional, as in the case of an image whose size is usually expressed by pixel width and height, .
However, images usually have different channels that contain color information in the format . Therefore, the inputs to the convolution layers usually have the format , which are the height, width, and channel depth of the input feature map, respectively.
Consequently, the 2D convolution layers are formed by several filters of dimension to maintain the agreement with the input data. Regarding the size of the filters, , we can find different combinations with the common sizes 1 × 1, 3 × 3, and 5 × 5. Another parameter that is part of the convolutions is called the bias. It consists of a scalar value for each convolution filter.
Figure 2 visually represents the dimensions, with their nomenclature, of a feature map and a filter used in a 3 × 3 convolution.
The result of the convolution corresponds to the sum of the product of the filter with the input data element by element. The filter slides through the size of the input data, producing an output value for each displacement. If there is more than one channel, the filter is applied to its corresponding channel, and all channels’ results are summarized.
Assume that a
input and a single filter
. The first element of the output corresponds to the following:
The size of the output depends mainly on the size of the kernel, the sliding factor or stride, and an optional setting known as padding that allows a frame, usually of zeros, to be added around the input.
The input and filter being square with sizes
and
, respectively, we can calculate the size of the output
as:
where
padding is the size of the added frame, and
stride is the positions the filter moves over the input data.
The number of channels of the output depends on the number of filters the convolution applies.
Figure 3 shows a visual example of a convolution layer operation. We observe an input feature map with a size
of 5 and a filter with a
of 3, since this is a 3 × 3 convolution, with a
stride and padding of 1.
Applying Formula (
2), we can verify that the output size
is 5. Also, since we have two filters, there are 2 output channels (
), obtaining an output dimension of 5 × 5 × 2.
To conclude, it is necessary to comment that a widespread practice is to apply an activation function to the results obtained in the convolution and to apply nonlinearities to the data, allowing the network to tackle more complex problems. The most common is ReLU, whose piecewise function is found in Equation (
3).
2.3. SqueezeNet
SqueezeNet, [
2], is a CNN architecture whose main objective has been to reduce the number of parameters without significantly affecting the classified images’ precision level. Reducing the number of parameters in the convolutional network makes its implementation feasible in FPGA or embedded systems. Moreover, it allows the parameters to be stored in the on-chip memory, reducing the bottleneck generated by the memory access bandwidth.
To do this, in [
2], we present the fire modules that make up the global architecture. These modules consist of a Squeeze layer, corresponding to a 1 × 1 convolution, whose output connects directly with an Expand layer, formed by a mixture of 1 × 1 and 3 × 3 convolutions, whose concatenated results generate the module output.
Using a squeeze layer before the 3 × 3 convolution allows one to reduce the number of input channels and the number of kernels of each filter.
Figure 4 shows the layout of the different convolutions in a fire module.
Subsequently, a new version of SqueezeNet, SqueezeNet v1.1, was released in the official repository [
2]. This version modifies the original architecture, keeping the fire modules, slightly reducing the number of parameters, and obtaining a 2.4 reduction in the computational cost without losing accuracy.
A comparison between the versions of the error classifications of the pretrained model with Imagenet images is available in the documentation of the PyTorch open-source machine learning library [
10]. The results are shown in
Table 1.
We can see the structure and configuration of SqueezeNet v.1.1 in
Figure 5 and
Table 2. It is a succession of layers (14). If we break down each layer into its constituent units (
Figure 5), we need to implement four fundamental elements: a 1 × 1 convolution, a 3 × 3 convolution, and max-pool and average-pool operations. Of course, the reuse of these units requires that the implementations be reconfigurable to adapt to the different sizes and parameters listed in
Table 2.
2.4. Related Work
Shortly after the appearance of the SqueezeNet model, David Gschwend [
11] implemented the model on a low-cost Zynqbox development platform that included a Xilinx Zynq-7000 FPGA combined with an ARM Cortex-A9. The implemented solution was called ZynqNet and was divided into two parts.
Modifications were made to the original SqueezeNet model to adapt it to FPGA requirements. One of the most notable modifications was eliminating the max-pool layers, following the philosophy of “all convolutional networks”, where the network is composed only of convolutional layers. To obtain the same result by reducing the output size, they added a stride of two in the convolution immediately after that. However, in SqueezeNet, this layer always consists of a 1 × 1 convolution, which would result in information loss. Therefore, they replaced this first layer with a 3 × 3 filter. They also implemented the final layer of type average pool, which could be considered a convolution whose filters are composed of values .
Another modification worth mentioning is the resizing of the layers, making their height and width powers of two. This is notable because this type of change is common in several implementations, which condition the type of CNN topology developed, achieving better latency values in exchange. On the other hand, in implementing the accelerator, the authors opted to use a single-precision floating point for data representation to maintain compatibility with their implementation of ZynqNet in GPU.
In addition, they leaned on the use of cache memories to facilitate memory reads and writes and also chose to unwind the loops of the 3 × 3 convolutions, thus achieving a top-5 error on Imagenet of 15.4% using a 32-bit floating-point data representation. We highlight that the accelerator was optimized to perform 3 × 3 convolutions, and it is possible that any other configuration would require additional logic, resulting in a low utilization of multiple-accumulate operations. It is worth noting that they performed their implementation using Vivado’s HLS.
Work similar to ZynqNnet was carried out in [
12], resulting in EdgeNet. This proposed convolutional network was implemented on a DE10-Nano board, and its proposed accelerator consisted of a configurable computing block. This unit was designed to work inversely, with the accelerator input equivalent to an expand layer and the output being the squeeze layer. This allowed fewer input channels to be accessed, as the input of the computation unit was mostly the result of a 1 × 1 convolution of the squeeze layer. The unit’s output was also the squeeze layer. Additionally, the unit was able to access pool-type layers depending on the configuration of the data path.
It should be noted that the results were obtained with a representation of the data and parameters using 8-bit floating points (five bits of exponent and two of mantissa) of the parameters resulting from the training of [
11], obtaining a reduction of 6% in accuracy in the classifications and achieving 51% of top-1 accuracy with Imagenet.
Continuing with the quantification of the parameters of convolutional neural networks, we find again in [
13] an implementation of SqueezeNet v1.1 where, prior to the design of the accelerator, the authors performed a study of the effect of quantification on the model’s accuracy.
They observed that by using the 8-bit integer data type, they considerably reduced both the FPGA resources and the memory access, obtaining a loss of 0.69% and 0.72%, resulting in 57.49% and 79.90%, the first and fifth in accuracy, respectively.
It is worth mentioning that this implementation was carried out on the low-cost DE10-Nano board using the OpenCL standard and HLS.
In addition, they incorporated additional logic that allowed them to read the input data, filter the convolution, and write the results into the global memory. Also, because the ReLU activation function generates feature maps with results equal to zero, they introduced a specific control with which they avoided performing operations with null input values.
Continuing with implementations using reduced fixed-point precision, we have the work of [
14]. They used a Zynq-7020 for the implementation based on the ZC702 development board. The authors employed 8 bits for the parameters (weights and bias) and 16 bits for the activations and carried out fixed-point arithmetic. This yielded an execution time of 333 ms. Their energy efficiency measurements were obtained from Vivado’s Xilinx Estimator Power (XPE) and resulted in a 2275-watt consumption. The Xilinx HLS tool was used for the design, which required two kernels for implementation. In particular, 186 DSPs were used in its implementation.
Similar works to those with Zynq-7020 can be found in [
15], using HLS but working with 32-bit floating point. They reached execution times of 1 second, and curiously, with coupling degrees similar to those previously mentioned, they obtain 7.95 watts of power consumption again using XPE.
Later, in [
16], the researchers developed a project on the SOC DE10-Nano board, the implementation of SqueezeNet v1.1 conferring parallelism at a multithreaded level, using the HLS of OpenCL.
They provided detailed documentation of the steps followed to carry out their design. They included different versions where applying optimization techniques of kernels as the coalescent memory access and the loop unrolling reduced the time required in the network’s prediction. Reviewing the implementation, it is observed that using coalescent memory through the float4 data structure limits the number of channels of each convolution layer since they must be a multiple of four due to the number of data elements that are read in a coalescent way.