1. Introduction
AS one of the fundamental tasks in computer vision, object detection is widely used in face detection, object tracking, image segmentation, and autonomous driving [
1]. The objective is to localize and classify specific objects in an image, accurately find all the objects of interest, and locate the position with a rectangular bounding box [
2,
3]. In recent years, in the field of computer vision, there has been a growing focus on designing deeper networks to extract valuable feature information, resulting in improved performance [
4,
5,
6,
7,
8]. However, due to the vast number of parameters in these models, they often consume a significant amount of computational resources. For example, Khan [
9] proposed an end-to-end scale-invariant head detection framework by modeling a set of specialized scale-specific convolutional neural networks with different receptive fields to handle scale variations. Wang [
10] introduced a pyramid structure into the transformer framework, using a progressive shrinking strategy to control the scale of feature maps. While these models demonstrate outstanding detection accuracy, they heavily rely on powerful GPUs to achieve rapid detection speed [
11]. This poses a significant challenge in achieving a balance between accuracy and inference speed on mobile devices with limited computational resources [
12,
13,
14]. Currently, detection models based on deep learning often use complex network architectures to extract valuable feature information. Although such models have a high detection accuracy, they usually rely on powerful graphics processing units (GPUs) to achieve a fast detection speed [
15]. With the rapid development of technologies such as smartphones, drones, and unmanned vehicles, implementing neural networks in parallel on devices with limited storage and computing power is becoming an urgent need. Under computing power and storage space constraints, lightweight real-time networks have become popular research topics related to the application of deep learning in embedded applications [
16].
Recently, some researchers have reduced the number of parameters and model size of the network by optimizing the network structure, such as SqueezeNet, MobileNetv1-v3 [
17,
18,
19], ShuffleNetv1-v2 [
20,
21], Xception [
22], MixNet [
23], EfficientNet [
24], etc. The MobileNet series methods replace the traditional convolution by using depth-wise separable convolutions, thus achieving a result similar to that of standard convolution but greatly reducing the number of model calculations and parameters. The ShuffleNet series networks use group convolution to reduce the number of model parameters and apply channel shuffling to reorganize the feature maps generated by group convolution. Other researchers have proposed regression-based one-stage object detectors, such as SSD [
25], YOLOv1-v4 [
26,
27,
28,
29], RetinaNet [
30], MimicDet [
31], etc. Instead of taking two shots, as in the RCNN series, one-stage detectors predict the target location and category information directly from a network without region propositions. Based on the regression concept of YOLOv1, SSD uses predefined boxes of different scales and aspect ratios for prediction and extracts feature maps of different scales for detection. Although the SSD accuracy is better than that of YOLOv1, SSD does not perform well in small object detection. YOLOv3 uses the Darknet backbone network to mine high-level semantic information, which greatly improves the classification performance. A similar feature pyramid network is used for feature fusion to enhance the accuracy of small target detection. Since a large number of easily classified negative samples in the training phase can lead to model degradation, RetinaNet proposes focal loss based on standard cross-entropy loss to eliminate category imbalance effectively, similar to a kind of hard sample mining. To improve the accuracy of the one-stage detector, MimicDet uses the features generated by a two-stage detector to train the one-stage detector in the training phase. However, in the inference phase.
MimicDet uses a one-stage method directly for prediction to ensure that the detection speed is relatively fast. The YOLO series methods achieve an excellent balance between accuracy and speed and have become widely used for target detection in actual scenarios. Nevertheless, YOLO models have a complex network structure and a large number of network parameters, so they require vast computing resources and considerable storage space when used in embedded devices. However, the high computational cost limits the ability of YOLO models to perform multiple tasks that require real-time performance on computationally limited platforms [
32]. To reduce the occupation of computing resources, lightweight YOLO methods require fewer parameters and improve the detection speed by applying a smaller feature extraction network, such as the latest YOLOv4-Tiny [
33]. Therefore, when performing object detection on embedded devices, improving the detection accuracy while achieving real-time performance is a significant problem to be solved.
This is because YOLO has a faster processing speed compared to Xception, making it more suitable for real-time applications, and the widespread use of Yolov4 and its excellent performance in many object detection-related tasks. After a comprehensive literature review to identify popular methods and the existing limitations in the field of object detection, we decided to exploit multiscale feature maps to promote the extraction of fine-grained features because they can improve the detection of small and medium-sized objects. In addition, we also considered how to achieve spatial and channel attention calibration through structural optimization and applied other strategies to improve detection accuracy without increasing computational costs. To obtain an efficient object detection model that can be applied in constrained environments originating from YOLOv4-Tiny, Mini-YOLOv4 is proposed in this paper to achieve an excellent trade-off between speed and accuracy. Compared with YOLOv4-Tiny, Mini-YOLOv4 not only improves the detection accuracy but also effectively reduces the number of model calculations and number of parameters from 5.9 M to 4.7 M, which means it can achieve efficient object detection in embedded devices. And compared to YOLOv3, YOLOv3-Tiny, YOLOv4-CSP, YOLOv4-Tiny, and YOLOv5s, Mini-YOLOv4 achieves fewer parameter sizes. To evaluate the effectiveness of our method for lightweight object detection in embedded systems, we conducted experiments on benchmark datasets, such as PASCAL VOC and MS COCO, and compared our method with other models. We also considered various metrics (e.g., parameters, BFLOPs, FPS, mAP, AP50, AP75, etc.) to measure the computational cost and detection performance of our method. We chose the NVIDIA Jetson Nano, a widely used low-cost deep learning platform, as the experimental test environment.
The main contributions of this paper are as follows:
(1) To reduce the number of parameters while achieving improved detection accuracy, we built a multibranch feature aggregation module (MFBlock) to replace the last 3 × 3 convolutional block in the backbone network. In MFBlock, we embed a new attention mechanism called the complete attention module (CAM) that directly explores spatial and channel clues. CAM uses the spatial structure information ignored by SENet and provides a significant increase in accuracy at a low computational cost.
(2) To exploit long-range dependencies, we design a group self-attention block (GSBlock) to replace the 3 × 3 convolutional block in the prediction head, consisting of a spatial group attention module (SGAM) and channel group attention module (CGAM). SGAM focuses on capturing the spatial association among feature maps, and CGAM aims to aggregate channel-wise feature information. We use SGAM and CGAM jointly to obtain comprehensive feature representations.
(3) To improve the regression accuracy for small and medium targets, we introduce a hierarchical feature pyramid network (H-FPN). In H-FPN, we use upsampling and downsampling to resize the feature maps of different stages in the network. Then, we fuse the high-level semantic features with the low-level feature representations in a hierarchical manner to obtain fine-grained features.
(4) Extensive experiments on PASCAL VOC and MS COCO datasets verify the effectiveness of each component. Moreover, we compare Mini-YOLOv4 with state-of-the-art object detection algorithms and other lightweight models. Mini-YOLOv4 achieves comparable results for mAP, lower BFLOPs value, and a real-time detection speed on an embedded platform NVIDIA Jetson Nano.
3. Method
Figure 1 illustrates our proposed Mini-YOLOv4 framework, which mainly includes three proposed modules: a multibranch feature aggregation block (MFBlock), a group self-attention block (GSBlock), and a hierarchical feature pyramid network (H-FPN). MFBlock introduces an attention module to capture spatial and channel clues directly and fuse feature information from multiple branches to expand the receptive field. Next, GSBlock explicitly models the point-to-point correlations among feature maps to mine long-range dependencies and obtain rich global information, thereby ensuring that the detection accuracy is improved without increasing the computational and parametric volume of the model. Finally, to improve the detection accuracy for small and medium-sized targets, we optimize the multiscale prediction process in YOLOv4-Tiny by fusing feature maps in different layers of the network in a hierarchical manner to obtain fine-grained feature representations.
3.1. Multibranch Feature Aggregation Block
To improve the detection speed, the network structure of YOLOv4-Tiny is relatively simple, so it is impossible to obtain fine-grained features. A multibranch feature aggregation block called MFBlock is designed to replace the last 3 × 3 convolution operation in the backbone network of the original YOLOv4-Tiny. The objective is to reduce the computational complexity and parameter size of the network while improving the ability of network feature extraction.
Recent studies have shown that channel attention (e.g., SENet) contributes to model performance, but it ignores extremely important spatial structure information. In order to generate fine-grained features with semantic information, we directly obtain channel and spatial responses through 1 × 1 convolution and enlarge the receptive field by fusing multiscale features. As shown in
Figure 2, the multibranch feature aggregation block consists of three branches. The first and second branches are used to extract the feature maps of different receptive fields, and the last branch is used for shortcut connections. Specifically, the first branch performs channel reduction through 1 × 1 convolution and then extracts semantic features through 3 × 3 convolution. After channel reduction, the second branch learns the weights of each channel and spatial location for feature selection through the CAM. As shown in
Figure 3, in addition to calibrating the channel response of the feature map, CAM considers the spatial location information and uses two 1 × 1 convolutions and a Leaky Relu activation function instead of the fully connected layer used by SENet to obtain an attention mask while performing identity connection to effectively uses previous feature information. Therefore, we can obtain an attention mask the same size as the input feature map.
Given an input feature
Fm, the attention map is calculated as:
where
W1 and
W2 are the parameter matrices of 1 × 1 convolution operation. The former is used for excitation, and the latter is used for squeezing. After obtaining the attention map
Am, the output feature map of
m is calculated as:
where
ConvBlock contains a 1 × 1 convolution, batch normalization, and Leaky ReLu activation function. The operator
is implemented in an elementwise manner. Notably, since CAM does not change the input feature-map size, CAM can be applied to any existing CNN network to emphasize discriminative features.
3.2. Group Self-Attention Block with Symmetric Structure
Although MFBlock emphasizes important features and mitigates the interference associated with redundant information, it is limited by the local receptive field and cannot obtain rich global information. Specifically, the global information mentioned here refers to obtaining the attention coefficient matrix through the attention mechanism, selecting the original input features based on the attention coefficient matrix, and choosing important features from the rich semantic information. Therefore, to efficiently capture long-range dependencies, we propose a GSBlock to replace the 3 × 3 convolutional block in the prediction head. GSBlock consists of two parts: (1) a spatial group self-attention module (SGAM) and (2) a channel group self-attention module (CGAM). Specifically, SGAM focuses on capturing the spatial association among feature maps. Since high-level channels tend to be strongly correlated, CGAM aims to aggregate channel-wise feature information. We jointly use SGAM and CGAM to mine contextual information and obtain a comprehensive feature representation.
Spatial group self-attention module: To capture the semantic dependencies among pixels in the spatial domain, we introduce a spatial attention module based on a self-attention mechanism. The no-local approach first performs linear mapping of input features to obtain query and key feature maps for calculating the similarity weights. Then, the attention information of the original features is corrected based on these weights. Finally, the feature channel for residual operation is expanded, but this whole process is relatively inefficient. Unlike the no-local method, our approach considers contextual information by partitioning a feature map into two symmetric groups through group convolution and computes the pairwise relations to form an affinity matrix. Then, the feature map groups are aggregated through weighted summation with the affinity matrix. The purpose of this design is to promote cross-group interactions to gain rich global information and efficiently reduce the number of self-attention calculations to achieve a lightweight network structure.
As shown in
Figure 4, given the feature map
Fs∈ℝ
C×H×W, the pairwise feature maps
Fs_1 and
Fs_2 are computed as:
where
Fs_i∈ℝ
C′×H×W and
C′ =
C/2.
W3 and
W4 are the parameter matrices of 1 × 1 convolution and 3 × 3 group convolution, respectively. Then, we resize
Fs_1 and
Fs_2 to ℝ
C′×N, where
N =
H × W. The spatial affinity matrix
As∈ℝ
1×H×W, as shown below:
Finally, we superimpose the attentional information on
Fs_1 and
Fs_2. The final output feature map
Es is obtained with the following equation:
Channel group self-attention module: Since high-level channels tend to be strongly correlated, some channels share similar semantic information. To mine semantically related channels, we designed a new attention module called the CGAM.
Similar to the SGAM, the CGAM uses group convolution to generate query-specific attention weights. We use the output feature
Es of the SGAM module as the input features of the CGAM, and the pairwise feature maps
Fc_1 and
Fc_2 are calculated with the following equation:
where
Fc_1,
Fc_2∈ℝ
C′×H×W.
W5 is the parameter matrice of 1 × 1 group convolution. Then, we resize
Fc_1 and
Fc_2 to ℝ
C′×N and the channel affinity matrix
Ac∈ℝ
1×H×W is computed as:
Then, the feature map with channel cues
Ec is obtained based on the following equation:
Finally, the output feature representation
Fout is obtained through a shortcut operation as
where
Fout∈ℝ
C×H×W.
W6 is the parameter matrix of 1 × 1 convolution.
3.3. Hierarchical Feature Pyramid Network
The feature pyramid network (FPN) improves the accuracy of object detection algorithms by fusing multiscale features. As shown in
Figure 5, YOLOv4-Tiny first generates feature maps in various stages {C2, C3, C4, C5}. Then, the FPN obtains P5 from C5 through a 1 × 1 convolution operation and uses top-down upsampling and horizontal connection operations to generate the fusion feature P4. However, P4 fails to effectively utilize low-level feature information, leading to low accuracy in detecting small targets. To solve this problem, we propose a new feature fusion network. Beyond the previous works, we argue that maximum pooling gathers important clues about distinct object features, and average pooling helps mitigate the loss of feature information as the network deepens. Thus, instead of using maximum pooling alone, such as in the FPN, we apply average pooling and maximum pooling operations simultaneously in downsampling to extract semantic features. Based on the multiscale prediction, we fuse high-level semantic features with low-level features in a hierarchical manner to use multiscale feature maps fully. As demonstrated in
Figure 6, we concatenate the fourfold downsampling results of C2 and the twofold downsampling results of C3 to generate an efficient feature descriptor and then perform channel reduction through a 1 × 1 convolution operation to obtain a feature map M2. We concatenate the upsampling result of C5 with C4 and then reduce the feature dimension to generate a feature map M4. Finally, we perform channel reduction based on M2 and M4 to obtain a feature map {P4, P5} with detailed information. This structure combines low-resolution but semantically rich features with high-resolution but poor semantic features through a top-down path and hierarchical feature fusion. Without significantly increasing the computational complexity, this method further enriches the semantic information of the feature maps.
3.4. Loss Function
The multitask loss function used to evaluate model performance is defined as follows:
CIOU (Complete-
IOU) [
50] is used to describe the coordinate loss. CIOU considers three critical geometric measures, i.e., overlap area, central point distance, and aspect ratio, which make the predicted box regression stable. In this paper, CIOU is used to measure the performance of the predicted box, which is defined as follows:
where
IOU is the overlap ratio between the predicted box
B and the ground truth box
Bgt,
ρ(·) is the Euclidean distance,
c is the diagonal distance for the minimum outer rectangle between the predicted box and ground-truth box, and
αv is used to monitor the aspect ratio of the predicted box.
BCE (BinaryCrossEntropyLoss) is used to describe the confidence loss and is defined as follows:
where
S2 is the number of cells and
B is the number of predicted boxes per cell.
and
indicate whether the
j-th anchor in the
i-th cell is responsible for predicting.
is the object confidence of the ground truth box, and
is the object confidence of the predicted box. The binary cross-entropy loss
l(
, ) is defined as follows:
CE (CrossEntropyLoss) is used to describe the classification loss, as described below:
where
(
) is the probability of the true target;
(
c) is the probability of other targets.
5. Discussions
To evaluate the performance of the proposed network compared to that of other YOLO methods, Mini-YOLOv4 is compared with YOLOv3, YOLOv3-Tiny, YOLOv4-CSP, and YOLOv4-Tiny. The PASCAL VOC dataset introduced in
Section 4 is used.
As shown in
Table 3, YOLOv4-CSP achieves the best detection accuracy but requires the largest number of parameters and has a slow detection speed. In contrast, Mini-YOLOv4 requires fewer parameters and computations than YOLOv4-CSP and is able to obtain suboptimal detection results compared to those of YOLOv3. Mini-YOLOv4 achieves the mAP of 79.0%, which is 17.8% and 3.2% higher than those of YOLOv3-Tiny and YOLOv4-Tiny, respectively. Such results imply that Mini-YOLOv4 provides higher classification accuracy for multicategory objects. Compared with the original model YOLOv4-CSP with a BFLOPs value of 29.9, the BFLOPs value of Mini-YOLOv4 is approximately 10 times smaller. In terms of the parameter size, Mini-YOLOv4 requires only 4.7 M parameters, which is 15 times fewer than that of YOLOv3 and YOLOv4-CSP. In addition, compared to YOLOv4-Tiny, MiniYOLOv4 requires 20.3% fewer parameters. The FPS of our network is 172.6, which is 3.1 times as many as obtained with YOLOv4-CSP. Compared to YOLOv4-Tiny, the real-time performance of Mini-YOLOv4 is reduced by 16.1%.
Figure 7 shows some qualitative examples of object detection results for different detection models on the PASCAL VOC 2007 test dataset. Each rectangular bounding box shown in the image contains information about the predicted category as well as the confidence score. Obviously, our detection model can find objects of interest more comprehensively and locate positions more accurately than YOLOv4-Tiny. That is because MFBlock and GSBlock in the proposed network increase the amount of contextual information by expanding the receptive field. Moreover, with hierarchical feature fusion in the H-FPN, the proposed network can accurately locate small and medium-sized targets. This is because MFBlock replaces the last 3 × 3 convolutional layer and focuses the network’s attention on regions of interest, capturing important features through attention mechanisms and enhancing the expression ability of features. Next is the group attention module, which proposes a spatial group attention module and a channel group attention module, focusing on calculating the semantic correlation of the spatial domain and channel domain to capture rich global information and obtain comprehensive feature representations. Finally, the improved H-FPN focuses on the fusion of high-level semantic features and low-level semantic features, compensating for the accumulation of lost feature information in the downsampling stage and enhancing target detection capabilities. The experimental results on PASCAL VOC show that Mini-YOLOv4 has excellent feature extraction capability and effectively reduces the number of model parameters and the number of calculations.
In order to further evaluate the superiority of Mini-YOLOv4 compared to other lightweight object detection network architectures, we take MobileNetv1-YOLOv4, MobileNetv2-YOLOv4, MobileNetv3-YOLOv4, ShuffleNetv1-YOLOv4, and ShuffleNetv2-YOLOv4 into comparison. All models are trained on the PASCAL VOC 2007 + 2012 training dataset and tested on the PASCAL VOC 2007 test dataset.
Table 3 summarizes the comparative results for the models. Compared to the other lightweight YOLOv4 methods, since Mini-YOLOv4 uses only two feature map scales for prediction, it requires fewer parameters and calculations and provides a faster detection speed. PPYOLO-Tiny uses MobileNetv3 as the backbone network, benefiting from the use of depthwise separable convolution instead of traditional convolution operations and post-quantization strategy. The FPS value and BLOPs results of the model are excellent. However, due to the pursuit of extreme speed optimization, it reduces feature extraction ability, and the mAP value is 2.8% lower than MiniYOLOv4. It is worth noting that YOLOv5s, which is also based on CSPDarknet-Tiny as the backbone network, uses three prediction heads for detection. Compared to our proposed MiniYOLOv4, which uses two prediction heads for detection, YOLOv5s requires more inference time and post-processing time to generate the final detection results. Therefore, it has almost half the FPS value and double the BLOPs compared to the model proposed in this paper.
As shown in
Table 4, MobileNetv3-YOLOv4 outperforms Mini-YOLOv4 by 1.4% mAP due to the complex feature extraction network, but Mini-YOLOv4 achieves better results in terms of BFLOPs, FPS, and parameter size. Compared with MobileNetv2-YOLOv4, Mini-YOLOv4 has similar accuracy, but the BFLOPs value and parameter size are reduced by 38.0% and 64.7%, respectively. For real-time performance, MobileNetv1-YOLOv4 achieves 77.6 FPS, ShuffleNetv1-YOLOv4 achieves 89.6 FPS, and ShuffleNetv2-YOLOv4 achieves 120.3 FPS. Compared to these three models, Mini-YOLOv4 improves the real-time performance by 122.4%, 92.6%, and 43.5%, respectively. In terms of parameter size, Mini-YOLOv4 requires approximately 3 times fewer parameters than the other models, and the number of parameters is reduced by 57.3% and 54.4% compared to those required by ShuffleNetv1-YOLOv4 and ShuffleNetv2-YOLOv4. It can be seen that although we use group convolution to reduce the amount of calculation like the ShuffleNet series of networks, our model creatively combines it with the attention mechanism to reduce the model size while improving the feature extraction capability of the model. According to the experimental results, it can be concluded that Mini-YOLOv4 achieves an excellent trade-off between accuracy and speed.
We compare Mini-YOLOv4 with state-of-the-art object detection networks on the MS-COCO test-dev dataset. As shown in
Table 5, most object detectors achieve excellent real-time performance and high detection accuracy. Notably, our proposed model enlarges the receptive field and enhances the feature extraction capability of the network, yielding significant improvements of 10.2% and 3.5% in the overall detection accuracy relative to that of YOLOv3-Tiny and YOLOv4-Tiny, respectively. Mini-YOLOv4 facilitates the interaction of low-level feature maps with high-level feature maps and enriches the semantic information of different scales. Compared to YOLOv4-Tiny, Mini-YOLOv4 achieves notable increases of 4.9% and 1.4% in detecting medium and small objects, respectively. It can be found that ASFF provides the highest overall detection accuracy of 38.1 but achieves poor real-time performance for an input size of 320 × 320. In comparison, our method is 4.2 times faster and achieves a slightly lower detection accuracy. For 512 × 512 input size, YOLOv4-CSP provides the best detection results and achieves real-time detection speed. In contrast, our method, with better accuracy than YOLOv4-Tiny, is 3.1 times faster than YOLOv4-CSP. Compared to PPYOLO-Tiny, Mini-YOLOv4 achieves improvements of 3.7%, 2.9%, and 3.4% in overall detection accuracy at the three different resolutions, respectively. Although YOLOX-Tiny outperforms Mini-YOLOv4 in terms of detection accuracy, Mini-YOLOv4’s rapid inference capability allows it to handle more deep learning tasks compared to YOLOX-Tiny. These results emphasize the robust performance and versatility of Mini-YOLOv4 as an efficient object detection model for various resolution settings and real-time applications.
To illustrate why the proposed model can improve detection accuracy, we use Grad-CAM [
66] as an attention extraction tool to visualize YOLOv3-Tiny, YOLOv4-Tiny, and Mini-YOLOv4 on the MS COCO test dataset. In
Figure 8, we select output feature maps with sizes of 13 × 13 and 26 × 26 for the comparison of attention maps. For the 13 × 13 size, the attention map generated by Mini-YOLOv4 can locate large target objects more accurately and does not include much of the background area. For the size of 26 × 26, Mini-YOLOv4 effectively confines attention to the semantic area of small and medium targets. It is clear that our proposed network can locate foreground objects more accurately than the other networks, regardless of the size and shape. This is because MFBlock enhances the expression ability of features through attention mechanism and feature fusion, providing a solid foundation for subsequent feature extraction operations. In addition, before the final convolution operation generates feature vectors, a spatial group self-attention module and a channel group self-attention module are inserted, which are used to model feature correlations in the spatial domain and channel domain. Based on the above method, the attention module effectively suppresses redundant information and captures rich global information to enhance feature representation capabilities. Finally, the proposed H-FPN utilizes the fusion of high-level semantic features and low-level semantic features for prediction, further enhancing the interaction of low-level semantic information, compensating for the accumulation of lost feature information in the downsampling stage, and enhancing target detection capabilities. This result indicates that the proposed model further enhances the detection accuracy for small and medium targets.
In order to solve the problems of poor model framework compatibility and slow model running speed, the model reasoning deployment framework comes into being. It unifies the model format and greatly accelerates the inference speed. TensorRT is one of the mainstream model deployment frameworks, and it performs well on NVIDIA Jetson series devices. This TensorRT deployment adopts a two-stage process. In the first stage, the Pytorch model is converted into an FP32-precision ONNX model file using the Pytorch API. In the second stage, the ONNX file is parsed through the TensorRT API, the TensorRT engine with FP16 precision is built, and the final deployment is completed. Compared with directly deploying the original model, this process solves the compatibility problem of the Pytorch model on the Jetson device and also reduces the model accuracy from FP32 to FP16, accelerating the inference speed. Through the introduction of the deployment framework, model training and deployment are decoupled, and the performance of the model is optimized, providing a guarantee for wide use in practical application scenarios.
To verify the detection speed of the proposed model on embedded platforms, we use the above method to compare Mini-YOLOv4 with YOLOv4-CSP and YOLOv4-Tiny on a Jetson Nano device [
53]. We believe that the Jetson Nano device is currently the mainstream embedded device, and it is reasonable to use for speed standard measurement. The comparison results are shown in
Table 6. Limited by the computing power of the embedded platform, YOLOv4-CSP, with high detection accuracy, can only achieve a maximum detection speed of 3.7 FPS on the Jetson Nano. The proposed Mini-YOLOv4 achieves detection speeds of 13.7 FPS, 16.4 FPS, and 25.6 FPS for the three input sizes considered. In order to improve the detection accuracy, we optimize the original YOLOv4-Tiny network structure so that the detection speed of Mini-YOLOv4 is slightly inferior to that of YOLOv4-Tiny, but it still achieves a real-time detection speed. The experiments show that Mini-YOLOv4 can achieve real-time performance on the embedded device, and the detection accuracy is greatly improved compared with that of YOLOv4-Tiny. This finding suggests that Mini-YOLOv4 is suitable for target detection on embedded devices.
The performance of these three algorithms is also measured in terms of resource usage, such as GPU and memory usage on the Jetson Nano. We monitor for 60 min, of which the machine is idle for the first 30 min (only background tasks are running), and the detection task is executed in the last 30 min.
Figure 9 shows the comparison result of memory usage. Notably, the initial memory usage of the Jetson Nano is approximately 32%. At about 30 min, the memory usage of YOLOv4-CSP jumps to 88%, the memory usage of YOLOv4-Tiny jumps to 51%, and the memory usage of Mini-YOLOv4 jumps to 45%. This phenomenon is mainly due to the small number of parameters in the proposed network, which is extremely lightweight. In
Figure 10, when the detection model starts to work, the GPU usage rate of YOLOv4-CSP far exceeds that of the other two models, remaining at around 88%. Compared with YOLOv4-Tiny, the proposed model has a similar usage rate. This is because the number of model calculations is reduced in the proposed approach while the detection accuracy is greatly enhanced.
Ablation Study
To verify the impact of the proposed modules on the final performance, we performed ablation experiments by gradually adding MFBlock, GSBlock, and H-FPN to the baseline YOLOv4-Tiny.
Table 7 shows the detection performance results for each model on the MS COCO test-dev dataset. In the case of tests under IOU = 0.5 and IOU from 0.50 to 0.95, the mAP values improve by 1.8% and 1.3% after adding MFBlock. GSBlock makes a significant contribution to improving detection accuracy by effectively mining contextual information. Additionally, SGAM improves overall performance by 0.9%, and CGAM improves overall performance by 1.2%. When these two modules are jointly integrated into the baseline, the overall detection accuracy is significantly improved by 1.8%. Based on the introduction of MFBlock and GSBlock, the H-FPN can further improve the detection performance for small and medium targets by 0.5% and 0.6%, respectively. It is evident that the H-FPN enriches the semantic information by effectively fusing multiscale features. As demonstrated in
Section 3.1, we learn from the idea of squeezing and activating to achieve attention calibration, which has proven to be useful for improving feature representation. Here, we study how the channel reduction rate in the CAM affects detection performance. In
Table 8, we conduct experiments involving a series of channel reduction rates. It can be seen that a smaller ratio slightly increases the number of parameters of the model, but it does not bring about a performance improvement. As the reduction rate increases, the trend of the mAP value is to gradually increase and then decrease. Particularly, when the channel reduction rate is set to 8, the mAP value is 79.0%, which can achieve a superior balance between accuracy and complexity. In addition, we find that taking the feature representations before the cross-channel interaction as residuals by adding an identity connection, as shown in
Figure 3, can further improve the detection accuracy. When the extra identity connection is removed, the detection accuracy of the model drops by 0.3%.
To demonstrate the advantages of the proposed CAM over other powerful attention modules, we investigate the performance of various attention modules, including SENet, CBAM, and SKNet. In the experiments, we replaced SENet, CBAM, and SKNet with the CAM in our network to verify its efficiency. The metrics of BFLOPs, FPS, parameter size, and mAP on the PASCAL VOC dataset are shown in
Table 9. Compared with SENet, CAM focuses on both channel and spatial attention information, which improves the mAP by 0.3%. CBAM calibrates the feature distribution by considering soft attention mechanisms in the sequential channel-spatial structure, but it does not perform well on this dataset. The mAP value of Mini-YOLOv4-CBAM is 78.1%, which is lower than that of Mini-YOLOv4-CAM by 0.9%. SKNet obtains the features of different receptive fields through various convolution kernels of different sizes and aggregates information from multiple paths to obtain a global and comprehensive representation. In terms of BFLOPs, FPS, and parameter size, Mini-YOLOv4-CAM obviously outperforms Mini-YOLOv4-SKNet. These results demonstrate that CAM has powerful feature extraction capabilities and excellent calculation efficiency.
To further explore the impact of the network structure of the SGAM and CGAM on detection accuracy, we investigate different combinations of the SGAM and CGAM to achieve optimal performance. We study three ways of combinations: parallel with a fusion (Mini-YOLOv4-S//C), sequential spatial-channel (Mini-YOLOv4-SC), and sequential channel-spatial (Mini-YOLOv4-CS).
Table 10 shows that Mini-YOLOv4-SC achieves the best performance, with an overall detection accuracy 0.9% and 0.3% higher than that of Mini-YOLOv4-S//C and Mini-YOLOv4-CS, respectively, on MS COCO. In the case of the test under IOU = 0.5, Mini-YOLOv4-SC gains 1.4% and 0.3% accuracy improvement over the other two combinations. Sequential architecture first generates refined features through the previous attention module and then allows the latter modules to learn attention information, which promotes optimization.
In addition to the above ablation experiments, the influence of different downsampling methods in H-FPN is assessed, and a comparison of the results is given in
Table 11. We resize the feature maps of different layers in the network to the same scale by downsampling and then perform a fusion operation to obtain a comprehensive feature representation. In our experiments, we use different combinations of average pooling and maximum pooling for downsampling and observe the corresponding performance differences. As shown in
Table 11, GAP + GMP provides the best detection results for the AP, AP50, and AP75 metrics, which significantly outperforms GAP or GMP alone. This result is because, potentially, using both pooling methods together not only extracts key features but also establishes connections among locations within the entire pooling window, allowing for local contextual information to be effectively captured.