1. Introduction
With the rapid development of driverless cars in recent years, they have been equipped with a visual perception system composed of cameras and embedded devices. The visual perception function primarily includes scene analysis and environment perception, among which scene analysis plays a dominant role. The scene analysis function primarily includes three research tasks: Object localization, image classification, and semantic segmentation. Visual object detection is the basic function module in autonomous vehicle scene analysis, and it is also the module that people are keen to study. The safety of autonomous vehicles largely depends on the object detection of the obstacles in front. However, the complex and changeable autonomous vehicle traffic scenes pose great challenges to the object detection algorithm, primarily including (1) the influence of bad weather such as haze and illumination on the accuracy of object detection; (2) how to maintain high-precision detection performance in the case of real-time detection; and (3) chip computing power and operating space limitations. When the object detection algorithm fails to extract complete feature information in time, there will be category errors in object recognition. Fog or haze has a serious impact on the recognition of traffic road information. The object detection algorithm, which is helpful in solving these problems, is the primary object detection algorithm based on deep learning, which has been popular in recent years.
Object detection algorithms based on deep learning can be divided into four schemes: (1) For object detection algorithms of region proposals, the typical networks are R-CNN [
1] and Faster R-CNN [
2]; (2) object detection algorithms based on regression, the typical representatives of which are YOLO [
3] and SSD [
4]; (3) object detection and recognition algorithms based on search, such as AttentionNet [
5]; and (4) algorithms based on Anchor-free, such as CornerNet [
6], CenterNet [
7], FCOS [
8], and so on. The above four schemes have laid the foundation for the subsequent research. Zou [
9] used convolutional neural networks to extract high-level features from sub-images and perform local feature weights and used recursive neural networks to classify feature sequences in the appropriate order. Wang [
10] replaced all BN layers with SN layers, and ResNet-101 [
11] was used as the backbone network to improve the CSP of the pedestrian detector [
12]. Song et al. [
13] detected pedestrians in various environments based on deep learning and multi-spectral feature fusion, which was robust. All of the above studies have improved the performance of target detection algorithms to varying degrees. These optimization strategies show excellent detection effects in the detection of the road environment in good weather conditions, but there are serious omissions and false detection in bad-haze weather conditions. In order to better detect pedestrians in a heavy-haze road environment, we need to develop more reasonable detection strategies.
Aiming at the problem of de-fogging in a foggy environment, Fan et al. [
14] proposed a new image de-fogging algorithm based on sparse domain training. It learns the fog-free sparse features in good weather, reconstructs the fog image, and uses a nonlinear stretching method to enhance the brightness of the image. However, it is not good at feature extraction, which leads to the problems of local defogging and color features missing in the foggy picture. Yang et al. [
15] proposed a depth perception method to predict depth images based on the generated adversarial network structure and then provided depth features for fog removal in the network framework, which was of great help to the removal of haze in the environment. The disadvantage was that the effect of removing distant fog in the image was poor. The above studies have made some improvements regarding the problem of fog removal, but there is still a need for further optimization. Han [
16] proposed an improved RCNN target detection model aimed at the detection problem of the target detection algorithm in a foggy-day environment. By optimizing the feature pyramid, shallow feature information is combined into a deep network to improve the extraction of information features, and an optimization module is added to enhance the robustness. This model has rich background information. However, there are problems of missing detection and false detection in the results of detecting foggy images. Cao et al. [
17] enhanced the loss function of the SSD model, adopted a weighted mask for network training, and strengthened the loss function. Thus, the anti-jamming ability of the target detection algorithm is improved in bad weather, but its detection efficiency is not so good. Qin et al. [
18] proposed a lightweight ThunderNet detector. A relatively efficient RPN and detection head are designed, and a spatial attention module and an enhancement module are added to improve the feature representation of the image. The ThunderNet detector enhances the detection accuracy and reduces the detection efficiency compared with the single-stage detector. All of the above-mentioned research studies improve the performance of the target detection algorithm to varying degrees. However, the rationality of balance detection efficiency and accuracy needs to be further improved. Otherwise, when the pedestrian in the traffic environment is blocked by haze, the object detection algorithm cannot extract the complete feature information effectively and quickly, resulting in the omission or false detection of object recognition, and eventually the occurrence of traffic accidents.
In order to solve these problems, four improvement measures were developed in this experiment to solve the problem of real-time detection of pedestrians on foggy roads by driverless cars. The dark channel defogging algorithm with a good defogging effect is adopted for defogging treatment. A large-scale, diversified, and robust BDD100K [
19] dataset is selected to enhance pedestrian detection in urban traffic scenes. An excellent dataset is particularly important for the learning of the detection model. Compared with other pedestrian datasets, the BDD100K [
19] dataset is more suitable for the learning of the urban traffic scene detection model. Secondly, the detection head of YOLOv7 was added to make it fit for the detection of more object sizes, so as to improve the detection ability of pedestrian objects. To solve the problem of computing resource consumption of the algorithm, a reasonable pruning strategy was developed to ensure the algorithm meets the requirements of vehicle chip hardware resources. Developers design the network in the training process through unsupervised learning of the weight of the network layer, to train the importance of each part. Therefore, in the deep learning algorithm, not all learned weights are equally important, and there are some redundant or less important network modules. On the premise of keeping the detection accuracy constant, the detection speed can be improved by pruning these network modules. Based on this idea, combined with the research work of Liu [
20], Ren [
21], Wang [
22], and Ye [
23], a combination pruning strategy was proposed. Combined with the pruning strategy, sparse L1 regularization was applied to channel scale factors to improve channel level sparsity and trim channels with smaller scale factors. In addition, adding layer pruning to channel pruning and evaluating the convolutional layer will trim the layer with the lowest scaling factor mean, resulting in a “refined” detector. The pruned network is then fine-tuned to take full advantage of the potential performance of the pruned network model. Compared with the original model YOLOv7, the trainable parameters of model YOLOv7
+-87% are reduced by 97.66%, the model space is reduced by 96.36%, and the reasoning speed is 423.30% faster.
3. YOLO-GW
This experiment is a model pruning based on YOLOv7, and the optimal network YOLOv7* can be obtained through
Figure 2.
There is a big difference between the YOLOv7 algorithm and the current mainstream object detection algorithm. In addition to the optimization of the network architecture, the YOLOv7 algorithm also adds some optimization methods in the process of network training. The optimization of network architecture and training mode enables YOLOv7 to improve the detection accuracy of the object without increasing the calculation cost. The optimized network structure of YOLOv7 shows the optimization of the backbone part in
Figure 3, and the modules of the network framework are decomposed in
Figure 4: The CBS in (a) is composed of a convolutional layer, a BN layer, and a Silu activation function. The CBS has three colors of modules, K represents the convolution kernel, S represents the step size, and the colors represent the three types of asynchronous lengths and the convolution kernel CBS module. The CBM module in (b) is composed of a convolutional layer, a BN layer, and a sigmoid function. The convolution kernel is 1 × 1 and the step size is 1. The REP module in (c) is divided into two types. One is the training module, which is composed of three branches, which are 3 × 3, 1 × 1, and the third layer is without convolution operation. The other is the reasoning module, which contains 3 × 3 convolution with a step size of 1. The MP-1 module in (d) has two branches, which are primarily used for down-sampling. The first branch passes through Maxpool and then through a 1 × 1 convolution. The second branch is a convolution of 1 × 1, a convolution of 3 × 3 with a step size of 2, and then the results of the two branches are added. The ELAN module in (e) has two branches: One after a 1 × 1 convolution, the second after a 1 × 1 convolution, followed by four 3 × 3 convolutions, and finally, the four features are superimposed together. The ELAN-W module in (f) differs from the ELAN module in (e) in that the ELAN-W module selects five outputs for superposition. The up-sample module in (g) uses the nearest neighbor interpolation as the up-sampling method. The MP-2 module in (h) has the same architecture as the MP-1 module in (d). The difference is that the step size of the convolutional block is different. The SPPCSPC module in (i) has two branches. The first part is the processing of the SPP module, and the second part is routine processing. Finally, the two parts are combined to reduce the amount of calculation by half, so that the speed becomes faster, and the accuracy will be improved.
This paper adds an ECA module [
36] on the basis of the YOLOv7 algorithm to optimize the performance of the YOLOv7 algorithm. ECANet is a partial optimization on the basis of the SENet module and proposes a local cross-channel interaction strategy without dimensionality reduction, also known as an ECA module. This module adds few parameters but effectively improves the performance of the object detection algorithm. In addition, in order to enhance the detection performance of YOLOv7, three detection heads were expanded to four detection heads. The detection heads of YOLOv7 were primarily used for object classification and regression, and enhanced effectiveness features could be obtained in the backbone network and the FPN network. The increase in detection heads not only enhanced object classification and regression but also improved effective feature extraction.
Basic training: Set the training parameters of the main network and select the appropriate dataset for basic training.
Sparse training: The purpose of sparse training [
37,
38] is to facilitate the pruning of the channel and network layer by the pruning strategy in the later stage. In the experiment, a scaling factor is implanted in each channel for the convenience of channel pruning, and its absolute value is used to express the importance of the channel. In detail, the BN layer after each convolutional layer without a detection head is used to improve the generalization ability and achieve fast convergence. The BN layer uses small batch statistics to normalize internal activation, and the conversion of the BN layer is as follows:
where
and
represent the input and output characteristics of the BN layer, respectively;
is the mean value of the small batch input;
represents the variance of the small batch input; and
and
represent trainable scales and displacements.
The trainable scaling factor in the BN layer is used to measure the importance of the network channel. In the experiment, the scaling factor is multiplied by the output of the channel to combine the training weight and scaling factor. After
completes the regularization of L1 [
39,
40], channel sparse training is started to distinguish the unimportant channels from the important ones. According to Formula (2), the object of sparse training is given as
where
is used to balance
(representing the loss of normal training in the network) and
(the penalty of scaling factor caused by sparsity), and
is the regularization of L1. The subgradient method is adopted to optimize the penalty term of non-smooth L1.
Develop pruning strategy: The optimal pruning strategy is a pruning method combining layer pruning and channel pruning. It is primarily used to introduce a global threshold to limit the pruning degree of pruning after sparse training is completed. In order to reduce the destructive pruning model of the network, a local security threshold is adopted to prevent the over-pruning of the network model. Firstly, the pruning rate was adjusted by setting the global threshold to be of all . In order to avoid over-pruning, the local safety threshold was proposed, and of all was set in the layer requiring pruning. When the scaling factor of the channel is less than the minimum values of and , it can be pruned to meet the requirements of pruning. If the scale factor of the whole layer is less than the threshold, in order to avoid the pruning of the whole layer, the channels with the largest scale factor in the layer are left. In addition, layer pruning is integrated on the basis of channel pruning. It is primarily used to evaluate the convolutional layer and then to rank the mean value of the scaling factors of each layer and cut out the corresponding part of the minimum value.
Fine-tuning the pruning network: After pruning the model, its performance will decline temporarily. Fine-tuning [
41] is adopted to adjust the potential performance of the pruned model to the optimal level.
Evaluation of detection performance: The optimized new network is evaluated according to the evaluation index to determine whether the new network achieves the optimal detection performance. To achieve the optimal detection performance, we must stop pruning, that is, the optimal model YOLOv7*. The optimal detection performance is not achieved, and we need to trim again. In the process of pruning again, we need to prevent the occurrence of over-pruning, and the over-pruned model cannot be repaired.
The de-fogging strategy for a traffic environment: The dark channel de-fogging algorithm is selected in this experiment. The advantages of this algorithm are a good restoration effect and a simple application scenario, while the disadvantages are the low operating efficiency of the algorithm. In order to solve this problem, an optimization strategy of the defogging algorithm is formulated. As shown in
Figure 5, we first execute Gaussian Smoothing in the original image with fog and then down-sample it 4 times. Then, the image after down-sampling is de-fogged in the dark channel [
42]. At this time, the processed image is 1/256 of the original image area, and the de-fogging efficiency has been effectively improved. Then, the image without fog is up-sampled, and Gaussian Smoothing is conducted after up-sampling. The final image obtained meets our requirement of resolution. Finally, the processed image is input into the network for object recognition. YOLO-GW is an optimization algorithm based on YOLOv7 network architecture optimization, combined pruning, and de-fogging strategies.
5. Experimental Results and Analysis
In
Table 2,
Table 3 and
Table 4, 640 × 640 and 864 × 864 respectively represent the size of the network input picture set during the training of the model. The data are the evaluation index obtained from the training on the BDD100K pedestrian dataset. The performance of the basic model and the optimized model is evaluated through the analysis of the data. In YOLOv7-N, N is the pruning rate, and N is 50%, 86%, 87%, and 88%, respectively.
In this experiment, YOLOv7 was selected as the basic network, but the performance of the original network could not meet the requirements of this experiment. After several tests, the ECA module was added to the YOLOv7 algorithm for optimization in this experiment. An efficient channel attention module could effectively improve the algorithm performance of YOLOv7 and only add a few parameters. In order to further improve the algorithm performance of YOLOv7, we changed the network structure of YOLOv7 with three detection heads into a network structure of four detection heads and carried out 846 × 846 large-scale training to further strengthen object classification and regression, as well as improve the effective feature extraction. As can be seen from
Figure 7, mAP was significantly improved with the addition of ECA, a detection head, and large-scale network training (the advantage of setting the size of the network training image as 864 × 864 is to follow the principle that the higher the image resolution, the larger the object size and the simpler the feature extraction). Finally, these three optimization methods were added to the YOLOv7 network at the same time. As can be seen from
Figure 7, mAP was further improved. As can be seen from
Table 2, these methods not only improve the network accuracy but also reduce the reasoning speed of the network to some extent. The YOLOv7 network directly decreases from 65 frames per second to 22 frames per second. The target detection algorithm of 22 frames per second detection speed combined with the defogging strategy cannot quickly detect pedestrians in traffic scenarios. Therefore, we integrated a pruning strategy on the basis of YOLOv7 network optimization.
Analysis of test effect:
Table 3 lists several representative model evaluation indexes of YOLOv7 and YOLOv7
+. Among them, YOLOv7
+ is an algorithm obtained through the optimization of the ECA module, detection head, and large-scale network training. There are six pruning models, namely YOLOv7*-86%, YOLOv7*-87%, YOLOv7*-88%, YOLOv7
+-86%, YOLOv7
+-87%, and YOLOv7
+-88%, which represent the changing process of pruning performance indexes of YOLOv7 and YOLOv7
+, respectively. As can be seen from
Figure 8, the parameters of YOLOv7 and YOLOv7
+ change in the pruning process. With the increase in the pruning rate, the reasoning speed of YOLOv7 and YOLOv7
+ image recognition continuously increases, while other parameters all decrease at a certain speed. The ultimate purpose of this experiment is to improve the reasoning speed of the detection algorithm as far as possible on the premise of improving the accuracy of object detection. It can be seen from
Figure 8 that the mAP performance of YOLOv7
+-86% and YOLOv7
+-87% is optimal. The FPS of the YOLOv7
+-87% model is 14 times higher than that of the YOLOv7
+-86% model when other evaluation indicators are essentially the same. Therefore, YOLOv7
+-87% was selected as the final model for optimization. Compared with the original YOLOv7 model, in which the input image for network settings is 640 × 640, the trainable parameters of the YOLOv7
+-87% model are reduced by 97.66%, the model space is reduced by 96.36%, and the reasoning speed is increased by 423.30%. The mAP of the optimized YOLOv7
+-87% is increased by 9% compared with the original mAP of YOLOv7. The comparison of the above data confirms the effectiveness of the optimization strategy.
The performance of YOLOv7
+-87% is compared with that of existing lightweight algorithms:
Table 4 lists the most representative lightweight network performance indicators. These algorithms are all trained by default under the same environment of the original author’s network configuration. By comparing the evaluation indexes of YOLOv5-N, YOLOv6-N, YOLOv7-Tiny, and YOLOv7
+-87% in
Figure 9, an appropriate basic network model is selected. As can be seen from
Figure 9, the network model of YOLOv7
+-87% shows the optimal performance compared with the network model of YOLOv5-N, YOLOv6-N, and YOLOv7-tiny when considering the evaluation indicators Recall, F1-score, and mAP. The parameters and volume of the YOLOv7
+-87% network model are also reduced to the minimum parameter size, saving space for deployment to hardware devices. The operation speed of its inference picture also shows outstanding operation efficiency in YOLOv5-N, YOLOv6-N, and YOLOv7-tiny. According to the comparison in
Figure 9, the evaluation index of training of the YOLOv7
+-87% network model on the BDD100K dataset is generally better than that of other network models. Therefore, YOLOv7
+-87% was selected as the object detection model of this experiment.
In order to improve the detection of objects on foggy traffic roads, this experiment selects the dark channel de-fogging algorithm to process the foggy environment. The advantages of this algorithm are the good restoration effect and simple application scenario, while the disadvantage is the low operating efficiency of the algorithm. In order to better meet the real-time nature of traffic roads, we first take the original picture with fog by Gaussian Smoothing and then down-sample it 4 times. Then, dark channel defogging is carried out on the image after downsampling. In this case, the de-fogging efficiency has been effectively improved. Then, the image defogging is up-sampled, and then Gaussian Smoothing is performed after up-sampling. The final image obtained meets our requirement of resolution. Then it is combined with the YOLOv7+-87% lightweight object detection model to meet the real-time detection requirements of foggy traffic roads.
In order to verify the effectiveness of the fog detection algorithm, several excellent object detection algorithms, namely YOLOv5-N, YOLOv6-N, YOLOv7-Tiny, YOLOv7, YOLOv7
+-87%, and YOLO-GW, were selected for visual comparison in this experiment to prove the effectiveness of the fog detection algorithm. Among them, YOLO-GW adds a de-fogging strategy on the basis of the YOLOv7
+-87% algorithm to improve pedestrian detection in a foggy traffic environment (the YOLO-GW algorithm model adds a defogging strategy, while other comparison algorithm models do not add a defogging strategy in the detection process). The de-fogging algorithm strategy does not participate in model training but rather only shows a de-fogging effect in the image detection process. As shown in
Table 5, YOLOv7
+-87% and YOLO-GW target detection algorithms have the same network parameters and volume. Compared with the parameters and volume of the original network YOLOv7, the trainable parameters of the YOLOv7
+-87% model are reduced by 97.66% and the model space is reduced by 96.36%, making it possible for the model to be deployed on the chip. The experiment was conducted in the same hardware environment configuration. The YOLO-GW algorithm in the table is based on the YOLOv7
+-87% algorithm, which adds the algorithm model of de-fogging algorithm strategy optimization. As shown in
Figure 10, it is equivalent to each frame image input into the network model for classification after the de-fogging strategy is carried out, and the detection accuracy of this model does not change. Due to the addition of the de-fogging strategy, the FPS of the YOLOv7
+-87% algorithm is reduced by 240, but it shows an excellent detection effect when detecting a foggy environment. Compared with the YOLOv5-N, YOLOv6-N, YOLOv7-tiny, and YOLOv7 target detection algorithms, the mAP of YOLO-GW shows the best performance.
In order to further verify the effectiveness of the YOLO-GW object detection algorithm, images of the original sunny day, mild foggy day, moderate foggy day, and severe foggy day are selected for visual detection and verification. The verified images are video data taken from actual road conditions, while the original images are from sunny days. In order to compare the detection effect of the YOLO-GW target detection algorithm with that of YOLOv5-N, YOLOv6-N, YOLOv7-tiny, and YOLOv7 in mild, moderate, and severe foggy days, in this experiment, the effects of mild fog, moderate fog, and severe fog were added to the selected sunny day pictures. The data are used to perform visual detection and comparison of YOLOv5-N, YOLOv6-N, YOLOv7-tiny, YOLOv7, and YOLO-GW, as shown in the figure below.
Visualization effect analysis: It can be seen from the detection of clear day images in
Figure 11,
Figure 12 and
Figure 13 that although YOLOv5-N, YOLOv6-N, and YOLOv7-tiny have fast reasoning speed, they are not ideal in visual performance detection, and the situation of missing detection is relatively serious. With the increase in fog concentration, there will be false detection. When the fog concentration is heavy, YOLOv5-N, YOLOv6-N, and YOLOv7-tiny cannot detect pedestrians in the environment. As can be seen from
Figure 14, YOLOv7 has a better object detection effect than YOLOv5-N, YOLOv6-N, and YOLOv7-Tiny. However, with the deepening of fog concentration, YOLOv7′s detection effect gradually weakens, and only nearby pedestrians can be detected when the environment is thick fog. As can be seen from
Figure 15, compared with YOLOv5-N, YOLOv6-N, YOLOv7-Tiny, and YOLOv7, YOLO-GW showed a better detection effect in visual detection and verification of images in sunny, mildly foggy, moderately foggy, or severely foggy days. Therefore, YOLO-GW is more suitable for pedestrian detection in a foggy environment compared with YOLOv5-N, YOLOv6-N, YOLOv7-Tiny, and YOLOv7 target detection algorithms.