1. Introduction
AVs encounter several difficulties under adverse weather conditions, such as snow, fog, haze, shadow, and rain [
1,
2,
3,
4,
5,
6,
7,
8]. AVs may be affected by poor decision making and control if their perception systems are degraded by adverse weather. When water vapor condenses in the sky, it obscures the view of the surrounding area, resulting in fog. Fog can make driving unsafe because it obscures visibility. The signal-to-noise ratio (SNR) is reduced, while measurement noise rises dramatically under foggy conditions. Unsafe behavior and road accidents might be caused by sensor data that are too noisy.
Machine vision in fog can fall as low as 1000 m in moderate fog and as low as 50 m in heavy fog [
9,
10]. Camera sensors are one of the significant sensors used for object detection because of their low cost and the large number of features they provide [
11]. In fog, the camera’s performance is limited due to visibility degradation. The quality of the image taken by a camera system can be substantially distorted by fog. In fog, lidar undergoes reflectivity degradation and a reduction in the distance measured. However, radars tend to perform better than cameras and lidars in adverse weather, since radars are unaffected by changes in environmental conditions [
11,
12]. Radars employ the Doppler effect to determine the distance and velocity of objects by monitoring the reflection of radio waves. With respect to object classification, radars fall short. Because radars can only detect objects, they cannot classify what kind of object they are detecting, since radar detections are far too sparse [
13,
14]. The sparse nature of radar point clouds collected with many vehicular radars (usually 64 points or less) might explain this [
15].
However, a significant amount of research on imaging radar has been conducted over the past several years, including [
16,
17,
18,
19], which has resulted in coherent images with a centimeter-scale resolution. Detailed research on motion estimation and compensation methods for vehicle multiple-input–multiple-output synthetic aperture radar (MIMO SAR) systems was presented by Manzoni et al. [
16]. The authors discuss the difficulties caused by the natural motion of the vehicle, which might result in visual abnormalities and distortions. An innovative approach to motion compensation that is based on the assessment of the platform’s motion characteristics and the subsequent compensation in the SAR processing was developed. The findings emphasize the potential of MIMO SAR for use in autonomous vehicles by demonstrating an improvement in image quality, as well as greater perception functionalities. Tebaldini et al. [
17] addressed the potential of vehicle synthetic aperture radar (SAR) imaging, as well as the obstacles it faces in urban contexts. The authors investigated the distinctive features of SAR, such as its capacity to function despite unfavorable weather conditions and the fact that it can see through vegetation, which makes it appropriate for use in urban settings. Solutions to a variety of problems, such as the high computing needs for SAR processing and the demand for effective data-collecting methodologies, were provided. The findings of this study highlight the importance of performing more research and development in order to realize the full potential of SAR imaging using autonomous vehicles. Wu and Zwick [
18] discussed their research on the use of synthetic aperture radar (SAR) in vehicle systems for the purpose of detecting parking lots. The authors suggested a technique for locating and categorizing parking lots that makes use of the interferometric features of SAR. The suggested method provides a high level of accuracy in the detection of parking lot boundaries because of the utilization of the coherent summation of radar echoes obtained from numerous passes. This research demonstrates the potential of SAR technology to aid autonomous vehicles in traversing complicated situations such as parking lots. Iqbal et al. [
19] discussed the fundamental principles of radar imaging, including range–Doppler processing and synthetic aperture radar. In addition to this, the most important difficulties, such as reducing interference and improving resolution, were investigated. This study placed emphasis on the significance of multi-channel radar systems, sophisticated signal-processing techniques, and efficient data fusion algorithms for the purpose of the successful usage of imaging radar in self-driving automobiles. The utilization of imaging radar technology in autonomous vehicles has significant potential to enhance their perception and decision-making capabilities. Notwithstanding the advantages of imaging radar in self-driving vehicles, there are several obstacles that need to be overcome, including computational demands, data collection techniques, and interference suppression, in order to fully capitalize on its potential. The advancement of the state of the art and realization of the vision of safe and efficient self-driving cars are contingent upon further research and development in this area.
AVs are often outfitted with numerous complementary sensors to provide complementary information that helps to attain the necessary accuracy when combined. Multi-sensor fusion combines data from numerous sensors to achieve a higher object detection/classification accuracy and performance than those obtained with a single-sensor modal system [
11]. Therefore, an essential subject for AVs is the combination of radar and other sensors, such as cameras. Radar–camera fusion systems can offer useful depth information for all observed objects in an autonomous driving situation. Radar sensors construct detections of nearby objects for subsequent usage, while the bounding boxes on the camera data can be used to verify and validate prior radar detections using deep-learning-based object detection methods [
14].
There have been significant contributions to object detection and classification using deep learning. In addition to AV technology, object detection has found application in other fields, including surveillance and security [
20], medicine [
21], robotics [
22,
23], the military [
24], etc. As outlined in [
25], a deep convolutional neural network (CNN) was first utilized for image classification in 2012. However, with respect to vehicular radars, it is not uncommon for part of the observations to have incomplete, distorted, and poor-quality data. Beam obstruction, instrument malfunction, blind spots, close-to-the-ground mounting, inclement weather such as fog, and many other factors contribute to these problems. Images obtained with a camera consist of color and feature information. This feature information can be used for label classification in an object detection task. The occurrence of fog can drastically distort the feature information of an image due to atmospheric scattering and attenuation. These radar and camera problems usually lead to inaccuracies in the real-time detection of the bounding box of an object or location in an image, especially when the object is not nearby or when the object is too small under medium and heavy fog weather conditions. Thus, the application of single-sensor modal CNN-based object detection algorithms to such distorted data has proven inefficient [
1,
2].
YOLOv5 [
26], a state-of-the-art object detection algorithm, is affected by mis-detections and false positives due to atmospheric scattering caused by fog particles. The existing deep-learning-based object detection techniques that exhibit a high degree of accuracy have a slow object detection speed in foggy weather conditions. However, several deep-learning-based object detection methods have achieved fast detection speeds at the expense of accuracy. Therefore, the problem of the lack of balance between detection speed and accuracy in foggy weather application persists. The uniqueness of radar signals and the scarcity of publicly available datasets [
27] containing both camera and radar datasets [
28,
29,
30,
31,
32,
33] under foggy weather conditions have resulted in a limitation of AV research in this area. Very few datasets that include camera and radar information under foggy weather conditions, such as those described in [
31], are available for AV research. To accommodate the needs of AVs in terms of the previously mentioned problems related to AVs’ environmental perception in fog, we make the following contributions:
This paper’s remaining sections are structured as follows: we discuss related works in
Section 2, we present our methodology in
Section 3, we present and discuss our results in
Section 4, and
Section 5 consists of the conclusion.
3. Method
3.1. Sensor Calibration and Coordinate Transformation
Measurement errors grow with distance, since radar and cameras are often mounted at different locations on the ego vehicle. As a result, the shared observation region between the camera and radar requires a combined calibration effort. The vehicle’s motion defines a local right-hand coordinate system in the ego vehicle coordinate system. The local coordinate system is conveyed with the vehicle as it travels through the environment. The x-axis indicates the path of motion, whereas the y-axis is parallel to the front axle, which serves as the starting point. The camera models employ three-dimensional coordinates. For example, the camera’s
coordinate system has its origins in the camera’s viewpoint. Images acquired with camera sensors employ an image coordinate frame
) and a pixel coordinate
as reference points for composition. The coordinate system of choice is the polar coordinate with respect to radar detection. The detected objects are referenced using polar coordinates. Thus, a target may be recorded as an
coordinate system vector in a vector space. The canonical coordinate system for radar comprises three elements: an azimuth
and the distance between the object
and the direction of the sensor’s origin. By measuring the distance of point
and its azimuth from the radar, we can estimate where
is in the world coordinate system [
59,
64].
The observations of the camera and radar detections can be associated using the information in a shared world coordinate system given by
. The camera calibration parameters can be used to project the radar detections onto the camera’s coordinate system and the image plane given by
. The calibration parameters of the camera can be broken down into two matrices: intrinsic and extrinsic. The intrinsic parameters’ matrix is given as [
59,
64]:
where
and
, such that
represents the focal length of the camera;
and
represent the physical dimensions of an individual pixel in the x–y axes’ directions, respectively;
and
represent the scale factors on the
and
axes; and
and
represent the central point offsets of the camera.
The extrinsic camera parameter can be expressed as:
where
represents the rotation parameter matrix and
is the translation parameter matrix used for mapping the radar detection point to the projection point P coordinates on the image plane. Thus, the radar detections may be mapped to their equivalent visual representations. After the mapping, the detections that fall outside the image frame are disregarded to ensure accuracy. The coordinate mapping from the world coordinate system to the image plane of the image coordinate system is as follows:
where
and
represent the projection point
coordinates on the image plane.
3.2. Radar Detection Model
Millimeter-wave radar detects objects by sending out electromagnetic radio frequency waves in a certain direction and then analyzing the reflected signals from the environment. It is possible to determine the target’s range and velocity by monitoring the echoes’ time lag and phase change. The target’s azimuth can be obtained using directional antennas or phase comparison methods [
64]. In a linear-frequency modulation continuous-wave radar signal waveform, the distance between the radar and the target causes the echo signal to have a time lag because of the propagation of electromagnetic waves. This results in a distance frequency shift
for moving targets, of which the Doppler frequency shift
is the outcome. Both the transmission and the echo signals will result in the generation of two differential frequencies,
and
, at the leading and trailing edges of the frequency, respectively. The following equations can be used to determine the range
and velocity
of a target:
where
is the period of frequency modulation,
is the modulation bandwidth,
is the center frequency of the transmission waveform,
is the speed at which light travels,
, and
.
The phase comparison approach is utilized to provide an estimation of the azimuth. The target signal has a travel distance while it is being propagated, and as a result, the echo signal has a phase difference that corresponds to that travel distance. The target’s azimuth
is determined using the following equation:
where
is the wavelength,
is the phase shift due to the target echo signal’s propagation delay, and
is the distance between the receiving antennas.
3.3. Fog Imaging Model
Physical atmospheric scattering models are shown in
Figure 2. The attenuation factor, transmission model, and airlight model comprise the physical atmospheric scattering model. Atmospheric scattering reduces the amount of light that may be absorbed for imaging under foggy conditions. Therefore, the target image’s object textures and edge features may be diminished. Attenuation and interference occur before the reflected light reaches the camera in foggy weather. An airtight concept allows light rays to be scattered before they reach the imaging camera. Instead of being scene lights from the item in the photograph, the transmitted lights include fog elements that obscure the images.
An image model proposed by Koschmieder [
65] has frequently been used in the scientific literature [
1]:
where
denotes the picture captured by the camera,
indicates the scene radiance image,
denotes the transmission map, and
denotes the airlight vector, which is homogenous for each pixel in the image. The attenuation factor is represented by
, while the atmospheric components are represented by
. The undetermined parameters of a hazy single-input picture
are represented by the letters
,
, and
. To acquire the restored picture (recovered image)
, the amount of ambient light
and transmission
can be determined using the following equation:
According to Narasimhan et al. [
66], the visual imaging model of a foggy scenario can be regarded as the outcome of concatenating the attenuation and interference models, as shown in
Figure 2. As a result of both attenuation and interference, fog can seriously degrade the quality of the image being captured in a machine. The theoretical model of the visual imaging model of a foggy scenario can also be represented as follows [
2]:
where
represents the attenuation model;
represents the interference model; the light waves have a certain wavelength
; the atmospheric scattering coefficient is denoted as
and measures the light’s capacity to disperse per unit volume; the depth of the scene is represented as
; the scattering coefficient is denoted as
; and
indicates the intensity of the target obstacle’s light as it is scattered through the atmosphere and reaches the camera.
As mentioned earlier, the scattering impact of incoming light on airborne particles in the atmosphere will reduce the intensity of the light that ultimately reaches the camera [
11]. We consider the relationship between the depth
d of the scene and transmission
t. We also consider the effect of image degradation due to the attenuation of the visibility of the image. Consider an observer (imaging camera) at distance
d(
x) from a scene point at position x. The relationship between the transmission
and depth
is expressed in the following equation [
67]:
where
is the distance between the imaging camera and the scene point at
and
represents the atmospheric scattering coefficient. If the atmosphere exhibits homogeneous physical properties, the scattering coefficient
will be the spatial constant. Therefore, Equation (4) can be rewritten as:
The transmission
illustrates the unscattered part of the light that reaches the camera. From Equation (11), we can express
as follows:
Equation (12) implies that the depth can be calculated up to an unknown scale if the transmission can be estimated [
67]. The visibility distance, measured in meters, is the maximum distance at which black and white objects lose their distinct contrast. As the distance increases in fog, a black and white object seems to become a uniform gray color to the human eye. Therefore, the standard maximum contrast ratio is 5% [
68].
Figure 3a depicts clear, foggy images collected during a real-time autonomous driving simulation at 100 m and 25 m visibility distances.
Figure 3b illustrates the contrast between the grayscale of the clear and foggy images at the visibility distances of 100 m and 25 m. The information regarding an image’s colors and features can be clearly revealed when the image is converted to grayscale. The information regarding the image’s features can be extracted and used for classification purposes in an object detection task. As shown in
Figure 3b, the range of the grayscale of the clear-day image is from around 0 to 250. The grayscale of the foggy images at the 100 m and 25 m visibility distances is highly concentrated between 30 and 210 and between 100 and 250, respectively. As a result, the detection of objects can be negatively affected by fog, because it drastically distorts the image’s feature information [
3].
Figure 3c shows a simulation of a real-time autonomous driving scene that lasted for 12 s in clear (no fog) and heavy fog conditions with a visibility distance of 25 m. Because sensor measurement noise tends to increase significantly in fog, the signal-to-noise ratio (SNR) value decreases dramatically.
Figure 3c illustrates a higher SNR value in the no-fog scene and a much lower SNR value in the heavy fog scene.
3.4. The Baseline YOLOv5 Model
YOLO is a cutting-edge, real-time object detection algorithm, and YOLOv5 [
26] is built on earlier versions of the YOLO algorithm. YOLO is one of the most effective object detection methods available, with a notable performance, yielding state-of-the-art results on datasets such as the Microsoft COCO [
69] and Pascal VOC [
70].
The backbone, neck, and head sections are the three fundamental components of the baseline YOLOv5 network, as shown in
Figure 4. The functionality of the backbone section involves the extraction of relevant feature data from the input images. The neck combines the collected features to create three different scales of feature maps used by the head to detect objects in the image. The YOLOv5 backbone network is CSPDarknet, and the neck consists of the FPN (Feature Pyramid Network) structure and PAN (Spatial Pyramid Pooling) structure.
- (i)
Backbone:
In YOLOv5, Darknet [
43] was merged with a cross-stage partial network (CSPNet) [
71], resulting in CSPDarknet. The CSPDarknet is composed of convolutional neural networks that use numerous iterations of convolution and pooling to generate feature maps of varying sizes from the input image. As a solution to the issues caused by the repetition of gradient information in large-scale backbones, CSPNet incorporates the gradient transitions into the feature map. Thus, reducing the model’s size and the number of parameters and floating-point operations per second guarantees fast and accurate inference. For an object detection task in fog, it is crucial to have a compact model size, fast detection speed, and high accuracy. The backbone generates four distinct levels of feature maps, including 152 × 152 pixels, 76 × 76 pixels, 38 × 38 pixels, and 19 × 19 pixels.
The backbone focus module (
Figure 5a) is used for slicing operations. The purpose of the focus is to improve feature extraction during downsampling. Convolution, batch normalization, and the leaky ReLU activation function (AF) are all sub-modules of the CBL module. YOLOv5 implements two distinct cross-stage partial networks (CSP), as shown in
Figure 5b. Each has a specific function; one is for the neck of the network, and the other is for the backbone. The CSP network uses cross-layer communication between the front and back layers to shrink the model size while preserving its accuracy and increasing inference speed. The feature map of the base layer is divided into two distinct parts: the main component and a skip connection. These two parts are then joined through transition, concatenation, and transition to reduce the amount of duplicate gradient information as effectively as possible. Regarding CSP networks, the difference between the backbone and the neck is that the latter uses CBL modules instead of residual units.
Maximum pooling with varying kernel sizes is carried out using the Spatial Pyramid Pooling, or SPP, module [
72], as shown in
Figure 5c. The features are fused through concatenation. The SPP module undertakes dimensionality reduction procedures to convey image features at a higher degree of abstraction. Pooling reduces the feature map’s size and the network’s computational cost while extracting the essential features.
- (ii)
Neck:
The feature maps from each level are fused by the neck (FPN and PAN) network to learn more contextual information and lessen the amount of data lost in the process. The low-level structures present in the feature maps near the image layer render them ineffective for precise object detection. Feature Pyramid Network (FPN) was designed to extract features to maximize detection speed and accuracy. FPN enables a top-down mechanism to generate higher-resolution layers from significant, robust semantic feature layers. The PAN architecture effectively transfers localization features through a down-top mechanism from lower to higher feature maps to improve the position accuracy of objects in the image. Thus, feature maps are generated on three different scales on three feature fusion layers.
- (iii)
Detection Head:
The detection head consists of convolution blocks that take the three different scales of the feature maps from the neck layer. Through convolution, the detection head yields three distinct sets of detections with resolution levels of , , and . Every grid unit in a feature map correlates to a larger portion of the original image as the feature map’s resolution decreases. This implies that the feature maps can adequately detect small and large objects.
3.5. Attention Mechanism
Numerous studies discovered that when deep CNN reaches a particular depth, it degenerates [
73]. Studies have shown that networks’ performance does not necessarily improve significantly with depth but can substantially increase computational costs throughout the training phase [
74]. Therefore, the attention mechanism was created to train networks in order to prioritize and devote more attention to relevant feature information while down-ranking that which is less relevant [
75]. The attention mechanism informs CNNs where to focus their attention and improves the feature representational power of the features, which helps with object detection tasks. The human eye provides proof that attention mechanisms are crucial for collecting relevant data [
76]. This behavior has prompted several studies [
76,
77,
78,
79,
80] aiming to improve convolutional neural networks’ efficiency in image classification problems by including an attention mechanism. In 2018, Woo et al. [
78] proposed the Convolutional Block Attention Module (CBAM), which integrates spatial and channel attention into a single lightweight mechanism. A considerable performance boost can be achieved with ECA-Net [
80], proposed by Wang et al. in 2020. ECA-Net is an efficient channel attention mechanism that can collect information regarding cross-channel relationships.
CBAM [
78] was designed to simultaneously capture both channel and spatial attention modules. Since the channels of feature maps are treated as feature detectors, the channel attention module focuses on the most important features in the input images. This makes the channel attention module an essential application for an image-processing task such as object detection in fog. Average pooling and max-pooling were employed to aggregate the spatial information of the input feature to obtain average-pooled and max-pooled features. For an input feature map
, individual channel weights are estimated, where the number of channels is
and the length and width of the feature map in pixels are
and
, respectively. The weighted multiplication of channels is useful for drawing more attention to the primary channel features. A shared network (multi-layer perceptron) with one hidden layer is used for both the average-pooled and max-pooled feature descriptors. The element-wise summation of the output vector of both descriptors then generates the channel attention weight map
using Equation (13). The channel-refined feature maps are obtained through the element-wise multiplication of the original feature map and
:
where
is the sigmoid activation function,
and
are the multi-layer perceptron weights,
denotes the average-pooled features, and
denotes the max-pooled features.
Next, the spatial component uses the channel-refined features from the channel submodule to generate a 2D spatial attention map. The element-wise multiplication of the spatial attention weight map and the input channel attention feature map generates the final refined feature map through the attention mechanism [
81]. The spatial attention module pays the most attention to the object’s position in the image frame. This is achieved by combining the spatial features in an individual space using the weighted sum of spatial features. The overall refined features are obtained by multiplying the channel-refined features from the 2D spatial attention map. For a channel-refined feature map
, the convolution of the average pooling and max-pooling using a
filter size gives the spatial attention weight map
, as shown in Equation (14):
where
is the sigmoid activation function and
is a convolution with a
filter size.
However, to lessen the number of parameters, CBAM uses dimensionality reduction to help to manage the model’s complexity. Nonlinear cross-channel relationships are captured throughout the dimensionality reduction process. The dimensionality reduction can lead to an inaccurate capture of the interaction between channels. We adopted the ECA-Net approach [
80] to solve this problem. ECA-Net uses global average pooling (GAP) to aggregate convolution features without reducing dimensionality. This is accomplished by increasing the number of parameters to a very modest degree while successfully gathering details regarding cross-channel interactions and gaining a substantial performance improvement. To understand channel attention, the ECA module adaptively estimates the kernel size
. It then conducts a 1D convolution and applies a sigmoid function
. The kernel size
can be adaptively determined as follows:
where
represents the nearest odd number of
, the kernel size
can be determined using mapping
, and the number of channels (channel dimension) is denoted as
.
is set to 2, and
is set to 1.
In this work, we combined ECA-Net and CBAM to achieve a powerful attention mechanism, as illustrated in
Figure 6. We incorporated the combined ECA-Net/CBAM attention mechanism into the fusion layers of our proposed camera–radar fusion network (CR-YOLOnet) shown in
Figure 7. The attention mechanism helps to draw more attention to and improve the feature representation of the features, which helps with object detection. We enhanced CR-YOLOnet with an attention framework to detect multi-scale object sizes in foggy weather conditions. ECA-Net handled the channel submodule operations, while CBAM handled the spatial submodule operations. The ECA-Net module is effectively trained on the input feature maps following a 1D convolutional GAP which generates the updated weight.
The channel-refined feature maps are produced through the element-wise multiplication of input feature maps and the updated weight. The output of the ECA module is sent to CBAM’s spatial attention module, which generates a 2D spatial attention map. The element-wise summation of the original input feature map and 2D spatial attention map is performed to obtain a residual-like architecture. The ReLU activation function is applied to the aggregated feature map to generate the final feature map, which sent to the detection head layer shown in
Figure 7.
3.6. Proposed Camera–Radar Fusion Network (CR-YOLOnet)
We present our proposed network, called CR-YOLOnet, in
Figure 7, a deep learning multiple-sensor fusion object detector based on the baseline YOLOv5 network. To develop CR-YOLOnet, we made several adjustments to the baseline YOLOv5 model. Our CR-YOLOnet can take its input from camera and radar sources, as compared to the single-modal system in the baseline YOLOv5. There are two CSPDarknet backbone networks with which CR-YOLOnet extracts feature maps, with one each for the camera and radar sensors.
The feature information from the backbone network is sent to the feature fusion layers through two connections, illustrated as round-dotted lines. The concepts of residual networks inspired the connections to improve the backpropagation of gradient in our network, prevent gradient fading, and minimize feature information loss for relatively small objects in fog.
As previously mentioned in
Section 3.4, we included the combined ECA-Net/CBAM attention mechanism in the fusion layers of CR-YOLOnet. The purpose of the attention mechanism is to enhance the capacity of CR-YOLOnet to detect multi-scale object sizes in medium and heavy fog weather conditions, especially small objects that are not nearby.
The detection head is composed of convolution blocks and utilizes all three scales of the feature maps in the neck layer. The two-dimensional convolution allows the detection head to produce three unique sets of detections, each having a resolution level of , , and respectively. The depth is 12 because the number of object classes is 7, the confidence level is 1, and the positional parameters are 4 in number, the total sum of which is 12.
3.7. Loss Function
Three components comprise the loss function: (i) the bounding box (position) loss, (ii) confidence loss, and (iii) classification loss. The bounding box loss function can be calculated when the intersection of the prediction box and the actual box is larger than the set threshold. The confidence loss and classification loss calculations are made when the object center enters the grid.
3.7.1. Bounding Box Loss Functions
We employed the complete intersection of union (CIoU) loss for bounding box regression [
82]. This is because the CIoU combines the following: (i) the overlap region between the predicted bounding box and the ground truth bounding box, (ii) the central point distance between the predicted bounding box and the ground truth bounding box, and (iii) the aspect ratio of the predicted bounding box and the ground truth bounding box. The CIoU approach combines these three components to improve the accuracy of the average precision (
AP) and average recall (AR) for object detection while achieving a faster convergence.
The CIoU loss function in Equation (16) builds on the distance intersection of union (DIoU) loss [
82] by enforcing a penalty term
for the box aspect ratio given in Equation (17):
where
is the weight function, a trade-off parameter that gives the overlap region factor a higher priority for regression, especially for non-overlapping cases;
helps to measure the consistency or similarity of the aspect ratio between the bounding boxes;
and
are the central points of the predicted bounding box
and the ground truth bounding box
; and the widths and heights of the predicted bounding and the ground truth bounding boxes are denoted as
and
and as
and
, respectively.
3.7.2. Confidence Loss and Classification Loss Functions
The confidence loss function
is as follows:
The classification loss function
is as follows:
where
represents the object detected by the
boundary of the grid cell,
denotes the number of grid points,
denotes the number of anchors associated with each grid,
denotes the number of categories,
represents the probability of categories,
denotes the box confidence score in cell
,
denotes the box confidence score for the predicted object, and
denotes the weight representing the predicted loss of confidence in the bounding box in the absence of an object.
Therefore, the overall loss function is given as follows:
5. Conclusions
In this paper, we introduced an enhanced YOLOv5-based multi-sensor fusion network (CR-YOLOnet) that fused radar object identification with a camera image bounding box to locate and identify small and distant objects in fog. We transformed the radar detections by mapping them onto two-dimensional image coordinates and projected the resulting radar image onto the camera image. Using image data, we demonstrated that atmospheric distortion has a negative impact on sensor data in fog. We showed that our CR-YOLOnet, in contrast to the single-modal system used in the baseline YOLOv5, is capable of receiving data from both camera and radar sources. CR-YOLOnet utilizes two different CSPDarknet backbone networks for feature map extraction, one for the camera sensors and the other for the radar sensors.
We emphasized and improved critical feature representation required for object detection using attention mechanisms and introduced two residual-like connections to reduce high-level feature information loss. We simulated autonomous driving instances under clear and foggy weather conditions using the CARLA simulator to obtain clear and multi-fog weather datasets. We implemented our CR-YOLOnet and the baseline YOLOv5 in model configurations of three sizes (small, medium, and large). We found that both the small CR-YOLOnet and medium YOLOv5 trained on clear + fog datasets struck a balance between speed and accuracy, with an mAP of 0.847 and a speed of 72 FPS. There was an improvement of 24.19% in mAP when compared to YOLOv5 trained on the clear + fog datasets, with an mAP of 0.765. However, the performance of CR-YOLOnet was significantly improved, especially in medium and heavy fog conditions. Since the large YOLOv5 model is more efficient for the detection of small objects, in the future, we could optimize the speed of our large CR-YOLOnet by reducing the dimensions of the input data using half-precision floating points, which lower the memory usage in neural networks, and enhance the backbone network with an attention mechanism, etc., without a trade-off of accuracy.