1. Introduction
Information perception is a key technology that supports mobile robots to realize autonomous, safe and efficient actions in complex environments, mainly including perception methods such as vision, radar and multi-sensor fusion.
Vision sensors are able to acquire information such as the category, color and texture of external objects, with the category being particularly important. Object detection often employs the YOLO (You Only Look Once) [
1] algorithm, which is the first end-to-end network framework enabling real-time detection. It uses regression to optimize solutions for category and location as a whole. The method enables real-time monitoring, but its oversimplified structure brings lower accuracy. In order to solve this problem, the YOLO series is constantly updated and iterated, deriving a series of improved models [
2,
3,
4]. For instance, YOLOv9 [
4] relies on Generalized Efficient Layer Aggregation Network (GELAN) and Programmable Gradient Information (PGI) to obtain higher accuracy and speed among similar models. However, focusing solely on global information or local features in the object detection process can easily lead to problems such as reduced recognition rate and missed detection. For example, the multimodal method [
5] obtains final results by training fusion units to acquire representation vectors that fuse image features and semantic features, while focusing too much on global information can make it difficult to infer the correct answers. On the other hand, overemphasizing local information limits the model’s ability to acquire global information. Object detection algorithms such as YOLOv9 divide images into a number of grids to extract features separately, which can easily result in the missed detection of distant small objects or overlapping objects. Therefore, introducing an attention mechanism can enhance the ability to focus on local features while simultaneously capturing global features and local information. This approach allows for flexible allocation of attention weights to enhance model performance and improve information capture abilities [
6].
The attention mechanism is a technique that mimics human ability to selectively observe and focus on key information according to need while ignoring the rest [
5]. It aggregates algorithms to the most relevant parts of the input based on their relative importance, tending to focus on unique aspects when processing large amounts of information [
7]. Currently, the attention mechanism is divided into the channel attention mechanism [
8], hybrid attention mechanism [
9], self-attention mechanism [
10], etc. SENetV2 establishes an attention mechanism based on channel correlation, which automatically obtains the importance of the feature channels and rationally assigns different weights according to the importance of each channel feature to reduce feature redundancy [
11]. Chen [
8] introduced Efficient Multi-Scale Attention (EMA), a multi-scaled channel attention mechanism, into YOLOv7, which aggregates pixel-level features and reduces missed detections and false detection through cross-channel interaction. In order to enhance the accuracy of pipeline leakage detection, Peng et al. [
9] integrated Convolutional Block Attention Module (CBAM) into YOLOv5, which enables the model to focus on extracting features, attenuating the influence of complex backgrounds and reducing computational amount by combining with adaptive spatial feature fusion. Cao et al. [
10] introduced the adaptive attention mechanism Swin Transform (SWT) into YOLOv7 to reduce the computation load. Swin Transformer is an adaptive attention mechanism that integrates into a convolutional neural network to guide the network to focus on contextual spatial information and improve its sensitivity to small objects. Wu et al. [
12] extended the perceptual domain of the model in YOLOX using attention modules, resulting in an average accuracy improvement of approximately 4.24% compared with the pre-optimization period. Therefore, this paper introduces SENetV2 into the convolutional neural network to improve its leakage detection of small objects or overlapping objects. At the same time, Context Guided Block, a lightweight semantic network, replaces the convolutional module to solve the problems of parameter numbers and the increase in training cost caused by the multi-branch structure of SENetV2 while ensuring the recognition accuracy.
After completing the target detection for vision, the processing of the detected target can obtain the color, texture and other information of the object in order to achieve finer information perception. However, the above methods cannot directly provide depth information, are greatly affected by occlusion and cannot directly perceive the shape and size of the object, so there is a big limitation in using only a single 2D camera as an information perception element. Li et al. [
13] combined 2D detection boxes from multiple cameras to approximate 3D detection box for the depth information. Ding et al. [
14] designed a motion target localization algorithm by improving the optical flow method, and then reconstructed the 3D coordinates of points within the motion target area based on the parallel structure of binocular vision. LiDAR can provide accurate distance information and accurate perception of object shape. The processing of radar point clouds is a commonly used clustering method [
15,
16] which is flexible but sensitive to model parameters, data noise and outliers. The use of a convolutional neural network point cloud processing algorithm can improve the detection accuracy and robustness [
17,
18], but it is prone to problems such as large computational volume and poor real-time algorithm performance. In order to improve the training and inference speed, some studies have voxelized the point cloud [
19] or used sparse convolution [
20], but both methods cause permanent loss of information and are only applicable for special scenarios. Although radar-based environment sensing can effectively determine the spatial location of an object, the large amount of point cloud data and the inability to obtain the color and texture information of the object make it difficult to use for information sensing alone.
In summary, existing single-sensor methods have significant limitations in environmental perception. While visual cameras can provide color and texture information of objects and LIDAR provides accurate distance and shape perception, methods relying on one type of sensor cannot comprehensively understand and percept targets in complex scenes. Therefore, in order to address the limitations of single sensors such as camera and radar, the use of fusion can effectively improve the information perception. Veitch-Michaelis J. [
21] data-fused LiDAR point cloud data with parallax maps obtained by stereo matching which realized simple calibration of camera and radar, but in low texture regions, the information perception ability will be reduced dramatically. Chen [
22] proposed MV3D (Multi-View 3D Object Detection Network) to fuse LiDAR point cloud data with RGB image information for predicting directional 3D bounding boxes. Although MV3D streamlines the VGG16 network, the algorithm recognition time is as high as 0.36 s per frame, which makes it difficult to achieve real-time processing. Some works have optimized MV3D by introducing an encoder-decoder backbone [
23] or a voxel feature encoding (VFE) layer [
24], which has advantages in both accuracy and detection speed. In addition, Duan [
25], Liu [
26] and others have recently used attention mechanisms to fuse data from LiDAR and cameras, achieving competitive 3D object detection performance.
All of the above works are oriented towards 3D object detection. The fusion of LiDAR and the camera is also capable of achieving accurate depth measurement, information perception and other tasks. Battrawy R. [
27] proposed a method to fuse LiDAR with images obtained from stereo matching for dense scene traffic estimation. Wang [
28] utilized Convolutional Neural Networks (CNNs) for the fusion of radar and camera data to achieve object detection based on depth measurements; although the algorithm had a significant improvement in detection speed, a sparse non-zippered pooling layer was constructed before converting to bird’s-eye view, which lost a large amount of information. Varuna [
29] utilized the fusion of camera and radar information for the environment sensing of unmanned vehicles, which was combined with the application of a wide-angle camera to an autonomous mobile robot and the establishment of a sensor fusion framework, which improves the unmanned vehicle’s ability to perceive environmental information, but the proposed method is only applicable to a single image and cannot be used for data streaming. Wang H. [
30] used LiDAR point clouds to obtain regions of interest by clustering, projected the regions of interest onto the image and used a YOLOv3 object detection algorithm to detect the object on the image corresponding to the candidate region, and the proposed algorithm’s processing time of each image frame reached 0.09 s, but due to the limitation of the underlying model, it only has 69.36% accuracy (Easy) on the KITTI dataset. Wang [
31] proposed Bi-Consistency Guidance Incomplete Multi-view Clustering when using clustering methods to deal with incomplete data. The method focuses on identifying instances and clustering optimization from multi-view data. While these studies confirmed the feasibility of combining LiDAR and cameras, several challenges emerge when applying the algorithms for a broader range, such as the complexity of processing high-dimensional point cloud data and the algorithms’ limited accuracy. A promising direction for future research is upgrading deep neural networks with new architectures and attention mechanisms, which may achieve more accurate and efficient perception.
Based on the above analysis, this paper proposes an information sensing system based on multi-sensor fusion. Camera and LiDAR are used to acquire color images and depth images of the outside world, respectively, and with vision as the core, the information perception is accomplished by improving the accuracy of the object detection algorithm, reducing the data processing needs of the LiDAR point cloud and fusing the data acquired by the two sensors at the data level. The contributions of this paper are as follows:
We propose an object detection model, YOLO-SCG, which integrates the latest SENetV2 attention mechanism, which can effectively improve the detection ability of the model, in addition to replacing the convolution in YOLOv9 with the Context Guided Block that can simulate the human visual system relying on contextual information, which can effectively improve the speed of the model.
It is experimentally verified that the YOLO-SCG proposed in this paper has higher accuracy and excellent real-time performance compared with the current popular target detection models.
We propose an algorithm with visual detection as the core and radar assistance for joint information perception. We try to incorporate the image detection information into the point cloud to obtain finer 3D information. Finally, the effectiveness of the proposed fusion method for information perception is verified by experiments.
3. Experimentation and Analysis
3.1. Hardware Configuration and Evaluation Metrics
The system used in this experiment is Ubuntu 18.04, the GPU is GeForce RTX2080Ti, memory is 32 G, CUDA 10.2 and Python 3.9. During the training procedure, the batch size is 4, while the learning rate is 0.01.
The performance evaluation metrics used in this paper for the object detection task are precision Equation (13), recall Equation (14) and mean average precision Equation (15). In addition, we added the size of the parameters of the model and the amount of computation to the evaluation metrics. It was computed for the convolutional layer, the fully connected layer and the other layers in the network structure that have parameters.
where
is the number of positive samples predicted to be positive,
is the number of negative samples predicted to be positive,
is the number of negative samples predicted to be negative and
is the average accuracy.
3.2. Datasets
For this experiment, the VOC 2007 datasets was chosen for training to detect as many classes as possible in an open environment. VOC, the full name of which is Visual Object Classes, is a dataset provided by the PASCAL VOC project for object detection and image classification. It consists of 9963 images, with horizontal resolutions ranging approximately 375~500 and vertical resolutions around 350~500. The dataset covers 20 different types of objects, including common targets such as “aircraft”, “bicycle”, “person”, etc. The scene in the dataset is complex, with large differences between similar targets in a single image and varying degrees of occlusion. Therefore, using this dataset for training can improve the model’s generalization ability. The current study employs a split ratio of 6:3:1 to divide the training set, testing set and validation set.
For the radar-related experiments, since the VOC dataset only contains image data, it is necessary to use the KITTI dataset for additional testing. The KITTI dataset is widely used for research in autonomous driving and computer vision, created by the Karlsruhe Institute of Technology and the University of Stuttgart. The dataset provides rich sensor data, including high-resolution images, LiDAR point clouds, GPS/IMU data and vehicle attitude information, covering various urban environments such as city streets and rural roads.
3.3. YOLO-SCG Ablation Experiment
To investigate the resulting enhancement effects of integrating the SENetV2 attention mechanism and the Context Guided block into the YOLOv9 network and to evaluate whether these two methods outperform other attention mechanisms and convolutional blocks, we designed ablation experiments. We conducted comparative analyses of the effects produced by incorporating various attention mechanisms and convolutional blocks into the network structure. We made various modifications to YOLOv9 and tested them using different configurations, with each experimental setup corresponding to a comprehensive set of evaluation metrics.
3.3.1. Comparison of Different Attention Mechanisms
Figure 7 shows the results of the ablation experiment after integrating various attention mechanisms into YOLOv9.
Integrating SENet (Squeeze and Excitation Networks) and SENetV2 into YOLOv9 can increase object detection accuracy via introducing squeeze excitation operations. The feature maps can be processed by dynamically adjusting the importance of different channels. Compared with SENet, SENetV2 further integrates a multi-branch fully connected layer to learn the correlation between these channels, resulting in a 4.29% increase in mAP(mean Average Precision), but at the same time leads to a 0.6% increase in parameter numbers.
We also investigated introducing the local average pooling and global average pooling operations of Mixed Local Channel Attention (MLCA) into YOLOv9. Firstly, it extracts local spatial information through local pooling. Subsequently, it introduces two branches: the first branch obtains global information by global average pooling, while the second branch uses convolution operations to capture local interaction information between channels and leverage useful features. Furthermore, fusing two branches obtains global contextual information, resulting in a 3.16% increase in detection accuracy but raising the computation complexity, with an increase of 1.23% in parameters.
The introduction of EMA in YOLOv9 improves target detection accuracy by both reshaping some channels of the feature map into batch dimension and dividing the channel dimension into multiple sub-features to ensure uniform distribution of channel information, as well as encoding the global information in parallel and capturing pixel-level relationships across the latitudinal interaction module, resulting in 3.39% increase in detection accuracy. However, due to its parallel branches, its parameters increased by 5.12%.
Global Attention Mechanism (GAM) is integrated into YOLOv9, and its 3D alignment is introduced to preserve global feature information across three dimensions, while the channel dependency across latitude is enhanced through Multilayer Perceptron (MLP); this maintains better global consistency, but it pays less attention to local features, resulting in only a 0.05% increase in detection accuracy.
A comprehensive analysis shows that the incorporation of the SENetV2 attention mechanism produces a better enhancement that not only improves the model performance but also achieves the overall best performance in terms of all evaluation metrics.
3.3.2. Comparison of Different Module Replacement Convolutions
The experimental results of replacing the convolutional layer using different modules are shown in
Table 1. The Context Guided self-attention mechanism is used to replace the convolutional layer of the original network, which fuses local features, surrounding context and global context information to enhance the capturing of the relationship between the object and the background and improve the accuracy of object detection. Experimental results show that this method significantly improves the accuracy of object detection, with a 4.18% improvement in mAP, while the amount of parameters is reduced by 8.06%. We also use three convolution modules, DynamicConv, DualConv, and SPD-Conv, respectively, to replace the convolution layers. DynamicConv improves the model performance by dynamically adjusting the convolution kernel but also increases the model parameter count and computational complexity. DualConv uses two convolutional kernels for convolution operation to optimize the information processing and feature extraction capability, which reduces the amount of model parameters by 8.8%, but the detection accuracy is not as good as Context Guided due to the redundancy of information. SPD-Conv (Space-to-Depth Convolution) captures multi-scale features and increases the receptive field by using different dilution rates. However, its complex structure leads to longer training times and a higher risk of overfitting. Comparative analysis reveals that replacing the convolutional layer using the Context Guided attention mechanism meets the requirements in all aspects of performance.
3.4. YOLO-SCG Comparison Experiment
In this section, the comparison experiments of the proposed model with other models are given, and the specific data is shown in
Table 2. The proposed model in this paper has better performance in terms of precision, recall, and F1_score. Due to the incorporation of the Context Guided lightweight network, the model reduces the number of parameters. For
[email protected], YOLO-SCG (0.880) reduces the value by 0.45% with respect to the original YOLOv9 (0.884) model, and improves the value by 6.54% with respect to the more popular YOLOv5s (0.826). However, in a comparison of
[email protected], the value improved by 2.3% compared to YOLOv10s, and YOLO-SCG improved the value by 4.06% relative to the original YOLOv9. This indicates that while the model’s performance is comparable to the original model in detecting regular objects, YOLO-SCG maintains high detection accuracy in situations involving small targets and overlapping objects. In addition, we compared our model with other real-time object detectors on the MS COCO dataset. As shown in
Figure 8, our model exhibits higher accuracy than other models with the same number of parameters.
3.5. Information Acquisition by Multi-Sensor Fusion
Cameras are greatly affected by light, which may lead to image quality degradation under different lighting conditions and make it difficult to directly obtain the distance information of objects, posing challenges for depth perception and 3D reconstruction. In this paper, we try to perform object detection on the image and then perform local clustering on the point cloud according to the obtained results, aiming to get a more accurate and fast information perception model.
3.5.1. Robustness Evaluation of Object Detection Model
The detection effect of the proposed YOLO-SCG object detection algorithm is shown in
Figure 9. As seen in areas A and B of the figure, although the YOLOv9 object detection algorithm achieves better accuracy, it still misses detection in some extreme cases (such as occlusion or low light). However, YOLO-SCG is sufficient to eliminate this leakage phenomenon after ensuring the detection effect of the original model.
To evaluate the robustness of the model under external environmental disturbances, we conducted experiments simulating various conditions. For images of the same scene, 10, 8 and 9 are detected in a normal environment, simulated foggy day and rainy day, respectively. Among them,
Figure 10b shows the detection effect in a normal environment, which shows that the model can clearly recognize both distant and near objects.
Figure 10c,d simulates detection in foggy and rainy conditions, and it can be seen that the proposed model is able to recognize all the objects at normal distances regardless of the weather. However, detecting distant objects poses some challenges in heavy rain and thick fog scenarios, resulting in 1–2 instances of missed detections. Nevertheless, this challenge remains within acceptable limits.
3.5.2. Evaluation of the Effectiveness of Localized Euclidean Clustering
Next, this paper uses conventional Euclidean clustering as a comparison, as shown in
Figure 11a. When dealing with independent objects (those that are far away from other objects), clustering methods can successfully distinguish between different objects. However, when the objects are close to each other or overlap, clustering the point cloud data is unable to distinguish different objects. More seriously, the algorithm will categorize different objects as a single object. This may result in failing to recognize movable objects or recognizing stationary objects as moving. The clustering effect using our proposed method is shown in
Figure 11b, where vision-based object detection obtains the detection frame and transforms it to the point cloud coordinate system, followed by performing Euclidean clustering within the coordinate frame. This method effectively enhances the refinement of clustering and reduce the number of point clouds to be clustered.
Figure 12 compares the results of the proposed method and two clustering algorithms for point clouds.
Figure 12a shows the detection effect of YOLO-SCG, which recognizes six cars in the image despite the large occlusion produced;
Figure 12b shows the effect of clustering using Region Growing alone, in which the method fails to cluster the object at distant locations.
Figure 12c depicts that integrating YOLO-SCG with Euclidean Clustering can recognize six cars more accurately.
3.5.3. Multi-Sensor Fusion for Information Perception
The above experiments evaluated and verified the advantages of our proposed object detection algorithm and the feasibility of the multi-sensor fusion algorithm. Subsequently, there is a need to organize and summarize the collected information. In
Figure 11, the visual detection of objects in front mainly detects cars and pedestrians; after local point cloud clustering of the point cloud within the same field of view as the camera, we are able to obtain distance and size information about the objects. The information perceived for pedestrians in the A and B areas in
Figure 11 is shown in
Table 3, which mainly includes the information of category, distance, size, color and texture.
To test the effectiveness of the proposed multi-sensor fusion algorithm, we conducted a comparative study using 30 consecutive frames of fused data, focusing on distant cars as an example. We statistically compared the performance of our improved information perception algorithm against the traditional Euclidean clustering algorithm in measuring errors of typical values of the actual size of objects in the environment. In order to evaluate the proposed method, we compared it with three methods, namely Euclidean clustering, Region Growing and Hierarchical Clustering.
Table 4 shows the four methods’ average relative measurement errors and leakage rates. Since visual object detection is introduced as an a priori frame followed by clustering, the proposed method in this paper has a smaller error in all the metrics compared to the traditional clustering method. In
Table 4, the Euclidean clustering method resulted in a miss detection rate of 23.3%, and the Region Growing and Hierarchical Clustering generated 13.3% and 16.6% missed frames, respectively, which include undetected objects and merged objects. In contrast, our method failed for two frames. Regarding perception speed, Hierarchical Clustering’s detection speed is 22 fps since it requires a complete hierarchy structure and involves similarity calculation. Using YOLO-SCG as prior information for Euclidean Clustering is 32.2% faster than normal Euclidean Clustering. In general, this method improves the efficiency, accuracy and stability of clustering.
In 30 frames of detection, the Euclidean clustering method resulted in a miss detection rate of 23.3%, which include undetected objects, merging with other objects, etc.; in contrast, our method failed to recognize only 2 frames.