As computer vision develops, automatic driving gained a lot of attention in recent years. To gather precise information about their surroundings, autonomous vehicles are outfitted with multi-modal data sensors. It is the fusion of data from different modes obtained by different sensors. However, the premise of fusion is precise alignment between the data. Currently, 2D cameras and 3D LiDAR are the most often utilized sensor combination devices, in which 2D cameras can obtain rich environmental information and 3D LiDAR can provide accurate depth information. By synergizing LiDAR and camera data, a complementary effect can be achieved, which is vital for downstream tasks, such as 3D target detection [
1], 3D scene reconstruction, and semantic segmentation [
2,
3].
Early LiDAR and camera-based calibration work was mainly for specific calibration targets or man-made targets [
4,
5,
6,
7,
8,
9], such as checkerboard or spherical surfaces. External calibration parameters are computed by matching the correspondence between a 2D image and a 3D point cloud. These methods are mainly achieved by offline calibration. However, in practical applications, due to the impact of irresistible factors such as vehicle vibration, aging, loosening, and so on, the relative position of the camera and the LiDAR will change, resulting in inaccurate calibration parameters obtained in this way. Because it requires human intervention, the whole calibration process is time-consuming and laborious, which is not conducive to practical application. In recent years, some target-free correlation calibration methods have been studied by many scholars to solve the above problems, but this method requires accurate initial calibration [
10,
11] or other self-motion data [
12] in the early stage. Recently, Wang, Guangming et al. [
13], Lv, Xudong et al. [
14], Schneider, Nick et al. [
15], Zhao, Ganning et al. [
16], Iyer, Ganesh et al. [
17], Shi, Jieying et al. [
18], and Lv et al. [
19] proposed deep learning methods to solve the 6-DOF rigid body transformation between the LiDAR coordinate system and camera coordinate system. They attempt to establish a correspondence between images and point clouds by utilizing deep learning’s strong feature extraction and learning capabilities. However, since image and point cloud are heterogeneous data, the majority of earlier deep learning-based methods use simple feature stitching to carry out information exchange between different modal features, which has great limitations and errors.
Overall, in LiDAR and camera calibration, researchers proposed different types of methods, which can be broadly split into three categories: (1) target-based methods; (2) target-less methods; and (3) learning-based methods.
1.1. Target-Based Methods
Offline, target-based techniques have been used most frequently to accomplish this challenge; [
8] advises employing a common checkerboard target that the camera perceives as a pattern and the LiDAR perceives as a 3D object. Different target forms are explored in other works. As an illustration, [
9] suggests a target with a hole and a circle to enhance sensor visibility in outdoor scenarios. Lastly, [
10] presents an automated calibration procedure that, in less than a minute, converges toward a solution with a single shot while utilizing several checkerboard targets inside. Conventional target-based techniques are precise, but they need particular tools and surroundings. Therefore, these algorithms are difficult, time-consuming, and only useful offline.
1.2. Target-Less Methods
In contrast with the target-based methods, the target-less methods do not require the use of calibration targets in order to estimate the six degrees of freedom rigid transformation between the LiDAR and the camera. A nonlinear explicit correspondence-less calibration method was presented by Tamas et al. [
20] with respect to the calibration issue as a 2D-3D registration of a shared LiDAR-camera region. The nonlinear registration system is built using little information, such as area shapes and depth data, and it directly gives the LiDAR camera’s calibration parameters. From LiDAR data and camera images acquired in urban building contexts, Liu et al. [
21] identified features such as rectangles and line segments, then calibrated them by maximizing 2D–3D feature matching. By aligning the natural edge features of the LiDAR-camera, Yuan et al. [
22] discovered an excellent LiDAR edge extraction technique based on point cloud plane fitting and voxel cutting, achieving high-precision matching. Additionally, certain techniques [
23,
24] estimate the extrinsic characteristics based on sensor motion. These motion-based techniques look for the transformation that best lines up the two sensors’ motion tracks.
1.3. Learning-Based Methods
Deep learning’s applications in computer vision revolutionized segmentation, classification, object recognition, and object tracking due to its quick development. When compared to other algorithms, it demonstrated exceptional extrinsic calibration performance. RegNet [
15] is the first method to use neural networks for external parameter regression, which combines the feature extraction module, feature matching module and parameter regression module of traditional methods into one neural network. Based on the above, CalibNet [
17] takes into account the spatial geometry information of the point cloud and further improves the calibration accuracy by introducing a 3D spatial transformer layer. NetCalib [
25] is the first method to calibrate external parameters by predicting RGB image depth maps and achieves good calibration results by obtaining a unified representation of input data. LCCNet [
14] considers the correlation between image and point cloud by introducing feature matching cost and obtains good calibration results. In addition, it is worth noting that current deep learning techniques necessitate approximate initial extrinsic parameter estimations. To address this, an error depth map is created by projecting point clouds onto an image. Zhu et al. [
26] introduced calibration as an optimization approach that leverages semantic features to establish a new evaluation metric for calibration accuracy. SOIC [
27] employed semantic segmentation to obtain an initial extrinsic parameter and then turned the initialization challenge into a perspective n point (PnP) issue [
28] by integrating semantic centroids.
In this paper, we put forward MSANet, a deep learning method to automatically find the external parameters of LiDAR coordinate system and camera coordinate system. Firstly, the final performance of the calibration task is usually closely related to the ability of the network to represent the characteristics of the input data. Low-level features retain more detailed information, while high-level features contain more semantic information. Therefore, in the feature extraction stage, we fuse features of different scales. In the feature matching stage, we use a multi-layer attention mechanism to fuse the features of different modes to complete feature matching and reconstruction. Finally, the translation and rotation parameters are independently regressed using a feed-forward network. The following is a summary of our contributions:
(1) In the stage of feature extraction, we propose CCAM, which can enhance context dependency and reduce interference from noisy information, and establish a foundation for calibration between LiDAR and camera.
(2) In the stage of feature extraction, we propose the RASPP module to obtain the context information of different receptive fields and enhance the correlation between features so as to help the network learn complex multi-scale features better, as well as enhance the network’s robustness and training efficiency and promote the calibration accuracy.
(3) We propose feature matching based on attention mechanisms. We perform iterative self-attention and cross-attention in the feature matching stage with the goal of simulating the process of human browsing back and forth during matching and finding better correspondences.
(4) We conducted numerous experiments on the KITTI datasets, and the results demonstrate that our proposed calibration method has a better generalization ability and calibration accuracy than most of state-of-the-art calibration methods.