MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms

Xiong, Fengguang; Zhang, Zhiqiang; Kong, Yu; Shen, Chaofan; Hu, Mingyue; Kuang, Liqun; Han, Xie

doi:10.3390/rs16224233

Open AccessArticle

MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms

by

Fengguang Xiong

^1,2,3,*

,

Zhiqiang Zhang

¹,

Yu Kong

¹,

Chaofan Shen

¹,

Mingyue Hu

¹,

Liqun Kuang

^1,2,3 and

Xie Han

^1,2,3

¹

School of Data Science and Technology, North University of China, Taiyuan 030051, China

²

Shanxi Provinces Vision Information Processing and Intelligent Robot Engineering Research Center, Taiyuan 030051, China

³

Shanxi Key Laboratory of Machine Vision and Virtual Reality, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(22), 4233; https://doi.org/10.3390/rs16224233

Submission received: 10 October 2024 / Revised: 7 November 2024 / Accepted: 12 November 2024 / Published: 14 November 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Sensor data fusion is increasingly crucial in the field of autonomous driving. In sensor fusion research, LiDAR and camera have become prevalent topics. However, accurate data calibration from different modalities is essential for effective fusion. Current calibration methods often depend on specific targets or manual intervention, which are time-consuming and have limited generalization capabilities. To address these issues, we introduce MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms, an end-to-end deep learn-based online calibration network for inferring 6-degree of freedom (DOF) rigid body transformations between 2D images and 3D point clouds. By fusing multi-scale features, we obtain feature representations that contain a lot of detail and rich semantic information. The attention module is used to carry out feature correlation among different modes to complete feature matching. Rather than acquiring the precise parameters directly, MSANet online corrects deviations, aligning the initial calibration with the ground truth. We conducted extensive experiments on the KITTI datasets, demonstrating that our method performs well across various scenarios, the average error of translation prediction especially improves the accuracy by 2.03 cm compared with the best results in the comparison method.

Keywords:

3D point cloud; RGB image; deep learning; extrinsic calibration; lidar; rigid transformation; attention mechanism

1. Introduction

As computer vision develops, automatic driving gained a lot of attention in recent years. To gather precise information about their surroundings, autonomous vehicles are outfitted with multi-modal data sensors. It is the fusion of data from different modes obtained by different sensors. However, the premise of fusion is precise alignment between the data. Currently, 2D cameras and 3D LiDAR are the most often utilized sensor combination devices, in which 2D cameras can obtain rich environmental information and 3D LiDAR can provide accurate depth information. By synergizing LiDAR and camera data, a complementary effect can be achieved, which is vital for downstream tasks, such as 3D target detection [1], 3D scene reconstruction, and semantic segmentation [2,3].

Early LiDAR and camera-based calibration work was mainly for specific calibration targets or man-made targets [4,5,6,7,8,9], such as checkerboard or spherical surfaces. External calibration parameters are computed by matching the correspondence between a 2D image and a 3D point cloud. These methods are mainly achieved by offline calibration. However, in practical applications, due to the impact of irresistible factors such as vehicle vibration, aging, loosening, and so on, the relative position of the camera and the LiDAR will change, resulting in inaccurate calibration parameters obtained in this way. Because it requires human intervention, the whole calibration process is time-consuming and laborious, which is not conducive to practical application. In recent years, some target-free correlation calibration methods have been studied by many scholars to solve the above problems, but this method requires accurate initial calibration [10,11] or other self-motion data [12] in the early stage. Recently, Wang, Guangming et al. [13], Lv, Xudong et al. [14], Schneider, Nick et al. [15], Zhao, Ganning et al. [16], Iyer, Ganesh et al. [17], Shi, Jieying et al. [18], and Lv et al. [19] proposed deep learning methods to solve the 6-DOF rigid body transformation between the LiDAR coordinate system and camera coordinate system. They attempt to establish a correspondence between images and point clouds by utilizing deep learning’s strong feature extraction and learning capabilities. However, since image and point cloud are heterogeneous data, the majority of earlier deep learning-based methods use simple feature stitching to carry out information exchange between different modal features, which has great limitations and errors.

Overall, in LiDAR and camera calibration, researchers proposed different types of methods, which can be broadly split into three categories: (1) target-based methods; (2) target-less methods; and (3) learning-based methods.

1.1. Target-Based Methods

Offline, target-based techniques have been used most frequently to accomplish this challenge; [8] advises employing a common checkerboard target that the camera perceives as a pattern and the LiDAR perceives as a 3D object. Different target forms are explored in other works. As an illustration, [9] suggests a target with a hole and a circle to enhance sensor visibility in outdoor scenarios. Lastly, [10] presents an automated calibration procedure that, in less than a minute, converges toward a solution with a single shot while utilizing several checkerboard targets inside. Conventional target-based techniques are precise, but they need particular tools and surroundings. Therefore, these algorithms are difficult, time-consuming, and only useful offline.

1.2. Target-Less Methods

In contrast with the target-based methods, the target-less methods do not require the use of calibration targets in order to estimate the six degrees of freedom rigid transformation between the LiDAR and the camera. A nonlinear explicit correspondence-less calibration method was presented by Tamas et al. [20] with respect to the calibration issue as a 2D-3D registration of a shared LiDAR-camera region. The nonlinear registration system is built using little information, such as area shapes and depth data, and it directly gives the LiDAR camera’s calibration parameters. From LiDAR data and camera images acquired in urban building contexts, Liu et al. [21] identified features such as rectangles and line segments, then calibrated them by maximizing 2D–3D feature matching. By aligning the natural edge features of the LiDAR-camera, Yuan et al. [22] discovered an excellent LiDAR edge extraction technique based on point cloud plane fitting and voxel cutting, achieving high-precision matching. Additionally, certain techniques [23,24] estimate the extrinsic characteristics based on sensor motion. These motion-based techniques look for the transformation that best lines up the two sensors’ motion tracks.

1.3. Learning-Based Methods

Deep learning’s applications in computer vision revolutionized segmentation, classification, object recognition, and object tracking due to its quick development. When compared to other algorithms, it demonstrated exceptional extrinsic calibration performance. RegNet [15] is the first method to use neural networks for external parameter regression, which combines the feature extraction module, feature matching module and parameter regression module of traditional methods into one neural network. Based on the above, CalibNet [17] takes into account the spatial geometry information of the point cloud and further improves the calibration accuracy by introducing a 3D spatial transformer layer. NetCalib [25] is the first method to calibrate external parameters by predicting RGB image depth maps and achieves good calibration results by obtaining a unified representation of input data. LCCNet [14] considers the correlation between image and point cloud by introducing feature matching cost and obtains good calibration results. In addition, it is worth noting that current deep learning techniques necessitate approximate initial extrinsic parameter estimations. To address this, an error depth map is created by projecting point clouds onto an image. Zhu et al. [26] introduced calibration as an optimization approach that leverages semantic features to establish a new evaluation metric for calibration accuracy. SOIC [27] employed semantic segmentation to obtain an initial extrinsic parameter and then turned the initialization challenge into a perspective n point (PnP) issue [28] by integrating semantic centroids.

In this paper, we put forward MSANet, a deep learning method to automatically find the external parameters of LiDAR coordinate system and camera coordinate system. Firstly, the final performance of the calibration task is usually closely related to the ability of the network to represent the characteristics of the input data. Low-level features retain more detailed information, while high-level features contain more semantic information. Therefore, in the feature extraction stage, we fuse features of different scales. In the feature matching stage, we use a multi-layer attention mechanism to fuse the features of different modes to complete feature matching and reconstruction. Finally, the translation and rotation parameters are independently regressed using a feed-forward network. The following is a summary of our contributions:

(1) In the stage of feature extraction, we propose CCAM, which can enhance context dependency and reduce interference from noisy information, and establish a foundation for calibration between LiDAR and camera.

(2) In the stage of feature extraction, we propose the RASPP module to obtain the context information of different receptive fields and enhance the correlation between features so as to help the network learn complex multi-scale features better, as well as enhance the network’s robustness and training efficiency and promote the calibration accuracy.

(3) We propose feature matching based on attention mechanisms. We perform iterative self-attention and cross-attention in the feature matching stage with the goal of simulating the process of human browsing back and forth during matching and finding better correspondences.

(4) We conducted numerous experiments on the KITTI datasets, and the results demonstrate that our proposed calibration method has a better generalization ability and calibration accuracy than most of state-of-the-art calibration methods.

2. Methods

Our network takes the RGB image and its corresponding depth map, obtained from the point cloud projection with initial error calibration, as input. It then outputs the external parameters of 6-DoF, comprising the vector of rotation and translation, representing the deviation between the initial extrinsic parameters

T_{i n i t}

and the ground truth extrinsic parameters

T_{g t}

. In our method, we assume that the ground truth transformation matrix

T_{g t}

of the coordinate system between the LiDAR and the camera is known, as well as the camera of internal parameter K. Our method consists of data processing, network structure, and loss function. Figure 1 shows the flow diagram of the data preprocessing part and the network architecture. In the data preprocessing part, we first transform the input point cloud data through the truth matrix to obtain the point cloud data in the camera coordinate system, then add the disturbance matrix to obtain the disturbed point cloud data, and finally use the camera internal parameters to project the point cloud data to obtain the depth map. The depth map and RGB image are entered into the network model as input data for subsequent related operations. Our network structure is mainly divided into three steps: the first part is the feature extraction layer, which is mainly used to extract the features of the input data; the second part is the feature matching layer, which is used to match the relevant feature points of the obtained feature graphs; the third part is the parameter regression layer, which is used to obtain the rotation and translation vectors.

2.1. Data Preprocessing

We input the RGB image along with its corresponding mis-calibrated depth map and camera parameters into the network. In the training stage, to obtain a large number of training data, we add randomly a transformation matrix

T_{r a n d}

to obtain an initial transformation matrix

T_{i n i t}

shown as Equation (1):

T_{i n i t} = T_{r a n d} \cdot T_{g t} = (\begin{matrix} R_{i n i t} & t_{i n i t} \\ 0 & 1 \end{matrix}),

(1)

in which

R_{i n i t}

denotes the initial rotation matrix of the extrinsic parameter, and

t_{i n i t}

denotes the translation vector of the extrinsic parameter. The mis-calibrated depth map can then be obtained by converting each point

p_{i} = {[x_{i}, y_{i}, z_{i}]}^{T} \in R^{3}

of the point cloud in the LiDAR coordinate system to

I_{i} = {[u_{i}, v_{i}]}^{T} \in R^{2}

on the image plane in the camera coordinate system as Equation (2):

z_{i} [\begin{array}{l} u_{i} \\ v_{i} \\ 1 \end{array}] = K \cdot T_{i n i t} \cdot [\begin{array}{l} x_{i} \\ y_{i} \\ z_{i} \\ 1 \end{array}] = K \cdot (\begin{matrix} R_{i n i t} & t_{i n i t} \\ 0 & 1 \end{matrix}) \cdot [\begin{array}{l} x_{i} \\ y_{i} \\ z_{i} \\ 1 \end{array}]

(2)

where

{[u_{i}, v_{i}, 1]}^{T}

and

{[x_{i}, y_{i}, z_{i}]}^{T}

represents the homogeneous coordinates of

I_{i}

and

p_{i}

, respectively.

z_{i}

is the camera coordinate’s pixel value as well as its depth value. If no corresponding point is projected to the pixel value, then the pixel value is 0. All

z_{i}

and

{[u_{i}, v_{i}]}^{T}

together make up the mis-calibrated depth map. Our network corrects errors by using the input RGB image and the mis-calibrated depth map to obtain

T_{r a n d}^{- 1}

shown as Eqution (3):

T_{g t} = T_{i n i t} \cdot T_{r a n d}^{- 1} = T_{g t} \cdot T_{r a n d} \cdot T_{r a n d}^{- 1} .

(3)

2.2. Network Architecture

Figure 2 shows our network architecture section that includes three steps: feature extraction, feature matching, and parameter regression. The function of feature extraction is to extract and reconstruct the feature of the RGB image and point cloud on multi-scales. Feature matching is used to match features between different modal features by attention mechanisms. Parameter regression is to estimate 6-DoF parameters, which represent the deviation between the initial external parameters and the ground truth extrinsic parameters.

First of all, we project the point cloud by using the camera parameter K to obtain a mis-calibrated depth map. Then, the RGB image and the mis-calibrated depth map are used as inputs for feature extraction to obtain semantically richer features. After that, feature reconstruction is carried out through the attention module to match features between different modal features. Finally, the external parameters of 6-DoF are obtained by regression, which represent the deviation between the initial external parameters

T_{i n i t}

and the ground truth extrinsic parameters

T_{g t}

. Iteration N represents the number of iterations required for external parameters to be applied to disturbed point clouds to generate the mis-calibrated depth map.

2.2.1. Feature Extraction Layer

We implement a dual-branch structure for feature extraction, accommodating both image and point cloud inputs. We adopted ResNet18 [29] as the backbone of RGB image feature extraction. We just use the first three layers, down-sampling the original input by 2x, 4x, and 8x. We apply the same structure as the image for feature extraction on the depth map corresponding to the input point cloud. All of the above ResNet18 blocks are initialized by using pre-trained weights from the ImageNet dataset.

The final performance of the calibration task is usually closely related to the ability of the network to represent the characteristics of the input data. Lower-level features generally preserve finer details, whereas higher-level features harbor richer semantic information. To increase the network’s calibration accuracy, we fused the high and low layer features, and added more semantic information while retaining the object details. As shown in Figure 3, we used ResNet18 backbone network to extract the features from the input RGB image, then processed the high and low layer features accordingly and spliced the output results in the channel dimension. Finally, the fused features were processed by the down-sampling module and atrous spatial pyramid pooling with residual (RASPP) to obtain the final output features. Considering the sparsity of the point cloud, the depth map projected by the point cloud has only a few numbers of non-zero pixel values, and the fusion of features of different scales extracted from the point cloud branches may affect the final performance, so only the features of the RGB image branches are fused.

After using ResNet18 to extract features from the input RGB image, we obtain feature maps of different scales. In order to fuse these feature maps of different scales, we need to match the size of the feature maps. For the high-level feature map, we need to up-sample it to ensure that it reaches the same size as the low-level feature map, but noise information is inevitably introduced in the process of up-sampling. To ensure that the high-level features are not disturbed by noise in the feature fusion stage, we adopt the composite convolutional attention module (CCAM) to adjust the weight relationship of high-level features, so as to facilitate the subsequent feature matching. After adjusting the number of channels and processing double bilinear interpolation for the high-level features, they are input into CCAM. As illustrated in Figure 4, compared with the traditional channel attention, CCAM not only considers the feature information between channels, but also integrates spatial information. Specifically, we begin by applying a Conv1×1 convolution layer and GELU activation function to process the high-level features. We then further process these features using a CCA block before adjusting the channel’s number of the feature map with another Conv1×1 convolution. Finally, we apply this result to the input feature map to obtain the final output feature map. The CCA block utilizes depth-separable convolution (DSConv) to separate space and channel dimensions, followed by depth-separable dilative convolution (DSDConv), which allows for larger receptive field content in the input feature map. This emphasizes spatial position relationships within the feature map and provides key support for subsequent feature matching. Lastly, point convolution (PWConv) is used to adjust weight relationships between channels in order to obtain attention weights. The attention weights are then multiplied by the input features to produce our final output.

Low-level features contain a lot of detailed information. Inspired by [30,31], we use the down-sampling module to process low-level features to obtain low-resolution feature representation to complete feature fusion at different scales. As shown in Figure 5, we replace the average pooling layer of the first branch with a self-attention mechanism to help the network concentrate on important detail features, while assisting the network in better perceiving the global structure of the input and increasing the network’s understanding of the global information.

In sensor calibration, due to the presence of different sizes of targets in the scene, it is easy to cause large calibration errors and poor calibration results. After fusing features of different scales, we can obtain a feature map containing a lot of semantic information and detailed information. To obtain the context information of different receptive fields and enhance the correlation between features, we use the RASPP module. As shown in Figure 6, the RASPP module consists of a 1 × 1 convolution layer and two depth-separable residual convolution (RDSConv). Different from ordinary depth-separable convolution, we introduce the residual structure in depth-separable residual convolution (RDSConv). That is, the output feature is obtained by processing the input feature map

F_{A}

by dilatation convolution, then the input feature map

F_{A}

and the output feature are added together. Finally, the final result is obtained by a 1 × 1 convolution layer operation. Such a structure helps the network learn complex multi-scale features better, as well as enhance the network’s robustness and training efficiency and promote the calibration accuracy.

First, the input feature graph

F_{A}

is convolved by 1 × 1 convolution and 2 depth-separable residual convolution. The resulting features are then concatenated along the channel dimension, followed by an adjustment of the feature channels using a 1 × 1 convolution layer. Finally, the output is obtained by adding together the result and the outcome of applying a 1 × 1 convolution to

F_{A}

.

2.2.2. Feature Matching Layer

Following the extraction of features from the RGB image and the point cloud depth map by using two separate branches, we obtain the features that contain a lot of detail and rich semantic information. For the obtained RGB feature map and point cloud feature map, because they are the same frame of information, they contain the same taste information. However, the random addition of disturbing information changes the relative position relationship between the image and point cloud. In order to seek the correlation relationship between their corresponding positions, we introduce the attention module shown as Figure 7. Different from the previous methods [15,16,17,18], for the features of the two modes extracted, we did not directly splice them in the channel dimension.

As shown in Figure 7, we spread the obtained features into one-dimensional vectors and add learnable position coding to obtain LiDAR feature maps

F_{l i d a r}

and camera image feature maps

F_{r g b}

, respectively. To establish a correlation between the two modes, the two obtained modal features are processed through the attention module in Transformer. First, for the self-attention mechanism, the LiDAR feature maps

F_{l i d a r}

and camera image feature maps

F_{r g b}

obtain their respective query Q, key K, and value V through the linear projection matrix. The query vector Q retrieves information from V based on the attention weight calculated by the dot product of Q corresponding to each value V and the key vector K. The process is described as Equation (4).

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

Intuitively, the attention mechanism autonomously seeks out features that are related to the final task. Following passing the self-attention module, the LiDAR features maps

F_{l i d a r}

and camera image features maps

F_{r g b}

, respectively, find the feature regions

{\tilde{F}}_{l i d a r}

,

{\tilde{F}}_{r g b}

related to the calibration task. Then, we calculate the corresponding query Q, key K, and value V for

{\tilde{F}}_{l i d a r}

and

{\tilde{F}}_{r g b}

. To find the matching correspondence between different modes, we perform cross-attention operations between LiDAR features maps

{\tilde{F}}_{l i d a r}

and camera image features maps

{\tilde{F}}_{r g b}

. Specifically, we use the attention weights calculated by

{\tilde{F}}_{r g b} ’ s

query Q and

{\tilde{F}}_{l i d a r} ’ s

key K to retrieve the features from value V in

{\tilde{F}}_{l i d a r}

that the network considers to be the best match for

{\tilde{F}}_{r g b}

, and then add the results to

{\tilde{F}}_{r g b}

as the reconstructed features, which can be described as Equation (5). We take the same approach for

{\tilde{F}}_{l i d a r}

, which can be described as Equation (6).

{\tilde{F}}_{r g b} = s o f t m a x (\frac{Q_{r g b} K_{l i d a r}^{T}}{\sqrt{d_{k}}}) V_{l i d a r} + {\tilde{F}}_{r g b}

(5)

{\tilde{F}}_{l i d a r} = s o f t m a x (\frac{Q_{l i d a r} K_{r g b}^{T}}{\sqrt{d_{k}}}) V_{r g b} + {\tilde{F}}_{l i d a r}

(6)

2.2.3. Parameter Regression Layer

The final output of our network consists of rotation vectors and translation vectors that correspond with our predicted

T_{r a n d}^{- 1}

. After the features of the two different modes are matched by the attention mechanism, the reconstructed features

{\tilde{F}}_{r g b}

and

{\tilde{F}}_{l i d a r}

are spliced in the channel dimension. Then, it is passed through the NIN block to maintain the structural information of the feature. Finally, we decouple the output into two parts, and obtain the final 1 × 3 translation vector t and 1 × 3 rotation vector r through the two-layer MLP structure, respectively, where the rotation vector r is transformed to a rotation matrix

R \in S O (3)

via Rodrigues’ rotation formula, which is shown as Equation (7):

R = e^{\hat{θ}} = I + \frac{\hat{θ}}{‖θ‖} s i n ‖θ‖ + \frac{{\hat{θ}}^{2}}{{‖θ‖}^{2}} (1 - c o s ‖θ‖)

(7)

where I denotes identity matrix, and

\hat{θ}

represents the antisymmetric matrix of the rotation vector r. We combine the resulting rotation matrix R with the translation vector t to obtain the transformation matrix

T_{p r e} \in S E (3)

corresponding to the

T_{r a n d}^{- 1}

, which is shown as Equation (8):

T_{p r e} = (\begin{matrix} R & t \\ 0 & 1 \end{matrix}) = T_{r a n d}^{- 1} .

(8)

Since the performance of the network cannot achieve satisfactory results for a large calibration range, we adopted an iterative calibration method similar to [17]. First, we input RGB images and mis-calibrated depth maps into the network to obtain the initial transformation matrix T1; then, we apply the predicted transformation T1 to transform the mis-calibrated depth map. We then input the converted depth map and the same RGB image into our network to predict the new transformation matrix T2. The final predicted transformation matrix is shown as Equation (9):

{\tilde{T}}_{p r e} = T_{1} T_{2} \dots T_{N} .

(9)

2.3. Loss Function

In the training process of the network, we use two types of loss functions: transformation loss

L_{T}

and point cloud loss

L_{P}

, and the total loss is shown as Equation(10):

L_{T o t a l} = L_{T} + λ_{P} L_{P} = λ_{r} L_{r} + λ_{t} L_{t} + λ_{P} L_{P}

(10)

where

λ_{r}

,

λ_{t}

, and

λ_{P}

denote the weights for the loss of rotation, translation, and point cloud, respectively, and transformation loss includes rotation loss and translation loss.

Transformation loss: The goal is to perform regression on the rotation and translation vectors, which are the output of the calibration network. Hence, we employ geodesic distance over

S O (3)

to describe the transformation loss

L_{T}

shown as Equations (11) and (12).

L_{r} = |a r c c o s (\frac{T r (R_{p r e}^{T} {\hat{R}}_{g t}) - 1}{2})| \cdot \frac{180}{π}

(11)

L_{t} = ‖t_{p r e} - {\hat{t}}_{g t}‖

(12)

where

R_{p r e}

and

t_{p r e}

are the predicted rotation matrix and translation vector, while

{\hat{R}}_{g t}

and

{\hat{t}}_{g t}

are the ground truth.

Point cloud loss: The predicted point cloud is obtained after the transformation matrix

T_{p r e}

of network prediction is used to correct and transform the disturbed point cloud. Chamfer distance is used to calculate the loss between the ground truth point cloud

S_{g t}

and its predicted point cloud

S_{p r e}

. Point cloud loss is shown as Equation (13).

L_{P} = \sum_{x \in S_{p r e}} \underset{y \in S_{g t}}{m i n} {‖x - y‖}_{2}^{2} + \sum_{y \in S_{g t}} \underset{x \in S_{p r e}}{m i n} {‖x - y‖}_{2}^{2}

(13)

3. Results

To verify the validity of our proposed calibration method, we performed a large number of experiments on the KITTI dataset [32], which includes point clouds captured by the Velodyne HDL-64E rotating 3D laser scanner and RGB images captured by the PointGray Flea2 color camera. In this section, we describe detailly the preparation of data, training details, evaluation indicators, and analysis of experimental results.

3.1. Data Preparation

Some previous work used different partitions of the KITTI dataset to train the model. We find that the model is sensitive to the way the dataset is divided; that is, for the same model, the final results obtained by using different datasets are very different. To make a relatively fair comparison with previous work, we trained our model using different datasets and partitions. For KITTI raw dataset, in order to maintain consistency with the CalibNet, we generated 24,000 pairs of data from “2011_09_26” for training and 6000 pairs of data for testing and use mis-calibration in the range of (−10°, +10°) on rotation and (−0.25 m, +0.25 m) on translation. At the same time, to test the generalization ability of the model, we also use sequence “2011_09_30” to test, which contains different camera parameters. The KITTI odometry dataset [33] is divided into 22 different sequences containing different scenarios. To be consistent with the LCCNet, we use the 01-20 sequence for training, the 00 sequence for testing, and the mis-calibration in the range is (−20°, +20°) on rotation and (−1.5 m, +1.5 m) on translation. Figure 8 shows some RGB images and their corresponding point clouds from the KITTI dataset.

3.2. Evaluation Metrics

The calibration results are evaluated based on translation and rotation parameters. To provide a fair comparison with past work, we calculate the mean absolute error of the translation vector on the X, Y, and Z axes, respectively. For the rotational component of the network output, we convert it into Euler Angle, and then calculate the average absolute error in roll, yaw, and pitch directions, respectively.

3.3. Training Details

We use an RGB image and a mis-calibrated depth map as input to the network, and its original resolution is 1242 × 375, which we resize it to 512 × 256. During the training phase, we employ the Adam optimizer [34] with a learning rate initialized to 0.0001 and lowered by a factor of 0.5 per 10 epochs. We set the batch size to 4 and trained 50 epochs on the NVIDIA RTX 4090.

3.4. Comparison on the KITTI Raw Sequence “2011_09_26” and Sequence “2011_09_30”

The calibration results of state-of-the-art methods and our proposed method are presented in Table 1. By experimenting, our proposed method had a mean absolute error for a translation prediction of 0.010 m (X: 0.0088 m, Y: 0.013 m, and Z: 0.0083 m), which was much better than other methods based on deep learning on Table 1, and a rotation prediction of 0.38° (roll: 0.10°, pitch: 0.97°, and yaw: 0.12°), which was also acceptable. Through a thorough analysis of the experimental results, we found that the results of our method in the translation component are significantly better than those of other deep learning-based methods. This is because the attention module could exchange information on the features of different modes and find out the feature pairs with the highest degree of correlation, thus improving the calibration accuracy. It can also be seen that the function of this module on the translation component is more prominent. For the rotation component, we found that the calibration results on the pitch axis were worse than those of other components, because when the point cloud was rotated, the rotation of the pitch axis would greatly affect the depth projection of the point cloud on the image plane. The visualization results are shown in Figure 9. Due to the influence of pitch axis, the calibration effect of our model will be decreased when facing more complex scenes, as shown in Figure 10. Compared with the depth map obtained by using the ground truth projection, the depth map obtained by our network prediction transformation matrix is obviously inclined, but in general, the results of the rotation component also reach a good result.

To further test the generalization ability of our proposed network, we carried out experiments on the “2011_09_30” sequence, which contained different camera parameters and scene contents. The experimental results are shown in Table 2. The experimental results show that our proposed method had a better generalization ability, which can be accurately calibrated for untrained and unfamiliar datasets with a translation MAE value of 0.0435 m (X: 0.0616 m, Y: 0.0093 m, and Z: 0.0595 m) as well as a rotation MAE value of 0.33° (roll: 0.13°, pitch: 0.71°, and yaw: 0.16°). The visual calibration results are shown in Figure 11. Similarly, due to the influence of pitch axis, it will cause some interference to the calibration results of our model. As shown in Figure 12. According to the projection depth map of the calibration results, the projection effect still had a certain tilt, indicating that the pitch axis had a certain influence on our target fruit results.

3.5. Comparison on the KITTI Odometry Dataset

The LCCNet model was experimented with KITTI odometry dataset. In order to maintain the same experimental setup as LCCNet, we used the 01-20 sequence for training g and the 00 sequence for testing. Pairwise point clouds and RGB images in each small sequence were accomplished by synchronizing the timestamps of the LiDAR and the camera. In our experiment, we only used the RGB image from the left. The experimental results are shown in Table 3. Our method achieved a mean absolute error for translation prediction of 0.0097 m (X: 0.0085 m, Y: 0.0059 m, and Z: 0.0147 m), and a rotation prediction of 0.24° (roll: 0.06°, pitch: 0.58°, and yaw: 0.08°). The visual calibration results are shown in Figure 13. The experimental results in Figure 14 also show that the calibration effect of pitch axis will cause interference to our experimental results.

It was evident from the experimental results in Table 3 that our proposed method is better than to other deep learning methods, and it also shows that the proposed method can obtain good calibration performance for various datasets, meeting the needs of practical applications.

3.6. Comparison of Inference Time

In this section, we compare average inference times between state-of-the-art calibration methods and our calibration method on various datasets. The experimental results are shown in Table 4. It can be observed from Table 4 that average inference time of LCCNet is lowest and followed by the average inference time of our proposed calibration method whether the training set and the test set are homologous or non-homologous. The deep reason that CalibDepth has the slowest calibration time is that it has an additional time-consuming depth estimation step compared to LCCNet and our proposed calibration method. From Table 4, it can also be noted that the inference time of the same method on homologous datasets is lower than on heterogeneous datasets. The main reason is that due to the different weather, lighting conditions, and traffic conditions during sequence collection, there are significant differences in the RGB image quality and point cloud density between the two datasets, which affects the structure and feature distribution of the input data. As a result, the weights trained on sequence “2011_09_26” generalize to sequence “2011_09_30”, and the inference time will increase.

3.7. Ablation Studies

To verify the importance of our proposed module in our proposed calibration network, ablation experiments were performed on the KITTI raw dataset and KITTI odometry dataset. There are three types of ablation experiments.

(1) RASPP module: The RASPP module is utilized to process the fused features, enabling the acquisition of receptive field content with varying sizes. In order to evaluate the calibration performance of the module, ablation experiments were conducted on different datasets. The experimental results are shown in Table 5, Table 6 and Table 7.

(2) Position encoding: To assess the effect of location coding on our final calibration results, we performed ablation experiments on the KITTI raw dataset and KITTI odometry dataset, respectively. For KITTI raw dataset, we used the sequence of “2011_09_26” and “2011_09_30”. The experimental results are shown in Table 5 and Table 6. We observed that incorporating learnable location coding into the features obtained through feature extraction led to improved results. This is because for the input point cloud data, the disturbance we add each time is random. To associate the features of different modes in the stage of feature matching, the position relationship of the features needs to be recorded so that the network can better find the corresponding relationship between the features. The same conclusion is also confirmed on the KITTI odometry dataset, and the experimental results are shown in Table 7.

(3) Attention module: We compared the effects of stacking different layers of attention modules. Intuitively, stacking multiple layers of attention modules would enable the network to learn more relevant features, which would facilitate the final calibration task. The experimental results are shown in Table 5, Table 6 and Table 7. In the table, n× indicates the stacking times of the self-cross attention structure, and 1.5× indicates the self-cross-self attention structure. We observe that the self-cross-self attention module structure can achieve better experimental results. For the calibration tasks of LiDAR and camera, we hope to find the most relevant features between different modes through the attention module. It can be observed from the ablation experimental results that although the multi-layer attention module can enable the network to learn more features, many of them are irrelevant to the final task, which not only increases the amount of computation, but also has an impact on the final performance of the network. Therefore, we chose the self-cross-self structure.

3.8. Experiment on Hyper-Parameters

In our proposed calibration network, hyper-parameters include Iteration N, and weights for rotation loss, translation loss, and point cloud loss, and we conducted relevant experiments in “2011_09_26” of KITTI raw dataset for them, respectively.

For Iteration N, we set its value to 1, 3, 5, 7, respectively. The experimental results corresponding to different iteration parameters are shown in the Table 8. It can be seen from the experimental results that when the value of iteration N is 3, the experimental results are the best, with the smallest calibration error. So, Iteration N is set as 3 for all experiments in the paper.

At the same time, we also conducted relevant experiments for the weight parameters involved in the loss function of the network model, and the experimental results are shown in Table 9. So,

λ_{r}

,

λ_{t}

, and

λ_{p}

are set 0.1, 3, and 2 for all experiments in the paper, respectively.

4. Discussion

4.1. Accuracy Comparison Between Our Proposed Method and the State-of-the-Art Methods

In the stage of feature extraction, our proposed calibration method extracts a lot of detail and rich semantic information, and we propose CCAM to adjust feature spatial relationships, enhance context dependency, and reduce interference from noise information. Meanwhile, we propose the RASPP module to handle the context information of different receptive fields and enhance the correlation between features, so as to help the network learn complex multi-scale features better, as well as enhance the network’s robustness and training efficiency, and promote the calibration accuracy.

In the stage of feature matching, we propose an attention mechanism in which self-attention is used to extract the features that are most relevant to the calibration task, and cross-attention is used to establish correspondences between the misaligned features. We iteratively perform self-attention and cross-attention in the stage of feature matching whose aim is to simulate the process of human browsing back and forth during matching and find better feature correspondences.

Based on above methods, we proposed MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms. We compare our proposed calibration method with state-of-the-art calibration methods. The comparison results are presented in Table 1, Table 2 and Table 3. It can be shown from Table 1 and Table 3 that our proposed method exhibits superior calibration accuracy compared to state-of-the-art calibration methods on homologous datasets. Meanwhile, from Table 3, it also can be demonstrated that our proposed method exhibits superior calibration accuracy compared to state-of-the-art calibration methods on non-homologous datasets, which also indicates that the network of our proposed method has good generalization ability. Figure 9, Figure 11 and Figure 13 visually demonstrate the calibration results of our proposed method. It can be observed that a good correspondence can be constructed between the point cloud and the RGB image, indicating that the method proposed in this paper effectively achieves calibration between lidar and camera.

4.2. Analysis on the Influence of Calibration Error

This paper focuses on two kinds of sensors: LiDAR and camera. Our purpose is to calibrate these two sensors using point cloud data from LiDAR and image data from camera. As we all know, point cloud data are sparse; when the object is far away from the LiDAR, the object information will be less, so only a small part of the feature information can be extracted, which will have a certain impact on the final calibration results. For image data, because they are greatly affected by illumination, when the scene exposure is relatively large or dark, the final feature extraction effect will be affected, thus affecting the calibration result.

4.3. The Limitations of Our Proposed Model

In the calibration experiments of LiDAR and camera, in order to compare with the existing methods, the perturbation added to the point cloud data in the data preprocessing stage is consistent with other methods, but with the continuous expansion of the error range, the calibration effect of our model will decline. At the same time, if the initial data contain large noise, it will also affect our calibration effect. Therefore, our next work is to improve these deficiencies.

5. Conclusions

In this paper, we address the calibration problem between LiDAR and camera by proposing an end-to-end calibration network aimed at estimating the 6-DOF rigid body transformation between them. The calibration network consists of three parts: feature extraction network, feature matching layer, and feature global aggregation network. We fuse features of different scales of RGB images to obtain feature representations that contain a lot of details and rich semantic information. In the feature matching stage, we use the attention module to fuse the information between the modal features and establish the correlation between the modes. In order to solve the problem of insufficient training samples, we introduce random deviations to the extrinsic parameters to augment the training data. Our method performs well across various scenarios, especially the average error of translation prediction, which improves the accuracy by 2.03 cm compared with the best results in the comparison method. In the future, we will try to train the network under more noise and in a wider range of rotation and translation to enhance the model’s noise resistance and calibration accuracy for various erroneous calibration ranges.

Author Contributions

Conceptualization, F.X.; Methodology, F.X. and Y.K.; Software, Z.Z.; Validation, L.K.; Formal analysis, L.K.; Investigation, C.S.; Writing—original draft, F.X. and Z.Z.; Writing—review & editing, M.H.; Visualization, M.H.; Supervision, X.H.; Funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grant Number 62272426, in part by the Shanxi Province Science and Technology Major Special Plan “Unveiling and Leading” Project under Grant Number 202201150401021, and in part by the Shanxi Provincial Natural Science Foundation under Grant Number 202203021212138, 202303021212206, 202303021211153, 202303021212372, 202203021222027.

Data Availability Statement

The data presented in this study are available on request from the corresponding author ([email protected]).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Krispel, G.; Opitz, M.; Waltner, G.; Possegger, H.; Bischof, H. Fuseseg: Lidar point cloud segmentation fusing multi-modal data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Pitkin County, CO, USA, 2–5 March 2020. [Google Scholar]
Zhang, Q.; Pless, R. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), Sendai, Japan, 28 September–2 October 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3. [Google Scholar]
Kummerle, J.; Kuhner, T.; Lauer, M. Automatic calibration of multiple cameras and depth sensors with a spherical target. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Geiger, A.; Moosmann, F.; Car, Ö.; Schuster, B. Automatic camera and range sensor calibration using a single shot. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
Unnikrishnan, R.; Hebert, M. Fast Extrinsic Calibration of a Laser Rangefinder to a Camera; Robotics Institute: Pittsburgh, PA, USA, 2005; Tech. Rep. CMU-RI-TR-05-09. [Google Scholar]
Chiabrando, F.; Chiabrando, R.; Piatti, D.; Rinaudo, F. Sensors for 3d imaging: Metric evaluation and calibration of a ccd/cmos time-of-flight camera. Sensors 2009, 9, 10080–10096. [Google Scholar] [CrossRef]
Fremont, V.; Bonnifait, P. Extrinsic calibration between a multi-layer lidar and a camera. In Proceedings of the 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Seoul, Republic of Korea, 20–22 August 2008; IEEE: Piscataway, NJ, USA, 2008. [Google Scholar]
Levinson, J.; Thrun, S. Automatic Online Calibration of Cameras and Lasers. In Proceedings of the Robotics: Science and Systems, Berlin, Germany, 24–28 June 2013. [Google Scholar] [CrossRef]
Pandey, G.; McBride, J.; Savarese, S.; Eustice, R. Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information. In Proceedings of the AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012; Volume 26. [Google Scholar]
Taylor, Z.; Nieto, J. Motion-based calibration of multimodal sensor arrays. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
Wang, G.; Qiu, J.; Guo, Y.; Wang, H. Fusionnet: Coarse-to-fine extrinsic calibration network of lidar and camera with hierarchical point-pixel fusion. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Lv, X.; Wang, B.; Dou, Z.; Ye, D.; Wang, S. LCCNet: LiDAR and camera self-calibration using cost volume network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Schneider, N.; Piewak, F.; Stiller, C.; Franke, U. RegNet: Multimodal sensor registration using deep neural networks. In Proceedings of the 2017 IEEE intelligent vehicles symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Zhao, G.; Hu, J.; You, S.; Kuo, C.C.J. CalibDNN: Multimodal sensor calibration for perception using deep neural networks. In Signal Processing, Sensor/Information Fusion, and Target Recognition XXX; SPIE: Bellingham, WA, USA, 2021; Volume 11756. [Google Scholar] [CrossRef]
Iyer, G.; Ram, R.K.; Murthy, J.K.; Krishna, K.M. CalibNet: Geometrically supervised extrinsic calibration using 3D spatial transformer networks. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Shi, J.; Zhu, Z.; Zhang, J.; Liu, R.; Wang, Z.; Chen, S.; Liu, H. Calibrcnn: Calibrating camera and lidar by recurrent convolutional neural network and geometric constraints. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Lv, X.; Wang, S.; Ye, D. CFNet: LiDAR-camera registration using calibration flow network. Sensors 2021, 21, 8112. [Google Scholar] [CrossRef] [PubMed]
Tamas, L.; Kato, Z. Targetless calibration of a lidar-perspective camera pair. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013. [Google Scholar]
Liu, L.; Stamos, I. Automatic 3D to 2D registration for the photorealistic rendering of urban scenes. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 2. [Google Scholar]
Yuan, C.; Liu, X.; Hong, X.; Zhang, F. Pixel-level extrinsic self calibration of high resolution lidar and camera in targetless environments. IEEE Robot. Autom. Lett. 2021, 6, 7517–7524. [Google Scholar] [CrossRef]
Nagy, B.; Kovács, L.; Benedek, C. Online targetless end-to-end camera-LiDAR self-calibration. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Liu, X.; Zhang, F.Z. Extrinsic calibration of multiple lidars of small fov in targetless environments. IEEE Robot. Autom. Lett. 2021, 6, 2036–2043. [Google Scholar] [CrossRef]
Wu, S.; Hadachi, A.; Vivet, D.; Prabhakar, Y. NetCalib: A Novel Approach for LiDAR-Camera Auto-Calibration Based on Deep Learning. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Zhu, Y.; Li, C.; Zhang, Y. Online camera-lidar calibration with sensor semantic information. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Wang, W.; Nobuhara, S.; Nakamura, R.; Sakurada, K. Soic: Semantic online initialization and calibration for lidar and camera. arXiv 2020, arXiv:2003.04260. [Google Scholar]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-deep networks. Adv. Neural Inf. Process. Syst. 2022, 35, 6789–6801. [Google Scholar]
Wu, Y.; Zhu, M.; Liang, J. PSNet: LiDAR and camera registration using parallel Subnetworks. IEEE Access 2022, 10, 70553–70561. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. The flow diagram of our proposed LiDAR-camera online calibration.

Figure 2. The overview of our proposed method for the calibration between LiDAR and camera.

Figure 3. Features of different scales extracted from the backbone network were fused. U2 represents the up-sampling of 2-fold bilinear interpolation for high-level features. (32,128) and (64,128) represent the adjustment of the number of channels in the feature map. © represents the concatenation of the feature map in the channel direction.

Figure 4. Attention module (CCAM),

\otimes

represents element-by-element multiplication.

\oplus

represents element-by-element addition.

Figure 4. Attention module (CCAM),

\otimes

represents element-by-element multiplication.

\oplus

represents element-by-element addition.

Figure 5. Down-sampling module.

Figure 6. RASPP module.

Figure 7. Attention module.

Figure 8. RGB images (a) and point clouds (b) from the KITTI dataset.

Figure 9. Example of calibration results for the sequence “2011_09_26” of the KITTI raw dataset. The first column depicts the corresponding mis-calibrated LiDAR point cloud projected on the RGB image, and the color of the projection point indicates its depth. The second column depicts the projected image after calibration via our proposed method. The third column represents the corresponding ground truth result. The red rectangles in the first column indicate the misalignment, and in the second column denote the proper alignment after our calibration method.

Figure 10. Example of mis-calibration results for the sequence “2011_09_26” in the KITTI raw dataset.

Figure 11. Examples of testing results on KITTI raw sequence “2011_09_30”.

Figure 12. Mis-calibration example of testing results on KITTI raw sequence “2011_09_30”.

Figure 13. Examples of testing results on the KITTI odometry dataset.

Figure 14. Mis-calibration example of testing results on the KITTI odometry dataset.

Table 1. Comparison of various methods on the KITTI raw dataset.

Method	Translation Absolute Error (cm)				Rotation Absolute Error (°)
Method	Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
RegNet	6	7	7	4	0.28	0.24	0.25	0.36
CalibNet	7.82	12.1	3.49	7.87	0.41	0.18	0.90	0.15
CalibRCNN	5.5	6.6	4.4	5.5	0.675	0.216	1.33	0.478
CalibDNN	5.07	3.8	1.8	9.6	0.21	0.11	0.35	0.18
CALNet	3.03	3.65	1.63	3.80	0.20	0.10	0.38	0.12
Ours	1.0	0.88	1.3	0.83	0.38	0.10	0.97	0.12

Table 2. Calibration results on untrained datasets and comparisons with other methods.

Method	Translation Absolute Error (cm)				Rotation Absolute Error (°)
Method	Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
CalibNet	7.82	12.1	3.49	7.87	0.41	0.15	0.90	0.18
CalibDNN	6.10	5.50	3.20	9.60	0.45	0.15	0.99	0.20
CALNet	4.95	4.84	2.59	7.42	0.40	0.15	0.91	0.15
CalibDepth	4.75	6.66	1.12	6.48	0.348	0.180	0.682	0.181
Ours	4.35	6.16	0.93	5.95	0.33	0.13	0.71	0.16

Table 3. Comparison of various methods on the KITTI odometry dataset.

Method	Translation Absolute Error (cm)				Rotation Absolute Error (°)
Method	Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
LCCNet	1.53	1.27	2.21	1.11	0.324	0.309	0.330	0.334
CalibDepth	1.17	1.31	1.02	1.17	0.123	0.064	0.226	0.08
Ours	0.97	0.85	0.59	1.47	0.24	0.060	0.58	0.08

Table 4. Inference time comparison of various methods.

Method	Dataset	Average Inference Time (ms)
LCCNet	Kitti_raw: 24,000 pairs for training, 6000 pairs for testing in “2011_09_26”	15
CalibDepth		22
Ours		17
LCCNet	Kitti_raw: 24,000 pairs for training in “2011_09_26”, for testing in “2011_09_30”	64
CalibDepth		85
Ours		78

Table 5. Ablation experimental results on the KITTI_RAW sequence “2011_09_30”.

Category		Translation Absolute Error (cm)				Rotation Absolute Error (°)
Category		Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
a.RASPP module	w/o RASPP module	4.40	6.19	1.03	5.98	0.38	0.15	0.75	0.23
a.RASPP module	with RASPP module	4.35	6.16	0.93	5.95	0.33	0.13	0.71	0.16
b.position encoding	w/o position encoding	4.73	6.47	1.39	6.33	0.36	0.15	0.79	0.17
	Fixed position encoding	4.5	6.27	0.93	6.3	0.37	0.14	0.84	0.15
	Trained position encoding	4.35	6.16	0.93	5.95	0.33	0.13	0.71	0.16
c.attention module	1×	4.66	6.82	0.97	6.19	0.39	0.17	0.82	0.18
	1.5×	4.35	6.16	0.93	5.95	0.33	0.13	0.71	0.16
	2×	4.75	6.75	0.8	6.69	0.4	0.19	0.84	0.19
	3×	4.67	6.37	1.31	6.35	0.35	0.14	0.78	0.15

Table 6. Ablation experimental results on the KITTI_RAW sequence “2011_09_26”.

Category		Translation Absolute Error (cm)				Rotation Absolute Error (°)
Category		Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
a.RASPP module	w/o RASPP module	1.02	0.89	1.32	0.85	0.41	0.13	0.96	0.15
a.RASPP module	with RASPP module	1.0	0.88	1.3	0.83	0.38	0.10	0.97	0.12
b.position encoding	w/o position encoding	1.08	0.93	1.41	0.89	0.45	0.12	1.08	0.14
	Fixed position encoding	1.04	0.91	1.35	0.85	0.43	0.11	1.09	0.13
	Trained position encoding	1.0	0.88	1.3	0.83	0.38	0.10	0.97	0.12
c.attention module	1×	1.04	0.92	1.4	0.9	0.39	0.12	0.96	0.13
	1.5×	1.0	0.88	1.3	0.83	0.38	0.10	0.97	0.12
	2×	1.1	0.94	1.42	0.92	0.41	0.11	1.04	0.13
	3×	1.05	0.95	1.35	0.85	0.4	011	1.02	0.12

Table 7. Ablation experimental results on the KITTI ODOMETRY dataset.

Category		Translation Absolute Error (cm)				Rotation Absolute Error (°)
Category		Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
a.RASPP module	w/o RASPP module	0.99	0.83	0.64	1.51	0.24	0.065	0.56	0.086
a.RASPP module	with RASPP module	0.97	0.85	0.59	1.47	0.24	0.060	0.58	0.080
b.position encoding	w/o position encoding	1.03	1.04	0.74	1.32	0.23	0.067	0.55	0.084
	Fixed position encoding	1.07	1.28	0.68	1.26	0.21	0.069	0.5	0.081
	Trained position encoding	0.97	0.85	0.59	1.47	0.24	0.060	0.58	0.080
c.attention module	1×	1.0	1.03	0.7	1.27	0.27	0.082	0.64	0.11
	1.5×	0.97	0.85	0.59	1.47	0.24	0.060	0.58	0.080
	2×	1.0	0.94	0.63	1.43	0.28	0.072	0.63	0.13
	3×	0.96	0.89	0.65	1.33	0.25	0.068	0.58	0.092

Table 8. Experimental results under various values of Iteration N.

Iteration N	Translation Absolute Error (cm)				Rotation Absolute Error (°)
Iteration N	Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
1	2.2	3.27	1.24	2.12	0.78	0.23	1.90	0.29
3	1.0	0.88	1.3	0.83	0.38	0.10	0.97	0.12
5	1.02	0.90	1.32	0.85	0.40	0.15	0.91	0.15
7	1.07	0.95	1.38	0.9	0.44	0.18	0.97	0.18

Table 9. Experimental results under different weight parameters.

Weight Parameter			Translation Absolute Error (cm)				Rotation Absolute Error (°)
$λ_{r}$	$λ_{t}$	$λ_{p}$	Mean	X	Y	Z	Mean	Roll	Pitch	Yaw
0.1	3	0.2	1.0	0.88	1.3	0.83	0.38	0.10	0.97	0.12
0.3	5	0.4	1.04	0.91	1.35	0.86	0.42	0.13	0.99	0.14
0.5	7	0.6	1.06	0.92	1.39	0.88	0.45	0.15	1.04	0.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiong, F.; Zhang, Z.; Kong, Y.; Shen, C.; Hu, M.; Kuang, L.; Han, X. MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms. Remote Sens. 2024, 16, 4233. https://doi.org/10.3390/rs16224233

AMA Style

Xiong F, Zhang Z, Kong Y, Shen C, Hu M, Kuang L, Han X. MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms. Remote Sensing. 2024; 16(22):4233. https://doi.org/10.3390/rs16224233

Chicago/Turabian Style

Xiong, Fengguang, Zhiqiang Zhang, Yu Kong, Chaofan Shen, Mingyue Hu, Liqun Kuang, and Xie Han. 2024. "MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms" Remote Sensing 16, no. 22: 4233. https://doi.org/10.3390/rs16224233

APA Style

Xiong, F., Zhang, Z., Kong, Y., Shen, C., Hu, M., Kuang, L., & Han, X. (2024). MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms. Remote Sensing, 16(22), 4233. https://doi.org/10.3390/rs16224233

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSANet: LiDAR-Camera Online Calibration with Multi-Scale Fusion and Attention Mechanisms

Abstract

1. Introduction

1.1. Target-Based Methods

1.2. Target-Less Methods

1.3. Learning-Based Methods

2. Methods

2.1. Data Preprocessing

2.2. Network Architecture

2.2.1. Feature Extraction Layer

2.2.2. Feature Matching Layer

2.2.3. Parameter Regression Layer

2.3. Loss Function

3. Results

3.1. Data Preparation

3.2. Evaluation Metrics

3.3. Training Details

3.4. Comparison on the KITTI Raw Sequence “2011_09_26” and Sequence “2011_09_30”

3.5. Comparison on the KITTI Odometry Dataset

3.6. Comparison of Inference Time

3.7. Ablation Studies

3.8. Experiment on Hyper-Parameters

4. Discussion

4.1. Accuracy Comparison Between Our Proposed Method and the State-of-the-Art Methods

4.2. Analysis on the Influence of Calibration Error

4.3. The Limitations of Our Proposed Model

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI