4.2. Implementation Details
In line with prior studies, we employ evaluation metrics for rotation and translation . Translation vectors are assessed using the average absolute error that can be denoted as and for the average absolute error across the translation and rotation components, respectively. For rotation matrices, we convert the output quaternions to Euler angles and compute the error.
We use CREStereo [
38] to estimate the depth map of target images as the input of the image branch in DPCalib using the stereo input from the dataset. We trained the proposed network on a single RTX 3090, and the depth map from the principal viewpoint was resized to 512 × 256 as input to the network. The resolution of the compensated viewpoint is the same as the main viewpoint after resizing. In terms of the intrinsic
of compensation view, we set the focal length
and, the pixel center
is (512, 128). We make further elaborate on the rationale behind our design choices in the section discussing the section of ablation experiments.
It’s worth noting that while the depth map obtained from stereo matching algorithms is generally similar to ground truth, there still exist significant errors, particularly at the edges. Through ablation experiments, we find that using these error-prone results directly as input during training does not facilitate loss function convergence.
Therefore, we first project a “sparse ground truth depth” using point clouds and . This sparse ground truth depth temporarily replaces the depth estimation image for training the neural network, enabling the network to learn corner matching capabilities, this process is referred to as “guided training”. Subsequently, we fine tune the model using the “depth estimation image” as input.
The training settings for both the “guided training” and “fine-tuning” stages are identical. The batch size is set to 32, the total number of epochs is 40, the optimizer used is Adam, and the learning rate is 1 × 10−4 with a decay rate of 0.9. In terms of the weighting of the loss function, = 0.95, = 0.05, = 0.7, = 0.3.
4.3. Comparison to Existing Methods
According to the experimental conditions specified in
Table 1, we conducted comparisons with different state-of-the-art (SOTA) works. The purpose of this approach is to maintain consistency with previous SOTA works in terms of experimental conditions. For example, CalibNet [
11], CalibDNN [
32], and CalidDepth [
3] utilized the dataset and error range mentioned in Exp 1 for experimentation and comparison. Therefore, we also compared our method with these works under the same conditions. In the quantitative analysis, we directly referenced the error results reported by these works when they were published. The same approach was applied to Exp 2 and 3.
However, it is worth noting that most of these works did not release complete code or provide trained models and qualitative analysis results. We attempted to reproduce their work but found it extremely difficult for most of the works, as we could not obtain the results mentioned in their papers. Among all the SOTA works, we only successfully reproduced NetCalib [
13] and obtained results consistent with those mentioned in their paper. Therefore, in addition to publishing the results of NetCalib as publicly available in Exp 2, we also followed training strategy of original NetCalib to obtain results for Exp 1 and 3. These training results have been added to
Table 2 and
Table 3. The bold formatting in all tables indicates that the corresponding metric achieved the best performance in horizontal or vertical comparisons. Furthermore, in this section, besides comparing Exp 2 with NetCalib, we directly compared the qualitative analysis of other experiments with ground truth.
The results of Exp 1 are shown in
Table 2, where the maximum rotation error is set to ±10° and the maximum translation error is set to ±0.25 m. This represents a relatively large error range. Consequently, previous algorithms tend to exhibit more noticeable errors in the pitch angle estimation. For example, CalibNet [
11] achieved a pitch angle error of 0.900°, which is five times higher than its error in roll and yaw estimation. Moreover, the absolute value of the error is also relatively large. Similarly, CalibDNN [
32] and CalibDepth [
3] faced similar issues.
In addition to the aforementioned issues, we find that even when the model training converged, NetCalib [
13] exhibited relatively larger errors in displacement compared to Calibdepth [
3]. This is attributed to the output head of NetCalib. While most works [
3,
11,
32] have two parallel output heads that separately output rotation matrices and translation vectors, NetCalib has only one output head that simultaneously outputs both values. This can result in the network being less sensitive to displacement errors. In contrast, our proposed DPCalib mitigates this problem by compensating for the viewpoint features. As a result, our DPCalib demonstrates superior performance in pitch angle estimation compared to other SOTA works. Overall, our DPCalib outperforms other SOTA works in both displacement and rotation error estimation. The qualitative analysis results of our DPCalib are depicted in
Figure 8.
As shown in
Figure 8, it illustrates the precision of calibration with 8 images and LiDAR projections under different scenarios. The first row depicts the results obtained by projecting with initial extrinsic provided parameters that contain errors. It’s evident that there is a misalignment between the geometric information of the point cloud and the semantic information of the image. The second row presents the situation when using correctly calibrated parameters, where the correspondence between the geometric information of the point cloud and the semantic information of the image is correct. The closer the algorithm’s estimated results are to the ground truth, the better the algorithm’s performance. The fourth and fifth rows, respectively, provide close-up views of the ground truth and the estimated values. We utilize clearly defined objects such as poles, pedestrians, and vehicles to further compare the differences between the algorithm’s results and the ground truth. For images like
Figure 8 that represent qualitative analysis results, our description follows a sequence from top to bottom and from left to right. For example, in the top row, the second image is referred to as “the 2nd input” or “the input 2”, while the third image in the row below is referred to as “the 7th input” or “the input 7”. Let’s take the second and seventh inputs as an example. From both the overall image and the close-up view of the trees and pole-like objects, it can be observed that our algorithm has achieved results that are very close to the ground truth.
We reproduced the results of NetCalib [
13] and conducted a qualitative analysis comparing it with our DPCalib, as shown in
Figure 9. Overall, both our method and NetCalib [
13] achieved the goal of updating the extrinsic parameters using initial extrinsic with errors. However, from
Figure 9, especially in input 1–5 where pedestrians are present, and in input 6–8 where pole-like objects are present, it can be observed that our method achieves more precise alignment of detailed information. This is reflected in the metrics, where the estimation errors of our DPCalib are slightly lower compared to NetCalib [
13].
The results of Exp 2 are shown in
Table 4, where the maximum rotation error is set to ±10° and the maximum translation error is set to ±0.2 m. The setup of Exp 2 is similar to Exp 1, but the difference lies in the data split proportions as shown in
Table 1. In Experiment 1, the training set proportion is much higher compared to the test and validation sets, allowing the network to receive more comprehensive training. Additionally, the error settings in Exp 2 are slightly smaller than in Exp 1, resulting in better performance overall. The results of Exp 2 are consistent with those of Exp 1. We observed that state-of-the-art methods based on parameter regression still suffer from significant pitch angle errors. Once again, the results of Exp 2 reaffirm the effectiveness of our method. Regarding the qualitative analysis of Exp 2, it is shown in
Figure 9.
It is worth noting that both Exp 1 and Exp 2 are comparisons with parameter regression-based methods, while DXQ-Net [
31] adopts a different technical approach, namely estimating pixel flow to find corresponding matching points and then deriving external parameters based on the correspondence between points. However, this method involves searching for pixel points in the image, thus constrained by computational cost, and can only handle small errors. Therefore, their experimental setup limits the maximum angular error to ±5° and the maximum displacement error to ±0.1 m, as specified in
Table 1.
Additionally, we reproduced the results of NetCalib in Exp 3. NetCalib, being based on parameter regression, exhibited errors and characteristics consistent with those in Exp1 and 2. NetCalib has a perceptual deficiency in pitch angle perception and poor regression performance for translation errors. The experimental results of DPCalib outperformed NetCalib. We refer to this experimental setup as Exp 3, as shown in
Table 3.
The results of Exp 3 are shown in
Table 3. Although pixel flow-based and iterative methods do not encounter significant pitch angle errors, they are limited to handling small errors. This is because constructing the cost volume requires searching for matching points pixel by pixel between two images, which can incur significant computational costs if the search range is too large. Additionally, DXQ-Net [
31], through iterative updates, mitigates the problem caused by missing perceptual information to some extent. However, even so, our method still performs slightly better than DXQ-Net [
31] in estimating translation errors. It is worth noting that our method produces results in a single end-to-end output, whereas DXQ-Net [
31] iterates multiple times to obtain results. In this scenario, although our method may have a slight disadvantage in angular error metrics, it still maintains a marginal advantage in translation error. The qualitative analysis results of Exp 3 are depicted in
Figure 10. As shown in
Figure 10, the parameters estimated by our method can adjust the initially erroneous extrinsic parameters to be nearly consistent with the ground truth.
Figure 11 depicts the histograms and error bars of the results corresponding to the three experiments. Each column represents a set of experimental settings, with the first row representing the histogram of translation errors, the second row representing the histogram of rotation errors, and the third row representing the error bars of both rotation and translation errors. The red segments represent the median of the experimental results, which are typically slightly lower than the average error. The “T” shaped bars and square boxes, respectively, represent the maximum and minimum errors, and the distance from the median to the maximum and minimum errors divided by two. It can be observed that although the input errors are randomly generated, the variance and standard deviation of our method are around the mean value, indicating that our method is stable and robust.
Additionally, we conducted a comparison and analysis of the real-time performance of our work against the successfully reproduced work, NetCalib [
13], serving as a reference. We selected three experimental environments—CPU (Intel i9-12900KF), GPU (NVIDIA RTX 3090), and onboard system (NVIDIA Jetson Orin)—to validate the real-time performance of our work. We conducted inference on 1000 samples from the dataset and calculated the average inference time. The results are shown in
Table 5.
As shown in
Table 5, our DPCalib achieves a running speed of 185.18 FPS on an RTX 3090 device and 33.36 FPS on the NVIDIA Jetson Orin edge development board, which is significantly better than NetCalib and meets the real-time requirements. However, it’s worth noting that the time shown in
Table 2 only reflects the inference time of the algorithm. The data preprocessing steps such as image loading, depth estimation, and projection also consume approximately 1.77 FPS (tested on an RTX 3090 and i9-12900KF environment). In the future, we plan to develop multithreaded tools to further enhance the real-time performance of the model.
Combining the above results, we can conclude that our DPCalib outperforms other parameter regression-based methods, especially in estimating the specific pitch angle of the extrinsic parameters. Additionally, it shows significant improvements in other metrics compared to other state-of-the-art methods. As for methods based on pixel flow estimation, our DPCalib demonstrates the capability to handle larger errors and also exhibits a clear advantage in estimating translation extrinsic parameters. In the section on ablation experiments, we demonstrated the effectiveness and necessity of each module.