1. Introduction
With the development of intelligent vehicles, the demand for environmental perception and precise localization is increasing. Visual Simultaneous Localization and Mapping (SLAM), as a vision-based localization and mapping method, holds broad application prospects in intelligent vehicles [
1]. Compared to GPS-based methods, stereo Visual SLAM can achieve stable localization and mapping in GNSS-denied scenes, providing precise and stable localization for autonomous driving [
1,
2]. Compared to Lidar-based SLAM, stereo cameras can similarly acquire precise scales of scenes at lower costs. Simultaneously, stereo cameras offer abundant environmental textures and exhibit more stable performance in structured environments [
3].
Generally, stereo Visual SLAM systems assume the camera moves in six degrees of freedom (DoF) space; therefore, the pose estimation is designed within SE(3). However, intelligent vehicles have more stringent motion constraints. Specifically, the motion of the vehicle is constrained by the road, necessitating the vehicle to adhere to the road, resulting in a degradation of its DoF [
4]. Consequently, in practical applications, this assumption of 3D space pose estimation conflicts with the approximate planar motion of the vehicle, inevitably causing additional pose estimation errors [
5,
6].
To address the aforementioned problems, the most direct approach is adding additional constraints to limit the DoF of the system. There are two methods for adding constraints. One method is to integrate additional sensors into the system for data fusion [
7]. For instance, fusing a camera with an Inertial Measurement Unit (IMU) utilizes the inertial data from the IMU to further constrain the pose, thereby improving the estimation accuracy of the pose. Typical solutions include OKVIS [
8], VINS-Mono [
9], among others. However, such methods, when applied to ground vehicles, are affected by factors like uniform speed linear motion or start-stop motion, which degrade IMU observability, subsequently reducing the overall system performance [
10,
11,
12]. The second method is to use the prior information that the vehicle adheres to the road, introducing constraint relationships between the road and the vehicle to enhance the accuracy of pose estimation without adding sensors [
13,
14,
15]. However, this method initially requires a parameterized model that accurately represents the road. Given the difficulty in directly measuring the road through sensors [
14], the road is often assumed to be a single infinite plane [
16], or road parameters are indirectly obtained from low-dimensional features, such as feature points [
13] or lines [
17].
The current road modeling methods based on the assumption of the infinite plane have been widely applied in indoor scenes, effectively enhancing localization accuracy. However, in outdoor environments, the infinite plane cannot accurately represent the road manifold, and incorrect assumptions might even lead to additional system errors [
4]. Therefore, in road scenes, methods based on feature point fitting are commonly used to express the road manifold [
18,
19,
20]. This method fits the road into a planar model [
18,
19] or a curved surface model [
20], utilizing the fitted road model to constrain the pose of the vehicle. However, road feature points are influenced by low-texture and self-similarity. During the depth recovery of road feature points using stereo disparity, compared to non-road feature points, there is greater depth uncertainty. Consequently, the spatial accuracy of ground feature points is lower, making them unsuitable for direct use in vehicle pose estimation and road model fitting [
18,
21,
22].
This paper proposes an optimization-based stereo Visual SLAM system combined with road constraints, focusing on two key aspects: maximizing the utilization of road features and incorporating vehicle movement on the road. Initially, a method employing homography of local road planes (LRPs) to extract parameters of local roads is proposed. This method approximates the local road as discrete planes, leveraging 2D-2D matching results of road features from previous keyframes to estimate the LRPs of the current keyframe using homography. As this process does not rely on depth information of road features, it circumvents the uncertainty caused by stereo feature matching on road features. By explicitly establishing constraints between vehicle poses and roads, errors arising from the six DoF motion assumptions of vehicles are minimized without any additional sensors. Subsequently, to avoid depth uncertainty when utilizing road feature points, reprojection constraints for non-road feature points and epipolar constraints for road feature points are applied to estimate the motion of the vehicle jointly. Finally, a nonlinear optimization model based on graph optimization is developed. This model joint optimizes vehicle trajectories, LRPs, and map points, thus enhancing the accuracy and robustness of the system.
There are four contributions in this paper:
A tightly coupled graph optimization framework is proposed, where explicit constraints between the vehicle and local road planes are established. This framework jointly optimizes the poses of vehicle trajectories, Local Road Planes (LRPs), and map points;
To mitigate the impact of depth uncertainties in road features on the estimation of the local road plane, a method using homography is proposed to extract local road plane parameters by leveraging the 2D-2D matched road feature points from previous keyframes to enhance the accuracy of local road plane estimation;
A motion estimation method is proposed for road scenes. It employs epipolar constraints for estimating rotation with road feature points to prevent the influence of depth errors and reprojection constraints for estimating both rotation and translation with non-road feature points. The joint optimization through bundle adjustment is used to enhance the robustness and precision of motion estimation;
A full SLAM system is proposed that can establish a global map containing map points and local road planes. Extensive validation through multiple datasets and real-world experiments demonstrates the superior performance of the proposed system over state-of-the-art Visual SLAM and Visual-inertial SLAM methods specifically in road scenes.
The rest of the paper is organized as follows: In
Section 2, related background research works are discussed. Notations and different plane models are proposed in
Section 3. The overview of the entire system and its individual modules are presented in
Section 4.
Section 5 details the experimental setup and experimental results with result analysis. Finally, the conclusions are given in
Section 6.
4. Proposed Method
4.1. System Overview
The pipeline of the proposed system is shown in
Figure 2. The system consists of three main parts: the front-end, local road modeling, and back-end. There are two methods to obtain the road area, one is using the method of clustering [
45,
46], and another is using semantic segmentation [
47]. The system takes stereo images and left semantic images with masks of road obtained through a semantic segmentation network [
47] as input. The output includes the vehicle poses and a global map containing map points and local road planes.
Similar to many Visual SLAM [
24,
25], the front-end processes stereo images in a sequence of feature extraction, stereo matching, inter-frame feature matching, and motion estimation. Keyframe selection relies on pose estimation and inter-frame co-visibility. Within the front-end, features are categorized into road features and non-ground features based on the semantic image. A more stringent inter-frame feature matching approach is proposed specifically for the road features.
The local road model operates in parallel with the back-end. The local road model is modeled as discrete planes in this part. Whenever a new keyframe is generated, the local road model models the corresponding local road plane for the keyframe. During the plane fitting process, measurements of the local road and pose estimations from previous keyframes are utilized to compute the relevant homography for the plane parameters of the new keyframe.
The back-end consists of two parts: Local Bundle Adjustment (LBA) and Loop Correction. Within LBA, road constraints on the vehicle are enhanced from two perspectives. The vehicle trajectory, local road planes, and map points are jointly optimized in the LBA, leading to more accurate pose estimations and maps. Loop Correction executes when the system detects loop closures. It performs global optimization on the vehicle trajectory, local road planes, and map points using loop closure constraint, rectifying accumulated pose drift within the system.
4.2. Front-End
In the front-end, feature points are extracted from the stereo images. Subsequently, feature points from the left and right images are stereo-matched, and the inter-frame matched feature points are used to calculate the camera’s pose changes.
4.2.1. Feature Extraction and Stereo-Matching
For the input, an image pyramid is initially constructed for both left and right images to ensure feature scale invariance. To ensure an even distribution of feature points across the entire image pyramid, each level of the image pyramid is subdivided into multiple 60 × 60 grids. Within each grid, the ORB features and descriptors [
24] are extracted until the number of feature points in each grid reaches the preset threshold or no qualifying features are found within the grid. After completing the extraction of image features, the depth of each feature point is recovered based on the stereo-matching results of feature points between the left and right images. The stereo-matching process between the left and right images involves epipolar line searches within the same pixel row. Subsequently, sub-pixel optimization is applied to attain more accurate depth for feature points.
As mentioned earlier, compared to other features, stereo-matching in road features often leads to larger disparity errors, increasing the uncertainty in depth estimation for road features. Moreover, as shown in
Figure 3, road feature points may be extracted from shadows on the road, which lack temporal invariance and are unsuitable for inclusion in the map. Due to these two reasons, road feature points are not included in the map in the proposed system. Instead, the local road planes are estimated using 2D road features, integrating stable plane features into the map.
4.2.2. Feature Tracking
For non-road feature points, feature matching between frames uses reprojection for 3D-2D feature tracking. The 3D feature points from the previous frame are projected onto the image of the current frame. An association gate centered around the reprojection point with a fixed radius is set up, and features falling within this gate are matched to establish the inter-frame feature points correspondence.
However, for features on the road, their large depth error causes deviations in the reprojected feature point, leading to decreased matching success rates and accuracy. To obtain accurately matched road feature points, a 2D-2D matching approach is proposed for the road features in consecutive left images. The pseudo-code for the road feature matching process is shown in Algorithm 1. In lines 1–9, the coarse matching of road feature points is executed based on the distance of descriptors. By computing the Hamming distance between descriptors of road features from consecutive frames, the feature pairs with descriptors having distances smaller than the reset threshold are stored as matching candidates along with their corresponding distances. In lines 10–24, a more refined matching process is performed for road feature points. Due to the self-similarity of the road, in the coarse matching process, a feature point often matches with multiple feature points in the next frame. To achieve a globally optimal match, the Hungarian matching method was adopted, using the reciprocal of the feature point distance as the weight, obtaining the globally minimum-cost fine matching results. To further eliminate outliers, in lines 25–40, the Random Sample Consensus (RANSAC) randomly selects the minimum sample set to estimate the initial epipolar geometry model. Through epipolar constraints, all matching pairs are judged to conform to the epipolar geometry relationship, classifying them as inliers or outliers. If the ratio of inliers meets the preset requirements, all outliers are removed. All inliers are used for subsequent processes.
Algorithm 1: 2D Road Features Matcher |
|
4.2.3. Motion Estimation
Due to the accurate spatial information of non-ground feature points, and conversely, the poor accuracy of ground feature points in spatial information, when estimating motion between frames, 3D-2D reprojection constraints and 2D-2D epipolar constraints are separately established for these two types of feature points. For non-ground feature points, re-projection error can be used to construct constraints,
where
and
are the matched 3D feature and 2D feature in two frames,
is the depth of
. This constraint can simultaneously estimate the rotation and translation of the vehicle. For road feature points, only 2D-2D matched features in consecutive frames are used to construct epipolar constraints, aiming to avoid errors caused by the depth uncertainty of feature points. The epipolar constraint describes the constraints formed by the 2D feature and the camera optical center when the same feature is projected onto images from two different perspectives under the projection model. When constructing epipolar constraints, the epipolar lines can be represented as:
where
is the observation of road feature point on the normal plane of the
kth frame,
.
According to [
21], the distance between matched feature points in the
th image and the epipolar line is considered as the epipolar error. By adjusting the rotation and translation changes between the two frames, the goal is to minimize this error. The distance from the matched feature point in the
th image to the epipolar line constructed based on Equation (
5) can be obtained as:
where
is the matched point of
in the
th image. Because 2D-2D matching does not involve scale, this constraint is used to estimate only the rotational of the vehicle. In contrast to [
21], where the epipolar errors are computed for all points, this constraint to road points is applied to road features, while non-road points continue using reprojection error, which can reduce information loss. The Jacobian of error term with respect to
is
Integrating both constraints based on their error functions, a comprehensive optimization problem is formulated in Equation (
8), where a nonlinear optimization method is employed to minimize the overall error function. By meticulously handling both ground and non-ground feature points, this differentiated strategy fully utilizes the characteristics of road feature points while enhancing the robustness and precision of motion estimation.
4.3. Local Road Modeling
When a new keyframe is detected in the front-end, it is necessary to estimate the local road where the newly generated keyframe is located. Due to the continuous shape and small gradient changes of roads, they can be approximated as planes within small-scale areas. Therefore, roads can be divided into a series of discrete planes. Based on this analysis, a plane model will be used to represent the local roads in this context.
When the current camera frame is detected as a keyframe, the initial position of the current frame is determined based on the motion estimation results from the front-end. Given the known mounting position of the camera, the projection of the camera onto the road is determined using the extrinsics between the camera and the road. So the road area where the keyframe is located is identified, as shown in the gray region in
Figure 4. The size of this region is empirically set to be 6 m in length and 4 m in width.
Once the local road area for the vehicle is determined, the previous keyframes that can observe this plane need to be found. To enhance search efficiency, previous keyframe groups capable of observing this area are determined based on the positions of previous keyframes and the field of view (FOV) of the camera. Subsequently, road feature point pairs located on the local road plane are selected from the previous frame groups, and the two keyframes possessing the most point pairs are identified. It is important to note while selecting road feature points, the 3D information of these points is utilized solely to confirm whether the feature points lie on the plane to be fitted. However, in the subsequent estimation of the road plane, only the 2D information of these road feature points is utilized.
After obtaining the matched feature point pairs as above, the local road plane needs to be fitted next. In contrast to the method outlined in [
4], which utilizes 3D feature points for plane fitting, here, only the 2D feature points from previous frames and their matching relationships are used to prevent the influence of depth errors of road points. There is a similar method [
48] that uses 2D feature points to construct homography between two keyframes. However, they use this constraint to optimize poses, whereas homography is employed to optimize planes in the proposed method. As shown in
Figure 4, for previous keyframes
and
, some road features lie on the local road plane where the current frame is positioned. In other words, there are some matching pairs of road points between these two previous keyframes that fall in the same road plane. According to epipolar geometry, if the points lie on the same plane, they can be constrained using homography:
where
is the scale,
and
is the observation of road feature point on the normal plane of
ith frame and
jth frame,
is the homography,
and
are the rotation and translation from
ith keyframe to
jth keyframe, both of them are known.
is the local road plane in
ith camera frame. In the established homography constraint, the variables involve the parameters of the plane and the rotation and translation between two frames. Furthermore, given the known pose changes between historical keyframes, the plane parameters can be extracted based on homography. According to Equation (
8), the homography constraint between
kth matched point pair is shown as:
Derived from expanding Equation (
11), the homography constraint can be expressed as
Accurate matching relationships of road points are obtained from the front-end. Considering the sensitivity of the epipolar constraint to noise, in order to reduce the impact of outliers on plane estimation, the 4-point method is used with RANSAC to select all inliers to compute
. The parameters of plane
are also calculated as the initial value for subsequent optimization after obtaining the
. Similar to [
18], the homography error can be expressed as
Given that the rotation
and translation
between
ith keyframe and
jth keyframe have been estimated in previous sliding window, optimization is solely performed on the plane. The Jacobian of error term with respect to
is
where ⊗ is the Kronecker product,
is the identity matrix.
4.4. Back-End
When keyframes are detected in the front-end, the keyframes will pass to the back-end. In the back-end, a Local Bundle Adjustment (LBA) is performed to optimize all keyframes within the sliding window along with the corresponding points and local road planes. Simultaneously, a check is conducted to determine whether the current keyframe exhibits a loop closure with previous keyframes stored in the map. If loop closure conditions are satisfied, a global BA is executed for loop closure correction.
4.4.1. Local Bundle Adjustment
In LBA, the optimized variables include the poses of all keyframes in the sliding window, as well as all the map points and local road planes corresponding to these keyframes.
Figure 5 shows the factor graph of the proposed Local Bundle Adjustment with points and road planes. The LBA incorporates four distinctive constraint types: reprojection constraints, linking keyframes to non-road points; epipolar constraints, connecting keyframes to road points; constraints associating the vehicle with Local Road Planes (LRPs); and homography constraints between preceding keyframes and LRPs. These diverse constraints form the basis for constructing error functions, facilitating the concurrent optimization of poses, landmarks, and LRPs. The error functions are represented in the least squares form and iteratively solved using the Gauss-Newton method from the G
2osolver [
49], with a maximum iteration limit set to 10. The optimization goal in LBA is to minimize the following loss function:
where
is the projection error,
is the epipolar error,
is the homography error,
is the error between keyframe and LRPs.
,
,
and
are the information matrices corresponding to the four types of errors. The
and
are similar to errors in the
Section 4.2.3, and the
is described in
Section 4.3.
The
is used to constrain the position of the vehicle through the local road plane. Based on the attachment between the vehicle and local road planes, where the vehicle should be in complete contact with the road, constraints can be established between the camera and the road plane. In contrast to [
4], the proposed method diverges in its approach by not relying on the direct interaction between the road plane and the four wheels to establish constraints. Instead, it leverages the extrinsics between the camera and the vehicle’s body frame. This technique involves transforming the camera pose to align with the body frame, establishing a singular-point constraint with the road plane. In contrast to the four-point constraints, the single-point constraint aims to minimize the impact on the system caused by errors in the plane normal and changes in extrinsics. The error is represented as follows:
where
is the pose of
ith keyframe,
is the homogeneous form of the origin of the vehicle body frame in the camera frame,
is the
kth plane in the map, which corresponds to the local road plane of the vehicle in the
ith keyframe. The Jacobian of error term with respect to
is
where
is the skew symmetric matrix of
. The Jacobian of error term for
is
4.4.2. Loop Correction
Loop Correction and LBA run in parallel in the back-end. When a new keyframe is detected, the proposed system, similar to Visual SLAM [
24,
25], employs loop closure detection using a Bag-of-Words (BoW) model based on DBoW2 [
49] to identify if the current frame forms a loop with previous keyframes in the map. If an accepted loop closure is detected, a global Bundle Adjustment (BA) is executed to rectify accumulated drift within the loop. During the global optimization, points, keyframes, and local road planes are simultaneously optimized.
Figure 6 shows local road planes before and after loop correction. The poses of keyframes and local road planes are simultaneously adjusted to achieve a more accurate map.
After loop closure occurs, the local road planes might be overlapping. The numerous repeated planes in the map lead to unnecessary consumption of computational resources and storage space. The keypoint of plane fusion is to determine whether two planes should be merged. Excessive fusion may result in planes that do not conform to the shape of the road, while insufficient fusion fails to resolve plane overlaps. The commonly used Intersection over Union (IoU) from object detection tasks is used to calculate the overlap ratio of planes, denoted as . By considering the distance between the center positions of planes, it is possible to quickly assess the likelihood of plane overlap. For two overlapping planes, if their IoU is greater than 0.5, two planes will be fused. During the fusion process, the matching relationship of feature points obtained from local road modeling is utilized to recompute the parameters of the fused plane, and the size of the fused plane is the Union of the two planes.
5. Experiments
RC-SLAM was evaluated using two publicly available datasets, KITTI-360 [
50] and KITTI [
51], along with a real-world dataset collected by a physical test platform. The selection of these datasets was carefully considered, taking into account factors such as the diversity and complexity of scenes, as well as the widespread usage of the datasets. For reference, the performance of open-source Visual SLAM systems: ORB-SLAM2 [
24], OV
2SLAM [
26], and the Visual-inertial SLAM system ORB-SLAM3 [
25] were also tested on the aforementioned datasets. The ablation experiments were performed on the KITTI-360 dataset to evaluate the two proposed constraints within RC-SLAM. To minimize the impact of randomness in each system, each system was consecutively run five times on every sequence of the dataset. It is important to note that both the proposed system and three open-source systems were implemented on a computer equipped with an Intel i7-11700 CPU at 3.6 GHz.
The systems were evaluated using two metrics: Absolute Trajectory Error (ATE)
[
52] and Relative Pose Error (RPE) [
53]. ATE assesses the global consistency of the system by comparing the root mean square error (RMSE) between the estimated trajectory and the ground truth. RPE consists of Relative Translation Error
and Relative Rotation Error
. It describes the local accuracy within fixed time intervals, suitable for assessing the drift of the system. It is important to note that alignment between the coordinate systems of each system and the ground truth is necessary before evaluation. Here, the Umeyama algorithm [
54] was used to process the aforementioned data.
5.1. KITTI-360 Dataset
The KITTI-360 dataset offers nine sequences with Ground Truth data. These sequences encompass various scenes, including low-speed driving in urban and high-speed driving on busy highways. The dataset contains multiple sensor data, including a stereo color camera operating at 10 Hz with a baseline of 0.6 m, two fish-eye cameras with a 180-degree FOV, a 64-line Lidar, and an OXTS3003 GPS/IMU Unit. Rectified stereo images and the provided Ground Truth from the dataset were used in experiments. It is noteworthy that the Ground Truth in this dataset is obtained through large-scale optimization using OXTS measurements, laser scans, and multi-view images, resulting in more accurate poses. Compared to the KITTI dataset, the Ground Truth in KITTI-360 is considered more accurate and reliable; thus, the ablation experiments were performed on this dataset. However, ground-truth poses are not available for each frame in all sequences. Some image frames lack corresponding Ground Truth, such as the first frame of each sequence. Hence, seven fragments with continuous Ground Truth from 7 sequences were selected to evaluate each algorithm. The corresponding camera frames at the beginning and end of each fragment were specified. For a direct comparison, all systems have closed-loop correction.
Table 1 shows the ablation experiments conducted by gradually adding the proposed methods for comparison. The proposed system was divided into three parts. VSLAM serves as the baseline, a basic Visual SLAM based solely on stereo images without any treatment of road features. VSLAM+EG integrates road features onto VSLAM and enforces 2D-2D epipolar constraints on road feature points, while non-ground feature points still adhere to 3D-2D reprojection constraints. VSLAM+EG+LRC represents the complete proposed system, which further performs local road planes to add constraints between road planes and vehicles based on VLSAM+EG. The results indicate the improvement in the
after adding epipolar geometry constraints. This improvement is due to mitigating the influence of depth uncertainty of road feature points on rotation estimation by employing epipolar constraints. However, 2D road feature points cannot obtain the scale, so epipolar geometry constraints cannot directly constrain the translation of the vehicle, resulting in minimal differences in
compared to the baseline. After the addition of local road plane constraints, further enhancements in
and
. There are two reasons: one direct reason is that the system establishes a local road plane to constrain the motion of the vehicle. Consequently, the motion estimation of the vehicle is closely with real physical conditions, thereby reducing vertical drift. The other indirect reason is that during the local road modeling process, observations from previous frames are utilized. This strengthens the correlation between the current frame and previous frames, consequently enhancing the inter-frame scale consistency of the system.
Table 1 also demonstrates the evaluation results of the proposed system compared to three open-source systems. The proposed system attained the best results across most sequences. Overall, the proposed system outperformed both ORB-SLAM2 and OV
2SLAM, two stereo Visual SLAM systems. Moreover, in terms of
, it shows advantages compared to the stereo visual system that fuses IMU. Contrasting with ORB-SLAM2 and OV
2SLAM, the proposed system shows improvements in
in sequences 00, 02, 04, 05, and 06. This suggests that the road constraint can effectively reduce inter-frame drift. Simultaneously, there is a slight elevation in
. This is because the proposed system utilizes more road feature points closer to the vehicle. This led to more effective constraints for enhancing the estimation of rotation. Compared to the ORB-SLAM3, which fuses stereo vision and IMU, the proposed system placed greater emphasis on using constraints from roads and previous frames. This led to an enhancement in global scale consistency. Therefore, in
metric, RC-SLAM with a stereo camera could achieve slightly better performance than the IMU-integrated ORB-SLAM3.
For a more intuitive comparison between the proposed system and three comparative systems aligned with ground truth,
Figure 7 and
Figure 8 show the estimated trajectories with the ground truth for KITTI-360 dataset sequences 00 and 02. It is evident that the scale of RC-SLAM is closer to the ground truth compared to the other comparative methods. This aligns with the smaller absolute trajectory error achieved by the proposed method, as shown in
Table 1.
Figure 9 further shows the comparison between RC-SLAM and the comparative methods in y position against ground truth for KITTI-360sequence 020. The proposed system better matches the ground truth in the vertical orientation of the vehicle. This substantiates that road constraints can reduce the vertical drift in vehicle motion.
5.2. KITTI Dataset
The KITTI Odometry dataset comprises 11 urban driving scenes with ground truth, including highways, urban streets, and residential areas. It includes data from stereo cameras (color and grayscale), Lidar, and IMU. In this experiment, rectified stereo color images were used, captured by a stereo camera with a baseline of 0.54 m, resolution of 1392 × 512 pixels, and frequency of 10 Hz. The high-precision ground truth of the vehicle generated by an OXTS3003 GPS/IMU unit was employed to evaluate the trajectories of SLAM systems. On the KITTI dataset, the proposed system and comparative system were tested under two conditions: with loop correction disabled and enabled. As the KITTI dataset contains loop closure scenes in sequences 00, 02, 05, 08, and 09, the performance of each method was assessed with loop correction in these five sequences.
Table 2 presents the results of the proposed method, ORB-SLAM2, OV
2SLAM, and ORB-SLAM3 without loop correction. The proposed system gets similar results to the KITTI360 dataset. Across metrics like
and
, the proposed system achieves the best results in most sequences. We attribute this to the proposed local road plane constraint and epipolar constraint for road features, which enhance accuracy in rotation and translation estimations. Compared to ORB-SLAM3, which fuses IMU, in scenes where ORB-SLAM3 initializes smoothly (like sequences 03 and 05), the proposed system exhibits higher
. However, the proposed system gets similar or even better results in other sequences than ORB-SLAM3. These outcomes suggest that while systems fused with IMU demonstrate increased accuracy, the prolonged or failed IMU initialization affects the entire SLAM system in some scenes. In contrast, the proposed system, independent of other sensors, explicitly expresses the physical constraints between the vehicle and the road, thus enhancing the accuracy and robustness of the system. As illustrated in
Figure 10 and
Figure 11, the trajectories estimated by the proposed system closely align with the ground truth. In sequence 03, the proposed system gets the lower error in the y position compared to the three comparative systems.
Table 3 shows the experimental results of RC-SLAM, ORB-SLAM2, OV
2SLAM, and ORB-SLAM3 on sequences 00, 02, 05, 06, 08, and 09 of the KITTI dataset with loop correction. All four systems detected and underwent loop correction in these six sequences. Loop correction effectively mitigates accumulated drift in trajectories, resulting in more consistent and accurate overall trajectories. Consequently, there is a notable improvement in
for all four systems. Although the proposed system gets the best result of
only in sequence 08, it consistently demonstrates second-best results in the other five sequences. This demonstrates the effectiveness of using local plane features to represent road characteristics, enhancing the proposed system during global Bundle Adjustment (BA). Combined with loop closure’s ability to constrain all frames within the loop, the global consistency of the system is further improved.
Figure 12 presents a comparison between RC-SLAM and the comparative systems in terms of estimated trajectories against ground truth after enabling loop closure, demonstrating higher consistency between the proposed method and the ground truth across the entire trajectory. This is due to the fact that within shorter frame sequences, local road plane features also contribute to inter-frame constraints, enhancing inter-frame scale consistency. When combined with loop correction, the proposed system gets better consistency.
Figure 12 presents a comparison between estimated trajectories and ground truth with loop correction, demonstrating higher consistency between the proposed method and the ground truth.
5.3. Real-World Experiments
Data within real-world scenes is gathered by a data collection vehicle. The four sequences were all captured within the campus. Among these, Sequence 01 includes a loop closure scene, while the other three lack it. The data collection vehicle is equipped with a stereo color camera having a baseline of 0.2 m, resolution of 1280 × 720 and a frame rate of 30 Hz. It is also equipped with an Xsens MTI-300 IMU operating at a frame rate of 200 Hz, a LiDAR with a frequency of 10 Hz, and a Bynav GNSS/IMU Unit. Additionally, wheel speed and steering angle are acquired from the CAN bus of the vehicle. Data from all sensors are recorded using a Data Logger. The extrinsics among different sensors and the intrinsic of the stereo camera were calibrated before the experiment. The data collection vehicle and various sensors are shown in
Figure 13. In this experiment, we utilized images from the stereo color camera and the output as ground truth from the Bynav GNSS/IMU Unit which underwent coordinate transformation, time synchronization, and other processing. It is noteworthy that unlike the stereo cameras employed in KITTI and KITTI-360, the baseline of the camera on the data collection vehicle is only 0.2 m, enabling the real-world data to reflect the performance of various systems with a smaller baseline camera.
Table 4 shows the experimental results of RC-SLAM, ORB-SLAM2, OV2-SLAM, and ORB-SLAM3 within the Real-world dataset captured in the campus environment. All systems detected loop closures and performed loop corrections in Sequence 01, estimated trajectories and ground truth were shown in
Figure 14. The proposed system exhibited the minimum
in datasets 00, 01, and 02, and achieved a near-optimal result in dataset 03. As shown in
Figure 15, this indicates that the proposed system can achieve better global consistency even with a small baseline camera. This performance is still dependent on assistance from the local road plane constraint. However, due to the reduced camera baseline, the number of nearby feature points in the front-end significantly diminishes, unavoidably leading to decreased accuracy in rotational estimation. Nevertheless, within RC-SLAM, employing 2D features for matching ground feature points allows the acquisition of more nearby feature points. When combined with epipolar constraints, this yields more accurate rotational estimation. Consequently, in Sequences 01 and 03, the proposed method also achieved the best outcomes in terms of relative average rotational error, while obtaining a second-best result in Sequence 02.
6. Conclusions
In this work, a stereo Visual SLAM system with road constraints based on graph optimization was proposed for intelligent vehicles. Firstly, the proposed system fully utilizes the matched road feature point between keyframes to construct epipolar constraints, which can avoid the impact of depth uncertainty of road feature points on the system and thereby achieve more accurate rotation estimation. Secondly, the system employs observations of the local road corresponding to the current keyframe from previous keyframes to estimate parameters of the local road plane and establishes constraints on the vehicle based on this plane. Lastly, the system obtains precise vehicle poses and global maps by utilizing nonlinear optimization to jointly optimize vehicle trajectories, LPRs, and map points. The ablation experiments demonstrate that the two road constraints in the system, focusing on epipolar constraints and local road constraints, effectively reduce errors arising from the xis DoF motion assumption of the vehicle. By comparing the proposed system with state-of-the-art Visual SLAM and Visual-inertial SLAM on the KITTI-360 dataset and KITTI dataset, the proposed system achieved more accurate trajectories of vehicles without the addition of extra sensors. Finally, further validation of the proposed system was demonstrated in real-world experiments. In future work, the system needs to be tested in more real-world road scenes. Moreover, the numerous dynamic objects on the road affect the localization and mapping of the system during experiments. To address this problem, dynamic SLAM is a worthwhile research direction.