1. Introduction
High-precision simultaneous localization and mapping (SLAM) plays a critical role in fields such as autonomous navigation for robots, self-driving cars, and drone control. Achieving high-precision SLAM requires accurate localization, meaning that the intelligent agent must know its exact position. Currently, precise localization methods widely rely on the Global Navigation Satellite System (GNSS). However, in complex urban environments with tall buildings and dense trees, the GNSS can encounter denied environments. Therefore, in GNSS-denied urban environments, SLAM requires high-precision localization methods that do not depend on the GNSS.
Odometry is a measurement method that uses sensor data to estimate pose changes over time. Traditional odometry relies on a single sensor, such as an inertial measurement unit (IMU), visual camera, or light detection and ranging (LiDAR). However, because of the sensors’ characteristics, single-sensor odometry is prone to significant cumulative localization errors [
1]. As the number of sensors onboard intelligent agents increases and sensor technology advances, dual- and multi-sensor fusion odometry can combine the advantages of various sensors to achieve higher localization accuracy.
In two-sensor fusion odometry, the mainstream approaches are visual–inertial odometry (VIO) and LiDAR–inertial odometry (LIO), both utilizing the IMU’s advantages of fast data updates and high autonomy [
2]. VIO is lightweight overall, primarily using feature matching to estimate the agent’s pose between frames. Researchers [
3] have developed monocular and stereo visual–inertial SLAM systems using oriented FAST and rotated BRIEF (ORB) feature point extraction and matching techniques [
4] to calculate camera pose changes between adjacent frames and improve pose estimation accuracy and robustness through loose coupling with IMU data. Authors in [
5], building on the work of [
3], performed a tight coupling of visual and IMU information using SuperPoint—a method based on convolutional neural networks—for feature point extraction and descriptor matching [
6]. This approach enhances the accuracy and robustness of pose estimation. The authors of [
7] employed sliding window optimization and factor graph optimization, demonstrating good performance in dynamic environments and high-speed motion.
Although VIO offers advantages such as lower device costs and higher autonomy and concealment, its performance can significantly degrade under limited lighting conditions because of its high sensitivity to environmental factors. However, LiDAR can better leverage its advantages to obtain more precise distance information in harsh environments [
8], thus providing more accurate self-pose estimation. Research on LiDAR and its coupling with LIO is relatively thorough [
9]. The studies [
7,
10] both used sliding window optimization methods to align each frame with a local map, ensuring that each frame established a clear constraint relationship with the first frame. The authors of [
11], building on [
10], constructed a local map by stitching together the latest keyframes and updated the local map based on pose information. Several studies [
12,
13,
14] used iterative error Kalman filtering as a filtering framework, performing state updates through iterative solutions to achieve more precise pose estimation.
LiDAR can provide precise distance information with high resolution and reliability, but its performance deteriorates in dynamic environments and severely degrades in feature-poor environments, such as open plains or long, straight corridors. By combining visual cameras, LiDAR, and IMUs to form laser–visual–inertial odometry (LVIO), the advantages of each sensor can be leveraged to effectively mitigate the limitations of individual sensors.
Authors in [
14] constructed a tightly coupled factor graph model that integrates measurements from visual cameras, LiDAR, and IMUs, treating primitives as landmarks and tracking them across multiple scans, thereby enhancing robustness in complex scenarios. Researchers in [
15] built on the LIO system from [
13] and the VIO system from [
7], using a tight coupling method to combine both systems and incorporating high-frequency odometry filtering. This approach enabled the proposed algorithm to be used for dense 3D mapping over large indoor and outdoor areas. The work in [
16,
17], based on [
15], continued using the LIO component, separating VIO into inter-frame tracking and frame-to-map tracking, and updating the iterative extended Kalman filtering process to achieve more accurate map construction.
However, in the complex urban environment, city roads are filled with numerous moving vehicles and pedestrians. Achieving high-precision SLAM in dynamic environments remains a significant challenge for current LVIO algorithms, as the movement of environmental objects can lead to the misidentification of surrounding obstacles and severely impact keyframe extraction and inter-frame tracking. To adapt to real-world dynamic environments, we consider the excellent performance of neural networks in dynamic object recognition [
18,
19]. Traditional neural network structures are complex, computationally demanding, and difficult to integrate with SLAM algorithms. Lightweight neural networks balance dynamic recognition performance with computational efficiency, and combining these networks with LVIO algorithms can achieve high-precision SLAM in dynamic and complex environments.
4. Dynamic Removal
4.1. Principle of Dynamic Object Removal
Dynamic removal is divided into four stages: dynamic object detection, potential dynamic point detection, actual dynamic point detection, and dynamic point removal. Dynamic object detection mainly identifies dynamic objects in image frames using the lightweight neural network YOLOv5. In this study, for the complex urban environment, dynamic objects are set as vehicles and pedestrians. If there are dynamic pedestrians or vehicles in the image frame, dynamic recognition exports visual recognition areas
and
, respectively. These visual recognition areas are projected onto the corresponding LiDAR frame to obtain recognition areas
and
, respectively, on the LiDAR, and all points within these LiDAR recognition areas are considered potential dynamic points, which are stored in the form of an ikd-Tree [
20].
For the potential dynamic points within a single LiDAR recognition area, this paper employs the Euclidean space clustering algorithm to identify the actual dynamic points. Euclidean space clustering is a hierarchical algorithm based on Euclidean distance, which groups points that are spatially close to each other. The clustering result is represented as a dendrogram, where each node corresponds to a cluster. The steps are as follows: First, all sample points are traversed, and a merge distance threshold is set. The Euclidean distance between two points and in an n-dimensional space is calculated, defined as . During the traversal, if , the points and are merged into the same cluster. This process is repeated until no further merging is possible. The advantage of the Euclidean space clustering algorithm lies in its simplicity and high efficiency when processing small-scale datasets. However, it is sensitive to noise and less efficient when handling large-scale data. In this study, the algorithm is applied to LiDAR recognition areas because the number of points requiring classification within each area is relatively small, making it well-suited for this efficient processing approach. Additionally, it meets the real-time requirements of dynamic odometry.
The steps of the Euclidean space clustering algorithm used for dynamic point removal are as follows: For potential dynamic points within a single LiDAR recognition area, Euclidean space clustering is used to identify the actual dynamic points. First, all potential dynamic points
within a single LiDAR recognition area and their spatial coordinates
are set. All potential dynamic points are traversed to identify the one closest to the center of the LiDAR recognition area,
. The merge distance threshold
is set, and all dynamic points are traversed again. When
,
is marked as the actual dynamic point and removed from the LiDAR stored data. After the traversal, the dynamic removal for that frame is completed, as shown in
Figure 3. The orange rectangle represents the visual recognition area
, which identifies the specific type of dynamic object and provides confidence scores. The red rectangle denotes the LiDAR recognition area
. Green dots indicate LiDAR scan points within the current image frame, red dots represent identified dynamic points, and blue dots are potential dynamic points currently being converted through Euclidean space clustering. As shown in
Figure 3, the dynamic removal method eliminates the main dynamic components of moving vehicles, leaving background points that can be used for mapping.
4.2. Dynamic Object Removal Strategy in Real-World Environments
In real urban environments, in addition to vehicles moving on the road, there are often stationary vehicles parked alongside the road. These stationary vehicles, like the road itself and the surrounding buildings, can provide accurate information for odometry. However, the information they provide is limited. Compared to flat road surfaces and building facades, static vehicles have a relatively minor impact on localization accuracy. If we attempt to differentiate between truly dynamic and static vehicles in real-world environments and only remove dynamic vehicles, the computational cost can increase significantly, posing a substantial burden on algorithms that already utilize lightweight neural networks.
Given these considerations, we removed all detected vehicles, both static and dynamic. Although this method sacrifices some information from static vehicles, it primarily achieves the goal of eliminating dynamic vehicles, which are the most impactful on real-world road mapping.
The results of subsequent simulation experiments confirm the feasibility of our strategy. As shown in
Figure 4, when comparing the mapping performance of the FAST-LIVO algorithm with the improved dynamic odometry on sequence 05 of the KITTI dataset [
21,
22], we observe a noticeable improvement in the mapping quality of the middle of the road. The FAST-LIVO algorithm is significantly affected by dynamic vehicles when mapping around corners, resulting in substantial deviations when re-entering the middle of the road for the second time. In contrast, the improved dynamic odometry, after removing dynamic objects, shows a clear improvement when re-entering the middle of the road for the second time.
Figure 5a provides a more intuitive observation of the improvement in mapping accuracy from a global trajectory perspective. Before encountering dynamic vehicles around corners, the difference in the mapping accuracies between FAST-LIVO and the dynamic odometry is minimal. However, after encountering dynamic vehicles at corners, the dynamic odometry shows a significant improvement in mapping accuracy compared to FAST-LIVO.
5. Simulation Experiments
To validate the performance of the proposed dynamic odometry algorithm in both indoor and real-world dynamic environments, we conducted tests using the KITTI and NTU VIRAL datasets. The NTU VIRAL dataset, collected by a small drone, represents indoor environments. The KITTI dataset, collected by a car equipped with multiple sensors, represents real urban road dynamic environments. The simulations in this study were conducted on an Ubuntu 20.04 system with an Intel i7-1087H processor manufactured by Intel (USA) and an RTX 2060 GPU manufactured by NVIDIA (Santa Clara, CA, USA).
The data processing method used in this study was a post-processing approach based on offline simulation tests using publicly available datasets. The experimental data were collected by the dataset publishers on their self-built experimental platform. The dataset is stored in the form of “rosbag” and contains data from various sensors.
5.1. Real-World Urban Road Dynamic Environment Simulation Validation
The KITTI dataset [
21,
22] was collected using a platform that includes two grayscale cameras, two color cameras, one 3D LiDAR, and one GPS navigation system, which provides accurate ground-truth trajectories. The specific models and parameters of the sensors used were as follows: The grayscale and color cameras were PointGrey Flea2 manufactured by PointGrey (Richmond, BC, Canada) with a resolution of 1392 × 512 pixels, providing grayscale and color visual information at an actual publishing frequency of 10 Hz. The 3D LiDAR was the Velodyne HDL-64E manufactured by Velodyne (San Jose, CA, USA), with an effective range of up to 100 m, providing LiDAR point cloud data at an actual publishing frequency of 10 Hz. The dataset contains real-world image data from urban, rural, and highway scenes and includes dynamic vehicles and pedestrians, making it suitable for evaluating the algorithm proposed in this paper.
The dynamic odometry algorithm proposed in this paper was validated on sequences 05, 07, and 09 of the KITTI dataset. These sequences feature complex paths and a significantly dynamic environment.
Figure 4 shows the mapping results of the proposed algorithm compared to the FAST-LIVO algorithm on sequence 05. The results reveal a noticeable improvement in mapping accuracy with the improved algorithm.
In the figure, green dots represent LiDAR points used for mapping after removing dynamic points, while orange-red dots indicate the removed dynamic points.
Figure 4e zooms in on the vehicle removal section within the blue box of
Figure 4b, providing a detailed view of the process of vehicle detection and removal. In the zoomed-in view of
Figure 4c, vehicles driving in the T-intersection area create trailing effects during mapping, and
Figure 4d presents these trailing effects more intuitively from the driver’s perspective. Dynamic objects scanned by LiDAR are also perceived as continuous obstacles, resulting in the appearance of obstacles in the middle of the road, which severely affects the mapping accuracy of traditional algorithms. The improved algorithm removes these trailing effects and retains static LiDAR points for mapping, thereby enhancing mapping accuracy.
A comparison of mapping accuracy between the improved algorithm and the FAST-LIVO algorithm on sequences 05, 07, and 09 is shown in
Table 1. The evaluation metrics include the maximum distance error, minimum distance error, and RMSE. The global and axial trajectory comparisons are shown in
Figure 5 and
Figure 6. The comparison data indicate that the improved algorithm shows significantly better accuracy metrics for mapping. The original algorithm failed to map sequence 07, whereas the improved algorithm successfully mapped it with high accuracy. We define
as the mapping RMSE of FAST-LIVO in the dataset sequences,
as the mapping RMSE of our improved dynamic odometry in the dataset sequences, and the average RMSE reduction as
. The proposed method reduced the RMSE of mapping by 72.73% and 57.50% in sequences 05 and 09, respectively. For sequence 07, where the original algorithm failed to map, the improved algorithm achieved high mapping accuracy.
5.2. Indoor Small-Scale Environment Simulation Validation
The NTU VIRAL dataset [
23] is a public dataset for indoor small-scale environments used for autonomous drone navigation. The data were collected by a small drone equipped with two 3D LiDARs, two hardware-synchronized global shutter cameras, multiple IMUs, and several ultra-wideband (UWB) ranging units, with the UWB providing precise pose information. The specific models and parameters of the sensors used are as follows: The IMU was the VectorNav VN1003 manufactured by VectorNav (Dallas, TX, USA) rugged IMU, which provides measurements of angular velocity and acceleration with an effective actual publishing frequency of up to 385 Hz. The 3D LiDAR was a 16-channel OS1 gen1 laser scanner, which provides LiDAR point cloud data with an actual publishing frequency of 10 Hz. The stereo camera was the uEye 1221 LE manufactured by IDS (Obersulm, Germany) monochrome global shutter camera, with a resolution of 752 × 480 pixels, providing visual information at an actual publishing frequency of 10 Hz. The UWB system was the Humatics P440 manufactured by Humatics (Waltham, MA, USA), with two anchor points set by the authors to measure the true trajectory of the drone and an actual publishing frequency of 10 Hz.
The dynamic odometry algorithm proposed in this study was validated on all sequences of the NTU VIRAL dataset, with results from the eee_03 sequence used as a specific example for detailed analysis. The blue box corresponds to the trees in the scene, while the green and red boxes correspond to the pillars in the scene.
Figure 7 illustrates the mapping results of the proposed algorithm on the eee_03 sequence. The figure shows that the improved algorithm effectively captures details such as trees, building supports, and roads, providing a fairly complete reconstruction of the overall environment.
Figure 8 shows the global trajectory, axial trajectory, and axial error variation of the improved algorithm on the eee_03 sequence. As depicted in
Figure 8, the predicted global trajectory aligns closely with the ground truth with minimal error. In the axial trajectories, the
X- and
Y-axes closely match the ground truth with minimal error, while the
Z-axis trajectory shows some slight jitter at 40 and 85 s. In the axial error variation plot, errors for all axes fluctuate within 0.2 m. The experimental results in
Figure 8 indicate that the improved algorithm performs well in indoor small-scale mapping, maintaining high positioning accuracy even after several minutes of flight and after covering hundreds of meters.
Table 2 compares the mapping accuracy of the improved algorithm and FAST-LIVO across all sequences of the NTU VIRAL dataset, with the evaluation metric being the root mean square error (RMSE). As shown in the data from
Table 2, although the NTU VIRAL dataset does not contain dynamic objects, the improved algorithm consistently outperforms FAST-LIVO in mapping accuracy across all sequences. This improvement indicates that packaging sensor data into computational units enhances positioning accuracy. By sorting the input data by time, the algorithm ensures that the IMU, image, and LiDAR data are processed more consistently, improving data synchronization and reducing errors caused by time delays. The mapping accuracy is successfully improved from the decimeter level to the centimeter level.
The method for calculating the average RMSE reduction in this section is the same as in the previous section, with the average RMSE reduction being 60.29% in the eee01–03 sequences, 60.71% in the nya01–03 sequences, and 69.86% in the sbs01–03 sequences.
6. Conclusions
In this study, based on the FAST-LIVO algorithm, we devised a multi-sensor fusion dynamic odometry method using the YOLOv5 neural network. Our approach enhances the alignment in LIO by packaging sensor data into computational units, better leveraging the advantages of batch processing and parallel computing, thereby improving the real-time performance of the odometry. It ensures more consistent processing of IMU, image, and LiDAR data, enhancing data synchronization and reducing errors caused by time delays. In terms of dynamic removal, a lightweight neural network is utilized to remove dynamic points at the LiDAR level, classifying the dynamic points within the detection box as potential dynamic points and actual dynamic points. The actual dynamic points are removed, while the potential dynamic points are retained for mapping.
The proposed dynamic odometry method achieved better mapping results on both the KITTI and NTU VIRAL datasets, with improved mapping accuracy compared to the FAST-LIVO algorithm.