1. Introduction
SLAM (Simultaneous Localization and Mapping) is a robotic methodology employed when a robot operates in an unknown environment [
1]. Utilizing onboard sensors such as cameras, inertial sensors, and laser sensors, the robot captures information about the external surroundings [
2,
3]. These acquired data are then used in real-time to estimate the robot’s own pose and simultaneously construct a map of the surrounding environment. The camera possesses characteristics such as compactness, low-power consumption, and cost-effectiveness. Simultaneously, it has the capability to capture a more comprehensive set of environmental information. Consequently, the adoption of cameras as sensors for visual SLAM has emerged as a prominent and actively pursued research direction [
4].
Research scholars have proposed several visual SLAM system frameworks [
5], including the VINS-Mono [
6] and ORB-SLAM series, which have shown promising results thus far [
7]. Among them, the ORB-SLAM series has drawn a lot of interest from academics studying this area because of its benefits, which include strong stability and real-time performance. Mur-Artal et al., released the open-source ORB-SLAM2 in 2017 that offers three camera modes for greater versatility [
8]. In 2021, they even unveiled ORB-SLAM3 with a multi-map system that incorporates a feature-based inertial visual odometry that greatly enhances the system’s localization accuracy [
9]. However, there are still certain issues with ORB-SLAM3 in practical situations [
10]. While moving objects in actual real-world environments can cause errors in the correlation of visual odometry data, existing algorithms typically assume the external environment to be static [
11]. This affects the accuracy of the SLAM algorithm [
12]. As a result, studying SLAM algorithms in dynamic environments is very crucial [
13,
14].
To improve the robustness and accuracy of the SLAM system in dynamic environments, many researchers optimize the algorithm based on ORB-SLAM [
15]. DynaSLAM [
16], as proposed by Bescos et al., is a hybrid semantic segmentation and multi-view geometry method that segments dynamic objects by instances using semantic segmentation networks. This allows for precise feature point culling on dynamic objects and background repair. On the basis of ORB-SLAM2, DS-SLAM is integrated with a semantic segmentation network that greatly enhances localization accuracy in dynamic scenes by minimizing the impact of dynamic targets on the system via the motion consistency detection method [
17]. DM-SLAM leverages both the distributed and local RANSAC (Random Sample Consensus) to extract static features from dynamic scenes while incorporating them into the system’s initialization [
18], based on the characteristics of both statics and object dynamics. In dynamic scenes, the system performs better than conventional SLAM systems. Zhu et al. proposed a visual SLAM using a deep feature matcher to extract features in each image frame by a deep feature extractor in a visual odometer and proposed a camera position fusion estimation method [
19], which is able to operate at 0.08 s per frame with low error. Wei et al. optimized the visual SLAM algorithm through semantic information, based on the framework of ORB-SLAM2, and added the dynamic region detection module into the visual odometer to improve the accuracy of the system’s position estimation in dynamic environments. Then, the image is semantically segmented by the BiSeNetV2 network, dynamic objects are removed using semantic information and dynamic regions, and finally a map containing semantic information is constructed [
20].
The existing SLAM system is susceptible to the influence of dynamic objects in dynamic scenes, which leads to a decrease in the system’s localization accuracy. In this research, we propose a SLAM system for dynamic situations that determines whether a feature point influences the system accuracy and whether it should be rejected or retained [
21]. The system detects a feature point on a dynamic object using an enhanced lightweight YOLOv5-based method [
22]. After that, position is estimated using the feature points that were kept, and a dense point cloud map is produced to satisfy navigation and path planning requirements [
23]. The experimental findings demonstrate the good real-time performance and ability of the algorithm presented in this research to increase the system accuracy in dynamic circumstances [
24].
The rest of this paper is organized as follows. The general architecture of the SLAM system used in this work is described in
Section 2, along with the guiding ideas and purposes of each module. The elements of the target detection network and the improvement techniques are covered in detail in
Section 3.
Section 4 provides a description and analysis of the comparative experiments that were conducted to evaluate the system’s performance. In
Section 5, the research of this paper is discussed. Finally, in
Section 6, the work is summarized, along with an outlook for future work.
2. System Description
The system framework of this paper as shown in
Figure 1 can be divided into five main components, i.e., tracking thread, localized mapping, loop closing, dynamic object detection thread, and dense map building.
The tracking thread is in charge of tracking the localization and orientation of the camera in real-time. This component matches features using ORB (Oriented FAST and Rotated BRIEF) feature points to estimate the camera pose. Pose estimation, descriptor matching, and feature extraction are steps in the feature matching process.
For the purpose of creating a local map, keyframe insertion and deletion are mainly handled by the local mapping thread. It also optimizes the data generated by the tracking thread to increase localization accuracy [
25].
The job of the loop closing thread is to check if the camera has returned to a previously visited area on a regular basis, which would indicate a loop closure event. The loop closure detection thread initiates a loop closure operation in the event that a loop closure is found.
The dynamic object detection thread uses the YOLOv5s network as its main body, replacing YOLOv5’s Backbone with MobileNetV3-Small to lighten the overall network structure [
26].
To create the dense point cloud map, the thread responsible for building the dense map must first convert the depth and image information into 3D point cloud data. Next, it must color the point cloud using the RGB image’s color information and stitch the point cloud together using keyframes.
The tracking thread extracts ORB feature points from the input image. ORB is an algorithm that identifies and describes important feature points in an image. The two basic components of ORB feature extraction are BRIEF (Binary Robust Independent Elementary Features) feature description and FAST (Features from Accelerated Segment Test) feature point detection.
FAST is a high-speed algorithm for detecting corner points in an image. The FAST algorithm first selects one pixel in the image to act as the center pixel before determining corner points. To determine whether this center pixel is a corner point, its intensity values are compared to those of the surrounding pixels. As shown in
Figure 2, the center pixel point
p has a gray scale of
IP, and the FAST algorithm selects a fixed number of surrounding pixels, usually 16 pixels. It then compares the gray values of these surrounding pixels with the gray value of the center pixel. If there are 12 consecutive points whose gray values satisfy Equation (1), then the point is a corner point.
where
T is the threshold value and
Ix represents the gray level of the surrounding 16 pixel points.
The algorithm known as BRIEF is chosen by ORB to describe the feature. BRIEF’s primary objective is to produce robust and efficient binary feature descriptors that may be used to match and identify important or distinctive points in a picture. BRIEF selects a feature point and determines a
T ×
T window and then randomly selects n pairs of pixel points defined as
where
p(
x) and
p(
y) are the gray values at point
x and
y, respectively. Then, the descriptor of feature point
p is defined as
One of the key components of the system is feature matching; the accuracy of the location estimation can only be ensured when the feature points are matched accurately and efficiently. The camera position problem must be solved when feature matching is finished, and the method in this work uses the ICP (Iterative Closest Point) algorithm to estimate the position of 3D feature points. The principle is as follows: assuming that the two known frames are
F1 and
F2, where
P = {
p1,⋯,
p2}∈
F1 and
Q = {
q1,⋯,
q2}∈
F2 are a set of well-matched three-dimensional points, and the camera’s pose is calculated by SVD decomposition, the rotation matrix
R and translation matrix
t are made to satisfy Equation (4).
The local mapping thread is responsible for optimizing keyframes through operations such as eliminating outliers, generating new map points, and removing redundant keyframes. Subsequently, the optimized keyframes are output to the loop closing thread. The loop closing thread, using the keyframe positioned at the front of the buffer queue, replaces it with the currently detected loop closure keyframe. The similarity between the current keyframe and the consensus keyframe is computed to determine the presence of a loop closure.
According to the depth information and image information provided by the camera, the two-dimensional coordinate points are converted into three-dimensional point cloud data.
Xi = [
x,
y,
z] represents the coordinates of the points,
X = {
x1,
x2,⋯,
xn} represents the set of spatial points, and the corresponding pixel coordinate of the known three-dimensional points under the camera coordinate system is expressed as
Xi = [
u,
v, 1]. According to the camera’s pinhole model imaging principle, we can obtain the point cloud. The spatial position information of the point cloud can be obtained according to the pinhole model imaging principle of the camera:
where
K represents the camera internal reference;
R represents the rotation matrix;
t represents the translation vector; and
z is the scale factor between the depth value and the actual spatial distance.
In this paper, the obtained point cloud data are processed by the PCL (Point Cloud Library). The
pix,
piy, and
piz of the point cloud
pi can be calculated according to the following equation:
For every keyframe, a corresponding point cloud is created. All of the point clouds are then stitched together using the system’s keyframe position data, and the RGB image’s color information is utilized to give the point clouds their color, creating a globally dense point cloud map.
4. Experiment and Analysis
This paper uses the publicly available TUM dataset in addition to the KITTI dataset to validate the algorithm’s performance. Image sequences of indoor scenes taken with a Kinect camera and referenced with actual camera poses are included in the TUM dataset. Experiments are also conducted using the highly dynamic sequences walking_xyz, walking_static, and walking_halfsphere on this dataset to confirm the algorithm’s accuracy and robustness in actual indoor environments. Sequences 01, 02, and 03 on the KITTI dataset are also utilized for experiments to confirm the accuracy and resilience of the algorithms in actual outdoor scenes. The dataset comprises real image data that were gathered from urban, rural, and highway scenes. For this paper, a laptop running Ubuntu 20.04 with 16 GB of RAM serves as the experimental platform.
4.1. Experiments on Object Detection Models
The object detection network is trained using the VOC-2007 dataset, which consists of 20 different objects including people, chairs, and cars. The dataset for this training consists of 7653 images, 1000 epochs are trained, batch_size is set to 128, Learning_rate is 0.0032, Momentum is 0.843, and Weight_decay is 0.00036.
The images in the walking_halfsphere dataset are detected and the Parameter quantity, Calculation quantity, and Detection time of different target detection networks are shown in
Table 3. All the tests are performed five times and the final results are averaged. It can be observed that using MobileNetV3 instead of the YOLOv5s Backbone feature extraction network reduces the model Parameter quantity by 50.08%, Calculation quantity by 62.63%, and Detection time by 10.71%.
4.2. Indoor Posture Estimation Experiment
This paper develops a set of information determination methods to retain the static feature points in the dynamic detection frame. A feature point is considered dynamic only if it is in the dynamic detection frame and not in the static detection frame. Rejecting a feature point based solely on its location will affect the accuracy of the system.
Figure 8a illustrates the effect of culling, while
Figure 8b displays the image prior to dynamic feature point culling and
Figure 8b displays the image subsequent to culling. The man’s plaid shirt in the figure has a lot of feature points that, if left unculled, will significantly lower the system’s accuracy. This paper’s method successfully identifies the dynamic objects in the image and retains the feature points on the static objects, like computers, while rejecting the feature points on the dynamic objects.
Robustness in SLAM refers to the system’s capacity to withstand a range of errors and disturbances. The accuracy of the system will be affected by the presence of dynamic objects in the dataset, but the system’s robustness will be demonstrated by its equally good performance in managing the dynamic dataset. The difference between the calculated and true values of the posture is represented by the Absolute Posture Error (APE), which provides a visual evaluation of the algorithm’s correctness. The difference in the amount of position change at the same timestamp is represented by the Relative Posture Error (RPE), which is a useful tool for estimating system drift. After aligning the estimated and actual position values using the timestamps, the change in position is computed between the true and estimated values over time. The difference in the change in values represents the position error of the camera.
Table 4 and
Table 5 present comparative test results between ORB-SLAM3 and the algorithm proposed in this paper across three different datasets. The evaluation metrics include root-mean-square error (RMSE), mean error (MEAN), and standard deviation (SD). The improvement ratio is calculated using the formula given in Equation (7):
where
m is the running result of ORB-SLAM3 and
n is the running result of the algorithm in this paper.
From
Table 5, it can be seen that the algorithm in this paper has improved the position estimation accuracy compared to ORB-SLAM3 after the dynamic feature points are removed. In terms of absolute trajectory error, the root-mean-square error in the walking_xyz sequence is improved by 80.02%, the mean error is improved by 92.09%, and the standard deviation is improved by 87.92%. In terms of the relative displacement trajectory error, the root-mean-square error in the walking_xyz sequence is improved by 71.71%, the mean error is improved by 82.65%, the standard deviation is improved by 61.58%, and the other two dynamic sequences are also significantly improved.
Figure 9a shows the camera trajectory of ORB-SLAM3 in the walking_xyz sequence, and
Figure 9b shows the camera trajectory of our SLAM in the walking_xyz sequence, where the solid line is the trajectory value computed by the system and the dashed line is the real trajectory value of the camera.
Figure 9c shows the comparison between ORB-SLAM3 and our SLAM in xyz mode with the real trajectory, and
Figure 9d shows the comparison between ORB-SLAM3 and our SLAM in rpy mode with the real trajectory. From the figure, it can be seen that the trajectory calculated by ORB-SLAM3 has a large error, while the trajectory calculated by our SLAM is closer to the real value.
Figure 10a shows the results of the APE comparison between ORB-SLAM3 and our SLAM in the walking_xyz sequence, and
Figure 10a shows the results of the RPE comparison between ORB-SLAM3 and our SLAM in the walking_xyz sequence. Our SLAM performs significantly better than ORB-SLAM3 in APE and better than ORB-SLAM3 in RPE. It can be seen that the algorithm in this paper has less error in dynamic environments compared to ORB-SLAM3 and has higher accuracy and robustness.
Table 6 displays the absolute trajectory errors of the algorithm used in this paper in comparison to the DS-SLAM and DynaSLAM algorithms. In the walking_xyz and walking_static sequences,
Table 6 shows that the algorithm used in this study performs better than the other algorithms in terms of root-mean-square error, and that DS-SLAM and DynaSLAM are comparable in these two sequences. Meanwhile, in the walking_halfsphere sequence, DynaSLAM performs better than DS-SLAM with respect to standard deviation and root-mean-square error, while the technique presented in this research comes in second. Comparing the error of this work to similar methods over different dataset sequences, it performs reasonably well overall.
This paper compares the system with ORB-SLAM3, DS-SLAM, and DynaSLAM to validate the system’s real-time performance.
Table 7 shows that ORB-SLAM3 has the shortest tracking time per frame because it does not process the dynamic objects in the scene. However, the tracking time per frame outperforms DS-SLAM with DynaSLAM in this paper when comparing the systems in similar dynamic scenes.
The comparison of the frame rate of this paper system with ORB-SLAM3, DS-SLAM, and DynaSLAM is presented in
Table 8. The table shows that ORB-SLAM3 achieves the highest frame rate of 37 frames per second because it does not detect dynamic objects. When comparing DS-SLAM and DynaSLAM, the system described in this paper has the highest frame rate. This paper’s system has the highest frame rate, able to sustain over 33 frames per second while the algorithm runs, demonstrating the improved real-time performance of the algorithm.
4.3. Outdoor Posture Estimation Experiment
Table 9 presents the absolute trajectory errors for the KITTI dataset comparison between the algorithm presented in this paper and DS-SLAM, DynaSLAM.
Table 9 shows that the algorithm used in this paper performs better in terms of root-mean-square error than the other algorithms in the 01 and 02 sequences, and that DS-SLAM and DynaSLAM perform similarly in these two sequences. In the 03 sequence, DynaSLAM performs poorly while DS-SLAM performs best in terms of standard deviation and root-mean-square error. Overall, this paper’s errors in various outdoor dataset sequences outperform comparable algorithms.
4.4. Dense Building Maps Experiment
In this paper, the walking_xyz sequence is selected as the image sequence for dense map construction, in which two men walk around a computer desk. If we do not eliminate the dynamic objects to build the map as shown in
Figure 11a, we can observe that the dynamic objects in the scene have an impact on the map construction, resulting in a large number of characters in the map residual shadow, which seriously reduces the effect of dense map construction. The algorithm used in this study creates the map that is depicted in
Figure 11b, removing the impact of dynamic objects and creating a good map that is suitable for use in robot navigation and path planning.