This section provides a detailed explanation of our proposed SLAM system, and the overview is shown in
Figure 1. The current frame first undergoes an initial pose prediction through Gaussian tracking. It is then evaluated to determine whether it qualifies as a keyframe for the scene. For keyframes, a new sub-Gaussian is constructed based on the point cloud projected from the RGB-D image and its pose. This sub-Gaussian enters the sliding sub-Gaussian window, transitioning to the fine-tuning state. To maintain a constant window size, the earliest sub-Gaussian awaiting fine-tuning is removed from the window and united with the already fine-tuned Gaussian map. Meanwhile, a related keyframe window is determined based on the pose of the current keyframe and its visible region. The keyframe is subsequently fine-tuned across this related keyframe, optimizing its pose and the Gaussian map. Certain scene editing operations are simultaneously performed. Over time, these keyframes progressively densify the representation of the entire scene, ultimately yielding a dense 3D scene map.
Next, we provide a detailed explanation of each component of the system.
Section 3.1 introduces the 3DGS-based scene representation for rendering and the differentiable splatting rasterized rendering method.
Section 3.2 explains the 3DGS-based tracking process and keyframe selection strategy. In
Section 3.3, we introduce the sliding sub-Gaussian window and the related keyframe selection algorithm in detail. Finally,
Section 3.4 elaborates on the loss functions for mapping and some direct scene editing operations base on 3DGS.
3.1. 3D Gaussian Scene Representation
In RK-SLAM, for the map
representing the scene, in addition to a 3D Gaussian
representing the refined scene, there is a sliding window
containing multiple groups of sub-Gaussians
awaiting fine-tuning as follows:
, there are
3D Gaussians. Each 3D Gaussian contains a center
, a spatial covariance
, an opacity
, and a color
c is represented by spherical harmonics for view-dependent radiance.
is a 3D scale vector, and
is the rotation matrix, which participates in computation as a 4D quaternion. Similarly, the 3D Gaussians in
also include these parameters. The key difference is that the parameters, particularly the centers
and rotation matrices
, are based on the pose of the
k-th keyframe. Therefore, the parameters of
also include the extrinsic matrix
of the
k-th keyframe as follows:
where the extrinsic matrix
contains the a rotation matrix
and a translation vector
. It can be expressed as follows:
When rendering the map
, the 3D Gaussians in
are transformed from the camera coordinate system to the world coordinate system using the extrinsic matrix
. After aligning with the coordinate system of
, they are united. The primary parameters involved in this transformation are the center coordinates
and the rotation matrices
as follows:
In practice, after they are united, the map
can be expressed as follows:
The differentiable splatting rasterized rendering follows [
18]. First, we need to project the 3D coordinates of the 3D Gaussians onto the 2D plane of the target camera using the following equations:
is the intrinsic matrix of the target camera,
is the extrinsic matrix,
is the Jacobian of the affine approximation of the projective transformation, and
d is the depth value on the
z-axis after the 3D Gaussian center is projected onto the target camera coordinate system.
After projection, the influence of each 3D Gaussian on the camera plane can be calculated using the following Gaussian distribution equation:
The influence of all 3D Gaussians is accumulated sequentially from near to far based on their depth order after projection. By adding their colors and depths in this manner as Equation (
9), the corresponding color
and depth
for each pixel in the image can be obtained.
3.2. Gaussian Tracking
In Gaussian tracking, following [
28], we adopt the proposed analytical Jacobian of the
camera’s extrinsic matrix with respect to the 3D Gaussians. For the loss function, we compute the losses for the RGB
and depth
. Notably, similar to [
28], we further optimize affine brightness parameters for exposure and penalize non-edge or low-opacity pixels. Consequently, the loss is expressed as follows:
represent the rendered RGB image and depth image, respectively, while
I and
D denote the ground truth RGB image and depth image.
are exposure-related factors and are trainable parameters, typically initialized to 0. The final expression of the loss function is as follows:
is a hyperparameter.
After tracking, it is necessary to determine whether the current frame qualifies as a keyframe. Similar to [
28], we measure the covisibility by measuring the intersection over the union of the observed Gaussians between the current frame and the last keyframe. If the covisibility drops below a threshold, or if the relative translation is large with respect to the median depth, the current frame is registered as a keyframe. The covisibility
between frame
i and frame
j can be calculated as follows:
represents the observed Gaussians for frame
3.3. Keyframe Window
When a new keyframe is obtained, a sub-Gaussian, as described by Equation (
3), is constructed. The point cloud is generated from the RGB-D image and downsampled at a certain ratio to initialize the sub-Gaussian. This sub-Gaussian is then added to the sliding keyframe window
. The new sub-Gaussian is based on the camera coordinate system corresponding to the current keyframe. As the pose of this keyframe changes during fine-tuning, the new sub-Gaussian adjusts accordingly, as shown in
Figure 2. To maintain a constant size for
, the sub-Gaussian that exceeds the window limit is removed from
. The earliest sub-Gaussian undergoes the most fine-tuning, and its result is relatively stable (discussed in
Section 4.3). Therefore, it is removed from the sliding window
. The removed sub-Gaussian is transformed into the world coordinate system using Equation (
5) and then united with the already refined Gaussians
At the same time, a related keyframe window
is also constructed. To ensure a strong correlation between the selected keyframes and the current keyframe, co-visibility is used as the metric for measuring inter-frame correlation. Additionally, to optimize the mapping of the current frame from a broader range of perspectives, the spatial distance between frames is also taken into consideration. Based on these factors, we design the related keyframe selection algorithm (Algorithm 1) as follows:
Algorithm 1 Related keyframe selection algorithm. |
- 1:
Input: keyframes , the observed Gaussians for all keyframes , the pose of all keyframes , the current keyframe k, and the max size for the related keyframe window l - 2:
Output: related keyframe window - 3:
- 4:
do - 5:
- 6:
end for - 7:
- 8:
- 9:
- 10:
do - 11:
for i in do - 12:
- 13:
end for - 14:
- 15:
- 16:
end while - 17:
contains the fine-tuned keyframes, represents the Gaussians observed by all keyframes, and denotes the 3D spatial positions of the keyframes.
In Algorithm 1, we first calculate the co-visibility between all fine-tuned keyframes and the current keyframe as Equation (
12) using its square as the measure of consistency (line 5). Next, we iteratively update the nearest distances for the keyframes. This iteration prevents the selected keyframes from being overly concentrated in a specific region. By iterating, the selected keyframes are more evenly distributed across the 3D space. During each iteration, after updating the nearest distances (line 12), we compute the relation of each keyframe as the product of its minimum distance and consistency score. Keyframes with the highest relations are added to
(line 14). This process continues until the number of keyframes in
meets the desired size, after which
is returned.
To provide a more intuitive demonstration of the related window, we extracted an example from the experiment, as shown in
Figure 3. The figure shows the RGB images of the current frame and its related frames and their order in the SLAM sequence. From the figure, it is evident that the visible area of all related frames overlaps with the current frame. Furthermore, these related frames offer observations of the current scene from multiple views. This observation mode is beneficial for both the localization and mapping of the current frame. Notably, among these related frames, the earliest comes from the 192nd frame in the sequence, which is significantly distant from the current frame (the 1523rd) in terms of sequence order. Despite this, our algorithm is still able to capture their relation. This is of particular significance for SLAM sequences with cyclic structures.
The keyframes included in the two types of windows contribute to Gaussian mapping for optimizing the scene map. The key difference lies in how they are handled: the Gaussians in the sub-Gaussian corresponding to the keyframes in the sliding keyframe window undergo fine-tuning along with the pose of the keyframes, whereas the keyframes in the related keyframe window no longer has a corresponding sub-Gaussian.
3.4. Gaussian Mapping
In Gaussian mapping, in addition to the two windows mentioned above, we randomly select a subset
of keyframes during each iteration to participate in the computation of the mapping loss.
prevents the scene map from forgetting past observations. The mapping loss is calculated from three aspects: RGB loss
, depth loss
, and isotropic regularization
refers to the structural similarity index measure [
is a hyperparameter used to control its weight, and
represents the mean of the scale factors
for the 3D Gaussians in the scene map
. The final loss is composed of the RGB and depth losses of the keyframes in all windows
and scale factor isotropic regularization terms:
Regarding scene editing, following [
28], we densify the 3D Gaussians based on gradients and prune them based on their opacity and scale. Additionally, for resetting scene opacity, we adopt the same approach following [