E2-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments

Huang, Jiafeng; Zhao, Shengjie; Zhang, Lin

doi:10.3390/app15031314

Open AccessArticle

E²-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments

by

Jiafeng Huang

,

Shengjie Zhao

^*

and

Lin Zhang

School of Computer Science and Technology, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1314; https://doi.org/10.3390/app15031314

Submission received: 8 December 2024 / Revised: 25 January 2025 / Accepted: 25 January 2025 / Published: 27 January 2025

(This article belongs to the Special Issue Advances in Audio/Image Signals Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Simultaneous Localization and Mapping (SLAM) technology has garnered significant interest in the robotic vision community over the past few decades. The rapid development of SLAM technology has resulted in its widespread application across various fields, including autonomous driving, robot navigation, and virtual reality. Although SLAM, especially Visual–Inertial SLAM (VI-SLAM), has made substantial progress, most classic algorithms in this field are designed based on the assumption that the observed scene is static. In complex real-world environments, the presence of dynamic objects such as pedestrians and vehicles can seriously affect the robustness and accuracy of such systems. Event cameras, which use recently introduced motion-sensitive biomimetic sensors, efficiently capture scene changes (referred to as “events”) with high temporal resolution, offering new opportunities to enhance VI-SLAM performance in dynamic environments. Integrating this kind of innovative sensor, we propose the first event-enhanced Visual–Inertial SLAM framework specifically designed for dynamic environments, termed

E^{2}

-VINS. Specifically, the system uses visual–inertial alignment strategy to estimate IMU biases and correct IMU measurements. The calibrated IMU measurements are used to assist in motion compensation, achieving spatiotemporal alignment of events. The event-based dynamicity metrics, which measure the dynamicity of each pixel, are then generated on these aligned events. Based on these metrics, the visual residual terms of different pixels are adaptively assigned weights, namely, dynamicity weights. Subsequently,

E^{2}

-VINS jointly and alternately optimizes the system state (camera poses and map points) and dynamicity weights, effectively filtering out dynamic features through a soft-threshold mechanism. Our scheme enhances the robustness of classic VI-SLAM against dynamic features, which significantly enhances VI-SLAM performance in dynamic environments, resulting in an average improvement of 1.884% in the mean position error compared to state-of-the-art methods. The superior performance of

E^{2}

-VINS is validated through both qualitative and quantitative experimental results. To ensure that our results are fully reproducible, all the relevant data and codes have been released.

Keywords:

robot vision; dynamic SLAM; event camera; robust bundle adjustment

1. Introduction

Visual–Inertial Simultaneous Localization and Mapping (VI-SLAM) has become a pivotal technology in robotics and computer vision [1,2,3]. This technique integrates data from visual sensors, typically monocular or stereo cameras, with inertial measurement unit (IMU) data used to simultaneously estimate the robot’s pose and construct a map of the environment. Although VI-SLAM has shown considerable success in static environments, its performance is significantly impeded by the presence of dynamic objects [4]. Dynamic objects, such as moving vehicles, pedestrians, or other unpredictable elements, expand the potential risks of VI-SLAM methods when interacting with real-world environments. These dynamic objects disrupt the data correlation between image frames, causing pose estimation drift or even loss and result in ghosting effects in the generated maps.

To address this issue, extensive research has attempted to eliminate the harm of dynamic objects to the entire SLAM system through visual-based dynamic recognition methods. However, these solutions come with inherent limitations more or less. These schemes can be broadly categorized into two types: constraint-based solutions and deep learning-based solutions. The first category establishes geometric or dynamic constraints to identify and remove dynamic features without deep learning [5,6,7,8]. However, these methods typically treat dynamic points as outliers and require precise camera pose estimation, which becomes particularly challenging when dynamic objects occupy a significant portion of the field of view. The second category utilizes deep learning methods for motion segmentation, thereby marking dynamic regions [9,10,11,12]. However, these approaches impose substantial computational burdens and typically can only handle predefined dynamic objects.

A feasible approach to addressing this challenge is the integration of additional sensors to enhance dynamic segmentation capabilities. Among these, event cameras, using a novel type of biologically inspired sensor [13,14,15], stand out as a particularly promising yet underexplored option, offering a new paradigm for high-speed dynamic visual perception. Unlike traditional cameras that capture intensity images at fixed intervals, event cameras measure asynchronous brightness changes at each pixel, which are known as “events”. This operating mechanism gives event cameras low latency, high dynamic range, and microsecond-level temporal resolution. While some studies have applied event cameras to SLAM schemes to explore their performance in high-speed motion and high dynamic range (HDR) scenes (e.g., overexposure or dim environments) [16,17,18,19,20], the research focused on dynamic scenes remains limited.

As an attempt to fill the research gap in the study of dynamic scenes, a novel event-enhanced VI-SLAM scheme for dynamic environments, namely,

E^{2}

-VINS, is proposed in this paper. In our

E^{2}

-VINS, the visual–inertial alignment is used for system initialization and IMU measurement calibration. Next, through IMU-assisted self-motion compensation, the events are spatiotemporally aligned, enabling the generation of normalized mean timestamps, which are regarded as dynamicity metrics. Based on these pixel-wise metrics, dynamicity weights are adaptively assigned to the visual residual terms of different features. The system state (camera poses and map points) and dynamicity weights are then jointly and alternately optimized in

E^{2}

-VINS. To ensure that our results are fully reproducible, all the relevant data and codes have been released at https://cslinzhang.github.io/EE-VINS/EE-VINS.html, accessed on 27 September 2024. Our key contributions are as follows:

1.: A novel event-based method is used to assign dynamicity weights to visual residuals, which is supported by a momentum factor to mitigate event noise interference. This method enables robust, dynamically adaptive bundle adjustment that integrates seamlessly into feature-based SLAM frameworks to filter dynamic feature points.
2.: A joint alternating optimization framework is employed that optimizes the system state and dynamicity weights, reducing the influence of dynamic features on SLAM robustness in dynamic environments.
3.: The first event-enhanced, dynamicity-adaptive VI-SLAM system is proposed, which is termed $E^{2}$ -VINS. Extensive qualitative and quantitative experiments on multiple benchmark datasets demonstrate that $E^{2}$ -VINS performs better than its competitors in the same field in real dynamic scenarios.

The remainder of this paper is organized as follows. Section 2 introduces the related work. Section 3 presents the overall framework of the proposed dynamic SLAM. Details for evaluation on publicly available datasets are presented in Section 4. Section 6 concludes the paper.

2. Related Work

In this section, we review the studies related to event-based SLAM and dynamic VI-SLAM, respectively.

2.1. Event-Based SLAM

Event-based SLAM technology is in the early stages of rapid development, with some research exploring the use of event cameras as novel data sources for visual odometry and SLAM. In 2016, Kim et al. [18] proposed the first complete event-based visual odometry system, incorporating three probabilistic filters to predict camera motion, light brightness gradient, and inverse depth of the scene. This work assumed a constant brightness and linear gradient, and it achieved simultaneous motion tracking and 3D scene reconstruction, but required a GPU due to the huge computational burden. Rebecq et al. [19] proposed EVO, an event-based visual odometry system that eliminates the need for light brightness reconstruction and operates on a standard CPU. EVO [19] includes two parallel pipelines: a tracking pipeline based on image-to-model alignment and a mapping pipeline for event-based 3D reconstruction. Unfortunately, the localization and mapping accuracy of it are relatively poor due to its unstable event-based bootstrap and feature detection method. USLAM [21] is the first state estimation pipeline that tightly couples events, standard frames, and inertial measurements to provide robust and accurate state estimation. And it can work normally under conditions such as low-light scenes or fast motion. Zuo et al. [20] proposed DEVO, which generates semi-dense depth maps by warping depth values from an extrinsically calibrated depth camera and updates the camera pose through geometric 3D–2D edge alignment.

This approach was further extended to Canny-EVT [22] recently, but it still relies on additional depth prior information from the depth camera to ensure system functionality. Hidalgo-Carrió et al. [23] proposed EDS, the first system to use a direct method for 6-DoF visual odometry incorporating both event and color information. EDS [23] uses sparse 3D points to predict the brightness increments of pixels through an ordinary optical camera and estimates camera motion by comparing it with event-based brightness increments in error. Huang et al. [24] proposed the first pure event-based 6-DoF motion compensation method and built a novel visual-event odometry, namely, MC-VEO, which uses the event generation model to align event images with RGB images for camera pose estimation. Guan et al. [25] proposed a robust, high-precision, real-time optimized monocular event-based VIO framework that leverages both point and line features. The framework tightly integrates events, images, and IMU measurements; unfortunately, this approach is not available as open source. Guo and Gallego [26] introduced the first event-based rotation-only bundle adjustment (BA) approach and developed CMax-SLAM, which eliminates the need to convert events into frames. However, due to the limitations of the contrast maximization framework, CMax-SLAM has been restricted to 3-DoF pure rotational motion estimation and is unsuitable for long-distance trajectory prediction. Overall, as shown in Table 1, existing event-based SLAM approaches are based on the assumption of static scenes and overlook the potential of utilizing event cameras’ sensitivity to dynamic objects for enhanced performance in dynamic environments.

2.2. Dynamic VI-SLAM

Dynamic VI-SLAM solutions can be broadly classified into two categories: constraint-based methods and deep learning-based methods. Table 2 provides an overview of existing dynamic VI-SLAM methods, including camera configuration, the use of additional priors, support for temporary stopping of dynamic objects, and consideration of the uncertainty of dynamic objects.

Constraint-based methods, which do not rely on deep learning, typically establish epipolar geometric or dynamic constraints to filter and remove dynamic features. Kim et al. [5] used IMU data to compensate for rotation between consecutive images and calculate motion vectors, filtering out dynamic features exhibiting different motion trends. Fan et al. [6] proposed a dynamic feature point selection method based on multiview geometry. After projecting different images onto the same plane, they located dynamic regions through rotation and translation constraints. Based on self-motion compensation and Dense Inverse Search (DIS) optical flow, Canovas et al. [7] proposed a superpixel level fast motion object detection method in their sparse feature visual odometry, which extracts and discards dynamic elements from camera tracking and mapping processes. Song et al. [8] proposed a robust bundle adjustment that can reject the features of dynamic objects by utilizing the pose prior estimated by IMU preintegration, while discarding dynamic features that deviate significantly from the motion prior while estimating the camera pose. Most of these methods assume accurate camera pose estimation, with a high proportion of static features. However, when dynamic objects occupy the majority of the field of view, these methods may fail, resulting in inaccurate camera pose estimation.

Deep learning-based methods focus on motion recognition and dynamic region partitioning. DS-SLAM [9] used a multithreaded system with a deep semantic segmentation network, SegNet, to perform motion object segmentation in an independent thread. Similar to DS-SLAM [9], DynaSLAM [10] used Mask R-CNN for motion object segmentation and identifies data points with significant depth or angle differences as dynamic points. SLAMANTIC [11] integrated semantic information into a geometric, feature-based VSLAM using Mask R-CNN to detect objects and construct semantic maps. In this method, the confidence measurement for each feature point considers both semantic labels and combined values of reprojection error, photometric error, and depth error. Dynamic-VINS [12] used YOLOv3 [27] for real-time motion object detection and employed depth thresholds and IMU data for motion consistency checks to preserve static background points within the bounding box. Although deep learning-based methods can effectively detect dynamic objects at the semantic level, they incur substantial computational overhead and struggle in scenarios such as (a) the presence of undefined dynamic objects and (b) partially visible dynamic objects due to occlusion. In these situations, deep learning networks may fail to detect dynamic objects effectively, limiting the applicability of corresponding dynamic SLAM approaches.

In comparison to the existing literature on event-based SLAM and dynamic VI-SLAM, our proposed method introduces several key differences. First, unlike many prior event-based approaches that do not address dynamic environments, our method explicitly handles dynamic scenes by introducing a novel dynamicity weighting mechanism and alternating optimization framework. Second, our approach incorporates event camera data and fuses them with traditional image inputs, offering improved robustness in dynamic environments. Furthermore, we address the challenges posed by temporarily stopping dynamic objects and their associated uncertainties, which have not been fully considered by previous methods. These distinctions are elaborated in detail in the following chapters:

In Section 3, we describe the differences in pipeline design, including the introduction of dynamicity weights and alternating optimization.
In Section 4, we present quantitative and qualitative comparisons with existing methods, demonstrating the superior performance of our approach in dynamic scenes.
In Section 6, we highlight the advantages and limitations of our method in handling dynamic environments.

These sections provide a comprehensive view of how our method differs from, and improves upon, existing research.

3. Methodology

In traditional VI-SLAM systems [1,2,28], visual data from cameras are combined with inertial measurements from an IMU to estimate the pose of the system. The visual data are typically used for feature tracking and map construction, while the IMU provides motion estimates between frames. These systems usually consist of a frontend, where visual features are extracted and tracked, and a backend, which optimizes the pose graph and map using both visual and inertial constraints. However, in dynamic environments, the presence of moving objects can corrupt feature correspondences and lead to errors in pose estimation, reducing the system’s robustness. Our proposed

E^{2}

-VINS extends this traditional pipeline by introducing a dynamicity-aware mechanism that assigns adaptive weights to dynamic features, improving the system’s performance in challenging dynamic environments.

The overall pipeline of the

E^{2}

-VINS is illustrated in Figure 1. After data preprocessing, during the initialization of

E^{2}

-VINS, the system performs visual-based motion structure recovery and aligns it with inertial measurements. Gyroscope measurements are corrected through bias estimation during visual–inertial alignment. The system then uses the corrected measurements to assist in motion compensation, achieving spatiotemporal alignment of events. The aligned events are used to generate event-based dynamicity metrics, which assess the dynamicity of each pixel. Based on these metrics, dynamicity weights are assigned to the visual residuals of different pixels. Subsequently,

E^{2}

-VINS jointly and alternately optimizes camera poses, map points, and dynamicity weights, adaptively filtering dynamic features to enhance the system’s robustness in dynamic environments.

This section begins with a brief review of the working principle of the event camera in Section 3.1. Then, the preprocessing and visual–inertial alignment of the system are introduced in Section 3.2. Next, we discuss the computation of event-based dynamicity metrics in Section 3.3. Finally, Section 3.4 covers the adaptive assignment of visual residual weights and the joint alternative optimization framework.

3.1. Preliminaries of the Event Camera

The observation of each pixel by the event camera is independent and asynchronous. When the event camera detects a change in the logarithmic brightness intensity at a pixel

u_{k} = {[x_{k}, y_{k}]}^{T}

that exceeds a specified threshold C (known as contrast sensitivity) [13] at timestamp

t_{k}

, it generates an event

e_{k} = (u_{k}, t_{k}, p_{k})

:

Δ I (u_{k}, t_{k}) = I (u_{k}, t_{k}) - I (u_{k}, t_{k}^{'}) = p_{k} C,

(1)

where I represents the logarithmic brightness intensity,

p_{k} \in {+ 1, - 1}

denotes the polarity—indicating whether the brightness change is an increase (brightening) or decrease (darkening)—and

t_{k}^{'}

is the timestamp of the previous event at pixel

u_{k}

.

3.2. Preprocessing and Visual–Inertial Alignment-Based IMU Calibration

Inspired by the approach of the classical frameworks [2,28], the system preintegrates IMU measurements between every two consecutive image frames, tracks features between consecutive frames, and detects new features in the latest frame. The process is as follows:

Preintegration of IMU Data

E^{2}

-VINS models IMU measurements by taking inspiration from previous works [2,28]. Specifically, accelerometer measurements are the sum of true linear acceleration a, gravity-compensating force

\dot{g}

, accelerometer bias

b_{a}

, and Gaussian white noise

n_{a}

, while true angular velocity

ω

is affected by gyroscope bias

b_{ω}

and additional noise

n_{ω}

:

\hat{a} = a + b_{a} + \dot{g} + n_{a}, \hat{ω} = ω + b_{ω} + n_{ω},

(2)

where the bias

b_{a}

and

b_{ω}

are modeled as random walk, and their derivatives are Gaussian white noise. These biases are estimated in subsequent visual–inertial alignment to calibrate IMU measurements.

Consider two consecutive images, corresponding to IMU coordinate frames (body frames)

b_{k}

and

b_{k + 1}

. During this period, the system’s position

α_{b_{k + 1}}^{b_{k}}

, velocity

β_{b_{k + 1}}^{b_{k}}

, and orientation states

γ_{b_{k + 1}}^{b_{k}}

are propagated using IMU measurements through a process referred to as IMU preintegration between images. The IMU preintegration strategy aligns with classical methods [2,28].

Preprocessing of Images and Visual-based Structure from Motion

The system applies contrast-limited adaptive histogram equalization to input images to cope with lighting variations. For each new image, the Shi–Tomasi corner features are extracted and processed. The features are tracked by the KLT sparse optical flow algorithm [29] to avoid time-consuming matching computations. Subsequently, a sparse map composed of camera poses and feature positions is estimated. To limit computational complexity,

E^{2}

-VINS maintains several frames using a sliding window. For frames with stable matches within the window, the system employs the five-point algorithm [30] to recover the relative rotation and scale-free translation between the two frames. Subsequently, all features between the two frames are triangulated, and the perspective-n-point (PnP) algorithm [31] is used to estimate the poses of all other frames in the window. Finally, the global bundle adjustment [32] is applied to minimize the total reprojection error of all feature observations. The first camera frame

{(\cdot)}^{c_{0}}

is used as the reference coordinate system for structure from motion (SFM), and all subsequent frame poses

(p_{c_{k}}^{c_{0}}, q_{c_{k}}^{c_{0}})

and feature positions are calculated relative to

{(\cdot)}^{c_{0}}

.

q_{c_{k}}^{c_{0}}

and

p_{c_{k}}^{c_{0}}

represent the rotation quaternion and translation vector, respectively. Given the extrinsic parameters

(p_{c}^{b}, q_{c}^{b})

between the camera and IMU, the system can convert the pose from the camera coordinate system to the body (IMU) coordinate system

{(\cdot)}^{b_{k}}

:

\begin{matrix} q_{b_{k}}^{c_{0}} & = q_{c_{k}}^{c_{0}} \otimes {(q_{c}^{b})}^{- 1}, \\ s p_{b_{k}}^{c_{0}} & = s p_{c_{k}}^{c_{0}} - R_{b_{k}}^{c_{0}} p_{c}^{b}, \end{matrix}

(3)

where s is the camera scale parameter, which would be solved during visual–inertial alignment, ⊗ represents the multiplication of two quaternions, and

R_{b_{k}}^{c_{0}}

is the rotation matrix corresponding to

q_{b_{k}}^{c_{0}}

.

Visual–Inertial Alignment

The process of visual–inertial alignment is shown in Figure 2. The basic idea is to match the visual SFM with the IMU preintegration values. For two consecutive frames

b_{k}

and

b_{k + 1}

, the system obtains rotation

q_{c_{k}}^{c_{0}}

and

q_{c_{k + 1}}^{c_{0}}

from the visual SFM, as well as relative rotation

γ_{b_{k + 1}}^{b_{k}}

from IMU preintegration. The system linearizes the IMU preintegration terms relative to the gyroscope bias and minimizes the following cost function:

\begin{matrix} min_{δ b_{ω}} \sum_{k \in B} {∥{q_{b_{k + 1}}^{c_{0}}}^{- 1} \otimes q_{b_{k}}^{c_{0}} \otimes γ_{b_{k + 1}}^{b_{k}}∥}^{2}, \\ γ_{b_{k + 1}}^{b_{k}} \approx {\hat{γ}}_{b_{k + 1}}^{b_{k}} \otimes [\begin{matrix} 1 \\ \frac{1}{2} J_{b_{ω}}^{γ} δ b_{ω} \end{matrix}], \end{matrix}

(4)

where

B

encompasses all the image frames in the sliding window, and the derivative of the preintegration terms

J_{b_{ω}}^{γ}

with respect to the bias

b_{ω}

is used for the first-order approximation of the relative rotation

γ

. In this way, the gyroscope bias can be estimated to correct the IMU measurements.

3.3. Event-Based Dynamicity Metrics

To obtain event-based pixel-level dynamicity metrics at a specific time, IMU-assisted self-motion compensation is required to achieve spatiotemporal alignment of the events. Subsequently, normalized mean timestamps are computed from the motion-compensated event frames, serving as dynamicity metrics.

IMU-assisted Motion Compensation

In the IMU-assisted self-motion compensation approach used in

E^{2}

-VINS, the IMU’s rotation information is applied to each event on the pixel plane using a rigid-body rotation model and a pinhole imaging model, resulting in sharper restored frames, as shown in Figure 3. Specifically, on the image plane, the pixel coordinates of a set of events are adjusted back to their corresponding positions at a common timestamp

t_{ref}

using a warping field.

To demonstrate this approach, let

E = {\{e_{k}\}}_{k = 1}^{N_{e}}

be a set of events within a time interval

T = {\{t_{k}\}}_{k = 1}^{N_{e}}

, where

N_{e}

is the group size. Assuming the angular rate

ω^{i}

has been converted to the event camera coordinate system, it can be decomposed into three components:

ϕ^{e}

,

θ^{e}

, and

ψ^{e}

on the X, Y, and Z axes, respectively. In addition, for each event

e_{k}

in

E

, the time difference

δ t_{k} = t_{k} - t_{ref}

is calculated. Using the angular rate and time difference, the corresponding rotation angles

ϕ_{k}

,

θ_{k}

, and

ψ_{k}

for the X, Y, and Z axes are computed. These rotation angles are applied to compensate the pixel coordinates

u_{k} = {[x_{k}, y_{k}]}^{T}

of event

e_{k}

, adjusting them to their corresponding positions

u_{k}^{'} = {[x_{k}^{'}, y_{k}^{'}]}^{T}

at the reference timestamp

t_{ref}

using a warping function, as shown in Figure 4. The warping function applied to a single event in the group is defined as

\begin{matrix} u_{k}^{'} = [R (u_{k} - o) - t] + o, \end{matrix}

(5)

where

o

denotes the coordinates of the center of the pixel plane,

R \in R^{2 \times 2}

is the rotation matrix resulting from Z axis rotation, and

t \in R^{2}

is the two-dimensional translation vector resulting from rotation around the X axis and Y axis. The rotation matrix

R

and the translation vector

t

can be calculated as follows:

\begin{matrix} R & = [\begin{matrix} cos ψ_{k} & sin ψ_{k} \\ - sin ψ_{k} & cos ψ_{k} \end{matrix}], \\ t & = (u_{k} - o) - f / s tan (ξ), \\ ξ & = [\begin{matrix} ξ_{x} \\ ξ_{y} \end{matrix}] = [\begin{matrix} arctan (x_{k} s / f) - θ_{k} \\ arctan (y_{k} s / f) - ϕ_{k} \end{matrix}] . \end{matrix}

(6)

Here, f represents the focal length, and s denotes the pixel size. The compensated events set

E^{'}

is derived by the initial set

E

through the warping function.

Event-based Dynamicity Metrics

Our dynamic pixel detection strategy leverages the significant differences in the mean timestamps of events triggered by dynamic objects compared to those from static backgrounds [33,34,35]. Specifically, events from static backgrounds exhibited concentrated mean timestamps, whereas those from dynamic objects displayed more evenly distributed mean timestamps across the compensated frame. Upon normalizing the mean timestamps of compensated events, values corresponding to background events tended toward 0, whereas those from dynamic objects tended toward 1.

For the compensated event set

E^{'}

, we calculate the mean timestamp of events at each pixel position to construct a time image

T

. The normalized mean timestamp at pixel position

u_{k}^{'}

is defined as

n m t (u_{k}^{'}) = \frac{T (u_{k}^{'}) - \bar{T}}{Δ t}

, where

\bar{T}

is the mean of the time image, and

Δ t

represents the duration of the time window.

Unlike other object detection methods [33,34,35], we did not apply hard threshold processing on

n m t (u)

for dynamic/static partitioning. Instead, we defined the event-based dynamicity metric as

{\hat{w}}_{e v} (u_{k}^{'}) = 1 - | n m t (u_{k}^{'}) | \in [0, 1]

. This metric measures the dynamicity of the corresponding pixel and was used for the adaptive assignment of weights for the visual residual terms.

3.4. Event-Based Dynamicity-Adaptive Bundle Adjustment

Dynamicity-adaptive Objective Function

In the conventional SLAM scheme [2], the visual–inertial BA formulation is defined as follows:

\begin{matrix} min_{X} \{{∥r_{p} - H_{p} X∥}^{2} + \sum_{k \in B} {∥r_{B} ({\hat{z}}_{b_{k + 1}}^{b_{k}}, X)∥}_{P_{b_{k + 1}}^{b_{k}}}^{2} + \sum_{(l, j) \in C} ρ ({∥r_{C} ({\hat{z}}_{l}^{c_{j}}, X)∥}_{P_{l}^{c_{j}}}^{2})\}, \end{matrix}

(7)

where

X

represents the full state vector containing the poses and velocities of keyframes, IMU biases (i.e., acceleration and gyroscope biases), and the estimated depths of features, as detailed in [2]. The terms

r_{p}

,

r_{B}

, and

r_{C}

represent residuals for marginalization, IMU, and visual reprojection measurements, respectively.

{\hat{z}}_{b_{k + 1}}^{b_{k}}

and

{\hat{z}}_{l}^{c_{j}}

denote the observations of the IMU and feature points, respectively.

H_{p}

denotes the measurement matrix for marginalization, and

ρ

is the Huber loss function.

{∥r∥}_{P}^{2} = \sqrt{r^{T} P r}

represents the Mahalanobis distance for residual term

r

, where

P

denotes the covariance matrix of the term. The Mahalanobis distance is an effective distance measurement method for two sets of points in multidimensional space, considering their correlation. For convenience, we denote

r_{B} ({\hat{z}}_{b_{k + 1}}^{b_{k}}, X)

and

r_{C} ({\hat{z}}_{l}^{c_{j}}, X)

as

r_{B}^{k}

and

r_{C}^{l, j}

, respectively.

In dynamic environments, the Huber loss cannot entirely reject the residuals from outliers [36]. To effectively address this issue, we propose assigning dynamicity-adaptive weights (dynamicity weights) to the visual residuals

r_{C}

of different pixels, thereby filtering out dynamic features through soft thresholding. Specifically, the visual residual term is modified to a novel loss function

\dot{ρ}

:

\begin{matrix} \dot{ρ} (w_{e v}^{l}, r_{C}^{l}) = {w_{e v}^{l}}^{2} r_{C}^{l} + λ_{w} {(1 - w_{e v}^{l})}^{2} . \end{matrix}

(8)

According to the Black–Rangarajan duality [37], the loss function

\dot{ρ}

is equivalent to the original visual residual term

r_{C}

. In Equation (8),

r_{C}^{l}

is computed as

r_{C}^{l} = \sum_{j \in C (f_{l})} {∥r_{C}^{l, j}∥}^{2}

, where

r_{C}^{l, j}

represents the residual for the j-th observation of the l-th feature, and

C (f_{l})

denotes the set of observations associated with feature

f_{l}

.

w_{e v}^{l}

represents the dynamicity weight for feature

f_{l}

. Features with weights close to 1 are considered static, while those with weights close to 0 are considered dynamic.

λ_{w} \in R^{+}

is a regularization parameter that adjusts the impact of dynamic features on the loss function.

With the dynamicity metrics derived from event observations, the enhanced BA can be performed in a dynamicity-adaptive manner. Unfortunately, the stability of dynamicity weights for the same feature across preceding and succeeding frames cannot be ensured solely by only introducing event-based metrics. The metric values of the same feature may experience abrupt changes between preceding and succeeding timestamps, which can interfere with the dynamicity assessment of that feature. To address this issue, an additional factor called the weight momentum factor, was introduced to ensure a smooth transition in dynamicity weights for the same feature over time.

Weight Momentum Factor

Since the features are continuously tracked, we assume that the feature

f_{l}

is optimized

n_{l}

times using its previous weight

{\bar{w}}_{e v}^{l}

. The weight momentum factor

Ψ (w_{e v}^{l})

is defined as

Ψ (w_{e v}^{l}) = n_{l} (w_{e v}^{l} - {\bar{w}}_{e v}^{l}),

(9)

where

n_{l}

represents the number of optimization times of the feature

f_{l}

, and

{\bar{w}}_{e v}^{l}

is the previous weight of the feature. This term helps preserve the weight trend over time by penalizing deviations from the previous weight. Therefore, the weight momentum factor is incorporated into the loss term, resulting in the final enhanced loss term as follows:

\begin{matrix} \ddot{ρ} (w_{e v}^{l}, r_{C}^{l}) = {w_{e v}^{l}}^{2} r_{C}^{l} + λ_{w} {(1 - w_{e v}^{l})}^{2} + λ_{m} Ψ^{2} (w_{e v}^{l}), \end{matrix}

(10)

where

λ_{m} \in R^{+}

represents a constant parameter to adjust the effect of the momentum factor on the BA. Thus, the final BA optimization problem is expressed as

\begin{matrix} min_{X, W} \{{∥r_{p} - H_{p} X∥}^{2} + \sum_{k \in B} {∥r_{B} ({\hat{z}}_{b_{k + 1}}^{b_{k}}, X)∥}_{P_{b_{k + 1}}^{b_{k}}}^{2} + \sum_{l \in F_{C}} \ddot{ρ} (w_{e v}^{l}, r_{C}^{l})\}, \end{matrix}

(11)

where

F_{C}

denotes the set of indices of all tracked features in the current moving window, and

W = \{w_{e v}^{l} | l \in F_{C}\}

represents the set of all feature weights.

Details on Weights Optimization and Update Strategies

Notably, event cameras have a much lower signal-to-noise ratio (SNR) compared to frame-based cameras [38], and in cases of aggressive motion, event observations become significantly more inaccurate. Therefore, it is crucial to evaluate whether the system remains effective in the presence of high observation noise or observation failures. We propose optimizing and updating the dynamic weights during each round of state estimation in BA to mitigate the impact of outliers or dynamic features with large visual reprojection residuals. In our

E^{2}

-VINS, we employed alternative optimization [37] to solve the problem presented in Equation (11). Specifically, we first optimized the dynamicity weights

W

while keeping the state vector

X

fixed, and subsequently optimized the state vector

X

with the dynamicity weights

W

held constant. By optimizing

W

and

X

alternatively, the BA can robustly and stably converge.

While optimizing

W

, all terms except weights are constants. Hence, the objective function for optimizing weights can be expressed as follows:

\begin{matrix} min_{w_{e v}^{l} \in [0, 1]} \{{w_{e v}^{l}}^{2} \sum_{j \in C (f_{l})} {∥r_{C}^{l, j}∥}^{2} + λ_{w} {(1 - w_{e v}^{l})}^{2} + λ_{m} Ψ^{2} (w_{e v}^{l})\}, \end{matrix}

(12)

where weight

w_{e v}^{l}

is independent of each other, and each

w_{e v}^{l}

could be optimized independently. For the quadratic of

w_{e v}^{l}

, the optimal

{w_{e v}^{l}}^{*}

under state estimation is derived as follows:

\begin{matrix} {w_{e v}^{l}}^{*} = \frac{λ_{w} + λ_{m} {n_{l}}^{2} {\bar{w}}_{e v}^{l}}{r_{C}^{l} + λ_{w} + λ_{m} {n_{l}}^{2}} \in [0, 1] . \end{matrix}

(13)

As expected, the optimal weight value is negatively correlated with visual residuals. During optimization, the weights of outlier features decrease, and their losses become flatter. As a result, the losses of outlier features are close to a zero gradient and do not affect BA. Before optimizing the state vector

X

, the dynamicity weight w for each feature on the current frame is integrated from the event-based dynamicity metric

{\hat{w}}_{e v}

and the optimized theoretical optimal value

w_{e v}^{*}

:

\begin{matrix} a_{w} = |{\hat{w}}_{e v} - w_{e v}^{*}|, \\ w = \{\begin{matrix} w_{e v}^{*}, a_{w} > λ_{e}, \\ (1 - a_{w}) {\hat{w}}_{e v} + a_{w} w_{e v}^{*}, a_{w} \leq λ_{e} . \end{matrix} \end{matrix}

(14)

where

λ_{e} \in [0, 1]

is a constant parameter used to adjust the confidence of event-based dynamicity metrics and optimized weight values. In this design, the event-based dynamicity metrics

{\hat{w}}_{e v}

implicitly affect the value

w_{e v}^{*}

of the current optimized weight by acting on the previous weight

{\bar{w}}_{e v}

.

4. Experiments

4.1. System Implementation and Methods for Comparison

We compared our proposed

E^{2}

-VINS with the following competing methods:

•: ORB-SLAM3 [1] is a system capable of performing visual, visual–inertial, and multimap SLAM with monocular, stereo, and RGB-D cameras using pinhole and fisheye lens models. It is not only a feature-based, tightly integrated visual–inertial SLAM system that relies on Maximum A Posteriori (MAP) estimation but also a multiple map system that utilizes a novel place recognition method for improved recall.
•: DSO [3] is a direct and sparse formulation for visual odometry. It combines a fully direct approach that minimizes photometric errors with a consistent and joint optimization of all model parameters, including geometry represented as inverse depth in a reference frame and camera motion.
•: VINS-FUSION [28] is an optimization-based multisensor state estimator. It features a tightly coupled monocular visual–inertial state estimator designed by fusing preintegrated IMU measurements with feature observations.
•: USLAM [21] is an indirect monocular method that fuses events, frames, and IMU measurements. Its frontend converts events into frames by motion compensation using the IMU’s gyroscope and the median scene depth. Then, FAST corners [39] are extracted and tracked separately on the event frames and the grayscale frames and then passed to a geometric feature-based backend. Similar to the proposed $E^{2}$ -VINS, USLAM [21] constructs three-modality tightly coupled SLAM systems using IMU, events, and image frames as inputs.
•: DynaVINS [8] is a novel visual–inertial SLAM framework designed for dynamic environments. It uses robust bundle adjustment based on IMU preintegration and a multihypothesis-based constraints grouping method to reduce the impact of dynamic objects and temporarily static objects, respectively.

The above state-of-the-art methods all support multisensor systems. To ensure a fair evaluation of the proposed methods, we used the versions of these systems that are based on monocular cameras and IMUs for comparison, and we kept the parameters of all comparison methods consistent with the official settings.

Our

E^{2}

-VINS system was configured with several key parameters to ensure efficient execution of visual–inertial SLAM tasks in dynamic environments. First, the feature tracking section set the maximum number of features to 150 and the minimum distance between features to 15 pixels to prevent overly dense or sparse feature distributions. The frequency for publishing tracking results was set to 10 Hz to ensure accurate estimations. The maximum estimated depth was limited to 10 m. The regularization parameter

λ_{w}

was set to 2.0 to prevent overfitting, and the momentum coefficient

λ_{m}

was set to 0.2. The alternating convergence factor was set to 0.9 and served as the convergence threshold for the total residuals during the optimization process. For optimization parameters, the maximum solver time was set to 0.3 milliseconds to ensure real-time performance. The maximum number of solver iterations was limited to 10 to prevent excessive iterations that could degrade performance. The keyframe selection threshold based on parallax was set to 10 pixels to ensure accurate keyframe selection in scenes with significant changes. Finally, the IMU parameters included accelerometer and gyroscope noise standard deviations, as well as accelerometer and gyroscope bias random walk noise values that were set to empirical values (

n_{a}

: 0.2,

b_{a}

: 0.05,

n_{ω}

: 0.02,

b_{ω}

:

4 \times 10^{- 5}

). The gravity magnitude was set to 9.81007 m/s². These parameter settings ensure that the system has good robustness and real-time performance in dynamic environments while providing sufficient flexibility to adapt to different application scenarios.

All modules in

E^{2}

-VINS were implemented in C++. In both the frontend and backend, pose graph optimization was solved using the open source library Ceres [40]. The proposed system leverages the widely used Robot Operating System (ROS) [41] for inter-process message communication. The evaluation experiments for all VI-SLAM algorithms in this section were conducted on a laptop equipped with an AMD Ryzen 5 4600U CPU.

4.2. Datasets and Metrics

Our

E^{2}

-VINS and compared methods (introduced in Section 4.1) were tested on two datasets, including the VIODE dataset (VIODE: https://github.com/kminoda/VIODE, accessed on 27 September 2024) [42] and the ECMD dataset (The ECMD dataset: https://arclab-hku.github.io/ecmd/, accessed on 27 September 2024) [43]. The VIODE dataset [42] was recorded from a simulated UAV that navigates in challenging dynamic environments. Each dynamic environment includes four levels of dynamic complexity, progressively adding more moving objects. The ECMD dataset [43] is the first event-based SLAM dataset specifically focused on urbanized autonomous driving. It was recorded with a sensor platform consisting of various novel sensors, including two sets of stereo event cameras with distinct resolutions (640 × 480, 346 × 260), an infrared camera, stereo industrial cameras, three mechanical LiDARs (including two slanted LiDARs), an onboard inertial measurement unit (IMU), and two global navigation satellite system (GNSS) receivers. The ground truths were provided by a centimeter-level positioning system that combines the GNSS real-time kinematic (RTK) with the fiber optics gyroscope integrated inertial system as GNSS-RTK/INS. The ECMD dataset [43] was collected in various driving scenarios, including dense streets, urban areas, tunnels, highways, bridges, and suburbs. These sequences were recorded under daylight and nighttime, providing challenging situations for visual and LiDAR SLAM systems, e.g., dynamic objects, high-speed motion, repetitive scenarios, and HDR scenes.

To assess the performance of the full SLAM method, we quantified pose estimation results using the mean position error (MPE, %). The MPE (%) is calculated as the mean position error (in meters) divided by the total length of the trajectory (in meters). We used the toolbox from [44] to evaluate the poses given by different SLAM solutions.

4.3. Qualitative Results

Figure 5 shows a visualization of some typical qualitative results generated by our method and some other competitors. The weighted feature image frames used in BA by our

E^{2}

-VINS on three test sequences from the VIODE dataset [42] (city_day, city_night, and parking_lot) and four sequences from the ECMD dataset [43] (Dense_street_day_easy_a, Dense_street_day_easy_b, Dense_street_night_easy_a, and Urban_road_day_easy_a) are shown on the upper side. In the figure, features with low weights (from dynamic objects) are plotted as red points, while features with high weights (from static objects) are shown as green ones. From Figure 5, it can be seen that

E^{2}

-VINS effectively recognized dynamic objects, such as moving trucks, cars, buses, and pedestrians, and reduced their feature weights, even in relatively dim nighttime sequences. The trajectories of the compared algorithms and

E^{2}

-VINS on the challenging sequences (Dense_street_day_medium_a and Dense_street_night_easy_a) of the ECMD dataset [43] is illustrated on the lower side. The former is a complex multidynamic object scene (a large number of large trucks waiting for red lights under overpasses), while the latter is a dim nighttime street scene. All other algorithms, except ours, either lost track or produced noisy trajectories in the dynamic and nighttime environments.

4.4. Quantitative Results

Table 3 and Table 4 report quantitative results of the comparison of our method with other SLAMs in term of Mean Position Error (%) on sequences from the VIODE dataset [42] and the ECMD dataset [43], respectively. The optimal results are highlighted in bold, while the runner-up results are highlighted with an underline. The evaluation experiments were conducted on the complete VIODE dataset [42] and partial sequences of ECMD [43], which correspond to scenes that include dense streets, highways, and nighttime scenes. Due to the absence of event data in the VIODE dataset [42], we used v2e [45] to generate event data from the videos. This generated data served as input for the proposed

E^{2}

-VINS to obtain reliable quantitative comparison results.

Table 3 shows that

E^{2}

-VINS outperformed all other baseline methods in translation accuracy except for nighttime sequences, where its performance was comparatively lower. This lower performance in nighttime sequences is attributed to the relatively low quality of the event data generated by v2e [45] from dim nighttime videos, which lack high dynamic range characteristics compared to real event cameras. The dynamic SLAM method DynaVINS [8] showed stronger overall adaptability to dynamic environments compared to other classical methods. In contrast, classical SLAM methods such as ORB-SLAM3 [1] and DSO [3], which perform well in static scenes, were unable to effectively complete localization tasks in nighttime sequences and outdoor high-dynamic scenarios. The results indicate that feature-based methods, due to their sparse feature points, are more resilient to dynamic scene disturbances than direct methods.

Furthermore, Table 4 confirms that

E^{2}

-VINS is superior to all other baseline methods in real dynamic scenarios. It is noteworthy that other comparative methods failed to complete the test sequences “Dense_street_day_easy_a” and “Dense_street_day_medium_a”. In the former sequence, there are vehicles queuing up and large vehicles occupying most of the field of view, causing traditional pipeline strategies [1,28] to treat relatively few static points as outliers. In contrast, with IMU-assisted event data, many dynamic points can be identified by the dynamicity metrics and their weights reduced, resulting in an effect opposite to that of traditional visual–inertial SLAM. In the latter sequence, a large number of large trucks queue under an overpass waiting for traffic lights. In this complex scenario with multiple dynamic targets, classical SLAM methods would treat pseudo-static feature points, which are transformed into dynamic points, as lost tracks, followed by detecting new feature points. During this process, motion estimation errors and trajectory “drift” can occur, as shown in the lower side of Figure 5. In the compared schemes, DynaVINS [8] demonstrated a higher success rate than traditional static schemes [1,3,21,28] due to its consideration of dynamic environments and the use of IMU motion priors to filter out dynamic features. And the event-based approach USLAM [21] indeed showed advantages in scenarios with fewer dynamic targets(Urban_road), but the lack of process strategies for moving targets made it unable to cope with scenarios with more dynamic targets(Dense_street). In contrast, the proposed event-based dynamicity metrics ensure that

E^{2}

-VINS can measure feature points’ dynamicity with high temporal resolution. The designed momentum weight factor and alternating optimization strategy ensure that the dynamicity weights of features are smoothly updated, demonstrating strong robustness in dynamic environments where visual–inertial SLAM methods struggle, thus resulting in the effectiveness of the proposed approach.

5. Discussions

The primary advantage of

E^{2}

-VINS lies in its ability to robustly handle dynamic environments, which is an aspect where traditional VI-SLAM methods often struggle. By introducing dynamicity-aware weighting, our method enhances the contribution of visual residuals during BA, selectively reducing the influence of dynamic features on the system. This ability to filter out dynamic features through a soft-threshold mechanism significantly improves the robustness of the SLAM system, particularly in scenarios with moving objects. The average improvement of 1.884% in the mean position error (compared to state-of-the-art methods) demonstrates the effectiveness of the dynamicity-adaptive strategy in dynamic environments.

Despite the strengths of

E^{2}

-VINS, there are some inherent limitations. In cases where the number of static feature points becomes too sparse (e.g., in narrow corridors, tunnels, or low-texture environments), our dynamicity-weighted BA can degrade into an IMU-residual-dominated approach, leading to poor localization performance. This issue arises when dynamic features no longer provide enough constraints, and the system relies too heavily on IMU data for pose estimation. To address this, future improvements should focus on enhancing the system’s robustness in low-texture scenarios by incorporating more advanced feature extraction methods or by developing strategies that better balance the contribution of IMU and visual data. Furthermore, the alignment strategy between image frames and events in

E^{2}

-VINS currently depends on hardware synchronization, which is not universally available across all event camera models. While synchronization is supported by specific models such as DAVIS346, ATIS, and CeleX5_mp, this limitation restricts the applicability of our method to these particular cameras. Future research should focus on developing robust software-based alignment algorithms that can accommodate a wider range of event camera models, extending the applicability of our method to more diverse systems.

Our work builds on existing VI-SLAM and event-camera-based SLAM systems but differs significantly in its approach to handling dynamic environments. Many traditional VI-SLAM systems fail to adequately deal with dynamic objects, often leading to drift or incorrect trajectory estimations in the presence of moving features. Previous methods in dynamic SLAM often use approaches like dynamic object detection or motion segmentation; however, these methods typically rely on either prior knowledge or computationally expensive algorithms that are not real time. By contrast,

E^{2}

-VINS introduces a dynamicity-adaptive approach that filters dynamic features directly within the optimization framework, providing a more efficient and robust solution. Moreover, previous work in event-based SLAM has primarily focused on static environments. Our method is the first to integrate dynamicity weights in the bundle adjustment process, making it significantly more effective in real-world dynamic environments where traditional methods struggle.

As part of future work, we plan to deploy

E^{2}

-VINS in practical scenarios such as autonomous driving and robotics, where robustness to moving objects is crucial. Real-time testing and performance evaluation in these environments will help identify further optimizations and ensure that the method performs effectively under real-world conditions. Another important direction for future work is improving dynamic feature segmentation to handle more complex dynamic features, such as non-rigid objects or unpredictable movements, which are common in real-world dynamic environments. In light of the current limitation with hardware synchronization, we plan to explore software-based synchronization techniques for event cameras. By enabling synchronization across a broader range of camera models, we can extend the applicability of

E^{2}

-VINS to more diverse sensor setups.

6. Conclusions

In this paper, we propose a novel event-enhanced VI-SLAM solution for dynamic environments named

E^{2}

-VINS. In

E^{2}

-VINS, the normalized mean timestamps are calculated on the spatiotemporally aligned events as event-based dynamicity metrics, which measure the dynamicity of each pixel. Based on these metrics, dynamicity weights are adaptively assigned to the visual residual terms of different pixels. Subsequently,

E^{2}

-VINS jointly and alternately optimizes the system state (camera poses and map points) and dynamicity weights, effectively filtering out dynamic features through a soft-threshold mechanism. Our approach enhances the robustness of bundle adjustment to dynamic features in a dynamicity-adaptive manner, resulting in an average improvement of 1.884% in the mean position error compared to state-of-the-art methods, as shown in Table 4. Experimental results further demonstrate that

E^{2}

-VINS can effectively distinguish between dynamic and static features, and generate correct trajectories even in challenging dynamic environments where traditional methods struggle, confirming its superior effectiveness. This work provides valuable insights for the deployment of VI-SLAM in dynamic settings, such as autonomous driving and robotics, where robustness to moving objects is crucial. In the future, we plan to integrate our method into real-world applications and evaluate its performance in practical deployment scenarios. Improving dynamic feature segmentation to handle more complex dynamic features will be another key focus of our future work.

Author Contributions

Supervision, L.Z.; Writing—original draft, J.H.; Writing—review and editing, S.Z. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 61936014 and 62272343; in part by the Shuguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission under Grant 21SG23; and in part by the Fundamental Research Funds for the Central Universities.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All generated data are available at https://cslinzhang.github.io/EE-VINS/EE-VINS.html (accessed on 27 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Montiel, J.M.; D. Tardós, J. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-MONO: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef]
Saputra, M.; Markham, A.; Trigoni, N. Visual SLAM and structure from motion in dynamic environments: A survey. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
Kim, D.; Han, S.; Kim, J. Visual odometry algorithm using an RGB-D sensor and IMU in a highly dynamic environment. In Proceedings of the Robot Intelligence Technology and Applications, Bucheon, Republic of Korea, 14–16 December 2015; pp. 11–26. [Google Scholar]
Fan, Y.; Han, H.; Tang, Y.; Zhi, T. Dynamic objects elimination in SLAM based on image fusion. IEEE Pattern Recognit. Lett. 2019, 127, 191–201. [Google Scholar] [CrossRef]
Canovas, B.; Rombaut, M.; Nègre, A.; Pellerin, D.; Olympieff, S. Speed and memory efficient dense RGB-D SLAM in dynamic scenes. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 4996–5001. [Google Scholar]
Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 11523–11530. [Google Scholar] [CrossRef]
Yu, C.; Liu, Z.; Liu, X.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Schörghuber, M.; Steininger, D.; Cabon, Y.; Humenberger, M.; Gelautz, M. SLAMANTIC-leveraging semantics to improve VSLAM in dynamic environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Republic of Korea, 27–28 October 2019; pp. 3759–3768. [Google Scholar]
Liu, J.; Li, X.; Liu, Y.; Chen, H. RGB-D inertial odometry for a resource-restricted robot in dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 9573–9580. [Google Scholar] [CrossRef]
Lichtsteiner, P.; Posch, C.; Delbruck, T. A 128 × 128 120 dB 15 μs latency asynchronous temporal contrast vision sensor. IEEE J. -Solid-State Circuits 2008, 43, 566–576. [Google Scholar] [CrossRef]
Posch, C.; Matolin, D.; Wohlgenannt, R. A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J. Solid-State Circuits 2011, 46, 259–275. [Google Scholar] [CrossRef]
Brandli, C.; Berner, R.; Yang, M.; Liu, S.; Delbruck, T. A 240 × 180 130 dB 3 µs latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 2014, 49, 2333–2341. [Google Scholar] [CrossRef]
Kim, H.; Kim, H.J. Real-time rotational motion estimation with contrast maximization over globally aligned events. IEEE Robot. Autom. Lett. 2021, 6, 6016–6023. [Google Scholar] [CrossRef]
Gallego, G.; Lund, J.E.A.; Mueggler, E.; Rebecq, H.; Delbruck, T.; Scaramuzza, D. Event-based, 6-DoF camera tracking from photometric depth maps. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2402–2412. [Google Scholar] [CrossRef]
Kim, H.; Leutenegger, S.; Davison, A.J. Real-time 3D reconstruction and 6-DoF tracking with an event camera. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 349–364. [Google Scholar]
Rebecq, H.; Horstschaefer, T.; Gallego, G.; Scaramuzza, D. EVO: A geometric approach to event-based 6-DoF parallel tracking and mapping in real time. IEEE Robot. Autom. Lett. 2017, 2, 593–600. [Google Scholar] [CrossRef]
Zuo, Y.; Yang, J.; Chen, J.; Wang, X.; Wang, Y.; Kneip, L. DEVO: Depth-event camera visual odometry in challenging conditions. In Proceedings of the IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 2179–2185. [Google Scholar]
Vidal, A.R.; Rebecq, H.; Horstschaefer, T.; Scaramuzza, D. Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios. IEEE Robot. Autom. Lett. 2018, 3, 994–1001. [Google Scholar] [CrossRef]
Zuo, Y.; Xu, W.; Wang, X.; Wang, Y.; Kneip, L. Cross-modal semi-dense 6-DoF tracking of an event camera in challenging conditions. IEEE Trans. Robot. 2024, 40, 1600–1616. [Google Scholar] [CrossRef]
Hidalgo-Carrió, J.; Gallego, G.; Scaramuzza, D. Event-aided direct sparse odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5771–5780. [Google Scholar]
Huang, J.; Zhao, S.; Zhang, T.; Zhang, L. MC-VEO: A visual-event odometry with accurate 6-DoF motion compensation. IEEE Trans. Intell. Veh. 2024, 9, 1756–1767. [Google Scholar] [CrossRef]
Guan, W.; Chen, P.; Xie, Y.; Lu, P. PL-EVIO: Robust monocular event-based visual inertial odometry with point and line Features. IEEE Trans. Autom. Sci. Eng. 2024, 21, 6277–6293. [Google Scholar] [CrossRef]
Guo, S.; Gallego, G. CMax-SLAM: Event-based rotational-motion bundle adjustment and SLAM system using contrast maximization. IEEE Trans. Robot. 2024, 40, 2442–2461. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Qin, T.; Pan, J.; Cao, S.; Shen, S. A general optimization-based framework for local odometry estimation with multiple sensors. arXiv arXiv:1901.03638.
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
Nistér, D. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 756–770. [Google Scholar] [CrossRef]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle adjustment—A modern synthesis. In Proceedings of the International Workshop on Vision Algorithms, Corfu, Greece, 21–22 September 2000; pp. 298–372. [Google Scholar]
Zhao, C.; Li, Y.; Lyu, Y. Event-based real-time moving object detection based on IMU ego-motion compensation. In Proceedings of the IEEE International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 690–696. [Google Scholar]
Delbruck, T.; Villanueva, V.; Longinotti, L. Integration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision. In Proceedings of the IEEE International Symposium on Circuits and Systems, Melbourne, Australia, 1–5 June 2014; pp. 2636–2639. [Google Scholar]
Falanga, D.; Kleber, K.; Scaramuzza, D. Dynamic obstacle avoidance for quadrotors with event cameras. Sci. Robot. 2020, 5, eaaz9712. [Google Scholar] [CrossRef]
Babin, P.; Giguère, P.; Pomerleau, F. Analysis of robust functions for registration algorithms. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 1451–1457. [Google Scholar]
Black, M.; Rangarajan, A. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. Int. J. Comput. Vis. 1996, 19, 57–91. [Google Scholar] [CrossRef]
Corke, P.I.; Khatib, O. Robotics, Vision and Control: Fundamental Algorithms in MATLAB; Springer: Berlin/Heidelberg, Germany, 2011; Volume 73. [Google Scholar]
Rosten, E.; Drummond, T. Machine learning for high-speed corner detection. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 430–443. [Google Scholar]
Agarwal, S.; Mierle, K.; CeresSolverTeam. Ceres Solver. 2023. Available online: http://ceres-solver.org (accessed on 27 September 2024).
OpenRobotics. ROS—Robot Operating System. 2021. Available online: https://www.ros.org/ (accessed on 27 September 2024).
Minoda, K.; Schilling, F.; Wüest, V.; Floreano, D.; Yairi, T. VIODE: A simulated dataset to address the challenges of visual-inertial odometry in dynamic environments. IEEE Robot. Autom. Lett. 2021, 6, 1343–1350. [Google Scholar] [CrossRef]
Chen, P.; Guan, W.; Huang, F.; Zhong, Y.; Wen, W.; Hsu, L.; Lu, P. ECMD: An event-centric multisensory driving dataset for SLAM. IEEE Trans. Intell. Veh. 2024, 9, 407–416. [Google Scholar] [CrossRef]
Grupp, M. evo: Python Package for the Evaluation of Odometry and SLAM. 2017. Available online: https://github.com/MichaelGrupp/evo (accessed on 27 September 2024).
Hu, Y.; Liu, S.; Delbruck, T. v2e: From video frames to realistic DVS events. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, Nashville, TN, USA, 19–25 June 2021; pp. 1312–1321. [Google Scholar]

Figure 1. The overall pipeline of our proposed

E^{2}

-VINS system. The preprocessing stage includes IMU preintegration and feature tracking of RGB frames. During system initialization,

E^{2}

-VINS performs geometric alignment between IMU measurements and vision-based structure-from-motion procedures while estimating the biases to obtain calibrated IMU measurements. The calibrated data are then applied for IMU-assisted event motion compensation. The compensated events are used to generate event-based dynamicity metrics, which assess the motion dynamics of each pixel. Based on these metrics, visual residuals are adaptively assigned dynamicity weights for different pixels. Finally,

E^{2}

-VINS jointly and iteratively optimizes the system state (camera poses and map points) and the dynamicity weights.

Figure 1. The overall pipeline of our proposed

E^{2}

-VINS system. The preprocessing stage includes IMU preintegration and feature tracking of RGB frames. During system initialization,

E^{2}

-VINS performs geometric alignment between IMU measurements and vision-based structure-from-motion procedures while estimating the biases to obtain calibrated IMU measurements. The calibrated data are then applied for IMU-assisted event motion compensation. The compensated events are used to generate event-based dynamicity metrics, which assess the motion dynamics of each pixel. Based on these metrics, visual residuals are adaptively assigned dynamicity weights for different pixels. Finally,

E^{2}

-VINS jointly and iteratively optimizes the system state (camera poses and map points) and the dynamicity weights.

Figure 2. The illustration of the visual–inertial alignment. By matching the visual structure with IMU preintegration, the system estimates the gyroscope bias and calibrates IMU measurements.

Figure 3. The differences between the distributions of the raw events and the motion-compensated ones (blue: positive events; red: negative events). The event camera moves freely and observes scenes of an office and a moving person. (a) shows the distribution of the raw events in the spatiotemporal space. (b) shows the event frame formed by the accumulation of raw events. (c,d) show the motion-compensated results corresponding to (a,b), respectively.

Figure 4. Illustration of self-motion compensation when the rotation angle on the Y axis is

θ_{k}

. The Y axis compensation displacement occurs when the initial coordinate of an event is not at

o

.

u_{k}

is the coordinate of the event triggered by the object

p

. After the camera rotates by an angle

θ_{k}

around the Y axis,

u_{k}

moves to

u_{k}^{'}

. This process can be seen as the motion of 3D point

p

to the point

p^{'}

, which corresponds to

u_{k}^{'}

in the original pose. In this view, the incident angles before and after the movement are

ξ_{x} + θ_{k}

and

ξ_{x}

, respectively. The angle

ξ_{x}

can be calculated using Equation 6. Furthermore, the compensation displacement

t_{x}

in the pixel plane can be obtained.

Figure 4. Illustration of self-motion compensation when the rotation angle on the Y axis is

θ_{k}

. The Y axis compensation displacement occurs when the initial coordinate of an event is not at

o

.

u_{k}

is the coordinate of the event triggered by the object

p

. After the camera rotates by an angle

θ_{k}

around the Y axis,

u_{k}

moves to

u_{k}^{'}

. This process can be seen as the motion of 3D point

p

to the point

p^{'}

, which corresponds to

u_{k}^{'}

in the original pose. In this view, the incident angles before and after the movement are

ξ_{x} + θ_{k}

and

ξ_{x}

, respectively. The angle

ξ_{x}

can be calculated using Equation 6. Furthermore, the compensation displacement

t_{x}

in the pixel plane can be obtained.

Figure 5. Visualization of some typical qualitative results. The weighted feature image frames obtained by

E^{2}

-VINS on three test sequences from the VIODE dataset [42] (city_day, city_night, and parking_lot) and four sequences from the ECMD dataset [43] (Dense_street_day_easy_a, Dense_street_day_easy_b, Dense_street_night_easy_a, and Urban_road_day_easy_a) are shown on the upper side. Features with low weights (from dynamic objects) are plotted as red points, while features with high weights (from static objects) are shown as green ones. The trajectories of the compared algorithms and

E^{2}

-VINS on the Dense_street_day_medium_a and Dense_street_night_easy_a sequences of the ECMD dataset [43] are illustrated on the lower side.

Figure 5. Visualization of some typical qualitative results. The weighted feature image frames obtained by

E^{2}

-VINS on three test sequences from the VIODE dataset [42] (city_day, city_night, and parking_lot) and four sequences from the ECMD dataset [43] (Dense_street_day_easy_a, Dense_street_day_easy_b, Dense_street_night_easy_a, and Urban_road_day_easy_a) are shown on the upper side. Features with low weights (from dynamic objects) are plotted as red points, while features with high weights (from static objects) are shown as green ones. The trajectories of the compared algorithms and

E^{2}

-VINS on the Dense_street_day_medium_a and Dense_street_night_easy_a sequences of the ECMD dataset [43] are illustrated on the lower side.

Table 1. An overview of existing event-based SLAM.

Method	Input	Main limitations	Dynamic Environment
Kim et al. [18]	Events	GPU requirement	×
Rebecq et al. [19]		Insufficient robustness	×
Guo and Gallego [26]		Rotational motion only	×
Zuo et al. [20]	Events and depth images	RGB-D camera brings efficiency bottleneck	×
Hidalgo-Carrió et al. [23]	Events and RGB images	Relatively low computational efficiency	×
Huang et al. [24]	Events and RGB images	-	×
Guan et al. [25]	Events, images, and IMU measurements	Not open source	×
Vidal et al. [21]		Insufficient utilization of the motion sensitivity of events	×
$E^{2}$ -VINS (Ours)		-	✓

Table 2. An overview of existing dynamic VI-SLAM.

Method	Camera Configuration	Practical Consideration
Method	Camera Configuration	Prior Knowledge	Temporary Stopping	Uncertainty
Constraint-based methods
Kim et al. [5]	RGB-D	✓	✓	×
Fan et al. [6]	Monocular	×	×	×
Canovas et al. [7]	RGB-D	×	×	×
Song et al. [8]	Monocular	✓	×	✓
Deep learning-based methods
DS-SLAM [9]	Monocular	×	✓	×
DynaSLAM [10]	Monocular	×	✓	×
SLAMANTIC [11]	Monocular	×	✓	×
Dynamic-VINS [12]	RGB-D	✓	✓	×
$E^{2}$ -VINS (Ours)	Event	×	✓	✓

Table 3. Mean position error (%) of

E^{2}

-VINS and compared SLAMs on the VIODE dataset [42].

Table 3. Mean position error (%) of

E^{2}

-VINS and compared SLAMs on the VIODE dataset [42].

Level	City_day				City_night				Parking_lot
Level	None	Low	Mid	High	None	Low	Mid	High	None	Low	Mid	High
ORB-SLAM3 [1]	1.230	0.543	2.844	fail	fail	fail	fail	fail	0.194	0.231	0.191	0.256
DSO [3]	6.055	5.214	5.445	fail	fail	fail	fail	fail	3.106	1.762	8.440	5.999
VINS-FUSION [28]	$\underset{̲}{0.133}$	0.115	0.355	0.323	0.198	0.224	0.276	0.280	0.134	0.182	0.932	1.497
DynaVINS [8]	0.142	$\underset{̲}{0.106}$	$\underset{̲}{0.098}$	$\underset{̲}{0.231}$	0.114	0.109	0.111	0.155	$\underset{̲}{0.128}$	$\underset{̲}{0.158}$	$\underset{̲}{0.156}$	$\underset{̲}{0.196}$
Ours+V2E [45]	0.118	0.101	0.080	0.104	$\underset{̲}{0.165}$	$\underset{̲}{0.165}$	$\underset{̲}{0.223}$	$\underset{̲}{0.163}$	0.095	0.142	0.138	0.134

Table 4. Mean position error (%) of

E^{2}

-VINS and compared SLAMs on the ECMD dataset [43].

Table 4. Mean position error (%) of

E^{2}

-VINS and compared SLAMs on the ECMD dataset [43].

Level	Dense_street_day				Dense_street_night		Urban_road_day		Urban_road_night
Level	Easy_a	Easy_b	Medium_a	Difficult	Easy_a	Easy_b	Easy_a	Easy_b	Easy_a
ORB-SLAM3 [1]	fail	8.558	fail	4.332	fail	$\underset{̲}{4.482}$	fail	fail	fail
DSO [3]	fail	8.509	fail	7.647	fail	5.560	11.190	13.663	5.534
VINS-FUSION [28]	fail	3.428	fail	$\underset{̲}{1.937}$	$\underset{̲}{3.110}$	9.966	fail	fail	12.964
USLAM [21]	fail	9.435	fail	fail	5.363	6.141	$\underset{̲}{7.947}$	$\underset{̲}{8.861}$	$\underset{̲}{3.939}$
DynaVINS [8]	fail	$\underset{̲}{2.701}$	fail	3.472	3.500	5.014	9.239	10.318	6.496
Ours	1.356	1.562	2.404	1.393	1.427	2.736	7.339	9.933	3.159

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Zhao, S.; Zhang, L. E²-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments. Appl. Sci. 2025, 15, 1314. https://doi.org/10.3390/app15031314

AMA Style

Huang J, Zhao S, Zhang L. E²-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments. Applied Sciences. 2025; 15(3):1314. https://doi.org/10.3390/app15031314

Chicago/Turabian Style

Huang, Jiafeng, Shengjie Zhao, and Lin Zhang. 2025. "E²-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments" Applied Sciences 15, no. 3: 1314. https://doi.org/10.3390/app15031314

APA Style

Huang, J., Zhao, S., & Zhang, L. (2025). E²-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments. Applied Sciences, 15(3), 1314. https://doi.org/10.3390/app15031314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

E²-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments

Abstract

1. Introduction

2. Related Work

2.1. Event-Based SLAM

2.2. Dynamic VI-SLAM

3. Methodology

3.1. Preliminaries of the Event Camera

3.2. Preprocessing and Visual–Inertial Alignment-Based IMU Calibration

3.3. Event-Based Dynamicity Metrics

3.4. Event-Based Dynamicity-Adaptive Bundle Adjustment

4. Experiments

4.1. System Implementation and Methods for Comparison

4.2. Datasets and Metrics

4.3. Qualitative Results

4.4. Quantitative Results

5. Discussions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI