1. Introduction
The ability to navigate and map unknown environments in real time is a crucial capability for autonomous systems [
1,
2]. Visual Simultaneous Localization and Mapping (VSLAM) enables devices such as robots, autonomous vehicles, and augmented reality (AR) platforms to achieve this by utilizing visual information from cameras to simultaneously construct a map of the environment while tracking their position within it [
3,
4]. As VSLAM technology has progressed, it has become increasingly important for dynamic, real-time applications, where systems must overcome challenges such as moving objects, fluctuating lighting conditions, and limited computational resources. For instance, in AR applications, SLAM enables the accurate overlay of virtual objects onto physical spaces, which requires precise localization and mapping under potentially difficult lighting or environmental conditions. Similarly, autonomous vehicles rely on SLAM to generate maps on-the-fly while adjusting to changes in the environment to ensure safe navigation [
5]. SLAM methods have also advanced with improvements in sensor technology, processing power, and algorithmic techniques, which allow for higher accuracy and adaptability. However, real-world environments are rarely static and often demand sophisticated, adaptive SLAM solutions capable of handling dynamic conditions and external disturbances. These challenges underscore the need for innovative methods that can enhance SLAM’s robustness, accuracy, and efficiency in dynamic scenarios. In this work, we propose to perform simultaneous localization and mapping in a VIsual Localization Domain (VILD), i.e., a domain where visually relevant features are suitably represented for simultaneous localization and mapping (SLAM). To this aim, we consider a stereo camera acquisition system as illustrated in
Figure 1, and we leverage the known properties of Fisher information to detect and recognize specific image patterns. Specifically, in [
6], the authors demonstrate that transforming images into a domain defined by a basis of orthogonal Circular Harmonic Function (CHF) filters with specific radial profiles enables straightforward maximum likelihood localization of 2D patterns. In this domain, the maximum likelihood estimation of visual pattern translation and rotation is achieved using a quadratic loss function. Therefore, the output of Circular Harmonic Filters can be used as a meaningful domain for signal representation. The outputs from filters of different orders highlight visually relevant features. Furthermore, they appear directly in the maximum likelihood estimation of image transformation parameters, such as scale factors, rotation, or translation. Building on this, VILD-SLAM method adopts filtering based on two-dimensional CHF, leveraging both magnitude and phase information to refine feature localization and reduce key errors such as mean squared error and scale drift. The VILD-SLAM process consists of two primary stages:
Computation of VILD: VILD highlights visually relevant regions. Specifically, after applying the CHF to detect high-intensity interest points corresponding to prominent structural edges in the environment, we compare the output magnitude against a threshold for feature localization. Then, we refine the output phase by selecting only the most relevant points, thus identifying the directions of visual structures.
VILD feature extraction and tracking: This stage adopts VILD to identify abrupt changes of the local structure direction and use this information to extract keypoints to be used for tracking and localization.
This domain allows us to improve feature matching and tracking accuracy.
The incorporation of these filtering stages improves accuracy even with lower resolution images, such as those captured at larger distances. This has led to notable improvements in trajectory accuracy by aligning the estimated SLAM trajectory more closely with GPS data. The experimental results indicate that the proposed CHF-based method effectively reduces key errors, thereby providing a more accurate trajectory estimation and improved performance in dynamic environments.
2. Related Works
Traditional SLAM methods [
7] generally rely on the assumption of static environments and utilize geometric techniques for localization and mapping, possibly exploiting application-specific constraints [
8] or memory-efficient data representation [
9]. While effective in controlled settings, these approaches often struggle with dynamic elements commonly encountered in real-world environments, such as moving objects or sudden lighting changes, as discussed in [
10,
11]. To overcome these limitations, recent research has focused on more adaptive SLAM methods that integrate artificial intelligence (AI), deep learning, and advanced hardware optimizations, enhancing SLAM’s robustness and accuracy in dynamic settings [
12,
13]. Systems like ORB-SLAM2 [
14] and DFT-VSLAM [
7] utilize advanced tracking techniques and dynamic feature extraction to improve performance in dynamic environments. Additionally, deep learning-based frameworks, such as AnyFeature-VSLAM [
15], adaptively manage visual features across different scenarios, maintaining high accuracy and reliability (see [
16] for a comprehensive survey).
VILD-SLAM advances the literature by leveraging Circular Harmonic Filters (CHFs) to improve feature detection and tracking, offering a robust alternative to existing methods. Unlike [
17], which integrates points and lines in dynamic environments, VILD-SLAM approach focuses on CHF coefficients for noise-robust edge and orientation detection, optimizing trajectory accuracy. Similarly, while [
18] introduces planar constraints for road-based SLAM, VILD-SLAM excels in extracting image transformations under varying conditions, enhancing adaptability. By refining stereo feature alignment compared to standard methods, VILD-SLAM complements and extends insights from [
19] and trajectory evaluations in [
20].
4. Visual Features in the CHF Domain: A Review
The literature has shown that the output of Circular Harmonic Filters can be used as a meaningful domain for signal representation. The outputs from filters of different orders highlight visually relevant features. Furthermore, they appear directly in the maximum likelihood estimation of image transformation parameters, such as scale factors, rotation, or translation. In this section, we review the theory, while in the next, we explain how to apply it to the matching and tracking of points between the right and left sequences of a stereo video sequence.
The extraction of visual features has been widely applied in image processing, because it can detect representative image features, such as edges, lines, and intersections.
This procedure provides valuable information about the structures of the output image; it highlights edges while simultaneously measuring their intensity (magnitude) and direction (orientation). Among others, Circular Harmonic Filters (CHFs), formerly introduced in a previous study, exhibit theoretical properties related to how they characterize the information associated with the visually relevant structure.
The mathematical formulation of CHFs, and their impact for visual feature extraction and tracking is presented below.
Let us recall the definition of the CHF in a 2D continuous domain. In polar coordinates (
r,
), representing the distance from the origin and the angle concerning the
x-axis, respectively, the CHF of order
k is the complex filter defined by the Formula (
2):
The functions
in Equation (
2) are isotropic Gaussian-weighted kernels, known as Laguerre–Gauss functions:
which satisfy isomorphism with the frequency space. The variable
k defines the angular structure of the model: for
, the CHF output is a low-pass version of the input image. As the order
k increases, CHFs highlight increasingly complex directional structures in the visual data, such as edges (for
), lines (for
), bifurcations (for
), and intersections (for
). In the followings, we refer to first-order CHF (
), which has a band-pass behaviour. In the frequency domain, the following formula, Formula (
3) stands:
with
. The module
results from the product of two factors: a radial factor, corresponding to a derivative action, and a Gaussian low-pass factor. The phase is written as follows:
The CHF filter output is meaningful in revealing the visually relevant structure in an image. A visual example using the first-order CHF (
) on a real image
at
is shown.
Figure 3 (top) represents the original frame before the application of the CHF filter, while in
Figure 3 (bottom) we report the magnitude
and the phase
of the CHF output. Notably, the magnitude highlights the strength of the edge and the phase identifies its orientation. Recently, the CHF filter has been extended to the non-Euclidean domain, over manifolds such as those underlying point-cloud data [
21].
The CHF filters play also a relevant role in maximum likelihood (ML) estimation of translation, scaling, and rotation parameters for natural images, as demonstrated in [
6]. The rational behind this is as follows. Let
denote a reference image, and let
be observed in presence of a scaling by a factor
, a rotation by an angle
, and a translation by a displacement
. Let us denote by
the observation of the transformed version of
in presence of an additive white Gaussian noise independent on
. The ML estimate of the parameters
is obtained by maximizing the log-likelihood function
of the observed image
with respect to the unknown parameters, i.e.,:
In white Gaussian noise, the ML estimate are directly obtained by minimizing Euclidean distance
The minimization of the Euclidean distance can also be obtained in a transformed orthonormal space. Let us recall the following mathematical results:
- (i)
The development of a generic function
in a series of orthonormal polar basis functions [
6,
22] as:
being
the Fourier transform
and
- (ii)
The development of
in a series of Laguerre–Gauss functions [
6,
23]
The application of these results in the discrete domain allows us to compute the ML cost function in a transformed space. As shown in [
6], the ML estimate of the parameters
is found by minimizing the distance of the transformed coefficients of the observed image
and the transformed version of the template
. These coefficients can in turn be obtained at the output of CHF filters of different orders.
5. Visual Localization Domain
Expanding on the above-described properties, to build the Visual Localization Domain, we resort to a first-order approximation of the image-series representation, computing the first-order coefficients directly through the application of the first-order CHF. This corresponds to representing the image in a theoretically grounded, visually relevant domain. We leverage this domain for point matching and tracking over the stereo video sequence. We have seen that Circular Harmonic Filters (CHFs) enable a visually relevant representation of signals while simultaneously providing a domain where the maximum likelihood estimation of signal parameters can be directly achieved by minimizing a quadratic distance in the coefficient domain. Here, we focus solely on the coefficients’ output by the first-order filter. For the point matching and tracking problem under consideration, we can assume that one of the stereo views serves as the reference for matching, while the goal is to identify the most similar version in the other view. This search is conducted not in the original domain but in the coefficient domain of the filter output. These coefficients allow for the identification of parameters such as rotation, as well as translation and scale, in terms of minimal mean squared error. As a result, they facilitate the search for similarities under such local image transformations. Furthermore, they are inherently robust to noise due to the low-pass effect typical of the Gaussian profile at high frequencies. The procedure resulting from these considerations is described below.
Applying the CHF filtering to the input sequences generates two complex sequences
obtained by convolving the luminance of the left and right original images with the impulse response
. Therefore, the filtered images are characterized in terms of the modules and phase
,
. The function [
6,
24] returns the magnitude and phase of the filtered image, which are useful for extracting the edges of objects present in the reference scene.
The outcome is a complex image in which each edge is associated with a high amplitude value of the magnitude, while the phase provides useful information on the spatial directions of the visually relevant image components. In contrast, uniform areas correspond to low-intensity and pseudo-random phase values [
6,
24]. Therefore, it can be stated that the CHF filter emphasises the presence of edges and measures their strength and orientation.
The next phase involves applying the following procedure, developed as follows. Firstly, the histogram
of the normalised magnitude
at
is computed. Since the magnitude highlights selected regions (edges), the histogram is typically multimodal, with one peak representing real edges and a second peak, near zero, representing high-frequency noise components. Hence, the areas relevant for point extraction and tracking can be highlighted by suitably selecting a threshold value
on the magnitude
. Specifically, the relevant areas are obtained as the set of points such that
which means that it is nonzero only at frequencies where the normalized magnitude values are above the threshold.
We improve the estimate of the orientation information by updating the phase based on the magnitude, and in particular computing the stereo visual phase sequences
:
An example of feature extracted at the output of the CHF filter appears in
Figure 4, showing the matched features on the magnitude maps
and the phase maps
(bottom) for frame 100 (
). Although meaningful, the magnitude map reports just an edge intensity information, while the phase map is rather noise. These limitations are overcome by the VILD maps
.
An interpretation of the role of
is provided in
Figure 5, where we recognize that
is different from zero only in the correspondence of structured areas, and the value at each
pair represents the direction of the edge at the corresponding pixel in
.
6. Experimental Results
The effectiveness of the VILD-SLAM algorithm was evaluated using real-world stereo camera datasets. We present results of VSLAM in the VILD domain, based on the sequences
and
. For comparison, we report also results of the state-of-the-art method in [
14], in the implementation available at [
28], and the results obtained when performing tracking on
.
The experiments use a stereo video sequence from the dataset in [
29]. These datasets consist of 1073 stereo image pairs, captured in July under sunny conditions. The selected frames, relative to the 5-th run, are used to construct and compare trajectories. Following a training phase, a robot equipped with a stereo camera autonomously traversed a 160-m route, in a natural landscape with occasional artificial structures, over repeated runs (see
Figure 6). Stereo keyframes, defined as the frames containing a sufficient number of valid keypoints for mapping purposes, were recorded approximately every 0.2 m. The true trajectory follows a 3D path, while the GPS trajectory and the estimated trajectory refer to a 2D projection. It is important to note that the GPS trajectory is known to be affected by estimation errors, resulting in random fluctuations. However, since these errors are typically smaller than those affecting the V-SLAM algorithm, we will consider the 2D GPS trajectory—disregarding changes in altitude—as the ground truth for validating the V-SLAM algorithm. Validation is performed by comparing the GPS locations with the locations estimated by V-SLAM across a set of
keypoints, which are assumed to be reference points.
The stereo camera configuration features a 0.24 m baseline and a resolution of
. Images are captured at 16 Hz, with the full stereo stream subsequently downsampled to extract stereo keyframes at intervals of approximately 0.2 m traveled [
29]. The simulation parameters are as follows: (i) the maximum horizontal displacement between corresponding keypoints was limited to 48 pixels, equating to
of the image width; (ii) the image pyramid employed a scale factor of 1.1 for size reduction; (iii) the pyramid included 10 levels.
Building on the framework outlined above, we assess the accuracy of the trajectory estimation compared to GPS benchmarks using VILD-based VSLAM. Applying the CHF filter to the luminance channel of the original images leads to the magnitude and phase components
, and
enhancing edges and their direction, from which we compute the VILD maps
, where keypoints are well concentrated in visually relevant regions. In
Figure 7 we show the VILD maps
,
, and keypoints matches at frame
.
Stemming from these calculations, the tracking is then performed in the VILD domain. To evaluate the error between the estimated and GPS trajectories, two performance metrics are used: mean square error (MSE), scale drift (SD).
The MSE quantifies the average Euclidean distance, on the
plane, between the
matrix
, collecting the
reference key data points
obtained by the GPS trajectories and the
matrix
of their estimated counterparts
. The MSE is computed as:
. The scale drift
, a secondary metric, measures systematic scale deviations between estimated and GPS trajectories. It is defined as a function of the scale factor
estimated through Helmert transformation, i.e., the scale factor computed as:
quantifies the systematic deviation of the estimated trajectory from the true GPS trajectory: negative (
) or positive SD values (
) indicate the need for compression or expansion to align the estimated trajectory with the GPS path.
Table 2 reports the MSE of the VILD-SLAM based on the maps
. The metrics shown are the mean square error (MSE [m
2]), root mean square error (RMSE [m]) and scale drift (SD[]). In addition to the results for the complete path, metrics are provided for the first segment (from the beginning of the path to the end of the curve) and the second segment (from the end of the curve to the closure of the path). This segmentation enables a more precise assessment of VILD-SLAM performance. Transitory keypoints, i.e., points for which a match is found on less then
consecutive frames, are discarded [
14]. We report results for two different values of the threshold
, namely
and
. For the sake of comparison, in
Table 3, we report the same metrics for the estimates obtained by the stereo ORB2-VSLAM algorithm in [
14]. The VILD-SLAM shows improved performances with respect to the literature in both conditions.
Figure 8 illustrates the performance achieved by VILD-SLAM operating on
showing the ground-truth GPS (green), estimated and optimized trajectories (red, pink, respectively). The estimated 3D key points locations are also represented by colored dots [
30,
31]. The figure refers to the case of
and
, respectively. For the sake of comparison, the trajectories obtained by the state-of-the-art algorithm in [
14] are also reported.
Implementing the CHF filter yielded both the magnitude and phase of the filtered image, followed by the magnitude thresholding and computation of the VILD maps
. The threshold can be selected by analysis of the magnitude histogram, illustrated in
Figure 9. We recognize the typical bimodal structure, with small values corresponding to noisy components in flat image areas, while large values correspond to sparse image structures.
A further analysis is conducted by degrading the stereo video sequences using a moving average filter defined over a circular support of radius
. This condition is taken as a proxy of a reduced spatial resolution condition, such as that encountered when the video sequence are acquired at a larger distance from the scene, or in harsh acquisition conditions, e.g., rain. The accuracy of VILD-SLAM using
is evaluated terms of the MSE and SD performance metrics.
Figure 10 presents the MSE and SD as a function of the low-pass filter radius. The blue bars represent the MSE, while the orange bars correspond to the absolute value of SD (|SD|). The algorithm in [
14] could not complete the analysis, even for different parameter settings, due to the increased difficulty in valid keypoint identification in the presence of image blur. For the sake of comparison, we report the MSE and |SD| values for the algorithm operating on the original, unblurred sequence, indicated by the horizontal lines. Specifically, we reported the results obtained by using the ORB-VSLAM2 method with BRISK features and SIFT features [
32,
33]. This figure demonstrates that the adoption of VILD for tracking also enables VSLAM on blurred images, maintaining meaningful trajectory estimations at different radii.
We now assess the performance of VILD-SLAM in noisy conditions, specifically when the images are acquired under an additive white Gaussian noise.
Figure 11 shows the mean squared error (MSE, left axis) and scale drift (SD, right axis) as a function of the SNR(dB). To better frame the performances of VILD-SLAM, we also reported the results obtained by using the ORB-VSLAM2 method with BRISK features and SIFT features [
32,
33]. We recognize that it outperforms the state-of-the-art competitors. All together, the results show the improvement of VILD in terms of accuracy and resilience.
An interpretation of these results is given in
Figure 12, where (first row) we show an original image
, its CHF output module
, and the thresholded phase
(
), and then (second row) we highlight some details of the captured image as they appear in the original domain and the VILD domain (
). We recognize that the VILD domain highlights the contribution in structured areas only, implicitly performing a kind of background subtraction. Therefore, using the VILD-SLAM approach, variations occurring within non-structured areas are inherently rejected, making the approach noise resilient. This behavior is beneficial also in dynamic environments where a number of non-stationary features, e.g., illumination, change throughout the VSLAM.
A few remarks are in order. The VILD-SLAM is dynamically aware, in the sense that it represents the scene where noise is rejected; still, it does not explicitly account for moving object, and this is left for further study. To sum up, VILD-SLAM shows the potential to improve real-time state-of-the-art solutions. The ability to reject noise components suggests its integration within a deep learning-based system; this relevant point is out of the scope of the paper and it is left for further studies.