StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information

Duan, Chao; Junginger, Steffen; Thurow, Kerstin; Liu, Hui

doi:10.3390/app13105842

Open AccessArticle

StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information

by

Chao Duan

¹,

Steffen Junginger

²,

Kerstin Thurow

³

and

Hui Liu

^1,*

¹

Institute of Artificial Intelligence & Robotics (IAIR), School of Traffic & Transportation Engineering, Central South University, Changsha 410075, China

²

Institute of Automation, University of Rostock, 18119 Rostock, Germany

³

Center for Life Science Automation, University of Rostock, 18119 Rostock, Germany

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(10), 5842; https://doi.org/10.3390/app13105842

Submission received: 2 April 2023 / Revised: 1 May 2023 / Accepted: 7 May 2023 / Published: 9 May 2023

(This article belongs to the Special Issue Robotics and Industrial Automation: From Methods to Applications)

Download

Browse Figures

Versions Notes

Abstract

:

We present a novel stereo visual odometry (VO) model that utilizes both optical flow and depth information. While some existing monocular VO methods demonstrate superior performance, they require extra frames or information to initialize the model in order to obtain absolute scale, and they do not take into account moving objects. To address these issues, we have combined optical flow and depth information to estimate ego-motion and proposed a framework for stereo VO using deep neural networks. The model simultaneously generates optical flow and depth information outputs from sequential stereo RGB image pairs, which are then fed into the pose estimation network to achieve final motion estimation. Our experiments have demonstrated that our combination of optical flow and depth information improves the accuracy of camera pose estimation. Our method outperforms existing learning-based and monocular geometry-based methods on the KITTI odometry dataset. Furthermore, we have achieved real-time performance, making our method both effective and efficient.

Keywords:

stereo visual odometry; optical flow; depth map; cost volume

1. Introduction

Simultaneous localization and mapping (SLAM) technology has found widespread application in numerous domains, such as autonomous driving, robotics, and augmented reality. Visual SLAM techniques play a significant role in this field and have garnered considerable attention from both the computer vision (CV) and robotics communities over the past few decades, as they are based on a low-cost and small sensor system [1,2], which guarantees these advantages compared to other sensor-based SLAM techniques. As an essential module of a visual SLAM system, VO aims to accurately estimate camera motion from frame sequences.

In recent decades, there has been a significant amount of research focused on creating precise, dependable, and robust VO methods. Typically, these methods fall into two main categories, VO based on geometry and VO based on learning. In geometry-based VO approaches, the final outcome is achieved by using a coarse-to-fine iteration procedure [3,4]. With careful initialization or relocalization within the generated maps, these methods have demonstrated a remarkable level of accuracy. Geometry-based VO methods heavily rely on manual intervention to detect, analyze, and address failures; track faults; and refine localization results. Alternatively, learning-based VO methods leverage neural networks to extract features from entire RGB input frames and then regress the features to obtain the camera’s ego-motion [5,6,7,8]. Some of these methods incorporate intermediate parameters such as optical flow maps [9,10,11], while others use additional information such as semantic segmentation [12,13]. Moreover, some treat VO as a subset of a larger multi-task framework [14,15,16,17]. VO methods can be broadly classified into monocular and stereo VO based on the type of camera used. Monocular VO methods recover absolute scale by utilizing additional information [3,4,18], such as the height of the camera or labels. Stereo frames, with their depth information, provide an inherent scale reference. Our approach can effectively eliminate the initial operation required in ORB-SLAM2 [18] when working with stereo frames. Several methods [15,16,17,19] estimate depth and ego-motion together, but they fail to account for the presence of moving objects, leading to suboptimal results. In this paper, our goal is to investigate the feasibility of estimating the frame-to-frame ego-motion of a camera solely from stereo image pairs.

We introduce a state-of-the-art network that utilizes stereo image pairs to estimate camera ego-motion by utilizing optical flow and depth information as intermediate quantities. Optical flow enables the network to discern between static and moving objects, while depth information provides an absolute scale. The primary contributions of our study are as follows: (1) we present a new approach to learning VO that effectively leverages optical flow and depth information to estimate ego-motion; (2) we put forth a novel structural ssim cost volume (SSIMCV) for extracting depth information from stereo image pairs; and (3) our method achieves state-of-the-art results when compared to existing geometry-based and learning-based VO methods on the KITTI visual odometry dataset. The remainder of this paper is structured as follows: §2 discusses relevant VO studies, §3 provides an overview of our StereoVO architecture, §4 compares the performance of our approach to other state-of-the-art methods, and §5 concludes the paper.

2. Related Work

Previous studies of VO methods are reviewed in this section. These methods can be classified into two main categories: geometry-based and learning-based methods.

2.1. Methods Based on Geometry

VO methods based on geometry rely on extracting geometric constraints from images to estimate motion. These methods can be further categorized into two types: feature-based methods and direct methods. In feature-based methods such as MonoSLAM [20], which is a typical filter-based approach [20,21], the state vector size can increase substantially with a rapid increase in feature points, making real-time calculations challenging. PTAM [22] improved based on MonoSLAM by splitting tracking and mapping into different threads, resulting in real-time camera motion tracking. ORB-SLAM [3] inherited the framework of PTAM while achieving real-time performance on CPU. Later, ORB-SLAM2 extended ORB-SLAM to support stereo and RGB-D cameras, while ORB-SLAM3 [23] added support for fisheye cameras and visual–inertial SLAM.

In contrast to feature-based methods, direct methods do not rely on artificially designed sparse features but instead build an optimization problem that estimates camera motion directly from pixel information, typically photometric errors. DTAM [24] calculates a dense depth map for each key-frame by minimizing global spatially regular energy functionals. The camera’s position is determined by aligning the entire image with the depth map directly. This method is computationally intensive. To reduce the amount of calculation involved, SVO [25] only extracts features when a key-frame is selected to initialize new 3D points. DSO [26] combines photometric and geometric errors and jointly optimizes all model parameters, and the method demonstrates robust performance in certain featureless environments.

However, all these geometry-based methods are complex, and fine adjustment is required for each module to achieve good performance. Additionally, current frameworks are basically fixed, and algorithms are approaching their limits.

2.2. Methods Based on Learning

In recent years, learning-based VO has gained substantial attention for its promising prospects of learning capability and robustness to challenging environments.

DeepVO [27] introduced an end-to-end framework for monocular VO that utilizes deep Recurrent Convolutional Neural Networks and requires five sequential frames for estimation, which results in greater memory consumption and poorer real-time performance. Zhan et al. [16] proposed a parallel CNNs framework where joint training for single view depth and VO improved depth prediction by imposing additional constraints on depth, achieving competitive results for VO. UnDeepVO [7] estimates the 6-DoF pose of a monocular camera using deep neural networks and leverages stereo image pairs to recover absolute scale. However, its inability to generalize to other datasets due to changes in absolute scale limit its application. Similarly, D3VO [28] also employs stereo image pairs to train a monocular VO network and models photometric uncertainties to enhance VO accuracy, which results in a fixed absolute scale and cannot be generalized to other datasets. MLF-VO [29] estimates camera ego-motion by using paired monocular frames in an unsupervised manner, focusing on different depth and image feature fusion strategies, which utilizes Multi-Layer Fusion to leverage RGB and inferred depth information to predict ego-motion after predicting depth images for each source frame; however, dynamic objects are not considered. P-CNN [9] estimates frame-to-frame ego-motion from dense optical flow generated by BROX [30], which can be time-consuming. Flowdometry [10], trained by optical flow generated by FlowNet [31], achieved better results than P-CNN but did not consider absolute scale. TartanVO [32] utilizes monocular image sequences, optical flow labeling, and camera matrix from the TartanAir dataset [33], which offers a wide range of synthetic data, for training and can be applied to other real-world datasets such as KITTI [34] and EuRoC [35] by adjusting the intrinsic matrix. In addition to pure VO methods, there are learning visual–inertial odometry methods [36,37] that combine visual and inertial together, which are not in the scope of this work.

Although recent monocular VO methods [19,27,28] employ stereo frames for training to obtain absolute scale, they still cannot fully solve the issue of absolute scale that persists in all monocular methods. Additionally, most VO methods based on learning do not consider the effect of dynamic objects. In this work, we leverage stereo image pairs to recover scale and employ optical flow and depth information to consider the effects of dynamic objects.

3. Approach

In this section, we provide a detailed description of our method. We first present the structure of our proposed network. Next, we introduce the SSIMCV, and finally, the loss function of our network is described.

3.1. The Structure of StereoVO

The proposed StereoVO comprises three main components: the Flow-Net, the Depth-Net, and the VO-Net. For the Flow-Net, FlowNet2 [38] is employed to estimate optical flow due to its superior generalization capability in realistic scenes and fast processing speed. In the Flow-Net, the optical flow of each pair of consecutive stereo frames are generated and then down-sampled to 1/8 of the original resolution using Average Pooling. In the Depth-Net, we utilized SSIMCV to acquire depth information. Different from previous methods, such as [7,8,15,16], we use down-sampled features to calculate depth information without up-sampling to obtain a full resolution depth map, which allows for a simpler structure of the network. The resulting optical flow and extracted depth information were used as input to the VO-Net. The architecture of the proposed StereoVO method is illustrated in Figure 1.

In the Flow-Net, the FlowNet2 network takes two continuous stereo frames and produces raw optical flow. These optical flows and frame at time t + 1 are used to warp the frame at time t. The structural similarity index (SSIM index) [39] between the warped frame at t and the original frame at t is utilized as the confidence of the optical flow. The SSIM index for two frames, x and y, is calculated as follows:

s s i m (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})},

(1)

where μ and σ are the mean intensity and standard deviation of frames, respectively, C₁ and C₂ are constant while we set C₁ = 9 × 10⁻⁴, and C₂ = 10⁻⁴ follows [39]. Then, the SSIM index is concatenated with optical flow and fed into Conv Block1, whose architecture is shown in Figure 2. Additionally, depth information is obtained from the left and right frame using the SSIMCV block, as described in Section 3.2. The architecture of the Conv and FC blocks is shown in Figure 3, and the parameters for all convolution and pooling layers can be found in Table 1.

In the FlowNet, the left-side and right-side optical flows, generated by FlowNet2, are concatenated with ssim right and ssim left, respectively. The SSIM index (Equation (1)) is computed between the destination image I_d and the warped destination image W_d,

s s i m_{s} = S S I M (W_{d}, I_{d}),

(2)

and we warp the source image I_s(x,y) via flow from source image to destination image flow_sd(u, v):

W_{d} = I_{s} (x + u_{x}, y + v_{y}) .

(3)

3.2. SSIM Cost Volume

A cost volume stores the matching costs for pairing a pixel from feature map f₁ with its corresponding pixels in feature map f₂. In stereo matching and optical flow estimation, three common types of cost volumes are used: Difference, Concatenation, and Correlation cost volume. The Difference cost volume computes the disparity between two patches, the Concatenation cost volume simply concatenates features f₁ and f₂, whereas FlowNet2 uses a window size of one to compute the Correlation cost volume between two patches. However, each of these cost volumes can be easily affected by outliers.

To assist the network to match left and right features accurately, we introduce the SSIM cost volume, which employs a sliding window patch comparison between two feature maps, f₁ and f₂, each with width, height, and number of channels (w, h, and c). Now, we consider a single comparison of two patches for sample, using a sliding window of size k and a maximum displacement of d_max. The SSIM cost volume is defined as follows:

c o s t (x_{0}, y_{0}) = S S I M (f_{1} (x_{0}, y_{0}), f_{2} (x_{0} + d, y_{0})),

(4)

where

x_{0} \in  [0, w], y_{0} \in  [0, h], d \in  [0, d_{\max}]

, f(x,y) denotes the feature window whose center is (x₀, y₀). Then, the cost volume is averaged over channel dimension, and finally, a 3D matrix is achieved.

3.3. Loss Function

We take advantage of the sequential stereo frames and the ground truth camera motions in our task. Our objective is to minimize the camera motion loss l. The end-to-end loss is defined as:

l = a {‖P_{o} - L_{o}‖}_{2} + b {‖P_{p} - L_{p}‖}_{2} + c \times l_{f},

(5)

l_{f} = d ‖W_{f} - f‖ + e \times S S I M (W_{f}, f),

(6)

W_{f} = f (x + d i s p (x, y), y),

(7)

where P and L represent predicted and ground truth motion, respectively, (o,p) indicates relative Euler angels o and translation p of left camera from t + 1 frame to t frame, o consists of roll, pitch, and yaw, p is composed of translation along the x, y, and z axles,

‖ \cdot ‖_{2}

denotes mean square error, and a to e are the factors of losses; similarly to [27], we set a = 100, b = 1, and c = 0.01; following [8], we set d = 0.15 and e = 0.85 while training.

4. Experimental Results

In this section, we present the experimental results to verify the effectiveness of our proposed method. We begin by detailing the experimental datasets and implementation specifics, followed by a comparative analysis of our results with state-of-the-art methods.

4.1. Dataset and Metrics

Dataset: The KITTI odometry benchmark [34] comprises twenty-two image sequences captured by two cameras mounted on a car during driving scenes in suburban and highway environments. Sequences 00–10 provide ground truth information on camera position, whereas sequences 11–21 only offer stereo images. The KITTI dataset records at a 10 frames-per-second rate during driving and includes numerous dynamic objects, such as cars, motorcycles, bicycles, and pedestrians, with speeds between 0 km/h and 90 km/h, which makes it a challenging dataset for the VO algorithm. To train our proposed network, we leveraged sequences 00–06 and 08, while our validation process selected frames randomly from the training set. Meanwhile, sequences 07, 09, and 10 served as our testing datasets. Table 2 shows the number of frames in each sequence. The stereo images are resized to a resolution of 1024 × 256. Labels from the KITTI dataset are absolute position according to the initial frame of the sequence; we transform labels to relatively frame-to-frame Euler angle and translation along the x, y, and z axles.

Evaluation Metrics: We adopt the KITTI odometry criterion, which reports the average translational error T_rel(%) and rotational errors R_rel(deg/100 m) of possible sub-sequences of length (100, 200, ···, 800) in meters.

4.2. Implementation Details

For our experiments, we employed the PyTorch framework for implementing our work and trained it using a single Nvidia RTX2080S GPU. The parameters of Flow-Net remained fixed during training. To optimize our approach, we utilized the Adam optimizer with a batch size of 100 and a learning rate of 10⁻³ for 1000 epochs. Subsequently, we adjusted the learning rate to 10⁻⁴ for 200 epochs, followed by another 50 epochs at a learning rate of 10⁻⁵. We eventually fine-tuned our network via a batch size of 10 and a learning rate of 10⁻⁵.

4.3. Comparison with Other Methods

Table 3 presents a comparison between our best model and several state-of-the-art geometry-based [3,4] algorithms as well as learning methods [8,9,10,27,28,29,32]. VISO2 is one of the most widely used geometry-based VO algorithms commonly employed as a baseline for comparison, which uses sparse features for stereo matching to achieve efficient monocular and stereo VO. Monocular VISO2 leverages a fixed camera height to recover absolute scale. ORB-SLAM2, another geometry-based algorithm, has demonstrated impressive results utilizing monocular, stereo, and RGB-D cameras. In our experiments, we denote the monocular and stereo versions of ORB-SLAM2 as “ORB-SLAM2-M” and “ORB-SLAM2-S”, loop-closure detection is disabled while comparison. Among the learning-based approaches, we selected both supervised [9,10,27,32] and unsupervised [8,28,29] methods.

In the supervised methods, DeepVO takes image sequences as inputs and uses RNN to refer to former ego-motion, while PCNN employs calculated optical flow pairs as input and a sampling network. TartanVO estimates optical flow for subsequent estimation. In the unsupervised methods, GeoNet estimates depth maps and ego-motion to generate rigid flow, which is then compared with predicted optical flow to filter out moving objects in the scene. MLF-VO analyzes how different fusion strategies affect the results and performs ego-motion estimation using RGB and inferred depth information in a Multi-Layer Fusion manner. In addition, for ORB-SLAM2, GeoNet, MLF-VO, and D3VO, the trajectories are aligned with the ground truth using [40], with 20 frames used for alignment. The results of VISO2-M, PCNN, DeepVO, Flowdometry, and TartanVO come from their paper or source code.

Figure 4 illustrates the trajectories of the stereo version of ORB-SLAM2 and ORB-SLAM3, PCNN, GeoNet, D3VO, TartanVO, MLF-VO, our StereoVO, and ground truth on sequences 07, 09, and 10 (if available). Our StereoVO achieves the best average performance in both translation and rotation compared to other learning-based methods. In comparison to PCNN, which employs optical flow calculated by BROX [39] as input for their network, our method significantly outperforms them in the task of VO. Flowdometry and TartanVO use similar strategies as ours, where optical flow is utilized as the middle procedure. However, Flowdometry uses FlowNetS [31] to generate optical flow, while TartanVO employs ground truth optical flow as a part of their label. In our StereoVO, we use a fixed-parameter FlowNet2 without ground truth optical flow, yet our method still demonstrates superior results. Similarly to our method, MLF-VO employs depth net and pose net in a series manner. However, MLF-VO did not consider the influence of moving objects. On the other hand, GeoNet uses separate components to learn the rigid flow and optical flow by rigid structure reconstructor and non-rigid motion localizer, respectively. They estimate rigid flow and optical flow in parallel, but in this case, the pose estimator in the rigid parts cannot become fully informed about moving objects. D3VO employs a back-end non-linear optimization module to optimize final poses. However, their T_rel error is more than twice that of ours.

We further compare the RPE (relative pose error) with MLF-VO, which has a similar average translational and rotation RMSE drift, shown in Figure 5, while resultes in Table 4.

We compare the running time of our method with ORB-SLAM2, ORB-SLAM3, MLF-VO, and TartanVO, and the results are shown in Table 5. The frames in the original KITTI VO dataset were resized to 192 × 640 for accurate comparison. The experiments were performed on Ubuntu 18.04 with AMD Ryzen7 3700x 3.6GHz CPU and NVIDIA RTX2080S 8G GPU. Our stereo VO method was found to be superior to both ORB-SLAM2 and ORB-SLAM3, exhibiting half the inference runtime to deliver precise results. Furthermore, our method uses 1.9 GB memory, which is almost the same as MLF-VO, making it efficient and practical for real-world applications.

5. Discussion

Previous learning-based visual odometry methods have paid little attention to the effect of dynamic objects on prediction results; in this paper, we consider dynamic objects implicitly, and in future research, we will consider the effect of dynamic objects explicitly. Although our method surpasses the stereo versions of ORB-SLAM2 and ORB-SLAM3 in terms of speed, it still falls short in terms of accuracy; more training data and new networks may help to improve it. In the future, we plan to focus on unsupervised stereo VO.

6. Conclusions

We propose a novel data-driven Deep Convolutional Neural Network (StereoVO) that encodes geometrical features in stereo image pairs to estimate camera motion. The main idea is to use CNN to extract features of stereo image pairs in image sequences, then match the features with proposed SSIM cost volume to obtain depth information, fuse depth information with optical flow, and finally achieve the regression of the camera motion among consecutive stereo images. The performance of our approach is better than the monocular versions of ORB-SLAM2, ORB-SLAM3, and DeepVO, PCNN, GeoNet, Flowdometry, and TartanVO. In terms of the memory occupation, our approach is close to the learning-based monocular VO. The proposed method achieves real-time performance and a higher running rate than the stereo versions of ORB-SLAM2 and ORB-SLAM3.

Author Contributions

Conceptualization, C.D. and H.L.; methodology, C.D.; software, C.D. and S.J.; validation, K.T. and S.J.; formal analysis, C.D.; investigation, C.D.; resources, H.L. and K.T.; data curation, C.D., K.T. and S.J.; writing—original draft, C.D.; writing—review and editing, C.D., H.L., S.J. and K.T.; Visualization, C.D.; supervision, H.L. and K.T.; funding acquisition, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Hunan Research Innovation Project for Postgraduate Students (1053320184209).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study.

Acknowledgments

We thank the anonymous reviewers for their valuable comments on our manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Barros, A.M.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A comprehensive survey of visual slam algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
Aslan, M.F.; Durdu, A.; Yusefi, A.; Sabanci, K.; Sungur, C. A tutorial: Mobile robotics, SLAM, bayesian filter, keyframe bundle adjustment and ROS applications. ROS 2021, 6, 227–269. [Google Scholar]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Song, S.; Chandraker, M.; Guest, C.C. High Accuracy Monocular SFM and Scale Correction for Autonomous Driving. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 730–743. [Google Scholar] [CrossRef] [PubMed]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Xue, F.; Wang, Q.; Wang, X.; Dong, W.; Zha, H. Guided Feature Selection for Deep Visual Odometry. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018. [Google Scholar]
Li, R.; Wang, S.; Long, Z.; Gu, D. Undeepvo: Monocular visual odometry through unsupervised deep learning. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018. [Google Scholar]
Yin, Z.; Shi, J. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–21 June 2018. [Google Scholar]
Costante, G.; Mancini, M.; Valigi, P.; Ciarfuglia, T.A. Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE Robot. Autom. Lett. 2015, 1, 18–25. [Google Scholar] [CrossRef]
Muller, P.; Savakis, A. Flowdometry: An Optical Flow and Deep Learning Based Approach to Visual Odometry. In Proceedings of the Applications of Computer Vision, Santa Rosa, CA, USA, 27–29 March 2017. [Google Scholar]
Saputra, M.; Gusmao, P.D.; Wang, S.; Markham, A.; Trigoni, N. Learning Monocular Visual Odometry through Geometry-Aware Curriculum Learning. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
Ganti, P.; Waslander, S. Network Uncertainty Informed Semantic Feature Selection for Visual SLAM. In Proceedings of the Conference on Computer and Robot Vision, Kingston, ON, Canada, 28–31 May 2019. [Google Scholar]
Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Almalioglu, Y.; Saputra, M.; Gusmo, P.; Markham, A.; Trigoni, N. GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
Madhu, B.V.; Majumder, A.; Das, K.; Kumar, S. Undemon: Unsupervised Deep Network for Depth and Ego-Motion Estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018. [Google Scholar]
Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhan, H.; Garg, R.; Saroj Weerasekera, C.; Li, K.; Agarwal, H.; Reid, I. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–21 June 2018. [Google Scholar]
Mur-Artal, R.; Tardos, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Amiri, A.J.; Loo, S.Y.; Zhang, H. Semi-supervised monocular depth estimation with left-right consistency using deep neural network. In Proceedings of the IEEE International Conference on Robotics and Biomimetics, Dali, China, 6–8 December 2019. [Google Scholar]
Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-Time Single Camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef] [PubMed]
Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots & Systems, Hamburg, Germany, 28 September–3 October 2015. [Google Scholar]
Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the IEEE and Acm International Symposium on Mixed & Augmented Reality, Nara, Japan, 13–16 November 2007. [Google Scholar]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the IEEE International Conference on Robotics and Automation, Hong Kong, China, 31 May-7 June 2014. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017. [Google Scholar]
Yang, N.; Von Stumberg, L.; Wang, R.; Cremers, D. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020. [Google Scholar]
Jiang, Z.; Taira, H.; Miyashita, N.; Okutomi, M. Self-Supervised Ego-Motion Estimation Based on Multi-Layer Fusion of RGB and Inferred Depth. In Proceedings of the IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar]
Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High Accuracy Optical Flow Estimation Based on a Theory for Warping. In Proceedings of the European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004. [Google Scholar]
Fischer, P.; Dosovitskiy, A.; Ilg, E.; Husser, P.; Hazrba, C.; Golkov, V.; Patrick, V.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Wang, W.; Hu, Y.; Scherer, S. TartanVO: A Generalizable Learning-based VO. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021. [Google Scholar]
Wang, W.; Zhu, D.; Wang, X.; Hu, Y.; Qiu, Y.; Wang, C.; Hu, Y.; Kapoor, A.; Scherer, S. TartanAir: A Dataset to Push the Limits of Visual SLAM. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Kyoto, Japan, 23–27 October 2020. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Rhode, Greece, 18–20 June 2012. [Google Scholar]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Rob. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Aslan, M.F.; Durdu, A.; Sabanci, K. Visual-inertial image-odometry network (VIIONet): A Gaussian process regression-based deep architecture proposal for UAV pose estimation. Measurement 2022, 194, 111030. [Google Scholar] [CrossRef]
Han, L.; Lin, Y.; Du, G.; Lian, S. Deepvio: Self-supervised deep learning of monocular visual inertial odometry using 3d geometric constraints. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019. [Google Scholar]
Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Brox, T. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar]
Umeyama, S. Least-squares Estimation of Transformation Parameters between Two Point Patterns. IEEE Trans Pattern Anal Mach Intell 1991, 13, 376–380. [Google Scholar] [CrossRef]

Figure 1. Overview of the StereoVO network. Trainable blocks are represented in blue and non-trainable blocks are shown in green. The blocks with the same name are weight shared. The blue arrow corresponds to the t frame, and the orange arrow corresponds to the t + 1 frame, while the intermediate variables are shown in gray. Flow is shown in RGB format transformed from raw optical flow, while we use raw optical flow during training and testing.

Figure 2. Architecture of the Conv Block, which contains three sequential Basic Blocks. Each Conv contains a sequential structure of a convolution layer, a BatchNorm layer, and a LeakReLU layer.

Figure 3. Architecture of Convs and FC.

Figure 4. Qualitative evaluation on (a) Sequence 07, (b) Sequence 09, and (c) Sequence 10 of KITTI odometry benchmark.

Figure 5. RPE comparison of MLF-VO and our StereoVO on (a) Sequence 07, (b) Sequence 09, and (c) Sequence 10 of KITTI odometry benchmark.

Table 1. Configuration of the convolution layers with learnable parameters. Larger kernel size is used for the beginning three Conv Blocks to capture more information. The table provides information on kernel size (k), stride (s), padding (p), and the number of input and output channels (chns) for each layer. As the depth of the network increases, the size of the feature maps decreases.

Block	Layer	k	s	p	chns
Conv Block	Conv1_1	7	2	3	3/16
	Conv1_2	3	1	1	48/16
	Conv Redir1	7	2	3	3/2
	Conv2_1	5	3	2	18/32
	Conv2_2	3	1	1	96/32
	Conv Redir2	5	2	2	18/2
	Conv3_1	5	2	2	34/64
	Conv3_2	3	1	1	192/64
	Conv Redir3	5	2	2	34/2
Conv Block4	Conv4_1	3	2	1	157/256
Conv Block4	Conv4_2	3	1	1	471/256
Conv Block5	Conv5_1	3	2	1	256/512
	Conv5_2	3	1	1	768/512
	Conv Redir5	3	2	1	256/2
Conv Block6	Conv6_1	3	2	1	514/1024
	Conv6_2	3	1	1	1542/1024
	Conv Redir6	3	2	1	514/2
FC	FC1	-	-	-	4104/1024
FC	FC2	-	-	-	1024/6

Table 2. Sequence detail of KITTI odometry benchmark.

Phase	Train								Test
Sequence	00	01	02	03	04	05	06	08	07	09	10
Frames	4541	1101	4661	801	271	2761	1101	4071	1101	1591	1201
Total	19,308								3893

Table 3. Odometry results comparison with the state-of-the-art VO methods. The best results of each block are highlighted in bold. M: monocular; St: stereo; G: geometry based; S: supervised; U: unsupervised. ‘-’ means the results are not available from that paper.

Method		Seq. 07		Seq. 09		Seq. 10		Average
Method		T_rel (%)	R_rel (deg/100 m)	T_rel (%)	R_rel (deg/100 m)	T_rel (%)	R_rel (deg/100 m)	T_rel (%)	R_rel (deg/100 m)
MG	VISO2-M [4]	23.61	19.11	4.04	1.43	25.20	3.80	17.62	8.11
	ORB-SLAM2-M [18]	10.96	0.37	15.30	0.26	3.71	0.30	9.99	0.31
StG	ORB-SLAM2-S [18]	0.59	0.20	1.8	0.18	1.67	0.37	1.35	0.25
StG	ORB-SLAM3-S [23]	0.62	0.21	1.16	0.19	1.02	0.34	1.24	0.25
MU	GeoNet [8]	-	-	30.57	9.89	45.14	10.37	37.86	8.17
MU	D3VO [28]	-	-	10.40	0.75	14.24	1.21	12.32	0.98
MU	MLF-VO [29]	6.11	2.59	3.64	1.33	6.39	1.14	5.38	1.68
MS	DeepVO [27]	3.91	4.60	-	-	8.11	8.83	6.01	6.72
MS	PCNN [9]	-	-	6.42	2.52	19.70	3.62	13.06	7.79
MS	Flowdometry [10]	-	-	12.64	8.04	11.65	7.28	12.15	7.66
MS	TartanVO [32]	8.51	4.55	7.24	2.99	9.20	2.80	8.32	3.45
StS	StereoVO (our)	3.42	1.87	5.11	1.39	5.78	1.62	4.77	1.63

Table 4. Average RPE results on KITTI Seq. 07, 09, and 10. Our method outperforms MLF-VO in all metrics.

Method	RMSE	Mean	Std.
MLF-VO	0.105	0.078	0.068
StereoVO	0.089	0.070	0.055

Table 5. The average processing time (seconds) for each image on KITTI sequence 07, 09, and 10. ‘-’ means not needed.

Method	Data	GPU (G)	Speed (s/Frame)
ORB-SLAM2-S [18]	stereo	-	0.1
ORB-SLAM3-S [23]	stereo	-	0.12
MLF-VO [29]	monocular	1.7	0.03
StereoVO	stereo	1.9	0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duan, C.; Junginger, S.; Thurow, K.; Liu, H. StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information. Appl. Sci. 2023, 13, 5842. https://doi.org/10.3390/app13105842

AMA Style

Duan C, Junginger S, Thurow K, Liu H. StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information. Applied Sciences. 2023; 13(10):5842. https://doi.org/10.3390/app13105842

Chicago/Turabian Style

Duan, Chao, Steffen Junginger, Kerstin Thurow, and Hui Liu. 2023. "StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information" Applied Sciences 13, no. 10: 5842. https://doi.org/10.3390/app13105842

APA Style

Duan, C., Junginger, S., Thurow, K., & Liu, H. (2023). StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information. Applied Sciences, 13(10), 5842. https://doi.org/10.3390/app13105842

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information

Abstract

1. Introduction

2. Related Work

2.1. Methods Based on Geometry

2.2. Methods Based on Learning

3. Approach

3.1. The Structure of StereoVO

3.2. SSIM Cost Volume

3.3. Loss Function

4. Experimental Results

4.1. Dataset and Metrics

4.2. Implementation Details

4.3. Comparison with Other Methods

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI