1. Introduction
Nowadays, almost all robots utilize visual information for their tasks such as grasping, random bin picking, autonomous navigation, and so on. In these robotic applications, visual information, similar to human vision, provides plenty of important and useful information to robots. For this reason, more and more computational or artificial vision techniques have been developed. In particular, in the field of autonomous vehicles or mobile robots, obtaining visual information with a large field of view (FoV) is advantageous because it can reduce the blind areas around the vehicles or robots. In this regard, using the fisheye camera is an efficient way for a robot to obtain images with a large field of view. In addition, a small number of fisheye cameras can be used to obtain around-view images from the robot rather than using normal FoV cameras. Therefore, the fisheye camera is commonly used nowadays in many robotic applications.
As one of the robotic applications, the vision-based or visual localization of a mobile robot is an essential technology in autonomous mapping and navigation. However, for certain types of robots, such as pallet or cleaning robots, it is inevitable that fisheye cameras are attached close to the ground level, as shown in
Figure 1. Consequently, the ground area occupies half of the captured fisheye image. In this situation, vision-based localization is challenging mainly because of two reasons. First, half of image feature extraction and tracking is conducted in the ground area, while the texture pattern on the ground is generally uniform compared with other cluttered areas. Common feature extraction algorithms in computer vision are subject to extract a smaller number of features in the uniform ground area than the other surrounding areas. Thus, the ground area in a fisheye image is not cost effective in terms of feature extraction and tracking. Second, the undistortion of the fisheye image to a pinhole image causes inevitable image loss along the boundary of the original image. As a result, only the center area of the original image, where half of the center area is also the ground area, can be used in the undistorted pinhole image for feature extraction and tracking.
Even though there are several disadvantages to using the fisheye camera in the low-profile robot, the ground information can be employed efficiently for visual navigation by assuming that the ground is a planar surface. Because almost all indoor mobile robots are assumed to move on the planar ground plane, the motion direction of a mobile robot can be considered parallel to the ground plane. This study is also motivated by the fact that the low-profile robot moves on a 2D planar space, which is parallel to the ground plane, and the fisheye camera captures the ground images at a very close distance to the ground.
In this study, we propose using a visual odometry method with a fisheye camera for low-profile pallet robots. By only using the ground plane images generated from the fisheye camera, the pose of the robot is estimated. The proposed method consists of several steps, as shown in
Figure 2. First, a fisheye image is converted to a cubemap image [
1]. Second, a ground plane image (GPI) is obtained from the cubemap image. GPI is a virtual top-view image generated by the bottom and front faces of the cubemap. Third, a ground plane image is again converted to an ortho-rectified GPI to compensate for any distortion error due to lens distortion parameters. Fourth, image feature points are extracted and tracked using the optical flow in the ortho-rectified GPI. The robot motion is then modeled as a 2D vector in the image space [
2]. Finally, as shown in
Figure 2, the odometry of the robot is estimated by applying the bicycle kinematic model with the motion vector between consecutive image frames [
3]. The architecture of the proposed visual odometry is named Ground Plane Odometry (GPO). In several experiments using a low-profile pallet robot, we compare the localization performance with T265 stereo odometry, GPO, and the Extended Kalman Filtering (EKF) of GPO and T265 [
4,
5]. By applying EKF with GPO and T265, we achieve a 0.058 m mean localization error in meters based on three indoor odometry experiments.
The structure of this study is as follows. In
Section 2, we briefly describe conventional studies related to this paper. In
Section 3, we explain the details of each step of the proposed method. In
Section 4, we show several experiments and evaluation results. Finally, in
Section 5, we conclude this paper.
2. Literature Review
In this section, we present the previous literature related to the visual odometry techniques of mobile robots, which employ image features on the surface of the ground plane with fisheye or wide-angle lens cameras.
M. Flores et al. [
6] compare the visual localization performance of four well-known image features, which are extracted from fisheye images: Oriented FAST and Rotated BRIEF (ORB) [
7], Speed-Up Robust Features (SURF) [
8], Features from Accelerated Segment Test (FAST) [
9], and KAZE [
10]. They measure the translation and rotation errors of visual odometry computed from two different approaches: the standard visual odometry and the adaptive probability-oriented feature matching methods. A fisheye image dataset by Zhang et al. [
11] is used for the experiment. The experimental results show that the performance of ORB and KAZE is relatively better than that of SURF and FAST.
W. Yahui et al. [
1] propose a novel visual SLAM (Simultaneous Localization and Mapping) method using cubemap images. They employ ORB-SLAM [
12] as the baseline of their proposed CubemapSLAM method. Cubemap images are consecutively generated from fisheye images, and they are used as the input of ORB-SLAM. The cubemap camera model can use all image information of the fisheye camera without affecting the performance of feature descriptors. In this study, CubemapSLAM is efficiently implemented, and it can run in real time. Despite the limited angular resolution of the sensor, CubemapSLAM shows better accuracy in localization compared with the pinhole camera model [
13].
A. J. Swank [
14] utilizes a single downward-pointing camera for the localization of a mobile robot. Image features are obtained from the camera, which is equipped with a wide-angle lens. The optical flow algorithm is employed to measure the location of the robot. In this study, the application of the proposed method was specifically targeted to robotic platforms in unseen areas. The author suggests applying the proposed method to any indoor or outdoor area when other localization sensors are not available.
J. Zhang et al. [
15] present a robust localization method based on monocular visual odometry. Accurate position estimation is presented even when a mobile robot is operated in undulating terrain. Their method employs a steering model to separately estimate the rotation and translation of the robot. Moreover, they also address the localization problem in an undulating terrain. The undulating terrain is approximated by multiple patches of locally flat surface. In each local flat surface, the inclination of the surface is estimated at the same time as the robot’s motion. The surface inclination accuracy is about 1% error.
J. Jordan et al. [
16] present a visual odometry technique of a mobile robot that is working on a 2D plane by applying the kinematic model. A monocular camera facing downward is used to capture the images of the floor. The motion of the robot is estimated by the alignment between two consecutive images. The motion is then combined with the kinematic model of the robot to remove the outlier image regions that do not reflect the actual motion. To obtain the ortho-rectified floor images, they use a simple image homography method.
D. Zhou et al. [
17] address a ground plane-based odometry for the purpose of autonomous vehicle driving. They use a single normal vision camera to capture the ground image in front of a road-driving vehicle. To estimate the vehicle motion in real metric space, they propose a divide-and-conquer motion scale estimation method based on the ground plane and camera height. Their investigation is a typical example that the ground plane-based odometry can be used in the autonomous vehicle driving technique.
J. Jordan et al. [
18] introduce a ground plane-based visual odometry using a downward-facing RGBD camera. The RGBD camera captures the ground images and 3D point clouds frame by frame, and the images and points are transformed into a virtual orthogonal camera. Then, the projected data are split into multiple orthographic grid data blocks. Each image block is registered frame by frame using Efficient Second-order Minimization (ESM) [
19]. Finally, the robot motion is estimated by the weighted average of all block motions. This approach uses an RGBD camera; thus, 3D ground information helps to find accurate robot motion.
M. Ouyang et al. [
20] integrate the gyroscope and wheel encoder information into visual odometry to reduce the odometry drift problem during planar motion. They use the gyroscope information to compensate for the orientation from the wheel encoder. In addition, the ground plane from camera images is segmented to extract the image features on the ground. As the backend of the odometry measurement, the graph optimization is used to minimize the measurement residuals of wheel encoder, gyroscope, vision, and ground feature constraint, respectively.
X. Chen et al. [
21] introduce a SLAM technique called StreetMap. For the localization and mapping from a mobile robot, they use a downward-facing camera to utilize the ground textures. They categorize the ground textures into two groups: feature based and line based. In each category, they proposed different localization mapping methods. They used the downward-facing camera to obtain the ground texture images. However, no image transformation, such as ortho-rectification, is used.
H. Lin et al. [
22] address the self-localization of a mobile robot using a single catadioptric camera. The geometric property of the catadioptric camera projection model is utilized to obtain the intersection of vertical lines and the ground plane. From the catadioptric camera image, the ground plane is segmented and followed by the vertical line extraction. Using the geometry of the catadioptric camera projection model, the 3D line equation corresponding to the extracted vertical image line is calculated. Then, the robot motion between two consecutive frames is estimated using the relationship of the 3D vertical lines.
In our previous investigation, we design a pallet-type mobile robot and mount a fisheye camera at the front of the robot to calculate the visual odometry [
23]. The fisheye image is first converted to a cubemap and next a ground plane image to track the image pattern on the ground. Using the motion of the feature tracking, the visual odometry of the robot is calculated.
3. Proposed Approach
3.1. Overview
The algorithm steps of the proposed method are shown in
Figure 3, and it is summarized as follows. As the first step, a fisheye image of the front view of a pallet robot is converted to a cubemap image. From the cubemap image, we generate a virtual pinhole camera image, Ground Plane Image (GPI), which is introduced in our previous study [
23]. GPI contains only the images of the ground plane at the front of the robot as if a virtual pinhole camera is facing perpendicular to the ground plane, as shown in
Figure 3. Ideally, GPI must be the ortho-rectified image captured from the virtual top-view camera. However, due to the image transformation error both from the fisheye to the cubemap and from cubemap to GPI, it is not perfectly ortho-rectified. The transformation error is mainly due to errors in fisheye lens modeling and errors in homography computation between the cubemap and GPI.
To perfectly generate an ortho-rectified GPI, we also propose to transform the ground plain image again to the ortho-rectified image (ortho-rectified GPI). In ortho-rectified GPI, the ground plane is perfectly rectified as the top-view camera captures perpendicular to the ground. As the next step, we track image features in ortho-rectified GPI using the Kanade–Lucas–Tomashi (KLT) optical flow algorithm to determine the motion vector of the robot [
24]. The motion vector between two GPIs is then scaled to determine the robot’s motion in the metric space. Then, from the scaled vector, the velocity and steering angle of a virtual front wheel of the robot are estimated. Because the fisheye camera is mounted at the front of the robot, we employ a kinematic bicycle model for the motion estimation of the virtual front wheel of the robot [
3].
3.2. Calibration of Fisheye Camera Using EUCM
In normal visual odometry, using a fisheye image as the original form is not common because of the complexity of multi-view geometry. Thus, usually, the fisheye camera model is converted to the pinhole model to simplify the multi-view geometry. However, transforming the fisheye model to the pinhole model causes a serious distortion in the undistorted pinhole image. In addition, only the center part of the undistorted image can be used in the rasterization step, and the remaining image areas are removed.
Therefore, instead of the pinhole model, we use the Cubemap [
1] model to utilize all image areas after undistortion. To transform the fisheye image to a cubemap image, all calibration parameters of the fisheye camera must be calibrated. As the calibration model, we use the enhanced unified camera model (EUCM) [
25], which is a nonlinear camera model that is generally used in omnidirectional cameras such as catadioptric systems or fisheye cameras. The EUCM has six parameters, [
fx, fy, cx, cy, α, β]. The curvature of the fisheye lens is calculated using α and β. Therefore, by using these parameters, the EUCM can represent the fisheye camera distortion.
Figure 4 shows the image projection model to the image plane through the fisheye lens. A 3D point
X is projected on a curved plane P to a 3D point
, as shown in Equations (1)–(5).
Then,
is orthogonally projected into the M plane, which is the normal image plane, as shown in Equation (6).
Finally, as shown in Equation (7), we multiply the normal plane point
with the intrinsic matrix
K to obtain the projection pixel
on the fisheye image.
In this study, to calibrate the fisheye camera’s EUCM, we use a calibration toolbox Kalibr [
26,
27,
28,
29,
30], which is extremely useful for camera calibration. Kalibr offers a variety of camera models, including EUCM, pinhole model, omnidirectional model, and double sphere models.
3.3. Mapping Fisheye Image to Cubemap Image
After calibrating the fisheye camera, we use the calibration parameters to generate the cubemap image from the fisheye image. As shown in
Figure 5a, the cubemap model consists of five image planes generated by virtual pinhole camera models. Each virtual pinhole model has the same intrinsic but different extrinsic parameters because of different viewing directions. Thus, the fisheye image is divided into five virtual pinhole cameras. Meanwhile, the intrinsic parameters of the virtual cubemap depend on the size of each cubemap face. We calibrate the intrinsic parameters [
fFace_x,
fFace_y,
cFace_x,
cFace_y] of the virtual pinhole cameras so that the pinhole image resolution is half of the cubemap image resolution. Here,
f is the focal length,
c is the principal point, and the subscript
Face is the index of the five cubemap faces such that
.
Figure 5b shows the relationship between the cubemap image plane and the cubemap model. To generate a cubemap image, we first define the transformation from the cubemap image coordinates to the cubemap model coordinates, as shown in Equation (14). Here,
is a 3D rotation matrix of the index
Face cubemap plane.
We transform the cubemap image coordinates
into the model coordinates
, which is on one of the cubemap faces, as shown in
Figure 6. The concept of the EUCM-Cubemap projection model involves the use of the EUCM parameters to transform the point of the cubemap face to the pixel of the fisheye image or vice versa. First, we transform point
to another point
on the curve plane P of the EUCM, as shown from Equations (15) to (18). We replace
in Equation (1) with
Here, the back-projection
is not required in this step.
Next, we project to on the EUCM normal image plane M using orthogonal projection, as shown in Equation (6). Finally, is projected to a pixel point on the fisheye image by multiplying it with the intrinsic matrix K, as shown in Equation (7).
By following this process, pixel coordinates in the fisheye image are transformed into the cubemap image coordinates. Using the backward transformation from the cubemap image to the fisheye image, we can identify the correspondences between the cubemap and the fisheye images. Because this process is computationally expensive, the coordinate transformation was performed once when the proposed method starts. Then, the index table of the pixel coordinates mapping between images is stored and used.
Figure 6 shows the EUCM–cubemap projection model, and
Figure 7 shows the difference between the fisheye image and the cubemap image projections.
3.4. Generating GPI from Cubemap Image
This section describes how to generate the proposed ground plane image (GPI) from the cubemap image. As shown in
Figure 8, the front and the bottom faces in the cubemap model are used to generate GPI, which is an ortho-rectified image from a virtual top-view camera. To generate a GPI in the front face, as shown in
Figure 8, the front-face image is warped to the ground plane using a homography matrix. We project a point
q on the front-face image to a homogeneous point
Q on the ground plane image using Equation (22).
In Equation (22), the symbol
means that the left and the right terms are homogeneously the same. The homography matrix
H is calculated from the relationship between the four corner positions of a chessboard. The chessboard is placed on the ground, as shown in
Figure 8, and the cubemap of the chessboard is generated. The four corners of the chessboard are selected from the cubemap image.
The bottom-face image is just cropped because the bottom-face is already obtained using a virtual top-view camera, as shown in
Figure 5a.
Figure 9 shows an example of GPI generation from a fisheye image. A trapezoid ground area in the front face of the cubemap image is warped to a rectangular area in a GPI, and a cropped bottom-face area is concatenated at the bottom of the GPI. In an ideal case, the generated ground plane image shows the ortho-rectified image from a virtual top-view camera.
3.5. Ortho-Rectification of GPI
Ideally, GPI is the ortho-rectified image obtained from the virtual top-view camera. Therefore, there should be no distortion in the image of the ground plane. However, due to the inherent errors in the EUCM calibration parameters and homography transformation from the front-view face of the cubemap model to the ground, there exists image distortion in GPI. An example of image distortion in GPI is shown in
Figure 10. From the left in the figure, an original fisheye image is converted to a cubemap image and finally to a GPI. In this figure, we intentionally place the chessboard pattern in front of the robot to identify the distortion easily. In the generated GPI, we can notice that there is image distortion of the chessboard pattern. This kind of image distortion must be corrected. Otherwise, the visual odometry using the distorted GPI yields erroneous results.
To correct the image distortion in GPI, we transform the GPI again to a new image, ortho-rectified GPI. The method of distortion correction is as follows. We place the chessboard at the front of the robot, obtain a fisheye image, and generate the GPI, as shown in
Figure 11. Then, we select four corner corners in the image, as shown in the red-colored rectangle in the figure. Finally, a homography matrix is calculated so that the red-colored rectangle area is transformed into a real rectangle, and the area of the rectangle is the same as the real metric space of the 2 × 4 squares area. To convert the pixel to metric space unit, we use the pixel-to-meter scale factor, which is used in
Section 3.6.
To compare the ortho-rectification accuracy of the GPI, we measure the side length and area of the 2 × 4 chessboard squares, as shown in
Table 1. As shown in
Figure 11, four sides of the red-colored rectangle, a, b, c, and d, are compared before and after ortho-rectification. The area of the rectangle S is also compared in
Table 1. As shown in the table, the ortho-rectification corrects image distortion in GPI, and the dimension of the ortho-rectified GPI is almost close to the true values.
3.6. Feature Tracking in Orth-Rectified GPI
The proposed visual odometry method in this study is using the ground motion in ortho-rectified GPI. Because the ortho-rectified GPI is the top-view image from the virtual camera, the ground motion in GPI can be measured in real metric dimensions using the focal length and the height of the camera. To obtain the ground motion in GPI, we use a simple and robust KLT optical flow algorithm [
24]. Moreover, to obtain a reliable motion vector that represents the motion of the robot, we obtain only the median vector between previous and current frames, as shown in
Figure 10.
Usually, the height of a pallet robot is very low, and the camera mounted close to the ground captures the details of the ground or floor patterns. The ground pattern in the ortho-rectified GPI is random enough to extract and track the features in consecutive image frames. Therefore, many features are usually obtained in a single image frame, and they are tracked using KLT. In an ideal case, all motion vectors via KLT are the same because the ortho-rectified GPI is a perfectly orthogonal view of the ground. However, there can be tracking errors in KLT, thus using the median of the motion vectors is the best calculation for motion estimation.
3.7. Metric Motion in Ortho-Rectified GPI
As shown in
Figure 12, the median motion vector
obtained from the ortho-rectified GPI matches with the robot motion with an unknown scale because vector
is obtained in the pixel space. Thus, we need the scale factor between the pixel space to the metric space to measure the actual robot motion. To find the scale of the motion vector, some assumptions are defined. Assuming that an ortho-rectified GPI is obtained using the virtual camera that is perfectly perpendicular to the flat ground, the scale of the motion vector is calculated using the height z from the ground to the camera’s lens center and the focal length of the camera f. The scale factor is expressed in Equation (23).
By using the scale factor, the robot’s actual motion vector
, as shown in
Figure 13, is calculated by multiplying the scale factor by the pixel motion
.
3.8. Robot Odometry Using a Kinematic Motion Model
As the motion of the camera mounted on the robot depends on the robot’s movement, we apply the motion vector of the ground plane image to a kinematic model of the robot to estimate the robot’s odometry. In this step, the motion vector of the ortho-rectified GPI is regarded as the measurement of a virtual front wheel. Thus, we employ the bicycle motion model and estimate the velocity V of the virtual front wheel and the steering angle δ using Equations (25) and (26).
where
and
are the components of the median motion vector, and
is the interval time between image frames.
Figure 14 shows the bicycle motion model we employ in this study. The bicycle motion model rotates and moves around IC, the center of motion in a 2D plane. We use the median motion vector
of the GPI to measure the velocity V of the virtual front wheel and the steering angle δ. As a result, the 2D odometry of the robot is estimated, and it is named Ground Plane Odometry (GPO). The bicycle motion follows three assumptions:
The wheel-based mobile robot moves in a 2D plane;
The center of mass of the mobile robot is located in the center of the robot body;
Non-slip, no longitudinal/lateral slip occurs on the wheels.
Following these assumptions, the robot’s pose is estimated by applying the velocity and the steering angle of the virtual wheel to the bicycle model of the robot as follows. All parameters of the bicycle model in
Figure 14 are calculated using the following equations:
where
lr is the distance between the rear wheel axis center and the mass center of the robot,
L is the distance between the rear wheel axis center and the front wheel axis center, and
is the angular velocity of the robot that rotates around the IC. Using these parameters, the velocity of the robot is determined as shown in Equations (32) and (33).
Finally, the visual odometry of the mobile robot is estimated using Equation (34).
5. Conclusions
In this study, we propose a visual odometry method for a low-profile pallet robot using the ground plane images obtained from a fisheye camera. To overcome the inherent radial distortion problem of the fisheye image, we introduce a Ground Plane Image (GPI), which is generated from two faces of the cubemap. Initially, we generate an original GPI and transform the image again to an ortho-rectified GPI to obtain a perfect top-view image from the mobile robot. Consequently, the median of the motion vectors in the ortho-rectified GPI can represent the real and accurate motion at the front head of the robot. The bicycle motion model is employed to estimate the odometry of the robot. The performance of the proposed Ground Plane Odometry (GPO) is compared with a stereo-based VIO method of a T265 camera. The proposed GPO shows better performance than T265 odometry, and the EKF of GPO and T265 achieves the best performance. In addition, it is shown that using the ortho-rectified GPI is better than using the original GPI. The proposed GPO can be used in any low-profile robot equipped with fisheye cameras. In addition, GPO can be used by combining with other robot sensors for better performance. In future studies, GPO will be employed to multi-view fisheye cameras in the pallet robot for more accurate visual odometry performance.