1. Introduction
For a mobile robot operating in known or unknown areas, localization in a map of its work area is an essential requirement for several practical applications. In many cases, this prior map used for localization is generated through a simultaneous localization and mapping (SLAM) process, in which the robot builds the map of the unknown environment based on the information from sensors while estimating its pose at the same time. Various sensors, including optical cameras [
1,
2], ultrasonic wave sensors [
3], laser range finders [
4,
5,
6], or multiple sensors, can be used in SLAM systems depending on different applications and environments.
In the past years, the monocular SLAM, in which only a single camera is used, has been proven to be a desirable solution for building high quality maps with swift robots due to its simple hardware structure. However, monocular vision is not able to provide an absolute scale estimation of the landmarks. This leads to a problem that when performing monocular SLAM, the movement of the camera has to be fixed on an ambiguous scale factor. Furthermore, without a stable baseline between adjacent frames, this scale factor is also unstable and unreliable. Accordingly, the map built by monocular SLAM will suffer from metric inconsistency, which means that the scale of the map obtained by monocular SLAM is unobservable and gradually drifts over time.
With the help of a wealth of information provided by images, it is possible to perform localization in the map generated by monocular SLAM using another camera by leveraging the semantic features in the images. However, consider a typical multi-robot system in robot search and rescue applications, as described in [
7]. Since the environment to be explored is always unknown and complicated, it is reasonable to send a swift “mapper” robot with a single camera first to give an overview of the target environment and build a map through monocular SLAM at the same time. Then, a “follower” unmanned ground vehicle (UGV), which has the load carrying capacity and often uses LRF for environment perception, can be deployed to the same area. By localizing themselves on the map provided by the “mapper” robot, the “follower” robots can carry out rescue activities more efficiently. It should be noted, in this scheme, that the “mapper” is only able to provide a map without the absolute scale. Hence, for the “follower” who wants to reuse this metric inconsistent map with only a range sensor, the unknown scale of the map could be a fatal drawback because the map without reliable range information is meaningless to the LRF and prevents the acquired sensor data from correctly associating with the map.
Recently, various methods have been proposed to solve the scale-drift problem in monocular SLAM. The most intuitive way to tackle the scale-drift problem is by fusing other types of sensors to make the metrical information in visual SLAM observable, e.g., the laser distance meter [
8,
9,
10], IMU [
11], or stereo system [
12]. However, the most remarkable advantage of monocular SLAM is its portability. On the other hand, using additional sensors always means less compact, less power-saving and higher cost as compared to the one based on a pure monocular camera. Detecting the loop closure and drift scale recovery by using global bundle optimization [
13,
14] is also a well-studied approach for correcting scale drift. However, the problem is that the closed loop is not always permitted, and even if the loop can be successfully closed, the absolute scale is still unknown. In [
15], the fixed height of the monocular camera is leveraged to impose the restrictions for scale correcting. Similarly, as proposed in [
16,
17], using the object recognition and size estimation of the items, the absolute scale can also be recovered during monocular SLAM.
All of the aforementioned works have provided constructive ideas and achieved ideal performances. It can be found that the goal of all of these techniques is to fix the scale drift in the map building process. In the present work, this problem is considered in another perspective: if a map has already been built by monocular SLAM, and the metrical inconsistency of a map is irreversible, is it possible to directly localize another robot in it with only an LRF while estimating the local scale at the same time? In the presence of the range information from LRF, new constraints can be introduced into the scale-drifted map, which makes the absolute scale of the map observable.
This issue is essentially a problem of localization in an “incomplete map”, which is defined as a map in which the constraints are either absent or incomplete. A few early works have been developed to address and solve a similar problem. Patric Jensfelt et al. [
18] proposed an unsupervised method that learns the distance in a topological map built with errors. Kümmerle, Rainer et al. [
19] created a SLAM framework utilizing aerial view images as prior information, in which the Monte Carlo localization (MCL) method has been employed to obtain the global constraints in the map extracted from aerial images. Mielle, Malcolm et al. [
20] presented a graph-based localization method in an emergency map. They tackled the deformation of the emergency map by optimizing the shape of the matched graph. Sketch map drawn by hand is another typical metrically inconsistent map. Behzadian, Bahram et al. [
21,
22] converted a hand-drawn draft into pixel coordinate and utilized a local scale factor to approximate the deformation of the draft.
In this paper, we aim at designing an LRF-based method for localizing a robot in the map built from monocular SLAM. The application scenario of the proposed method is the multi-robot cooperation in a single-floor indoor space. The map obtained through the monocular SLAM process or visual odometry is a 3D point cloud, while the sensor used for localization is a 2D LRF. Therefore, the dimensionality of the point cloud is reduced by projecting it onto a 2D grid map to make the landmarks match the 2D scans. Then, to locate a robot in this map with scale ambiguity, the relative motion of the robot and the measurement of LRF are adjusted based on a flexible scale factor in order to make them adaptable to the size of the map. For achieving this goal, a variant of the MCL approach is presented to estimate not only the robot pose, but also the relative scale of the local map.
This paper is an extension of our previous work [
23]. Our initial paper [
23] only addressed the problem of localization in a relatively simple condition. In this article, a more detailed explanation of the proposed method is available. We also address the global localization problem and provide an in-depth evaluation of the extended MCL approach. Furthermore, experiments are conducted in a more challenging real environment to show the flexibility and robustness of the proposed method. The rest of this article is organized as follows: A general description of our proposed system and problem statement is given in
Section 2. The detailed proposed methods are presented in
Section 3 and
Section 4. The experimental results are shown in
Section 5.
2. System Overview
The outline of the proposed system is shown in
Figure 1. The target environment is explored by a “mapper” agent, which can be a swift robot (UAV) or even a hand-held monocular camera. First, using the monocular SLAM, a 3D point cloud map is generated from the incoming image data and then pre-processed into a 2D representation. Next, a “follower” wheeled robot with LRF is deployed into the same area to perform localization in the constructed map. The LRF data, encoder data, and the map are inputs for the localization system. Whereas, the outputs are the local scale factor and the “follower” robot’s pose.
The idea behind our approach is based on the assumption that the scale drift in monocular SLAM is a gradual process, therefore, the local deformation of the built map is always insignificant and equal in every direction. In a local map of visual SLAM, a point set
is maintained to depict the landmarks
in the real-word. For the agent “mapper”, the absolute distance
d between a landmark
and the camera center
is unobservable. Instead, the monocular SLAM produces an arbitrary distance parameter
between the corresponding point
and
to give an estimate of
d in the obtained map. For laser-based localization, the “follower” is sent to this certain local map, and it measures the same landmark
with LRF from another perspective. The distance between the robot and landmark
in the map is modeled as
. The actual distance, which is measured by the laser beam, is denoted as
. Under the previous assumption, the scale factor in the same local map should be invariable. Therefore, the local scale factor of the robot at time
t is denoted as
:
where
is the noise of scale estimation. In Equation (
1),
d is unknown, while
is given by the monocular SLAM and
is the observation from the laser sensor. Hence, for a pose
of the “follower” robot in the map frame, there exists a relative distance
and its corresponding scale factor
, which describes the relationship between the drifted scale and the absolute one. Therefore, it is possible to search for the robot poses and local scale using the Monte Carlo method.
In this work, our main focus is on the development of a localization approach rather than addressing the problem of visual SLAM. The maps to be used in our system are extracted from the one built by the monocular SLAM method beforehand. In theory, our proposed localization algorithm is not confined to a specific SLAM method and works well on the point cloud maps generated by any monocular SLAM systems. For instance, in
Section 5, our approach is experimentally evaluated on different types of 3D maps (semi-dense and sparse) built by various SLAM methods [
1,
2].
3. 2D Representation of the 3D Environment
In this section, our goal is to convert the 3D point cloud map into a 2D map in order to match the scans from 2D LRF; 2D structure extraction from 3D map is a technique commonly used for 2D–3D sensor fusion and data compression. Morton [
24] presented a vertical structure representation method named the Strip Histogrim Grid, in which the scene is encoded as histograms and the objective identification is also enabled. A classifier named HLD is leveraged in [
25]. It parameterizes the obstacles and classifies them into positive and negative ones in order to provide information for self-navigation. An improvement of the HLD (Height-length-density) classifer is given by Goeddel et al. [
26], in which they assorted the obstacles based on the slope and developed a compact map structure. Other works proposed probability-based methods to handle the rasterization of 3D map. For instance, in [
27,
28], the Gaussian Mixture model is exploited to characterize the large-scale point clouds into a 2D set to achieve fast LiDAR localization.
For achieving the 2D representation, we employed the 2D occupancy grid map [
29] where each location in the environment is assumed to be either occupied, free, or unknown. Here, the lines of sight from the camera center to the landmarks are considered as measurements. Every key frame from monocular SLAM is treated as an independent observation. Next, all the measurements from the camera are projected to a specified plane and the occupied probability of each grid is calculated using the approach employed in the laser-based mapping method [
30,
31,
32].
Due to the relatively unknown camera absolute pose, it is necessary to decide the projection direction. For example, most indoor environment consists of flat floor surface that can be used to determine the coordinate system. To obtain the floor plane, RANSAC-based plane detection is applied to the point cloud map to extract the plane coefficients. However, the spurious planes that snatch parts of the points from the other items may lead to misidentification [
33]. For instance, points of the planes segmented by the original RANSAC in the indoor environment frequently belong to parts of the edges of tables or chairs on the floor. The original RANSAC plane detection method is slightly modified by splitting all the 3D map points
with the hypothetical plane. We assume that the
y-axis of the first camera frame, which is always defined as the
y-axis of the world frame, should point to the ground. Since the points under the hypothetical plane can be considered as “underground” points that should not be observed by camera, the hypothesis plane with the maximum probabilistic
in the modified RANSAC is denoted as follows:
where
is the indicator of the plane model:
where
is the plane function of the hypothesis plane
M. If the point-plane distance
between the plane and point
is within a threshold
,
will return 1; if
is an “underground” point below the hypothesis plane,
will return a negetive value
. The parameter
is the weighting coefficient of the points under the hypothesis plane, which needs to be adjusted depending on different densities of the maps. In other words, the idea is to detect a plane with fewest points under it.
Figure 2a provides an example of the point cloud from monocular SLAM on the EuRoC MAV dataset [
34]. As shown in
Figure 2b, the blue points are the inlier ones that belong to the floor plane extracted from the point cloud, and the “underground” points, which make a negative contribution to the weight of the detected plane, are marked in red.
The world frame of monocular SLAM is denoted as , where are base vectors of the world frame that is centred at . Let denote the normal vector of the plane obtained by RANSAC, which satisfies . Thus, the new map frame before projection can be defined as , in which the unit vectors is obtained by projecting onto the detected plane ; ; is the vector which is perpendicular to and ; coincides with the point . Therefore, the transformation matrix from the world frame to the map frame is .
If the resolution of the 2D grid map is (
), and
r is the size of each grid (
m/grid), the transformation from a world point
to its corresponding 2D map grid
is given as:
Consequently, the 3D map can be processed into a 2D grid map by calculating the occupancy probability of each grid. Furthermore, sometimes the camera FOV may go through spaces above the obstacles and therefore will not be able to observe them due to the limitation of the field of view. As a result, these unobserved landmarks may be regarded as paths which are free to pass. To get rid of this kind of misjudgement, a grid is considered as the constant obstacle if its occupied probability has already reached a threshold for a number of frames.
The top view of the point cloud built on Euroc dataset can be seen in
Figure 2c. The extracted 2D grid map is shown in
Figure 2d. Some of the landmarks and the shape of the target environment has been well described by the grid map. However, the left-top side of the grid map in
Figure 2d is improperly marked as “unknown”. This is because the source map is generated by a sparse SLAM method [
2], and the FOV of the camera in this dataset rarely focuses on this area.
4. 2D Laser-Based Localization in Scale Drifted Map
4.1. Extended Monte Carlo Localization
An extended MCL algorithm is employed to solve the LRF based localization problem. The extended MCL algorithm is built upon the conventional one proposed by Dellaert et al. [
35,
36] to make an extra estimation of the local scale factor
at time
t. We assume that
only depends upon the position
and the map
m, the state of the robot can be updated by:
where,
denotes the pose of the robot at time
t in the map
m,
is the sensor observations. Given the control inputs
, the motion model
predicts the probability of updating the robot state into
. The sensor model, or observation model,
denotes the likelihood of receiving sensor signal
over the current state
in the map
m.
In MCL, the probability of the robot states is randomly sampled and represented by a number of particles. Generally, the update of the belief is a three-step recursive process, also known as the particle filter:
Prediction step: A number of particles are sampled in line with the motion model to approximate the unknown robot state .
Correction step: All of the sampled particles are weighted according to the observation model . The details of the motion and observation models will be described in the following subsections.
Resample step: The algorithm samples new particles based on their weight. In this step, the particles are drawn with replacement, hence the ones with low probability will be eliminated, while the stability of convergence is also enhanced.
As discussed in
Section 2, a prerequisite for achieving robust localization is the correct scale estimation. As a result, although the scale factor
and robot pose
are calculated by particle filter at the same time, the convergence of
always precedes the robot pose
. In this case, the convergence process of the MCL method can be divided into two phases based on the behaviours of particles: (1) Scale factor initialization: the scale factor
is first converged at a limited range. A parameter
is defined to judge whether the scale estimation is well converged. (2) Robot pose tracking: the pose of robot will be calculated only if the scale factor
has already been converged, meanwhile,
will also be continuously revised. Here,
is calculated by:
where,
is the total number of employed particles at time
t,
is the scale factor of the
i-th particle at time
t,
is the average scale factor of all the particles at time
t. The scale factor will be considered as converged only if the
is smaller than 0.8 for 5 updates continuously. This threshold value is determined empirically.
4.2. Motion Model
The task of the robot motion model is to compute the posterior distribution
. In our work, the odometry model based on the encoder will be used to predict the motion.
Figure 3a illustrates the robot odometry model. When the robot advances from (
) to (
), the relative motion can be indicated by a parameter vector
, in which the element
is the translation,
and
are the first and second rotation of robot head. Given the reading of encoder,
can be computed as:
where,
is updated by the raw measurement of the encoder. Through modelling the motion error under Gaussian distribution, the rotation and translation can be sampled by:
where, the function
means Gaussian noise with variance
b and zero mean.
denotes the motion error parameter of the robot, which specifies the error of robot state accrued with the pose changing, similar to the defination in [
29].
Additionally, the scale factor
should also be updated with noise. It has been observed that one of the main reasons of the scale drift is the large rotations of camera. As a result, serious scale drifts frequently occur at the corners in the map. In this case, the rotation is taken into account for the prediction of
in the motion model. The
is defined as a Gaussian distribution:
where
is the angle of robot’s rotation between
t and
, and
is the standard deviation of variable
of the particles in the cluster with the highest weight.
Then the pose of robot can be updated by:
Figure 3b,c depicts the distributions of 1000 samples from the classical odometry model and the one with uncertain scale factor. In these two figures, the sampling has been done under the same parameters of motion model, however, it can be seen that our model is with higher uncertainty owing to the metrical inconsistencies. Therefore, the proposed localization method usually takes longer time to be converged in contrast to the original MCL method.
4.3. Sensor Model
The beam model, or the laser ray model [
29], is utilized to calculate the importance of each particle. Intuitively, the working principle of beam model in a MCL algorithm is to fix the laser scanner onto the poses of all the particles in the map
m. These hypothetical beams are cast along their current directions until hitting obstacles or reaching their maximum ranges. The weight of each particle will be determined based on the degree of agreement between these beams and the map
m. Let the
i-th measurements on map
m be
. The likelihood of the full scan is composed of all the
beams:
where, the model of a single beam incorporates four error types sourcing from four different conditions (correction detection, unexpected obstacles, missed measurements, and random measurements):
where,
are the weights defined for averaging the four types of distributions. The method to calculate
, and
is identical to the original beam model described in [
29].
Since the map built by monocular SLAM cannot perfectly depict the target scene, some free space gets marked as obstacles by mistake. Therefore, the particles are allowed to travel through the unreachable parts in the grid map. However, this will cause a new problem that some particles may pass the “walls” in the map and increase the risk of localization failure. Therefore, the weights of the particles in the unknown grids are penalized by multiplying a coefficient . Similarly, the penalty coefficient for the particles in the occupied grids is .
4.4. Initialization of Particle Filter
At the start of the MCL approach, the particles need to be set based on an initial guess of the robot’s state. Since the size of the original map is unknown, the initial scale factors of all the particles are randomly set in a range obeying uniform distribution. For example, if the particles are initialized with scale factors in the range [0.01 m/grid, 3 m/grid], a grid map with the size 640 × 640 will be able to represent the environment with a dynamic size from 6.4 m × 6.4 m to 1920 m × 1920 m, which covers all the possible sizes of the map for a typical indoor environment.
Furthermore, if the robot has no knowledge about its starting position, it is necessary to estimate its pose in the entire map, which is known as the global localization problem. In this work, one of the key advantages of using MCL algorithm is the ability to handle the global localization by distributing particles over all the possible grids in the map. For the evaluations in
Section 5, the method is tested with a manually set initial guess around the starting point, and also under a global distribution covering all the possible hypotheses states for global localization.
During the initialization process, a large number of particles are required for searching among all the possible scale factors. However, after a few steps, most of the particles with incorrect scales will be eliminated in the resampling step and the variance of
s will also be decreased accordingly. Then, the performance of particle filter in our method become similar to the original MCL, thus tracking of robot pose can be done with much fewer particles. To determine the number of particles, the KLD (Kullback-Leibler divergence) sampling approach [
37] is utilized, which gives an approximate estimation of the KL divergence between the true distribution and the estimated one. With the help of KLD sampling, the computational cost of our approach is reduced to ensure the algorithm to be able to achieve the real-time level.
4.5. The Problem of Particle Collapse
In the resample process of the particle filter, if only a small number of particles carry the vast majority of weights whereas the rest of the particles have the minimum weights, the diversity of particles will be reduced. This may lead to a situation where the algorithm may fall into a single peak, which is a common problem in the particle filter, known as the particle collapse. Normally, this can be solved by setting trigger conditions for the resampling and applying low variance sampling.
Nevertheless, in our case, the uncertain scale could also be an underlying cause of the particle collapse. For instance,
Figure 4a,b shows two possible robot states carrying different scale factors but observing similar laser measurements with different poses. With similar probabilities, two clusters of particles will be generated around these two positions as shown in
Figure 4c. Assume that the Cluster I in
Figure 4c represents the correct robot pose. If Cluster I is eliminated after a few updates, all the resampled particles will be incorrectly gathered at the Cluster II, that is how the particle collapse happens.
In our practical tests, the proposed method is prone to fall into local optimal with very small scale factors because of the ambiguity of the map scale, especially in the global localization. To avoid this collapse problem, the candidate particles in a few sub-optimal clusters should be prevented from being discarded to make the system more robust against the multi-optimal situations, similar to the idea in [
38]. A kd-tree is maintained for storing all the particles in each update. Then, we cluster the particles according to the distance in
into several clusters and then sort them on their total weights. A number of clusters with the highest weight are selected for tracking multi-model hypotheses in the initialization step.
In the clustered MCL [
38], several clusters of particles are updated independently for global trace, which is not required for our case. Instead, our strategy is to put a part of the particles in each selected cluster into a “safe box” and keep them from being abandoned during the resample operation. Moreover, particles in all the clusters are involved in the same MCL process, which means that these clusters are dynamically changed in every update. Whereas, this operation will introduce bias to the system owing to artificially increasing the probability of the states carried by the particles in the “safe box”. Hence, this method is only applied until the scale is considered as converged.
6. Conclusions
In this study, we propose a novel localization approach which allows the robot to reuse the scale-drifted map from monocular SLAM, with only a laser range finder. To the best of our knowledge, the framework proposed in this article is the first trial to address the problem with this setup. By extracting 2D structures from the point cloud, the 3D point cloud from monocular SLAM is converted to a 2D grid map for localization. Our proposed localization approach is built upon a variant of Monte Carlo localization method, where both the robot’s pose and map scale are handled. The proposed method is tested on a public dataset and real environments. The result shows that the localization method can provide a relatively accurate estimation of the robot pose. Moreover, in feature-less environments, where the monocular SLAM produces significant scale drift, our method is still able to give a quick and consistent localization result with correct scale estimation.
In the future, the current work can be fulfilled in multiple directions. Some limitations of the proposed method have already been discussed, we believe that utilizing a more robust measurement model, such as likelihood field model [
41] or normal distribution representation [
42], may be a possible solution for enhancing the localization accuracy. Also, the pose estimation can be extended to 6 DoF that the usage of 3D LIDAR sensors will be involved in order to improve the robustness against the point cloud map. Futrthermore, the auto-navigation can also be achieved via combining the existing path planning method [
43].