1. Introduction
Nowadays, the demand for 3D graphical models in computer graphics, virtual reality and communication, etc., and their application are increasing rapidly. Much progress in multi-view stereo (MVS) algorithms has been made in recent years, both in terms of precision and performance. Fundamental matrix
F computation is a basic problem in multi-view geometry for the reconstruction of 3D scenes. Once
F is obtained, the projective reconstruction of a scene or object can be inferred from the fundamental matrix [
1].
Usually, the eight-point method (8-pt) [
2,
3], seven-point method (7-pt), five-point method (5-pt) [
4,
5], and gold distance method (Gold) [
6] are all based on the RANSAC (RANdom Sample And Consensus) framework. If picture quality and feature matching are precise enough, accurate results can be achieved; however, the estimation from the seven-point method is sensitive to noise. The five-point method makes use of all kinds of constraints but is not very robust. These methods often include a non-linear optimization as the last step, which is time-consuming. The sampling method is based on the distance from feature points to epipolar lines on the image to select the best
F from the candidate
F. The
F with least average distance wins and is output as the final result. So the quality of
F is determined by this distance number.
According to the epipolar theory in the multi-view geometry of computer vision, as demonstrated in
Figure 1, all the epipolar lines,
, intersect at a single point, epipole, which is
e or
at each image plane. An epipolar line such as
or
is the intersection of an epipolar plane with the image plane. The epipolar plane, e.g.,
, is a plane containing the baseline, e.g.,
. The epipole is the point of intersection of the line joining the camera centers (the baseline) with the image plane. Equivalently, the epipole is the image in one view of the camera centre of the other view. E.g.
e is the projection of
on the left view in
Figure 1. It is also the vanishing point of the baseline (translation) direction.
This geometry relation between two images for , is linked through F. Our method also utilizes the RANSAC framework to compute fundamental matrix F like most of the other ordinary algorithms.
Although every result after RANSAC could have very small average distance from matching feature pixel to the epipolar lines, sometimes the reconstruction result is still unstable if the reconstructed scene is viewed by the other different standard. For example, epipoles of different final best samplings distribute anywhere but not concentrate in a small region. Even the epipoles could be located either within the image or out of the image. Sometimes, the epipole of each computation from F varies a long distance between them; however, both Fs have very small average distance to epipolar lines. When we utilized this F to reconstruct the two-view stereo, sometimes the reconstructed 3D scene was abnormal and occasionally kind of projective. This means the average distance to epipolar lines is not the only, nor very robust, criteria to decide the fundamental matrix. Actually, a fundamental matrix with very small average distance to epipolar lines sometimes led to very poor 3D reconstruction results, such as two-view reconstruction based on F. This problem is critical in affine and projective reconstruction in which there is no meaningful metric information about the object space.
A new error selection criterion is proposed in this paper to solve the problem that Fs with very small average distance to epipolar lines sometimes result in very bad 3D metrical reconstruction or even result in totally wrong results. The advantage of our criterion is that this method can visually present the 3D reconstructed scene to indicate the quality of F. The criterion is still the projective error to project , which is in 3D metric space rather than projective space. The description of metric space or metric reconstruction implies that the structure is defined in the Euclidean coordinates where the reconstructed scene is highly similar with the real scene. Metric is used to compare with descriptions for projective and affine. One speaks of projective reconstruction, affine reconstruction, similarity reconstruction, and so on, to indicate the type of transformation involved. In this paper, the term metric reconstruction is normally used in preference to similarity reconstruction, being identical in meaning.
Especially, this modification to the error selection criterion can improve 3D reconstruction results if we assume the principle center of camera in the image coordinates lies at the center of the image. The F under our criterion is suitable for the 3D metric construction, whereas the other Fs are always used for projective reconstruction. When the re-projection error is extracted from metric reconstruction based on F is feed back to judge the quality of the F and then recompute F, the quality of F is improved accordingly. This is the novelty of this paper.
2. Previous Research
Lots of works have been done on the fundamental matrix computation in the past 30 years [
2,
7,
8,
9,
10,
11,
12]. It’s a well and deeply developed area in computer vision; however, there still are some papers on the fundamental matrix published every year. Some traditional methods yield good results, which means very low projective errors to the epipolar lines. This average distance to the epipolar lines is a universal or almost unique criterion to judge the quality of the fundamental matrix.
The five-point method [
4,
5] is the most popular method that finds the possible solutions for relative camera pose between two calibrated views based on five given corresponding points. The algorithm consists of computing the coefficients of a tenth degree polynomial in closed form and subsequently finding its roots. Kukelova et al. [
13] proposed polynomial eigenvalue solutions to the five-point and six-point relative pose problems, which can be solved using the standard efficient numerical algorithms. It is somewhat more stable than solutions proposed by Nister [
4,
5] and Stewenius [
14]. Another five-point method [
15] was proposed to estimate the fundamental matrix between two non-calibrated cameras from five correspondences of rotation invariant features. Three of the points have to be co-planar and two of them in general position. The solver, combined with Graph-Cut RANSAC, was superior to the seven and eight-point algorithms both in terms of accuracy and the needed sample number on the evaluated 561 publicly available real image pairs.
Reference [
11] proposed a novel technique for estimating the fundamental matrix based on Least Absolute Deviation (LAD), which is also known as the
norm method. Banno et al. [
16] proposed a parameterization for the nonlinear optimization which includes the rank 2 constraint. Double quaternion and a scalar are adopted as the parameter set. Menglong et al. A two-point fundamental matrix method [
17] is even proposed. It makes use of the epipolar theory to estimate the fundamental matrix based on three corresponding epipolar lines instead of seven or eight corresponding points. Sengupta et al. [
12] introduce a novel rank constraint,
and
, on collections of fundamental matrices in multi-view settings. Then iterative re-weighted least squares and alternate direction method of multiplier (ADMM) are used to minimize a
-cost function. Reference [
18] developed the algorithm MAPSAC to obtain the robust MAP estimate of an arbitrary manifold to compute
F. But by its open source code, the SfM result is not stable which means that the reconstructed scene varies from one to another totally different one.
As [
19] points out, the use of symmetric epipolar distance should be avoided. The error criterion, i.e., the Kanatani distance (REK) [
20], is the most effective error criterion found in their experiment for use as a distance function during the outlier removal phase of the
F estimation. Some methods [
21,
22] searched for corresponding epipolar lines using points on the convex hull of the silhouette of a single moving object. These methods fail when the scene includes multiple moving objects. Reference [
23] extends previous work to scenes having multiple moving objects by using the “Motion Barcodes”, a temporal signature of lines. Reference [
24] developed a new method for calculating the fundamental matrix combined with the feature lines. In [
24], the camera orientation elements and relative orientation elements are used as the parameters of the fundamental matrix, and the equivalent relationships are deduced based on the horizontal and vertical feature lines.
Recently, after overwhelming studies of deep learning by neural networks, fundamental matrix estimation without correspondences is proposed by novel neural network architectures in an end-to-end manner [
25]. New modules and layers are designed to preserve mathematical properties of the fundamental matrix as a homogeneous rank-2 matrix with seven degrees of freedom.
Most of the methods above focus on more constraints or new optimization methods. Unlike these methods, our algorithm uses a new criterion visually for metric reconstruction under the RANSAC framework and achieves a result similar to [
4,
5] i.e., our
F is better to connect with 3D Euclidean coordinate. In other words, our
F is better to reconstruct the real world of Euclidean coordinate.
4. Computing in Metric Space Based on Geometric Distance
As we know, F includes the information in the projective coordination but not the Euclidean coordination. If metric reconstruction is necessary, it can only be estimated based on F via camera’s intrinsic matrix K. Auto- (or self-) calibration, which can result in K automatically, is the computation of metric properties of the cameras and/or the scene from a set of uncalibrated images. Our method does not need any intrinsic parameter measured previously to reconstruct under the metric frame of Euclidean coordinate. In other words, we use a simplified auto-calibration method to estimate K.
In this paper, some constraints [
28] are enforced on the intrinsic parameters of camera to estimate
K. These constraints are: (1) Typically the aspect ratio is around 1. (2) The skew ratio can be assumed to be 0. (3) The principal point of the camera is projected to the center of image,
where
is the width and height of image in unit pixel, respectively. After moving the origin of image coordinate to
, the intrinsic parameter matrix is the
K defined in Equation (10), where
f is the focal length of camera. This is a basic assumption for our method because this condition can be satisfied easily now and easier in the future with improvements in the manufacturing. Our method tries to fully make use of this assumption. Theoretically, it can achieve a perfect result if the real images satisfy this constraint of
K ideally.
4.1. 3D Metric Distance Criterion of F
As mentioned above, almost all the computation of F uses the distance of feature pixel to epipolar line as criterion to discard outliers of features matching. However, our algorithm uses the 3D metric distance criterion and is detailed as follows.
Compared with the traditional methods of
F, the 3D metric distance criterion is more robust and has a similar performance on precision. It is the average distance between the projection of
X on two image plains,
and
, and feature pixels
in Equation (
11), denoted by
. If
of one point
is less than threshold presented, it is an inlier. Then Equation (
11) is designed to compute the average distance
, our 3D metric distance criterion of
F, with all
N matching inliers.
This 3D metric distance criterion can be applied into any RANSAC framework of
F computation to judge whether one feature matching is an outlier or not. This criterion is more robust and has better performance especially for the images satisfied the assumption that the optical center lies at the center of image,
. The validation of this criterion by experiment is discussed in
Section 5.
4.2. Optimization Based on Metric Geometric Distance
Our optimization based on metric geometric distance is in the same framework as the gold standard method [
6] but in a metrical space for
X and
, rather than the projective space described as Equation (
8) in
Section 3.3. In other words, we use a different method to compute projective matrix labeled as
which represents the metric scene in Euclidean coordinates.
The input of our optimization is an initial
F, image size
, and matching feature points
. The initial
F of the one images pair is computed by the eight-point method as described in
Section 3.1 in the RANSAC framework based on the traditional distance from
to epipolar line. The output is the final result of
F. Goal of the optimization is to minimize the Equation (
11). The steps of this algorithm are listed as follows:
Estimate
with
K in Equation (
10).
Decompose
E to
by the method in
Section 3.2.
Normalize the matching points to with respect to K and . if , then .
With these correspondence
and
, estimate
X by the triangulation method [
6].
Minimize energy function Equation (
11) over
X,
and
. The cost is minimized using the Levenberg–Marquardt algorithm over
variables:
for
N 3D points
, and 24 for the camera matrix
.
Retrieve F from the . At first, compute the epipole of the second image by where is zero space of . Then return the final .
As mentioned above, the K is estimated through the constraints, but not a precisely measured one. Of course if the precisely measured K is applied to our algorithm, better result will be achieved. For most of the scenarios, the precise K could not be known in advance. Therefore, this estimated and feasible K is adopted in our optimization. In fact it works very well and always achieves a satisfying result.
Compared with normalization in
Section 3.1, our method uses a different normalization method for coordinates of features to compute final
F. In
Section 3.1, its origin point is a centroid that lies at the barycenters of
u or
, and its average distance from
u or
to the origin point is
which is fixed. However, in our algorithm, origin point is at the center of the images, and the average distance from origin point is less than 1, which is not fixed. Our normalization is more efficient than the normalization in
Section 3.1. Our normalization method is a linear operation that does not need to calculate the barycenter. Furthermore, our method can visualize metric 3D reconstruction of
X as described in
Section 5.3 but the other normalization methods cannot produce reasonable visualization of reconstruction directly.
4.3. The Framework of Computing F Visually
The input data of our algorithm are a pair of images of a scene that need to be reconstructed, which are taken with a hand-held camera with a fixed focal length. The framework of the experiment is as follows:
- Step 1:
Extract SIFT features on both images. Compute the matchings of features .
- Step 2:
Use a traditional method such as the eight-point method to compute F as the initial input of next step.
- Step 3:
- Step 4:
If the metrical projection error is smaller than Step 2, output the result as the final F. If not, go to Step 2 to repeat.
In step (3), we have E to can deduce from E. Then, the figure of reconstructed 3D points X can be shown in a Cartesian coordinate system even in each iteration.
5. Experiment
5.1. Experimental Settings
Three different types of datasets, Dino, ZhangLan, and Simu were used in our experiment. Dino is a public-accessed dataset with an image size of . ZhangLan has an image size of and was captured by us on our own campus. Simu is a human-made simulation of camera projection with an image size of .
5.2. Features Matching
Figure 2 shows the features extracted by SIFT on dataset
and
. Many different stereo matching methods are fully and deeply researched [
29]. Thus, our algorithm adopts SIFT [
30] for the feature matching
of two images. It is the first step to compute
F. Here, Euclidean distance is the criterion to measure the features’ difference. The ratio of the nearest distance over the second nearest distance was set to be less than threshold
. Otherwise it will not be labeled as a feature.
Actually, the other feature matching algorithms such as Harris, BRISK, SURF, ORB, etc., can still work. We choose SIFT only because it is typical and good enough to compare with other
F computation methods, not because it is the best or most accurate matching method. SIFT has some mismatching features most of the time. Because of the RANSAC framework used in this paper, the mismatching can be culled out by the sampling and iteration mechanism as shown in
Figure 2. This is one of the differences to most of
F computation methods that always carefully remove the wrong feature matching manually.
For the matching of simulated scene
displayed in
Figure 3, all the points in the image are perfectly matched because the scene was manually made. So the simulation matching of scene
Simu in
Figure 3 is not listed in
Figure 2. The 3D points
X and camera positions are displayed in
Figure 3a and the projection image of
X at the first and second viewpoint are displayed in
Figure 3b and
Figure 3c, respectively.
For feature matching from SIFT in
Figure 2, there must be some outliers of matching with long distances to epipolar lines or obvious re-projection error of
X. If outliers are validated by the criteria mentioned in this paper, the incorrect matching should be removed from the feature matching sequence and not be counted into error calculation. This is an important step to improve precision and robustness of our method.
Figure 4 shows part of the typical epipoles and their corresponding epipolar lines chosen randomly and computed with our 3D metrical criterion for the final
F under the RANSAC framework like the other traditional methods. As we can see, the epipoles are close to the epipolar lines. The distances are less than one pixel distance on average as shown in
Table 1.
5.3. Reconstructed 3D Points
The visual results of
Figure 5 show the 3D reconstruction with respect to our 3D metric projection error criterion based on
E from
F. However, eight-point [
6], seven-point [
6], and Gold [
6] methods can only compute the projective
F and cannot reconstruct the metrical 3D result directly.
In our experiment, the eight-point method always has good reconstruction results for each criterion. Every point of X is recovered precisely. The projection errors are 1.50, 0.5, and 0.45 for the three different scenarios, respectively.
On the other hand, for the seven-point method, no matter how carefully the parameters are adjusted, the reconstruction result is not good enough because seven-point method is very sensitive to noise. The reconstructed 3D points X of almost lie on one plane which in fact are two planes. The reason is that restriction of is too arbitrary and only has mathematical meaning. For the reconstruction here, has a kind of side effect when computing E. In our experiment, was a useful and necessary but weak constraint. It is useful because it adds one equation to reduce one point pair input and the epipole has to be calculated under this constraint. It is weak because that it affects the result significantly when this constraint is applied. Most of the time without this constraint, the better projective error or average epipolar lines distance are obtained, although this does not mean they will result in better 3D construction results. Actually, limits the space of the solution of F. When this constraint is applied to the algorithms above, there is an advantage to compute the epipoles on corresponding images.
The appended material are the result from the five-point method that computes
E as output directly. So the epipolar criterion can not be applied to five-point method. Compared with the other two methods over their distance to the epipolar line, Gold and Ours almost have results similar to the seven-point method. The five-point method’s projection error of
X, which is listed on the fourth row of
Table 1, is obviously bigger than the other four. The five-point method is not robust or reliable enough in our experiments.
For the scene , if only 0.5 pixel random projection error or less is added on , it could have better reconstruction than the other scenarios because the optical center of camera lies at image center precisely.
5.4. Numerical Result Comparison
Table 1 compares metric criterion with traditional epipolar distance in 8-pt [
6], 7-pt [
6], 5-pt [
13] and Gold [
6] method. The second to sixth row of
Table 1 are result based on the average distance from feature pixels to epipolar lines w.r.t the 5 methods to compute
F. The columns are three different scenes of
,
and
. The total initial matching features number are 100, 123 and 100. The numbers under the column ’Error’ are the average distance from feature pixels to epipolar lines while the numbers under the column ’Inliers #’ are numbers of inliers. Here unit of the distance is pixel in image. Our method and the method Gold have the best re-projection error than the other four.
Table 2 is the average re-projection error from feature pixels
to
and the 3D metric criterion
and
as described in
Section 4.1, for the same datasets
,
, and
. The last column is our method, which outperforms the other four methods. In order to compare with each other, the same
F in
Table 1 were used to decompose out
E with
K to reconstruct the scene, i.e., their inliers are recorded to compute the average re-projection error. Our method has comparatively good results compared with the other four.
Projective validation is a common way to judge the quality of
F. In this paper, we propose metric validation to perform the comparison. Metric validation is based on the assumption that the intrinsic matrix
K is subject to the constraints in
Section 4. Then, the reconstruction is achieved by the deduced essential matrix
E. The criteria to judge the inlier becomes the average re-projection error of
X on images to
. If images are captured with the intrinsic matrix
K under these constraints, the achieved result
F will have the better final reconstruction in the metric coordinate. Actually, this is the case for most of the cameras in making metric validation work. The biggest benefit of our metric validation is that it can produce reasonable and visual metric reconstruction results rather than a sort of projective reconstruction with small numerical projection errors that could be used to compare. If a more precise result is required, the intrinsic matrix
K should be measured in advance. Here, we focus on an automatic approach so that the pre-measured
K is not adopted. As we can see in
Table 1, our method has better results in metric space than the five-point method.
In our experiments, thresholds of inliers have two types. (1) For metric validation, it is the metric re-projection errors of X on images to and . (2) For RANSAC inliers, it is the projective errors from or to the corresponding epipolar lines. Usually, the smaller threshold means less inliers and better re-projection errors especially for real image pairs. For example, for the result of with the five-point method, the re-projection errors decreased from 5.5 to 4.7, on the other hand the number of inliers dropped from 94 to 86.
In
Section 4.3, the bigger metrical projection error could be produced in Step 3, which optimize
F using algorithm in
Section 4.2, because the initialization of global optimization is crucial. Once the initialization of Step 3 is recalculated, the result with smaller result will be obtained in three iterations most of the time.
In order to measure the precision of reconstruction in metric space, we add some noise into matching features’ position and then reconstruct the scene to compare the distance between the points with the ground-truth distance simulated out. Because there are 100 points in , distances can be used to compare the error with the ground-truth. We calculate the mean error of 4950 distances and repeat this process 100 times with different noise at the same level to obtain a average error distance. Considering the scene area is a unbounded factor to affect the error of distances, we apply the ratio of the above average error distance to maximum distance of these 100 points. The ratio result is within most of the time when the noise onto features is pixel with Gaussian distributions. Our method provides a reasonable and high precision for 3D reconstruction.