5.1. Ellipse Prediction
We used the TUM dataset [
33] (the
sequence) and the 7-Scenes dataset [
34] (the
sequence) to evaluate DEllipse-Net for ellipse prediction. In the
sequence, we selected 11 categories and 12 objects to construct the ellipsoid environment. A total of 1000 images were selected for training DEllipse-Net and YOLOv8, while 500 images were reserved for testing. In the
sequence, we selected 7 categories and 11 objects to construct the ellipsoid environment. Six sequences totaling 6000 images were randomly selected, with 3000 images used for training and the remaining 3000 images for testing. During network training, the batch size was set to 16 for 200 epochs, using the Adam optimizer with an initial learning rate of
.
To demonstrate the superiority of our proposed ellipse network, we compared it with two other methods: one based on a classical mathematical ellipse (Mathematical) [
15], and the other, an ellipse prediction network proposed by Zins et al. (Ellipse-Net) [
31]. We evaluated the performance using four metrics:
(Intersection over Union),
,
,
(Absolute Translation Error), and
(Absolute Rotation Error).
measures the overlap between the predicted ellipse and the ground truth ellipse, with higher values indicating closer alignment between the predicted and actual ellipsoid projection.
is the average
of all predicted objects of the same category.
represents the difference between the predicted
and the ideal
, with smaller errors indicating better accuracy in predicting the ellipsoid projection.
assesses the translation error, with lower values indicating that the estimated translation is closer to the true translation.
measures the rotational error, with lower values reflecting smaller discrepancies between the predicted and ground truth rotation.
where,
denotes the predicted ellipse, and
represents the ground truth ellipse projected from the ellipsoid.
and
are the estimated translation vector and the true translation vector, respectively.
and
denote the estimated rotation matrix and the true rotation matrix, respectively.
We applied noise of 5 pixels, 10 pixels, and 15 pixels to the bottom left and top right vertices of the bounding box to evaluate the robustness of the three methods for ellipse prediction.
Table 2 presents the
variation of ellipse predictions and corresponding ellipsoid projections for different objects under bounding box noise across two datasets. The table shows that when the detection box is undisturbed, the
values of all three methods achieve good results of 0.9 or above. However, as the noise increases, the
values for the three methods decrease. Notably, among the 23 objects in the two datasets, our method did not attain the highest
value for 5 of these objects. Additionally, our method did not achieve the highest
value in the prediction of 7 objects under partial disturbance. Nevertheless, for the remaining 15 objects, our method produced the best results. This indicates that our approach effectively addresses the inaccuracies in ellipse prediction caused by boundary detection errors.
Figure 5 illustrates the performance of the three methods across the two datasets with a noise of 15 pixels. The white ellipse represents the true ellipsoid projection, while the white bounding box indicates the area with added noise. All three methods of ellipse prediction were performed within this region. It is evident that the added disturbance has the greatest impact on the Math method, while Ellipse-Net demonstrates some degree of robustness against interference. Our method, however, maintains relatively strong performance even after noise, for instance, in the
dataset, our predictions almost completely overlap with the true ellipsoid.
With the enhancement of the ellipsoid prediction’s anti-interference ability, there is a corresponding improvement in localization capability.
Table 3 illustrates the impact of three ellipsoid prediction methods on localization performance under various noise. It is important to note that our testing was conducted using object-level topology and alignment, with the variables in this test limited to the three different ellipsoid prediction methods. The table demonstrates that, under different disturbances, the stable ellipsoid predictions generated by DEllipse-Net significantly enhance the robustness of ellipsoid-based localization. In eight localization experiments, we achieved the best results in seven of them.
More specifically, we present the statistical results of the
for different objects under undisturbed conditions, as shown in
Figure 6 and
Figure 7, respectively. These line graphs illustrate the percentage of ellipses at various thresholds of
. In the case of the
dataset (
Figure 6), eight objects achieved optimal results at different thresholds. For instance, the “TV” had all predicted
below 0.08. When predicting the “Cup”, the threshold for
was set at 0.1, where our method achieved a
success rate, while Ellipse-Net accounted for
and Mathematical accounted for
. Furthermore, there are notable differences in ellipse prediction accuracy for different objects within the same category. For example, in the prediction of “Cola”, the Mathematical and Ellipse-Net exhibit varying predictive capabilities, while our approach demonstrates relatively high robustness. Similarly, as shown in
Figure 7, our method also achieved the best results for
.
Finally, we compared the model sizes of the ellipses across the three methods and the average prediction time per frame, as shown in
Table 4. Since the Mathematical relies on mathematical techniques for ellipse prediction, it does not have a traditional model, resulting in minimal time required for predicting ellipses from image frames. In contrast, the single model size of Ellipse-Net reaches 82.6 MB, and each object requires corresponding model training during actual operation. Consequently, when dealing with a large number of objects in the environment, the space occupied becomes substantial, and it also takes the most time to predict ellipses for a single frame. However, DEllipse-Net only requires the training of one model to handle ellipse predictions for all objects, and our ellipse model is only 18.3 MB in size, with a running time that is shorter than that of Ellipse-Net. This demonstrates that DEllipse-Net is both lightweight and portable.
5.2. Localization Accuracy with Object-Level Instance Topology-Alignment
Localization prediction involves matching the predicted ellipse in the image with the corresponding ellipsoid in the environment. In previous works, such as [
15,
31], ellipses and ellipsoids are matched by category, where a single ellipse may correspond to the projection of multiple ellipsoids, leading to pose estimation errors. By applying our object-level instance topology to ellipse-ellipsoid localization, we achieve more accurate matching between ellipses and ellipsoids.
We use the
dataset from 7-Scenes [
34] for our experiments. The
dataset contains 6 sequences, each with 1000 images. Since our method requires the detection of two or more ellipse-ellipsoid pairs for pose estimation, we use this criterion for image filtering. The primary aim of this experiment is to evaluate the impact of our instance topology on pose estimation. Therefore, we conducted localization tests using three different ellipse estimation models under two different ellipse-ellipsoid matching modes.
and
are used as metrics to assess localization accuracy.
As shown in
Table 5, “×” indicates that instance topology is not used, while “√” indicates the use of our instance topology. In the “×” mode, Ours effectively reduces both translation and rotation errors. In the translation test, our method achieved the lowest
in four sequences, while it ranked second in seq-04 and seq-06. In the rotation test, our method produced suboptimal results in seq-01 and seq-02, but achieved the lowest
in the remaining sequences. In the “√” mode, our method demonstrated similar performance.
Compared with the “×” mode, after using our instance topology, the translation and rotation errors of the camera are greatly reduced. In seq-04 sequence, Mathematical reduces the translation error from 0.208 m to 0.036 m and the corresponding rotation error from to after using descriptor matching. The translation error of Ellipse-Net decreased from 0.164 m to 0.037 m, and the rotation error decreased from to . The translation error of our method decreased from 0.165 m to 0.032 m, and the rotation error decreased from to . This shows that our instance topology greatly improves the probability of object matching, thereby improving the localization accuracy.
5.3. Pose Accuracy in Different Lighting Environments
When performing localization in indoor environments, the accuracy is affected by various factors due to the time gap between map creation and subsequent localization. Among these factors, lighting variation has the most significant impact on visual localization. In this section, we compare a feature-based method, OVS (OpenVSLAM) [
35], with two Ellipse-Ellipsoid-based methods (Ellipse-Net and ours) to evaluate the localization performance of these three approaches under different lighting conditions. We use the
,
,
, and
sequences as test data and adjust the HSV values in the images to simulate low-light conditions, as shown in
Figure 8. Under normal lighting conditions, we use the RGB-D model of the full OVS system for map creation, but only its relocalization module during the localization phase. Ellipse-Net follows the approach provided by [
31], while our method is based on DEllipse-Net and object-level instances topology-alignment localization.
and
are used as metrics to evaluate localization accuracy.
As shown in
Table 6, the localization accuracy of all three methods decreases in dark environments compared to normal lighting. In the
sequence, under normal lighting, OVS has translation and rotation errors of 0.102 m and
, Ellipse-Net has errors of 0.024 m and
, and our method achieves 0.012 m and
. In dark conditions, OVS’s errors rise to 0.119 m and
, Ellipse-Net’s to 0.065 m and
, while our method shows only 0.022 m and
. Overall, our method consistently outperforms OVS and Ellipse-Net across all lighting conditions, with similar advantages in the other three environments.
Despite both Ellipse-Net and our approach being based on the Ellipse-Ellipsoid model, our method is more robust to lighting variations. Across all four sequences, Ellipse-Net shows significant localization accuracy changes under different lighting conditions. For example, in the sequence, its translation and rotation errors increase by 0.039 m and in dark conditions, while OVS’s increase by 0.023 m and . In contrast, our method sees only minimal increases of 0.009 m and , demonstrating its robustness against lighting variations.
5.4. Pose Accuracy When Objects Are Occluded
This experiment tests the localization ability of classical point-based methods and ellipse-ellipsoid methods under object occlusion. In four sequences, to simulate occlusion, we applied a black mask to occlude corresponding objects in the RGB images. We sequentially occluded 1, 2, and 3 objects in the query images to evaluate how both methods handle object occlusion, as shown in
Figure 9. To ensure the ellipse-ellipsoid model can still predict poses, we selected query images containing five or more objects from the sequences. We used images without occlusions for map construction and then performed localization testing using the occluded images. The localization performance of both methods under occlusion was evaluated using
and
.
As shown in
Table 7, in the
sequence, when the number of occluded objects increased from one to two, OVS’s translation error rose slightly from 0.238 m to 0.241 m, and its rotation error from
to
. When the number of occlusions increased from two to three, the translation error increased to 0.260 m, and the rotation error to
. Ellipse-Net showed a similar trend, with translation errors increasing by 0.002 m and 0.016 m, and rotation errors by
and
. In contrast, our method showed no increase in errors during the first stage and only a small rise of 0.017 m in translation and
in rotation errors during the second stage. These results suggest that localization accuracy decreases with more occlusions for OVS, Ellipse-Net, and our method. This trend was also observed in the
,
, and
sequences.
The Ellipse-Ellipsoid model relies on object detection for ellipse prediction, matching, and global localization. When landmarks are occluded, both Ellipse-Net and our method fail to detect them, leading to localization failure. In contrast, OVS uses ORB feature points, matched with keyframes via BoW [
16] and applies PnP for pose estimation, making it less sensitive to occlusion after successful matching. During testing, we used black ellipses to occlude objects, preventing feature detection in darkened areas and reducing incorrect matches. Although our method is more affected by occlusions, it still outperforms OVS, demonstrating superior localization accuracy under occlusion conditions.