A Survey of 6D Object Detection Based on 3D Models for Industrial Applications
Abstract
:1. Introduction
- RGBD cameras (i.e., color and depth) are available for providing input to the algorithm;
- Only 3D object models (CAD or reconstructed) are required to set up the algorithm (i.e., no recordings by real cameras).
- A listing of requirements that typical industrial use cases have for object detectors.
- A comprehensive collection of empirical data from experiments with 6D object detectors that meet the identified criteria.
- Empirical data on the performance of the object detector FFB6D [2], which has not been evaluated with purely model-based training, yet.
2. Related Work
2.1. Reviews and Benchmarks
2.2. 6D Object Detectors and Pose Estimators
2.3. Model-Based Training and Image Synthesis
3. Background
3.1. Problem Definition: 6D Object Detection
- The parametrization of the algorithm is different. For localization, we can accept the N best hypotheses that the object detector produced, while for detection, we need to set a score threshold for P as an acceptance criterion for the hypotheses.
- The required metrics for evaluating the performance differ. For localization, determining a score that only regards the rate of positive detection is sufficient (e.g., recall). As the detector outputs a maximum of N results, we know that every false positive implies a false negative, e.g., the precision is always at least as good as the recall here. For detection tasks, this is not true, and so we need to regard metrics that both take true and false positives into account (e.g., recall and precision).
3.2. Industrial Applications
- CAD models are available, which means that generating reference data is cheap.
- High-end and RGBD cameras are available, as higher costs and a larger form factor compared to RGB cameras are negligible in large-scale production environments.
- Scene setups are controlled. Production mostly happens indoors, and the placement of lights and cameras can be controlled easily. Indoor setups also allow for a broader range of possible RGBD cameras, as active cameras often do not work well in sunlight.
- The minimally required frame rate for many automation tasks is the production’s takt time, which is usually lower than the required frame rate for interactive applications.
- Lots of industrially manufactured objects are textureless. Specifically, workpieces that are at the beginning of production chains are often made of a single material with flat and untextured surfaces.
- A lot of man-made objects, especially those with simple geometry, are rotationally symmetric, or at least appear so under certain perspectives. This makes their poses ambiguous, which can be a difficult problem for algorithms relying on optimization.
- A common task in the area of robotic manipulation is bin picking. Here, individual objects can be highly occluded.
- Additionally, especially in bin-picking tasks, we have an unknown number of instances of the same object class. As described in Section 3.1, we refer to this task as object detection in contrast to object localization, where the number of objects to detect is known a priori. When attempting to detect an unknown number of instances, false positives can be a major problem.
- Object colors are often unspecified in the reference data. CAD models generally store an object’s geometric and kinetic properties, but not its surface properties, defining color and reflective behavior.
- There are objects with difficult surface properties that hinder the recognition of geometric properties based on optical recordings, i.e., objects made from materials with high specular reflections, such as metals, or objects made from translucent or transparent materials, such as glass.
3.3. Model-Based Training
- 3D models:
- Here, we directly derive features in the latent space from the information contained in a 3D model, i.e., a model’s vertices and normals, e.g., PPFs only require a 3D model of an object at training time.
- Augmented real images:
- In this strategy, real images are augmented to generate a higher variety of training images. This can be done by simulating varied recording conditions, e.g., changing an image’s size or aspect ratio, its brightness or sharpness, or adding noise. A more involved mode of image augmentation is the “Render and Paste” strategy in which an object is cropped from its original scene and pasted onto a different background to simulate a varying background, or covered by another cropping to simulate occlusion.
- Renderings:
- Rendering is the process of simulating the full image recording pipeline and thus generating 2D images from 3D models. There is a big variety in how this simulation is implemented and how realistic the resulting output is. The simplest and quickest method for rendering images is using a rasterization-based renderer, such as OpenGL. This type of renderer usually produces plausible, but not necessarily physically accurate renderings, in order to achieve real-time performance. A more ambitious mode of generating realistic images is called physically based rendering (PBR), which is not a strictly defined term, but usually entails more realistic simulation of the behavior of light and surfaces than the commonly used Blinn–Phong model [46], e.g., by employing ray tracing.
3.4. Modalities
- Besides CosyPose [5], we did not find any multi-view approaches that fit the scope given by our use case.
- Point-cloud-based object detectors are very popular in the area of autonomous driving. Consequently, they are commonly evaluated on datasets and metrics tailored to this use case (e.g., the KITTI dataset [48]), and the evaluation scores found in the literature cannot be compared to those of most RGBD-based object detectors.
- Multi-view images as well as point clouds usually cover a larger portion of a scene than single-view images. Thus, they could mitigate problems due to occlusion, pose ambiguities and specular reflections.
- Point clouds are primarily geometrical representations of scenes, and thus, object detection based on geometrical 3D models potentially requires less preprocessing of training data, as input and training data are already in the same domain. In particular, the involved generation of synthetic images can be skipped.
4. Materials and Methods
4.1. Methods
- Modality
- describes which type of input a method accepts at training time and runtime. RGB-based methods tend to have a larger error when estimating the distance of objects to the camera. Depth-based methods are based on geometry only, so they cannot use color cues or textures visible on objects. RGBD-based methods can use the best of both worlds. We only regarded the modality of the core-method uses, i.e., no optional refinement steps. Of course, every RGB-based detector can be extended to RGBD by, for example, post-processing the results with ICP [47], and every depth-based detector can be extended to RGBD by employing some kind of 2D-edge-based pose refinement.
- Features
- states whether a method uses learned or hand-crafted features for object detection, i.e., whether the algorithm is data- or model-driven. As the name suggests, data-driven methods tend to require large amounts of training data: in our case, synthetic images. The generation of these data and the subsequent training can be computationally very demanding, in some cases needing several days for a full setup. Hand-crafted features usually do not require as much data, and the conversion of training data to features is straightforward, as no weight optimization takes place. However, the latter tend to have more parameters that need to be fine-tuned for optimal results.
- Scope
- describes whether a feature in the object-detection step represents the full target object (e.g., a “template”) or a single point of interest (e.g., a single pixel or an image patch). Global features, representing the whole object, are usually more robust when detecting multiple instances of a single object class that are close to or even occluding each other. Local features tend to be more robust against general occlusion or difficult lighting conditions.
- Output
- gives the type of space that the output pose is in. Regression-based methods predict continuous results, i.e., the poses they estimate are theoretically infinitely accurate. Classification-based methods predict discrete results, i.e., their output is one of a previously learned finite number of classes. Whether a discrete estimation is good enough depends on the use-case requirements and whether there are enough computational resources to perform a refinement step.
Remarks for Individual Methods
- The learning-based method FFB6D [2] was trained on synthetic images by ourselves. To train FFB6D, we used the synthetical images generated with BlenderProc [4] for BOP Challenge 2020 [3], using scene 2 as the validation set. We deactivated all data augmentation and trained on the renderings as they are. The training ran for iterations at a batch size of 3.
- PoseRBPF [33] is a tracking and not an object-detection method. However, the algorithm can actually be used for object detection (referred to as initialization in the respective paper), and the pose estimation accuracy is improved over consecutive frames. For this reason, we regarded it in this work, despite not fully fitting the required profile.
4.2. Datasets
- LineMOD (LM) [14]:
- First presented by Hinterstoisser et al. to evaluate their algorithm of the same name, the LM dataset provides 15 scenes. In each scene, 1 of 15 different objects from an office environment is annotated and placed on a desktop with severe clutter.
- LineMOD occluded (LMO) [19]:
- This dataset includes scene number 2 of the original LineMOD datasets, but with ground truth annotations for multiple objects from different classes in a single frame. In addition to the background clutter, this poses the challenge of a lot of occlusion between objects.
- TLESS [51]:
- The T-LESS dataset comprises 20 scenes with annotations for 30 different object classes. The depicted objects are all typical industrially manufactured objects, made from textureless white plastic, many of which are rotationally symmetric. The objects are all placed on a black background, so there is little background clutter. All scenes show different combinations of objects with different placements, with cases of multiple instances of one object in a scene and objects occluding each other.
4.3. Metrics
- The distances of detected instances and ground-truth annotations are calculated with a geometric metric. Based on a metric-specific threshold, every detected instance and ground truth annotation is classified as one of true positive (TP), false positive (FP), and false negative (FN).
- The numbers of TPs, FPs and FNs are aggregated based on a metric for the evaluation of binary classifiers, which then gives the final evaluation score.
- Average distance (symmetric) (ADD(S)) [14]:
- This metric measures the average distance of 3D points of an object’s model transformed with two different poses. ADD-S (also ADI) is a variant, which takes into account that rotationally symmetric objects can have multiple valid pose estimates. ADD(S) is used to denote that the symmetric variant ADD-S is used for objects with rotational symmetries and ADD for non-symmetric objects. The most commonly used threshold for classifying an estimate as correct is , where d is the target object’s diameter. Some publications use , which are marked in the respective locations.
- Visual surface discrepancy (VSD) [52]:
- As the name suggests, this metric measures the difference of the visible surface of an object transformed with two different poses relative to the camera, i.e., if an object looks exactly the same when transformed with two poses, the VSD is 0. In particular, this handles rotational symmetries more intuitively than ADD(S). This metric has two threshold parameters, determining whether a pose is considered to be correct: is the maximum allowed difference in the camera distance of overlapping pixels; is the minimally allowed percentage of object pixels that need to be considered correct according to the condition for the whole hypothesis to be considered correct. A widely used combination of thresholds is mm and . BOP Challenge 2020 [3] used a different approach by increasing in the interval in steps of and in in steps of . They then determined the score for every – pair and took the average as the total score. We refer to this configuration as VSDBOP.
5. Evaluation
5.1. Discussion
5.1.1. Method Scores
- Object localization:
- For LM-ADD(S), LM-VSD, LM-VSDBOP, TLESS-VSD and TLESS-VSDBOP, the following respective methods perform best: LCHFs [17], PPFs by Vidal et al. [11], SurfEmb [36], PoseRBPF [33] and and again SurfEmb [36]. LMO-VSDBOP allows a direct comparison of PFFs and SurfEmb, from which we can assume that the latter is the overall better method. We cannot compare the other top runners because they were not evaluated on the same metric–dataset combination, so the best overall object localizer remains inconclusive.
- Object detection:
- For LMO-ADD(S)-F1, LCHFs [17] perform best. As they also perform very well for object localization in LM-ADD(S), we conclude that this method can outperform many other object detectors, albeit with some reservations.
- Occlusion:
- Workpiece-detection (textureless, rotationally symmetric):
- A lot of newer methods in the literature are trained on real or a combination of real and synthetic data, and for a lot of generally promising methods, there are currently no or little empirical data available on the performance with purely model-based training; if data are available, they are not comparable.
- LineMOD as well as PPFs have drawbacks compared to learning-based methods that are not reflected in the scores, such as the need for manual parameter optimization (both), fragility against occlusion (LineMOD) and slow runtimes (PPF).
- Both LineMOD and PPFs show mediocre performance for LM-F1, while being good at generating high recalls. We assume this is because both methods are not discriminative (i.e., they do not explicitly “know” what to exclude), and thus tend to have lower precision than learning-based methods.
5.1.2. Runtime
5.1.3. Availability and Comparability of Empirical Data
6. Conclusions and Future Work
- Train established and promising object detectors with model-based data and evaluate them.
- Evaluate established and promising object detectors with metrics that take precision into consideration.
- Take methods based on point clouds and multi-view images into consideration.
- Allow researchers to produce meaningful and comparable data by providing tools and frameworks that offer uniform formats and interfaces for evaluating object detectors on a multitude of different datasets and metrics. Additionally, provide an online database that simplifies collecting, categorizing and analyzing evaluation results. We consider BOP to be a good start in this direction, but in order to be a general-purpose framework for evaluating object detection, it should be extended with more metrics and simpler interfaces.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- He, Z.; Feng, W.; Zhao, X.; Lv, Y. 6D pose estimation of objects: Recent technologies and challenges. Appl. Sci. 2020, 11, 228. [Google Scholar] [CrossRef]
- He, Y.; Huang, H.; Fan, H.; Chen, Q.; Sun, J. FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 3002–3012. [Google Scholar] [CrossRef]
- Hodaň, T.; Sundermeyer, M.; Drost, B.; Labbé, Y.; Brachmann, E.; Michel, F.; Rother, C.; Matas, J. BOP Challenge 2020 on 6D Object Localization. In Computer Vision—ECCV 2020 Workshops; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020; Volume 12536, pp. 577–594. [Google Scholar] [CrossRef]
- Denninger, M.; Sundermeyer, M.; Winkelbauer, D.; Zidan, Y.; Olefir, D.; Elbadrawy, M.; Lodhi, A.; Katam, H. BlenderProc 2019. Available online: http://xxx.lanl.gov/abs/1911.01911 (accessed on 28 December 2021).
- Labbé, Y.; Carpentier, J.; Aubry, M.; Sivic, J. CosyPose: Consistent Multi-View Multi-Object 6D Pose Estimation. In Computer Vision—ECCV 2020; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020; Volume 12362, pp. 574–591. [Google Scholar] [CrossRef]
- Sahin, C.; Garcia-Hernando, G.; Sock, J.; Kim, T.K. A review on object pose recovery: From 3D bounding box detectors to full 6D pose estimators. Image Vis. Comput. 2020, 96, 103898:1–103898:25. [Google Scholar] [CrossRef] [Green Version]
- Cong, Y.; Chen, R.; Ma, B.; Liu, H.; Hou, D.; Yang, C. A Comprehensive Study of 3-D Vision-Based Robot Manipulation. IEEE Trans. Cybern. 2021. [Google Scholar] [CrossRef] [PubMed]
- Du, G.; Wang, K.; Lian, S.; Zhao, K. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: A review. Artif. Intell. Rev. 2021, 54, 1677–1734. [Google Scholar] [CrossRef]
- Drost, B.; Ulrich, M.; Navab, N.; Ilic, S. Model globally, match locally: Efficient and robust 3D object recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 998–1005. [Google Scholar] [CrossRef] [Green Version]
- Hinterstoisser, S.; Lepetit, V.; Rajkumar, N.; Konolige, K. Going further with point pair features. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; Volume 9907, pp. 834–848. [Google Scholar] [CrossRef] [Green Version]
- Vidal, J.; Lin, C.Y.; Lladó, X.; Martí, R. A method for 6D pose estimation of free-form rigid objects using point pair features on range data. Sensors 2018, 18, 2678. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hodaň, T.; Michel, F.; Brachmann, E.; Kehl, W.; Buch, A.G.; Kraft, D.; Drost, B.; Vidal, J.; Ihrke, S.; Zabulis, X.; et al. BOP: Benchmark for 6D object pose estimation. In Computer Vision—ECCV 2018; Springer: Cham, Switzerland, 2018; Volume 11214, pp. 19–35. [Google Scholar] [CrossRef] [Green Version]
- Hinterstoisser, S.; Holzer, S.; Cagniart, C.; Ilic, S.; Konolige, K.; Navab, N.; Lepetit, V. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 858–865. [Google Scholar] [CrossRef]
- Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; Navab, N. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Computer Vision—ACCV 2012; Springer: Berlin/Heidelberg, 2013; Volume 7724, pp. 548–562. [Google Scholar] [CrossRef] [Green Version]
- Rios-Cabrera, R.; Tuytelaars, T. Discriminatively trained templates for 3D object detection: A real time scalable approach. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2013; pp. 2048–2055. [Google Scholar] [CrossRef] [Green Version]
- Tejani, A.; Tang, D.; Kouskouridas, R.; Kim, T.K. Latent-class Hough Forests for 3D Object Detection and Pose Estimation. In Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8694, pp. 462–477. [Google Scholar] [CrossRef]
- Tejani, A.; Kouskouridas, R.; Doumanoglou, A.; Tang, D.; Kim, T.K. Latent-Class Hough Forests for 6 DoF Object Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 119–132. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hodan, T.; Zabulis, X.; Lourakis, M.; Obdrzalek, S.; Matas, J. Detection and fine 3D pose estimation of texture-less objects in RGB-D images. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2015; pp. 4421–4428. [Google Scholar] [CrossRef]
- Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; Rother, C. Learning 6D object pose estimation using 3D object coordinates. In Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8690, pp. 536–551. [Google Scholar] [CrossRef] [Green Version]
- Brachmann, E.; Michel, F.; Krull, A.; Yang, M.Y.; Gumhold, S.; Rother, C. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3364–3372. [Google Scholar] [CrossRef]
- Kehl, W.; Milletari, F.; Tombari, F.; Ilic, S.; Navab, N. Deep learning of local RGB-D patches for 3D object detection and 6D pose estimation. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; Volume 9907, pp. 205–220. [Google Scholar] [CrossRef] [Green Version]
- Buch, A.G.; Kiforenko, L.; Kraft, D. Rotational Subgroup Voting and Pose Clustering for Robust 3D Object Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2017; pp. 4137–4145. [Google Scholar] [CrossRef] [Green Version]
- Rambach, J.; Deng, C.; Pagani, A.; Stricker, D. Learning 6DoF Object Poses from Synthetic Single Channel Images. In In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality, ISMAR-Adjunct 2018, Munich, Germany, 16–20 October 2018; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2018; pp. 164–169. [Google Scholar] [CrossRef]
- Tekin, B.; Sinha, S.N.; Fua, P. Real-Time Seamless Single Shot 6D Object Pose Prediction. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 292–301. [Google Scholar] [CrossRef] [Green Version]
- Sundermeyer, M.; Marton, Z.C.; Durner, M.; Brucker, M.; Triebel, R. Implicit 3D orientation learning for 6D object detection from RGB images. In Computer Vision—ECCV 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11210, pp. 712–729. [Google Scholar] [CrossRef] [Green Version]
- Park, K.; Patten, T.; Vincze, M. Pix2pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 7667–7676. [Google Scholar] [CrossRef] [Green Version]
- Zakharov, S.; Shugurov, I.; Ilic, S. DPOD: 6D pose object detector and refiner. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 1941–1950. [Google Scholar] [CrossRef] [Green Version]
- Thalhammer, S.; Patten, T.; Vincze, M. SyDPose: Object Detection and Pose Estimation in Cluttered Real-World Depth Images Trained using only Synthetic Data. In Proceedings of the 2019 International Conference on 3D Vision, 3DV 2019, Quebec City, QC, Canada, 16–19 September 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 106–115. [Google Scholar] [CrossRef]
- Li, Z.; Wang, G.; Ji, X. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, hlSeoul, Korea, 27–28 October 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 7677–7686. [Google Scholar] [CrossRef]
- Hagelskjar, F.; Buch, A.G. Pointvotenet: Accurate Object Detection and 6 DOF Pose Estimation in Point Clouds. In Proceedings of the International Conference on Image Processing, ICIP, Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 2641–2645. [Google Scholar] [CrossRef]
- Hodaň, T.; Baráth, D.; Matas, J. EPOs: Estimating 6D pose of objects with symmetries. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 11700–11709. [Google Scholar] [CrossRef]
- Su, Y.; Rambach, J.; Pagani, A.; Stricker, D. Synpo-net—Accurate and fast CNN-based 6DoF object pose estimation using synthetic training. Sensors 2021, 21, 300. [Google Scholar] [CrossRef] [PubMed]
- Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; Fox, D. PoseRBPF: A rao-blackwellized particle filter for 6-D object pose tracking. IEEE Trans. Robot. 2021, 37, 1328–1342. [Google Scholar] [CrossRef]
- He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; Sun, J. PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11629–11638. [Google Scholar] [CrossRef]
- Wang, C.; Xu, D.; Zhu, Y.; Martin-Martin, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3338–3347. [Google Scholar] [CrossRef] [Green Version]
- Haugaard, R.L.; Buch, A.G. SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings. arXiv 2021, arXiv:2111.13489. [Google Scholar]
- Rudorfer, M.; Neumann, L.; Krüger, J. Towards Learning 3d Object Detection and 6d Pose Estimation from Synthetic Data. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, ETFA, Zaragoza, Spain, 10–13 September 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 1540–1543. [Google Scholar] [CrossRef]
- Hodan, T.; Vineet, V.; Gal, R.; Shalev, E.; Hanzelka, J.; Connell, T.; Urbina, P.; Sinha, S.N.; Guenter, B. Photorealistic Image Synthesis for Object Instance Detection. In Proceedings of the International Conference on Image Processing, ICIP, Taipei, Taiwan, 22–25 September 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 66–70. [Google Scholar] [CrossRef] [Green Version]
- Hinterstoisser, S.; Pauly, O.; Heibel, H.; Martina, M.; Bokeloh, M. An annotation saved is an annotation earned: Using fully synthetic training for object detection. In Proceedings of the 2019 International Conference on Computer Vision Workshop, ICCVW 2019, Seoul, Korea, 27–28 October 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 2787–2796. [Google Scholar] [CrossRef]
- Rojtberg, P.; Pöllabauer, T.; Kuijper, A. Style-transfer GANs for bridging the domain gap in synthetic pose estimator training. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Virtual Reality, AIVR 2020, Utrecht, The Netherlands, 14–18 December 2020; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020; pp. 188–195. [Google Scholar] [CrossRef]
- Eversberg, L.; Lambrecht, J. Generating images with physics-based rendering for an industrial object detection task: Realism versus domain randomization. Sensors 2021, 21, 7901. [Google Scholar] [CrossRef] [PubMed]
- König, R.; Drost, B. A Hybrid Approach for 6DoF Pose Estimation. In Computer Vision—ECCV 2020 Workshops; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020; Volume 12536, pp. 700–706. [Google Scholar] [CrossRef]
- Sundermeyer, M.; Durner, M.; Puang, E.Y.; Marton, Z.C.; Vaskevicius, N.; Arras, K.O.; Triebel, R. Multi-path learning for object pose estimation across domains. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 13913–13922. [Google Scholar] [CrossRef]
- Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1530–1538. [Google Scholar] [CrossRef] [Green Version]
- Shugurov, I.; Zakharov, S.; Ilic, S. DPODv2: Dense Correspondence-Based 6 DoF Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
- Blinn, J.F. Models of light reflection for computer synthesized pictures. ACM Siggraph Comput. Graph. 1977, 11, 192–198. [Google Scholar] [CrossRef]
- Rusinkiewicz, S.; Levoy, M. Efficient variants of the ICP algorithm. In Proceedings of the International Conference on 3-D Digital Imaging and Modeling, 3DIM, Quebec City, QC, Canada, 28 May–1 June 2001; pp. 145–152. [Google Scholar] [CrossRef] [Green Version]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNET: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 4556–4565. [Google Scholar] [CrossRef] [Green Version]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2016; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2017; pp. 77–85. [Google Scholar] [CrossRef] [Green Version]
- Hodaň, T.; Haluza, P.; Obdrzalek, Š.; Matas, J.; Lourakis, M.; Zabulis, X. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, 24–31 March 2017; pp. 880–888. [Google Scholar] [CrossRef] [Green Version]
- Hodăn, T.; Matas, J.; Obdržálek, Š. On evaluation of 6D object pose estimation. In Computer Vision—ECCV 2016 Workshops; Springer: Cham, Switzerland, 2016; Volume 9915, pp. 609–619. [Google Scholar] [CrossRef]
- Hodan, T. BOP: Benchmark for 6D Object Pose Estimation. Available online: https://bop.felk.cvut.cz/leaderboards/ (accessed on 28 December 2021).
Method | By | Year | Modality | Features | Scope | Output |
---|---|---|---|---|---|---|
AAE | Sundermeyer et al. [43] | 2020 | RGB | Learned | Global | Cont. |
CAE | Kehl et al. [21] | 2016 | RGBD | Learned | Local | Cont. |
CDPNv2 | Li et al. [29] | 2019 | RGBD | Learned | Local | Cont. |
CosyPose | Labbé et al. [5] | 2020 | RGB | Learned | Global | Cont. |
DPOD | Zakharov et al. [27] | 2019 | RGB | Learned | Local | Cont. |
DTT-OPT-3D | Rios-Cabrera and Tuytelaars [15] | 2013 | RGBD | Learned | Global | Disc. |
EPOS | Hodaň et al. [3] | 2020 | RGB | Learned | Local | Cont. |
FFB6D | He et al. [2] | 2021 | RGBD | Learned | Local | Cont. |
HybridPose | König and Drost [42] | 2020 | RGBD | Learned | Global | Cont. |
LCHF | Tejani et al. [16] | 2014 | RGBD | Learned | Local | Cont. |
LCHF | Tejani et al. [17] | 2018 | RGBD | Learned | Local | Cont. |
LineMOD | Hinterstoisser et al. [14] | 2013 | RGBD | Hand-crafted | Global | Disc. |
ObjPoseFromSyn | Rambach et al. [23] | 2018 | RGB | Learned | Global | Cont. |
Pix2Pose | Park et al. [26] | 2019 | RGB | Learned | Local | Cont. |
PointVoteNet | Hagelskjar and Buch [30] | 2020 | D | Learned | Both | Cont. |
PoseCluster | Buch et al. [22] | 2017 | D | Learned | Local | Cont. |
PoseRBPF | Deng et al. [33] | 2021 | RGBD | Learned | Global | Cont. |
PPF | Drost et al. [9] | 2010 | D | Hand-crafted | Local | Cont. |
PPF | Hinterstoisser et al. [10] | 2016 | D | Hand-crafted | Local | Cont. |
PPF | Vidal et al. [11] | 2018 | D | Hand-crafted | Local | Cont. |
PVNet | Peng et al. [49] | 2019 | RGB | Learned | Local | Cont. |
RandomForest | Brachmann et al. [19] | 2014 | RGB | Learned | Local | Cont. |
SSD6D | Kehl et al. [44] | 2017 | RGB | Learned | Global | Cont. |
SurfEmb | Haugaard and Buch [36] | 2021 | RGBD | Learned | Global | Cont. |
SyDPose | Thalhammer et al. [28] | 2019 | D | Learned | Global | Cont. |
SynPo-Net | Su et al. [32] | 2021 | RGB | Learned | Global | Cont. |
TemplateBased | Hodan et al. [18] | 2015 | RGBD | Hand-crafted | Global | Cont. |
UncertaintyDriven | Brachmann et al. [20] | 2016 | RGB | Learned | Local | Cont. |
YOLO6D | Tekin et al. [24] | 2018 | RGB | Learned | Local | Cont. |
Dataset | Metric | Task | Challenges | Data-Points |
---|---|---|---|---|
LM | ADD(S)-Recall | Localization | Background-clutter | 19 |
LM | ADD(S)-F1 | Detection | Background-clutter | 7 |
LM | VSD-Recall | Localization | Background-clutter | 11 |
LMO | VSDBOP-Recall | Localization | Background-clutter, occlusion | 17 |
TLESS | VSD-Recall | Localization | Texturelessness, symmetry | 11 |
TLESS | VSDBOP-Recall | Localization | Texturelessness, symmetry | 12 |
LM | LM | LM | LMO | TLESS | TLESS | ||
---|---|---|---|---|---|---|---|
Method | Modality | ADD(S) | ADD(S)-F1 | VSD | VSDBOP | VSD | VSDBOP |
AAE [43] (ICP) | RGBD | 71.58 [32] | 69.53 [43] | ||||
AAE [43] | RGB | 32.63 [32] | 20.53 [43] | ||||
CAE [21] (ICP) | RGBD | 58.2 [12] | 24.6 [12] | ||||
CDPNv2 [29] (ICP) | RGBD | 46.9 [53] | 36.8 [53] | ||||
CDPNv2 [29] | RGBD | 44.5 [53] | 30.3 [53] | ||||
CosyPose [5] | RGB | 48 [53] | 57.1 [53] | ||||
DPOD [27] | RGB | 10.1 [53] | 4.8 [53] | ||||
DTT-OPT-3D [15] | RGBD | 96.5 [6] | |||||
EPOS [3] | RGB | 38.9 [53] | 38 [53] | ||||
FFB6D [2] | RGBD | 54.08 | 55.5 | 37.7 | |||
HybridPose [42] (ICP) | RGBD | 51.7 [53] | 58 [53] | ||||
LCHF [16] (co-tra) | RGBD | 78.6 [6] | 82 [6] | ||||
LCHF [16] | RGBD | 12.1 [12] | |||||
LCHF [17] (Iterated) | RGBD | 81.7 † [17] | |||||
LCHF [17] | RGBD | 98.2 † [17] | 78.8 † [17] | ||||
LineMOD [14] (ICP) | RGBD | 96.3 [6] | |||||
LineMOD [14] | RGBD | 96.6 [14] | 63 [6] | ||||
ObjPoseFromSyn [23] | RGB | 10.22 [32] | |||||
Pix2Pose [26] | RGB | 11.32 [32] | 15.6 [53] | ||||
PointVoteNet [30] (ICP) | D | 53.5 [53] | 0.3 [53] | ||||
PoseCluster [22] (ICP, PPFH) | D | 56.6 [12] | |||||
PoseCluster [22] (ICP, SI) | D | 33.33 [12] | |||||
PoseRBPF [33] (SDF) | RGBD | 82.58 [33] | |||||
PoseRBPF [33] | RGBD | 80.52 [33] | |||||
PPF [9] (Edge) | RGBD | 42.5 [53] | 67.5 [43] | 46.9 [53] | |||
PPF [9] (ICP) | D | 82 [12] | 43.7 [53] | 37.5 [53] | |||
PPF [9] (ICP, Edge) | RGBD | 79.13 [12] | 39.2 [53] | 37 [53] | |||
PPF [9] | D | 78.9 [6] | 51.7 † [17] | 56.81 [43] | |||
PPF [10] | D | 96.4 [6] | |||||
PPF [11] (ICP) | D | 87.83 [12] | 47.3 [53] | 66.51 [43] | 46.4 [53] | ||
PPF [11] | D | 66.3 [33] | |||||
PVNet [49] (ICP) | RGBD | 50.2 [53] | |||||
PVNet [49] | RGB | 42.8 [53] | |||||
RandomForest [19] | RGB | 67.6 [12] | |||||
SSD6D [44] (ICP) | RGBD | 79 [32] | |||||
SSD6D [44] | RGB | 2.42 [32] | 4.7 [53] | ||||
SurfEmb [36] | RGBD | 61.5 [53] | 79.7 [53] | ||||
SyDPose [28] | D | 30.21 [28] | 59.1 † [28] | ||||
SynPo-Net [32] (ICP) | RGBD | 72.29 [32] | |||||
SynPo-Net [32] | RGB | 44.13 [32] | |||||
TemplateBased [18] (PSO) | RGBD | 94.9 [6] | 87.1 [12] | ||||
TemplateBased [18] | RGBD | 69.83 [12] | 63.18 [43] | ||||
UncertaintyDriven [20] | RGB | 75.33 [12] | 17.84 [43] | ||||
YOLO6D [24] | RGB | 21.43 [32] |
Method | Variant | Modality | Runtime [s] |
---|---|---|---|
SynPo-Net [32] | RGB | 0.015 | |
YOLO6D [24] | RGB | 0.02 | |
DTT-OPT-3D [15] | RGBD | 0.055 | |
PoseRBPF [33] | RGBD | 0.071 | |
SSD6D [44] | RGB | 0.083 | |
PoseRBPF [33] | SDF | RGBD | 0.156 |
FFB6D [2] | RGBD | 0.196 | |
AAE [43] | RGB | 0.2 | |
DPOD [27] | RGB | 0.206 | |
HybridPose [42] | ICP | RGBD | 0.337 |
CosyPose [5] | RGB | RGB | 0.47 |
CosyPose [5] | RGB | 0.493 | |
AAE [43] | ICP | RGBD | 0.8 |
CDPNv2 [29] | RGB | 0.98 | |
PPF [9] | ICP | D | 1.38 |
LCHF [16] | RGBD | 1.4 | |
RandomForest [19] | RGBD | 1.4 | |
CDPNv2 [29] | ICP | RGBD | 1.49 |
CAE [21] | ICP | RGBD | 1.8 |
EPOS [3] | RGB | 1.87 | |
LCHF [17] | RGBD | 1.96 | |
TemplateBased [18] | PSO | RGBD | 2.1 |
PPF [9] | D | 2.3 | |
PPF [11] | ICP | D | 3.22 |
UncertaintyDriven [20] | RGBD | 4.4 | |
SurfEmb [36] | RGBD | 9.227 | |
TemplateBased [18] | RGBD | 12.3 | |
PoseCluster [22] | ICP, PPFH | D | 14.2 |
PoseCluster [22] | ICP, SI | D | 15.9 |
PPF [9] | Edge | RGBD | 21.5 |
PPF [9] | ICP, Edge | D | 21.5 |
PPF [9] | ICP | RGBD | 87.57 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gorschlüter, F.; Rojtberg, P.; Pöllabauer, T. A Survey of 6D Object Detection Based on 3D Models for Industrial Applications. J. Imaging 2022, 8, 53. https://doi.org/10.3390/jimaging8030053
Gorschlüter F, Rojtberg P, Pöllabauer T. A Survey of 6D Object Detection Based on 3D Models for Industrial Applications. Journal of Imaging. 2022; 8(3):53. https://doi.org/10.3390/jimaging8030053
Chicago/Turabian StyleGorschlüter, Felix, Pavel Rojtberg, and Thomas Pöllabauer. 2022. "A Survey of 6D Object Detection Based on 3D Models for Industrial Applications" Journal of Imaging 8, no. 3: 53. https://doi.org/10.3390/jimaging8030053
APA StyleGorschlüter, F., Rojtberg, P., & Pöllabauer, T. (2022). A Survey of 6D Object Detection Based on 3D Models for Industrial Applications. Journal of Imaging, 8(3), 53. https://doi.org/10.3390/jimaging8030053