Vision-Guided Object Recognition and 6D Pose Estimation System Based on Deep Neural Network for Unmanned Aerial Vehicles towards Intelligent Logistics
Abstract
:1. Introduction
- Object detection: The RGB-D camera on the UAV can continuously capture images during flights and then transmit them to the server via the WIFI system. The target of interest can be detected and located by the single shot multibox detector (SSD) [5] algorithm.
- Object tracking: Once the target is detected, the UAV will continue to track and approach it steadily.
- Semantic Segmentation: The semantic segmentation network processes the image combined with the color and depth information and outputs an accurate segmentation mask. Furthermore, for objects with the same textures but different 3D geometries, as described in the subsection about object classification, we introduced a classification network to distinguish them.
- 6D Pose Estimation: The 6D pose estimation network calculates the pose parameters of the target in the segmented image and transmits them to the UAV.
- We built a practical vision system and reliable visual assistance of express delivery that expands collaborative work between humans and UAVs by enabling accurate localization and 6D pose parameters of the targets.
- We proposed a semantic segmentation network with a novel feature fusion structure, which provides more comprehensive semantic information for segmentation by connecting different features at different layers to fuse color and depth information.
- We proposed an innovative 6D pose estimation network that uses the pyramid pooling module (PPM) and the refined residual block (RRB) module in the color feature extraction backbone to enhance features to accurately generate 6D pose parameters for the target.
- We constructed a dataset with the ground truth of segmentation masks and 6D pose parameters to evaluate the performance of the algorithm of the system, namely the SIAT dataset.
2. Related Work
3. System and Methodology
3.1. Hardware Setup
3.2. System Overview
3.3. Object Detection
3.4. Object Tracking
3.5. Semantic Segmentation
- To better utilize the geometric information of the environment, in addition to the RGB images, we input the depth images into the segmentation network. Therefore, our encoder network consists of two branches, with one branch extracting color features and the other extracting geometric features. Next, the color and geometric features of the Max-Pooling layer will be concatenated at each downsampling stage to reinforce the expressiveness of the features.
3.6. Object Classification
3.7. Object 6D Pose Estimation
4. Experiments
4.1. Datasets
4.2. Metrics
4.3. Experiments on the SIAT Dataset
4.4. Experiments on the YCB-Video Dataset
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Yang, Q.; Ye, H.; Huang, K.; Zha, Y.; Shi, L. Estimation of leaf area index of sugarcane using crop surface model based on UAV image. Trans. Chin. Soc. Agric. Eng. 2017, 33, 104–111. [Google Scholar]
- Viguier, R.; Lin, C.C.; Aliakbarpour, H.; Bunyak, F.; Pankanti, S.; Seetharaman, G.; Palaniappan, K. Automatic Video Content Summarization Using Geospatial Mosaics of Aerial Imagery. In Proceedings of the 2015 IEEE International Symposium on Multimedia (ISM), Miami, FL, USA, 14–16 December 2015. [Google Scholar]
- Thomas, J.; Loianno, G.; Daniilidis, K.; Kumar, V. The role of vision in perching and grasping for MAVs. In Proceedings of the Micro- & Nanotechnology Sensors, Systems, & Applications VIII, Baltimore, MD, USA, 17–21 April 2016. [Google Scholar]
- Thomas, J.; Loianno, G.; Daniilidis, K.; Kumar, V. Visual Servoing of Quadrotors for Perching by Hanging from Cylindrical Objects. IEEE Robot. Autom. Lett. 2016, 1, 57–64. [Google Scholar] [CrossRef] [Green Version]
- Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1521–1529. [Google Scholar]
- Smolyanskiy, N.; Kamenev, A.; Smith, J.; Birchfield, S. Toward low-flying autonomous MAV trail navigation using deep neural networks for environmental awareness. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 4241–4247. [Google Scholar]
- Kainuma, A.; Madokoro, H.; Sato, K.; Shimoi, N. Occlusion-robust segmentation for multiple objects using a micro air vehicle. In Proceedings of the 2016 16th International Conference on Control, Automation and Systems (ICCAS), Gyeongju, Republic of Korea, 16–19 October 2016. [Google Scholar]
- Yan, K.; Sukthankar, R. PCA-SIFT: A more distinctive representation for local image descriptors. Proc. CVPR 2004, 2, 506–513. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 386–397. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In ECCV 2016: Computer Vision—ECCV 2016, Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [Green Version]
- Bruni, V.; Vitulano, D. An improvement of kernel-based object tracking based on human perception. IEEE Trans. Syst. Man Cybern. Syst. 2014, 44, 1474–1485. [Google Scholar] [CrossRef]
- Xiao, C.; Yilmaz, A. Efficient tracking with distinctive target colors and silhouette. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2728–2733. [Google Scholar]
- Lychkov, I.I.; Alfimtsev, A.N.; Sakulin, S.A. Tracking of moving objects with regeneration of object feature points. In Proceedings of the 2018 Global Smart Industry Conference (GloSIC), Chelyabinsk, Russia, 13–15 November 2018; pp. 1–6. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016. [Google Scholar]
- Fan, H.; Ling, H. SANet: Structure-Aware Network for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. Transcenter: Transformers with dense queries for multiple-object tracking. arXiv 2021, arXiv:2103.15145. [Google Scholar]
- Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
- Ilea, D.E.; Whelan, P.F. Image segmentation based on the integration of colour–texture descriptors—A review. Pattern Recognit. 2011, 44, 2479–2501. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Nejatishahidin, N.; Fayyazsanavi, P.; Kosecka, J. Object pose estimation using mid-level visual representations. arXiv 2022, arXiv:2203.01449. [Google Scholar]
- Zhu, M.; Derpanis, K.G.; Yang, Y.; Brahmbhatt, S.; Zhang, M.; Phillips, C.; Lecce, M.; Daniilidis, K. Single image 3D object detection and pose estimation for grasping. In Proceedings of the IEEE International Conference on Robotics & Automation, Hong Kong, Chia, 31 May–5 June 2014. [Google Scholar]
- Tekin, B.; Sinha, S.N.; Fua, P. Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 292–301. [Google Scholar]
- Rad, M.; Lepetit, V. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3828–3836. [Google Scholar]
- Lepetit, V.; Moreno-Noguer, F.; Fua, P. Epnp: An accurate o (n) solution to the pnp problem. Int. J. Comput. Vis. 2009, 81, 155. [Google Scholar] [CrossRef] [Green Version]
- Doumanoglou, A.; Kouskouridas, R.; Malassiotis, S.; Kim, T.K. Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. arXiv 2019, arXiv:1901.04780. [Google Scholar]
- Kuo, W.; Angelova, A.; Lin, T.Y.; Dai, A. Mask2cad: 3d shape prediction by learning to segment and retrieve. In Computer Vision—ECCV 2020, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 260–277. [Google Scholar]
- Kuo, W.; Angelova, A.; Lin, T.Y.; Dai, A. Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12589–12599. [Google Scholar]
- Liang, G.; Chen, F.; Liang, Y.; Feng, Y.; Wang, C.; Wu, X. A manufacturing-oriented intelligent vision system based on deep neural network for object recognition and 6d pose estimation. Front. Neurorobot. 2021, 14, 616775. [Google Scholar] [CrossRef]
- He, Y.; Wang, Y.; Fan, H.; Sun, J.; Chen, Q. FS6D: Few-Shot 6D Pose Estimation of Novel Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6814–6824. [Google Scholar]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems 2019, Vancouver, BC, Canada, 8–14 December 2019; NeurIPS: Vancouver, BC, Canada, 2019; Volume 32. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1857–1866. [Google Scholar]
- Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv 2017, arXiv:1711.00199. [Google Scholar]
- Zhan, S.; Chung, R.; Zhang, X.T. An Accurate and Robust Strip-Edge-Based Structured Light Means for Shiny Surface Micromeasurement in 3-D. IEEE Trans. Ind. Electron. 2013, 60, 1023–1032. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Dense(per) | Dense(iter) | Ours(per) | Ours(iter) | |||||
---|---|---|---|---|---|---|---|---|
AUC | ADD-S | AUC | ADD-S | AUC | ADD-S | AUC | ADD-S | |
toy | 97.4 | 88.6 | 97.4 | 88.6 | 88.8 | 97.2 | 88.8 | 97.8 |
Lay’s | 68.0 | 73.8 | 72.7 | 75.3 | 76.0 | 74.0 | 78.6 | 82.2 |
bowl | 97.4 | 91.5 | 97.4 | 91.5 | 92.2 | 97.7 | 92.3 | 99.1 |
Thermos cup | 50.0 | 58.9 | 50.0 | 58.9 | 73.6 | 61.6 | 73.6 | 61.6 |
Tea box | 69.2 | 82.0 | 69.2 | 82.0 | 75.2 | 68.8 | 79.5 | 68.8 |
Blue moon | 52.2 | 75.7 | 60.8 | 76.4 | 78.1 | 62.5 | 84.0 | 73.1 |
Metal block | 64.3 | 78.5 | 64.5 | 78.5 | 76.7 | 64.4 | 81.4 | 76.9 |
Carton | 71.7 | 83.4 | 71.7 | 83.4 | 75.1 | 66.1 | 80.1 | 75.3 |
cup | 96.3 | 85.9 | 97.6 | 87.6 | 87.9 | 97.7 | 88.9 | 99.5 |
back of cup | 92.7 | 88.2 | 92.7 | 88.2 | 87.7 | 94.0 | 90.1 | 98.1 |
MEAN | 75.7 | 79.5 | 77.3 | 81.0 | 81.4 | 78.7 | 83.7 | 83.2 |
PoseCNN + ICP | Dense(per) | Dense(iter) | Ours(per) | Ours(iter) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
AUC | ADD-S | AUC | ADD-S | AUC | ADD-S | AUC | ADD-S | AUC | ADD-S | |
002 master chef can | 95.8 | 100.0 | 95.2 | 100.0 | 96.4 | 100.0 | 95.3 | 100.0 | 96.3 | 100.0 |
003 cracker box | 92.7 | 93.0 | 92.5 | 99.3 | 95.5 | 99.5 | 92.6 | 100.0 | 96.3 | 100.0 |
004 sugar box | 98.2 | 100 | 95.1 | 100.0 | 97.5 | 100.0 | 95.5 | 100.0 | 97.7 | 100.0 |
005 tomato soup can | 94.5 | 96.8 | 93.7 | 96.9 | 94.6 | 96.9 | 96.8 | 100.0 | 97.7 | 100.0 |
006 mustard bottle | 98.6 | 100.0 | 95.9 | 100.0 | 97.2 | 100.0 | 96.0 | 100.0 | 97.8 | 100.0 |
007 tuna fish can | 97.1 | 97.9 | 94.9 | 100.0 | 96.6 | 100.0 | 96.0 | 100.0 | 97.2 | 100.0 |
008 pudding box | 97.9 | 100.0 | 94.7 | 100.0 | 96.5 | 100.0 | 94.3 | 100.0 | 96.8 | 100.0 |
009 gelatin box | 98.8 | 100.0 | 95.8 | 100.0 | 98.1 | 100.0 | 97.3 | 100.0 | 98.2 | 100.0 |
010 potted meat can | 92.7 | 97.2 | 90.1 | 93.1 | 91.3 | 93.1 | 93.0 | 95.4 | 94.0 | 95.3 |
011 banana | 97.1 | 99.7 | 91.5 | 93.9 | 96.6 | 100.0 | 93.5 | 96.8 | 97.1 | 100.0 |
019 pitcher base | 97.8 | 100.0 | 94.6 | 100.0 | 97.1 | 100.0 | 93.4 | 99.5 | 97.9 | 100.0 |
021 bleach cleanser | 96.9 | 99.9 | 94.3 | 99.8 | 95.8 | 100.0 | 95.0 | 99.7 | 96.7 | 100.0 |
024 bowl | 81.0 | 58.8 | 86.6 | 69.5 | 88.2 | 98.8 | 84.4 | 73.9 | 88.8 | 96.8 |
025 mug | 95.0 | 99.5 | 95.5 | 100.0 | 97.1 | 100.0 | 96.0 | 100.0 | 97.3 | 100.0 |
035 power drill | 98.2 | 99.9 | 92.4 | 97.1 | 96.0 | 98.7 | 92.9 | 97.3 | 96.1 | 98.3 |
036 wood block | 87.6 | 82.6 | 85.5 | 93.4 | 89.7 | 94.6 | 85.8 | 84.3 | 91.7 | 96.7 |
037 scissors | 91.7 | 100 | 96.4 | 100.0 | 95.2 | 100.0 | 96.6 | 100.0 | 93.1 | 99.5 |
040 large marker | 97.2 | 98.0 | 94.7 | 99.2 | 97.5 | 100.0 | 95.9 | 99.7 | 97.8 | 100.0 |
051 large clamp | 75.2 | 75.6 | 71.6 | 78.5 | 72.9 | 79.2 | 73.7 | 79.2 | 75.7 | 80.1 |
052 extra large clamp | 64.4 | 55.6 | 69.0 | 69.5 | 69.8 | 76.3 | 83.4 | 83.6 | 83.3 | 88.9 |
061 foam brick | 97.2 | 99.6 | 92.4 | 100.0 | 92.5 | 100.0 | 94.8 | 100.0 | 96.4 | 100.0 |
MEAN | 93.0 | 93.1 | 91.2 | 95.3 | 93.1 | 96.8 | 92.8 | 96.5 | 94.8 | 97.9 |
Detection | Segmentation | Classification | 6D Pose Estimation | Image Transmission | |
---|---|---|---|---|---|
Time (s) | 0.049 | 0.02 | 0.002 | 0.023 | 0.11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Luo, S.; Liang, Y.; Luo, Z.; Liang, G.; Wang, C.; Wu, X. Vision-Guided Object Recognition and 6D Pose Estimation System Based on Deep Neural Network for Unmanned Aerial Vehicles towards Intelligent Logistics. Appl. Sci. 2023, 13, 115. https://doi.org/10.3390/app13010115
Luo S, Liang Y, Luo Z, Liang G, Wang C, Wu X. Vision-Guided Object Recognition and 6D Pose Estimation System Based on Deep Neural Network for Unmanned Aerial Vehicles towards Intelligent Logistics. Applied Sciences. 2023; 13(1):115. https://doi.org/10.3390/app13010115
Chicago/Turabian StyleLuo, Sijin, Yu Liang, Zhehao Luo, Guoyuan Liang, Can Wang, and Xinyu Wu. 2023. "Vision-Guided Object Recognition and 6D Pose Estimation System Based on Deep Neural Network for Unmanned Aerial Vehicles towards Intelligent Logistics" Applied Sciences 13, no. 1: 115. https://doi.org/10.3390/app13010115
APA StyleLuo, S., Liang, Y., Luo, Z., Liang, G., Wang, C., & Wu, X. (2023). Vision-Guided Object Recognition and 6D Pose Estimation System Based on Deep Neural Network for Unmanned Aerial Vehicles towards Intelligent Logistics. Applied Sciences, 13(1), 115. https://doi.org/10.3390/app13010115