Improving Robot Perception Skills Using a Fast Image-Labelling Method with Minimal Human Intervention
Abstract
:Featured Application
Abstract
1. Introduction
2. Materials and Methods
- Define a controlled background with a constant colour. In addition, the background colour has to be different from the colours of the objects. An example is shown in Figure 1a.Figure 2. Proposed fast object detection algorithm with data augmentation. Images with controlled background are labelled as is shown in Figure 1. Detected objects are overlapped with a texturized background to create a standard image taken under uncontrolled conditions. Moreover, several objects can be overlapped with the same background: f1 represents an image subtraction to obtain the object mask and f2 is a pixel selection from original image using a mask detected with f1. With f3 selected pixels are synthesized with real background images to improve the training capabilities of the image set.
- Produce images with a controlled background. The viewfinder should be framed by the controlled background and the objects clearly inside the viewfinder. If part of the object is outside of the viewfinder, it can be considered an occluded object.
- If necessary, exacerbate the differences between the object and background pixels. As discussed, the aim is to obtain an image where background pixel colour is easy to differentiate from the object pixels. If it is necessary to accentuate this difference, two operations are proposed:
- If necessary, perform an RGB to HSV conversion to highlight the object pixels in the image. Some objects such as white skin tones are easier to detect if they are in a colour space different from RGB. This operation is represented in Figure 1.
- Image subtraction of the background without objects from the image with objects is represented in Figure 2 as f1. In this case, an image of the background without objects is necessary. This step will help to remove shadows and brightness in the image background and will increase the difference between object pixels and background pixels.
- Remove the background pixels by carrying out pixel segmentation. With an image similar to the one shown in Figure 1b, the contrast between object pixels and background pixels increases significantly and the detection of object pixels is easy. Pixel selection is easily carried out by value, since object pixels are quite different from background pixels. The result is a binary image where pixels that belong to object are set to 1 and pixels that belong to background are set to 0.
- Perform a combination of dilation and erosion algorithms to close holes and noise in the segmented image.
- Detect objects in images by grouping selected pixels in object masks.
- Perform data augmentation, overlapping objects on images that contain only the background. This step is represented with function f3 in Figure 2. Several objects can be synthetized with the same background to perform data augmentation.
- Save the masks defined by the silhouette.
- A bounding box of the mask defines a rectangle. Rectangles are saved in an xml file.
- Perform polygon approximation to obtain the polygon vertex of an object silhouette to be saved in json format.
- Masks are saved in a png image.
3. Results
3.1. Background Effects
3.2. Creating a New Data Set
4. Discussion
4.1. Background Effects
4.2. Data Augmentation
4.3. Real or Virtual Images
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 39, 640–651. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
- Abousaleh, F.S.; Lim, T.; Cheng, W.-H.; Yu, N.-H.; Hossain, M.A.; Alhamid, M.F. A novel comparative deep learning framework for facial age estimation. EURASIP J. Image Video Process. 2016, 2016, 47. [Google Scholar] [CrossRef] [Green Version]
- Ma, X.; Geng, J.; Wang, H. Hyperspectral image classification via contextual deep learning. EURASIP J. Image Video Process. 2015, 2015, 20. [Google Scholar] [CrossRef] [Green Version]
- Li, X.; Jiang, Y.; Chen, M.; Li, F. Research on iris image encryption based on deep learning. EURASIP J. Image Video Process. 2018, 2018, 126. [Google Scholar] [CrossRef] [Green Version]
- Xin, M.; Wang, Y. Research on image classification model based on deep convolution neural network. EURASIP J. Image Video Process. 2019, 2019, 40. [Google Scholar] [CrossRef] [Green Version]
- Shi, W.; Liu, S.; Jiang, F.; Zhao, D.; Tian, Z. Anchored neighborhood deep network for single-image super-resolution. EURASIP J. Image Video Process. 2018, 2018, 34. [Google Scholar] [CrossRef]
- Yang, W. Analysis of sports image detection technology based on machine learning. EURASIP J. Image Video Process. 2019, 2019, 17. [Google Scholar] [CrossRef]
- Russa kovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Qin, X.; He, S.; Zhang, Z.; Dehghan, M.; Jagersand, M. ByLabel: A Boundary Based Semi-Automatic Image Annotation Tool. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1804–1813. [Google Scholar] [CrossRef]
- Iakovidis, D.K.; Goudas, T.; Smailis, C.; Maglogiannis, I. Ratsnake: A Versatile Image Annotation Tool with Application to Computer-Aided Diagnosis. Sci. World J. 2014, 2014, 286856. [Google Scholar] [CrossRef] [PubMed]
- Chaudhary, A.; Raheja, J.L. Light invariant real-time robust hand gesture recognition. Optik 2018, 159, 283–294. [Google Scholar] [CrossRef]
- McConnell, R.K. Method of and Apparatus for Pattern Recognition. U.S. Patent 4,567,610, 28 January 1986. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar] [CrossRef] [Green Version]
- Sinha, S.N.; Frahm, J.M.; Pollefeys, M.; Genc, Y. GPU-based video feature tracking and matching. In EDGE, Workshop on Edge Computing Using New Commodity Architectures; Department of Computer Science: Chapel Hill, NC, USA, 2006. [Google Scholar]
- Dutta, A.; Gupta, A.; Zisserman, A. Vgg Image Annotator Via. 2016. Available online: https://www.robots.ox.ac.uk/~vgg/software/via/ (accessed on 20 January 2022).
- Dutta, A.; Zisserman, A. The VIA annotation software for images, audio and video. arXiv 2019, arXiv:1904.10699. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the Seventh International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981. [Google Scholar]
- Tomasi, C.; Kanade, T. Detection and Tracking of Point Features. Int. J. Comput. Vis. 1991, 9, 137–154. [Google Scholar] [CrossRef]
- Shi, J. Good features to track. In Proceedings of the 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994. [Google Scholar]
- Breheret, A. Pixel Annotation Tool. 2017. Available online: https://github.com/abreheret/PixelAnnotationTool (accessed on 20 January 2022).
- Zhang, C.; Loken, K.; Chen, Z.; Xiao, Z.; Kunkel, G. Mask editor: An image annotation tool for image segmentation tasks. arXiv 2018, arXiv:1809.06461. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
- Sun, B.; Saenko, K. From Virtual to Reality: Fast Adaptation of Virtual Object Detectors to Real Domains. BMVC 2014, 1, 3. [Google Scholar] [CrossRef] [Green Version]
- Su, H.; Qi, C.R.; Li, Y.; Guibas, L.J. Render for CNN: Viewpoint estimation in images using CNNC trained with rendered 3d model views. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2686–2694. [Google Scholar]
- Castro, E.; Ulloa, A.; Plis, S.M.; Turner, J.A.; Calhoun, V.D.; Eduardo, C. Generation of synthetic structural magnetic resonance images for deep learning pre-training. In Proceedings of the 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), New York, NY, USA, 16–19 April 2015; pp. 1057–1060. [Google Scholar] [CrossRef]
- Segawa, Y.; Kawamoto, K.; Okamoto, K. First-person reading activity recognition by deep learning with synthetically generated images. EURASIP J. Image Video Process. 2018, 2018, 33. [Google Scholar] [CrossRef] [Green Version]
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C.V. Cats and dogs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3498–3505. [Google Scholar] [CrossRef]
- GoogleResearch. Tensorflow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://doi.org/10.1207/s15326985ep4001 (accessed on 1 December 2018).
- GoogleResearch. Detection Model Zoo. 2017. Available online: https://github.com/tensorflow/models (accessed on 20 January 2022).
- Erol, A.; Bebis, G.; Nicolescu, M.; Boyle, R.D.; Twombly, X. Vision-based hand pose estimation: A review. Comput. Vis. Image Underst. 2007, 108, 52–73. [Google Scholar] [CrossRef]
- Abderrahmane, Z.; Ganesh, G.; Crosnier, A.; Cherubini, A. Haptic Zero-Shot Learning: Recognition of objects never touched before. Robot. Auton. Syst. 2018, 105, 11–25. [Google Scholar] [CrossRef]
SSC | FASTER RCNN | ||||
---|---|---|---|---|---|
MOBILENET | RESNET | RESNET | INCEPTION | ||
training data | original images | 0.81/0.88 | 0.80/0.74 | 0.85/0.81 | 0.84/0.86 |
flat background | 0.82/0.85 | 0.81/0.73 | 0.86/0.89 | 0.82/0.83 | |
data augmentation | 0.77/0.87 | 0.78/0.75 | 0.80/0.89 | 0.83/0.84 | |
data augmentation flat background | 0.78/0.86 | 0.77/0.74 | 0.79/0.88 | 0.81/0.82 | |
testing data | original images | 0.41/0.47 | 0.40/0.44 | 0.45/0.51 | 0.44/0.56 |
flat background | 0.42/0.45 | 0.41/0.43 | 0.46/0.49 | 0.42/0.53 | |
data augmentation | 0.47/0.47 | 0.48/0.45 | 0.40/0.49 | 0.43/0.54 | |
data augmentation flat background | 0.48/0.46 | 0.47/0.44 | 0.49/0.48 | 0.41/0.52 |
CLASSES | ||||||
---|---|---|---|---|---|---|
ONE | TWO | THREE | FOUR | FIVE | ||
CLASSIFIED AS (%) (Training data are flat background images) | ONE | 0.7000 | 0.1167 | 0.0100 | 0.0167 | 0.0033 |
TWO | 0.1733 | 0.7333 | 0.1600 | 0.0233 | 0.0233 | |
THREE | 0.0767 | 0.1300 | 0.6667 | 0.1100 | 0.0300 | |
FOUR | 0.0333 | 0.0167 | 0.1400 | 0.6933 | 0.1333 | |
FIVE | 0.0167 | 0.0033 | 0.0233 | 0.1567 | 0.8100 | |
Precision | 0.8268 | 0.6587 | 0.6579 | 0.6820 | 0.8020 | |
Recall | 0.7000 | 0.7333 | 0.6667 | 0.6933 | 0.8100 | |
CLASSIFIED AS (%) (Training data are with texturized background as combination of objects and images of backgrounds) | ONE | 0.7033 | 0.1167 | 0.0100 | 0.0167 | 0.0033 |
TWO | 0.1767 | 0.7400 | 0.1333 | 0.0300 | 0.0167 | |
THREE | 0.0633 | 0.1233 | 0.6833 | 0.1200 | 0.0267 | |
FOUR | 0.0400 | 0.0167 | 0.1500 | 0.7000 | 0.1300 | |
FIVE | 0.0167 | 0.0033 | 0.0233 | 0.1333 | 0.8233 | |
Precision | 0.8275 | 0.6748 | 0.6721 | 0.6752 | 0.8233 | |
Recall | 0.7033 | 0.7400 | 0.6833 | 0.7000 | 0.8233 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ricolfe-Viala, C.; Blanes, C. Improving Robot Perception Skills Using a Fast Image-Labelling Method with Minimal Human Intervention. Appl. Sci. 2022, 12, 1557. https://doi.org/10.3390/app12031557
Ricolfe-Viala C, Blanes C. Improving Robot Perception Skills Using a Fast Image-Labelling Method with Minimal Human Intervention. Applied Sciences. 2022; 12(3):1557. https://doi.org/10.3390/app12031557
Chicago/Turabian StyleRicolfe-Viala, Carlos, and Carlos Blanes. 2022. "Improving Robot Perception Skills Using a Fast Image-Labelling Method with Minimal Human Intervention" Applied Sciences 12, no. 3: 1557. https://doi.org/10.3390/app12031557
APA StyleRicolfe-Viala, C., & Blanes, C. (2022). Improving Robot Perception Skills Using a Fast Image-Labelling Method with Minimal Human Intervention. Applied Sciences, 12(3), 1557. https://doi.org/10.3390/app12031557