Human 3D Pose Estimation with a Tilting Camera for Social Mobile Robot Interaction
Abstract
:1. Introduction
- Single-View. The first method requires only a single image of the scene (provided that the human feet are visible) to reconstruct a simplified, planar version of the detected human’s skeleton. Since it is fast, this method performs well even if the robot and/or users are moving, allowing its use during normal robot navigation.
- Multi-View. This second approach is more precise and can be applied with the user in any arbitrary bodily position. It uses several (at least two) views of the scene taken from different angles of the tilting robot head so that both the user and the robot must remain reasonably still during the acquisition time (≈2 s). It is important to highlight that, in this case, and due to the particular camera setup and motion, common epipolar constraints for stereo vision are not applicable. Thus, this paper also contributes with an implementation of specific 3D reconstruction equations, as well as with the exploitation of the body structure to improve the obtained results.
2. Related Work
3. Camera Extrinsic Calibration
3.1. Robotic Platform
3.2. Camera Pose Estimation
- A rotation transformation R that aligns the camera’s y-axis with the y-axis of the floor frame, so that the camera’s XZ plane becomes parallel to its correspondent in the floor frame.
- The y-coordinate of the camera frame with respect to the floor plane, that is, the distance (or height) from the floor to the camera center.
- Base: The image coordinates of the robot’s action buttons on the base are stored as the visual feature vector for each camera pose. This is employed for positive camera rotations, since only the robot base is visible in the image, as shown in Figure 3c.
- Head Frame: The head frame appearance is highly changeable during the camera movements, which precludes employing a fixed template as before in order to extract the features. For this reason, we employ the angles between the head frame and the image borders to build the feature vector. This is done for negative rotations (see Figure 3b).
4. Human Body Pose Estimation
4.1. Single-View Estimation
4.2. Multi-View Estimation
- Initial rotation. Once we have a pair of images taken from two different camera poses, we first transform the second image by applying the rotation matrix between the cameras, aligning this way both reference systems. This matrix generates a linear transformation through the expression: , where is the camera’s intrinsic calibration matrix, which can be directly applied to the image. However, this alignment presents small errors that have to be corrected by finding a second (residual) rotation that completely matches the axes.
- Residual rotation. The residual rotation matrix aligns both epipoles e and , so that [32,37], and it is computed based on the homogeneous coordinates of the epipoles as:
5. Experimental Validation
5.1. Camera Retrieval
5.2. Position
5.2.1. Single-View Method
5.2.2. Multi-View Method
5.3. Human Orientation
6. Conclusions and Future Work
- Single-View. It requires only a single view of the scene and allows the robot and the user to move during the pose estimation. Provided that the feet are detected, a simplified, planar human body is then reconstructed based on the camera pose and the feet coordinates, even determining if the user is standing up or lying down.
- Multi-View. It overcomes the visible-feet limitation by employing several views of the scene taken from different angles, requiring both the human and the robot to remain still during the process (≈2 s). Due to the particular camera setup and motion, common epipolar constraints for stereo vision can not be applied, thus we provided specific 3D reconstruction equations and integrated the body structure to improve its reconstruction.
Author Contributions
Funding
Conflicts of Interest
References
- Goodrich, M.A.; Schultz, A.C. Human–robot interaction: A survey. Found. Trends® Hum. Comput. Interact. 2008, 1, 203–275. [Google Scholar] [CrossRef]
- Canal, G.; Escalera, S.; Angulo, C. A real-time human-robot interaction system based on gestures for assistive scenarios. Comput. Vision Image Underst. 2016, 149, 65–77. [Google Scholar] [CrossRef]
- Saleh, S.; Sahu, M.; Zafar, Z.; Berns, K. A multimodal nonverbal human-robot communication system. In Proceedings of the Sixth International Conference on Computational Bioengineering, ICCB, Belgrade, Serbia, 4–6 September 2015; pp. 1–10. [Google Scholar]
- Gockley, R.; Forlizzi, J.; Simmons, R. Natural person-following behavior for social robots. In Proceedings of the ACM/IEEE International Conference on Human-robot Interaction, Arlington, VA, USA, 10–12 March 2007; pp. 17–24. [Google Scholar]
- Cesta, A.; Coradeschi, S.; Cortellessa, G.; Gonzalez, J.; Tiberio, L.; Von Rump, S. Enabling social interaction through embodiment in ExCITE. In Proceedings of the ForItAAL: Second Italian Forum on Ambient Assisted Living, Trento, Italy, 5–7 October 2010; pp. 1–7. [Google Scholar]
- Shi, Y.; Dong, X.; Shi, D.; Yang, Q. Human Detection Using Color and Depth Information by Kinect Based on the Fusion Method of Decision Template. ICIC Express Lett. 2016, 7, 1567–1574. [Google Scholar]
- Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; Brox, T. 3D human pose estimation in RGBD images for robotic task learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1986–1992. [Google Scholar]
- Moreno, F.A.; Ruiz Sarmiento, J.R.; Monroy, J.; Fernandez, M.; Gonzalez-Jimenez, J. Analyzing interference between RGB-D cameras for human motion tracking. In Proceedings of the International Conference on Applications of Intelligent Systems (APPIS), Las Palmas de Gran Canaria, Spain, 8–12 January 2018; pp. 1–4. [Google Scholar]
- Butler, D.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Hodges, S.; Kim, D. Shake’n’sense: reducing interference for overlapping structured light depth cameras. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Austin, TX, USA, 5–10 May 2012; pp. 1933–1936. [Google Scholar]
- González-Jiménez, J.; Galindo, C.; Ruiz-Sarmiento, J. Technical improvements of the Giraff telepresence robot based on users’ evaluation. In Proceedings of the 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, Paris, France, 9–13 September 2012; pp. 827–832. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. arXiv 2018, arXiv:1812.08008. [Google Scholar] [CrossRef] [PubMed]
- Gong, W.; Zhang, X.; Gonzàlez, J.; Sobral, A.; Bouwmans, T.; Tu, C.; Zahzah, E.H. Human pose estimation from monocular images: A comprehensive survey. Sensors 2016, 16, 1966. [Google Scholar] [CrossRef]
- Choo, K.; Fleet, D.J. People tracking using hybrid Monte Carlo filtering. In Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 321–328. [Google Scholar]
- Andriluka, M.; Roth, S.; Schiele, B. Monocular 3d pose estimation and tracking by detection. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 623–630. [Google Scholar]
- Taylor, C.J. Reconstruction of articulated objects from point correspondences in a single uncalibrated image. Comput. Vision Image Underst. 2000, 80, 349–363. [Google Scholar] [CrossRef]
- Guan, P.; Weiss, A.; Balan, A.O.; Black, M.J. Estimating human shape and pose from a single image. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 1381–1388. [Google Scholar]
- Ramakrishna, V.; Kanade, T.; Sheikh, Y. Reconstructing 3d human pose from 2d image landmarks. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 573–586. [Google Scholar]
- Freifeld, O.; Black, M.J. Lie bodies: A manifold representation of 3D human shape. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 1–14. [Google Scholar]
- Elgammal, A.; Lee, C.S. Tracking people on a torus. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 520–538. [Google Scholar] [CrossRef] [PubMed]
- Urtasun, R.; Fleet, D.J.; Fua, P. Temporal motion models for monocular and multiview 3D human body tracking. Comput. Vision Image Underst. 2006, 104, 157–177. [Google Scholar] [CrossRef]
- Alldieck, T.; Kassubeck, M.; Wandt, B.; Rosenhahn, B.; Magnor, M. Optical flow-based 3d human motion estimation from monocular video. In Proceedings of the German Conference on Pattern Recognition, Stuttgart, Germany, 9–12 October 2017; pp. 347–360. [Google Scholar]
- Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 1653–1660. [Google Scholar]
- Tompson, J.J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1799–1807. [Google Scholar]
- Li, S.; Zhang, W.; Chan, A.B. Maximum-margin structured learning with deep networks for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2848–2856. [Google Scholar]
- Rogez, G.; Schmid, C. Image-based synthesis for deep 3D human pose estimation. Int. J. Comput. Vision 2018, 126, 993–1008. [Google Scholar] [CrossRef]
- Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2640–2649. [Google Scholar]
- Li, X.; Fan, Z.; Liu, Y.; Li, Y.; Dai, Q. 3D Pose Detection of Closely Interactive Humans Using Multi-View Cameras. Sensors 2019, 19, 2831. [Google Scholar] [CrossRef] [PubMed]
- Orlandini, A.; Kristoffersson, A.; Almquist, L.; Björkman, P.; Cesta, A.; Cortellessa, G.; Galindo, C.; Gonzalez-Jimenez, J.; Gustafsson, K.; Kiselev, A.; et al. ExCITE project: A review of forty-two months of robotic telepresence technology evolution. Presence Teleoperators Virtual Environ. 2016, 25, 204–221. [Google Scholar] [CrossRef]
- Coradeschi, S.; Cesta, A.; Cortellessa, G.; Coraci, L.; Galindo, C.; Gonzalez, J.; Karlsson, L.; Forsberg, A.; Frennert, S.; Furfari, F.; et al. GiraffPlus: A system for monitoring activities and physiological parameters and promoting social interaction for elderly. In Human-Computer Systems Interaction: Backgrounds and Applications 3; Springer: Cham, Switzerland, 2014; pp. 261–271. [Google Scholar]
- Luperto, M.; Monroy, J.; Ruiz-Sarmiento, J.R.; Moreno, F.A.; Basilico, N.; Gonzalez-Jimenez, J.; Borghese, N.A. Towards Long-Term Deployment of a Mobile Robot for at-Home Ambient Assisted Living of the Elderly. In Proceedings of the 2019 European Conference on Mobile Robots (ECMR), Prague, Czech Republic, 4–6 September 2019; pp. 1–6. [Google Scholar]
- Cheng, C.; Hao, X.; Li, J. Relative camera pose estimation method using optimization on the manifold. In Proceedings of the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Salzburg, Austria, 13–16 August 2017; pp. 41–46. [Google Scholar]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Moreno, F.A.; Blanco, J.L.; Gonzalez, J. Stereo vision specific models for particle filter-based SLAM. Robot. Auton. Syst. 2009, 57, 955–970. [Google Scholar] [CrossRef]
- Moreno, F.A.; Blanco, J.L.; Gonzalez-Jimenez, J. A constant-time SLAM back-end in the continuum between global mapping and submapping: application to visual stereo SLAM. Int. J. Robot. Res. 2016, 35, 1036–1056. [Google Scholar] [CrossRef]
- Wei, Y.; Lhuillier, M.; Quan, L. Fast segmentation-based dense stereo from quasi-dense matching. In Proceedings of the Asian Conference on Computer Vision, Jeju, Korea, 27–30 July 2004; p. CDROM. [Google Scholar]
- Lazaros, N.; Sirakoulis, G.C.; Gasteratos, A. Review of stereo vision algorithms: from software to hardware. Int. J. Optomechatronics 2008, 2, 435–462. [Google Scholar] [CrossRef]
- Monasse, P.; Morel, J.M.; Tang, Z. Three-step image rectification. In Proceedings of the British Machine Vision Conference (BMVA), Aberystwyth, UK, 30 August–2 September 2010; pp. 89.1–89.10. [Google Scholar]
- Laveau, S.; Faugeras, O. Oriented projective geometry for computer vision. In Proceedings of the European Conference on Computer Vision, Cambridge, UK, 14–18 April 1996; pp. 147–156. [Google Scholar]
- Body Tracking SDK. Available online: https://orbbec3d.com/bodytracking-sdk/ (accessed on 29 July 2019).
- Garcia-Salguero, M.; Monroy, J.; Solano, A.; Gonzalez-Jimenez, J. Socially Acceptable Approach to Humans by a Mobile Robot. In Proceedings of the 2nd International Conference on Applications of Intelligent Systems (APPIS), Las Palmas de Gran Canaria, Spain, 7–9 January 2019; pp. 1–7. [Google Scholar]
- Coroiu, A.D.C.A.; Coroiu, A. Interchangeability of Kinect and Orbbec Sensors for Gesture Recognition. In Proceedings of the 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 6–8 September 2018; pp. 309–315. [Google Scholar]
- Microsoft. Microsoft Kinect. Available online: https://developer.microsoft.com/en-us/windows/kinect (accessed on 24 October 2019).
- Lange, B.; Koenig, S.; McConnell, E.; Chang, C.Y.; Juang, R.; Suma, E.; Bolas, M.; Rizzo, A. Interactive game-based rehabilitation using the Microsoft Kinect. In Proceedings of the 2012 IEEE Virtual Reality Workshops (VRW), Costa Mesa, CA, USA, 4–8 March 2012; pp. 171–172. [Google Scholar]
- El-laithy, R.A.; Huang, J.; Yeh, M. Study on the use of Microsoft Kinect for robotics applications. In Proceedings of the 2012 IEEE/ION Position, Location and Navigation Symposium, Myrtle Beach, SC, USA, 23–26 April 2012; pp. 1280–1288. [Google Scholar]
- Lun, R.; Zhao, W. A survey of applications and human motion recognition with microsoft kinect. Int. J. Pattern Recognit. Artif. Intell. 2015, 29, 1555008. [Google Scholar] [CrossRef] [Green Version]
- Călin, A.; Cantea, A.; Dascălu, A.; Mihaiu, C.; Suciu, D. MIRA-Upper Limb Rehabilitation System Using Microsoft Kinect. Studia Univ. Babes-Bolyai Inform. 2011, 56, 63–74. [Google Scholar]
- Gaber, A.; Taher, M.F.; Waheb, M. A comparison of virtual rehabilitation techniques. In Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science(EECSS), Barcelona, Spain, 13–14 July 2015; p. 317. [Google Scholar]
- Hughes, C.; Glavin, M.; Jones, E.; Denny, P. Review of geometric distortion compensation in fish-eye cameras. In Proceedings of the IET Irish Signals and Systems Conference(ISSC), Galway, Ireland, 18–19 June 2008; pp. 162–167. [Google Scholar]
- MAPIR-UMA Youtube Channel. Available online: https://www.youtube.com/channel/UC-thsUlVVKvB_vIANQXLLeA (accessed on 11 November 2019).
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Garcia-Salguero, M.; Gonzalez-Jimenez, J.; Moreno, F.-A. Human 3D Pose Estimation with a Tilting Camera for Social Mobile Robot Interaction. Sensors 2019, 19, 4943. https://doi.org/10.3390/s19224943
Garcia-Salguero M, Gonzalez-Jimenez J, Moreno F-A. Human 3D Pose Estimation with a Tilting Camera for Social Mobile Robot Interaction. Sensors. 2019; 19(22):4943. https://doi.org/10.3390/s19224943
Chicago/Turabian StyleGarcia-Salguero, Mercedes, Javier Gonzalez-Jimenez, and Francisco-Angel Moreno. 2019. "Human 3D Pose Estimation with a Tilting Camera for Social Mobile Robot Interaction" Sensors 19, no. 22: 4943. https://doi.org/10.3390/s19224943
APA StyleGarcia-Salguero, M., Gonzalez-Jimenez, J., & Moreno, F. -A. (2019). Human 3D Pose Estimation with a Tilting Camera for Social Mobile Robot Interaction. Sensors, 19(22), 4943. https://doi.org/10.3390/s19224943