Improved Convolutional Pose Machines for Human Pose Estimation Using Image Sensor Data
Abstract
:1. Introduction
2. Improved Convolutional Pose Machines
2.1. Convolutional Pose Machines
2.1.1. Pose Machines
2.1.2. Convolutional Pose Machines
2.2. Improved Convolutional Pose Machines
2.2.1. Design of the Improved Convolutional Pose Machines
2.2.2. Training and Testing
2.3. Learning in GoogLeNet13-CPM-Stage6
3. Experimental Results
3.1. Experimental Environment and Datasets
3.2. Experimental Procedure
3.3. Experimental Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Wang, L.; Zang, J.L.; Zhang, Q.L.; Niu, Z.X.; Hua, G.; Zheng, N.N. Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural NetWork. Sensors 2018, 18, 1979. [Google Scholar] [CrossRef] [PubMed]
- Gong, W.J.; Zhang, X.N.; Gonezalez, J.; Sobral, A.; Bouwmans, T.; Tu, C.H.; Zahzah, E.-H. Human Pose Estimation from Monocular Images: A Comprehensive Survey. Sensors 2016, 16, 1966. [Google Scholar] [CrossRef] [PubMed]
- Han, J.G.; Shen, J.D. Progress in two-dimensional human pose estimation. J. Xi’an Univ. Posts Telecom. 2017, 4, 1–9. [Google Scholar] [CrossRef]
- Pishchulin, L.; Insafutdinov, E.; Tang, S.; Andres, B.; Andriluka, M.; Gehler, P.; Schiele, B. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient object localization using convolutional networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
- Tompson, J.; Jain, A.; LeCun, Y.; Bregler, C. Joints training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the 2014 International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Wang, R. Human Posture Estimation based on Deep Convolution Neural Network. University of Electronic Science and technology of China. Available online: http://nvsm.cnki.net/kns/brief/default_result.aspx (accessed on 27 March 2016).
- Pfister, T.; Charles, J.; Zisserman, A. Flowing ConvNets for Human Pose Estimation in Videos. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–16 December 2015. [Google Scholar]
- Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 8–16 October 2016. [Google Scholar]
- Wei, S.-E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- MPII Human Pose Dataset. Available online: http://human-pose.mpi-inf.mpg.de (accessed on 11 November 2018).
- Leeds Sports Pose. Available online: http://sam.johnson.io/research/lsp.html (accessed on 11 November 2018).
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 8–16 October 2016. [Google Scholar]
- Chu, X.; Yang, W.; Ouyang, W.; Ma, C.; Yuille, A.L.; Wang, X. Multi-context attention for human pose estimation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–29 July 2017. [Google Scholar]
- Chou, C.; Chien, J.; Chen, H. Self adversarial training for human pose estimation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–29 July 2017. [Google Scholar]
- Yang, W.; Li, S.; Ouyang, W.; Li, H.; Wang, X. Learning feature pyramids for human pose estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
- Ramakrishna, V.; Munoz, D.; Hebert, M.; Bagnell, J.; Sheikh, Y. Pose Machines: Articulated Pose Estimation via Inference Machines. In Proceedings of the 2014 European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 2012 International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Jia, Y.; Shelhamer, E.; Donahue, J. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
- Zhou, F.Y.; Jin, L.P.; Dong, J. Review of Convolutional Neural Networks. J. Comput. Sci. 2017, 40, 1229–1251. [Google Scholar]
- Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply supervised nets. In Proceedings of the 2015 International Conference on Artificial Intelligence and Statistics (AISTATS), San Diego, CA, USA, 9–12 May 2015. [Google Scholar]
- Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
- Bradley, D. Learning in Modular Systems. Ph.D. Thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA, 2010. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 2010 International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy, 13–15 May 2010. [Google Scholar]
- Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.7321 (accessed on 10 October 2018).
- Deng, F.; Pu, S.L.; Chen, X.H.; Shi, Y.S.; Yuan, T.; Pu, S.Y. Hyperspectral Image Classification with Capsule Network Using Limited Training Samples. Sensors 2018, 18, 3153. [Google Scholar] [CrossRef] [PubMed]
- Mohamed, A.; Hinton, G.; Penn, G. Understanding how deep belief networks perform acoustic modeling. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 25–30 March 2012; pp. 4273–4276. [Google Scholar]
- Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 30–42. [Google Scholar] [CrossRef] [Green Version]
- The Extended Leeds Sports Pose. Available online: http://sam.johnson.io/research/lspet.html (accessed on 11 November 2018).
- Wu, Y.; Zhang, L.M. A Survey of Research Work on Neural Network Generalization and Structure Optimization Algorithms. Appl. Res. Comput. 2002, 19, 21–25. [Google Scholar] [CrossRef]
- Lifshitz, I.; Fetaya, E.; Ullman, S. Human pose estimation using deep consensus voting. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 8–16 October 2016. [Google Scholar]
- Tang, Z.; Peng, X.; Geng, S.; Zhu, Y.; Metaxas, D. CU-Net: Coupled U-Nets. In Proceedings of the 2018 British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018. [Google Scholar]
- Tang, Z.; Peng, X.; Geng, S.; Wu, L.; Zhang, S.; Metaxas, D. Quantized Densely Connected U-Nets for Efficient Landmark Localizetion. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
- Loffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Traing by Reducing Internal Covariate Shift. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Loffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
Type | Kernel Size/Stride Size | Depth | Output Size |
---|---|---|---|
conv1 | 7 × 7/2 | 1 | 184 × 184 × 64 |
max pool | 3 × 3/2 | 0 | 92 × 92 × 64 |
conv2 | 3 × 3/1 | 2 | 92 × 92 × 192 |
max pool | 3 × 3/2 | 0 | 46 × 46 × 192 |
Inception(3a) | - | 2 | 46 × 46 × 292 |
Inception(3b) | - | 2 | 46 × 46 × 480 |
Inception(4a) | - | 2 | 46 × 46 × 512 |
Inception(4b) | - | 2 | 46 × 46 × 512 |
Inception(4c) | - | 2 | 46 × 46 × 512 |
Inception(4d) | - | 2 | 46 × 46 × 528 |
Inception(4e) | - | 2 | 46 × 46 × 832 |
conv4_3_CPM | 3 × 3/1 | 1 | 46 × 46 × 256 |
conv4_4_CPM | 3 × 3/1 | 1 | 46 × 46 × 256 |
conv4_5_CPM | 3 × 3/1 | 1 | 46 × 46 × 256 |
conv4_6_CPM | 3 × 3/1 | 1 | 46 × 46 × 256 |
conv4_7_CPM | 3 × 3/1 | 1 | 46 × 46 × 128 |
conv5_1_CPM | 1 × 1/1 | 1 | 46 × 46 × 512 |
conv5_2_CPM | 1 × 1/1 | 1 | 46 × 46 × 15 |
Type | Kernel Size/Stride | Depth | Output Size |
---|---|---|---|
conv4_7_CPM | 3 × 3/1 | 1 | 46 × 46 × 128 |
pool_center_lower | 9 × 9/8 | 0 | 46 × 46 × 1 |
conv5_2_CPM | 1 × 1/1 | 1 | 46 × 46 × 15 |
concat_Stage2 | - | 0 | 46 × 46 × 144 |
Mconv1_Stage2 | 7 × 7/1 | 1 | 46 × 46 × 128 |
Mconv2_Stage2 | 7 × 7/1 | 1 | 46 × 46 × 128 |
Mconv3_Stage2 | 7 × 7/1 | 1 | 46 × 46 × 128 |
Mconv4_Stage2 | 7 × 7/1 | 1 | 46 × 46 × 128 |
Mconv5_Stage2 | 7 × 7/1 | 1 | 46 × 46 × 128 |
Mconv6_Stage2 | 1 × 1/1 | 1 | 46 × 46 × 128 |
Mconv7_Stage2 | 1 × 1/1 | 1 | 46 × 46 × 15 |
Parameters | Meaning |
---|---|
batch_size = 16 | The size of the training data for a single iteration |
Backend = LMDB | Database format |
lr_policy = “step” stepsize = 120,000 | Learning strategy is step The times of iterations required to adjust the learning rate |
weight_decay = 0.0005 | Weight attenuation coefficient |
base_lr = 0.000080 | Initial value of learning rate |
momentum = 0.9 max_iter = 350,000 | Momentum Maximum iterations |
Dataset Name | Category | Num of Key Points | Training/Validation |
---|---|---|---|
MPII | Whole body | 14 | 25,000/3000 |
Extended-LSP + LSP | Whole body | 14 | 11,000/1000 |
Models | Iteration | Accuracy (1000 Images) |
---|---|---|
CPM-Stage6 | 630,000 | 0.8554 |
VGG10-CPM-Stage6 | 320,000 | 0.8798 |
Improved Model | 175,000 | 0.8823 |
Models | Iteration | Cost Time | Speed (ms/each Image) |
---|---|---|---|
CPM-Stage6 | 630,000 | 180 h | 260.7 |
VGG10-CPM-Stage6 | 320,000 | 91 h | 255.6 |
Improved Model | 175,000 | 36.37 h | 181.2 |
Models | Head | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | PCK |
---|---|---|---|---|---|---|---|---|
Lifshitz et al. [32] ** | 96.8 | 89.0 | 82.7 | 79.1 | 90.9 | 86.0 | 82.5 | 86.7 |
Pishchulin et al. [4] * | 97.0 | 91.0 | 83.8 | 78.1 | 91.0 | 86.7 | 82.0 | 87.1 |
Insafutdinov et al. [9] * | 97.4 | 92.7 | 87.5 | 84.4 | 91.5 | 89.9 | 87.2 | 90.1 |
Wei et al. [10] * | 97.8 | 92.5 | 87.0 | 83.9 | 91.5 | 90.8 | 89.9 | 90.5 |
CU-Net-8 [33] | 97.1 | 94.7 | 91.6 | 89.0 | 93.7 | 94.2 | 93.7 | 93.4 |
Tang et al. [34] | 97.5 | 95.0 | 92.5 | 90.1 | 93.7 | 95.2 | 94.2 | 94.0 |
Chu et al. [14] * | 98.1 | 93.7 | 89.3 | 86.9 | 93.4 | 94.0 | 92.5 | 92.6 |
Improved model | 96.8 | 90.5 | 85.3 | 81.7 | 90.3 | 87.8 | 86.3 | 88.4 |
Improved model * | 98.2 | 93.7 | 89.8 | 87.3 | 92.7 | 93.8 | 92.3 | 92.6 |
Models | Iteration | Cost Time | Speed (ms/each) |
---|---|---|---|
Lifshitz et al. [32] ** | - | - | 700 |
Pishchulin et al. [4] * | 1,000,000 | - | 57,995 |
Insafutdinov et al. [9] * | 1,000,000 | 120 h | 230 |
Wei et al. [10] * | 985,000 | 280 h | 260.7 |
Chu et al. [14] * | - | - | - |
Improved model * | 275,000 | 82 h | 180.9 |
Models | Head | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | PCKh |
---|---|---|---|---|---|---|---|---|
Lifshitz et al. [32] | 97.8 | 93.3 | 85.7 | 80.4 | 85.3 | 76.6 | 70.2 | 85.0 |
Pishchulin et al. [4] * | 94.1 | 90.2 | 83.4 | 77.3 | 82.6 | 75.7 | 68.6 | 82.4 |
Insafutdinov et al. [9] | 96.8 | 95.2 | 89.3 | 84.4 | 88.4 | 83.4 | 78.0 | 88.5 |
Wei et al. [10] * | 97.8 | 95.0 | 88.7 | 84.0 | 88.4 | 82.8 | 79.4 | 88.5 |
Newell et al. [13] | 98.2 | 96.3 | 91.2 | 87.1 | 90.1 | 87.4 | 83.6 | 90.9 |
Chu et al. [14] | 98.5 | 96.3 | 91.9 | 88.1 | 90.6 | 88.0 | 85.0 | 91.5 |
CU-Net-8 [33] | 97.4 | 96.2 | 91.8 | 87.3 | 90.0 | 87.0 | 83.3 | 90.8 |
Tang et al. [34] | 97.4 | 96.4 | 92.1 | 87.7 | 90.2 | 87.7 | 84.3 | 91.2 |
Improved model | 98.6 | 96.4 | 91.9 | 88.5 | 90.4 | 87.8 | 84.8 | 91.5 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qiang, B.; Zhang, S.; Zhan, Y.; Xie, W.; Zhao, T. Improved Convolutional Pose Machines for Human Pose Estimation Using Image Sensor Data. Sensors 2019, 19, 718. https://doi.org/10.3390/s19030718
Qiang B, Zhang S, Zhan Y, Xie W, Zhao T. Improved Convolutional Pose Machines for Human Pose Estimation Using Image Sensor Data. Sensors. 2019; 19(3):718. https://doi.org/10.3390/s19030718
Chicago/Turabian StyleQiang, Baohua, Shihao Zhang, Yongsong Zhan, Wu Xie, and Tian Zhao. 2019. "Improved Convolutional Pose Machines for Human Pose Estimation Using Image Sensor Data" Sensors 19, no. 3: 718. https://doi.org/10.3390/s19030718
APA StyleQiang, B., Zhang, S., Zhan, Y., Xie, W., & Zhao, T. (2019). Improved Convolutional Pose Machines for Human Pose Estimation Using Image Sensor Data. Sensors, 19(3), 718. https://doi.org/10.3390/s19030718