Cofopose: Conditional 2D Pose Estimation with Transformers
Abstract
:1. Introduction
- We propose Cofopose, a two-stage approach consisting of person- and keypoint-detection transformers for 2D human pose estimation.
- Cofopose comprises conditional cross-attention, conditional DETR, and encoder-decoders in the transformer framework to achieve person and keypoint detection. Specifically, we utilize conditional cross-attention and fine-tuned conditional DETR for our person detection, and encoder-decoders in the transformers for our keypoint detection.
- Cofopose achieves state-of-the-art accuracy on both the MPII and MS-COCO benchmark datasets. Furthermore, the contributions of the hypothesized architecture have been confirmed using ablation investigations.
2. Related Work
2.1. Transformers
2.2. Human Pose Estimation
3. Model
3.1. Revisiting Conditional DETR
3.2. Cofopose Architecture
3.2.1. Transformer Encoder
3.2.2. Transformer Decoder
3.2.3. Conditional Cross-Attention
3.2.4. Keypoint Detection
4. Experiments
4.1. Setup
4.2. Model Settings
4.3. Implementation Details
4.4. Comparism with Existing State-of-the-Art Archectures
4.5. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
COCO Common Object in Context |
MPII Max Planck Institut Informatik |
DETR Detection Transformer |
DCNN Deep Convolutional Neural Network |
VATT Video–Audio–Text Transformer |
HPE Human Pose Estimation |
SOTA State-of-the-art |
References
- Belagiannis, V.; Zisserman, A. Recurrent Human Pose Estimation. arXiv 2016, arXiv:1605.02914. [Google Scholar]
- Ji, X.; Fang, Q.; Dong, J.; Shuai, Q.; Jiang, W.; Zhou, X. A Survey on Monocular 3D Human Pose Estimation. Virtual Real. Intell. Hardw. 2020, 2, 471–500. [Google Scholar] [CrossRef]
- Cristani, M.; Raghavendra, R.; del Bue, A.; Murino, V. Human Behavior Analysis in Video Surveillance: A Social Signal Processing Perspective. Neurocomputing 2013, 100, 86–97. [Google Scholar] [CrossRef]
- Shotton, J.; Sharp, T.; Fitzgibbon, A.; Blake, A.; Cook, M.; Kipman, A.; Finocchio, M.; Moore, R. Real-Time Human Pose Recognition in Parts from Single Depth Images. Commun. ACM 2013, 56, 116–124. [Google Scholar] [CrossRef]
- Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Aggarwal, J.K.; Ryoo, M.S. Human Activity Analysis: A Review. ACM Comput. Surv. 2011, 43, 16. [Google Scholar] [CrossRef]
- Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherhrNet: Scale-Aware Representation Learning for Bottom-up Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. RMPE: Regional Multi-Person Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards Accurate Multi-Person Pose Estimation in the Wild. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose Recognition with Cascade Transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Farokhian, M.; Rafe, V.; Veisi, H. Fake News Detection Using Parallel BERT Deep Neural Networks. arXiv 2022, arXiv:2204.04793. [Google Scholar]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022. [Google Scholar] [CrossRef]
- Zhang, S.; Loweimi, E.; Bell, P.; Renals, S. On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop, SLT 2021—Proceedings, Shenzhen, China, 19–22 January 2021. [Google Scholar]
- Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J.; Research, S. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.T.; Zhou, E. TokenPose: Learning Keypoint Tokens for Human Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Smith, S.M.; Brady, J.M. SUSAN—A New Approach to Low Level Image Processing. Int. J. Comput. Vis. 1997, 23, 45–78. [Google Scholar] [CrossRef]
- Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
- Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.; Koo, S.; Kumar, S. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, Barcelona, Spain, 4–8 May 2020. [Google Scholar]
- Veličković, P.; Casanova, A.; Liò, P.; Cucurull, G.; Romero, A.; Bengio, Y. Graph Attention Networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018—Conference Track Proceedings, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. Adv. Neural Inf. Process. Syst. 2021, 34, 24206–24221. [Google Scholar]
- Huang, L.; Tan, J.; Liu, J.; Yuan, J. Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation. In Proceedings of the ECCV 2020 16th European Conference, Glasgow, UK, 23–28 August 2020; Volume 12370. [Google Scholar]
- Miech, A.; Alayrac, J.B.; Laptev, I.; Sivic, J.; Zisserman, A. Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the ECCV 2020 16th European Conference, Glasgow, UK, 23–28 August 2020; Volume 12346. [Google Scholar]
- Kortylewski, A.; Liu, Q.; Wang, A.; Sun, Y.; Yuille, A. Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion. Int. J. Comput. Vis. 2021, 129, 736–760. [Google Scholar] [CrossRef]
- Li, J.; Bian, S.; Zeng, A.; Wang, C.; Pang, B.; Liu, W.; Lu, C. Human Pose Regression with Residual Log-Likelihood Estimation. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Zhang, Y.; Wang, Y.; Camps, O.; Sznaier, M. Key Frame Proposal Network for Efficient Pose Estimation in Videos. In Proceedings of the ECCV 2020 16th European Conference, Glasgow, UK, 23–28 August 2020; Volume 12362. [Google Scholar]
- Ning, G.; Liu, P.; Fan, X.; Zhang, C. A Top-down Approach to Articulated Human Pose Estimation and Tracking. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Volume 11130. [Google Scholar]
- Zhang, J.; Zhu, Z.; Lu, J.; Huang, J.; Huang, G.; Zhou, J. SIMPLE: SIngle-Network with Mimicking and Point Learning for Bottom-up Human Pose Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021. [Google Scholar]
- Luo, Z.; Golestaneh, S.A.; Kitani, K.M. 3D Human Motion Estimation via Motion Compression and Refinement. In Proceedings of the 15th Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020; Volume 12626. [Google Scholar]
- Clark, R.; Wang, S.; Markham, A.; Trigoni, N.; Wen, H. VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization. In Proceedings of the Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Phon-Amnuaisuk, S.; Murata, K.T.; Kovavisaruch, L.O.; Lim, T.H.; Pavarangkoon, P.; Mizuhara, T. Visual-Based Positioning and Pose Estimation. In Proceedings of the Communications in Computer and Information Science, Valletta, Malta, 25–27 February 2020; Volume 1332. [Google Scholar]
- Tao, C.; Jiang, Q.; Duan, L.; Luo, P. Dynamic and Static Context-Aware LSTM for Multi-Agent Motion Prediction. In Proceedings of the ECCV 2020 16th European Conference, Glasgow, UK, 23–28 August 2020; Volume 12366. [Google Scholar]
- Singh, G.; Cuzzolin, F. Recurrent Convolutions for Causal 3D CNNs. In Proceedings of the Proceedings—2019 International Conference on Computer Vision Workshop, ICCVW 2019, Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Shu, X.; Zhang, L.; Qi, G.J.; Liu, W.; Tang, J. Spatiotemporal Co-Attention Recurrent Neural Networks for Human-Skeleton Motion Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3300–3315. [Google Scholar] [CrossRef] [PubMed]
- Raaj, Y.; Idrees, H.; Hidalgo, G.; Sheikh, Y. Efficient Online Multi-Person 2D Pose Tracking with Recurrent Spatio-Temporal Affinity Fields. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Liu, Z.; Chen, H.; Feng, R.; Wu, S.; Ji, S.; Yang, B.; Wang, X. Deep Dual Consecutive Network for Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
- Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-Aware Coordinate Representation for Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Luvizon, D.C.; Tabia, H.; Picard, D. Human Pose Regression by Combining Indirect Part Detection and Contextual Information. Comput. Graph. 2019, 85, 15–22. [Google Scholar] [CrossRef]
- Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient Object Localization Using Convolutional Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Volume 9910. [Google Scholar]
- Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–23 June 2018. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Volume 11210. [Google Scholar]
- Su, Z.; Ye, M.; Zhang, G.; Dai, L.; Sheng, J. Cascade Feature Aggregation for Human Pose Estimation. arXiv 2019, arXiv:1902.07837. [Google Scholar]
- Golda, T.; Kalb, T.; Schumann, A.; Beyerer, J. Human Pose Estimation for Real-World Crowded Scenarios. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2019, Taipei, Taiwan, 18–21 September 2019. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Volume 11210. [Google Scholar]
- Wei, F.; Sun, X.; Li, H.; Wang, J.; Lin, S. Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation. In Proceedings of the ECCV 2020 16th European Conference, Glasgow, UK, 23–28 August 2020; Volume 12355. [Google Scholar]
- Papandreou, G.; Zhu, T.; Chen, L.-C.; Gidaris, S.; Tompson, J.; Murphy, K. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Tian, Z.; Chen, H.; Shen, C. DirectPose: Direct End-to-End Multi-Person Pose Estimation. arXiv 2019, arXiv:1911.07451. [Google Scholar]
- Nie, X.; Feng, J.; Zhang, J.; Yan, S. Single-Stage Multi-Person Pose Machines. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
- Newell, A.; Huang, Z.; Deng, J. Associative Embedding: End-to-End Learning for Joint Detection and Grouping. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. CenterNet: Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Liu, Z.; Feng, R.; Chen, H.; Wu, S.; Gao, Y.; Gao, Y.; Wang, X. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Models | Remarks | Limitations |
---|---|---|
DeepPose [44] | A model was created to study the results of jointly training a multi-staged framework with repeated intermediate inspection. | Regressing to a location is extremely difficult, increasing the complexity of the learning and reducing generalization. |
ConvNet Pose [45] | Proposed an architecture to generate discrete heatmaps instead of continuous ones | The architecture lacks structural modeling. |
CPM [46] | Integration of the convolutional network into pose machines, allowing them to learn image features and image-dependent spatial models to estimate human poses. | Vulnerable when multiple individuals are nearby, computational cost, and, if the detection of individuals fails, there is no possibility of recovering. |
Stacked-Hglass [47] | Utilized repeated bottom-up, top-down, and intermediate supervision to improve the network’s performance. | Hundreds of parameters, and loss functions become incredibly complex |
DeeperCut [48] | Introduced strong body part detectors to produce effective bottom-up proposals for body joints, and utilized the deep ResNet for human pose estimation. | The pairwise representations are very hard to regress. |
PAF [49] | Proposed a model to connect human body parts via Part Affinity Fields (PAF), a non-parametric method, to achieve bottom-up pose estimation. | Grouping body parts is very challenging when there is a large overlap between people. |
CPN [50] | Proposed a CPN structure composed of GlobalNet and RefineNet. Easy keypoints are estimated by the GlobalNet, while the estimation of hard keypoints is performed by RefineNet. | High computational costs, and vulnerable when multiple individuals are nearby. |
SB [51] | Introduced an intuitive and simplified architecture that is made up of few deconvolutional layers at the end of ResNet to estimate the keypoint heatmap. | High computational cost, and vulnerable when multiple individuals are nearby. |
HRNet [8] | Proposed an innovative and intuitive method to keep a high-resolution representation throughout the process. | Fails to capture long-range interactions between joints, and has high computational complexity. |
CFA [52] | Provides a cascaded multiple hourglass, as well as aggregating high, medium, and low-level features to better capture global semantic and local detailed information. | If the detection of individuals fails, there is no possibility of recovering, and it has a high computational cost. |
occNet [53] | Revealed tow occlusion detection networks, namely Occlusion Net (OccNet) and Occlusion Net Cross Branch (OccNetCB), to perform pose estimation of all the detected persons. | Suffers from early commitment, hence, if the detection of an individual person fails, recovery becomes very difficult. |
Dark [42] | The researchers identified the design limitations on the existing standard coordinate-decoding model, and introduced a principled distribution-aware decoding model. | Encounters the problem of sub-pixel localization. |
Method | Backbone | Epoc | Head | Shou | Elbow | Wrist | Hip | Knee | Ankle | Mean |
---|---|---|---|---|---|---|---|---|---|---|
CPM [46] | CPM | 200 | 96.2 | 95.0 | 87.2 | 82.2 | 87.6 | 82.7 | 78.4 | 87.7 |
SBL [51] | Res-152 | 200 | 97.0 | 95.9 | 90.3 | 85.0 | 89.2 | 85.3 | 81.3 | 89.6 |
Integral [57] | Res-101 | 200 | - | - | - | - | - | - | - | 87.3 |
PRTR [11] | HRNet-W32 | 200 | 97.3 | 96.0 | 90.6 | 84.5 | 89.7 | 85.5 | 79.0 | 89.5 |
HRNet-W32 | 50 | 93.3 | 91.4 | 73.5 | 60.0 | 81.0 | 58.1 | 41.7 | 73.2 | |
Cofopose | Res-101 * | 50 | 96.0 | 94.2 | 84.3 | 75.8 | 86.9 | 78.0 | 71.1 | 84.6 |
Res-101 ** | 50 | 97.6 | 95.8 | 90.5 | 84.9 | 89.8 | 85.1 | 79.1 | 89.6 | |
Res-101 ** | 75 | 97.9 | 96.2 | 90.3 | 85.3 | 90.3 | 85.7 | 80.4 | 90.1 | |
Res-152 * | 50 | 96.8 | 94.5 | 85.2 | 77.3 | 88.8 | 78.8 | 73.4 | 85.6 | |
Res-152 ** | 50 | 97.1 | 95.5 | 88.6 | 82.3 | 88.6 | 82.5 | 75.5 | 87.9 | |
HRNet-W32 ** | 50 | 96.5 | 94.0 | 84.8 | 77.1 | 87.3 | 77.1 | 79.0 | 84.5 | |
Performance Gain | +0.6 | +0.2 | +0.3 | +0.6 | +0.2 | +0.5 |
Method | Backbone | Input | #Params | GFLOPs | AP | AP50 | AP75 | APM | APL | AR |
---|---|---|---|---|---|---|---|---|---|---|
H-B** | ||||||||||
8-stage Hglass [47] | Hglass-8 stacked | 256 × 192 | 25.1 M | 14.3 | 66.9 | - | - | - | - | - |
CPN [50] | Res-50 | 256 × 192 | 27.0 M | 6.20 | 68.6 | - | - | - | - | - |
SB [51] | Res-50 | 384 × 288 | 34.0 M | 18.6 | 72.2 | 89.3 | 78.9 | 68.1 | 79.7 | 77.6 |
SB [51] | Res-101 | 384 × 288 | 53.0 M | 26.7 | 73.6 | 69.9 | 80.3 | 79.1 | 81.1 | 79.1 |
R-B** | ||||||||||
PointSetNet [58] | ResNeXt-101-DCN | - | - | - | 65.7 | 85.4 | 71.8 | - | - | - |
HRNet-W48 | - | - | - | 69.8 | 88.8 | 76.3 | - | - | - | |
PRTR [11] | HRNet-W32 | 512 × 384 | 57.2 M | 37.8 | 73.3 | 89.2 | 79.9 | 69.0 | 80.9 | 80.2 |
Cofopose | Res-50 | 384 × 288 | 39.2 M | 10.2 | 69.3 | 89.4 | 76.3 | 64.0 | 77.1 | 76.9 |
Res-50 | 512 × 384 | 40.4 M | 17.7 | 71.9 | 90.4 | 79.1 | 67.3 | 79.9 | 79.1 | |
Res-101 | 512 × 3 84 | 59.3 M | 32.3 | 73.1 | 90.4 | 80.3 | 68.4 | 80.8 | 80.1 | |
HRNet-W32 | 384 × 288 | 56.0 M | 20.7 | 74.1 | 90.3 | 80.8 | 69.9 | 81.3 | 80.9 | |
HRNet-W32 | 512 × 384 | 56.0 M | 36.9 | 74.2 | 90.2 | 81.0 | 70.1 | 81.8 | 81.3 | |
Performance Gain(R-B**) | +0.9 | +1.2 | +1.1 | +1.1 | +0.9 | +1.1 | ||||
Performance Gain(H-B**) | +0.6 | +1.1 | +0.7 | +0.7 | +2.2 |
Method | Backbone | Input | #Params | GFLOPs | AP | AP50 | AP75 | APM | APL | AR |
---|---|---|---|---|---|---|---|---|---|---|
H-B*** | ||||||||||
Mask-RCN [62] | Res-50 | - | - | - | 63.1 | 87.3 | 68.7 | 57.8 | 71.4 | - |
G-RMI [10] | Res-50 | 353 × 257 | 42.6 M | 57.0 | 64.9 | 85.5 | 71.3 | 62.3 | 70.0 | 69.7 |
Assoc. Embe [63] | Hglass-4 stack | - | - | - | 65.5 | 86.8 | 72.3 | 60.6 | 72.6 | 70.2 |
PifPaf [49] | Res-101 | - | - | - | 65.5 | - | - | 62.4 | 72.9 | - |
PersonLab [59] | Res-101 | - | - | - | 65.5 | 87.1 | 71.4 | 61.3 | 71.5 | 70.1 |
HigherHRNet [7] | HRNet-W48 | - | - | - | 70.5 | 89.3 | 77.2 | 66.6 | 75.8 | 74.9 |
CPN [50] | ResNet-Inception | 384 × 288 | - | - | 72.1 | 91.4 | 80.0 | 68.7 | 77.2 | 78.5 |
SB [51] | Res-152 | 384 × 288 | 68.6 M | 35.6 | 73.7 | 91.9 | 81.1 | 70.3 | 80.0 | 79.0 |
Dark [42] | HRNet-W48 | 384 × 288 | 63.6 M | 32.9 | 76.2 | 92.5 | 83.6 | 72.5 | 82.4 | 81.1 |
R-B*** | ||||||||||
CenterNet [64] | Hglass-2 stack | - | - | - | 63.0 | 86.8 | 69.6 | 58.9 | 70.4 | - |
DirectPose [60] | Res-101 | - | - | - | 63.3 | 86.7 | 69.4 | 57.8 | 71.2 | - |
SPM [61] | Hglass-8 stack | 384 × 384 | - | - | 66.9 | 88.5 | 72.9 | 62.6 | 73.1 | - |
Integral [11,57] | Res-101 | 256 ×256 | 45.0 M | 11.0 | 67.8 | 88.2 | 74.8 | 63.9 | 74.0 | - |
PointSetNet [58] | HRNet-W48 | - | - | - | 68.7 | 89.9 | 76.3 | 64.8 | 75.3 | - |
PRTR [11] | HRNet-W32 | 512 × 384 | 57.2 M | 37.8 | 72.1 | 90.4 | 79.6 | 68.1 | 79.0 | 79.4 |
Cofopose | Res-101 | 384 × 288 | 58.9 M | 18.3 | 69.9 | 91.0 | 77.8 | 65.7 | 76.9 | 77.5 |
HRNet-W32 | 384 × 288 | 56.1 M | 21.0 | 72.8 | 91.5 | 80.7 | 68.7 | 79.3 | 79.7 | |
HRNet-W32 | 512 × 384 | 56.1 M | 36.9 | 74.1 | 91.3 | 80.7 | 69.0 | 80.1 | 80.3 | |
Performance Gain(R-B***) | +2.0 | +1.1 | +1.1 | +0.9 | +1.1 | +0.9 |
Method | AP | Inference Speed (FPS) |
---|---|---|
HRNet-W48 | 73.3 | 27 |
HRNet-W32 | 72.5 | 28 |
TransPose-H | 74.2 | 38 |
Cofopose | 74.2 | 36 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Aidoo, E.; Wang, X.; Liu, Z.; Tenagyei, E.K.; Owusu-Agyemang, K.; Kodjiku, S.L.; Ejianya, V.N.; Aggrey, E.S.E.B. Cofopose: Conditional 2D Pose Estimation with Transformers. Sensors 2022, 22, 6821. https://doi.org/10.3390/s22186821
Aidoo E, Wang X, Liu Z, Tenagyei EK, Owusu-Agyemang K, Kodjiku SL, Ejianya VN, Aggrey ESEB. Cofopose: Conditional 2D Pose Estimation with Transformers. Sensors. 2022; 22(18):6821. https://doi.org/10.3390/s22186821
Chicago/Turabian StyleAidoo, Evans, Xun Wang, Zhenguang Liu, Edwin Kwadwo Tenagyei, Kwabena Owusu-Agyemang, Seth Larweh Kodjiku, Victor Nonso Ejianya, and Esther Stacy E. B. Aggrey. 2022. "Cofopose: Conditional 2D Pose Estimation with Transformers" Sensors 22, no. 18: 6821. https://doi.org/10.3390/s22186821
APA StyleAidoo, E., Wang, X., Liu, Z., Tenagyei, E. K., Owusu-Agyemang, K., Kodjiku, S. L., Ejianya, V. N., & Aggrey, E. S. E. B. (2022). Cofopose: Conditional 2D Pose Estimation with Transformers. Sensors, 22(18), 6821. https://doi.org/10.3390/s22186821