Skeleton-Based Attention Mask for Pedestrian Attribute Recognition Network
Abstract
:1. Introduction
- The proposed method presented a soft attention mask formulated by skeleton data, which is insensitive to variation in human posture.
- Besides local features from a soft attention model, features from the neighboring background regions are kept for handling various viewpoints and postures.
2. Related Work
2.1. Pedestrian Attribute Recognition
2.2. Visual Attention Model
2.3. Human Skeleton and Pose Estimation
3. Attention Mask
0. Nose | 5. Left shoulder | 10. Right wrist | 15. Left ankle |
1. Left eye | 6. Right shoulder | 11. Left hip | 16. Right ankle |
2. Right eye | 7. Left elbow | 12. Right hip | |
3. Left ear | 8. Right elbow | 13. Left knee | |
4. Right ear | 9. Left wrist | 14. Right knee |
4. PAR Network Architecture
4.1. Backbone Network
4.2. Human-Part Attention Module
4.3. Classification Layers
5. Training Method
5.1. Network Optimization
5.2. Human Attribute Augmentation
6. Experiment
6.1. Dataset
6.2. Implementation Detail
6.3. Evaluation Metric
6.4. Experimental Results
6.4.1. Overall Performance
6.4.2. Attribute-Level Performance
6.4.3. Time Complexity
7. Discussions
7.1. Surrounding Region
7.2. Occlusion
7.3. Irregular Human Posture
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Feris, R.; Bobbitt, R.; Brown, L.; Pankanti, S. Attribute-based people search: Lessons learnt from a practical surveillance system. In Proceedings of the International Conference on Multimedia Retrieval, New York, NY, USA, 1–4 April 2014; pp. 153–160. [Google Scholar]
- Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; Hu, Z.; Yan, C.; Yang, Y. Improving person re-identification by attribute and identity learning. Pattern Recognit. 2019, 95, 151–161. [Google Scholar] [CrossRef] [Green Version]
- Shan, C.; Porikli, F.; Xiang, T.; Gong, S. Video Analytics for Business Intelligence; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Lwowski, J.; Kolar, P.; Benavidez, P.; Rad, P.; Prevost, J.J.; Jamshidi, M. Pedestrian detection system for smart communities using deep convolutional neural networks. In Proceedings of the 2017 12th System of Systems Engineering Conference (SoSE), Paris, France, 18–21 June 2017; pp. 1–6. [Google Scholar]
- Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2013, 26, 1819–1837. [Google Scholar] [CrossRef]
- Sudowe, P.; Spitzer, H.; Leibe, B. Person attribute recognition with a jointly-trained holistic cnn model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 87–95. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Li, D.; Zhang, Z.; Chen, X.; Huang, K. A Richly Annotated Pedestrian Dataset for Person Retrieval in Real Surveillance Scenarios. IEEE Trans. Image Process. 2019, 28, 1575–1590. [Google Scholar] [CrossRef] [PubMed]
- Li, D.; Chen, X.; Huang, K. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 111–115. [Google Scholar]
- Abdulnabi, A.H.; Wang, G.; Lu, J.; Jia, K. Multi-task CNN model for attribute prediction. IEEE Trans. Multimed. 2015, 17, 1949–1959. [Google Scholar] [CrossRef] [Green Version]
- Zhu, J.; Liao, S.; Yi, D.; Lei, Z.; Li, S.Z. Multi-label cnn based pedestrian attribute learning for soft biometrics. In Proceedings of the 2015 international conference on biometrics (ICB), Phuket, Thailand, 19–22 May 2015; pp. 535–540. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Malik, J. Actions and attributes from wholes and parts. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2470–2478. [Google Scholar]
- Yang, L.; Zhu, L.; Wei, Y.; Liang, S.; Tan, P. Attribute recognition from adaptive parts. arXiv 2016, arXiv:1607.01437. [Google Scholar]
- Li, D.; Chen, X.; Zhang, Z.; Huang, K. Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. In Proceedings of the 2018 IEEE international conference on multimedia and expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
- Li, Y.; Huang, C.; Loy, C.C.; Tang, X. Human attribute recognition by deep hierarchical contexts. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 684–700. [Google Scholar]
- Liu, P.; Liu, X.; Yan, J.; Shao, J. Localization guided learning for pedestrian attribute recognition. arXiv 2018, arXiv:1808.09102. [Google Scholar]
- Zhang, Y.; Zhang, P.; Yuan, C.; Wang, Z. Texture and shape biased two-stream networks for clothing classification and attribute recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13538–13547. [Google Scholar]
- Junejo, I.N.; Ahmed, N. Depthwise Separable Convolutional Neural Networks for Pedestrian Attribute Recognition. SN Comput. Sci. 2021, 2, 100. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Wang, J.; Zhu, X.; Gong, S.; Li, W. Attribute recognition by joint recurrent learning of context and correlation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 531–540. [Google Scholar]
- Liu, X.; Zhao, H.; Tian, M.; Sheng, L.; Shao, J.; Yi, S.; Yan, J.; Wang, X. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 350–359. [Google Scholar]
- Guo, H.; Fan, X.; Wang, S. Human attribute recognition by refining attention heat map. Pattern Recognit. Lett. 2017, 94, 38–45. [Google Scholar] [CrossRef]
- Tan, Z.; Yang, Y.; Wan, J.; Hang, H.; Guo, G.; Li, S.Z. Attention-based pedestrian attribute analysis. IEEE Trans. Image Process. 2019, 28, 6126–6140. [Google Scholar] [CrossRef] [PubMed]
- Zhao, X.; Sang, L.; Ding, G.; Han, J.; Di, N.; Yan, C. Recurrent attention model for pedestrian attribute recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 2–9 February 2019; Volume 33, pp. 9275–9282. [Google Scholar]
- Li, Y.; Shi, F.; Hou, S.; Li, J.; Li, C.; Yin, G. Feature Pyramid Attention Model and Multi-Label Focal Loss for Pedestrian Attribute Recognition. IEEE Access 2020, 8, 164570–164579. [Google Scholar] [CrossRef]
- Sarfraz, M.S.; Schumann, A.; Wang, Y.; Stiefelhagen, R. Deep view-sensitive pedestrian attribute inference in an end-to-end model. arXiv 2017, arXiv:1707.06089. [Google Scholar]
- Chen, W.; Yu, X.; Ou, L. Pedestrian Attribute Recognition in Video Surveillance Scenarios Based on View-attribute Attention Localization. arXiv 2021, arXiv:2106.06485. [Google Scholar]
- Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields; CVPR: Salt Lake City, UT, USA, 2017. [Google Scholar]
- Kreiss, S.; Bertoni, L.; Alahi, A. Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 11977–11986. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Wang, X.; Zheng, S.; Yang, R.; Luo, B.; Tang, J. Pedestrian attribute recognition: A survey. arXiv 2019, arXiv:1901.07474. [Google Scholar] [CrossRef]
- Jia, J.; Huang, H.; Yang, W.; Chen, X.; Huang, K. Rethinking of Pedestrian Attribute Recognition: Realistic Datasets with Efficient Method. arXiv 2020, arXiv:2005.11909. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Deng, Y.; Luo, P.; Loy, C.C.; Tang, X. Pedestrian attribute recognition at far distance. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 789–792. [Google Scholar]
Class Index | Class Name | List of Skeleton Joint Index |
---|---|---|
1 | Head | 0, 1, 2, 3, 4 |
2 | Upper body | 5, 6, 7, 8, 9, 10 |
3 | Lower body | 11, 12, 13, 14 |
4 | Foot | 15, 16 |
Group | Attribute |
---|---|
Global (G) | gender, age |
Head (H) | hair length, muffler, hat, glasses |
Upper body (U) | clothes style, logo, casual or formal |
Lower body (L) | clothes style, logo, casual or formal |
Foot (F) | footware style |
Attachment (At) | backpack, messenger bag, plastic bags. |
Group | Attribute |
---|---|
Global (G) | gender, age, body shape, role |
Head (H) | hair style, hair color, hat, glasses |
Upper body (U) | clothes style, clothes color |
Lower body (L) | clothes style, clothes color |
Foot (F) | footware style, footware color |
Attachment (At) | backpack, single shoulder bag, handbag |
Action (Ac) | telephoning, gathering, talking, pushing |
Networks | Evaluation Metric | |||
---|---|---|---|---|
Recall (%) | Precision (%) | F1-Score (%) | mA (%) | |
ResNet-50 | 50.88% | 61.77% | 55.80% | 71.86% |
ResNet-50 with single mask | 55.83% | 62.12% | 58.81% | 74.19% |
ResNet-50 with separated mask | 45.05% | 54.71% | 49.41% | 73.33% |
Inception V3 | 49.82% | 59.72% | 53.95% | 71.21% |
Inception V3 with single mask | 54.06% | 60.47% | 57.08% | 73.32% |
Inception V3 with separated mask | 50.04% | 56.68% | 53.15% | 71.25% |
I-ResNet V2 | 51.66% | 55.41% | 54.45% | 72.00% |
I-ResNet V2 with single mask | 53.16% | 59.00% | 55.93% | 72.17% |
I-ResNet V2 with separated mask | 51.23% | 59.90% | 53.47% | 72.08% |
Networks | Evaluation Metric | |||
---|---|---|---|---|
Recall (%) | Precision (%) | F1-Score (%) | mA (%) | |
ResNet-50 | 56.09% | 69.20% | 61.96% | 76.33% |
ResNet-50 with single mask | 58.02% | 70.80% | 63.77% | 77.16% |
ResNet-50 with separated mask | 55.43% | 62.92% | 58.94% | 76.90% |
Inception V3 | 54.05% | 68.85% | 60.56% | 75.26% |
Inception V3 with single mask | 56.76% | 65.86% | 60.97% | 76.44% |
Inception V3 with separated mask | 53.15% | 65.12% | 58.53% | 74.46% |
I-ResNet V2 | 51.55% | 62.27% | 54.77% | 73.51% |
I-ResNet V2 with single mask | 55.77% | 66.56% | 59.73% | 76.09% |
I-ResNet V2 with separated mask | 49.34% | 63.43% | 53.43% | 72.40% |
Global | Head | Upperbody | Lowerbody | Foot | Attachment | Action | |
---|---|---|---|---|---|---|---|
ResNet-50 | 68.46% | 80.44% | 73.10% | 77.88% | 70.93% | 71.60% | 66.32% |
ResNet-50 with single mask | 71.12% | 80.52% | 75.89% | 82.49% | 73.52% | 74.05% | 67.22% |
ResNet-50 with separated mask | 66.65% | 72.14% | 70.90% | 81.19% | 71.17% | 63.70% | 60.95% |
Inception V3 | 68.61% | 77.14% | 72.43% | 79.90% | 68.74% | 70.25% | 66.15% |
Inception V3 with single mask | 70.98% | 77.63% | 75.40% | 82.28% | 71.55% | 73.14% | 66.57% |
Inception V3 with separated mask | 69.81% | 75.61% | 73.96% | 83.78% | 70.73% | 66.70% | 63.43% |
I-ResNet V2 | 68.43% | 77.99% | 73.60% | 79.76% | 68.21% | 72.66% | 67.62% |
I-ResNet V2 with single mask | 71.66% | 75.83% | 74.50% | 78.59% | 68.37% | 72.40% | 66.17% |
I-ResNet V2 with separated mask | 68.59% | 75.31% | 73.07% | 82.13% | 70.80% | 72.82% | 66.53% |
Global | Head | Upperbody | Lowerbody | Foot | Attachment | |
---|---|---|---|---|---|---|
ResNet-50 | 74.68% | 69.89% | 80.76% | 78.36% | 71.23% | 78.57% |
ResNet-50 with single mask | 75.25% | 72.29% | 80.47% | 78.94% | 72.84% | 79.39% |
ResNet-50 with separated mask | 74.53% | 68.96% | 79.17% | 78.64% | 70.97% | 77.45% |
Inception V3 | 74.76% | 68.79% | 79.54% | 77.73% | 69.73% | 77.66% |
Inception V3 with single mask | 74.81% | 69.42% | 79.86% | 79.74% | 71.48% | 78.74% |
Inception V3 with separated mask | 74.13% | 67.67% | 78.53% | 76.38% | 69.69% | 77.36% |
I-ResNet V2 | 73.12% | 67.12% | 77.83% | 75.11% | 68.96% | 75.99% |
I-ResNet V2 with single mask | 74.80% | 69.49% | 80.65% | 78.97% | 70.12% | 78.14% |
I-ResNet V2 with separated mask | 73.23% | 66.20% | 76.34% | 73.98% | 68.05% | 75.00% |
without Mask (fps) | with Single Mask (fps) | with Separated Mask (fps) | |
---|---|---|---|
ResNet-50 | 38.85 | 37.26 | 36.49 |
Inception V3 | 36.49 | 35.17 | 34.80 |
I-ResNet V2 | 35.75 | 33.89 | 33.43 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sooksatra, S.; Rujikietgumjorn, S. Skeleton-Based Attention Mask for Pedestrian Attribute Recognition Network. J. Imaging 2021, 7, 264. https://doi.org/10.3390/jimaging7120264
Sooksatra S, Rujikietgumjorn S. Skeleton-Based Attention Mask for Pedestrian Attribute Recognition Network. Journal of Imaging. 2021; 7(12):264. https://doi.org/10.3390/jimaging7120264
Chicago/Turabian StyleSooksatra, Sorn, and Sitapa Rujikietgumjorn. 2021. "Skeleton-Based Attention Mask for Pedestrian Attribute Recognition Network" Journal of Imaging 7, no. 12: 264. https://doi.org/10.3390/jimaging7120264
APA StyleSooksatra, S., & Rujikietgumjorn, S. (2021). Skeleton-Based Attention Mask for Pedestrian Attribute Recognition Network. Journal of Imaging, 7(12), 264. https://doi.org/10.3390/jimaging7120264