A Spatial-Temporal Multi-Feature Network (STMF-Net) for Skeleton-Based Construction Worker Action Recognition
Abstract
:1. Introduction
- ➢
- The study introduces the Spatial-Temporal Multi-Feature Network (STMF-Net), which incorporates GCN and TCN models to learn spatial-temporal feature sequences. Through GCN, the network can aggressively extract the node features from their neighbors at the spatial level. Through stack-TCN, the network can continuously extract sequence features at the temporal level. Fusing them could help the model extract robust features and boost recognition ability.
- ➢
- For spatial features, we designed four different hierarchical skeleton topologies (Body-level, Part2-level, Part5-level, Joint-level) and utilized a graph convolutional network to extract features. In particular, the innovative joint-level structure is proposed. This strategy selects the root joint as the center, connecting all other joints to it, forming a star-like topology. This topology graph significantly reduces distances between nodes and captures more detailed features.
- ➢
- The study adopted a spatial-temporal two-step fusion strategy, replacing the naive six-stream direct fusion strategy, to ensure optimal fusion performance by balancing the independent learning of feature streams and adequately correlating the fusion stream.
2. Literature Review
2.1. Human Skeleton Model
2.1.1. Human Skeleton Structure
2.1.2. Spatial-Temporal Features of Skeletons
2.2. Features Based on Skeletons
2.2.1. Joint-Based Approaches
2.2.2. Part-Based Approaches
2.3. Skeleton-Based Action Recognition Algorithm
2.3.1. Deep Learning Algorithm
2.3.2. Multi-Stream Neural Network
3. Methodology
3.1. Pipeline of the Proposed Sequence Network
3.2. Input Feature
3.2.1. Intra-Frame Input
3.2.2. Inter-Frame Input
4. Experiment
4.1. Dataset and Implementation Details
4.2. Overall Performance of Multi-Stream Network
4.2.1. The Overall Performance of Single Stream
4.2.2. The Overall Performance of Fusion Spatial Stream
4.2.3. The Performance of the Spatial-Temporal Fusion Data Stream
4.2.4. Comparison with Other State-of-the-Art Methods
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lean, C.S. Empirical Tests to Discern Linkages between Construction and Other Economic Sectors in Singapore. Constr. Manag. Econ. 2001, 19, 355–363. [Google Scholar] [CrossRef]
- Alaloul, W.S.; Altaf, M.; Musarat, M.A.; Faisal Javed, M.; Mosavi, A. Systematic Review of Life Cycle Assessment and Life Cycle Cost Analysis for Pavement and a Case Study. Sustainability 2021, 13, 4377. [Google Scholar] [CrossRef]
- Hillebrandt, P.M. Economic Theory and the Construction Industry; Palgrave Macmillan: London, UK, 2000; ISBN 978-0-333-77479-3. [Google Scholar]
- Alaloul, W.S.; Musarat, M.A.; Liew, M.S.; Qureshi, A.H.; Maqsoom, A. Investigating the Impact of Inflation on Labour Wages in Construction Industry of Malaysia. Ain Shams Eng. J. 2021, 12, 1575–1582. [Google Scholar] [CrossRef]
- Sunindijo, R.Y.; Zou, P.X.W. Political Skill for Developing Construction Safety Climate. J. Constr. Eng. Manag. 2012, 138, 605–612. [Google Scholar] [CrossRef]
- Lee, K.-P.; Lee, H.-S.; Park, M.; Kim, H.; Han, S. A real-time location-based construction labor safety management system. J. Civ. Eng. Manag. 2014, 20, 724–736. [Google Scholar] [CrossRef]
- Choi, J.; Gu, B.; Chin, S.; Lee, J.-S. Machine Learning Predictive Model Based on National Data for Fatal Accidents of Construction Workers. Autom. Constr. 2020, 110, 102974. [Google Scholar] [CrossRef]
- Li, R.Y.M.; Chau, K.W.; Zeng, F.F. Ranking of Risks for Existing and New Building Works. Sustainability 2019, 11, 2863. [Google Scholar] [CrossRef]
- Circular of the General Office of the Ministry of Housing and Urban-Rural Development on the Production Safety Accidents of Housing and Municipal Engineering in 2019. Available online: https://www.mohurd.gov.cn/gongkai/fdzdgknr/tzgg/202006/20200624_246031.html (accessed on 15 October 2024).
- Hinze, J. Construction Safety. Saf. Sci. 2008, 46, 565. [Google Scholar] [CrossRef]
- Aggarwal, J.K.; Ryoo, M.S. Human Activity Analysis: A Review. ACM Comput. Surv. (CSUR) 2011, 43, 1–43. [Google Scholar] [CrossRef]
- Turaga, P.; Chellappa, R.; Subrahmanian, V.S.; Udrea, O. Machine Recognition of Human Activities: A Survey. IEEE Trans. Circuits Syst. Video Technol. 2008, 18, 1473–1488. [Google Scholar] [CrossRef]
- Gong, J.; Caldas, C.H.; Gordon, C. Learning and Classifying Actions of Construction Workers and Equipment Using Bag-of-Video-Feature-Words and Bayesian Network Models. Adv. Eng. Inform. 2011, 25, 771–782. [Google Scholar] [CrossRef]
- Peddi, A.; Huan, L.; Bai, Y.; Kim, S. Development of Human Pose Analyzing Algorithms for the Determination of Construction Productivity in Real-Time. In Proceedings of the Construction Research Congress 2009, Washington, DC, USA, 5–7 April 2009; American Society of Civil Engineers: Reston, VA, USA, 2009; pp. 11–20. [Google Scholar]
- Luo, X.; Li, H.; Cao, D.; Yu, Y.; Yang, X.; Huang, T. Towards Efficient and Objective Work Sampling: Recognizing Workers’ Activities in Site Surveillance Videos with Two-Stream Convolutional Networks. Autom. Constr. 2018, 94, 360–370. [Google Scholar] [CrossRef]
- Ray, S.J.; Teizer, J. Real-Time Construction Worker Posture Analysis for Ergonomics Training. Adv. Eng. Inform. 2012, 26, 439–455. [Google Scholar] [CrossRef]
- Han, S.; Lee, S. A Vision-Based Motion Capture and Recognition Framework for Behavior-Based Safety Management. Autom. Constr. 2013, 35, 131–141. [Google Scholar] [CrossRef]
- Weerasinghe, I.P.T.; Ruwanpura, J.Y.; Boyd, J.E.; Habib, A.F. Application of Microsoft Kinect Sensor for Tracking Construction Workers. In Proceedings of the Construction Research Congress, West Lafayette, IN, USA, 21–23 May 2012; American Society of Civil Engineers: Reston, VA, USA, 2012; pp. 858–867. [Google Scholar]
- Jalal, A.; Kim, Y.-H.; Kim, Y.-J.; Kamal, S.; Kim, D. Robust Human Activity Recognition from Depth Video Using Spatiotemporal Multi-Fused Features. Pattern Recognit. 2017, 61, 295–308. [Google Scholar] [CrossRef]
- Hou, J.; Wang, G.; Chen, X.; Xue, J.-H.; Zhu, R.; Yang, H. Spatial-Temporal Attention Res-TCN for Skeleton-Based Dynamic Hand Gesture Recognition. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 273–286. [Google Scholar]
- Li, B.; He, M.; Cheng, X.; Chen, Y.; Dai, Y. Skeleton Based Action Recognition Using Translation-Scale Invariant Image Mapping and Multi-Scale Deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 10–14 July 2017. [Google Scholar]
- Li, Y.; Xia, R.; Liu, X.; Huang, Q. Learning Shape-Motion Representations from Geometric Algebra Spatio-Temporal Model for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1066–1071. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Feng, J.; Zhang, S.; Xiao, J. Explorations of Skeleton Features for LSTM-Based Action Recognition. Multimed. Tools Appl. 2019, 78, 591–603. [Google Scholar] [CrossRef]
- Wang, H.; Wang, L. Beyond Joints: Learning Representations from Primitive Geometries for Skeleton-Based Action Recognition and Detection. IEEE Trans. Image Process. 2018, 27, 4382–4394. [Google Scholar] [CrossRef]
- Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proc. AAAI Conf. Artif. Intell. 2018, 32, 7444–7452. [Google Scholar] [CrossRef]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 3697–3703. [Google Scholar]
- Wu, D.; Shao, L. Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 724–731. [Google Scholar]
- Chen, X.; Koskela, M. Online RGB-D Gesture Recognition with Extreme Learning Machines. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction—ICMI ’13, Sydney, Australia, 9–13 December 2013; ACM Press: New York, NY, USA, 2013; pp. 467–474. [Google Scholar]
- Sempena, S.; Maulidevi, N.U.; Aryan, P.R. Human action recognition using dynamic time warping. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, Bandung, Indonesia, 17–19 July 2011; pp. 1–5. [Google Scholar]
- Luo, J.; Wang, W.; Qi, H. Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–3 December 2013; pp. 1809–1816. [Google Scholar]
- Rahmani, H.; Mahmood, A.; Huynh, D.Q.; Mian, A. Real Time Action Recognition Using Histograms of Depth Gradients and Random Decision Forests. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; pp. 626–633. [Google Scholar]
- Bloom, V.; Makris, D.; Argyriou, V. G3D: A Gaming Action Dataset and Real Time Action Recognition Evaluation Framework. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 7–12. [Google Scholar]
- Yun, K.; Honorio, J.; Chattopadhyay, D.; Berg, T.L.; Samaras, D. Two-Person Interaction Detection Using Body-Pose Features and Multiple Instance Learning. Available online: https://www.kaggle.com/datasets/dasmehdixtr/two-person-interaction-kinect-dataset (accessed on 16 July 2012).
- Wang, C.; Wang, Y.; Yuille, A.L. An Approach to Pose-Based Action Recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 915–922. [Google Scholar]
- Yong, D.; Wang, W.; Wang, L. Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
- Thakkar, K.; Narayanan, P.J. Part-Based Graph Convolutional Network for Action Recognition. arXiv 2018, arXiv:1809.04983. [Google Scholar]
- Tosato, D.; Farenzena, M.; Cristani, M.; Murino, V. Part-Based Human Detection on Riemannian Manifolds. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, 26–29 September 2010; pp. 3469–3472. [Google Scholar]
- Liu, R.; Xu, C.; Zhang, T.; Zhao, W.; Cui, Z.; Yang, J. Si-GCN: Structure-Induced Graph Convolution Network for Skeleton-Based Action Recognition. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
- Zhao, R.; Wang, K.; Su, H.; Ji, Q. Bayesian Graph Convolution LSTM for Skeleton Based Action Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6881–6891. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Chan, W.; Tian, Z.; Wu, Y. GAS-GCN: Gated Action-Specific Graph Convolutional Networks for Skeleton-Based Action Recognition. Sensors 2020, 20, 3499. [Google Scholar] [CrossRef] [PubMed]
- Liu, D.; Xu, H.; Wang, J.; Lu, Y.; Kong, J.; Qi, M. Adaptive Attention Memory Graph Convolutional Networks for Skeleton-Based Action Recognition. Sensors 2021, 21, 6761. [Google Scholar] [CrossRef] [PubMed]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.; Wang, L. Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3633–3642. [Google Scholar]
- Jia, J.-G.; Zhou, Y.-F.; Hao, X.-W.; Li, F.; Desrosiers, C.; Zhang, C.-M. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition. J. Comput. Sci. Technol. 2020, 35, 538–550. [Google Scholar] [CrossRef]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-Occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 786–792. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Li, F.; Zhu, A.; Xu, Y.; Cui, R.; Hua, G. Multi-Stream and Enhanced Spatial-Temporal Graph Convolution Network for Skeleton-Based Action Recognition. IEEE Access 2020, 8, 97757–97770. [Google Scholar] [CrossRef]
- Tian, Y.; Li, H.; Cui, H.; Chen, J. Construction Motion Data Library: An Integrated Motion Dataset for on-Site Activity Recognition. Sci. Data 2022, 9, 726. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Lev, G.; Sadeh, G.; Klein, B.; Wolf, L. RNN Fisher Vectors for Action Recognition and Image Annotation. In Computer Vision–ECCV 2016. ECCV 2016. Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9910. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions, or products referred to in the content. |
Epoch | Body-Stream Acc (%) | Part2-Stream Acc (%) | Part5-Stream Acc (%) | Joint-Stream Acc (%) |
---|---|---|---|---|
200 | 63.78 | 59.94 | 65.12 | 48.6 |
500 | 69.53 | 65.51 | 70.51 | 56.89 |
1000 | 72.94 | 68.87 | 73.43 | 59.98 |
Algorithms | Acc (%) |
---|---|
ST-GCN | 79.51 |
2-layer LSTM | 77.37 |
1-layer LSTM | 76.45 |
Traditional RNN | 72.71 |
Our approach | 79.36 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tian, Y.; Lin, S.; Xu, H.; Chen, G. A Spatial-Temporal Multi-Feature Network (STMF-Net) for Skeleton-Based Construction Worker Action Recognition. Sensors 2024, 24, 7455. https://doi.org/10.3390/s24237455
Tian Y, Lin S, Xu H, Chen G. A Spatial-Temporal Multi-Feature Network (STMF-Net) for Skeleton-Based Construction Worker Action Recognition. Sensors. 2024; 24(23):7455. https://doi.org/10.3390/s24237455
Chicago/Turabian StyleTian, Yuanyuan, Sen Lin, Hejun Xu, and Guangchong Chen. 2024. "A Spatial-Temporal Multi-Feature Network (STMF-Net) for Skeleton-Based Construction Worker Action Recognition" Sensors 24, no. 23: 7455. https://doi.org/10.3390/s24237455
APA StyleTian, Y., Lin, S., Xu, H., & Chen, G. (2024). A Spatial-Temporal Multi-Feature Network (STMF-Net) for Skeleton-Based Construction Worker Action Recognition. Sensors, 24(23), 7455. https://doi.org/10.3390/s24237455