TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition
Abstract
:1. Introduction
- More rich input features are extracted in the preprocess. The relative displacements of the joints between their three frames are used as a type of input features. The relative displacements of two frames, the relative positions of the joints and the geometric information lengths and angles of the bones are used as input features to construct a feature extraction structure with a three-branch configuration during the data preprocessing phase, and the preprocessed features are serially connected immediately after extraction. Results are superior to the method of using speed as an input feature;
- A temporal feature cross-extraction convolution block is proposed to extract high-level features from temporal information. The cross-extraction convolution block contains a convolution block for cross-extraction of temporal features and a gated CNN unit with information filtering function;
- The stitching spatial–temporal attention (SST-Att) block is proposed. SST-Att not only considers the spatial attention processing of joints, but also focuses on the remarkable information of joints in the temporal series, so that critical details on these two scales can be further distinguished and extracted;
- Three large common datasets (NTU RGB + D60, NTU RGB + D120 and UAV-Human) are used to train TFC-GCN in large quantities of experiments, and competitive accuracies are obtained.
2. Related Work
3. Methodology
3.1. Graph Convolutional Network
3.2. Preprocessing
3.3. Temporal Feature Cross-Extraction Convolution
3.4. The Stitching Spatial–Temporal Attention Mechanism
3.5. Evaluation for the Light weight of Model
4. Experimental Results
4.1. Datasets
4.2. The Details of Implementation
4.3. Comparisons with SOTA Methods
4.3.1. NTU RGB + D60 and NTU RGB + D120
4.3.2. UAV-Human
4.3.3. Performance of Model
4.4. Ablation Experiment
4.4.1. Input Features
4.4.2. Temporal Feature Cross-Extraction Convolution
4.5. Visualization of Results
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cao, Z.; Simon, T.; Wei, S. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef] [Green Version]
- Vemulapalli, R.; Arrate, F.; Chellappa, R. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014; pp. 588–595. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Thakkar, K.; Narayanan, P. Part-based graph convolutional network for action recognition. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1227–1236. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12026–12035. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7912–7921. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
- Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3288–3297. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with convolutional neural networks. In Proceedings of the Multimedia & Expo Workshops (ICMEW), IEEE International Conference, Hong Kong, China, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 597–600. [Google Scholar]
- Kim, T.; Reiter, A. Interpretable 3d human action analysis with temporal convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 10–14 July 2017; pp. 1623–1631. [Google Scholar]
- Yang, F.; Wu, Y.; Sakti, S.; Nakamura, S. Make skeleton-based action recognition model smaller, faster and better. In Proceedings of the ACM Multimedia Asia, Beijing, China, 16–18 December 2019; pp. 1–6. [Google Scholar]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton-based action recognition using regularized deep LSTM networks. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Wang, H.; Wang, L. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Process. 2018, 27, 4382–4394. [Google Scholar] [CrossRef] [PubMed]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2126. [Google Scholar]
- Li, L.; Zheng, W.; Zhang, Z.; Huang, Y.; Wang, L. Skeleton-based relational modeling for action recognition. arXiv 2018, arXiv:1805.02556. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3595–3603. [Google Scholar]
- Song, Y.; Zhang, Z.; Wang, L. Richly activated graph convolutional network for action recognition with incomplete skeletons. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1–5. [Google Scholar]
- Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 13339–13348. [Google Scholar]
- Hou, R.; Wang, Z.; Ren, R.; Cao, Y.; Wang, Z. Multi-channel network: Constructing efficient GCN baselines for skeleton-based action recognition. Comput. Graph. 2023, 110, 111–117. [Google Scholar] [CrossRef]
- Song, Y.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of The 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1625–1633. [Google Scholar]
- Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1109–1118. [Google Scholar]
- Song, Y.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef] [PubMed]
- Lv, W.; Zhou, Y. Human action recognition based on multi-scale feature augmented graph convolutional network. In Proceedings of the 6th International Conference on Innovation in Artificial Intelligence (ICIAI), Association for Computing Machinery, New York, NY, USA, 4–6 March 2022; pp. 112–118. [Google Scholar]
- Dang, R.; Liu, C.; Liu, M.; Chen, Q.Q. Channel attention and multi-scale graph neural networks for skeleton-based action recognition. AI Commun. 2022, 35, 187–205. [Google Scholar] [CrossRef]
- Kong, J.; Bian, Y.; Jiang, M. MTT: Multi-scale temporal Transformer for skeleton-based action recognition. IEEE Signal Process. Lett. 2022, 29, 528–532. [Google Scholar] [CrossRef]
- Chan, W.; Tian, Z.; Wu, Y. GAS-GCN: Gated action-specific graph convolutional networks for skeleton-based action recognition. Sensors 2020, 20, 3499. [Google Scholar] [CrossRef] [PubMed]
- Jang, S.; Lee, H.; Cho, S.; Woo, S.; Lee, S. Ghost graph convolutional network for skeleton-based action recognition. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Seoul, Republic of Korea, 1–3 November 2021; pp. 1–4. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.; Kot, A. NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, T.; Liu, J.; Zhang, W.; Ni, Y.; Wang, W.; Li, Z. UAV-Human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16266–16275. [Google Scholar]
- Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
- Huang, L.; Huang, Y.; Ouyang, W.; Wang, L. Part-level graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11045–11052. [Google Scholar]
- Peng, W.; Hong, X.; Chen, H.; Zhao, G. Learning graph convolutional network for skeleton-based human action recognition by neural searching. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 2669–2676. [Google Scholar]
- Cai, J.; Jiang, N.; Han, X.; Jia, K.; Lu, J. JOLO-GCN: Mining joint centered light-weight information for skeleton-based action recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 2734–2743. [Google Scholar]
- Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic GCN: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 55–63. [Google Scholar]
- Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, 2–9 February 2021; pp. 1113–1122. [Google Scholar]
- Asif, U.; Mehta, D.; Cavallar, S.; Tang, J.; Harrer, S. DeepActsNet: A deep ensemble framework combining features from face, hands, and body for action recognition. Pattern Recognit. 2023, 139, 109484. [Google Scholar] [CrossRef]
- Kim, S.-B.; Jung, C.; Kim, B.-I.; Ko, B.C. Lightweight semantic-guided neural networks based on single head attention for action recognition. Sensors 2022, 22, 9249. [Google Scholar] [CrossRef] [PubMed]
- Bai, Z.; Xu, H.; Zhang, H.; Gu, H. Multi-person interaction action recognition based on co-occurrence graph convolutional network. In Proceedings of the 34th Chinese Control and Decision Conference, Hefei, China, 21–23 May 2022; pp. 5030–5035. [Google Scholar]
- Cao, Y.; Liu, C.; Huang, Z. Skeleton-based action recognition with temporal action graph and temporal adaptive graph convolution structure. Multimed. Tools Appl. 2021, 80, 29139–29162. [Google Scholar] [CrossRef]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 183–192. [Google Scholar]
- Yang, S.; Wang, X.; Gao, L.; Song, J. MKE-GCN: Multi-modal knowledge embedded graph convolutional network for skeleton-based action recognition in the wild. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Tan, Z.; Zhu, Y.; Liu, B. Learning spatial-temporal feature with graph product. Signal Process. 2023, 210, 109062. [Google Scholar] [CrossRef]
- Li, T.; Liu, J.; Zhang, W.; Duan, L. Hard-net: Hardness-aware discrimination network for 3d early activity prediction. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 420–436. [Google Scholar]
Model | Param. (M) | Acc. | |||
---|---|---|---|---|---|
X-Sub (%) | X-View (%) | X-Sub120 (%) | X-Set120 (%) | ||
ST-GCN [5] | 3.10 * | 81.5 | 88.3 | 70.7 * | 73.2 * |
SR-TSL [34] | 19.07 * | 84.8 | 92.4 | 74.1 * | 79.9 * |
RA-GCN [20] | 6.21 * | 87.3 | 93.6 | 81.1 | 82.7 |
AS-GCN [19] | 9.05 * | 86.8 | 94.2 | 77.9 * | 78.5 * |
2s-AGCN [8] | 6.94 * | 88.5 | 95.1 | 82.5 * | 84.2 * |
AGC-LSTM [7] | 22.89 | 89.2 | 95.0 | — | — |
DGNN [9] | 26.24 | 89.9 | 96.1 | — | — |
SGN [24] | 0.69 | 89.0 | 94.5 | 79.2 | 81.5 |
Ghost-GCN [30] | 0.51 | 89.0 | 94.6 | — | — |
PL-GCN [35] | 20.70 | 89.2 | 95.0 | — | — |
NAS-GCN [36] | 6.57 | 89.4 | 95.7 | — | — |
ResGCN-N51 [23] | 0.77 | 89.1 | 93.5 | 84.0 | 84.2 |
JOLO-GCN [37] | 10.42 | 93.8 | 98.1 | 87.6 | 89.7 |
CTR-GCN [21] | 1.46 | 92.4 | 96.8 | 88.9 | 90.6 |
Dynamic-GCN [38] | 14.40 | 91.5 | 96.0 | 87.3 | 88.6 |
MST-GCN [39] | 12.00 | 91.5 | 96.6 | 87.5 | 88.8 |
EfficientGCN-B0 [25] | 0.29 | 90.2 | 94.9 | 86.6 | 85.0 |
DeepActsNet [40] | 19.4 | 94.0 | 97.4 | 89.3 | 88.4 |
SGN-SHA [41] | 0.32 | 87.5 | 92.6 | 78.5 | 87.1 |
CGCN [42] | — | 92.9 | 96.0 | — | — |
ST-AGCN [43] | — | 88.2 | 94.3 | — | — |
TFC-GCN (ours) | 0.18 | 87.9 | 91.5 | 83.0 | 81.6 |
Model | Param. (M) | CSv1 (%) | CSv2 (%) | |
---|---|---|---|---|
DGNN [9] | 26.24 | 29.9 | — | 0.76 |
ST-GCN [5] | 3.10 | 30.3 | 56.1 | 2.38 |
2s-AGCN [8] | 6.94 | 34.5 | 66.7 | 1.79 |
CTR-GCN [21] | 1.46 | 41.7 | — | 3.79 |
EfficientGCN-B0 [25] | 0.29 | 39.2 | 63.2 | 4.91 |
Shift-GCN [44] | — | 38.0 | 67.0 | — |
MKE-GCN [45] | 1.46 | 43.8 | — | 3.43 |
STGPCN [46] | 1.70 | 41.5 | 67.8 | 3.24 |
HARD-Net [47] | — | 36.97 | — | — |
TFC-GCN (ours) | 0.18 | 39.6 | 64.7 | 5.40 |
Model | Param. (M) | Acc. (%) | FLOPs (G) | |
---|---|---|---|---|
ST-GCN [5] | 3.10 * | 81.5 | 16.32 * | 3.31 |
SR-TSL [34] | 19.07 * | 84.8 | 4.20 * | 1.70 |
RA-GCN [20] | 6.21 * | 87.3 | 32.80 * | 2.71 |
AS-GCN [19] | 9.50 * | 86.8 | 26.76 * | 2.32 |
2s-AGCN [8] | 6.94 * | 88.5 | 37.32 * | 2.62 |
SGN [24] | 0.69 | 89.0 | — | 4.86 |
Ghost-GCN [30] | 0.51 | 89.0 | — | 5.16 |
PL-GCN [35] | 20.70 | 89.2 | — | 1.67 |
NAS-GCN [36] | 6.57 | 89.4 | — | 2.68 |
ResGCN-N51 [23] | 0.77 | 89.1 | — | 4.76 |
JOLO-GCN [37] | 10.42 | 93.8 | 41.80 | 2.30 |
CTR-GCN [21] | 1.46 | 92.4 | — | 4.16 |
Dynamic-GCN [38] | 14.40 | 91.5 | — | 2.00 |
MST-GCN [39] | 12.00 | 91.5 | — | 2.15 |
EfficientGCN-B0 [25] | 0.29 | 90.2 | 2.73 | 5.74 |
DeepActsNet [40] | 19.4 | 94.0 | 6.6 | 1.76 |
TFC-GCN (ours) | 0.18 | 87.9 | 1.90 | 6.20 |
Input | Param. (M) | Acc. (%) |
---|---|---|
J | 0.12 | 85.1 |
V | 0.12 | 83.4 |
B | 0.12 | 85.0 |
C | 0.12 | 86.3 |
JVB | 0.18 | 87.6 |
JVC | 0.18 | 86.9 |
VBC | 0.18 | 87.3 |
JBC | 0.18 | 87.9 |
Temporal Convolution | Param. (M) | Acc. (%) |
---|---|---|
Only CE | 0.17 | 86.7 |
CE + PC | 0.17 | 87.2 |
CE + PCP | 0.17 | 87.2 |
CE + Gate | 0.17 | 87.2 |
CE + PC + PCP | 0.18 | 87.3 |
CE + PC + Gate | 0.18 | 87.5 |
CE + PCP + Gate | 0.18 | 87.0 |
CE + PC + PCP + Gate | 0.18 | 87.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, K.; Deng, H. TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition. Sensors 2023, 23, 5593. https://doi.org/10.3390/s23125593
Wang K, Deng H. TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition. Sensors. 2023; 23(12):5593. https://doi.org/10.3390/s23125593
Chicago/Turabian StyleWang, Kaixuan, and Hongmin Deng. 2023. "TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition" Sensors 23, no. 12: 5593. https://doi.org/10.3390/s23125593
APA StyleWang, K., & Deng, H. (2023). TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition. Sensors, 23(12), 5593. https://doi.org/10.3390/s23125593