Light-Weight Classification of Human Actions in Video with Skeleton-Based Features
Abstract
:1. Introduction
- based on signals from worn sensors or mobile devices;
- vision-based—people are depicted in sequences of images such as movies or movie clips. This data can be obtained using various types of equipment, e.g., a stationary or mobile digital camera or a Kinect-like sensor.
- it is useful to extract meaningful information from the tracking of skeletal joints by employing classic functions instead of training relational networks to perform function approximation;
- a feature engineering step, giving a 2D map of spatio-temporal joints locations, allows the use of a light-weight CNN-based network for action classification, replacing the need for heavy-weight LSTMs or 3D CNNs for this task.
2. Related Work
2.1. Human Pose Estimation
2.2. Human Action Classification
- SIMPLE: 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14
- TREE-LIKE: 1-0-1-2-3-4-3-2-1-8-9-10-11-10-9-8-12-13-14-13-12-8-1-5-6-7-6-5-1
2.3. Analysis
2.4. Contribution of This Work
- inventing a new mapping of the skeleton joints set to a feature vector, which makes the representation of local neighborhoods explicit; such a representation is experimentally proven to be most suitable for a CNN classifier;
- proposing a feature engineering algorithm, which refines the skeleton information by canceling noisy joints, filling gaps of joints information by tracking skeletons in time, and normalizing the spatial information contained in feature vectors;
- defining a network architecture based on a CNN with skip connections that proves to perform better than LSTM and Transformer networks, even with a much lower number of trainable parameters than in competitive solutions;
- the proposed approach can be applied both to 2D and 3D skeleton data extracted from video frames or depth images; several extremely light-weight models (with up to 250 k parameters only) were trained and tested on a popular dataset, the NTU-RGB+D video dataset [25].
2.5. Video Datasets
- Contains RGB videos with a resolution of (pixels).
- Includes depth and infrared maps with a resolution of (pixels).
- Each behavior of the set is captured by three cameras.
- Behaviors were performed by people in two settings (showing activities from different viewpoints).
- It consists of 56.880 videos showing 60 classes of behavior.
- Includes RGB videos with resolution (pixels).
- Includes depth maps with a resolution of (pixels).
- Activities were performed by 10 people. Each person repeated the performed activity 2 times. There are 10 activity classes.
3. The Approach
3.1. Structure
- Frame selection: selecting a sparse sequence of frames from the video clip.
- Pose estimation.This step is performed by an external OpenPose library (if not already existing in the dataset)—detecting and localizing 2D or 3D human skeletons in images or video frames.
- Feature engineering.This step is accomplished by skeleton tracking, joint refinement and normalization and final ordering of joints. Several persons could potentially be present in a frame, but we are interested in the “main” person. The positions and size of persons can vary—a smart normalization is needed. Some joints are not always detected, and some others are detected with low certainty. Thus, smart tracking of a skeleton and refinement of the missing joints are needed.
- Model training and testing: different networks and configurations are trained on the main dataset.A trained model, thanks to which it becomes possible to analyze the behavior of a person, is applied for the action classification of a feature map that is provided as its input. Several models are tested on the two datasets, and their performance is measured.
3.2. Frame Selection
3.3. Skeleton Estimation
- —for 2D joints;
- —for 3D joints.
3.4. Feature Engineering
3.4.1. Skeleton Tracking
3.4.2. Filling of Empty Joints
- for the initial frame, the position of a missing joint is set to its first detection in the frame sequence;
- in the middle of the frame sequence, a missing joint is set to a linear interpolation of the two closest-time known positions of this joint;
- joints that are lost at the end of the frame sequence receive the positions last seen.
Algorithm 1: Skeleton tracking (the brackets /* */ represent comments of the pseudo-code) |
SkeletonTracking() {
{
(1) ; /* introduce a list that temporary refers all new skeletons */ (2) for every do {; if then {/* find the nearest skeleton */ arg ; if then {; } } (3) set ; for every do {; if then { refer in } } (4) for every do {if is then /* virtually extend the previous path by previous-frame skeleton */ } (5) SelectPaths(); /* eventually prune weak paths if needed; at the end of sequence—select the best one */ } |
Algorithm 2: Select paths |
SelectPaths() if (the last frame) then else ; if do { for do: GetScore(); PrunePaths(); } |
3.4.3. Normalization of Joints
- Min_max—the reference point is the lower-left corner of the boundary box covering the action being performed over all frames—this point is found within all frames, so it does not change its position over time; the reference line segment is the diagonal of the area that a person occupied.
- Independent min_max—a method similar to “min_max” with the difference that there are two reference sections—one for coordinates counted as the width of the frame and the other for coordinates counted as the height of the frame; scaling is performed independently for coordinates using the length of the corresponding line segments.
- Spine_using—the reference point becomes the position of joint number 8, and the reference segment is the line segment between joints 1 and 8 (to be interpreted as the spine length)—the longest such segment in all considered frames of a video clip is taken; let us note that these two joints are usually well detected in an image.
3.4.4. Palm-Based Ordering of Joints
3.4.5. Feature Map
- SIMPLIFIED: 0-1-2-4-5-7-8-11-14 (a 9-elementary simplification of the 15-elementary joints set, without joints representing elbows, knees and hips)
- SIMPLE: 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14 (Figure 1a)
- TREE-LIKE: 1-0-1-2-3-4-3-2-1-8-9-10-11-10-9-8-12-13-14-13-12-8-1-5-6-7-6-5-1 (Figure 1b)
- Palm-Based: 1-2-3-4-1-4-0-4-11-4-8-4-7-1-7-0-7-14-7-8-7-6-5-1-0-1-8-9-10-11-14-13-12-8 (Figure 8)
3.5. Neural Network Models
3.5.1. LSTM
3.5.2. Transformer
3.5.3. CNN
3.6. Implementation
4. Results
4.1. Model Comparison
- Hyperparameter: initial learning constant , constant learning reduction rate = 0.3, minimum value of learning constant = , batch size = 100;
- Data sampling: frame number = 20;
- Feature engineering: normalization method = independent min max.
- Network parameter: number of LSTM units (L) = 100, number of LSTM layers (P) = 1, hidden layers activation function = tanh;
- Feature engineering: joint order = SIMPLE.
- Hyperparameters (fixed): initial learning constant , constant learning reduction rate = 0.3, minimum value of learning constant = , batch size = 100;
- Data sampling (fixed): frame number = 20;
- Network parameters: number of MHSA units per layer (L) = 100, number of Transformer layers (P) = 2, hidden layers activation function = ReLu;
- Feature engineering: normalization method (fixed) = “independent min max”; joint order = SIMPLE.
- Hyperparameter: initial learning constant , constant learning reduction rate = 0.3, minimum value of learning constant = , batch size = 100;
- Network layers: two convolution stages with three layers each (arranged in two streams: 2 + 1), two dense layers with two 50% dropout layers; 32 and 64 filters in the convolution layers of stage 1 and 2, respectively;
- Data sampling: frame number = 20;
- Feature engineering: normalization method = Independent min_max.
- Network parameter: number of CNN units (L) = 100, number of CNN layers (P) = 1, hidden layers activation function = GeLu, convolution layer activation function = GeLu;
- Feature engineering: joint order = Palm-based.
4.2. CNN Model Optimization
4.3. Experiment with the UTKinect Dataset
4.4. Comparison with Top Works
4.5. Analysis of Class Confusions
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Coppola, C.; Cosar, S.; Faria, D.R.; Bellotto, N. Automatic detection of human interactions from RGB-D data for social activity classification. In Proceedings of the 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Lisbon, Portugal, 28–31 August 2017. [Google Scholar]
- Zhang, S.; Wei, Z.; Nie, J.; Huang, L.; Wang, S.; Li, Z. A review on human activity recognition using vision-based method. J. Healthc. Eng. 2017, 2017, 3090343. [Google Scholar] [CrossRef] [PubMed]
- Hussain, Z.; Sheng, M.; Zhang, W.E. Different Approaches for Human Activity Recognition: A Survey. J. Netw. Comput. Appl. 2020, 167, 102738. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Liu, L.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 816–833. [Google Scholar] [CrossRef]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based Action Recognition with Convolutional Neural Networks. arXiv 2017, arXiv:1704.07595. [Google Scholar]
- Bevilacqua, A.; MacDonald, K.; Rangarej, A.; Widjaya, V.; Caulfield, B.; Kechadi, T. Human Activity Recognition with Convolutional Neural Networks. In Machine Learning and Knowledge Discovery in Databases; LNAI; Springer: Cham, Switzerland, 2019; Volume 11053, pp. 541–552. [Google Scholar]
- Liang, D.; Fan, G.; Lin, G.; Chen, W.; Pan, X.; Zhu, H. Three-Stream Convolutional Neural Network with Multi-Task and Ensemble Learning for 3D Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar] [CrossRef]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3590–3598. [Google Scholar]
- Liu, M.; Yuan, J. Recognizing Human Actions as the Evolution of Pose Estimation Maps. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1159–1168. [Google Scholar]
- Cippitelli, E.; Gambi, E.; Spinsante, S.; Florez-Revuelta, F. Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In Proceedings of the 2nd IET International Conference on Technologies for Active and Assisted Living (TechAAL 2016), London, UK, 24–25 October 2016. [Google Scholar]
- Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
- Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; LNCS. Springer: Cham, Switzerland, 2016; Volume 9907, pp. 34–50. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- [Online]. NTU RGB+D 120 Dataset. Papers with Code. Available online: https://paperswithcode.com/dataset/ntu-rgb-d-120 (accessed on 30 June 2022).
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; p. 25. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The Progress of Human Pose Estimation: A Survey and Taxonomy of Models Applied in 2D Human Pose Estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar] [CrossRef]
- Wei, K.; Zhao, X. Multiple-Branches Faster RCNN for Human Parts Detection and Pose Estimation. In Proceedings of the Computer Vision—ACCV 2016 Workshops, Taipei, Taiwan, 20–24 November 2017. [Google Scholar]
- Su, Z.; Ye, M.; Zhang, G.; Dai, L.; Sheng, J. Cascade feature aggregation for human pose estimation. arXiv 2019, arXiv:1902.07837. [Google Scholar]
- Duan, H.; Zhao, Y.; Chen, K.; Shao, D.; Lin, D.; Dai, B. Revisiting Skeleton-based Action Recognition. arXiv 2021, arXiv:2104.13586. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. arXiv 2016, arXiv:1604.02808. [Google Scholar]
- [Online] UTKinect-3D Database. Available online: http://cvrc.ece.utexas.edu/KinectDatasets/HOJ3D.html (accessed on 30 June 2022).
- Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks. Comput. Vis. Image Underst. 2021, 208–209, 103219. [Google Scholar] [CrossRef]
A1: drink water | A2: eat meal | A3: brush teeth | A4: brush hair |
A5: drop | A6: pick up | A7: throw | A8: sit down |
A9: stand up | A10: clapping | A11: reading | A12: writing |
A13: tear up paper | A14: put on jacket | A15: take off jacket | A16: put on a shoe |
A17: take off a shoe | A18: put on glasses | A19: take off glasses | A20: put on a hat/cap |
A21: take off a hat/cap | A22: cheer up | A23: hand waving | A24: kicking something |
A25: reach into pocket | A26: hopping | A27: jump up | A28: phone call |
A29: play with phone/tablet | A30: type on a keyboard | A31: point to something | A32: taking a selfie |
A33: check time (from watch) | A34: rub two hands | A35: nod head/bow | A36: shake head |
A37: wipe face | A38: salute | A39: put palms together | A40: cross hands in front |
A1: walk | A2: sit down | A3: stand up | A4: pick up |
A5: carry | A6: throw | A7: push | A8: pull |
A9: wave hands | A10: clap hands |
Number | Description |
---|---|
0 | The main point of the head |
1 | Neck base |
2, 5 | Shoulders |
3, 6 | Elbows |
4, 7 | Wrists |
8 | The base of the spine |
9, 12 | Hips |
10, 13 | Knee |
11, 14 | Cube |
15, 16, 17, 18 | Extra head points |
19, 20, 21, 22, 23, 24 | Extra foot points |
Accuracy | Simplified | Simple | Tree-like | Palm-Based |
---|---|---|---|---|
training | 95.3 | 94.2 | 90.3 | 92.8 |
test | 86.1 | 86.8 | 83.5 | 85.3 |
parameters | 142.140 | 146.940 | 158.140 | 162.140 |
Accuracy | Simplified | Simple | Tree-like | Palm-Based |
---|---|---|---|---|
training | 92.5 | 96.5 | 87.4 | 84.2 |
test | 81.5 | 84.7 | 80.3 | 78.3 |
parameters | 76.794 | 185.790 | 620.434 | 836.844 |
Accuracy | Simplified | Simple | Tree-like | Palm-Based |
---|---|---|---|---|
training | 95.4 | 96.5 | 94.7 | 97.7 |
test | 86.0 | 87.5 | 87.0 | 89.7 |
parameters | 149.068 | 181.068 | 213.068 | 245.068 |
Joints Order | 7 Frames | 10 Frames | 15 Frames | 20 Frames | 25 Frames |
---|---|---|---|---|---|
SIMPLE (%) | 83.8 | 85.7 | 87.2 | 88.1 | 88.5 |
- Parameters | 135k | 143k | 156k | 180k | 194k |
Palm-Based (%) | 85.0 | 87.6 | 89.4 | 90.1 | 90.3 |
- Parameters | 153k | 168k | 194k | 219k | 245k |
Batch Size | 10 | 50 | 75 | 100 |
---|---|---|---|---|
SIMPLE (%) | 87.8 | 88.3 | 87.5 | 87.4 |
Palm-Based (%) | 88.8 | 90.1 | 89.7 | 89.7 |
Normalization | Min_Max | Independent Min_Max | Spin_Using |
---|---|---|---|
SIMPLE (%) | 87.2 | 87.5 | 88.3 |
Palm-Based (%) | 89.8 | 89.9 | 90.1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kasprzak, W.; Jankowski, B. Light-Weight Classification of Human Actions in Video with Skeleton-Based Features. Electronics 2022, 11, 2145. https://doi.org/10.3390/electronics11142145
Kasprzak W, Jankowski B. Light-Weight Classification of Human Actions in Video with Skeleton-Based Features. Electronics. 2022; 11(14):2145. https://doi.org/10.3390/electronics11142145
Chicago/Turabian StyleKasprzak, Włodzimierz, and Bartłomiej Jankowski. 2022. "Light-Weight Classification of Human Actions in Video with Skeleton-Based Features" Electronics 11, no. 14: 2145. https://doi.org/10.3390/electronics11142145
APA StyleKasprzak, W., & Jankowski, B. (2022). Light-Weight Classification of Human Actions in Video with Skeleton-Based Features. Electronics, 11(14), 2145. https://doi.org/10.3390/electronics11142145