Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities
Abstract
:1. Introduction
- Smart cameras: These cameras use algorithms, based on artificial intelligence (AI) and machine learning (ML), to track and identify people’s actions in real time.
- Wearable Devices: Wearable technology uses sensors to monitor the wearer’s every move, allowing for accurate recognition of common physical motions like running, jumping, and walking.
- Health and Fitness Apps: Apps for health and fitness track and analyze user data using artificial intelligence and machine learning algorithms to make suggestions and give feedback, based on specific activities like running, cycling, and swimming.
- Automated Surveillance Systems: For security and safety reasons, automated surveillance systems are available that use artificial intelligence and machine learning algorithms for human action identification.
- Human–computer interaction: systems that employ human action recognition for human–computer interaction are available, with examples including gesture recognition in gaming and virtual reality.
- Daily activities, such as walking, running, jumping, sitting, standing, etc.
- Sports activities, such as basketball, soccer, tennis, etc.
- Exercise activities, such as weightlifting, yoga, aerobics, etc.
- Medical activities, such as gait analysis for patients with mobility impairments.
- Industrial activities, such as assembly line work, machine operation, etc.
- Interpersonal activities, such as handshaking, hugging, pointing, etc.
- Artistic activities, such as dancing, playing musical instruments, etc.
- Household activities, such as cooking, cleaning, etc.
Our Contributions
- We provide a detailed introduction to human activity recognition using computer vision.
- Our comprehensive analysis of action recognition was facilitated by examining both conventional and deep learning-based approaches.
- We present a generic framework for recognizing human actions in videos.
- In order to classify all the different approaches to human action recognition, we proposed a new taxonomy, and present a detailed discussion with recent work in regard to our taxonomy.
- This study explores the challenges associated with existing approaches to actions and interactions, as well as emerging trends for possible future paths in the detection of complex human behavior and online activity.
2. Overview
Significant Achievements
- Stanford University, USA: Convolutional Neural Networks (ConvNets) for action recognition, which have become the standard for the field [38].
- University of Oxford, UK: Developed the Two-Stream Convolutional Networks for action recognition in videos [24].
- Carnegie Mellon University, USA: Developed the Deep Structured Semantic Model for human action recognition [39].
- Max Planck Institute for Informatics, Germany: Conducted research on human action recognition in the context of egocentric videos [40].
- Ecole Centrale de Lyon, France: Through their research on deep learning-based action recognition, they have made important progress in the field. For example, they have made algorithms for action recognition that use unstructured data [41].
- National Institute of Information and Communications Technology, Japan: Conducted research on human action recognition in the context of wearable sensors [42].
- University of California, USA: Conducted extensive research on 3D human action recognition using deep learning [43].
- Chinese Academy of Sciences, China: Developed the Skeleton-based adaptive convolution models for human action recognition in videos [44].
- Technical University of Munich, Germany: Conducted research on human action recognition in the context of ego–motion representation [45].
- INRIA, France: Conducted research on human action recognition using deep learning and introduced the concept of spatiotemporal convolutional networks [46].
3. Human Action Recognition Framework
4. Research Method and Taxonomy
- Defining the scope and objectives: Goals and scope were established by first detailing what would be included in this study, which, in this case, centered on the many aspects of human action recognition. In this article, we give a brief overview of human action recognition, including where it came from, how it has changed over time, and how far it has come right to the present.
- Conducting a comprehensive literature search: We searched academic literature extensively to find studies, articles, and publications pertinent to the study of human action recognition. We used Google Scholar, MDPI, PubMed, and IEEE Xplore, among many others, to accomplish this.
- Evaluating the quality of the literature: We evaluated the quality of the literature we found by looking at aspects like the validity and reliability of the research methods used, how well the results fit with the goals of our review, and how well the data was analyzed and interpreted.
- Classifying the literature: We organized the material we collected in terms of the precise components of human action recognition we were examining, using a classification system. Methods based on feature extraction, and methods based on activity types, and so on were all included.
- Synthesizing the literature: To synthesize the literature, we summed up the main points of each research article we studied, compared and contrasted their methods and results, and added our own original thoughts and conclusions.
- Analyzing and interpreting the data: We studied and interpreted the data from the literature review in order to address the particular issue, make conclusions, and find gaps in the present body of research.
4.1. Feature Extraction-Based Action Recognition
4.1.1. Handcrafted Representation Method
Depth-Based Approaches
Skeleton-Based Approaches
Hybrid Feature-Based Approaches
4.1.2. Deep Learning Representation Method
Convolutional Neural Networks (CNN)s
Recurrent Neural Networks (RNNs)
Autoencoders
Hybrid Deep Learning Models
4.1.3. Attention-Based Methods
- Self-Attention: Self-attention is a fundamental transformer mechanism for both encoding and decoding blocks. For each video clip in the vision region, the self-attention layer takes a sequence of X (either a video clip or an entity token) and linearly converts the input into three distinct vectors: K (key), Q (query), or V (value).
- Multi-Head Attention: A multi-head attention method [195] was presented to describe the complicated interactions of token entities from diverse perspectives.
4.2. Activity Type-Based Human Action Recognition
4.2.1. Atomic Action
4.2.2. Behavior
4.2.3. Interaction
4.2.4. Group Activities
5. Popular Datasets and Approaches
5.1. Atomic Action Datasets
5.1.1. KTH Dataset
5.1.2. NTU RGB+D
5.1.3. MSR Action 3D
5.2. Behavior Dataset
Multi-Camera Action Dataset (MCAD)
5.3. Interaction Dataset
5.3.1. MSR Daily Activity 3D Dataset
5.3.2. Multi-Camera Human Action Video Dataset (MuHAVI)
5.3.3. UCF50
5.4. Group Activities
5.4.1. ActivityNet Dataset
5.4.2. The Kinetics Human Action Video Dataset
5.4.3. HMDB-51 Dataset
5.4.4. HOLLYWOOD Dataset
5.4.5. HOLLYWOOD 2 Dataset
5.4.6. UCF-101 Action Recognition Dataset
6. Evaluation Metrics and Performance
- True Positive: Both the predicted and actual activity categories are the same.
- False Positive: activities that do not match the searching category but are projected to belong to the sought category.
- True Negative: activities in which the actual, as well as projected, activity do not conform to the searching class.
- False Negative: activities that should go into a certain category but are, instead, expected to fall outside of that category.
- Recall: Recall is also known as sensitivity, true positive rate, or likelihood of detection. It is based on real positive instances, expected to be positive in advance. Sensory assesses the percentage of activities that are projected to be in a certain class. In the same way, the system’s inability to detect activities is determined by the system’s sensitivity. Mathematically, we may write this as follows:
- Precision: It defines the probability of an activity being observed to really occur. The likelihood that an observed activity would be wrongly identified by the recognizer is given a precision of 1. Mathematically, we may write this as follows:
- F Score: Precision and recall are the two factors that define the harmonic mean. It tells us how accurate the test is. F measures both the accuracy and the robustness of a classifier at the same time. The value 1 is the greatest value, while 0 is the worst. Mathematically, we may write this as follows:
- Accuracy: This metric measures the proportion of accurate predictions against the total number of samples. As long as the classes are evenly sampled, the accuracy yields satisfactory outcomes. It is possible to represent this mathematically as follows:
- Confusion Matrix: Known as an “error matrix”, this sums together the model’s prediction outcomes and indicates the model’s overall accuracy. An error graph is generated and shown in a confusion matrix for each kind of misclassified data. There is a row for each anticipated class and a column for each actual class in the matrix, or the other way around. Figure 27 shows the structure of a confusion matrix.
7. Research Issues, Opportunities, and Future Directions
7.1. Research Issues
7.2. Opportunities
7.3. Future Directions
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cippitelli, E.; Fioranelli, F.; Gambi, E.; Spinsante, S. Radar and RGB-depth sensors for fall detection: A review. IEEE Sens. J. 2017, 17, 3585–3604. [Google Scholar]
- Cai, H.; Fang, Y.; Ju, Z.; Costescu, C.; David, D.; Billing, E.; Ziemke, T.; Thill, S.; Belpaeme, T.; Vanderborght, B.; et al. Sensing-enhanced therapy system for assessing children with autism spectrum disorders: A feasibility study. IEEE Sens. J. 2018, 19, 1508–1518. [Google Scholar]
- Kong, Y.; Fu, Y. Modeling supporting regions for close human interaction recognition. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 29–44. [Google Scholar]
- Zhang, J.; Li, W.; Ogunbona, P.O.; Wang, P.; Tang, C. RGB-D-based action recognition datasets: A survey. Pattern Recognit. 2016, 60, 86–105. [Google Scholar]
- Chen, L.; Wei, H.; Ferryman, J. A survey of human motion analysis using depth imagery. Pattern Recognit. Lett. 2013, 34, 1995–2006. [Google Scholar]
- Lun, R.; Zhao, W. A survey of applications and human motion recognition with microsoft kinect. Int. J. Pattern Recognit. Artif. Intell. 2015, 29, 1555008. [Google Scholar]
- Presti, L.L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar]
- Han, F.; Reily, B.; Hoff, W.; Zhang, H. Space-time representation of people based on 3D skeletal data: A review. Comput. Vis. Image Underst. 2017, 158, 85–105. [Google Scholar]
- Ye, M.; Zhang, Q.; Wang, L.; Zhu, J.; Yang, R.; Gall, J. A survey on human motion analysis from depth data. In Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications; Springer: Berlin/Heidelberg, Germany, 2013; pp. 149–187. [Google Scholar]
- Aggarwal, J.K.; Xia, L. Human activity recognition from 3d data: A review. Pattern Recognit. Lett. 2014, 48, 70–80. [Google Scholar]
- Zhu, F.; Shao, L.; Xie, J.; Fang, Y. From handcrafted to learned representations for human action recognition: A survey. Image Vis. Comput. 2016, 55, 42–52. [Google Scholar]
- Aggarwal, J.K.; Ryoo, M.S. Human activity analysis: A review. ACM Comput. Surv. (CSUR) 2011, 43, 1–43. [Google Scholar]
- Dawn, D.D.; Shaikh, S.H. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 2016, 32, 289–306. [Google Scholar]
- Zhang, Z.; Liu, S.; Liu, S.; Han, L.; Shao, Y.; Zhou, W. Human action recognition using salient region detection in complex scenes. In Proceedings of the Third International Conference on Communications, Signal Processing, and Systems, Hohhot, Inner Mongolia, China, 14–15 July 2014; Springer: Berlin/Heidelberg, Germany, 2015; pp. 565–572. [Google Scholar]
- Nguyen, T.V.; Song, Z.; Yan, S. STAP: Spatial-temporal attention-aware pooling for action recognition. IEEE Trans. Circuits Syst. Video Technol. 2014, 25, 77–86. [Google Scholar]
- Zhang, H.B.; Lei, Q.; Zhong, B.N.; Du, J.X.; Peng, J.; Hsiao, T.C.; Chen, D.S. Multi-surface analysis for human action recognition in video. SpringerPlus 2016, 5, 1–14. [Google Scholar]
- Burghouts, G.; Schutte, K.; ten Hove, R.M.; van den Broek, S.; Baan, J.; Rajadell, O.; van Huis, J.; van Rest, J.; Hanckmann, P.; Bouma, H.; et al. Instantaneous threat detection based on a semantic representation of activities, zones and trajectories. Signal Image Video Process. 2014, 8, 191–200. [Google Scholar]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- Oreifej, O.; Liu, Z. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 716–723. [Google Scholar]
- Li, M.; Leung, H.; Shum, H.P. Human action recognition via skeletal and depth based feature fusion. In Proceedings of the 9th International Conference on Motion in Games, Burlingame, CA, USA, 10–12 October 2016; pp. 123–132. [Google Scholar]
- Yang, X.; Tian, Y. Effective 3d action recognition using eigenjoints. J. Vis. Commun. Image Represent. 2014, 25, 2–11. [Google Scholar]
- Chen, C.; Liu, K.; Kehtarnavaz, N. Real-time human action recognition based on depth motion maps. J. Real-Time Image Process. 2016, 12, 155–163. [Google Scholar]
- Azure Kinect DK. Available online: https://azure.microsoft.com/en-us/products/kinect-dk/ (accessed on 6 February 2023).
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
- Wang, P.; Li, W.; Gao, Z.; Zhang, J.; Tang, C.; Ogunbona, P.O. Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum.-Mach. Syst. 2015, 46, 498–509. [Google Scholar]
- Güler, R.A.; Neverova, N.; Kokkinos, I. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7297–7306. [Google Scholar]
- Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 4–6 February 2018; Volume 32. [Google Scholar]
- Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2914–2923. [Google Scholar]
- Morshed, M.G.; Lee, Y.K. MNSSD: A Real-time DNN based Companion Image Data Annotation using MobileNet and Single Shot Multibox Detector. In Proceedings of the 2022 IEEE International Conference on Big Data and Smart Computing (BigComp), Daegu, Republic of Korea, 17–20 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 251–258. [Google Scholar]
- Zhou, Z.; Shi, F.; Wu, W. Learning spatial and temporal extents of human actions for action detection. IEEE Trans. Multimed. 2015, 17, 512–525. [Google Scholar]
- Zhang, H.B.; Li, S.Z.; Chen, S.Y.; Su, S.Z.; Lin, X.M.; Cao, D.L. Locating and recognizing multiple human actions by searching for maximum score subsequences. Signal Image Video Process. 2015, 9, 705–714. [Google Scholar]
- Shu, Z.; Yun, K.; Samaras, D. Action detection with improved dense trajectories and sliding window. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 541–551. [Google Scholar]
- Oneata, D.; Verbeek, J.; Schmid, C. Efficient action localization with approximately normalized fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2545–2552. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- De la Torre, F.; Hodgins, J.; Bargteil, A.; Martin, X.; Macey, J.; Collado, A.; Beltran, P. Guide to the Carnegie Mellon University Multimodal Activity (Cmu-Mmac) Database; Citeseer: Princeton, NJ, USA, 2009. [Google Scholar]
- Steil, J.; Bulling, A. Discovery of everyday human activities from long-term visual behaviour using topic models. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osakam, Japan, 7–11 September 2015; pp. 75–85. [Google Scholar]
- Baradel, F.; Wolf, C.; Mille, J.; Taylor, G.W. Glimpse clouds: Human activity recognition from unstructured feature points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 469–478. [Google Scholar]
- Takizawa, K.; Aoyagi, T.; Takada, J.i.; Katayama, N.; Yekeh, K.; Takehiko, Y.; Kohno, K.R. Channel models for wireless body area networks. In Proceedings of the 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vancouver, BC, Canada, 20–25 August 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1549–1552. [Google Scholar]
- Ohn-Bar, E.; Trivedi, M. Joint angles similarities and HOG2 for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 465–470. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
- Tenorth, M.; Bandouch, J.; Beetz, M. The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Kyoto, Japan, 27 September–4 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1089–1096. [Google Scholar]
- Weinland, D.; Ronfard, R.; Boyer, E. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 2006, 104, 249–257. [Google Scholar]
- Abdallah, Z.S.; Gaber, M.M.; Srinivasan, B.; Krishnaswamy, S. Activity recognition with evolving data streams: A review. ACM Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar]
- Herath, S.; Harandi, M.; Porikli, F. Going deeper into action recognition: A survey. Image Vis. Comput. 2017, 60, 4–21. [Google Scholar]
- Jalal, A.; Kim, Y.H.; Kim, Y.J.; Kamal, S.; Kim, D. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognit. 2017, 61, 295–308. [Google Scholar]
- Yang, X.; Tian, Y. Super normal vector for human activity recognition with depth cameras. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1028–1039. [Google Scholar] [PubMed]
- Xu, C.; Govindarajan, L.N.; Cheng, L. Hand action detection from ego-centric depth sequences with error-correcting Hough transform. Pattern Recognit. 2017, 72, 494–503. [Google Scholar]
- Qi, J.; Yang, P.; Hanneghan, M.; Tang, S.; Zhou, B. A hybrid hierarchical framework for gym physical activity recognition and measurement using wearable sensors. IEEE Internet Things J. 2018, 6, 1384–1393. [Google Scholar]
- Alsinglawi, B.; Nguyen, Q.V.; Gunawardana, U.; Maeder, A.; Simoff, S.J. RFID systems in healthcare settings and activity of daily living in smart homes: A review. E-Health Telecommun. Syst. Netw. 2017, 6, 1–17. [Google Scholar]
- Lara, O.D.; Labrador, M.A. A survey on human activity recognition using wearable sensors. IEEE Commun. Surv. Tutor. 2012, 15, 1192–1209. [Google Scholar]
- Cornacchia, M.; Ozcan, K.; Zheng, Y.; Velipasalar, S. A survey on activity detection and classification using wearable sensors. IEEE Sens. J. 2016, 17, 386–403. [Google Scholar]
- Prati, A.; Shan, C.; Wang, K.I.K. Sensors, vision and networks: From video surveillance to activity recognition and health monitoring. J. Ambient Intell. Smart Environ. 2019, 11, 5–22. [Google Scholar]
- Kumar, K.S.; Bhavani, R. Human activity recognition in egocentric video using HOG, GiST and color features. Multimed. Tools Appl. 2020, 79, 3543–3559. [Google Scholar]
- Roy, P.K.; Om, H. Suspicious and violent activity detection of humans using HOG features and SVM classifier in surveillance videos. In Advances in Soft Computing and Machine Learning in Image Processing; Springer: Berlin/Heidelberg, Germany, 2018; pp. 277–294. [Google Scholar]
- Thyagarajmurthy, A.; Ninad, M.; Rakesh, B.; Niranjan, S.; Manvi, B. Anomaly detection in surveillance video using pose estimation. In Emerging Research in Electronics, Computer Science and Technology; Springer: Berlin/Heidelberg, Germany, 2019; pp. 753–766. [Google Scholar]
- Martínez-Villaseñor, L.; Ponce, H. A concise review on sensor signal acquisition and transformation applied to human activity recognition and human–robot interaction. Int. J. Distrib. Sens. Netw. 2019, 15, 1550147719853987. [Google Scholar]
- Yang, H.; Yuan, C.; Li, B.; Du, Y.; Xing, J.; Hu, W.; Maybank, S.J. Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit. 2019, 85, 1–12. [Google Scholar]
- Nunez, J.C.; Cabido, R.; Pantrigo, J.J.; Montemayor, A.S.; Velez, J.F. Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognit. 2018, 76, 80–94. [Google Scholar]
- Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3d points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2018; IEEE: Piscataway, NJ, USA, 2010; pp. 9–14. [Google Scholar]
- Bulbul, M.F.; Jiang, Y.; Ma, J. Human action recognition based on dmms, hogs and contourlet transform. In Proceedings of the 2015 IEEE International Conference on Multimedia Big Data, Beijing, China, 20–22 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 389–394. [Google Scholar]
- Chen, C.; Liu, M.; Liu, H.; Zhang, B.; Han, J.; Kehtarnavaz, N. Multi-temporal depth motion maps-based local binary patterns for 3-D human action recognition. IEEE Access 2017, 5, 22590–22604. [Google Scholar]
- Zhang, B.; Yang, Y.; Chen, C.; Yang, L.; Han, J.; Shao, L. Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Trans. Image Process. 2017, 26, 4648–4660. [Google Scholar]
- Yang, X.; Zhang, C.; Tian, Y. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia, Nara, Japan, 29 October–2 November 2012; pp. 1057–1060. [Google Scholar]
- Lai, K.; Bo, L.; Ren, X.; Fox, D. A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1817–1824. [Google Scholar]
- Yang, X.; Tian, Y. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 804–811. [Google Scholar]
- Slama, R.; Wannous, H.; Daoudi, M. Grassmannian representation of motion depth for 3D human gesture and action recognition. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 3499–3504. [Google Scholar]
- Wang, J.; Liu, Z.; Chorowski, J.; Chen, Z.; Wu, Y. Robust 3d action recognition with random occupancy patterns. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 872–885. [Google Scholar]
- Xia, L.; Aggarwal, J. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2834–2841. [Google Scholar]
- Liu, M.; Liu, H. Depth context: A new descriptor for human activity recognition by using sole depth sequences. Neurocomputing 2016, 175, 747–758. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Robust 3D action recognition through sampling local appearances and global distributions. IEEE Trans. Multimed. 2017, 20, 1932–1947. [Google Scholar]
- Ji, X.; Cheng, J.; Feng, W.; Tao, D. Skeleton embedded motion body partition for human action recognition using depth sequences. Signal Process. 2018, 143, 56–68. [Google Scholar]
- Gowayyed, M.A.; Torki, M.; Hussein, M.E.; El-Saban, M. Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
- Qiao, R.; Liu, L.; Shen, C.; van den Hengel, A. Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition. Pattern Recognit. 2017, 66, 202–212. [Google Scholar]
- Devanne, M.; Wannous, H.; Berretti, S.; Pala, P.; Daoudi, M.; Del Bimbo, A. 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cybern. 2014, 45, 1340–1352. [Google Scholar] [PubMed] [Green Version]
- Guo, Y.; Li, Y.; Shao, Z. DSRF: A flexible trajectory descriptor for articulated human action recognition. Pattern Recognit. 2018, 76, 137–148. [Google Scholar]
- Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar]
- Dollár, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 65–72. [Google Scholar]
- Chaaraoui, A.A.; Padilla-López, J.R.; Climent-Pérez, P.; Flórez-Revuelta, F. Evolutionary joint selection to improve human action recognition with RGB-D devices. Expert Syst. Appl. 2014, 41, 786–794. [Google Scholar]
- Vemulapalli, R.; Arrate, F.; Chellappa, R. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar]
- Perez, M.; Liu, J.; Kot, A.C. Skeleton-based relational reasoning for group activity analysis. Pattern Recognit. 2022, 122, 108360. [Google Scholar]
- Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 914–927. [Google Scholar]
- Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1290–1297. [Google Scholar]
- Raman, N.; Maybank, S.J. Activity recognition using a supervised non-parametric hierarchical HMM. Neurocomputing 2016, 199, 163–177. [Google Scholar]
- Zhu, Y.; Chen, W.; Guo, G. Fusing spatiotemporal features and joints for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 486–491. [Google Scholar]
- Sung, J.; Ponce, C.; Selman, B.; Saxena, A. Unstructured human activity detection from rgbd images. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, St Paul, MN, USA, 4–18 May 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 842–849. [Google Scholar]
- Liu, A.A.; Nie, W.Z.; Su, Y.T.; Ma, L.; Hao, T.; Yang, Z.X. Coupled hidden conditional random fields for RGB-D human action recognition. Signal Process. 2015, 112, 74–82. [Google Scholar]
- Kong, Y.; Fu, Y. Bilinear heterogeneous information machine for RGB-D action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1054–1062. [Google Scholar]
- Kong, Y.; Fu, Y. Max-margin heterogeneous information machine for RGB-D action recognition. Int. J. Comput. Vis. 2017, 123, 350–371. [Google Scholar]
- Hejazi, S.M.; Abhayaratne, C. Handcrafted localized phase features for human action recognition. Image Vis. Comput. 2022, 123, 104465. [Google Scholar]
- Al-Obaidi, S.; Al-Khafaji, H.; Abhayaratne, C. Making sense of neuromorphic event data for human action recognition. IEEE Access 2021, 9, 82686–82700. [Google Scholar]
- Singh, D.; Mohan, C.K. Graph formulation of video activities for abnormal activity recognition. Pattern Recognit. 2017, 65, 265–272. [Google Scholar]
- Everts, I.; Van Gemert, J.C.; Gevers, T. Evaluation of color spatio-temporal interest points for human action recognition. IEEE Trans. Image Process. 2014, 23, 1569–1580. [Google Scholar]
- Zhu, Y.; Chen, W.; Guo, G. Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 2014, 32, 453–464. [Google Scholar]
- Chakraborty, B.; Holte, M.B.; Moeslund, T.B.; Gonzalez, J.; Roca, F.X. A selective spatio-temporal interest point detector for human action recognition in complex scenes. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1776–1783. [Google Scholar]
- Vishwakarma, D.K.; Kapoor, R.; Dhiman, A. A proposed unified framework for the recognition of human activity by exploiting the characteristics of action dynamics. Robot. Auton. Syst. 2016, 77, 25–38. [Google Scholar]
- Nazir, S.; Yousaf, M.H.; Velastin, S.A. Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 2018, 72, 660–669. [Google Scholar]
- Miao, Y.; Song, J. Abnormal event detection based on SVM in video surveillance. In Proceedings of the 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA), Ottawa, ON, Canada, 29–30 September 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1379–1383. [Google Scholar]
- Xu, D.; Xiao, X.; Wang, X.; Wang, J. Human action recognition based on Kinect and PSO-SVM by representing 3D skeletons as points in lie group. In Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 11–12 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 568–573. [Google Scholar]
- Liu, L.; Shao, L.; Li, X.; Lu, K. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Trans. Cybern. 2015, 46, 158–170. [Google Scholar]
- Vishwakarma, D.K.; Kapoor, R. Hybrid classifier based human activity recognition using the silhouette and cells. Expert Syst. Appl. 2015, 42, 6957–6965. [Google Scholar]
- Gan, L.; Chen, F. Human Action Recognition Using APJ3D and Random Forests. J. Softw. 2013, 8, 2238–2245. [Google Scholar]
- Khan, Z.A.; Sohn, W. Abnormal human activity recognition system based on R-transform and kernel discriminant technique for elderly home care. IEEE Trans. Consum. Electron. 2011, 57, 1843–1850. [Google Scholar]
- Chaaraoui, A.A.; Florez-Revuelta, F. Optimizing human action recognition based on a cooperative coevolutionary algorithm. Eng. Appl. Artif. Intell. 2014, 31, 116–125. [Google Scholar]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. Action recognition from depth sequences using depth motion maps-based local binary patterns. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1092–1099. [Google Scholar]
- Li, C.; Hou, Y.; Wang, P.; Li, W. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process. Lett. 2017, 24, 624–628. [Google Scholar]
- Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3288–3297. [Google Scholar]
- Liu, J.; Akhtar, N.; Mian, A. Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 807–811. [Google Scholar]
- Xie, C.; Li, C.; Zhang, B.; Chen, C.; Han, J.; Zou, C.; Liu, J. Memory attention networks for skeleton-based action recognition. arXiv 2018, arXiv:1804.08254. [Google Scholar]
- Huang, Z.; Wan, C.; Probst, T.; Van Gool, L. Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6099–6108. [Google Scholar]
- Vemulapalli, R.; Chellapa, R. Rolling rotations for recognizing human actions from 3d skeletal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4471–4479. [Google Scholar]
- Liu, M.; Yuan, J. Recognizing human actions as the evolution of pose estimation maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1159–1168. [Google Scholar]
- Tang, Y.; Liu, X.; Yu, X.; Zhang, D.; Lu, J.; Zhou, J. Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–24. [Google Scholar]
- Li, X.; Liu, C.; Shuai, B.; Zhu, Y.; Chen, H.; Tighe, J. Nuta: Non-uniform temporal aggregation for action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3683–3692. [Google Scholar]
- Xu, Y.; Wei, F.; Sun, X.; Yang, C.; Shen, Y.; Dai, B.; Zhou, B.; Lin, S. Cross-model pseudo-labeling for semi-supervised action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2959–2968. [Google Scholar]
- Qian, Y.; Kang, G.; Yu, L.; Liu, W.; Hauptmann, A.G. Trm: Temporal relocation module for video recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 151–160. [Google Scholar]
- Yu, L.; Qian, Y.; Liu, W.; Hauptmann, A.G. Argus++: Robust real-time activity detection for unconstrained video streams with overlapping cube proposals. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 112–121. [Google Scholar]
- Wang, L.; Tong, Z.; Ji, B.; Wu, G. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1895–1904. [Google Scholar]
- Gowda, S.N.; Rohrbach, M.; Sevilla-Lara, L. SMART Frame Selection for Action Recognition. arXiv 2020, arXiv:2012.10671. [Google Scholar]
- Shi, Y.; Tian, Y.; Wang, Y.; Huang, T. Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Trans. Multimed. 2017, 19, 1510–1520. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Gener. Comput. Syst. 2019, 96, 386–397. [Google Scholar]
- Ijjina, E.P.; Chalavadi, K.M. Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recognit. 2016, 59, 199–212. [Google Scholar]
- Akilan, T.; Wu, Q.J.; Safaei, A.; Jiang, W. A late fusion approach for harnessing multi-CNN model high-level features. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 566–571. [Google Scholar]
- Kim, T.S.; Reiter, A. Interpretable 3d human action analysis with temporal convolutional networks. In Proceedings of the 2017 IEEE conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1623–1631. [Google Scholar]
- Huynh-The, T.; Hua, C.H.; Kim, D.S. Encoding pose features to images with data augmentation for 3-D action recognition. IEEE Trans. Ind. Inform. 2019, 16, 3100–3111. [Google Scholar]
- Gowda, S.N. Human activity recognition using combinatorial Deep Belief Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1–6. [Google Scholar]
- Li, C.; Wang, P.; Wang, S.; Hou, Y.; Li, W. Skeleton-based action recognition using LSTM and CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 585–590. [Google Scholar]
- Das, S.; Chaudhary, A.; Bremond, F.; Thonnat, M. Where to focus on for human action recognition? In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 71–80. [Google Scholar]
- Veeriah, V.; Zhuang, N.; Qi, G.J. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4041–4049. [Google Scholar]
- Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
- Du, Y.; Fu, Y.; Wang, L. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans. Image Process. 2016, 25, 3010–3022. [Google Scholar]
- Zhang, S.; Liu, X.; Xiao, J. On geometric features for skeleton-based action recognition using multilayer lstm networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 148–157. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Mahasseni, B.; Todorovic, S. Regularizing long short term memory with 3D human-skeleton sequences for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3054–3062. [Google Scholar]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
- Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2017, 27, 1586–1599. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Wang, H.; Wang, L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 499–508. [Google Scholar]
- Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
- Liou, C.Y.; Cheng, W.C.; Liou, J.W.; Liou, D.R. Autoencoder for words. Neurocomputing 2014, 139, 84–96. [Google Scholar]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar]
- Zhang, J.; Shan, S.; Kan, M.; Chen, X. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 1–16. [Google Scholar]
- Jiang, X.; Zhang, Y.; Zhang, W.; Xiao, X. A novel sparse auto-encoder for deep unsupervised learning. In Proceedings of the 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI), Hangzhou, China, 19–21 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 256–261. [Google Scholar]
- Zhou, Y.; Arpit, D.; Nwogu, I.; Govindaraju, V. Is joint training better for deep auto-encoders? arXiv 2014, arXiv:1405.1380. [Google Scholar]
- Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
- Zhang, Q.; Yang, L.T.; Yan, Z.; Chen, Z.; Li, P. An efficient deep learning model to predict cloud workload for industry informatics. IEEE Trans. Ind. Inform. 2018, 14, 3170–3178. [Google Scholar]
- Baccouche, M.; Mamalet, F.; Wolf, C.; Garcia, C.; Baskurt, A. Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In Proceedings of the BMVC, Surrey, UK, 3–7 September 2012; Volume 1, p. 12. [Google Scholar]
- Hinton, G.E.; Sejnowski, T.J. Learning and relearning in Boltzmann machines. Parallel Distrib. Process. Explor. Microstruct. Cogn. 1986, 1, 2. [Google Scholar]
- Carreira-Perpinan, M.A.; Hinton, G.E. On contrastive divergence learning. In Proceedings of the Aistats, Bridgetown, Barbados, 6–8 January 2005; Volume 10, pp. 33–40. [Google Scholar]
- Hinton, G.E. A practical guide to training restricted Boltzmann machines. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 599–619. [Google Scholar]
- Cho, K.; Raiko, T.; Ilin, A. Enhanced gradient and adaptive learning rate for training restricted Boltzmann machines. In Proceedings of the ICML, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the ICML, Haifa, Israel, 21–24 June 2010. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Stochastic pooling for regularization of deep convolutional neural networks. arXiv 2013, arXiv:1301.3557. [Google Scholar]
- Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar]
- Chen, B. Deep Learning of Invariant Spatio-Temporal Features from Video. Ph.D Thesis, University of British Columbia, Vancouver, BC, Canada, 2010. [Google Scholar]
- Zhang, L.; Zhu, G.; Shen, P.; Song, J.; Afaq Shah, S.; Bennamoun, M. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3120–3128. [Google Scholar]
- Kamel, A.; Sheng, B.; Yang, P.; Li, P.; Shen, R.; Feng, D.D. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man, Cybern. Syst. 2018, 49, 1806–1819. [Google Scholar]
- Khan, I.U.; Afzal, S.; Lee, J.W. Human activity recognition via hybrid deep learning based model. Sensors 2022, 22, 323. [Google Scholar]
- Wu, D.; Pigou, L.; Kindermans, P.J.; Le, N.D.H.; Shao, L.; Dambre, J.; Odobez, J.M. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1583–1597. [Google Scholar]
- Wang, P.; Li, W.; Gao, Z.; Zhang, Y.; Tang, C.; Ogunbona, P. Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 595–604. [Google Scholar]
- Shi, Z.; Kim, T.K. Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3461–3470. [Google Scholar]
- Liu, Z.; Zhang, C.; Tian, Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis. Comput. 2016, 55, 93–100. [Google Scholar]
- Wang, X.; Zhang, S.; Qing, Z.; Tang, M.; Zuo, Z.; Gao, C.; Jin, R.; Sang, N. Hybrid relation guided set matching for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19948–19957. [Google Scholar]
- Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar]
- Duan, H.; Wang, J.; Chen, K.; Lin, D. Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7351–7354. [Google Scholar]
- Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition. arXiv 2021, arXiv:2109.08472. [Google Scholar]
- Gao, R.; Oh, T.H.; Grauman, K.; Torresani, L. Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10457–10467. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
- Das, S.; Koperski, M.; Bremond, F.; Francesca, G. Deep-temporal lstm for daily living action recognition. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
- Sharma, S.; Kiros, R.; Salakhutdinov, R. Action recognition using visual attention. arXiv 2015, arXiv:1511.04119. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 20–36. [Google Scholar]
- Jian, M.; Zhang, S.; Wu, L.; Zhang, S.; Wang, X.; He, Y. Deep key frame extraction for sport training. Neurocomputing 2019, 328, 147–156. [Google Scholar]
- Zhou, Y.; Sun, X.; Zha, Z.J.; Zeng, W. Mict: Mixed 3d/2d convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 449–458. [Google Scholar]
- Foggia, P.; Saggese, A.; Strisciuglio, N.; Vento, M. Exploiting the deep learning paradigm for recognizing human actions. In Proceedings of the 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Seoul, Republic of Korea, 26–29 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 93–98. [Google Scholar]
- Ahsan, U.; Sun, C.; Essa, I. Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. arXiv 2018, arXiv:1801.07230. [Google Scholar]
- Saghafi, B.; Rajan, D. Human action recognition using pose-based discriminant embedding. Signal Process. Image Commun. 2012, 27, 96–111. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Guo, H.; Wang, H.; Ji, Q. Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20052–20061. [Google Scholar]
- Liu, Z.; Tian, Y.; Wang, Z. Improving human action recognitionby temporal attention. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 870–874. [Google Scholar]
- Gharaee, Z.; Gärdenfors, P.; Johnsson, M. First and second order dynamics in a hierarchical SOM system for action recognition. Appl. Soft Comput. 2017, 59, 574–585. [Google Scholar]
- Chen, J.; Mittal, G.; Yu, Y.; Kong, Y.; Chen, M. GateHUB: Gated History Unit with Background Suppression for Online Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19925–19934. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
- Zellers, R.; Bisk, Y.; Schwartz, R.; Choi, Y. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv 2018, arXiv:1808.05326. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Rae, J.W.; Potapenko, A.; Jayakumar, S.M.; Lillicrap, T.P. Compressive transformers for long-range sequence modelling. arXiv 2019, arXiv:1911.05507. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Wei, Y.; Liu, H.; Xie, T.; Ke, Q.; Guo, Y. Spatial-temporal transformer for 3d point cloud sequences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1171–1180. [Google Scholar]
- Chen, J.; Ho, C.M. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1910–1921. [Google Scholar]
- Wu, C.Y.; Li, Y.; Mangalam, K.; Fan, H.; Xiong, B.; Malik, J.; Feichtenhofer, C. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13587–13597. [Google Scholar]
- Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3333–3343. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7464–7473. [Google Scholar]
- Xu, H.; Ghosh, G.; Huang, P.Y.; Arora, P.; Aminzadeh, M.; Feichtenhofer, C.; Metze, F.; Zettlemoyer, L. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. arXiv 2021, arXiv:2105.09996. [Google Scholar]
- Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv 2021, arXiv:2104.11178. [Google Scholar]
- Sun, C.; Baradel, F.; Murphy, K.; Schmid, C. Learning video representations using contrastive bidirectional transformer. arXiv 2019, arXiv:1906.05743. [Google Scholar]
- Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the limits of language modeling. arXiv 2016, arXiv:1602.02410. [Google Scholar]
- Zhu, L.; Yang, Y. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8746–8755. [Google Scholar]
- Luo, H.; Ji, L.; Shi, B.; Huang, H.; Duan, N.; Li, T.; Li, J.; Bharti, T.; Zhou, M. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv 2020, arXiv:2002.06353. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Marszalek, M.; Laptev, I.; Schmid, C. Actions in context. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 22–24 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 2929–2936. [Google Scholar]
- Reddy, K.K.; Shah, M. Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 2013, 24, 971–981. [Google Scholar]
- Li, W.; Wong, Y.; Liu, A.A.; Li, Y.; Su, Y.T.; Kankanhalli, M. Multi-camera action dataset (MCAD): A dataset for studying non-overlapped cross-camera action recognition. arXiv 2016, arXiv:1607.06408. [Google Scholar]
- Bhardwaj, R.; Singh, P.K. Analytical review on human activity recognition in video. In Proceedings of the 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), Noida, India, 14–15 January 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 531–536. [Google Scholar]
- Chahuara, P.; Fleury, A.; Vacher, M.; Portet, F. Méthodes SVM et MLN pour la reconnaissance automatique d’activités humaines dans les habitats perceptifs: Tests et perspectives. In Proceedings of the RFIA 2012 (Reconnaissance des Formes et Intelligence Artificielle), Lyon, France, 22–24 January 2012; pp. 978–982. [Google Scholar]
- Nguyen-Duc-Thanh, N.; Stonier, D.; Lee, S.; Kim, D.H. A new approach for human-robot interaction using human body language. In Proceedings of the International Conference on Hybrid Information Technology, Daejeon, Republic of Korea, 22–24 September 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 762–769. [Google Scholar]
- Mollet, N.; Chellali, R. Détection et interprétation des Gestes de la Main. In Proceedings of the 2005 3rd International Conference on SETIT, Sousse, Tunisia, 27–31 March 2005. [Google Scholar]
- Wenkai, X.; Lee, E.J. Continuous gesture trajectory recognition system based on computer vision. Int. J. Appl. Math. Inf. Sci. 2012, 6, 339–346. [Google Scholar]
- Xu, W.; Lee, E.J. A novel method for hand posture recognition based on depth information descriptor. KSII Trans. Internet Inf. Syst. (TIIS) 2015, 9, 763–774. [Google Scholar]
- Youssef, M.B.; Trabelsi, I.; Bouhlel, M.S. Human action analysis for assistance with daily activities. Int. J. Hum. Mach. Interact. 2016, 7. [Google Scholar]
- Shao, J.; Kang, K.; Change Loy, C.; Wang, X. Deeply learned attributes for crowded scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4657–4666. [Google Scholar]
- Shu, T.; Xie, D.; Rothrock, B.; Todorovic, S.; Chun Zhu, S. Joint inference of groups, events and human roles in aerial videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4576–4584. [Google Scholar]
- Ryoo, M.S.; Aggarwal, J.K. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1593–1600. [Google Scholar]
- Vrigkas, M.; Nikou, C.; Kakadiaris, I.A. A review of human activity recognition methods. Front. Robot. AI 2015, 2, 28. [Google Scholar] [CrossRef] [Green Version]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition 2004, ICPR 2004, Cambridge, UK, 23–26 August 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 32–36. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [Green Version]
- Singh, S.; Velastin, S.A.; Ragheb, H. Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods. In Proceedings of the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, Boston, MA, USA, 29 August–1 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 48–55. [Google Scholar]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2556–2563. [Google Scholar]
- Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, 23–28 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–8. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Minnen, D.; Westeyn, T.; Starner, T.; Ward, J.A.; Lukowicz, P. Performance metrics and evaluation issues for continuous activity recognition. Perform. Metrics Intell. Syst. 2006, 4, 141–148. [Google Scholar]
- Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Yu, P.S.; Long, M. PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning. arXiv 2021, arXiv:2103.09504. [Google Scholar]
- Paoletti, G.; Cavazza, J.; Beyan, C.; Del Bue, A. Subspace Clustering for Action Recognition with Covariance Representations and Temporal Pruning. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Virtual, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 6035–6042. [Google Scholar]
- Ullah, A.; Muhammad, K.; Hussain, T.; Baik, S.W. Conflux LSTMs network: A novel approach for multi-view action recognition. Neurocomputing 2021, 435, 321–329. [Google Scholar] [CrossRef]
- Shahroudy, A.; Ng, T.T.; Gong, Y.; Wang, G. Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1045–1058. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lan, Z.; Lin, M.; Li, X.; Hauptmann, A.G.; Raj, B. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 204–212. [Google Scholar]
- Wu, W.; Sun, Z.; Ouyang, W. Revisiting classifier: Transferring vision-language models for video recognition. In Proceedings of the AAAI, Washington, DC, USA, 7–8 February 2023; Volume 1, p. 5. [Google Scholar]
- Wang, Y.; Li, K.; Li, Y.; He, Y.; Huang, B.; Zhao, Z.; Zhang, H.; Xu, J.; Liu, Y.; Wang, Z.; et al. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv 2022, arXiv:2212.03191. [Google Scholar]
- Wang, L.; Koniusz, P. Self-supervising action recognition by statistical moment and subspace descriptors. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4324–4333. [Google Scholar]
- Ullah, A.; Muhammad, K.; Ding, W.; Palade, V.; Haq, I.U.; Baik, S.W. Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Appl. Soft Comput. 2021, 103, 107102. [Google Scholar] [CrossRef]
- Negin, F.; Koperski, M.; Crispim, C.F.; Bremond, F.; Coşar, S.; Avgerinakis, K. A hybrid framework for online recognition of activities of daily living in real-world settings. In Proceedings of the 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Colorado Springs, CO, USA, 23–26 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 37–43. [Google Scholar]
- Rautaray, S.S.; Agrawal, A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev. 2015, 43, 1–54. [Google Scholar] [CrossRef]
- Xu, K.; Qin, Z.; Wang, G. Recognize human activities from multi-part missing videos. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
- Nweke, H.F.; Teh, Y.W.; Mujtaba, G.; Al-Garadi, M.A. Data fusion and multiple classifier systems for human activity detection and health monitoring: Review and open research directions. Inf. Fusion 2019, 46, 147–170. [Google Scholar] [CrossRef]
- Akansha, U.A.; Shailendra, M.; Singh, N. Analytical review on video-based human activity recognition. In Proceedings of the 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 16–18 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3839–3844. [Google Scholar]
Methods | Data Type | Dataset | Performance | Source | Year |
---|---|---|---|---|---|
Fast Fourier Transform | RGB | UCF101 Kinetics | Acc: 99.21 Acc: 98.24 | [94] | 2022 |
QSVM | RGB | UCF11 HMDB51 | Acc: 94.43 Acc: 87.61 | [95] | 2021 |
SVM | RGB | UCSDped-1 UCSDped-2 UMN | Acc: 97.14 Acc: 91.13 Acc: 95.24 | [96] | 2017 |
SVM | RGB | UCF11 UCF50 | Acc: 78.6 Acc: 72.9 | [97] | 2014 |
SVM | RGB | MSRAction3D UTKinectAction CAD-60 MSRDailyActivity3D | Acc: 94.3 Acc: 91.9 Acc: 87.5 Acc: 80.0 | [98] | 2014 |
SVM | RGB | Weizmann KTH Hollywood2 | Acc: 100 Acc: 96.3 Mean Average Precision: 58.46 | [99] | 2011 |
SVM | RGB | KTH Weizmann i3Dpost Ballet IXMAS | Average Acc.: 95.5 Average Acc.: 100 Average Acc.: 92.92 Average Acc.: 93.25 Average Acc.: 85.5 | [100] | 2016 |
SVM | RGB | KTH UCFSports Hollywood2 | Average Acc: 91.8 Average Acc: 94 Mean Average Precision: 68.1 | [101] | 2018 |
SVM with ASAGA | RGB | UCSDped 1 | Acc: 87.2 | [102] | 2014 |
SVM with PSO | Skeleton | MSRAction3D UTKinect Florence3D action | Acc: 93.75 Acc: 97.45 Acc: 91.20 | [103] | 2016 |
SVM with GA | RGB | KTH HMDB51 UCF youtube Hollywood2 | Acc: 95.0 Acc: 48.4 Acc: 82.3 Acc: 46.8 | [104] | 2015 |
SVM-Neural Network | RGB | KTH Weizmann | Average Acc.: 96.4 Average Acc.: 100 | [105] | 2015 |
RF | Skeleton | UTKinect | Acc: 92 | [106] | 2013 |
NBNN | 3D joints skeleton | MSRAction3D-Test1 MSRAction3D-Test2 MSRAction3D-cross-subject | Acc: 95.8 Acc: 97.8 Acc: 83.3 | [21] | 2014 |
HMM-Kernel Discriminant Analysis | Silhouette | Elder care data | Acc: 95.8 | [107] | 2011 |
HMM | Skeleton | Im-DailyDepthActivity MSRAction3D (CS) MSRDailyActivity3D (CS) | Acc: 74.23 Acc: 93.3 Acc: 94.1 | [49] | 2017 |
Dynamic Time Wrapping | RGB | MuHAVi (LOSO) MuHAVi (LOAO) | Acc: 100 Acc: 100 | [108] | 2014 |
KELM | Depth | MSRGesture (LOSO) MSRAction3D (CS) | Acc: 93.4 Acc: 91.94 | [109] | 2015 |
KELM | Depth | DHA MSRAction3D MSRGesture3D MSRDailyActivity3D | Acc: 96.7 Acc: 96.70 Acc: 99.39 Acc: 89 | [65] | 2017 |
Methods | Data Type | Dataset | Performance | Source | Year |
---|---|---|---|---|---|
PoseConv3D | RGB+Depth | NTU-RGBD | Acc: 97.1 | [80] | 2022 |
Temporal Difference Networks | RGB | Something-SomethingV1 Kinetics | Acc: 68.2 Acc: 79.4 | [123] | 2021 |
CNN | RGB | UCF101 HMDB51 FCVID ActivityNet | Acc: 98.6 Acc: 84.3 Acc: 82.1 Acc: 84.4 | [124] | 2020 |
2-stream Convolution Network | RGB | UCF101 HMDB51 | Acc: 91.5 Acc: 65.9 | [27] | 2015 |
3-stream CNN | RGB | KTH UCF101 HMDB51 | Acc: 96.8 Acc: 92.2 Acc: 65.2 | [125] | 2017 |
Multi-stream CNN | Skeleton | NTU-RGBD (CS) NTU-RGBD (CV) MSRC-12 (CS) Northwestern-UCLA | Acc: 80.03 Acc: 87.21 Acc: 96.62 Acc: 92.61 | [126] | 2017 |
3D CNN | RGB | KTH | Acc: 90.2 | [127] | 2012 |
Actional-graph-based CNN | Skeleton | NTU-RGBD (CS) NTU-RGBD (CV) Kinetics Kinetics | Acc: 86.8 Acc: 94.2 Top-5 acc: 56.5 Top-1 acc: 34.8 | [128] | 2019 |
CNN | RGB | UCF101 HMDB51 | Acc: 92.5 Acc: 65.2 | [129] | 2016 |
CNN | RGB | UCF50 UCF101 YouTube action HMDB51 | Acc: 96.4 Acc: 94.33 Acc: 96.21 Acc: 70.33 | [130] | 2019 |
CNN-Genetic Algorithm | RGB | UCF50 | Acc: 99.98 | [131] | 2016 |
CNN | Skeleton | UTD-MHAD NTU-RGBD (CV) NTU-RGBD (CS) | Acc: 88.10 Acc: 82.3 Acc: 76.2 | [110] | 2017 |
ConvNets | RGB | CIFAR100 Caltech101 CIFAR10 | Acc: 75.87 Acc: 95.54 Acc: 91.83 | [132] | 2017 |
Temporal CNN | Skeleton | NTU-RGBD (CV) NTU-RGBD (CS) | Acc: 83.1 Acc: 74.3 | [133] | 2017 |
ConvNets | Skeleton | MSRAction3D UTKinect-3D SBU-Kinect Interaction | Acc: 97.9 Acc: 98.5 Acc: 96.2 | [134] | 2019 |
DBN and CNN | Skeleton | HMDB51 Hollywood 2 | Acc: 80.48 Acc: 91.21 | [135] | 2017 |
CNN-LSTM | Skeleton | NTU-RGBD (CV) NTU-RGBD (CS) | Acc: 90.10 Acc: 82.89 | [136] | 2017 |
3D-ConvNets-LSTM | Depth | NTU-RGBD(CV) NTU-RGBD(CS) UCLA | Acc: 95.4 Acc: 93 Acc: 93.1 | [137] | 2019 |
Method | Data Type | Dataset | Performance | Source | Year |
---|---|---|---|---|---|
HyRSM | RGB | UCF101 | Acc: 93.0 | [175] | 2022 |
GCN | Skeleton | NTU-RGBD | Acc: 96.1 | [176] | 2022 |
PYSKL | Skeleton | NTU-RGBD UCF101 | Acc: 97.4 Acc: 86.9 | [177] | 2022 |
ActionCLIP | RGB+Text | Kinetics | Acc: 83.8 | [178] | 2021 |
IMGAUD2VID | RGB+Audio | ActivityNet | Acc: 80.3 | [179] | 2020 |
AGCN-LSTM | Skeleton | NTU-RGBD(CS) NTU-RGBD(CV) UCLA | Acc: 89.2 Acc: 95 Acc: 93.3 | [180] | 2019 |
Stacked LSTM | Skeleton | SBU Kinect HDM05 CMU | Acc: 90.41 Acc: 97.25 Acc: 81.04 | [144] | 2016 |
Stacked LSTM | Skeleton | MSRDailyActivity3D NTU-RGBD (CS) CAD-60 | Acc: 91.56 Acc: 64.9 Acc: 67.64 | [181] | 2018 |
Stacked LSTM | RGB | HMDB51 UCF101 Hollywood2 | Acc: 41.31 Acc: 84.96 MAP: 43.91 | [182] | 2015 |
Differential RNN | RGB and Skeleton | MSRAction3D (CV) KTH-1 (CV) KTH-2 (CV) | Acc: 92.03 Acc: 93.96 Acc: 92.12 | [138] | 2015 |
TSN | RGB | HMDB51 UCF101 | Acc: 69.4 Acc: 94.2 | [183] | 2016 |
FCN | RGB | Sports Video | Acc: 97.4 | [184] | 2019 |
AGCN | Skeleton | NTU-RGBD (CS) NTU-RGBD (CV) Kinetics Kinetics | Acc: 88.5 Acc: 95.1 Top 5% acc: 58.7 Top 1% acc: 36.1 | [44] | 2019 |
Two-stream MiCT | RGB | HMDB51 UCF101 | Acc: 70.5 Acc: 94.7 | [185] | 2018 |
DBN | Depth | MHAD MIVIA | Acc: 85.8 Acc: 84.7 | [186] | 2014 |
GAN | RGB | UCF101 HMDB51 | Acc: 47.2 Acc: 14.40 | [187] | 2018 |
Dataset | Input Type | Action Type | #Classes | #Videos | Year | Ref. |
---|---|---|---|---|---|---|
HMDB51 | RGB 1 | Group | 51 | 6766 | 2011 | [235] |
UCF101 | RGB | Group | 101 | 13,320 | 2012 | [237] |
NTU RGB + D | RGB + D 2 + S 3 | Atomic | 60 | 56,880 | 2016 | [142] |
ActivityNet | RGB | Group | 200 | 19,994 | 2016 | [233] |
Kinetics | RGB | Group | 400 | 306,245 | 2017 | [234] |
Hollywood2 | RGB | Group | 12 | 1707 | 2009 | [216] |
KTH | RGB | Atomic | 6 | 2391 | 2004 | [230] |
UCF50 | RGB | Interaction | 50 | 6618 | 2012 | [217] |
MSR Daily Activity 3D | RGB + D + S | Interaction | 16 | 320 | 2012 | [87] |
MSR Action 3D | D + S | Atomic | 20 | 567 | 2010 | [63] |
MuHAVI | RGB | Interaction | 17 | 1904 | 2010 | [232] |
MCAD | RGB | Behavior | 18 | 14,298 | 2016 | [218] |
Action Types | Accuracies | Method | Year | |
---|---|---|---|---|
Atomic Action | ||||
KTH | 99.86% | PredRNN-V2 [239] | 2021 | |
NTU RGB+D | 97.1% | PoseC3D [80] | 2022 | |
MSR Action 3D | 98.02% | Temporal Subspace Clustering [240] | 2021 | |
Behavior | ||||
MCAD | 86.9% | Conflux LSTMs network [241] | 2021 | |
Interaction | ||||
MSR Daily Activity 3D | 97.5% | DSSCA-SSLM [242] | 2017 | |
MuHAVI | 100% | ST-LSTM (Tree) + Trust Gate [26] | 2016 | |
UCF50 | 94.4% | MIFS [243] | 2015 | |
Group Activities | ||||
ActivityNet | 96.9% | Text4Vis (w/ViT-L) [244] | 2023 | |
Kinetics | 91.1% | InternVideo-T [245] | 2022 | |
HMDB-51 | 87.56% | DEEP-HAL with ODF+SDF [246] | 2021 | |
Hollywood2 | 71.3% | DS-GRU [247] | 2021 | |
UCF-101 | 98.64% | SMART [124] | 2020 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Morshed, M.G.; Sultana, T.; Alam, A.; Lee, Y.-K. Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities. Sensors 2023, 23, 2182. https://doi.org/10.3390/s23042182
Morshed MG, Sultana T, Alam A, Lee Y-K. Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities. Sensors. 2023; 23(4):2182. https://doi.org/10.3390/s23042182
Chicago/Turabian StyleMorshed, Md Golam, Tangina Sultana, Aftab Alam, and Young-Koo Lee. 2023. "Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities" Sensors 23, no. 4: 2182. https://doi.org/10.3390/s23042182
APA StyleMorshed, M. G., Sultana, T., Alam, A., & Lee, Y. -K. (2023). Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities. Sensors, 23(4), 2182. https://doi.org/10.3390/s23042182