Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework
Abstract
:1. Introduction
- We propose a computationally efficient cascaded spatial–temporal learning approach for human activity recognition. The proposed system utilizes deep discriminative RGB features guided by a channel–spatial attention mechanism and long-term modeling of action-centric features for reliable recognition of human activities in video streams.
- We propose a light-weight CNN architecture having a total of eight convolutional layers where the maximum number of kernels used per layer is 64 with spatial dimensions . With these constrained settings, we have developed a compact yet efficient CNN architecture for deep discriminative feature extraction as opposed to complex deep CNNs utilized by other contemporary works in their activity-recognition models using transfer learning.
- We design a stacked dual channel–spatial attention mechanism with residual skip connection for spatial saliency extraction from video frames. The developed dual attentional module is placed after each two-consecutive convolutional layers of the developed CNN model which helps the network to extract saliency-aware deep discriminative features for localizing the action-specific regions in video frames.
- We propose a bi-directional GRU network with three bi-directional layers (having forward and backward pass) that efficiently capture the long-term temporal patterns of human actions in both forward and backward directions, which greatly enhances the reusability of features, improves the feature propagation, and alleviates the issue of gradients vanishing.
- We demonstrate the effectiveness and suitability of the proposed encapsulated dual attention CNN and bi-directional GRU framework (DA-CNN+Bi-GRU) for resource-constrained IoT and edge devices by comparing the model accuracy and execution/inference time of the DA-CNN+Bi-GRU framework with various baseline methods as well as contemporary human action-recognition methods.
2. Related Works on Human Activity Recognition
2.1. Handcrafted Feature-Based Methods
2.2. Deep Learning Feature-Based Methods
2.3. Temporal Modeling-Based Methods
2.4. Attention Mechanism-Based Methods
3. Proposed Human Activity-Recognition Framework
3.1. Overview of Proposed CNN Architecture
3.2. Dual Attention Module
3.2.1. Channel Attention
3.2.2. Spatial Attention
3.3. Learning Human Action Patterns via Bi-Directional GRU
4. Experimental Results and Discussion
4.1. Implementation Details
4.2. Datasets
4.2.1. YouTube Action Dataset
4.2.2. UCF50 Dataset
4.2.3. HMDB51 Dataset
4.2.4. UCF101 Dataset
4.2.5. Kinetics-600 Dataset
4.3. Assessment of Our Framework with Baseline Methods
4.4. Comparison with State-of-the-Art Methods
Method | Year | Accuracy (%) |
---|---|---|
Multi-task hierarchical clustering [68] | 2017 | 51.4 |
STPP+LSTM [79] | 2017 | 70.5 |
Optical flow + multi-layer LSTM [47] | 2018 | 72.2 |
TSN [80] | 2018 | 70.7 |
IP-LSTM [81] | 2019 | 58.6 |
Deep autoencoder [70] | 2019 | 70.3 |
TS-LSTM + temporal-inception [82] | 2019 | 69.0 |
HATNet [83] | 2019 | 74.8 |
Correlational CNN + LSTM [84] | 2020 | 66.2 |
STDAN [67] | 2020 | 56.5 |
DB-LSTM+SSPF [48] | 2021 | 75.1 |
DS-GRU [52] | 2021 | 72.3 |
TCLC [85] | 2021 | 71.5 |
Evidential deep learning [78] | 2021 | 77.0 |
ViT+LSTM [76] | 2021 | 73.7 |
Semi-supervised temporal gradient learning [86] | 2022 | 75.9 |
AdaptFormer [87] | 2022 | 55.6 |
SVT (Linear) [88] | 2022 | 57.8 |
SVT (Fine-tune) [88] | 2022 | 67.2 |
SVFormer-S [89] | 2023 | 59.7 |
SVFormer-B [89] | 2023 | 68.2 |
DA-CNN+Bi-GRU (Proposed) | 2023 | 79.3 |
Method | Year | Accuracy (%) |
---|---|---|
Multi-task hierarchical clustering [68] | 2017 | 76.3 |
Saliency-aware 3DCNN with LSTM [91] | 2017 | 84.0 |
Spatiotemporal multiplier networks [92] | 2017 | 87.0 |
Long-term temporal convolutions [39] | 2017 | 82.4 |
RTS [90] | 2018 | 96.4 |
OFF [93] | 2018 | 96.0 |
TVNet [94] | 2018 | 95.4 |
Attention cluster [95] | 2018 | 94.6 |
CNN with Bi-LSTM [96] | 2018 | 92.8 |
Videolstm [97] | 2018 | 89.2 |
Two stream convnets [98] | 2018 | 84.9 |
Mixed 3D-2D convolutional tube [99] | 2018 | 88.9 |
TS-LSTM + temporal-inception [82] | 2019 | 91.1 |
TSN+TSM [100] | 2019 | 94.3 |
STM [101] | 2019 | 96.2 |
Correlational CNN + LSTM [84] | 2020 | 92.8 |
SVT (Linear) [88] | 2022 | 90.8 |
SVT (Fine-tune) [88] | 2022 | 93.7 |
ConvNet Transformer [102] | 2023 | 86.1 |
SVFormer-S [89] | 2023 | 79.1 |
SVFormer-B [89] | 2023 | 86.7 |
DA-CNN+Bi-GRU (Proposed) | 2023 | 97.6 |
Method | Year | Accuracy (%) |
---|---|---|
SlowFast [106] | 2019 | 81.8 |
Stnet [107] | 2019 | 76.3 |
LGD-3D [108] | 2019 | 82.7 |
GCF-Net [104] | 2020 | 70.0 |
D3D+S3D-G [109] | 2020 | 79.1 |
MoviNet [110] | 2021 | 83.5 |
Global and local-aware attention [105] | 2021 | 70.0 |
MM-ViT [111] | 2022 | 83.5 |
Swin-B [112] | 2022 | 83.8 |
Swin-L [112] | 2022 | 85.9 |
MTV-B [103] | 2022 | 84.0 |
MTV-H [103] | 2022 | 89.6 |
DA-CNN+Bi-GRU (Proposed) | 2023 | 86.7 |
4.5. Action-Recognition Visualization
4.6. Runtime Analysis
5. Conclusions and Future Research Directions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CNN | Convolutional neural network |
DA-CNN | Dual attention convolutional neural network |
CBAM | Convolutional block attention module |
IoT | Internet of things |
RNN | Recurrent neural network |
GRU | Gated recurrent unit |
Bi-GRU | Bi-directional gated recurrent unit |
SPF | Seconds per frame |
FPS | Frames per second |
References
- Munir, A.; Blasch, E.; Kwon, J.; Kong, J.; Aved, A. Artificial Intelligence and Data Fusion at the Edge. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 62–78. [Google Scholar] [CrossRef]
- Munir, A.; Kwon, J.; Lee, J.H.; Kong, J.; Blasch, E.; Aved, A.; Muhammad, K. FogSurv: A Fog-Assisted Architecture for Urban Surveillance Using Artificial Intelligence and Data Fusion. IEEE Access 2021, 9, 111938–111959. [Google Scholar] [CrossRef]
- Huang, C.; Wu, Z.; Wen, J.; Xu, Y.; Jiang, Q.; Wang, Y. Abnormal Event Detection Using Deep Contrastive Learning for Intelligent Video Surveillance System. IEEE Trans. Ind. Inform. 2021, 18, 5171–5179. [Google Scholar] [CrossRef]
- Sahu, A.; Chowdhury, A.S. Together Recognizing, Localizing and Summarizing Actions in Egocentric Videos. IEEE Trans. Image Process. 2021, 30, 4330–4340. [Google Scholar] [CrossRef] [PubMed]
- Qi, M.; Qin, J.; Yang, Y.; Wang, Y.; Luo, J. Semantics-Sware Spatial–Temporal Binaries for Cross-Modal Video Retrieval. IEEE Trans. Image Process. 2021, 30, 2989–3004. [Google Scholar] [CrossRef]
- Muhammad, K.; Ullah, H.; Obaidat, M.S.; Ullah, A.; Munir, A.; Sajjad, M.; de Albuquerque, V.H.C. AI-Driven Salient Soccer Events Recognition Framework for Next Generation IoT-Enabled Environments. IEEE Internet Things J. 2021, 2202–2214. [Google Scholar] [CrossRef]
- Ng, W.; Zhang, M.; Wang, T. Multi-Localized Sensitive Autoencoder-Attention-Lstm for Skeleton-Based Action Recognition. IEEE Trans. Multimed. 2021, 24, 1678–1690. [Google Scholar] [CrossRef]
- Asghari, P.; Soleimani, E.; Nazerfard, E. Online Human Activity Recognition Employing Hierarchical Hidden Markov Models. J. Ambient Intell. Humaniz. Comput. 2020, 11, 1141–1152. [Google Scholar] [CrossRef] [Green Version]
- Ehatisham-Ul-Haq, M.; Javed, A.; Azam, M.A.; Malik, H.M.; Irtaza, A.; Lee, I.H.; Mahmood, M.T. Robust Human Activity Recognition Using Multimodal Feature-Level Fusion. IEEE Access 2019, 7, 60736–60751. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, G.; Khan, A.U.; Siddiqi, A.; Khan, M.U.G. Human Activity Recognition Using Mixture of Heterogeneous Features and Sequential Minimal Optimization. Int. J. Mach. Learn. Cybern. 2019, 10, 2329–2340. [Google Scholar] [CrossRef]
- Franco, A.; Magnani, A.; Maio, D. A Multimodal Approach for Human Activity Recognition Based on Skeleton and RGB Data. Pattern Recognit. Lett. 2020, 131, 293–299. [Google Scholar] [CrossRef]
- Elmadany, N.E.D.; He, Y.; Guan, L. Information Fusion for Human Action Recognition via Biset/Multiset Globality Locality Preserving Canonical Correlation Analysis. IEEE Trans. Image Process. 2018, 27, 5275–5287. [Google Scholar] [CrossRef]
- Dileep, D.; Sreeni, K. Anomalous Event Detection in Crowd Scenes using Histogram of Optical Flow and Entropy. In Proceedings of the 2021 Fourth International Conference on Microelectronics, Signals & Systems (ICMSS), Kollam, India, 18–19 November 2021; pp. 1–6. [Google Scholar]
- Yenduri, S.; Perveen, N.; Chalavadi, V. Fine-Grained Action Recognition Using Dynamic Kernels. Pattern Recognit. 2022, 122, 108282. [Google Scholar] [CrossRef]
- Luvizon, D.C.; Picard, D.; Tabia, H. Multi-Task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2752–2764. [Google Scholar] [CrossRef] [Green Version]
- Li, J.; Liu, X.; Zhang, W.; Zhang, M.; Song, J.; Sebe, N. Spatio-Temporal Attention Networks for Action Recognition and Detection. IEEE Trans. Multimed. 2020, 22, 2990–3001. [Google Scholar] [CrossRef]
- Ghose, S.; Prevost, J.J. Autofoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos with Deep Learning. IEEE Trans. Multimed. 2020, 23, 1895–1907. [Google Scholar] [CrossRef]
- Lu, L.; Lu, Y.; Yu, R.; Di, H.; Zhang, L.; Wang, S. GAIM: Graph Attention Interaction Model for Collective Activity Recognition. IEEE Trans. Multimed. 2019, 22, 524–539. [Google Scholar] [CrossRef]
- Liu, K.; Gao, L.; Khan, N.M.; Qi, L.; Guan, L. A Multi-Stream Graph Convolutional Networks-Hidden Conditional Random Field Model for Skeleton-Based Action Recognition. IEEE Trans. Multimed. 2020, 23, 64–76. [Google Scholar] [CrossRef]
- Hu, P.; Ho, E.S.l.; Munteanu, A. 3DBodyNet: Fast Reconstruction of 3D Animatable Human Body Shape From a Single Commodity Depth Camera. IEEE Trans. Multimed. 2021, 24, 2139–2149. [Google Scholar] [CrossRef]
- Yan, C.; Hao, Y.; Li, L.; Yin, J.; Liu, A.; Mao, Z.; Chen, Z.; Gao, X. Task-Adaptive Attention for Image Captioning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 43–51. [Google Scholar] [CrossRef]
- Xia, W.; Yang, Y.; Xue, J.H.; Wu, B. Tedigan: Text-Guided Diverse Face Image Generation and Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 November 2021. [Google Scholar]
- Pareek, P.; Thakkar, A. A Survey on Video-Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications. Artif. Intell. Rev. 2021, 54, 2259–2322. [Google Scholar] [CrossRef]
- Kong, Y.; Fu, Y. Human Action Recognition and Prediction: A Survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
- Scovanner, P.; Ali, S.; Shah, M. A 3-Dimensional Sift Descriptor and Its Application to Action Recognition. In Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, 25–29 September 2007; pp. 357–360. [Google Scholar]
- Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning Realistic Human Actions from Movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Ryoo, M.S.; Matthies, L. First-Person Activity Recognition: Feature, Temporal Structure, and Prediction. Int. J. Comput. Vis. 2016, 119, 307–328. [Google Scholar] [CrossRef]
- Ullah, H.; Muhammad, K.; Irfan, M.; Anwar, S.; Sajjad, M.; Imran, A.S.; de Albuquerque, V.H.C. Light-DehazeNet: A Novel Lightweight CNN Architecture for Single Image Dehazing. IEEE Trans. Image Process. 2021, 30, 8968–8982. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Yao, Y.; Zhang, L.; Wang, Q.; Xie, G.; Shen, F. Saliency Guided Inter-and Intra-Class Relation Constraints for Weakly Supervised Semantic Segmentation. IEEE Trans. Multimed. 2022, 25, 1727–1737. [Google Scholar] [CrossRef]
- Aafaq, N.; Mian, A.S.; Akhtar, N.; Liu, W.; Shah, M. Dense Video Captioning with Early Linguistic Information Fusion. IEEE Trans. Multimed. 2022, 25, 2309–2322. [Google Scholar] [CrossRef]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 27 June–2 July 2014; pp. 1725–1732. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y. Towards Good Practices for Very Deep Two-Stream Convnets. arXiv 2015, arXiv:1507.02159. [Google Scholar]
- Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond Short Snippets: Deep Networks for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
- Wu, Z.; Wang, X.; Jiang, Y.G.; Ye, H.; Xue, X. Modeling Spatial–Temporal Clues in a Hybrid Deep Learning Framework for Video Classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26 October 2015; pp. 461–470. [Google Scholar]
- Wang, X.; Farhadi, A.; Gupta, A. Actions Transformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2658–2667. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1933–1941. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Varol, G.; Laptev, I.; Schmid, C. Long-Term Temporal Convolutions for Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517. [Google Scholar] [CrossRef] [Green Version]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [Green Version]
- Oikonomou, K.M.; Kansizoglou, I.; Manaveli, P.; Grekidis, A.; Menychtas, D.; Aggelousis, N.; Sirakoulis, G.C.; Gasteratos, A. Joint-Aware Action Recognition for Ambient Assisted Living. In Proceedings of the 2022 IEEE International Conference on Imaging Systems and Techniques (IST), Kaohsiung, Taiwan, 21–23 June 2022; pp. 1–6. [Google Scholar]
- Shah, A.; Mishra, S.; Bansal, A.; Chen, J.C.; Chellappa, R.; Shrivastava, A. Pose and Joint-Aware Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, WI, USA, 3–8 January 2022; pp. 3850–3860. [Google Scholar]
- Holte, M.B.; Tran, C.; Trivedi, M.M.; Moeslund, T.B. Human pose estimation and activity recognition from multi-view videos: Comparative explorations of recent developments. IEEE J. Sel. Top. Signal Process. 2012, 6, 538–552. [Google Scholar] [CrossRef]
- Nandagopal, S.; Karthy, G.; Oliver, A.S.; Subha, M. Optimal Deep Convolutional Neural Network with Pose Estimation for Human Activity Recognition. Comput. Syst. Sci. Eng. 2022, 44, 1719–1733. [Google Scholar] [CrossRef]
- Zhou, T.; Wang, W.; Qi, S.; Ling, H.; Shen, J. Cascaded human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4263–4272. [Google Scholar]
- Zhou, T.; Yang, Y.; Wang, W. Differentiable Multi-Granularity Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8296–8310. [Google Scholar] [CrossRef]
- Ullah, A.; Muhammad, K.; Del Ser, J.; Baik, S.W.; de Albuquerque, V.H.C. Activity Recognition Using Temporal Optical Flow Convolutional Features and Multilayer LSTM. IEEE Trans. Ind. Electron. 2018, 66, 9692–9702. [Google Scholar] [CrossRef]
- He, J.Y.; Wu, X.; Cheng, Z.Q.; Yuan, Z.; Jiang, Y.G. DB-LSTM: Densely Connected Bi-Directional LSTM for Human Action Recognition. Neurocomputing 2021, 444, 319–331. [Google Scholar] [CrossRef]
- Sun, X.; Xu, H.; Dong, Z.; Shi, L.; Liu, Q.; Li, J.; Li, T.; Fan, S.; Wang, Y. CapsGaNet: Deep Neural Network Based on Capsule and GRU for Human Activity Recognition. IEEE Syst. J. 2022, 16, 5845–5855. [Google Scholar] [CrossRef]
- Ibrahim, M.S.; Muralidharan, S.; Deng, Z.; Vahdat, A.; Mori, G. A Hierarchical Deep Temporal Model for Group Activity Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1971–1980. [Google Scholar]
- Biswas, S.; Gall, J. Structural Recurrent Neural Network (SRNN) for Group Activity Analysis. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1625–1632. [Google Scholar]
- Ullah, A.; Muhammad, K.; Ding, W.; Palade, V.; Haq, I.U.; Baik, S.W. Efficient Activity Recognition Using Lightweight CNN and DS-GRU Network for Surveillance Applications. Appl. Soft Comput. 2021, 103, 107102. [Google Scholar] [CrossRef]
- Li, X.; Zhao, Z.; Wang, Q. Abssnet: Attention-Based Spatial Segmentation Network for Traffic Scene Understanding. IEEE Trans. Cybern. 2021, 52, 9352–9362. [Google Scholar] [CrossRef]
- Deng, J.; Li, L.; Zhang, B.; Wang, S.; Zha, Z.; Huang, Q. Syntax-Guided Hierarchical Attention Network for Video Captioning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 880–892. [Google Scholar] [CrossRef]
- Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A. Stacked Attention Networks for Image Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 21–29. [Google Scholar]
- Baradel, F.; Wolf, C.; Mille, J. Human Action Recognition: Pose-Based Attention Draws Focus to Hands. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 604–613. [Google Scholar]
- Islam, M.M.; Iqbal, T. Multi-gat: A Graphical Attention-Based Hierarchical Multimodal Representation Learning Approach for Human Activity Recognition. IEEE Robot. Autom. Lett. 2021, 6, 1729–1736. [Google Scholar] [CrossRef]
- Long, X.; Gan, C.; Melo, G.; Liu, X.; Li, Y.; Li, F.; Wen, S. Multimodal Keyless Attention Fusion for Video Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection. IEEE Trans. Image Process. 2018, 27, 3459–3471. [Google Scholar] [CrossRef]
- Cho, S.; Maqbool, M.; Liu, F.; Foroosh, H. Self-Attention Network for Skeleton-Based Human Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 635–644. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Liu, J.; Luo, J.; Shah, M. Recognizing Realistic Actions from Videos “in the Wild”. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1996–2003. [Google Scholar]
- Reddy, K.K.; Shah, M. Recognizing 50 Human Action Categories of Web Videos. Mach. Vis. Appl. 2013, 24, 971–981. [Google Scholar] [CrossRef] [Green Version]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A Large Video Database for Human Motion Recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Carreira, J.; Noland, E.; Banki-Horvath, A.; Hillier, C.; Zisserman, A. A Short Note About Kinetics-600. arXiv 2018, arXiv:1808.01340. [Google Scholar]
- Zhang, Z.; Lv, Z.; Gan, C.; Zhu, Q. Human Action Recognition Using Convolutional LSTM and Fully Connected LSTM With Different Attentions. Neurocomputing 2020, 410, 304–316. [Google Scholar] [CrossRef]
- Liu, A.A.; Su, Y.T.; Nie, W.Z.; Kankanhalli, M. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 102–114. [Google Scholar] [CrossRef]
- Ye, J.; Wang, L.; Li, G.; Chen, D.; Zhe, S.; Chu, X.; Xu, Z. Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9378–9387. [Google Scholar]
- Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Action Recognition Using Optimized Deep Autoencoder and CNN for surveillance Data Streams of Non-Stationary Environments. Future Gener. Comput. Syst. 2019, 96, 386–397. [Google Scholar] [CrossRef]
- Dai, C.; Liu, X.; Lai, J. Human Action Recognition Using Two-Stream Attention Based LSTM Networks. Appl. Soft Comput. 2020, 86, 105820. [Google Scholar] [CrossRef]
- Afza, F.; Khan, M.A.; Sharif, M.; Kadry, S.; Manogaran, G.; Saba, T.; Ashraf, I.; Damaševičius, R. A Framework of Human Action Recognition Using Length Control Features Fusion and Weighted Entropy-Variances Based Feature Selection. Image Vis. Comput. 2021, 106, 104090. [Google Scholar] [CrossRef]
- Muhammad, K.; Ullah, A.; Imran, A.S.; Sajjad, M.; Kiran, M.S.; Sannino, G.; de Albuquerque, V.H.C. Human Action Recognition Using Attention Based LSTM Network with Dilated CNN Features. Future Gener. Comput. Syst. 2021, 125, 820–830. [Google Scholar] [CrossRef]
- Al-Obaidi, S.; Al-Khafaji, H.; Abhayaratne, C. Making Sense of Neuromorphic Event Data for Human Action Recognition. IEEE Access 2021, 9, 82686–82700. [Google Scholar] [CrossRef]
- Zhang, L.; Lim, C.P.; Yu, Y. Intelligent Human Action Recognition Using an Ensemble Model of Evolving Deep Networks with Swarm-Based Optimization. Knowl.-Based Syst. 2021, 220, 106918. [Google Scholar] [CrossRef]
- Hussain, A.; Hussain, T.; Ullah, W.; Baik, S.W. Vision transformer and deep sequence learning for human activity recognition in surveillance videos. Comput. Intell. Neurosci. 2022, 2022, 3454167. [Google Scholar] [CrossRef]
- Du, Z.; Mukaidani, H. Linear Dynamical Systems Approach for Human Action Recognition with Dual-Stream Deep Features. Appl. Intell. 2022, 52, 452–470. [Google Scholar] [CrossRef]
- Bao, W.; Yu, Q.; Kong, Y. Evidential Deep Learning for Open Set Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 13349–13358. [Google Scholar]
- Wang, X.; Gao, L.; Wang, P.; Sun, X.; Liu, X. Two-Stream 3-D Convnet Fusion for Action Recognition in Videos with Arbitrary Size and Length. IEEE Trans. Multimed. 2017, 20, 634–644. [Google Scholar] [CrossRef]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2740–2755. [Google Scholar] [CrossRef] [Green Version]
- Yu, S.; Xie, L.; Liu, L.; Xia, D. Learning Long-Term Temporal Features with Deep Neural Networks for Human Action Recognition. IEEE Access 2019, 8, 1840–1850. [Google Scholar] [CrossRef]
- Ma, C.Y.; Chen, M.H.; Kira, Z.; AlRegib, G. TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition. Signal Process. Image Commun. 2019, 71, 76–87. [Google Scholar] [CrossRef] [Green Version]
- Diba, A.; Fayyaz, M.; Sharma, V.; Paluri, M.; Gall, J.; Stiefelhagen, R.; Van Gool, L. Holistic Large Scale Video Understanding. arXiv 2019, arXiv:1904.11451. [Google Scholar]
- Majd, M.; Safabakhsh, R. Correlational Convolutional LSTM for Human Action Recognition. Neurocomputing 2020, 396, 224–229. [Google Scholar] [CrossRef]
- Zhu, L.; Fan, H.; Luo, Y.; Xu, M.; Yang, Y. Temporal Cross-Layer Correlation Mining for Action Recognition. IEEE Trans. Multimed. 2021, 24, 668–676. [Google Scholar] [CrossRef]
- Xiao, J.; Jing, L.; Zhang, L.; He, J.; She, Q.; Zhou, Z.; Yuille, A.; Li, Y. Learning from Temporal Gradient for Semi-Supervised Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3252–3262. [Google Scholar]
- Chen, S.; Ge, C.; Tong, Z.; Wang, J.; Song, Y.; Wang, J.; Luo, P. Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv 2022, arXiv:2205.13535. [Google Scholar]
- Ranasinghe, K.; Naseer, M.; Khan, S.; Khan, F.S.; Ryoo, M.S. Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2874–2884. [Google Scholar]
- Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; Jiang, Y.G. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18816–18826. [Google Scholar]
- Zhu, Y.; Newsam, S. Random Temporal Skipping for Multirate Video Analysis. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 542–557. [Google Scholar]
- Wang, X.; Gao, L.; Song, J.; Shen, H. Beyond Frame-Level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition. IEEE Signal Process. Lett. 2016, 24, 510–514. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Pinz, A.; Wildes, R.P. Spatiotemporal Multiplier Networks for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4768–4777. [Google Scholar]
- Sun, S.; Kuang, Z.; Sheng, L.; Ouyang, W.; Zhang, W. Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1390–1399. [Google Scholar]
- Fan, L.; Huang, W.; Gan, C.; Ermon, S.; Gong, B.; Huang, J. End-to-End Learning of Motion Representation for Video Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6016–6025. [Google Scholar]
- Long, X.; Gan, C.; De Melo, G.; Wu, J.; Liu, X.; Wen, S. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7834–7843. [Google Scholar]
- Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action Recognition in Video Sequences Using Deep Bi-Directional LSTM with CNN Features. IEEE Access 2017, 6, 1155–1166. [Google Scholar] [CrossRef]
- Li, Z.; Gavrilyuk, K.; Gavves, E.; Jain, M.; Snoek, C.G. Videolstm Convolves, Attends and Flows for Action Recognition. Comput. Vis. Image Underst. 2018, 166, 41–50. [Google Scholar] [CrossRef] [Green Version]
- Han, Y.; Zhang, P.; Zhuo, T.; Huang, W.; Zhang, Y. Going Deeper with Two-Stream ConvNets for Action Recognition in Video Surveillance. Pattern Recognit. Lett. 2018, 107, 83–90. [Google Scholar] [CrossRef]
- Zhou, Y.; Sun, X.; Zha, Z.J.; Zeng, W. Mict: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 449–458. [Google Scholar]
- Song, X.; Lan, C.; Zeng, W.; Xing, J.; Sun, X.; Yang, J. Temporal–Spatial Mapping for Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 748–759. [Google Scholar] [CrossRef]
- Jiang, B.; Wang, M.; Gan, W.; Wu, W.; Yan, J. STM: Spatiotemporal and Motion Encoding for Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2000–2009. [Google Scholar]
- Phong, N.H.; Ribeiro, B. Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer. arXiv 2023, arXiv:2302.09187. [Google Scholar]
- Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview Transformers for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3333–3343. [Google Scholar]
- Hsiao, J.; Chen, J.; Ho, C. Gcf-Net: Gated Clip Fusion Network for Video Action Recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 699–713. [Google Scholar]
- Zheng, Z.; An, G.; Wu, D.; Ruan, Q. Global and Local Knowledge-Aware Attention Network for Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 334–347. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- He, D.; Zhou, Z.; Gan, C.; Li, F.; Liu, X.; Li, Y.; Wang, L.; Wen, S. STNET: Local and Global Spatial–Temporal Modeling for Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8401–8408. [Google Scholar]
- Qiu, Z.; Yao, T.; Ngo, C.W.; Tian, X.; Mei, T. Learning Spatio-Temporal Representation with Local and Global Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12056–12065. [Google Scholar]
- Stroud, J.; Ross, D.; Sun, C.; Deng, J.; Sukthankar, R. D3d: Distilled 3D Networks for Video Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 625–634. [Google Scholar]
- Kondratyuk, D.; Yuan, L.; Li, Y.; Zhang, L.; Tan, M.; Brown, M.; Gong, B. Movinets: Mobile Video Networks for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 16020–16030. [Google Scholar]
- Chen, J.; Ho, C.M. MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, WI, USA, 3–8 January 2022; pp. 1910–1921. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Munir, A.; Gordon-Ross, A.; Lysecky, S.; Lysecky, R. A Lightweight Dynamic Optimization Methodology and Application Metrics Estimation Model for Wireless Sensor Networks. Elsevier Sustain. Comput. Inform. Syst. 2013, 3, 94–108. [Google Scholar] [CrossRef]
- Alghamdi, Y.; Munir, A.; Ahmad, J. A Lightweight Image Encryption Algorithm Based on Chaotic Map and Random Substitution. Entropy 2022, 24, 1344. [Google Scholar] [CrossRef]
Layer | Input Channels | Number of Kernels | Kernel Size | Stride | Padding | Output Channels |
---|---|---|---|---|---|---|
Conv 1 | 3 | 16 | 1 | 1 | 16 | |
Conv 2 | 16 | 16 | 1 | 1 | 16 | |
Max pooling | ||||||
Channel Attention | ||||||
Spatial Attention | ||||||
Conv 3 | 32 | 32 | 1 | 1 | 32 | |
Conv 4 | 32 | 32 | 1 | 1 | 32 | |
Max pooling | ||||||
Channel Attention | ||||||
Spatial Attention | ||||||
Conv 5 | 32 | 32 | 3×3 | 1 | 1 | 32 |
Conv 6 | 32 | 32 | 3×3 | 1 | 1 | 32 |
Max pooling | ||||||
Channel Attention | ||||||
Spatial Attention | ||||||
Conv 7 | 32 | 64 | 1 | 1 | 64 | |
Conv 8 | 32 | 64 | 1 | 1 | 64 | |
Max pooling | ||||||
Channel Attention | ||||||
Spatial Attention | ||||||
Global Average Pooling | ||||||
Flatten |
Method | Spatial Block Layers | Temporal Block Layers |
---|---|---|
CNN+LSTM | 8 convolutional | 3 LSTM |
CNN+Bi-LSTM | 8 convolutional | 6 LSTM (3 forward and 3 backward) |
CNN+GRU | 8 convolutional | 3 GRU |
CNN+Bi-GRU | 8 convolutional | 6 GRU (3 forward and 3 backward) |
DA-CNN+Bi-GRU | 12 convolutional (8 convolutional and 4 attentional) | 6 GRU (3 forward and 3 backward) |
Method | Dataset | Accuracy (%) |
---|---|---|
CNN+LSTM | YouTube action | 64.7 |
CNN+Bi-LSTM | YouTube action | 84.2 |
CNN+GRU | YouTube action | 88.5 |
CNN+Bi-GRU | YouTube action | 92.1 |
CNN (channel attention only)+Bi-GRU | YouTube action | 94.2 |
CNN (spatial attention only)+Bi-GRU | YouTube action | 95.6 |
DA-CNN+Bi-GRU (Proposed) | YouTube action | 98.0 |
CNN+LSTM | UCF50 | 76.3 |
CNN+Bi-LSTM | UCF50 | 83.3 |
CNN+GRU | UCF50 | 87.6 |
CNN+Bi-GRU | UCF50 | 93.6 |
CNN (channel attention only)+Bi-GRU | UCF50 | 95.1 |
CNN (spatial attention only)+Bi-GRU | UCF50 | 95.7 |
DA-CNN+Bi-GRU (Proposed) | UCF50 | 98.5 |
CNN+LSTM | HMDB51 | 56.7 |
CNN+Bi-LSTM | HMDB51 | 63.2 |
CNN+GRU | HMDB51 | 68.0 |
CNN+Bi-GRU | HMDB51 | 72.4 |
CNN (channel attention only)+Bi-GRU | HMDB51 | 73.9 |
CNN (spatial attention only)+Bi-GRU | HMDB51 | 74.5 |
DA-CNN+Bi-GRU (Proposed) | HMDB51 | 79.3 |
CNN+LSTM | UCF101 | 83.9 |
CNN+Bi-LSTM | UCF101 | 86.8 |
CNN+GRU | UCF101 | 90.7 |
CNN+Bi-GRU | UCF101 | 94.2 |
CNN (channel attention only)+Bi-GRU | UCF101 | 95.1 |
CNN (spatial attention only)+Bi-GRU | UCF101 | 95.8 |
DA-CNN+Bi-GRU (Proposed) | UCF101 | 97.6 |
CNN+LSTM | Kinetics-600 | 73.2 |
CNN+Bi-LSTM | Kinetics-600 | 77.9 |
CNN+GRU | Kinetics-600 | 81.5 |
CNN+Bi-GRU | Kinetics-600 | 84.3 |
CNN (channel attention only)+Bi-GRU | Kinetics-600 | 84.9 |
CNN (spatial attention only)+Bi-GRU | Kinetics-600 | 85.6 |
DA-CNN+Bi-GRU (Proposed) | Kinetics-600 | 86.7 |
Method | Year | Accuracy (%) |
---|---|---|
Multi-task hierarchical clustering [68] | 2017 | 89.7 |
BT-LSTM [69] | 2018 | 85.3 |
Deep autoencoder [70] | 2019 | 96.2 |
STDAN [67] | 2020 | 98.2 |
Two-stream attention LSTM [71] | 2020 | 96.9 |
Weighted entropy-variance-based feature selection [72] | 2021 | 94.5 |
Dilated CNN+BiLSTM+RB [73] | 2021 | 89.0 |
DS-GRU [52] | 2021 | 97.1 |
Local-global features + QSVM [74] | 2021 | 82.6 |
DA-CNN+Bi-GRU (Proposed) | 2023 | 98.0 |
Method | Year | Accuracy (%) |
---|---|---|
Multi-task hierarchical clustering [68] | 2017 | 93.2 |
Deep autoencoder [70] | 2019 | 96.4 |
Ensemble model with swarm-based optimization [75] | 2021 | 92.2 |
DS-GRU [52] | 2021 | 95.2 |
Local-global features + QSVM [74] | 2021 | 69.4 |
ViT+LSTM [76] | 2021 | 96.1 |
(LD-BF) + (LD-DF) [77] | 2022 | 97.5 |
DA-CNN+Bi-GRU (Proposed) | 2023 | 98.5 |
Method | Seconds per Frame (SPF) | Year | Frames per Second (FPS) | ||
---|---|---|---|---|---|
GPU | CPU | GPU | CPU | ||
STPP+LSTM [79] | 0.0053 | - | 2017 | 186.6 | - |
CNN with Bi-LSTM [96] | 0.0570 | - | 2017 | 20 | - |
OFF [93] | 0.0048 | - | 2018 | 206 | - |
Videolstm [97] | 0.0940 | - | 2018 | 10.6 | - |
Optical flow + multi-layer LSTM [47] | 0.0356 | 0.18 | 2018 | 30 | 3.5 |
Deep autoencoder [70] | 0.0430 | 0.43 | 2019 | 24 | 1.5 |
TSN+TSM [100] | 0.0167 | - | 2019 | 60 | - |
IP-LSTM [81] | 0.0431 | - | 2019 | 23.2 | - |
STDN [67] | 0.0075 | - | 2020 | 132 | - |
DS-GRU [52] | 0.0400 | - | 2021 | 25 | - |
MoviNet [110] | 0.0833 | - | 2021 | 12 | - |
(LD-BF) + (LD-DF) [77] | 0.0670 | - | 2022 | 14 | - |
DA-CNN+Bi-GRU (Proposed) | 0.0036 | 0.0049 | 2023 | 300 | 250 |
Method | Seconds per Frame (SPF) | Year | Frames per Second (FPS) | ||
---|---|---|---|---|---|
GPU | CPU | GPU | CPU | ||
STPP+LSTM [79] | 0.0023 | - | 2017 | 423.58 | - |
CNN with Bi-LSTM [96] | 0.0354 | - | 2017 | 32.14 | - |
OFF [93] | 0.0029 | - | 2018 | 331.04 | - |
Videolstm [97] | 0.0584 | - | 2018 | 17.03 | - |
Optical flow + multi-layer LSTM [47] | 0.0221 | 0.17 | 2018 | 48.21 | 3.71 |
Deep autoencoder [70] | 0.0267 | 0.40 | 2019 | 38.56 | 1.59 |
TSN+TSM [100] | 0.0167 | - | 2019 | 60 | - |
IP-LSTM [81] | 0.0268 | - | 2019 | 37.28 | - |
STDN [67] | 0.0046 | - | 2020 | 212.12 | - |
DS-GRU [52] | 0.0248 | - | 2021 | 40.17 | - |
MoviNet [110] | 0.0645 | - | 2021 | 15.48 | - |
(LD-BF) + (LD-DF) [77] | 0.0416 | - | 2022 | 22.49 | - |
DA-CNN+Bi-GRU (Proposed) | 0.0036 | 0.0049 | 2023 | 300 | 250 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ullah, H.; Munir, A. Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. J. Imaging 2023, 9, 130. https://doi.org/10.3390/jimaging9070130
Ullah H, Munir A. Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. Journal of Imaging. 2023; 9(7):130. https://doi.org/10.3390/jimaging9070130
Chicago/Turabian StyleUllah, Hayat, and Arslan Munir. 2023. "Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework" Journal of Imaging 9, no. 7: 130. https://doi.org/10.3390/jimaging9070130
APA StyleUllah, H., & Munir, A. (2023). Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. Journal of Imaging, 9(7), 130. https://doi.org/10.3390/jimaging9070130