A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition
Abstract
:1. Introduction
2. Related Works
- Effective extraction of spatial and temporal features. Especially the extraction of temporal features is the focus of current research and is one of the great challenges.
- The systematic integration of spatial and temporal features. From the early score [23] fusion to the recent pixel-level fusion [29], researchers have proposed many novel spatial-temporal feature fusion methods, but at present, there is still much room for improvement in the effectiveness of each method.
- Improving the efficiency of action recognition network. This is one of the important constraints on whether the proposed algorithm can be applied in practice. It is also the focus of current research and will be discussed in the following sections.
3. The Proposed Approach
3.1. Motion Attention Module (MA)
3.2. Spatial-Temporal Channel Attention Module (STCA)
4. Experimental Results and Analysis
4.1. Datasets
4.2. Implementation Detail
4.3. Comparisons on Various Network Structures
4.4. Ablation Analysis
4.5. Results on Something-Something Dataset
4.6. Results on Diving48 and EGTEA Gaze++
4.7. Spatial-Temporal Feature Distribution Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chen, Y.; Wang, L.; Li, C.; Hou, Y.; Li, W. ConvNets-based action recognition from skeleton motion maps. Multimed. Tools Appl. 2020, 79, 1707–1725. [Google Scholar] [CrossRef]
- Kanojia, G.; Kumawat, S.; Raman, S. Attentive Spatio-Temporal Representation Learning for Diving Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Sudhakaran, S.; Escalera, S.; Lanz, O. Gate-Shift Networks for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Aggarwal, J.; Ryoo, M. Human Activity Analysis: A Review. ACM Comput. Surv. 2011, 43, 1–43. [Google Scholar] [CrossRef]
- Kong, Y.; Fu, Y. Human Action Recognition and Prediction: A Survey. arXiv 2018, arXiv:1806.11230. [Google Scholar]
- Turaga, P.; Chellapa, R.; Subrahmanian, V.; Udrea, O. Machine Recognition of Human Activities: A Survey. Circuits Syst. Video Technol. IEEE Trans. 2008, 18, 1473–1488. [Google Scholar] [CrossRef] [Green Version]
- Guo, G.; Lai, A. A survey on still image based human action recognition. Pattern Recognit. 2014, 47, 3343–3361. [Google Scholar] [CrossRef]
- Ziaeefard, M.; Bergevin, R. Semantic human activity recognition: A literature review. Pattern Recognit. 2015, 48, 2329–2345. [Google Scholar] [CrossRef]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Action recognition via pose-based graph convolutional networks with intermediate dense supervision. Pattern Recognit. 2022, 121, 108170. [Google Scholar] [CrossRef]
- Agahian, S.; Negin, F.; Köse, C. An efficient human action recognition framework with pose-based spatiotemporal features. Eng. Sci. Technol. Int. J. 2020, 23, 196–203. [Google Scholar] [CrossRef]
- Ikizler-Cinbis, N.; Sclaroff, S. Object, Scene and Actions: Combining Multiple Features for Human Action Recognition. In Proceedings of the 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
- Zhang, Y.; Qu, W.; Wang, D. Action-scene Model for Human Action Recognition from Videos. AASRI Procedia 2014, 6, 111–117. [Google Scholar] [CrossRef]
- Li, J.; Xie, X.; Pan, Q.; Cao, Y.; Zhao, Z.; Shi, G. SGM-Net: Skeleton-guided multimodal network for action recognition. Pattern Recognit. 2020, 104, 107356. [Google Scholar] [CrossRef]
- Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network. Pattern Recognit. 2020, 107, 107511. [Google Scholar] [CrossRef]
- Elahi, G.M.E.; Yang, Y.H. Online Learnable Keyframe Extraction in Videos and its Application with Semantic Word Vector in Action Recognition. Pattern Recognit. 2021, 122, 108273. [Google Scholar] [CrossRef]
- Zhang, Z.; Wang, C.; Xiao, B.; Zhou, W.; Liu, S. Human Action Recognition with Attribute Regularization. In Proceedings of the 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, Beijing, China, 18–21 September 2012; pp. 112–117. [Google Scholar] [CrossRef]
- Liu, L.; Wang, S.; Hu, B.; Qiong, Q.; Wen, J.; Rosenblum, D.S. Learning structures of interval-based Bayesian networks in probabilistic generative model for human complex activity recognition. Pattern Recognit. 2018, 81, 545–561. [Google Scholar] [CrossRef]
- Wang, H.; Schmid, C. Action Recognition with Improved Trajectories. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar] [CrossRef] [Green Version]
- Wang, H.; Oneata, D.; Verbeek, J.; Schmid, C. A Robust and Efficient Video Representation for Action Recognition. Int. J. Comput. Vis. 2015, 119, 219–238. [Google Scholar] [CrossRef] [Green Version]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef] [Green Version]
- Zhou, X.; Zhu, M.; Pavlakos, G.; Leonardos, S.; Derpanis, K.G.; Daniilidis, K. MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 901–914. [Google Scholar] [CrossRef] [Green Version]
- Martínez, B.M.; Modolo, D.; Xiong, Y.; Tighe, J. Action Recognition With Spatial-Temporal Discriminative Filter Banks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5481–5490. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: New York, NY, USA, 2014; Volume 27. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2740–2755. [Google Scholar] [CrossRef] [Green Version]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Columbus, OH, USA, 23–28 June 2014; pp. 4489–4497. [Google Scholar]
- Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 318–335. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6450–6459. [Google Scholar]
- Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Gool, L.V. Temporalsegmentnetworks: Towards good prac-tices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar]
- Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 803–818. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking Spatiotemporal Feature Learning:Speed-Accuracy Trade-offs in Video Classification. In Proceedings of the European Conference on Computer Vision, San Francisco, CA, USA, 4–9 February 2017; pp. 1–17. [Google Scholar]
- Wang, L.; Li, W.; Li, W.; Gool, L.V. Appearance-and-Relation Networks for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1430–1439. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Ng, J.Y.H.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
- Feichtenhofer, C. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 200–210. [Google Scholar] [CrossRef]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
- Diba, A.; Fayyaz, M.; Sharma, V.; Karami, A.H.; Mahdi Arzani, M.; Yousefzadeh, R.; Van Gool, L. Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification. arXiv 2017, arXiv:1711.08200. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
- Sun, L.; Jia, K.; Yeung, D.Y.; Shi, B.E. Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4597–4605. [Google Scholar]
- Zolfaghari, M.; Singh, K.; Brox, T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 695–712. [Google Scholar]
- Jiang, B.; Wang, M.; Gan, W.; Wu, W.; Yan, J. STM: SpatioTemporal and Motion Encoding for Action Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Luo, C.; Yuille, A. Grouped Spatial-Temporal Aggregation for Efficient Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5512–5521. [Google Scholar]
- Goyal, R.; Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fründ, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5843–5851. [Google Scholar]
- Mahdisoltani, F.; Berger, G.; Ghar-bieh, W.; Fleet, D.; Memisevic, R. On the effective-ness of task granularity for transfer learning. arXiv 2018, arXiv:1804.09235. [Google Scholar]
- Li, Y.; Li, Y.; Vasconcelos, N. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 513–528. [Google Scholar]
- Li, Y.; Liu, M.; Rehg, J.M. In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Jeev Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Bertasius, G.; Feichtenhofer, C.; Tran, D.; Shi, J.; Torresani, L. Learning discriminative motion features through detection. In Proceedings of the IEEE International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Wan, H.; Tran, D.; Torresani, L.; Feiszli, M. Video Modeling with Correlation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Sudhakaran, S.; Lanz, O. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. arXiv 2018, arXiv:1807.11794. [Google Scholar]
Method | Params | GFlops | Accuracy (%) |
---|---|---|---|
C2D | 23.9 M | 33 G | 3.0 |
C3D | 46.5 M | 62 G | 25.9 |
Cascade 3D | 27.6 M | 37 G | 26.9 |
Reversed Cascade 3D | 27.6 M | 40.6 G | 27.8 |
Parallel | 27.6 M | 40.6 G | 31.7 |
Our DTP | 23.9 M | 33 G | 32.5 |
Structure | MSth (%) | SthV1 (%) |
---|---|---|
DTP | 32.5 | 45.1 |
DTP+MA | 33.0 | 46.2 |
DTP+STCA | 33.1 | 46.5 |
DTP+MA+STCA | 34.2 | 47.4 |
Models | Backbone | Frame Number | Params | GFLOPS | Top-1 Acc (%) |
---|---|---|---|---|---|
TSN [30] (ECCV’16) | ResNet-50 | 8 | 23.9 M | 33 | 19.7 |
I3D-RGB [32] (CVPR’17) | ResNet-50 | 32 × 2 clips | 28 M | 153 × 2 | 41.6 |
TRN-2stream [31] (ECCV’18) | BN-Inception | 8 | - | - | 42.0 |
ECO-RGB [44] (ECCV’18) | BN-Inception | 8 | 47.5 M | 32 | 39.6 |
16 | 47.5 M | 64 | 41.4 | ||
S3D [33] (ECCV’18) | BN-Inception | 64 | - | 66 | 47.3 |
NL I3D-RGB [53] (CVPR’18) | 3D-ResNet-50 | 32 × 2 clips | 28 M | 117 × 2 | 44.4 |
TSM [28] (ICCV’19) | ResNet-50 | 8 | 23.9 M | 33 | 43.4 |
16 | 23.9 M | 65 | 44.8 | ||
GST [47] (ICCV’19) | ResNet-50 | 8 | 21.0 M | 29.5 | 46.6 |
16 | 21.0 M | 59 | 48.6 | ||
STM [45] (ICCV’19) | ResNet-50 | 8 × 30 | - | 33.2 × 30 | 49.2 |
16 × 30 | - | 66.5 × 30 | 50.7 | ||
GSM [3] (CVPR’20) | BN-Inception | 8 | - | 16.5 | 47.24 |
16 | - | 33 | 49.56 | ||
CSTANet | ResNet-50 | 8 | 24.1 M | 33 | 47.4 |
8 × 2 clips | 24.1M | 33 × 2 | 48.6 | ||
16 | 24.1 M | 66 | 48.8 | ||
16 × 2 clips | 24.1 M | 66 × 2 | 49.5 |
Model | Backbone | Frame | Top-1 Acc (%) |
---|---|---|---|
TRN [31] | BN-Inception | 8f | 48.8 |
TSM [28] (*) | ResNet-50 | 8f | 59.1 |
16f | 59.4 | ||
GST [47] | ResNet-50 | 8f | 58.8 |
TRN-2Stream [31] | BN-Inception | 8f | 55.5 |
TSM-2Stream [28] | ResNet-50 | 16f | 63.5 |
CSTANet | ResNet-50 | 8f | 60.0 |
16f | 61.6 |
Method | Pretrain | Top-1 Acc (%) |
---|---|---|
C3D (64 frames) [25] | - | 27.6 |
R(2+1)D [54] | Kinetics | 28.9 |
R(2+1)D+DIMOFS [54] | Kinetics + PoseTrack | 31.4 |
C3D-ResNet18 [25] (from [47]) | ImageNet | 33 |
P3D-ResNet18 [42] (from [47]) | ImageNet | 30.8 |
CSTANet-ResNet18 (ours) | ImageNet | 35.3 |
C3D-ResNet50 [25] (from [47]) | ImageNet | 34 |
P3D-ResNet50 [42] (from [47]) | ImageNet | 32.4 |
GST-ResNet50 [47] | ImageNet | 38.8 |
CorrNet [55]) | - | 37.7 |
Attentive STRL [2]) | ImageNet | 35.64 |
CSTANet-ResNet50 (ours) | ImageNet | 39.5 |
CSTANet-ResNet50 (×2 clips) | ImageNet | 40.0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, H.; Xia, T.; Li, H.; Gu, X.; Lv, W.; Wang, Y. A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition. Mathematics 2021, 9, 3226. https://doi.org/10.3390/math9243226
Wang H, Xia T, Li H, Gu X, Lv W, Wang Y. A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition. Mathematics. 2021; 9(24):3226. https://doi.org/10.3390/math9243226
Chicago/Turabian StyleWang, Huafeng, Tao Xia, Hanlin Li, Xianfeng Gu, Weifeng Lv, and Yuehai Wang. 2021. "A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition" Mathematics 9, no. 24: 3226. https://doi.org/10.3390/math9243226
APA StyleWang, H., Xia, T., Li, H., Gu, X., Lv, W., & Wang, Y. (2021). A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition. Mathematics, 9(24), 3226. https://doi.org/10.3390/math9243226