Separable ConvNet Spatiotemporal Mixer for Action Recognition
Abstract
:1. Introduction
2. Methodology
2.1. Network Architecture
2.2. Frame Compression Method
2.3. Spatial Domain
Algorithm 1: Time wrapper | |
Input: | |
Output: | |
Step 1: Reshape each clip: | |
(3) | |
of the same index i from different videos using as the feature extractor. | |
(4) | |
Step 3: Concatenate the features | |
(5) | |
2.4. Temporal Domain
2.5. MLP Head
3. Experimental Results
3.1. Dataset
3.2. Training Method
3.3. Various Spatiotemporal Mechanism
3.4. Spatial Extractor
3.5. Patch Size Analysis
3.6. N Spatiotemporal Mixer Layer
3.7. Time Wrapper
3.8. ImageNet Pretrain
3.9. Comparison with State-of-the-Art
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhang, H.B.; Zhang, Y.X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.X.; Chen, D.S. A comprehensive survey of vision-based human action recognition methods. Sensors 2019, 19, 1005. [Google Scholar] [CrossRef] [PubMed]
- Liang, C.; Yang, J.; Du, R.; Hu, W.; Tie, Y. Non-Uniform Motion Aggregation with Graph Convolutional Networks for Skeleton-Based Human Action Recognition. Electronics 2023, 12, 4466. [Google Scholar] [CrossRef]
- Ji, R. Research on basketball shooting action based on image feature extraction and machine learning. IEEE Access 2020, 8, 138743–138751. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6. [Google Scholar] [CrossRef]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? ICML 2021, 2, 4. [Google Scholar]
- Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Feiszli, M. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5552–5561. [Google Scholar]
- Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019, 20, 1997–2017. [Google Scholar]
Model | Mechanism | Top1 | Top5 |
---|---|---|---|
Spatial extractor (ResNet50) pretrained on ImageNet1k | |||
SC_winslide | Multi-scale window + Pooling | 57.4 | 81.4 |
SCTA | Transformer | 62.3 | 85.2 |
SCTAX | Transformer(row) + Transformer(column) | 57.0 | 81.7 |
SCTA_SlowFast | Transformer(slow) + Transformer(fast) | 63.8 | 86.2 |
SCSM | Spatiotemporal Mixer | 67.4 | 87.7 |
Spatial Extractor (Se) | ImNet Top1 | Se Params(M) | Se GFLOPS | Top1 | Top5 |
---|---|---|---|---|---|
Spatial extractor pretrained on ImageNet1k, using SCTA model | |||||
RegNet_X_800MF [22] | 75.2 | 7.3 | 0.80 | 56.0 | 78.1 |
EfficientNet_B3 [23] | 82.0 | 12.2 | 1.83 | 60.5 | 83.6 |
ResNet50 | 76.1 | 25.6 | 4.09 | 62.3 | 85.2 |
Swin_V2_T [24] | 82.0 | 28.4 | 5.94 | 73.0 | 90.8 |
ConvNeXt_Tiny | 82.5 | 28.6 | 4.46 | 73.5 | 90.5 |
Spatial extractor pretrained on ImageNet1k, using SCSM model | |||||
RegNet_X_800MF | 75.2 | 7.3 | 0.80 | 67.8 | 87.3 |
ResNet50 | 82.0 | 12.2 | 1.83 | 67.4 | 87.7 |
ConvNeXt_Tiny | 82.5 | 28.6 | 4.46 | 73.4 | 90.1 |
ConvNeXt_Tiny (5 layers Spatiotemporal Mixer) | 82.5 | 28.6 | 4.46 | 73.7 | 90.8 |
Spatiotemporal Mixer Layers | Top1 | Top5 |
---|---|---|
2 | 63.6 | 85.4 |
3 | 67.4 | 87.7 |
4 | 67.0 | 87.4 |
Time Wrapper | Top1 | Top5 |
---|---|---|
ResNet-50 pretrain (ImageNet1k), using SCSM | ||
All to batch | 64.2 | 85.3 |
Along to batch | 67.4 | 87.8 |
Model | Frame | Pretrain | Param (M) | GFLOPs × Views | Top1 | Top5 |
---|---|---|---|---|---|---|
SCSM backbone utilizes ConvNeXt | ||||||
ResNet3D-50 [26] | 16 | - | 47.0 | 80.3 × 1 | 61.3 | 83.1 |
I3D-RGB [7] | 25 | - | 12 | 108 × 1 | 68.4 | 88.0 |
I3D-RGB [7] | 25 | IN1k | 12 | 108 × 1 | 71.1 | 89.3 |
R(2 + 1)D [6] | 32 | - | 63.6 | 152 × 115 | 72.0 | 90.0 |
ip-CSN-152 [27] | - | - | 32.8 | 109 × 30 | 77.8 | 92.8 |
SlowFast 4 × 16, R50 [8] | - | - | 34.4 | 36.1 × 30 | 75.6 | 92.1 |
SlowFast 8 × 8, R101 [8] | - | - | 53.7 | 106 × 30 | 77.9 | 93.2 |
ViViT-L/16 × 2 FE [10] | 32 | IN1k | - | 3980 × 3 | 80.6 | 92.7 |
TimeSformer [25] | 8 | IN1k | 121.4 | 196.6 × 3 | 75.8 | - |
TimeSformer-HR [25] | 16 | IN21k | - | 1703.3 × 3 | 79.7 | 94.4 |
SCSM_T | 16 | IN1k | 34.7 | 71.4 × 1 | 73.4 | 90.1 |
SCSM_B | 16 | IN1k | 60.6 | 138.9 × 1 | 75.1 | 91.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cheng, H.-Y.; Yu, C.-C.; Li, C. Separable ConvNet Spatiotemporal Mixer for Action Recognition. Electronics 2024, 13, 496. https://doi.org/10.3390/electronics13030496
Cheng H-Y, Yu C-C, Li C. Separable ConvNet Spatiotemporal Mixer for Action Recognition. Electronics. 2024; 13(3):496. https://doi.org/10.3390/electronics13030496
Chicago/Turabian StyleCheng, Hsu-Yung, Chih-Chang Yu, and Chenyu Li. 2024. "Separable ConvNet Spatiotemporal Mixer for Action Recognition" Electronics 13, no. 3: 496. https://doi.org/10.3390/electronics13030496
APA StyleCheng, H. -Y., Yu, C. -C., & Li, C. (2024). Separable ConvNet Spatiotemporal Mixer for Action Recognition. Electronics, 13(3), 496. https://doi.org/10.3390/electronics13030496