Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization
Abstract
:1. Introduction
- We propose a novel method to adapt CLIP to the video domain for efficient action recognition. The method is simple to implement, does not disrupt the original pre-trained parameters, and has very few trainable parameters. Extensive experiments demonstrate the good performance and generalization of our method in various learning settings;
- Taking a visual perspective, we design a temporal prompt reparameterization encoder that aims to enhance the model’s temporal modeling capability. The encoder replaces the direct learning style of prompt vectors, allowing the model to learn more generalized temporal representations for specific domains while also being lightweight and efficient;
- At the textual level, we predefine a Chinese label dictionary and introduce the corresponding Chinese text encoder to realize joint semantic constraints of Chinese and English in order to enhance video representations.
2. Related Works
- Video Action Recognition
- 2.
- Vision–Language Pre-trained Models
- 3.
- Parameter-Efficient Transfer Learning
3. Methods
3.1. Action Recognition with VLMs
3.1.1. Video Representations
3.1.2. Text Representations
3.1.3. Similarity Calculation
3.2. Proposed CLIP-Based Framework
3.3. Video Encoder
3.3.1. Temporal Prompts
3.3.2. Reparameterization Encoder
3.3.3. Spatial Semantic Guidance of CLIP
3.4. Text Encoder
3.5. Learning Objectives
4. Experiments
4.1. Experimental Configurations and Details
4.1.1. Datasets and Evaluation
4.1.2. Training Details
4.1.3. Baseline
4.2. Ablation Studies
4.2.1. Effect of Key Components
4.2.2. Number of Sampled Frames
4.2.3. Length of Learnable Global Prompts
4.2.4. The Influence of Different Networks
4.2.5. The Role of Residual Connection in
4.2.6. Hand-Crafted [CLS] Prefix Prompt
4.2.7. Trainable Parameters and Time Efficiency
4.2.8. Visualization
4.3. Few-Shot Video Recognition
4.4. Comparison with the State-of-the-Art
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sahoo, J.P.; Prakash, A.J.; Plawiak, P.; Samantray, S. Real-time hand gesture recognition using fine-tuned convolutional neural network. Sensors 2022, 22, 706. [Google Scholar] [CrossRef] [PubMed]
- Jiang, Q.; Li, G.; Yu, J.; Li, X. A model based method of pedestrian abnormal behavior detection in traffic scene. In Proceedings of the 2015 IEEE First International Smart Cities Conference (ISC2), Guadalajara, Mexico, 25–28 October 2015. [Google Scholar]
- Lentzas, A.; Vrakas, D. Non-intrusive human activity recognition and abnormal behavior detection on elderly people: A review. Artif. Intell. Rev. 2020, 53, 1975–2021. [Google Scholar] [CrossRef]
- Tang, Z.; Gu, R.; Hwang, J.N. Joint multi-view people tracking and pose estimation for 3D scene reconstruction. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Feiszli, M. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Selva, J.; Johansen, A.S.; Escalera, S.; Nasrollahi, K.; Moeslund, T.B.; Clapés, A. Video transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12922–12943. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021. [Google Scholar]
- Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv 2021, arXiv:2111.11432. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Xu, H.; Ghosh, G.; Huang, P.Y.; Okhonko, D.; Aghajanyan, A.; Metze, F.; Zettlemoyer, L.; Feichtenhofer, C. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv 2021, arXiv:2109.14084. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
- Ju, C.; Han, T.; Zheng, K.; Zhang, Y.; Xie, W. Prompting Visual-Language Models for Efficient Video Understanding. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Lin, Z.; Geng, S.; Zhang, R.; Gao, P.; De Melo, G.; Wang, X.; Dai, J.; Qiao, Y.; Li, H. Frozen clip models are efficient video learners. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Pan, J.; Lin, Z.; Zhu, X.; Shao, J.; Li, H. St-adapter: Parameter-efficient image-to-video transfer learning. Adv. Neural Inf. Process. Syst. 2022, 35, 26462–26477. [Google Scholar]
- Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Bahng, H.; Jahanian, A.; Sankaranarayanan, S.; Isola, P. Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv 2022, arXiv:2203.17274. [Google Scholar]
- Chen, S.; Ge, C.; Tong, Z.; Wang, J.; Song, Y.; Wang, J.; Luo, P. Adaptformer: Adapting vision transformers for scalable visual recognition. Adv. Neural Inf. Process. Syst. 2022, 35, 16664–16678. [Google Scholar]
- Jie, S.; Deng, Z.H. Convolutional bypasses are better vision transformer adapters. arXiv 2022, arXiv:2207.07039. [Google Scholar]
- Gao, Y.; Shi, X.; Zhu, Y.; Wang, H.; Tang, Z.; Zhou, X.; Li, M.; Metaxas, D.N. Visual prompt tuning for test-time domain adaptation. arXiv 2022, arXiv:2210.04831. [Google Scholar]
- Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
- Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT understands, too. AI Open, 2023; in press. [Google Scholar]
- Wang, X.; Zhu, L.; Wang, H.; Yang, Y. Interactive prototype learning for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Stroud, J.; Ross, D.; Sun, C.; Deng, J.; Sukthankar, R. D3d: Distilled 3d networks for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Wang, L.; Tong, Z.; Ji, B.; Wu, G. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition. arXiv 2021, arXiv:2109.08472. [Google Scholar]
- Ni, B.; Peng, H.; Chen, M.; Zhang, S.; Meng, G.; Fu, J.; Xiang, S.; Ling, H. Expanding language-image pretrained models for general video recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Du, Y.; Wei, F.; Zhang, Z.; Shi, M.; Gao, Y.; Li, G. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Zhang, R.; Guo, Z.; Zhang, W.; Li, K.; Miao, X.; Cui, B.; Qiao, Y.; Gao, P.; Li, H. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Lei, J.; Li, L.; Zhou, L.; Gan, Z.; Berg, T.L.; Bansal, M.; Liu, J. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- He, J.; Zhou, C.; Ma, X.; Berg-Kirkpatrick, T.; Neubig, G. Towards a unified view of parameter-efficient transfer learning. arXiv 2021, arXiv:2110.04366. [Google Scholar]
- Guo, D.; Rush, A.M.; Kim, Y. Parameter-efficient transfer learning with diff pruning. arXiv 2020, arXiv:2012.07463. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Yang, A.; Pan, J.; Lin, J.; Men, R.; Zhang, Y.; Zhou, J.; Zhou, C. Chinese clip: Contrastive vision-language pretraining in chinese. arXiv 2022, arXiv:2211.01335. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Huang, Z.; Zhang, S.; Pan, L.; Qing, Z.; Tang, M.; Liu, Z.; Ang Jr, M.H. Tada! temporally-adaptive convolutions for video understanding. arXiv 2021, arXiv:2110.06178. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Wang, Q.; Du, J.; Yan, K.; Ding, S. Seeing in flowing: Adapting clip for action recognition with motion prompts learning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar]
- Wasim, S.T.; Naseer, M.; Khan, S.; Khan, F.S.; Shah, M. Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Duan, H.; Zhao, Y.; Xiong, Y.; Liu, W.; Lin, D. Omni-sourced webly-supervised learning for video recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Wu, W.; Wang, X.; Luo, H.; Wang, J.; Yang, Y.; Ouyang, W. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Datasets | Categories | Training Set | Test Set | Total | Split | Sources |
---|---|---|---|---|---|---|
HMDB-51 | 51 | 3570 | 1530 | 7000 | 3 | Movies and web videos |
UCF-101 | 101 | 9537 | 3783 | 13,320 | 3 | YouTube |
Something-Something V1 | 174 | ∼86 K | ∼12 K | ∼100 K | 1 | Crowdsourcing collection |
Setting | Value |
---|---|
Training Hyperparameter | |
Batch size | 256 (Fully), 64 (Few-shot) |
Training epochs | 30 (ViT-B), 20 (ViT-L) |
Optimizer | AdamW, betas = [0.9,0.999] |
learning rate | 5 × 10−6 (Fully), 4 × 10−6 (Few-shot) |
Learning rate schedule | cosine |
Linear warm-up epochs | 5 |
Weight decay | 1 × 10−2 |
Data Augmentation | |
Training resize | RandomSizedCrop |
Training crop size | 224 |
Random Flip | 0.5 |
Gray Scale | 0.2 |
β(·) | TAda | C3D | R(2+1)D | HMDB-51 | UCF-101 | ||
---|---|---|---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | ||||
MLP | - | 70.1 | 91.3 | 92.6 | 96.2 | ||
LSTM | 71.0 | 91.8 | 93.9 | 97.0 | |||
Transformer Encoders | 71.8 | 92.5 | 94.4 | 97.5 | |||
Ours | ✓ | 72.3 | 92.9 | 94.8 | 98.2 | ||
✓ | 72.8 | 93.5 | 94.6 | 98.2 | |||
✓ | 73.2 | 93.6 | 95.8 | 99.0 |
β(·) | Residual Connection | HMDB-51 | UCF-101 | |||
---|---|---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | |||
Ours | TAda | × | 70.5 | 92.0 | 95.1 | 97.8 |
✓ | 72.3 | 92.9 | 94.8 | 98.2 | ||
C3D | × | 71.2 | 92.5 | 92.9 | 97.4 | |
✓ | 72.8 | 93.5 | 94.6 | 98.2 | ||
R(2+1)D | × | 70.0 | 90.9 | 93.0 | 96.0 | |
✓ | 73.2 | 93.6 | 95.8 | 99.0 |
Manual Prefix Initialization for [CLS] | HMDB-51 | UCF-101 | ||
---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | |
a video about [CLS]. | 73.1 | 93.7 | 95.8 | 98.8 |
This is a video about [CLS]. | 73.2 | 93.6 | 95.8 | 99.0 |
the video is about [CLS]. | 72.9 | 93.6 | 95.6 | 98.9 |
Methods | Backbone | T | Tunable Parameters | Epoch | Batch Size | Training GPU Minutes (HMDB) | Memory (HMDB) | Top-1 (%) | |
---|---|---|---|---|---|---|---|---|---|
HMDB | UCF | ||||||||
Vita | ViT-B/16 | 8 | 38.88 | 30 | 96 | 45 | 17,721 | 67.25 | 91.54 |
16 | 48 | 18,182 | 63.79 | 90.38 | |||||
BIKE | 8 | 106.8 (+100%) | 32 | 24 | 6445 | 72.22 | 96.15 | ||
16 | 45 | 10,603 | 73.31 | 96.63 | |||||
XCLP | 8 | 131.5 (+147%) | 16 | 32 | 17,674 | 70.22 | 94.20 | ||
16 | 8 | 54 | 70.75 | 94.10 | |||||
Ours | 8 | 54.2 | 32 | 22 | 6601 | 72.50 | 95.33 | ||
16 | 41 | 10,759 | 73.21 | 95.81 |
Methods | Pretrain | TP | β(·) | CN. Branch | HMDB-51 | UCF-101 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
K | ||||||||||||
2 | 4 | 8 | 16 | 2 | 4 | 8 | 16 | |||||
STM | ImageNet-1k | - | - | - | 35.8 | 39.0 | 43.6 | - | 65.4 | 73.9 | 81.3 | - |
3D-ResNet-50 | 43.2 | 44.3 | 49.9 | - | 68.8 | 71.1 | 85.8 | - | ||||
TSM-R50 | 17.5 | 20.9 | 18.4 | 31.0 | 25.3 | 47.0 | 64.4 | 61.0 | ||||
TimeSformer | ImageNet-21k | - | - | - | 19.6 | 40.6 | 49.4 | 55.4 | 48.5 | 75.6 | 83.7 | 89.4 |
Video Swin-B | 20.9 | 41.3 | 47.9 | 56.1 | 53.3 | 74.1 | 85.8 | 88.7 | ||||
X-Florence | FLD-900M | - | - | - | 51.6 | 57.8 | 64.1 | 64.2 | 84.0 | 88.5 | 92.5 | 94.8 |
ActionCLIP | CLIP+ Kinetics-400 | - | - | - | 54.8 | 56.7 | 57.3 | - | 80.7 | 85.3 | 89.2 | - |
X-CLIP-B/16 | CLIP-400M | - | - | - | 53.0 | 57.3 | 62.8 | 64.0 | 76.4 | 83.4 | 88.3 | 91.4 |
Vita-B/16 | 39.9 | 44.5 | 54.0 | 57.0 | 70.1 | 79.3 | 83.7 | 90.0 | ||||
BIKE-B/16 | 64.3 | 67.6 | 71.3 | 71.9 | 88.6 | 91.5 | 92.8 | 93.3 | ||||
[47] | 55.3 | 58.7 | 64.0 | 64.6 | 82.4 | 85.8 | 89.1 | 91.6 | ||||
Baseline | - | - | - | 49.6 | 52.8 | 57.3 | 60.3 | 65.1 | 74.8 | 77.6 | 80.2 | |
Ours | ✓ | 55.8 | 59.1 | 61.8 | 63.7 | 74.9 | 77.8 | 80.4 | 84.3 | |||
✓ | ✓ | 59.3 | 62.4 | 66.0 | 67.9 | 79.8 | 82.1 | 84.8 | 88.2 | |||
✓ | ✓ | ✓ | 61.6 | 64.8 | 69.6 | 70.4 | 82.7 | 86.0 | 89.5 | 92.1 |
Methods | Pretrain Data | Modalities | Frozen | Top-1 (%) | |
---|---|---|---|---|---|
UCF-101 | HMDB-51 | ||||
ARTNet | - | RGB | × | 94.3 | 70.9 |
TSM | 95.9 | 73.5 | |||
STM | 96.2 | 72.2 | |||
MVFNet | 96.6 | 75.7 | |||
TDN | 97.4 | 76.4 | |||
R3D-50 | 92.0 | 66.0 | |||
NL-I3D | - | 66.0 | |||
Methods with Kinetics pre-training | |||||
STC | K400 | RGB | × | 95.8 | 72.6 |
ECO | K400 | 93.6 | 68.4 | ||
R(2+1)D-34 | K400 | 96.8 | 74.5 | ||
FASTER32 | K400 | 96.9 | 75.7 | ||
SlowOnly-8x8-R101 | K400 + OmniSource | 97.3 | 79.0 | ||
Methods with ImageNet pre-training | |||||
I3D | ImageNet + K400 | RGB | × | 95.6 | 74.8 |
S3D | ImageNet + K400 | 96.8 | 75.9 | ||
LGD-3D | ImageNet + K600 | 97.0 | 75.7 | ||
Methods with large-scale image-language pre-training | |||||
BIKE ViT-L | CLIP + K400 | RGB | × | 98.8 | 82.2 |
ViT-B/16 w/ST-Adapter | CLIP + K400 | ✓ | 96.4 | 77.7 | |
VideoPrompt [17] | CLIP | ✓ | 93.6 | 66.4 | |
[47] | CLIP | ✓ | 96.3 | 72.9 | |
BIKE ViT-B | CLIP | × | 96.6 | 73.3 | |
XCLIP-B | CLIP | × | 94.2 | 70.8 | |
Vita ViT-B [48] | CLIP | ✓ | 91.5 | 67.3 | |
Ours ViT-B | CLIP | ✓ | 96.5 | 73.8 | |
Methods with additional modalities | |||||
Two-Stream I3D | ImageNet + K400 | RGB + Flow | × | 98.0 | 80.7 |
Two-Stream LGD-3D | ImageNet + K600 | RGB + Flow | 98.2 | 80.5 | |
PERF-Net | ImageNet + K700 | RGB + Flow + Pose | 98.6 | 83.2 | |
SlowOnly-R101-RGB + I3D-Flow | OmniSource | RGB + Flow | 98.6 | 83.8 | |
SMART | ImageNet + K400 | RGB + Flow | 98.6 | 84.3 |
Methods | Pretrain Data | Architecture | Frozen | SSv1 | |
---|---|---|---|---|---|
Top-1 | Top-5 | ||||
Methods with ImageNet pre-training | |||||
TANet-R50 | ImageNet-1K | CNN | × | 47.6 | 77.7 |
TSM | 47.2 | 78.1 | |||
TEANet | 48.9 | - | |||
SmallBig | 50.0 | 79.8 | |||
STM | 50.7 | 80.4 | |||
TEINet | 51.0 | - | |||
AIA (TSM) | 51.6 | 79.9 | |||
MSNet | 52.1 | 82.3 | |||
TEA | 52.3 | 81.9 | |||
SDA-TSM | 52.8 | 81.3 | |||
CT-NET | 53.4 | 81.7 | |||
TDN | 53.9 | 82.1 | |||
TAdaConvNeXtV2-T | 54.1 | - | |||
TAdaConvNeXtV2-S | ImageNet21K + K400 | 59.7 | - | ||
TAdaConvNeXtV2-B | 60.7 | - | |||
Methods with large-scale image-language pre-training | |||||
Ours-B/16 | CLIP-400M | Transformer | ✓ | 54.9 | 83.8 |
TAdaFormer-B/16 | × | 59.2 | - | ||
Side4Video-B/16 | ✓ | 60.7 | 86.0 | ||
UniFormerV2-B/16 | × | 56.8 | 84.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Deng, L.; Tan, J.; Liu, F. Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization. Electronics 2024, 13, 3348. https://doi.org/10.3390/electronics13163348
Deng L, Tan J, Liu F. Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization. Electronics. 2024; 13(16):3348. https://doi.org/10.3390/electronics13163348
Chicago/Turabian StyleDeng, Lujuan, Jieqing Tan, and Fangmei Liu. 2024. "Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization" Electronics 13, no. 16: 3348. https://doi.org/10.3390/electronics13163348
APA StyleDeng, L., Tan, J., & Liu, F. (2024). Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization. Electronics, 13(16), 3348. https://doi.org/10.3390/electronics13163348