A Short Video Classification Framework Based on Cross-Modal Fusion
Abstract
:1. Introduction
- We propose an improved hierarchical clustering approach for keyframe extraction. Unlike traditional keyframe extraction algorithms, hierarchical clustering does not require a predefined number of keyframes to be extracted. Instead, it adaptively determines which frames are keyframes through clustering to offer greater flexibility. This method is capable of preserving essential information from the video while effectively reducing redundant frames, resulting in more representative extracted keyframes.
- We investigate the extraction methods of visual features and textual features within videos. The method of combining visual information and text information for video classification is used in this paper. The visual information is first processed by the key frame extraction method to divide the video into multiple images representing the main content. The pre-trained Timesformer network is used for feature extraction to obtain the visual features of the video. At the same time, the text information is also extracted by the fine-tuning-based method in BERT. Finally, these two kinds of features are fused by the feature aggregation algorithm for video classification.
- We propose a cross-modal fusion short video classification (CFVC) framework. This framework utilizes text features and visual features in a new way, combining the integration of visual attributes extracted from the training dataset and text features extracted from subtitles to achieve cross-modal fusion and integrate them into joint features for downstream classification tasks.
2. Related Work
2.1. I3D Networks
2.2. Two-Stream Networks
2.3. SlowFast Networks
3. System Model and Problem Formulation
3.1. Visual Feature Extraction
3.2. Text Feature Extraction
3.3. Cross-Modal Feature Fusion Framework
4. Experimental Results and Discussion
4.1. Experimental Dataset
4.2. Performance Evaluation Index
4.3. Experimental Results and Analysis
- (1)
- Single-mode feature
- (2)
- Cross-modal fusion
- (3)
- Comparison of public datasets
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jin, M.; Ning, Y.; Liu, F.; Zhao, F.; Gao, Y.; Li, D. An Evaluation Model for the Influence of KOLs in Short Video Advertising Based on Uncertainty Theory. Symmetry 2023, 15, 1594. [Google Scholar] [CrossRef]
- Ali, A.; Senan, N. A review on violence video classification using convolutional neural networks. In Recent Advances on Soft Computing and Data Mining, Proceedings of the Second International Conference on Soft Computing and Data Mining (SCDM-2016), Bandung, Indonesia, 18–20 August 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 130–140. [Google Scholar]
- Trzcinski, T. Multimodal social media video classification with deep neural networks. In Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments; SPIE: Washington, DC, USA, 2018; pp. 879–886. [Google Scholar]
- Ntalianis, K.; Doulamis, N. An automatic event-complementing human life summarization scheme based on a social computing method over social media content. Multimed. Tools Appl. 2016, 75, 15123–15149. [Google Scholar] [CrossRef]
- Jain, A.; Singh, D. A Review on Histogram of Oriented Gradient. IITM J. Manag. IT 2019, 10, 34–36. [Google Scholar]
- Ragupathy, P.; Vivekanandan, P. A modified fuzzy histogram of optical flow for emotion classification. J. Ambient Intell. Hum. Comput. 2021, 12, 3601–3608. [Google Scholar] [CrossRef]
- Fan, M.; Han, Q.; Zhang, X.; Liu, Y.; Chen, H.; Hu, Y. Human Action Recognition Based on Dense Sampling of Motion Boundary and Histogram of Motion Gradient. In Proceedings of the 2018 IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS), Enshi, China, 25–27 May 2018; pp. 1033–1038. [Google Scholar]
- Wang, H.; Klaser, A.; Schmid, C.; Liu, C.-L. Action recognition by dense trajectories. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3169–3176. [Google Scholar]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2–8 December 2013; pp. 3551–3558. [Google Scholar]
- Silva, F.B.; Werneck, R.d.O.; Goldenstein, S.; Tabbone, S.; Torres, R.d.S. Graph-based bag-of-words for classification. Pattern Recognit. 2018, 74, 266–285. [Google Scholar] [CrossRef]
- Karim, A.A.; Sameer, R.A. Image Classification Using Bag of Visual Words (BoVW). Al-Nahrain J. Sci. 2018, 21, 76–82. [Google Scholar] [CrossRef]
- Li, R.; Liu, Z.; Tan, J. Reassessing hierarchical representation for action recognition in still images. IEEE Access 2018, 6, 61386–61400. [Google Scholar] [CrossRef]
- Singhal, S.; Tripathi, V. Action recognition framework based on normalized local binary pattern. Progress in Advanced Computing and Intelligent Engineering. Proc. ICACIE 2017, 1, 247–255. [Google Scholar]
- Hu, Y.; Gao, J.; Xu, C. Learning dual-pooling graph neural networks for few-shot video classification. IEEE Trans. Multimedia 2020, 23, 4285–4296. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, Y.; Zhao, J.; Zhang, Q. A Low-Complexity Fast CU Partitioning Decision Method Based on Texture Features and Decision Trees. Electronics 2023, 12, 3314. [Google Scholar] [CrossRef]
- Liu, C.; Wang, Y.; Zhang, N.; Gang, R.; Ma, S. Learning Moiré Pattern Elimination in Both Frequency and Spatial Domains for Image Demoiréing. Sensors 2022, 22, 8322. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Jiang, X.; Song, Q.; Zhang, P. A Visual Enhancement Network with Feature Fusion for Image Aesthetic Assessment. Electronics 2023, 12, 2526. [Google Scholar] [CrossRef]
- Yi, Q.; Zhang, G.; Liu, J.; Zhang, S. Movie Scene Event Extraction with Graph Attention Network Based on Argument Correlation Information. Sensors 2023, 23, 2285. [Google Scholar] [CrossRef]
- Gudaparthi, H.; Niu, N.; Yang, Y.; Van Doren, M.; Johnson, R. Deep Learning’s fitness for purpose: A transformation problem Frame’s perspective. CAAI Trans. Intell. Technol. 2023, 8, 343–354. [Google Scholar] [CrossRef]
- Luo, X.; Wen, X.; Li, Y.; Li, Q. Pruning method for dendritic neuron model based on dendrite layer significance constraints. CAAI Trans. Intell. Technol. 2023, 8, 308–318. [Google Scholar] [CrossRef]
- Yan, M.; Lou, X.; Chan, C.A.; Wang, Y.; Jiang, W. A semantic and emotion-based dual latent variable generation model for a dialogue system. CAAI Trans. Intell. Technol. 2023, 8, 319–330. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
- Wu, Q.; Zhu, A.; Cui, R.; Wang, T.; Hu, F.; Bao, Y.; Snoussi, H. Pose-Guided Inflated 3D ConvNet for action recognition in videos. Signal Process. Image Commun. 2021, 91, 116098. [Google Scholar] [CrossRef]
- Chen, H.; Li, Y.; Fang, H.; Xin, W.; Lu, Z.; Miao, Q. Multi-Scale Attention 3D Convolutional Network for Multimodal Gesture Recognition. Sensors 2022, 22, 2405. [Google Scholar] [CrossRef]
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1933–1941. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Jin, C.; Luo, C.; Yan, M.; Zhao, G.; Zhang, G.; Zhang, S. Weakening the Dominant Role of Text: CMOSI Dataset and Multimodal Semantic Enhancement Network. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–15. [Google Scholar] [CrossRef] [PubMed]
- Patrick, S.C.; Réale, D.; Potts, J.R.; Wilson, A.J.; Doutrelant, C.; Teplitsky, C.; Charmantier, A. Differences in the temporal scale of reproductive investment across the slow-fast continuum in a passerine. Ecol. Lett. 2022, 25, 1139–1151. [Google Scholar] [CrossRef]
- Wei, D.; Tian, Y.; Wei, L.; Zhong, H.; Chen, S.; Pu, S.; Lu, H. Efficient dual attention slowfast networks for video action recognition. Comput. Vis. Image Underst. 2022, 222, 103484. [Google Scholar] [CrossRef]
- Jiang, Y.; Cui, K.; Chen, L.; Wang, C.; Xu, C. Soccerdb: A large-scale database for comprehensive video understanding. In Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports, Seattle, WA, USA, 16 October 2020; pp. 1–8. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 20 August 2023).
- Bloehdorn, S.; Basili, R.; Cammisa, M.; Moschitti, A. Semantic kernels for text classification based on topological measures of feature similarity. In Proceedings of the Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China, 18–22 December 2006; pp. 808–812. [Google Scholar]
- Hao, W.; Zhang, K.; Zhang, L.; Han, M.; Hao, W.; Li, F.; Yang, G. TSML: A New Pig Behavior Recognition Method Based on Two-Stream Mutual Learning Network. Sensors 2023, 23, 5092. [Google Scholar] [CrossRef] [PubMed]
- Wu, W.; Zhang, D.; Cai, Y.; Wang, S.; Li, J.; Li, Z.; Tang, Y.; Zhou, H. A Bilingual, OpenWorld Video Text Dataset and End-to-End Video Text Spotter with Transformer. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021. Available online: https://openreview.net/forum?id=vzb0f0TIVlI (accessed on 20 August 2023).
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
Mode | Feature | Top@1 (%) | Top@5 (%) | F1 (%) |
---|---|---|---|---|
NextVLAD | Video frame | 60.1 | 70.9 | 65.3 |
NextVLAD | Subtitle | 55.9 | 63.2 | 58.7 |
AttentionCluster | Video frame | 58.2 | 69.4 | 63.3 |
AttentionCluster | Subtitle | 52.0 | 62.3 | 56.2 |
NextVLAD-AttentionCluster | Video frame | 61.1 | 79.0 | 67.9 |
NextVLAD-AttentionCluster | Subtitle | 57.3 | 63.1 | 59.4 |
Mode | Feature | Top@1 (%) | Top@5 (%) | F1 (%) |
---|---|---|---|---|
NextVLAD | Video frame and Subtitle | 64.3 | 72.8 | 68.2 |
AttentionCluster | Video frame and Subtitle | 63.2 | 71.2 | 65.9 |
NextVLAD-AttentionCluster | Video frame and Subtitle | 65.8 | 82.2 | 73.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pang, N.; Guo, S.; Yan, M.; Chan, C.A. A Short Video Classification Framework Based on Cross-Modal Fusion. Sensors 2023, 23, 8425. https://doi.org/10.3390/s23208425
Pang N, Guo S, Yan M, Chan CA. A Short Video Classification Framework Based on Cross-Modal Fusion. Sensors. 2023; 23(20):8425. https://doi.org/10.3390/s23208425
Chicago/Turabian StylePang, Nuo, Songlin Guo, Ming Yan, and Chien Aun Chan. 2023. "A Short Video Classification Framework Based on Cross-Modal Fusion" Sensors 23, no. 20: 8425. https://doi.org/10.3390/s23208425
APA StylePang, N., Guo, S., Yan, M., & Chan, C. A. (2023). A Short Video Classification Framework Based on Cross-Modal Fusion. Sensors, 23(20), 8425. https://doi.org/10.3390/s23208425