A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning
Abstract
:1. Introduction
- A hierarchical spatial–temporal video feature extraction approach is developed. The purpose is to ensure as much characteristic information as possible for generating video summarization;
- A cross-attention cell that combines the local and global features information based on DB-ConvLSTM is proposed. It seeks to emphasize the difference between related frames and achieve more accurate screening in similar keyframes clusters for video summary generation;
- Verification experiments and comparative analysis are performed on two benchmark datasets (TVSum and SumMe) for this paper’s algorithm. The results demonstrate that the proposed algorithm is extremely rational, effective, and usable.
2. Related Work
2.1. Video Summarization
2.2. Cross Attention
2.3. Graph Attention Networks (GATs)
3. Materials and Methods
3.1. DB-ConvLSTM and CB-ConvLSTM
3.1.1. DB-ConvLSTM
3.1.2. CB-ConvLSTM
3.2. Contrastive Adjustment Learning
3.3. Multi-Conv-Attention and Cross-Attention
3.3.1. Multi-Conv-Attention
3.3.2. Cross-Attention
3.4. Loss Function
4. Experiments Analysis
4.1. Datasets
4.2. Evaluation Metrics
4.3. Experimental Environment and Parameters Settings
4.4. Comparative Analysis of Schemes
4.4.1. Self-Verification
4.4.2. Comparative Analysis with Relative Approaches
- (1)
- Comparison With “Bi-LSTM+” Methods
- (2)
- Comparison With “Attention+” Methods
- (3)
- Comparison With “Graph Attention+” Methods
4.4.3. Comparison Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhong, R.; Wang, R.; Zou, Y.; Hong, Z.; Hu, M. Graph attention networks adjusted bi-LSTM for video summarization. IEEE Signal Proc. Lett. 2021, 28, 663–667. [Google Scholar] [CrossRef]
- Yoon, U.-N.; Hong, M.-D.; Jo, G.-S. Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation. Sensors 2021, 21, 4562. [Google Scholar] [CrossRef] [PubMed]
- Liu, T.; Meng, Q.; Huang, J.-J.; Vlontzos, A.; Rueckert, D.; Kainz, B. Video summarization through reinforcement learning with a 3D spatio-temporal u-net. IEEE Trans. Image Proc. 2022, 31, 1573–1586. [Google Scholar] [CrossRef] [PubMed]
- Li, W.; Pan, G.; Wang, C.; Xing, Z.; Han, Z. From coarse to fine: Hierarchical structure-aware video summarization. ACM Trans. Mult. Comput. Commun. Appl. TOMM 2022, 18, 1–16. [Google Scholar] [CrossRef]
- Zhang, K.; Chao, W.-L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 766–782. [Google Scholar]
- Zhao, B.; Li, H.; Lu, X.; Li, X. Reconstructive sequence-graph network for video summarization. IEEE Trans. Patt. Anal. Mach. Intell. 2021, 44, 2793–2801. [Google Scholar] [CrossRef] [PubMed]
- Teng, X.; Gui, X.; Xu, P. A Multi-Flexible Video Summarization Scheme Using Property-Constraint Decision Tree. Neurocomputing 2022, 506, 406–417. [Google Scholar] [CrossRef]
- Ji, Z.; Zhang, Y.; Pang, Y.; Li, X.; Pan, J. Multi-video summarization with query-dependent weighted archetypal analysis. Neurocomputing 2019, 332, 406–416. [Google Scholar] [CrossRef]
- Rafiq, M.; Rafiq, G.; Agyeman, R.; Choi, G.S.; Jin, S.-I. Scene classification for sports video summarization using transfer learning. Sensors 2020, 20, 1702. [Google Scholar] [CrossRef] [Green Version]
- Zhu, W.; Han, Y.; Lu, J.; Zhou, J. Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization. IEEE Trans. Image Proc. 2022, 31, 3017–3031. [Google Scholar] [CrossRef]
- De Avila, S.E.F.; Lopes, A.P.B.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Patt. Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
- Zhao, B.; Li, X.; Lu, X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia, New York, NY, USA, 23–27 October 2017; pp. 863–871. [Google Scholar]
- An, Y.; Zhao, S. A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction. arXiv 2021, arXiv:2109.12581. [Google Scholar]
- Sahu, A.; Chowdhury, A.S. First person video summarization using different graph representations. Patt. Recognit. Lett. 2021, 146, 185–192. [Google Scholar] [CrossRef]
- Fu, H.; Wang, H. Self-attention binary neural tree for video summarization. Patt. Recognit. Lett. 2021, 143, 19–26. [Google Scholar] [CrossRef]
- Ji, Z.; Zhao, Y.; Pang, Y.; Li, X.; Han, J. Deep attentive video summarization with distribution consistency learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 1765–1775. [Google Scholar] [CrossRef]
- Köprü, B.; Erzin, E. Use of Affective Visual Information for Summarization of Human-Centric Videos. arXiv 2021, arXiv:2107.03783. [Google Scholar]
- Mi, L.; Chen, Z. Hierarchical Graph Attention Network for Visual Relationship Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
- Lin, W.; Deng, Y.; Gao, Y.; Wang, N.; Zhou, J.; Liu, L.; Zhang, L.; Wang, P. CAT: Cross-Attention Transformer for One-Shot Object Detection. arXiv 2021, arXiv:2104.14984. [Google Scholar]
- Sanabria, M.; Precioso, F.; Menguy, T. Hierarchical multimodal attention for deep video summarization. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7977–7984. [Google Scholar]
- Petit, O.; Thome, N.; Rambour, C.; Themyr, L.; Collins, T.; Soler, L. U-net transformer: Self and cross attention for medical image segmentation. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Strasbourg, France, 27 September 2021; pp. 267–276. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; Woo, W.-c. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal ON, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Song, H.; Wang, W.; Zhao, S.; Shen, J.; Lam, K.-M. Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 715–731. [Google Scholar]
- Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations 2016, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–15. [Google Scholar]
- Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5179–5187. [Google Scholar]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating Summaries from User Videos; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2014; pp. 505–520. [Google Scholar]
- Open Video Project. Available online: https://open-video.org/ (accessed on 22 September 2022).
- Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkila, J. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7596–7604. [Google Scholar]
- Zhao, B.; Li, X.; Lu, X. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7405–7414. [Google Scholar]
- Lin, J.; Zhong, S.-h.; Fares, A. Deep hierarchical LSTM networks with attention for video summarization. Comput. Electr. Eng. 2022, 97, 107618. [Google Scholar] [CrossRef]
- Liang, G.; Lv, Y.; Li, S.; Wang, X.; Zhang, Y. Video summarization with a dual-path attentive network. Neurocomputing 2022, 467, 1–9. [Google Scholar] [CrossRef]
- Zhu, W.; Lu, J.; Han, Y.; Zhou, J. Learning multiscale hierarchical attention for video summarization. Patt. Recognit. 2022, 122, 108–312. [Google Scholar] [CrossRef]
- Ji, Z.; Zhao, Y.; Pang, Y.; Li, X.; Han, J. Video summarization with attention-based encoder–decoder networks. IEEE Trans. Circ. Syst. Video Technol. 2019, 30, 1709–1717. [Google Scholar] [CrossRef] [Green Version]
- Li, P.; Tang, C.; Xu, X. Video summarization with a graph convolutional attention network. Front. Inform. Technol. Electr. Eng. 2021, 22, 902–913. [Google Scholar] [CrossRef]
- Park, J.; Lee, J.; Kim, I.-J.; Sohn, K. Sumgraph: Video summarization via recursive graph modeling. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 647–663. [Google Scholar]
Datasets | Description |
---|---|
TVsum | The title-based video summarization dataset contains 50 videos of various genres (e.g., news, documentary, egocentric) and 1000 annotations of shot-level importance scores (20 user annotations per video). The duration varies from 2 to 10 min. |
SumMe | The SumMe dataset consists of 25 videos, each annotated with at least 15 human annotated summaries. The duration of videos varies from 1.5 to 6.5 min. |
Datasets | Setting | Training Phase | Testing Phase |
---|---|---|---|
TVSum | C | 80% TVSum | The rest of TVSum |
A | TVSum+SumMe+ | The rest of TVSum | |
OVP+YouTube | |||
T | SumMe+OVP+YouTube | TVSum | |
SumMe | C | 80% SumMe | The rest of SumMe |
A | TVSum+ SumMe+ | The rest of SumMe | |
OVP+YouTube | |||
T | TVSum+OVP+YouTube | SumMe |
Data Sets | TVSum | SumMe | ||||
---|---|---|---|---|---|---|
Metric | C (%) | A (%) | T (%) | C (%) | A (%) | T (%) |
MAX | 65.3 | 67.4 | 66.2 | 61.6 | 63.1 | 64.5 |
MIN | 50.8 | 50.2 | 55.9 | 53.3 | 52.4 | 51.3 |
AVERAGE | 60.57 | 58.62 | 61.26 | 58.4 | 58.4 | 60.01 |
Data Sets | TVSum | SumMe | ||||
---|---|---|---|---|---|---|
Metric | C (%) | A (%) | T (%) | C (%) | A (%) | T (%) |
vsLSTM [5] | 54.2 | 57.9 | 56.9 | 37.6 | 41.6 | 40.7 |
dppLSTM [5] | 54.7 | 59.6 | 58.7 | 38.6 | 42.9 | 41.8 |
H-RNN [12] | 57.9 | 61.9 | − | 42.1 | 43.8 | − |
HAS-RNN [32] | 58.7 | 59.8 | − | 42.3 | 42.1 | − |
DHAVS [33] | 60.8 | 61.2 | 57.5 | 45.6 | 46.5 | 43.5 |
Ours | 65.3 | 67.4 | 66.2 | 58.4 | 58.4 | 60.01 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Teng, X.; Gui, X.; Xu, P.; Tong, J.; An, J.; Liu, Y.; Jiang, H. A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning. Sensors 2022, 22, 8275. https://doi.org/10.3390/s22218275
Teng X, Gui X, Xu P, Tong J, An J, Liu Y, Jiang H. A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning. Sensors. 2022; 22(21):8275. https://doi.org/10.3390/s22218275
Chicago/Turabian StyleTeng, Xiaoyu, Xiaolin Gui, Pan Xu, Jianglei Tong, Jian An, Yang Liu, and Huilan Jiang. 2022. "A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning" Sensors 22, no. 21: 8275. https://doi.org/10.3390/s22218275
APA StyleTeng, X., Gui, X., Xu, P., Tong, J., An, J., Liu, Y., & Jiang, H. (2022). A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning. Sensors, 22(21), 8275. https://doi.org/10.3390/s22218275