Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
Abstract
:1. Introduction
2. Auditory and Visual Feature Extraction Method
2.1. Auditory Feature Extraction Based on CNN-LSTM
2.2. Visual Feature Based on CNN-ConvLSTM
3. The Deep Network for Auditory Visual Information Fusion
3.1. Shared Semantic Subspace Based on Autoencoder
3.1.1. Shared Semantic Subspace
3.1.2. Shared Semantic Subspace Based on Autoencoder
3.1.3. Model Optimization Based on Semantic Correspondence
3.2. Violent Behavior Recognition Model Based on Visual and Auditory Fusion
3.2.1. Network Structure
3.2.2. Algorithm Realization
Algorithm 1. Algorithm of Auditory Visual Fusion of Autoencoder Mapping |
Input: Video frame sequence, Audio frame waveform image, Label , Iteration number T Output: Weights of Network model |
1: Initialize the network weights, freeze some parameters of AlexNet, t = 1; |
2: for t = 1:T do |
3: Compute network model output: , , , and label |
4: Calculate the error value at time t according to formula (10) |
5: Calculate the error gradient δk of hidden layer element k at time t |
6: Calculate the error gradient δct of state Ct at time t |
7: Update network weight vectors W |
4. Experiments and Results Analysis
4.1. The Experimental Setup
4.1.1. Dataset
4.1.2. Experimental Parameters Config
4.2. Experimental Results
4.2.1. Validation of Feature Combination Method
4.2.2. Visual and Auditory Information Fusion Visualization Based on Autoencoder Mapping
4.2.3. Violence Test Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ramzan, M.; Abid, A.; Khan, H.U.; Awan, S.M.; Ismail, A.; Ahmed, M.; Ilyas, M.; Mahmood, M. A review on state-of-the-art violence detection techniques. IEEE Access 2019, 7, 107560–107575. [Google Scholar] [CrossRef]
- Nayak, R.; Pati, U.C.; Das, S.K. A comprehensive review on deep learning-based methods for video anomaly detection. Image Vis. Comput. 2021, 106, 104078. [Google Scholar] [CrossRef]
- Ribeiro, P.C.; Audigier, R.; Pham, Q.C. RIMOC, a feature to discriminate unstructured motions: Application to violence detection for video-surveillance. Comput. Vis. Image Underst. 2016, 144, 121–143. [Google Scholar] [CrossRef]
- Dhiman, C.; Vishwakarma, D.K. High dimensional abnormal human activity recognition using histogram oriented gradients and Zernike moments. In Proceedings of the IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Coimbatore, India, 14–16 December 2017; Springer: Berlin/Heidelberg, Germany, 2017; p. 5. [Google Scholar]
- Senst, T.; Eiselein, V.; Kuhn, A.; Sikora, T. Crowd violence detection using global motion-compensated Lagrangian features and scale-sensitive video-level representation. IEEE Trans. Inform. Forensics Secur. 2017, 12, 2945–2956. [Google Scholar] [CrossRef]
- Bilinski, P.; Bremond, F. Human violence recognition and detection in surveillance videos. In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Colorado Springs, CO, USA, 23–26 August 2016; IEEE: Piscataway, NJ, USA, 2016; Volume 7, pp. 30–36. [Google Scholar]
- Zhang, T.; Yang, Z.; Jia, W.; Yang, B.; Yang, J.; He, X. A new method for violence detection in surveillance scenes. Multimed. Tools Appl. 2016, 75, 7327–7349. [Google Scholar] [CrossRef]
- Mu, G.; Cao, H.; Jin, Q. Violent scene detection using convolutional neural networks and deep audio features. Commun. Comput. Inform. Sci. CCPR 2016, 663, 451–461. [Google Scholar]
- Xie, J.; Yan, W.; Mu, C.; Liu, T.; Li, P.; Yan, S. Recognizing violent activity without decoding video streams. Optik 2016, 127, 795–801. [Google Scholar] [CrossRef]
- Peixoto, B.M.; Avila, S.; Dias, Z.; Rocha, A. Breaking down violence: A deep-learning strategy to model and classify violence in videos. In Proceedings of the 13th International Conference on Availability, Reliability and Security, Hamburg, Germany, 27–30 August 2018; ACM Library: New York, NY, USA, 2018; Volume 50, pp. 1–7. [Google Scholar]
- Manzo, M.; Pellino, S. Voting in transfer learning system for ground-based cloud classification. Mach. Learn. Knowl. Extr. 2021, 3, 542–553. [Google Scholar] [CrossRef]
- Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 2018, 6, 1155–1166. [Google Scholar] [CrossRef]
- Sreenu, G.; Saleem Durai, M.A. Intelligent video surveillance: A review through deep learning techniques for crowd analysis. J. Big Data 2019, 6, 1–27. [Google Scholar] [CrossRef]
- Accattoli, S.; Sernani, P.; Falcionelli, N.; Mekuria, D.N.; Dragoni, A.F. Violence detection in videos by combining 3D convolutional neural networks and support vector machines. Appl. Artif. Intell. 2020, 34, 329–344. [Google Scholar] [CrossRef]
- Tripathi, G.; Singh, K.; Vishwakarma, D.K. Violence recognition using convolutional neural network: A survey. J. Intell. Fuzzy Syst. 2020, 39, 7931–7952. [Google Scholar] [CrossRef]
- Oscar, D.; Ismael, S.; Gloria, B.; Tae-Kyun, K. Fast Violence Detection in Video. In Proceedings of the 9th International Conference on Computer Vision Theory and Application (VISAPP), Lisbon, Portugal, 5–8 January 2015; Volume 2, pp. 478–485. [Google Scholar]
- Sharma, M.; Baghel, R. Video Surveillance for violence detection using deep learning. Lect. Notes Data Eng. Commun. Technol. 2020, 37, 411–420. [Google Scholar]
- García-Gómez, J.; Bautista-Durán, M.; Gil-Pita, R.; Mohino-Herranz, I.; Rosa-Zurera, M. Violence Detection in Real Environments for Smart Cities. In Proceedings of the 10th International Conference of Ubiquitous Computing and Ambient Intelligence (UCAmI), San Bartolomé de Tirajana, Spain, 29 November 2016; Volume 10070, pp. 482–494. [Google Scholar]
- Chen, L.; Jakubowicz, J.; Yang, D.; Zhang, D.; Pan, G. Fine-Grained urban event detection and characterization based on tensor cofactorization. IEEE Trans. Hum.-Mach. Syst. 2017, 47, 380–391. [Google Scholar] [CrossRef]
- Wang, Y.; Neves, L.; Metze, F. Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2742–2746. [Google Scholar]
- Lejmi, W.; Khalifa, A.B.; Mahjoub, M.A. Fusion Strategies for Recognition of Violence Actions. In Proceedings of the IEEE/ACS International Conference on Computer Systems and Applications, AICCSA, Hammamet, Tunis, 30 October–3 November 2017; pp. 178–183. [Google Scholar]
- Asad, M.; Yang, J.; He, J.; Shamsolmoali, P.; He, X. Multi-frame feature-fusion-based model for violence detection. Vis. Comput. 2021, 37, 1415–1431. [Google Scholar] [CrossRef]
- Song, D.; Kim, C.; Park, S.-K. A multi-temporal framework for high-level activity analysis: Violent event detection in visual surveillance. Inform. Sci. 2018, 447, 83–103. [Google Scholar] [CrossRef]
- Xia, Q.; Zhang, P.; Wang, J.; Tian, M.; Fei, C. Real Time Violence Detection Based on Deep Spatio-Temporal Features. In Proceedings of the 13th Chinese Conference on Biometric Recognition, Zhuzhou, China, 12–13 October 2018; Volume 10996, pp. 157–165. [Google Scholar]
- Michael, S.B. Chapter 42-Audiovisual speech integration: Neural substrates and behavior. In Neurobiology of Language; Elsevier: Amsterdam, The Netherlands, 2016; pp. 515–526. [Google Scholar]
- Gu, C.; Wu, X.; Wang, S. Violent video detection based on semantic correspondence. IEEE Access 2020, 8, 85958–85967. [Google Scholar] [CrossRef]
- Ivanovic, B.; Leung, K.; Schmerling, E.; Pavone, M. Multimodal deep generative models for trajectory prediction: A conditional variational autoencoder approach. IEEE Robot. Autom. Lett. 2021, 6, 295–302. [Google Scholar] [CrossRef]
- Sjöberg, M.; Baveye, Y.; Wang, H.; Quang, V.L.; Ionescu, B.; Dellandréa, E.; Schedl, M.; Demarty, C.; Chen, L. The MediaEval 2015 Affective Impact of Movies Task. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Cramer, J.; Wu, H.-H.; Salamon, J.; Bello, J.P. Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3852–3856. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.; Wong, W.; Woo, W. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inform. Process. Syst. 2015, 802–810. [Google Scholar]
- Shi, X.; Xing, F.; Zhang, Z.; Sapkota, M.; Guo, Z.; Yang, L. A scalable optimization mechanism for pairwise based discrete hashing. IEEE Trans. Image Process. 2021, 30, 1130–1142. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Guo, Z.; Li, S.; Xing, F.; You, J.; Jay Kuo, C.-C.; Fakhri, G.; Woo, J. Adversarial unsupervised domain adaptation with conditional and label shift: Infer, Align and Iterate. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, ON, Canada, 11–17 October 2021. [Google Scholar]
- Chakraborty, R.; Maurya, A.K.; Pandharipande, M.; Hassan, E.; Ghosh, H.; Kopparapu, S.K. TCS-ILAB-MediaEval 2015: Affective Impact of Movies and Violent Scene Detection. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Moreira, D.; Avila, S.; Perez, M.; Moraes, D.; Testoni, V.; Valle, E.; Goldenstein, S.; Rocha, A. RECOD at MediaEval 2015: Affective Impact of Movies Task. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Vlastelica, M.P.; Hayrapetyan, S.; Tapaswi, M.; Stiefelhagen, R. KIT at MediaEval 2015-Evaluating Visual Cues for Affective Impact of Movies Task. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Seddati, O.; Kulah, E.; Pironkov, G.; Dupont, S.; Mahmoudi, S.; Dutoit, T. UMons at MediaEval 2015 Affective Impact of Movies Task Including Violent Scenes Detection. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Jin, Q.; Li, X.; Cao, H.; Huo, Y.; Liao, S.; Yang, G.; Xu, J. RUCMM at MediaEval 2015 Affective Impact of Movies Task: Fusion of Audio and Visual Cues. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Vu, L.; Sang, P.; Duy-Dinh, L.; Shinichi, S.; Duc-Anh, D. NII-UIT at MediaEval 2015 Affective Impact of Movies Task. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Dai, Q.; Zhao, R.; Wu, Z.; Wang, X.; Gu, Z.; Wu, W.; Jiang, Y. Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning. In Proceedings of the MediaEval 2015 Multimedia Benchmark Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Shi, X.; Xing, F.; Xu, K.; Chen, P.; Liang, Y.; Lu, Z.; Guo, Z. Loss-based Attention for Interpreting Image-level Prediction of Convolutional Neural Networks. IEEE Trans. Image Process. 2021, 30, 1662–1675. [Google Scholar] [CrossRef] [PubMed]
Dataset Name | Type | Data Scale | Length/Clips (sec) | Scenario | Annotation |
---|---|---|---|---|---|
MediaEval2015 | Violence | 502 | 8~12 | Movie | Frame-Level |
Module Name | Type | Input/Output Data Dimension | Repeat Times |
---|---|---|---|
Auditory feature extraction | AlexNet | (227 × 227 × 3, 4096) | 1 |
LSTM | (4096, 4096) | 1 | |
Video feature extraction | Substract | (227 × 227 × 3 × 2, 227 × 227 × 3) | 1 |
AlexNet | (227 × 227 × 3, 4096) | 1 | |
Conv-LSTM | (4096, 4096) | 1 | |
Autoencoder | FC+ReLu | (4096, 2048) | 1 |
FC+ReLu | (2048, 4096) | 2 | |
CONCAT | (4 × 4096, 4 × 4096) | 1 | |
Classifier | FC+ReLu | (4 × 4096, 4096) | 1 |
FC+ReLu | (4096, 2) | 1 |
Hyperparameters of Network | Default Value |
---|---|
Learning Rate | 10−5 |
LR decay rate | 0.5 |
Batch | 16 |
Hidden size | 128 |
Loss function | Cross Entropy |
Penalty coefficient ratio | 1:16 |
Optimized | A dam |
Method | P | R | F1 | MAP | |
---|---|---|---|---|---|
Late fusion | SVM | 0.29 | 0.70 | 0.41 | 17.4% |
Average Fusion | 0.34 | 0.75 | 0.46 | 19.1% | |
3-layer Perceptron | 0.33 | 0.73 | 0.45 | 18.6% | |
Feature fusion | Add | 0.42 | 0.79 | 0.54 | 29.2% |
Concat | 0.51 | 0.84 | 0.63 | 31.54% |
P | R | F1 | MAP | |
---|---|---|---|---|
Auditory feature | 0.46 | 0.73 | 0.56 | 16.47% |
Visual feature | 0.36 | 0.82 | 0.50 | 20.21% |
Fusion modality feature | 0.51 | 0.84 | 0.63 | 31.54% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lou, J.; Zuo, D.; Zhang, Z.; Liu, H. Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping. Electronics 2021, 10, 2654. https://doi.org/10.3390/electronics10212654
Lou J, Zuo D, Zhang Z, Liu H. Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping. Electronics. 2021; 10(21):2654. https://doi.org/10.3390/electronics10212654
Chicago/Turabian StyleLou, Jiu, Decheng Zuo, Zhan Zhang, and Hongwei Liu. 2021. "Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping" Electronics 10, no. 21: 2654. https://doi.org/10.3390/electronics10212654
APA StyleLou, J., Zuo, D., Zhang, Z., & Liu, H. (2021). Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping. Electronics, 10(21), 2654. https://doi.org/10.3390/electronics10212654