Detection of Dangerous Human Behavior by Using Optical Flow and Hybrid Deep Learning
Abstract
:1. Introduction
2. Related Works
3. Materials and Methods
3.1. Optical Flow
3.2. Stacked Hybrid 3D Deep Learning Autoencoder (3D SAE-LSTM-CNN)
3.3. Proposed Algorithm
3.3.1. Global Matching
3.3.2. Frameworks
- Layer 1 (batch normalization): Batch normalization normalizes the input by subtracting the batch’s meaning and then divides it through batch standard deviation. Input layer: the image’s input size is 128 × 128, the frame length is 15, the batch size is 64, and the momentum is 0.8.
- Layer 2: Bidirectional (Conv-LSTM 2D) layer with 16 filters and (3 × 3) kernel sizes. The activation function used is (tanh). The recurrent dropout is 0.2. This layer reads the input data and outputs features to the max pooling layer.
- Layer 3: Max pooling 3D layer with pooling size of (1,2,2). It outputs a feature vector. The output of this layer is the encoded feature vector of the input data from layer 2, where it reduces the dimension.
- Layer 4: Time-distributed layer: This wrapper permits the application of a layer to each temporal input slice. Each input should have a minimum three-dimensional nature, with the temporal dimension being defined as the index, which is one dimension of the initial input. The recurrent dropout is 0.2.
- Layer 5: Bidirectional (Conv-LSTM 2D) layer with 16 filters and (3 × 3) kernel sizes. The activation function used is (tanh). The recurrent dropout is 0.2. This layer reads the input data and outputs features to the max pooling layer.
- Layer 6 (batch normalization input layer): This layer normalizes the previous activation layer’s output by subtracting the batch’s mean and dividing it by the batch standard deviation. The momentum is 0.8.
- Layer 7 (max pooling 3D layer): with pooling size of (1,2,2). It outputs a feature vector. The output of this layer is the encoded feature vector of the input data from layer 6, where it reduces the dimension.
- Layer 8 (time-distributed layer): This layer applies the same convolution operation to each input sequentially in time. The recurrent dropout is 0.2.
- Layer 9: bidirectional (Conv-LSTM 2D) layer with 16 filters and (3 × 3) kernel sizes. The activation function used is (tanh). The recurrent dropout is 0.2. This layer reads the input data and outputs features to the max pooling layer.
- Layer 10 (batch normalization input layer): This layer is used to normalize the previous activation layer’s output by subtracting the batch’s mean and then dividing through the batch standard deviation. The momentum is 0.8.
- Layer 11 (max pooling 3D layer): with pooling size of (1,2,2). It outputs a feature vector. The output of this layer is the encoded feature vector of the input data from layer 10, where it reduces the dimension.
- Layer 12 (time-distributed layer): This layer applies the same convolution operation to each input sequentially in time. The recurrent dropout is 0.3.
- Layer 13 (flatten layer): By flattening, the data are reduced to a one-dimensional array in preparation for their input into the dense layer (layer 14).
- Layer 14 (dense layer): This layer connects to all neurons of the flattened layer. This layer maps every neuron in layer 13 to every neuron in the output layer (layer 15).
- Layer 15 (dense layer): This layer is the output layer in which the output is the dot product of the weight matrix or kernel and tensor passed as input.
4. Results
4.1. Optical Flow Analysis
4.2. Quantitative Results
4.3. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Diraco, G.; Rescio, G.; Siciliano, P.; Leone, A. Review on Human Action Recognition in Smart Living: Sensing Technology, Multimodality, Real-Time Processing, Interoperability, and Resource-Constrained Processing. Sensors 2023, 23, 5281. [Google Scholar] [CrossRef]
- Yang, J.; Zhang, Z.; Xiao, S.; Ma, S.; Li, Y.; Lu, W.; Gao, X. Efficient data-driven behavior identification based on vision transformers for human activity understanding. Neurocomputing 2023, 530, 104–115. [Google Scholar] [CrossRef]
- Ko, K.E.; Sim, K.B. Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Eng. Appl. Artif. Intell. 2018, 67, 226–234. [Google Scholar] [CrossRef]
- Wang, F.; Zhang, J.; Wang, S.; Li, S.; Hou, W. Analysis of Driving Behavior Based on Dynamic Changes of Personality States. Int. J. Environ. Res. Public Health 2020, 17, 430. [Google Scholar] [CrossRef]
- Mohammed, H.A. Assessment of distracted pedestrian crossing behavior at midblock crosswalks. IATSS Res. 2021, 45, 584–593. [Google Scholar] [CrossRef]
- Zhou, X.; Ren, H.; Zhang, T.; Mou, X.; He, Y.; Chan, C.Y. Prediction of Pedestrian Crossing Behavior Based on Surveillance Video. Sensors 2022, 22, 1467. [Google Scholar] [CrossRef]
- Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions: A survey. ISPRS J. Photogramm. Remote Sens. 2023, 196, 146–177. [Google Scholar] [CrossRef]
- Wang, J.; Huang, H.; Li, K.; Li, J. Towards the Unified Principles for Level 5 Autonomous Vehicles. Engineering 2021, 7, 1313–1325. [Google Scholar] [CrossRef]
- Gesnouin, J. Analysis of Pedestrian Movements and Gestures Using an On-Board Camera to Predict Their Intentions. September 2022. Available online: https://pastel.hal.science/tel-03813520 (accessed on 20 January 2024).
- Zhang, D.; He, L.; Tu, Z.; Zhang, S.; Han, F.; Yang, B. Learning motion representation for real-time spatio-temporal action localization. Pattern Recognit 2020, 103, 107312. [Google Scholar] [CrossRef]
- Prabono, A.G.; Yahya, B.N.; Lee, S.L. Atypical Sample Regularizer Autoencoder for Cross-Domain Human Activity Recognition. Inf. Syst. Front. 2021, 23, 71–80. [Google Scholar]
- Garcia, K.D.; de Sá, C.R.; Poel, M.; Carvalho, T.; Mendes-Moreira, J.; Cardoso, J.M.; de Carvalho, A.C.; Kok, J.N. An ensemble of autonomous auto-encoders for human activity recognition. Neurocomputing 2021, 439, 271–280. [Google Scholar] [CrossRef]
- Huang, W.; Zhang, L.; Gao, W.; Min, F.; He, J. Shallow Convolutional Neural Networks for Human Activity Recognition Using Wearable Sensors. IEEE Trans. Instrum. Meas. 2021, 70, 2510811. [Google Scholar] [CrossRef]
- Zhang, D.; Wu, Y.; Guo, M.; Chen, Y. Deep Learning Methods for 3D Human Pose Estimation under Different Supervision Paradigms: A Survey. Electronics 2021, 10, 2267. [Google Scholar] [CrossRef]
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human Action Recognition From Various Data Modalities: A Review. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3200–3225. [Google Scholar] [CrossRef]
- Aljabri, M.; AlAmir, M.; AlGhamdi, M.; Abdel-Mottaleb, M.; Collado-Mesa, F. Towards a better understanding of annotation tools for medical imaging: A survey. Multimed. Tools Appl. 2022, 81, 25877–25911. [Google Scholar] [CrossRef]
- Su, J.; An, Y.; Wu, J.; Zhang, K. Pedestrian Detection Based on Feature Enhancement in Complex Scenes. Algorithms 2024, 17, 39. [Google Scholar] [CrossRef]
- Karadeniz, A.T.; Çelik, Y.; Başaran, E. Classification of Walnut Varieties Obtained from Walnut Leaf Images by the Recommended Residual Block Based CNN Model. Eur. Food Res. Technol. 2023, 249, 727–738. [Google Scholar]
- Iftikhar, S.; Zhang, Z.; Asim, M.; Muthanna, A.; Koucheryavy, A.; El-Latif, A.A.A. Deep Learning-Based Pedestrian Detection in Autonomous Vehicles: Substantial Issues and Challenges. Electronics 2022, 11, 3551. [Google Scholar] [CrossRef]
- Lo, K.M. Optical Flow Based Motion Detection for Autonomous Driving. arXiv 2022, arXiv:2203.11693. [Google Scholar]
- Ladjailia, A.; Bouchrika, I.; Merouani, H.F.; Harrati, N.; Mahfouf, Z. Human activity recognition via optical flow: Decomposing activities into basic actions. Neural Comput. Appl. 2019, 32, 16387–16400. [Google Scholar] [CrossRef]
- OpenCV: Optical Flow. Available online: https://docs.opencv.org/3.4/db/d7f/tutorial_js_lucas_kanade.html (accessed on 20 January 2024).
- Wang, T.; Snoussi, H. Detection of abnormal events via optical flow feature analysis. Sensors 2015, 15, 7156–7171. [Google Scholar] [CrossRef] [PubMed]
- Vrskova, R.; Hudec, R.; Kamencay, P.; Sykora, P. Human Activity Classification Using the 3DCNN Architecture. Appl. Sci. 2022, 12, 931. [Google Scholar] [CrossRef]
- Hu, V.T.; Zhang, D.W.; Mettes, P.; Tang, M.; Zhao, D.; Snoek, C.G.M. Latent Space Editing in Transformer-Based Flow Matching. arXiv 2023, arXiv:2312.10825. [Google Scholar] [CrossRef]
- Chen, Z.; Ramachandra, B.; Wu, T.; Vatsavai, R.R. Relational Long Short-Term Memory for Video Action Recognition. arXiv 2018, arXiv:1811.07059. [Google Scholar]
- Yang, H.; Zhang, J.; Li, S.; Luo, T. Bi-direction hierarchical LSTM with spatial-temporal attention for action recognition. J. Intell. Fuzzy Syst. 2019, 36, 775–786. [Google Scholar] [CrossRef]
- Anvarov, F.; Kim, D.H.; Song, B.C. Action recognition using deep 3D CNNs with sequential feature aggregation and attention. Electronics 2020, 9, 147. [Google Scholar] [CrossRef]
- Cheng, Y.; Yang, Y.; Chen, H.B.; Wong, N.; Yu, H. S3-Net: A Fast Scene Understanding Network by Single-Shot Segmentation for Autonomous Driving. ACM Trans. Intell. Syst. Technol. 2021, 12, 1–19. [Google Scholar] [CrossRef]
- Ullah, A.; Muhammad, K.; Ding, W.; Palade, V.; Haq, I.U.; Baik, S.W. Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Appl. Soft. Comput. 2021, 103, 107102. [Google Scholar] [CrossRef]
- Patrick, M.; Asano, Y.M.; Kuznetsova, P.; Fong, R.; Henriques, J.F.; Zweig, G.; Vedaldi, A. On Compositions of Transformations in Contrastive Self-Supervised Learning. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9557–9567. [Google Scholar] [CrossRef]
- Tan, K.S.; Lim, K.M.; Lee, C.P.; Kwek, L.C. Bidirectional Long Short-Term Memory with Temporal Dense Sampling for human action recognition. Expert. Syst. Appl. 2022, 210, 118484. [Google Scholar] [CrossRef]
- Hussain, A.; Hussain, T.; Ullah, W.; Baik, S.W. Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos. Comput. Intell. Neurosci. 2022, 2022, 3454167. [Google Scholar] [CrossRef]
- Liu, T.; Ma, Y.; Yang, W.; Ji, W.; Wang, R.; Jiang, P. Spatial-temporal interaction learning based two-stream network for action recognition. Inf. Sci. 2022, 606, 864–876. [Google Scholar] [CrossRef]
- Ullah, H.; Munir, A. Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. J. Imaging 2023, 9, 130. [Google Scholar] [CrossRef] [PubMed]
Layer (Type) | Output Shape |
---|---|
Batch normalization | (None, 15, 128, 128, 3) |
Bidirectional (ConvLSTM2D) | (None, 15, 126, 126, 32) |
Max pooling 3D | (None, 15, 63, 63, 32) |
Time-distributed | (None, 15, 63, 63, 32) |
Bidirectional (ConvLSTM2D) | (None, 15, 61, 61, 32) |
Batch normalization | (None, 15, 61, 61, 32) |
Max pooling 3D | (None, 15, 31, 31, 32) |
Time-distributed | (None, 15, 31, 31, 32) |
Bidirectional (ConvLSTM2D) | (None, 15, 29, 29, 32) |
Batch normalization | (None, 15, 29, 29, 32) |
Max pooling 3D | (None, 15, 15, 15, 32) |
Time-distributed | (None, 15, 15, 15, 32) |
Flatten | (None, 108,000) |
Dense | (None, 128) |
Dense | (None, 51) |
3D CNN | 2 CNN + LSTM | ||
---|---|---|---|
Layer (Type) | Output Shape | Layer (Type) | Output Shape |
Conv3D | (None, 15, 128, 128, 16) | ConvLSTM2D | (None, 20, 62, 62, 4) |
Max pooling 3D | (None, 15, 64, 64, 16) | Max pooling 3D | (None, 20, 31, 31, 4) |
Batch normalization | (None, 15, 64, 64, 3) | Time-distributed | (None, 20, 31, 31, 4) |
Conv3D | (None, 15, 64, 64, 32) | ConvLSTM2D | (None, 20, 29, 29, 8) |
Max pooling 3D | (None, 15, 32, 32, 32) | Max pooling 3D | (None, 20, 15, 15, 8) |
Batch normalization | (None, 15, 32, 32, 32) | Time-distributed | (None, 20, 15, 15, 8) |
Conv3D | (None, 15, 32, 32, 64) | ConvLSTM2D | (None, 20, 13, 13, 14) |
Max pooling 3D | (None, 15, 16, 16, 64) | Max pooling 3D | (None, 20, 7, 7, 14) |
Batch normalization | (None, 15, 16, 16, 64) | Time-distributed | (None, 20, 7, 7, 14) |
Conv3D | (None, 15, 16, 16, 128) | ConvLSTM2D | (None, 20, 5, 5, 16) |
Max pooling 3D | (None, 15, 8, 8, 128) | Max pooling 3D | (None, 20, 3, 3, 16) |
Batch normalization | (None, 15, 8, 8, 128) | Flatten | (None, 2880) |
Flatten | (None, 90,112) | Dense | (None, 51) |
Dense | (None, 256) | ||
Dense | (None, 51) |
Structure | Training Accuracy | Test Accuracy (%) |
---|---|---|
3D CNN | 99.9 | 52.76 |
2 CNN + LSTM | 93.22 | 58 |
Proposed model | 100 | 86.86 |
Ref. | Structure | Acc. Results (%) |
---|---|---|
Chen et al. (2018) [26] | Relational LSTM | 70.4 |
Yang et al. (2019) [27] | 3DCNNs + BDH-LSTM | 72.2 |
Anvarov et al. (2020) [28] | Squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN | 74.1 |
Cheng et al. (2021) [29] | S3-Net | 80.8 |
Ullah (2021) [30] | Lightweight CNN and DS-GRU | 72.21 |
Patrick et al. (2021) [31] | GDT | 72.8 |
Tanet al. (2022) [32] | Fusion network | 70.72 |
Hussain (2022) [33] | VIT + LSTM | 73.714 |
Liu et al. (2022) [34] | Spatial-Temporal Interaction Learning Two-stream network (STILT) | 72.1 |
Ullah and Munir (2023) [35] | Dual attentional convolutional neural network (DA-CNN) and bidirectional GRU (Bi-GRU) | 79.3 |
Proposed method | Stacked autoencoder CNN + LSTM | 86.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Salim, L.M.; Celik, Y. Detection of Dangerous Human Behavior by Using Optical Flow and Hybrid Deep Learning. Electronics 2024, 13, 2116. https://doi.org/10.3390/electronics13112116
Salim LM, Celik Y. Detection of Dangerous Human Behavior by Using Optical Flow and Hybrid Deep Learning. Electronics. 2024; 13(11):2116. https://doi.org/10.3390/electronics13112116
Chicago/Turabian StyleSalim, Laith Mohammed, and Yuksel Celik. 2024. "Detection of Dangerous Human Behavior by Using Optical Flow and Hybrid Deep Learning" Electronics 13, no. 11: 2116. https://doi.org/10.3390/electronics13112116
APA StyleSalim, L. M., & Celik, Y. (2024). Detection of Dangerous Human Behavior by Using Optical Flow and Hybrid Deep Learning. Electronics, 13(11), 2116. https://doi.org/10.3390/electronics13112116