Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism
Abstract
:1. Introduction
- We propose a CNN and LSTM-based method with an attention mechanism that is substituted over the output layer of the LSTM to detect the spatiotemporal features.
- The attention layer assigns different weights by employing probability distribution to focus on relevant cues in the sequence for the recognition of sign language gestures.
- The designed architecture is lightweight, with an optimum parameter count, and computationally efficient compared to the related existing methods.
- Various performance metrics and the K-fold approach were used to assess the model’s efficiency and contributed to assuring the model’s resilience.
2. Materials and Methods
2.1. Dataset
2.2. Dataset Pre-Processing
2.3. Model Architecture
Algorithm 1. Hybrid CNN-LSTM with attention to sign language recognition. |
Input: Input sign gesture video frame sequence Output: Prediction of class labels Step 1: Spatial feature extraction # Loop over each frame in the frame sequence to compute features for frame in frame sequence do for frame in frame sequence do Features ← MobileNetV2 feature extractor () end for Step 2: Attention-based LSTM for t in range (feature sequence length X): # Compute hidden state using Equations (1)–(6) Set LSTM output h to the sequence of hidden states h = (, ,…,); for each hidden state sequence h do Compute attention weights , context vector using Equations (9)–(11). context vector = Attention () end for # Apply a Dense layer with ReLU activation to the context vector dense1 = Dense (128, activation=“relu”) (context vector) # Apply Dropout layer with a dropout rate of 0.7 dropout = Dropout (0.7) (dense1) # Apply a Dense layer with Softmax activation for classification of sign gesture class output = Dense (100, activation=“softmax”) (dropout) end for |
3. Results and Discussion
3.1. Implementation Details
3.2. Performance Measures
3.3. Comparison with the Existing Techniques
3.4. Computational Performance Analysis
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Naz, N.; Sajid, H.; Ali, S.; Hasan, O.; Ehsan, M.K. Signgraph: An Efficient and Accurate Pose-Based Graph Convolution Approach Toward Sign Language Recognition. IEEE Access 2023, 11, 19135–19147. [Google Scholar] [CrossRef]
- Naz, N.; Sajid, H.; Ali, S.; Hasan, O.; Ehsan, M.K. MIPA-ResGCN: A multi-input part attention enhanced residual graph convolutional framework for sign language recognition. Comput. Electr. Eng. 2023, 112, 109009. [Google Scholar] [CrossRef]
- Wang, F.; Zhang, L.; Yan, H.; Han, S. TIM-SLR: A lightweight network for video isolated sign language recognition. Neural Comput. Appl. 2023, 35, 22265–22280. [Google Scholar] [CrossRef]
- Huang, J.; Zhou, W.; Li, H.; Li, W. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2822–2832. [Google Scholar] [CrossRef]
- Das, S.; Biswas, S.K.; Purkayastha, B. A deep sign language recognition system for Indian sign language. Neural Comput. Appl. 2023, 35, 1469–1481. [Google Scholar] [CrossRef]
- Starner, T.; Weaver, J.; Pentland, A. Real-time american sign language recognition using desk and wearable computer-based video. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1371–1375. [Google Scholar] [CrossRef]
- Grobel, K.; Assan, M. Isolated sign language recognition using hidden Markov models. In Proceedings of the 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, Orlando, FL, USA, 12–15 October 1997; Volume 1, pp. 162–167. [Google Scholar]
- Huang, C.L.; Huang, W.Y. Sign language recognition using model-based tracking and a 3D Hopfield neural network. Mach. Vis. Appl. 1998, 10, 292–307. [Google Scholar] [CrossRef]
- Wang, L.C.; Wang, R.; Kong, D.H.; Yin, B.C. Similarity assessment model for Chinese sign language videos. IEEE Trans. Multimed. 2014, 16, 751–761. [Google Scholar] [CrossRef]
- Hikawa, H.; Kaida, K. Novel FPGA implementation of hand sign recognition system with SOM–Hebb classifier. IEEE Trans. Circuits Syst. Video Technol. 2014, 25, 153–166. [Google Scholar] [CrossRef]
- Pigou, L.; Dieleman, S.; Kindermans, P.J.; Schrauwen, B. Sign language recognition using convolutional neural networks. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6–7 and 12, 2014, Proceedings, Part I 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 572–578. [Google Scholar]
- Molchanov, P.; Gupta, S.; Kim, K.; Kautz, J. Hand gesture recognition with 3D convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–15 June 2015; pp. 1–7. [Google Scholar]
- Huang, Y.; Huang, J.; Wu, X.; Jia, Y. Dynamic Sign Language Recognition Based on CBAM with Autoencoder Time Series Neural Network. Mob. Inf. Syst. 2022, 2022, 3247781. [Google Scholar] [CrossRef]
- Bantupalli, K.; Xie, Y. American sign language recognition using deep learning and computer vision. In Proceedings of the 2018 IEEE International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4896–4899. [Google Scholar]
- Aparna, C.; Geetha, M. CNN and stacked LSTM model for Indian sign language recognition. In Machine Learning and Metaheuristics Algorithms, and Applications: First Symposium, SoMMA 2019, Trivandrum, India, December 18–21, 2019, Revised Selected Papers 1; Springer: Singapore, 2020; pp. 126–134. [Google Scholar]
- Rastgoo, R.; Kiani, K.; Escalera, S. Video-based isolated hand sign language recognition using a deep cascaded model. Multimed. Tools Appl. 2020, 79, 22965–22987. [Google Scholar] [CrossRef]
- Ming, Y.; Qian, H.; Guangyuan, L. CNN-LSTM Facial Expression Recognition Method Fused with Two-Layer Attention Mechanism. Comput. Intell. Neurosci. 2022, 2022, 7450637. [Google Scholar] [CrossRef] [PubMed]
- Bousbai, K.; Merah, M. A comparative study of hand gestures recognition based on MobileNetV2 and ConvNet models. In Proceedings of the 2019 6th International Conference on Image and Signal Processing and their Applications (ISPA), Mostaganem, Algeria, 24–25 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
- Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1459–1469. [Google Scholar]
- Boháček, M.; Hrúz, M. Sign pose-based transformer for word-level sign language recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 182–191. [Google Scholar]
- Das, S.; Biswas, S.K.; Purkayastha, B. Automated Indian sign language recognition system by fusing deep and handcrafted feature. Multimed. Tools Appl. 2023, 82, 16905–16927. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Hassan, N.; Miah, A.S.M.; Shin, J. A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition. Appl. Sci. 2024, 14, 603. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Venugopalan, A.; Reghunadhan, R. Applying Hybrid Deep Neural Network for the Recognition of Sign Language Words Used by the Deaf COVID-19 Patients. Arab. J. Sci. Eng. 2023, 48, 1349–1362. [Google Scholar] [CrossRef] [PubMed]
- Tay, N.C.; Tee, C.; Ong, T.S.; Teh, P.S. Abnormal behavior recognition using CNN-LSTM with attention mechanism. In Proceedings of the 2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), Kuala Lumpur, Malaysia, 25 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Natarajan, B.; Rajalakshmi, E.; Elakkiya, R.; Kotecha, K.; Abraham, A.; Gabralla, L.A.; Subramaniyaswamy, V. Development of an end-to-end deep learning framework for sign language recognition, translation, and video generation. IEEE Access 2022, 10, 104358–104374. [Google Scholar] [CrossRef]
- Lanjewar, M.G.; Panchbhai, K.G.; Patle, L.B. Fusion of transfer learning models with LSTM for detection of breast cancer using ultrasound images. Comput. Biol. Med. 2024, 169, 107914. [Google Scholar] [CrossRef] [PubMed]
- Li, D.; Yu, X.; Xu, C.; Petersson, L.; Li, H. Transferring cross-domain knowledge for video sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6205–6214. [Google Scholar]
- Du, Y.; Xie, P.; Wang, M.; Hu, X.; Zhao, Z.; Liu, J. Full transformer network with masking future for word-level sign language recognition. Neurocomputing 2022, 500, 115–123. [Google Scholar] [CrossRef]
- Tunga, A.; Nuthalapati, S.V.; Wachs, J. Pose-based sign language recognition using GCN and BERT. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 31–40. [Google Scholar]
- Umar, S.S.I.; Iro, Z.S.; Zandam, A.Y.; Shitu, S.S. Accelerated Histogram of Oriented Gradients for Human Detection. Ph.D. Thesis, Universiti Teknologi Malaysia, Johor Bahru, Malaysia, 2016. [Google Scholar]
Method Used | Input | Parameters | Training Epochs | Avg Infer Time | Top-1 Accuracy (%) |
---|---|---|---|---|---|
I3D [19] (2020) | RGB | 12.4 M | 200 | 0.55 s | 65.89 |
TK-3DConvNet [30] (2020) | RGB | - | - | - | 77.55 |
Full Transformer Network [31] (2022) | RGB | - | - | - | 80.72 |
GCN-BERT [32] (2021) | Pose | - | 100 | - | 60.15 |
SPOTER [20] (2022) | Pose | 5.92 M | - | 0.05 s | 63.18 |
MIPA-ResGCN [2] (2023) | Pose | 0.99 M | 350 | 0.0391 | 83.33 |
SIGNGRAPH [1] (2023) | Pose | 0.62 M | 350 | 0.0404 | 72.09 |
Ours (MobileNetV2 + LSTM + ATTENTION) | RGB | 0.374 M | 150 | 0.0302 | 84.65 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kumari, D.; Anand, R.S. Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism. Electronics 2024, 13, 1229. https://doi.org/10.3390/electronics13071229
Kumari D, Anand RS. Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism. Electronics. 2024; 13(7):1229. https://doi.org/10.3390/electronics13071229
Chicago/Turabian StyleKumari, Diksha, and Radhey Shyam Anand. 2024. "Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism" Electronics 13, no. 7: 1229. https://doi.org/10.3390/electronics13071229
APA StyleKumari, D., & Anand, R. S. (2024). Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism. Electronics, 13(7), 1229. https://doi.org/10.3390/electronics13071229