Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM
Abstract
:1. Introduction
- A novel approach for solving SLR is proposed, where output does not depend on a single algorithm (classifier), but a second stage is added for considering past, present and future input information, thus, increasing the robustness of the system in unseen data;
- Proof that with our SLR multimodality model, interpretation accuracy of complex words can be improved by adding a bidirectional LSTM for spelling correction concerning state-of-the-art approaches;
- A methodology to generate SLR systems using regression DNNs yielding to high accuracy, at real-time speeds, which was demonstrated in our experiments with different users;
- An open-source ASL dataset that can be used for training ASL translation systems and natural language processing algorithms.
2. Materials and Methods
2.1. YOLO Networks
2.2. Images Dataset
- All of the images contained in this dataset were captured in a complex background scene, with different persons and handshapes; this allows training DNNs with more effectiveness for real-world conditions;
- Researchers in the field of computer vision for natural language processing, sign language recognition, and human–computer interfacing using handshape commands can benefit from this dataset;
- These data can be used to train DNNs for object detection in images to perform American Sign Language alphabet translation;
- The dataset does not contain only images for ASL alphabet translation, but every image is labeled with YOLO format for a faster and easier usage of this data;
- The good practical results obtained by using DNNs greatly depend on the quality and amount of data used for its training; this dataset can be used to train a YOLO network or to enrich another dataset for sign language recognition.
2.3. YOLO Network Models
2.4. Word Spelling Correction
RNN Architecture
2.5. Spelling Correction Dataset
3. Results
3.1. YOLO Models Results
3.2. Bidirectional LSTM Results
3.3. Real-Time Experiment Results
4. Discussion
Supplementary Materials
Author Contributions
Funding
Informed Consent Statement
Conflicts of Interest
References
- World Health Organization. Deafness and Hearing Loss. Available online: https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss (accessed on 9 September 2020).
- World Health Organization. WHO Global Estimates on Prevalence of Hearing Loss, Prevention of Deafness WHO. 2018. Available online: https://www.who.int/deafness/Global-estimates-on-prevalence-of-hearing-loss-for-website.pptx?ua=1 (accessed on 9 September 2020).
- Dong, C.; Leu, M.C.; Yin, Z. Sign Language Alphabet Recognition Using Microsoft Kinect. In Proceedings of the 2015 IEEE Conference on CVPRW, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Hee-Deok, Y. Sign Language Recognition with the Kinect Sensor Based on Conditional Random Fields. Sensors 2015, 15, 135–147. [Google Scholar]
- Cemil, O.; Ming, C.L. American Sign Language word recognition with a sensory glove using artificial neural networks. Eng. Appl. Artif. Intell. 2011, 4, 1204–1213. [Google Scholar]
- Ognjan, L.; Miroslav, P. Hand gesture recognition using low-budget data glove and cluster-trained probabilistic neural network. Assem. Autom. 2014, 34, 94–105. [Google Scholar]
- Rivera-Acosta, M.; Ortega-Cisneros, S.; Rivera, J.; Sandoval-Ibarra, F. American Sign Language Alphabet Recognition Using a Neuromorphic Sensor and an Artificial Neural Network. Sensors 2017, 17, 2176. [Google Scholar] [CrossRef] [Green Version]
- Jie, G.; Wengang, Z.; Houqiang, L.; Weiping, L. Sing Language Recognition Using Real-Sense. In Proceedings of the 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, China, 12–15 July 2015. [Google Scholar]
- Md Azher, U.; Shayhan, A.C. Hand Sign Language Recognition for Bangla Alphabet using Support Vector Machine. In Proceedings of the International Conference on Innovations in Science, Engineering and Technology (ICISET), Dhaka, Bangladesh, 28–29 October 2016. [Google Scholar]
- Wenjin, T.; Ming, C.L.; Zhaozheng, Y. American Sign Language alphabet recognition using Convolutional Neural Networks with multiview augmentation and inference fusion. Eng. Appl. Artif. Intell. 2018, 76, 202–213. [Google Scholar]
- Sarfaraz, M.; Harish, C.T.; Adhyan, S. American Sign Language Character Recognition Using Convolution Neural Network. Smart Computing and Informatics. Smart Innov. Syst. Technol. 2018, 78, 403–412. [Google Scholar]
- Yuancheng, Y.; Yingli, T.; Matt, H.; Yingya, L. Recognizing American Sign Language Gestures from within Continuous Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2064–2073. [Google Scholar]
- Dinesh, S.; Sivaprakash, S.; Keshav, M.; Ramya, K. Real-Time American Sign Language Recognition with Faster Regional Convolutional Neural networks. Int. J. Innov. Res. Sci. Eng. Technol. 2018, 7, 297–305. [Google Scholar]
- Oishee, B.H.; Mohammad, I.J.; Md, S.I.; Al-Farabi, A.; Alving, S.P. Real Time Bangladeshi Sign Language Detection using Faster R-CNN. In Proceedings of the International Conference on Innovation in Engineering and Technology (ICIET), Dhaka, Bangladesh, 27–28 December 2018. [Google Scholar]
- Rastgoo, R.; Kiani, K.; Escalera, S. Multi-Modal Deep Hand Sign Language Recognition in Still Images Using Restricted Boltzmann Machine. Entropy 2018, 20, 809. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yang, L.; Chen, J.; Zhu, W. Dynamic Hand Gesture Recognition Based on a Leap Motion Controller and Two-Layer Bidirectional Recurrent Neural Network. Sensors 2020, 20, 2106. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jordan, J.B.; Anikó, E.; Diego, R.F. British Sign Language Recognition via Late Fusion of Computer Vision and Leap Motion with Transfer Learning to American Sign Language. Sensors 2020, 20, 5151. [Google Scholar]
- Vincent, H.; Tomoya, S.; Gentiane, V. Convolutional and Recurrent Neural Network for Human Activity Recognition: Application on American Sign Language. PLoS ONE 2020, 15, 1–12. [Google Scholar]
- Kim, M.; Cho, J.; Lee, S.; Jung, Y. IMU Sensor-Based Hand Gesture Recognition for Human-Machine Interfaces. Sensors 2019, 19, 3827. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Akash. ASL Alphabet Image Data Set for Alphabets in the American Sign Language. 2018. Available online: https://www.kaggle.com/grassknoted/asl-alphabet (accessed on 9 September 2020).
- Nvidia, CUDA GPUs. Available online: https://developer.nvidia.com/cuda-gpus (accessed on 9 September 2020).
- Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
- Joseph, R.; Santosh, D.; Ross, G.; Ali, F. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar]
- Joseph, R.; Ali, F. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar]
- Joseph, R.; Ali, F. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Pedro, F.; Ross, B.; David, M.; Deva, R. Object Detection with Discriminatively Trained Part Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar]
- Ross, G.; Jeff, D.; Trevor, D.; Jitendra, M. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Ross, G. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Shaoqing, R.; Kaiming, H.; Ross, G.; Jian, S. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar]
- Wenbo, L.; Jianwu, D.; Yangping, W.; Song, W. Pedestrian Detection Based on YOLO Network Model. In Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation, Changchun, China, 5–8 August 2018. [Google Scholar]
- Weidong, M.; Xiangpeng, L.; Qi, W.; Qingpeng, Z.; Yanqiu, L. New approach to vehicle license plate location based on new model YOLO-L and plate pre-identification. IET Image Proc. 2019, 13, 1041–1049. [Google Scholar]
- Zuzanna, K.; Jacek, S. Bones detection in the pelvic area on the basis of YOLO neural network. In Proceedings of the 19th International Conference Computational Problems of Electrical Engineering, Banska Stiavnica, Slovakia, 9–12 September 2018. [Google Scholar]
- Steve, D.; Nanik, S.; Chastine, F. Indonesian Sign Language Recognition using YOLO Method. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1077, 012029. [Google Scholar] [CrossRef]
- Tzutalin, LabelImg. Git Code. 2015. Available online: https://github.com/tzutalin/labelImg/ (accessed on 9 September 2020).
- YOLO: Real-Time Object Detection. Available online: https://pjreddie.com/darknet/yolo/ (accessed on 5 July 2019).
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Computation 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Keisuke, S.; Kevin, D.; Matt, P.; Benjamin, V. Robust Word Recognition via Semi-Character Recurrent Neural Network. arXiv 2017, arXiv:1608.02214. [Google Scholar]
- Pengfei, L.; Xipeng, Q.; Xuanjing, H. Recurrent Neural Network for Text Classification with Multi-Task Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), New York City, NY, USA, 9–15 July 2016. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, L.; Dollar, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312v3, 2015. [Google Scholar]
A1 (416 × 416) 1 | A2 (352 × 352) 1 | |
---|---|---|
C1—416 × 416, 32 × 3 × 3, 1 | C1—352 × 352, 32 × 3 × 3, 1 | C22—11 × 11, 256 × 1 × 1, 1 |
P2—416 × 416, 2 × 2, 2 | C2—352 × 352, 64 × 3 × 3, 2 | U23 |
C3—208 × 208, 64 × 3 × 3, 1 | C3—176 × 176, 32 × 1 × 1, 1 | C24—22 × 22, 256 × 1 × 1, 1 |
P4—208 × 208, 2 × 2, 2 | C4—176 × 176, 64 × 3 × 3, 1 | C25—22 × 22, 512 × 3 × 3, 1 |
C5—104 × 104, 128 × 3 × 3, 1 | C5—176 × 176, 128 × 3 × 3, 2 | C26—22 × 22, 256 × 1 × 1, 1 |
P6—104 × 104, 2 × 2, 2 | C6—88 × 88, 64 × 1 × 1, 1 | C27—22 × 22, 512 × 3 × 3, 1 |
C7—52 × 52, 256 × 3 × 3, 1 | C7—88 × 88, 128 × 3 × 3, 1 | C28—22 × 22, 256 × 1 × 1, 1 |
P8—52 × 52, 2 × 2, 2 | C8—88 × 88, 256 × 3 × 3, 2 | C29—22 × 22, 512 × 3 × 3, 1 |
C9—26 × 26, 512 × 3 × 3, 1 | C9—44 × 44, 128 × 1 × 1, 1 | C30—22 × 22, 93 × 1 × 1, 1 |
P10—26 × 26, 2 × 2, 1 | C10—44 × 44, 512 × 3 × 3, 1 | D31 |
C11—26 × 26, 1024 × 3 × 3, 1 | C11—44 × 44, 1024 × 3 × 3, 2 | C32—22 × 22, 128 × 1 × 1, 1 |
C12—26 × 26, 256 × 1 × 1, 1 | C12—22 × 22, 512 × 1 × 1, 1 | U33 |
C13—26 × 26, 512 × 3 × 3, 1 | C13—22 × 22, 1024 × 3 × 3, 1 | C34—44 × 44, 128 × 1 × 1, 1 |
C14—26 × 26, 93 × 1 × 1, 1 | C14—22 × 22, 512 × 1 × 1, 1 | C35—44 × 44, 256 × 3 × 3, 1 |
D15 | C15—22 × 22, 1024 × 3 × 3, 2 | C36—44 × 44, 128 × 1 × 1, 1 |
C16—26 × 26, 128 × 1 × 1, 1 | C16—11 × 11, 512 × 1 × 1, 1 | C37—44 × 44, 256 × 3 × 3, 1 |
U17 | C17—11 × 11, 1024 × 3 × 3, 1 | C38—44 × 44, 128 × 1 × 1, 1 |
C18—52 × 52, 256 × 1 × 1, 1 | C18—11 × 11, 512 × 1 × 1, 1 | C39—44 × 44, 256 × 3 × 3, 1 |
C19—52 × 52, 93 × 1 × 1, 1 | C19—11 × 11, 1024 × 3 × 3, 1 | C40—44 × 44, 93 × 1 × 1, 1 |
D20 | C20—11 × 11, 93 × 1 × 1, 1 | D41 |
D21 |
Frame-Rate (fps) | mAP @ 50 (%) ASLYtest/testset | |||||
---|---|---|---|---|---|---|
Image Size | 288 × 288 | 352 × 352 | 416 × 416 | 288 × 288 | 352 × 352 | 416 × 416 |
A1 GTX-1050 | 47.15 | 36.00 | 29.54 | 97.07/72.66 | 98.86/74.26 | 99.46/75.63 |
A1 GTX-1080 | 122.24 | 95.32 | 72.90 | |||
A2 GTX-1050 | 29.81 | 22.85 | 16.65 | 98.06/69.87 | 99.81/81.74 | 99.85/81.76 |
A2 GTX-1080 | 81.79 | 61.35 | 46.03 |
Correct Sentence | Wrong Sentence |
---|---|
I would like to make an appointment | i would liue to mave an apointsent |
is there a restaurant close to us | is theue a restaubant clote to us |
oscar is helping me with my homework | oscar is helpsng me winh my honework |
hello I am cesar | hslo i am cesyr |
I think it tastes good | i thimk it tases god |
User | Sentence | A2 (352 × 352) Prediction | WSC | |
---|---|---|---|---|
1 | I like pancakes | i like patcakes | i like pancakes | |
2 | Is there a restaurant close to us | is there a reistaurat close to us | is there a restaurant close to us | |
3 | Please wait for me | pletse wait for me | please wait for me | |
4 | Andrea plays the piano | atdrea plays the siatosts | andrea plays the piano | |
5 | The boss is busy | the bcs is brusy | the boss is busy | |
6 | We need to rent a room | we ned to rent a rom | we need to rent a room | |
7 | I am getting my license | si am getting my licenxe | of am getting my outside | |
8 | Could be better | could be beaer | could be better | |
9 | I should leave | i shculd letve | i should leave | |
10 | I am not sure about that | xi am not sure tabout thta | of am not sure vacant wanna |
User | Levenshtein Distance Sentence/A2 (352 × 352) Prediction | Levenshtein Distance Sentence/WSC |
---|---|---|
1 | 1 | 0 |
2 | 2 | 0 |
3 | 1 | 0 |
4 | 6 | 0 |
5 | 3 | 0 |
6 | 2 | 0 |
7 | 2 | 8 |
8 | 2 | 0 |
9 | 2 | 0 |
10 | 4 | 11 |
Proposal | Sensor | Classification Level | Dataset Samples | Accuracy (%) |
---|---|---|---|---|
Ours | RGB | Letters (single sign) | 29,200 | 99.81/81.74 (mAP@50) |
[9] | RGB | Letters (single sign) | 2400 | 99.5 |
[10] | RGB–depth | Letters (single sign) | 48,000 | 99.9 |
[33] | RGB | Letters (single sign) | 4547 | 99.91 (mAP@50) |
[15] | RGB–depth | Letters (single sign) | 2524–131,000 | 99.31–98.13 |
[19] | IMU | Numbers (dynamic gesture) | 1000 | 98.60 |
[16] | LMC | Words (dynamic gesture) | 300–480 | 96.67–95.23 |
[11] | RGB | Letters (single sign) | 14,781 | 95.54 |
[17] | RGB–LMC | Gestures (dynamic–BSL) | 16,498 | 94.44 |
[13] | RGB | Letters (single sign) | 130,000 | 93.33 |
[18] | LMC | Words and numbers | 16,890 | 91.1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rivera-Acosta, M.; Ruiz-Varela, J.M.; Ortega-Cisneros, S.; Rivera, J.; Parra-Michel, R.; Mejia-Alvarez, P. Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM. Electronics 2021, 10, 1035. https://doi.org/10.3390/electronics10091035
Rivera-Acosta M, Ruiz-Varela JM, Ortega-Cisneros S, Rivera J, Parra-Michel R, Mejia-Alvarez P. Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM. Electronics. 2021; 10(9):1035. https://doi.org/10.3390/electronics10091035
Chicago/Turabian StyleRivera-Acosta, Miguel, Juan Manuel Ruiz-Varela, Susana Ortega-Cisneros, Jorge Rivera, Ramón Parra-Michel, and Pedro Mejia-Alvarez. 2021. "Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM" Electronics 10, no. 9: 1035. https://doi.org/10.3390/electronics10091035
APA StyleRivera-Acosta, M., Ruiz-Varela, J. M., Ortega-Cisneros, S., Rivera, J., Parra-Michel, R., & Mejia-Alvarez, P. (2021). Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM. Electronics, 10(9), 1035. https://doi.org/10.3390/electronics10091035