Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices
Abstract
:1. Introduction
2. Related Work
2.1. Audio-Visual Speech Recognition
2.2. Gesture Recognition
3. Research Datasets
3.1. Audio-Visual Speech Recognition
3.2. Gesture Recognition
- Multimodality (video data in RGB format with depth map);
- All gestures are rendered dynamically;
- Quite a large number of signers (43 people);
- Quite a large number of gestures (226 Turkish SL gestures);
- Various background settings.
4. Methodology
4.1. Audio-Visual Speech Recognition
4.1.1. Visual Speech Recognition
4.1.2. Audio Speech Recognition
4.1.3. Audio-Visual Fusion
4.2. Gesture Recognition
- Preparing a gesture;
- Functional component of the gesture (its core);
- Retraction [128].
- Size of recognition dictionary;
- Variation of signers (gender and age) and gestures;
- Characteristics of the visual information transmission channel.
- Hand configuration (shape of hand or hands);
- Place of performance (hands in space during the gesture);
- The nature of the movement;
- Facial expressions;
- Lip articulation.
- 2D distances from face to hands are calculated as:
- Areas of face and hands intersection are calculated as:
- Zones of hands location, which are illustrated in Figure 7d. The presented zones (five zones) for showing gestures make it possible to describe all available gestures in the Y-plane. The area with the hand belongs to one of the five gesture zones if the area of their intersection is greater than 50%. In rare cases, when an area with a hand intersects simultaneously by 50% with two of the five zones, then the zone is selected by its smallest initial coordinate () relative to the Y-plane.
5. Evaluation Experiments
5.1. Audio-Visual Speech Recognition
5.1.1. Visual Speech Recognition
- The selection of model architecture;
- The selection of optimal input image resolution;
- The selection of optimal data augmentation methods [128].
- Two learning rate schedulers (constant learning rate, cosine annealing learning rate [141]). The learning rate on cosine annealing is calculated as:
- Two optimizers (Adam, SGD). The maximum accuracy of SR for the Adam optimizer is achieved at a learning rate 10 times less than with the SGD optimizer.
- MixUp [142] allows mixing two images and their labels with different probabilities. The MixUp is applied to both images and binary vector, and the new image and their label vector are calculated as:
- Label smoothing [143] softens hot image label vectors. The label smoothing is applied to all binary vectors to which the MixUp augmentation technique has not been applied and is calculated as:
- Affine transformations are aimed at modifying training images by horizontal and vertical shifts, horizontal flips, shear angle in the counter-clockwise direction, and rotations.
5.1.2. Audio Speech Recognition
- Two learning rate schedulers (constant learning rate, cosine annealing learning rate);
- Two optimizers (Adam, SGD).
5.1.3. Audio-Visual Fusion
5.2. Gesture Recognition
- 2D distances from face to hands (two features per frame);
- Areas of face and hands intersection (two features per frame);
- Zones of hands location (two features per frame);
- Age estimate (one feature per frame);
- Gender estimate (one feature per frame).
- 8 basic features;
- 20 features to represent hand configurations;
- 5 features to represent lip regions.
6. Conclusions
- -
- Data dependency: the performance of both AVSR and SLR methods heavily relies on the quantity and quality of the training data. If the real-world data significantly deviate from the training data, the recognition accuracy will drop significantly.
- -
- Sensitivity to noise: in practical applications, both AVSR and SLR methods may encounter acoustic and visual noise that can negatively impact their performance. However, the presence of two information streams (video and audio) provides some level of robustness against noise.
- -
- Training time: the proposed NN models require substantial computational resources, making the training process time-consuming. This process involves multiple iterations and calculations in order to optimize the model’s parameters and achieve the desired accuracy. The longer training time not only requires more computational power but also increases the demand for storage and memory resources. Therefore, a trade-off between computational resources, training time, and accuracy should be carefully considered when implementing these models.
- -
- Requirement for real-time processing: in order for the proposed AVSR and SLR methods to function in real-time, it is crucial to have access to modern mobile devices equipped with high-performance processors. These powerful devices are necessary to ensure that the NN models can process and analyze the video and audio data quickly and efficiently.
- -
- Improving the accuracy and robustness of AVSR and SLR in real-world scenarios where data can be noisy and diverse, and addressing variations in speech and gesture styles, accents, and other sources of variability;
- -
- Investigating and creating new models that can effectively handle multilingual and cross-lingual recognition, and demonstrating robust performance across different cultures and dialects.
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ASR | Automatic Speech Recognition |
AUTSL | Ankara University Turkish Sign Language Dataset |
AV | Audio-Visual |
AVSR | Automatic Audio-Visual Speech Recognition |
BiGRU | Bidirectional Gated Recurrent Unit |
CNN | Convolutional Neural Network |
CTC | Connectionist Temporal Classification |
CV | Computer Vision |
CVPR | Computer Vision and Pattern Recognition |
DBF | Deep Bottleneck Features |
DRT | Dimensionality Reduction Technique |
E2E | End-to-End |
FCNN | Fully Connected Neural Network |
GRU | Gated Recurrent Unit |
HCI | Human-Computer Interaction |
HMM | Hidden Markov Model |
LDA | Linear Discriminant Analysis |
LRW | Lip Reading in the Wild Dataset |
LSTM | Long-Short Term Memory |
MFCC | Mel-Frequency Cepstral Coefficient |
NN | Neural Network |
PCA | Principal Component Analysis |
ROI | Region-of-Interest |
SL | Sign Language |
SLR | Sign Language Recognition |
SNR | Signal-to-Noise Ratio |
SR | Speech Recognition |
STF | Spatio-Temporal Features |
SVM | Support Vector Machine |
t-SNE | t-distributed Stochastic Neighbor Embedding |
References
- Miao, Z.; Liu, H.; Yang, B. Part-based Lipreading for Audio-Visual Speech Recognition. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, Toronto, ON, Canada, 11–14 October 2020; pp. 2722–2726. [Google Scholar] [CrossRef]
- Cho, J.W.; Park, J.H.; Chang, J.H.; Park, H.M. Bayesian Feature Enhancement using Independent Vector Analysis and Reverberation Parameter Re-Estimation for Noisy Reverberant Speech Recognition. Comput. Speech Lang. 2017, 46, 496–516. [Google Scholar] [CrossRef]
- Yu, W.; Zeiler, S.; Kolossa, D. Fusing Information Streams in End-to-End Audio-Visual Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Toronto, ON, Canada, 6–11 June 2021; pp. 3430–3434. [Google Scholar] [CrossRef]
- Crosse, M.J.; Di Liberto, G.M.; Lalor, E.C. Eye can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration. J. Neurosci. 2016, 36, 9888–9895. [Google Scholar] [CrossRef] [Green Version]
- McGurk, H.; MacDonald, J. Hearing Lips and Seeing Voices. Nature 1976, 264, 746–748. [Google Scholar] [CrossRef]
- Lee, Y.H.; Jang, D.W.; Kim, J.B.; Park, R.H.; Park, H.M. Audio-visual Speech Recognition based on Dual Cross-Modality Attentions with the Transformer Model. Appl. Sci. 2020, 10, 7263. [Google Scholar] [CrossRef]
- Ivanko, D.; Ryumin, D.; Karpov, A. Automatic Lip-Reading of Hearing Impaired People. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, XLII-2/W12, 97–101. [Google Scholar] [CrossRef] [Green Version]
- Guo, L.; Lu, Z.; Yao, L. Human-Machine Interaction Sensing Technology based on Hand Gesture Recognition: A Review. IEEE Trans. Hum.-Mach. Syst. 2021, 51, 300–309. [Google Scholar] [CrossRef]
- Mahmud, S.; Lin, X.; Kim, J.H. Interface for Human Machine Interaction for Assistant Devices: A Review. In Proceedings of the 10th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, Las Vegas, NV, USA, 6–8 January 2020; pp. 768–773. [Google Scholar] [CrossRef]
- Ryumin, D.; Kagirov, I.; Axyonov, A.; Pavlyuk, N.; Saveliev, A.; Kipyatkova, I.; Zelezny, M.; Mporas, I.; Karpov, A. A Multimodal User Interface for an Assistive Robotic Shopping Cart. Electronics 2020, 9, 2093. [Google Scholar] [CrossRef]
- Ryumin, D.; Karpov, A.A. Towards Automatic Recognition of Sign Language Gestures using Kinect 2.0. In Proceedings of the International Conference on Universal Access in Human-Computer Interaction (UAHCI), Springer, Vancouver, BC, Canada, 9–14 July 2017; pp. 89–101. [Google Scholar] [CrossRef]
- Wang, Y.; Fan, X.; Chen, I.F.; Liu, Y.; Chen, T.; Hoffmeister, B. End-to-End Anchored Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brighton, UK, 12–17 May 2019; pp. 7090–7094. [Google Scholar] [CrossRef] [Green Version]
- Krishna, G.; Tran, C.; Yu, J.; Tewfik, A.H. Speech Recognition with no Speech or with Noisy Speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brighton, UK, 12–17 May 2019; pp. 1090–1094. [Google Scholar] [CrossRef] [Green Version]
- Wang, Y.; Shen, J.; Zheng, Y. Push the Limit of Acoustic Gesture Recognition. IEEE Trans. Mob. Comput. 2020, 21, 1798–1811. [Google Scholar] [CrossRef]
- Carli, L.L.; LaFleur, S.J.; Loeber, C.C. Nonverbal Behavior, Gender, and Influence. J. Personal. Soc. Psychol. 1995, 68, 1030. [Google Scholar] [CrossRef]
- Iriskhanova, O.; Cienki, A. The Semiotics of Gestures in Cognitive Linguistics: Contribution and Challenges. Vopr. Kogn. Lingvist. 2018, 4, 25–36. [Google Scholar] [CrossRef]
- Nathan, M.J.; Schenck, K.E.; Vinsonhaler, R.; Michaelis, J.E.; Swart, M.I.; Walkington, C. Embodied Geometric Reasoning: Dynamic Gestures During Intuition, Insight, and Proof. J. Educ. Psychol. 2021, 113, 929. [Google Scholar] [CrossRef]
- Lin, W.; Orton, I.; Li, Q.; Pavarini, G.; Mahmoud, M. Looking at the Body: Automatic Analysis of Body Gestures and Self-Adaptors in Psychological Distress. IEEE Trans. Affect. Comput. 2021, 1. [Google Scholar] [CrossRef]
- Von Agris, U.; Knorr, M.; Kraiss, K.F. The Significance of Facial Features for Automatic Sign Language Recognition. In Proceedings of the 8th IEEE International Conference on Automatic Face & Gesture Recognition, IEEE, Amsterdam, The Netherlands, 17–19 September 2008; pp. 1–6. [Google Scholar] [CrossRef]
- Chung, J.S.; Zisserman, A. Lip Reading in the Wild. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; pp. 87–103. [Google Scholar] [CrossRef]
- Sincan, O.M.; Keles, H.Y. AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods. IEEE Access 2020, 8, 181340–181355. [Google Scholar] [CrossRef]
- Petridis, S.; Stafylakis, T.; Ma, P.; Tzimiropoulos, G.; Pantic, M. Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), IEEE, Athens, Greece, 18–21 December 2018; pp. 513–520. [Google Scholar]
- Ivanko, D. Audio-Visual Russian Speech Recognition. Ph.D. Thesis, Universität Ulm, Ulm, Germany, 2022. [Google Scholar]
- Dupont, S.; Luettin, J. Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Trans. Multimed. 2000, 2, 141–151. [Google Scholar] [CrossRef]
- Ivanko, D.; Karpov, A.; Fedotov, D.; Kipyatkova, I.; Ryumin, D.; Ivanko, D.; Minker, W.; Zelezny, M. Multimodal Speech Recognition: Increasing Accuracy using High Speed Video Data. J. Multimodal User Interfaces 2018, 12, 319–328. [Google Scholar] [CrossRef]
- Ivanko, D.; Ryumin, D.; Axyonov, A.; Železnỳ, M. Designing Advanced Geometric Features for Automatic Russian Visual Speech Recognition. In Proceedings of the International Conference on Speech and Computer, Leipzig, Germany, 18–22 September 2018; pp. 245–254. [Google Scholar] [CrossRef]
- Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Izenman, A.J. Linear Discriminant Analysis. In Modern Multivariate Statistical Techniques; Springer: New York, USA, 2013; pp. 237–280. [Google Scholar] [CrossRef]
- Belkina, A.C.; Ciccolella, C.O.; Anno, R.; Halpert, R.; Spidlen, J.; Snyder-Cappione, J.E. Automated Optimized Parameters for T-Distributed Stochastic Neighbor Embedding Improve Visualization and Analysis of Large Datasets. Nat. Commun. 2019, 10, 5415. [Google Scholar] [CrossRef] [Green Version]
- Petridis, S.; Pantic, M. Deep Complementary Bottleneck Features for Visual Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Shanghai, China, 20–25 March 2016; pp. 2304–2308. [Google Scholar] [CrossRef]
- Takashima, Y.; Aihara, R.; Takiguchi, T.; Ariki, Y.; Mitani, N.; Omori, K.; Nakazono, K. Audio-Visual Speech Recognition using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 277–281. [Google Scholar] [CrossRef]
- Ninomiya, H.; Kitaoka, N.; Tamura, S.; Iribe, Y.; Takeda, K. Integration of Deep Bottleneck Features for Audio-Visual Speech Recognition. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; pp. 563–567. [Google Scholar] [CrossRef]
- Potamianos, G.; Neti, C.; Gravier, G.; Garg, A.; Senior, A.W. Recent Advances in the Automatic Recognition of Audiovisual Speech. IEEE 2003, 91, 1306–1326. [Google Scholar] [CrossRef]
- Ivanko, D.; Karpov, A.; Ryumin, D.; Kipyatkova, I.; Saveliev, A.; Budkov, V.; Ivanko, D.; Železnỳ, M. Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions. In Proceedings of the International Conference on Speech and Computer, Springer, Hatfield, Hertfordshire, UK, 12–16 September 2017; pp. 757–766. [Google Scholar] [CrossRef]
- Argones Rua, E.; Bredin, H.; García Mateo, C.; Chollet, G.; Gonzalez Jimenez, D. Audio-Visual Speech Asynchrony Detection using co-Inertia Analysis and Coupled Hidden Markov Models. Pattern Anal. Appl. 2009, 12, 271–284. [Google Scholar] [CrossRef]
- Koller, O.; Ney, H.; Bowden, R. Deep Learning of Mouth Shapes for Sign Language. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Santiago, Chile, 7–13 December 2015; pp. 85–91. [Google Scholar] [CrossRef]
- Noda, K.; Yamaguchi, Y.; Nakadai, K.; Okuno, H.G.; Ogata, T. Lipreading using Convolutional Neural Network. In Proceedings of the Interspeech, Singapore, 14–18 September 2014; pp. 1149–1153. [Google Scholar] [CrossRef]
- Tamura, S.; Ninomiya, H.; Kitaoka, N.; Osuga, S.; Iribe, Y.; Takeda, K.; Hayamizu, S. Audio-Visual Speech Recognition using Deep Bottleneck Features and High-Performance Lipreading. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), IEEE, Hong Kong, China, 16–19 December 2015; pp. 575–582. [Google Scholar] [CrossRef]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 07–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef] [Green Version]
- Son Chung, J.; Senior, A.; Vinyals, O.; Zisserman, A. Lip Reading Sentences in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6447–6456. [Google Scholar] [CrossRef] [Green Version]
- Petridis, S.; Wang, Y.; Li, Z.; Pantic, M. End-to-End Audiovisual Fusion with LSTMs. In Proceedings of the 14th International Conference on Auditory-Visual Speech Processing, Stockholm, Sweden, 25–26 August 2017; pp. 36–40. [Google Scholar] [CrossRef] [Green Version]
- Wand, M.; Koutník, J.; Schmidhuber, J. Lipreading with Long Short-Term Memory. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Shanghai, China, 20–25 March 2016; pp. 6115–6119. [Google Scholar] [CrossRef] [Green Version]
- Assael, Y.M.; Shillingford, B.; Whiteson, S.; De Freitas, N. LipNet: End-to-End Sentence-Level Lipreading. arXiv 2016, arXiv:1611.01599. [Google Scholar]
- Shi, B.; Hsu, W.N.; Mohamed, A. Robust Self-Supervised Audio-Visual Speech Recognition. In Proceedings of the Interspeech, Incheon, Korea, 18–22 September 2022; pp. 2118–2122. [Google Scholar] [CrossRef]
- Ivanko, D.; Ryumin, D.; Kashevnik, A.; Axyonov, A.; Kitenko, A.; Lashkov, I.; Karpov, A. DAVIS: Driver’s Audio-Visual Speech Recognition. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 1141–1142. [Google Scholar]
- Ryumina, E.; Ivanko, D. Emotional Speech Recognition Based on Lip-Reading. In Proceedings of the International Conference on Speech and Computer, Springer, Gurugram, India, 14–16 November 2022; pp. 616–625. [Google Scholar] [CrossRef]
- Ivanko, D.; Kashevnik, A.; Ryumin, D.; Kitenko, A.; Axyonov, A.; Lashkov, I.; Karpov, A. MIDriveSafely: Multimodal Interaction for Drive Safely. In Proceedings of the International Conference on Multimodal Interaction (ICMI), Bengaluru, India, 7–11 November 2022; pp. 733–735. [Google Scholar] [CrossRef]
- Zhou, P.; Yang, W.; Chen, W.; Wang, Y.; Jia, J. Modality Attention for End-to-End Audio-Visual Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brighton, UK, 12–17 May 2019; pp. 6565–6569. [Google Scholar] [CrossRef] [Green Version]
- Makino, T.; Liao, H.; Assael, Y.; Shillingford, B.; Garcia, B.; Braga, O.; Siohan, O. Recurrent Neural Network Transducer for Audio-Visual Speech Recognition. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, Sentosa, Singapore, 14–18 December 2019; pp. 905–912. [Google Scholar] [CrossRef] [Green Version]
- Petridis, S.; Stafylakis, T.; Ma, P.; Cai, F.; Tzimiropoulos, G.; Pantic, M. End-to-End Audiovisual Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Calgary, AB, Canada, 15–20 April 2018; pp. 6548–6552. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep Audio-Visual Speech Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8717–8727. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Sterpu, G.; Saam, C.; Harte, N. Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; pp. 111–115. [Google Scholar] [CrossRef] [Green Version]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 433–459. [Google Scholar]
- Zeyer, A.; Bahar, P.; Irie, K.; Schlüter, R.; Ney, H. A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, Sentosa, Singapore, 14–18 December 2019; pp. 8–15. [Google Scholar] [CrossRef]
- Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F.; et al. Transformer-based Acoustic Modeling for Hybrid Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Barcelona, Spain, 4–8 May 2020; pp. 6874–6878. [Google Scholar] [CrossRef] [Green Version]
- Yeh, C.F.; Mahadeokar, J.; Kalgaonkar, K.; Wang, Y.; Le, D.; Jain, M.; Schubert, K.; Fuegen, C.; Seltzer, M.L. Transformer-Transducer: End-to-End Speech Recognition with Self-Attention. arXiv 2019, arXiv:1910.12977. [Google Scholar]
- Paraskevopoulos, G.; Parthasarathy, S.; Khare, A.; Sundaram, S. Multimodal and Multiresolution Speech Recognition with Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2381–2387. [Google Scholar] [CrossRef]
- Fernandez-Lopez, A.; Sukno, F.M. Survey on Automatic Lip-Reading in the Era of Deep Learning. Image Vis. Comput. 2018, 78, 53–72. [Google Scholar] [CrossRef]
- Ivanko, D.; Axyonov, A.; Ryumin, D.; Kashevnik, A.; Karpov, A. RUSAVIC Corpus: Russian Audio-Visual Speech in Cars. In Proceedings of the 13th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 1555–1559. [Google Scholar]
- Ivanko, D.; Ryumin, D.; Axyonov, A.; Kashevnik, A. Speaker-Dependent Visual Command Recognition in Vehicle Cabin: Methodology and Evaluation. In Proceedings of the International Conference on Speech and Computer, Springer, St. Petersburg, Russia, 27–30 September 2021; pp. 291–302. [Google Scholar] [CrossRef]
- Lee, B.; Hasegawa-Johnson, M.; Goudeseune, C.; Kamdar, S.; Borys, S.; Liu, M.; Huang, T. AVICAR: Audio-Visual Speech Corpus in a Car Environment. In Proceedings of the 8th International Conference on Spoken Language Processing, Jeju Island, Republic of Korea, 4–8 October 2004; pp. 1–4. [Google Scholar]
- Afouras, T.; Chung, J.S.; Zisserman, A. LRS3-TED: A Large-Scale Dataset for Visual Speech Recognition. arXiv 2018, arXiv:1809.00496. [Google Scholar]
- Chen, H.; Xie, W.; Vedaldi, A.; Zisserman, A. VGGSound: A Large-Scale Audio-Visual Dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Barcelona, Spain, 4–8 May 2020; pp. 721–725. [Google Scholar] [CrossRef]
- Czyzewski, A.; Kostek, B.; Bratoszewski, P.; Kotus, J.; Szykulski, M. An Audio-Visual Corpus for Multimodal Automatic Speech Recognition. J. Intell. Inf. Syst. 2017, 49, 167–192. [Google Scholar] [CrossRef] [Green Version]
- Kashevnik, A.; Lashkov, I.; Axyonov, A.; Ivanko, D.; Ryumin, D.; Kolchin, A.; Karpov, A. Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin. IEEE Access 2021, 9, 34986–35003. [Google Scholar] [CrossRef]
- Zhu, H.; Luo, M.D.; Wang, R.; Zheng, A.H.; He, R. Deep Audio-Visual Learning: A Survey. Int. J. Autom. Comput. 2021, 18, 351–376. [Google Scholar] [CrossRef]
- Keskin, C.; Kıraç, F.; Kara, Y.E.; Akarun, L. Hand Pose Estimation and Hand Shape Classification using Multi-Layered Randomized Decision Forests. In Proceedings of the European Conference on Computer Vision (ECCV), Springer, Firenze, Italy, 7–13 October 2012; pp. 852–863. [Google Scholar] [CrossRef]
- Keskin, C.; Kıraç, F.; Kara, Y.E.; Akarun, L. Real Time Hand Pose Estimation using Depth Sensors. In Consumer Depth Cameras for Computer Vision; Springer: London, UK, 2013; pp. 119–137. [Google Scholar] [CrossRef]
- Taylor, J.; Tankovich, V.; Tang, D.; Keskin, C.; Kim, D.; Davidson, P.; Kowdle, A.; Izadi, S. Articulated Distance Fields for Ultra-Fast Tracking of Hands Interacting. ACM Trans. Graph. (TOG) 2017, 36, 1–12. [Google Scholar] [CrossRef]
- Camgöz, N.C.; Kındıroğlu, A.A.; Akarun, L. Sign Language Recognition for Assisting the Deaf in Hospitals. In Proceedings of the International Workshop on Human Behavior Understanding, Springer, Amsterdam, The Netherlands, 16 October 2016; pp. 89–101. [Google Scholar] [CrossRef]
- Kindiroglu, A.A.; Ozdemir, O.; Akarun, L. Temporal Accumulative Features for Sign Language Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), IEEE Computer Society, Seoul, Republic of Korea, 27–28 October 2019; pp. 1288–1297. [Google Scholar] [CrossRef] [Green Version]
- Orbay, A.; Akarun, L. Neural Sign Language Translation by Learning Tokenization. In Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG), IEEE, Buenos Aires, Argentina, 16–20 November 2020; pp. 222–228. [Google Scholar] [CrossRef]
- Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7784–7793. [Google Scholar] [CrossRef]
- Koller, O.; Camgoz, N.C.; Ney, H.; Bowden, R. Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2306–2320. [Google Scholar] [CrossRef] [Green Version]
- Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Multi-Channel Transformers for Multi-Articulatory Sign Language Translation. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 301–319. [Google Scholar] [CrossRef]
- Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign language Transformers: Joint End-to-End Sign Language Recognition and Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10023–10033. [Google Scholar] [CrossRef]
- Bragg, D.; Koller, O.; Caselli, N.; Thies, W. Exploring Collection of Sign Language Datasets: Privacy, Participation, and Model Performance. In Proceedings of the The 22nd International ACM SIGACCESS Conference on Computers and Accessibility, Online, 26–28 October 2020; pp. 1–14. [Google Scholar] [CrossRef]
- Bragg, D.; Caselli, N.; Hochgesang, J.A.; Huenerfauth, M.; Katz-Hernandez, L.; Koller, O.; Kushalnagar, R.; Vogler, C.; Ladner, R.E. The FATE Landscape of Sign Language AI Datasets: An Interdisciplinary Perspective. ACM Trans. Access. Comput. (TACCESS) 2021, 14, 1–45. [Google Scholar] [CrossRef]
- Dey, S.; Pal, A.; Chaabani, C.; Koller, O. Clean Text and Full-Body Transformer: Microsoft’s Submission to the WMT22 Shared Task on Sign Language Translation. arXiv 2022, arXiv:2210.13326. [Google Scholar] [CrossRef]
- Narayana, P.; Beveridge, R.; Draper, B.A. Gesture Recognition: Focus on the Hands. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5235–5244. [Google Scholar] [CrossRef]
- Zhu, G.; Zhang, L.; Shen, P.; Song, J. Multimodal Gesture Recognition using 3-D Convolution and Convolutional LSTM. IEEE Access 2017, 5, 4517–4524. [Google Scholar] [CrossRef]
- Abavisani, M.; Joze, H.R.V.; Patel, V.M. Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1165–1174. [Google Scholar] [CrossRef] [Green Version]
- Elboushaki, A.; Hannane, R.; Afdel, K.; Koutti, L. MultiD-CNN: A Multi-Dimensional Feature Learning Approach based on Deep Convolutional Networks for Gesture Recognition in RGB-D Image Sequences. Expert Syst. Appl. 2020, 139, 112829. [Google Scholar] [CrossRef]
- Yu, Z.; Zhou, B.; Wan, J.; Wang, P.; Chen, H.; Liu, X.; Li, S.Z.; Zhao, G. Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition. IEEE Trans. Image Process. 2021, 30, 5626–5640. [Google Scholar] [CrossRef]
- van Amsterdam, B.; Clarkson, M.J.; Stoyanov, D. Gesture Recognition in Robotic Surgery: A Review. IEEE Trans. Biomed. Eng. 2021, 68, 2021–2035. [Google Scholar] [CrossRef]
- Mujahid, A.; Awan, M.J.; Yasin, A.; Mohammed, M.A.; Damaševičius, R.; Maskeliūnas, R.; Abdulkareem, K.H. Real-Time Hand Gesture Recognition based on Deep Learning YOLOv3 Model. Appl. Sci. 2021, 11, 4164. [Google Scholar] [CrossRef]
- Qi, W.; Ovur, S.E.; Li, Z.; Marzullo, A.; Song, R. Multi-Sensor Guided Hand Gesture Recognition for a Teleoperated Robot using a Recurrent Neural Network. IEEE Robot. Autom. Lett. 2021, 6, 6039–6045. [Google Scholar] [CrossRef]
- Sluÿters, A.; Lambot, S.; Vanderdonckt, J. Hand Gesture Recognition for an Off-the-Shelf Radar by Electromagnetic Modeling and Inversion. In Proceedings of the 27th International Conference on Intelligent User Interfaces, Helsinki, Finland, 21–25 March 2022; pp. 506–522. [Google Scholar] [CrossRef]
- Hrúz, M.; Gruber, I.; Kanis, J.; Boháček, M.; Hlaváč, M.; Krňoul, Z. One Model is Not Enough: Ensembles for Isolated Sign Language Recognition. Sensors 2022, 22, 5043. [Google Scholar] [CrossRef]
- Boháček, M.; Hrúz, M. Sign Pose-based Transformer for Word-level Sign Language Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 182–191. [Google Scholar] [CrossRef]
- Amangeldy, N.; Kudubayeva, S.; Kassymova, A.; Karipzhanova, A.; Razakhova, B.; Kuralov, S. Sign Language Recognition Method based on Palm Definition Model and Multiple Classification. Sensors 2022, 22, 6621. [Google Scholar] [CrossRef]
- Ma, Y.; Xu, T.; Han, S.; Kim, K. Ensemble Learning of Multiple Deep CNNs using Accuracy-Based Weighted Voting for ASL Recognition. Appl. Sci. 2022, 12, 11766. [Google Scholar] [CrossRef]
- Boháek, M.; Hrúz, M. Learning from What is Already Out There: Few-shot Sign Language Recognition with Online Dictionaries. arXiv 2023, arXiv:2301.03769. [Google Scholar]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar] [CrossRef] [Green Version]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar] [CrossRef] [Green Version]
- Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1145–1153. [Google Scholar] [CrossRef] [Green Version]
- Bazarevsky, V.; Kartynnik, Y.; Vakunov, A.; Raveendran, K.; Grundmann, M. Blazeface: Sub-Millisecond Seural Face Detection on Mobile GPUs. arXiv 2019, arXiv:1907.05047. [Google Scholar]
- Kartynnik, Y.; Ablavatski, A.; Grishchenko, I.; Grundmann, M. Real-Time Facial Surface Geometry from Monocular Video on Mobile GPUs. arXiv 2019, arXiv:1907.06724. [Google Scholar]
- Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-Device Real-Time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
- Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-Device Real-Time Body Pose Tracking. arXiv 2020, arXiv:2006.10204. [Google Scholar]
- Joo, H.; Neverova, N.; Vedaldi, A. Exemplar Fine-Tuning for 3D Human Model Fitting Towards in-the-Wild 3D Human Pose Estimation. In Proceedings of the International Conference on 3D Vision (3DV), IEEE, London, UK, 1–3 December 2021; pp. 42–52. [Google Scholar] [CrossRef]
- Rong, Y.; Shiratori, T.; Joo, H. FrankMocap: A Monocular 3D whole-Body Pose Estimation System via Regression and Integration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 1749–1759. [Google Scholar] [CrossRef]
- Ronchetti, F.; Quiroga, F.; Estrebou, C.A.; Lanzarini, L.C.; Rosete, A. LSA64: An Argentinian Sign Language Dataset. In Proceedings of the Congreso Argentino de Ciencias de la Computación (CACIC), San Luis, Argentina, 3–7 October 2016; pp. 794–803. [Google Scholar]
- Joze, H.R.V.; Koller, O. MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language. arXiv 2018, arXiv:1812.01053. [Google Scholar]
- Huang, J.; Zhou, W.; Li, H.; Li, W. Attention-based 3D-CNNs for Large-Vocabulary Sign Language Recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2822–2832. [Google Scholar] [CrossRef]
- Kagirov, I.; Ivanko, D.; Ryumin, D.; Axyonov, A.; Karpov, A. TheRuSLan: Database of Russian Sign Language. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 6079–6085. [Google Scholar]
- Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-Level Deep Sign Language Recognition from Video: A New Large-Scale Dataset and Methods Comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 1459–1469. [Google Scholar] [CrossRef]
- Tavella, F.; Schlegel, V.; Romeo, M.; Galata, A.; Cangelosi, A. WLASL-LEX: A Dataset for Recognising Phonological Properties in American Sign Language. arXiv 2022, arXiv:2203.06096. [Google Scholar] [CrossRef]
- Grishchenko, I.; Ablavatski, A.; Kartynnik, Y.; Raveendran, K.; Grundmann, M. Attention Mesh: High-Fidelity Face Mesh Prediction in Real-Time. In Proceedings of the CVPRW on Computer Vision for Augmented and Virtual Reality, Seattle, WA, USA, 14–19 June 2020; pp. 1–4. [Google Scholar]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. Librosa: Audio and Music Signal Analysis in Python. In Proceedings of the Python in Science Conference, Austin, Texas, USA, 6–12 July 2015; pp. 18–25. [Google Scholar] [CrossRef] [Green Version]
- Liu, D.; Wang, Z.; Wang, L.; Chen, L. Multi-Modal Fusion Emotion Recognition Method of Speech Expression based on Deep Learning. Front. Neurorobotics 2021, 86, 1–13. [Google Scholar] [CrossRef]
- Zhang, L.; Zhu, G.; Shen, P.; Song, J.; Afaq Shah, S.; Bennamoun, M. Learning Spatiotemporal Features using 3DCNN and Convolutional LSTM for Gesture Recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 3120–3128. [Google Scholar] [CrossRef]
- Verkholyak, O.; Dresvyanskiy, D.; Dvoynikova, A.; Kotov, D.; Ryumina, E.; Velichko, A.; Mamontov, D.; Minker, W.; Karpov, A. Ensemble-within-Ensemble Classification for Escalation Prediction from Speech. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 481–485. [Google Scholar] [CrossRef]
- Xu, Y.; Kong, Q.; Wang, W.; Plumbley, M.D. Large-Scale Weakly Supervised Audio Classification using Gated Convolutional Neural Network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Calgary, AB, Canada, 15–20 April 2018; pp. 121–125. [Google Scholar] [CrossRef] [Green Version]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
- Dresvyanskiy, D.; Ryumina, E.; Kaya, H.; Markitantov, M.; Karpov, A.; Minker, W. End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild. Multimodal Technol. Interact. 2022, 6, 11. [Google Scholar] [CrossRef]
- Ryumina, E.; Verkholyak, O.; Karpov, A. Annotation Confidence vs. Training Sample Size: Trade-off Solution for Partially-Continuous Categorical Emotion Recognition. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 3690–3694. [Google Scholar] [CrossRef]
- Markitantov, M.; Ryumina, E.; Ryumin, D.; Karpov, A. Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 1756–1760. [Google Scholar] [CrossRef]
- Debnath, S.; Roy, P. Appearance and Shape-based Hybrid Visual Feature Extraction: Toward Audio-Visual Automatic Speech Recognition. Signal Image Video Process. 2021, 15, 25–32. [Google Scholar] [CrossRef]
- Pavlovic, V.I.; Sharma, R.; Huang, T.S. Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 677–695. [Google Scholar] [CrossRef] [Green Version]
- Vuletic, T.; Duffy, A.; Hay, L.; McTeague, C.; Campbell, G.; Grealy, M. Systematic Literature Review of Hand Gestures used in Human Computer Interaction Interfaces. Int. J. Hum.-Comput. Stud. 2019, 129, 74–94. [Google Scholar] [CrossRef] [Green Version]
- Ryumin, D. Automated Hand Detection Method for Tasks of Gesture Recognition in Human-Machine Interfaces. Sci. Tech. J. Inf. Technol. Mech. Opt. 2020, 20, 525–531. [Google Scholar] [CrossRef]
- Gruber, I.; Ryumin, D.; Hrúz, M.; Karpov, A. Sign Language Numeral Gestures Recognition using Convolutional Neural Network. In Proceedings of the International Conference on Interactive Collaborative Robotics, Leipzig, Germany, 18–22 September 2018; pp. 70–77. [Google Scholar] [CrossRef]
- Rezende, T.M.; Almeida, S.G.M.; Guimarães, F.G. Development and Validation of a Brazilian Sign Language Database for Human Gesture Recognition. Neural Comput. Appl. 2021, 33, 10449–10467. [Google Scholar] [CrossRef]
- Gavrila, D.M. The Visual Analysis of Human Movement: A Survey. Comput. Vis. Image Underst. 1999, 73, 82–98. [Google Scholar] [CrossRef] [Green Version]
- Wu, Y.; Zheng, B.; Zhao, Y. Dynamic Gesture Recognition based on LSTM-CNN. In Proceedings of the Chinese Automation Congress (CAC), IEEE, Xi’an, China, 30 November–2 December 2018; pp. 2446–2450. [Google Scholar] [CrossRef]
- Ryumin, D.; Kagirov, I.; Ivanko, D.; Axyonov, A.; Karpov, A. Automatic Detection and Recognition of 3D Manual Gestures for Human-Machine Interaction. Autom. Detect. Recognit. 3d Man. Gestures Hum.-Mach. Interact. 2019, XLII-2/W12, 179–183. [Google Scholar] [CrossRef] [Green Version]
- Kagirov, I.; Ryumin, D.; Axyonov, A. Method for Multimodal Recognition of One-Handed Sign Language Gestures through 3D Convolution and LSTM Neural Networks. In Proceedings of the International Conference on Speech and Computer, Istanbul, Turkey, 20–25 August 2019; pp. 191–200. [Google Scholar] [CrossRef]
- De Coster, M.; Van Herreweghe, M.; Dambre, J. Isolated Sign Recognition from RGB Bideo using Pose flow and Self-Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3441–3450. [Google Scholar] [CrossRef]
- Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton aware Multi-Modal Sign Language Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3413–3423. [Google Scholar] [CrossRef]
- Innocenti, S.U.; Becattini, F.; Pernici, F.; Del Bimbo, A. Temporal Binary Representation for Event-based Action Recognition. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), IEEE, Milan, Italy, 10–15 January 2021; pp. 10426–10432. [Google Scholar] [CrossRef]
- Serengil, S.I.; Ozpinar, A. LightFace: A Hybrid Deep Face Recognition Framework. In Proceedings of the Innovations in Intelligent Systems and Applications Conference (ASYU), IEEE, Istanbul, Turkey, 15–17 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
- Serengil, S.I.; Ozpinar, A. Hyperextended LightFace: A Facial Attribute Analysis Framework. In Proceedings of the International Conference on Engineering and Emerging Technologies (ICEET), IEEE, Istanbul, Turkey, 27–28 October 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Axyonov, A.A.; Kagirov, I.A.; Ryumin, D.A. A Method of Multimodal Machine Sign Language Translation for Natural Human-Computer Interaction. J. Sci. Tech. Inf. Technol. Mech. Opt. 2022, 139, 585. [Google Scholar] [CrossRef]
- Axyonov, A.; Ryumin, D.; Kagirov, I. Method of Multi-Modal Video Analysis of Hand Movements For Automatic Recognition of Isolated Signs of Russian Sign Language. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, XLIV-2/W1-2021, 7–13. [Google Scholar] [CrossRef]
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical Attention Networks for Document Classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar] [CrossRef] [Green Version]
- Ryumina, E.; Dresvyanskiy, D.; Karpov, A. In Search of a Robust Facial Expressions Recognition Model: A Large-Scale Visual Cross-Corpus Study. Neurocomputing 2022, 514, 435–450. [Google Scholar] [CrossRef]
- Axyonov, A.; Ryumin, D.; Kashevnik, A.; Ivanko, D.; Karpov, A. Method for Visual Analysis of Driver’s Face for Automatic Lip-Reading in the Wild. Comput. Opt. 2022, 46, 955–962. [Google Scholar]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. MixUp: Beyond Empirical Risk Minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
- Müller, R.; Kornblith, S.; Hinton, G.E. When Does Label Smoothing Help? Adv. Neural Inf. Process. Syst. 2019, 32, 1–10. [Google Scholar]
- Ivanko, D.; Ryumin, D.; Kashevnik, A.; Axyonov, A.; Karnov, A. Visual Speech Recognition in a Driver Assistance System. In Proceedings of the European Signal Processing Conference, IEEE, Belgrade, Serbia, 29 August–2 September 2022; pp. 1131–1135. [Google Scholar]
- Zhong, Z.; Lin, Z.Q.; Bidart, R.; Hu, X.; Daya, I.B.; Li, Z.; Zheng, W.S.; Li, J.; Wong, A. Squeeze-and-Attention Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13065–13074. [Google Scholar] [CrossRef]
- Sincan, O.M.; Junior, J.; Jacques, C.; Escalera, S.; Keles, H.Y. ChaLearn LAP Large Scale Signer Independent Isolated Sign Language Recognition Challenge: Design, Results and Future Research. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3472–3481. [Google Scholar] [CrossRef]
Set | # Classes | # Samples for Each Class | # Frames |
---|---|---|---|
Train | 500 (words) | 800–1000 | 29 |
Val | 50 | ||
Test | 50 |
Characteristic | Train | Val | Test |
---|---|---|---|
Number of signers | 31 | 6 | 6 |
Number of articulate signers | 19 | 5 | 3 |
Number of gesture repetitions by one signer | 1–12 | 2–6 | 1–3 |
Average number of gesture repetitions by one signer | 4.0 | 3.3 | 2.8 |
Average gesture repetitions | 124.5 | 19.5 | 16.6 |
Number of videos | 28,142 | 4418 | 3742 |
Model | Optimizer | Learning Rate | Accuracy, % |
---|---|---|---|
Constant learning rate | |||
2DCNN+BiLSTM | Adam | 0.0001 | 83.38 |
SGD | 0.001 | 83.10 | |
3DCNN | Adam | 0.0001 | 81.41 |
SGD | 0.001 | 81.01 | |
3DCNN+BiLSTM | Adam | 0.0001 | 83.19 |
SGD | 0.001 | 82.99 | |
Cosine annealing learning rate | |||
2DCNN+BiLSTM | Adam | 0.0001 | 85.35 * |
SGD | 0.001 | 84.63 | |
3DCNN | Adam | 0.0001 | 83.72 |
SGD | 0.001 | 83.51 | |
3DCNN+BiLSTM | Adam | 0.0001 | 85.12 |
SGD | 0.001 | 84.39 |
Image Size | # Channels | Image Normalization | Accuracy, % |
---|---|---|---|
88 × 88 | 3 | Padding | 85.35 |
88 × 88 | 1 | 84.95 | |
112 × 112 | 3 | 85.75 | |
44 × 44 | 3 | 86.24 | |
22 × 22 | 3 | 81.00 | |
44 × 44 | 3 | Resize | 84.84 |
MixUp, p | Label Smoothing, | Affine Transform, p | Accuracy, % |
---|---|---|---|
– | – | – | 86.24 |
20 | – | – | 86.76 |
40 | – | – | 80.47 |
– | 0.1 | – | 86.72 |
– | 0.2 | – | 86.07 |
– | – | 20 | 87.03 |
– | – | 40 | 85.72 |
20 | 0.1 | 20 | 87.19 |
Model | Optimizer | Learning Rate | Accuracy, % |
---|---|---|---|
Constant learning rate | |||
ResNet | Adam | 0.0001 | 91.19 |
SGD | 0.001 | 91.86 | |
PANN | Adam | 0.00001 | 70.88 |
SGD | 0.0001 | 70.44 | |
VGG | Adam | 0.0001 | 91.15 |
SGD | 0.0001 | 91.44 | |
Cosine annealing learning rate | |||
ResNet | Adam | 0.0001 | 92.04 |
SGD | 0.001 | 92.24 | |
PANN | Adam | 0.00001 | 84.84 |
SGD | 0.0001 | 78.46 | |
VGG | Adam | 0.0001 | 92.08 |
SGD | 0.0001 | 91.86 |
# Mels | Step Size | Image Size | # Channels | Accuracy, % |
---|---|---|---|---|
128 | 512 | 128 × 39 | 3 | 92.24 |
128 | 512 | 128 × 39 | 1 | 92.77 |
256 | 512 | 256 × 39 | 91.77 | |
64 | 512 | 64 × 39 | 93.77 | |
64 | 256 | 64 × 77 | 94.45 | |
64 | 128 | 64 × 153 | 94.58 | |
64 | 64 | 64 × 305 | 95.36 | |
64 | 32 | 64 × 609 | 95.35 | |
32 | 64 | 32 × 305 | 94.79 |
Mixup, p | Label Smoothing, | SpecAugment, p | Accuracy, % |
---|---|---|---|
– | – | – | 95.36 |
20 | – | – | 95.59 |
40 | – | – | 95.04 |
– | 0.1 | – | 95.86 |
– | 0.2 | – | 95.68 |
– | – | time mask (20) | 95.84 |
– | – | freq mask (20) | 95.35 |
20 | 0.1 | time mask (20) | 96.07 |
SysID | Method | Fusion | Accuracy, % |
---|---|---|---|
1 | 2DCNN + BiLSTM | – | 87.16 |
2 | ResNet | – | 96.07 |
3 | SysID 1 & 2 | Prediction-level | 96.87 |
4 | SysID 1 & 2 | Feature-level | 98.44 |
5 | SysID 1 & 2 | Model-level | 98.76 |
– | E2E AVSR [50] | Model-level | 98.00 |
– | PBL AVSR [1] | Model-level | 98.30 |
DRT | # Components | |||
---|---|---|---|---|
2 | 5 | 10 | 15 | |
PCA | 87.52 | 89.60 | 94.95 | 95.54 |
LDA | 88.91 | 92.65 | 97.19 | 96.82 |
t-SNE | 90.78 | – | – | – |
DRT | # Components | |||
---|---|---|---|---|
2 | 5 | 10 | 15 | |
For the entire test set | ||||
PCA | 98.16 | 98.40 | 98.45 | 98.48 |
LDA | 98.21 | 98.56 | 98.48 | 98.32 |
t-SNE | 98.37 | – | – | – |
For articulating speakers | ||||
PCA | 98.98 | 99.28 | 99.44 | 99.54 |
LDA | 99.52 | 99.59 | 99.48 | 99.23 |
t-SNE | 99.34 | – | – | – |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors 2023, 23, 2284. https://doi.org/10.3390/s23042284
Ryumin D, Ivanko D, Ryumina E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors. 2023; 23(4):2284. https://doi.org/10.3390/s23042284
Chicago/Turabian StyleRyumin, Dmitry, Denis Ivanko, and Elena Ryumina. 2023. "Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices" Sensors 23, no. 4: 2284. https://doi.org/10.3390/s23042284
APA StyleRyumin, D., Ivanko, D., & Ryumina, E. (2023). Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors, 23(4), 2284. https://doi.org/10.3390/s23042284