Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review
Abstract
:1. Introduction
2. Biosignal-Based Speech Recognition
2.1. Muscle Activity
EMG-Based Speech Recognition
2.2. Brain Activity
EEG-Based Speech Recognition
2.3. Articulatory Activity
2.3.1. Correcting Motion Interface
2.3.2. EPG-Based Speech Recognition
2.3.3. Magnetic Articulography-Based Speech Recognition
Electromagnetic Articulography (EMA)-Based Speech Recognition
Permanent-Magnetic Articulography (PMA)-Based Speech Recognition
2.4. Application of Device Interface
2.4.1. Interface Technologies by Tongue and Mouth Sensors
2.4.2. Internal Implantable Devices
2.4.3. Language Interfaces
3. Deep Learning Based Voice Recognition
3.1. Visual Speech Recognition
3.1.1. LipNet
3.1.2. Automatic Lip-Reading System
3.2. Silent Speech Interface
3.2.1. Articulation-to-Speech Synthesis
3.2.2. Sensors-AI Integration
3.2.3. Deep Learning in Silent Speech Recognition
4. Challenges and Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Voice Disorders: Overview. Available online: https://www.asha.org/practice-portal/clinical-topics/voice-disorders/ (accessed on 29 October 2019).
- Cheah, L.A.; Gilbert, J.M.; Gonzalez, J.A.; Bai, J.; Ell, S.R.; Green, P.D.; Moore, R.K. Towards an Intraoral-Based Silent Speech Restoration System for Post-laryngectomy Voice Replacement. In Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies, Rome, Italy, 21–23 February 2016. [Google Scholar]
- Shin, Y.H.; Seo, J. Towards contactless silent speech recognition based on detection of active and visible articulators using IR-UWB radar. Sensors 2016, 16, 1812. [Google Scholar] [CrossRef] [Green Version]
- Sharpe, G.; Camoes Costa, V.; Doubé, W.; Sita, J.; McCarthy, C.; Carding, P. Communication changes with laryngectomy and impact on quality of life: A review. Qual. Life Res. 2019, 28, 863–877. [Google Scholar] [CrossRef]
- Li, W. Silent speech interface design methodology and case study. Chin. J. Electron. 2016, 25, 88–92. [Google Scholar] [CrossRef]
- Denby, B.; Schultz, T.; Honda, K.; Hueber, T.; Gilbert, J.M.; Brumberg, J.S. Silent speech interfaces. Speech Commun. 2010, 52, 270–287. [Google Scholar] [CrossRef] [Green Version]
- Ji, Y.; Liu, L.; Wang, H.; Liu, Z.; Niu, Z.; Denby, B. Updating the Silent Speech Challenge benchmark with deep learning. Speech Commun. 2018, 98, 42–50. [Google Scholar] [CrossRef] [Green Version]
- Meltzner, G.S.; Heaton, J.T.; Deng, Y.; De Luca, G.; Roy, S.H.; Kline, J.C. Silent speech recognition as an alternative communication device for persons with laryngectomy. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2386–2398. [Google Scholar] [CrossRef]
- Schultz, T.; Wand, M.; Hueber, T.; Krusienski, D.J.; Herff, C.; Brumberg, J.S. Biosignal-Based Spoken Communication: A Survey. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2257–2271. [Google Scholar] [CrossRef]
- Meltzner, G.S.; Heaton, J.T.; Deng, Y.; De Luca, G.; Roy, S.H.; Kline, J.C. Development of sEMG sensors and algorithms for silent speech recognition. J. Neural Eng. 2018, 15, 1–12. [Google Scholar] [CrossRef] [PubMed]
- Bi, L.; Feleke, A.; Guan, C. A review on EMG-based motor intention prediction of continuous human upper limb motion for human-robot collaboration. Biomed. Signal Process. Control 2019, 51, 113–127. [Google Scholar] [CrossRef]
- Levis, J.; Suvorov, R. Automatic speech recognition. In The Encyclopedia of Applied Linguistics; Springer: Berlin, Germany, 2012. [Google Scholar]
- Burileanu, D. Spoken language interfaces for embedded applications. In Human Factors and Voice Interactive Systems; Springer: Boston, MA, USA, 2008; pp. 135–161. [Google Scholar]
- Koenecke, A.; Nam, A.; Lake, E.; Nudell, J.; Quartey, M.; Mengesha, Z.; Toups, C.; Rickford, J.R.; Jurafsky, D.; Goel, S. Racial disparities in automated speech recognition. Proc. Natl. Acad. Sci. USA 2020, 117, 7684–7689. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Janke, M.; Wand, M.; Schultz, T. A Spectral map.ping Method for EMG-Based Recognition of Silent Speech. Available online: https://www.scitepress.org/papers/2010/28141/28141.pdf (accessed on 13 February 2021).
- Diener, L.; Schultz, T. Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion. Available online: https://www.csl.uni-bremen.de/cms/images/documents/publications/IS2018_EMG_Realtime.pdf (accessed on 13 February 2021).
- Liu, H.; Dong, W.; Li, Y.; Li, F.; Geng, J.; Zhu, M.; Chen, T.; Zhang, H.; Sun, L.; Lee, C. An epidermal sEMG tattoo-like patch as a new human–machine interface for patients with loss of voice. Microsyst. Nanoeng. 2020, 6, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Rapin, L.; Dohen, M.; Polosan, M.; Perrier, P.; Loevenbruck, H. An EMG study of the lip muscles during covert auditory verbal hallucinations in schizophrenia. J. Speech Lang. Hear. Res. 2013. [Google Scholar] [CrossRef]
- Janke, M.; Diener, L. Emg-to-speech: Direct generation of speech from facial electromyographic signals. IEEE/ACM Trans. Audio. Speech. Lang. Process. 2017, 25, 2375–2385. [Google Scholar] [CrossRef] [Green Version]
- Jong, N.S.; Phukpattaranont, P. A speech recognition system based on electromyography for the rehabilitation of dysarthric patients: A Thai syllable study. Biocybern. Biomed. Eng. 2019, 39, 234–245. [Google Scholar] [CrossRef]
- Sugie, N.; Tsunoda, K. A Speech Prosthesis Employing a Speech Synthesizer—Vowel Discrimination from Perioral Muscle Activities and Vowel Production. IEEE Trans. Biomed. Eng. 1985, BME-32, 485–490. [Google Scholar] [CrossRef] [PubMed]
- Schultz, T.; Wand, M. Modeling coarticulation in EMG-based continuous speech recognition. Speech Commun. 2010, 52, 341–353. [Google Scholar] [CrossRef] [Green Version]
- Lee, W.; Kim, D.; Shim, B.; Park, S.; Yu, C.; Ryu, J. Survey on Mouth Interface for Voice Reproduction and Volitional Control. J. Inf. Technol. Archit. 2015, 12, 171–181. [Google Scholar]
- Srisuwan, N.; Wand, M.; Janke, M.; Phukpattaranont, P.; Schultz, T.; Limsakul, C. Enhancement of EMG-based Thai number words classification using frame-based time domain features with stacking filter. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, Siem Reap, Cambodia, 9–12 December 2014. [Google Scholar]
- Gaddy, D.; Klein, D. Digital Voicing of Silent Speech. arXiv 2020, arXiv:2010.02960. Available online: https://arxiv.org/abs/2010.02960 (accessed on 6 October 2020).
- Debry, C.; Dupret-Bories, A.; Vrana, N.E.; Hemar, P.; Lavalle, P.; Schultz, P. Laryngeal replacement with an artificial larynx after total laryngectomy: The possibility of restoring larynx functionality in the future. Head Neck 2014, 36, 1669–1673. [Google Scholar] [CrossRef]
- Pinheiro, A.P.; Schwartze, M.; Kotz, S.A. Voice-selective prediction alterations in nonclinical voice hearers. Sci. Rep. 2018, 8, 14717. [Google Scholar] [CrossRef] [Green Version]
- Fiedler, L.; Wöstmann, M.; Graversen, C.; Brandmeyer, A.; Lunner, T.; Obleser, J. Single-channel in-ear-EEG detects the focus of auditory attention to concurrent tone streams and mixed speech. J. Neural Eng. 2017, 14, 036020. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lee, A.; Gibbon, F.E.; Kearney, E.; Murphy, D. Tongue-palate contact during selected vowels in children with speech sound disorders. Int. J. Speech. Lang. Pathol. 2014, 16, 562–570. [Google Scholar] [CrossRef] [PubMed]
- Gibbon, F.E. Abnormal patterns of tongue-palate contact in the speech of individuals with cleft palate. Clin. Linguist. Pho-netics. 2004, 18, 285–311. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dromey, C.; Sanders, M. Intra-speaker variability in palatometric measures of consonant articulation. J. Commun. Disord. 2009, 42, 397–407. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mantie-Kozlowski, A.; Pitt, K. Treating myofunctional disorders: A multiple-baseline study of a new treatment using electropalatography. Am. J. Speech-Language Pathol. 2014. [Google Scholar] [CrossRef]
- Park, H.; Ghovanloo, M. An arch-shaped intraoral tongue drive system with built-in tongue-computer interfacing SoC. Sensors 2014, 14, 21565–21587. [Google Scholar] [CrossRef] [Green Version]
- Huo, X.; Wang, J.; Ghovanloo, M. A magneto-inductive sensor based wireless tongue-computer interface. IEEE Trans. Neural Syst. Rehabil. Eng. 2008, 16, 497–504. [Google Scholar]
- Sebkhi, N.; Yunusova, Y.; Ghovanloo, M. Towards Phoneme Landmarks Identification for American-English using a Multimodal Speech Capture System. In Proceedings of the 2018 IEEE Biomedical Circuits and Systems Conference, BioCAS 2018–Proceedings, Cleveland, OH, USA, 17–19 October 2018. [Google Scholar]
- Chan, A.D.; Englehart, K.; Hudgins, B.; Lovely, D.F. Myo-electric signals to augment speech recognition. Med. Biol. Eng. Comp. 2001, 39, 500–504. [Google Scholar] [CrossRef]
- Manabe, H.; Hiraiwa, A.; Sugimura, T. Unvoice Speech Recognition Using EMG-mime Speech Recognition. In Proceedings of the CHI’03 Extended Abstracts on Human Factors in Computing Systems, Ft. Lauderdale, FL, USA, 5–10 April 2003. [Google Scholar]
- Maier-Hein, L.; Metze, F.; Schultz, T.; Waibel, A. Session independent non-audible speech recognition using surface elec-tromyography. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, Cancun, Mexico, 27 November–1 December 2005. [Google Scholar]
- Manoni, L.; Turchetti, C.; Falaschetti, L.; Crippa, P. A Comparative Study of Computational Methods for Compressed Sensing Reconstruction of EMG Signal. Sensors 2019, 19, 3531. [Google Scholar] [CrossRef] [Green Version]
- Donchin, E.; Spencer, K.M.; Wijesinghe, R. The mental prosthesis: Assessing the speed of a P300-based brain- computer interface. IEEE Trans. Rehabil. Eng. 2000, 8, 174–179. [Google Scholar] [CrossRef] [Green Version]
- Millán, J.D.R.; Rupp, R.; Müller-Putz, G.R.; Murray-Smith, R.; Giugliemma, C.; Tangermann, M.; Vidaurre, C.; Cincotti, F.; Kübler, A.; Leeb, R.; et al. Combining brain-computer interfaces and assistive technologies: State-of-the-art and challenges. Front. Neurosci. 2010. [Google Scholar] [CrossRef] [PubMed]
- Poulos, M.; Rangoussi, M.; Alexandris, N.; Evangelou, A. On the use of EEG features towards person identification via neural networks. Med. Inform. Internet Med. 2001, 26, 35–48. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Helmstaedter, C.; Kurthen, M.; Linke, D.B.; Elger, C.E. Patterns of language dominance in focal left and right hemisphere epilepsies: Relation to MRI findings, EEG, sex, and age at onset of epilepsy. Brain Cogn. 1997, 33, 135–150. [Google Scholar] [CrossRef] [Green Version]
- Harle, R. A survey of indoor inertial positioning systems for pedestrians. IEEE Commun. Surv. Tutor. 2013, 15, 1281–1293. [Google Scholar] [CrossRef]
- Lane, N.D.; Miluzzo, E.; Lu, H.; Peebles, D.; Choudhury, T.; Campbell, A.T. A survey of mobile phone sensing. IEEE Commun. Mag. 2010, 48, 140–150. [Google Scholar] [CrossRef]
- Wrench, A.A. Advances in EPG palate design. Adv. Speech. Lang. Pathol. 2007, 9, 3–12. [Google Scholar] [CrossRef]
- Gilbert, J.M.; Rybchenko, S.I.; Hofe, R.; Ell, S.R.; Fagan, M.J.; Moore, R.K.; Green, P. Isolated word recognition of silent speech using magnetic implants and sensors. Med. Eng. Phys. 2010, 32, 1189–1197. [Google Scholar] [CrossRef] [PubMed]
- Ono, T.; Hori, K.; Masuda, Y.; Hayashi, T. Recent advances in sensing oropharyngeal swallowing function in Japan. Sensors 2010, 10, 176–202. [Google Scholar] [CrossRef] [Green Version]
- Hofe, R.; Ell, S.R.; Fagan, M.J.; Gilbert, J.M.; Green, P.D.; Moore, R.K.; Rybchenko, S.I. Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing. Speech. Commun. 2013, 55, 22–32. [Google Scholar] [CrossRef]
- Heracleous, P.; Badin, P.; Bailly, G.; Hagita, N. A pilot study on augmented speech communication based on Elec-tro-Magnetic Articulography. Pattern. Recognit. Lett. 2011, 32, 1119–1125. [Google Scholar] [CrossRef]
- Van Wassenhove, V. Speech through ears and eyes: Interfacing the senses with the supramodal brain. Front. Psychol. 2013. [Google Scholar] [CrossRef] [Green Version]
- Lobo-Prat, J.; Kooren, P.N.; Stienen, A.H.; Herder, J.L.; Koopman, B.F.J.M.; Veltink, P.H. Non-invasive control interfaces for intention detection in active movement-assistive devices. J. Neuroeng. Rehabil. 2014, 11, 168. [Google Scholar] [CrossRef] [Green Version]
- Rosso, P.; Hurtado, L.F.; Segarra, E.; Sanchis, E. On the voice-activated question answering. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2012, 42, 75–85. [Google Scholar] [CrossRef] [Green Version]
- Poncela, A.; Gallardo-Estrella, L. Command-based voice teleoperation of a mobile robot via a human-robot interface. Robotica 2015, 33, 1–18. [Google Scholar] [CrossRef] [Green Version]
- Hwang, S.; Jin, Y.G.; Shin, J.W. Dual Microphone Voice Activity Detection Based on Reliable Spatial Cues. Sensors 2019, 19, 3056. [Google Scholar] [CrossRef] [Green Version]
- Prasad, R.; Saruwatari, H.; Shikano, K. Robots that can hear, understand and talk. Adv. Robot. 2004, 18, 533–564. [Google Scholar] [CrossRef]
- Maas, A.L.; Qi, P.; Xie, Z.; Hannun, A.Y.; Lengerich, C.T.; Jurafsky, D.; Ng, A.Y. Building DNN acoustic models for large vocabulary speech recognition. Comput. Speech. Lang. 2016, 41, 195–213. [Google Scholar] [CrossRef] [Green Version]
- Ravanelli, M.; Omologo, M. Contaminated speech training methods for robust DNN-HMM distant speech recognition. arXiv 2017, arXiv:1710.03538. [Google Scholar]
- Zeyer, A.; Irie, K.; Schlüter, R.; Ney, H. Improved training of end-to-end attention models for speech recognition. arXiv 2018, arXiv:1805.03294. [Google Scholar]
- Hori, T.; Cho, J.; Watanabe, S. End-to-end speech recognition with word-based RNN language models. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. [Google Scholar]
- Sak, H.; Senior, A.; Rao, K.; Beaufays, F. Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv 2015, arXiv:1507.06947. [Google Scholar]
- Takahashi, N.; Gygli, M.; Van Gool, L. Aenet: Learning deep audio features for video analysis. IEEE. Trans. Multimedia 2017, 20, 513–524. [Google Scholar] [CrossRef] [Green Version]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Jared Casper, J.; Bryan Catanzaro, B.; Qiang Cheng, Q.; Guoliang Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. Int. Conf. Mach. Learn. 2016, 48, 173–182. [Google Scholar]
- Assael, Y.M.; Shillingford, B.; Whiteson, S.; De Freitas, N. Lipnet: End-to-end sentence-level lipreading. arXiv 2016, arXiv:1611.01599. [Google Scholar]
- Ephrat, A.; Peleg, S. Vid2speech: Speech reconstruction from silent video. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
- Chen, Y.C.; Yang, Z.; Yeh., C.F.; Jain., M.; Seltzer., M.L. AIPNet: Generative Adversarial Pre-training of Accent-invariant Networks for End-to-end Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020. [Google Scholar]
- Biadsy, F.; Weiss, R.J.; Moreno, P.J.; Kanvesky, D.; Jia, Y. Parrotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation. arXiv 2019, arXiv:1904.04169. [Google Scholar]
- Sun, C.; Yang, Y.; Wen, C.; Xie, K.; Wen, F. Voice Quality Assessment in Communication identification for limited dataset using the deep migration hybrid model based on transfer learning. Sensors 2018, 18, 2399. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, D.; Brown, G.J. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications; Wiley-IEEE Press: Piscataway, NJ, USA, 2006. [Google Scholar]
- Xu, R.; Ren, Z.; Dai, W.; Lao, D.; Kwan, C. Multimodal speech enhancement in noisy environment. In Proceedings of the 2004 Int. Symp. Intell. Multimedia, Video Speech Process, Hong Kong, China, 20–22 October 2004. [Google Scholar]
- Kamath, S.; Loizou, P. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 13–17 May 2002. [Google Scholar]
- Reddy, A.M.; Raj, B. Soft mask methods for single-channel speaker separation. IEEE Trans. Audio. Speech. Lang. Process. 2007, 15, 1766–1776. [Google Scholar] [CrossRef]
- Scalart, P.; Filho, J.V. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Atlanta, GA, USA, 9 May 1996. [Google Scholar]
- Lim, J.S.; Oppenheim, A.V. Enhancement and Bandwidth Compression of Noisy Speech. Proc. IEEE 1979, 67, 1586–1604. [Google Scholar] [CrossRef]
- De Almeida, F.L.; Rosa, R.L.; Rodriguez, D.Z. Voice quality assessment in communication services using deep learning. In Proceedings of the 15th International Symposium on Wireless Communication Systems (ISWCS), Lisbon, Portugal, 28–31 August 2018. [Google Scholar]
- Gosztolya, G.; Pintér, Á.; Tóth, L.; Grósz, T.; Markó, A.; Csapó, T.G. Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces. In Proceedings of the IEEE International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019. [Google Scholar]
- Cao, B.; Kim, M.J.; van Santen, J.P.; Mau, T.; Wang, J. Integrating Articulatory Information in Deep Learning-Based Text-to-Speech Synthesis. INTERSPEECH 2017, 254–258. [Google Scholar] [CrossRef] [Green Version]
- Cieri, C.; Miller, D.; Walker, K. The fisher corpus: A resource for the next generations of speech-to-text. LREC 2004, 4, 69–71. [Google Scholar]
- Gretter, R. Euronews: A multilingual speech corpus for ASR. LREC 2014, 2635–2638. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1083.2378&rep=rep1&type=pdf (accessed on 1 May 2014).
- Angelini, B.; Brugnara, F.; Falavigna, D.; Giuliani, D.; Gretter, R.; Omologo, M. Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus. In Proceedings of the Third International Conference on Spoken Language Processing, Yokohama, Japan, 18–22 September 1994; pp. 1391–1394. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015. [Google Scholar]
- Linguistic Data Consortium, CSR-II (wsj1) Complete. Available online: https://doi.org/10.35111/q7sb-vv12 (accessed on 2 July 1994). [CrossRef]
- Garofalo, J.; Graff, D.; Paul, D.; Pallett, D. CSR-I (wsj0) Complete. Available online: https://doi.org/10.35111/ewkm-cg47 (accessed on 30 May 2007).
- Kingsbury, B. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009. [Google Scholar]
- Font, F.; Roma, G.; Serra, X. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, Barcelona, Spain, 21–25 October 2013. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human action classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Cooke, M.; Barker, J.; Cunningham, S.; Shao, X. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 2006, 120, 2421–2424. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- NIST Multimodal Information Group. 2008 NIST Speaker Recognition Evaluation Training Set Part 1. Available online: https://doi.org/10.35111/pr4h-n676 (accessed on 15 August 2011).
- DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. Available online: https://catalog.ldc.upenn.edu/ldc93s1 (accessed on 25 December 2017).
- Lu, Y.; Li, H. Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory. Appl. Sci. 2019, 9, 1599. [Google Scholar] [CrossRef] [Green Version]
- Akbari, H.; Arora, H.; Cao, L.; Mesgarani, N. Lip2audspec: Speech reconstruction from silent lip movements video. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
- Li, X.; Kwan, C. Geometrical feature extraction for robust speech recognition. In Proceedings of the IEEE International Conference on Record of the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 30 October–2 November 2005. [Google Scholar]
- Fernandez-Lopez, A.; Sukno, F.M. Survey on automatic lip-reading in the era of deep learning. Image. Vis. Comput. 2018, 78, 53–72. [Google Scholar] [CrossRef]
- Hao, M.; Mamut, M.; Yadikar, N.; Aysa, A.; Ubul, K. A Survey of Research on Lipreading Technology. IEEE Access 2020, 8, 204518–204544. [Google Scholar] [CrossRef]
- Fernandez-Lopez, A.; Martinez, O.; Sukno, F.M. Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA, 30 May–3 June 2017. [Google Scholar]
- Eom, C.S.H.; Lee, C.C.; Lee, W.; Leung, C.K. Effective privacy preserving data publishing by vectorization. Inform. Sci. 2019, 527, 311–328. [Google Scholar] [CrossRef]
- Wang, J.; Hahm, S. Speaker-independent silent speech recognition with across-speaker articulatory normalization and speaker adaptive training. In Proceedings of the Annual Conference of the International Speech Communication Association–Proceedings; 2015. Available online: https://www.isca-speech.org/archive/interspeech_2015/i15_2415.html (accessed on 15 August 2020).
- Gonzalez-Lopez, J.A.; Gomez-Alanis, A.; Martín-Doñas, J.M.; Pérez-Córdoba, J.L.; Gomez, A.M. Silent Speech Interfaces for Speech Restoration: A Review. IEEE Access 2020, 8, 177995–178021. [Google Scholar] [CrossRef]
- Kapur, A.; Kapur, S.; Maes, P. Alterego: A personalized wearable silent speech interface. In Proceedings of the 2018 International Conference Intelligent User Interfaces, Tokyo, Japan, 7–11 March 2018. [Google Scholar]
- Kimura, N.; Kono, M.; Rekimoto, J. SottoVoce: An ultrasound imaging-based silent speech interaction using deep neural networks. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Glasgow, Scotland, UK, 4–9 May 2019. [Google Scholar]
- Sebkhi, N.; Sahadat, N.; Hersek, S.; Bhavsar, A.; Siahpoushan, S.; Ghoovanloo, M.; Inan, O.T. A deep neural network-based permanent magnet localization for tongue tracking. IEEE Sens. J. 2019, 19, 9324–9331. Available online: https://doi.org/10.1145/3290605.3300376 (accessed on 15 August 2020). [CrossRef]
- Kim, M.; Sebkhi, N.; Cao, B.; Ghovanloo, M.; Wang, J. Preliminary test of a wireless magnetic tongue tracking system for silent speech interface. In Proceedings of the Biomedical Circuits and Systems Conference (BioCAS), Cleveland, OH, USA, 17–19 October 2018. [Google Scholar]
- Csapó, T.G.; Al-Radhi, M.S.; Németh, G.; Gosztolya, G.; Grósz, T.; Tóth, L.; Markó, A. Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder. arXiv 2019, arXiv:1906.09885. Available online: https://arxiv.org/abs/1906.09885 (accessed on 15 August 2020).
- Cao, B.; Kim, M.J.; Wang, J.R.; van Santen, J.P.; Mau, T.; Wang, J. Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors’ Orientation Information. In Proceedings of the INTERSPEECH; 2018. Available online: https://www.researchgate.net/profile/Jun_Wang121/publication/327350739_Articulation-to-Speech_Synthesis_Using_Articula-tory_Flesh_Point_Sensors’_Orientation_Information/links/5b89a729299bf1d5a735a574/Articulation-to-Speech-Synthesis-Using-Articulatory-Flesh-Point-Sensors-Orientation-Information.pdf (accessed on 15 August 2020).
- Baddeley, A.; Eldridge, M.; Lewis, V. The role of subvocalisation in reading. Q. J. Exp. Psychol. 1981, 33, 439–454. [Google Scholar] [CrossRef]
- Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.W.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Boles., A.; Rad., P. Voice biometrics: Deep learning-based voiceprint authentication system. In Proceedings of the IEEE System of Systems Engineering Conference, Waikoloa, HI, USA, 18–21 June 2017. [Google Scholar]
- Wang, J.; Samal, A.; Green, J.R. Across-speaker articulatory normalization for speaker-independent silent speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association; 2014. Available online: https://www.isca-speech.org/archive/interspeech_2014/i14_1179.html (accessed on 15 August 2020).
- Hahm, S.; Wang, J.; Friedman, J. Silent speech recognition from articulatory movements using deep neural network. Int. Congr. Phon. Sci. 2015, 1–5. Available online: http://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2015/Papers/ICPHS0524.pdf (accessed on 15 August 2020).
- Kim, M.; Cao, B.; Mau, T.; Wang, J. Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. In Proceedings of the IEEE/ACM Transactions on Audio Speech and Language Processing; 2017. Available online: https://ieeexplore.ieee.org/abstract/document/8114350 (accessed on 15 August 2020).
- Beigi, H. Speaker recognition: Advancements and challenges. In New Trends and Developments in Biometrics; InTech: London, UK, 2012. [Google Scholar] [CrossRef] [Green Version]
- Kim, M.J.; Cao, B.; Mau, T.; Wang, J. Multiview Representation Learning via Deep CCA for Silent Speech Recognition. INTERSPEECH 2017, 7, 2769–2773. [Google Scholar]
- Patil, P.; Gujarathi, G.; Sonawane, G. Different Approaches for Artifact Removal in Electromyography based Silent Speech Interface. Int. J. Sci. Eng. Technol. 2016, 5. Available online: http://ijsetr.org/wp-content/uploads/2016/01/IJSETR-VOL-5-ISSUE-1-282-285.pdf (accessed on 15 August 2020).
- Yates, A.J. Delayed auditory feedback. Psychol. Bull. 1963, 60, 213–232. [Google Scholar] [CrossRef] [PubMed]
- Jou, S.C.; Schultz, T.; Walliczek, M.; Kraft, F.; Waibel, A. Towards continuous speech recognition using surface electromyography. Int. Conf. Spok. Lang. Process. 2006, 573–576. Available online: https://www.isca-speech.org/archive/interspeech_2006/i06_1592.html (accessed on 15 August 2020).
Application | Organs | References | ||||||
---|---|---|---|---|---|---|---|---|
Oral Cavity | Muscle | Brain | ||||||
Tongue | Palate | Lip | Larynx | Face | Neck | |||
EMG | √ | [15,16,17,18] | ||||||
√ | √ | [6,9,10,15,19,20,21,22,23,24,25] | ||||||
√ | [26] | |||||||
EEG | √ | [6,8,9,27,28] | ||||||
EGG | √ | [6,9] | ||||||
EPG | √ | √ | [9,23,29,30,31,32] | |||||
TDS | √ | √ | [33,34] | |||||
MSCS * | √ | √ | [35] |
Name | Model | Dataset | Result | Ref. |
---|---|---|---|---|
Very large DNN models | DNN | 2100 h training corpus combining Switchboard and Fisher [78] | - As a result of ASR system performance using up to 400M parameters and 7 hidden layers, structures using NAG [79] optimizer and 3 to 5 hidden layers performed best. | [57] |
Robust DNN-HMM | DNN | - Euronews database [79] - APASCI [80] | - The approach based on an asymmetric context window, close-talk supervision, and a supervised close-talk pre-training showed more than 15% performance improvement over the baseline system for contaminated voice training. | [58] |
Encoder-Decoder-Attention model | LSTM | - Switchboard 300 h - LibriSpeech 1000 h [81] | Comparing WER, - Switchboard achieved competitive results with existing end-to-end models. - Librispech achieved the WER of 3.54% on the dev-clean and 3.82% on the test-clean subsets, showing the best performance. | [59] |
Look-ahead LM *** | RNN | - Wall Street Journal(WSJ) [82,83] - LibriSpeech [81] | - The compared result with other end-to- end systems, 5.1% WER for WSJ eval92 and 8.4% WER for WSJ dev93. - When comparing WER with other language models, the model obtained consistent error reduction as the size of the vocabulary increased. | [60] |
LSTM RNN acoustic models | LSTM RNN | 3 million utterances with an average duration of about 4 s, taken from real 16 kHz Google voice search traffic | - Models using the state-level minimum Bayes risk sequence discriminative training criterion [84] have achieved continuous WER improvement. | [61] |
AENet | CNN | - Freesound [85] to create a novel audio event classification database - USF101 dataset [86] to evaluate the AENet features | - Recognizing the variety of audio from theevent, the audio event detection capability has improved by 16% and video highlight detection by more than 8% compared to the commonly used audio features. | [62] |
Deep Speech2 | RNN CNN | - English: 11,940 h of speech 8 million utterances - Mandarin: 9400 h of speech 11 million utterances | - English: 2 layers of 2D CNN, 3 layers of unidirectional RNN. - Mandarin: 9 layers of 7 RNN with 2D convolution and BatchNorm. | [63] |
LipNet | CNN GRU | GRID corpus [87] | - The accuracy of the sense-level in the GRID dataset is 95.2%. | [64] |
Vid2speech | CNN | GRID corpus | - The audio-visual test using Amazon MTurk having word intelligibility of 79%. | [65] |
AIPNet | GAN * LSTM | 9 English accents containing 4M (3.8K h) utterances crowd-sourced workers | - Supervised setting achieved 2.3~4.5% relative reduction on WER with LASR. - Semi-supervised setting achieving 3.4~11.3% WER reduction. | [66] |
Parrotron | LSTM CNN | 30,000 h training set 24 million English utterances | WER of 32.7% from a deaf speaker with nonsense words | [67] |
TLCNN **-RBM | CNN RBM | NIST 2008 SRE dataset [88], self-built speech database, TIMIT dataset [89] | - CNN, which has FBN, reduces training time by 48.04% compared to CNN without FBN. - 97% higher average accuracy than when using CNN or the TL-CNN network. | [68] |
Lip reading model | CNN LSTM | audio-visual database with 10 independent digital English utterances | - The accuracy is 88.2% in the test dataset. | [90] |
Lip2Audspec | CNN LSTM | GRID corpus [87] | - The average accuracy of 20 workers is 55.8%. | [91] |
Model | Method | Data | Result | Ref. |
---|---|---|---|---|
DNN-HMM to reduce the WER compared to the GMM-HMM approach used in SSC * | Ultrasonic probe (tongue), Infrared-illuminated video camera (lip) | - SSC data recorded without any vocalization - 320 × 240 pixel tongue images and 640 × 480 pixel lip images in black and white | WER of 6.4% is obtained, which is lower than the published benchmark [112] value of 17.4%. | [7] |
Voicing Silent Speech | EMG | 20 h of facial EMG signals from a single speaker collected during both silent and vocalized speech | WER of 3.6% from closed-vocabulary data condition and 68% from the open vocabulary condition. | [25] |
AlterEgo | attaching electrodes to neuromuscular muscles | Synthetic data corpus | The average word accuracy of 10 users is 92.01%. | [99] |
SottoVoce | Ultrasonic probe | - rescaled ultrasonic image with 500 speech commands | - The success rate of recognizing the smart speaker (Amazon Echo) is 65%. | [100] |
DCCA ** to find the correlation between articulatory movement data and acoustic features | Electromagnetic Articulograph | - speaker-independent (7 speakers) - 3-dimensional movement data of articulators (tongue and lip) - included acoustic data | the Phoneme error rate of 57.3% using only DNN-HMM, which is 45.9% when combined with DCCA and 42.5% when combined with DCCA + fMLLR ***. | [113] |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, W.; Seong, J.J.; Ozlu, B.; Shim, B.S.; Marakhimov, A.; Lee, S. Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review. Sensors 2021, 21, 1399. https://doi.org/10.3390/s21041399
Lee W, Seong JJ, Ozlu B, Shim BS, Marakhimov A, Lee S. Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review. Sensors. 2021; 21(4):1399. https://doi.org/10.3390/s21041399
Chicago/Turabian StyleLee, Wookey, Jessica Jiwon Seong, Busra Ozlu, Bong Sup Shim, Azizbek Marakhimov, and Suan Lee. 2021. "Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review" Sensors 21, no. 4: 1399. https://doi.org/10.3390/s21041399
APA StyleLee, W., Seong, J. J., Ozlu, B., Shim, B. S., Marakhimov, A., & Lee, S. (2021). Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review. Sensors, 21(4), 1399. https://doi.org/10.3390/s21041399