D2StarGAN: A Near-Far End Noise Adaptive StarGAN for Speech Intelligibility Enhancement
Abstract
:1. Introduction
- (1)
- In real life, the speaker cannot be in a completely noise-free environment, the far-end speaker may itself be speaking Lombard speech, and there are limitations to the method of simply converting normal speech to Lombard speech.
- (2)
- Although the StarGAN-based approach is an improvement over the CycleGAN-based approach, the adequacy of feature mapping is still significantly deficient, and there remains an insurmountable gap between real speech and transformed speech even with StarGAN. Speech intelligibility enhancement methods with only the Lombard effect still do not work well under strong noise interference with very low signal-to-noise ratios.
2. Related Work
3. Baseline
3.1. Traditional NELE System Structure
3.2. AdaSAStarGAN
4. Proposed D2StarGAN
4.1. System Structure
4.2. Mapping Model
5. Experimental Section and Results
5.1. Experimental Setup
5.1.1. Dataset
5.1.2. Comparison with Existing Approaches
5.2. Objective Evaluations
5.3. Subjective Listening Tests
5.4. Acoustic Analysis
5.5. Analysis of System Robustness
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, G.; Hu, R.; Wang, X.; Zhang, R. A near-end listening enhancement system by RNN-based noise cancellation and speech modification. Multimed. Tools Appl. 2019, 78, 15483–15505. [Google Scholar] [CrossRef]
- Leglaive, S.; Alameda-Pineda, X.; Girin, L.; Horaud, R. A recurrent variational autoencoder for speech enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 371–375. [Google Scholar]
- Yemini, Y.; Chazan, S.E.; Goldberger, J.; Gannot, S. A Composite DNN Architecture for Speech Enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 841–845. [Google Scholar]
- Kleijn, W.B.; Crespo, J.B.; Hendriks, R.C.; Petkov, P.; Sauert, B.; Vary, P. Optimizing speech intelligibility in a noisy environment: A unified view. IEEE Signal Process. Mag. 2015, 32, 43–54. [Google Scholar] [CrossRef]
- Hussain, A.; Chetouani, M.; Squartini, S.; Bastari, A.; Piazza, F. Nonlinear speech enhancement: An overview. In Progress in Nonlinear Speech Processing; Springer: Berlin/Heidelberg, Germany, 2007; pp. 217–248. [Google Scholar]
- Huang, P.S.; Chen, S.D.; Smaragdis, P.; Hasegawa-Johnson, M. Singing-voice separation from monaural recordings using robust principal component analysis. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 57–60. [Google Scholar]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
- Kwan, C.; Chu, S.; Yin, J.; Liu, X.; Kruger, M.; Sityar, I. Enhanced speech in noisy multiple speaker environment. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1640–1643. [Google Scholar]
- Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Interspeech, Lyon, France, 25–29 August 2013; Volume 2013, pp. 436–440. [Google Scholar]
- Kolbæk, M.; Tan, Z.H.; Jensen, J. Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 25, 153–167. [Google Scholar] [CrossRef]
- Fu, S.W.; Wang, T.W.; Tsao, Y.; Lu, X.; Kawai, H. End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1570–1584. [Google Scholar] [CrossRef]
- Sun, L.; Du, J.; Dai, L.R.; Lee, C.H. Multiple-target deep learning for LSTM-RNN based speech enhancement. In Proceedings of the 2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA), San Francisco, CA, USA, 1–3 March 2017; pp. 136–140. [Google Scholar]
- Ayhan, B.; Kwan, C. Robust speaker identification algorithms and results in noisy environments. In Proceedings of the Advances in Neural Networks–ISNN 2018: 15th International Symposium on Neural Networks, ISNN 2018, Minsk, Belarus, 25–28 June 2018; Proceedings 15. Springer: Berlin/Heidelberg, Germany, 2018; pp. 443–450. [Google Scholar]
- Huang, Z.; Watanabe, S.; Yang, S.w.; García, P.; Khudanpur, S. Investigating self-supervised learning for speech enhancement and separation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6837–6841. [Google Scholar]
- Zorila, T.C.; Kandia, V.; Stylianou, Y. Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012. [Google Scholar]
- Jokinen, E.; Remes, U.; Takanen, M.; Palomäki, K.; Kurimo, M.; Alku, P. Spectral tilt modelling with extrapolated GMMs for intelligibility enhancement of narrowband telephone speech. In Proceedings of the 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), Juan-les-Pins, France, 8–11 September 2014; pp. 164–168. [Google Scholar]
- Garnier, M.; Henrich, N. Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? Comput. Speech Lang. 2014, 28, 580–597. [Google Scholar] [CrossRef]
- Junqua, J.C.; Fincke, S.; Field, K. The Lombard effect: A reflex to better communicate with others in noise. In Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Phoenix, AZ, USA, 15–19 March 1999; Volume 4, pp. 2083–2086. [Google Scholar]
- Jokinen, E.; Remes, U.; Alku, P. Intelligibility enhancement of telephone speech using Gaussian process regression for normal-to-Lombard spectral tilt conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1985–1996. [Google Scholar] [CrossRef]
- Li, G.; Wang, X.; Hu, R.; Zhang, H.; Ke, S. Normal-to-lombard speech conversion by LSTM network and BGMM for intelligibility enhancement of telephone speech. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion. arXiv 2019, arXiv:1907.12279. [Google Scholar]
- Ferro, R.; Obin, N.; Roebel, A. Cyclegan voice conversion of spectral envelopes using adversarial weights. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–22 January 2021; pp. 406–410. [Google Scholar]
- Li, H.; Fu, S.W.; Tsao, Y.; Yamagishi, J. iMetricGAN: Intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning. arXiv 2020, arXiv:2004.00932. [Google Scholar]
- Li, D.; Zhao, L.; Xiao, J.; Liu, J.; Guan, D.; Wang, Q. Adaptive Speech Intelligibility Enhancement for Far-and-Near-end Noise Environments Based on Self-attention StarGAN. In International Conference on Multimedia Modeling; Springer: Cham, Switzerland, 2022; pp. 205–217. [Google Scholar]
- Sauert, B.; Vary, P. Near end listening enhancement: Speech intelligibility improvement in noisy environments. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 14–19 May 2006; Volume 1, p. I. [Google Scholar]
- Koutsogiannaki, M.; Petkov, P.N.; Stylianou, Y. Intelligibility enhancement of casual speech for reverberant environments inspired by clear speech properties. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Niermann, M.; Vary, P. Listening Enhancement in Noisy Environments: Solutions in Time and Frequency Domain. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 699–709. [Google Scholar] [CrossRef]
- López, A.R.; Seshadri, S.; Juvela, L.; Räsänen, O.; Alku, P. Speaking Style Conversion from Normal to Lombard Speech Using a Glottal Vocoder and Bayesian GMMs. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1363–1367. [Google Scholar]
- Seshadri, S.; Juvela, L.; Räsänen, O.; Alku, P. Vocal effort based speaking style conversion using vocoder features and parallel learning. IEEE Access 2019, 7, 17230–17246. [Google Scholar] [CrossRef]
- Li, G.; Hu, R.; Zhang, R.; Wang, X. A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement. Multimed. Tools Appl. 2020, 79, 19471–19491. [Google Scholar] [CrossRef]
- Gentet, E.; David, B.; Denjean, S.; Richard, G.; Roussarie, V. Neutral to lombard speech conversion with deep learning. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7739–7743. [Google Scholar]
- Seshadri, S.; Juvela, L.; Yamagishi, J.; Räsänen, O.; Alku, P. Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6835–6839. [Google Scholar]
- Seshadri, S.; Juvela, L.; Alku, P.; Räsänen, O. Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 2838–2842. [Google Scholar]
- Xiao, J.; Liu, J.; Li, D.; Zhao, L.; Wang, Q. Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN. In Proceedings of the MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, 6–10 June 2022; Proceedings, Part I. Springer: Cham, Switzerland, 2022; pp. 544–556. [Google Scholar]
- Li, G.; Hu, R.; Ke, S.; Zhang, R.; Wang, X.; Gao, L. Speech intelligibility enhancement using non-parallel speaking style conversion with stargan and dynamic range compression. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
- Kawahara, H.; Masuda-Katsuse, I.; De Cheveigne, A. Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun. 1999, 27, 187–207. [Google Scholar] [CrossRef]
- Morise, M.; Yokomori, F.; Ozawa, K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 2016, 99, 1877–1884. [Google Scholar] [CrossRef]
- Li, H.; Yamagishi, J. Multi-Metric Optimization Using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3000–3011. [Google Scholar] [CrossRef]
- Phan, H.; Le Nguyen, H.; Chén, O.Y.; Koch, P.; Duong, N.Q.; McLoughlin, I.; Mertins, A. Self-attention generative adversarial network for speech enhancement. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–12 June 2021; pp. 7103–7107. [Google Scholar]
- Soloducha, M.; Raake, A.; Kettler, F.; Voigt, P. Lombard speech database for German language. In Proceedings of the DAGA 42nd Annual Conference on Acoustics, Florence, Italy, 24–27 October 2016. [Google Scholar]
- Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
- Falk, T.H.; Zheng, C.; Chan, W.Y. A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1766–1774. [Google Scholar] [CrossRef]
- Ma, J.; Hu, Y.; Loizou, P.C. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J. Acoust. Soc. Am. 2009, 125, 3387–3405. [Google Scholar] [CrossRef] [PubMed]
- Alghamdi, N.; Maddock, S.; Marxer, R.; Barker, J.; Brown, G.J. A corpus of audio-visual Lombard speech with frontal and profile views. J. Acoust. Soc. Am. 2018, 143, EL523–EL529. [Google Scholar] [CrossRef] [PubMed]
System | Noiseless → LowNoise | Noiseless → HighNoise | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SIIB | HASPI | ESTOI | SRMR | NCM | SIIB | HASPI | ESTOI | SRMR | NCM | |
UN | 18.71 | 1.92 | 0.24 | 3.76 | 0.31 | 13.08 | 1.58 | 0.20 | 2.17 | 0.20 |
CGAN | 36.21 | 2.58 | 0.32 | 4.13 | 0.40 | 23.86 | 2.08 | 0.24 | 4.39 | 0.33 |
SGAN | 40.49 | 2.64 | 0.37 | 6.42 | 0.44 | 32.17 | 2.29 | 0.25 | 5.42 | 0.34 |
ASSGAN | 47.25 | 2.66 | 0.37 | 6.63 | 0.43 | 30.38 | 2.25 | 0.25 | 5.33 | 0.33 |
D2StarGAN | 49.56 | 2.70 | 0.39 | 7.43 | 0.48 | 31.57 | 2.31 | 0.27 | 6.16 | 0.37 |
System | Noiseless → LowNoise | Noiseless → HighNoise | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SIIB | HASPI | ESTOI | SRMR | NCM | SIIB | HASPI | ESTOI | SRMR | NCM | |
UN | 15.86 | 1.66 | 0.21 | 2.23 | 0.28 | 10.94 | 1.49 | 0.13 | 1.42 | 0.19 |
CGAN | 25.12 | 2.13 | 0.28 | 4.59 | 0.36 | 16.84 | 1.83 | 0.19 | 2.68 | 0.28 |
SGAN | 31.58 | 2.47 | 0.30 | 5.20 | 0.44 | 20.83 | 2.00 | 0.19 | 3.32 | 0.32 |
ASSGAN | 36.82 | 2.34 | 0.29 | 5.39 | 0.45 | 21.28 | 1.97 | 0.20 | 3.37 | 0.33 |
D2StarGAN | 39.90 | 2.53 | 0.31 | 5.83 | 0.47 | 23.13 | 2.01 | 0.21 | 3.77 | 0.36 |
System | Babble | Restaurant | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SIIB | HASPI | ESTOI | SRMR | NCM | SIIB | HASPI | ESTOI | SRMR | NCM | |
UN | 14.31 | 1.61 | 0.21 | 2.23 | 0.24 | 9.37 | 1.43 | 0.14 | 1.43 | 0.19 |
SGAN | 33.30 | 2.37 | 0.26 | 5.33 | 0.37 | 20.66 | 2.05 | 0.20 | 3.30 | 0.35 |
ASSGAN | 32.59 | 2.38 | 0.26 | 5.59 | 0.37 | 21.37 | 2.03 | 0.20 | 3.31 | 0.35 |
D2StarGAN | 33.31 | 2.42 | 0.28 | 6.14 | 0.39 | 22.37 | 2.05 | 0.21 | 3.64 | 0.37 |
System | SIIB | HASPI | ESTOI | SRMR | NCM |
---|---|---|---|---|---|
UN | 13.64 | 1.60 | 0.21 | 2.22 | 0.23 |
SGAN | 27.12 | 2.24 | 0.23 | 5.01 | 0.32 |
D2StarGAN | 29.85 | 2.29 | 0.26 | 5.83 | 0.35 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, D.; Zhu, C.; Zhao, L. D2StarGAN: A Near-Far End Noise Adaptive StarGAN for Speech Intelligibility Enhancement. Electronics 2023, 12, 3620. https://doi.org/10.3390/electronics12173620
Li D, Zhu C, Zhao L. D2StarGAN: A Near-Far End Noise Adaptive StarGAN for Speech Intelligibility Enhancement. Electronics. 2023; 12(17):3620. https://doi.org/10.3390/electronics12173620
Chicago/Turabian StyleLi, Dengshi, Chenyi Zhu, and Lanxin Zhao. 2023. "D2StarGAN: A Near-Far End Noise Adaptive StarGAN for Speech Intelligibility Enhancement" Electronics 12, no. 17: 3620. https://doi.org/10.3390/electronics12173620
APA StyleLi, D., Zhu, C., & Zhao, L. (2023). D2StarGAN: A Near-Far End Noise Adaptive StarGAN for Speech Intelligibility Enhancement. Electronics, 12(17), 3620. https://doi.org/10.3390/electronics12173620