Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement
Abstract
:1. Introduction
- Proposed hybrid dilated convolution module (HDM), which consists of dilated convolution and conventional convolution;
- Proposed recursive recurrent speech enhancement network (RRSENet), which uses a GRU module to solve the problem of multiple training parameters and complex structures;
- Extensive experiments are performed on dataset synthesized by TIMIT corpus and NOISEX92, and the proposed model achieves excellent results.
2. Related Work
2.1. Speech Enhancement
2.2. Recurrent Neural Network
2.3. Dilated Convolution
3. Model Description
3.1. Problem Formulation
3.2. Hybrid Dilated Convolution Module (HDM)
3.3. Recursive Learning Speech Enhancement Network (RLSENet)
3.4. Recursive Recurrent Speech Enhancement Network (RRSENet)
4. Experiments
4.1. Datasets
4.2. Experimental Settings
4.3. Experimental Results
4.3.1. Compared with Typical Algorithms
4.3.2. Module Comparisons
4.3.3. GRU Module and Recursive Times
4.3.4. Speech Spectrogram
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef] [Green Version]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [Google Scholar] [CrossRef] [Green Version]
- Ephraim, Y.; Van Trees, H.L. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 1995, 3, 251–266. [Google Scholar] [CrossRef]
- Wang, D. On ideal binary mask as the computational goal of auditory scene analysis. In Speech Separation by Humans and Machines; Springer: Berlin, Germany, 2005; pp. 181–197. [Google Scholar]
- Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 7–19. [Google Scholar] [CrossRef]
- Tan, K.; Wang, D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3229–3233. [Google Scholar]
- Zhao, H.; Zarar, S.; Tashev, I.; Lee, C.H. Convolutional-recurrent neural networks for speech enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2401–2405. [Google Scholar]
- Paliwal, K.; Wójcicki, K.; Shannon, B. The importance of phase in speech enhancement. Speech Commun. 2011, 53, 465–494. [Google Scholar] [CrossRef]
- Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
- Kolbæk, M.; Tan, Z.H.; Jensen, S.H.; Jensen, J. On loss functions for supervised monaural time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 825–838. [Google Scholar] [CrossRef] [Green Version]
- Abdulbaqi, J.; Gu, Y.; Chen, S.; Marsic, I. Residual Recurrent Neural Network for Speech Enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6659–6663. [Google Scholar]
- Fu, S.W.; Tsao, Y.; Lu, X.; Kawai, H. Raw waveform-based speech enhancement by fully convolutional networks. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 006–012. [Google Scholar]
- Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Rethage, D.; Pons, J.; Serra, X. A wavenet for speech denoising. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5069–5073. [Google Scholar]
- Pandey, A.; Wang, D. TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6875–6879. [Google Scholar]
- Ren, D.; Zuo, W.; Hu, Q.; Zhu, P.; Meng, D. Progressive Image Deraining Networks: A Better and Simpler Baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Tai, Y.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’79), Washington, DC, USA, 2–4 April 1979. [Google Scholar]
- Sim, B.L.; Tong, Y.C.; Chang, J.S.; Tan, C.T. A parametric formulation of the generalized spectral subtraction method. IEEE Trans. Speech Audio Process. 1998, 6, 328–337. [Google Scholar] [CrossRef]
- Kamath, S.D.; Loizou, P.C. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In Proceedings of the ICASSP international Conference on Acoustics Speech and Signal Processing, Orlando, FL, USA, 13–17 May 2002; Volume 4. [Google Scholar]
- Sovka, P.; Pollak, P.; Kybic, J. Extended Spectral Subtraction. In Proceedings of the 8th European Signal Processing Conference, Trieste, Italy, 10–13 September 1996. [Google Scholar]
- Cohen, I. Relaxed statistical model for speech enhancement and a priori SNR estimation. IEEE Trans. Speech Audio Process. 2005, 13, 870–881. [Google Scholar] [CrossRef] [Green Version]
- Cohen, I.; Berdugo, B. Speech enhancement for non-stationary noise environments. Signal Process. 2001, 81, 2403–2418. [Google Scholar] [CrossRef]
- Hasan, M.K.; Salahuddin, S.; Khan, M.R. A modified a priori SNR for speech enhancement using spectral subtraction rules. IEEE Signal Process. Lett. 2004, 11, 450–453. [Google Scholar] [CrossRef]
- Cappe, O. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio Process. 1994, 2, 345–349. [Google Scholar] [CrossRef]
- Li, A.; Zheng, C.; Cheng, L.; Peng, R.; Li, X. A Time-domain Monaural Speech Enhancement with Feedback Learning. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 769–774. [Google Scholar]
- Yuliani, A.R.; Amri, M.F.; Suryawati, E.; Ramdan, A.; Pardede, H.F. Speech Enhancement Using Deep Learning Methods: A Review. J. Elektron. Dan Telekomun. 2021, 21, 19–26. [Google Scholar] [CrossRef]
- Yan, Z.; Xu, B.; Giri, R.; Tao, Z. Perceptually Guided Speech Enhancement Using Deep Neural Networks. In Proceedings of the ICASSP 2018—2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
- Karjol, P.; Ajay Kumar, M.; Ghosh, P.K. Speech Enhancement Using Multiple Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5049–5052. [Google Scholar]
- Xu, Y.; Du, J.; Huang, Z.; Dai, L.R.; Lee, C.H. Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. arXiv 2017, arXiv:1703.07172. [Google Scholar]
- Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2015, arXiv:1409.2329. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Cho, K.; Merrienboer, B.V.; Gulcehre, C.; BaHdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Gao, T.; Du, J.; Dai, L.R.; Lee, C.H. Densely Connected Progressive Learning for LSTM-Based Speech Enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5054–5058. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Ye, S.; Hu, X.; Xu, X. Tdcgan: Temporal Dilated Convolutional Generative Adversarial Network for End-to-end Speech Enhancement. arXiv 2020, arXiv:2008.07787. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report n. 1993. Volume 93, p. 27403. Available online: https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir4930.pdf (accessed on 8 March 2022).
- Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
- Pandey, A.; Wang, D. A New Framework for CNN-Based Speech Enhancement in the Time Domain. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1179–1188. [Google Scholar] [CrossRef]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual Evaluation of Speech Quality (PESQ): A New Method for Speech Quality Assessment of Telephone Networks and Codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), Salt Lake City, UT, USA, 7–11 May 2001. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Demiar, J.; Schuurmans, D. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Metrics | PESQ | STOI | ||||||
---|---|---|---|---|---|---|---|---|
SNR | −5 dB | 0 dB | 5 dB | Avg. | −5 dB | 0 dB | 5 dB | Avg. |
Noisy | 1.537 | 1.843 | 2.083 | 1.821 | 66.40 | 77.56 | 84.92 | 76.30 |
LogMMSE | 1.579 | 1.906 | 2.176 | 1.887 | 66.77 | 78.25 | 85.52 | 76.84 |
Wiener | 1.710 | 2.180 | 2.610 | 2.166 | 65.15 | 77.84 | 85.90 | 76.29 |
TCNN | 1.969 | 2.510 | 2.830 | 2.436 | 80.87 | 89.11 | 92.71 | 87.56 |
AECNN | 2.013 | 2.557 | 2.945 | 2.505 | 81.33 | 89.45 | 92.99 | 87.92 |
HDMNet | 2.122 | 2.654 | 3.013 | 2.596 | 82.03 | 89.52 | 93.09 | 88.22 |
RLSENet | 2.151 | 2.692 | 3.039 | 2.627 | 83.06 | 90.18 | 93.40 | 88.80 |
RRSENet | 2.220 | 2.750 | 3.051 | 2.674 | 83.17 | 90.19 | 93.30 | 88.89 |
Metrics | PESQ | STOI | ||||||
---|---|---|---|---|---|---|---|---|
SNR | −5 dB | 0 dB | 5 dB | Avg. | −5 dB | 0 dB | 5 dB | Avg. |
Noisy | 1.419 | 1.603 | 1.868 | 1.630 | 65.05 | 72.99 | 82.10 | 73.38 |
LogMMSE | 1.439 | 1.654 | 1.941 | 1.678 | 65.23 | 73.66 | 82.67 | 73.85 |
Wiener | 1.617 | 2.030 | 2.387 | 2.011 | 64.50 | 76.20 | 84.90 | 75.20 |
TCNN | 1.785 | 2.054 | 2.359 | 2.066 | 74.76 | 83.71 | 89.88 | 82.79 |
AECNN | 1.771 | 2.066 | 2.437 | 2.091 | 74.28 | 83.50 | 89.79 | 82.52 |
HDMNet | 1.937 | 2.234 | 2.611 | 2.261 | 77.99 | 85.06 | 90.25 | 84.43 |
RLSENet | 1.899 | 2.204 | 2.586 | 2.230 | 77.10 | 85.62 | 90.97 | 84.56 |
RRSENet | 1.959 | 2.276 | 2.649 | 2.295 | 78.87 | 85.99 | 91.03 | 85.29 |
Metrics | PESQ | STOI | ||||||
---|---|---|---|---|---|---|---|---|
SNR | −5 dB | 0 dB | 5 dB | Avg. | −5 dB | 0 dB | 5 dB | Avg. |
Noisy | 1.537 | 1.843 | 2.083 | 1.821 | 66.40 | 77.56 | 84.92 | 76.30 |
FCMNet | 1.832 | 2.268 | 2.528 | 2.209 | 76.45 | 86.02 | 90.50 | 84.32 |
FDMNet | 1.959 | 2.441 | 2.755 | 2.385 | 78.46 | 87.31 | 91.56 | 85.77 |
HDMNet | 2.122 | 2.654 | 3.013 | 2.596 | 82.03 | 89.52 | 93.09 | 88.22 |
Metrics | PESQ | STOI | ||||||
---|---|---|---|---|---|---|---|---|
SNR | −5 dB | 0 dB | 5 dB | Avg. | −5 dB | 0 dB | 5 dB | Avg. |
Noisy | 1.419 | 1.603 | 1.868 | 1.630 | 65.05 | 72.99 | 82.10 | 73.38 |
FCMNet | 1.714 | 1.976 | 2.279 | 1.990 | 74.24 | 82.19 | 88.17 | 81.53 |
FDMNet | 1.903 | 2.183 | 2.499 | 2.195 | 76.79 | 83.66 | 89.18 | 83.20 |
HDMNet | 1.937 | 2.234 | 2.611 | 2.261 | 77.99 | 85.06 | 90.25 | 84.43 |
Metrics | PESQ | STOI | ||
---|---|---|---|---|
Model | RLSENet | RRSENet | RLSENet | RRSENet |
t1 | 2.351 | 2.326 | 84.94 | 83.45 |
t2 | 2.502 | 2.581 | 86.96 | 87.11 |
t3 | 2.591 | 2.623 | 88.64 | 88.50 |
t4 | 2.627 | 2.688 | 88.78 | 88.82 |
t5 | 2.625 | 2.704 | 88.75 | 88.22 |
Metrics | PESQ | STOI | ||
---|---|---|---|---|
Model | RLSENet | RRSENet | RLSENet | RRSENet |
t1 | 2.151 | 2.135 | 81.92 | 81.54 |
t2 | 2.170 | 2.238 | 81.89 | 83.13 |
t3 | 2.189 | 2.244 | 89.37 | 84.26 |
t4 | 2.230 | 2.268 | 84.56 | 84.59 |
t5 | 2.262 | 2.245 | 84.56 | 83.48 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, Z.; Ma, Y.; Tan, F.; Feng, X. Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement. Appl. Sci. 2022, 12, 3461. https://doi.org/10.3390/app12073461
Song Z, Ma Y, Tan F, Feng X. Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement. Applied Sciences. 2022; 12(7):3461. https://doi.org/10.3390/app12073461
Chicago/Turabian StyleSong, Zhendong, Yupeng Ma, Fang Tan, and Xiaoyi Feng. 2022. "Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement" Applied Sciences 12, no. 7: 3461. https://doi.org/10.3390/app12073461
APA StyleSong, Z., Ma, Y., Tan, F., & Feng, X. (2022). Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement. Applied Sciences, 12(7), 3461. https://doi.org/10.3390/app12073461