A Feature Integration Network for Multi-Channel Speech Enhancement
Abstract
:1. Introduction
2. Signal Model
3. Proposed Algorithms
3.1. Full- and Sub-Band LSTM Module
3.2. Global–Local Attention Fusion Module
3.3. Spatial Attention Module
4. Experimental Setup
4.1. Dataset
- The SPA-DNS dataset: This dataset was created with simulated room sizes ranging from to , covering common indoor dimensions. Reverberation times (RT60) were varied between 0.2 and 1.2 s, simulating a range from low to moderate reverberant conditions typical in indoor environments. A circular microphone array of four microphones, with a radius of 10 cm, was randomly placed in each room. Both the array and the two sources—clean speech and noise—were positioned at random locations at least 0.5 m from the walls, with a source–source distance of 0.75 to 2 m. Clean speech and noise samples were sourced from the DNS Challenge 2020 corpus [26]. Noise clips were selected from Audioset and Freesound, encompassing a wide range of typical real-world noise types commonly encountered in daily life. In total, we generate 85,000 training utterances (3–6 s), 4400 validation utterances (3–10 s), and 2700 test utterances (3–10 s) with signal-to-noise ratios (SNRs) between −5 dB and 10 dB, mimicking real-world noisy conditions. The ratios of the training, validation, and test sets were 92%, 5%, and 3%, respectively.
- The Libri-wham dataset: The rooms were created with sizes ranging from to , covering common indoor dimensions. The RT60 values ranged from 0.3 to 0.6 s, simulating a range from low to moderate reverberant conditions typical in indoor environments. A linear array of four microphones was randomly positioned within each room, with a spacing of 0.5 cm between each microphone. The two sources, consisting of speech and noise, were randomly positioned within the room at heights between 1.2 and 1.6 m, to simulate typical speaking heights and realistic noise scenarios in daily life. Clean speech files were obtained from Librispeech (train-clean-100) [27], while noise files were selected from the MUSAN dataset [28], which contains 929 diverse noise samples representing various real-world noise types encountered in everyday environments. We generated 16,724 training utterances of 6 s each, 1872 validation utterances of 6 s each, and 543 test utterances of 6 s each. The ratios of the training, validation, and test sets were 87%, 10%, and 3%, respectively. Signal-to-noise ratios (SNRs) ranged from −5 dB to 10 dB.
4.2. Model Configurations
5. Experimental Results and Analysis
5.1. Ablation Study
- Full- and Sub-Band LSTM Module Contribution: Compared to the noisy input, Case A, which contained only two LSTM layers, showed significant improvements across all evaluation metrics. Specifically, the PESQ score improved from 1.56 to 2.77, the STOI score from 0.652 to 0.902, and the SI-SDR from −6.56 to 5.16, demonstrating the effectiveness in capturing spectral information. This module provided a foundational improvement by allowing the model to learn from both full- and sub-band spectral features, which are crucial for enhancing noisy speech.
- GLAF Module Contribution: Case A, which lacked the GLAF module, exhibited the poorest performance among all cases. The inclusion of the GLAF module in Case B resulted in gains of +0.009 in PESQ and +0.004 in STOI compared to Case A, highlighting the effectiveness of refining spectral information by incorporating global–local attention. This module plays an important role in enhancing specific details by applying both global and local attention, thereby effectively boosting the precision of the extracted features.
- Fusion Type Impact: A comparison between Case B and Case C showed that fusion using SA (spatial attention) outperformed the fusion by summation, resulting in improvements of +0.13 in PESQ, +0.013 in STOI, and +1.28 in SI-SDR. This improvement highlights that using spatial attention to selectively integrate feature information is more effective than a simple summation approach, leading to more significant enhancements in speech quality and intelligibility.
- Number of Feature Integration Blocks: Increasing the number of feature integration blocks from one (Case A) to three (Case E) consistently improved performance. Specifically, the PESQ score increased from 2.99 to 3.62, the STOI score from 0.919 to 0.965, and the SI-SDR from 6.20 to 12.00. These findings illustrate that having more feature integration blocks leads to better model performance, although at the cost of increased parameter size and computational complexity. The performance gains in terms of PESQ, STOI, and SI-SDR make the additional complexity worthwhile in scenarios where high-quality speech enhancement is crucial. However, in real-time or resource-constrained deployments, the increase in parameters and computational cost must be balanced against these improvements.
5.2. Comparison to the Baseline Models
5.3. Robustness to Unseen Noise and Reverberation Time
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tammen, M.; Doclo, S. Deep Multi-Frame MVDR Filtering for Binaural Noise Reduction. arXiv 2022, arXiv:2205.08983. [Google Scholar]
- Jannu, C.; Vanambathina, S.D. An Overview of Speech Enhancement Based on Deep Learning Techniques. Int. J. Image Graph. 2023, 2550001. [Google Scholar] [CrossRef]
- Wang, D. Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 2008, 12, 332–353. [Google Scholar] [CrossRef] [PubMed]
- Heymann, J.; Drude, L.; Haeb-Umbach, R. Neural network based spectral mask estimation for acoustic beamforming. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 196–200. [Google Scholar]
- Erdogan, H.; Hershey, J.R.; Watanabe, S.; Mandel, M.I.; Le Roux, J. Improved MVDR beamforming using single-channel mask prediction networks. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 1981–1985. [Google Scholar]
- Ni, Z.; Grezes, F.; Trinh, V.A.; Mandel, M.I. Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks. arXiv 2020, arXiv:2012.02191. [Google Scholar]
- Tesch, K.; Gerkmann, T. Insights into deep non-linear filters for improved multi-channel speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 31, 563–575. [Google Scholar] [CrossRef]
- Yang, Y.; Quan, C.; Li, X. McNet: Fuse multiple cues for multichannel speech enhancement. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Lee, D.; Choi, J.W. DeFT-AN: Dense frequency-time attentive network for multichannel speech enhancement. IEEE Signal Process. Lett. 2023, 30, 155–159. [Google Scholar] [CrossRef]
- Li, A.; Liu, W.; Zheng, C.; Li, X. Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6487–6491. [Google Scholar]
- Tolooshams, B.; Giri, R.; Song, A.H.; Isik, U.; Krishnaswamy, A. Channel-attention dense u-net for multichannel speech enhancement. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 836–840. [Google Scholar]
- Lee, C.H.; Patel, K.; Yang, C.; Shen, Y.; Jin, H. An MVDR-Embedded U-Net Beamformer for Effective and Robust Multichannel Speech Enhancement. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 8541–8545. [Google Scholar]
- Wang, Z.Q.; Cornell, S.; Choi, S.; Lee, Y.; Kim, B.Y.; Watanabe, S. FNeural speech enhancement with very low algorithmic latency and complexity via integrated full-and sub-band modeling. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Wang, Z.Q.; Cornell, S.; Choi, S.; Lee, Y.; Kim, B.Y.; Watanabe, S. TF-GridNet: Integrating full-and sub-band modeling for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 3221–3236. [Google Scholar] [CrossRef]
- Hao, X.; Su, X.; Horaud, R.; Li, X. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 6633–6637. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Tokala, V.; Grinstein, E.; Brookes, M.; Doclo, S.; Jensen, J.; Naylor, P.A. Binaural Speech Enhancement Using Deep Complex Convolutional Transformer Networks. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 681–685. [Google Scholar]
- Fu, Y.; Liu, Y.; Li, J.; Luo, D.; Lv, S.; Jv, Y.; Xie, L. Uformer: A unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 7417–7421. [Google Scholar]
- Hoang, N.C.; Nguyen, T.N.L.; Doan, T.K.; Nguyen, Q.C. Multi-stage temporal representation learning via global and local perspectives for real-time speech enhancement. Appl. Acoust. 2024, 223, 110067. [Google Scholar] [CrossRef]
- Xiang, X.; Zhang, X.; Chen, H. A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 2021, 29, 105–109. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
- Pan, H.; Gao, F.; Dong, J.; Du, Q. Multiscale adaptive fusion network for hyperspectral image denoising. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3045–3059. [Google Scholar] [CrossRef]
- Scheibler, R.; Bezzam, E.; Dokmanić, I. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 351–355. [Google Scholar]
- Pandey, A.; Xu, B.; Kumar, A.; Donley, J.; Calamia, P.; Wang, D. TPARN: Triple-path attentive recurrent network for time-domain multichannel speech enhancement. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6497–6501. [Google Scholar]
- Reddy, C.K.; Gopal, V.; Cutler, R.; Beyrami, E.; Cheng, R.; Dubey, H.; Matusevych, S.; Aichner, R.; Aazami, A.; Braun, S.; et al. The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. arXiv 2020, arXiv:2005.13981. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5206–5210. [Google Scholar]
- Snyder, D.; Chen, G.; Povey, D. Musan: A music, speech, and noise corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Zhang, Z.; Xu, S.; Zhuang, X.; Qian, Y.; Wang, M. Dual branch deep interactive UNet for monaural noisy-reverberant speech enhancement. Appl. Acoust. 2023, 212, 109574. [Google Scholar] [CrossRef]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Proceedings (Cat. No. 01CH37221). IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 749–752. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 4214–4217. [Google Scholar]
- Le Roux, J.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR–half-baked or well done? In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 626–630. [Google Scholar]
- Luo, Y.; Chen, Z.; Mesgarani, N.; Yoshioka, T. End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation. arXiv 2019, arXiv:1910.14104. [Google Scholar]
Case | Para. (M) | FLOPs (G/s) | RTF | N | GLAF | Fusion Type | Memory Usage (MB) | PESQ | STOI | SI-SDR |
---|---|---|---|---|---|---|---|---|---|---|
Noisy | - | - | - | - | - | - | - | 1.56 | 0.652 | −6.56 |
A | 0.85 | 5.67 | 0.12 | 1 | × | - | 8087 | 2.77 | 0.902 | 5.16 |
B | 0.90 | 5.69 | 0.71 | 1 | ✓ | sum | 8091 | 2.86 | 0.906 | 4.92 |
C | 0.91 | 5.70 | 0.72 | 1 | ✓ | SA | 8093 | 2.99 | 0.919 | 6.20 |
D | 1.8 | 11.40 | 0.83 | 2 | ✓ | SA | 11,585 | 3.50 | 0.957 | 10.75 |
E | 2.7 | 17.09 | 0.95 | 3 | ✓ | SA | 17,632 | 3.62 | 0.965 | 12.00 |
Dataset | Model | Cau. | Para. (M) | PESQ | STOI | SI-SDR |
---|---|---|---|---|---|---|
SPA-DNS | Noisy | - | - | 1.56 | 0.652 | −6.56 |
FasNet-TAC [34] | × | 4.1 | 2.301 | 0.824 | 4.321 | |
EaBNet [10] | × | 2.8 | 2.718 | 0.878 | 3.904 | |
FT-JNF [7] | × | 3.3 | 2.886 | 0.885 | 5.269 | |
Prop. | × | 2.7 | 3.62 | 0.965 | 12.00 | |
Libri-wham | Noisy | - | - | 1.44 | 0.604 | −12.436 |
FasNet-TAC [34] | × | 4.1 | 1.91 | 0.702 | −5.160 | |
EaBNet [10] | × | 2.8 | 2.37 | 0.810 | −2.292 | |
FT-JNF [7] | × | 3.3 | 2.50 | 0.854 | 1.894 | |
Prop. | × | 2.7 | 3.43 | 0.942 | 8.38 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zeng, X.; Zhang, X.; Wang, M. A Feature Integration Network for Multi-Channel Speech Enhancement. Sensors 2024, 24, 7344. https://doi.org/10.3390/s24227344
Zeng X, Zhang X, Wang M. A Feature Integration Network for Multi-Channel Speech Enhancement. Sensors. 2024; 24(22):7344. https://doi.org/10.3390/s24227344
Chicago/Turabian StyleZeng, Xiao, Xue Zhang, and Mingjiang Wang. 2024. "A Feature Integration Network for Multi-Channel Speech Enhancement" Sensors 24, no. 22: 7344. https://doi.org/10.3390/s24227344
APA StyleZeng, X., Zhang, X., & Wang, M. (2024). A Feature Integration Network for Multi-Channel Speech Enhancement. Sensors, 24(22), 7344. https://doi.org/10.3390/s24227344