Improved Speech Spatial Covariance Matrix Estimation for Online Multi-Microphone Speech Enhancement
Abstract
:1. Introduction
2. Previous Work and Contributions
- A speech PSD estimator based on the TCS scheme to take the knowledge on the speech signal in the cepstral domain into account;
- An RTF estimator based on the TDoA estimate to take advantage of the information from all frequency bins, especially when the signal-to-noise ratio (SNR) is low;
- The refinement of the acoustic parameter estimates by exploiting the clean speech spectrum and clean speech power spectrum estimated in the first pass.
3. MMSE Multi-Microphone Speech Enhancement
3.1. Signal Model
3.2. MWF and MVDR–Wiener Filter Factorization
3.3. Speech and Noise SCM Estimation
4. Proposed Speech SCM Estimation
Algorithm 1 Proposed multi-microphone speech enhancement algorithm with improved speech SCM estimation. |
5. Experiments
5.1. Experimental Settings
5.2. Experimental Results
5.3. Ablation Study
5.4. Computational Complexity
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vary, P.; Martin, R. Digital Speech Transmission: Enhancement, Coding and Error Concealment; John Wiley & Sons: Chichester, UK, 2006. [Google Scholar]
- Kates, J.M. Digital Hearing Aids; Plural Publishing: San Diego, CA, USA, 2008. [Google Scholar]
- Rabiner, L.; Juang, B.H. Fundamentals of Speech Recognition; Prentice-Hall, Inc.: Hoboken, NJ, USA, 1993. [Google Scholar]
- Kim, M.; Shin, J.W. Improved Speech Enhancement Considering Speech PSD Uncertainty. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1939–1951. [Google Scholar] [CrossRef]
- Kim, M.; Song, H.; Cheong, S.; Shin, J.W. iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement. Proc. Interspeech 2022, 2022, 181–185. [Google Scholar]
- Benesty, J.; Chen, J.; Huang, Y. Microphone Array Signal Processing; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Gannot, S.; Vincent, E.; Markovich-Golan, S.; Ozerov, A. A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 692–730. [Google Scholar] [CrossRef] [Green Version]
- Souden, M.; Benesty, J.; Affes, S. On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 260–276. [Google Scholar] [CrossRef]
- Markovich-Golan, S.; Gannot, S.; Cohen, I. A weighted multichannel Wiener filter for multiple sources scenarios. In Proceedings of the 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, Eilat, Israel, 14–17 November 2012; pp. 1–5. [Google Scholar]
- Doclo, S.; Spriet, A.; Wouters, J.; Moonen, M. Speech distortion weighted multichannel Wiener filtering techniques for noise reduction. In Speech Enhancement; Springer: Berlin/Heidelberg, Germany, 2005; pp. 199–228. [Google Scholar]
- Balan, R.; Rosca, J. Microphone array speech enhancement by Bayesian estimation of spectral amplitude and phase. In Proceedings of the Sensor Array and Multichannel Signal Processing Workshop Proceedings, Rosslyn, VA, USA, 6 August 2002; pp. 209–213. [Google Scholar]
- Thüne, P.; Enzner, G. Maximum-likelihood approach with Bayesian refinement for multichannel-Wiener postfiltering. IEEE Trans. Signal Process. 2017, 65, 3399–3413. [Google Scholar] [CrossRef]
- Heymann, J.; Drude, L.; Haeb-Umbach, R. A generic neural acoustic beamforming architecture for robust multi-channel speech processing. Comput. Speech Lang. 2017, 46, 374–385. [Google Scholar] [CrossRef]
- Schwartz, O.; Gannot, S.; Habets, E.A. An expectation-maximization algorithm for multimicrophone speech dereverberation and noise reduction with coherence matrix estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 1495–1510. [Google Scholar] [CrossRef]
- Thiergart, O.; Taseska, M.; Habets, E.A. An informed MMSE filter based on multiple instantaneous direction-of-arrival estimates. In Proceedings of the 21st IEEE European Signal Processing Conference (EUSIPCO 2013), Marrakech, Morocco, 9–13 September 2013; pp. 1–5. [Google Scholar]
- Thiergart, O.; Taseska, M.; Habets, E.A. An informed parametric spatial filter based on instantaneous direction-of-arrival estimates. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 2182–2196. [Google Scholar] [CrossRef]
- Taseska, M.; Habets, E.A. Informed spatial filtering for sound extraction using distributed microphone arrays. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1195–1207. [Google Scholar] [CrossRef]
- Chakrabarty, S.; Habets, E.A. A Bayesian approach to informed spatial filtering with robustness against DOA estimation errors. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 145–160. [Google Scholar] [CrossRef]
- Higuchi, T.; Ito, N.; Araki, S.; Yoshioka, T.; Delcroix, M.; Nakatani, T. Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 780–793. [Google Scholar] [CrossRef]
- Jin, Y.G.; Shin, J.W.; Kim, N.S. Spectro-temporal filtering for multichannel speech enhancement in short-time Fourier transform domain. IEEE Signal Process. Lett. 2014, 21, 352–355. [Google Scholar] [CrossRef]
- Serizel, R.; Moonen, M.; Van Dijk, B.; Wouters, J. Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 785–799. [Google Scholar] [CrossRef] [Green Version]
- Wang, Z.; Vincent, E.; Serizel, R.; Yan, Y. Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments. Comput. Speech Lang. 2018, 49, 37–51. [Google Scholar] [CrossRef] [Green Version]
- Schwartz, O.; Gannot, S.; Habets, E.A. Joint maximum likelihood estimation of late reverberant and speech power spectral density in noisy environments. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 151–155. [Google Scholar]
- Souden, M.; Chen, J.; Benesty, J.; Affes, S. Gaussian model-based multichannel speech presence probability. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1072–1077. [Google Scholar] [CrossRef]
- Souden, M.; Chen, J.; Benesty, J.; Affes, S. An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2159–2169. [Google Scholar] [CrossRef]
- Taseska, M.; Habets, E.A. Nonstationary noise PSD matrix estimation for multichannel blind speech extraction. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2223–2236. [Google Scholar]
- Martín-Doñas, J.M.; Jensen, J.; Tan, Z.H.; Gomez, A.M.; Peinado, A.M. Online Multichannel Speech Enhancement Based on Recursive EM and DNN-Based Speech Presence Estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 3080–3094. [Google Scholar] [CrossRef]
- Schwartz, O.; Gannot, S. A recursive expectation-maximization algorithm for online multi-microphone noise reduction. In Proceedings of the 2018 IEEE 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 1542–1546. [Google Scholar]
- Jin, Y.G.; Shin, J.W.; Kim, N.S. Decision-directed speech power spectral density matrix estimation for multichannel speech enhancement. J. Acoust. Soc. Am. 2017, 141, EL228–EL233. [Google Scholar] [CrossRef] [Green Version]
- Markovich, S.; Gannot, S.; Cohen, I. Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1071–1086. [Google Scholar] [CrossRef]
- Hwang, S.; Kim, M.; Shin, J.W. Dual microphone speech enhancement based on statistical modeling of interchannel phase difference. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2865–2874. [Google Scholar] [CrossRef]
- Dietzen, T.; Moonen, M.; van Waterschoot, T. Instantaneous PSD Estimation for Speech Enhancement based on Generalized Principal Components. In Proceedings of the 2020 IEEE 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 191–195. [Google Scholar]
- Dietzen, T.; Doclo, S.; Moonen, M.; van Waterschoot, T. Square root-based multi-source early PSD estimation and recursive RETF update in reverberant environments by means of the orthogonal Procrustes problem. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 755–769. [Google Scholar] [CrossRef]
- Mitsufuji, Y.; Takamune, N.; Koyama, S.; Saruwatari, H. Multichannel blind source separation based on evanescent-region-aware non-negative tensor factorization in spherical harmonic domain. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 607–617. [Google Scholar] [CrossRef]
- Dietzen, T.; Doclo, S.; Moonen, M.; van Waterschoot, T. Integrated sidelobe cancellation and linear prediction Kalman filter for joint multi-microphone speech dereverberation, interfering speech cancellation, and noise reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 740–754. [Google Scholar] [CrossRef] [Green Version]
- Pezzoli, M.; Cobos, M.; Antonacci, F.; Sarti, A. Sparsity-Based Sound Field Separation in The Spherical Harmonics Domain. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 1051–1055. [Google Scholar]
- Pezzoli, M.; Carabias-Orti, J.J.; Cobos, M.; Antonacci, F.; Sarti, A. Ray-space-based multichannel nonnegative matrix factorization for audio source separation. IEEE Signal Process. Lett. 2021, 28, 369–373. [Google Scholar] [CrossRef]
- Wang, Z.Q.; Wang, P.; Wang, D. Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR. IEEE/ACM Trans. Audio Speech Lang. Proc. 2020, 28, 1778–1787. [Google Scholar] [CrossRef] [PubMed]
- Kim, H.; Kang, K.; Shin, J.W. Factorized MVDR Deep Beamforming for Multi-Channel Speech Enhancement. IEEE Signal Process. Lett. 2022, 29, 1898–1902. [Google Scholar] [CrossRef]
- Markovic, D.; Defossez, A.; Richard, A. Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain. arXiv 2022, arXiv:2206.15423. [Google Scholar]
- Luo, Y.; Chen, Z.; Mesgarani, N.; Yoshioka, T. End-to-end microphone permutation and number invariant multi-channel speech separation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6394–6398. [Google Scholar]
- Luo, Y.; Mesgarani, N. Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 3071–3075. [Google Scholar]
- Liu, W.; Li, A.; Wang, X.; Yuan, M.; Chen, Y.; Zheng, C.; Li, X. A Neural Beamspace-Domain Filter for Real-Time Multi-Channel Speech Enhancement. Symmetry 2022, 14, 1081. [Google Scholar] [CrossRef]
- Cohen, I. Relative transfer function identification using speech signals. IEEE Trans. Speech Audio Process. 2004, 12, 451–459. [Google Scholar] [CrossRef] [Green Version]
- Varzandeh, R.; Taseska, M.; Habets, E.A. An iterative multichannel subspace-based covariance subtraction method for relative transfer function estimation. In Proceedings of the 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), San Francisco, CA, USA, 1–3 March 2017; pp. 11–15. [Google Scholar]
- Zhang, J.; Heusdens, R.; Hendriks, R.C. Relative acoustic transfer function estimation in wireless acoustic sensor networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1507–1519. [Google Scholar] [CrossRef]
- Pak, J.; Shin, J.W. Sound localization based on phase difference enhancement using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Proc. 2019, 27, 1335–1345. [Google Scholar] [CrossRef]
- Song, H.; Shin, J.W. Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 671–675. [Google Scholar]
- Cohen, I.; Berdugo, B. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process. Lett. 2002, 9, 12–15. [Google Scholar] [CrossRef]
- Cohen, I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 2003, 11, 466–475. [Google Scholar] [CrossRef] [Green Version]
- Breithaupt, C.; Gerkmann, T.; Martin, R. A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 4897–4900. [Google Scholar]
- Roy, R.; Kailath, T. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 984–995. [Google Scholar] [CrossRef] [Green Version]
- Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef] [Green Version]
- Markovich-Golan, S.; Gannot, S.; Kellermann, W. Performance analysis of the covariance-whitening and the covariance-subtraction methods for estimating the relative transfer function. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 2499–2503. [Google Scholar] [CrossRef]
- Noll, A.M. Cepstrum pitch determination. J. Acoust. Soc. Am. 1967, 41, 293–309. [Google Scholar] [CrossRef]
- Gerkmann, T.; Martin, R. On the statistics of spectral amplitudes after variance reduction by temporal cepstrum smoothing and cepstral nulling. IEEE Trans. Signal Process. 2009, 57, 4165–4174. [Google Scholar] [CrossRef]
- Knapp, C.; Carter, G. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 320–327. [Google Scholar] [CrossRef] [Green Version]
- Vincent, E.; Watanabe, S.; Nugraha, A.A.; Barker, J.; Marxer, R. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Comput. Speech Lang. 2017, 46, 535–557. [Google Scholar] [CrossRef] [Green Version]
- Gerkmann, T.; Breithaupt, C.; Martin, R. Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 910–919. [Google Scholar] [CrossRef]
- International Telecommunication Union. Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codec; International Telecommunication Union: Geneve, Switzerland, 2007. [Google Scholar]
- Jensen, J.; Taal, C.H. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 2009–2022. [Google Scholar] [CrossRef]
- Le Roux, J.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR–half-baked or well done? In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 626–630. [Google Scholar]
Method | Noise Type | Avg. | |||
---|---|---|---|---|---|
BUS | CAF | PED | STR | ||
Noisy | 1.32 | 1.24 | 1.26 | 1.28 | 1.27 |
CDR-MWF [26] | 1.93 | 1.74 | 1.86 | 1.74 | 1.82 |
CDR-Proposed | 2.35 | 2.15 | 2.18 | 2.17 | 2.21 |
DNN-REMWF [27] | 2.10 | 2.01 | 2.12 | 2.03 | 2.07 |
DNN-REMKF [27] | 2.13 | 2.03 | 2.15 | 2.07 | 2.10 |
DNN-Proposed | 2.43 | 2.24 | 2.28 | 2.29 | 2.31 |
Method | Noise Type | Avg. | |||
---|---|---|---|---|---|
BUS | CAF | PED | STR | ||
Noisy | 71.0 | 66.0 | 68.8 | 67.2 | 68.2 |
CDR-MWF [26] | 83.1 | 79.7 | 82.4 | 80.2 | 81.4 |
CDR-Proposed | 85.1 | 83.4 | 83.7 | 82.4 | 83.6 |
DNN-REMWF [27] | 83.7 | 82.0 | 83.9 | 83.2 | 83.2 |
DNN-REMKF [27] | 84.3 | 82.6 | 84.5 | 83.9 | 83.8 |
DNN-Proposed | 86.0 | 85.3 | 85.9 | 85.0 | 85.5 |
Method | Noise Type | Avg. | |||
---|---|---|---|---|---|
BUS | CAF | PED | STR | ||
Noisy | 6.79 | 7.77 | 8.60 | 6.85 | 7.51 |
CDR-MWF [26] | 9.68 | 9.44 | 10.69 | 9.75 | 9.89 |
CDR-Proposed | 14.25 | 14.55 | 14.56 | 13.40 | 14.19 |
DNN-REMWF [27] | 14.40 | 14.18 | 14.96 | 14.64 | 14.54 |
DNN-REMKF [27] | 14.71 | 14.42 | 15.25 | 14.99 | 14.84 |
DNN-Proposed | 15.90 | 16.07 | 16.34 | 15.83 | 16.04 |
Method | PESQ Score | eSTOI (×100) | SISDR (in dB) | ||||||
---|---|---|---|---|---|---|---|---|---|
(−∞,6.5) | (6.5,8.5) | (8.5,∞) | (−∞,5) | (6.5,8.5) | (8.5,∞) | (−∞,6.5) | (6.5,8.5) | (8.5,∞) | |
Noisy | 1.22 | 1.26 | 1.33 | 63.3 | 67.1 | 74.2 | 5.04 | 7.48 | 9.74 |
CDR-MWF [26] | 1.66 | 1.83 | 1.95 | 76.7 | 80.7 | 86.5 | 8.81 | 9.80 | 10.97 |
CDR-Proposed | 2.07 | 2.22 | 2.33 | 80.6 | 83.2 | 86.9 | 12.35 | 13.95 | 16.14 |
DNN-REMWF [27] | 1.84 | 2.07 | 2.26 | 79.3 | 82.8 | 87.2 | 12.80 | 14.41 | 16.28 |
DNN-REMKF [27] | 1.87 | 2.10 | 2.29 | 80.0 | 83.4 | 87.7 | 13.07 | 14.70 | 16.60 |
DNN-Proposed | 2.15 | 2.32 | 2.44 | 82.7 | 85.2 | 88.5 | 14.38 | 15.83 | 17.78 |
Method | PESQ | eSTOI | SISDR |
---|---|---|---|
DNN-REMWF [27] | 2.07 | 83.2 | 14.54 |
+ | 2.07 | 84.5 | 15.79 |
+ | 2.12 | 85.4 | 16.07 |
+ | 2.19 | 83.8 | 15.52 |
+ (DNN-Proposed) | 2.31 | 85.5 | 16.04 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, M.; Cheong, S.; Song, H.; Shin, J.W. Improved Speech Spatial Covariance Matrix Estimation for Online Multi-Microphone Speech Enhancement. Sensors 2023, 23, 111. https://doi.org/10.3390/s23010111
Kim M, Cheong S, Song H, Shin JW. Improved Speech Spatial Covariance Matrix Estimation for Online Multi-Microphone Speech Enhancement. Sensors. 2023; 23(1):111. https://doi.org/10.3390/s23010111
Chicago/Turabian StyleKim, Minseung, Sein Cheong, Hyungchan Song, and Jong Won Shin. 2023. "Improved Speech Spatial Covariance Matrix Estimation for Online Multi-Microphone Speech Enhancement" Sensors 23, no. 1: 111. https://doi.org/10.3390/s23010111
APA StyleKim, M., Cheong, S., Song, H., & Shin, J. W. (2023). Improved Speech Spatial Covariance Matrix Estimation for Online Multi-Microphone Speech Enhancement. Sensors, 23(1), 111. https://doi.org/10.3390/s23010111