Learning Ratio Mask with Cascaded Deep Neural Networks for Echo Cancellation in Laser Monitoring Signals
Abstract
:1. Introduction
- For echo cancellation in laser monitoring problem, we formulate it as a simple but effective additive echo noise model. Armed with this understanding of echoes, which are regarded as noise, we propose cascaded DNNs (C-DNNs) to learn the ratio mask. The experiments validate that the proposed C-DNNs can update the training target to improve the estimation accuracy of the ratio mask.
- We construct a new database that includes rich speeches and are ready to release it to the speech-enhancement/separation research community.
2. Methodology
2.1. Problem Formulation
2.2. Acoustic Features
2.3. Training Targets
2.3.1. Ideal Ratio Mask (IRM)
2.3.2. Spectral Magnitude Mask (SMM)
2.3.3. Corrected Ratio Mask (CRM)
2.4. Cascaded DNNs (C-DNNs)
2.4.1. Network Architecture
2.4.2. Learning Strategy
3. Experimental Data, Protocol, and Evaluation Metrics
3.1. Experimental Data
3.2. Experimental Protocol
3.3. Evaluation Metric
4. Experimental Results and Analysis
4.1. The Effect of Echo Intensity
4.2. The Effect of Echo Delay
4.3. The Effect of Training Target
4.4. Comparison with Existing Methods
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Zhang, C.F. The Improvement and Realization of Laser Eavesdropping. Laser Infrared 2008, 38, 145–148. [Google Scholar]
- Wang, M.T.; Zhu, Y.; Mu, Y.n. A two-stage amplifier of laser eavesdropping model based on waveguide fiber taper. Def. Technol. 2019, 15, 95–97. [Google Scholar] [CrossRef]
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef] [Green Version]
- Li, C.; Liu, W.J. A novel multi-band spectral subtraction method based on phase modification and magnitude compensation. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 4760–4763. [Google Scholar]
- Lim, J.; Oppenheim, A. All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 197–210. [Google Scholar] [CrossRef]
- Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
- Loizou, P.C. Speech Enhancement: Theory and Practice; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
- Valin, J.M. On adjusting the learning rate in frequency domain echo cancellation with double-talk. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1030–1034. [Google Scholar] [CrossRef] [Green Version]
- Guo, M.; Elmedyb, T.B.; Jensen, S.H.; Jensen, J. Analysis of acoustic feedback/echo cancellation in multiple-microphone and single-loudspeaker systems using a power transfer function method. IEEE Trans. Signal Process. 2011, 59, 5774–5788. [Google Scholar]
- Zhang, S.; Zheng, W.X. Recursive adaptive sparse exponential functional link neural network for nonlinear AEC in impulsive noise environment. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4314–4323. [Google Scholar] [CrossRef]
- Wang, D.; Brown, G.J.; Darwin, C. Computational auditory scene analysis: Principles, algorithms and applications. Acoust. Soc. Am. J. 2008, 124, 13. [Google Scholar]
- Brown, G.J.; Wang, D. Separation of speech by computational auditory scene analysis. In Speech Enhancement; Springer: Berlin/Heidelberg, Germany, 2005; pp. 371–402. [Google Scholar]
- Srinivasan, S.; Roman, N.; Wang, D. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 2006, 48, 1486–1501. [Google Scholar] [CrossRef]
- Narayanan, A.; Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7092–7096. [Google Scholar]
- Liang, S.; Liu, W.; Jiang, W.; Xue, W. The optimal ratio time-frequency mask for speech separation in terms of the signal-to-noise ratio. J. Acoust. Soc. Am. 2013, 134, EL452–EL458. [Google Scholar]
- Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [Green Version]
- Bao, F.; Abdulla, W.H.; Bao, F.; Abdulla, W.H. A New Ratio Mask Representation for CASA-Based Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 2019, 27, 7–19. [Google Scholar] [CrossRef]
- Liang, S.; Liu, W.; Jiang, W. A new Bayesian method incorporating with local correlation for IBM estimation. IEEE Trans. Audio Speech Lang. Process. 2012, 21, 476–487. [Google Scholar] [CrossRef]
- Zhang, L.; Bao, G.; Zhang, J.; Ye, Z. Supervised single-channel speech enhancement using ratio mask with joint dictionary learning. Speech Commun. 2016, 82, 38–52. [Google Scholar]
- Bao, F.; Abdulla, W.H. A new time-frequency binary mask estimation method based on convex optimization of speech power. Speech Commun. 2018, 97, 51–65. [Google Scholar] [CrossRef]
- Chang, J.H.; Jo, Q.H.; Kim, D.K.; Kim, N.S. Global soft decision employing support vector machine for speech enhancement. IEEE Signal Process. Lett. 2008, 16, 57–60. [Google Scholar] [CrossRef]
- Pearlmutter, B.A. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Trans. Neural Netw. 1995, 6, 1212–1228. [Google Scholar] [CrossRef] [Green Version]
- Weninger, F.; Eyben, F.; Schuller, B. Single-channel speech separation with memory-enhanced recurrent neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3709–3713. [Google Scholar]
- Weninger, F.; Hershey, J.R.; Le Roux, J.; Schuller, B. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA, 3–5 December 2014; pp. 577–581. [Google Scholar]
- Williamson, D.S.; Wang, Y.; Wang, D. Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 2016, 24, 483–492. [Google Scholar] [CrossRef] [Green Version]
- Zhao, Y.; Wang, Z.Q.; Wang, D. Two-stage deep learning for noisy-reverberant speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 27, 53–62. [Google Scholar] [CrossRef]
- Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 2013, 21, 65–68. [Google Scholar] [CrossRef]
- Wang, Q.; Du, J.; Dai, L.R.; Lee, C.H. A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 2018, 26, 1181–1193. [Google Scholar] [CrossRef]
- Zhang, X.L.; Wang, D. A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 2016, 24, 967–977. [Google Scholar] [CrossRef]
- Griffin, D.; Lim, J. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Wang, D.; Woods, W.S.; Merks, I.; Zhang, T. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 982–992. [Google Scholar] [CrossRef]
- Kim, G.; Lu, Y.; Hu, Y.; Loizou, P.C. An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 2009, 126, 1486–1494. [Google Scholar] [CrossRef] [Green Version]
- Hermansky, H.; Morgan, N. RASTA processing of speech. IEEE Trans. Speech Audio Process. 1994, 2, 578–589. [Google Scholar] [CrossRef] [Green Version]
- Shao, Y.; Wang, D. Robust speaker identification using auditory features and computational auditory scene analysis. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 1589–1592. [Google Scholar]
- Shao, Y.; Jin, Z.; Wang, D.; Srinivasan, S. An auditory-based feature for robust speech recognition. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 4625–4628. [Google Scholar]
- Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 7–19. [Google Scholar] [CrossRef]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar]
- Hu, Y.; Loizou, P.C. Subjective comparison of speech enhancement algorithms. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 14–19 May 2006; Volume 1. [Google Scholar]
SNR (dB) | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Unprocessed | 0.830 | 0.863 | 0.875 | 0.890 | 0.906 | 0.921 | 0.933 | 0.941 | 0.952 | 0.956 | 0.964 | 0.912 |
IRM | 0.856 | 0.888 | 0.912 | 0.921 | 0.934 | 0.946 | 0.951 | 0.956 | 0.959 | 0.966 | 0.972 | 0.933 |
SMM | 0.866 | 0.898 | 0.916 | 0.926 | 0.940 | 0.948 | 0.954 | 0.960 | 0.962 | 0.968 | 0.972 | 0.937 |
CRM | 0.866 | 0.898 | 0.914 | 0.924 | 0.939 | 0.948 | 0.955 | 0.960 | 0.962 | 0.967 | 0.973 | 0.938 |
SNR (dB) | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Unprocessed | 1.97 | 2.05 | 2.12 | 2.14 | 2.30 | 2.33 | 2.42 | 2.50 | 2.60 | 2.64 | 2.78 | 2.35 |
IRM | 2.18 | 2.32 | 2.46 | 2.54 | 2.69 | 2.78 | 2.85 | 2.92 | 2.96 | 3.08 | 3.12 | 2.72 |
SMM | 2.21 | 2.35 | 2.49 | 2.58 | 2.72 | 2.82 | 2.88 | 2.94 | 3.00 | 3.10 | 3.14 | 2.75 |
CRM | 2.20 | 2.34 | 2.47 | 2.56 | 2.73 | 2.82 | 2.89 | 2.95 | 3.00 | 3.09 | 3.14 | 2.75 |
DELAY (s) | 0.188 | 0.206 | 0.225 | 0.245 | 0.263 | 0.281 |
---|---|---|---|---|---|---|
Unprocessed | 0.948 | 0.945 | 0.942 | 0.940 | 0.938 | 0.937 |
IRM | 0.956 | 0.954 | 0.954 | 0.951 | 0.948 | 0.946 |
SMM | 0.959 | 0.958 | 0.956 | 0.953 | 0.950 | 0.947 |
CRM | 0.959 | 0.959 | 0.957 | 0.953 | 0.951 | 0.948 |
DELAY (s) | 0.188 | 0.206 | 0.225 | 0.245 | 0.263 | 0.281 |
---|---|---|---|---|---|---|
Unprocessed | 2.51 | 2.50 | 2.50 | 2.51 | 2.52 | 2.52 |
IRM | 2.89 | 2.90 | 2.89 | 2.86 | 2.83 | 2.81 |
SMM | 2.91 | 2.92 | 2.90 | 2.88 | 2.86 | 2.84 |
CRM | 2.91 | 2.93 | 2.91 | 2.88 | 2.87 | 2.84 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lang, H.; Yang, J. Learning Ratio Mask with Cascaded Deep Neural Networks for Echo Cancellation in Laser Monitoring Signals. Electronics 2020, 9, 856. https://doi.org/10.3390/electronics9050856
Lang H, Yang J. Learning Ratio Mask with Cascaded Deep Neural Networks for Echo Cancellation in Laser Monitoring Signals. Electronics. 2020; 9(5):856. https://doi.org/10.3390/electronics9050856
Chicago/Turabian StyleLang, Haitao, and Jie Yang. 2020. "Learning Ratio Mask with Cascaded Deep Neural Networks for Echo Cancellation in Laser Monitoring Signals" Electronics 9, no. 5: 856. https://doi.org/10.3390/electronics9050856
APA StyleLang, H., & Yang, J. (2020). Learning Ratio Mask with Cascaded Deep Neural Networks for Echo Cancellation in Laser Monitoring Signals. Electronics, 9(5), 856. https://doi.org/10.3390/electronics9050856