Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech
Abstract
:1. Introduction
2. Previous Work
3. DNN-Based ASC
3.1. Model Architecture
3.2. Multi-Condition Training
3.3. Input Features
4. Proposed Methods
4.1. Noise-Floor-Based Feature for ASC
4.2. Incorporating Speech Presence Information
4.3. Ensemble Methods
5. Results and Discussion
5.1. Dataset
5.2. Training Setup
5.3. Performance Evaluation
- : the ResNet baseline system;
- : the ResNet system with NF-MFBE;
- : the ResNet system with both MFBE and NF-MFBE;
- : the ResNet system with both MFBE and speech mask;
- : the ResNet system with MFBE, NF-MFBE, and speech mask.
5.3.1. Performance of Small Footprint Models
- ∗
- NF-based features bring benefits to certain acoustic scenes, thus can be utilized as an extra input feature for ASC;
- ∗
- Incorporating speech masks improves ASC accuracy in high SBRs, while losing information in low SBRs;
- ∗
- Ensemble methods in general help ASC and the best performance can be obtained by that combines models with three different input features.
5.3.2. Performance of Large-Footprint Model
- ∗
- Statistical noise floor estimate does not convincingly help ASC, presumably because the model is big enough that relevant information can be faithfully extracted from the feature of mixed signal;
- ∗
- For the large model, incorporating DL-based speech mask estimates consistently improves the classification accuracy;
- ∗
- Ensemble methods could further bring benefits to the large-footprint model.
5.3.3. Complexity Considerations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Barchiesi, D.; Giannoulis, D.; Stowell, D.; Plumbley, M.D. Acoustic Scene Classification: Classifying environments from the sounds they produce. IEEE Signal Process. Mag. 2015, 32, 16–34. [Google Scholar] [CrossRef]
- Eronen, A.; Tuomi, J.; Klapuri, A.; Fagerlund, S.; Sorsa, T.; Lorho, G.; Huopaniemi, J. Audio-based context awareness-acoustic modeling and perceptual evaluation. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03), Hong Kong, China, 6–10 April 2003; Volume 5, pp. V–529. [Google Scholar] [CrossRef]
- Eronen, A.; Peltonen, V.; Tuomi, J.; Klapuri, A.; Fagerlund, S.; Sorsa, T.; Lorho, G.; Huopaniemi, J. Audio-based context recognition. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 321–329. [Google Scholar] [CrossRef]
- Elizalde, B.; Lei, H.; Friedland, G.; Peters, N. An i-vector based approach for audio scene detection. In Proceedings of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, Online, 31 March–14 April 2013; pp. 114–117. [Google Scholar]
- Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 1128–1132. [Google Scholar] [CrossRef]
- Mesaros, A.; Heittola, T.; Virtanen, T. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE 2018), Surrey, UK, 19–20 November 2018; pp. 9–13. [Google Scholar]
- Abeßer, J. A Review of Deep Learning Based Methods for Acoustic Scene Classification. Appl. Sci. 2020, 10, 2020. [Google Scholar] [CrossRef]
- Valenti, M.; Squartini, S.; Diment, A.; Parascandolo, G.; Virtanen, T. A convolutional neural network approach for acoustic scene classification. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 1547–1554. [Google Scholar] [CrossRef]
- Seo, H.; Park, J.; Park, Y. Acoustic scene classification using various pre-processed features and convolutional neural networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE 2019), New York, NY, USA, 25–26 October 2019; pp. 25–26. [Google Scholar]
- Jallet, H.; Cakir, E.; Virtanen, T. Acoustic Scene Classification Using CRNN. In Proceedings of the DCASE2017 Challenge, Munich, Germany, 16 November 2017. [Google Scholar]
- Mulimani, M.; Koolagudi, S.G. Acoustic scene classification using deep learning architectures. In Proceedings of the 2021 6th International Conference for Convergence in Technology (I2CT), Maharashtra, India, 2–4 April 2021; pp. 1–6. [Google Scholar]
- Bisot, V.; Serizel, R.; Essid, S.; Richard, G. Nonnegative Feature Learning Methods for Acoustic Scene Classification. In Proceedings of the DCASE2017 Challenge, Munich, Germany, 16 November 2017. [Google Scholar]
- Takahashi, G.; Yamada, T.; Ono, N.; Makino, S. Performance evaluation of acoustic scene classification using DNN-GMM and frame-concatenated acoustic features. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 1739–1743. [Google Scholar]
- Song, S.; Desplanques, B.; De Moor, C.; Demuynck, K.; Madhu, N. Robust Acoustic Scene Classification in the Presence of Active Foreground Speech. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 995–999. [Google Scholar] [CrossRef]
- Song, S.; Desplanques, B.; Demuynck, K.; Madhu, N. SoftVAD in iVector-Based Acoustic Scene Classification for Robustness to Foreground Speech. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 404–408. [Google Scholar] [CrossRef]
- Elshamy, S.; Fingscheidt, T. DNN-Based Cepstral Excitation Manipulation for Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1803–1814. [Google Scholar] [CrossRef]
- Liu, S.; Triantafyllopoulos, A.; Ren, Z.; Schuller, B.W. Towards Speech Robustness for Acoustic Scene Classification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3087–3091. [Google Scholar] [CrossRef]
- Martin, R. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 2001, 9, 504–512. [Google Scholar] [CrossRef]
- Rangachari, S.; Loizou, P.C. A noise-estimation algorithm for highly non-stationary environments. Speech Commun. 2006, 48, 220–231. [Google Scholar] [CrossRef]
- Gerkmann, T.; Hendriks, R.C. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 1383–1393. [Google Scholar] [CrossRef]
- Byttebier, L.; Desplanques, B.; Thienpondt, J.; Song, S.; Demuynck, K.; Madhu, N. Small-footprint acoustic scene classification through 8-bit quantization-aware training and pruning of ResNet models. In Proceedings of the DCASE2021 Challenge, Online, 15–19 November 2021. [Google Scholar]
- Braun, S.; Gamper, H.; Reddy, C.K.; Tashev, I. Towards Efficient Models for Real-Time Deep Noise Suppression. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 656–660. [Google Scholar] [CrossRef]
- Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif. 1999, 10, 61–74. [Google Scholar]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1321–1330. [Google Scholar]
- Reddy, C.K.A.; Dubey, H.; Gopal, V.; Cutler, R.; Braun, S.; Gamper, H.; Aichner, R.; Srinivasan, S. ICASSP 2021 Deep Noise Suppression Challenge. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6623–6627. [Google Scholar] [CrossRef]
- Huang, J.; Lu, H.; Lopez Meyer, P.; Cordourier, H.; Del Hoyo Ontiveros, J. Acoustic Scene Classification Using Deep Learning-based Ensemble Averaging. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events 2019 (DCASE 2019), New York, NY, USA, 25–26 October 2019; pp. 94–98. [Google Scholar] [CrossRef]
- Alamir, M.A. A novel acoustic scene classification model using the late fusion of convolutional neural networks and different ensemble classifiers. Appl. Acoust. 2021, 175, 107829. [Google Scholar] [CrossRef]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
- Smith, L.N. Cyclical Learning Rates for Training Neural Networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar] [CrossRef]
Paper | System | Contribution |
---|---|---|
[15] | iVector | Noise-floor based features; Multi-condition training. |
[16] | iVector | SoftVAD; Modified Baum-Welch statistics; Score-fusion. |
This paper | ResNet | Noise-floor based features; CRUSE; Meta-learner. |
Encoder Parameters | |
---|---|
Number of channels | 16, 32, 64, 128 |
Kernel size (Time, Frequency) | (2, 3) for all layers |
Stride (Time, Frequency) | (1, 2) for all layers |
Default activation functions | Leaky ReLU, slope |
Final layer activation functions | sigmoid |
Skip connection | convolution |
Number of training epochs | 40 |
Accuracy (%) | Clean | −5 dB | 0 dB | 5 dB | 10 dB | 15 dB | 20 dB |
---|---|---|---|---|---|---|---|
66.66 ± 0.74 | 62.35 ± 0.40 | 61.64 ± 0.58 | 57.86 ± 0.44 | 54.24 ± 0.93 | 49.71 ± 0.67 | 46.39 ± 0.69 | |
63.40 ± 0.64 | 57.69 ± 0.71 | 56.97 ± 0.71 | 55.91 ± 0.95 | 52.35 ± 1.10 | 47.97 ± 0.83 | 41.50 ± 0.87 | |
65.34 ± 1.14 | 61.57 ± 1.08 | 60.61 ± 0.72 | 59.11 ± 0.85 | 55.62 ± 1.49 | 52.78 ± 0.78 | 47.92 ± 1.15 | |
NF_vote_1 | 66.93 ± 0.35 | 62.80 ± 0.20 | 62.90 ± 0.56 | 59.92 ± 0.72 | 55.87 ± 0.35 | 51.39 ± 0.58 | 47.10 ± 0.58 |
NF_vote_2 | 68.17 ± 0.75 | 64.53 ± 0.67 | 63.19 ± 0.66 | 60.25 ± 0.53 | 56.70 ± 0.50 | 53.18 ± 0.48 | 48.71 ± 0.36 |
63.07 ± 0.32 | 60.15 ± 0.96 | 60.04 ± 1.21 | 59.01 ± 0.92 | 56.56 ± 0.58 | 54.56 ± 0.96 | 52.18 ± 1.10 | |
M_vote | 68.66 ± 0.41 | 66.02 ± 0.46 | 64.35 ± 0.49 | 61.65 ± 0.55 | 58.90 ± 0.70 | 56.45 ± 0.77 | 54.19 ± 1.77 |
55.06 ± 1.46 | 56.36 ± 1.52 | 56.09 ± 1.59 | 56.52 ± 1.71 | 55.24 ± 1.66 | 53.21 ± 1.54 | 49.55 ± 1.64 | |
M_NF_vote | 70.48 ± 0.43 | 66.69 ± 1.03 | 65.76 ± 1.58 | 63.29 ± 1.37 | 59.90 ± 1.07 | 56.68 ± 1.00 | 53.80 ± 0.71 |
Meta_learner | 70.94 ± 0.58 | 67.37 ± 0.92 | 66.29 ± 1.44 | 64.25 ± 1.11 | 61.43 ± 0.95 | 58.60 ± 0.89 | 55.50 ± 1.19 |
Accuracy (%) | Clean | −5 dB | 0 dB | 5 dB | 10 dB | 15 dB | 20 dB |
---|---|---|---|---|---|---|---|
71.70 ± 0.53 | 66.99 ± 1.10 | 66.31 ± 1.12 | 64.79 ± 0.98 | 62.14 ± 0.87 | 58.50 ± 1.30 | 53.14 ± 1.54 | |
67.81 ± 0.57 | 61.45 ± 1.34 | 61.33 ± 1.88 | 60.34 ± 1.34 | 57.25 ± 1.67 | 53.38 ± 1.49 | 48.73 ± 1.97 | |
67.23 ± 1.34 | 61.33 ± 1.18 | 61.67 ± 0.91 | 60.85 ± 1.34 | 56.87 ± 1.48 | 53.47 ± 1.54 | 49.54 ± 1.48 | |
NF_vote_1 | 72.72 ± 0.31 | 67.32 ± 0.68 | 66.59 ± 1.54 | 65.28 ± 1.10 | 62.66 ± 1.22 | 59.75 ± 1.33 | 54.71 ± 1.25 |
NF_vote_2 | 73.16 ± 0.07 | 67.52 ± 1.07 | 67.75 ± 1.33 | 65.61 ± 0.91 | 62.68 ± 1.06 | 59.06 ± 1.08 | 55.18 ± 0.90 |
70.82 ± 0.79 | 66.69 ± 0.75 | 66.66 ± 0.48 | 65.51 ± 0.84 | 64.41 ± 1.46 | 62.56 ± 1.72 | 59.26 ± 1.80 | |
M_vote | 75.33 ± 0.61 | 70.96 ± 0.87 | 70.47 ± 1.01 | 68.80 ± 0.99 | 67.15 ± 0.66 | 64.45 ± 0.50 | 60.89 ± 0.88 |
64.90 ± 1.09 | 62.63 ± 1.42 | 62.90 ± 1.96 | 62.19 ± 1.90 | 60.52 ± 1.74 | 59.56 ± 1.36 | 56.61 ± 1.05 | |
M_NF_vote | 74.73 ± 0.51 | 70.00 ± 0.91 | 69.99 ± 1.06 | 68.90 ± 1.18 | 67.57 ± 0.77 | 64.38 ± 0.72 | 60.36 ± 1.09 |
Meta_learner | 74.91 ± 0.66 | 71.43 ± 0.82 | 70.65 ± 0.86 | 69.64 ± 1.07 | 67.65 ± 0.99 | 65.81 ± 0.61 | 61.94 ± 1.17 |
Non-Zero Parameters/MACs | Small-Footprint Systems | Large-Footprint Systems |
---|---|---|
336 KB/1.58 G | 4450 KB/24.72 G | |
336 KB/1.58 G | 4450 KB/24.72 G | |
336 KB/1.58 G | 4450 KB/24.72 G | |
NF_vote_1 | 672 KB /3.16 G | 8900 KB/49.44 G |
NF_vote_2 | 672 KB /3.16 G | 8900 KB /49.44 G |
336 KB/1.58 G | 4450 KB/24.72 G | |
M_vote | 672 KB/3.16 G | 8900 KB/49.44 G |
336 KB/1.58 G | 4450 KB/24.72 G | |
M_NF_vote | 1008 KB/4.74 G | 13350 KB/74.16 G |
Meta_learner | 1008 KB/4.74 G | 13350 KB/74.16 G |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, S.; Song, Y.; Madhu, N. Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech. Appl. Sci. 2024, 14, 609. https://doi.org/10.3390/app14020609
Song S, Song Y, Madhu N. Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech. Applied Sciences. 2024; 14(2):609. https://doi.org/10.3390/app14020609
Chicago/Turabian StyleSong, Siyuan, Yanjue Song, and Nilesh Madhu. 2024. "Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech" Applied Sciences 14, no. 2: 609. https://doi.org/10.3390/app14020609
APA StyleSong, S., Song, Y., & Madhu, N. (2024). Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech. Applied Sciences, 14(2), 609. https://doi.org/10.3390/app14020609