Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection
Abstract
:1. Introduction
- We, for the first time, conceptualize sound events as spatio-temporal-frequency entities, similar to actions in videos. This charts a new research trajectory for the SELD task.
- We propose a novel STFF-Net architecture that cohesively integrates the detection and localization of sound events, simplifies the polyphonic SELD optimization process, and provides a spatio-temporal analysis of sound events.
- We conduct comprehensive experiments on three public synthetic and real spatial audio datasets to validate the effectiveness of our proposed method.
2. Background and Related Works
2.1. Sound Event Localization and Detection
2.2. Attention Mechanisms
3. Methodology
3.1. Spatio-Temporal-Frequency Fusion Network
3.2. Spatio-Temporal-Frequency Volume
3.3. Enhanced-3D Residual Block
Algorithm 1 PyTorch-like Implementation of a Non-parametric Attention Module |
|
3.4. Class- and Track-Wise Output Format
4. Evaluation
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Data Augmentation
4.1.3. Evaluation Metrics
4.1.4. Parameter Configuration
4.2. Experimental Results
4.2.1. Choice of SELD Features
4.2.2. Effect of Attention Modules
4.2.3. Analysis of in the Attention Module
4.2.4. Experiments on a Synthetic Spatial Audio Dataset
4.2.5. Experiments on a Real Spatial Audio Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 2015, 17, 279–288. [Google Scholar] [CrossRef]
- Despotovic, V.; Pocta, P.; Zgank, A. Audio-based Active and Assisted Living: A review of selected applications and future trends. Comput. Biol. Med. 2022, 149, 106027. [Google Scholar] [CrossRef] [PubMed]
- Stowell, D.; Wood, M.D.; Pamuła, H.; Stylianou, Y.; Glotin, H. Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge. Methods Ecol. Evol. 2019, 10, 368–380. [Google Scholar] [CrossRef]
- Elizalde, B.; Zarar, S.; Raj, B. Cross modal audio search and retrieval with joint embeddings based on text and audio. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 4095–4099. [Google Scholar]
- Zürn, J.; Burgard, W. Self-supervised moving vehicle detection from audio-visual cues. IEEE Robot. Autom. Lett. 2022, 7, 7415–7422. [Google Scholar] [CrossRef]
- He, Y.; Trigoni, N.; Markham, A. SoundDet: Polyphonic moving sound event detection and localization from raw waveform. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 4160–4170. [Google Scholar]
- Shimada, K.; Takahashi, N.; Koyama, Y.; Takahashi, S.; Tsunoo, E.; Takahashi, M.; Mitsufuji, Y. Ensemble of ACCDOA-and EINV2-Based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection. Technical Report. 2021. Available online: https://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Shimada_117_t3.pdf (accessed on 17 September 2024).
- Wang, Q.; Chai, L.; Wu, H.; Nian, Z.; Niu, S.; Zheng, S.; Wang, Y.; Sun, L.; Fang, Y.; Pan, J.; et al. The NERC-SLIP System for Sound Event Localization and Detection of DCASE2022 Challenge; Technical Report. 2022. Available online: https://dcase.community/documents/challenge2022/technical_reports/DCASE2022_Du_122_t3.pdf (accessed on 17 September 2024).
- Hu, J.; Cao, Y.; Wu, M.; Kong, Q.; Yang, F.; Plumbley, M.D.; Yang, J. A track-wise ensemble event independent network for polyphonic sound event localization and detection. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9196–9200. [Google Scholar]
- Politis, A.; Adavanne, S.; Krause, D.; Deleforge, A.; Srivastava, P.; Virtanen, T. A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), Barcelona, Spain, 15–19 November 2021; pp. 125–129. [Google Scholar]
- Cao, Y.; Iqbal, T.; Kong, Q.; An, F.; Wang, W.; Plumbley, M.D. An improved event-independent network for polyphonic sound event localization and detection. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 885–889. [Google Scholar]
- Xie, R.; Shi, C.; Le Zhang, Y.L.; Li, H. Ensemble of Attention Based CRNN for Sound Event Detection and Localization. Technical Report. 2022. Available online: https://dcase.community/documents/challenge2022/technical_reports/DCASE2022_Xie_18_t3.pdf (accessed on 17 September 2024).
- Kim, J.S.; Park, H.J.; Shin, W.; Han, S.W. A Robust Framework for Sound Event Localization and Detection on Real Recordings. Technical Report. 2022. Available online: https://dcase.community/documents/challenge2022/technical_reports/DCASE2022_Han_54_t3.pdf (accessed on 17 September 2024).
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Global context networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 6881–6895. [Google Scholar] [CrossRef] [PubMed]
- Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound event detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
- Grumiaux, P.A.; Kitić, S.; Girin, L.; Guérin, A. A survey of sound source localization with deep learning methods. J. Acoust. Soc. Am. 2022, 152, 107–151. [Google Scholar] [CrossRef] [PubMed]
- Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 34–48. [Google Scholar] [CrossRef]
- Politis, A.; Mesaros, A.; Adavanne, S.; Heittola, T.; Virtanen, T. Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2020, 29, 684–698. [Google Scholar] [CrossRef]
- Cao, Y.; Kong, Q.; Iqbal, T.; An, F.; Wang, W.; Plumbley, M.D. Polyphonic sound event detection and localization using a two-stage strategy. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA, 25–26 October 2019; pp. 30–34. [Google Scholar]
- Shimada, K.; Koyama, Y.; Takahashi, N.; Takahashi, S.; Mitsufuji, Y. ACCDOA: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 915–919. [Google Scholar]
- Nguyen, T.N.T.; Watcharasupat, K.N.; Nguyen, N.K.; Jones, D.L.; Gan, W.S. Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2022, 30, 1749–1762. [Google Scholar] [CrossRef]
- Nguyen, T.N.T.; Jones, D.L.; Watcharasupat, K.N.; Phan, H.; Gan, W.S. SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 716–720. [Google Scholar]
- Rosero, K.; Grijalva, F.; Masiero, B. Sound events localization and detection using bio-inspired gammatone filters and temporal convolutional neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2023, 31, 2314–2324. [Google Scholar]
- Huang, W.; Huang, Q.; Ma, L.; Chen, Z.; Wang, C. SwG-former: Sliding-window Graph Convolutional Network Integrated with Conformer for Sound Event Localization and Detection. arXiv 2023, arXiv:2310.14016. [Google Scholar]
- Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. Comput. Vis. Media. 2022, 8, 331–368. [Google Scholar] [CrossRef]
- Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 1–9. [Google Scholar]
- Cao, Y.; Iqbal, T.; Kong, Q.; Zhong, Y.; Wang, W.; Plumbley, M.D. Event-independent network for polyphonic sound event localization and detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan, 2–4 November 2020; pp. 11–15. [Google Scholar]
- Zhao, S.; Saluev, T.; Jones, D.L. Underdetermined direction of arrival estimation using acoustic vector sensor. Signal Process. 2014, 100, 160–168. [Google Scholar] [CrossRef]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
- Shimada, K.; Koyama, Y.; Takahashi, S.; Takahashi, N.; Tsunoo, E.; Mitsufuji, Y. Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 316–320. [Google Scholar]
- Politis, A.; Shimada, K.; Sudarsanam, P.; Adavanne, S.; Krause, D.; Koyama, Y.; Takahashi, N.; Takahashi, S.; Mitsufuji, Y.; Virtanen, T. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France, 3–4 November 2022; pp. 125–129. [Google Scholar]
- Guizzo, E.; Marinoni, C.; Pennese, M.; Ren, X.; Zheng, X.; Zhang, C.; Masiero, B.; Uncini, A.; Comminiello, D. L3DAS22 challenge: Learning 3D audio sources in a real office environment. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9186–9190. [Google Scholar]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 13001–13008. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the International Speech Communication Association (ISCA), Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
- Mazzon, L.; Koizumi, Y.; Yasuda, M.; Harada, N. First order ambisonics domain spatial augmentation for DNN-based direction of arrival estimation. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA, 25–26 October 2019; pp. 154–158. [Google Scholar]
- Wang, Q.; Du, J.; Wu, H.X.; Pan, J.; Ma, F.; Lee, C.H. A four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1251–1264. [Google Scholar] [CrossRef]
- Mesaros, A.; Adavanne, S.; Politis, A.; Heittola, T.; Virtanen, T. Joint measurement of localization and detection of sound events. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 333–337. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Mao, Y.; Zeng, Y.; Liu, H.; Zhu, W.; Zhou, Y. ICASSP 2022 L3DAS22 Challenge: Ensemble of Resnet-Conformers with Ambisonics Data Augmentation for Sound Event Localization and Detection. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9191–9195. [Google Scholar]
- Wu, S.; Wang, Y.; Hu, Z.; Liu, J. HAAC: Hierarchical audio augmentation chain for ACCDOA described sound event localization and detection. Appl. Acoust. 2023, 211, 109541. [Google Scholar] [CrossRef]
- Hu, J.; Cao, Y.; Wu, M.; Kong, Q.; Yang, F.; Plumbley, M.D.; Yang, J. Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains. Technical Report. 2022. Available online: https://dcase.community/documents/workshop2022/proceedings/DCASE2022Workshop_Hu_61.pdf (accessed on 17 September 2024).
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Feature Components | Format | Abbreviation | # Feature Channels | Scale | TF Alignment |
---|---|---|---|---|---|
Log-mel spectrogram, mel-scale GCC-PHAT | MIC | MelSpecGCC | mel | ✗ | |
Log-linear spectrogram, linear-scale GCC-PHAT | MIC | LinSpecGCC | linear | ✗ | |
Log-linear spectrogram, NIPD | MIC | SALSA-Lite | linear | ✓ | |
Log-linear spectrogram, normalized principal eigenvector | MIC/FOA | SALSA | linear | ✓ | |
Log-mel spectrogram, mel-scale IV | FOA | MelSpecIV | mel | ✓ | |
Log-linear spectrogram, linear-scale IV | FOA | LinSpecIV | linear | ✓ |
Characteristics | TNSSE | STARSS | L3DAS22 |
---|---|---|---|
Spatial audio format | MIC, FOA | MIC, FOA | FOA |
Sampling rate | 24 kHz | 24 kHz | 32 kHz |
Type of spatial recordings | synthetic | real | synthetic |
Moving sources | ✓ | ✓ | ✗ |
Ambient noise | ✓ | ✓ | ✓ |
Reverberation | ✓ | ✓ | ✓ |
Non-target interfering events | ✓ | ✓ | ✗ |
Maximum degree of polyphony | 3 | 5 | 3 |
Overlapping of same class events (%) | high | low | low |
Number of target sound classes | 12 | 13 | 14 |
Feature | Cutoff Freq. | Micro-Average Metrics | Macro-Average Metrics | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MIC format | |||||||||||
MelSpecGCC | 12 kHz | 0.616 | 0.466 | 22.2° | 0.720 | 0.388 | 0.616 | 0.439 | 21.3° | 0.685 | 0.402 |
LinSpecGCC | 12 kHz | 0.630 | 0.441 | 21.7° | 0.681 | 0.407 | 0.630 | 0.411 | 20.9° | 0.635 | 0.425 |
MIC SALSA | 4 kHz | 0.529 | 0.532 | 17.3° | 0.690 | 0.351 | 0.529 | 0.473 | 18.5° | 0.697 | 0.365 |
SALSA-Lite | 2 kHz | 0.500 | 0.567 | 15.5° | 0.691 | 0.332 | 0.500 | 0.515 | 16.6° | 0.703 | 0.343 |
FOA format | |||||||||||
MelSpecIV | 12 kHz | 0.498 | 0.564 | 15.8° | 0.685 | 0.334 | 0.498 | 0.522 | 16.2° | 0.716 | 0.338 |
LinSpecIV | 12 kHz | 0.488 | 0.574 | 13.9° | 0.667 | 0.331 | 0.488 | 0.525 | 15.2° | 0.708 | 0.335 |
FOA SALSA | 9 kHz | 0.503 | 0.560 | 15.4° | 0.664 | 0.341 | 0.503 | 0.513 | 16.4° | 0.695 | 0.347 |
Approach | Params. | FOA Num. | Precision | Recall | F1-Score |
---|---|---|---|---|---|
(’22) revised SELDnet [38] | 7.0 M | single | 0.423 | 0.289 | 0.343 |
(’22 #1) Hu et al. [9] † | - | double | 0.706 | 0.691 | 0.699 |
(’22 #2) Mao et al. [45] † | 37.8 M | single | 0.600 | 0.584 | 0.592 |
CRNN | 13.6 M | single | 0.700 | 0.586 | 0.638 |
CRNN + SE | 13.7 M | single | 0.710 | 0.588 | 0.643 |
CRNN + CBAM | 13.7 M | single | 0.705 | 0.585 | 0.640 |
CRNN + GC | 13.7 M | single | 0.702 | 0.584 | 0.638 |
CRNN + SimAM | 13.6 M | single | 0.701 | 0.595 | 0.644 |
Approach | Parameters | Format | Micro-Average Metrics | Macro-Average Metrics | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
(’21) SELDnet [10] | 0.5 M | MIC | 0.75 | 0.234 | 30.6° | 0.378 | 0.577 | 0.75 | - | - | - | - |
FOA | 0.73 | 0.307 | 24.5° | 0.405 | 0.539 | 0.73 | - | - | - | - | ||
(’21 #1) Shimada et al. [7] † | 42 M | FOA | 0.43 | 0.699 | 11.1° | 0.732 | 0.265 | 0.43 | - | - | - | - |
G-SELD [26] | - | FOA | 0.65 | 0.439 | 22.5° | 0.559 | 0.444 | 0.65 | - | - | - | - |
HAAC-enhanced EINV2 [46] | 85 M | FOA | 0.49 | 0.60 | 15.26° | 0.70 | 0.321 | 0.49 | - | - | - | - |
CRNN | 13.7 M | MIC | 0.453 | 0.624 | 14.24° | 0.721 | 0.297 | 0.453 | 0.627 | 13.70° | 0.738 | 0.291 |
CRNN + CBAM | 13.8 M | MIC | 0.476 | 0.605 | 15.04° | 0.723 | 0.308 | 0.476 | 0.602 | 14.49° | 0.737 | 0.304 |
CRNN + SimAM | 13.7 M | MIC | 0.446 | 0.633 | 13.96° | 0.710 | 0.295 | 0.446 | 0.633 | 13.11° | 0.731 | 0.289 |
STFF-Net | 35.8 M | MIC | 0.494 | 0.593 | 16.16° | 0.753 | 0.310 | 0.494 | 0.609 | 15.39° | 0.755 | 0.304 |
CRNN | 13.7 M | FOA | 0.411 | 0.660 | 12.67° | 0.719 | 0.275 | 0.411 | 0.660 | 12.03° | 0.740 | 0.270 |
CRNN + CBAM | 13.8 M | FOA | 0.407 | 0.663 | 12.93° | 0.699 | 0.279 | 0.407 | 0.664 | 12.34° | 0.735 | 0.269 |
CRNN + SimAM | 13.7 M | FOA | 0.394 | 0.682 | 12.12° | 0.738 | 0.260 | 0.394 | 0.675 | 11.52° | 0.758 | 0.256 |
STFF-Net | 35.8 M | FOA | 0.415 | 0.677 | 11.95° | 0.740 | 0.266 | 0.415 | 0.680 | 11.33° | 0.756 | 0.261 |
Approach | Parameters | Format | Micro-Average Metrics | Macro-Average Metrics | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
(’22) SELDnet [37] | 604 k | MIC | 0.71 | 0.36 | - | - | - | 0.71 | 0.18 | 32.2° | 0.47 | 0.560 |
FOA | 0.71 | 0.36 | - | - | - | 0.71 | 0.21 | 29.3° | 0.46 | 0.551 | ||
(’22 #2) Hu et al. [47] † | 85 M | FOA | 0.53 | - | - | - | - | 0.53 | 0.481 | 17.8° | 0.626 | 0.381 |
HAAC-enhanced EINV2 [46] | 85 M | FOA | 0.54 | - | - | - | - | 0.54 | 0.45 | 17.20° | 0.62 | 0.391 |
SwG-former [27] | 110.8 M | FOA | 0.64 | - | - | - | - | 0.64 | 0.452 | 24.5° | 0.657 | 0.416 |
SwG-EINV2 [27] | 288.5 M | FOA | 0.63 | - | - | - | - | 0.63 | 0.489 | 20.9° | 0.718 | 0.385 |
CRNN | 13.7 M | MIC | 0.566 | 0.483 | 21.62° | 0.756 | 0.362 | 0.566 | 0.417 | 21.38° | 0.621 | 0.412 |
CRNN + CBAM | 13.8 M | MIC | 0.585 | 0.471 | 21.46° | 0.744 | 0.372 | 0.585 | 0.413 | 20.52° | 0.632 | 0.413 |
CRNN + SimAM | 13.7 M | MIC | 0.545 | 0.515 | 19.27° | 0.737 | 0.350 | 0.545 | 0.439 | 18.77° | 0.598 | 0.403 |
STFF-Net | 35.8 M | MIC | 0.613 | 0.447 | 23.31° | 0.793 | 0.376 | 0.613 | 0.393 | 21.78° | 0.631 | 0.428 |
CRNN | 13.7 M | FOA | 0.547 | 0.521 | 19.69° | 0.792 | 0.336 | 0.547 | 0.417 | 18.50° | 0.658 | 0.380 |
CRNN + CBAM | 13.8 M | FOA | 0.526 | 0.530 | 20.54° | 0.763 | 0.337 | 0.526 | 0.445 | 19.14° | 0.630 | 0.389 |
CRNN + SimAM | 13.7 M | FOA | 0.531 | 0.532 | 19.33° | 0.771 | 0.334 | 0.531 | 0.481 | 17.31° | 0.654 | 0.373 |
STFF-Net | 35.8 M | FOA | 0.499 | 0.570 | 18.00° | 0.794 | 0.309 | 0.499 | 0.511 | 17.96° | 0.663 | 0.356 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, B.; Wang, M.; Gu, Y. Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection. Sensors 2024, 24, 6090. https://doi.org/10.3390/s24186090
Chen B, Wang M, Gu Y. Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection. Sensors. 2024; 24(18):6090. https://doi.org/10.3390/s24186090
Chicago/Turabian StyleChen, Baoqing, Mei Wang, and Yu Gu. 2024. "Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection" Sensors 24, no. 18: 6090. https://doi.org/10.3390/s24186090
APA StyleChen, B., Wang, M., & Gu, Y. (2024). Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection. Sensors, 24(18), 6090. https://doi.org/10.3390/s24186090