Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss
Abstract
:1. Introduction
1.1. Background and Significance
1.2. Current Status of the Problem
- The first challenge involves obtaining richer long short-term dependencies. As described in Refs. [21,31]; as audio signals (such as sound events) have different durations, multi-scale information is indispensable for signal detection. Although SS-FCN [29] demonstrated that dilated convolution is effective for modeling temporal context, a single dilated convolutional layer in SS-FCN can only capture features with a single length of temporal dependencies. While MS-FCN can capture multi-scale information by merging features output from the dilated convolution of different layers, MS-FCN can capture multi-scale information but is limited by the layer of dilated convolution. Similar to MS-FCN, MSFF-Net [30] can only capture and merge a limited number of temporal dependencies. Thus, a network that can obtain richer temporal context information and merge them to capture richer long short-term dependencies is required.
- The second challenge involves selecting effective samples (support vectors) and using them to guide the training. As audio signal detection–classification is considered a frame-wise prediction task, samples are defined according to the time frame (or time segment) of several tens of milliseconds. In a recording, sound events, such as gunshots and glass break, usually have extremely limited durations, resulting in background samples taking a dominant proportion. Thus, there is a high class imbalance between the foreground and background in audio signal detection and classification datasets. In addition, class imbalances exist among different kinds of audio signals, e.g., in the TUT-SED 2017 dataset, the duration of car events exceeds that of brakes squeaking, children, large vehicle, and people speaking events combined. Moreover, the dataset contains large silent clips in the background, which show high similarity and are easily classified. Moreover, some events, e.g., gunshot, glass break, car, and large vehicle events, have smaller sample sizes, high inter-sample similarity, and high concurrency, making them difficult to detect. As mentioned above, a serious class imbalance exists in datasets. Some events have large sample sizes and are easy to detect, contributing little to the training. Conversely, other events may have small sample sizes and are difficult to fit, requiring the network to focus more on them. Addressing how to manage these two types of samples and allocate different levels of attention during training to improve detection performance is a significant challenge.
1.3. Proposed Solutions and Contributions
- For the first challenge, this paper proposes the HDC-Inception module to capture richer long short-term dependencies, with the advantages of the Inception module [32,33] and hybrid dilated convolution (HDC) framework [34]. In the Inception module (as shown in Figure 1a), several parallel paths are constructed, and convolutional filters of different sizes are used in the parallel paths to capture and fuse features with different views. In our early SS-FCN [29], dilated convolution was used to model temporal context, as dilated convolution can ensure high time resolution and capture longer temporal dependencies without altering the filter size or network depth. Combining the Inception module and dilated convolution, the Dilated-Inception module (as shown in Figure 1b) is naturally proposed, which uses the architecture of the Inception module and replaces the convolutional filters of different sizes with dilated convolutions of different dilation factors. However, the Dilated-Inception module suffers from a “gridding issue” [34] caused by the element-skipping sampling of standard dilated convolution, under which some neighboring information is ignored [35]. Fortunately, the “gridding issue” can be alleviated by HDC [34], which uses a range of dilation factors and concatenates them serially to gradually increase the views. Based on the above analyses, we propose the HDC-Inception module (as shown in Figure 1c), which combines the advantages of the Inception module and HDC. The proposed HDC-Inception module has several parallel paths, which use dilated convolutions of different dilation factors to capture different temporal dependencies. Inspired by HDC, the dilation factors are increased path-by-path, and the output features of the previous path are sent to the next path to alleviate the “gridding issue”. Features from all paths are concatenated and fed into a convolutional filter to fuse information altogether. In addition, skip connections are used between two HDC-Inception modules to alleviate gradient vanishing. Obviously, by employing HDC-Inception modules in a stacked configuration, temporal dependencies of diverse durations can be acquired, thereby enabling networks to capture more comprehensive temporal contextual information.
- For the second challenge, this paper proposes soft margin cross-entropy loss (soft margin CE loss) to adaptively select effective samples (support vectors) and use them to guide the training, as inspired by the soft margin support vector machine (soft margin SVM) [36,37]. In SVM, when samples are inseparable in linear space, a non-linear kernel will be used to map the samples into a feature space, where a linear separating hyperplane can be found. As neural networks can be regarded as non-linear kernels of SVM, in the following discussion on SVM and loss function, samples are located in the space after non-linear mapping. In SVM (as shown in Figure 2a) [36,37], a hyperplane is found to separate the samples into two classes with the max margin. Based on the hyperplane, samples are divided into support vectors and non-support vectors. Only support vectors are used to optimize the separating hyperplane. However, SVM is only feasible when the samples are linearly separable. To overcome this limitation, the soft margin SVM (as shown in Figure 2b) is proposed, which allows misclassified samples and minimizes errors. Inspired by soft margin SVM, we propose soft margin CE loss (as shown in Figure 2c). Similar to soft margin SVM, the proposed soft margin CE loss presupposes a separating hyperplane and sets a margin to divide samples into support vectors and non-support vectors. Support vectors are used to calculate the optimum separating hyperplane. Through identity transforms, the calculation of this hyperplane is transformed into an optimization model/problem, which can be used as the loss function of neural networks. Although the proposed soft margin CE loss originated from soft margin SVM, which only works for binary classification problems, it applies to multi-classification tasks. Using the advantages of soft margin CE loss and CE loss, this paper also proposes a hybrid CE loss approach. This method utilizes all samples in training while focusing on support vectors.
2. Related Works
2.1. Audio Signal Detection and Classification
2.2. Inception Module and Hybrid Dilated Convolution
2.3. Soft Margin Support Vector Machine
3. Method
3.1. HDC-Inception
3.2. Soft Margin CE Loss
3.3. Hybrid CE Loss
3.4. Proposed Method
3.4.1. Preprocessing
3.4.2. Convolutional Layers
3.4.3. HDC-Inception Layers
3.4.4. Prediction Layer
3.4.5. Loss Function
4. Experiments
4.1. Dataset
4.2. Metrics
4.3. Baseline
4.4. Experiments Setup
5. Results and Discussion
5.1. Ablation Studies
5.1.1. HDC-Inception Module
5.1.2. Soft-Margin CE Loss
5.1.3. Hybrid CE Loss
5.2. Comparisons with Existing Methods
5.2.1. TUT Rare Sound Event 2017 Dataset
5.2.2. TUT-SED 2017 Dataset
5.2.3. TUT-SED 2016 Dataset
5.2.4. Computational Costs
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Crocco, M.; Cristani, M.; Trucco, A.; Murino, V. Audio Surveillance: A Systematic Review. arXiv 2014, arXiv:1409.7787. [Google Scholar] [CrossRef]
- Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Reliable detection of audio events in highly noisy environments. Pattern Recognit. Lett. 2015, 65, 22–28. [Google Scholar] [CrossRef]
- Zhang, S.; Li, X.; Zhang, C. Neural Network Quantization Methods for Voice Wake up Network. arXiv 2021, arXiv:1808.06676. [Google Scholar] [CrossRef]
- Xu, C.; Rao, W.; Wu, J.; Li, H. Target Speaker Verification with Selective Auditory Attention for Single and Multi-Talker Speech. arXiv 2021, arXiv:2103.16269. [Google Scholar] [CrossRef]
- Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
- Chu, S.; Narayanan, S.; Kuo, C.C. Environmental sound recognition with time—Frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
- Salamon, J.; Bello, J.P. Feature learning with deep scattering for urban sound analysis. In Proceedings of the 2015 23rd European Signal Processing Conference, EUSIPCO 2015, Nice, France, 31 August–4 September 2015; pp. 724–728. [Google Scholar] [CrossRef]
- Stowell, D.; Clayton, D. Acoustic event detection for multiple overlapping similar sources. arXiv 2015, arXiv:1503.07150. [Google Scholar] [CrossRef]
- Huang, Y.; Cui, H.; Hou, Y.; Hao, C.; Wang, W.; Zhu, Q.; Li, J.; Wu, Q.; Wang, J. Space-Based Electromagnetic Spectrum Sensing and Situation Awareness. Space Sci. Technol. 2024, 4, 0109. [Google Scholar] [CrossRef]
- Xu, L.; Song, G. A Recursive Parameter Estimation Algorithm for Modeling Signals with Multi-frequencies. Circuits, Syst. Signal Process. 2020, 39, 4198–4224. [Google Scholar] [CrossRef]
- Wan, Z.; Yang, R.; Huang, M.; Zeng, N.; Liu, X. A review on transfer learning in EEG signal analysis. Neurocomputing 2021, 421, 1–14. [Google Scholar] [CrossRef]
- Zhang, Z.; Luo, H.; Wang, C.; Gan, C.; Xiang, Y. Automatic Modulation Classification Using CNN-LSTM Based Dual-Stream Structure. IEEE Trans. Veh. Technol. 2020, 69, 13521–13531. [Google Scholar] [CrossRef]
- Heittola, T.; Mesaros, A.; Eronen, A.J.; Virtanen, T. Acoustic event detection in real life recordings. In Proceedings of the European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, 9–13 September 2013. [Google Scholar]
- Gencoglu, O.; Virtanen, T.; Huttunen, H. Recognition of acoustic events using deep neural networks. In Proceedings of the European Signal Processing Conference, Lisbon, Portugal, 13 November 2014. [Google Scholar]
- Cakir, E.; Heittola, T.; Huttunen, H.; Virtanen, T. Polyphonic sound event detection using multi label deep neural networks. In Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 12–17 July 2015. [Google Scholar] [CrossRef]
- Zhang, H.; McLoughlin, I.; Song, Y. Robust sound event recognition using convolutional neural networks. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, QLD, Australia, 19–24 April 2015; pp. 559–563. [Google Scholar] [CrossRef]
- Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv 2016, arXiv:1604.06338. [Google Scholar] [CrossRef]
- Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent neural networks for polyphonic sound event detection in real life recordings. arXiv 2016, arXiv:1604.00861v1. [Google Scholar] [CrossRef]
- Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. arXiv 2017, arXiv:1702.06286. [Google Scholar] [CrossRef]
- Lu, R.; Duan, Z.; Zhang, C. Multi-Scale Recurrent Neural Network for Sound Event Detection. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 131–135. [Google Scholar] [CrossRef]
- Zhang, J.; Ding, W.; Kang, J.; He, L. Multi-scale time-frequency attention for acoustic event detection. arXiv 2019, arXiv:1904.00063. [Google Scholar] [CrossRef]
- Gong, Y.; Chung, Y.; Glass, J. Ast: Audio spectrogram transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar]
- Li, K.; Song, Y.; Dai, L.; McLoughlin, I.; Fang, X.; Liu, L. AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
- Kong, Q.; Xu, Y.; Wang, W.; Plumbley, M. Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2450–2460. [Google Scholar] [CrossRef]
- Ye, Z.; Wang, X.; Liu, H.; Qian, Y.; Tao, R.; Yan, L.; Ouchi, K. Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection. arXiv 2021, arXiv:2110.02011. [Google Scholar] [CrossRef]
- Wakayama, K.; Saito, S. Cnn-Transformer with Self-Attention Network for Sound Event Detection. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; pp. 6332–6336. [Google Scholar]
- Wang, M.; Yao, Y.; Qiu, H.; Song, X. Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection. Symmetry 2022, 14, 366. [Google Scholar] [CrossRef]
- Pankajakshan, A. Sound Event Detection by Exploring Audio Sequence Modelling. Ph.D. Thesis, Queen Mary University of London, London, UK, 2023. [Google Scholar]
- Wang, Y.; Zhao, G.; Xiong, K.; Shi, G.; Zhang, Y. Multi-Scale and Single-Scale Fully Convolutional Networks for Sound Event Detection. Neurocomputing 2021, 421, 51–65. [Google Scholar] [CrossRef]
- Wang, Y.; Zhao, G.; Xiong, K.; Shi, G. MSFF-Net: Multi-scale feature fusing networks with dilated mixed convolution and cascaded parallel framework for sound event detection. Digit. Signal Process. A Rev. J. 2022, 122, 103319. [Google Scholar] [CrossRef]
- Wang, W.; Kao, C.C.; Wang, C. A simple model for detection of Rare Sound Events. arXiv 2018, arXiv:1808.06676. [Google Scholar] [CrossRef]
- Zeng, G.; He, Y.; Yu, Z.; Yang, X.; Yang, R.; Zhang, L. Going Deeper with Convolutions Christian. J. Chem. Technol. Biotechnol. 2016, 91, 2322–2330. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2016, arXiv:1512.00567. [Google Scholar] [CrossRef]
- Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. arXiv 2018, arXiv:1702.08502. [Google Scholar] [CrossRef]
- Li, J.; Yu, Z.L.; Gu, Z.; Liu, H.; Li, Y. Dilated-Inception Net: Multi-Scale Feature Aggregation for Cardiac Right. IEEE Trans. Biomed. Eng. 2019, 66, 3499–3508. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Platt, J.C.; Labs, R. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines Review. In Advances in Kernel Methods: Support Vector Learning; MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
- Mesaros, A.; Heittola, T.; Dikmen, O.; Virtanen, T. Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QL, Australia, 19–24 April 2015. [Google Scholar] [CrossRef]
- Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Hayashi, T.; Watanabe, S.; Toda, T.; Hori, T.; Le Roux, J.; Takeda, K. Duration-Controlled LSTM for Polyphonic Sound Event Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2059–2070. [Google Scholar] [CrossRef]
- Adavanne, S.; Politis, A.; Virtanen, T. Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-Channel Features. arXiv 2018, arXiv:1801.09522v1. [Google Scholar] [CrossRef]
- Kao, C.C.; Wang, W.; Sun, M.; Wang, C. R-CRNN: Region-based convolutional recurrent neural network for audio event detection. arXiv 2018, arXiv:1808.06627. [Google Scholar] [CrossRef]
- Huang, G.; Heittola, T.; Virtanen, T. Using sequential information in polyphonic sound event detection. In Proceedings of the 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, Tokyo, Japan, 17–20 September 2018; pp. 291–295. [Google Scholar] [CrossRef]
- Li, Y.; Liu, M.; Drossos, K.; Virtanen, T. Sound Event Detection via Dilated Convolutional Recurrent Neural Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 286–290. [Google Scholar] [CrossRef]
- Baade, A.; Peng, P.; Harwath, D. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. arXiv 2022, arXiv:2203.16691. [Google Scholar]
- Alex, T.; Ahmed, S.; Mustafa, A.; Awais, M.; Jackson, P. Max-Ast: Combining Convolution, Local and Global Self-Attentions for Audio Event Classification. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 1061–1065. [Google Scholar]
- Gong, Y.; Chung, Y.; Glass, J. PSLA: Improving Audio Tagging with Pretraining. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3292–3306. [Google Scholar] [CrossRef]
- Gong, Y.; Lai, C.; Chung, Y.; Glass, J. SSAST: Self-Supervised Audio Spectrogram Transformer. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, Virtual Event, 1–22 March 2022; Volume 36, pp. 10699–10709. [Google Scholar]
- Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. arXiv 2015, arXiv:1409.1259. [Google Scholar] [CrossRef]
- Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Improved inception-residual convolutional neural network for object recognition. arXiv 2020, arXiv:1712.09888. [Google Scholar] [CrossRef]
- Liu, W.; Chen, J.; Li, C.; Qian, C.; Chu, X.; Hu, X. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Cho, S.; Foroosh, H. Spatio-Temporal Fusion Networks for Action Recognition. arXiv 2019, arXiv:1906.06822. [Google Scholar] [CrossRef]
- Hussein, N.; Gavves, E.; Smeulders, A.W. Timeception for complex action recognition. arXiv 2019, arXiv:1812.01289. [Google Scholar] [CrossRef]
- Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal pyramid network for action recognition. arXiv 2020, arXiv:2004.03548. [Google Scholar] [CrossRef]
- van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Mesaros, A.; Heittola, T.; Diment, A.; Elizalde, B.; Shah, A.; Vincent, E.; Raj, B.; Virtanen, T. DCASE 2017 Challeng Setup: Tasks, Datasets and Baseline System. In Proceedings of the DCASE 2017—Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16–17 November 2017. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the European Signal Processing Conference 2016, Budapest, Hungary, 29 August–2 September; pp. 1128–1132. [CrossRef]
- Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for Polyphonic Sound Event Detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
- Shen, Y.H.; He, K.X.; Zhang, W.Q. Learning how to listen: A temporal-frequential attention model for sound event detection. arXiv 2019, arXiv:1810.11939. [Google Scholar] [CrossRef]
- Cakir, E.; Virtanen, T. Convolutional Recurrent Neural Networks for Rare Sound Event Detection. In Proceedings of the DCASE 2017—Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017; pp. 1–5. [Google Scholar]
- Lim, H.; Park, J.; Lee, K.; Han, Y. Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks. In Proceedings of the DCASE 2017 Proceedings—Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017; pp. 2–6. [Google Scholar]
- Baumann, J.; Lohrenz, T.; Roy, A.; Fingscheidt, T. Beyond the Dcase 2017 Challenge on Rare Sound Event Detection: A Proposal for a More Realistic Training and Test Framework. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 611–615. [Google Scholar] [CrossRef]
- Lu, R. Bidirectional GRU for Sound Event Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017; pp. 1–4. [Google Scholar]
- Zhou, J. Sound Event Detection in Multichannel Audio LSTM Network. arXiv 2017. [Google Scholar]
- Chen, Y.; Zhang, Y.; Duan, Z. Dcase2017 Sound Event Detection Using Convolutional Neural Networks. In Proceedings of the DCASE 2017—Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017. [Google Scholar]
- Adavanne, S.; Virtanen, T. A report on sound event detection with different binaural features. arXiv 2017, arXiv:1710.02997. [Google Scholar] [CrossRef]
- Yang, H.; Luo, L.; Wang, M.; Song, X.; Mi, F. Sound Event Detection Using Multi-Scale Dense Convolutional Recurrent Neural Network with Lightweight Attention. In Proceedings of the 2023 3rd International Conference on Electronic Information Engineering and Computer, EIECT 2023, Shenzhen, China, 17–19 November 2023; pp. 35–40. [Google Scholar]
- Lu, Z.; Xie, H.; Liu, C.; Zhang, Y. Bridging the Gap between Vision Transformers and Convolutional Neural Networks on Small Datasets. Adv. Neural Inf. Process. Syst. 2022, 35, 14663–14677. [Google Scholar]
- Le, T.; Jouvet, P.; Noumeir, R. A Small-Scale Switch Transformer and NLP-Based Model for Clinical Narratives Classification. arXiv 2023, arXiv:2303.12892. [Google Scholar] [CrossRef]
- Panopoulos, I.; Nikolaidis, S.; Venieris, S.; Venieris, I. Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices. In Proceedings of the IEEE Symposium on Computers and Communications, Gammarth, Tunisia, 9–12 July 2023. [Google Scholar]
TUT Rare Sound Events 2017 Dataset | TUT-SED 2017 Dataset | TUT-SED 2016 Dataset | |||||||
---|---|---|---|---|---|---|---|---|---|
Event | Development | Evaluation | Event | Development | Evaluation | Event (Home) | Event (Residential Area) | ||
baby cry | 106 | 42 | brakes squeaking | 52/96.99 s | 24 | (object) Rustling | 60 | (object) Banging | 23 |
glass break | 96 | 43 | car | 304/2471.14 s | 110 | (object) Snapping | 57 | Bird singing | 271 |
gunshot | 134 | 53 | children | 44/35.99 s | 19 | Cupboard | 40 | Car passing by | 108 |
large vehicle | 61/923.74 s | 24 | Cutlery | 76 | Children shouting | 31 | |||
people speaking | 89/715.66 s | 47 | Dishes | 151 | People speaking | 52 | |||
people walking | 109/1246.95 s | 48 | Drawer | 51 | People walking | 44 | |||
Glass jingling | 36 | Wind blowing | 30 | ||||||
Object impact | 250 | ||||||||
People walking | 54 | ||||||||
Washing dishes | 84 | ||||||||
Water tap running | 47 |
SS-FCN | SS-FCN+HDC-Inception | |||
---|---|---|---|---|
Layer | Development|Evaluation | Development|Evaluation | ||
ER | F1(%) | ER | F1(%) | |
4 | 0.15|0.27 | 92.03|86.74 | 0.12|0.18 | 93.84|90.99 |
5 | 0.17|0.20 | 91.59|90.16 | 0.10|0.16 | 94.38|91.43 |
6 | 0.12|0.17 | 94.12|91.47 | 0.11|0.16 | 94.52|91.87 |
7 | 0.10|0.18 | 94.81|90.60 | 0.11|0.17 | 94.12|91.58 |
Development Dataset | Evaluation Dataset | |||||||
---|---|---|---|---|---|---|---|---|
Layer | Baby Cry | Glass Break | Gunshot | Average | Baby Cry | Glass Break | Gunshot | Average |
ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | |
4 | 0.09|95.39 | 0.05|97.34 | 0.21|88.79 | 0.12|93.84 | 0.28|86.91 | 0.14|92.34 | 0.12|93.71 | 0.18|90.99 |
5 | 0.07|96.54 | 0.04|97.97 | 0.20|88.64 | 0.10|94.38 | 0.15|92.40 | 0.16|91.07 | 0.17|90.83 | 0.16|91.43 |
6 | 0.13|93.44 | 0.06|97.14 | 0.14|92.98 | 0.11|94.52 | 0.16|92.25 | 0.17|91.10 | 0.16|92.25 | 0.16|91.87 |
7 | 0.11|94.80 | 0.06|96.69 | 0.17|90.87 | 0.11|94.12 | 0.28|86.91 | 0.14|92.47 | 0.09|95.37 | 0.17|91.58 |
8 | 0.09|95.51 | 0.07|96.52 | 0.21|88.40 | 0.12|93.48 | 0.23|88.49 | 0.15|91.95 | 0.15|92.21 | 0.18|90.88 |
9 | 0.14|93.15 | 0.06|97.13 | 0.21|88.60 | 0.14|92.96 | 0.17|91.62 | 0.17|91.18 | 0.15|92.43 | 0.16|91.74 |
Development Dataset | Evaluation Dataset | |||||||
---|---|---|---|---|---|---|---|---|
Margin (E) | Baby Cry | Glass Break | Gunshot | Average | Baby Cry | Glass Break | Gunshot | Average |
ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | |
0.1 | 0.16|91.87 | 0.10|95.00 | 0.18|90.11 | 0.15|92.33 | 0.25|87.80 | 0.13|92.93 | 0.12|93.70 | 0.17|91.48 |
0.2 | 0.15|92.71 | 0.06|96.69 | 0.24|87.01 | 0.15|92.14 | 0.17|91.55 | 0.16|91.18 | 0.14|92.89 | 0.16|91.87 |
0.3 | 0.17|91.55 | 0.10|94.51 | 0.18|90.64 | 0.15|92.23 | 0.21|89.47 | 0.15|92.01 | 0.11|94.50 | 0.15|91.99 |
0.4 | 0.14|93.03 | 0.10|94.74 | 0.18|89.87 | 0.14|92.55 | 0.19|90.47 | 0.14|92.27 | 0.16|91.45 | 0.16|91.40 |
0.45 | 0.13|93.76 | 0.05|97.56 | 0.26|85.90 | 0.15|92.41 | 0.20|90.02 | 0.14|92.89 | 0.16|91.96 | 0.17|91.62 |
Development Dataset | Evaluation Dataset | |||||||
---|---|---|---|---|---|---|---|---|
Baby Cry | Glass Break | Gunshot | Average | Baby Cry | Glass Break | Gunshot | Average | |
Method | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 |
SS-FCN | 0.11|94.67 | 0.10|94.65 | 0.14|93.03 | 0.12|94.12 | 0.26|86.96 | 0.15|92.08 | 0.09|95.37 | 0.17|91.47 |
+HDC-Inception | 0.13|93.44 | 0.06|97.14 | 0.14|92.98 | 0.11|94.52 | 0.16|92.25 | 0.17|91.10 | 0.16|92.25 | 0.16|91.87 |
+HDC-Inception+SM-CE | 0.17|91.55 | 0.10|94.51 | 0.18|90.64 | 0.15|92.23 | 0.21|89.47 | 0.15|92.01 | 0.11|94.50 | 0.15|91.99 |
+HDC-Inception+hybrid-CE | 0.14|93.09 | 0.07|96.54 | 0.14|92.66 | 0.12|94.10 | 0.20|90.41 | 0.10|94.61 | 0.10|95.12 | 0.13|93.38 |
MS-FCN | 0.06|96.76 | 0.09|95.40 | 0.14|92.86 | 0.10|95.01 | 0.18|90.98 | 0.18|89.91 | 0.10|95.10 | 0.15|92.00 |
+SM-CE | 0.15|92.56 | 0.10|95.00 | 0.13|93.19 | 0.12|93.58 | 0.17|91.53 | 0.16|91.30 | 0.12|93.50 | 0.15|92.11 |
+hybrid CE loss | 0.15|92.77 | 0.04|97.97 | 0.15|92.01 | 0.11|94.25 | 0.20|90.20 | 0.14|92.74 | 0.10|95.10 | 0.14|92.68 |
Development Dataset | Evaluation Dataset | |||||||
---|---|---|---|---|---|---|---|---|
Method | Baby Cry | Glass Break | Gunshot | Average | Baby Cry | Glass Break | Gunshot | Average |
ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | ER|F1 | |
CRNN+TA [61] | ******* | ******* | ******* | ******* | 0.25|87.4 | 0.05|97.4 | 0.18|90.6 | 0.16|91.8 |
CRNN+Attention [61] | 0.10|95.1 | 0.01|99.4 | 0.16|91.5 | 0.09|95.3 | 0.18|91.3 | 0.04|98.2 | 0.17|90.8 | 0.13|93.4 |
Mult-Scale RNN [31] | 0.11|94.3 | 0.04|97.8 | 0.18|90.6 | 0.11|94.2 | 0.26|86.5 | 0.16|92.1 | 0.18|91.1 | 0.20|89.9 |
R-CRNN [43] | 0.09|*** | 0.04|*** | 0.14|*** | 0.09|95.5 | ******* | ******* | ******* | 0.23|87.9 |
CRNN [62] | ******* | ******* | ******* | 0.14|92.9 | 0.18|90.8 | 0.10|94.7 | 0.23|87.4 | 0.17|91.0 |
1D-CRNN [63] | 0.05|97.6 | 0.01|99.6 | 0.16|91.6 | 0.07|96.3 | 0.15|92.2 | 0.05|97.6 | 0.19|89.6 | 0.13|93.1 |
[64] | ******* | ******* | ******* | ******* | 0.29|86.20 | 0.09|95.47 | 0.28|85.60 | 0.22|89.09 |
MTFA [21] | 0.06|96.7 | 0.02|99.0 | 0.14|92.7 | 0.07|96.1 | 0.10|95.1 | 0.02|98.8 | 0.14|92.5 | 0.09|95.5 |
MS-FCN [29] | 0.10|95.1 | 0.02|99.0 | 0.13|93.1 | 0.08|95.7 | 0.11|94.4 | 0.06|96.8 | 0.08|96.2 | 0.08|95.8 |
MSFF-Net [30] | 0.10|94.97 | 0.03|98.60 | 0.11|94.43 | 0.08|96.00 | 0.10|94.82 | 0.05|97.59 | 0.08|95.97 | 0.08|96.13 |
MS-FCN+Hybrid-CE | 0.12|93.9 | 0.20|99.0 | 0.12|94.0 | 0.08|95.6 | 0.11|94.4 | 0.05|97.4 | 0.06|96.8 | 0.07|96.2 |
SS-FCN+HDCi+Hybrid-CE | 0.11|94.31 | 0.01|99.60 | 0.13|93.3 | 0.08|95.7 | 0.12|93.8 | 0.06|96.8 | 0.06|97.0 | 0.08|95.9 |
Development | Evaluation | |||
---|---|---|---|---|
Method | ER | F1 (%) | ER | F1 (%) |
SS-RNN [65] | 0.61 ± 0.003 | 56.7 | 0.825 | 39.6 |
LSRM [66] | 0.66 | 54.5 | 0.853 | 39.1 |
CNN [67] | 0.81 | 37 | 0.858 | 30.9 |
DCASE2017 Baseline [58] | 0.69 | 56.7 | 0.9358 | 42.8 |
MS-RNN [20] | 0.604 ± 0.001 | *** | *** | *** |
CRNN [68] | 0.6 | 59 | 0.7914 | 41.7 |
SS-FCN [29] | 0.5724 ± 0.008 | 61.02 ± 0.63 | 0.8132 ± 0.0145 | 47.03 ± 1.73 |
MS-FCN [29] | 0.5714 ± 0.0097 | 61.21 ± 0.85 | 0.7843 ± 0.019 | 48.60 ± 0.74 |
MSFF-Net [30] | 0.5805 ± 0.0047 | 59.76 ± 0.79 | 0.7519 ± 0.0074 | 49.93 ± 1.5 |
LocEnc100 [28] | *** | *** | 0.90 | 49.86 |
attn200 [27] | *** | *** | 0.681 | 49.6 |
MS-AttDenseNet-RNN [69] | *** | *** | 0.76 | 49.6 |
MS-FCN+Hybrid-CE | 0.5745 ± 0.0044 | 60.27 ± 1.28 | 0.7717 ± 0.0402 | 48.86 ± 0.91 |
SS-FCN+HDCi+Hybrid-CE | 0.5893 ± 0.0135 | 59.84 ± 0.99 | 0.7662 ± 0.0209 | 49.62 ± 0.96 |
Development | Evaluation | |||
---|---|---|---|---|
Method | ER | F1 (%) | ER | F1 (%) |
GMM [59] | 1.13 | 17.9 | *** | *** |
FNN [15] | 1.32 ± 0.06 | 32.5 ± 1.2 | *** | *** |
CNN [19] | 1.09 ± 0.06 | 26.4 ± 1.9 | *** | *** |
RNN [19] | 1.10 ± 0.04 | 29.7 ± 1.4 | *** | *** |
MS-RNN [20] | 0.82 ± 0.01 | 31.5 ± 0.8 | *** | *** |
CRNN [19] | 0.95 ± 0.02 | 30.3 ± 1.7 | *** | *** |
SS-FCN [29] | 0.7807 ± 0.0088 | 41.62 ± 0.94 | 0.9367 ± 0.0084 | 26.66 ± 1.46 |
MS-FCN [29] | 0.7780 ± 0.0127 | 42.02 ± 1.85 | 0.9328 ± 0.0411 | 25.39 ± 3.05 |
MSFF-Net [30] | 0.7806 ± 0.0103 | 40.85 ± 0.0179 | 0.9264 ± 0.0099 | 25.95 ± 0.0156 |
MS-FCN+hybrid-CE | 0.7769 ± 0.0234 | 40.68 ± 0.02.23 | 0.9287 ± 0.0218 | 27.03 ± 1.92 |
SS-FCN+HDCi+hybrid-CE | 0.8078 ± 0.0067 | 37.11 ± 1.41 | 0.9289 ± 0.0118 | 26.33 ± 1.40 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Wang, W.; Chen, Y.; Su, X.; Chen, J.; Yang, W.; Li, Q.; Duan, C. Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss. Electronics 2024, 13, 3194. https://doi.org/10.3390/electronics13163194
Wang Y, Wang W, Chen Y, Su X, Chen J, Yang W, Li Q, Duan C. Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss. Electronics. 2024; 13(16):3194. https://doi.org/10.3390/electronics13163194
Chicago/Turabian StyleWang, Yingbin, Weiwei Wang, Yuexin Chen, Xinyu Su, Jinming Chen, Wenhai Yang, Qiyue Li, and Chongdi Duan. 2024. "Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss" Electronics 13, no. 16: 3194. https://doi.org/10.3390/electronics13163194
APA StyleWang, Y., Wang, W., Chen, Y., Su, X., Chen, J., Yang, W., Li, Q., & Duan, C. (2024). Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss. Electronics, 13(16), 3194. https://doi.org/10.3390/electronics13163194