Next Article in Journal
Bearing Faults Diagnosis by Current Envelope Analysis under Direct Torque Control Based on Neural Networks and Fuzzy Logic—A Comparative Study
Previous Article in Journal
A Study on the Non-Contact Artificial Intelligence Elevator System Due to the Effect of COVID-19
Previous Article in Special Issue
Fusion of Infrared and Visible Light Images Based on Improved Adaptive Dual-Channel Pulse Coupled Neural Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss

1
National Key Laboratory of Science and Technology on Space Microwave, Xi’an Institute of Space Radio Technology, No. 504 East Chang’an Street, Xi’an 710100, China
2
Xi’an Institute of Space Radio Technology, No. 504 East Chang’an Street, Xi’an 710100, China
3
School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(16), 3194; https://doi.org/10.3390/electronics13163194
Submission received: 28 June 2024 / Revised: 19 July 2024 / Accepted: 6 August 2024 / Published: 13 August 2024
(This article belongs to the Special Issue Machine Learning Methods for Solving Optical Imaging Problems)

Abstract

:
Signal detection and classification tasks, especially in the realm of audio, suffer from difficulties in capturing long short-term dependencies and effectively utilizing samples. Firstly, audio signal detection and classification need to classify audio signals and detect their onset and offset times; therefore, obtaining long short-term dependencies is necessary. The methods based on RNNs have high time complexity and dilated convolution-based methods suffer from the “gridding issue” challenge; thus, the HDC-Inception module is proposed to efficiently extract long short-term dependencies. Combining the advantages of the Inception module and a hybrid dilated convolution (HDC) framework, the HDC-Inception module can both alleviate the “gridding issue” and obtain long short-term dependencies. Secondly, datasets have large numbers of silent segments and too many samples for some signal types, which are redundant and less difficult to detect, and, therefore, should not be overly prioritized. Thus, selecting effective samples and guiding the training based on them is of great importance. Inspired by support vector machine (SVM), combining soft margin SVM and cross-entropy loss (CE loss), the soft margin CE loss is proposed. Soft margin CE loss can adaptively select support vectors (effective samples) in datasets and guide training based on the selected samples. To utilize datasets more sufficiently, a hybrid CE loss is proposed. Using the benefits of soft margin CE loss and CE loss, hybrid CE loss guides the training with all samples and gives weight to support vectors. Soft margin CE loss and hybrid CE loss can be extended to most classification tasks and offer a wide range of applications and great potential.

1. Introduction

1.1. Background and Significance

In the specialized domain of signal detection and classification, the subdiscipline of audio signal detection/classification is used to identify the signal types in audio recordings and estimate their onset and offset times (known as audio event detection). This technological paradigm unveils expansive applicative potential, such as audio surveillance [1,2], voice wakeup [3], speaker verification [4,5], environmental sound recognition [6,7], and bird call detection [8]. Audio signal detection/classification demonstrates significant versatility, with its techniques being applicable in areas such as spatial electromagnetic spectrum awareness [9], signal parameter estimation [10], signal analysis [11], and modulation recognition [12]. The expansive applicability and the technological versatility of this approach underscore its critical role in advancing the capabilities of contemporary signal detection and classification frameworks.

1.2. Current Status of the Problem

The most typical signal detection–classification method is the Gaussian Mixture model–hidden Markov model (GMM-HMM) method [13], in which each type of signal is modeled by a three-state hidden Markov model, and each state is modeled by a Gaussian mixture model. With the development of deep learning, an increasing number of signal detection–classification methods based on deep learning have been proposed, such as feedforward neural networks (FNNs) [14,15], convolutional neural networks (CNNs) [16,17], and recurrent neural networks (RNNs) [18]. In 2017, Cakir et al. proposed the convolutional recurrent neural network (CRNN) [19], which combines CNN and RNN, with the former extracting frequency invariance and the latter modeling temporal dependencies. As signals have different durations, many researchers have given weight to multi-scale information in recent years. Lu et al. proposed the multi-scale recurrent neural network (MS-RNN) [20], which uses the Mel-band energies feature of audio recordings at different time resolutions to capture both fine-grained and long-term dependencies. Zhang et al. proposed the multi-scale time-frequency attention (MTFA) method [21], which extracts features of different scales through convolutional operation and max-pooling. Transformer-based methods have been successfully used for sound event classification and weakly labeled sound event detection, including AST [22], AST-SED [23], CNN-Transformer [24], and SEDT [25,26]. Refs. [27,28] constructed Transformer/self-attention-based models for polyphonic sound event detection, but their results indicate that Transformer/self-attention-based methods do not exhibit significant advantages in polyphonic sound event detection. To accelerate the running speed and improve performance, in our earlier study [29], single-scale fully convolutional networks (SS-FCNs) and multi-scale fully convolutional networks (MS-FCNs) were proposed. The SS-FCN employs dilated convolution to encapsulate temporal context, offering more rapid execution compared to RNNs. However, its effectiveness in signal detection is limited to specific types, as it operates with a single temporal dependency length. Conversely, the MS-FCN integrates a feature fusion module that combines temporal dependencies of varying lengths, enhancing its ability to handle long short-term dependencies and, thus, improving the detection capabilities across a broader range of audio signals [29]. However, MS-FCN neglects fine-grained dependencies and intermediate-length temporal dependencies. To overcome the limitations of MS-FCN [29], in a subsequent manuscript, MSFF-Net [30], the dilated mixed convolution module, and the cascaded parallel module were proposed. The former merges dilated and standard convolutions to model temporal information, while the latter merges temporal dependencies of different lengths in a cascading and parallel manner to capture rich temporal information.
Although audio signal detection and classification have made great progress, there are still significant challenges that need to be addressed. This manuscript focuses on two of these challenges:
  • The first challenge involves obtaining richer long short-term dependencies. As described in Refs. [21,31]; as audio signals (such as sound events) have different durations, multi-scale information is indispensable for signal detection. Although SS-FCN [29] demonstrated that dilated convolution is effective for modeling temporal context, a single dilated convolutional layer in SS-FCN can only capture features with a single length of temporal dependencies. While MS-FCN can capture multi-scale information by merging features output from the dilated convolution of different layers, MS-FCN can capture multi-scale information but is limited by the layer of dilated convolution. Similar to MS-FCN, MSFF-Net [30] can only capture and merge a limited number of temporal dependencies. Thus, a network that can obtain richer temporal context information and merge them to capture richer long short-term dependencies is required.
  • The second challenge involves selecting effective samples (support vectors) and using them to guide the training. As audio signal detection–classification is considered a frame-wise prediction task, samples are defined according to the time frame (or time segment) of several tens of milliseconds. In a recording, sound events, such as gunshots and glass break, usually have extremely limited durations, resulting in background samples taking a dominant proportion. Thus, there is a high class imbalance between the foreground and background in audio signal detection and classification datasets. In addition, class imbalances exist among different kinds of audio signals, e.g., in the TUT-SED 2017 dataset, the duration of car events exceeds that of brakes squeaking, children, large vehicle, and people speaking events combined. Moreover, the dataset contains large silent clips in the background, which show high similarity and are easily classified. Moreover, some events, e.g., gunshot, glass break, car, and large vehicle events, have smaller sample sizes, high inter-sample similarity, and high concurrency, making them difficult to detect. As mentioned above, a serious class imbalance exists in datasets. Some events have large sample sizes and are easy to detect, contributing little to the training. Conversely, other events may have small sample sizes and are difficult to fit, requiring the network to focus more on them. Addressing how to manage these two types of samples and allocate different levels of attention during training to improve detection performance is a significant challenge.

1.3. Proposed Solutions and Contributions

In response to these challenges, this paper proposes the HDC-Inception module, soft margin cross-entropy loss, and hybrid cross-entropy loss (soft margin CE loss and hybrid CE loss) to capture richer long short-term dependencies and improve the utilization of effective samples.
  • For the first challenge, this paper proposes the HDC-Inception module to capture richer long short-term dependencies, with the advantages of the Inception module [32,33] and hybrid dilated convolution (HDC) framework [34]. In the Inception module (as shown in Figure 1a), several parallel paths are constructed, and convolutional filters of different sizes are used in the parallel paths to capture and fuse features with different views. In our early SS-FCN [29], dilated convolution was used to model temporal context, as dilated convolution can ensure high time resolution and capture longer temporal dependencies without altering the filter size or network depth. Combining the Inception module and dilated convolution, the Dilated-Inception module (as shown in Figure 1b) is naturally proposed, which uses the architecture of the Inception module and replaces the convolutional filters of different sizes with dilated convolutions of different dilation factors. However, the Dilated-Inception module suffers from a “gridding issue” [34] caused by the element-skipping sampling of standard dilated convolution, under which some neighboring information is ignored [35]. Fortunately, the “gridding issue” can be alleviated by HDC [34], which uses a range of dilation factors and concatenates them serially to gradually increase the views. Based on the above analyses, we propose the HDC-Inception module (as shown in Figure 1c), which combines the advantages of the Inception module and HDC. The proposed HDC-Inception module has several parallel paths, which use dilated convolutions of different dilation factors to capture different temporal dependencies. Inspired by HDC, the dilation factors are increased path-by-path, and the output features of the previous path are sent to the next path to alleviate the “gridding issue”. Features from all paths are concatenated and fed into a convolutional filter to fuse information altogether. In addition, skip connections are used between two HDC-Inception modules to alleviate gradient vanishing. Obviously, by employing HDC-Inception modules in a stacked configuration, temporal dependencies of diverse durations can be acquired, thereby enabling networks to capture more comprehensive temporal contextual information.
  • For the second challenge, this paper proposes soft margin cross-entropy loss (soft margin CE loss) to adaptively select effective samples (support vectors) and use them to guide the training, as inspired by the soft margin support vector machine (soft margin SVM) [36,37]. In SVM, when samples are inseparable in linear space, a non-linear kernel will be used to map the samples into a feature space, where a linear separating hyperplane can be found. As neural networks can be regarded as non-linear kernels of SVM, in the following discussion on SVM and loss function, samples are located in the space after non-linear mapping. In SVM (as shown in Figure 2a) [36,37], a hyperplane is found to separate the samples into two classes with the max margin. Based on the hyperplane, samples are divided into support vectors and non-support vectors. Only support vectors are used to optimize the separating hyperplane. However, SVM is only feasible when the samples are linearly separable. To overcome this limitation, the soft margin SVM (as shown in Figure 2b) is proposed, which allows misclassified samples and minimizes errors. Inspired by soft margin SVM, we propose soft margin CE loss (as shown in Figure 2c). Similar to soft margin SVM, the proposed soft margin CE loss presupposes a separating hyperplane and sets a margin to divide samples into support vectors and non-support vectors. Support vectors are used to calculate the optimum separating hyperplane. Through identity transforms, the calculation of this hyperplane is transformed into an optimization model/problem, which can be used as the loss function of neural networks. Although the proposed soft margin CE loss originated from soft margin SVM, which only works for binary classification problems, it applies to multi-classification tasks. Using the advantages of soft margin CE loss and CE loss, this paper also proposes a hybrid CE loss approach. This method utilizes all samples in training while focusing on support vectors.
In summary, this paper proposes the HDC-Inception module and soft margin CE loss to capture richer long short-term dependencies and improve the utilization of effective samples. The HDC-Inception module combines the advantages of the Inception module [32,33] and HDC [34], which can extract rich long short-term dependencies and alleviate the “gridding issue”. Soft margin CE loss couples the advantages of soft margin SVM and CE loss, which selects support vectors adaptively and uses them to guide the training. Combining soft margin CE loss and CE loss, this paper also proposes the hybrid CE loss approach, which uses all samples in the training and gives weight to support vectors. Soft margin CE loss and hybrid CE loss can be used as the loss functions of networks and are suitable for most classification tasks.
The rest of this paper is organized as follows: the related works are shown in Section 2. The proposed HDC-Inception module and soft margin CE loss are described in Section 3. Section 4 introduces the datasets and metrics. The results and discussion are presented in Section 5. Section 6 presents the conclusion and further works.

2. Related Works

2.1. Audio Signal Detection and Classification

Early signal detection and classification methods included the Gaussian mixture model–hidden Markov model (GMM-HMM) method [13] and the non-negative matrix factorization (NMF) method [38]. As the technology evolved, deep learning methods emerged, leading to the adoption of feedforward neural networks (FNNs) [14,15] and convolutional neural networks (CNNs) [16,17] in this field. Limited by time resolution, FNN [14,15] and CNN [16,17] are not suitable for signal detection and classification. To overcome this limitation, Parascandolo et al. proposed the RNN-based method [18], which uses RNNs [39] to predict the class of each signal frame by frame, ensuring high time resolution. Furthermore, by combining RNNs (such as long short-term memory (LSTM) [40]) with hidden Markov models, Hayashi et al. proposed the duration-controlled long short-term memory [41], which utilizes three-state left-to-right hidden Markov models to describe the course of audio events. Using the advantages of CNNs [16] and RNNs [18], Cakir et al. proposed well-known convolutional recurrent neural networks (CRNNs) [19] for audio signal detection and classification, where CNNs are used to extract frequency invariance and RNNs ensure high time resolution. Despite the attractive performance of CRNNs, which has led to expanded versions such as C3RNN [42], R-CRNN [43], and [44,45], these methods have seen limited performance improvements. Although Transformer-based approaches have been successful in sound event classification and weakly labeled sound event detection [22,23,24,25,26,46,47,48,49], for polyphonic sound event detection, these methods do not offer substantial benefits [27,28]. In recent years, the concept of long short-term dependencies has garnered attention from researchers. Lu et al. proposed multi-scale recurrent neural networks (MS-RNNs) [20], in which audio recordings of different time resolutions are used to capture temporal dependencies of different lengths. Zhang et al. proposed the multi-scale time-frequency attention (MTFA) method [21], employing convolutional operations and max-pooling to capture multi-scale features.
However, all these methods mentioned above use RNNs (e.g., LSTM [40] or gated recurrent unit (GRU) [50]) to model temporal context information; therefore, they share the RNN limitations of training challenges and low running speeds To overcome these challenges, the SS-FCN and MS-FCN approaches are proposed [29]. SS-FCN demonstrates that dilated convolution can be used to model temporal context information [29]. MS-FCN demonstrates that fusing temporal dependencies of different lengths can enhance the performances of different kinds of signals. As an improvement of MS-FCN, MSFF-Net [30] is capable of capturing richer temporal dependencies. SS-FCN, MS-FCN, and MSFF-Net approaches achieve competitive performance and their running speeds are well above those of previous methods [29,30].

2.2. Inception Module and Hybrid Dilated Convolution

Inception module: The Inception module (as shown in Figure 1a) was first proposed by Szegedy et al. for object detection and classification [33]; it arranges parallel convolutions of different kernels to capture different views from the input feature. As the Inception module can capture features with different views and deal with multi-scale problems of objects, it has been extensively applied to object recognition [51], human pose estimation [52], action recognition [53,54,55], and so on.
Hybrid dilated convolution: Hybrid dilated convolution (HDC) [34] was proposed by Wang et al. to alleviate the “gridding issue” caused by dilated convolution. In dilated convolution, zeros are padded between two pixels in a convolutional kernel; thus, the receptive field of this kernel only covers an area with checkerboard patterns, and some neighboring information is ignored. To alleviate the “gridding issue”, HDC uses a range of dilated convolutions with different dilation factors and concatenates them in series, under which the receptive field grows gradually and covers the whole area.

2.3. Soft Margin Support Vector Machine

Support vector machine (SVM), proposed by Cortes et al. [36], is a classical machine learning algorithm and has been extensively applied to classification tasks. As shown in Figure 2a, SVM calculates a separating hyperplane with the max-margin to separate the samples into two classes. Samples, closest to the hyperplane are defined as support vectors. In SVM, only support vectors are used to calculate the optimum hyperplane, and non-support vectors do not contribute to the calculation of the hyperplane. The training sample is denoted by x i R p with label y i { + 1 , 1 } , i = { 1 , 2 , , n } . As shown in Figure 2a, the optimum hyperplane w x + b = 0 separates two classes with the largest margin and all samples meet the constraint of y ( w x + b ) 1 , where w R p represents the weight vector of the separating hyperplane and b is the bias. To maximize the margin, the optimum separating hyperplane can be obtained by solving the optimization problem in Equation (1):
min 1 2 w 2 subject to y i ( w x i + b ) 1 , i = 1 , n .
However, SVM requires that all samples follow the constraint of y ( w x + b ) 1 , causing SVM to be feasible only when the samples are separable linearly. To overcome this limitation, the soft margin SVM is proposed, which allows misclassified samples that do not meet the constraint, as shown in Figure 2b. In soft margin SVM, the misclassified samples incur the error 1 y ( w x + b ) . Taking this error as a penalty and integrating it into Equation (1), the optimization problem transforms into Equation (2):
min w , b 1 2 w 2 + C i = 1 n max ( 0 , 1 y i ( w x i + b ) )
where C is a regularization parameter that defines the trade-off between the maximum margin and the misclassified error.

3. Method

3.1. HDC-Inception

The proposed HDC-Inception approach takes advantage of both the HDC [34] and Inception modules [32,33], with the former alleviating the “gridding issue” and the latter capturing temporal dependencies of different lengths. Ref. [29] demonstrated that dilated convolution is effective at modeling temporal context information in audio signal detection and classification systems. Dilated convolution can address the limitations of RNNs, including restricted temporal dependencies, training difficulties, and low efficiency. In the HDC-Inception model, dilated convolution is specifically chosen to model temporal context.
As shown in Figure 1a, in video tasks, through constructing several parallel paths with convolutional filters of different sizes, the Inception module can capture features of different scales. Extending to signal processing, the Dilated-Inception module is proposed, as shown in Figure 1b, in which parallel paths with different dilation factors are used to capture temporal dependencies of different lengths. However, dilated convolution suffers from the “gridding issue”. To overcome this limitation, inspired by HDC [34], HDC-Inception is proposed (Figure 1c). This approach gradually increases the dilation factor in the parallel path; the output of the previous path, concatenated with the input, is fed into the next path to ensure the gradual increase in the temporal dependent length. Obviously, in HDC-Inception, different paths output temporal dependencies of different lengths. To better fuse temporal dependencies of different lengths, the outputs of all paths and the input features are concatenated and then passed through a convolution with a filter size of 1. In addition, skip connections are used between two HDC-Inception modules to alleviate gradient vanishing. Referring to [56], the exponential dilation factors of {1, 2, 4, 8, 16,…} are selected in HDC-Inception. As too-large dilation factors lead to too-sparse sampling, failing to cover local temporal context information, the proposed HDC-Inception module utilizes four parallel paths with dilation factors of {1, 2, 4, 8}. Hence, one HDC-Inception obtains temporal dependencies of four different lengths. As HDC-Inception modules are connected in series, if there are N H I HDC-Inception modules in the proposed method, 4 N H I temporal dependencies are captured. Thus, through HDC-Inception, the proposed method can capture rich temporal contextual information.

3.2. Soft Margin CE Loss

Inspired by soft margin SVM, the proposed soft margin CE loss can identify support vectors and non-support vectors. Furthermore, soft margin CE loss meets the general requirements of classification tasks and can be used for multi-label and multi-classification tasks. In classification tasks, classification methods based on neural networks usually output a prediction, ranging from 0 to 1, to represent the activity probability of one class. In this process, the neural networks can be regarded as non-linear kernels of SVM, which map the samples into a feature space, where samples satisfy linear separability. Based on this, as shown in Figure 2c, the activity probability for the sample is in the range of [0, 1] ( p [ 0 , 1 ] ). Analogous to SVM, we assume that the separating hyperplane is p = 0.5 , which is used to judge whether an event occurred.
When the sound event occurs in this audio segment (the target output y = 1), the constraint is shown in Equation (3), as follows:
p 0.5 + E
where E represents the margin. Via an identity transformation, this constraint converts to Equation (4), as follows:
log ( p ) log ( 0.5 + E )
In the proposed soft margin CE loss, the misclassified samples are allowed, which do not meet the constraint equation, Equation (3). Similar to soft margin SVM, the misclassified samples incur the error log ( p ) + log ( 0.5 + E ) and are regarded as support vectors. Taking this error as a penalty and considering all samples, the optimization function can be formulated as in Equation (5):
min [ max ( 0 , log ( p ) + log ( 0.5 + E ) ) ]
When there is no sound event in this audio segment (target output y = 0), the constraint can be denoted as in Equation (6):
p 0.5 E 1 p 0.5 + E log ( 1 p ) log ( 0.5 + E )
In this condition, the misclassified samples incur the error log ( 1 p ) + log ( 0.5 + E ) and the optimization function is shown in Equation (7), as follows:
min ( max ( 0 , log ( 1 p ) + log ( 0.5 + E ) ) )
Considering all situations and all samples, we can obtain the soft margin CE loss, as shown in Equation (8):
min [ i = 1 n ( y i max ( 0 , log ( p i ) + log ( 0.5 + E ) ) + ( 1 y i ) max ( 0 , log ( 1 p i ) + log ( 0.5 + E ) ) ]
where y i = { 0 , 1 } is the target output and E [ 0 , 0.5 ] represents the margin.
Analysis: In this section, we analyze the characteristics of the proposed soft margin CE loss and compare them with CE loss. In this analysis, a simple classification model is selected, which consists of a single-layer fully connected network followed by a sigmoid. In this model, the forward propagation is shown as follows:
s = w x = j = 1 J w j x j
p = 1 1 + e s
L C E = i = 1 N ( y i log ( p i ) ( 1 y i ) log ( 1 p i ) )
L S M C E = i = 1 N ( y i max ( 0 , log ( p i ) + log ( 0.5 + E ) ) + ( 1 y i ) max ( 0 , log ( 1 p i ) + log ( 0.5 + E ) )
where s is the output of the fully connected network, and p is the output of the sigmoid representing the activity probability of a sound event. L C E is the CE loss and L S M C E is the soft margin CE loss. ⊙ denotes the inner product. The purpose is to minimize L C E or L S M C E . The partial derivative L w j represents the speed of convergence, where L { L C E , L S M C E } .
When CE loss is selected as the loss function, the calculation of the partial derivative L C E w j is shown as follows:
L C E / p = y / p + ( 1 y ) / ( 1 p ) p / s = ( 1 / ( 1 + e s ) ) ( e s / ( 1 + e s ) = p ( 1 p ) s / w j = x j s o : L C E w j = L C E p p s s x j = [ y / p + ( 1 y ) / ( 1 p ) ] p ( 1 p ) x j = [ y ( 1 p ) + ( 1 y ) p ] x j = ( p y ) x j
Referring to Equation (13), when soft margin CE loss is selected as the loss function, the computational formula of the partial derivative L S M C E w j is as follows:
L S M C E w j = 0 , ( y = 0 and p 0.5 E ) or ( y = 1 and p 0.5 + E ) ( p y ) x j , o t h e r s
For CE loss and the proposed soft margin CE loss, the comparisons between loss values and their corresponding partial derivatives L w j , L { L C E , L S M C E } are shown in Figure 3. Observing Figure 3, the proposed soft margin CE loss has similar characteristics to CE loss; for this reason, soft margin CE loss can replace CE loss in most classification tasks. The difference is that CE loss uses all samples to train the networks, but soft margin CE loss selects support vectors and uses them to guide the training, in which non-support vectors do not contribute to training.

3.3. Hybrid CE Loss

Although soft margin CE loss can select support vectors, which play an indispensable role in training, the number of support vectors is limited. However, neural networks need huge samples to support training. Under comprehensive consideration, the hybrid CE loss function, mixing the proposed soft margin CE loss with classical CE loss, is proposed, as shown in Equation (15).
L = r L S M C E + ( 1 r ) L C E
where L S M C E denotes soft margin CE loss, L C E denotes classical CE loss, and L denotes hybrid loss. r denotes the trade-off between soft margin CE loss and classical CE loss.

3.4. Proposed Method

Employing HDC-Inception, along with soft margin CE loss and hybrid CE loss, forms the core of the proposed methodology. As shown in Figure 4, the proposed method consists of three parts: (1) Convolutional layers used to obtain features with frequency invariance, (2) HDC-Inception layers, which can model temporal context and capture long short-term dependencies, and (3) loss functions, under which CE loss, soft margin CE loss, and a hybrid loss function that combines both are selected. Next, we will go into detail about each of the three parts.

3.4.1. Preprocessing

In this paper, we select the log Mel-band energy (MBE) as input for the neural networks, which has been extensively used for audio signal detection and classification [19,29]. Firstly, to increase comparability with previous methods, each audio recording is divided into 46.4 ms segments (or frames) with 50% overlap using the hamming window, similar to [19,20,29,30,42]. Then, the 40-band MBE x t R F is extracted from each audio signal segment, considering the computational load and comparability with previous methods [19,20,29,30,42], where F = 40 denotes the number of features. Referring to [19,29], we select 256 consecutive MBEs (e.g., X = { x 0 , x 1 x T 1 } R F × T , T = 256 ) as the input for the networks. The target output corresponding to the input X is denoted by Y = { y 0 , y 1 y T 1 } R N c × T , where N c indicates the type of target event; y t = { y ( 0 , t ) , y ( 1 , t ) y ( N c 1 , t ) } T R N c represents the label of the t-th frame, and if the c-th class occurs in the t-th frame, y ( c , t ) equals 1, otherwise, 0.

3.4.2. Convolutional Layers

The MBE sequence X is fed into the convolutional layers, in which two-dimensional convolutional filters are used and each convolution is followed by batch normalization, ReLU activation, and non-overlapped max-pooling. As the convolutional layers are used to enhance the frequency invariance, the max-pooling is computed only along the frequency axis and the time resolution of features remains the same. The output of convolutional layers is indicated by F c R M × F ¯ × T , where M represents the number of channels (or the hidden units) and F ¯ denotes the number of frequency bands.

3.4.3. HDC-Inception Layers

The HDC-Inception layers are composed of multiple HDC-Inception modules; each layer is followed by batch normalization and ReLU activation. The outputs of the convolutional layers are stacked along the frequency axis and then fed into the HDC-Inception layers. The output of the n t h Dilated-Res module is represented by F H I n R W n × T . And n N H I , N H I denotes the number of HDC-Inception modules in the proposed method.

3.4.4. Prediction Layer

The output of HDC-Inception layers F H I N H I is fed into the prediction layer to predict the activity probabilities of sound events. The prediction layer is a single convolutional layer with a filter size of 3 and sigmoid activation. The output is denoted by P R N c × T , which is a T-length sequence, in which P ( c , t ) ( 0 , 1 ) denotes the activity probability of class c in the t t h audio frame.

3.4.5. Loss Function

The proposed soft margin CE loss and hybrid CE loss are selected as the loss functions in the proposed method. Soft margin CE loss gives weight to difficult samples (renamed as support vectors) and hybrid CE loss takes advantage of soft margin CE loss and classical CE loss; this approach strikes a balance between considering all samples and focusing on the more difficult samples.
Batch normalization is used after each convolutional or dilated convolutional layer to reduce the internal covariate shift. And L2 regularization is selected to prevent overfitting. Adam [57] is set as the gradient descent optimizer.

4. Experiments

4.1. Dataset

We test the proposed method on three datasets: the TUT Rare Sound Events 2017 dataset, the TUT-SED 2017 dataset, and the TUT-SED 2016 dataset. (1) The TUT Rare Sound Events 2017 dataset [58] is a synthetic dataset, with three types of isolated audio signals as the target events and backgrounds coming from 15 different scenes. The target events include ’baby cry’ (106 training, 42 test instances, mean duration 2.25 s), ‘glass break’ (96 training, 43 test instances, mean duration 1.16 s), and ‘gunshot’ (134 training, 53 test instances, mean duration 1.32 s). For the development training dataset, 5000 mixtures were generated per target event, incorporating event-to-background ratios of −6, 0, and 6 dB, alongside an event presence probability of 0.99. In contrast, the development testing set and the evaluation set utilize the official set of 500 audio samples per target event provided for testing. (2) The TUT-SED 2017 dataset [58] comprises recordings from real-life street scenes featuring six sound event classes. The development dataset includes 24 files, totaling 70 min. The dataset is divided into training and testing sets using the official four-fold cross-validation method for experiments. In this dataset, the number of instances for each event class includes ‘brakes squeaking’ (52 development, 24 evaluation), ‘car’ (304 development, 110 evaluation), ‘children’ (44 development, 19 evaluation), ‘large vehicle’ (61 development, 24 evaluation), ‘people speaking’ (89 development, 47 evaluation), and ‘people walking’ (109 development, 48 evaluation) events. (3) Recording from two real-life scenes, the TUT-SED 2016 dataset [59] consists of 22 recordings ranging from three to five minutes each. Of these, 10 recordings are from homes and feature 11 sound events, while 12 recordings are from residential areas with 7 sound events. Details on the number of instances per class are provided in Table 1 of [59].
For these three datasets, the per-class event instances can be found in Table 1. One instance represents a complete process of one sound event, with a duration ranging from a few tenths of a second to several tens of seconds. All three datasets are publicly available datasets and more detailed information about them can be found in Refs. [58,59].

4.2. Metrics

Building on previous methods [19,29,30], event-based metrics [60] assess the TUT Rare Sound Events 2017 dataset by comparing system outputs to the corresponding reference outputs on an event-by-event basis. Meanwhile, segment-based metrics [60] are applied to evaluate the TUT-SED 2016/2017 datasets, analyzing performances over continuous segments, while segment-based metrics compare the system output and reference (the target output) in short time segments [60]. Both employ intermediate statistics (true positive, false positive, and false negative), which are used to calculate the F1 score and error rate (ER). More details about these metrics can be found in Ref. [60].

4.3. Baseline

Our earlier proposed SS-FCN [29] is selected as the baseline. The proposed method has a framework similar to SS-FCN, which is composed of convolutional layers, dilated convolutional layers, and CE loss. Convolutional layers in SS-FCN and the proposed method are the same. The dilated convolutional layers of SS-FCN and CE loss correspond to the proposed HDC-Inception layers and soft margin CE loss (or hybrid loss), respectively.

4.4. Experiments Setup

Referring to [19,29,30], for each dataset, the Mel-band energy sequences used as input are consistently set to a length of 256, except for the evaluation dataset of the TUT Rare Sound Events 2017, where the entire file with an MBE sequence length of 1293 is used. The convolutional layers are 2D-CNNs with a filter size of 3; according to Refs. [19,29], the hyperparameter grid search is launched on the hidden units of {32, 64, 128} and the frequency max-pooling arrangements {(5, 4, 2), (5, 2, 2)}. The proposed HDC-Inception module has five parallel paths, in which four are dilated convolutions with dilation factors {1, 2, 4, 8} and one is a skipping connection. The HDC-Inception model undergoes a hyperparameter optimization using a grid search across the values {32, 64, 128} for its hidden units. A convolutional network with a filter size of 3 and a sigmoid activation function is chosen for the prediction layer. For the proposed soft margin CE loss, we run a hyperparameter grid search on the margin E {0.1, 0.2, 0.3, 0.4}. For the hybrid loss function, we run a hyperparameter grid search on the trade-off coefficients r {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. As the separating hyperplane is p = 0.5 in the soft margin CE loss approach, the event activity probabilities P are thresholded at 0.5. The starting learning rate is established at 0.001. Concerning the TUT Rare Sound Events dataset, the training is capped at 50 epochs, implementing early stopping if there is no decrease in validation loss for ten consecutive epochs. Concerning the TUT-SED 2017/2016 dataset, the training is extended to a maximum of 1000 epochs, with early stopping triggered if there is no reduction in the overall error rate on the validation set after 100 epochs. Each experiment is conducted five times using the official four-fold cross-validation setup to obtain average results.

5. Results and Discussion

The experiments are essentially split into two segments. Initially, ablation experiments are performed on the TUT Rare Sound Events 2017 dataset [58] to validate the efficiency of HDC-Inception, soft margin CE loss, and hybrid CE loss. Subsequently, comparisons with existing methods are conducted on the TUT Rare Sound Events 2017 dataset [58], TUT-SED 2017 dataset [58], and TUT-SED 2016 dataset [59]. The experimental results show that the HDC-Inception and hybrid CE loss proposed in this paper can effectively enhance performance and outperform existing audio signal detection and classification methods across several datasets.

5.1. Ablation Studies

This section prioritizes the validity of HDC-Inception, followed by the performances of soft margin CE loss and hybrid CE loss. It was observed that the performance of the method based on soft margin CE loss is limited. The reason is that under soft margin CE loss, the number of support vectors in a batch is too small, making it difficult to improve the performance. To tackle this problem, this paper proposes hybrid CE loss, and experiments show that the hybrid CE loss can effectively improve the performance of the method.

5.1.1. HDC-Inception Module

Effect of HDC-Inception In this section, SS-FCN [29] is chosen as the baseline, which consists of convolutional layers, dilated convolutional layers, and CE loss. The proposed HDC-Inception layer is used to replace dilated convolution layers, in the SS-FCN, to obtain the method named SS-FCN + HDC-Inception. To verify the effectiveness of the HDC-Inception proposed, this section compares the performance of SS-FCN and SS-FCN+HDC-Inception. In this experiment, the number of dilated convolution layers and HDC-Inception layers is the same, consisting of 4, 5, 6, and 7, respectively. For the sake of fairness, the results of SS-FCN in Refs. [29,30] are selected. Compared to SS-FCN+HDC-Inception, the results are shown in Table 2. Four networks with two metrics, plus the development dataset and the evaluation dataset, make a total of 16 comparable indicators. As shown in Table 2, SS-FCN+HDC-Inception outperforms SS-FCN in 14 of the 16 metrics. Thus, the proposed HDC-Inception approach has a better ability to model temporal contextual information compared to dilated convolution.
The layer of HDC-Inception. This section analyzes the impact of the number of HDC-Inception layers on detection performance. The results are displayed in Table 3, with the best performance in bold. As shown in Table 3, the best performance is achieved with 4 to 7 HDC-Inception layers. As the number of HDC-Inception layers increases from 4 to 6, the class-wise average error rate tends to decrease, and the F1 score tends to improve on both the evaluation and development datasets. As the number of layers increases from 6 to 8, both the class-wise average error rate and the F1 score deteriorate on the evaluation dataset, and the performance on the development dataset also worsens. Therefore, in the task of audio signal detection and classification, more HDC-Inception layers do not necessarily result in better performance. Although the network with nine HDC-Inception layers performs well on the evaluation dataset, its performance is poor on the development dataset. With six HDC-Inception layers, the network achieves the best class-wise average F1 score and the lowest class-wise average error rate on both the development and evaluation datasets. Thus, the optimal number of HDC-Inception layers is set to 6.

5.1.2. Soft-Margin CE Loss

Effect of margins (E). This section is used to verify the validity of the soft margin CE loss and the effect of margins (E) in this loss function; the results are shown in Table 4 and Figure 5. For the baby cry and glass break events, the best error rates and F1 scores are obtained by soft margin CE loss with E = 0.45 on the development dataset. However, on the evaluation dataset, the best performances are obtained at E = 0.2 and E = 0.1 , respectively. For gunshot, on the development and evaluation datasets, when E = 0.3 , the best performance is attained. Observing the best performance for each event in Table 4 and Figure 5, the best performances for the development dataset are distributed at E equal to {0.3, 0.4, 0.45}, and for the evaluation dataset, the best performances are distributed at E equal to {0.1, 0.2, 0.3}. As the results show, with a larger E, soft margin CE loss performs better on the development dataset, and with a smaller E, it performs poorly on the development dataset and relatively well on the evaluation dataset. The larger the E, the better the fit between the soft margin CE loss and the traditional CE loss. When E = 0.5 , soft margin CE loss is equal to CE loss. With large E, soft margin CE loss performs better on the development dataset and worse on the evaluation dataset. While with a small E, the performance on the development dataset degrades, there is better performance on the evaluation dataset. This result, to some extent, indicates that the soft margin CE loss proposed in this paper is somewhat resistant to overfitting. With E = 0.3 , the soft margin CE loss shows better performance on both the development and evaluation datasets; therefore, E = 0.3 was chosen for subsequent experiments.
Effect of soft margin CE loss. As shown in Table 5, compared with SS-FCN+HDC-Inception, after soft margin CE loss is used, the performance improves slightly on the evaluation dataset and decreases significantly on the development dataset. To verify whether this drop is a coincidence, the comparative experiments based on MS-FCN are launched, in which the CE loss and soft margin CE loss are selected as the loss functions, respectively; the results are shown in the last two rows of Table 5. The results are similar to those of SS-FCN, with limited performance improvement on the evaluation dataset and more significant performance degradation on the development dataset after using soft margin CE loss instead of CE loss.
The difference between soft margin CE loss and CE loss causes this degradation. Observing Figure 3, CE loss can be regarded as a special soft margin CE loss with E = 0.5 . So, the differences between soft margin CE loss and CE loss are their different points of convergence. CE loss converges to 1 or 0, and soft margin CE loss only requires samples to meet p 0.5 + E or p 0.5 E . In soft margin CE loss, when the output p of a positive sample satisfies p 0.5 + E , or p of a negative sample meets p 0.5 E , the partial derivative L w is equal to 0. Therefore, this kind of sample is not valid for training and is called the non-support vector. It is known that training a neural network requires a large dataset. However, when soft margin CE loss is used, although the training of support vectors is well noticed, this loss function can cause many samples to become non-support vectors prematurely, which is detrimental to the training. In the task of audio signal detection–classification, the input is a continuous sequence of Mel spectra, represented as X = { x 0 , x 1 x T 1 } R F × T , where T = 256 denotes the sequence length. In this experiment, this X will be noted as a support vector, if one element x t of X is a support vector. Figure 6 presents the number of support vectors and the ratio in the dataset after each epoch. Observing Figure 6, the proportion of support vectors is about 25% in the later epochs. In this experiment, the batch size is 256, and considering the support vectors, the effective batch size is about 256∗0.25 = 64. Note that not all elements work in a support vector (X); thus, the actual number of effective samples is extremely low. To address this problem, this paper proposes the hybrid CE loss, which can use all samples to train the network while focusing on the support vector.

5.1.3. Hybrid CE Loss

The hybrid CE loss can be trained using all samples while focusing on the support vectors. The experimental results are shown in the fourth and last row of Table 5. As shown in Table 5, when SS-FCN is selected as the baseline and hybrid CE loss is used to replace soft margin CE loss, for the development dataset, the class-wise average error rate decreases by 3% and the class-wise average F1 score improves by 1.87%. For the evaluation dataset, the corresponding improvements are 2% and 1.39%. Comparing MS-FCN with soft margin CE loss, after hybrid CE loss is selected, the class-wise average error rate improves by 1%, and the class-wise F1 score improves by 0.67%. Compared to MS-FCN with traditional CE loss, hybrid CE loss is less effective for the development dataset, while for the evaluation dataset, improvements of 1% and 0.68% are obtained for the class-wise error rate and F1 score. In summary, hybrid CE loss provides overall improvement compared to soft margin CE loss and shows greater robustness than traditional CE loss.

5.2. Comparisons with Existing Methods

In this section, the hybrid CE loss is selected as the loss function for SS-FCN+HDC-Inception and MS-FCN to compare with existing methods. The experiment encompasses the TUT Rare Sound Event 2017 dataset, the TUT-SED 2017 dataset, and the TUT-SED 2016 dataset.

5.2.1. TUT Rare Sound Event 2017 Dataset

In ablation studies, for the TUT Rare Sound Event 2017 dataset, all sound events, such as baby cry, glass break, and gunshot, share a single network. While, in previous methods, a separate network was launched for each kind of sound event to improve the performance. To be fair, this section will construct dedicated networks for each type of sound event, and the results are shown in Table 6.
Observing Table 6, compared to MS-FCN with CE loss, after hybrid CE loss is used, on the development dataset, the performance is not improved significantly, while on the evaluation dataset, improvements of 1% and 0.4% are obtained for the class-wise average error rate and F1 score. This further suggests hybrid CE loss is superior to CE loss. As for the ability to model long short-term dependencies of MS-FCN and HDC-Inception, the results of MS-FCN with hybrid CE loss and SS-FCN with hybrid CE loss and HDC-Inception are shown in the penultimate and the last rows of Table 6. For the development dataset, the performance of both methods is similar, while for the evaluation dataset, the network based on MS-FCN achieves improvements of 0.01 and 0.3% in the class-wise average error rate and F1 score. Clearly, the two methods of obtaining long short-term dependencies do not show significant differences.
Observing all methods in Table 6, on the development dataset, for gunshot, MSFF-Net [30] achieves the best performance, and for glass break, SS-FCN with hybrid CE loss and HDC-Inception delivers the best performance. Moreover, 1D-CRNN [63] and MTFA [21] achieve high performances for the baby cry event. On the evaluation dataset, MS-FCN based on hybrid CE loss shows the best class-wise average error rate and F1 score. For gunshot, SS-FCN based on hybrid CE loss and HDC-Inception has the best performance. For baby cry and glass break events, MTFA [21] has the best performance. Although 1D-CRNN is better performing on the development dataset, it has a much lower score on the evaluation dataset and uses the ensemble strategy with high complexity. MTFA, while performing better on the development dataset, sees its class-wise average error rate and F1 score worsen by 0.02, 0.01, 0.07%, and 0.04% than the proposed method, respectively. Moreover, MTFA uses RNN to model time dependencies, processing data in a serial manner rather than a parallel manner. This results in a decrease in operating speed by approximately 20 times compared to similar fully convolutional networks [29,30], causing MTFA to require longer training times. MSFF-Net [30] and the proposed method in this manuscript are both improvements based on SS-FCN and MS-FCN [29]. Observing Table 6, both exhibit similar performance on the development dataset, while the proposed method shows improvements of 0.01 and 0.07% compared to MSFF-Net [30] on the evaluation dataset for the class-wise average error rate and F1 score, respectively.

5.2.2. TUT-SED 2017 Dataset

The TUT-SED 2017 dataset, extracted from real life, is used to evaluate industrial application value; the results are shown in Table 7. Observing the second and fourth columns in the last row of Table 7, after hybrid CE loss replaces CE loss, the performance on the development dataset decreases slightly, while on the evaluation dataset, improvements of 1.26% and 0.26% are obtained for the class-wise averaging ER and F1 scores. This pattern is similar to the results from the TUT Rare Sound Event 2017 dataset. Overfitting occurs on the development datasets; however, some improvements are observed for the development dataset. The differing results on the development and evaluation datasets demonstrate that the proposed hybrid CE loss can alleviate overfitting. For HDC-Inception, when comparing its ability to capture long short-term dependencies against MS-FCN, the results are displayed in the last and second-to-last rows of Table 7. As the DCASE Challenge focuses on the performance of the evaluation dataset, the proposed HDC-Inception is more competitive. On the evaluation dataset, the method based on HDC-Inception shows improvements of 0.55% and 1.24% in class-wise error rate and F1 score, respectively, compared to the method using MS-FCN. This result is different from that of the TUT Rare Sound Event 2017 dataset. The different results indicate that both approaches have their advantages and disadvantages. After HDC-Inception and hybrid CE loss are used, compared with SS-FCN, SS-FCN+HDCi+Hybrid-CE achieves improvements of 4.7% and 2.59%, on class-wise ER and the F1 score. This indicates that the proposed method shows a significant improvement in this dataset.
In order to present the results comprehensively and compare the misclassification of different categories, confusion matrices based on quantity and proportion are constructed as shown in Figure 7. When constructing the confusion matrix, the segment lengths are set to 1 s, as segment-based metrics. As shown in Figure 7, for the brakes squeaking event, all samples are misclassified, with 79% of them being misjudged as the car. This result is similar to that obtained by the dataset developers, where the ER for the brakes squeaking was 98% and the F1 score was 4.1% [58], nearly a complete misclassification. Analyzing the reasons, as shown in Table 1, the brakes squeaking event has 52 samples with a total duration of 96.99 s, and only about 70 s is selected during the training process, resulting in an extremely small total duration for this event. Moreover, car engine running and car idling were also labeled as the car event. It is evident that car engine and car idling events have a high similarity to brakes squeaking, which is also the cause of the misjudgment. For the car event, the accuracy is relatively high. For the children event, all samples are misclassified, with 73% being misjudged as people walking. This result is similar to the dataset developers’ findings, where the ER for children was completely wrong and the F1 score was 0 [58]. The Car event has the second shortest duration of 350.99 s. Beyond that, during dataset partitioning, the authors divided children yelling and children talking into the children category, which also led to the majority of children samples being misclassified as people speaking and people walking. For the large vehicle event, the proportion of correct judgments is 49%, while the proportions of misclassifications as car and background are 21% and 17%, respectively. It is understandable that large vehicle and car events have a high similarity. For people speaking and people walking, the proportions of correct judgments are 39% and 55%, respectively. The two categories have a relatively high mutual misclassification rate, as it is common for people to walk while talking or talk while walking. For the background event, the proportion of correct judgments is 63%, with 23% misclassified as the car event. Overall, the proportions of misclassifications as car and people walking events are relatively high, due to the longer total durations of these two events on the training dataset, as car has the largest duration of 2471.14 s and people walking has the second-largest duration of 1246.95 s. Therefore, the primary reasons for the model’s misjudgment are inferred to be class imbalance and inter-sample similarity.
Compared to classical CRNN [62], methods based on the hybrid CE loss and HDC-Inception gain large improvements on both development and evaluation datasets. SS-FCN with hybrid CE loss and HDC-Inception reduces ER by 0.0107 and improves the F1 score by 0.84% on the development dataset, as well as reduces ER by 0.0152 and improves the F1 score by 7.92% on the evaluation dataset. For MS-FCN, a noticeable improvement is captured by the proposed method. Similar to MSFF-Net [30], the proposed method is an improved method based on MS-FCN [29], and both exhibit comparable performance. Ref. [69] is a method based on dense convolution and attention mechanisms, and its performance is similar to ours. Refs. [27,28] are methods based on Transformer or self-attention mechanisms. Specifically, ref. [28] discussed the impacts of two self-attention mechanisms, time-restricted and full-length, on polyphonic sound event detection, suggesting that polyphonic sound event detection benefits more from features within a limited time frame; hence, the time-restricted self-attention prevails. On the TUT-SED 2017 datasets, Ref. [28] achieved an ER of 0.9 and an F1 score of 49.86%. Its F1 score is slightly superior to that of the proposed method, which is 49.42%, but the ER is excessively poor compared to our 0.7662. Similarly, Ref. [27] employed a self-attention-based method. During training, Ref. [27] incorporated the DCASE 2021 dataset, resulting in an ER of 0.6810 and an F1 score of 49.6%. Although their F1 score is comparable to ours, their ER is 8.52% better than ours. However, the question remains whether the improvement is from data augmentation or the Transformer/self-attention architecture. If it originates from architecture, why does [28] not show any improvement?
This issue has been discussed in the fields of image processing and natural language processing (NLP). Researchers have consistently indicated that Transformer/self-attention-based methods require large datasets to demonstrate their advantages, which are not as evident in smaller datasets [70,71,72]. In image processing, Ref. [70] suggested that Vision Transformers struggle to effectively learn inductive biases in small datasets. Similarly, in the NLP domain, Ref. [71] found that switch Transformers did not exhibit significant advantages in clinical narrative classification when applied to small datasets. Both [71,72] emphasized the necessity of special strategies such as data augmentation, pre-trained models, and knowledge distillation when using Transformer models on small datasets. This also explains why, in signal classification and the detection domain, Transformer-based models are predominantly used with large datasets or weak labels, such as AST-SED [23], CNN-Transformer [24], and SEDT [25]. For this reason, in polyphonic sound event detection, Transformer/self-attention-based methods employ data argumentation or knowledge distillation frequently, as seen in Ref. [27] with the DCASE 2021 dataset. Under unequal conditions and without the utilization of data augmentation or external datasets, the proposed method exhibits comparable performance to methods based on Transformer or self-attention mechanisms on the TUT-SED 2017 dataset.

5.2.3. TUT-SED 2016 Dataset

The TUT-SED 2016 dataset is small, containing audio recordings of 78 min with 17 types of events, and it is used to validate the performance of small real-world datasets. For MS-FCN, switching the loss function from CE loss to hybrid CE loss, for the performance of the development dataset, there is no significant change, and for the evaluation dataset—the ER and F1 scores achieve improvements of 0.0041 and 1.64%. This change in performance is similar to that of the TUT Rare Sound Event 2017 dataset and the TUT-SED 2017 dataset, validating the effectiveness of the proposed hybrid CE loss further. Comparing MS-FCN and HDC-Inception, as shown in the last two rows of Table 8, MS-FCN shows better performance on the development dataset, while each performs similarly on the evaluation dataset. Observing the last row and the fourth row from the bottom of Table 8, the transition from MS-FCN to the method based on HDC-Inception and hybrid CE loss results in improved ER and F1 scores by 0.0039 and 0.94% on the evaluation dataset, while there is a significant decline on the development dataset. Compared with MSFF-Net [30], also based on MS-FCN [29], the proposed method shows an improvement of 1.08% in the class-wise average F1 score on the evaluation dataset, which means that the proposed improvement measures of HDC-Inception and hybrid CE loss are more effective. As shown in Table 8, the proposed method significantly outperforms the existing CRNN [19] and MS-RNN [20] methods, demonstrating greater performance improvement and high adaptability to small datasets.
The summary of the experimental results is as follows: (1) Observing MS-FCN vs. MS-FCN+Hybrid-CE, (i) for the TUT Rare Sound Event 2017 evaluation dataset, MS-FCN+Hybrid-CE achieves improvements of 1% and 0.4% for the ER and F1 scores, as shown in Table 5. (ii) For the TUT-SED 2017 evaluation dataset, ER and F1 scores achieve improvements of 1.26% and 0.26%, as shown in Table 6. (iii) And for the TUT-SED 2016 evaluation dataset, improvements of 0.41% and 1.64% are obtained, as shown in Table 7. (2) Observing SS-FCN vs. SS-FCN+HDCi+Hybrid-CE, (i) for the TUT Rare Sound Event 2017 evaluation dataset, SS-FCN+HDCi+Hybrid-CE achieves improvements of 4% and 1.91% for the ER and F1 scores, as shown in Table 4. (ii) For the TUT-SED 2017 evaluation dataset, improvements of 4.7% and 2.59% are obtained on class-wise ER and F1 scores, as shown in Table 6. (iii) And for the TUT-SED 2016 evaluation dataset, ER improved by 0.78%, while the F1 score remained consistent. The results indicate that the proposed HDC-Inception and hybrid CE loss are effective, achieving a maximum improvement of 4.7%. Compared with HDC-Inception, the hybrid CE loss exhibits more clear improvements, resulting in a 3% enhancement in performance as shown in Table 4.

5.2.4. Computational Costs

Model size and FLOPs (floating point operations) are selected to measure the complexity of models. The model size reflects the total number of parameters in a neural network; a larger model size means higher memory usage. FLOPs refer to the number of basic operations during a single forward or backward propagation in a neural network, reflecting the computational load of the network. The larger the FLOPs, the more computational resources and time the neural network requires, and the higher the complexity of the model. As shown in Table 9, SS-FCN [29] exhibits the most optimal model size and FLOPs. In contrast, MS-FCN [29] possesses the largest model size and FLOPs. The proposed hybrid CE loss is used as a loss function solely in models; hence, the model size and FLOPs of MS-FCN+hybrid CE loss are identical to those of MS-FCN. The proposed HDC-Inception, by introducing additional branches to capture long short-term dependencies, does not significantly increase the model size and FLOPs. The model size and FLOPs of HDC-Inception are less than those of MS-FCN. This indicates that—-compared to MS-FCN—the proposed HDC-Inception is more favorable in terms of model size and computational complexity.

6. Conclusions

To extract richer long short-term dependencies, HDC-Inception is proposed, which combines the advantages of HDC [34] and the Inception module [32,33], extracting multi-scale temporal dependencies and alleviating the “gridding issue” in dilated convolution. To select support vectors and focus on them to guide training, inspired by the support vector machine (SVM), the soft margin CE loss is proposed. This loss function can discriminate between support vectors and non-support vectors, guide the training by focusing on support vectors, and manage multiple classification tasks, which have wide applications. To address the problem of insufficient support vectors in the soft margin CE loss, the hybrid CE loss is proposed. Hybrid CE loss combines the advantages of CE loss and soft margin CE loss, guides the training with all samples, and gives weight to support vectors. After the hybrid CE loss is selected, the ER and F1 scores are improved by 3% and 1.51%, respectively, on the TUT Rare Sound Event 2017 dataset. When hybrid CE loss and HDC-Inception are selected, the ER and F1 scores are improved by 4% and 1.91%, respectively. On the TUT-SED 2017 dataset, after both are selected, the ER and F1 scores are improved by 4.7% and 2.59%, respectively. On the TUT-SED 2016 dataset, the ER and F1 scores also improved. This indicates that the proposed HDC-Inception and hybrid CE loss can effectively enhance the performance.
Through theoretical analysis and experimental verification, we have reason to believe that the proposed HDC-Inception and hybrid CE loss have great potential and are worth further exploration. In particular, the hybrid CE loss and soft margin CE loss can adaptively screen samples, thereby selecting more valuable samples for training. Moreover, the proposed HDC-Inception shows strong compatibility with other network structures, and soft margin CE loss and hybrid CE loss are suitable for all classification problems. We will further research the sample selection mechanisms of the soft margin CE loss, striving to link sample selection with the performance of the model. We expect it will play a role in industrial application scenarios.

Author Contributions

Conceptualization, Y.W. and Q.L.; methodology, Y.W. and Q.L.; software, Y.W.; validation, Y.W., Q.L. and W.W.; formal analysis, Y.W., Y.C. and J.C.; investigation, W.W.; resources, C.D.; data curation, Y.W., Y.C. and X.S.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W. and Q.L.; visualization, Y.W., Y.C., X.S. and W.Y.; supervision, Y.W., Y.C. and J.C.; project administration, Q.L., Y.C. and W.Y.; funding acquisition, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by National Natural Science Foundation of China under grant number 62107033, National Key Laboratory of Science and Technology on Space Microwave (No. Y23-SYSJJ-02) and National Key Laboratory of Science and Technology on Space Microwave (No. HTKJ2024KL504001).

Data Availability Statement

The datasets used in this article can be obtained from the following link: https://webpages.tuni.fi/arg/datasets (accessed on 8 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Crocco, M.; Cristani, M.; Trucco, A.; Murino, V. Audio Surveillance: A Systematic Review. arXiv 2014, arXiv:1409.7787. [Google Scholar] [CrossRef]
  2. Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Reliable detection of audio events in highly noisy environments. Pattern Recognit. Lett. 2015, 65, 22–28. [Google Scholar] [CrossRef]
  3. Zhang, S.; Li, X.; Zhang, C. Neural Network Quantization Methods for Voice Wake up Network. arXiv 2021, arXiv:1808.06676. [Google Scholar] [CrossRef]
  4. Xu, C.; Rao, W.; Wu, J.; Li, H. Target Speaker Verification with Selective Auditory Attention for Single and Multi-Talker Speech. arXiv 2021, arXiv:2103.16269. [Google Scholar] [CrossRef]
  5. Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
  6. Chu, S.; Narayanan, S.; Kuo, C.C. Environmental sound recognition with time—Frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
  7. Salamon, J.; Bello, J.P. Feature learning with deep scattering for urban sound analysis. In Proceedings of the 2015 23rd European Signal Processing Conference, EUSIPCO 2015, Nice, France, 31 August–4 September 2015; pp. 724–728. [Google Scholar] [CrossRef]
  8. Stowell, D.; Clayton, D. Acoustic event detection for multiple overlapping similar sources. arXiv 2015, arXiv:1503.07150. [Google Scholar] [CrossRef]
  9. Huang, Y.; Cui, H.; Hou, Y.; Hao, C.; Wang, W.; Zhu, Q.; Li, J.; Wu, Q.; Wang, J. Space-Based Electromagnetic Spectrum Sensing and Situation Awareness. Space Sci. Technol. 2024, 4, 0109. [Google Scholar] [CrossRef]
  10. Xu, L.; Song, G. A Recursive Parameter Estimation Algorithm for Modeling Signals with Multi-frequencies. Circuits, Syst. Signal Process. 2020, 39, 4198–4224. [Google Scholar] [CrossRef]
  11. Wan, Z.; Yang, R.; Huang, M.; Zeng, N.; Liu, X. A review on transfer learning in EEG signal analysis. Neurocomputing 2021, 421, 1–14. [Google Scholar] [CrossRef]
  12. Zhang, Z.; Luo, H.; Wang, C.; Gan, C.; Xiang, Y. Automatic Modulation Classification Using CNN-LSTM Based Dual-Stream Structure. IEEE Trans. Veh. Technol. 2020, 69, 13521–13531. [Google Scholar] [CrossRef]
  13. Heittola, T.; Mesaros, A.; Eronen, A.J.; Virtanen, T. Acoustic event detection in real life recordings. In Proceedings of the European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, 9–13 September 2013. [Google Scholar]
  14. Gencoglu, O.; Virtanen, T.; Huttunen, H. Recognition of acoustic events using deep neural networks. In Proceedings of the European Signal Processing Conference, Lisbon, Portugal, 13 November 2014. [Google Scholar]
  15. Cakir, E.; Heittola, T.; Huttunen, H.; Virtanen, T. Polyphonic sound event detection using multi label deep neural networks. In Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 12–17 July 2015. [Google Scholar] [CrossRef]
  16. Zhang, H.; McLoughlin, I.; Song, Y. Robust sound event recognition using convolutional neural networks. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, QLD, Australia, 19–24 April 2015; pp. 559–563. [Google Scholar] [CrossRef]
  17. Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv 2016, arXiv:1604.06338. [Google Scholar] [CrossRef]
  18. Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent neural networks for polyphonic sound event detection in real life recordings. arXiv 2016, arXiv:1604.00861v1. [Google Scholar] [CrossRef]
  19. Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. arXiv 2017, arXiv:1702.06286. [Google Scholar] [CrossRef]
  20. Lu, R.; Duan, Z.; Zhang, C. Multi-Scale Recurrent Neural Network for Sound Event Detection. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 131–135. [Google Scholar] [CrossRef]
  21. Zhang, J.; Ding, W.; Kang, J.; He, L. Multi-scale time-frequency attention for acoustic event detection. arXiv 2019, arXiv:1904.00063. [Google Scholar] [CrossRef]
  22. Gong, Y.; Chung, Y.; Glass, J. Ast: Audio spectrogram transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar]
  23. Li, K.; Song, Y.; Dai, L.; McLoughlin, I.; Fang, X.; Liu, L. AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
  24. Kong, Q.; Xu, Y.; Wang, W.; Plumbley, M. Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2450–2460. [Google Scholar] [CrossRef]
  25. Ye, Z.; Wang, X.; Liu, H.; Qian, Y.; Tao, R.; Yan, L.; Ouchi, K. Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection. arXiv 2021, arXiv:2110.02011. [Google Scholar] [CrossRef]
  26. Wakayama, K.; Saito, S. Cnn-Transformer with Self-Attention Network for Sound Event Detection. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; pp. 6332–6336. [Google Scholar]
  27. Wang, M.; Yao, Y.; Qiu, H.; Song, X. Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection. Symmetry 2022, 14, 366. [Google Scholar] [CrossRef]
  28. Pankajakshan, A. Sound Event Detection by Exploring Audio Sequence Modelling. Ph.D. Thesis, Queen Mary University of London, London, UK, 2023. [Google Scholar]
  29. Wang, Y.; Zhao, G.; Xiong, K.; Shi, G.; Zhang, Y. Multi-Scale and Single-Scale Fully Convolutional Networks for Sound Event Detection. Neurocomputing 2021, 421, 51–65. [Google Scholar] [CrossRef]
  30. Wang, Y.; Zhao, G.; Xiong, K.; Shi, G. MSFF-Net: Multi-scale feature fusing networks with dilated mixed convolution and cascaded parallel framework for sound event detection. Digit. Signal Process. A Rev. J. 2022, 122, 103319. [Google Scholar] [CrossRef]
  31. Wang, W.; Kao, C.C.; Wang, C. A simple model for detection of Rare Sound Events. arXiv 2018, arXiv:1808.06676. [Google Scholar] [CrossRef]
  32. Zeng, G.; He, Y.; Yu, Z.; Yang, X.; Yang, R.; Zhang, L. Going Deeper with Convolutions Christian. J. Chem. Technol. Biotechnol. 2016, 91, 2322–2330. [Google Scholar] [CrossRef]
  33. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2016, arXiv:1512.00567. [Google Scholar] [CrossRef]
  34. Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. arXiv 2018, arXiv:1702.08502. [Google Scholar] [CrossRef]
  35. Li, J.; Yu, Z.L.; Gu, Z.; Liu, H.; Li, Y. Dilated-Inception Net: Multi-Scale Feature Aggregation for Cardiac Right. IEEE Trans. Biomed. Eng. 2019, 66, 3499–3508. [Google Scholar] [CrossRef]
  36. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  37. Platt, J.C.; Labs, R. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines Review. In Advances in Kernel Methods: Support Vector Learning; MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
  38. Mesaros, A.; Heittola, T.; Dikmen, O.; Virtanen, T. Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QL, Australia, 19–24 April 2015. [Google Scholar] [CrossRef]
  39. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
  40. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  41. Hayashi, T.; Watanabe, S.; Toda, T.; Hori, T.; Le Roux, J.; Takeda, K. Duration-Controlled LSTM for Polyphonic Sound Event Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2059–2070. [Google Scholar] [CrossRef]
  42. Adavanne, S.; Politis, A.; Virtanen, T. Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-Channel Features. arXiv 2018, arXiv:1801.09522v1. [Google Scholar] [CrossRef]
  43. Kao, C.C.; Wang, W.; Sun, M.; Wang, C. R-CRNN: Region-based convolutional recurrent neural network for audio event detection. arXiv 2018, arXiv:1808.06627. [Google Scholar] [CrossRef]
  44. Huang, G.; Heittola, T.; Virtanen, T. Using sequential information in polyphonic sound event detection. In Proceedings of the 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018, Tokyo, Japan, 17–20 September 2018; pp. 291–295. [Google Scholar] [CrossRef]
  45. Li, Y.; Liu, M.; Drossos, K.; Virtanen, T. Sound Event Detection via Dilated Convolutional Recurrent Neural Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 286–290. [Google Scholar] [CrossRef]
  46. Baade, A.; Peng, P.; Harwath, D. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. arXiv 2022, arXiv:2203.16691. [Google Scholar]
  47. Alex, T.; Ahmed, S.; Mustafa, A.; Awais, M.; Jackson, P. Max-Ast: Combining Convolution, Local and Global Self-Attentions for Audio Event Classification. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 1061–1065. [Google Scholar]
  48. Gong, Y.; Chung, Y.; Glass, J. PSLA: Improving Audio Tagging with Pretraining. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3292–3306. [Google Scholar] [CrossRef]
  49. Gong, Y.; Lai, C.; Chung, Y.; Glass, J. SSAST: Self-Supervised Audio Spectrogram Transformer. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, Virtual Event, 1–22 March 2022; Volume 36, pp. 10699–10709. [Google Scholar]
  50. Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. arXiv 2015, arXiv:1409.1259. [Google Scholar] [CrossRef]
  51. Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Improved inception-residual convolutional neural network for object recognition. arXiv 2020, arXiv:1712.09888. [Google Scholar] [CrossRef]
  52. Liu, W.; Chen, J.; Li, C.; Qian, C.; Chu, X.; Hu, X. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  53. Cho, S.; Foroosh, H. Spatio-Temporal Fusion Networks for Action Recognition. arXiv 2019, arXiv:1906.06822. [Google Scholar] [CrossRef]
  54. Hussein, N.; Gavves, E.; Smeulders, A.W. Timeception for complex action recognition. arXiv 2019, arXiv:1812.01289. [Google Scholar] [CrossRef]
  55. Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal pyramid network for action recognition. arXiv 2020, arXiv:2004.03548. [Google Scholar] [CrossRef]
  56. van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
  57. Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar] [CrossRef]
  58. Mesaros, A.; Heittola, T.; Diment, A.; Elizalde, B.; Shah, A.; Vincent, E.; Raj, B.; Virtanen, T. DCASE 2017 Challeng Setup: Tasks, Datasets and Baseline System. In Proceedings of the DCASE 2017—Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16–17 November 2017. [Google Scholar]
  59. Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the European Signal Processing Conference 2016, Budapest, Hungary, 29 August–2 September; pp. 1128–1132. [CrossRef]
  60. Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for Polyphonic Sound Event Detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
  61. Shen, Y.H.; He, K.X.; Zhang, W.Q. Learning how to listen: A temporal-frequential attention model for sound event detection. arXiv 2019, arXiv:1810.11939. [Google Scholar] [CrossRef]
  62. Cakir, E.; Virtanen, T. Convolutional Recurrent Neural Networks for Rare Sound Event Detection. In Proceedings of the DCASE 2017—Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017; pp. 1–5. [Google Scholar]
  63. Lim, H.; Park, J.; Lee, K.; Han, Y. Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks. In Proceedings of the DCASE 2017 Proceedings—Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017; pp. 2–6. [Google Scholar]
  64. Baumann, J.; Lohrenz, T.; Roy, A.; Fingscheidt, T. Beyond the Dcase 2017 Challenge on Rare Sound Event Detection: A Proposal for a More Realistic Training and Test Framework. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 611–615. [Google Scholar] [CrossRef]
  65. Lu, R. Bidirectional GRU for Sound Event Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017; pp. 1–4. [Google Scholar]
  66. Zhou, J. Sound Event Detection in Multichannel Audio LSTM Network. arXiv 2017. [Google Scholar]
  67. Chen, Y.; Zhang, Y.; Duan, Z. Dcase2017 Sound Event Detection Using Convolutional Neural Networks. In Proceedings of the DCASE 2017—Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 16 November 2017. [Google Scholar]
  68. Adavanne, S.; Virtanen, T. A report on sound event detection with different binaural features. arXiv 2017, arXiv:1710.02997. [Google Scholar] [CrossRef]
  69. Yang, H.; Luo, L.; Wang, M.; Song, X.; Mi, F. Sound Event Detection Using Multi-Scale Dense Convolutional Recurrent Neural Network with Lightweight Attention. In Proceedings of the 2023 3rd International Conference on Electronic Information Engineering and Computer, EIECT 2023, Shenzhen, China, 17–19 November 2023; pp. 35–40. [Google Scholar]
  70. Lu, Z.; Xie, H.; Liu, C.; Zhang, Y. Bridging the Gap between Vision Transformers and Convolutional Neural Networks on Small Datasets. Adv. Neural Inf. Process. Syst. 2022, 35, 14663–14677. [Google Scholar]
  71. Le, T.; Jouvet, P.; Noumeir, R. A Small-Scale Switch Transformer and NLP-Based Model for Clinical Narratives Classification. arXiv 2023, arXiv:2303.12892. [Google Scholar] [CrossRef]
  72. Panopoulos, I.; Nikolaidis, S.; Venieris, S.; Venieris, I. Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices. In Proceedings of the IEEE Symposium on Computers and Communications, Gammarth, Tunisia, 9–12 July 2023. [Google Scholar]
Figure 1. Inspired by the Inception module (a) and Dilated-Inception module (b), the HDC-Inception module (c) is proposed to capture long short-term dependencies.
Figure 1. Inspired by the Inception module (a) and Dilated-Inception module (b), the HDC-Inception module (c) is proposed to capture long short-term dependencies.
Electronics 13 03194 g001
Figure 2. Inspired by SVM (a), using the advantages of soft margin SVM (b) and cross-entropy loss (CE loss), the soft margin CE loss approach (c) is proposed, which adaptively selects effective samples (support vectors) and uses them to guide the training. Dark samples: support vectors; white samples: non-support vectors. w x + b = 0 is the separating hyperplane, where w R p represents the weight vector and b is the bias. p [ 0 , 1 ] represents the activity probability of sample x, E represents the margin.
Figure 2. Inspired by SVM (a), using the advantages of soft margin SVM (b) and cross-entropy loss (CE loss), the soft margin CE loss approach (c) is proposed, which adaptively selects effective samples (support vectors) and uses them to guide the training. Dark samples: support vectors; white samples: non-support vectors. w x + b = 0 is the separating hyperplane, where w R p represents the weight vector and b is the bias. p [ 0 , 1 ] represents the activity probability of sample x, E represents the margin.
Electronics 13 03194 g002
Figure 3. The loss function of cross-entropy loss (CE loss) (a) and soft margin CE loss (b); x-axis: the output of neural networks, indicating the activity probability of the sound event; y-axis: the output of the loss function, indicating the error value or penalty value. Their partial derivatives L w j are shown in (c,d); x-axis: the same as (a,b); y-axis: partial derivative L w j , indicating the speed of convergence of the CE loss and soft margin CE loss.
Figure 3. The loss function of cross-entropy loss (CE loss) (a) and soft margin CE loss (b); x-axis: the output of neural networks, indicating the activity probability of the sound event; y-axis: the output of the loss function, indicating the error value or penalty value. Their partial derivatives L w j are shown in (c,d); x-axis: the same as (a,b); y-axis: partial derivative L w j , indicating the speed of convergence of the CE loss and soft margin CE loss.
Electronics 13 03194 g003
Figure 4. Overview of the proposed networks. The proposed networks consist of three parts: convolutional layers (a), the proposed HDC-Inception layers (b), prediction layer (c) and soft margin CE loss and hybrid CE loss as the loss functions (d).
Figure 4. Overview of the proposed networks. The proposed networks consist of three parts: convolutional layers (a), the proposed HDC-Inception layers (b), prediction layer (c) and soft margin CE loss and hybrid CE loss as the loss functions (d).
Electronics 13 03194 g004
Figure 5. In soft margin CE loss, the effect of margin (E) on (a) the class-wise error rate (ER) and (b) the class-wise F-score (F-score). Dataset: TUT Rare Sound Events 2017 Development/Evaluation dataset. x-axis: margin (E). y-axis: error rate, F1 score. “Dev”: Development dataset, “Eva”: Evaluation dataset.
Figure 5. In soft margin CE loss, the effect of margin (E) on (a) the class-wise error rate (ER) and (b) the class-wise F-score (F-score). Dataset: TUT Rare Sound Events 2017 Development/Evaluation dataset. x-axis: margin (E). y-axis: error rate, F1 score. “Dev”: Development dataset, “Eva”: Evaluation dataset.
Electronics 13 03194 g005
Figure 6. During training, the number of effective samples (support vectors) (a) and the effective sample ratio (b) varies after each epoch.
Figure 6. During training, the number of effective samples (support vectors) (a) and the effective sample ratio (b) varies after each epoch.
Electronics 13 03194 g006
Figure 7. Confusion matrix based on the results of the TUT-SED 2017 dataset. When constructing the confusion matrix, the segment length is set to 1 s, as segment-based metrics. (a) Confusion matrix based on quantity, (b) confusion matrix based on proportion.
Figure 7. Confusion matrix based on the results of the TUT-SED 2017 dataset. When constructing the confusion matrix, the segment length is set to 1 s, as segment-based metrics. (a) Confusion matrix based on quantity, (b) confusion matrix based on proportion.
Electronics 13 03194 g007
Table 1. Event instances per class in the TUT Rare Sound Events 2017 dataset, the TUT-SED 2017 dataset, and the TUT-SED 2016 dataset.
Table 1. Event instances per class in the TUT Rare Sound Events 2017 dataset, the TUT-SED 2017 dataset, and the TUT-SED 2016 dataset.
TUT Rare Sound Events 2017 DatasetTUT-SED 2017 DatasetTUT-SED 2016 Dataset
EventDevelopmentEvaluationEventDevelopmentEvaluationEvent
(Home)
Event
(Residential Area)
baby cry10642brakes squeaking52/96.99 s24(object) Rustling60(object) Banging23
glass break9643car304/2471.14 s110(object) Snapping57Bird singing271
gunshot13453children44/35.99 s19Cupboard40Car passing by108
large vehicle61/923.74 s24Cutlery76Children shouting31
people speaking89/715.66 s47Dishes151People speaking52
people walking109/1246.95 s48Drawer51People walking44
Glass jingling36Wind blowing30
Object impact250
People walking54
Washing dishes84
Water tap running47
Table 2. HDC-Inception vs. dilated convolution. Layers: the number of dilated convolution layers or HDC-Inception layers. ER: class-wise averaging error rate, F1: class-wise averaging F1 score, development|evaluation: TUT Rare Sound Event 2017 development/evaluation dataset. The best results are marked in bold face.
Table 2. HDC-Inception vs. dilated convolution. Layers: the number of dilated convolution layers or HDC-Inception layers. ER: class-wise averaging error rate, F1: class-wise averaging F1 score, development|evaluation: TUT Rare Sound Event 2017 development/evaluation dataset. The best results are marked in bold face.
SS-FCNSS-FCN+HDC-Inception
LayerDevelopment|EvaluationDevelopment|Evaluation
ERF1(%)ERF1(%)
40.15|0.2792.03|86.740.12|0.1893.84|90.99
50.17|0.2091.59|90.160.10|0.1694.38|91.43
60.12|0.1794.12|91.470.11|0.1694.52|91.87
70.10|0.1894.81|90.600.11|0.1794.12|91.58
Table 3. Effects of the layer of HDC-Inception. Networks: SS-FCN+HDC-Inception. Layer: the layer of HDC-Inception. ER: error rate. F1: F1 score. The best results are marked in bold face.
Table 3. Effects of the layer of HDC-Inception. Networks: SS-FCN+HDC-Inception. Layer: the layer of HDC-Inception. ER: error rate. F1: F1 score. The best results are marked in bold face.
Development DatasetEvaluation Dataset
LayerBaby CryGlass BreakGunshotAverageBaby CryGlass BreakGunshotAverage
ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1
40.09|95.390.05|97.340.21|88.790.12|93.840.28|86.910.14|92.340.12|93.710.18|90.99
50.07|96.540.04|97.970.20|88.640.10|94.380.15|92.400.16|91.070.17|90.830.16|91.43
60.13|93.440.06|97.140.14|92.980.11|94.520.16|92.250.17|91.100.16|92.250.16|91.87
70.11|94.800.06|96.690.17|90.870.11|94.120.28|86.910.14|92.470.09|95.370.17|91.58
80.09|95.510.07|96.520.21|88.400.12|93.480.23|88.490.15|91.950.15|92.210.18|90.88
90.14|93.150.06|97.130.21|88.600.14|92.960.17|91.620.17|91.180.15|92.430.16|91.74
Table 4. Effect of margin in soft margin CE loss. Dataset: TUT Rare Sound Events 2017 dataset. Margin (E): the margin of the soft margin CE loss. The best results are marked in bold face.
Table 4. Effect of margin in soft margin CE loss. Dataset: TUT Rare Sound Events 2017 dataset. Margin (E): the margin of the soft margin CE loss. The best results are marked in bold face.
Development DatasetEvaluation Dataset
Margin (E)Baby CryGlass BreakGunshotAverageBaby CryGlass BreakGunshotAverage
ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1
0.10.16|91.870.10|95.000.18|90.110.15|92.330.25|87.800.13|92.930.12|93.700.17|91.48
0.20.15|92.710.06|96.690.24|87.010.15|92.140.17|91.550.16|91.180.14|92.890.16|91.87
0.30.17|91.550.10|94.510.18|90.640.15|92.230.21|89.470.15|92.010.11|94.500.15|91.99
0.40.14|93.030.10|94.740.18|89.870.14|92.550.19|90.470.14|92.270.16|91.450.16|91.40
0.450.13|93.760.05|97.560.26|85.900.15|92.410.20|90.020.14|92.890.16|91.960.17|91.62
Table 5. Results of ablation studies. Dataset: TUT Rare Sound Events 2017 dataset. SM-CE: soft margin CE loss. The best results are marked in bold face.
Table 5. Results of ablation studies. Dataset: TUT Rare Sound Events 2017 dataset. SM-CE: soft margin CE loss. The best results are marked in bold face.
Development DatasetEvaluation Dataset
Baby CryGlass BreakGunshotAverageBaby CryGlass BreakGunshotAverage
MethodER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1
SS-FCN0.11|94.670.10|94.650.14|93.030.12|94.120.26|86.960.15|92.080.09|95.370.17|91.47
+HDC-Inception0.13|93.440.06|97.140.14|92.980.11|94.520.16|92.250.17|91.100.16|92.250.16|91.87
+HDC-Inception+SM-CE0.17|91.550.10|94.510.18|90.640.15|92.230.21|89.470.15|92.010.11|94.500.15|91.99
+HDC-Inception+hybrid-CE0.14|93.090.07|96.540.14|92.660.12|94.100.20|90.410.10|94.610.10|95.120.13|93.38
MS-FCN0.06|96.760.09|95.400.14|92.860.10|95.010.18|90.980.18|89.910.10|95.100.15|92.00
+SM-CE0.15|92.560.10|95.000.13|93.190.12|93.580.17|91.530.16|91.300.12|93.500.15|92.11
+hybrid CE loss0.15|92.770.04|97.970.15|92.010.11|94.250.20|90.200.14|92.740.10|95.100.14|92.68
Table 6. The results on the TUT Rare Sound Event 2017 dataset. The symbol “***” and “*******” indicates that specific results are not available. HDCi: HDC-Inception. The best results are marked in bold face.
Table 6. The results on the TUT Rare Sound Event 2017 dataset. The symbol “***” and “*******” indicates that specific results are not available. HDCi: HDC-Inception. The best results are marked in bold face.
Development DatasetEvaluation Dataset
MethodBaby CryGlass BreakGunshotAverageBaby CryGlass BreakGunshotAverage
ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1ER|F1
CRNN+TA [61]****************************0.25|87.40.05|97.40.18|90.60.16|91.8
CRNN+Attention [61]0.10|95.10.01|99.40.16|91.50.09|95.30.18|91.30.04|98.20.17|90.80.13|93.4
Mult-Scale RNN [31]0.11|94.30.04|97.80.18|90.60.11|94.20.26|86.50.16|92.10.18|91.10.20|89.9
R-CRNN [43]0.09|***0.04|***0.14|***0.09|95.5*********************0.23|87.9
CRNN [62]*********************0.14|92.90.18|90.80.10|94.70.23|87.40.17|91.0
1D-CRNN [63]0.05|97.60.01|99.60.16|91.60.07|96.30.15|92.20.05|97.60.19|89.60.13|93.1
[64]****************************0.29|86.200.09|95.470.28|85.600.22|89.09
MTFA [21]0.06|96.70.02|99.00.14|92.70.07|96.10.10|95.10.02|98.80.14|92.50.09|95.5
MS-FCN [29]0.10|95.10.02|99.00.13|93.10.08|95.70.11|94.40.06|96.80.08|96.20.08|95.8
MSFF-Net [30]0.10|94.970.03|98.600.11|94.430.08|96.000.10|94.820.05|97.590.08|95.970.08|96.13
MS-FCN+Hybrid-CE0.12|93.90.20|99.00.12|94.00.08|95.60.11|94.40.05|97.40.06|96.80.07|96.2
SS-FCN+HDCi+Hybrid-CE0.11|94.310.01|99.600.13|93.30.08|95.70.12|93.80.06|96.80.06|97.00.08|95.9
Table 7. Performance comparison. Dataset: TUT-SED 2017 dataset. The symbol “***” indicates that specific results are not available. HDCi: HDC-Inception. The best results are marked in bold face.
Table 7. Performance comparison. Dataset: TUT-SED 2017 dataset. The symbol “***” indicates that specific results are not available. HDCi: HDC-Inception. The best results are marked in bold face.
DevelopmentEvaluation
MethodERF1 (%)ERF1 (%)
SS-RNN [65]0.61 ± 0.00356.70.82539.6
LSRM [66]0.6654.50.85339.1
CNN [67]0.81370.85830.9
DCASE2017 Baseline [58]0.6956.70.935842.8
MS-RNN [20]0.604 ± 0.001*********
CRNN [68]0.6590.791441.7
SS-FCN [29]0.5724 ± 0.00861.02 ± 0.630.8132 ± 0.014547.03 ± 1.73
MS-FCN [29]0.5714 ± 0.009761.21 ± 0.850.7843 ± 0.01948.60 ± 0.74
MSFF-Net [30]0.5805 ± 0.004759.76 ± 0.790.7519 ± 0.007449.93 ± 1.5
LocEnc100 [28]******0.9049.86
attn200 [27]******0.68149.6
MS-AttDenseNet-RNN [69]******0.7649.6
MS-FCN+Hybrid-CE0.5745 ± 0.004460.27 ± 1.280.7717 ± 0.040248.86 ± 0.91
SS-FCN+HDCi+Hybrid-CE0.5893 ± 0.013559.84 ± 0.990.7662 ± 0.020949.62 ± 0.96
Table 8. Results on the TUT-SED 2016 dataset. The symbol “***” indicates that specific results are not available. HDCi: HDC-Inception. The best results are marked in bold face.
Table 8. Results on the TUT-SED 2016 dataset. The symbol “***” indicates that specific results are not available. HDCi: HDC-Inception. The best results are marked in bold face.
DevelopmentEvaluation
MethodERF1 (%)ERF1 (%)
GMM [59]1.1317.9******
FNN [15]1.32 ± 0.0632.5 ± 1.2******
CNN [19]1.09 ± 0.0626.4 ± 1.9******
RNN [19]1.10 ± 0.0429.7 ± 1.4******
MS-RNN [20]0.82  ±  0.0131.5  ±  0.8******
CRNN [19]0.95  ±  0.0230.3  ±  1.7******
SS-FCN [29]0.7807 ± 0.008841.62 ± 0.940.9367 ± 0.008426.66 ± 1.46
MS-FCN [29]0.7780 ± 0.012742.02 ± 1.850.9328 ± 0.041125.39 ± 3.05
MSFF-Net [30]0.7806 ± 0.010340.85 ± 0.01790.9264 ± 0.009925.95 ± 0.0156
MS-FCN+hybrid-CE0.7769 ± 0.023440.68 ± 0.02.230.9287 ± 0.021827.03 ± 1.92
SS-FCN+HDCi+hybrid-CE0.8078 ± 0.006737.11 ± 1.410.9289 ± 0.011826.33 ± 1.40
Table 9. Computational costs of the proposed networks. Model size: the number of parameters of the model. FLOPs: the number of basic operations during a single forward or backward propagation in the networks.
Table 9. Computational costs of the proposed networks. Model size: the number of parameters of the model. FLOPs: the number of basic operations during a single forward or backward propagation in the networks.
MethodSS-FCN [29]SS-FCN+HDCi+Hybrid-CEMS-FCN [29]MS-FCN+Hybrid-CE
model size2.5445 MB2.6761 MB3.8741 MB3.8741 MB
FLOPs0.9936 G1.0086 G1.171 G1.171 G
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, W.; Chen, Y.; Su, X.; Chen, J.; Yang, W.; Li, Q.; Duan, C. Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss. Electronics 2024, 13, 3194. https://doi.org/10.3390/electronics13163194

AMA Style

Wang Y, Wang W, Chen Y, Su X, Chen J, Yang W, Li Q, Duan C. Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss. Electronics. 2024; 13(16):3194. https://doi.org/10.3390/electronics13163194

Chicago/Turabian Style

Wang, Yingbin, Weiwei Wang, Yuexin Chen, Xinyu Su, Jinming Chen, Wenhai Yang, Qiyue Li, and Chongdi Duan. 2024. "Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss" Electronics 13, no. 16: 3194. https://doi.org/10.3390/electronics13163194

APA Style

Wang, Y., Wang, W., Chen, Y., Su, X., Chen, J., Yang, W., Li, Q., & Duan, C. (2024). Effective Sample Selection and Enhancement of Long Short-Term Dependencies in Signal Detection: HDC-Inception and Hybrid CE Loss. Electronics, 13(16), 3194. https://doi.org/10.3390/electronics13163194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop