1. Introduction
Sound serves as a crucial pathway for conveying information, allowing humans to comprehend the conditions and changes in their surroundings through auditory cues. All audible sounds that humans can perceive are collectively referred to as audio. In practical environments, during the acquisition and transmission of audio signals, we often encounter a multitude of diverse noise sources. Strong background noise gets intertwined with the intended audio signals, resulting in a significant masking of the inherent features of the target audio signals. Amidst strong noise interference, the recognition of noise segments enables the implementation of audio enhancement procedures, such as noise reduction and echo elimination, thus enhancing the quality and audibility of the audio. When analyzing environmental sounds, identifying relevant sound segments contributes to detecting specific events; sound patterns; or isolating abnormal sounds like sirens, passing vehicles, and vocal conversations. Concurrently, the recognition of noise segments allows for a precise analysis of the attributes of environmental noise, thereby enhancing safety measures.
Audio type recognition poses a significant challenge within the realm of pattern recognition. The early 1990s saw the initiation of research into methodologies for audio type recognition. Notably, in 1994, B. Feiten and S. Gunzel employed a technique based on Self-Organizing Neural Nets to automatically identify auditory features with similar acoustic qualities [
1]. As computational power has advanced and the volume of audio feature data has expanded significantly, recognition models rooted in machine learning and deep learning have become indispensable for audio signal recognition [
2]. These models encompass a range of approaches, including convolutional neural networks (CNNs) [
3,
4,
5,
6], recurrent neural networks (RNNs) [
7], convolutional recurrent neural networks (CRNNs), randomized learning [
8], deep convolutional neural networks (DCNNs) [
9], support vector machines (SVMs) [
4], Gaussian mixture models (GMMs), deep attention networks [
10], transfer learning [
9], and ensemble learning [
11], among others. These methods can be applied independently or in combination to enhance the performance of audio category recognition.
Enhancing the accuracy of audio type recognition hinges on two pivotal considerations: firstly, selecting the optimal feature or feature combination that captures the fundamental characteristics of the audio signal; and secondly, choosing the appropriate method or model for recognizing audio signal types [
12,
13]. The Mel acoustic spectrogram aligns with the perceptual attributes of the human ear, enabling a more effective capture of crucial audio signal information. It finds applicability across various audio processing tasks, contributing to heightened efficiency and performance in feature representation. Our study proposes the adoption of the Mel acoustic spectrogram to characterize audio signals.
Neural networks operate akin to the human brain, yet their initial performance lagged behind that of traditional machine learning models during the same era. Given that a majority of neural networks encountered challenges in effectively handling the dynamic attributes inherent in audio signals and that recognizing phonemes necessitated the incorporation of contextual information, the Time Delay Neural Network (TDNN) was introduced as a solution by Hinton et al. in 1989 [
14]. The TDNN boasts two remarkable attributes: its capacity to dynamically adapt to temporal changes in features and its minimal parameter count [
15,
16]. There is only one hidden layer node linked to each input in traditional deep neural networks. In contrast, the features of the hidden layer are collaboratively influenced by both the present-moment inputs and future-moment inputs in the modified TDNN. This approach effectively leverages the temporal context information within audio signals by processing multiple consecutive frames of audio input. The Emphasized Channel Attention, Propagation, and Aggregation Time Delay Neural Network (ECAPA-TDNN) is a neural network model designed for speech recognition tasks, introduced in the year 2020 [
17].This model amalgamates the conventional TDNN architecture with attention mechanisms, emphasizing channel attention, propagation, and aggregation in TDNN-based speaker verification [
18,
19,
20,
21]. Furthermore, it enhances feature extraction and representation capabilities by incorporating extended context aggregation and introducing expansion layers. To discern the type of both target audio and noise, our research employs an audio type recognition method founded on the deep learning model ECAPA-TDNN. The experimental data were sourced from the THCHS-30 dataset of Tsinghua University [
22], comprising speech signals serving as event signals. Additionally, the dataset includes noise signals encompassing 12 distinct types from the NoiseX-92 dataset [
23].
2. Mel Sound Spectrogram
The time-domain waveform represents the most straightforward and readily obtainable feature for recognizing audio signal types. However, due to its susceptibility to influences and inherent instability, time-domain information tends to yield suboptimal results as a recognition feature. Conversely, the frequency-domain information of audio signals offers greater accuracy in capturing audio characteristics and is less prone to interference. Currently, the conversion of an audio signal’s time domain information into frequency-domain information can be achieved through methods like Fourier transform or wavelet transform. However, these approaches often lead to the loss of certain signal features. The time–frequency characterization of a signal encompasses both time-domain and frequency-domain information, endowing it with heightened identification capabilities. An acoustic spectrogram that is built on spectral analysis with the incorporation of the time dimension offers a more intuitive depiction of signal changes. Essentially, it embodies a time–frequency characterization of the audio signal [
24].
The spectrum portrays signal distribution across various frequencies. However, the human auditory system discriminates between frequencies with varying sensitivity. Research reveals that the frequency resolution of the human ear is not linear but logarithmic. This means that two pairs of frequencies situated at equal distances in the frequency domain might not be perceived equally by the human ear [
25]. This issue finds an effective resolution through the introduction of Mel frequency. Mel frequency characterizes the human ear’s sensitivity to audio signal frequencies [
26]. The logarithmic relationship between linear frequency and Mel frequency is defined by Equation (1) [
27].
where
is the perceived frequency in Mel, and
the actual frequency in Hz.
Figure 1 illustrates a schematic representation of the relationship between Mel frequency and actual frequency. Notably, as the frequency decreases, the Mel frequency exhibits a more rapid alteration concerning linear frequency, resulting in a steeper curve slope. Conversely, at higher frequencies, the Mel frequency experiences a gentler ascent, leading to a smaller curve slope. This phenomenon underscores the concept that high-frequency sounds are less distinguishable to the human ear, while low-frequency sounds are more easily discerned. This variation in perceptual sensitivity by the human ear is distinctly portrayed.
Building upon the investigation into Mel frequency, the Mel filter is introduced to simulate the phenomenon where higher frequencies are perceived less distinctly by the human ear, exhibiting a more gradual auditory response. This involves constructing numerous triangular filters, with a greater emphasis on low-frequency filters and fewer high-frequency filters, forming a filter bank aligned with their frequency distribution. The frequency response characteristic curve of the Mel filter bank is depicted in
Figure 2.
By subjecting the spectrogram to Mel-scaled filtering through a bank of Mel filters, the transformation to Mel spectrogram for the audio signal is achieved. Similar to the spectrogram, the Mel spectrogram is also a representation in the time–frequency domain.
Figure 3 displays the time-domain waveform, spectrogram, and Mel spectrogram corresponding to a segment of noise signal.
The Mel acoustic spectrogram employs the Mel scale on the frequency axis, with the Mel filter bank designed to align with the human ear’s sound perception characteristics. This design facilitates the mapping of denser frequency regions to sparser ones, effectively reducing unnecessary redundancy within the spectrogram. By remapping frequency axis information, the spectrogram’s dimensionality is reduced. This not only lessens the computational complexity of the features but also accelerates the training and inference processes of the model. Ultimately, the Mel sound spectral feature parameters are chosen for audio event detection and noise type identification, effectively accommodating various signal types.
Figure 4 presents the Mel sound spectral characterizations for the 12 types of noise sourced from the NoiseX-92 dataset. The first and second plots depict Mel acoustic spectrograms of white noise and pink noise, respectively. In these plots, time is represented on the horizontal axis, frequency on the vertical axis, and the plot color corresponds to signal amplitude. A comparison across the plots highlights significant disparities in the Mel acoustic spectrograms across different scenarios. For instance, there is minimal high-frequency information in the Volvo vehicle noise scenario, while the f16 fighter noise scenario prominently features a higher high-frequency component. Additionally, the Mel acoustic spectrograms of the f16 fighter noise scenario exhibit distinct horizontal stripes, whereas the factory1 and factory2 scenarios display predominant vertical stripes. The irregular “speckling” observed in the factory2 scene arises from the recording’s context in an automobile production plant, where abrupt acoustic events like knocks are common.
Figure 5 shows the Mel acoustic spectrograms of three distinct audio event signals: speech, alarm, and explosion. Evidently, the variations in characteristics among different audio events are pronounced, enabling clear differentiation between several audio events.
With deep learning models demonstrating their prowess across diverse domains, many challenges that elude effective resolution via traditional machine learning find improved outcomes through deep neural networks. This effect is particularly prominent in image recognition [
28]. The Mel spectrogram encapsulates fundamental audio features, leveraging the neural networks’ proficiency in image processing; inputting Mel spectrograms into deep neural network models enables the recognition of audio types.
3. ECAPA-TDNN Deep Learning Model
The ECAPA-TDNN model was introduced by Desplanques et al. at the University of Goethe, Belgium, in 2020. Drawing on the latest advancements in computer vision-related fields, the ECAPA-TDNN model brings forth several enhancements to the TDNN model. This model places heightened emphasis on inter-channel attention and multilayer feature aggregation [
17,
18].
Figure 6 illustrates the structure of the ECAPA-TDNN model, comprising key components such as TDNN+ReLU+BN, SE-Res2Block, Attentive Statistics Pooling (ASP), and Multilayer Feature Aggregation (MFA). The model employs the AAM-Softmax loss function for optimization.
3.1. SE-Res2Block
The ECAPA-TDNN model includes a section composed of multiple SE-Res2Block modules linked sequentially. The core constituents of this module encompass TDNN, SE-Net, and Res2Net components.
Figure 7 presents an illustrative representation of the network structure of the SE-Res2Block modules.
In recent years, to enhance the effectiveness of deep learning models, researchers have introduced concepts like inception structures and attention mechanisms. These innovations focus on optimizing the spatial dimensions of input feature maps. By aggregating features from various receptive fields and adeptly capturing both global and local connections, these enhancements enhance the overall performance of deep learning models. The distinctive aspect of SE-Net lies in its approach of modeling the channel dimensions of input feature maps. This enables a recalibration of the feature maps, thereby boosting the model’s performance [
29].
Figure 8 provides an illustrative depiction of the SE-Net structure.
The input X of the SE-Net possesses a dimension of , which can be processed and translated into a feature map U with the dimensions . Subsequently, the feature map U undergoes additional compression through global average pooling, resulting in a channel vector that encapsulates global information for each channel. The subsequent step involves excitation, entailing two fully connected layer operations performed on the channel vectors. The initial fully connected layer executes dimensionality reduction, which curtails parameters to lower computational complexity. The subsequent fully connected layer conducts dimensionality enhancement, aiming to restore the dimensions of channel numbers and weight vectors. Using two fully connected layers often reveals channel correlations more effectively than a single layer, offering augmented nonlinear capabilities, while trimming parameters to bolster computational efficiency. Employing the Sigmoid activation function, the output from the fully connected layers is employed to compute weights for each channel feature. Consequently, the original features can be adjusted based on these weights, generating a feature map, , that more accurately captures type-specific characteristics.
Diverging from preceding network architectures that rely on features of varying resolutions to enhance multiscale capabilities, Res2Net captures features across distinct receptive fields and scales, thus acquiring a comprehensive blend of global and local information [
30,
31]. Res2Net is an advancement built upon the foundation of the Bottleneck structure by segmenting the 3 × 3 convolutional layers within each residual block into multiple sub-branches. This is then followed by the fusion of features through residual connections. The Bottleneck structure is depicted in
Figure 9a, comprising sequentially connected 1 × 1, 3 × 3, and 1 × 1 convolutions that are integrated through residual connections. In
Figure 9b, the Res2Net structure for s = 4 is illustrated. The distinct differentiation between Res2Net and Bottleneck lies in the approach of Res2Net, which splits the feature map subsequent to the 1 × 1 convolution into s sub-branches. Notably, x
1 serves directly as the output of y
1 without any modifications. Meanwhile, x
2 undergoes a 3 × 3 convolution; a portion becomes y
2’s output, while the remaining part connects to x
3. In a cascading manner, x
3 and x
4 execute the same sequence of operations.
3.2. Attention Statistical Pooling
To address the limitation of the average pooling layer, which is susceptible to information loss, attention statistic pooling was introduced. This approach simultaneously considers disparities in information across both time and channel dimensions. Thus, the network can allocate attention to crucial details across various time intervals and feature maps. The implementation of attention statistic pooling is realized through Equation (2).
In Equation (2), represents the activation value of the preceding layer’s network at time step t. Following the weight matrix, , and bias transformation, , the dimensionality of is reduced from channels to channels. This streamlines the parameter configuration and mitigates the potential for overfitting.
Moreover,
signifies that the
dimensional vector derived from the activation function,
, undergoes linear transformation and projection, resulting in a
dimensional spatial representation. To calculate the attention weight of time step
on channel
, the Softmax transformation is applied to
. Equation (3) represents the formula for this computation.
Equations (4) and (5) yield the weighted mean vector,
, and the weighted standard deviation vector,
, on channel
, respectively. These vectors are then concatenated to yield the ultimate output of the attention statistic pooling layer.
3.3. Multilayer Feature Aggregation
Compared to earlier TDNN-based systems, ECAPA-TDNN stands out because of its pioneering approach to multilayer feature aggregation. It not only incorporates features solely from the last frame layer but also integrates those from the other two layers. This is achieved by amalgamating the features produced by the first, second, and final SE-Res2Block modules in the channel dimension, facilitated by residual connections. Subsequently, the deep features are further extracted through a fully connected layer, and these features are then utilized in the computation of attention statistics pooling. The specific progression is elucidated in
Figure 6.
In the realm of deep learning, a variety of feature types and sources exist that necessitate integration to enhance the model’s effectiveness and generalization capability. Among the commonly employed methods for integration are merge operations and element-level summation.
Figure 6 illustrates a multilayer feature aggregation process employing the merge operation to integrate features across different levels.
3.4. AAM-Softmax Loss Function
Refining the choice of a more effective loss function remains a persistent challenge in the realm of deep learning. Particularly for classification tasks, the selected loss function must strike a delicate balance between maximizing the distance between different classes while minimizing the distance within the same class. While neural network classification models often adopt the Softmax loss function, this approach overlooks the absence of information regarding the angular relationships between classes within the feature space, consequently yielding suboptimal results. To address this problem, researchers have introduced a fixed angular interval as a penalty term into the Softmax loss function, proposing the Additive Angular Margin Loss Softmax (AAM-Softmax). This approach effectively narrows intra-class gaps and enlarges inter-class distances [
32,
33]. The precise formulation of AAM-Softmax is presented in Equation (6).
In Equation (6), represents the total number of samples, denotes the number of classes, is the sample, stands for the angle between the samples and the corresponding weight vectors of class, represents the scaling factor, and refers to the edge angle. The edge angle serves the purpose of fostering more closely knit samples within the same class while simultaneously enhancing the disparities between different classes. This serves to enhance the efficacy of classification or recognition.
ECAPA-TDNN entails a slightly elevated computational load in comparison to alternative neural network models. The training and inference processes of ECAPA-TDNN demand a greater allocation of computational resources and time as opposed to conventional DNN and LSTM (Long Short-Term Memory Network) models. Nevertheless, the computational overhead of ECAPA-TDNN remains relatively modest when contrasted with certain more recent speech recognition models, such as Transformers.
4. Experimental Paradigm
The experimental simulations were conducted on the Shenzhou laptop platform, featuring the following specific configuration: CPU—Intel Core i5-8400; graphics card—NVIDIA GTX 1060 6 GB; operating system—Windows 10; simulation software—PyCharm Community Edition; development language—Python 3.9; and deep learning framework—PyTorch 1.10.
In the experimental simulation, the noise signals were chosen from the 12 available types in the NoiseX-92 dataset. Simultaneously, due to the scarcity of audio event-related datasets and the representative nature of speech signals in the audio domain, audio event signals were sourced from the speech signals within Tsinghua University’s THCHS-30 dataset. These signals were employed for speaker recognition testing. The data were standardized into 16 kHz mono audio files, with sample data randomly allocated for training and testing sets. The frame length was set at 20 ms (320 samples), while the frame shift was established at 10 ms (160 samples) during frame segmentation.
For the speaker recognition test in the experiment, speech data from nine individuals in the THCHS-30 dataset were randomly chosen and designated as participants A to I. These individuals’ speech recordings were utilized for both training and testing. The frame segmentation was performed with a frame length of 20 ms (320 samples) and a frame shift of 10 ms (160 samples).
The effectiveness of noise signal type recognition is assessed using the accuracy metric R. Prior to computing the accuracy rate, the recognized noise signals need to be manually labeled with their corresponding frame noise signal types. The accuracy rate, R, for each noise type recognition can be computed using Equation (7).
where
represents the count of frames in which the manually labeled noise type matches the noise type obtained through the deep learning model’s recognition, while
denotes the total number of frames containing noise signals.