1. Introduction
The ability of marine mammals to use sound for communication is one of the key features of their adaptation to the underwater living environment, and their acoustic signals exhibit remarkable diversity. American scholar Lilly categorized these acoustic signals into three main types based on their functions and parameters, Click, Whistle, and Burst Pulse, with each sound type serving distinct purposes [
1]. Studying the acoustic signals of marine mammals holds great significance for understanding their biological behavior, rational utilization of marine resources, and conservation efforts to protect these species. Passive Acoustic Monitoring (PAM) of marine mammals is a widely employed biomonitoring method. Since acoustic waves propagate with lower energy attenuation in the ocean than light waves, they can travel longer distances, making them particularly suitable for monitoring and identifying marine mammals [
2]. In the pursuit of developing PAM systems for detecting and classifying marine mammals, numerous algorithms have been devised specifically to analyze the acoustic characteristics of these animals. Notably, most of these detection and classification algorithms rely on the distinct acoustic features exhibited by marine mammals [
3].
Feature extraction is a crucial step in acoustic signal processing. By employing feature extraction techniques, one can obtain the acoustic features of the target, which subsequently serve as a reliable basis for target identification. Ibrahim et al. demonstrated the effectiveness of using Mel-Frequency Cepstral Coefficients (MFCC) and Discrete Wavelet Transformation Coefficients in classifying marine mammal calls using Support Vector Machines (SVM) [
4]. Brown et al. pioneered the use of the Gaussian Mixture Model combined with the Hidden Markov Model (GMM-HMM) for recognizing MFCC features [
5]. In their study, they achieved a classification consistency of over 90% for a set of 75 killer whale calls. Dugan, on the other hand, extracted time-frequency features from cetacean calls and classified them using three different models, achieving the highest assignment rate of 86.45% [
6]. Maheen et al. acoustically classified six marine mammal species by fusing one-dimensional localized binary patterns with MFCC features. Their approach yielded a training test accuracy of 90.4% [
7]. Meanwhile, Zhong Mingtuo et al. fused MFCC, Linear Cepstral Coefficients, and time-domain features from 61 species of marine mammals as feature parameters [
8]. By utilizing SVM for classification, they managed to improve the recognition rate by 5.5% compared to traditional MFCC-based methods. Li Songbin et al. extracted six features—MFCC, FBanks, PNCC, PSRCC, GFCC, and MSRCC—from three marine mammal species for comparative analysis [
9]. They employed a Convolutional Neural Network-Gated Recurrent Unit (CNN-GRU) structure for recognition and achieved a classification accuracy of 74%. Some of these studies use one feature to identify marine mammal acoustic signals, and some use multiple, but none of the features they use can be separated from the time-frequency domain. Since climatic factors, such as rainfall and typhoons, have a great impact on the marine acoustic environment, the recognition of marine mammal acoustic signals only by time-domain features or frequency-domain features has certain limitations. Zhang Xuebo et al. coherently synthesized signals in the range-Doppler domain associated with each receiver after performing range cell migration correction (RCMC) for each receiver, and then corrected the azimuth offset [
10,
11]. To improve the simulation efficiency of multi-receiver synthetic aperture sonar (SAS) echo signal, Zhang Xuebo et al. multiplied the spectrum of the transmitted signal with the phase shift related to the delay so that the spectrum of the echo signal can be accurately obtained [
12,
13]. Compared with the traditional echo simulation algorithm, this method significantly improves the simulation efficiency of the echo signal without losing performance. Although the MFCC features have achieved good recognition results in the current field, the Delay-Doppler (DD) domain features can respond to the speed information of the target and are not easily affected by environmental factors such as climate, so combining them with the MFCC features can increase the reliability of the recognition, and the use of the dual-feature fusion learning method may be beneficial to improve the accuracy of the recognition.
Convolutional Neural Networks (CNNs) are widely utilized in the field of speech recognition. Traditionally, acoustic signal recognition involves time-frequency analysis of the acoustic signal to generate a spectrogram, which is then used to identify the acoustic signal based on its unique patterns. Detailed information on machine learning and deep learning-based methods and underwater sound sources, features, classifiers, datasets, related techniques, challenges, and future trends for marine ship sound classification and fish sound classification are discussed by Aslam et al. [
14]. Bianco et al. presented the development of machine learning in four acoustic research areas: source localization in speech processing, source localization in marine acoustics, bioacoustics, and environmental sounds in everyday scenes [
15]. In the process of feature extraction for speech signals, the original acoustic features are often replaced with acoustic feature images, and CNN-based image recognition techniques are employed to recognize the acoustic signals. This approach has proven to achieve accuracy rates that are difficult to match using traditional methods, especially when dealing with large-sample datasets. Zhang Xuebo et al. focus on the application of a nonlinear chirp scaling algorithm in SAS and validate the proposed method through simulation and real data. The processing results show that the imaging efficiency is greatly improved compared with the phase center approximation (PCA) method [
16]. Wang et al. proposed a comprehensive underwater image enhancement framework, the metalantis framework, which enhances state-of-the-art physical models of underwater imaging by utilizing virtually generated data for reinforcement learning [
17,
18]. Meanwhile, they gave two examples in [
19,
20]. Shiu et al. explored the use of deep CNNs and Recurrent Neural Networks (RNNs) with spectrograms to detect vocalizations of North Atlantic Right Whales [
21]. Their findings indicated that deep learning architectures can produce false positive rates several orders of magnitude lower than other algorithms. Griffiths et al. proposed a multivariate clustering method to identify distinct Click vocal clusters of Dall’s Porpoise in the U.S.A., and the validity of the three clusters was verified using the Random Forest method [
22]. Cai et al. designed a multichannel-based classification model with a parallel structure, fusing predictions and introducing data enhancement techniques to further improve classification accuracy [
23]. Duan Dexin et al. trained a random forest classifier using time-frequency graph features to detect and distinguish echolocation signals, achieving higher recall and accuracy under low Signal-to-Noise Ratio (SNR) conditions [
24]. Cominelli et al. combined pre-trained acoustic classification models (VGGish, NOAA, and Google Humpback Whale Detector), dimensionality reduction (UMAP), and balanced random forest algorithms to demonstrate how machine-learned acoustic features can capture different aspects of the marine acoustic environment [
25]. The current state of research suggests that applying deep learning to marine mammal acoustic recognition has become a trend, but most of the methods proposed so far are based on a feature-based recognition model. Although neural network models for marine mammal acoustic recognition offer higher accuracy and reduce time and labor costs, traditional neural network models often suffer from complexity, computational costs, and other problems.
To address the limitations of traditional recognition methods that rely solely on a single feature input, this paper proposes a dual-feature fusion learning target recognition model. To validate the effectiveness of the proposed model for marine mammal acoustic recognition, this study combines three common CNN recognition models with two signal features, respectively, for single-feature recognition, which is compared with two-feature recognition. The innovations of the dual-feature fusion learning method presented in this paper are as follows:
The marine mammal acoustic signal is preprocessed using adaptive filtering to enhance the SNR and mitigate the interference of environmental noise.
- 1.
Delay-Doppler domain features are introduced into the acoustic feature recognition of marine mammals, effectively addressing the impact of seasonal changes in the marine environment on marine mammal acoustic signals.
- 2.
A dual-feature fusion learning target recognition model is developed, capable of recognizing both MFCC features and Delay-Doppler domain features simultaneously. This model exhibits high recognition accuracy and strong generalization ability for mammal acoustic signal recognition in complex marine environments.
3. Method
First, this research performs adaptive filtering on acoustic signals of marine mammals to enhance their SNR. Subsequently, it extracts the MFCC and DD domain features of marine mammals as the two input features of this recognition method. This study constructs a dual-feature fusion learning target recognition model, which can be trained by inputting two marine mammal acoustic signal features at the same time and can improve the target recognition accuracy. The overall idea of the paper is shown in
Figure 5.
3.1. Framing and Normalization
Framing: The purpose of framing is to extract a series of shorter, discrete time segments (frames) from a continuous signal so that each frame can be further analyzed and processed. A frame is a fixed-length sequence of samples extracted from an audio signal. The length of a frame is usually a power of 2, such as 256, 512, or 1024, because such a length makes subsequent Fast Fourier Transform (FFT) calculations more efficient. The frameshift is the number of samples between two consecutive frames. The frameshift determines the degree of overlap between frames. Smaller frameshifts provide higher temporal resolution, but increase computational effort; larger frameshifts reduce computational effort, but decrease temporal resolution.
For example, after inputting a segment of the signal, set the number of samples per frame to 1024 and the frameshift to 512, i.e., move 512 samples at a time to get a new frame, which ensures 50% overlap, and draw the data of the first three frames as shown in
Figure 6.
Normalization: Min-Max Normalization is a method of scaling data features to a specific range (usually between 0 and 1). This method is implemented through the maximum and minimum values of each feature, using a linear transformation to map the data to the new range. The formula for this is [
37]:
where
x is the original data point,
min(
x) is the minimum value in the data set,
max(
x) is the maximum value in the data set, and
x’ is the normalized data point.
The advantage of Min-Max Normalization is that it is sensitive to outliers, since changes in the maximum and minimum values directly affect the normalized result, and the data can be easily scaled to any specified range.
3.2. Dual-Feature Extraction Analysis
Due to its excellent nonlinear perception ability, MFCC has been widely applied in acoustic signal recognition research. However, this feature remains a traditional time-frequency domain feature, susceptible to environmental changes. In this study, we extract the DD domain feature of marine mammals, which complements the MFCC feature and can reflect the motion characteristics of marine mammals. This approach addresses the limitations of recognizing marine mammals using a single feature.
Figure 7 depicts the MFCC features of the Fraser’s Dolphin acoustic signal. The horizontal axis represents the frame rate, where each frame encompasses a specific number of samples with a certain overlap between neighboring frames. The more frames there are, the higher the temporal resolution, which can more accurately capture the dynamic properties of the signal and changes in short-term characteristics. This is because more frames mean that the signal is more finely segmented, which better reflects changes in the signal over time. The vertical axis indicates the MFCC parameter dimension, signifying the number of MFCCs extracted per frame. This number determines the dimensionality of the feature vector for each frame. For example, if the order of the DCT is 30, then 30 MFCC coefficients will be generated for each frame, but usually only the first 13 coefficients are retained because these lower-order coefficients contain the main spectral information, while the higher-order coefficients tend to be associated with noise.
Figure 8 illustrates the Delay-Doppler domain features of the Fraser’s Dolphin acoustic signal. The horizontal axis represents the Doppler frequency shift, indicating the change in signal frequency resulting from the relative motion between the target and the receiver. The vertical axis depicts the time delay, which is the duration of signal propagation. The Delay-Doppler domain features effectively reflect the speed characteristics of organisms, as varying speeds among different organisms inherently leads to distinct Delay-Doppler domain features. As evident from the figure, the signal exhibits a pronounced frequency shift at approximately 8 Hz.
3.3. Dual-Feature Fusion Learning Target Recognition Model
Traditional CNN models are typically based on a single feature and can only be trained for that specific feature, which often limits their generalization ability and robustness. Consequently, in this paper, we introduce a dual-feature fusion learning target recognition model that is capable of simultaneously inputting features from both the time-frequency and Delay-Doppler domains. The acoustic signals generated by different targets may have frequency overlap, so it is challenging to rely only on the time-frequency domain features of the signals to identify the targets. Since different targets move at different speeds, they are characterized differently in the DD domain. The target attributes in the other feature domain of the signal can be described by DD domain features, and the classification approach is equivalent to combining the MFCC features and the DD domain features as a joint feature and classifying the different marine mammals by using this joint feature as an axis, which corresponds to different points on this axis. Due to the increased dimensionality of the described signal, the target information can be reflected more comprehensively. We compare the recognition performance of this model with three widely used single-feature models: VGG16, GoogleNet, and ResNet.
As depicted in
Figure 9, our dual-feature fusion learning target recognition model comprises nine convolutional layers and two fully connected layers. Notably, a max pooling layer is inserted between every two convolutional layers, and the recognition task is ultimately carried out by a SoftMax layer. The maximum pooling layer reduces the size of the feature map by selecting the maximum value of each region, which reduces the amount of computation and the number of parameters in the subsequent layers, helping to improve the computational efficiency of the model and reduce the risk of overfitting. In addition to this, the maximum pooling layer provides a degree of translation invariance, so that even if there is a small translation of the image, the maximum pooling layer can extract the same features. With the pooling operation, the model is able to retain the most important features and ignore unimportant details, thus improving the robustness of the features. The SoftMax layer transforms the output of the neural network into a probability distribution, such that the output value of each category is between 0 and 1, and the sum of the output values of all the categories is 1. This allows the model to perform better on multi-classification tasks and improves the accuracy of the classification. The fully connected layer is able to integrate the features extracted from the convolutional and pooling layers and map these features to the sample labeling space, enhancing the feature integration capability of the model.
Figure 10 presents the specific parameters of the nine convolutional layers. For instance, the parameter “1 × 1 × 32” signifies the following: the convolutional kernel size is 1 × 1, allowing for cross-channel information integration without altering the spatial dimensions; furthermore, the number of convolutions is 32, indicating that the convolutional operation yields 32 feature maps that reflect the outcomes of convolving the input data with this kernel. Additionally, the activation function employed is ReLU (Rectified Linear Unit), renowned for its simplicity in computation, rapid convergence, and effectiveness in mitigating the gradient vanishing problem. The normalization method used is Min-Max Normalization, which is implemented by the maximum and minimum values of each feature, using linear transformations to map the data to new ranges.
4. Experiment and Analysis
In this section, the effectiveness of the aforementioned dual-feature fusion learning target recognition model is verified. First, the pre-processed signal undergoes MFCC and DD domain feature extraction. Subsequently, these two types of features are input into VGG16, GoogleNet, ResNet, and the dual-feature fusion learning target recognition model for training. The recognition performances of these different models are then analyzed and compared. The experimental environment configuration for this study is as follows: the GPU is NVIDIA GeForce RTX 4080 (NVIDIA Corporation, Santa Clara, CA, USA), the CPU is Intel i9-13900K (Intel Corporation, Santa Clara, CA, USA), and the neural network is trained to utilize the GPU. The operating system is Windows 10, with 128 GB of RAM (Samsung, Seoul, Republic of Korea). The Python version used is 3.9, and the neural network is constructed using the PyTorch framework. The development environment is PyCharm 2018.
4.1. Experimental Data and Evaluation Metrics
The marine mammal acoustic data used in this study were obtained from the Watkins Marine Mammal Sound Database open-source database, which was collected by William Watkins, one of the founding fathers of marine mammal bioacoustics, and contains recordings from seven decades, from the 1940s to the 2000s. The database contains approximately 2000 unique recordings of more than 60 species of marine mammals, ranging in length from one second to several minutes, all in .wav format.
Figure 11 shows three marine mammals used in the experiment [
38].
Due to the varying lengths of the audio data in the dataset and the short duration of marine mammal vocalizations, in this paper, the longer audio data are clipped into multiple short segments, which increases the amount of input data and improves the accuracy of the training. To ensure that each short segment after clipping contains at least one complete animal vocalization, this paper utilizes Adobe Audition 2024 software to analyze the time-frequency diagrams of the audio data before clipping. The dataset is divided into training sets and test sets in the ratio of 4:1, and the following experimental results are analyzed from these two aspects.
In this study, four metrics were selected to evaluate the experimental results: Accuracy, Precision, Recall, and F1 Score. These metrics are derived from the confusion matrix, which evaluates the model’s accuracy by comparing the predicted category labels with the actual category labels. The structure of the confusion matrix is presented in
Table 1 [
39].
True Positives (TP): the number of samples that the model correctly predicts as positive.
False Positives (FP): the number of samples that the model incorrectly predicts as positive.
True Negatives (TN): the number of samples that the model correctly predicts as negative.
False Negatives (FN): the number of samples that the model incorrectly predicts as negative.
Accuracy is the proportion of correctly predicted samples to the total number of samples. Precision is the proportion of samples that are positive classes among all samples that are predicted to be positive classes. Recall is the proportion of samples that are correctly predicted to be positive classes among all samples that are positive classes. The F1-score is the harmonic mean of precision and recall, which is used to measure the balanced performance of the model. They are calculated as follows [
40]:
4.2. Experimental Validation
In this paper,
Section 2 mentions that a set of triangular filters is used for extracting MFCC features, which serve to improve the SNR of the original signal. The DD features are extracted without filtering the signal, whereas the signal is analyzed by LMS adaptive filtering before DD feature extraction. The analysis results are presented in
Figure 12, which shows the DD domain feature extraction of the Spinner Dolphin before and after filtering, respectively. Notably, the results are significantly different, with the DD domain feature extraction after filtering making the visualization features more obvious and enhancing image contrast.
In this study, all marine mammal acoustic signals were minimum mean square filtered, and then the DD-domain features of the original signals and the DD-domain features of the filtered signals were input into the three single feature models of VGG16, GoogleNet, and ResNet as the target recognition features, respectively. The accuracy, precision, recall, and F1 scores of the two groups were obtained for comparison, and the results are shown in
Table 2 and
Table 3:
The table contains the results of the training set and the test set; the two kinds of data are separated by “/”, the data in front of “/” is the result of the training set, the data on the right side of “/” is the result of the test set. By comparing the recognition results of DD domain features from single-feature models before and after filtering, it becomes evident that the recognition accuracy of each model improves by 3% to 13% after filtering. Consequently, applying LMS adaptive filtering before feature extraction and recognition can effectively enhance recognition accuracy.
MFCC and DD domain features are extracted from preprocessed data and individually input into VGG16, GoogleNet, and ResNet models for single-feature training. Subsequently, these two types of features are simultaneously input into the dual-feature fusion learning target recognition model for training.
Table 4 presents the recognition results for the two features using the four models.
In the final line, “Dual-feature” signifies that the dual-feature fusion learning target recognition model is employed to recognize both MFCC features and DD domain features. Analyzing the training results reveals that the accuracy rate stands at 59% to 80% for DD domain features alone, 93% to 98% for MFCC features alone, and a remarkable 98% to 99% when both MFCC features and DD domain features are utilized through the dual-feature fusion learning target recognition model. Therefore, the dual-feature fusion learning target recognition model is better than the other models in recognizing the acoustic signals of three marine mammals, namely, the Fraser’s Dolphin, the Spinner Dolphin, and the Long-Finned Pilot Whale. Comparison with other models alone does not adequately demonstrate the superiority of the model proposed in this paper. As such, this study conducts generalizability experiments and ablation experiments to validate the model’s ability to generalize and the reasonableness of the methodology.
4.3. Generalization Ability Analysis
To verify the generalizability of this dual-feature fusion learning target recognition model, two distinct marine mammal acoustic signals—the Ross seal and the Bearded seal—were chosen for training in this study. The training results are presented in
Table 5.
When DD domain features are recognized in isolation, the accuracy ranges from 59% to 90%. When MFCC features are recognized alone, the accuracy lies between 73% and 97%. However, when the dual-feature fusion learning target recognition model recognizes both MFCC and DD domain features, the accuracy soars to 91% to 98%. The superior recognition accuracy achieved by the dual-feature fusion learning target recognition model, as compared to other models, underscores its excellent generalization capabilities.
4.4. Ablation Experiment
Ablation Study is an experimental design method commonly used in scientific research, especially in the fields of machine learning and deep learning. The core idea is to gain a deeper understanding of how the model works and how the components interact with each other by systematically removing or modifying certain parts of the model (e.g., layers, nodes, features, parameters, etc.) and observing how such changes affect the model’s performance.
An ablation experiment was conducted in this study to validate the efficacy of MFCC features, DD domain features, and LMS adaptive filtering within the proposed method for marine mammal acoustic signal recognition. By systematically removing each component from the model, we aimed to assess their contributions to the overall performance.
Ablation Experiment 1: MFCC features are removed and DD domain features are trained using CNN.
Ablation Experiment 2: Remove DD domain features and train on MFCC features using CNN.
Ablation Experiment 3: Remove LMS adaptive filtering and directly perform feature extraction on the original signal.
The training results of the three ablation experiments and the complete target recognition model are compared in
Table 6, which shows that the removal of MFCC features decreases the target recognition accuracy by about 43%; the removal of DD domain features decreases the target recognition accuracy by 0–3%; and the removal of LMS adaptive filtering decreases the target recognition accuracy by 1–3%.
These findings demonstrate that the inclusion of MFCC features, DD domain features, and LMS adaptive filtering is crucial for achieving optimal performance in the target recognition model. Removing any of these components leads to a decrease in model accuracy, highlighting their contributions to the overall performance.
4.5. Qualitative Validation
The loss function is a non-negative real-valued function used to quantify the difference between model predictions and true labels [
41]. By calculating the value of the loss function, the accuracy of the model predictions can be quantified, and thus the performance of the model can be evaluated. By minimizing the value of the loss function during training, the parameters of the model can be optimized so that the model’s predictions are closer to the true labels. To provide a more intuitive analysis of the processes and effects of each model and feature recognition, we have plotted the loss function curves during the training process and the recognition accuracies of each model in
Figure 13 and
Figure 14.
An explanation of the legend follows:
VGG-MFCC. Recognition of MFCC domain features using the VGG16 model.
VGG-DD. Recognition of DD domain features using the VGG16 model.
GoogleNet-MFCC. Recognition of MFCC features using the GoogleNet model.
GoogleNet-DD. Recognition of DD features using the GoogleNet model.
ResNet-MFCC. Recognition of MFCC features using the ResNet model.
ResNet-DD. Recognition of DD domain features using the ResNet model.
MFCC-DD. Recognition of MFCC and DD domain features using the dual-feature fusion learning target recognition model.
Figure 13 illustrates the variation of the loss function during the training of the three single-feature models and the dual-feature fusion learning target recognition model.
Figure 14, on the other hand, shows the accuracies of different models for two types of feature recognition, including both the training and test sets, enabling a more direct comparison of the recognition performance of each model.
(1) When recognizing DD domain features alone, the loss function decreases slowly and exhibits significant fluctuations. In contrast, recognizing MFCC features alone results in a relatively faster decrease in the loss function with fewer fluctuations. Notably, the dual-feature fusion learning target recognition model, which incorporates both MFCC and DD domain features, achieves an even faster decrease in the loss function with a smoother curve.
(2) Compared to the model using MFCC features alone, the dual-feature fusion learning target recognition model improves the accuracy of the training set by 3% to 6% and the accuracy of the test set by 1% to 3%. When compared to the model using DD domain features alone, the improvement in accuracy is even more pronounced, with an increase of 20% to 23% for the training set and 25% to 38% for the test set. Additionally, compared to models utilizing single features such as VGG16, GoogleNet, and ResNet, the structure of the dual-feature fusion learning target recognition model is simpler, contributing to an overall enhancement in model recognition efficiency.
5. Conclusions
In this paper, we propose a dual-feature fusion learning method that mainly consists of two parts: feature extraction and target recognition.
(1) Feature Extraction: The MFCC and DD domain features of marine mammals are extracted as input features. The MFCC, being closer to the human auditory system than other spectral features, captures the auditory characteristics of the marine mammals’ vocalizations. Meanwhile, the DD domain features reflect the motion characteristics of these animals, providing complementary information. By combining these two features, the model ensures robust recognition performance even under low SNR conditions.
(2) Dual-Feature Fusion Learning Target Recognition Model: We introduce a novel dual-feature fusion learning target recognition model that can simultaneously input both features into a convolutional neural network for target recognition. In addition, generalizability experiments and ablation experiments are carried out in this study, which prove that the model has good generalization ability.
Compared with the traditional single-feature recognition model, the method proposed in this paper simplifies the model structure, improves the recognition accuracy and training efficiency, and has good generalization ability, which can provide some references for research in marine mammal acoustic recognition and other related fields. The method is based on the recognition of marine mammals by passive sonar, so it is suitable for the recognition of marine mammals that can actively emit sound, but it is less effective for the recognition of some fish. The acoustic signals emitted by underwater targets are transmitted through acoustic channels in the ocean, and their signal-to-noise ratio is bound to be greatly reduced, so noise in the ocean is an important issue affecting target recognition. With the increasing changes in the marine environment, including the impacts of climate change and human activities, the habitats and behavioral patterns of marine mammals are also changing [
42]. Effective acoustic signal recognition techniques can help monitor these changes and study and protect animals and their habitats non-invasively and at ecologically relevant temporal and spatial scales. The future of this research can be applied to (1) the combination of active and passive sonar for the identification of marine organisms; (2) the combination of marine environmental noise filtering technology for detection; and (3) the upgrading of hardware and software technology, which will be mounted on a variety of marine observation platforms to observe organisms in the ocean in real-time, allowing us to achieve the goal of monitoring and protecting the marine biological environment.
In this paper, three models, VGG16, GoogleNet, and ResNet, are used as references in the comparison experiments, and future research will consider using more up-to-date algorithms for comparisons to improve the recognition efficiency of other models and to find more appropriate models. This study mainly relies on marine mammal acoustic signal data from the Watkins Marine Mammal Sound Database open-source database, and future studies will consider using more diverse datasets, including real-world data, to explore the performance of the target recognition model with dual-feature fusion learning in different environments.