1. Introduction
The use of underwater acoustic signals for the purpose of identifying underwater targets has immense importance in the domains of marine resource exploitation and national defense security. The techniques for identifying underwater targets can be separated into two stages: extracting signal characteristics and developing an automated target classifier. The feature extraction method of early radiation noise is analyzed around the target energy spectrum and line spectrum characteristics. The continuous spectrum of the radiated noise may be accurately modeled using either the least square approach or power spectrum estimation. By eliminating the component of the continuous spectrum from the original signal, we can extract the line spectrum properties of the signal [
1]. With the progress of neural networks, the energy spectrum features of underwater acoustic signals can be extracted using neural networks. As a relatively mature neural network, the Restricted Boltzmann machine can achieve high-precision extraction of ship target power spectrum features [
2]. When examining underwater acoustic targets using signal time domain analysis, the cross-power spectrum obtained from coherence in the signal time domain may emphasize the distinctive line spectrum features of underwater targets [
3]. The primary emphasis in the study of early underwater acoustic targets is on the temporal characteristics of the signal. However, the underwater acoustic signal is unstable and cannot accurately describe the signal characteristics in the time domain; so, the frequency domain of the signal can reflect the signal characteristics more stably. The double logarithmic spectrum feature is used to analyze the frequency domain features of ship radiation noise [
4], and it improves the accuracy of target recognition. A single time domain or frequency domain feature contains less target feature information, and the Time–Frequency combined analysis can more comprehensively analyze the components of underwater acoustic signals that do not change with time. Wang et al. [
5] extracted multi-beam low-frequency analysis and recording features, used convolutional neural networks to identify targets, and realized target detection. With the improvement of wavelet analysis technology, its excellent non-stationary signal analysis ability has received attention in the underwater acoustic field; so, it has been widely used in the analysis of radiation noise. Wavelet analysis is used to decompose underwater acoustic signals, and the wavelet coefficients extracted from the respective signals can reflect the characteristics of the target [
6], which can not only improve the accuracy of the target recognition but also provide a new idea for signal denoising. The combination of empirical mode decomposition (EMD) and wavelet decomposition is used to remove noise from underwater acoustic targets, resulting in signals with a high ratio of signal to noise. This process enhances the accuracy of target detection [
7]. With the development of the chaos theory and nonlinear dynamic principle, it is frequently applied to the characteristic analysis of underwater acoustic targets. Refined composite multiscale fluctuation-based dispersion entropy (RCMFDE) is proposed to extract ship features, and the experimental results show the validity and universality of the classification [
8]. Combined with improved intrinsic time-scale decomposition (IITD) and multiscale dispersion entropy (MDE), it can significantly raise target recognition accuracy [
9]. The majority of the aforementioned feature extraction techniques concentrate on the signal’s Time–Frequency domain features. The overlapping feature frequency ranges of various underwater acoustic targets provide an immediate challenge in accurately identifying the target when using the signal’s characteristics in the Time–Frequency domain as a result of the fast progress in oceanic technology.
In order to solve the problem that it is difficult to recognize more complex targets by using time-frequency feature, this paper analyzes the signal from the perspective of Delay-Doppler domain. The motion characteristics of the target are defined by the Delay-Doppler domain components of the signal. Because of the difference in volume and shape, the velocity of different underwater acoustic targets is quite different. The Delay-Doppler features can represent the difference of targets within a specific range. The coherence integration method of the power spectrum is used to compensate for the Delay-Doppler factor of passive sonar signals, which may significantly enhance the detection efficacy of mobile targets [
10]. The underwater acoustic communication receiver has a high bit error rate (BER) since it was constructed using the signal characteristics found in the Delay-Doppler feature [
11]. The accuracy of signal features may be increased by using deep learning for the feature extraction of signals in the Delay-Doppler domain as the field develops and advances [
12]. The Delay-Doppler features of signals are widely used in underwater acoustic communication. Sun et al. [
13] proposed an orthogonal operation based on the Gram–Schmidt method to solve the multi-path sparse Delay-Doppler parameter solution of linear frequency modulated (LFM) signals in shallow water. Compared to the traditional method, it has a lower mean square error. Guo et al. [
14] introduced exponential modulation (IM) into the Delay-Doppler feature of the signal, and the system’s BER performance was enhanced by using the Hamming distance optimization model. Zhang et al. [
15] created a generalized approximated message passing (GAMP) algorithm to estimate underwater acoustic channels in a communication system, which can effectively reduce the computational complexity. The signal’s Delay-Doppler characteristics can reflect the target’s motion characteristics. However, the target’s speed varies over time and cannot stably reflect the characteristics of the target. This study combines the temporal and spectral properties of the signal to address the aforementioned problem. By using the Time–Frequency feature and the Delay-Doppler feature together, it is possible to mitigate target identification mistakes that may occur when utilizing these features individually. Instead, it suggests an approach to target feature extraction that is based on both domains.
Automatic recognition and classification technology are essential to realize the intelligence and automation of underwater acoustic equipment. The early recognition of underwater targets mainly relies on experienced professionals. With the progress of computers and related algorithms, there have been some target recognition methods based on statistical analysis, such as the Bayesian pattern classification method, cluster analysis method, support vector machine, restricted Boltzmann machine, decision tree, hidden Markov model, and nearest neighbor method [
16] using wavelet packet transform to process the signal, extract the wavelet energy spectrum features of the radiated noise signal, and use SVM to identify the signal effectively. Spampinato et al. [
17] used the hidden Markov model to compare the trajectory of underwater fish and realize the trajectory detection of underwater fish. Luo et al. [
18] used a restricted Boltzmann machine to examine the normalized spectrum of the signal, obtain the data’s deep structural features, and finally classify the data with BP neural network features. With the emergence of artificial intelligence, there is increasingly more research on recognizing underwater acoustic targets using neural networks. Convolutional neural networks (CNNs) are often used for underwater target detection. Song et al. [
19] extracted five characteristic parameters of radiated noise in the frequency, time, and Mel transform domains and established a convolutional neural network to recognize the target. The recognition accuracy was improved by 7.8% compared to the SVM method. Hu et al. [
20] established an extreme learning machine to recognize the signal by using a deep neural network to extract characteristics of radiated noise, and the identification accuracy was 93.04%. Wang et al. [
21] extracted the modified empirical mode decomposition and gamma tone frequency cepstrum coefficient of the signal, and fused the two signal features into new features. The deep neural network’s structure is optimized using the Gaussian mixture model, and the target is recognized by the optimized deep neural network. With the iterative progress of computer technology, artificial intelligence technology represented by deep learning will be more widely used to recognize and classify underwater target radiation noise. Using neural networks to identify underwater acoustic targets can significantly reduce labor costs, and it is an area for significant future development in this field. The traditional neural network model has the disadvantage of complex model structure, which can not be well used in underwater acoustic target recognition.
In this work, to thoroughly examine the target’s characteristics from two angles, we first propose a feature extraction method that utilizes the Delay-Doppler and Time–Frequency domains. Then, this research introduces a target recognition network specifically tailored for the proposed feature extraction approach. Our structure is more straightforward than traditional neural network models, with no overfitting or underfitting problems. The innovative contributions of this study are outlined below:
This work suggests a joint feature identification method that is based on the Delay-Doppler and Time–Frequency domains. The purpose of this method is to analyze ship-radiated noise from a new perspective of signal processing and provide a new basis for underwater acoustic target recognition. The algorithm extracts signal characteristics that include a broader range of target information.
This research presents a target recognition model that uses joint features in conjunction with a convolutional neural network (TF-DD-CNN). This model streamlines the model architecture and enhances the efficacy of model training. The experimental findings demonstrate that its recognition accuracy surpasses that of the conventional neural network model.
4. Experiment
The Time–Frequency and Delay-Doppler characteristics provide the ability to comprehensively represent the information of the target. In this paper, the two signal features were composed of joint features to recognize the target, and the CNN approach was used for recognizing the underwater targets. A multi-input feature target recognition network structure was designed, and the combined Time–Frequency and Delay-Doppler properties of signals were used to identify the targets. To assess the efficacy of the proposed combined features and recognition network model for target recognition, several mature CNNs were also constructed: the VGG16, GoogleNet, and ResNet models were used to validate the new recognition model by assessing the classification accuracy of several recognition models.
4.1. Experimental Environment Configuration
This section provides an overview of the experimental environment, specifically focusing on the configuration of the computer hardware and software versions used in the experiment. It explains how the environment might impact the pace at which the model is trained. The graphics processing unit (GPU) was the NVIDIA GeForce RTX 4080 (NVIDIA, Santa Clara, CA, USA), while the central processing unit (CPU) was the Intel i9-13900K (Intel, Santa Clara, CA, USA). To enhance the efficiency of model training, the neural network underwent training using a GPU. The computer’s operating system was Windows 10, with a total RAM capacity of 128 GB. In this study, a predictive model was constructed using the Python programming language, specifically Python version 3.9. The neural network was constructed using the PyTorch framework. The Pycharm2018 compiler was used for the implementation.
4.2. Data Introduction and Preprocessing
The data used in this paper were sourced from the ShipsEar database, which can be consulted online at
http://atlanttic.uvigo.es/underwaternoise/, accessed on 1 November 2023. The hydrophone that collected these data was moored on the seafloor and connected to an underwater buoy to ensure a vertical position, and the upper end was connected to a surface buoy [
28]. The database comprises 11 distinct categories of ship radiation noise data, as well as a collection of marine environmental background noise data. The data are stored in audio format, and
Figure 13 is a time domain waveform of the first 20 s of a passenger ship in this database. Since the collection time of each ship is different, the total data duration of each type of ship is also different. This paper uses data from the four longest collected ship types and marine environmental noise as targets.
The recognition model trained using only the ShipsEar database was not convincing; so, the author’s research group collected a set of radiated noise data from fishing boats, passenger ships, and motorboats to verify the effectiveness of the model. The data collection site is located at Jimiya Port in the West Coast New Area in Qingdao City. The port is busy with freight transportation and has a large number of different types of ships entering and exiting the port all day. When we collected these data, the wind and waves were low; so, the impact of marine environmental noise on the data is relatively small.
Figure 14 shows images of some targets in the ShipsEar database and data collected by the research team.
Table 2 shows the target types and data duration of the data used in this article. These targets have significant differences in size, making them suitable for analyzing the target recognition performance of Delay-Doppler features based on them.
Figure 15 shows the preprocessing process of the original data in this paper, with the aim of obtaining Time–Frequency and Delay-Doppler features for target recognition and using these two features as training sets to train the recognition model (TF-DD-CNN). First, this paper frames the original data to expand the data volume. Second, FFT is used to analyze the frequency domain features of the processed data, combined with the time domain feature to form the Time–Frequency feature. Third, SFFT is used to analyze the Delay-Doppler feature of the data. Finally, this paper uses the Time–Frequency and Delay-Doppler features of the signal to create training data for the TF-DD-CNN. The ratio of the number of training and testing sets is set to 5:1.
This research applies frame processing techniques to the original radiation noise data in order to enhance the training set.
Figure 16 depicts the process of dividing the radiation noise signal into frames. The length of the radiated noise signal is denoted as n, data frame length (
DFL) indicates the length of the intercepted signal, where
DFL < n, and frame shift (
FS) indicates that every few groups of data start to intercept the next frame of data. In this paper, the original data are divided into two frames, and the radiation noise data are divided into one frame every two seconds. To ensure the seamless flow of information between neighboring data frames, the overlap rate of adjacent data will be maintained by 50%. To meet the need for a 50% overlap rate, the step size can be selected as half of the
DFL; the frameshift is one second.
Figure 17 is the time domain diagram of a four-frame signal of certain data, with one frame of data every two seconds and the step size set to one second so that the data of two adjacent frames has a 50% overlap rate.
Figure 18 shows the Time–Frequency and Delay-Doppler features of a passenger ship.
Figure 18a shows the Delay-Doppler feature of the ship and
Figure 18b shows the Time–Frequency feature of the ship, with the signal energy mainly concentrated in the low-frequency band. From the analysis of the
Figure 18a, the target has a relatively obvious Doppler frequency shift. In order to improve the accuracy of target recognition, all coordinate information is hidden from the input results.
4.3. Evaluation Criteria for Target Recognition Results
This study employed four metrics, namely, accuracy, precision, recall, and F1 Score [
29], to assess the experimental outcomes. Accuracy (
Acc) is a measure that calculates the ratio of correctly predicted samples by the classification model to the total number of samples. It provides a measure of the overall prediction accuracy of the model across all samples. Precision (
Pre) is the quotient obtained by dividing the number of accurately anticipated positive samples by the total number of samples predicted as positive. It measures the precision of the model’s predictions for the positive category. Recall (
Rec) is the quotient obtained by dividing the number of accurately predicted positive category samples by the total number of actual positive category samples. It quantifies the rate at which the model correctly identifies positive category samples. The F1 score (
F1) is a metric that quantifies the balance between prediction accuracy and recall by calculating their harmonic mean. It considers both aspects to provide a comprehensive evaluation of the model’s performance. For multi-classification problems, the construction of the confusion matrix can be shown in
Table 3.
Among the four sets of formulas, PP is the count of positive recognition outcomes from the recognition model, where the actual label is also positive. PN denotes the count of recognition outcomes from the model that is accurate, but the corresponding true label is incorrect. NP denotes the count of recognition results from the recognition model that is incorrect, while the actual label is correct. NN represents the number of recognition results from the recognition model that are both false and have a false actual label. The formulas for accuracy, precision, recall, and F1 score are derived from the confusion matrix.
4.4. Results and Analysis
This study performed two sets of tests to validate the accuracy of the Delay-Doppler domain features and the multi-input target recognition model (TF-DD-CNN). In experiment 1, signals’ Time–Frequency domain features (TF), Delay-Doppler domain features (DD), and joint features (TF-DD) were respectively used as inputs. The joint features of signals were realized using Feature Fusion in
Figure 12. The VGG16 model, ResNet model [
30], and GoogleNet model [
31] were used for target recognition to verify the effectiveness of Delay-Doppler domain features for target recognition. The objective of experiment 2 was to validate the efficacy of the developed recognition model by using the TF-DD-CNN model to identify the target and assess the accuracy of various models.
4.4.1. Joint Feature Recognition Experiment
Experiment 1 verified the effectiveness of Delay-Doppler domain features for target recognition. The joint features were obtained based on the Feature Fusion method in
Figure 12, which takes the Time–Frequency feature and Delay-Doppler feature of the signal as inputs and uses the Feature Fusion method to obtain the joint features. This experiment could verify that the joint features can effectively improve the accuracy of target recognition compared to a single Time–Frequency feature.
Figure 19 shows the overall image of the loss function obtained from all nine groups of the experiments. The icon represents the combination of network structure and input features. The term “VGG-TF” refers to the outcome of employing the VGG model with Time–Frequency feature recognition. Similarly, “VGG-DD” denotes the result of using the VGG model with Delay-Doppler feature recognition. Lastly, “VGG-TF-DD” signifies the outcome of utilizing the VGG model with joint feature recognition. Each set of experiments conducted 70 rounds of training. The learning rate was set to 0.001, and the ratio of training sets to test sets was defined as 5:1.
Table 4 displays the outcomes of Acc, Pre, Rec, and F1 achieved by each experimental group. The table shows the results of “Training set results/ Testing set results”, and
Figure 20 exhibits the recognition accuracy outcomes of the three models on the training set.
The findings of experiment 1 confirm that the use of Delay-Doppler features significantly enhances the accuracy of ship target detection.
Figure 19 shows that the three models’ accuracy is best after 30 training iterations.
Table 4 shows that the recognition accuracy of several models is between 60% and 80%, among which the ResNet model has the best recognition effect.
Figure 20 displays the distribution of identification accuracy for various recognition models and input characteristics.
Figure 20 demonstrates that the ResNet network has the highest recognition accuracy, and the recognition results obtained by analyzing the different feature inputs of each model with joint features as the basis are the highest, followed by Time–Frequency domain features. The lowest recognition accuracy is the Delay-Doppler feature, which also reflects the idea proposed in
Section 3.1 of this paper. To a certain extent, the result distribution of
Figure 20 proves the analysis of
Figure 10 because Delay-Doppler may contain multiple overlapping target features, and the corresponding recognition accuracy will be lower than the Time–Frequency domain features. However, by combining the two features, the recognition accuracy of the final model will be greatly enhanced. By analyzing the results of
Figure 20 and
Table 4, the use of the joint feature (TF-DD) described in this research leads to an average increase in target identification accuracy of around 6–8% compared to employing a single feature alone. This gain is rather noticeable.
4.4.2. TF-DD-CNN Model Target Recognition Experiment
Experiment 2 aimed to validate the efficacy of the network model that was constructed. Experiment 1 provided the recognition results of joint features (TF-DD) by the VGG model, ResNet model, and GoogleNet model; so,
Figure 21 only displays the training loss function of the TF-DD-CNN model, while the model’s recognition accuracy outcomes are shown in
Table 5.
Figure 22 shows the prediction accuracy of the VGG, ResNet, GoogleNet, and TF-DD-CNN models.
Figure 21 shows the curve of the loss function changing with the number of training iterations. The loss function of the model reaches its lowest level after 15 training iterations, and the training efficiency is higher than the three models in
Section 4.4.1. This efficient training performance benefits from the simple model structure, which reduces the computational burden during the model training process. From
Table 5, it can be seen that in the recognition results of the training set, the multi-input recognition model can achieve a recognition accuracy of 94%, and it can also achieve an accuracy of 92% on the test set.
Figure 22 intuitively displays the accuracy distribution of the multi-input recognition model with joint features. Compared to the other three recognition models, this model can improve the recognition accuracy of ship targets by 15% to 20%, significantly improving the recognition accuracy. The experimental results not only verify that the multi-input target recognition model can effectively recognize ship targets but also further confirm that the model has high accuracy in target recognition.
4.4.3. Analysis
The experiment in
Section 4.4.1 shows that the joint features based on the Time–Frequency domain and time-Delay-Doppler domain proposed in this article can effectively improve the recognition accuracy of ship targets. Under the same recognition model conditions, comparing the recognition results of Time–Frequency domain features and time-Delay-Doppler domain features, the recognition accuracy of the joint features can be improved by 5–7%, and the improvement effect is relatively significant. The recognition results indicate that a single Delay-Doppler domain feature cannot improve the recognition accuracy of the model, and the recognition effect of Delay-Doppler domain features is the worst.
The multi-input target recognition model can significantly improve the accuracy of target recognition. The recognition results in
Section 4.4.1 show that the recognition accuracy of a single input target recognition model can only reach up to 80%, and even if joint features are used to recognize targets, the recognition accuracy is only slightly higher than 80%. The recognition results in
Section 4.4.2 show that the multi-input target recognition model can effectively improve the accuracy of ship target recognition. Compared to the recognition results of the VGG16, GoogLeNet, and ResNet models, the multi-input target recognition model has a simpler structure, the highest training efficiency, and a significant improvement in recognition accuracy. The recognition accuracy of the model has been improved to over 90%, with an improvement range of 15% to 20%.