2.1. Network Pruning
Currently, mainstream pruning methods can be divided into two categories: non-structural pruning and structural pruning. Non-structural pruning does not alter the network’s structure but rather zeros out some of the weights. However, the distribution of these zeros is generally irregular, requiring specialized hardware and software acceleration for proper functioning. In contrast, structural pruning modifies the neural network’s structure by physically removing grouped parameters based on predefined rules. This approach does not rely on specific AI accelerators or software, making it more widely applicable in engineering projects than non-structural pruning.
In the realm of non-structural pruning, the origins can be traced back to 1989 when LeCun et al. [
23] balanced network complexity and training set errors using second-order derivative information to remove unimportant weights from the network. Their aim was to achieve a network with better generalization and improved inference speed. Subsequently, Han et al. [
24] proposed a non-structured pruning method based on threshold weight values, analyzing weight parameters to prune the network structure. However, setting pruning thresholds requires a strong mathematical background and manual intervention, severely limiting its engineering application. In response to this challenge, Li et al. [
25] introduced an optimization-based method for automatically adjusting thresholds, transforming the threshold adjustment problem into a constrained optimization problem and solving it using derivative-free optimization algorithms. Lee et al. [
26] highlighted the importance of global pruning by studying hierarchical sparsity, introducing an adaptive magnitude pruning score that does not rely on manual hyperparameter tuning. Moreover, determining the importance of weights scientifically is a major research focus. Carreira et al. [
27], for instance, framed pruning as an optimization problem, minimizing the loss of weights while satisfying basic pruning conditions and automatically learning the optimal number of weights to prune in each layer. Kwon et al. [
28] proposed a novel representation scheme for sparsely quantized neural network weights, achieving high compression ratios through fine-grained and non-structural pruning while maintaining model accuracy.
In the realm of structural pruning, methods can be broadly categorized into neuron pruning, filter pruning, channel pruning, and layer pruning based on the granularity of pruning. Molchanov et al. [
29] approximated neurons’ contribution to the loss function using first- and second-order Taylor expansions and iteratively removed neurons with low scores. Wang et al. [
30] proposed a layer-adaptive filter pruning method based on reducing structural redundancy, constructing a graph for each convolutional layer in CNNs to measure their redundancy. This method prunes unimportant filters in the most redundant layers rather than in all layers. To address the additional hyperparameters and long training cycles associated with filter pruning methods, Ruan et al. [
31] introduced a novel dynamic and progressive filter pruning scheme (DPFPS), solving the optimization problem based on pruning ratios with an iterative soft-thresholding algorithm that exhibits dynamic sparsity. At the end of training, only redundant parameters need to be removed without additional stages. Ding et al. [
32] proposed a lossless channel pruning method called ResRep, reducing CNN size by narrowing convolutional layer widths. However, premature pruning of some channels can severely impact model accuracy. To mitigate this issue, Hou et al. [
33] suggested iteratively pruning and regrowing channels throughout the entire training process, reducing the risk of prematurely pruning important channels. Yang et al. [
34] discovered that merely reducing CNN model size or computational complexity does not necessarily reduce energy consumption. To bridge the gap between CNN design and energy optimization, they proposed a CNN energy-aware pruning algorithm that uses CNN energy consumption directly to guide layer pruning processes.
2.2. Knowledge Distillation
Currently, knowledge distillation can be classified into three types based on the type of knowledge: output-based distillation, feature-based distillation, and relation-based distillation. Researchers need to choose the distillation method based on the specific task and may also consider combining different distillation methods for optimal results.
Buciluǎ et al. [
35] were the first to propose a method for training fast and compact small models to mimic the performance of larger, slower, and more complex models. They achieved this by training the small model using data obtained from inference of the large model. Their method resulted in minimal performance loss for the small model, demonstrating that knowledge acquired from a large ensemble of models can be transferred to simpler, smaller models. Building upon this, Hinton et al. [
36] introduced a more general solution termed “knowledge distillation”. They defined soft targets (class probability distribution generated by the teacher network’s Softmax layer) and hard targets (one-hot encoding of true labels) and introduced a “temperature” variable to control the smoothness of the class probability distribution. The student model was then trained to learn both soft and hard targets simultaneously, with the weighted average of the two target loss functions serving as the overall loss function, thereby improving the performance of the student model. Subsequently, Zhang et al. [
37] proposed a deep mutual learning method that allows the student model to learn knowledge from multiple teacher models. To enhance the student model’s understanding of the teacher models, Kim et al. [
38] introduced a method of encoding and decoding the teacher network’s outputs, leading to improved results for the student learning the decoded knowledge. In order to fully utilize the outputs of the teacher network, Zhao et al. [
39] proposed a decoupled knowledge distillation (DKD) method, dividing the distillation loss into target class loss and non-target class loss, significantly enhancing the distillation performance.
The mentioned methods are all based on output distillation, yet some researchers argue that response-based distillation has certain limitations and propose feature-based distillation methods. Adriana et al. [
40] argued that for deep neural networks, it is challenging to propagate supervisory signals all the way back to the front end, and constraining only at the label level is insufficient. Therefore, they propose the Fitnet method, introducing supervisory signals in the middle hidden layers of the network, using hints training to constrain the outputs of the intermediate layers of the two models to be as close as possible, allowing the student model to learn the intermediate layer features of the teacher model. Zagoruyko et al. [
41] proposed attention transfer methods, enabling the student to learn the attention distribution of the teacher to enhance model performance. Huang et al. [
42] introduced the neuron selectivity transfer (NST) method, enabling the student model to better mimic the neuron activation patterns of the teacher model, thereby improving knowledge distillation effectiveness. Ahn et al. [
43] presented an information-theoretic framework for knowledge transfer, formulating knowledge transfer as maximizing mutual information between teacher and student networks, which has significant advantages in cross-heterogeneous network architecture knowledge transfer. To simplify the student model’s learning process, Yang et al. [
44] used an adaptive instance normalization method to perform feature statistics on each feature channel’s data, making it easier for the student to capture the teacher model’s higher-level feature representations.
In addition to output-based and feature-based distillation methods, some researchers believe that it is crucial for the student model to learn the relationships between network feature layers, known as relation-based distillation. Yim et al. [
45] considered that a true teacher teaches the process of problem-solving, so they redefine knowledge distillation as learning the process of solving problems, specifically focusing on learning the relationships between network feature layers. They distill knowledge by fitting the relationships between teacher and student layers, allowing the student to learn the process of feature extraction from the teacher model. Subsequently, Peng et al. [
46] proposed a distillation method based on correlation congruence, transferring knowledge by comparing the correlation of feature representations between the student and teacher models, aiming to align the correlation of the student model with that of the teacher model. Park et al. [
47] introduced a relationship knowledge distillation (RKD) method, proposing distillation losses based on distance and angle constraints to transfer the mutual relationships between data instances. To further refine the knowledge of the teacher model, Liu et al. [
48] introduced an instance relationship graph (IRG)-based method, modeling instance features, instance relationships, and feature space transformations, allowing the student’s IRG to mimic the structure of the teacher’s IRG. Zhou et al. [
49] designed a student model based on graph neural networks (GNNs), leveraging GNNs to capture and transfer the global knowledge of the teacher model, effectively improving the performance of the student model.
In summary, unstructured pruning methods can achieve large-scale model pruning but require specialized hardware support for deployment. Current structured pruning methods are mostly designed for specific network architectures and necessitate manually crafted pruning rules, which are labor-intensive and less generalizable. In the realm of knowledge distillation, methods based on outputs and features are easier to implement, whereas those based on relationships are more complex and challenging to integrate with other distillation techniques. Additionally, due to the high cost and difficulty of underwater acoustic experiments, there remains a significant research gap in model compression for underwater acoustic target detection. Therefore, this paper employs an automated pruning scheme based on dependency graphs, combined with output- and feature-based distillation methods. This approach achieves substantial model compression while maintaining detection accuracy and has been validated for performance on embedded devices in underwater environments.