1. Introduction
Synthetic aperture radar (SAR) [
1] has achieved extensive utilization in diverse fields such as reconnaissance detection, geological exploration, disaster detection, and public area security screening. Its ability to operate round the clock and in all weather conditions makes it a vital tool. Automatic target recognition (ATR) [
2] plays a crucial role in SAR image interpretation, encompassing the recognition of target regions of interest and inference of target class attributes. This paper focuses on SAR-ATR, which holds immense practical value and theoretical significance, providing valuable insights for image recognition [
3], target recognition [
4], image matching [
5], and other remote sensing applications.
SAR achieves high-resolution imaging through linear FM signals and matched filtering techniques for distance direction, while the azimuth direction employs motion-based virtual aperture synthesis. It possesses remarkable attributes such as all-day [
6], all-weather operability [
7], high penetration capability [
8], long-range observation [
9], and high-resolution ground imaging [
10]. The wealth of surface electromagnetic scattering information offered by SAR has contributed to its widespread adoption in various domains, including ocean exploration [
11], forestry census [
12], topographic mapping [
13], land resource survey, and traffic control, as well as military applications like battlefield reconnaissance, radar guidance, and strike effect evaluation [
14]. Enhancing image quality plays a crucial role in SAR applications, encompassing image super-resolution, denoising, deblurring, and contrast enhancement. These techniques not only improve visual effects but also enhance feature extraction quality, thereby facilitating subsequent image understanding and interpretation. However, traditional SAR imaging algorithms face challenges in effectively detecting and imaging moving targets due to their complex scattering characteristics and motion features.
With the rapid advancements in deep learning, convolutional neural networks (CNNs) have been increasingly employed for ship recognition in SAR images, exhibiting promising outcomes. However, deeper CNNs focus on accuracy at the expense of real-time performance, speed, and compatibility with resource-constrained embedded platforms. Hence, there is a pressing need to develop lightweight models that strike a balance between speed and accuracy, enabling real-time ship target recognition in SAR images and seamless deployment on embedded platforms [
15]. In the realm of traditional machine learning, feature extraction and classification algorithms are commonly used for target recognition. Feature extraction techniques include edge, texture, shape, and polarization characteristics, enabling the extraction of target feature information from SAR images. The wavelet transform, Gabor filter, grayscale co-occurrence matrix, and principal component analysis are popular feature extraction algorithms. Following feature extraction, classifiers such as support vector machines (SVMs) [
16], artificial neural networks, and decision trees are utilized for target classification. Traditional machine learning methods also encompass feature matching-based approaches, such as polarized scattering similarity and adaptive local orientation patterns. Du proposed the Fast C&W algorithm to counter attacks on SAR target recognition by deep convolutional neural networks [
17]. Peng proposed a Speckle-variant attack algorithm for the adversarial attack on SAR target recognition of deep convolutional neural networks [
18].
While traditional machine learning methods can achieve satisfactory results in SAR target recognition, they have limitations. Subjectivity and incompleteness arise from the need for manual feature selection during extraction. Furthermore, the performance of traditional machine learning methods is constrained by the capabilities of feature extraction and classifiers.
In recent years, the emergence of deep learning has prompted researchers to explore its application in SAR target recognition. Two-stage target recognition methods generally offer higher accuracy compared to single-stage methods. However, two-stage algorithms often exhibit slower training and recognition speeds compared to their single-stage counterparts. To address this limitation, researchers have increasingly turned to single-stage algorithms to ensure real-time recognition. However, single-stage methods are more susceptible to false recognitions and localization errors, particularly for small targets. Thus, there is a need to enhance the performance of single-stage algorithms in detecting small targets in real-time applications.
Existing target recognition algorithms primarily cater to optical images, focusing on accuracy improvement. Few detectors have been specifically tailored for SAR images. Directly applying target recognition algorithms designed for optical images to SAR images may yield suboptimal results due to differences in imaging mechanisms, target characteristics, and resolution disparities. Therefore, it is crucial to develop target recognition algorithms that consider the unique complexities and characteristics of SAR images.
Given the extensive SAR image data and high feature dimensionality, traditional deep learning models often suffer from a large number of parameters and high computational complexity. Hence, the research focus has shifted toward lightweight networks. Lightweight networks refer to neural network structures that achieve computational efficiency by reducing parameters and computations. In SAR automatic target recognition, lightweight networks offer improved computational speed and memory efficiency without compromising classification accuracy.
In conclusion, the demand for a lightweight and readily deployable single-stage target recognition algorithm for SAR image target recognition, especially on embedded platforms, is becoming increasingly urgent. This paper introduces the FCCD-SAR method, which addresses the SAR image target recognition challenge by striking a well-balanced approach. Our approach significantly improves recognition accuracy while minimizing the number of parameters and floating-point operations (FLOPs) required. The main contributions of this paper can be summarized as follows.
(1) To facilitate the development of a SAR image target recognition algorithm that is well-suited for embedded platforms, we adopt a more rational approach by employing the FasterNet algorithm for dataset lightening and feature extraction. This enables better alignment with the unique characteristics of SAR image data. Moreover, we effectively reduce the number of parameters while preserving satisfactory performance.
(2) To enhance the extraction of scattering information from targets and improve target recognition accuracy, we propose the utilization of a lightweight upsampling operator called CARAFE. This operator exhibits a wide perception field during reconfiguration, allowing for effective improvement in the recognition performance of SAR targets. Additionally, this design enables a reduction in the number of parameters and floating-point operations (FLOPs) required while maintaining high recognition performance.
(3) To further improve the model’s recognition accuracy and computational efficiency, a faster and lighter module C3-Faster is used to reduce the number of parameters and computation while ensuring recognition accuracy.
(4) For the characteristics of multi-scale and large-scale variation of SAR targets, DyHead’s attention-based mechanism detection head is used to better detect feature information at different scales adequately and improve the recognition effect of SAR targets.
(5) To obtain the ultimate network model, a pruning operation is introduced to prune the network structure to obtain the minimum optimal network model with guaranteed accuracy.
2. Related Work
Numerous publicly available datasets exist for SAR target recognition, with the MSTAR ten-class classification dataset being the most renowned. Various algorithms, including traditional and deep learning-based approaches, have been employed for SAR target recognition. Traditional algorithms often prove ineffective, relying heavily on manual parameter setting and design, lacking robustness, and exhibiting poor generalization to other SAR datasets. Additionally, their recognition speed and real-time performance fall short of engineering application requirements.
Consequently, the benefits of end-to-end deep learning algorithms have become increasingly evident. In recent years, researchers have shifted towards deep learning algorithms for SAR image target recognition, capitalizing on advancements in deep learning techniques. These algorithms eliminate the need for intricate manual feature extraction, instead focusing on designing robust network structures to effectively extract SAR target features. Convolutional neural networks (CNNs) have gained significant popularity in SAR image target recognition, particularly for ship targets. Pre-trained CNN [
19] models have yielded promising results for feature extraction in SAR images, followed by classification using traditional classifiers. With the continuous evolution of deep learning techniques, researchers have explored complex deep neural network models such as RNNs and graph convolution neural networks (GCNNs) [
20] for SAR target-recognition tasks.
The development of deep learning algorithms has introduced new methods for SAR target recognition. For instance, region extraction algorithms commonly used in target recognition, such as region-based convolutional neural networks (R-CNN) [
21] and Single Shot MultiBox Detectors (SSD) [
22], have been adapted for SAR target recognition. Additionally, novel network architectures and techniques have been proposed, such as the feature pyramid network for target recognition [
23] and the fully convolutional attention block algorithm for SAR target recognition [
24]. However, these methods often rely on deep network structures without considering practical engineering applications, leading to imbalanced parameters, FLOPs, and recognition accuracy.
Improving the accuracy and robustness of target recognition can be achieved by fusing SAR images with data from multiple sources, such as optical and infrared images. Multi-source data fusion plays a crucial role in SAR automatic target recognition [
25], utilizing information from various sources for comprehensive analysis [
26] and feature extraction to enhance accuracy and robustness. Data-level fusion and feature-level fusion are two main aspects of multi-source data fusion [
27].
Given the growing interest in lightweight SAR target-recognition models, the focus has shifted toward networks with reduced parameters and computational requirements. Lightweight networks offer improved computational speed and memory efficiency without sacrificing classification accuracy [
28]. Notable lightweight designs include a lossless lightweight CNN proposed by Zhang [
29] and a modified convolutional random vector function link network [
30] for SAR target recognition. However, existing models often fail to strike the appropriate balance between accuracy and lightweight design and neglect the need to tailor recognition models specifically for SAR image target recognition datasets.
Therefore, to address the requirements of real-world engineering applications, we have devised an innovative SAR target recognition algorithm that is both lightweight and highly precise. This algorithm has been specifically designed to cater to SAR image target recognition datasets.
3. Materials and Methods
In this paper, we propose a lightweight SAR ART algorithm based on FasterNet, the FCCD-SAR. As a consequence, we strike the best balance between accuracy and lightweight design. The FCCD-SAR model mainly consists of the following modules and strategies: the state-of-the-art target-recognition benchmark framework YOLOV5, the faster neural network backbone network FasterNet, the lightweight upsampling operator CARAFE that solves the problems of some general modules and operators, the faster lightweight module C3- Faster, and DyHead, which uses an attention mechanism to unify different target-detection heads. Compared to the current state-of-the-art methods, such as YOLOV8 and YOLOV7, YOLOV8 has just been released recently, and its model is still in the stage of frequent modification, thus it is not stable enough. YOLOV7, on the other hand, has a slightly lower inference speed than YOLOV5 and requires more memory resources. In contrast, YOLOV5, after many official modifications, has a more stable performance, a more mature network, and a faster inference speed, and at the same time, it is more economical in terms of memory consumption. Therefore, YOLOV5 was chosen as the benchmark framework for this study.
3.1. Architectural Overview of FCCD-SAR Network
The schematic representation in
Figure 1 depicts the holistic network architecture of our FCCD-SAR model.
Figure 2 shows the basic YOLOV5 network framework diagram, which facilitates the comparison with the improved structure. The model depicted in the figure comprises five distinct components: input, backbone, neck, head, and output. Notably, the input image initially undergoes processing through FasterNet [
31], a custom-designed lightweight backbone network. This backbone network is adept at extracting the discrete scattering features of SAR images more rationally.
Then, the backbone network into the neck meets the lightweight, universal upsampling operator CARAFE [
32], which is used to achieve significant improvements in different tasks while introducing only a small number of parameters and computational costs. Then, before extracting the small target scale features output, a faster lightweight module C3-Faster is introduced to combine the respective advantages of CNN and self-attention to complement each other, which can improve the recognition performance of SAR targets while reducing the number of parameters and FLOPs. Finally, the processed image flows into DyHead [
33], our chosen attention-based detection head. DyHead incorporates attention mechanisms across scale-aware feature layers, spatial locations for spatial perception, and output channels for task perception. This innovative approach greatly enhances the expressiveness of the model’s target-detection head without imposing an additional computational burden. We will discuss the detailed improvements in these four areas later on.
3.2. Faster and Better Neural Networks: FasterNet
To design fast neural networks, much work has focused on reducing the number of FLOPs. However, we observe that this reduction in FLOPs does not necessarily lead to a similar degree of reduction in latency. This mainly stems from the inefficiency of low FLOPS per second.
To achieve faster networks, we revisited the popular operators and demonstrated that such low FLOPS are mainly due to frequent memory accesses of the operators, especially deep convolution. Therefore, we adopted a new partially convolutional (PConv) that can extract spatial features more efficiently by reducing both redundant computations and memory accesses.
To optimize the backbone network, we incorporated FasterNet-T0, the smallest version of FasterNet, and retained its MLPBlock architecture. Additionally, we made improvements by eliminating unnecessary MLPBlocks through stacking. The resulting lightweight backbone, FasterBackbone, was specifically designed to efficiently extract scattering features from SAR datasets.
Figure 3 provides an overview of the structural details of FasterBackbone, while
Table 1 presents the specific parameters used. Through extensive experimental validation using the SAR dataset, we demonstrated the remarkable feature extractability of the backbone we designed.
3.3. Lightweight Upsampling Operator: CARAFE
The upsampling operation was achieved through feature recombination, which involves the dot product between the upsampling kernel and the corresponding neighborhood pixels in the input feature map. The fundamental network structure, with a small receptive field, ignores some useful information, and therefore, the receptive field needs to be enlarged. The upsampling operation CARAFE, on the other hand, can have a large receptive field during reorganization and guides the reorganization process based on the input features. Meanwhile, the whole CARAFE operator structure is small, which meets the lightweight requirement. Specifically, the input feature map is utilized to predict unique upsampling kernels for each position, followed by feature recombination based on these predicted kernels. CARAFE demonstrates significant performance improvements across various tasks while only introducing minimal additional parameters and computational overhead.
CARAFE consists of two primary modules: the upsampling kernel prediction module and the feature recombination module, as depicted in
Figure 4. Assuming an upsampling multiplier of
and an input features map with dimensions
, the process begins by predicting the upsampling kernel through the upsampling kernel prediction module. Subsequently, the feature recombination module is employed to complete the upsampling procedure, resulting in an output feature map with dimensions
.
Given an input feature map of shape , our initial step involves channel compression, reducing the channel number to using a operation. The primary objective of this compression is to alleviate the computational burden on subsequent steps. Following that, we proceeded with content encoding and upsampling kernel prediction, assuming a specific upsampling kernel size of . It is worth noting that a larger upsampling kernel offers a broader perceptual field, but it also entails a higher computational cost. To incorporate distinct upsampling kernels for each position in the output feature map, it is necessary to predict the shape of the upsampling kernel as . In the initial step, after compressing the input feature map, we employed a convolutional layer with channels to predict the upsampling kernel. The number of input channels is , and the number of output channels is . Following this, we expanded the channel dimension across the spatial dimension, resulting in an upsampling kernel with the shape .
At each location within the output feature map, we performed a mapping back to the corresponding region in the input feature map. This region, centered on the location, encompassed a region of size . Subsequently, we computed the dot product between this region and the predicted upsampling kernel specific to that point, resulting in the output value. It is worth noting that different channels at the same location shared the same upsampling kernel.
3.4. Faster and Lighter Modules: C3-Faster
In recent years, the fields of computer vision (CV) have witnessed a surge of interest in convolutional neural networks (CNNs) and self-attention networks (SNNs). CNNs have achieved remarkable breakthroughs in CV domains, including image classification, target recognition, and target tracking, consistently attaining state-of-the-art performance across diverse datasets. Concurrently, the rapid development of vision transformers has led to the emergence of transformer-based models with various self-attention mechanisms that have begun to surpass CNNs in several vision tasks, thereby redefining the performance benchmarks in these areas.
ACmix offers a compelling fusion of convolution and self-attention, making it a suitable approach for enhancing hybrid representation learning in SAR image target recognition. With the challenge of detecting small targets in SAR images in mind, we opted to replace the original YOLOV5 C3 module with C3-Faster, provided by FasterNet. Faster-Block and BottleNeck structures are shown in
Figure 5. Among them, Faster-Block has one more partial convolution than BottleNeck for spatial fusion and one Drop path to reduce the amount of calculation.
Figure 5 showcases the design of this faster, lightweight module. By incorporating the lightweight C3-Faster, we further enhanced the speed of target recognition, addressing the need for efficient and swift target identification.
3.5. Detection Head Based on Attention Mechanism: DyHead
In
Figure 1, the third component showcased our attention mechanism-based detection head called DyHead, which was tailored for the SAR image dataset. DyHead introduced a novel dynamic head framework that unifies various target detection heads using an attention mechanism. By leveraging attention between feature levels for scale perception, spatial locations for spatial perception, and output channels for task perception, this approach substantially enhances the expressiveness of the model’s target detection head without imposing additional computational burden.
DyHead is a fusion of three attention mechanisms: scale-aware attention , spatial attention , and channel attention . These attention mechanisms are stacked together to form a single block. The final head consists of multiple blocks, each incorporating this stack of attention mechanisms.
With the feature tensor
at hand, we can describe the generalized form of self-attentiveness as follows:
The simplest approach would be to employ a fully connected layer, but directly learning the attention function across all dimensions would result in excessive computational requirements and prove impractical due to the high dimensionality. Instead, we tackled this challenge by breaking down the attention function into three sequential attentions, each targeting a single dimension:
Scale-aware attention
: to address the fusion of features at different scales based on their semantic significance, we began by introducing scale-aware attention:
In this context, corresponds to a linear function that utilizes convolutional approximation, while represents a hard-sigmoid activation function.
Spatial-aware Attention
: continuing with our exploration, we then introduced another module called spatial location-aware attention to emphasize the discriminative capabilities of various spatial locations. Given the large extent of S, we decoupled it into two stages: first, we employed deformation convolution to achieve sparse attention learning, and then we integrated features across different scales to complete the process:
In this scenario, K represents the number of sparsely sampled positions. The remaining parameter information is analogous to that in deformation convolution, is an importance factor to add bias, and is an importance factor for adaptive weighting and thus, it is omitted for brevity.
Task-aware attention
: to facilitate collaborative learning with the enhanced generalizability of goal representation capabilities, we devised a task-aware attention mechanism. This attention mechanism dynamically adjusts feature channels to assist various tasks as needed:
The hyperparameter plays a crucial role in controlling the activation threshold, akin to DyReLU.
and
are used as rescale and reshift, respectively. By sequentially implementing the aforementioned attention mechanism, we can stack multiple instances of it. The configuration of the DynamicHead is illustrated in
Figure 6, providing a visual representation of its structure.
3.6. Pruning
The deployment of CNNs in practical applications is often hindered by their high computational requirements. In this study, we propose a straightforward and efficient approach called network pruning, which involves the sparsity of network channels. This method is particularly well-suited for CNN architectures, as it minimizes the training overhead and yields models that can be deployed without the need for specialized hardware or software acceleration while still maintaining high performance. By training on thick networks and automatically filtering and removing redundant channels during the training process, we can generate streamlined networks that achieve comparable accuracy levels.
This process involves applying L1 regularization to the scaling factor within the Batch Normalization (BN) layer and iteratively adjusting the scaling factor. By converging the scaling factor towards zero, we can identify and remove unimportant channels. This can be visualized in
Figure 7, where the regularization gradually reduces the scaling factor values, leading to the elimination of unnecessary channels.
By applying L1 regularization to the scaling factors, each corresponding to a specific convolutional channel or neuron in the fully connected layer, we can effectively discriminate and prune unimportant channels in subsequent operations. Although the additional regularization term has a minimal impact on model performance, it can potentially enhance training accuracy. While pruning unimportant channels may initially lead to a temporary performance drop, this can be rectified through subsequent fine-tuning.
The pruned network obtained after the pruning process exhibits a more compact size, reduced running time, and decreased computational operations compared to the original network.
For each channel in the network, a scaling factor
is introduced, which is multiplied by the output of that channel. Both the network and these scale factors undergo training, and sparse regularization is continuously applied to the scaling factors throughout the training process. Eventually, channels with significantly small scaling factor values are pruned, and the network is further fine-tuned. Here are the specific details of the process:
In the training process, the loss function is defined as a combination of terms. The first term on the left represents the loss for the normal training of the CNN, where (x, y) denotes the input and target. The second term introduces a sparsity penalty on the scaling factor. The balance factor normalizes the latter term. In our experiments, the L1 paradigm, is chosen for the later sparsification training. To optimize the non-smoothed L1 penalty term, we use the subgradient descent method. Alternatively, the non-smoothed L1 penalty can be replaced with a smoothed L1 penalty to avoid the need for subgradients at non-smoothed points.
Channel pruning involves removing the input-output connectivity related to the channel, leading to a narrower network. The scaling factors serve as channel selectors, and when optimized with the network, they facilitate the removal of unimportant channels without significantly affecting the generalization performance.
BN finds extensive application in contemporary CNNs, facilitating rapid model convergence and enhancing generalization performance. We draw inspiration from BN’s technique of normalizing activation values and employ it as a basis to devise a straightforward yet efficient approach for combining channel scaling factors. Specifically, the BN layer employs mini-batch statistics to normalize the activation values within a given segment. Considering
as the input and
as the output of the BN layer, with B representing the present mini-batch. The BN layer executes the subsequent transformation:
where
represents the mean and
represents the variance of the inputs in B.
and
denote trainable affine transformation parameters responsible for normalizing the activation values and subsequently linearly transforming them to an arbitrary scale.
The conventional approach is to add a BN layer after the convolutional layer with a channel scaling/offset factor. This allows us to directly use the parameter in the BN layer as the scaling factor for network pruning. It offers the advantage of avoiding any additional overhead in the network and is an efficient way to implement channel pruning.
5. Discussion
This paper presented FCCD-SAR, a lightweight algorithm for SAR target recognition based on FasterNet. The method was specifically designed for deployment on embedded devices, taking into consideration the lightweight requirements. Moreover, it incorporates the unique feature information characteristics of SAR images.
In this study, the lightweight benchmark model YOLOv5 was initially introduced, and subsequently, FasterNet, a more efficient and faster neural network, was used to replace the main network. The choice of FasterNet was motivated by its compatibility with the unique characteristics of the SAR image dataset, striking a balance between speed and accuracy.
To minimize both the model size and computational effort, we employed the lightweight upsampling operator CARAFE. CARAFE performs a dot product between the upsampling kernel and the pixels in the surrounding neighborhood of each position within the input feature map. This operation allows for a broader perceptual field during recombination and guides the recombination process using input features. As a result, it enhances the recognition performance of SAR targets while simultaneously reducing the number of parameters.
To improve both the recognition accuracy and computational efficiency of the model, we incorporated the C3-Faster module, which is faster and lighter. This module effectively reduced the number of parameters and computational requirements by selectively discarding unimportant information while maintaining the required level of recognition accuracy.
The attention mechanism-based detection head, DyHead, was incorporated to handle the multi-scale features of SAR targets. DyHead is a dynamic head framework that utilizes an attention mechanism to unify various target detection heads. It leverages attention mechanisms across feature levels for scale perception, spatial locations for spatial perception, and output channels for task perception. By employing this approach, the model’s target detection head achieved enhanced expressiveness and improved target recognition accuracy without increasing the computational effort.
To obtain the optimal model, we employed a pruning technique to reduce the network’s complexity while preserving its accuracy. Subsequently, we evaluated the proposed method on the MSTAR dataset, and the results demonstrated its exceptional performance, achieving an MAP of 99.5%. Notably, the number of parameters was merely 2.72 M, and the FLOPs amounted to 6.11 G, showcasing the model’s efficiency.