1. Introduction
Synthetic Aperture Radar (SAR) is a microwave sensor that is unaffected by external environmental factors such as clouds, fog, snow, and night situations. It is capable of continuously monitoring local terrain scenes, possessing strong penetration capabilities and high-resolution imaging characteristics, enabling accurate detection of obscured or camouflaged targets [
1]. It finds widespread applications in civilian and military sectors including topographic mapping, disaster assessment, environmental monitoring, target reconnaissance, and target localization. Among these applications, marine target detection is a significant subdivision of SAR object detection, with ship target detection being a primary focus within marine target detection.
In traditional ship detection algorithms, CFAR [
2,
3] and other adaptive algorithms are widely utilized due to their capability of adaptively scanning images. The CFAR method analyzes input noise to establish thresholds, thereby identifying the presence of a target when the energy of the input signal surpasses these thresholds. To cater to the diverse requirements of various SAR image applications, multiple statistical models have been proposed, encompassing Gaussian, gamma, Weibull, log-normal, G0, and K distributions [
4,
5]. Moreover, enhancements and variations of CFAR algorithms continually emerge [
6,
7,
8]. Nevertheless, these approaches often require the manual configuration of features, which is laborious, and exhibit limited transfer ability. While these methods excel in scenarios involving single-class ships and locally uniform background noise, their efficacy wanes in scenarios such as nearshore ship detection with intense interference, as well as multi-scale ship detection [
9,
10]. Additionally, they lack the capability to process targets end-to-end. Hence, there exists an imperative need for more sophisticated and robust algorithms to tackle these challenges.
After AlexNet [
11] achieved significant acclaim in the 2012 ImageNet competition, convolutional neural networks (CNNs) have seen a resurgence in importance within the domain of image processing. Represented by R-CNN [
12], CNN-based algorithms have been employed in object detection, pioneering the development of two-stage object detection. Subsequent advancements such as SPPNet [
13], Fast R-CNN [
14], and Faster R-CNN [
15] have further refined two-stage detection algorithms, achieving real-time processing improvements in both accuracy and speed. The evolution of two-stage detection algorithms has led to the emergence of models such as Feature Pyramid Networks (FPNs) [
16], Cascade R-CNN [
17], Mask R-CNN [
18], and Libra R-CNN [
19], among others [
20]. The two-stage algorithm first proposes a region proposal, then proceeds to classify it and refine the bounding box through the subsequent stage network. While more accurate than one-stage algorithms, it suffers from much slower processing speeds.
The two-stage algorithms still face bottlenecks in speed, and there is still a certain gap in real-time image object detection. Addressing such issues, the You Only Look Once (YOLO) [
21] algorithm was proposed. As the pioneering work of single-stage detection algorithms, it no longer needs to generate region proposals and process them in two steps, but directly produces the output results for bounding boxes and class, achieving a nearly 10-fold speedup compared to the previous two-stage algorithms. Wei Liu proposed Single Shot MultiBox Detector (SSD) [
22], which introduces the concept of multi-scale and multi-resolution detection. Subsequently, the YOLOv2 [
23] and YOLOv3 [
24] algorithms address the poor accuracy issue of single-stage algorithms by incorporating ideas such as multi-box detection, feature fusion, and multi-scale outputs into the network. While maintaining fast processing speeds, these enhancements lead to a significant increase in accuracy. Following RetinaNet [
25], single-stage networks have surpassed the accuracy of the best two-stage object detection networks at the time. CornerNet [
26] and CenterNet [
27] further introduce the concepts of corner points and center points in deep learning. YOLOv4 [
28] integrates numerous contemporary ideas such as Complement IoU (CIoU) [
29], PANet [
30], and Mix up data augmentation [
31] to achieve both fast processing speeds and higher accuracy in object detection algorithms. YOLOX [
32] introduces decoupled head into object detection, achieving better results on top of existing algorithms. This paper selects the YOLOv5 [
33] framework as the baseline for experimentation.
While existing networks have achieved good results in optical images, there are still notable cases of false alarms and missed detections in SAR ship detection, particularly in scenarios with strong interference near shorelines and in situations involving multi-scale and small targets, as depicted in
Figure 1. Therefore, there is an urgent need for algorithmic improvements tailored to SAR images.
With the introduction of the SAR Ship Detection Dataset (SSDD) [
34] and the emergence of more SAR target detection datasets [
35,
36], a plethora of papers on SAR domain object detection have been proposed [
37]. The earliest works typically employed classical networks such as Faster R-CNN [
34], SSD [
38], and YOLOv2 [
39], without improvements specifically tailored to SAR ship target problems, resulting in a relatively mediocre performance.
Attention mechanisms, by weighting key feature maps and spatial regions of importance, are commonly employed for the deep mining of multi-scale and small object information, serving as a means to address targets in complex nearshore scenes effectively [
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50].
In earlier endeavors, the integration of the Squeeze and Excitation (SE) attention mechanism with Faster R-CNN has demonstrated excellent detection results on the early version of the SSDD dataset [
40]. Zhao et al. [
41] proposed utilizing the Convolutional Block Attention Module (CBAM) and Receptive Fields Block (RFB) to address detection and recognition challenges on top of YOLOv5. Wang et al. [
42] introduced the sim attention mechanism and C3 channel shuffling to tackle multi-scale ship detection issues in complex scenarios. Li et al. [
43] presented coordinate attention to enhance the performance of detecting small objects. Tang et al. [
44] devised a Multiscale Receptive Field Convolution Block with Attention Mechanism (AMMRF) to leverage positional information in feature maps, accurately capturing regions crucial for detection in feature maps, as well as capturing relationships between feature map channels to better understand the ship–background dynamics. A study [
45] proposed the United Attention Module (UAM) and Global Context-guided Feature Balanced Pyramid (GC-FBP) to enhance ship detection performance. Wu et al. [
46] introduced a method based on the coordinate attention (CA) mechanism and Asymptotic Feature Fusion (AFF) to alleviate the problem of small object position loss and enhance the model’s ability to detect multi-scale targets. Hu et al. [
47] put forward a Balance Attention Network (BANet), employing both Local Attention Module (LAM) and Non-Local Attention Module (NLAM) to respectively capture the local information of ships, strengthen network robustness, and equilibrate local and non-local features. Ren [
48] proposed incorporating the Channel and Position Enhancement Attention (CPEA) module to enhance the precision of target localization by utilizing positional data. DSF-Net [
49] incorporated the Pixel-wise Shuffle Attention module (PWSA) to boost feature extraction capabilities and employed Non-Local Shuffle Attention (NLSA) to enhance the long-term dependency of features, thereby promoting information exchange. Cui et al. [
50] proposed the addition of a Spatial Shuffle-Group Enhance (SSE) attention module to the CenterNet network to enhance its performance. Cai et al. [
51] introduced FS-YOLO, which incorporates a Feature Enhancement Module (FEM) and a Spatial Channel Pooling Module (ESPPCSPC) on top of the original YOLO backbone, thereby improving network performance. Wang et al. [
52] integrated the Global Context-Aware Subimage Selection (GCSS) module with the Local Context-Aware False Alarms Suppression (LCFS) module to enhance the network’s adaptability to duplicated scenes. Cheng et al. [
53] improved the YOLOX backbone by proposing the S2D network, which better integrates information from the neck component and enhances the network’s performance in detecting small objects. Additionally, Zhang et al. [
54] discovered the modulation effects of target motion on polarization and Doppler. Meanwhile, Gao et al. [
55] employed the dualistic cascade convolutional method to enhance the performance of ship target detection.
Many papers have also focused on improving the loss function to enhance object detection performance. Zhang et al. [
56] introduced the center loss to ensure an equitable allocation of loss contributions among different factors and reduce the sensitivity of object detection to changes in ground truth box shapes. YOLO-Lite [
48] utilized a confidence loss function to improve the accuracy of ship object detection. DSF-Net [
49] employed an R-tradeoff loss to improve small detects, accelerate training efficiency, and reduce false positive rates. Zhou [
57] developed a loss function that employs a dual Euclidean distance approach, leveraging the corner coordinates of predicted and ground truth boxes, which accurately describes various overlapping scenarios. Zhang [
58] used global average precision loss (GAP loss) to enable the model to quickly differentiate between positive and negative samples to enhance accuracy. The paper [
59] utilized a KLD loss function to improve accuracy. Chen [
60] used the SIoU loss to aid the training process of the network.
These loss functions enhance the detection capability for small objects to some degree, accelerate training convergence, and elevate accuracy. However, they do not consider the impediment caused by inferior instances to the learning ability of the object detection model, resulting in limited performance improvement.
Many articles have also explored the use of decoupled heads [
43,
47,
61] to decouple the semantic information head and bounding box information head, preventing interference between different features and achieving better results. However, these simple decoupled heads only provide limited performance improvements as they do not consider the differences in semantic and bounding box information.
Therefore, in this paper, based on the YOLOv5 backbone, we propose the SAR Ship Context Decoupled Head (SSCDH), which is based on the characteristics of localization and semantic information. We use shuffle attention to enhance the focus on understanding complex backgrounds. Additionally, we introduce a new Wise IoU loss grounded in a dynamic non-monotonic focus framework and designed to utilize the degree of anomaly. The goal is to improve the accuracy of ship detection. Hence, the primary advancements of this paper include the following:
- 1.
In order to enhance the effectiveness of the original decoupling head model, we design dedicated decoupling heads that align with the specific characteristics of positioning and semantic information.
- 2.
To improve the model’s capability in detecting objects of varying scales, we incorporate a shuffle attention module into the larger feature layers of the original model’s neck.
- 3.
To boost the accuracy of object detection, we utilize the Wise IoU loss function, which leverages attention-based bounding box regression loss and a dynamic non-monotonic focus mechanism.
- 4.
To demonstrate the effectiveness of the proposed technique, we conduct extensive experiments using the HRSID dataset and the SAR-Ship-Dataset.
The first part of this paper served as an introduction, which presents the background, related works pertinent to this study, and the identified issues. The second part focuses on the methods, describing the network structure and the design approach for each module. The third part presents the experimental details and results. The fourth part discusses the effectiveness of our chosen head and attention mechanism. Finally, the fifth part concludes the entire paper.
2. Methods
This section introduces the method of the proposed SSCDH. The first part provides an overview of the architecture of the proposed model. The second part discusses the shuffle attention module utilized in our model, along with its principles of spatial and channel attention mechanisms. The third part introduces the decoupled heads based on contextual information from ships. Lastly, the fourth part describes the Wise IoU loss function employed.
2.1. Network Architecture
The network is based on YOLOv5 architecture [
33]. The overall structure of the proposed method is shown in
Figure 2. The input RGB image size is
, where
H represents the height of the image and
W represents the width of the image. The input image passes through 1 large convolutional module and 2 convolutional and residual convolutional modules, resulting in a feature map of size
after 3 downsampling operations. Subsequently, another convolutional and residual module produces a feature map of size
, followed by another similar module yielding a feature map of size
. These feature maps are then forwarded to the SPP bottleneck module and subsequently to the neck module, still retaining the dimensions
.
The feature map of size is processed through a 512-channel 1 × 1 convolutional layer, resulting in a feature map of size . This is then upsampled twice and concatenated with another feature map. The feature map obtained after the first upsampling, , is concatenated with the feature map from the backbone, resulting in a feature map of size . This is followed by another convolutional layer to obtain a feature map measuring , which is then upsampled to obtain a feature map of size . A 256-channel 1 × 1 convolution is applied to obtain the feature map of size .
Additionally, the feature map of size undergoes downsampling using a 256-channel convolution with a kernel size of 3, padding of 1, and a stride of 2, resulting in a feature map of size . This is concatenated with the output of the second convolution, resulting in a feature map of size , which is then passed through a convolutional residual block to obtain the feature map measuring .
Similarly, the feature map of size undergoes downsampling using a 512-channel convolution with a kernel size of 3, padding of 1, and a stride of 2, resulting in a feature map of size . This is concatenated with the output of the second convolution, resulting in a feature map measuring . Another convolutional residual block is applied to obtain the feature map measuring . A shuffle attention module is then applied to this feature map to enhance feature extraction.
Subsequently, the model undergoes another convolution operation with 1024 channels, a stride of 2, a kernel dimension of 3, along with a padding of 1. The generated feature map is then directed to the next C3 module, yielding the feature map of size .
Finally, the SAR Ship Context Decoupled Head is utilized to fuse features from multiple hierarchical levels. The feature map measuring obtained after the second downsampling is used as , the feature map. Consequently, is derived by incorporating information from , and feature maps. Similarly, incorporates information from , and feature maps, and incorporates information from , and feature maps. This process ultimately yields the final bounding box positions and confidence scores for target classification.
2.2. Shuffle Attention Module
The application of the SE [
62] mechanism considers the crucial role of channel attention in target recognition and detection, which has found widespread application in object detection. CBAM [
63] combines both channel attention and spatial attention mechanisms, resulting in a notable enhancement in the accuracy of computation. The shuffle attention (SA) module [
64] also integrates channel attention and spatial attention mechanisms while incorporating the concept of group convolutional kernel channel rearrangement. This achieves superior results compared to other attention mechanisms. In this proposed method, we chose to integrate the shuffle attention component after 32× downsampling layers, aiming to elevate the understanding of the semantic and channel information for the final layer, thereby achieving more accurate detection capabilities for complex scenes, small targets, and multi-scale objects.
Figure 3 illustrates the shuffle attention process framework.
First, shuffle attention employs “channel partitioning” to concurrently process sub-features for each group. Next, in the channel attention pathway, global average pooling is utilized to compute statistics at the channel level. This is followed by the application of a pair of parameters to adjust the scaling and shifting of the channel vectors. For the spatial attention pathway, group normalization (GN) is utilized to derive statistics at the spatial level, resulting in a condensed feature representation similar to that of the channel pathway. Then, these two pathways are combined. Following this, all the derived sub-features are consolidated and, ultimately, the channel shuffle technique is applied to enhance the data exchange between the various sub-features.
Shuffle attention achieves the grouping of features, initially, by partitioning the feature maps of a given size into G groups. Here, C indicates the total number of channels, while H signifies the vertical dimension of the feature map, and W corresponds to its horizontal dimension. Specifically, shuffle attention divides the feature maps of as G clusters, denoted as , where each is of the size . Consequently, during training, every individual component map progressively captures different interpretive insights.
Subsequently, an attention module is used to generate the corresponding significance weights for each component map. In detail, each attention unit processes the input feature map by splitting it into two separate pathways, denoted as and , each of size . One branch, , is used to create channel attention maps using connections between channels to improve channel effectiveness. Meanwhile, the other branch, , produces spatial attention maps using connections between spatial features to identify more useful spatial characteristics.
First, we extract channel-level statistical information from the input
by utilizing global average pooling (GAP), embedding global information into
of size
.
can be obtained by performing spatial average pooling with dimensions
, defined as
Next, employing a basic gating function combined with a Sigmoid activation, we construct a compact feature to precisely and adaptively select. The ultimate result of channel attention
can be derived as follows:
where
and
are parameters of size
and used for the fully connected and bias term
,
represents the full collection operation, and
represents the Sigmoid activation function.
Simultaneously, we process data to obtain spatial-level statistical information, enhancing the representation through a Group Norm (GN) operation. The ultimate result of spatial attention can be derived as follows:
where
and
are parameters of size
.
Finally, we merge the passways of the channel and spatial attention to obtain the output of the same size as the input, , with dimensions . Subsequently, all components are aggregated. Lastly, we employ a “channel shuffle” that enhances the flow of information between groups across channel dimensions. The final output of the SA module matches the size of the input .
2.3. SAR Ship Context Decoupled Head
The preference inconsistency towards feature context between classification and localization is strong. Specifically, localization tends to emphasize boundary features for accurate bounding box regression, whereas object classification leans towards semantic context. Existing methods like YOLOX utilize decoupled heads to handle different feature contexts for various tasks. However, since these heads work with the same input features, there is an imbalance between classification and localization.
Based on the structure and principles of Task-Specific Context Decoupling (TSCODE) [
65], we separately manage the encoding of features for categorization and positioning, known as context decoupling, to selectively employ more suitable semantic contexts for specific tasks. For the classification branch, rich semantic contextual features present in the image are typically required to infer object categories. Therefore, we use feature encoding that is broad but captures strong semantic details. For the localization branch, which requires precise boundary information, we offer high-resolution feature maps to better define object edges.
While classification in object detection is less detailed and focuses on identifying objects within a bounding box, using downsampled feature maps for classification does not significantly impact performance but does lower computational costs. On the other hand, object categories can be inferred from their surrounding environments; for instance, ship targets are likely to appear on the sea surface or docked at port edges. Employing broad insights derived from detailed semantic information improves classification performance.
Building on these findings, we developed Semantic Context Encoding (SCE) to enhance classification efficiency and accuracy. As illustrated in
Figure 4, SCE uses two levels of feature maps,
and
, at each pyramid level
l to produce a feature map with rich semantic information for classification.
Initially, we downsample
by a factor of two and then concatenate it with
, to yield the final classification feature map,
:
where Concat(
) signifies a concatenation operation, and DConv
refers to a shared convolutional layer used for downsampling. It is noteworthy that the resolution of
is half of
Pl.
Subsequently,
is passed through to
to predict classification scores, where
represents the classification loss function and
represents further classification and the Objection Operation. We employ
, consisting of two convolutional layers with 512 channels. Given that
is downsampled by a factor of 2 compared to
, at each position (x, y) in
, the predicted classification scores of its four nearest neighbors in
are computed, denoted as
, where N is the number of classes, and
and
represent the height and width of the feature map. Subsequently,
is reshaped to
to recover the resolution
This approach not only leverages the sparse key features from but also incorporates the rich semantic information from higher levels on the pyramid as .
Localization is more complex than classification, needing additional details for keypoint prediction. Methods usually use a one-scale feature map , though lower pyramid levels often have stronger responses to object contours, edges, and fine textures. Nevertheless, higher-level feature maps are crucial for localization as they facilitate the comprehensive observation of the entire object, thus giving more details to understand the complete shape of the object.
Based on these findings, we recommend Detail Preserving Encoding (DPE) for accurate localization. At each layer l of the pyramid, our DPE integrates feature maps from three layers: , , and . supplies detailed edge features, whereas gives a broader object view.
Figure 5 shows the DPE structure. The feature map on
is first upsampled by a factor of 2 and then aggregated with
. Subsequently, it is downsampled to the resolution of
through a 3 × 3 convolutional layer with a stride of 2. Finally,
is upsampled and combined to produce the final classification feature map,
. The computation process is as follows:
Here, (•) signifies upsampling, while DConv(•) indicates a shared convolutional layer for downsampling. Specifically, we compute using , , and . Subsequently, further bounding box predictions at the l-th pyramid level are performed through , where represents the locational loss function and represents the further bounding box regression operation.
2.4. Wise IoU Loss
In the field of object detection, Intersection over Union (IoU) evaluates the overlap between anchor boxes and target boxes. Compared to employing the norm as the bounding box loss function, IoU loss effectively mitigates interference from the proportional representation of bounding box sizes, which allows the model to efficiently balance learning for both large and small objects when IoU loss is utilized for bounding box regression. IoU loss is defined as
However, when IoU is zero (i.e., or ), the gradient of the IoU loss , resulting in the disappearance of gradients during back-propagation and the failure to update the overlapping distance .
To address this issue, existing research accounts for various geometric aspects of bounding boxes and incorporates a penalty term
. The existing bounding box regression (BBR) loss follows the paradigm
The Generalized Intersection over Union (GIoU) loss function extends the standard IoU loss by incorporating a penalty term. Unlike traditional IoU, which only assesses the overlap between boxes, GIoU also evaluates the surrounding non-overlapping regions. However, when one box is fully enclosed within another, GIoU cannot differentiate its relative positional relationships.
To address the limitations of GIoU, Distance-IoU (DIoU) [
29] adjusts the penalty term by maximizing the overlap area. This is achieved through minimizing the normalized distance between the center points of two bounding boxes. This modification aims to prevent divergence issues that can occur during the training process when using IoU loss and GIoU loss.
DIoU is defined as the relative spacing between the centers of two bounding boxes:
where
b and
are the centers of the predicted and ground truth bounding boxes, respectively. The term
represents the Euclidean distance between these centers, while
c refers to the diagonal length of the minimal bounding rectangle that can enclose both the predicted and actual boxes.
This method effectively addresses the gradient vanishing issue encountered with and incorporates a geometric aspect. By utilizing , DIoU can make more intuitive selections when faced with anchor boxes that have identical values.
Furthermore, considering the aspect ratio in addition to DIoU leads to the proposed CIoU:
where
and
describes the consistency of aspect ratios:
Here,
w and
denote the widths of the prediction box and the ground truth box, while
h and
represent the heights of the prediction box and the ground truth box, respectively. Because the unavoidable presence of poor-quality instances in the dataset leads to increased penalties, especially when influenced by factors like geometry, distance, and aspect ratio, thus diminishing the model’s generalization performance. In order to reduce the effects of geometry when anchor boxes align closely to target boxes, while intervening less during training to elevate the model’s ability to generalize, we construct WIoU v1 [
66] as
The IoU score
significantly diminishes the penalization for high-quality anchor boxes in
, emphasizing the gap between center points when anchor boxes closely match with target boxes, where
is the term amplifying
for regular quality anchor boxes.
Here, and denote the size of the minimum bounding box, while the numerator represents the distance between the prediction box and ground truth. For the purpose of stopping from causing gradients hindering optimization, and are excluded from the computation framework and the computation is denoted by the superscript *. This effectively eliminates factors hindering convergence, thus avoiding the introduction of new metrics like the aspect ratio.
Inspired by focal loss, which concentrates model attention on challenging samples, improving classification performance, we introduce a monotonic focusing coefficient
for
:
The introduction of the focusing coefficient alters the gradient propagation of WIoU v2:
It is noteworthy that the gradient gain
. During model training, as
decreases, the gradient gain also diminishes, resulting in diminished efficiency in the final training phases. Thus, we introduce the average of
as a normalization factor:
Here, denotes the exponentially weighted momentum-weighted moving average with parameter m. Dynamic adjusting of the normalization parameter maintains the gradient improvement on a more elevated perspective overall, thus dealing with the challenge of reduced convergence speed in later training phases.
The abnormality of anchor boxes is distinguished by the proportion of
to
:
Lower abnormality implies a higher quality of anchor boxes. We assign smaller gradient improvement to them, focusing the regression on anchor boxes of normal quality. Additionally, assigning reduced gradient improvement to anchor boxes with higher abnormality effectively prevents large gradients from low-quality samples. We construct a non-monotonic focusing coefficient apply it to WIoU v1:
Here, when , r = 1. When the abnormality of anchor boxes satisfies , where C represents a constant, the reference box will obtain the maximum gradient benefit. Since is variable, the standards for categorizing anchor box quality are, likewise, flexible, enabling WIoU v3 to adopt the most suitable gradient gain distribution method at each moment.
4. Discussion
4.1. Attention Mechanism
The integration of shuffle attention serves as a critical enhancement in feature representation. Unlike traditional attention mechanisms that often prioritize spatial or channel-wise features in isolation, shuffle attention dynamically adjusts the attention weights across both dimensions simultaneously. This dual approach enables the model to effectively capture contextual relationships among objects and their surroundings, which is particularly beneficial in cluttered environments. By concentrating on relevant spatial features while maintaining a holistic view of the input data, the model’s ability to infer object categories and their contextual significance is markedly improved. Furthermore, the adaptability of shuffle attention to multi-scale objects allows for a more nuanced understanding of features, thereby enhancing the model’s overall performance across varying object sizes.
In this part, we conducted extensive experiments applying various attention mechanisms on the HRSID dataset and the SAR-Ship-Dataset, analyzing their effectiveness in object detection tasks.
Concerning the HRSID dataset, the comparative experiment results are shown in
Table 6. Among the various attention mechanisms examined, shuffle attention demonstrated outstanding performance in enhancing recall, getting precision and recall rates of 92.4% and 89.4%, along with an F1 Score of 90.9%. Furthermore, it attained high levels of 94.5% and 72.1% on the AP50 and AP50-95 evaluation metrics, respectively. These results indicate that, compared with many other attention mechanisms, shuffle attention effectively elevates the network’s capabilities to identify ship targets in object detection tasks.
Apart from shuffle attention, other attention mechanisms exhibited relatively weaker performances in recall. For instance, SE, CBAM, and Efficient Channel Attention (ECA) achieved recall rates of 87.4%, 87.7%, and 87.6%, respectively, much lower than shuffle attention’s 89.4%. Additionally, coordinate attention and sim attention achieved recall rates of 86.9% and 87.2%, respectively, also lower than shuffle attention. Besides recall, other performance metrics (F1 Score, AP50, and AP50-95) also failed to surpass shuffle attention. Specifically, shuffle attention achieved relatively high levels of 90.9%, 94.5%, and 72.1% on the F1 Score, AP50, and AP50-95, respectively. In comparison, the performance of other attention mechanisms on these metrics was slightly inferior. For instance, the performance of SE, CBAM, and ECA on these metrics were 90.4%, 90.1%, and 90.3% (F1 Score), 94.1%, 94.2%, and 94.1% (AP50), and 71.5%, 71.2%, and 71.1% (AP50-95), respectively. Although their performance remains respectable, they cannot match the overall performance of shuffle attention. Thus, shuffle attention not only excels in recall rate but also achieves high levels on other crucial performance metrics, further demonstrating its superiority in object detection tasks.
Furthermore, the results in
Table 7 indicate that shuffle attention also performs optimally on the SAR-Ship-Dataset. It surpasses other attention mechanisms in key performance indicators such as precision (92.5%), recall (90.5%), F1 Score (91.5%), AP50 (95.5%), and AP50-95 (58.3%). This underscores the significant advantage of shuffle attention in object detection tasks, particularly in improving recall and overall performance.
Consequently, we conclude that shuffle attention is the optimal choice among many attention mechanisms for achieving object detection on the SAR-Ship-Dataset.
4.2. Decoupled Head
In object detection, classification and localization are two main sub-tasks, but there is an inconsistency in their requirements for feature context. The localization task focuses more on boundary features to accurately regress bounding boxes, while the classification task tends to rely on a rich semantic context. Existing methods typically employ decoupled heads to address this issue, attempting to learn different feature contexts for each task. However, these decoupled heads still operate based on the same input features, resulting in an unsatisfactory balance between classification and localization. Specifically, bounding box regression requires more texture details and edge information to precisely locate the object’s boundaries, whereas the classification task necessitates a stronger semantic context to identify the object’s category.
This situation means that traditional decoupled head detectors cannot effectively meet the demands of these two tasks because they still share the same input feature maps, limiting their ability to select task-specific contexts. Although traditional decoupling designs achieve parameter decoupling by learning independent parameters, they still fail to fully resolve the issue, as the semantic context is largely determined by the shared input features. This leads to the phenomenon of feature redundancy in the classification task, while the localization task relies on more detailed texture and boundary information, making it difficult to achieve accurate corner predictions.
In order to demonstrate that designing decoupled heads based on different contextual semantics for classification and regression branches achieves better target detection results in SAR ship target detection than simple decoupled heads, we conducted comparative experiments using the simple decoupled head and Context Decoupled head.
The
Table 8 and
Table 9 below present the performance metrics of the two different heads, simple decoupled head and Context Decoupled head, on the HRSID and SAR-Ship-Datasets. These methods were evaluated based on precision (Pre), recall (Rec), AP50, AP50-95, and Giga Floating-point Operations (GFLOPs).
For the simple decoupled head method, on the HRSID dataset, its precision is 91.6%, recall is 88.4%, AP50 is 94.2%, AP50-95 is 70.1%, and computational complexity is 7.1 GFLOPs. On the SAR-Ship-Dataset, its precision is 91.3%, recall is 90.2%, AP50 is 94.8%, and AP50-95 is 57.1%, with computational complexity remaining at 7.1 GFLOPs. In contrast, the Context Decoupled head method demonstrates superior performance on both datasets. On the HRSID dataset, its precision is 92.4%, rate of recall is 89.4%, AP50 is 94.5%, AP50-95 is 72.1%, and computational complexity is 9.8 GFLOPs. On the SAR-Ship-Dataset, its precision is 92.5%, recall is 90.5%, AP50 is 95.5%, and AP50-95 is 58.3%, with computational complexity still at 9.8 GFLOPs.
These results show that the Context Decoupled head approach outperforms the simple decoupled head method regarding precision, recall, and AP on both datasets, albeit with slightly higher computational complexity.
4.3. Wise IoU Loss
The Wise IoU loss introduces a sophisticated mechanism to mitigate the negative impact of low-quality samples during training. Traditional loss functions often penalize the model heavily for geometric discrepancies, which can disproportionately affect generalization, especially in datasets with noisy annotations. By employing a distance attention mechanism alongside a dynamic focus mechanism, our loss function alleviates the penalty on well-aligned anchor boxes while downplaying the influence of poorly aligned ones. This novel approach not only fosters better training dynamics but also enhances the model’s robustness against false positives and negatives. The result is a model that excels in precise localization, particularly in challenging scenarios where object overlap and occlusion are prevalent.
The comparison experiments of the loss functions on HRSID and SAR-Ship-Dataset are shown in
Table 10 and
Table 11, respectively.
The loss function used in the original baseline method is the CIoU loss function, while the loss function used in this paper is the Wise IoU loss. We conducted comparative experiments on the HRSID and SAR-Ship-Dataset, demonstrating the superiority of the Wise IoU algorithm.
The results from the experiments clearly demonstrate that the use of Wise IoU leads to improvements in various aspects of object detection on the HRSID and SAR-Ship-Dataset.