AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection

Zhang, Yuxin; Dong, Chunlei; Guo, Lixin; Meng, Xiao; Liu, Yue; Wei, Qihao

doi:10.3390/rs16183465

Open AccessArticle

AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection

by

Yuxin Zhang

,

Chunlei Dong

^*

,

Lixin Guo

,

Xiao Meng

,

Yue Liu

and

Qihao Wei

School of Physics, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(18), 3465; https://doi.org/10.3390/rs16183465

Submission received: 29 July 2024 / Revised: 10 September 2024 / Accepted: 13 September 2024 / Published: 18 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

This paper aims to improve a small-scale object detection model to achieve detection accuracy matching or even surpassing that of complex models. Efforts are made in the module design phase to minimize parameter count as much as possible, thereby providing the potential for rapid detection of maritime targets. Here, this paper introduces an innovative Anchor-Free-based Multi-Scale Feature Fusion Network (AFMSFFNet), which improves the problems of missed detection and false positives, particularly in inshore or small target scenarios. Leveraging the YOLOX tiny as the foundational architecture, our proposed AFMSFFNet incorporates a novel Adaptive Bidirectional Fusion Pyramid Network (AB-FPN) for efficient multi-scale feature fusion, enhancing the saliency representation of targets and reducing interference from complex backgrounds. Simultaneously, the designed Multi-Scale Global Attention Detection Head (MGAHead) utilizes a larger receptive field to learn object features, generating high-quality reconstructed features for enhanced semantic information integration. Extensive experiments conducted on publicly available Synthetic Aperture Radar (SAR) image ship datasets demonstrate that AFMSFFNet outperforms the traditional baseline models in detection performance. The results indicate an improvement of 2.32% in detection accuracy compared to the YOLOX tiny model. Additionally, AFMSFFNet achieves a Frames Per Second (FPS) of 78.26 in SSDD, showcasing superior efficiency compared to the well-established performance networks, such as faster R-CNN and CenterNet, with efficiency improvement ranging from 4.7 to 6.7 times. This research provides a valuable solution for efficient ship detection in complex backgrounds, demonstrating the efficacy of AFMSFFNet through quantitative improvements in accuracy and efficiency compared to existing models.

Keywords:

anchor free; feature pyramid fusion; ship detection; SAR image

Graphical Abstract

1. Introduction

With the rapid development of SAR imaging technology, imaging resolution has been continuously improved, making SAR images widely used for maritime surveillance [1], fishery management [2], maritime safety [3], etc. SAR images play a crucial role in various maritime applications due to their ability to provide high-resolution imagery regardless of weather conditions or time of day. For instance, in maritime surveillance, SAR images enhance the efficiency and accuracy of sea traffic management by offering continuous monitoring. In fishery management, SAR technology aids in detecting illegal fishing activities, thereby ensuring the sustainable use of marine resources. Moreover, in maritime safety, SAR images facilitate the rapid detection and localization of distressed vessels and floating objects, providing critical information for timely rescue operations. Therefore, the investigation of ship detection technology based on SAR images has become an increasingly hot issue of maritime remote sensing.

Traditional methods for SAR object detection mainly include approaches relying on auxiliary features [4], multi-polarization techniques based on polarization data, constant false alarm rate (CFAR) methods rooted in statistical features [5], etc. With the rise of computer vision technology, deep learning object detection methods have significantly advanced in feature extraction, notably boosting the performance of recognition and detection. The deep learning-based object detection algorithms can generally be broadly categorized into two main types: the two-stage algorithms typified by R-CNN [6] and its derivatives, and the one-stage algorithms, exemplified by the YOLO (You Only Look Once) [7,8,9,10,11] series models. Although two-stage algorithms generally exhibit higher accuracy, they incur additional computational overhead. In contrast, one-stage algorithms prioritize speed, albeit with a trade-off in detection accuracy, especially for small targets and occluded objects. In recent years, with notable advancements in traditional target detection, an increasing number of researchers have applied deep learning techniques to SAR image target detection, achieving substantial research outcomes in this field. For instance, Zhang et al. [12] proposed an enhanced YOLOv3 model named Darknet-19, which achieved a 2.3 times speedup over the original YOLOv3 by removing redundant feature maps and incorporating a cascade path to enhance feature extraction. Guo et al. [13] effectively addressed the problem of missing detections of multi-scale targets by incorporating the CBAM module and BiFPN into YOLOv5, resulting in a 1.9% increase in average precision. Anchor-free methods, such as FCOS-based ship target detection [14] and CenterNet-based SAR ship detection [15], have also been explored. Correspondingly, we summarize the principles and characteristics of traditional methods and deep learning-based methods in Table 1. Given the pronounced advantages of deep learning in SAR image target detection, this paper primarily focuses on the in-depth study of deep learning-based algorithms for SAR image target detection.

Although deep learning-based detection methods exhibit superior performance, there are still certain issues. Overall, the current deep learning-based SAR object detection methods encounter the following issues:

(1): The complex background conditions and strong clutter interference make it difficult to distinguish targets from the background, especially in inshore areas where ships are easily affected by noise interference from the land and other equipment, resulting in false alarm problems.
(2): The detection performance of small-size targets is poor. Due to the unclear features, the features of small ships are difficult to accurately learn for the network, leading to missed detection.
(3): In the pursuit of enhanced detection accuracy, many leading detection algorithms often adopt complex models with a substantial number of parameters. However, these models typically incur significant computational costs, making deployment on devices challenging and thereby constraining their widespread applicability.

Considering the problems mentioned above, this paper proposes an efficient anchor-free detector named AFMSFFNet, which is used for multi-scale ship detection in complex background SAR images. The main contributions of this paper are as follows:

(1): The method proposed in this paper introduces the latest anchor-free architecture YOLOX tiny as the basic framework and then conducts a series of optimization designs. Extensive tests on the SSDD [25] and the SAR ship dataset [26] demonstrate a noteworthy improvement in the detection accuracy with this method compared to the original YOLOvX tiny network, achieving a 2.32% performance increase on SSDD.
(2): This paper significantly enhances the FPN structure of YOLOvX tiny by introducing the Local Cross-Channel Attention (LCA) module and adaptive weighted fusion (AWF) mechanism. The LCA module strengthens feature extraction in target regions, effectively suppressing the interference of complex backgrounds in the detection results. Additionally, the introduction of the AWF module makes the fusion of multi-scale features more flexible and adaptive, maximizing the complementary information between different scales. These two improvements together form the final AB-FPN structure, significantly enhancing the detection capability across different scales, with a particular boost in the sensitivity of small target detection.
(3): By introducing MGA in the detection head, multi-scale receptive field cascades were employed to capture multi-scale contextual information, effectively expanding the receptive field of the detection head. This enhancement significantly improves the network’s global perception capabilities in complex backgrounds, enabling better differentiation between ships and background clutter.
(4): The proposed model achieves a high inference speed (FPS) while maintaining a low number of parameters. This indicates that our network not only offers significant improvements in detection performance but also remains lightweight, making it suitable for efficient deployment in practical applications. The collaborative effect of the three improved modules results in enhanced performance in small target detection. Furthermore, compared to similar methods, the proposed model demonstrates a distinct advantage in nearshore target detection, making it better equipped to handle challenges in complex scenarios.

The following section outlines the structure of the remainder of this paper. The research on related works is presented in Section 2. Section 3 details the implementation of the proposed detection network. Section 4 specifically discusses the experimental setup, results, and corresponding analysis. Section 5 summarizes the performance of our proposed method and provides future research prospects.

2. Related Work

Object detection has long been a central challenge in the field of computer vision. With the rapid development of deep learning technology, significant breakthroughs have been achieved in object detection algorithms based on deep learning. Given the success of deep learning in traditional object detection tasks, an increasing number of researchers are actively refining and applying it to SAR image target detection. Consequently, this section initiates an in-depth exploration of the current status of two traditional object detection frameworks, ultimately focusing on deep learning-based SAR image target detection methods. The review of these key studies provides strong background support for our research.

2.1. Two-Stage Method Based on Region Suggestions

The two-stage method starts by extracting features and generating regions of interest, followed by additional detection and recognition for each region. While these algorithms typically achieve higher accuracy, they also incur increased computational overhead. R-CNN [6] pioneered deep learning in object detection, departing from manual feature extraction to data-driven approaches, marking a significant advancement in object detection algorithms. Fast R-CNN [27] streamlined by directly inputting whole images, enhancing speed and accuracy. The evolution continued with faster R-CNN [28], introducing the innovative Region Proposal Network (RPN) concept to eliminate redundant computations by sharing convolution features of the input image. Addressing the challenge of small target detection, feature pyramid networks (FPNs) [29] emerged to reconcile the resolution and semantic issues of convolutional neural networks. The FPN integrated multi-scale feature information, creating a feature pyramid that found widespread application in subsequent detection networks. Subsequently, cascade R-CNN [30] innovatively used cascade regression to address prediction mismatches during overfitting, gradually increasing the Intersection over Union (IoU) value. Through continuous development, the current two-stage algorithms excel in enhancing detection accuracy, particularly in small targets and complex scenes. Nevertheless, such a design often entails high computational costs, prolonged training and inference times, and the introduction of additional hyperparameters.

2.2. One-Stage Method Based on Target Regression

The one-stage target detection completes the whole detection process in a regression process, which has fast computation speed but sacrifices some computation accuracy. As the pioneering one-stage object detection algorithm, YOLOv1 significantly improved detection speed but had lower accuracy than two-stage algorithms, especially for small objects. Addressing these issues, Liu et al. [31] proposed SSD, detecting objects on multi-scale feature maps. Later iterations of the YOLO series have further enhanced performance through advancements in data augmentation, the elimination of anchor box structures, and the integration of multi-scale feature fusion. RetinaNet [32] introduced a refined cross-entropy loss, addressing accuracy gaps with focal loss. While maintaining high detection speed, it has detection accuracy comparable to that of the two-stage detection algorithm at that time.

The aforementioned anchor-based detection algorithms often involve designing various anchor sizes for specific tasks, requiring significant manual input and potentially impacting the generalization of the detector. The accuracy of detection is sensitive to hyperparameter settings, leading to potential missed detections due to inappropriate anchor frame sizes. DenseBox [33], inspired by pixel-level semantic segmentation, pioneers an early exploration of the anchor-free concept by classifying and regressing each pixel position on the output feature map. In 2018, CornerNet [34] represented the target region with a pair of key points, predicting the position of bounding box corners independently, creating a meaningful relationship. In addition to key point-based anchor-free algorithms, there is another approach using anchor-point detection, applying semantic segmentation to object detection. Models like CenterNet [35], FCOS [36], YOLOX [11], etc., classify and regress on each pixel, broadening the anchor-free methodology.

2.3. Deep Learning-Based SAR Image Object Detection Algorithm

Expanding on the foundation laid by the aforementioned object detection models, numerous scholars have delved into the realm of object detection based on SAR imagery. Zhou et al. [21] proposed a lightweight object detection model, Lira-YOLO, based on PeleeNet. With only 4.3 MB model parameters, Lira-YOLO achieved an average precision on SSDD comparable to YOLOv3_tiny. Cui [22] et al.‘s research focuses on multi-scale ship detection, integrating CBAM into a densely connected pyramid network. This integration selectively emphasizes feature maps at different scales, preserving spatial and semantic information in multi-scale fused feature maps. Zhao et al. [23] combined a Receptive Field Block (RFB) and the Convolutional Block Attention Module (CBAM), introducing a novel lateral connection called an Attention Receptive Block (ARB) in feature pyramid networks (FPNs) to enhance the construction of fine-grained feature pyramids, thereby improving network performance. In 2021, Cui et al. [24] proposed a large-scale SAR image ship detection method based on CenterNet. This method significantly enhances nearshore vessel detection performance through Channel Shuffle and Spatial Enhancement operations of the Spatially Stacked Ensemble (SSE) attention module. Despite the significant achievements in the aforementioned studies, the research still grapples with the delicate balance between accuracy and model complexity. Also, this constitutes the central concern addressed in this paper. Therefore, this paper focuses on further optimizing the deep learning-based SAR image target detection algorithm, striving for superior performance while conscientiously controlling the number of model parameters.

In recent years, with the further advancements in the YOLO series algorithms (such as YOLOv7 and YOLOv8), Transformer-based algorithms, and other algorithms showing outstanding performance in the field of computer vision, many researchers have begun exploring the application of these advanced algorithms in the domain of SAR image ship detection [37,38,39,40,41,42]. The YOLO series algorithms, known for their fast and accurate object detection capabilities, are particularly suitable for applications with high real-time requirements. On the other hand, Transformer algorithms, especially Vision Transformer (ViT) [43], Swin Transformer [44], and DETR [45], effectively capture global information through self-attention mechanisms, demonstrating significant potential in image detection tasks. Guo et al. [38] adopted a decoupling strategy to reconstruct the backbone network of YOLOv8, designing a lightweight SAR ship detection method. Cheney et al. [42] designed a high-order spatial-channel controllable convolution module, CSn, based on cascaded recursive convolution, overcoming the second-order limitations of Transformer algorithm features. They proposed a new CSn-[feature pyramid network (FPN) + path aggregation network (PAN)] network (CSnNet). Liu et al. [46] introduced active learning technology into SAR ship detection tasks to reduce the number of samples required for training ship detection models, proposing an active learning method based on a hybrid density network to estimate a probability distribution for each localization and classification head output. Wu et al. [47] utilized the OTSU method to achieve rough sea–land segmentation, constructing background resolution (BR) through feature fusion to distinguish between false and true targets, thereby realizing a multi-feature fusion-based SAR image nearshore ship detection method.

Significant research progress and achievements have also been made in the detection of small ships in SAR images [38,48,49,50,51,52,53]. Sun et al. [48] proposed a small object detection network based on the FFCLC network, improving feature fusion and detection capability for small objects by introducing an attention feature fusion and multi-scale object detection (multi-detect) submodule. Zhou et al. [49] addressed the sidelobe and contour blur issues of small-sized ships in SAR images by using dual pooling and a loss function based on the dual Euclidean distance between the corner coordinates of predicted and actual boxes. Hu et al. [50], building on Transformer, introduced a small-object-friendly detection head with SwTR (Swin Transformer) and a loss function integrating normalized Gaussian Wasserstein distance (NWD) and IoU fusion, significantly enhancing the detection accuracy of small ships in SAR images. Ge et al. [51] improved detection performance by 3.82% by incorporating the CA attention mechanism into YOLOv7 and replacing PANet with BiFPN. To address the issue of small ship detection in SAR images, Yao et al. [53] designed a shallow feature reconstruction module (SFR) to extract semantic information of small ships and utilized FEP to extract multi-level features with strong semantic information, reconstructing shallow feature maps through feature alignment and spatial information enhancement.

3. Materials and Methods

This section provides a detailed description of the proposed AFMSFFNet method. Detailed information on each module will be provided in the subsequent sections.

3.1. The Overall Architecture of YOLOX

To reduce the model’s complexity, the latest anchor-free architecture, YOLOX tiny, is introduced as the basic framework. This choice eliminates the need for numerous hyperparameter settings and anchor-related architecture designs. Figure 1 shows the overall structure diagram of YOLOX. A typical object detector consists of three parts: the backbone, neck, and head. YOLOX utilizes CSPDarknet as the backbone for feature extraction. The three effective feature maps, C3, C4, and C5, obtained from the backbone will be sent to the neck for enhanced feature extraction. In YOLOX, PAFPN is used as a multi-level feature fusion method in the neck to improve the problems of shallow feature loss and limited fusion effect in FPN. The head generates location boxes and classification results. YOLOX uses a decoupled head, which is different from the previous generations’ coupled head, as it separates the classification and regression tasks for better results.

3.2. The Overall Architecture of the AFMSFFNet Model

This paper aims to improve on the basis of YOLOX and improve the detection performance of the network for small ships and nearshore ships in SAR images. The overall architecture of the proposed AFMSFFNet is shown in Figure 2. In particular, the main feature network of AFMSFFNet retains the original structure of YOLOX tiny, still using CSPDark-Net53. In the neck part, the PAFPN used in YOLOX tiny is improved, and an AB-FPN structure is proposed. For the final detection head part, a MGAHead is designed.

The overall process is as follows:

(1): A SAR image is sent to the main feature network for initial feature extraction, generating multiple feature maps from C1 to C5.
(2): Then, the three output maps, C3, C4, and C5, are further enhanced for feature extraction. The feature map undergoes adaptive feature fusion in the AB-FPN to effectively extract multi-scale target information, ultimately generating three output maps, P3_out, P4_out, and P5_out.
(3): Finally, the feature maps, P3_out, P4_out, and P5_out, are passed to the MGAHead for final ship position detection and category classification.

Compared with the original YOLOX, AFMSFFNet can better utilize information from each different feature layer. The designed attention module can accurately distinguish clutter interference and targets in SAR images and the proposed MGA module can increase the receptive field of the detection head, ultimately improving detection accuracy.

3.3. AB-FPN

In our network, an AB-FPN structure is designed to improve the feature fusion process used in the original PAFPN of YOLOX in two ways. (1) The FPN structure was enhanced by incorporating an LCA module, which aims to improve the network’s capability to extract target features and effectively suppress false alarms caused by background clutter. (2) Weighted fusion was employed to enable the final feature map to adaptively learn the significance of various scale feature maps and enhance the representation ability of the fused features. A detailed description of each improvement will be provided subsequently.

(1) LCA. Small ships in SAR images are often difficult to detect due to their size, making them easily overlooked or misclassified. The LCA module enhances the network’s ability to capture the positional information of small targets by extracting comprehensive data and texture features. The structure of the LCA module is shown in Figure 3. By leveraging global average pooling to reflect overall information from the feature map and global max pooling to highlight prominent local features, LCA can capture both global and local spatial and texture features of the image. This allows for a more comprehensive description of the importance of different regions, enabling the model to focus more on the features of the target area, thereby improving target detection accuracy. Specifically, it can be assumed that the feature map U extracted from the output image X is high H, wide W, and the number of channels is C.

Firstly, the channel statistics T₁ and T₂ with a size of 1 × 1 × c are obtained by global average pooling and global maximum pooling, respectively, which can be expressed as follows:

T_{1 C} = F_{s q 1} (U_{C}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{C} (i, j),

(1)

T_{2 C} = F_{s q 2} (U_{C}) = \max_{(i, j \in R_{i, j})} U_{C} (i, j),

(2)

where

T_{1 C}

and

T_{2 C}

are the cth element of

T_{1}

and

T_{2}

, respectively, and

U_{C}

is the feature map with c channels in U.

To avoid the negative impact of the dimensionality reduction operation of the fully connected layer on the prediction of channel attention, capture the dependency relationship between necessary channels, and reduce redundant calculations, each channel c and its k adjacent regions is considered to capture the local cross-channel interactions. Specifically, the weight coefficients of each channel are learned by a one-dimensional convolution with a size of k in Equation (3). The attention coefficient is obtained by adding the two results after sharing the convolution layer.

ω_{C} = σ (C 1 D_{k} (T_{1 C}) + C 1 D_{k} (T_{2 C})),

(3)

where σ represents the sigmoid activation function, C1D represents the one-dimensional convolution, and k represents the kernel size.

Finally, the weights of each channel (mapped to between 0 and 1) are multiplied by the original feature to generate the final feature map, as specified in Equation (4).

{\tilde{X}}_{C} = F_{s cale} (U_{C}, ω_{C}) = U_{C} \times ω_{C} .

(4)

The dimension of the feature map is restored to H × W × C. The LCA attention block can further enhance the saliency information of ships in feature extraction and effectively suppress the interference of complex backgrounds.

(2) Adaptive weighted fusion mechanism (AWF). To enhance the feature fusion capability of the PAFPN component in YOLOX, we have designed the AWF (adaptive weighted fusion) module. Unlike the traditional fusion methods in YOLOX, which utilize a fixed weighting scheme, the AWF module employs an adaptive weighting fusion mechanism. This mechanism learns the optimal weight coefficients for each input feature map, dynamically adjusting their contributions to achieve more accurate and efficient feature representation. This adaptability allows the system to better accommodate different targets and scenarios. The adaptive fusion mechanism effectively leverages the complementary information among feature maps, thereby enhancing target detection performance in complex backgrounds.

As shown in Figure 4, the AWF module efficiently exchanges information among these three feature maps, enabling all features to be fully utilized. Specifically, as represented in Equations (5)–(7), for the output feature map P3_out, the AWF module assigns different adaptive weighting coefficients

ω_{3}^{1}

,

ω_{4}^{1}

, and

ω_{5}^{1}

to the feature maps P3, P4, and P5, respectively. Through adaptive learning of the weights between different features, the final P3_out is fused from multiple feature maps with different contributions. Similarly, P4_out and P5_out are obtained. Through the above improvements, the AB-FPN can further improve the accuracy of feature representation and enhance the detection of small ships.

P 3_out = P 3_in \times ω_{3}^{1} + P 4_in \times ω_{4}^{1} + P 5_in \times ω_{5}^{1},

(5)

P 4_out = P 3_in \times ω_{3}^{2} + P 4_in \times ω_{4}^{2} + P 5_in \times ω_{5}^{2},

(6)

P 5_out = P 3_in \times ω_{3}^{3} + P 4_in \times ω_{4}^{3} + P 5_in \times ω_{5}^{3} .

(7)

3.4. MGA

In order to achieve more accurate detection based on the efficient feature fusion brought by the AB-FPN, the detection head for object detection has been redesigned, and the MGA module is introduced before the head to increase the receptive field. The core idea of the MGA module is to use a multi-scale receptive field cascade to obtain multi-scale contextual information, which is mainly composed of a series of dilated convolutions with different dilation rates. Dilated convolutions can obtain a larger receptive field than ordinary convolutions with the same size convolution kernel, thus increasing the receptive field without sacrificing detailed contextual information. Nearshore detection is complicated by the presence of clutter and background interference. The introduction of the MGA module effectively enhances the global perception capability of targets within complex backgrounds. This enables the network to better distinguish between ships and background clutter, thereby improving detection accuracy in nearshore areas.

The MGA module draws inspiration from the modules in the InceptionV3 algorithm, and its overall structure is shown in Figure 5. The input data are divided into five branches, one of which is directly concatenated with the output features as a residual shortcut. Another branch passes through a global average pooling layer, and then a 1 × 1 convolution layer is used to change the number of channels. The remaining three branches use 1 × 1 convolution layers to reduce dimensionality at first and then apply convolutional operations with kernel sizes of 1 × 3 and 3 × 3 to two of them, respectively, to change the feature scales. For the branch with a 1 × 3 kernel size, a 3 × 1 convolution is applied after the initial convolution, replacing a 3 × 3 convolution operation. For the branch with a 3 × 3 kernel size, a second 3 × 3 convolution is applied after the first convolution to approximate a 5 × 5 convolution. Then, the three branches undergo dilated convolution operations with a dilation rate of 1, 3, and 5, with a 3 × 3 kernel size. Finally, all features from the five branches are concatenated to form the output features.

This strategy not only increases the receptive field while maintaining high spatial resolution but also utilizes multiple convolutional layers of different sizes to bring parallel channels, flexibly extracting multi-scale feature information, thus providing better classification and localization predictions for the final detection head.

4. Experiments and Analysis

In order to verify the performance of the proposed ship detection method, a series of experiments are conducted on publicly available SAR ship datasets named the SSDD and SAR ship dataset. Based on the format of the PASCAL VOC detection dataset, the dataset is divided into three parts in the experiment, with the ratio of the training set, validation set, and test set being 7:2:1. The model training process used transfer learning, with the pretrained model used as the initial weight to avoid training the network from scratch. The maximum learning rate is set to 1 × 10⁻², with a minimum of 0.01 times that value, utilizing the cosine annealing learning rate approach for reduction. The Mosaic and Mixup data augmentation methods are used in experiments, and the optimizer used was SGD. All experiments are performed under the TensorFlow 2.2.0 deep learning framework, equipped with an NVIDIA GeForce RTX 2080Ti GPU. In order to quantitatively analyze the experimental results, several common evaluation metrics are introduced, such as precision, recall, F1 score, and Mean Average Precision (mAP), to measure the detection accuracy of the model. Generally speaking, the higher the AP corresponding to a model, the better its detection performance. Model complexity and efficiency are measured using model parameters and FPS transmitted per second. Model parameters are used to describe the complexity of a model and its corresponding memory resource consumption. FPS represents the overall detection speed of the algorithm, indicating the number of images the algorithm can process per second.

(1) Performance Analysis of AFMSFFNet. The proposed AFMSFFNet and the original YOLOX tiny network were trained on SSDD. Figure 6 illustrates the loss curves on the validation set during the training process. As observed in Figure 6, our proposed method demonstrates superior generalization capabilities on unseen data, yielding better performance.

Table 2 presents a comparison of the complexity of these two models. The values in Table 2 reflect the effectiveness and applicability of our model. The increase in model complexity, when compared to the original YOLOX tiny, can be considered relatively minimal. The model’s parameter count is 8.46 M, emphasizing its compact size and showcasing the effectiveness of designed modules for enhanced detection without complex computations. The training time of 48.4 s demonstrates the model’s efficiency, indicating its ability to efficiently update parameters. Additionally, compared to traditional high-performance models, such as YOLOX-x or faster R-CNN, this model’s short inference time and high FPS make it suitable for real-time applications, especially those requiring rapid decision making. Overall, these performance metrics reflect that our designed model can achieve outstanding results while maintaining computational efficiency. This indicates the model’s potential for deployment in resource-limited situations.

(2) Comparative experiments. To evaluate the performance of the proposed method, the proposed AFMSFFNet and other mainstream algorithms are trained on the SSDD dataset. The results of the detection performance are shown in Table 3, where bold values indicate the best results for the respective evaluation metrics. As shown in Table 3, it can be observed that our method achieves an

{AP}_{50}

of 97.87% and the highest recall rate, outperforming other methods. Compared to the baseline model, YOLOX tiny, our improved method brings a 2.32% increase in

{AP}_{50}

, a 1.19% improvement in precision, and a 2.29% improvement in recall. Compared to other models in the YOLO series, our method achieves an approximately 2% to 4% improvement in

{AP}_{50}

. Furthermore, compared to the common two-stage algorithm faster R-CNN, our algorithm also exhibits a 1.28% advantage in

{AP}_{50}

. The last two columns in Table 3 illustrate the memory consumption and efficiency of the proposed method compared to other mainstream models. The proposed method has a parameter count of only 8.46 M, which is significantly smaller than several high-performing mainstream methods, achieving real-time detection while ensuring high detection accuracy.

In order to further analyze the performance of the model, we compared the performance of the proposed model in nearshore and offshore scenarios with the models mentioned in the previous paper, and the results are shown in Table 4. It can be seen that our proposed model achieved good results in both scenarios, especially in the nearshore scenario, with performance improvements of 25.55% and 8.96%, respectively. At the same time, it can be seen that the proposed model has a faster detection speed compared to the ARPN model.

(3) Ablation experiment. For a visual assessment of the effectiveness of each module in the proposed method, a series of ablation experiments were performed on the SSDD dataset. The results are listed in Table 5, where the bold values indicate the best results for each evaluation metric. The obtained results were compared with those of the baseline model, YOLOX tiny, to examine the influence of each module. LCA improves AP from 54.70% to 56.40% and AP₅₀ from 95.55% to 97.17%, indicating its efficacy in capturing precise target information, especially in complex backgrounds. The multi-scale global aggregation of MGAHead may aid in improving the perception of multi-scale information, thereby increasing AP₅₀ by 1.61%. AWF boosts AP₅₀ (95.55% to 96.91%) but causes a marginal dip in AP and AP₇₅. With the concurrent integration of the LCA and AWF modules, AP₅₀ achieves a performance level of 97.43%. There is a possibility of complementary effects between them in certain aspects, leveraging their respective strengths to achieve more pronounced performance improvements in AP₅₀. Through the combination of MGA and LCA, the performance of AP₅₀ has been enhanced, reaching 97.23%. We infer that the introduction of these two modules enables the model to concurrently capture multi-scale background information and local channel attention. In summary, LCA emphasizes target position, MGAHead highlights multi-scale global information, and AWF provides adaptive feature fusion. Their combination likely enables the model to comprehensively capture information about target position, scale, and context, particularly excelling in handling small-sized targets and multi-scale information. When all modules are introduced, there is a significant improvement in model performance across AP, AP₅₀, and AP₇₅.

(4) Performance Comparison of LCA and Other Attention Mechanisms. Different attention modules are incorporated into the FPN at the same position of the YOLOX tiny model to verify the effectiveness of LCA. The experiments were conducted on the SSDD dataset. The results in Table 6 show that almost all of these attention mechanisms improved the network performance. Furthermore, the proposed LCA module in this study exhibited the best overall performance, achieving the highest results in AP,

{AP}_{50}

, and

{AP}_{s}

, which were 56.4%, 97.17%, and 49.5%, respectively. This indicates that the LCA module indeed enhances the network’s detection capability for small targets.

(5) Visualization Results. Figure 7 illustrates the detection performance of our method in small targets and nearshore target scenarios on SSDD. The ground truth and predicted bounding boxes are depicted in green and yellow, respectively. Red boxes represent false alarms, and blue boxes indicate missed detections. It can be observed that our method effectively performs the detection task in these scenarios. To further demonstrate the performance of the proposed network, Figure 8 presents heatmaps of selected results. It is evident that our method enhances the network’s focus on targets, enabling effective detection tasks.

(6) Generalization Experiment. To further validate the generalization of the proposed algorithm, experiments were conducted on the SAR ship dataset. Table 7 presents the training results of the proposed method compared to several mainstream algorithms. It can be observed that even on larger datasets, the proposed method still exhibits favorable performance compared with other mainstream models. Compared to the original YOLOX tiny algorithm, there is a 0.72% increase in AP₅₀. Among them, when faced with a large model, such as YOLOv5-x, our algorithm does not significantly improve the accuracy (only 0.3% higher than YOLOv5-x). This is because larger datasets often yield better training results with larger models. However, the proposed method demonstrates good generalization by maintaining its detection performance on medium and large datasets while keeping the parameter count low. Furthermore, most algorithms show an improvement in AP₅₀ compared to the training results on SSDD, indicating that increasing the number and diversity of samples can indeed enhance detection accuracy.

Finally, visualizations of the detection results for these methods are presented in Figure 9. The ground truth and predicted bounding boxes are depicted in green and yellow, respectively. Red boxes represent false alarms, and blue boxes indicate missed detections. It can be seen that the proposed method has achieved excellent detection performance in various complex scenarios, which alleviates the challenge of nearshore small target detection to a certain extent.

5. Conclusions

This article proposes a novel SAR image anchor-free object detection method called AFMSFFNet, which effectively addresses the detection challenges of inshore and small-scale ships. This method leverages YOLOX tiny as its base architecture, achieving remarkable improvements in detection accuracy through the incorporation of two novel modules, the AB-FPN and MGAHead. A series of experiments conducted on the SSDD and SAR ship dataset datasets demonstrate the effectiveness of the proposed AFMSFFNet. Compared to existing mainstream algorithms, our model exhibits advantages in terms of both accuracy and speed. Its compact model size as well as short running and inference time demonstrate its deployment potential in resource-limited situations, which can be better applied to industrial hardware platforms, such as radar chips and embedded devices, enhancing the feasibility of deploying this method on such devices rather than limited to large computing devices, such as PCs and servers. Additionally, the model proposed in this paper is currently applied to ship detection in SAR images. In future work, we will engage in more detailed module design and research, enabling the model to be applied to various downstream tasks, including ship detection based on Doppler images [54], considering the design of models that better suit practical industrial applications. At the same time, this study also has certain application prospects in other microwave imaging detection fields, such as resource exploration, device condition detection, and medical detection. In the future, we will dedicate efforts to a more comprehensive exploration of the industrial domain.

Author Contributions

Conceptualization, Y.Z. and C.D.; methodology, Y.Z.; software, Y.Z. and C.D.; validation, Y.Z., Y.L. and Q.W.; investigation, Y.Z. and Y.L.; data curation, Y.Z. and Q.W.; writing—original draft, Y.Z. and X.M.; writing—review and editing, Y.Z., X.M. and C.D.; supervision, L.G., X.M. and C.D.; project administration, L.G., X.M. and C.D.; funding acquisition, L.G., X.M. and C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62301401) and the Fundamental Research Funds for the Central Universities (Grant No. QTZX23019).

Data Availability Statement

The data presented in this study are openly available in the SSDD and SAR ship dataset at https://doi.org/10.3390/rs13183690 and https://doi.org/10.3390/rs11070765. The reference numbers are [25,26].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Barber, B.C.; Barker, J.E. The use of SAR-ATI for maritime surveillance and difficult target detection. In Proceedings of the 2009 International Radar Conference Surveillance for a Safer World (RADAR 2009), Bordeaux, France, 12–16 October 2009; pp. 1–6. [Google Scholar]
Friedman, K.; Wackerman, C.; Funk, F.; Schwenzfeier, M.; Pichel, W.; Colon-Clemente, P.; Li, X. Analyzing the dependence between RADARSAT-1 vessel detection and vessel heading using CFAR algorithm for use on fishery management. In Proceedings of the Oceans 2003. Celebrating the Past... Teaming Toward the Future (IEEE Cat. No.03CH37492), San Diego, CA, USA, 22–26 September 2003; Volume 5, pp. P2819–P2823. [Google Scholar]
Mazzarella, F.; Vespe, M.; Santamaria, C. SAR Ship Detection and Self-Reporting Data Fusion Based on Traffic Knowledge. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1685–1689. [Google Scholar] [CrossRef]
Rey, M.; Tunaley, J.; Folinsbee, J.; Jahans, P.; Dixon, J.; Vant, M. Application of Radon Transform Techniques to Wake Detection in Seasat-A SAR Images. IEEE Trans. Geosci. Remote Sens. 1990, 28, 553–560. [Google Scholar] [CrossRef]
Goldstein, G.B. False-Alarm Regulation in Log-Normal and Weibull Clutter. IEEE Trans. Aerosp. Electron. Syst. 1973, AES-9, 84–92. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Zhang, T.; Zhang, X.; Shi, J.; Wei, S. High-Speed Ship Detection in SAR Images by Improved Yolov3. In Proceedings of the 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing, Chengdu, China, 14–15 December 2019; pp. 149–152. [Google Scholar]
Guo, Y.; Chen, S.; Zhan, R.; Wang, W.; Zhang, J. SAR Ship Detection Based on YOLOv5 Using CBAM and BiFPN. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 2147–2150. [Google Scholar]
Sun, Z.; Dai, M.; Leng, X.; Lei, Y.; Xiong, B.; Ji, K.; Kuang, G. An Anchor-Free Detection Method for Ship Targets in High-Resolution SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7799–7816. [Google Scholar] [CrossRef]
Ma, X.; Hou, S.; Wang, Y.; Wang, J.; Wang, H. Multiscale and Dense Ship Detection in SAR Images Based on Key-Point Estimation and Attention Mechanism. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Rey, M.; Tunaley, J.; Sibbald, T. Use of the Dempster-Shafer algorithm for the detection of SAR ship wakes. IEEE Trans. Geosci. Remote Sens. 1993, 31, 1114–1118. [Google Scholar] [CrossRef]
Copeland, A.C.; Ravichandran, G.; Trivedi, M.M. Localized Radon transform-based detection of ship wakes in SAR images. IEEE Trans. Geosci. Remote Sens. 1995, 33, 35–45. [Google Scholar] [CrossRef]
Ringrose, R.; Harris, N. Ship detection using polarimetric SAR data. In Proceedings of the SAR workshop: CEOS Committee on Earth Observation Satellites, Toulouse, France, 26–29 October 1999; Volume 450, p. 687. [Google Scholar]
Ritcey, J. An Order-Statistics-Based CFAR for SAR Applications; Electrical Engineering Department, University of Washington: Seattle, WA, USA, 1990. [Google Scholar]
Xing, X.; Chen, Z.; Zou, H.; Zhou, S. A fast algorithm based on two-stage CFAR for detecting ships in SAR images. In Proceedings of the 2009 2nd Asian-Pacific Conference on Synthetic Aperture Radar (APSAR), Xi’an, China, 26–30 October 2009; pp. 506–509. [Google Scholar]
Long, Z.; Suyuan, W.; Zhongma, C.; Jiaqi, F.; Xiaoting, Y.; Wei, D. Lira-YOLO: A lightweight model for ship detection in radar images. J. Syst. Eng. Electron. 2020, 31, 950–956. [Google Scholar] [CrossRef]
Cui, Z.; Li, Q.; Cao, Z.; Liu, N. Dense Attention Pyramid Networks for Multi-Scale Ship Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8983–8997. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Xiong, B.; Kuang, G. Attention Receptive Pyramid Network for Ship Detection in SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2738–2756. [Google Scholar] [CrossRef]
Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship Detection in Large-Scale SAR Images Via Spatial Shuffle-Group Enhance Attention. IEEE Trans. Geosci. Remote Sens. 2020, 59, 379–391. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. A SAR dataset of ship detection for deep learning under complex backgrounds. Remote Sens. 2019, 11, 765. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Huang, L.; Yang, Y.; Deng, Y.; Yu, Y. Densebox: Unifying landmark localization with end to end object detection. arXiv 2015, arXiv:1509.04874. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Guo, H.; Yang, X.; Wang, N.; Gao, X. A CenterNet++ model for ship detection in SAR images. Pattern Recognit. 2021, 112, 107787. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Fu, H.; Wang, X.; Peng, C.; Che, Z.; Wang, Y. A dual-task algorithm for ship target detection and semantic segmentation based on improved YOLOv5. In Proceedings of the OCEANS 2023—Limerick, Limerick, Ireland, 5–8 June 2023; pp. 1–7. [Google Scholar]
Guo, Y.; Zhan, R.; Chen, S.; Li, L.; Zhang, J. A lightweight SAR ship detection method based on improved YOLOv8. In Proceedings of the IET International Radar Conference (IRC 2023), Chongqing, China, 3–5 December 2023; pp. 1322–1327. [Google Scholar]
Tan, X.; Leng, X.; Wang, J.; Ji, K. A ship detection method based on YOLOv7 in range-compressed SAR data. In Proceedings of the IET International Radar Conference (IRC 2023), Chongqing, China, 3–5 December 2023; pp. 948–952. [Google Scholar]
Yang, Y.; Ju, Y.; Zhou, Z. A Super Lightweight and Efficient SAR Image Ship Detector. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Wei, H.; Wang, Z.; Hua, G.; Ni, Y. A Zero-Shot NAS Method for SAR Ship Detection Under Polynomial Search Complexity. IEEE Signal Process. Lett. 2024, 31, 1329–1333. [Google Scholar] [CrossRef]
Chen, C.; Zeng, W.; Zhang, X.; Zhou, Y. CSnNet: A Remote Sensing Detection Network Breaking the Second-Order Limitation of Transformers with Recursive Convolutions. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4207315. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Liu, J.; Ma, F.; Yin, Q.; Zhang, F. An improved deep active learning method for SAR ship detection. In Proceedings of the IET International Radar Conference (IRC 2023), Chongqing, China, 3–5 December 2023; pp. 3336–3341. [Google Scholar]
Wu, S.; Wang, W.; Ruan, F.; Zhang, H.; Deng, J.; Guo, P.; Fan, H. Inshore ship detection using high-resolution SAR images based on multi-feature fusion. In Proceedings of the IET International Radar Conference (IRC 2023), Chongqing, China, 3–5 December 2023; pp. 1067–1072. [Google Scholar]
Sun, M.; Li, Y.; Chen, X.; Zhou, Y.; Niu, J.; Zhu, J. A Fast and Accurate Small Target Detection Algorithm Based on Feature Fusion and Cross-Layer Connection Network for the SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8969–8981. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, H.; Ma, F.; Pan, Z.; Zhang, F. A Sidelobe-Aware Small Ship Detection Network for Synthetic Aperture Radar Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Hu, B.; Miao, H. An Improved Deep Neural Network for Small-Ship Detection in SAR Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sensing 2024, 17, 2596–2609. [Google Scholar] [CrossRef]
Ge, R.; Mao, Y.; Li, S.; Wei, H. Research on Ship Small Target Detection in SAR Image Based on Improved YOLO-v7. In Proceedings of the 2023 International Applied Computational Electromagnetics Society Symposium (ACES-China), Hangzhou, China, 15–18 August 2023; pp. 1–3. [Google Scholar]
Zhang, A.; Zhu, X. Research on ship target detection based on improved YOLOv5 algorithm. In Proceedings of the 2023 5th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 14–16 April 2023; pp. 459–463. [Google Scholar]
Bai, L.; Yao, C.; Ye, Z.; Xue, D.; Lin, X.; Hui, M. Feature Enhancement Pyramid and Shallow Feature Reconstruction Network for SAR Ship Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1042–1056. [Google Scholar] [CrossRef]
Zhang, W.; Wu, Q.M.J.; Yang, Y.; Akilan, T.; Zhao, W.G.W.; Li, Q.; Niu, J. Fast Ship Detection with Spatial-Frequency Analysis and ANOVA-Based Feature Fusion. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the YOLOX method. (“Res Unit*n” indicates that there are a total of n Res Units at this location).

Figure 2. The overall architecture of the proposed method.

Figure 3. The structure of the LCA module.

Figure 4. The structure of AWF.

Figure 5. The structure of MGA.

Figure 6. The loss curve of the validation set.

Figure 7. Visualization in complex scenarios on SSDD. (a) Ground truth. (b) YOLOX tiny. (c) Our method.

Figure 8. The heatmaps for some of the results on SSDD. (a) YOLOX tiny. (b) Our method.

Figure 9. Visualization of the detection results on the SAR ship dataset. (a) Ground truth. (b) YOLOv3. (c) YOLOv4. (d) YOLOv5-X. (e) YOLOX tiny. (f) YOLOX-x. (g) YOLOv7 tiny. (h) Efficientdet. (i) SSD. (j) Faster R-CNN. (k) Our method.

Table 1. A comparison of different SAR image object detection methods.

Method	Principle	Advantages	Disadvantages	References
Approaches relying on auxiliary features	This approach primarily leverages auxiliary features in the vicinity of the target, such as the wake or oil spills from vessels. By detecting these auxiliary features, the presence of the target is indirectly inferred.	Simple and intuitive, requiring no complex sensors or data processing Effective target detection can be carried out when the target itself is difficult to directly detect but has obvious auxiliary features	There is a certain dependence on the detection environment, and effective detection may be challenging in situations where the target is directly visible but lacks clear auxiliary features	Ref. [4] Ref. [16] correctly classified 86 out of 93 SEASAT ship images and 21 out of 24 ocean scenes Ref. [17]
Multi-polarization techniques based on polarization data	Enhancing target detection accuracy is achieved by analyzing the response of targets to different polarization modes using multi-polarization data acquired by SAR sensors.	Utilize the electromagnetic scattering characteristics of the target to provide additional information and enhance target identification and detection Enhances the contrast between targets and backgrounds in SAR images	Handling polarized data introduces complexity to the algorithm and processing workflow Imposes greater demands on sensors and data collection	Ref. [18] P_MD = 3(10)⁻³, P_FA = 2.5(10)⁻⁷(P_FA: the probability of a false alarm, P_MD: the probability of miss detection)
CFAR methods	Based on the statistical distribution of background clutter, targets are detected by comparing pixel grayscale values with a threshold within a specific region.	Usually, statistical methods are used, and the calculations are relatively simple By considering local background statistical information to adapt to targets of different sizes, small targets can usually be effectively detected	It is prone to multiple target occlusion and false alarms at the edges of clutter Manual adjustment of parameters, such as window size and threshold, are required to adapt to different scenarios	Ref. [5] Ref. [19] Ref. [20]
Deep learning-based detection methods	Employing deep learning models, such as convolutional neural networks (CNNs) or Recurrent Neural Networks (RNNs), features representing targets are learned from SAR images for target detection.	Capable of automatically learning complex feature representations without manual feature extraction Independent of specific prior knowledge, applicable to various scenarios Typically exhibits strong generalization capabilities and good robustness	Training the model typically requires a large amount of well-annotated data Models typically require high computational resources, especially for complex deep networks, which may require high-performance computing devices, such as GPUs or TPUs	Ref. [12]: SSDD, mAP = 90.08% Ref. [13]: GF3 SAR images, AP = 92.8% Ref. [14]: (high-resolution SAR images dataset) HRSID, mAP = 96.01% Ref. [15]: SSDD, mAP = 96.3% Ref. [21]: SSDD, mAP = 85.46% Ref. [22]: SSDD, mAP = 89.8% Ref. [23]: SSDD Offshore mAP = 98.2%, inshore mAP = 84.1% Ref. [24]: SAR ship dataset, mAP = 94.7%

Table 2. Comparison of complexity between YOLOX tiny and the proposed model.

Method	Parameters/Trainable Params (M)	Training Time	Inference Time	FPS
YOLOX tiny	5.05/5.03	52.21 s	40.4 ms	117.89
Ours	8.46/8.42	48.96 s	45.9 ms	78.26

Table 3. Performance comparison of different detection methods on the SSDD dataset.

Method	Recall	Precision	F1	AP50	Parameters/Trainable Parameters (M)	FPS
YOLOv3	87.79%	90.91%	0.89	88.19%	61.58/61.52	63.93
YOLOv4 tiny	85.88%	88.24%	0.87	88.23%	5.88/5.87	96.31
YOLOv4	91.60%	93.02%	0.92	92.00%	64.00/63.94	52.54
YOLOv5-x	94.66%	96.12%	0.95	94.44%	87.34/87.24	25.86
YOLOX-x	95.80%	95.80%	0.96	95.36%	99.10/99.00	23.58
YOLOX tiny	94.66%	96.50%	0.96	95.55%	5.05/5.03	117.89
YOLOv7 tiny	93.13%	94.21%	0.94	93.83%	6.03/6.01	125.81
YOLOv8 tiny	90.84%	94.44%	0.93	94.23%	3.01/3.01	63.05
YOLOv8-s	92.75%	95.29%	0.94	94.43%	11.14/11.14	60.17
Efficientdet	67.56%	94.15%	0.79	91.56%	3.89/3.84	44.19
SSD	48.85%	93.43%	0.64	85.67%	23.61/23.61	103.90
FCOS	92.37%	94.90%	0.94	96.23%	32.17/32.12	22.48
CenterNet	91.98%	97.18%	0.94	94.68%	191.36/191.24	16.66
Faster R-CNN	95.80%	88.38%	0.92	96.59%	28.34/28.24	11.61
Ours	96.95%	97.69%	0.97	97.87%	8.46/8.42	78.26

Table 4. Comparison of model performance in different scenarios on the SSDD dataset.

Method	AP₅₀ (Inshore)	AP₅₀ (Offshore)	FPS
DAPN [34]	67.55%	95.93%	\
ARPN [35]	84.1%	98.2%	13
Ours	93.06%	98.59%	78.26

Table 5. Ablation experiment on the SSDD dataset.

Method	AP	AP₅₀	AP₇₅
YOLOX tiny	54.70%	95.55%	61.36%
YOLOX tiny + LCA	56.40%	97.17%	61.65%
YOLOX tiny + AWF	54.60%	96.91%	59.96%
YOLOX tiny + MGAHead	54.90%	97.16%	61.49%
YOLOX tiny + LCA + AWF	56.30%	97.43%	62.02%
YOLOX tiny + MGAHead + LCA	55.80%	97.23%	62.52%
YOLOX tiny + MGAHead + AWF	54.30%	97.38%	61.56%
Ours	55.90%	97.87%	63.67%

Table 6. A performance comparison of different attention mechanisms.

Method	AP	AP₅₀	AP₇₅	AP_S	AP_m	AP_l
Basic	54.7%	95.55%	61.36%	47.1%	65.3%	62.5%
SE	51.6%	95.82%	59.51%	43.1%	63.5%	71.2%
ECA	55.4%	96.54%	62.76%	48.1%	65.5%	70.7%
CBAM	55.0%	96.23%	62.46%	46.3%	67.0%	67.4%
LCA	56.4%	97.17%	61.65%	49.5%	66.0%	70.3%

Table 7. A performance comparison on the SAR ship dataset.

Method	Recall	Precision	F1	AP50
YOLOv3	87.78%	94.43%	0.91	95.62%
YOLOv4	79.83%	91.97%	0.86	89.72%
YOLOv5-x	95.44%	93.32%	0.94	96.91%
YOLOX tiny	93.12%	93.25%	0.93	96.49%
YOLOX-x	92.51%	93.38%	0.93	96.12%
YOLOv7 tiny	90.99%	93.18%	0.92	95.56%
Efficientdet	90.02%	92.85%	0.91	95.80%
SSD	84.66%	93.34%	0.89	94.72%
Faster R-CNN	93.18%	93.18%	0.93	95.63%
DETR	95.56%	80.48%	0.87	94.97%
Ours	95.78%	92.46%	0.94	97.21%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Dong, C.; Guo, L.; Meng, X.; Liu, Y.; Wei, Q. AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection. Remote Sens. 2024, 16, 3465. https://doi.org/10.3390/rs16183465

AMA Style

Zhang Y, Dong C, Guo L, Meng X, Liu Y, Wei Q. AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection. Remote Sensing. 2024; 16(18):3465. https://doi.org/10.3390/rs16183465

Chicago/Turabian Style

Zhang, Yuxin, Chunlei Dong, Lixin Guo, Xiao Meng, Yue Liu, and Qihao Wei. 2024. "AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection" Remote Sensing 16, no. 18: 3465. https://doi.org/10.3390/rs16183465

APA Style

Zhang, Y., Dong, C., Guo, L., Meng, X., Liu, Y., & Wei, Q. (2024). AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection. Remote Sensing, 16(18), 3465. https://doi.org/10.3390/rs16183465

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection

Abstract

1. Introduction

2. Related Work

2.1. Two-Stage Method Based on Region Suggestions

2.2. One-Stage Method Based on Target Regression

2.3. Deep Learning-Based SAR Image Object Detection Algorithm

3. Materials and Methods

3.1. The Overall Architecture of YOLOX

3.2. The Overall Architecture of the AFMSFFNet Model

3.3. AB-FPN

3.4. MGA

4. Experiments and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI