1. Introduction
Maritime transport has propelled the growth of coastal industries and international trade exchanges, providing a steady supply for modern economic and trade development [
1]. Global trade exchanges have become more active and maritime transport volume has surged rapidly. However, the continuous progress in maritime transport has also presented mounting challenges in terms of maritime supervision. Ship detection serves as an essential means within the maritime supervision system, enabling swift ship positioning and tracking. Synthetic Aperture Radar (SAR) is widely acknowledged as one of the most extensively employed techniques for maritime supervision due to its exceptional capability to operate continuously under diverse weather conditions [
2]. The field of maritime surveillance necessitates a rapid and efficient SAR ship detection method, given the importance of real-time ship detection.
Numerous efficient and accurate methods have been developed for ship detection, with scholars extensively researching real-time ship detection using various forms of SAR, including SAR imagery, raw SAR echo data and SAR range-compressed data. The earliest and most extensive research on SAR ship detection is the detection using SAR imagery. Based on the presence or absence of artificially designed features, this field can be categorized into traditional methods and deep learning-based methods. Traditional ship detection methods primarily rely on manually crafted features, with the constant false alarm rate algorithm (CFAR) being the most representative [
3]. The CFAR algorithm is based on the utilization of statistical properties of background noise for determining an appropriate threshold through rigorous statistical analysis, thereby achieving the binary classification between ships and background. Apart from CFAR, there exist detection techniques based on global threshold [
4], visual saliency [
5], domain transform [
6], etc. However, these conventional methods exhibit limited robustness in terms of detection performance and involve pixel-level traversal during processing, rendering them inadequate for real-time applications. The deep learning ship detection method employs deep neural networks for automated feature extraction. Due to the robust feature extraction capabilities of convolutional neural networks (CNN), CNN-based detection methods have garnered significant attention in research. Consequently, in this phase of the investigation, a two-stage detection network exemplified by Faster RCNN [
7] and a one-stage detection network exemplified by YOLOv1 [
8] emerged. Moreover, Transformer [
9], a type of recurrent neural network (RNN) has also been found applicable. Subsequently, RNN-based networks such as Vision Transformer (ViT) [
10] and Detection Transformer (DeTR) [
11] have gradually emerged within the field of computer vision. These deep neural networks have been progressively integrated into SAR ship detection, showcasing precise detection performance and efficient detection time.
Although deep learning-based methods can rapidly detect ships in SAR images, the real-time detection of SAR ships is often constrained by imaging time in practical application scenarios. During the SAR imaging process, the echo data need to be transmitted to the ground station. Subsequently, after undergoing various intricate and time-consuming operations such as range cell migration correction, azimuth pulse compression, and radar radiation correction, SAR imagery products essential for subsequent ship detection can be obtained. Therefore, the overall time required for ship detection using SAR images should encompass both the detection time and the acquisition time of obtaining a fully focused SAR image product.
In light of the inherent limitations introduced by the SAR imaging process, researchers have attempted to advance the detection step and explore on-board detection before imaging, to enhance the real-time detection performance of SAR ships. Prior to imaging, several scholars [
12,
13] have attempted detection based on the analysis of raw SAR echo data. However, due to the low peak signal-to-noise ratio of SAR raw echo data, the accurate detection of ship targets becomes challenging. Consequently, there has been a growing interest in detecting ships within the range-compressed domain with partial focus. Joshi et al. [
14] proposed that the real-time requirements of ship detection cannot be met by solely relying on fully focused SAR images, highlighting the necessity for additional processing efforts to generate these SAR images. The utilization of range-compressed data appears relatively appealing. Subsequently, CFAR [
14], Faster RCNN [
15], YOLO [
16], LSTM [
17], Inception [
18], and other ship detection methods commonly employed for SAR imagery have also been applied to range-compressed data. Nevertheless, these methods were originally designed for optical images and do not account for network structure redesign when directly applied to range-compressed domain data. As a result, their performance may not meet expectations since properties of range-compressed SAR images significantly differ from those of optical remote sensing images and conventional SAR representations of ships. Additionally, the mentioned ship detection methods of the range-compressed domain fail to address the challenges of optimizing computational efficiency and reducing parameters. Therefore, it is imperative to design an efficient lightweight ship detection method based on the characteristics of the range-compressed domain.
In this paper, we propose an innovative design of a novel lightweight network, FastRCDet, which effectively ensures comparable performance to mainstream lightweight detection networks while significantly reducing the number of parameters and floating-point operations (FLOPs). The key contributions of this research can be summarized as follows:
We propose a novel lightweight detection network framework for ship detection in the SAR range-compressed domain, considering both the limited resources of embedded platforms and the unique characteristics of the SAR range-compressed domain to design an optimized network structure.
Considering the specific geometric characteristics in the SAR range-compressed domain and combining these with a lightweight network design concept, the Lightweight Adaptive Network (LANet) is proposed as the backbone. To address the large scale and high aspect ratio of ships in the range-compressed domain, Arbitrary Kernel Convolution (AKConv) is introduced as a fundamental component of our backbone. This allows for adaptability in shaping the receptive field to better suit ship data characteristics within the range-compressed domain. LANet outperforms existing lightweight networks in ship detection within the range-compressed domain, exhibiting smaller models, faster speeds, and more precise detection.
To further enhance the efficiency and simplicity of network models, an innovative single Multi-Scale Fusion Head (MSFH) is proposed. We incorporate an Atrous Spatial Pyramid Pooling (ASPP) module to effectively fuse feature maps from multiple scales and reduce computational complexity. This module effectively combines features at different scales to better capture detailed information of the target object. Compared to traditional multi-detection head structures, this design significantly reduces the number of parameters in the network model and minimizes computational resource consumption while maintaining high-precision detection results.
To enhance the adaptability of ship detection in the range-compressed domain of SAR, a novel loss function, Direction IoU (DIoU), tailored to the target’s shape characteristics is proposed. By meticulously designing the angular cost component, this function ensures higher predicted bounding box accuracy in the range through increased horizontal movement cost. Consequently, our model exhibits exceptional detection performance even in SAR range-compressed images.
To assess the performance of our proposed network, we conduct experiments using publicly available ship datasets in the range-compressed domain. In comparison to mainstream lightweight detection networks, our proposed network achieves significant reduction in parameter count and computational requirements without compromising detection performance.
The rest of this paper is organized as follows. In
Section 2, we review the related work, which provides a literature review in lightweight networks, lightweight ship detection methods in SAR imagery, and ship detection methods in the SAR range-compressed domain. The theoretical explanation of the range-compressed domain is presented in
Section 3, accompanied by a brief analysis of its characteristics. In
Section 4, we elaborate on the proposed method in detail. The evaluation metrics, experimental design, experiment results, and analysis of experimental results are presented in
Section 5. The conclusion is presented in
Section 6.
2. Related Works
Performing ship detection prior to image acquisition from airborne or spaceborne SAR payload platforms reduces data transmission and enables a rapid determination of the ship’s position. Considering the limited availability of computing resources, it is imperative to develop a lightweight neural network model for smart ship detection that can effectively operate on resource-constrained edge computing platforms. Therefore, this section provides a comprehensive literature review encompassing three key aspects: (1) lightweight networks; (2) lightweight ship detection methods in SAR imagery; and (3) ship detection methods in the SAR range-compressed domain.
2.1. Lightweight Networks
Lightweight ship detection in SAR imagery is a critical technology for maritime surveillance and search and rescue operations. Although traditional deep learning models are effective, they have high computational and storage requirements. Consequently, relevant scholars have been actively engaged in the investigation of lightweight networks.
2.1.1. Convolution Neural Networks
Currently, convolutional neural networks (CNNs) are predominantly utilized for object detection and recognition tasks. Since AlexNet [
19], CNNs have progressively evolved into deeper and more intricate architectures. However, their efficiency in terms of time and memory utilization has not necessarily improved. In real-world scenarios, real-time computation with limited computational resources is imperative, necessitating a delicate trade-off between speed and accuracy. To achieve higher efficiency, numerous studies of lightweight CNNs have been conducted. MobileNets [
20,
21,
22], proposed by Google, are a series of lightweight networks that effectively reduce model parameters and computational complexity while marginally compromising precision. ShuffleNets [
23,
24], introduced by the Megvii Institute in 2017, demonstrates efficient performance on mobile devices. The Huawei Noah’s Ark Laboratory proposed an end-to-end network, GhostNet [
25], which introduces an innovative Ghost module specifically designed to efficiently generate additional feature maps. EfficientNets [
26,
27] are convolutional neural networks proposed by the Google Brain, which achieves efficient model design through the uniform scaling of network depth, width, and resolution. This approach effectively reduces computational complexity and parameter count while maintaining high accuracy.
2.1.2. Vision Transformer and Variants
Since the proposal of ViT, which extends the application of the Transformer from machine translation or prediction [
28] to computer vision, there has been a growing research interest in ViT. Subsequent studies have focused on enhancing ViT through improvements in lightweight model design. LeViT [
29], proposed by Facebook AI Research, stands as a prominent benchmark among lightweight ViT models. It effectively amalgamates the global information transmission capabilities of Transformers with the local structural perception capabilities of CNNs. Wang et al. [
30] proposed the Pyramid Vision Transformer (PVT), which integrates the pyramid structure of convolutional neural networks with the global receptive field of Transformers. PVT is specifically designed to address challenges such as low resolution, high computational requirements, and memory overhead associated with traditional Transformers when handling demanding prediction tasks. Chen et al. [
31] have proposed a novel network called Mobile-Former, which effectively parallelizes the functionalities of MobileNet and Transformers while establishing a bidirectional connection between them. Mehta et al. [
32] proposed a network named MobileViT that aims to integrate the strengths of CNN and ViT.
2.2. Lightweight Ship Detection in SAR Imagery
Lightweight ship detection in SAR imagery is a critical technology for maritime surveillance and search and rescue operations. Although traditional deep learning models are effective, their high computational and storage demands limit their suitability for deployment on resource-constrained platforms.
To address these challenges, recent advancements have focused on developing lightweight neural networks that maintain high detection accuracy while significantly reducing model size and computational complexity. These networks employ innovative techniques such as optimized convolutional layers, attention mechanisms, and advanced loss functions to efficiently process SAR images and accurately detect ships under various environmental conditions. The objective is to achieve real-time processing speeds and robust performance even in cluttered maritime scenes.
Currently, there exist two categories of lightweight synthetic aperture radar (SAR) ship detection methodologies. One approach involves adapting the mainstream object detection network by making certain modifications to reduce its computational burden. Miao et al. [
33] proposed an improved lightweight RetinaNet model for ship detection in SAR images. By replacing shallow convolutional layers and reducing the number of deep convolutional layers in the backbone, adding space and channel attention modules, and applying a K-means clustering algorithm to adjust the model parameters, the calculation and parameter number are significantly reduced. Li et al. [
34] proposed a novel detection framework based on Faster R-CNN to enhance detection speed, and made some improvements to its feature extraction, recognition, and positioning task network to improve its recognition and localization capabilities. Yu et al. [
35] proposed a lightweight ship detection network based on YOLOX-s [
36], which eliminates the computationally intensive pyramid structure and establishes a streamlined network relying on first-order features to enhance inspection efficiency. Liu et al. [
37] presented a novel lightweight approach YOLOv7oSAR for ship detection in SAR imagery, demonstrating advancements over YOLOv7 [
38] through the introduction of a lightweight rotating frame structure to reduce computational costs and the adoption of specific loss functions to enhance accuracy.
Another type of approach is to redesign a new network framework. Wang et al. [
39] propose a lightweight network for ship target detection in SAR images, addressing the high-cost issue of traditional deep convolutional neural networks through the design of a network structure optimization algorithm based on a multi-objective Firefly algorithm. Ren et al. [
40] presented a lightweight network named YOLO-Lite, based on the YOLO [
41] framework. By devising feature-enhanced backbone networks and incorporating specialized modules, accurate SAR ship detection across diverse backgrounds is achieved while maintaining low computational overhead. Chang et al. [
42] developed a lightweight multi-class SAR detection network, MLSDNet, which leverages adaptive scale-distributed attention and a streamlined backbone network to enhance target detection performance. Tian et al. [
43] proposed the LFer-Net detector in order to address the challenges encountered in ship detection, such as low resolution, small target size, and the dense arrangement of ships. Zhou et al. [
44] introduced a lightweight SAR ship detection network, EGTB-Net, which integrates converter and feature enhancement technologies to enhance both the speed and accuracy of detection.
2.3. Ship Detection in the SAR Range-Compressed Domain
Joshi et al. [
14] proposed using the CFAR algorithm to detect ships outside the clutter region of the range-Doppler domain, which is a robust method for dealing with different clutter environments. However, it may encounter challenges when scaling effectively to larger datasets or more diverse environmental conditions. Leng et al. [
45] proposes a ship detection method based on SAR range-compressed data, using Complex Signal Kurtosis (CSK) to pre-screen potential ship areas, and using CNN-based classification to detect potential target areas. The combined use of CSK and CNN could lead to an increase in computing demands, potentially hampering real-time processing capabilities.
In recent years, Gao et al. [
17] proposed utilizing LSTM networks in natural language processing to detect two-dimensional objects as multiple one-dimensional sequences in the range-compressed domain and achieved preliminary results. However, further experiments are required to validate the real-time capabilities and speed of this detector. LSTM is primarily designed for sequential data processing, making its application to spatial domain tasks conceptually challenging and necessitating extensive fine-tuning. Loran et al. [
15] introduced an airborne SAR ship detector based on the Faster R-CNN network in the range-Doppler domain. Nevertheless, these methods lack a comprehensive analysis of features associated with the range-compressed domain and fail to integrate them into their approaches. Zeng et al. [
18] presented a ship target detection method based on SAR’s range-compressed domain using a novel inception-text convolutional neural network model. However, this method can only provide an approximate measurement of the range dimension, requiring further experimentation for verification of its applicability.
4. Methods
In this section, a new lightweight network is proposed for ship detection in the SAR range-compressed domain, named FastRCDet, which is designed to be efficient and rapid. Instead of building upon an existing network as a baseline, we have reengineered the network specifically considering ship characteristics in the SAR range-compressed domain. FastRCDet introduces a lightweight backbone, utilizes a single detection head with a novel loss function, and incorporates an anchor-free algorithm to form our detection framework.
4.1. Overall Structure and Process
The overall network structure of FastRCDet is illustrated in
Figure 2. It represents an anchor-free single-stage ship detector in the range-compressed domain. In order to cater to the specific characteristics of ship data within this domain, our design has deviated from the conventional network structure comprising input, backbone, neck, detection head, and output components. Instead, we have adopted a simplified framework consisting of four fundamental modules: input, lightweight backbone, lightweight single head, and output. This approach not only simplifies the network structure but also significantly reduces computational complexity.
Specifically, during the process of forward propagation, we initially subject the imagery in the range-compressed domain to the Lightweight Adaptive Network (LANet), which has been meticulously designed, thereby obtaining a deep and high-dimensional feature map. Subsequently, the feature maps originating from diverse spatial scales are effectively fused through a novel single Multi-Scale Fusion Head (MSFH) for accurate ship prediction.
In the process of backpropagation, a new loss function is proposed by analyzing the shape characteristics of ship images in the range-compressed domain. The resultant loss function, named as Direction IoU (DIoU), effectively facilitates network convergence by aligning the predicted bounding box with the vertical line of the ground truth box during training, which improves the model’s reception of the position and shape of the ship.
4.2. Backbone: Lightweight Adaptive Network (LANet)
4.2.1. Adaptive Kernel Convolution
Considering the distinctive characteristics of ships in the range-compressed domain, we propose adjusting the shape of the convolution kernel to effectively modify the receptive field region and enhance the accuracy of extraction. We incorporate AKConv [
48] with shape adjustability, which offers significant advantages in addressing traditional convolution defects. This convolution not only enhances the capture of surrounding area information but also reduces parameter count and computational burden, resulting in improved results and higher efficiency for image processing tasks.
Traditional convolution exhibits limitations in processing range-compressed domain images. We introduce AKConv into the backbone to address these issues. The specific structure of AKConv is illustrated in
Figure 3. Firstly, conventional convolution operations are confined to a fixed-size window and fail to capture information beyond its boundaries. Consequently, important details may be lost when dealing with larger targets. In contrast, AKConv possesses shape-adjustment capabilities by learning offsets that modify the shape of the convolution kernel. This enables more flexibility in adapting to varying sizes and shapes of windows during convolution operations, thereby effectively capturing additional information from the surrounding area. Furthermore, as the window size k increases, traditional convolution introduces a substantial increase in the number of parameters. Consequently, this leads to heightened model complexity and computational demands, making overfitting more likely to occur. AKConv effectively mitigates these issues by dynamically learning offsets and adjusting shapes. Therefore, when employing AKConv for convolution operations, it not only reduces model complexity and computational burden but also enhances overall generalization capabilities.
AKConv stands out due to its ability to accommodate an arbitrary number of parameters, setting it apart from traditional convolution methods that rely on fixed-size parameters like 3 × 3 or 5 × 5. As depicted in
Figure 4, AKConv surpasses this constraint by allowing parameter sizes to be flexibly set to any value, including but not limited to 1, 2, 3, 4, and beyond. This characteristic endows AKConv with enhanced flexibility for model design.
In addition to the arbitrary number of parameters, AKConv possesses another distinctive characteristic: the initial convolution kernel shape is not constrained by traditional conventions. While conventional convolutions typically employ square or rectangular shapes as their initial kernels, AKConv allows for a wider range of shapes and sizes.
Figure 5 illustrates how different shapes and sizes can be utilized as initial cores when employing n convolution parameters.
4.2.2. Adaptive Kernel Block (AK Block)
We have developed two types of blocks within LANet, namely the basic block and the Adaptive Kernel (AK) block. The schematic representation of the conventional block is illustrated in
Figure 6a, while
Figure 6b depicts the architecture of the AK block. The basic block consists of an AKConv layer and two 1 × 1 convolution layers. Inspired by the concept of ResNet [
49], a shortcut connection is incorporated between the input and output of the block.
In the proposed LANet, the network structure is constructed by sequentially connecting an AK block and three basic blocks using merging layers. The design of LANet aims to fully exploit the unique advantages offered by the AKConv layer and complement conventional convolution operations.
Through the utilization of this deep convolutional network architecture, the extraction of image features becomes more comprehensive and accurate, enabling the capture of rich and diverse information present in the image. Notably, the normal block encompasses essential functions for normal convolution, effectively extracting deep features. Additionally, the AK block introduces a novel AKConv layer that enhances the model’s capability to model intricate relationships between details and local regions while preserving its original information extraction ability. The various components of the network are tightly integrated and play a pivotal role in information transmission and the acquisition of abstract representations. This deep convolutional network architecture efficiently progresses from low-level visual features to high-level semantic representations, ultimately generating high-dimensional feature representations with abundant semantic information and excellent generalization performance.
4.3. Detection Head: Multi-Scale Fusion Head (MSFH)
Most target detection networks use multiple detection heads to adapt to objects of different scales. The high-layer detection head is responsible for the detection of small objects, while the low-layer detection head is responsible for the detection of large objects, which is a divide-and-conquer strategy. However, in fact, the key to detecting objects of different scales is the size of the receptive field, because the receptive field required for objects of different scales is different, and each layer of the model has a different size of the receptive field. Including FPN, it integrates and fuses different receptive field features.
In this paper, a novel design method for a single detector head named Multi-Scale Fusion head (MSFH) is proposed, which draws on the idea of YOLOF [
50], and uses a parallel group convolution network structure similar to Inception [
51] to fuse different receptive field features. In this way, the single detection head can be adapted to objects of different scales, and the network structure is simplified to reduce the computation. The structure of the detection head is illustrated in
Figure 7.
Specifically, in the MSFH module, we employ Atrous Spatial Pyramid Pooling (ASPP) [
52] to fuse multi-scale features. This module replaces the conventional pooling operation with multiple parallel dilatation convolution layers featuring different dilation rates. The features extracted from each parallel convolution layer are processed in separate branches and fused to generate the final outcome. The core concept lies in dividing the input image into distinct levels, where each level incorporates multiple parallel dilatation convolutions with varying dilation rates, followed by concatenating the output results from all levels to form a comprehensive feature representation for the network. This approach offers an advantage as it combines outputs from diverse levels, enabling the MSFH module to effectively extract features from images of various sizes and thereby preserving image information more comprehensively. The background, position, and category information of the ship target are obtained by employing three parallel 5 × 5 convolution layers, respectively.
4.4. Loss Function: Direction IoU (DIoU)
The object detection loss function, as depicted in (7), is decomposed into two components: Intersection over Union (
IoU) loss and category loss.
where since the proposed method in our study solely focuses on ship position detection without considering categorization,
is zero and will not be further discussed.
Firstly, in object detection,
IoU, initially employed to quantify the accuracy between the predicted bounding box and ground truth bounding box, is defined as follows:
where
represents any two polygonal boxes. In the early target detection networks, the loss function used in the backpropagation process was defined as follows:
The ship range-compressed data differ significantly from conventional optical remote sensing images and SAR images, necessitating a redesign of the loss function to enhance network training. To better suit ship detection within the range-compressed domain of SAR, we have devised a novel
IoU loss function named Direction IoU (DIoU) that aligns more effectively with the target’s shape characteristics. The expression of DIoU is as follows:
where
represents distance loss, and
represents shape loss. The following provides the precise definition and calculation of distance loss and shape loss, respectively.
4.4.1. Distance Loss
As depicted in
Figure 1, the SAR range-compressed domain image demonstrates range-dimensional focusing of the ship while exhibiting azimuth-dimensional defocusing. Specifically, it is crucial to ensure utmost precision in detecting range cells while allowing a certain level of compromise in azimuthal accuracy. Consequently, during network training, the meticulous prediction of bounding boxes should be emphasized for horizontal movement, whereas vertical movement can be approached with relatively more flexibility. In conclusion, it is imperative to devise a loss function that adequately accounts for the impact of distance between the predicted bounding box and ground-truth bounding box when the angle
approaches 0 and minimizes its influence as
trends towards
.
The calculation of the proposed distance loss is illustrated by (11):
where
represents angle cost, and the calculation of
is presented in (12).
(
) represents the square ratio of the difference between the width (height) of the center point in the actual box and the width (height) of the center point in the predicted box to the width (height) of the smallest peripheral rectangle, as depicted in (13).
As depicted in
Figure 8,
represents the azimuth angle between the ground truth bounding box and the predicted bounding box,
denotes the vertical displacement between them,
signifies their horizontal separation, and
indicates the distance between their respective center points. The variables
and
represent the width and height, respectively, of the minimum external bounding box encompassing both bounding boxes.
where
and
represent the central point coordinates of the prediction box and the truth box, respectively.
As
approaches
, it can be observed that
tends to approach 0 in
Figure 9c. In Formula (11), when
remains constant, a smaller value of
corresponds to a smaller exponential term, indicating greater distance loss, as shown in
Figure 9d. Consequently, in order to achieve the same low loss during descent, the predicted bounding box needs to undergo more displacement towards the ground-truth bounding box as α increases.
4.4.2. Shape Loss
The calculation of shape loss is presented in (17):
The
(
) is defined as the quotient of the discrepancy between the width (height) values of the truth-valued bounding box and the predicted bounding box, divided by the maximum width (height) value observed in both boxes. The geometric relationship between foundation truth bounding box and prediction bounding box is shown in
Figure 10. The calculation of
is presented in (18).
By combining Formulas (9)–(11) and (17), the final expression of DIoU loss can be expressed as follows:
4.5. Anchor-Free
Among the conventional object detection algorithms, the anchor-based algorithm is a prevalent approach. This technique necessitates performing an anchor-bias operation on the dataset, which involves clustering the width and height of annotated objects to derive a set of prior dimensions. Subsequently, during model training, the network optimizes its predicted bounding box dimensions based on this set of priors.
However, our proposed FastRCDet adopts an anchor-free algorithm. In contrast to an anchor-based algorithm, this algorithm eliminates the need for predefining anchor frames based on prior width and height information of ground truth values. Instead, it transforms target detection into a key point detection problem. Specifically, during model construction, the target is treated as a single point represented by its center coordinates using this method. Consequently, target detection is achieved through the regression of the center point and its distance from the target boxes.
This simplification streamlines model post-processing and reduces computational overhead. Another distinction is that while anchor-based algorithms typically associate N anchors with each feature point on a feature map, our anchor-free method only corresponds one candidate box per feature point. Consequently, it also offers significant advantages in terms of reasoning speed.
5. Experiments
To validate the efficacy of FastRCDet, an extensive array of experiments was conducted on an available ship dataset within the range-compressed domain of SAR. This section initially outlines the dataset preparation, experimental setup, evaluation metrics, and training specifics. Subsequently, ablation experiments were performed to substantiate the effectiveness of each proposed module. Furthermore, a comparative analysis was carried out to juxtapose the proposed FastRCDet with the most commonly used detectors.
5.1. Dataset
FastRCDet is a supervised deep learning detector. It is crucial to have an adequate number of range-compressed domain data samples with target labeling in order to ensure that the trained network model possesses strong reasoning capabilities, specifically for detecting ships in test images.
In this study, we employ RCShip [
53], a publicly available dataset of ship detection data in the range-compressed domain.
Table 1 presents the relevant parameters associated with these images. The dataset consists of 18,672 images measuring 1024 × 1024 pixels each. The data show distinct ships and encompass extensive sea surface areas. These samples encompass diverse ships across various scenarios and have been meticulously annotated by experts to ensure precise and dependable reference data.
During the training phase, utilizing these labeled compressed domain data samples enables the network model to better comprehend and identify various types of targets. Consequently, it effectively guides the network model in acquiring the capability to differentiate and locate each target within test images.
5.2. Implementation Details
The experiments in this paper were conducted using the same equipment, and the specific environmental parameters of the equipment are presented in
Table 2. For network training, a batch size of 16 was employed, with an input image size of 1024 × 1024. Each training iteration consisted of 200 epochs, starting with an initial learning rate of 0.001 and gradually decreasing to a final learning rate of 0.0001. Stochastic Gradient Descent (SGD) was utilized as the optimizer, with a momentum value set at 0.937 and a weight decay term set at 0.0005.
5.3. Evaluation Metrics
The performance evaluation of the implemented detector incorporates
, average precision (
), frames per second (
), parameters (
), and floating point operations (
) as evaluation metrics. These metrics are computed as follows:
where
represents the count of accurately detected ships, while
denotes false alarms, indicating the number of non-ship targets mistakenly identified.
stands for missed alarms and indicates the count of undetected ships.
represents the total number of images.
represents the dimension of the
-th feature map,
represents the number of channels of the
-th feature map,
represents the size of the
-th convolution kernel, and
represents the total number of layers.
5.4. Experimental Results
5.4.1. Comparison Experiments between the Lightweight Detection Methods and FastRCDet
In the experiments, we conducted a comparative analysis between FastRCDet and other state-of-the-art CNN-based object detectors, including a two-stage network (Faster R-CNN) and one-stage networks (YOLOv4-tiny, YOLOv5-n, and YOLOv7-tiny), as well as the anchor-free network, YOLOX-nano. The specific results can be found in
Table 3.
According to the
Table 3, it is evident that various detection networks exhibit different performances. Particularly for lightweight models, the FastRCDet proposed in this paper demonstrates conspicuous advantages.
Firstly, in terms of accuracy, FastRCDet achieved a relatively high level compared to the lightweight models YOLOv4-tiny, YOLOv5-s, and YOLOv7-tiny, with its F1-score reaching 68.81%. This surpasses the performance of YOLOv5-s (47.55%) and YOLOv7-tiny (49.09%), indicating that FastRCDet exhibits superior overall performance in target detection tasks. Secondly, in terms of computational efficiency, FastRCDet achieves a frame rate (FPS) of 38.02, surpassing the respective rates of 30.03 for YOLOv4-tiny and 24.03 for YOLOv5-s. This improvement highlights its suitability for real-time scenarios with stringent time constraints.
The FastRCDet model also demonstrates competitiveness in terms of model size and computation, with a mere 2.49 M parameters and 8.73 GFLOPs required. This confers a significant advantage for deployment in resource-constrained environments. In summary, as a lightweight ship detection network, FastRCDet not only exhibits exceptional accuracy and recall rate but also offers substantial benefits in real-time performance and resource utilization. These attributes render it suitable for application scenarios that demand stringent computing resources and rapid response times.
According to the experimental results presented in
Table 3, it is evident that FastRCDet achieves a superior detection speed and boasts the most lightweight model size. Among models with similar sizes to FastRCDet, YOLOv4-tiny and YOLOX-tiny are the closest contenders. To provide visual evidence, we include the test results of FastRCDet alongside these two models for comparison purposes in
Figure 11 and
Figure 12.
Figure 11 depicts scenarios of sparser ship distribution, while
Figure 12 shows scenarios of denser ship distribution. Notably, in the selected eight scenarios, FastRCDet demonstrates performance on par with YOLOX-tiny. However, YOLOv4-tiny exhibits more instances of false detection and missed detection, aligning with our actual test findings.
In summary, the FastRCDet model achieves rapid detection speed with minimal model parameters and computational requirements, while maintaining performance that is comparable to the state-of-the-art real-time detector YOLOX.
5.4.2. Comparison Experiments between the Mainstream Lightweight Backbone and LANet
To validate the efficacy of LANet in ship detection, we conducted experiments employing diverse backbones while keeping the detection head conditions constant. The experimental findings are presented in
Table 4. The effectiveness of LANet in ship detection can be comprehensively analyzed based on the experimental results.
According to
Table 4, LANet, as proposed in this paper, exhibits significant superiority over mainstream lightweight backbones (MobileNetv2 [
21], GhostNetv2 [
25], and ShuffleNetv2 [
24]) when employing the same detection head (MSFH + DIoU). Specifically, LANet achieves an
indicator of 77.12%, surpassing the performance of other backbone networks (48.71% for MobileNetv2, 49.26% for GhostNetv2, and 42.13% for ShuffleNetv2). These results demonstrate that LANet effectively enhances detection accuracy in ship detection tasks with a substantial advantage.
The performance of LANet on the evaluation indicator was also notable, achieving a significant 39.70% compared to other backbone networks (MobileNetv2 at 17.30%, GhostNetv2 at 18.42%, and ShuffleNetv2 at 13.21%). This demonstrates the robustness of LANet across a wider range of IoU.
Furthermore, LANet exhibits remarkable computational efficiency, surpassing other backbone networks with an impressive FPS of 38.02 (compared to MobileNetv2’s 33.34, GhostNetv2’s 27.45, and ShuffleNetv2’s 15.55). This exceptional performance enables real-time target detection at significantly faster speeds, making it well-suited for scenarios requiring prompt responses.
In conclusion, LANet, proposed as the novel backbone, exhibits remarkable performance advantages for ship detection tasks, encompassing high precision, efficiency, and adaptability. It is well-suited for application scenarios that necessitate real-time inspections at high speeds while ensuring accuracy.
5.4.3. Comparison Experiments of the YOLOX Head and CIoU Loss
To evaluate the efficacy of MFSH in ship inspection, we conducted comparative experiments utilizing the detection head of YOLOX [
36], a widely acknowledged state-of-the-art anchor-free detector. Simultaneously, to assess the effectiveness of DIoU, we employed the CIoU (Complete IoU) loss function for conducting comparative experiments while maintaining consistency in the backbone. The experimental findings are presented in
Table 5.
- (1)
YOLOX Head vs. MSFH:
According to
Table 4, YOLOX Head employs two loss functions (CIoU and DIoU), yielding 58.24%/20.35% and 60.30%/22.61% on the
and
metrics, respectively. MSFH also utilizes these two loss functions but achieves significantly improved performance under identical conditions, achieving 71.26%/35.19% and 77.12%/39.70% on the
and
indicators, respectively, demonstrating superior detection accuracy compared to YOLOX Head.
YOLOX Head achieves an FPS of 25.36, while the MSFH outperforms it with an FPS of 38.02, indicating superior processing speed in MSFH as well. Moreover, the parameter count for YOLOX Head is 9.02 M, whereas for MSFH it is only 2.49 M, suggesting a lower model complexity in MSFH. Additionally, the computational complexity measured by GFLOPS is significantly higher in YOLOX Head at 34.62 compared to MSFH’s value of 8.73.
In conclusion, from the perspective of the detection head, MSFH outperforms YOLOX Head in all metrics, indicating that MSFH exhibits higher accuracy and efficiency in ship detection tasks in the SAR range-compressed domain.
- (2)
CIoU vs. DIoU:
The experimental results in
Table 5 demonstrate the significant impact of employing different loss functions. Specifically, under YOLOX Head, DIoU outperforms CIoU with respect to
and
, as evidenced by an increase from 58.24% to 60.30% for
and from 20.35% to 22.61% for
when using DIoU.
When employing the MSFH, the utilization of CIoU yields an of 71.26% and an of 35.19%, whereas DIoU achieves superior performance with an of 77.12% and an of 39.70%. These results indicate that DIoU outperforms CIoU in terms of both and under MSFH.
From the perspective of the loss function, DIoU outperforms CIoU across all metrics, thereby suggesting that employing the DIoU loss is more appropriate for ship detection in the range-compressed domain.
5.4.4. Ablation Experiments to Examine the Effect of Neck
To ascertain the dispensability of neck in ship detection within the range-compressed domain of SAR, we conducted experiments with different necks while maintaining consistent conditions. The experimental results are presented in
Table 6.
By eliminating the neck structure and utilizing LANet as the backbone, a remarkable improvement in frame rate to 38.02 FPS was achieved, surpassing the performance of FPN (15.86 FPS) and PAFPN (14.01 FPS). This underscores the exceptional efficiency of processing extensive data without relying on the neck structure.
The number of parameters in the no-neck configuration is merely 2.49 M, accompanied by a computation of 8.73 GFLOPs, which significantly contrasts with the parameter count and computation involved when employing the neck structure. This not only diminishes hardware requisites and operational expenses but also enhances the model’s applicability within resource-constrained environments.
Despite the absence of the neck structure, our proposed method exhibits commendable detection accuracy. The achieved of 77.12% and of 39.70% are comparable to those obtained with the inclusion of the neck structure, underscoring that our neck-free configuration significantly enhances real-time performance and computational efficiency without compromising precision.
The advantage of the neck-free structure is particularly suitable for applications that require rapid processing of large amounts of data. High frame rates and low latency improve the system’s response speed, helping real-time ship monitoring systems operate more efficiently.
6. Conclusions
To address the issue of being inadequately lightweight that occurs in existing methods for SAR range-compressed domains, we propose a novel ship detection method in the range-compressed domain called FastRCDet. This method aims to simplify the network structure and reduce parameter count and computational complexity, thereby enhancing real-time ship detection on board. The specific innovations are outlined below:
A new lightweight backbone, LANet. LANet employs adaptive kernel convolution to dynamically adjust convolution parameters, thereby reducing the number of model parameters and enhancing the effectiveness of depth feature extraction in the range compression domain.
A new single detection head, MSFH. By integrating feature maps of different scales, MSFH adeptly adapts to ship shape features within the range-compressed domain, thereby mitigating the potential degradation of detection performance caused by reduced network models.
A new loss function, DIoU. Considering the geometric attributes of ship, a new loss function called DIoU is designed to enhance the adaptability of our network towards range-compressed domain characteristics.
To validate the feasibility and efficacy of our approach, ship detection experiments were conducted employing publicly available range-compressed datasets. Our results demonstrate that FastRCDet achieves a significant reduction in parameter count and computational complexity, with about a 27% improvement in real-time detection speed compared to existing methods.
Future research will further optimize the network structure to improve the detection performance and real-time capability. At the same time, according to the characteristics of the embedded edge platform, we will also carry out targeted optimization. These optimization measures include, but are not limited to, adopting more advanced algorithms, increasing hardware resources, improving imaging algorithms, and so on. The model demonstrates accurate results in validation. However, the collection of more appropriate ship data is necessary to enhance the dataset and thoroughly evaluate and refine the network. Through these efforts, we hope to make deep learning technology work better for SAR ship detection, and make references for the radar community.