1. Introduction
China has more than 18,000 km of coastline and a maritime land area of more than 300 square kilometers. Maritime security plays a vital role in our national defense security. However, in recent years, illegal fishing, drug trafficking, illegal immigration, and other illegal maritime activities are common, and some countries often send ships to perform “tours” in China’s waters. Therefore, ship detection technology plays an important role in protecting homeland security and monitoring illegal maritime activities [
1].
Synthetic aperture radar (SAR) is an active earth observation system that can be installed on aircraft, satellites, spacecraft, and other flight platforms. It has a strong penetration ability to cloud, fog, rain, and so on, and is unaffected by light. It can observe the earth all day and in all weather conditions in real time and has a certain surface penetration ability [
2]. Therefore, the SAR system has unique advantages in disaster, environmental, and marine monitoring [
1,
2,
3,
4,
5]; resource exploration; crop yield estimation; surveying; mapping; military operations [
4,
5,
6,
7]; and other applications. It also plays a role that other remote sensing methods find difficult to play. Therefore, increasingly more countries have paid attention to it.
Owing to the characteristics of SAR imagery, ship detection using SAR images has become an important research direction for SAR image applications [
1,
2,
3,
4,
5,
6,
7]. We can obtain several high-resolution SAR images through airborne and spaceborne SAR images. In these images, ships on the ocean are clearly seen. Therefore, we can use relevant images to detect ships and other targets conducive to improving the coastal defense capability of our country.
At present, there are many researches on deep learning [
8,
9,
10,
11]. Traditional ship target monitoring uses SAR images, and commonly used methods include template matching [
12], support vector machine [
13], linear interpolation [
14], principal component analysis [
15], a combination of multimode dictionary learning and sparse representation [
16], and CFAR (constant false alarm rate) [
17], and so forth. SAR images are easily affected by various environmental factors, such as speckle noise and background clutter, which makes it difficult to extract the features of the target of interest. The traditional method usually uses manual extraction, but there are some disadvantages, such as less feature extraction and difficult manual selection, and the rate of missing alarm is high, which ultimately affects the detection effect. Therefore, more and more scholars and scientific research institutions begin to carry out research on real-time detection of SAR ship targets.
Since the ImageNet competition in 2012, deep learning has begun to develop rapidly. Its powerful data processing and feature learning abilities have attracted people’s attention and recognition. Convolutional neural network (CNN) is commonly used in deep learning [
9,
10,
11], more and more scholars have made a lot of research on this aspect [
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]. However, most current algorithms [
31,
32,
33,
34,
35] focus on improving the model’s accuracy, ignoring its speed and high cost. For some large companies, the pretraining large model optimizes the algorithm and improves accuracy, but for smaller companies, the expensive cost of training is prohibitive. In addition, the accuracy and speed of the model are not always perfect. Improved accuracy decreases model speed. In the case that much time is needed, it appears weak in instances that need real-time detection. Therefore, through lightweight models, network redundancy can be compressed and reduced, which greatly reduces the storage capacity, effectively improves the training speed and efficiency of the model [
36,
37], and orients it to real-time to achieve a broader range of real-time detection.
Different from traditional methods that require manual feature designs, deep learning methods can automatically extract features to achieve end-to-end target detection. Furthermore, the detection performance of deep-learning-based methods is superior. Generally speaking, target detection methods based on deep learning can be divided into two categories. The first is single-stage detection, and the mainstream single-stage detection models include the YOLO (You Only Look Once) series [
1,
2] and SSD (Single Shot MultiBox Detector) algorithm [
38]. Based on regression, this method can directly predict the category confidence and locate the target position on the image. The other is the two-stage model, which presents the regional proposal network structure, generates a series of candidate boxes containing potential targets, and then further determines the target category and corrects the boundary boxes. Faster R-CNN [
9], Feature Pyramid Networks (FPNs) [
10], Mask R-CNN [
11], and other algorithms based on multiscale feature fusion have been developed. The detection speed of the single-stage model is superior, and the effect of real-time detection is achieved. The detection accuracy of the two-stage model is better.
In terms of target detection, Dong et al. [
31] improved the Faster RCNN by replacing the traditional nonmaximum suppression (NMS) with a Sig-NMS in the regional proposal network stage, significantly reducing the possibility of small missing targets. Cui et al. [
32] proposed a detection method based on the intensive attention Pyramid Network (DAPN). Extracting rich features, including resolution and semantic information, improves the detection performance of multiscale ship targets. For multidirection target detection, An et al. [
33] improved DRbox-V1 by FPN, focal loss, and improved coding scheme, and proposed the drbox-V2 detector, which detects ships in any direction. Li et al. [
34] proposed a residual network based on a rotating region (R3-NET) to detect multidirectional vehicles for remote sensing images and videos with high robustness and accuracy. For dense target detection, Wang et al. [
35] added the Spatial Group-wise Enhancement (SGE) attention module to CenterNet, which detected densely docked ships well. However, although the above methods have achieved satisfactory progress of detection accuracy, they are still computationally expensive, time-consuming, and unsuitable for deploying devices with limited computing resources and memory. Therefore, it is necessary to design a lightweight target detection model for remote sensing images.
In terms of target detection using SAR images, Feng et al. [
1] proposed a new lightweight position-enhanced anchor-free SAR ship detection algorithm called LPEDet, which redesigned the lightweight multiscale backbone for a new position-enhanced attention strategy. Xu et al. [
2] designed a lightweight cross stage part (L-CSP) module to reduce the amount of computation and applied network pruning for a more compact detector. The FASC-NET proposed by Yu et al. [
26] is mainly composed of ASIR block, focus block, SPP block, and CAPE block; this network can reduce the number of parameters to a certain extent and maintain a certain accuracy without losing information. Then, to ensure excellent detection performance, Hou et al. [
39] proposed a ship detection method for SAR images based on a visual attention model featuring the existing priors of ships in the water. This method can accurately detect ocean-going ships; however, berthing ships’ missed detection and false alarm rates are high. Liu and Cao [
40] proposed a SAR image target detection method based on a visual attention pyramid model and singular value decomposition (VA-SVD), which has a slow calculation speed and poor detection performance for high-resolution SAR images. Wang et al. [
41] proposed a target detection algorithm for high-resolution SAR images applied to complex scenes based on visual attention with high detection accuracy, but it cannot retain the original shape of the target. Yu et al. [
42] proposed an efficient lightweight network, Efficient-YOLO. In this paper, a new regression loss function, ECIoU, is proposed to improve positioning accuracy and model convergence speed, the SCUPA module is proposed to enhance the generalization ability of the model, and the GCHE module is proposed to enhance the feature extraction ability of the network. Jiang et al. [
43] proposed a three-channel image construction scheme based on NSLP contour extraction, which enriches the contour information of the dataset while reducing the impact of noise. Liu et al. [
44] proposed a lightweight YOLOV4-Lite model based on which the MobileNetv2 network was used as the backbone feature extraction network, and deep separable convolution was used to reduce the computational overhead in the process of network training and ensure the lightweight characteristics of the network. Sun et al. [
45] proposed a novel YOLO-based arbitrary-oriented SAR ship detector using bidirectional feature fusion and angular classification (BiFA-YOLO). This paper will be a novel bidirectional feature fusion module (BI-DFFM) specifically for SAR ship detection applied to the YOLO framework to effectively aggregate multiscale features to detect multiscale ships, and an angle classification structure is added to obtain ship angle information more accurately.
At present, significant research has been conducted on SAR ship monitoring [
1,
2,
3,
4,
5,
6,
7]. There is a great difference between SAR images and optical images. SAR images are generally used for target detection only with amplitude information. Meanwhile, SAR images are susceptible to various environmental factors, such as speckle noise and background clutter, which complicates feature extraction of interesting targets. In addition, the movement sensitivity and pose sensitivity of the sensor also lead to SAR target instability. The target detection algorithm of optical images is not entirely applicable to SAR images.
In summary, the following problems must be resolved:
(1) In order to further improve the accuracy of existing algorithms, most work involves blindly increasing the structure of the model, resulting in a large number of model parameters that slow down the speed of model training and reasoning. This outcome is not only not conducive to the real-time detection effect of the model, but also reduces the practicality of the model. At the same time, the complexity and the number of parameters of the model also limit the application and promotion of the model to a certain extent.
(2) Some models may not consider the problem of location information and computation overhead, which may lead to inaccurate target positioning during target detection, or the detection effect is not good because the hardware equipment with higher conditions is required.
To this end, we propose a new lightweight YOLOv5-MNE that improves the training speed and reduces the memory of SAR ship detection, and we did many ablation studies to compare. The main contributions are as follows:
(1) To solve the model speed reduction problem caused by a high number of parameters in the model, we designed a lightweight module, MNEBlock. Based on the YOLOv5 [
46] network, a lightweight YOLOv5-MNE network was formed by fusing MNEBlock into the backbone of the basic network.
(2) In order to help the model locate and identify the objects of interest more accurately, the CA (coordinate attention) mechanism is introduced into this paper. The CA mechanism is flexible and lightweight, which avoids a lot of computational overhead and compensates the accuracy to a certain extent.
(3) Extensive ablation experiments were performed to confirm the validity of these contributions. The same experiment was performed on different datasets to compare the applicability of the proposed method with different datasets. Experiments were conducted on different orders of magnitude datasets to compare the applicability of the proposed method for different orders of magnitude datasets.
The remaining materials are arranged as follows:
Section 2 describes the method used in this paper.
Section 3 describes the results of these experiments.
Section 4 describes the ablation experiments. Finally,
Section 5 summarizes the whole article and presents our conclusions.