SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing

Guo, Xiaomeng; Xu, Baoyi

doi:10.3390/rs16183420

Open AccessArticle

SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing

by

Xiaomeng Guo

¹ and

Baoyi Xu

^1,2,*

¹

The State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an 710071, China

²

Hangzhou Institute of Technology, Xidian University, Hangzhou 311231, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(18), 3420; https://doi.org/10.3390/rs16183420

Submission received: 21 July 2024 / Revised: 10 September 2024 / Accepted: 11 September 2024 / Published: 14 September 2024

(This article belongs to the Special Issue Artificial Intelligence Algorithm for Remote Sensing Imagery Processing (4th Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Monitoring aircraft using synthetic aperture radar (SAR) images is a very important task. Given its coherent imaging characteristics, there is a large amount of speckle interference in the image. This phenomenon leads to the scattering information of aircraft targets being masked in SAR images, which is easily confused with background scattering points. Therefore, automatic detection of aircraft targets in SAR images remains a challenging task. For this task, this paper proposes a framework for speckle reduction preprocessing of SAR images, followed by the use of an improved deep learning method to detect aircraft in SAR images. Firstly, to improve the problem of introducing artifacts or excessive smoothing in speckle reduction using total variation (TV) methods, this paper proposes a new nonconvex total variation (NTV) method. This method aims to ensure the effectiveness of speckle reduction while preserving the original scattering information as much as possible. Next, we present a framework for aircraft detection based on You Only Look Once v8 (YOLOv8) for SAR images. Therefore, the complete framework is called SAR-NTV-YOLOv8. Meanwhile, a high-resolution small target feature head is proposed to mitigate the impact of scale changes and loss of depth feature details on detection accuracy. Then, an efficient multi-scale attention module was proposed, aimed at effectively establishing short-term and long-term dependencies between feature grouping and multi-scale structures. In addition, the progressive feature pyramid network was chosen to avoid information loss or degradation in multi-level transmission during the bottom-up feature extraction process in Backbone. Sufficient comparative experiments, speckle reduction experiments, and ablation experiments are conducted on the SAR-Aircraft-1.0 and SADD datasets. The results have demonstrated the effectiveness of SAR-NTV-YOLOv8, which has the most advanced performance compared to other mainstream algorithms.

Keywords:

synthetic aperture radar (SAR); deep learning; feature extraction; aircraft detection

1. Introduction

Synthetic Aperture Radar (SAR) represents a sophisticated coherent imaging system capable of generating high-resolution remote sensing imagery irrespective of diurnal cycles or adverse weather conditions [1]. Owing to its inherent advantages, such as uninterrupted all-day and all-weather imaging capabilities, SAR has found extensive applications in diverse domains including forest monitoring, urban feature extraction, maritime surveillance, and beyond [2,3,4,5,6]. Target detection is an important application scenario for SAR, and this topic has aroused great research interest in both academic and industrial fields. The fundamental principle of target detection in SAR imagery involves exploiting the disparities in scattering characteristics to effectively segregate targets from the background and accurately determine their locations. However, the interpretation of SAR images is inherently complex due to the coherent scattering and imaging mechanisms [7]. Specifically, aircraft detection within SAR imagery under intricate conditions remains a formidable challenge, yet it is of paramount importance for critical applications such as airport management and battlefield reconnaissance [8,9,10,11,12].

Due to the intricate nature of SAR imaging mechanisms, aircraft targets manifest as irregular scattering bright spots within SAR imagery [13]. The critical semantic information pertaining to aircraft targets is embedded within these scattering bright spots, necessitating a comprehensive joint analysis for effective detection [14]. Detecting aircraft interference factors in SAR data includes significant changes in aircraft size, discrete distribution characteristics of targets, and complex background interference. Classical target detection algorithms based on SAR imagery are predominantly categorized into three approaches: mining structural and geometric features of the targets, extracting texture features, and statistical analysis of data distribution. For specific targets, prior knowledge of their structural or geometric features can significantly enhance detection accuracy. The method based on constant false alarm rate (CFAR) and its improved variants are once one of the classic methods for aircraft detection in SAR images. However, in environments with complex backgrounds, the scattering points from ground clutter can interfere with the aircraft targets, rendering these methods less effective in achieving satisfactory detection results.

An algorithm for detecting aircrafts in SAR imaging results by focusing on the geometric features of aircraft is proposed in [15]. Using the Canny operator, Guo et al. [16] propose an aircraft edge detection method. Texture features describe visual attributes through directional gradient distribution and visual saliency. A multi-component model that combines target structural information with a mixed statistical distribution is proposed in [17]. Nevertheless, these conventional feature extraction techniques struggle to capture high-level features information from SAR data.

With the evolution of neural network architecture, convolutional neural network (CNN) have demonstrated powerful feature mining capabilities. CNNs learn high-dimensional visual features that possess robust discrimination and generalization abilities. However, deep learning-based target detection methods have some drawbacks. They require large amounts of data for training to enhance model generalization, and conventional optical image detection methods struggle with complex background interference and aircraft edge detection in SAR. Thus, a specialized network design is needed for better detection results. Popular CNN architectures include Faster R-CNN [18], RetinaNet [19], and FCOS [20]. Nowadays, target detection tasks are primarily based on feature learning and representation. CNNs also perform exceptionally well in SAR-based target detection.

Zhao et al. [21] introduce a novel SAR aircraft detection framework that fully utilizes the advantages of variant convolution and attention, thereby establishing a new backbone structure to enhance the ability to detect aircraft features from SAR images. Employing the SSD target detection framework, authors in [22] incorporate a synergistic approach to improve the precise of ordinary object detection methods in detecting aircraft targets from SAR data. To mitigate false alarms, the method in [23] implements an airport runway mask and devise a weighted feature fusion module to improve detection precision. Authors in [24] propose an advanced feature enhancement strategy coupled with an attention pyramid network, facilitating precise detection of SAR aircraft. Authors in [25] introduce an innovative scattering feature extraction framework, which strengthens the correlation between aircraft scattering points, ensuring accurate aircraft detection. A novel pyramid model based on attention fusion technology is proposed for detecting aircraft targets with blurred features in SAR images [26], which has been proven to have high accuracy.

However, the coherent imaging mechanism inherent to SAR images introduces multiplicative speckle noise, which significantly degrades the visual quality of the images and poses substantial challenges to their processing, interpretation, and analysis [27]. Applying deep learning for aircraft detection is greatly enhanced by performing speckle reduction preprocessing on the SAR dataset. After speckle reduction, the quality of the SAR images improves, significantly enhancing the clarity and distinguishability of features, which makes the aircraft more recognizable and provides better input for subsequent deep learning models. If the quality of the input SAR images is poor, the deep learning model may require more complex architectures and additional training data to learn the characteristics of the noise. In fact, this process can be likened to the data augmentation techniques commonly used in the field of deep learning, where preprocessing the original dataset allows the object detection neural network to focus more on extracting the features of the target.

Variational methods gain popularity due to their ability to effectively preserve edge features. Rudin et al. [28] pioneer the total variation (TV) based model, which remains the most widely utilized approach within this category [29]. Building on this, authors in [30] develop a nonconvex model, which called AA, designed to mitigate speckle noise in SAR images, incorporating a nonconvex fidelity term alongside a variational regularization term. Despite the AA model’s efficacy in despeckling, its practical application faces challenges due to high computational complexity. To address this, Bioucas introduces a strictly convex model through the use of logarithmic transformation [31], which can be efficiently solved using a variable splitting method. Nevertheless, this approach also falls short in preserving textures and frequently produces a staircase effect, thereby compromising its speckle reduction performance.

In 2017, a speckle reduction method for SAR images based on CNN is proposed [32]. In 2020, a new CNN-based speckle reduction method is introduced, leveraging the non-local characteristics of SAR images [33]. At the same time, Dalsasso proposes a pre-training approach using a Gaussian noise model, followed by speckle reduction training [34]. However, these CNN-based methods for SAR image despeckling often require substantial computational resources, especially with large training datasets. To reduce computational complexity, a lightweight despeckling CNN, SAR-CNN-M-xUnit, is proposed [35]. Although this method reduces computational costs, it does not improve despeckling performance. However, current CNN-based methods do not demonstrate sufficient advantages. This is mainly due to the lack of high-quality training data. These methods typically train on synthetic speckle images, using the original clean images as labels. This can be seen as using CNN to reverse-engineer the formula for generating speckle, rather than actually performing despeckling.

As mentioned above, most current research on speckle reduction is still based on non-machine learning methods. We believe this is primarily due to the challenges in effectively training neural networks. Achieving good denoising results using Convolutional Neural Networks (CNNs) typically requires a large amount of labeled data for training, and obtaining high-quality labeled data in SAR images can be relatively difficult. If we want to enhance the speckle reduction capability in object detection models, the number of parameters in the target network will inevitably increase, requiring more computational resources and time. Furthermore, in some cases, the model may learn noise instead of useful signals, leading to suboptimal denoising performance.

We propose a new framework for SAR images to achieve aircraft detection in complex scenes. The proposed method performs speckle reduction preprocessing on SAR images, and then inputs it into the YOLOv8 based neural network designed for SAR images for feature extraction.

To alleviate the impact of scale mutation and loss of deep feature detail information on detection accuracy, we adjust the original YOLOv8 detection network. The small target detection head module based on a high-resolution feature map replaces the original ordinary detection head module. Meanwhile, a cross-space learning module is employed to adaptively focus on target features, thereby mitigating the adverse effect of speckle. Subsequently, the network merges the high-level mappings and passes them to the global attention module, aiming to extract contextual information from the entire feature extraction net. The proposed method enhances the aircraft detection capability based on YOLOv8 backbone. The main contributions of this paper are as follows:

A paradigm for speckle reduction preprocessing of SAR images is proposed, followed by a CNN based aircraft detection networks. In the despeckling preprocessing stage, a new noise reduction model based on non-convex total variation is proposed. This approach effectively smooths the image while preserving its details. By significantly enhancing aircraft scattering information and reducing background clutter and speckle noise, despeckling allows the subsequent neural detection network to extract aircraft features more effectively.
Aiming at the aircraft detection task of SAR image, this paper chooses YOLOv8 (SAR-YOLOv8) as the basic structure. At the same time, a high-resolution small target feature head detector suitable for YOLOv8 is adopted. We introduce an innovative attention pyramid that combines local and contextual features. In SAR-YOLOv8, local attention dynamically captures the target’s local characteristics. The cross-scale attention mechanism aids the SAR-YOLOv8 in collecting crucial contextual information, thereby effectively minimizing false positives. Then, we propose a progressive feature network of spatial feature fusion, which makes the different scale feature information generated by each stage effectively utilize the multi-stage context, so as to significantly strengthen the precises of aircraft targets in the scattering area.
We construct a SAR-NTV-YOLOv8 aircraft detection framework and conduct sufficient comparative and ablation experiments. The high accuracy of SAR data sets shows that SAR-NTV-YOLOv8 is suitable for aircraft detection. Meanwhile, through a large number of comparative experiments, ablation experiments and despeckling experiments, it is shown that the proposed SAR-NTV-YOLOv8 method has the most advanced detection capability compared to other object detection networks.

The structure of this paper is as follows. In Section 2, related work is reviewed, which mainly included SAR image speckle reduction and YOLOv8 network structure. Then, Section 3 shows the overall architecture of the SAR-NTV-YOLOv8 method, the speckle reduction process, and the network structure designed in this paper for SAR image aircraft recognition. Section 4 presents the dataset, experiments, and corresponding analysis. Afterwards, Section 5 provide a comprehensive discussion on the results. Finally, this paper is summarized in Section 6.

2. Related Work

2.1. SAR Image Despeckling Method

We focus on the despeckling methods based on spatial filtering. Some early algorithms, such as Lee filter, Kuan filter, and Forst filter, can well suppress speckle noise and need very little computing resource. However, these methods often reduce the resolution of the image and obscure the details of scattered targets. Recently, as a powerful tool for image denoising, variational methods have made significant progress. Total variational regularization is widely used in variational methods [36]. Under the Bayesian Maximum A Posteriori (MAP) framework, Aubert et al. [30] propose a non convex AA model. Due to the computational complexity of non convex terms in the AA model, Shi and Osher propose a low complexity strictly convex model for improvement [29]. In order to protect the texture of the target structure and the feasibility of the result, Bioucas et al. [31] propose MIDAL method, which uses convex data fidelity term and TV regularizer to reduce speckle. Ordinary TV regularizers are still widely used due to their convexity, but they are prone to generating staircase effects [37,38]. To address these issues, Han Yu propose a new nonconvex TV regularizer that can effectively despeckling uniform regions of the image while preserving edges.

2.2. YOLOv8

YOLOv8 is the latest target detection method proposed by Ultratics based on YOLOv5 in 2023 [39]. It integrates the most advanced technology and introduces new features and improvements. While further improving model performance and flexibility, YOLOv8 is faster, more accurate, and easier to use. Moreover, YOLOv8 has an extremely wide range of applications. It has a wide range of usage scenarios, such as object tracking, instance segmentation, image classification, and pose estimation. The most significant improvement of YOLOv8 lies in its excellent scalability. The designer has created a framework that is compatible with YOLO’s historical versions, allowing developers to freely switch and conduct experimental comparisons. The framework introduces new features and improvements, further enhancing the performance and flexibility of the model. Compared to YOLOv5 [40,41], YOLOv8 is more suitable for engineering application scenarios.

3. Proposed Method

3.1. Overall Scheme of the Proposed Method

In order to facilitate understanding, the complete process of aircraft detection proposed in this paper is shown in the Figure 1. The first yellow background module refers to despeckling preprocessing, which will be detailed in Section 3.2. The second part is the backbone network for SAR image aircraft detection, which will be detailed in Section 3.3. The aircraft detection network is developed based on the YOLOv8 framework and SAR image features of aircraft, with improvements to the detection head, feature extraction network, and attention mechanism. This network can export the fine and discriminative features of continuous detectors to accurately achieve the final regression detection.

3.2. Despeckling Preprocessing Module

3.2.1. SAR Image Multiplicative Noise Signal Model

Let

y \in R_{+}^{n}

represent the imaging result of the raw SAR data,

x \in R_{+}^{n}

represent the clean SAR image that needs to be despeckled, and

N \in R_{+}^{n}

be the multiplicative noise that follows the Gamma distribution. Therefore, SAR data is usually represented as follows

Y = X \cdot N .

(1)

By performing logarithmic transformation on both sides of (1), the following equation can be obtained

\underset{G}{\underset{︸}{log Y}} = \underset{U}{\underset{︸}{log X}} + \underset{W}{\underset{︸}{log N}} .

(2)

Here, the probability density function (PDF) of the noise term

W

can be obtained

p_{W} (w) = p_{N} (e^{w}) e^{w} = \frac{M^{M}}{Γ (M)} e^{M w} e^{- e^{w} M},

(3)

where

Γ (\cdot)

denotes the Gamma function, and M is the SAR image attribute, which is the equivalent number of looks (ENL). It is worth noting that the Equation (3), also known as the Fisher-Tippett distribution, has matured in the field of probability and statistics research. Based on the assumption of conditional independence, the authors in [31] rigorously proved and derived the following relationship

\begin{matrix} log p_{G | U} (G | U) = log p_{W} (G - U) \\ = \underset{constant}{\underset{︸}{M log M + M G - log (Γ (M))}} - M U - M e^{G - U} \\ = C - M (U + e^{G - U}), \end{matrix}

(4)

where C is a constant (no effect on the optimization objective). In general, the despeckling work takes the negative term of (4) as the data fidelity.Then, the classical despeckling problem is modeled as an unconstrained minimization model, as shown follows

min_{u} \{M \sum_{i = 1}^{n} (u_{i} + e^{g_{i} - u_{i}}) + λ ϕ (u)\},

(5)

where

ϕ (u)

is the regularization term,

λ

is the corresponding weight, adjusting the regularization weight,

g_{i} = log y_{i}

is the logarithm of the pixel values of the SAR data imaging result at i, and

u_{i} = log x_{i}

is the logarithm of the pixel values of the optimized result at i.

3.2.2. Based on Convex Total Variation Regularization Model

In speckle reduction research, the method of TV is very important, which is proposed in [29]. The expression of TV in the image is as follows:

{∥u∥}_{T V} = \sum_{i} \sqrt{{(\nabla_{x} u)}_{i}^{2} + {(\nabla_{y} u)}_{i}^{2}},

(6)

where

{(\nabla_{x} u)}_{i}

denotes the horizontal pixel difference at pixel i, and

{(\nabla_{y} u)}_{i}

represents the vertical pixel value difference at pixel i. Due to the large amount of detailed structural textures in SAR images, there may be staircase artifacts in the TV optimization results. To overcome this problem while also considering the effect of nonconvex relaxation methods in alleviating computational pressure, a new TV based regularizer is proposed, which takes the following form

ϕ ({∥u∥}_{T V}) = \frac{α {∥u∥}_{T V}}{{(ε + α {∥u∥}_{T V})}^{p}},

(7)

where

ε

is a small constant, which is intended to avoid illegal solving results,

0 \leq p \leq 1

. Thus, by merging (5) and (7), a new speckle reduction model can be constructed

min_{u} \{M \sum_{i = 1}^{n} (u_{i} + e^{g_{i} - u_{i}}) + λ ϕ ({∥u∥}_{T V})\},

(8)

We introduce the auxiliary variable

V \in R^{n}

to obtain the constraint form of (8) as follow

min_{u} \{M \sum_{i = 1}^{n} (u_{i} + e^{g_{i} - u_{i}}) + λ ϕ ({∥v∥}_{T V})\} s . t . U = V,

(9)

Then, the augmented Lagrangian function of (9) is expressed as follows

L (U, V, Λ) = M \sum_{i = 1}^{n} (u_{i} + e^{g_{i} - u_{i}}) + λ ϕ ({∥v∥}_{T V}) + 〈Λ, U - V〉 + \frac{μ}{2} {∥U - V∥}_{F}^{2} .

(10)

At this point, the problem has met the basic conditions for ADMM algorithm to solve. Therefore, we provide specific equations for solving each variable

U^{k + 1} = \underset{U}{arg min} M \sum_{i = 1}^{n} (u_{i} + e^{g_{i} - u_{i}}) + 〈Λ^{k}, U - V^{k}〉 + \frac{μ^{k}}{2} {∥U - V^{k}∥}_{F}^{2},

(11)

V^{k + 1} = \underset{V}{arg min} λ ϕ ({∥v∥}_{T V}) + 〈Λ^{k}, U^{k + 1} - V〉 + \frac{μ^{k}}{2} {∥U^{k + 1} - V∥}_{F}^{2},

(12)

Λ^{k + 1} = Λ^{k} + μ^{k} (U^{k + 1} - V^{k + 1}),

(13)

μ^{k + 1} = ρ μ^{k} .

(14)

Newton’s method can be used to solve the Formula (11), and the solution of the formula can be obtained after several iterations. As for (12), due to its nonconvex terms, the iterative reweighting method can be used for the solving process. By combining (7) and (12), an expansion can be obtained for convenient calculation as follows

V^{k + 1} = \underset{V}{arg min} λ \cdot β \cdot {∥v∥}_{T V} + 〈Λ^{k}, U^{k} - V〉 + \frac{μ^{k}}{2} {∥U^{k} - V∥}_{F}^{2},

(15)

where

β

is defined as

β = \{\begin{matrix} 1, & if k = 1, \\ \frac{1}{{(ε + {∥v^{k - 1}∥}_{TV})}^{p}}, & k > 1 . \end{matrix}

(16)

Obviously, when the value of

{∥v∥}_{T V}

decreases, the value of

β

will increase, which will cause the smoothing result to be more lightly constrained, resulting in smoother edges and textures of the aircraft scattering. After obtaining the closed solution for the weight

β

, (16) can usually be regarded as an optimization problem that can be easily solved. In [31], the Chambolle algorithm is typically chosen to optimize the problem (16).

3.3. Aircraft Target Detection Module

The YOLOv8 algorithm is a fast target detection algorithm with better comprehensive performance, but there are still some shortcomings in aircraft detection.

In SAR images, the targets varies significantly, with a higher number of aircraft with fewer scattering points, and complex and blurry background object scattering. Many similar image features increase the difficulty of extracting key features.
The network model is very complex, which is not conducive to deploy on-board or airborne equipment. This paper has made some improvements for YOLOv8.

To tackle these issues, we have conducted thorough research and analysis, and proposed corresponding improvements. Then, we propose SAR-YOLOv8 suitable for aircraft detection as shown in the Figure 2.

3.3.1. High Resolution Small Target Feature Heads

In order to alleviate the impact of scale changes and loss of deep feature details on detection accuracy, this section proposes a small target detection head (STDH) module based on high-resolution feature maps. Based on YOLOv8, optimize the Head module to compensate for small target missed detections and improve model prediction accuracy. The main task is to add a 4× downsampling process on top of maintaining 8×, 16×, and 32× downsampling on the YOLOv8 backbone network, which is used to integrate the low-level map P2 in the structure into the high-level maps. At this point, a high-resolution feature map of 160 × 160 generated, which has a smaller receptive field and richer location information, which is conducive to network detection of smaller targets. Meanwhile, a detection branch is added to the detection head to detect the very small scale target. By utilizing the feature pyramid feature fusion network, the position represented by the low-level mapping is integrated with the semantics represented by the high-level mapping. This fusion enables the model to effectively learn minimal target features, enhancing the accuracy and recall rate for detecting small targets. The improvements are indicated in the red dashed box labeled 1 in Figure 3. Subsequently, after passing through the feature fusion network, a branch specifically for detecting minimal targets is added to the detection head network. This branch is designed to identify the smallest aircraft targets in the input images. Issues of false positives or missed detections, which are caused by the small scale of the aircraft or the shooting positions, are mitigated. Consequently, the network’s performance in detecting small targets is significantly enhanced.

3.3.2. Efficient Multi-Scale Attention Based on Cross Spatial Learning

Most SAR image aircraft datasets exhibit the phenomenon of small target sizes and blurred local detail features, which are crucial for image target recognition tasks. However, the original YOLOv8 model often generates false alarms in aircraft detection, such as misidentifying buildings, ground textures, and even shadows or noise in the background as aircrafts.

The main reason is that the model failed to effectively distinguish between background and foreground targets, resulting in insufficient learning of discriminative features, leading to a decrease in the recognition precises and an increase in false detection rate. In addition, due to the shooting angle of the aircraft and the complexity of the environment, such as some aircraft being partially obstructed by other objects, aircraft gathering in airports or other densely populated areas, these factors have led to difficulties in learning the overall features, which in turn increases the challenge of target detection.

In 2023, an efficient multi-scale attention (EMA) module is proposed, which utilizes a grouping structure and does not require dimensionality reduction. EMA establishes short-term and long-term dependencies by designing multi-scale parallel subnetworks. This module segments the information of each channel into multiple sub features while ensuring its accuracy.

The EMA attention mechanism, illustrated in Figure 3, uses a parallel subnetwork structure. This design prevents performance drops from complex sequential processing and convolutions, enabling efficient extraction of pixel-level features. By parallelizing 1 × 1 and 3 × 3 convolutions, EMA combines multi-scale spatial information, enhancing the network’s overall performance.

Here, let

x \in^{C \times H \times W}

represent the input image array and G represent the sub channel dimension of EMA, that is, there are

X = [X_{0}, X_{i}, \dots, X_{G - 1}], X_{i} \in^{C / / G \times H \times W}

sub features in total. There is usually

G ≪ C

, and the learned attention weight parameters are used to enhance the feature extraction ability of RoI. From the cross-space learning part of the figure, it can be seen that EMA uses three branches to extract attention weights, of which

1 \times 1

convolution is located in the first two branches, and

3 \times 3

convolution is located in the third branch.

Branch 1 compresses the original feature map in terms of channel count and fuses it with information from other branches. Branch 2 draws inspiration from CA attention mechanism to perform one-dimensional global average pooling on feature maps in both height and width directions. Branch 3 uses convolution operations to process feature maps, effectively capturing cross dimensional interactions and establishing relationships between different dimensions with other branches.

EMA interacts with the feature information of branches 2 and 3 across channels in the cross spatial learning module. In the

1 \times 1

convolutional branch, it multiplies with branch 3 through group normalization, global average pooling, and softmax normalization functions. In the convolutional branch, it multiplies with branch 2 through global average pooling and softmax normalization functions.

Then add up the two extracted multi-scale features

1 \times H \times W

. Due to the interdependence between channels and space, EMA achieves more comprehensive feature aggregation by aggregating information between different sub channels. Two tensors are introduced: one from the

1 \times 1

branch and another from the

3 \times 3

branch. To extract different spatial features at different levels, perform two-dimensional pooling calculations on all tensors represented by the

1 \times 1

branch. The output feature dimension of the smallest branch in the network is as,

{R_{1}}^{1 \times C / / G} \times {R_{1}}^{C / / G \times H W}

, and the two-dimensional global average pooling operation is shown as follow

z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (i, j),

(17)

where H is the height, W denotes the width of the feature map, and

x_{c}

represents different channel features. EMA used the softmax normalization function to fit the linear transformation at the output of two-dimensional global average pooling.

Finally, the output features of the three branches are reweighted and aggregated, resulting in the same

X \in R^{C \times H \times W}

size as the input data dimension. The yellow dashed box area with number 2 in Figure 2 shows the improved part. By adding an EMA in the front of the neck network of YOLOv8, attention in the process of aircraft scattering feature extraction is strengthened. Through this method, the YOLOv8 model has successfully improved its attention to detailed features and fuzzy parts of aircraft, enhanced the extraction of key feature points of aircraft, and enabled the model to more effectively distinguish between aircraft targets and backgrounds, thereby improving detection accuracy and improving missed and false detections of aircraft.

3.3.3. Progressive Feature Pyramid Based on Adaptive Spatial Feature Fusion

Low level features come from shallow networks, rich in spatial information, with higher feature resolution and smaller receptive fields. They pay more attention to detail information, such as texture and edges, and are suitable for capturing the details of small targets. Advanced features come from deep networks, rich in semantic information, with lower feature resolution and larger receptive fields. They pay more attention to global information and provide semantic information such as context and background knowledge for the model.

The bottom-up strategy of AFPN introduces a different issue: fine details from lower-level features might be lost or diminished as they propagate and interact. The Adaptive Spatial Feature Fusion (ASFF) strategy introduces a feature pyramid network, as shown in Figure 4, which achieves effective information extraction for multi-scale object detection and solves the problem of different feature focus points at different levels by introducing fully connected feature fusion and weight parameter adjustment.

The core idea of HRNet is to connect and fuse feature outputs from top to bottom, so as to produce results that take into account both high and low features. Inspired by ASFF and HRNet, the Asymptotic Feature Pyramid Network (AFPN) adopts a comparable approach. It begins the fusion process during the initial stage of bottom-up feature extraction in the Backbone by integrating two low-level features, P2 and P3, which have varying resolutions. As we enter the later stage, we gradually incorporate advanced feature P4 into the fusion process, ultimately fusing Backbone’s top-level feature P5.

Meanwhile, this method also avoids the degradation of information during the deep network extraction process. In the process of extracting low-level and high-level features, AFPN progressively merges features from different levels, including low, high, and top-level features. Initially, AFPN integrates detailed features from lower levels, followed by the integration of semantic features from higher levels, and finally incorporates the most abstract top-level features. Directly fusing P2, P3, P4, and P5 is impractical due to the significant semantic gap between distant features, particularly between the lowest and highest levels, which results in suboptimal fusion outcomes. To address this, a progressive feature pyramid is employed to gradually merge features from different layers, ensuring that the semantic and detailed information of non-adjacent layers are more closely aligned during the fusion process, thus mitigating the aforementioned issues.

AFPN can dynamically generate feature pyramids of different scales according to the content of the input SAR image. In images with different scale targets, feature pyramids suitable for the content would be generated, thereby improving the precises and robustness of target detection. By incorporating only standard convolutional components, the increase in parameters remains minimal. In each feature fusion module, conflicts between multiple target information may arise; to address this, Adaptive Spatial Feature Fusion (ASFF) is employed to mitigate such conflicts. ASFF is typically integrated into the feature fusion network, enabling the comprehensive fusion of features from different levels in a fully connected manner. This approach allows the model to adapt to varying spatial scales and structures present in the input data, dynamically adjusting the weights of features. Therefore, the ability of the model to complex data is enhanced, and the overall modeling accuracy and generalization ability can be considered to be improved.

The adaptive spatial special fusion process is shown in Figure 5, where

x_{i j}^{n \to l}

is the two dimensional feature vector at position

(i, j)

from the n-th to the l-th layer.

l = 1, 2, 3

denotes the Level 1, Level 2, and Level 3 feature layers in the figure, and

y_{i j}^{l}

represents the result vector. It is obtained by using the multi-level feature adaptive spatial feature fusion method, and is linearly combined with the weight parameter

α_{i j}^{l}

,

β_{i j}^{l}

,

γ_{i j}^{l}

of feature vector

x_{i j}^{1 \to l}

,

x_{i j}^{2 \to l}

,

x_{i j}^{3 \to l}

. The specific representation is as follows:

y_{i j}^{l} = α_{i j}^{l} \cdot x_{i j}^{1 \to l} + β_{i j}^{l} \cdot x_{i j}^{2 \to l} + γ_{i j}^{l} \cdot x_{i j}^{3 \to l},

(18)

When linear addition, the dimensions of the feature map of the level 1–3 layer are consistent, and the same number of channels is required. The weight parameters

α_{i j}^{l}, β_{i j}^{l}, γ_{i j}^{l} \in [0, 1]

, which defined by softmax, and

λ_{α_{i j}}^{l}, λ_{β_{i j}}^{l}, λ_{γ_{i j}}^{l}

are the control parameters.

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} = 1 .

(19)

α_{i j}^{l} = \frac{e^{λ_{α_{i j}}^{l}}}{e^{λ_{α_{i j}}^{l}} + e^{λ_{β_{i j}}^{l}} + e^{λ_{γ_{i j}}^{l}}} .

(20)

The blue dashed box with number 3 in Figure 2 is an improved feature pyramid fusion network. An improved adaptive spatial fusion progressive feature pyramid module is proposed, which consistently preserves low-level features while incrementally integrating high-level features throughout the fusion process. This module utilizes a fully connected approach for adaptive spatial feature fusion, ensuring continuous interaction between different levels of features. Incorporating this module into YOLOv8 addresses the challenge of suboptimal fusion performance due to significant semantic disparities when merging features from non-adjacent levels. Therefore, in the backbone network, the features of all convolutional layers are fully integrated.

Floating point operands is an important index to measure the complexity of network modules in the model. According to the calculation method of complexity in [42], the calculation amount of the standard convolution module can be expressed as:

F L O P s_{CONV} = C_{i n} \cdot C_{o u t} \cdot K^{2} \cdot H \cdot W,

(21)

where H and W denote the length and width of the output feature map, K represents the size of the convolution kernel,

1 \times 1

convolution is 1,

3 \times 3

convolution is 3,

C_{i n}

represents the number of input channels, and

C_{o u t}

represents the number of output channels.

Here, we analyze the complexity of the ASFF module. First, perform a

1 \times 1

convolution, so the complexity is

n \times F L O P s_{C O N V}

, where n represents the total number of input feature maps. Next, the adjusted feature maps are weighted and summed, with a complexity of

F L O P s_{f u s i o n} = C_{o u t} \cdot H \cdot W,

(22)

Based on the above, the total computational complexity of the ASFF module can be expressed as

F L O P s_{A S F F} = n \cdot (F L O P s_{CONV} + F L O P s_{f u s i o n}) = n \cdot C_{i n} \cdot C_{o u t} \cdot K^{2} \cdot H \cdot W + n \cdot C_{o u t} \cdot H \cdot W .

(23)

3.4. Loss Function

In target detection, the loss function quantifies the difference between predicted outcomes and actual labels, playing a crucial role in model optimization. The SAR-YOLOv8 model introduced here uses regression loss, specifically distribution focal loss (DFL) and complete intersection over union (CIoU) Loss, combined with a weighted ratio.

When detecting SAR images, multiple targets are often occluded or overlapping, and the coordinates of the target boxes are not flexible enough, especially when the boundaries are not clear enough. Therefore, the introduction of DFL significantly enhances the value of the output result difference (

y_{i}

,

y_{i - 1}

), making the network more flexible in representing losses, expressed as follows

L o s s_{D F L} (D_{i - 1}, D_{i}) = - ((y_{i} - y) log (D_{i - 1}) + (y - y_{i - 1}) log (D_{i})) .

(24)

\hat{y} = \sum_{k = 0}^{n} D (y_{k}) y_{k} = D_{i} y_{i} + D_{i - 1} y_{i - 1} = \frac{y_{i} - y}{y_{i} - y_{i - 1}} y_{i - 1} + \frac{y - y_{i - 1}}{y_{i} - y_{i - 1}} y_{i} = y,

(25)

where

D (y_{i})

is simplified by

D_{i}

. To supplement the possibility near the output y, namely

y_{i}

and

y_{i - 1}

, the optimal solution value for DFL can be obtained at

D_{i - 1} = \frac{y_{i} - y}{y_{i} - y_{i - 1}}

and

D_{i} = \frac{y - y_{i - 1}}{y_{i} - y_{i - 1}}

, which can ensure that the estimated regression label

\hat{y}

is infinitely close to the corresponding label y. The derivation is shown in (25), which ensures its correctness as a loss function.

Intersection over Union (IoU) is a concept used in object detection, which calculates the overlap rate between predicted and real borders. However, when the two boxes do not overlap, resulting in an IoU of zero, the loss function becomes non-differentiable, hindering proper network training. To address this, the generalized intersection over union (GIoU) is introduced, which mitigates some of IoU’s issues. Nonetheless, GIoU struggles to differentiate the relative positions of boxes when the predicted area is located within the actual area as a whole. Subsequently, the distance intersection over union (DIoU) was proposed to resolve these issues. However, DIoU also has limitations: when the output areas are within the actual areas and their center points coincide but their shapes differ, DIoU produces identical results.

Currently, the CIoU has been proposed to address the limitations of DIoU by considering the aspect ratio between the output region and the actual region. This helps to resolve the issues encountered with DIoU. CIoU can be written as follows:

{L o s s}_{C I o U} = 1 - C I o U = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α υ .

(26)

where b and

b^{g t}

represent the center coordinates of the result and the label area, respectively.

ρ

is the distance between these center coordinates, c is the distance between the minimum internal diagonal points of the combined shape containing the output and the ground truth box,

α

is a coefficient for balancing the ratio, and

υ

measures the aspect ratio consistency between output results and labels. The expressions of

α

and

υ

are as follows

α = \frac{υ}{(1 - I o U) + υ},

(27)

υ = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w}{h})}^{2} .

(28)

4. Experiments and Results

4.1. Dataset and Settings

The SAR-AIRcraft-1.0 dataset is used in this experiment, which is created by Wang et al. in [43]. The dataset consists of images captured by the Gaofen-3 satellite, characterized by single polarization, 1-m spatial resolution, and spotlight imaging mode. The dataset focuses on images from three major civil airports: Shanghai Hongqiao, Beijing Capital, and Taiwan Taoyuan. It contains 4368 aircraft slices, involving 7 fine-grained aircraft types. It has the characteristics of complex scenes, rich categories, dense targets, noise interference, diverse tasks, and multi-scale. It has been publicly released in [43]. This dataset can be accessed on https://radars.ac.cn/web/data/getData?dataType=SARDataset, and the version used in this article is downloaded on 15 June 2024. In this experiment, we empirically allocated a reasonable ratio of 4:1 between the training set and the test set, with 3495 images in the training set and 873 images in the test set.

To better verify the robustness of the proposed method in SAR images, we also selected the dataset SADD published by Zhang et al. [44]. SADD is a SAR aircraft detection dataset collected from the TerraSAR-X satellite, which operates in X-band and HH polarization mode, with an image resolution range of 0.5 to 3 m. By cropping the complete SAR images, we obtained 2966 images with 224 × 224 slices, totaling 7835 aircraft targets. In this dataset, we still use a 4:1 ratio of 1780 images for training and 1186 images for testing.

The whole experiment is implemented with Pytorch 1.7, Cuda 11.0, and mmdection code libraries. The training process spans 60 epochs, utilizing the Adam optimizer with an initial learning rate of 0.001, a momentum parameter of 0.9, and a weight decay of 0.0001. A cosine annealing strategy is employed to adjust the learning rate dynamically throughout the training. All experiments are conducted on the Ubuntu 20.04 operating system. The hardware setup includes an Intel i7-12700K CPU and an NVIDIA RTX 3080 GPU.

4.2. Evaluation Metrics

4.2.1. SAR Image Despeckling Evaluation Metrics

The ENL is a crucial evaluation metric for assessing the effectiveness of speckle noise suppression in SAR images, which is expressed as follows

E N L = \frac{μ^{2}}{σ^{2}},

(29)

where

μ

denotes the mean value of input SAR image intensity,

σ

is the standard deviation. The larger the ENL value, the weaker the speckle interference in the SAR image.

The texture details of an SAR image are crucial for quality, and the edge preserving index (EPI) can be used as an evaluation indicator for the ability of different filtering algorithms to protect the image texture details of the original SAR raw data. The edge texture features can be reflected using the gradient map of the input data, and the EPI is calculated using the gradient information of the image that is calculated as

E P I = \frac{G (F)}{G (R)},

(30)

where

G (\cdot)

represents the sum of the gradients of an SAR image, which is a constant value, W denotes the SAR data input to the filter, and Z denotes the image result output by the speckle reduction filter. The calculation methods of

G (Z)

and

G (W)

are as follows

G (Z) = \sum_{n = 1}^{r} \sum_{m = 1}^{c} \sqrt{{[Z (n, m) - Z (n + 1, m)]}^{2} + Z (n, m) - Z (n, m + 1)},

(31)

G (W) = \sum_{n = 1}^{r} \sum_{m = 1}^{c} \sqrt{{[W (n, m) - W (n + 1, m)]}^{2} + W (n, m) - W (n, m + 1)} .

(32)

According to (30), as the image edge preservation ability of the filter becomes stronger, the value of EPI should naturally increase.

4.2.2. Aircraft Detection Evaluation Metrics

In object detection performance evaluation, metrics such as precision (P), recall (R), false alarm (FA), F1 score, and average precision (AP) are utilized. These metrics are derived from four key values: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

\begin{matrix} P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}, F A = \frac{F P}{T P + F P}, \\ F_{1} = \frac{2 \times P \times R}{P + R}, A P = \int_{0}^{1} P (R) d R . \end{matrix}

(33)

mAP is the mean average precision, which is the average of all types of APs, and comprehensively reflects the detection performance of the entire model. The

{mAP}_{50}

is the average accuracy of all species when

I o U = 50 %

.

4.3. Results and Analysis

4.3.1. Comparison with Other Despeckling Methods

To further illustrate the ability of noise reduction preprocessing designed in this paper, two images with complex scenes in the SAR-Aircraft-1.0 dataset are despeckled. The NTV is compared with classical and CNN based speckle reduction algorithms such as MIDAL, TV, and SAR-CNN-M-xUnit [35].

To comprehensively assess the despeckling performance, the quantitative evaluation results are given in Table 1, with the top-performing outcomes emphasized in bold. For each real SAR image, we manually selected a homogeneous region to compute the ENL. The ENL values of MIDAL and TV are lower, while SAR-CNN-M-xUnit exhibits stronger speckle reduction effect. The NTV method has the highest ENL value, indicating excellent despeckling effect in homogeneous region.

As illustrated in Table 1, the NTV demonstrates enhanced performance in despeckling uniform regions, aircraft region speckle reduction, and target detail preservation when compared to the TV method. When evaluated against the MIDAL algorithm, the NTV shows significant improvement in edge preservation while maintaining effective despeckling in uniform regions. SAR-CNN-M-xUnit can indeed reduce the speckle well, but its effect is weaker than that of NTV.

Meanwhile, EPI, MIDAL, and SAR-CNN-M-xUnit all exhibit good edge preservation capabilities, while TV performs poorly. The average EPI difference between NTV and SAR-CNN-M-xUnit is approximately 0.04, indicating that NTV still maintains a strong skill to protect edge details. The comparison between TV and NTV reveals that incorporating a new TV-based regularizer enhances the preservation of edge details.

Comparing the despeckling time in Table 1, it can be seen that TV is the fastest, followed by NTV. The CNN-based method has a long despeckling time, which is also a factor limiting its application.

Visually, from Figure 6 and Figure 7, all the despeckling algorithms can well suppress speckle in homogeneous regions. In order to highlight the details of each algorithm, the texture area within the red box has been enlarged and displayed. However, the performance of the MIDAL method is notably compromised by the presence of staircase artifacts. As depicted in Figure 6b and Figure 7b, residual speckle remains visible in the images post-despeckling. The TV method, on the other hand, introduces point-wise artifacts in the despeckled SAR images, particularly evident in the image shown in Figure 7. This indicates that the TV method struggles to adequately preserve texture details. It can be seen that TV has the most severe degradation of texture, especially in the enlarged areas of Figure 6c and Figure 7c.

From the enlarged regions of Figure 6d and Figure 7d, it can be found that SAR-CNN-M-xUnit still has some residual speckle. This proves that the CNN-based method can effectively reduce the speckle and perform better than MIDAL and TV. From Figure 6e and Figure 7e that the NTV preserve details well, especially in heterogeneous regions. It is obvious that the NTV is better in visual, which shows that the NTV has good visual effect. In conclusion, the proposed NTV method effectively reduces speckle in SAR images across most areas while avoiding the introduction of artifacts.

To better compare the effectiveness of the despeckling method, we compare the results of aircraft target detection after applying different algorithms, as shown in Figure 8. According to Figure 8a, the presence of speckles in the original image leads to some speckle regions being misidentified as aircraft. From Figure 8b,c, it can be seen that the over-smoothing caused by speckle reduction eliminates most of the aircraft scattering features. Figure 8d shows better algorithm performance, but there is still some residual speckle, resulting in misidentification. The algorithm proposed in this paper achieves a good detection accuracy, as shown in Figure 8e.

In summary, it can be proved that NTV has better performance than traditional algorithms such as TV and MIDAL, and deep learning methods such as SAR-CNN-M-xUnit. Therefore, it is very reasonable to use NTV as the despeckling preprocessing algorithm. Next, based on the NTV preprocessing method, we conduct in-depth research on the aircraft target detector in SAR images.

4.3.2. Comparison with Other Target Detecting Methods

To illustrate the effectiveness of SAR-NTV-YOLOv8, its results are compared with other popular, ship, and aircraft target detectors, such as FCOS, YOLOv7, LMSD-YOLO [42], PFF-ADN [9], and SEFEPNet [44], as shown in Table 2.

On the SAR-AIRcraft-1.0 dataset, SAR-NTV-YOLOv8 achieves an AP of 83.41%, surpassing LMSD-YOLO’s 73.43% and SEFEPNet’s 80.33%, with improvements of 9.98% and 3.08%, respectively. The algorithm also reaches a precision of 93.53%, a recall of 92.17%, and an F1 score of 93.10, all higher than YOLOv7’s F1 score of 87.41% and SEFEPNet’s F1 score of 91.34%. On the SADD dataset, SAR-NTV-YOLOv8 exhibits outstanding performance, achieving an AP of 88.64%, significantly higher than FCOS’s 73.89%, YOLOv7’s 81.47%, and SEFEPNet’s 86.98%, with improvements of 14.75%, 7.17%, and 1.66%, respectively. The algorithm’s precision is 93.82%, recall is 93.15%, and F1 score is 92.73, all exceeding YOLOv7’s recall of 90.35%. SAR-NTV-YOLOv8 excels in efficiently extracting and learning textures of aircraft details, and can detect large-scale distributed aircraft scenes.

According to the design concept of SAR-NTV-YOLOv8, it can be considered that this improvement is due to the effective suppression of speckle noise by NTV. Meanwhile, the attention intensity of the backbone network has been strengthened, and the ability to extract key features, as well as regional targets and backgrounds, has been improved.

Although SAR-NTV-YOLOv8 has a higher parameter count than some comparison algorithms, including PFF-ADN (43.15 M) and SEFEPNet (40.23 M), it maintains a relatively low FLOP of 117.61 G. This is significantly lower than FCOS (206.2 G) and PFF-ADN (198.5 G), indicating that SAR-NTV-YOLOv8 achieves a good balance between model complexity and computational efficiency. The parameters and FLOPs of LMSD-YOLO are the smallest. However, due to the high customization of the network for SAR image ship detection tasks, its performance in aircraft detection is not very good. The combination of lower FLOP and high performance indicators indicates that SAR-NTV-YOLOv8 is a suitable choice for real-time applications where both accuracy and speed are crucial.

Figure 9 presents the

{AP}_{50}

and Precision-Recall (PR) curves for various methods, providing additional evidence of our method’s effectiveness. In particular, Figure 9b provides a visualization of the PR curve for evaluating the performance of the SAR-NTV-YOLOv8 model. Overall, the SAR-NTV-YOLOv8 exhibits exceptional performance.

Figure 10 shows the visual experimental results on the SAR-AIRcraft-1.0 dataset, where the SAR-NTV-YOLOv8 model outperforms other adaptive techniques in the field, achieving an excellent balance between accuracy and recall. Notably, in the first column of Figure 10, an aircraft that blends into the background building is missed by other methods but is accurately detected by our proposed approach, demonstrating its enhanced learning capability.

In the second column, both FCOS and PFF-ADN mistakenly identified buildings as aircraft targets, and LMSD-YOLO, FCOS, YOLOv7, and PFF-ADN both had issues with missed detections. The proposed method successfully detects all aircrafts, which is because the efficient detection framework of cross-space learning ensures the success rate of detection.

From the third column, it can be seen that SAR-NTV-YOLOv8 has fewer missed detections and false alarms than SEFEPNet. From the fourth column of the figure, it can be seen that large aircraft are arranged very close in the SAR image. Due to the interference caused by strong clutter, compared with our method, other methods have some missed aircraft targets. However, due to the use of high-resolution precise detection heads and spatial feature fusion pyramid structures, the proposed method can avoid missed detections and false alarms.

In addition, Figure 11 shows the visual experimental results of the SADD dataset. The target background of SADD is relatively simpler, with less scattering interference, resulting in higher detection accuracy for all algorithms. Due to the use of a large number of cropped images that include incomplete aircraft components, these should not be detected. In short, the detection results are similar to Figure 10, which proves the effectiveness and robustness of this method.

Regarding the ship detection methods, LMSD-YOLO can achieve a better detection ability. It can be seen from Figure 10 and Figure 11 that its target detection ability on SAR images is stronger than the general target detection method. However, LMSD-YOLO does not achieve the same high detection accuracy in aircraft detection tasks as in ship detection tasks. Here we briefly discuss the reasons:

Shape feature difference: Ships typically have a more regular shape, making them easier to identify in SAR images, especially in stationary environments with minimal background interference. In contrast, aircraft are smaller, usually ranging from 10 to 80 m in length. For example, a Boeing 737 has a wingspan of about 34 m, a fuselage length of 39 m, and a total surface area of approximately 556 square meters. In comparison, a container ship like the “Maersk Magellan” measures about 400 m in length, 60 m in width, and has a deck area of around 24,000 square meters. This indicates that the area of a ship is roughly 43 times that of a medium-sized passenger aircraft.
Differences in SAR Scattering Characteristics: Ships typically have flat metal surfaces that exhibit strong backscattering characteristics. In contrast, the scattering characteristics of aircraft are more complex, primarily arising from components such as the wings, fuselage, and tail. While the surfaces of aircraft are generally smoother, their complex shapes can lead to more dispersed scattering characteristics, often resulting in weaker echoes in SAR images.
Neural Network Architecture: Neural networks customized for ship target detection typically utilize deeper convolutional layers to extract richer features. Additionally, these networks often employ specific pooling strategies to preserve the geometric shapes of the ships. In contrast, neural networks designed for aircraft target detection commonly incorporate attention mechanisms to enhance focus on aircraft features, particularly in complex backgrounds.

Although the direct application of ship detection methods has not achieved exciting detection results, due to some commonalities, many ideas in the field of ship detection can be referred to.

The visual experimental results from different datasets demonstrate that the SAR-NTV-YOLOv8, which employs a feature pyramid architecture, effectively combines the semantic richness of high-level layers with the image texture details of low-level maps. By making predictions at multiple levels of the feature hierarchy, this method greatly enhances the small aircraft detection precises. The multi-scale mechanism of cross-space learning successfully enhances the scattering characteristics of aircraft in SAR images, which is conducive to capturing the unobvious characteristics of aircraft.

4.3.3. Ablation Experiment

To demonstrate the detection capability of the proposed improved SAR-NTV-YOLOv8, ablation experiments are designed on the SAR-Aircraft-1.0 dataset. The first group of experiments used YOLOv8 as the baseline, the second group added a STDH module, the third group added an EMA module, and the fourth group added an AFPN module for ablation experiments to verify the effectiveness of each module of the model. The detection results are shown in Table 3. In the second experiment, after adding STDH, Precision increased by 1.58% and mAP50 increased by 1.76%, indicating that STDH enhanced sensitivity to small-sized aircraft.

In the third group of experiments, EMA is added to the backbone network of YOLOv8, which increased precision 1.58% and mAP50 2.44%. It is verified that the EMA module successfully alleviates the problem of distraction and insufficient feature extraction, which is conducive to the increase of key features extracted by the model.

The fourth group of experiments adds the APFN module. The results show that precision is increased by 1% to 91.21%, and mAP50 is increased by 0.42% to 80.23%.

Thus, unlike directly training with SAR images, training with despeckling SAR images can significantly improve detection performance. Therefore, in the fifth group of experiments, the NTV despeckling module is added to the SAR-YOLOv8 in the complete state, which increased the precision to 93.53%, mAP50 to 83.41%, and F1 score to 93.17%. In summary, it can be considered that SAR-NTV-YOLOv8 has better overall performance.

The above ablation experiments can illustrate that the three improved points of the SAR-YOLOv8 model effectively solve the problem of insufficient feature extraction caused by the difficulty in distinguishing the scattering characteristics of multi-scale aircrafts and SAR images. Meanwhile, the proposed NTV preprocessing module can effectively strengthen the SAR image quality of object detection, so it further improving the aircraft detection in SAR image ability of the YOLOv8.

5. Discussion

Current research on SAR aircraft detection has largely overlooked the issue of image degradation caused by speckle noise. The presence of strong clutter and speckle noise in these images hinders the network’s ability to accurately learn target features, leading to a high incidence of false alarms and missed detections. We propose NTV preprocessing to convert SAR images severely affected by speckle interference into clean SAR images, which is beneficial to the detection network to mine aircraft target features. The results given in Section 4.3.2 explain that the NTV can well reduce the problems of staircase artifacts and over-smoothing caused by SAR speckle. At the same time, in the ablation experiment of Section 4.3.3, adding an NTV preprocessing module can effectively improve the success rate of aircraft detection. Briefly, SAR image despeckling is very important. Therefore, the image NTV preprocessing module should be considered in practical applications. In the future, we can try to add a speckle reduction module to the SAR aircraft detection backbone network, making the method a single-stage detection scheme and simplifying the processing flow.

The ablation experiments in Section 4.3.3 verify the effectiveness of STDH, EMA, and AFPN. These results show that the representation ability of the aircraft detection network is greatly improved by coordinating the inconsistency on different feature scales. By adding EMA, the precision increased from 88.63% to 90.21%. It shows that the EMA module enhances the extraction ability of different scale features. AFPN enhances the extraction of texture and other detailed features of aircraft targets by combining ASFF and HRNet mechanisms, so that YOLOv8 can better adaptively focus on aircraft. However, EMA and AFPN require a certain amount of calculation. Therefore, in the follow-up research, we will focus on exploring a more lightweight and high-precision aircraft target detection network.

6. Conclusions

This paper proposes an aircraft detection method for SAR images with speckle interference in complex scenes. Firstly, this method adds an NTV preprocessing module, effectively enhancing the scattering information of aircraft targets, effectively suppressing sea clutter and speckle noise, and thus avoiding a large number of false alarms. Next, the method is based on the YOLOv8 backbone and incorporates improvements in STDH, EMA, and AFPN. By adding a detection branch to the detection head to detect extremely small scale aircrafts, the precision and recall of detecting extremely small targets can be improved. Then, in response to the phenomenon of small target sizes and blurred local detail features, an EMA module is added to combine local information and contextual information, aggregating semantic information at different levels. Next, the effective information extraction of multi-scale target detection is realized. AFPN network is used to solve the problem of feature attention at different levels. In this way, each feature layer can fully integrate high-level feature maps and low-level features. The experimental results show that SAR-NTV-YOLOv8 can accurately detect aircraft from SAR images. In addition, the speckle reduction method NTV can also be applied to other tasks of detecting targets from SAR images.

Author Contributions

Conceptualization, X.G. and B.X.; methodology, X.G.; software, X.G.; validation, X.G. and B.X.; formal analysis, X.G.; investigation, X.G.; resources, B.X.; data curation, X.G.; writing—original draft preparation, X.G.; writing—review and editing, B.X.; visualization, B.X.; supervision, B.X.; project administration, B.X.; funding acquisition, B.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Basic Research Program of Shaanxi under Grant No. 2023-JC-QN-0717.

Data Availability Statement

The dataset presented in this paper is SAR-AIRcraft-1.0, which is available at https://radars.ac.cn/web/data/getData?dataType=SARDataset, accessed on 15 June 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AA	Aubert and Aujol
AFPN	Asymptotic Feature Pyramid Network
ASFF	Adaptive Spatial Feature Fusion
CNN	Convolutional neural network
DFL	distribution focal loss
EMA	efficient multi-scale attention
ENL	equivalent number of looks
EPI	edge preserving index
IoU	Intersection over Union
MAP	maximum a posterior
NTV	nonconvex total variation
SAR	Synthetic aperture radar
SFR-Net	scattering feature relationship network
STDH	small target detection head
TV	total variation
YOLOv8	You Only Look Once v8

References

Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Zeng, K.; Wang, Y. A deep convolutional neural network for oil spill detection from spaceborne SAR images. Remote Sens. 2020, 12, 1015. [Google Scholar] [CrossRef]
Luti, T.; De Fioravante, P.; Marinosci, I.; Strollo, A.; Riitano, N.; Falanga, V.; Mariani, L.; Congedo, L.; Munafò, M. Land consumption monitoring with SAR data and multispectral indices. Remote Sens. 2021, 13, 1586. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, D.; Qiu, X.; Li, F. Scattering-Point-Guided RPN for Oriented Ship Detection in SAR Images. Remote Sens. 2023, 15, 1411. [Google Scholar] [CrossRef]
Zheng, Y.; Liu, P.; Qian, L.; Qin, S.; Liu, X.; Ma, Y.; Cheng, G. Recognition and depth estimation of ships based on binocular stereo vision. J. Mar. Sci. Eng. 2022, 10, 1153. [Google Scholar] [CrossRef]
Feng, Y.; Chen, J.; Huang, Z.; Wan, H.; Xia, R.; Wu, B. A Lightweight Position-Enhanced Anchor-Free Algorithm for SAR Ship Detection. Remote Sens. 2022, 14, 1908. [Google Scholar] [CrossRef]
Reigber, A.; Scheiber, R.; Jager, M.; Prats-Iraola, P.; Hajnsek, I.; Jagdhuber, T.; Papathanassiou, K.P.; Nannini, M.; Aguilera, E.; Baumgartner, S.; et al. Very-High-Resolution Airborne Synthetic Aperture Radar Imaging: Signal Processing and Applications. Proc. IEEE 2013, 101, 759–783. [Google Scholar] [CrossRef]
He, C.; Tu, M.; Xiong, D.; Tu, F.; Liao, M. A component-based multi-layer parallel network for airplane detection in SAR imagery. Remote Sens. 2018, 10, 1016. [Google Scholar] [CrossRef]
Xiao, X.; Jia, H.; Xiao, P.; Wang, H. Aircraft Detection in SAR Images Based on Peak Feature Fusion and Adaptive Deformable Network. Remote Sens. 2022, 14, 6077. [Google Scholar] [CrossRef]
Yu, W.; Li, J.; Wang, Z.; Yu, Z. Boosting SAR Aircraft Detection Performance with Multi-Stage Domain Adaptation Training. Remote Sens. 2022, 15, 4614. [Google Scholar] [CrossRef]
Rostami, M.; Kolouri, S.; Eaton, E.; Kim, K. Deep Transfer Learning for Few-Shot SAR Image Classification. Remote Sens. 2019, 11, 1374. [Google Scholar] [CrossRef]
Wu, F.; Hu, T.; Xia, Y.; Ma, B.; Sarwar, S.; Zhang, C. WDFA-YOLOX: A Wavelet-Driven and Feature-Enhanced Attention YOLOX Network for Ship Detection in SAR Images. Remote Sens. 2024, 16, 1760. [Google Scholar] [CrossRef]
Zhang, X.; Hu, D.; Li, S.; Luo, Y.; Li, J.; Zhang, C. Aircraft Detection from Low SCNR SAR Imagery Using Coherent Scattering Enhancement and Fused Attention Pyramid. Remote Sens. 2023, 15, 4480. [Google Scholar] [CrossRef]
Jia, Z.; Zheng, H.; Wang, R.; Zhou, W. FedDAD: Solving the Islanding Problem of SAR Image Aircraft Detection Data. Remote Sens. 2024, 15, 3620. [Google Scholar] [CrossRef]
Gao, J.; Gao, X.; Sun, X. Geometrical features-based method for aircraft target interpretation in high-resolution SAR images. Foreign Electron. Meas. Technol. 2022, 34, 21–28. [Google Scholar]
Guo, Q.; Wang, H.; Xu, F. Aircraft target detection from spaceborne synthetic aperture radar image. Aerosp. Shanghai 2018, 35, 57–64. [Google Scholar]
He, C.; Tu, M.; Liu, X.; Xiong, D.; Liao, M. Mixture statistical distribution based multiple component model for target detection inhigh resolution SAR imagery. ISPRS Int. J. Geo-Inf. 2017, 6, 336. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Zhao, Y.; Zhao, L.; Li, C.; Kuang, G. Pyramid Attention Dilated Network for Aircraft Detection in SAR Images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 662–666. [Google Scholar] [CrossRef]
Zhao, W.; Lan, D.; Jia, M.; Bin, L.; Dong, Y. SAR Target Detection Based on SSD with Data Augmentation and Transfer Learning. IEEE Geosci. Remote Sens. Lett. 2019, 16, 150–154. [Google Scholar]
Wang, J.; Xiao, H.; Chen, L.; Xing, J.; Pan, Z.; Luo, R.; Cai, X. Integrating Weighted Feature Fusion and the Spatial Attention Module with Convolutional Neural Networks for Automatic Aircraft Detection from SAR Images. Remote Sens. 2021, 13, 910. [Google Scholar] [CrossRef]
Guo, Q.; Wang, H.; Xu, F. Scattering Enhanced Attention Pyramid Network for Aircraft Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7570–7587. [Google Scholar] [CrossRef]
Kang, Y.; Wang, Z.; Fu, J.; Sun, X.; Fu, K. SFR-Net: Scattering Feature Relation Network for Aircraft Detection in Complex SARImages. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5218317. [Google Scholar] [CrossRef]
Chen, L.; Luo, R.; Xing, J.; Li, Z.; Yuan, Z.; Cai, X. Geospatial Transformer Is What You Need for Aircraft Detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5225715. [Google Scholar] [CrossRef]
Yu, Y.; Acton, S.T. Speckle reducing anisotropic diffusion. IEEE Trans. Image Process. 2002, 11, 1260–1270. [Google Scholar] [PubMed]
Rudin, L.; Osher, S. Total variation based image restoration with free local constraints. In Proceedings of the 1st International Conference on Image Processing, Austin, TX, USA, 13–16 November 1994; Volume 1, pp. 31–35. [Google Scholar]
Osher, S.; Burger, M.; Goldfarb, D.; Xu, J.; Yin, W. An iterative regularization method for total variation-based image restoration. Multiscale Model. Simul. 2005, 4, 460–489. [Google Scholar] [CrossRef]
Aubert, G.; Aujol, J.-F. A variational approach to removing multiplicative noise. SIAM J. Appl. Math. 2008, 68, 925–946. [Google Scholar] [CrossRef]
Bioucas-Dias, J.M.; Figueiredo, M.A. Multiplicative noise removal using variable splitting and constrained optimization. IEEE Trans. Image Process. 2010, 7, 1720–1730. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Zhang, H.; Patel, V. SAR Image Despeckling Using a Convolutional Neural Network. IEEE Signal Process. Lett. 2017, 24, 1763–1767. [Google Scholar] [CrossRef]
Cozzolino, D.; Verdoliva, L.; Scarpa, G.; Poggi, G. Nonlocal CNN SAR Image Despeckling. Remote Sens. 2020, 12, 1006. [Google Scholar] [CrossRef]
Dalsasso, E.; Yang, X.; Denis, L.; Tupin, F.; Yang, W. SAR Image Despeckling by Deep Neural Networks: From a Pre-Trained Model to an End-to-End Training Strategy. Remote Sens. 2020, 12, 2636. [Google Scholar] [CrossRef]
Wang, H.; Ding, Z.; Li, X.; Shen, S.; Ye, X.; Zhang, D.; Tao, S. Convolutional Neural Network with a Learnable Spatial Activation Function for SAR Image Despeckling and Forest Image Analysis. Remote Sens. 2021, 13, 3444. [Google Scholar] [CrossRef]
Zhang, J.; Wei, Z.; Xiao, L. A fast adaptive reweighted residual feedback iterative algorithm for fractional-order total variation regularized multiplicative noise removal of partly-textured images. Signal Process. 2014, 7, 381–395. [Google Scholar] [CrossRef]
Chen, D.; Sun, S.; Zhang, C.; Chen, Y.; Xue, D. Fractional-order tvl2 model for image denoising. Cent. Eur. J. Phys. 2013, 1, 1414–1422. [Google Scholar]
Gu, S.; Zhang, L.; Zuo, W.; Feng, X. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2862–2869. [Google Scholar]
Dillon, R.; Jordan, K.; Jacqueline, H.; Ahmad, D. Real-Time Flying Object Detection with YOLOv8. arXiv 2018, arXiv:2305.09972. [Google Scholar]
Zhan, W.; Sun, C.; Wang, M.; She, J.; Zhang, Y.; Zhang, Z.; Sun, Y. An improved yolov5 real-time detection method for small objects captured by uav. Soft Comput. 2022, 26, 361–373. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. Tph-yolov5: Improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer vision, Virtual Conference, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Guo, Y.; Chen, S.; Zhan, R.; Wang, W.; Zhang, J. LMSD-YOLO: A Lightweight YOLO Algorithm for Multi-Scale SAR Ship Detection. Remote Sens. 2022, 14, 4801. [Google Scholar] [CrossRef]
Wang, Z.; Kang, Y.; Zeng, X. SAR-AIRcraft-1.0: High-resolution SAR aircraft detection and recognition dataset. J. Radars 2023, 12, 906–922. [Google Scholar]
Zhang, P.; Xu, H.; Tian, T.; Gao, P.; Li, L.; Zhao, T.; Zhang, N.; Tian, J. SEFEPNet: Scale Expansion and Feature Enhancement Pyramid Network for SAR Aircraft Detection with Small Sample Dataset. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 3365. [Google Scholar] [CrossRef]

Figure 1. The overall network structure of SAR-NTV-YOLOv8.

Figure 2. Overall structure of SAR-YOLOv8 aircraft target detection network.

Figure 3. EMA module structure based on cross spatial learning.

Figure 4. Structure diagram based on adaptive spatial feature fusion.

Figure 5. The adaptive spatial special fusion process.

Figure 6. Comparison of NTV with some popular methods for the despeckling of real SAR image 1. (a) Original SAR image. (b) MIDAL. (c) TV. (d) SAR-CNN-M-xUnit. (e) NTV.

Figure 7. Comparison of NTV with some popular methods for the despeckling of real SAR image 2. (a) Original SAR image. (b) MIDAL. (c) TV. (d) SAR-CNN-M-xUnit. (e) NTV.

Figure 8. Comparison of detection results based on different speckle reduction methods using SAR-YOLOv8. The green, yellow, and red rectangle represent detected results, missed alarms, and false alarms, respectively. (a) Original SAR image. (b) MIDAL. (c) TV. (d) SAR-CNN-M-xUnit. (e) NTV.

Figure 9. Comparison with other methods. (a)

{AP}_{50}

curves. (b) PR curves.

Figure 9. Comparison with other methods. (a)

{AP}_{50}

curves. (b) PR curves.

Figure 10. The detection results of each algorithm in four typical SAR scenarios in the SAR-AIRcraft-1.0 dataset. The green, yellow, and red rectangle represent detected results, missed alarms, and false alarms, respectively. Each row from top to bottom shows the experimental results of FCOS, YOLOv7, LMSD-YOLO, PFF-ADN, SEFEPNet, and SAR-NTV-YOLOv8, respectively.

Figure 11. The detection results of each algorithm in four typical SAR scenarios in the SADD dataset. The green, yellow, and red rectangle represent detected results, missed alarms, and false alarms, respectively. Each row from top to bottom shows the experimental results of FCOS, YOLOv7, LMSD-YOLO, PFF-ADN, SEFEPNet, and SAR-NTV-YOLOv8, respectively.

Table 1. ENL, EPI, and time consumption of the two real SAR images produced by different despeckling algorithms.

Image	Algorithm	ENL	EPI	Time
SAR Image 1	MIDAL	11.43	0.31	7.13 s
	TV	5.96	0.27	0.98 s
	SAR-CNN-M-xUnit	12.57	0.39	2.73 s
	NTV	21.73	0.43	1.02 s
SAR Image 2	MIDAL	9.15	0.33	7.01 s
	TV	7.78	0.25	0.93 s
	SAR-CNN-M-xUnit	14.91	0.36	2.67 s
	NTV	23.94	0.40	0.99 s

Best results are in bold.

Table 2. Comparison with other Target Detecting methods in two datasets.

Algorithm	Params	FLOPs	SAR-AIRcraft-1.0				SADD
Algorithm	(M)	(G)	AP (%)	P (%)	R (%)	F1 (%)	AP (%)	P (%)	R (%)	F1 (%)
FCOS	31.89	206.20	57.93	82.86	85.01	83.37	73.89	87.92	86.45	87.18
YOLOv7	37.23	105.23	70.64	89.71	88.45	87.41	81.47	92.11	90.35	91.21
LMSD-YOLO	3.5	6.6	73.43	88.24	89.07	87.41	83.18	92.47	90.92	92.00
PFF-ADN	43.15	198.50	76.15	87.36	90.97	90.33	85.62	93.12	91.75	93.16
SEFEPNet	40.23	167.21	80.33	90.78	90.62	91.34	86.98	94.13	92.88	93.05
SAR-NTV-YOLOv8	49.33	117.61	83.41	93.53	92.17	93.10	88.64	93.82	93.15	92.73

Best results are in bold.

Table 3. Ablation studies of STDH, EMA, AFPN, and NTV.

No.	STDH	EMA	AFPN	NTV	Precision (%)	Recall (%)	F1 (%)	mAP50 (%)
1	x	x	x	x	87.05	81.72	82.13	75.61
2	✓	x	x	x	88.63	86.43	84.43	77.37
3	✓	✓	x	x	90.21	87.23	87.36	79.81
4	✓	✓	✓	x	91.21	89.74	90.08	80.23
5	✓	✓	✓	✓	93.53	92.17	93.17	83.41

Best results are in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, X.; Xu, B. SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing. Remote Sens. 2024, 16, 3420. https://doi.org/10.3390/rs16183420

AMA Style

Guo X, Xu B. SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing. Remote Sensing. 2024; 16(18):3420. https://doi.org/10.3390/rs16183420

Chicago/Turabian Style

Guo, Xiaomeng, and Baoyi Xu. 2024. "SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing" Remote Sensing 16, no. 18: 3420. https://doi.org/10.3390/rs16183420

APA Style

Guo, X., & Xu, B. (2024). SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing. Remote Sensing, 16(18), 3420. https://doi.org/10.3390/rs16183420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing

Abstract

1. Introduction

2. Related Work

2.1. SAR Image Despeckling Method

2.2. YOLOv8

3. Proposed Method

3.1. Overall Scheme of the Proposed Method

3.2. Despeckling Preprocessing Module

3.2.1. SAR Image Multiplicative Noise Signal Model

3.2.2. Based on Convex Total Variation Regularization Model

3.3. Aircraft Target Detection Module

3.3.1. High Resolution Small Target Feature Heads

3.3.2. Efficient Multi-Scale Attention Based on Cross Spatial Learning

3.3.3. Progressive Feature Pyramid Based on Adaptive Spatial Feature Fusion

3.4. Loss Function

4. Experiments and Results

4.1. Dataset and Settings

4.2. Evaluation Metrics

4.2.1. SAR Image Despeckling Evaluation Metrics

4.2.2. Aircraft Detection Evaluation Metrics

4.3. Results and Analysis

4.3.1. Comparison with Other Despeckling Methods

4.3.2. Comparison with Other Target Detecting Methods

4.3.3. Ablation Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI