1. Introduction
Synthetic aperture radar (SAR) is a type of high-resolution coherent imaging radar that is not affected by lighting and weather conditions, enabling all-day and all-weather ground detection [
1]. SAR is widely used in various fields [
2] due to its advantages of long-range and high-resolution imaging. In recent years, with the maturity of SAR technology and the enhancement of data acquisition capabilities, it has become an important problem in SAR applications to extract useful information rapidly from massive high-resolution SAR image data. Nowadays, automatic target recognition (ATR) technology [
3] for SAR images, which is aimed at solving this problem, has become a hot research topic.
In the past few decades, SAR ATR technology has made significant progress from theoretical research to practical applications. Classical methods for SAR ATR are based on templates and models. The template-based method can be further divided into direct template-matching methods [
4], which calculate the correlation or distance between the test sample and the template obtained from the training sample itself, and feature template-matching methods, which use commonly used classifiers such as SVM [
5], KNN [
6] and Bayesian classifiers [
7], to compare the geometric [
8], mathematical [
9,
10], or electromagnetic scattering features [
11,
12] extracted from the training and test samples. Although template-matching methods are simple in principle and easy to implement, it is difficult to establish a complete template library in practical applications and requires high storage and computation space. To overcome these limitations, the model-based SAR target recognition methods [
13,
14] are proposed, which use electromagnetic simulation software to calculate the electromagnetic scattering images based on the established models of the targets and perform feature matching with test samples to achieve target recognition.
Classical SAR target recognition methods rely on laborious manual-designed feature engineering, which to some extent are limited by data structure and feature extraction capability, leading to unstable recognition performance. Therefore, the automatic feature extraction ability of neural networks enabled the application of deep learning methods in SAR target recognition. Initially, the deep learning model used for computer vision tasks is directly fine-tuned for SAR target recognition [
15], and some unsupervised learning methods are also directly used for SAR target feature extraction [
16,
17,
18]. Subsequently, the methods specifically designed for SAR image amplitude information characteristics are proposed. Chen et al. [
19] designed A-ConvNet based on VGG for the MSTAR dataset target recognition task and achieved a recognition rate above 99%; Lin et al. [
20] proposed a deep convolutional Highway Unit for SAR ATR with few samples; Gao et al. [
21] proposed a dual-branch deep convolutional neural network (Dual-CNN) to extract polarization and spatial features and fuse them together; Jiao et al. [
22] designed a Wishart deep stacking network (DSN) specifically for polarimetric SAR target classification, which performed well on real polarimetric SAR data. Li et al. [
23] proposed a fully convolutional attention module that focuses on important channels and target areas, improving the computational efficiency and significantly improving the performance of SAR target recognition. Recent studies attempt to combine deep learning with physical models, focusing on the characteristics brought about by the special imaging mechanism of SAR. Zhang et al. [
24] introduced a network called DKTS-N to combine deep learning networks with domain-specific knowledge in the context of SAR. Huang et al. [
24] proposed Deep SAR-Net, which uses CNN and Convolutional Autoencoders to extract spatial and scattering features of SAR images for classification tasks. Feng et al. [
25] proposed a method based on integrated partial models and deep learning algorithms to combine electromagnetic scattering characteristics with deep neural networks.
In practical applications, due to the special imaging mechanism of SAR, the visual features of the same target vary greatly under different observation azimuths, which poses challenges for single-aspect SAR target recognition. As SAR systems advance, the development of multi-aspect SAR techniques, such as Circular SAR, enables the continuous observation of a given target from various viewing angles. Multi-aspect SAR target recognition technology utilizes multiple images of the same target obtained from different observation angles to combine the scattered characteristics from different perspectives. By fully exploiting the complementary and correlated recognition information of the target at different angles, multi-aspect SAR target recognition can significantly improve the accuracy and anti-interference ability of target recognition.
The deep learning methods used for multi-aspect SAR target recognition are mainly based on recurrent neural networks (RNN) and convolutional neural networks (CNN). For example, MA-BiLSTM [
26] and BCRN [
27] are based on long short-term memory networks, MVDCNN [
28] is based on parallel CNN with hierarchical merging and fusion structures, while MVDFLN [
29] combines the recurrent unit and convolution. The existing multi-aspect SAR target recognition methods mainly face the following challenges:
Multi-aspect SAR image recognition methods based on RNN or CNN are limited by sequence constraints. It is difficult to learn the correlation between two images that are far apart in the multi-aspect SAR image sequence, leading to information loss.
The number of SAR image samples is insufficient to meet the needs of deep learning training networks. Most existing methods adopt supervised learning methods combined with data augmentation. The limited SAR data restrict the generalization ability of deep learning models.
To address the limitations of existing methods, a multi-aspect SAR target recognition method based on contrastive learning (CL) and Non-Local is proposed in this paper. After pre-training, the encoder part of the CL network is used to extract feature maps from each image in the multi-aspect SAR image sequence. Based on the obtained feature maps, high-dimensional features are further extracted, while Non-Local computation is inserted between different feature extraction layers to achieve multi-aspect feature learning.
This paper proposes an innovative approach to exploit the correlation between multi-aspect SAR images by utilizing Non-Local [
30]. Self-attention calculation is not affected by the order of images in the sequence; thus, it can mine the correlation information between images more effectively. As a classic application of self-attention in computer vision, Non-Local directly operates on two-dimensional images and achieves pixel-level self-attention calculation between feature maps with a simple and flexible structure. In consideration of the loss of local detailed information in self-attention calculation, the ResNet [
31] structure is designed to extract feature maps. Convolutional operations and pre-training based on CL are employed to reduce the requirement for sample quantity and improve the generalization ability of the network.
Compared to existing methods, the novelty and contribution of the proposed method in this paper can be summarized as follows.
A Non-Local structure is introduced for multi-aspect SAR feature learning. By implementing self-attention calculation multiple times in feature spaces of different dimensions, Non-Local can effectively capture the correlation information among multi-aspect SAR images.
The lightweight contrastive learning network is applied to SAR image feature extraction tasks in order to fully utilize the limited SAR data to train an effective feature extraction network.
Compared with existing methods, our method achieves higher recognition accuracy on the MSTAR dataset and demonstrates better generalization performance in case of few samples and strong interference.
The remainder of this paper is organized as follows. A comprehensive description of the proposed network structure is provided in
Section 2.
Section 3 outlines the experimental details and discusses the obtained results.
Section 4 discusses the benefits of the proposed method and outlines potential future work.
Section 5 provides a summary of the entire paper.
2. Proposed Method
The overall architecture of the proposed network includes four parts, i.e., sequence construction, feature extraction based on pre-training by contrastive learning, multi-aspect feature learning based on Non-Local, and classification, as shown in
Figure 1.
The multi-aspect SAR image sequences are constructed based on the single-aspect SAR images, which are also used for pre-training. In the CL network for pre-training, traditional data augmentation methods are used to generate two enhanced views for a single SAR image, each serving as input for the two branches. The optimization of the pre-train network is achieved by reducing the differences between the output features of two branches. After pre-training, the encoder part of the upper branch is transferred to extract the feature map of each image in the multi-aspect SAR image sequences. Then, during the multi-aspect feature learning process, the extracted feature maps of each image are input into the multi-aspect encoder based on Non-Local and ResNet to learn the correlation between multi-aspect SAR images. The output features of the multi-aspect encoder are dimensionally reduced and then averaged along the sample dimension for feature fusion. Finally, the softmax classifier is used to obtain the prediction probability. The following sections will provide details and training process of the proposed method.
2.1. Sequence Construction
From different azimuth and depression angles, multi-aspect SAR images of the target can be obtained through one or more platforms, which can be used to construct multi-aspect SAR image sequences based on the following steps [
32]. The original SAR image set
consists of multiple categories
. The images in each category are sorted according to the azimuth, which is denoted as
.
C represents the total number of target categories,
represents the number of images contained in each class and
is assumed to be the azimuth angle of the image. For the given sequence length
k,
images are selected from the original image set by a sliding window with the step size of 1, which are combined in different permutations to obtain sequences. The sequences in which the azimuth difference between any two images is less than the given angle range
are selected as the experimental sample. Assume that the final constructed sequences of a certain class are denoted as
, where
represents the number of sequences. The construction process above is shown in Algorithm 1, and an example of multi-aspect SAR image sequence construction is shown in
Figure 2.
Algorithm 1 Sequence construction algorithm |
Initialization: The angle range , the sequence length k Input: Original images and the number of classes C Output: The constructed sequence set for to C do for to do
if
Combine to add all possible sequences of length k except else if
Add the sequence end for if
Add the sequence end for
|
2.2. Feature Extraction Based on Contrastive Learning
To introduce contrastive learning into SAR image feature extraction, considering that the visual features presented in SAR images are often quite similar, we choose the Bootstrap Your Own Latent (BYOL) [
33] network, which is based on the asymmetry of two branches instead of negative samples, for pre-training the ResNet model used for single-aspect SAR image feature extraction.
As shown in
Figure 3, the online branch of the pre-training network based on BYOL consists of the encoder based on ResNet, the projector and predictor based on multi-layer perceptron (MLP) with the same structure. The target branch only includes the encoder and projector with the same structure as the online branch.
The encoder based on ResNet is used to output the feature representation of the input image. To obtain a lightweight model, Deep Separable Convolution (DSC) [
34] is designed to replace the original convolution layer in ResNet, which consists of deep convolution and pointwise convolution, as shown in
Figure 4. Deep convolution applies a separate convolution kernel to each channel of the input, while pointwise convolution uses the
convolution to fuse the output of deep convolution across channels and change the channel number of the final output. Compared to original convolution, DSC achieves a significant reduction in parameter and computational costs.
The input convolutional layer of the encoder contains a DSC layer and ReLU activation function. Then, the core structure consists of several units with the same structure stacked together, each of which contains a max pooling layer and two Lite ResBlocks. The Lite ResBlock includes two DSC layers and the residual connection as shown in
Figure 5. The output pooling layer of the encoder uses average pooling, which computes the average of values within the pooling window as the value after pooling.
Assuming that the input image of the encoder is
x, the DSC operation is denoted as
and
represents the ReLU activation function, the calculation of the input convolution layer can be expressed as:
Then, suppose the input of the
unit stacked is
and the output is
, where the input of the first unit is
. Suppose
denotes the computation of Lite ResBlock, and
denotes the max pooling operation. The operation of the
unit can be represented as:
Suppose
is the batch normalization operation; then, the calculation process of
is defined as:
The output of the last unit
passes through the average pooling layer, denoted as
, to obtain the final output
of the encoder, which can be expressed as:
The projector and predictor are based on the multi-layer perceptron with the same structure, consisting of two fully connected layers separated by batch normalization and ReLU activation functions, which expand and reduce the dimensionality of the feature vector. The input of the projector is the output of the encoder
, and the output of the projector is denoted as
. The predictor takes the output of the projector as its input and is denoted as
. The calculation process of the projector and predictor is as follows:
where
and
represent the two fully connected sub-layers in the projector, while
and
represent the two fully connected sub-layers in the predictor.
The pre-training network takes the single-aspect SAR image as input and generates two augmented views through methods such as rotation, flipping, cropping, scaling, and brightness adjustment. The online branch goes through the encoder, projector, and predictor to obtain the output vector, while the target branch obtains the output vector only through the encoder and projector. During pre-training, the network is optimized by minimizing the error between the output vectors from the two branches, which will be introduced in
Section 2.5.1. After pre-training, the input convolutional layer and the first
N stacked units of the online branch encoder will be transferred for extracting feature maps of each image in the sequence. The obtained feature maps are denoted as
,
.
2.3. Multi-Aspect Encoder Based on Non-Local
The multi-aspect encoder, which combines Non-Local and ResNet, is designed to achieve multi-aspect feature learning. The main structure and details of the multi-aspect encoder are shown in
Figure 6.
The multi-aspect encoder is composed of multiple layers. Each layer concatenates the feature maps extracted from each SAR image in the input multi-aspect SAR image sequence and performs self-attention calculation via the Non-Local layer. Then, the output of Non-Local is split along the sample dimension to extract higher-level features separately for each image by ResBottleneckBlock. Finally, the output is downsampled through the maxpool layer. Suppose there are M layers in the multi-aspect encoder. The detailed calculation process for each layer will be described next.
Assuming the layer of the multi-aspect encoder takes input , and output , , where k is the length of the multi-aspect SAR image sequence, and are the feature map sizes of the input and output of the layer, and and represent the number of channels of the input and output in the layer. The input in the first layer is the output , from the feature extraction part of the network. Firstly, we concatenate the input feature maps into , which serves as the input vector for Non-Local.
Non-Local is a residual structure that computes the output by adding the input vector with the result of the self-attention computation. Given an input vector
, three convolution operations with kernel size
are performed to obtain query vector
, key vector
, and value vector
with the same dimensions as the input vector. The convolution kernels are denoted as
,
, and
, and the corresponding biases are denoted as
,
and
. The calculation process can be expressed as:
The shapes of
,
and
are flattened into
, where
. Next, the original correlation matrix
is calculated through the dot product, which means transposing the key vector
and multiplying it with the query vector
. Then, the weight matrix
is obtained by normalizing
. The calculation process is as follows:
where
represents the position number of the input vector, that is,
. The weight matrix
is multiplied with the value vector
to obtain the output
of self-attention, which is given by:
Finally, reshape
into
and add it to the input vector
of Non-Local to complete the residual calculation and obtain the output
, which can be expressed as:
The output of Non-Local
is split back into individual feature maps
,
, which are then fed into ResBottleneckBlock with shared parameters for further feature extraction. ResBottleneckBlock is a residual structure where the input vector passes through a
convolution layer with batch normalization (BN) and ReLU, which is followed by a
convolution layer with BN and ReLU and finally a
convolution layer with BN only. The output of ResBottleneckBlock is then added to the input vector and passed through another ReLU activation function. The calculation process can be expressed as follows:
where
represents the
convolution with BN and ReLU,
represents the
convolution with BN and ReLU, and
represents the
convolution with only BN. The output of the ResBottleneckBlock is downsampled using the maxpool layer to obtain the output of the
layer in the multi-aspect encoder.
2.4. Feature Dimensionality Reduction and Classfication
The output of the layer of the multi-aspect encoder is k 1-D feature vectors, which are denoted as , . They are concatenated along the sample dimension to obtain , with the size . The feature dimension of is reduced by a convolution layer to obtain the output , where C is the number of sample classes. After averaging Z along the sample dimension to achieve feature fusion, the softmax classifier is applied to obtain the predicted probabilities of the input samples output by the network.
2.5. Training Process
2.5.1. Pre-Train Based on CL and Layer Transfer
As described in
Section 2.2, the CL network is optimized by minimizing the difference between the outputs of the online and target branch. The MSE loss function is used to calculate the error between the two branches, which is computed as the distance between the L2 normalized output vectors of the two branches. The loss function can be formulated as:
where
represents the output of the online branch for the augmented view Aug1, and
represents the output of the target branch for the augmented view Aug2. Due to the asymmetry of the structure, the network needs to exchange input views of the two branches; then, the complete loss function of the network is:
The parameter of the online branch is updated by backpropagation (BP), while the parameter of the target branch is updated using the momentum update mechanism, meaning that the update is determined based on the corresponding parameter of the online branch. The momentum update mechanism can be expressed as:
where
represents the parameter of the target branch and
represents the updated parameter of the online branch.
m is the weight coefficient which is usually large, so that the parameter of the target branch changes slowly and steadily approaching
.
After the pre-training network converges, the first few layers of the online branch encoder are transferred to the entire network to extract feature maps from each image in sequences. During the training process of the entire network, the parameters of the feature extraction part are fixed and unchanged.
2.5.2. Training of Overall Network
Considering that DSC used in the Lite ResBlock may lead to the loss of model accuracy under certain conditions [
35], Knowledge Distillation (KD) [
36] is introduced into the overall network training. The supervised information of better performing but more complex models can be involved in the training of lightweight models through KD, thereby improving the performance of lightweight models.
The proposed lightweight network is the student model, and the network using convolutional layers instead of DSC is the teacher model. One of the key points of KD is to add a temperature parameter
T to the softmax classifier, which can be described as:
In the training process of the student model, the parameters of the teacher model that has been trained based on the cross-entropy loss function remain unchanged. The loss function of the student model includes the cross-entropy loss between the output and the sample label and the distillation loss that measures the gap at the same temperature
t between the student model and the teacher model using Kullback–Leibler divergence. The loss function can be formulated as:
where
N represents the number of class types,
and
represent the prediction probability for class
j output by the teacher model and student model at temperature
t,
represents the value of the sample label corresponding to class
j, and
is the proportion coefficient.