1. Introduction
Synthetic aperture radar (SAR) is an active ground observation system that can be installed on aircraft, satellites, spaceships and other flight platforms. Compared with the optical and infrared observation methods, SAR can overcome the adverse effects of weather and perform dynamic observations of ground and ocean targets, so it has bright application prospects in the field of remote sensing. Compared with natural images, SAR images reflect the backscattering intensity of electromagnetic information, so specialist systems are needed to interpret them, but searching for targets of interest in the massive SAR images by humans is time-consuming and extremely difficult, which justifies the urgent need for SAR automatic target recognition (SAR-ATR) algorithms [
1]. In the era of big data, there are tons of SAR image data waiting to be processed every day. Therefore, SAR-ATR requires not only high recognition accuracy, but also efficient data processing flows.
The traditional SAR image target recognition methods are mainly composed of independent steps such as preprocessing, feature extraction, recognition and classification. The feature extraction process usually needs scale invariant feature transform (SIFT) [
2], histogram of oriented gradient (HOG) [
3] and other algorithms to extract good distinguishing features to better complete the classification task. However, both the accuracy and efficiency of SAR image recognition are seriously restricted due to the complicated process and hand-designed features [
4,
5].
In 2012, the deep CNN [
6] proposed by Krizhevsky et al. achieved the error rate considerably lower than the previous state of the art results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and CNN became of great interest to the academic community. Since then, many CNN models such as VGGNet [
7], GoogLeNet [
8], ResNet [
9], DenseNet [
10] and SENet [
11], have been proposed and constantly challenged the computer’s limit of image cognitive ability. In the last ILSVRC competition in 2017, the Top-5 error rate of the Image Classification task reached 2.251%, exceeding the human recognition level.
The exciting progress of CNN in the field of computer vision (CV) has encouraged people to think about how to apply CNN to target recognition in SAR images, and many scholars have made intensive studies of this topic. Some papers used CNN to accomplish the SAR image target classification experiments on the Moving and Stationary Target Acquisition and Recognition (MSTAR) [
12] public data set. The accuracy has gradually increased from 84.7% [
1] to more than 99% [
13,
14], which is higher than that of SVM [
15], Cond Gauss [
16], AdaBoost [
17] and so on. Ding et al. [
4] investigated the capability of a CNN combined with three types of data augmentation operations in SAR target recognition. Issues such as translation of target, randomness of speckle noise in different observations, and lack of pose images in training data are intensive studied. Huang et al. [
18] studied the influence of different optimization methods on the recognition results of SAR images in CNN model. Huang et al. [
19] discussed the problem of SAR image recognition under limited labeled data. Bentes et al. [
20] compared four CNN models [
4,
21,
22,
23] used in SAR-ATR in recent years, and put forward a multiple resolution input CNN model (CNN-MR). In order to improve the learning ability of the network, the SAR images are processed to different resolution slices in CNN-MR. The performance of the CNN-MR in the experiments indicating that the informative features make a significant contribution to obtain higher accuracy in CNN models, but such data preprocessing will bring extra work. As summarized in [
24], while deep learning has become the main tool for tasks like detection in CV on RGB imagery, however, it has not yet had the same impact on remote sensing. As far as we know, many CNN models [
1,
4,
14,
22] designed for SAR image recognition are shallow networks, and only a few frontier technologies are utilized in the field of CV. We consider that the following three aspects can be further studied in the task of SAR image recognition with CNN:
Many shallow CNN models just consist of several convolution layers, pooling layers and an output layer. The interdependencies between channels and spaces of feature maps are often overlooked. How to improve the expressive ability of CNN and extract informative features through network designing is a valuable research direction.
Because the source of SAR image acquisition is greatly limited, some data sets are highly imbalanced. When the traditional machine learning classification method is applied to the imbalanced dataset, the classifier is biased to minority classes in order to improve the overall accuracy, and the classification performance is seriously affected. As far as we know, the problem of data imbalance in SAR image target recognition has not been paid enough attention in the current research yet.
The huge amount of parameters is an obstacle when CNN is applied in practice. In SAR image recognition, attention should also be paid to reducing network parameters and computation consumption while ensuring accuracy.
The human visual system (HVS) can automatically locate the salient regions in visual images. Inspired by the HVS mechanism, several attention models are proposed to better understand how the regions of interest (ROIs) are selected in images [
25]. The visual attention mechanism has been wildly applied in many prediction tasks such as natural language processing (NLP) [
26], image/video caption [
27,
28], image classification [
11,
29] etc. In SAR image recognition, Karine et al. [
25] combined the SIFT method with a saliency attention model and built a new feature named multiple salient keypoints descriptors (MSKD). MSKD is not used on the whole SAR image, but only the target area. The recognition experiments for both ISAR and SAR images show that MSKD can achieve a significant advantage over SIFT, which indicates that the application of the visual attention mechanism in SAR image recognition is feasible.
SENet [
11] is a CNN model based on visual attention mechanism. It uses a gating mechanism to model channel-wise relationships and enhances the representation power of modules throughout the networks [
30]. The authors of SENet developed a series of SE blocks that integrate with ResNet [
9], ResNext [
31] and Inception-ResNet [
32], respectively. Experimental results on the ImageNet dataset show that the introduction of SEblock can effectively reduce the error rate. In ILSVRC 2017, SENet won the first place in image classification competition, indicating its effectiveness.
Depthwise separable convolution [
33] is a kind of model compression technique that reduces the number of parameters and amount of computation used in convolutional operations while increasing representational efficiency [
34]. It consists of a depthwise (DW) convolution, i.e., a spatial convolution performed independently over every channel of an input, followed by a pointwise (PW) convolution, i.e., a regular convolution with 1 × 1 kernel, projecting the channels computed by the DW convolution onto a new channel space. Depthwise separable convolution have been previously shown in Xception [
33] to allow for image classification models that outperform similar networks with the same number of parameters, by making more efficient use of the parameters available for representation learning. Many state of the art CNN models such as MobileNets [
35], ResNext [
31], ShuffelNet [
36], SqueezeNet [
37] etc. also adopt depthwise separable convolutions to reduce model parameters and accelerate their calculations.
Data imbalance exists widely in practical applications, such as detecting sea surface oil pollution through satellite radar images [
38], monitoring illegal trade in credit cards [
39], and classifying medical data [
40], etc. The general methods of dealing with imbalanced classification problems can be divided into two categories. The first one is data level methods including over-sampling and under-sampling [
41,
42,
43]. The core idea of over-sampling is to randomly copy or expand the data of minority classes, but it easily leads to over fitting problems and deteriorates the generalization ability of the model. The under-sampling method balances the number of each class by removing part of the samples in the majority class, but it often losses some important data, which cause large offset or distortion in the decision boundary. The second is the algorithm level methods represented by the cost sensitive learning [
44]. This method generally does not change the original distribution of the training data, but it gives different misclassification costs for different classes, i.e., the misclassification cost of a minority classes is higher than that of majority classes. The cost matrix in cost-sensitive learning is difficult to obtain directly from the data set and misclassification costs are often unknown [
45,
46]. Buda et al. [
47] investigated the impact of class imbalance on the classification performance of CNNs and compared some frequently used methods. Experimental results indicate that over-sampling is almost universally effective in most situations where data imbalance occurs.
Inspired by SENet [
11] and the extensive application of depthwise separable convolution, we consider applying them to SAR image recognition tasks. Based on the visual attention mechanism, we first designed a channel-wise and spatial attention block as the basic unit to construct our CNN model. Then, depthwise separable convolution wis utilized to replace the standard convolution in order to decrease network parameters and model size. We also use a new loss function named weighted distance measure (WDM) loss to reduce the influence of data imbalance on the accuracy. The main contributions of our work are:
Propose a lightweight CNN model based on visual attention mechanism for SAR image classification. The utilization of channel-wise and spatial attention mechanism can boost the representational power of network. Experiment on MSTAR [
12] dataset indicate that compare with CNN model without visual attention mechanism (e.g., ResNet [
9], Network in literature [
23] and A-ConvNet [
22]), our network achieves higher recognition accuracy. Meanwhile, the model parameters and calculation consumption are significantly reduced by using depthwise separable convolution.
A new WDM loss function is proposed to solve the data imbalance problem in the data set, and a comparative analysis is done of different ways to deal with the data imbalance problem. Experimental results of MSTAR [
12] and OpenSARShip [
48] indicate the new loss function has a good adaptability for the imbalanced data set.
The rest of this paper is organized as follows:
Section 2 illustrates the key technologies used to build our lightweight CNN, including channel-wise and spatial attention, depthwise separable convolution and its implementation, and WDM loss function. Furthermore, the technical details of network construction and network topology are also given.
Section 3 conducts a series of comparative experiments based on two open datasets, i.e., MSTAR [
12] and OpenSARShip [
48]. The performance of the proposed network is demonstrated, and how to choose the hyper-parameters is discussed.
Section 4 summarizes our work and puts forward the future research.
2. Lightweight CNN Based on Visual Attention Mechanism
2.1. Channel-Wise and Spatial Attention
Convolution layers are the basic structure for CNNs. It learns filters that capturing local spatial features along all input channels, and generates feature maps of jointly encoding space and channel information. Squeeze and excitation (SE) block in [
11] can be considered as a kind of channel-wise attention mechanism. It squeezes features along the spatial domain and reweights features along the channels. The structure of SE block is shown in the upper part of
Figure 1. In SAR image target recognition, regions of interest are generally concentrated in a small area. Meanwhile, spatial information usually contains important features for accurate recognition, so it should also be used rationally. Inspired by SE block, we carry out similar operations on spatial, and introduce channel attention and spatial attention mechanisms on two parallel branches. Finally, we add the results from the two channels as the output. We call the above operation as channel-wise and spatial attention (CSA) mechanism, and the convolution unit is named CSA block, the structure of it is shown in
Figure 1.
Suppose that the feature maps entering into CSA block is
, where
,
and
are the spatial height, width and channel depth respectively. In channel attention by-pass,
is represented as
,
represents the feature maps on each channel. Spatial squeeze is performed by global average pooling (GAP), a statistic
is generated by shrinking
through spatial dimensions
, where the
-th element of
is calculated by:
After that, channel excitation is completed through a gating mechanism with sigmoid activation, vector
is transformed to:
In Equation (2),
refers to the ReLU [
49] function and
represent sigmoid function,
and
. The utilization of two fully-connected (FC) layers aims at limiting model complexity and aiding generalization, it is composed of a dimensionality reduction layer with parameters
with reduction ratio r (we set it to be 8, and the parameter choice is discussed in
Section 3.5), a ReLU function, and then a dimensionality-increasing layer with parameters
. The final output of the block is obtained by rescaling the transformation output
with the activations:
The output of channel attention by-pass is , which represents the fusion features between channels.
In spatial attention by-pass, the input feature map is represented as
,
with
and
represents the spatial features that contain all the channel information. Channel squeeze is performed by a 1 × 1 convolution kernel
, generating a projection tensor
, i.e.,
. Each
of
represents the linearly combination for all
channels in a spatial location
. Similar to channel attention by-pass, we use the sigmoid function as nonlinear activation to complete spatial excitation. The output of spatial attention by-pass can be illustrated as:
where,
and
.
Finally, we add the results of two by-passes (channel attention by-pass and spatial attention by-pass) to get the output of CSA block, i.e., . For the input feature map , CSA block carries the future recalibrated through the channel and spatial, and it can enhance the expression ability of networks.
2.2. Depthwise Separable Convolution
In standard convolution, the channel of every kernel is the same as that of the current feature map
, and every channel is convoluted at the same time. The distribution of convolution kernel in standard convolution is shown in
Figure 2a. Kernel size is
, and the number is
.
Depthwise separable convolution [
33] uses DW convolution and 1 × 1 PW convolution to decompose convolution in channel level. DW refers to a convolution kernel that no longer carry out convolutions in all channels of the input image, but one input channel, i.e., one convolution kernel corresponds to one channel. After that, the PW convolution aggregates the multichannel output of the DW convolution layer to get the weight of the global response. The distribution of convolution kernel in depthwise separable convolution is shown in
Figure 2b.
Through
Figure 2a,b, we can make a brief analysis of the computation consumption of two convolution methods. The size of the input image is
, with
channels, the size of
kernels is
. In order to unify the output and input feature map in size, we assume the stride of convolution is 1, so the size of output features is
. Ignoring the addition of features aggregation, the calculation amount required is
, the first two items are the size of the input image, and the other four are the space dimensions of the convolution kernel. When deep separable convolution is used, the calculation consumption of DW convolution is
and the calculation consumption of PW convolution is
. So we can get the ratio of calculation consumption of two convolutions is as follows:
It can be seen from the above formula that the calculation consumption of deep separable convolution can be effectively reduced compared with the standard convolution, and the ratio of calculation consumption is only related to the number and size of the convolution kernel.
2.3. Weighted Distance Measure Loss Function
Imbalanced data have a great influence on the classification results, mainly because majority class data have more influence on classifiers than minority classes, so the classification boundaries are biased toward the majority classes.
The common loss function in the field of machine learning, such as 0–1 loss function, log loss function and cross entropy loss function, have the same misclassification cost for all samples, and fail to be used directly in the problem of imbalance data classification. Therefore, new loss functions need to be designed for imbalanced data. On the other hand, the classification problem is a core problem in the research of pattern recognition, and a basic criterion in pattern recognition is to keep the inter class distance as large as possible and the intra class distance as small as possible.
Through the above analysis, we can conclude that the loss function used for imbalanced data classification in CNN should meet the following requirements:
It should strengthen the influence of minority samples on training process, and avoid the submergence of minority samples by majority samples.
It should be well compatible with the CNN training process and can be calculated in batches.
It should enhance the inter class distance and reduce the intra class distance.
Contrastive loss [
50] is used to solve the face recognition problem with long tailed distribution (which mean the number of categories is very large and not known during training, and the number of training samples for a single category is very small, and it can be regarded as a form of data imbalance.) data. This method requires a pair of samples as input, learning a similarity measure based on the input data, and then using the similarity measure to determine whether the two samples belong to one class and achieve the recognition results. The core idea of contrastive loss is put a small distance between similar samples, and large distance for dissimilar samples [
51]. In generally, the purpose of the SAR image classification is not to judge whether the two slices belong to one class, but to identify what category the image belongs to. So contrastive loss function cannot be used directly. Even so, the thought of the contrastive loss function is of great reference. We combine the idea of contrastive loss and cost sensitive learning to design a weighted distance measure (WDM) loss function used for the problem of imbalanced data classification in CNN. The target of WDM loss function lies in two aspects, the first one is maximize the inter class distance and minimize the intra class distance, and the second one is make the samples of minority classes obtain a large compensation weight.
The WDM loss function can be expressed as the following form.
In Equation (6), represents intra class loss and represents inter class loss, and are loss weights of intra class and inter class respectively. is set to and is set to .
We use
indicates the compensation weight, which is used to control the wrong cost of different classes. Supposing that there are
samples in
class totally, and the number of each class are arranged from large to small as
(
). Then, compensation weight
can be expressed as
, which ensuring the minority classes can obtain a large compensation weight.
can be further expressed as:
represents the total classes of samples in a training batch, and is defined as intra class distance measure. is the -th longest Euclidean distance in one class. Suppose and are the two samples with the farthest distance in this class, and are the two samples with second-farthest distance, then there is , . is a hyper-parameter ( is not a sensitive parameter, it can be set to 1 or 2, experience shows is a better choice.), showing the punishment strength of loss function to the intra class distance. The greater value of means the greater the intensity of the punishment. Through Equation (7), we can see that the essence of intra class loss is the harmonic mean of the first maximum distance measure.
In Equation (8), supposing that the inter class distance between the class
A and
B is the shortest.
is defined as inter class distance measure, representing the shortest inter class.
and
denote the arithmetic mean of samples in class
A and
B after the last layer of CNN, which represents the center of the class characteristics.
is the threshold of loss function to punish the inter class distance. The smaller inter class distance will cause greater loss. We set
to
and the results sensitive to it is discussed in
Section 3.5.
In general, in the WDM loss function, we introduce the intra class distance measure and the inter class distance measure to punish the problem that the intra class distance is too large and the inter class distance is too small.
It should be explained that the contrastive loss function is based on a pair of samples, the optimization process is also aimed at a pair of samples and is a local optimization. The WDM loss function is based on a training batch, and the optimization process is also a global optimization for all kinds of samples.
2.4. Network Construction
2.4.1. The Implementation of Depthwise Separable Convolution and CSA Block
When we build the network, we learn from the basic structure of ResNet [
9]. When ResNet works, the core unit of it, i.e., the residual block first uses 1 × 1 convolution to compress the dimension of the input feature maps. Therefore, the subsequent 3 × 3 convolution will be completed on a lower data dimension. Finally, the data dimension will be restored by 1 × 1 convolution.
The structure of the residual block is shown in
Figure 3a. In the whole process, data is compressed firstly and then expanded, so this structure is also called the bottleneck block. The data processing process in the bottleneck structure is shown in
Table 1,
represents expansion factor and generally takes 0.25 in residual structure.
As introduced in
Section 2.2, DW convolution uses a convolution kernel with one channel (as shown in
Figure 2b), feature extraction capability has decreased compared with standard convolution. If depthwise separable convolution is directly used to replace the 3 × 3 standard convolution in the bottleneck structure, DW convolution will face the data of compressed dimension, which is more unfavorable for DW convolution to extract features. Therefore, refer to literature [
52], we first enhance the dimension of data by a PW unit before using DW, that is, set expansion factor
to an integer bigger than 1 (we take
= 6, and the choice of it is discussed in
Section 3.5) to make DW convolution reach a higher dimension of data. After that, a PW convolution is used to compress the data dimension. This structure is called inverted residual block, as shown in
Figure 3b. In addition, related studies [
52] also show that using non-linear layers in bottlenecks indeed hurts the performance by several percent, so in the inverted residual block, we remove the ReLU layer after the last 1 × 1 convolution to better retain the features.
Finally, the CSA block mentioned in
Section 2.1 is added to the inverted residual structure to complete the fusion of the channel and the spatial features. The structure of the inverted residual block with channel-wise and spatial attention (IR-CSA) is shown in
Figure 3c. We use IR-CSA structure as the basic convolution block to form the main structure of the CNN we propose. It is similar to ResNet [
9] and many other networks, the main structure of the network is constructed by continuously stacking the basic convolution units.
2.4.2. Network Topology
The main steps used in designing our network are summarized as below:
We use depthwise separable convolution instead of the 3 × 3 standard convolution in network to reduce the computational cost, and use the inverted residual block to improve the feature extraction ability of depthwise separable convolution.
The CSA block mentioned in
Section 2.1 is introduced into the inverted residual structure to improve feature learning and fusion capabilities.
WDM loss function is applied to reduce the impact of imbalance data.
For the SAR image slice with input size 128 × 128, the larger size of convolution kernels are adopted to cope with the possible noise. We design the convolution kernel size in the first convolution layer to be 7 × 7, the performance of convolution kernels of different sizes under noise interference will be illustrated in
Section 3.3.
The structure of lightweight network presented in this paper and ResNet50 [
9] are shown in
Table 2. Our network contains 12 IR-CSA blocks, and each IR-CSA block has 4 convolution layers and one CSA block. Similar to ResNet50, our network is also a 50-layer deep network, but its computing consumption is obviously less than it.
Only the main structure of the network is given in the
Table 2. Other operations, such as batch normalization (BN), ReLU, etc. are not embodied in the table. The reduction of the size of the feature maps is achieved by setting the convolution step of 2.