1. Introduction
Synthetic aperture radar (SAR), an active imaging sensor, can operate under all-day and all-weather conditions and deliver high-resolution images [
1]. SAR has extensive applications in various civilian and military domains, such as geological surveying, climate change monitoring, and environmental surveillance [
2]. Despite the wealth of data generated by SAR, manually extracting relevant information is impractical; hence, automatic target recognition (ATR) has become a crucial aspect of SAR image interpretation.
SAR ATR is generally divided into three steps: detection, discrimination, and classification [
3]. The classification stage can be further divided into feature extraction and classifier design. Feature extraction reduces the dimensionality of the raw SAR images and extracts highly discriminative features from the raw input for classifiers to perform classification tasks. Standard classifiers in the SAR ATR field include support vector machines (SVM) [
4,
5,
6], sparse representation classifier (SRC) [
7,
8], and multilayer perceptron (MLP) [
9]. In recent years, researchers have designed various methods to extract different features from SAR images, which can be categorized into three types: handcrafted features, depth features, and fusion features.
Handcrafted features are mainly designed for the unique characteristics of SAR images, including geometric structure features, transform domain features, and scattering features. For example, the moment features describe the geometric structure information of the target and shadow regions, such as area, center, centroid, and [
10,
11,
12,
13]. In addition, descriptors encode or extract features from the contours of the target and shadow, using techniques such as Fourier descriptors, elliptic Fourier descriptors, and Zernike moments [
4,
12,
14,
15]. Fourier transform, Wavelet transform, Gabor transform, and principal component analysis (PCA) could be used to extract the features from SAR images [
9,
16]. Scattering features in SAR ATR mainly involve attributed scattering centers (ASCs) [
17]. SAR ATR employing scattering features typically relies on template matching or region matching methods, which define a similarity measure between these features and assign the target label to the template class with the highest similarity [
18,
19,
20,
21]. Although the handcrafted features designed for the target and shadow in SAR images have physical explainability about geometric information or the scattering mechanism, the overall ATR performance is not outstanding. The reason for this is that a kind of feature cannot describe in-depth information about the target or shadow; however, combining multiple features may fail to provide robust feature representation due to redundancy or high correlation between different features.
Depth features are extracted by convolutional neural networks (CNN). Recently, CNN-based methods have achieved extraordinary recognition accuracy in the field of SAR image classification [
22,
23,
24,
25,
26,
27]. Profeta et al. [
22] developed AFRLeNet, a network specifically designed for the seven-classification problem of SAR images. To address the issue of overfitting in deep neural networks for SAR image classification, Chen et al. [
27] proposed a fully convolutional neural network called A-ConvNets. Furthermore, with the advancements in computer vision, attention mechanisms have been introduced in SAR image target recognition. For instance, Zhan et al. [
28] proposed the AM-CNN combined with the CBAM, which achieved a classification accuracy of 99.35% on a 10-class MSTAR dataset. Lang et al. [
29] integrated a multidomain attention module into CNN, which fused features from the frequency domain and the wavelet transform domain to enhance the model’s feature extraction capability. Park et al. [
30] proposed a novel channel attention DS-AE, based on the squeeze-and-excitation (SE)mechanism, to preserve the integrity of model channel information. Although depth feature-based ATR models demonstrate outstanding classification accuracy, the mapping relationship between the model’s input and output is challenging to interpret intuitively. Moreover, mainstream CNN models typically take the original SAR image as input. This makes it difficult for them to extract helpful depth information from shadows due to the unique properties of shadows.
Fusion features can use the complementarity between different features to improve ATR performance further. Examples include the fusion of Gabor features and depth features in [
31], and the combination of Gabor features and texture features in [
32]. In [
31], Gabor features and depth features are combined by initializing the inception blocks in the Inception network with multi-scale and multi-directional Gabor filters. Additionally, the combination of depth features and other handcrafted features also has achieved good recognition results, such as the combination of depth features with gradient features [
33], depth features with transform domain features [
34,
35], and depth features with texture features [
36]. Currently, the fusion of depth and scattering features is also gaining attention. On the one hand, data-driven depth features provide highly discriminative features for classification. On the other hand, ASC features based on scattering theory provide physical interpretability that depth features do not have. The effective combination of both has spawned a wealth of research on SAR ATR [
37,
38,
39,
40,
41].
Fusion features have become prevalent in SAR ATR, research on the fusion of shadow and depth features has not been explored in depth. SAR sensors operate under the condition of slant-viewing, which produces shadow regions in the resulting SAR image. Shadows can indirectly represent the targets, such as their outlines and heights. Considering this, traditional methods focus on extracting geometric properties or contour information from shadows [
7,
13,
14,
15]. Although these methods have computational advantages, they struggle to capture deep representations of shadows. It is possible to use CNN as a feature extractor to fuse depth information of shadows and targets for classification automatically. However, existing CNN-based SAR ATR methods often directly employ the original image as input, which suppresses the expression of shadow features. There are two possible reasons for this situation. Shadows have low amplitude, and they are sensitive to the depression angle. These two unique attributes make it difficult for CNN to utilize shadow features effectively. First, the formation of the shadow is due to the occlusion of the high object, causing an area of the scene not to produce radar echoes [
42]. Therefore, the intensity of the shadow is much lower than the target one, see
Figure 1d,e. If the target and shadow regions are directly fed to CNN without processing, it will harm extraction of targets depth features [
43]. Second, it is difficult for the shadow to provide a stable representation of targets due to its high sensitivity to the radar’s depression angle. According to our current understanding, current research on the combination of shadows and deep CNN networks is not in depth. Choi et al. [
44] proposed a dual-branch CNN structure to separately extract depth features from the preprocessed target region and shadow region. However, this ignores the relative position relationship between the target and shadow. The relative position of the target and shadow reflects the radar viewing angle and target attitude during imaging, which can provide helpful discriminative information for the classifier [
13].
Therefore, to enable CNN to utilize depth features of both targets and shadows comprehensively, the contributions of this paper are as follows.
(1) We first propose a segmentation method based on statistical features of the SAR image to extract regions of targets and shadows. This preprocessing allows us to compensate for the unique attributes of shadows to help the CNN extract the depth information of shadows. Then, we use the target region and shadow mask as input of the CNN, which not only solves the low-intensity problem of the shadow but also restricts the CNN to extract depth features from shadow contours, see
Figure 1c.
(2) A data-augmentation method is proposed to provide a robust representation of shadows. Based on the shadow imaging geometry, this method can not only compensate for the geometric distortion caused by different imaging depression angles but also increase the diversity of the training set to prevent overfitting.
(3) We propose a novel feature-enhancement module (FEM) based on DSC and CBAM. The attention-based FEM can comprehensively extract high-discriminative features of target regions and shadow masks. Specifically, we introduce a spatial attention mechanism in the FEM, allowing it to fuse the depth features of targets and shadows adaptively. We also perform interpretability analysis on FEM and spatial attention in FEM to further explore its enhancement effect.
The rest of this paper is organized as follows. In
Section 2, we first introduce the SAR image segmentation method, followed by the data-augmentation method and the specific details of FEM. Experiments and analysis based on the MSTAR dataset are in
Section 3. Finally,
Section 4 provides conclusion.
2. Methodology
The overall framework proposed in this paper is shown in
Figure 2. This framework includes three main modules. First, SAR images are segmented to extract target areas and shadow masks. Then, data augmentation is applied to the segmentation results to increase the diversity of training samples and compensate for the geometric distortion of shadows. Finally, the proposed FEM is embedded into existing deep CNN models for feature extraction and classification. Each module is explained in detail below.
2.1. SAR Image Segmentation
A simple SAR image scene typically consists of three components: the target area, the shadow area, and the background clutter. The intensity distributions of the target and shadow regions exhibit different characteristics, as illustrated in
Figure 1. Therefore, a simple threshold-based method, relying on statistical models, can be employed to separate the target and shadow regions from the SAR image [
13,
23,
44]. Although threshold-based segmentation effectively extracts target regions, it may not be entirely suitable for shadow extraction due to the influence of speckle noise in SAR images and the occlusion caused by other objects. Filtering methods commonly used in optical images, such as median filtering and Gaussian filtering, are not appropriate for mitigating non-additive speckle noise in SAR images [
14]. Therefore, anisotropic diffusion filtering has been introduced for denoising SAR images, as discussed in [
14,
15]. Anisotropic diffusion filtering can effectively suppress SAR image noise while preserving the structural information of the target and shadow regions. Motivated by [
15,
44,
45], we propose a method for extracting the shadow mask based on the target centroid-labeled. This method first employs anisotropic diffusion filtering to denoise the SAR image, followed by a dual thresholding approach to roughly segment the target and shadow regions. Finally, the Euclidean distance between the centroid of the target contour and the centroids of suspicious shadow contours is used to filter out false shadows, therefore enhancing the robustness of shadow segmentation.
This paper first extracts the target mask based on the method in [
44]. Then, the centroid of the target mask is used as auxiliary information to extract the shadow mask. Suppose an original SAR image is represented as
, where
and
, of size
. The following is the detailed process of segmentation.
Step 1: Apply a logarithmic transformation to to enhance low grayscale value regions, resulting in .
Step 2: Perform anisotropic diffusion filtering on to obtain .
Step 3: Normalize to obtain , where .
Step 4: Binarize by marking positions with intensities above as 1 and the rest as 0, resulting in the target mask . Similarly, mark positions with intensities below as 1 and the rest as 0, obtaining the shadow mask .
Step 5: Apply a sliding window of size to perform counting filter processing on and separately, yielding the counting filter results and .
Step 6: Perform morphological dilation and closing operations on and separately.
Step 7: Select the largest connected region as the final target mask , and compute its centroid .
Step 8: Calculate the centroids of binary regions in and the Euclidean distances d between each centroid and . Select the largest connected region with as the final shadow mask .
Step 9: Obtain the target region and shadow mask image by applying .
The parameter details of the proposed SAR image segmentation algorithm are in
Section 3.2. To provide a more intuitive understanding of each step in the algorithm,
Figure 3 presents the stepwise output of the segmentation. As seen in
Figure 3c, applying anisotropic diffusion filtering to the SAR image not only helps suppress speckle noise but also preserves the structural and detailed information of the target and shadow regions, resulting in smooth contours of the segmented shadow masks.
2.2. Data Augmentation
Given the characteristics of SAR images where shadows do not directly reflect the high backscattering of targets, the intensity of shadow areas tends to be relatively low or even close to 0, as depicted in
Figure 1. Therefore, shadows can only provide auxiliary information about targets, such as their contours. Some traditional SAR ATR methods leverage this characteristic by extracting geometric features from the binarized shadow mask (contour) instead of directly extracting features from the shadow area. For instance, geometric properties such as a shadow mask’s center, centroid, and moment features can be extracted [
10,
11,
12,
13]. Alternatively, descriptors can be employed to encode the shadow contours directly, enabling the extraction of contour features [
13,
14,
15]. Motivated by these approaches, we propose to combine the shadow mask with a target region as input for the deep-learning model. This circumvents the problem of significant intensity differences between the shadow and the target and guides subsequent deep networks to extract features from the shadow contours. Moreover, this processing method preserves the relative positional relationship between the target and the shadow.
However, shadows tend to exhibit unstable characteristics due to their sensitivity to depression angles in SAR images. Geometric distortions occur in both the target and shadow areas of SAR images at different radar depression angles. These distortions lead to variations in the shape and position of targets and shadows in training and test data, posing challenges for SAR target recognition.
Figure 4 illustrates the projection of ground objects under different radar line of sight (RLOS) conditions. As depicted in
Figure 4, the projections of the target and shadow areas in the range direction experience compression with scaling factors of
and
, respectively, where
represents the depression angle of the radar [
44,
46]. For example, the SOC training and test set images under MSTAR (as described in
Section 3.1) are generated at depression angles of 17° and 15°, respectively. Consequently, the scaling factor for the target region is:
where
denotes the scaling factor of the target region. However, the scaling factor of the shadow is larger than the target area, namely:
where
is the scaling factor of the shadow region. Due to the scaled characteristic of targets and shadows, we use affine transformation to geometrically adjust the image in the training set to compensate for the geometric distortion of the training set compared to the test set. Take the affine transformation of the shadow as an example. Assuming that the shadow mask in the Cartesian coordinate system is
, after applying the affine transformation, it becomes
, and its coordinate mapping can be calculated as follows [
44,
46]:
Considering that the images in the MSTAR dataset are collected at azimuth angles ranging from 0° to 360° with intervals of 5° to 6°, there may be some deviation in the scaling factor. To address this, we applied four scaling parameters, namely
, with a step size of 0.05 to the shadow mask. As a result, the newly generated training set is five times larger than the original.
Figure 5 illustrates the augmented images obtained by applying different scaling factors to the 2S1 and BRDM2 images in the training set. It is important to note that though the scaling factor for the target area is small, we simultaneously performed an affine transformation on both the target and shadow to preserve their relative positional relationship. This data-augmentation technique not only increases the diversity of training samples to prevent overfitting in the deep-learning model but also compensates for the geometric distortion in the target and shadow areas caused by different depression angles during imaging. Thus, the augmented training set becomes more representative of the data distribution in the test set.
2.3. Feature-Enhancement Module
The low intensity and instability of shadows can be solved by binarized masking and data augmentation, respectively, but the importance of targets and shadows is different. In other words, the target region contains rich scattering information, while the shadow mask can only provide the indirect expression of the target. Moreover, compared to the original image, the CNN only takes the target region and shadow mask as input, which significantly reduces the available information during deep feature extraction, especially when the pooling layers compress the spatial resolution and cause more severe information loss.
Considering the above issues, we propose a feature-enhancement module (FEM) based on DSC and CBAM. First, the CBAM in FEM adaptively fuses essential features of the target and shadow for classification. Second, the module has enough generalization capability so that we do not need to change existing backbone networks and can directly embed FEM into their downsampling layers. Finally, and not least importantly, it can enhance the feature extraction capability of the deep-learning model and compensate for the loss of features after pooling. This section first introduces DSC and CBAM. Detailed information about FEM is then provided.
2.3.1. Depthwise Separable Convolution
The MobileNets series has recently gained popularity for their ability to achieve high accuracy in image classification while being lightweight enough to run on mobile and embedded devices [
47,
48,
49]. A key innovation in these networks is the introduction of depthwise separable convolution (DSC).
DSC differs from standard convolution by decomposing it into two separate steps: depthwise convolution and pointwise convolution. In standard convolution, computations are performed simultaneously in spatial and channel dimensions. However, DSC performs these computations in two distinct stages. First, depthwise convolution executes convolution operations on each channel of the input feature map individually. Then, pointwise convolution linearly combines the results of the depthwise convolution using
convolutions [
47]. By decomposing the convolution in this way, DSC significantly reduces the number of trainable parameters in the CNN.
Figure 6 illustrates the differences between standard convolution and DSC.
Assuming the application of a standard convolution with kernel
to the input feature map
, resulting in feature map
, where
represents the spatial size of the convolution kernel
and
are the heights and widths of the input and output feature maps, respectively, and
and
denote the number of channels in the input and output feature maps, respectively,
The computation of DSC is divided into two processes, namely depthwise convolution and
convolution. The depthwise convolution kernel
is used for channel-wise filtering of the feature map, i.e., the
-th filter of
is convolved with the
-th channel of
,
A pointwise convolution is then performed on the result of the depthwise convolution. Finally, the reduction of the DSC compared to the standard convolution can be calculated as:
As seen from (6), DSC can significantly reduce the computational cost of the model compared to standard convolution. Considering that the number of the output feature map channels is usually large, the computational expense of using a DSC is approximately that of standard convolution.
2.3.2. CBAM
The convolutional block attention module (CBAM) is an attention mechanism that can adaptively adjust the weights of different spatial positions and channels in the feature map to improve the performance of the model [
50]. The CBAM module consists of channel and spatial attention, as shown in
Figure 7. Given a feature map
, CBAM first infers the attention weights
in the channel dimension and then infers the attention weights
in the spatial position, where
H,
W and
C represent the height, width and the number of channels of the feature map, respectively. The calculation process is as follows [
50]:
where ⊗ represents element-wise multiplication, and
is the refined output of
after passing through the CBAM. The specific calculation method of the channel attention weight
is:
where
represents the nonlinear activation function, MLP denotes a multilayer perceptron with weights
and
,
is the reduction ratio and
and
represent the average pooling and max pooling results of
in the spatial dimension, respectively. As can be seen from the channel attention module in
Figure 7, the computation of channel attention first applies global average pooling and global max pooling on the spatial dimension of feature map
to generate average-pooled feature
and max-pooled feature
, used to describe spatial context information. Then, a shared fully connected layer is used to weight the average-pooled and max-pooled features further. As a result, the channel attention mechanism can adaptively adjust the weight of each channel, enhancing the representation of valuable features and reducing noise interference from irrelevant features.
The computation of spatial attention is similar to channel attention. However, it performs global average pooling and global max pooling on the feature map
in the channel dimension to obtain average-pooled feature
and max-pooled feature
, respectively. Then, the two are concatenated along the channel dimension and passed through a standard convolution to obtain a 2D spatial attention weight. That is:
where
represents the nonlinear activation function, and
denotes a standard convolution operation. As shown in
Figure 7, spatial attention focuses on which positions in the feature map have richer information. In other words, it adaptively weights different spatial positions of feature maps of the targets and shadows to emphasize the most useful features for classification.
2.3.3. Feature-Enhancement Module
The FEM primarily comprises the inverse residual block and CBAM, as illustrated in
Figure 8. The inverse residual block utilizes DSC to expand the input feature map in the channel dimension and downsample it in the spatial dimension [
49]. CBAM then assigns distinct weights to different spatial positions and channels of the feature map, emphasizing the spatial and channel importance of the feature map of target and shadow, respectively [
50].
To provide more detail, given a feature map
, the pooling operation first downsamples
to obtain
. Subsequently, the convolution kernel
is used to expand the channel dimension of the input feature map
, producing a new feature map
. Then, depthwise convolution (see (5)) is applied for further feature extraction and downsampling, resulting in
. Using Equation (
7), the spatial and channel dimensions of
are weighted to generate the CBAM-refined feature map
. The convolution kernel
is then convolved with
to acquire the final enhanced feature map
. Lastly, residual connections are used to connect the pooled feature map
and the enhanced feature map
:
Here,
represents the enhanced feature map. The FEM employs the inverse residual block based on DSC, which is lightweight and does not significantly increase the number of trainable parameters of the original models. Furthermore, by integrating spatial and channel attention within CBAM, the FEM adaptively can fuse the depth representation of the target region and the shadow mask, prioritizing the most relevant parts for classification. For example,
Figure 9 displays the detailed network structure of A-ConvNets [
27] with the added FEM.
4. Conclusions
Shadows in SAR images can reveal the structural information of the target from a side perspective, providing unique features distinct from the target itself. However, shadows exhibit properties of low intensity and depression angle sensitivity, which make it challenging for CNN to extract useful information from them. To address this problem, we propose a novel strategy for fusing target and shadow information to enable CNN to extract depth features from targets and shadows comprehensively. First, we introduce a segmentation method to extract the target and shadow information. Taking the target region and shadow mask as input to CNN helps solve the shadow’s low-amplitude issue, enabling subsequent networks to extract deep representation from the shadow contour. Second, we propose a data-augmentation technique to compensate for the geometric distortion of shadows due to different depression angles. Finally, we present a FEM that can adaptively fuse the target and shadow information while emphasizing the partial importance of targets and shadows. Extensive experiments conducted on the MSTAR dataset demonstrate that the FEM can improve the ability of existing networks to extract information on target and shadow, therefore achieving state-of-the-art performance in both SOC and EOC scenarios.
Future work includes the following aspects. First, advanced segmentation methods, such as deep-learning-based SAR image segmentation, can be utilized to improve target and shadow extraction in complex scenes. Second, the proposed FEM can be integrated into deep backbone networks to enhance recognition accuracy; however, this may increase the complexity of the models. Lastly, integrating the proposed method with the modern SAR ATR framework can help in handling SAR images with multiple targets.