3.1. Overview
The overall procedure of the proposed network is illustrated in
Figure 1. It consists of three parts: a feature encoder, a deep attentive high-resolution change decoder, and a classifier. We adopt a modified HRNet18 [
52] as our encoder to extract multi-level features. Then, the encoded features are fed to the proposed change decoder with the aim of learning long-range differences between bitemporal features. The change decoder is comprised of hierarchical feature fusion modules and parallel convolution blocks (PCBs). In the process of decoding, extracted bitemporal features are fused in a bottom-up manner by SCAM to generate change features. Next, the bitemporal features and change features are refined by stacked convolutional blocks in parallel. In this way, the bitemporal images are mapped into features in an embedding space, which is represented semantic information of bitemporal images. Change features are refined to reduce misalignment. Finally, the classifier predicts a pixel-level change map according to the concatenation of the change features. In the training process, the change map is exploited to calculate the classification loss (Equation (
8)) with labels, and the bitemporal features are deeply supervised by the contrastive loss (Equation (
7)). The hybrid loss (Equation (
12)) is employed to improve the convergence stability of the proposed network and bring better detection results.
The inference detail of our DSAHRNet is presented in Algorithm 1. Let represent a pair of bitemporal images, and y represents the ground truth. The inference and training process of DSAHRNet can be summarized as follows:
- (1)
First, the bitemporal images are input into the weight-sharing Siamese encoder, with each obtaining a group of multi-level features , .
- (2)
Next, the same level features are merged at the channel dimension to obtain change features, , where SCAMs are applied hierarchically.
- (3)
Then, both bitemporal features and change features are refined by PCBs and SCAMs.
- (4)
After that, the classifier generates a change mask
M from multi-level change features
.
and
are calculated between the predicted change mask
M and the ground truth
y by Equation (
8) and Equation (
9), while
is calculated for intermediate bitemporal features by Equation (
7).
- (5)
Finally, the sum of and are back-propagated to optimize model weights.
Algorithm 1 Inference of DSAHRNet for change detection |
- 1:
(a pair of bitemporal images) - 2:
(a prediction change mask) - 3:
// step1: feature extraction by HRNet18 - 4:
i in {1,2} - 5:
= = HRNet18 - 6:
- 7:
// step2: use SCAM to generate change features hierarchically - 8:
=None - 9:
i in {1,2,3,4} - 10:
= SCAM(Concat()) - 11:
- 12:
// step3: refine features - 13:
i in {1,2,3,4} - 14:
= PCB() - 15:
- 16:
i in {1,2,3,4} - 17:
= SCAM(Concat()) - 18:
- 19:
// step4: obtain change mask by the classifier - 20:
= Classifier(Concat())
|
3.2. Deep Attentive High-Resolution Change Decoder
The change decoder is designed to generate multi-level change features and semantic features with high resolution required for CD. The decoding process consists of following three steps:
Fusing and Upsampling. After feature extraction by the encoder, we obtain four features at the sizes of (h/4, w/4), (h/8, w/8), (h/16, w/16), (h/32, w/32) (h, w represents the height and width of input images). We first concatenate feature pairs hierarchically along the channel dimension to generate four change features at different levels, then decode change features by an attention module, named SCAM. Finally, we upsample low-level features to concatenate with high-level features in a bottom-up manner. As a result, we learn multi-level attention-refined change features for change detection. This step corresponds to the pseudo-code in lines 9–10 of Algorithm 1.
Our intuition of refining features is that emphasizing those features relevant to change of interest, and suppressing those features irrelevant to change. Consequently, we design a spatial-channel attention module, which is shown in
Figure 2. The channel attention module (see
Figure 2b) aims to capture channel-wise importance through a channel attention map, and it focuses on ‘which channel’ to emphasize or suppress from the concatenated feature. The channel attention module is calculated as follows:
where
denotes the input feature, C, H, W is the channel dimension, height, and width of the feature. Firstly, we perform a global average pooling and a global max pooling to obtain two vectors with the size of
, denoted by
and
. Then, a weight-shared multi-layer perception (MLP) gives weights to each channel on the two vectors. Afterward, we perform a sigmoid operator on the element-wise sum of the two vectors to obtain the channel attention map
. Finally, multiply the original feature with
to obtain the channel-refined feature
.
The spatial attention module (see
Figure 2c) focuses on ‘which area’ to emphasize or suppress on the channel-refined feature
. The key step is producing an attention map which indicates the importance of each pixel. A well-known method is using large kernel convolution [
51,
53]. However, large kernel convolution brings a large quantity of computational cost and parameters. Inspired by the use of visual attention [
54], we decompose a large kernel convolution into three convolutional layers to decrease the computational complexity and capture long-range dependence. The three convolutional layers are a spatial local convolution (depth-wise convolution, DW-Conv), a spatial long-range convolution (depth-wise dilation convolution, DW-D-Conv), and a
channel convolution. Therefore, the proposed spatial attention module can be expressed as:
where
is the input feature, C, H, W is the channel dimension, height, and width of the feature. ⊗ denotes the element-wise product. The subscript of each convolutional operator represents the kernel size.
denotes the final spatial attention map. Finally, the input feature
F is refined as follows:
We validated this module by visualizing features with Grad-CAM [
55]. As shown in
Figure 3, the features refined by SCAM are mostly concentrated on building areas with clearer boundaries, which indicates that our SCAM indeed calculates more accurate bitemporal features and change features.
Convolutional Forward Pass. To obtain the final change mask, most existing methods fuse extracted multi-level features by concatenation or element-wise addition. These methods are able to take advantage of multi-level features. However, simply fusing different level features may introduce misalignment to change features, which leads to noisy change borders or pseudo-changes. To overcome these cons, we propose parallel convolutional blocks to filter noise. As shown in
Figure 1, the PCB is composed of gray nodes, and there are four PCBs at different levels in our network. The attention-refined multi-level change features are fed into t convolutional blocks in parallel, and the bitemporal features are processed in the same manner to extract more information of interest. Both bitemporal features and change features have four branches, channels of the four branches are 18, 36, 72, and 144, in turn, which are lower than most existing methods. The convolutional block we used is the basic-block (shown in
Figure 4) from ResNet [
27]. The block consists of two
convolutional layers, and two batch normalization [
56] layers followed a ReLU [
57] activation layer. In this way, we further refine extracted features without downsampling to maintain its high resolution for the further decoding process. Lines 13–14 of Algorithm 1 reflect the process of the convolutional forward pass.
Fusing and Classification. In the wake of the high-resolution convolutional forward pass, we use hierarchically SCAMs again (see in lines 16–17 of Algorithm 1) to assemble information in the network. Then, the concatenation of multi-level change features is fed into a classifier to generate a change mask (see in line 20 of Algorithm 1). The classifier consists of two convolutional layers, an upsampling operation, and an Argmax operation. We apply two convolutional layers and an upsampling operation on the concatenation of multi-level change features to obtain the discriminative map . The value in the first channel of D indicates the probability that the corresponding pixel belongs to the unchange class, while the other channel indicates the probability that the corresponding pixel belongs to the change class. Finally, an Argmax operation is employed to generate the final change mask by finding the class with the maximum value for each pixel of the discriminative map D.
The visualization of feature maps is displayed in
Figure 5. It can be seen that the bitemporal features have higher values in the building areas, and there is less interference in the change feature map after refinements of PCBs and SCAMs. Hence, we can conclude that our proposed deep attentive high-resolution change decoder can extract discriminative features and model the long-range dependence for change detection effectively.
3.3. Deep Supervision and Loss Function
The batch contrastive loss (BCL) [
58] and the binary cross-entropy (BCE) loss are widely used for change detection in remote sensing. The BCL is used to measure the similarity between the distance map and the ground truth, which is defined as follows:
where
represents the Euclidean distance of the feature pair at point
.
denotes the label at point
, which is 0 or 1.
N is the size of the distance map.
m is the margin to filter out pixel pairs with a distance lower than this value.
The BCE loss can be calculated as follows:
where
and
denote the ground-truth label and the predicted probability of change class at point
, respectively.
To weaken the effect of unbalanced categories, the dice coefficient loss is combined with BCE loss, which is defined as:
where
X and
Y denote the predicted change map, and the ground-truth label, ⋂, represents the intersection of
X and
Y.
Unlike previous networks, our network has two auxiliary branches to extract bitemporal semantic features (see
Figure 1) and one branch for change detection. To improve the convergence of our network by alleviating the vanishing gradient problems and learn multi-level discriminative features for intermediate layers, we introduce a deeply supervised module into the change decoder based on the auxiliary branches.
Figure 6 illustrates the calculation process of change loss, which is described in the fourth step of
Section 3.1. As shown in
Figure 6, the multi-level change features are concatenated to generate the change map. For the two groups of intermediate features
and
, a Euclidean distance map
is calculated at the
level. We leverage a BCL to pull unchanged bitemporal features closer and push changed bitemporal features away between
and
, and use a BCE loss to measure the similarity among probability distributions of change class and unchange class. In addition, we introduce dice coefficient loss to alleviate the class imbalance problem. The final change loss function is formulated as:
where
represents the weight for deep supervision. According to our experimental results in
Section 4.7,
was set to 0.1.
represents the Euclidean distance of the feature pair (
).
denotes the final change mask, H, W is the height and width of the mask.