1. Introduction
Change detection (CD) is a critical task that aims to identify changes between images captured at different times and it has numerous applications in fields such as land cover change analysis, ecosystem monitoring, city management, and damage assessment [
1,
2,
3,
4,
5,
6]. In recent years, the rapid development of imaging equipment has led to the acquisition of very-high-resolution (VHR) remote sensing images and aerial images with ease. Xing et al. [
7] proposed a spectrally enhanced multi-feature fusion method with change magnitude map integration for effective CD. Du et al. [
8] proposed a novel tri-temporal logic-verified change vector analysis approach which can identify the errors of change vector analysis through logical reasoning and judgment with an additional temporal image assistance. Lv et al. [
9] utilized a multi-feature probabilistic ensemble conditional random field model to perform the task of CD for VHR remote sensing imagery. However, traditional CD methods face challenges in effectively extracting features and learning high-level information, limiting their practical applications.
Nowadays, deep learning-based methods have achieved remarkable success and become the de facto standard of VHR image change detection [
10]. In the literature, a large number of attempts have been made to solve CD problems using deep learning techniques [
11,
12,
13,
14]. Among these, convolutional neural network (CNN) has emerged as the most commonly used architecture due to its ability to preserve fine detail. To extract discriminative features, many CNN-based methods [
15,
16,
17] use Siamese architecture [
18] and it is often accompanied by stacked convolution layers [
10,
19], dilated convolutions [
19], spatial pyramid pooling layers, and various attention mechanisms (e.g., channel and spatial attention) [
20,
21,
22,
23] are utilized to increase the receptive field, which is essential to capture long-range contextual information within the spatial and temporal scope. This type of approach is straightforward, and the introduction of an attention-based mechanism is effective in capturing global details. However, these methods use attention to re-weigh the bi-temporal features from Siamese-based CNN architecture to relate long-range spatio-temporal information [
21,
22,
23,
24], where more attention is paid to balance and integrate high-level semantic features, and the interactions between low-level details are usually ignored. The tremendous advancements and successes of transformers in natural language processing (NLP) [
25] have led to their widespread adoption and development for use in vision tasks, such as ViT [
26], Swin Transformer [
27] and SETR [
28].
The self-attention mechanism is an important component of transformers, which explicitly models the interactions between all entities of a sequence. It capture the interaction among
n entities by encoding each one with global contextual information. In detail, by using learnable weight matrics
,
, and
, the input
X is first projected onto three weight matrices to get
,
, and
, respectively. Then, attention scores are obtained by computing the dot-product of the queries with all keys, which can be seen as a query-matching process with all values. Due to the token structure, ViT and its variants have a larger receptive field and capture low frequencies, mainly about the global shape and structure. In contrast, CNN extracts features by pixel-wise convolution, and it has a smaller receptive field and captures high frequencies (such as local details and textures). According to the findings presented in [
29], the multi-head self-attentions (MSAs) utilized in ViT and CNN exhibit contrasting behaviors. Specifically, while MSAs function as low-pass filters, CNN are more akin to high-pass filters. This is because self-attention is a global operation that exchanges information between patch tokens, and it is better at capturing global information (low frequencies) than local information (high frequencies). Therefore, high-frequency information by CNN should be combined with low-frequency information by ViT, which is especially important for pixel-level segmentation task, e.g., (fine-grained) change detection. However, most methods focus on image classification [
26,
30,
31], object detection [
32,
33], semantic segmentation [
28,
31], and very few works on transformers for CD have been reported in the literature [
34,
35,
36].
We note that traditional Siamese-based CD methods split two-stream branches to separately extract features and they ignore information fusion between bi-temporal images. Transformers utilize queries and keys to calculate self-attention weights, which contains the routine of comparing and querying implicitly. In terms of CD, it is natural to utilize the mechanism based on a transformer to extract change features and enhance the information interaction between bi-temporal images along the spatio-temporal dimension. In fact, global information, such as smooth regions, mainly contains low-frequency components, while high-frequency components contain local detail information such as object boundaries [
37]. The human visual system, from different frequencies, realizes the classification and identification of object semantics by extracting these visual features’ representation. Some scholars have been trying to explore the high-frequency and low-frequency information extraction capabilities of the network from the perspective of frequency and combine them, and they have achieved competitive results in various visual tasks [
38,
39]. Therefore, the development and application of a new vision architecture are imperative to capture the semantic change information in pairs of CD images from high-frequency and low-frequency perspectives. In light of these requirements, a specially well-designed, novel cross-attention-based transformer framework is proposed for CD, which is efficient, accurate, and robust. We base our approach on the intuition that CD is an alternative process of comparing image patches, which is similar to visual change observation of human beings with respect to patch-scanning and comparison [
40,
41]. Traditional Siamese-based CD methods, which extract features separately from each stream, fail to incorporate information fusion between bi-temporal images. In contrast, joint information fusion and feature extraction at different scales are more intuitive as they better capture the complex bi-temporal features of interest. More importantly, transformer and CNN are complementary in feature representation with respect to high-frequency-like fast change and low-frequency-like slow change [
29], and the pure usage of either type may be limited in representing complex bi-temporal features [
39]. To this end, a new dual cross-attention mechanism block is proposed to combine CNN’s outstanding high-frequency capture ability with transformer’s global attention on low frequencies. Our proposed framework is not only efficient and accurate, but also robust to changes in remote sensing imagery. The key novelties of our work lie in the following aspects:
A novel dual cross-attention-based module is introduced into the transformer encoder block. This module features a dual branch structure that enables the extraction of high- and low-frequency information, resulting in a more powerful semantic association modeling between bi-temporal patches;
A hierarchical transformer encoder is presented to extract and merge multi-scale features simultaneously, and a feature pyramid fusion network decoder is aimed to yield discriminable feature difference representation that alleviates the problems caused by naive context aggregation;
Extensive experiments on four CD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods in terms of accuracy and robustness.
The structure of this paper is as follows:
Section 2 provides a review on deep-learning-based CD methods. In
Section 3, we present our proposed methodology in detail, outlining the advantages of the dual cross-attention-based transformer (DCAT) framework.
Section 4 presents comprehensive comparisons between DCAT and other SOTA methods, providing a thorough evaluation of its effectiveness in
Section 5. Finally,
Section 6 concludes the paper, summarizing the contributions of our work and discussing future directions for research in the field of CD.
3. Materials and Methods
In this section, the overall architecture of DCAT is briefly described, and key components of the proposed approach are then elaborated, including overlap patch embedding module (OPEM) and dual cross-attention block (DCAB). Next, the feature pyramid fusion network and loss function are introduced.
3.1. Overall Architecture
The general process of our CD model based on DCAT is presented in
Figure 1a. Our model comprises two main modules: a hierarchical encoder and a feature pyramid fusion network (FPFN). The hierarchical encoder is designed to capture fine-grained features that preserve high-resolution remote sensing image details, making it an essential component for CD tasks. By leveraging a hierarchical feature encoder, the network can effectively capture semantic details at different scales, allowing for a more comprehensive and informative feature representation. In addition, our proposed method employs a pyramid decoder to gradually aggregate multi-scale features, thereby obtaining more reliable CD maps. This approach enhances the network’s ability to capture and exploit multi-scale information, leading to more accurate and robust change detection results. Overall, our DCAT-based CD model offers a promising solution for the challenging task of CD in remote sensing imagery. Code and models can be found at
https://github.com/zhouyy314/DCAT, accessed on 24 March 2023.
Each stage of the model includes an overlapping patch embedding layer that downsamples the feature maps, as well as a set of dual cross-attention blocks that leverage both transformer and convolutional operations in parallel. The last block of each stage is the DCAB, which contains a low- and high-frequency cross-attention mixer. It is worth noting that local contextual change features captured by a cross-attention mechanism help avoid the split in the two-stream network. Furthermore, low- and high-frequency information communication of bi-temporal patches within each window are achieved by cross-attention, and it behaves like the process of information query and comparison. The low- and high-frequency mixers employ dual cross-attention mechanisms twice to enhance the interaction between bi-temporal feature tokens. Moreover, the joint feature fusion and comparison are implemented at all stages, rather than sequential comparison after feature extraction. To elaborate, pairs of patches are input into the hierarchical encoder, which gradually reduces the spatial resolution and doubles the channel dimension at each stage. After that, multi-level feature maps and with resolutions , , and of the original bi-temporal images are obtained, respectively. Finally, four-level features with different resolutions are fed into the FPFN to obtain the final CD mask with the size of .
The central idea behind DCAT is to leverage dual cross-attention to effectively capture and incorporate the global context that accounts for multi-level semantic differences. By combining the exceptional local modeling capability of CNN with the powerful global dependency perception of transformer, high- and low-frequency information can be extracted, respectively. These two branches correspond to the mixer, as detailed in
Section 3.3. For ease of reference in the following section, we use the terms “high-frequency features” and “low-frequency features” to refer to the CNN features and transformer features, respectively.
3.2. Overlapped Patch Embedding Module
In visual tasks, multi-level and multi-scale feature representation is helpful to aggregate semantic context information and spatial details [
27,
55,
60]. In consequence, to improve the dense-pixel-prediction performance, it is important to introduce a hierarchical structure which is adequate in mixing low- and high-level features.
An overlapping patch merging process is used to preserve the local continuity and make full use of the local context information around small-size patches, OPEM shrinks hierarchical feature from the i-th stage of resolution to the -th stage of resolution by convolution, where . As the network goes deeper, the number of tokens is reduced and the receptive field is increased thanks to the progressive downsampling applied by the model.
3.3. Dual Cross-Attention Transformer Block
Vision Transformers (ViTs) are widely used in visual tasks, in which many change detection approaches use the classic idea of self-attention. The vanilla vision transformer consists of a multi-head self-attention (MSA) module and a feed-forward network (FFN). However, MSA computes dependencies between all patch tokens, resulting in a strong emphasis on aggregating global information across all layers. This approach inevitably leads to a significant increase in computational complexity, which scales quadratically with the number of input tokens.
Although certain approaches such as PVT [
55] and SegFormer [
56] attempt to decrease the spatial scale of keys and values prior to the self-attention operation, using a single branch such as convolution can lead to the loss of small object details and the degradation of high-frequency information (e.g., local textures). As a result, the ViTs’ modeling capability can be weakened. That is to say, high-frequency details are easily ignored by ViTs, yet CNN has outstanding local modeling ability. In fact, in visual tasks, high-frequency information is very important. Therefore, we propose a low- and high-frequency cross-attention mixer to aggregate global and local information, respectively.
As shown in
Figure 1b, the mixer consists of two sub-mixer, the low-frequency mixer and the high-frequency mixer, both of which employ a cross-attention mechanism to interact with features from another phase. For simplicity, the residual connection and FFN are omitted. It is worth noting that we transplanted the effective high-frequency representation extraction ability of CNNs to transformer block and adopted a multi-branch structure in high-frequency mixer, which was inspired by Inception [
61]. Rather than directly inputting image tokens into the low- and high-frequency mixer, the mixer introduced in this study separates the input feature along the channel dimension using a pre-defined ramp division strategy. The separated components are subsequently fed into the high- and low-frequency mixer, respectively. In contrast to the conventional self-attention mixer, our novel mixer employs a hybrid architecture that integrates the strengths of CNN and transformer models. The novelty of our approach lies in its ability to perform attention calculation twice while maintaining a smaller number of parameters.
Specifically, the input feature map
is first partitioned into two components along the channel dimension:
and
, where
. Then,
and
are fed into low-frequency mixer (LFM) and high-frequency mixer (HFM) to compute the cross-attention feature
and
, respectively. Finally,
and
are concatenated together, and the final feature output
is obtained through a convolutional layer with the kernel size
. It can be defined as:
where LN is layer normalization, [;] denotes concatenation.
In many transformer-based methods [
26,
27,
54,
56], a traditional self-attention mechanism is only applicable to a single input sequence. The attention function maps a query and a set of key-value pairs to an output that is a weighted sum of values. In order to adapt to the dual input situation in CD, a novel cross-attention is presented to model the inner connection and information interaction between token pairs. To compute the cross-attention scores between queries and keys and obtain the weighted sum of values, we first compute the dot product of each query
with all keys
that belong to another token. As is done in many transformer-based methods, we divide the dot products by
before applying the SoftMax function to obtain the weights on the values
. For computational convenience, we pack the queries, keys, and values into matrices
,
, and
, respectively. Taking the cross-attention in the high-frequency mixer as an example, the above calculation process is described by the following formulas
where
and
denote cross-attention feature maps for bi-temporal tokens at the
i-th (
) stage in the high-frequency mixer. The cross-attention calculation process in the low-frequency mixer is similar.
Finally, our DCAT is also equipped with a feed-forward network (FFN), and the difference is that it incorporates the above LFM and HFM; LayerNorm (LN) and residual concatenation are also applied. Therefore, the proposed DCAT is formally defined as
3.3.1. High-Frequency Cross-Attention Mixer
Considering the sensitivity of the convolution operation to high-frequency detail information and the parameter optimization, we adopt three different parallel branches to capture high-frequency components and depth convolution for spatial reduction before cross-attention operation.
In detail,
is convolved twice in the first branch, using a depth-wise convolution (DWConv) with kernel size
and then
. Similarly, the second branch utilizes a DWConv operation with the kernel size of
. Here,
R is a reduction ratio and maintains different reduction ratios at different stages, which is achieved by convolution with a step size of
R. The third branch utilizes max pool operation before
DWConv. After the previous operation, three feature maps after space reduction are concatenated together as the input of cross-attention. It can be formulated as:
where
denote the outputs of three branches, respectively,
is depth-wise convolution with the kernel size
. Then,
are concatenated together and flattened following layer normalization to obtain
.
Obviously, the obtained token sequence
is shorter than the flattened input feature
. Referring to the
Q,
, and
design in the vanilla self-attention mechanism, we obtain new
,
,
in the high-frequency mixer by:
,
, and
. Here,
,
, and
are learnable matrices to generate query, key, and value, respectively. Then, according to the definition of Equations (
4) and (
5), the cross-attention feature map
of bi-temporal tokens input in the high-frequency mixer is obtained, respectively.
3.3.2. Low-Frequency Cross-Attention Mixer
We use cross-attention again in thte low-frequency mixer to communicate semantic information between tokens from different phases. Although the attention mechanism has a stronger ability to learn global representation, it also inevitably introduces a significant computational burden, particularly for shallow layers of the network when the feature resolution is relatively large. To address this issue, we adopt a simple approach of using an average pooling (AvePool) layer to reduce the feature resolution of
before the cross-attention operation. Then, an upsample operation is utilized to recover spatial resolution. The low-frequency mixer can be defined as:
where
is the cross-attention operation and
is the output in LFM, respectively. In this way, the computational overhead of the low-frequency branch is reduced, meanwhile, global information embedding and communicating are enhanced.
3.3.3. Frequency Ramp Structure
From the perspective of signal processing, the high-frequency component corresponds to the part of the image that changes violently, that is, the edge (contour) or noise of the image, while the low-frequency component represents the area in the image where the brightness or gray value changes slowly, describes the main part of the image, and is a comprehensive measure of the entire image strength. Therefore, in many visual frameworks, bottom layers play more roles in capturing high-frequency details while deep layers do so more in modeling low-frequency global information. By capturing the details in high-frequency components, lower layers can capture visual elementary features, and also gradually gather local information to achieve a global understanding of the input. We designed a frequency ramp structure which gradually splits more channel dimensions from lower to higher layers to a low-frequency mixer and thus leaves fewer channel dimensions to the high-frequency mixer. In many visual frameworks, the lower layers are typically responsible for capturing high-frequency details, while the deeper layers are more focused on modeling low-frequency global information. This is because the lower layers are better suited for capturing visual elementary features and gradually gathering local information to achieve a global understanding of the input. To take advantage of this characteristic, we propose a frequency ramp structure that gradually splits more channel dimensions to the higher layers. Specifically, more channel dimensions are allocated to the high-frequency mixer in the lower layers while fewer channel dimensions are allocated to that in the higher layers. This approach enables our network to effectively capture both high-frequency details and low-frequency global information and facilitates the efficient processing of visual inputs across different frequencies.
Figure 1 illustrates our proposed DCAT architecture, which comprises four stages with varying feature sizes and channel dimensions. For each DCAB, we pre-define a simple channel split strategy to divide low- and high-frequency mix with
and
channels, respectively. Assuming that the total channel number of the input feature map is
C and
,
gradually decreases from shallow to deep layers, while
reverses. By employing the ramp division strategy, DCAT is able to effectively balance high- and low-frequency components across all layers. Different channel division methods will be discussed in the ablation study.
In short, the advantages of the proposed dual cross-attention block lie in the following aspects:
- (1)
We introduce a high- and low-frequency mixer that combines the strengths of convolution and transformer to enhance the perception capability of cross-attention from frequency viewpoint. By capturing both high- and low-frequency information from bi-temporal inputs, the mixer aims to improve the modeling ability of the network for high-frequency information while leveraging the sensitivity of transformer to low-frequency information. This approach enhances the richness and diversity of feature representation, leading to more accurate and robust change detection results;
- (2)
The cross-attention mechanism is utilized twice to query and compare semantic information between bi-temporal features. This enables the network to capture more complex and subtle changes between the input images, leading to a more comprehensive and informative feature representation.
3.4. Feature Pyramid Fusion Network
The feature pyramid fusion network (FPFN) aims to aggregate multi-scale bi-temporal feature input and predict distinguishable change maps. Our proposed FPFN restores the original resolution by the progressive up-sampling and merging the early-stage information.
As shown in
Figure 2, the proposed FPFN consists of four layers. We get bi-temporal feature maps
and
(
) from encoder, then a bi-temporal feature fusion module (BFF) is applied to merge different temporal features to obtain fused feature
, followed by upsampling and concatenation layers, which aggregates both shallow and deep context information. The BFF fuses semantic change information by further enhancing the difference and consistency of bi-temporal features which correspond to the difference enhancement module (DEM) and the consistency selection module (CSM), respectively. Finally, the representation is fused and restored through a linear layer and two deconvolution layers. The final per-pixel change result
is obtained by a
convolutional layer.
The above procedure can be formalized as:
Here, BFF is the bi-temporal feature fusion module. In Equation (
13),
,
is obtained by convolution operation on
. It is worth noting that before concatenation in Equation (
15),
(
) are upsampled to maintain the same resolution with
.
Generally, the fusion of bi-temporal features in CD task can be accomplished through direct subtraction followed by the calculation of the absolute value or by concatenating the features in the channel dimension. However, these simple extraction schemes fail to distinguish between the changed and unchanged classes during the feature extraction process, and they have limited effectiveness in mining and combining mixed features. To overcome this limitation, we emphasize the difference and consistency between different temporal features to further explore and enhance the fusion feature representation.
As shown in
Figure 3,
and
are added separately to obtain the input features of the next stage after parallel DEM and CSM. When extracting cross-temporal features, the model learns not only the differences between changed and unchanged regions but also their consistency. In other words, the most effective and important channel features from different phases are extracted by the common feature, which may mean unchanged regions or ground objects because they have similar texture patterns. Whereas, difference features are obtained by DEM, which corresponds to the changed regions. In this way, differences and consistencies between bi-temporal features have been further enhanced and distinguished, while the information useful for change detection is separated from the mixed features. After the refinement, a more informative and robust feature map is generated, then a change map is obtained by fusing gradually layer by layer.
3.4.1. Consistency Selection Module
The goal of the Consistency Selection Module (CSM) is to select bi-temporal channel features in an adaptive manner, based on the common-modality feature that corresponds to regions that remain unchanged over time.
Figure 3a illustrates how CSM dynamically selects features using a SoftMax attention mechanism that is guided by information from different temporal branches. Firstly, the common-attribute feature maps
are extracted by directly summing input bi-temporal features
and
. Secondly, the feature maps
and
are obtained through two two-layer fully connected networks. Finally,
and
are obtained by multiplying the attention maps of bi-temporal features with their inputs, respectively. The above process can be formulated as:
Here, , and are three intermediate variables; is an average pooling layer; and refer to two two-layer fully connected networks while they have the same weight at the first layer to reduce parameters; and ⊗ represent the SoftMax opertation and element-wise multiplication, respectively.
3.4.2. Difference Enhancement Module
The difference enhancement module (DEM) aims at enhancing the difference of changed regions. As illustrated in
Figure 3b, this module leverages the channel-wise attention weighting mechanism to enhance temporal-related difference. Firstly, we obtain the differential feature
by directly substracting
and
. Secondly, DEM infers attention maps
based on differential feature maps
. The attention maps are then multiplied with each input feature map to refine the features adaptively. Finally, the refined feature maps are added to the input feature maps to obtain enhanced differential feature maps
and
. The above procedure can be formulated as:
Here, and refer to global max pooling and global average pooling, respectively; and are intermediate variables; MLP is a simple two-layer CNN; the element-wise multiplication operation is denoted by ⊗, and the sigmoid function is represented by .
After extracting the consistency features and difference features, interaction and aggregation of the overall information during extraction process are completed by the direct summation. It can be formulated as
where
is the output feature after BFF at the
i-th stage.
The proposed FPFN offers several advantages, which can be summarized in the following three aspects.
- (1)
FPFN employs a pyramid architecture to fuse features layer by layer, incorporating both early coarse features and deep fine features. This approach enhances the network’s ability to capture complex and multi-scale information;
- (2)
The network utilizes CSM to select similarities and takes advantage of the SoftMax function’s normalization to reassign weights to feature channels, which allows for the screening of bi-temporal unchanged features. This adaptive selection of channel features from two input features minimizes the introduction of redundant features, allowing the network to focus on the most reliable common features;
- (3)
The proposed DEM enhances changed attributes by utilizing direct subtraction to obtain the difference representation of input features. By doing so, the network improves the feature separability between changed regions and unchanged regions, leading to more accurate change detection results.
3.5. Loss Function
The binary cross-entropy loss is a commonly used loss function for CD tasks. Since pixel-wise CD can be considered as a binary classification task, we use the following binary cross-entropy loss function as the loss function of DCAT:
where
N is the number of image pixels;
is the true label of the
n-th pixel,
represents the changed class, and
represents the unchanged class;
denotes the change possibility in the predicted map. Technically, four change maps
with different resolutions calculate the binary cross-entropy loss
with the scaled ground truth, respectively. Meanwhile, the final change map
also calculates the binary cross-entropy loss
with ground truth. Then, all losses are weighted and summed to get the overall loss
L. It can be formulated as:
where
is a trade-off parameter,
is binary cross-entropy defined by Equation (
30).
5. Discussion
To provide a detailed analysis, we present ablation experiments on the LEVIR-CD dataset in this section. By performing ablation experiments on our proposed method, we aim to identify the contribution of each component to the overall performance and provide insights into the effectiveness of our approach.
5.1. Ablation Study on Cross-Attention
Within the transformer framework, our proposed approach introduces a cross-attention mechanism, which is the core idea and key innovation of DCAT. As shown in
Figure 1c, DCAB starts to compute the dot products of the query
from the patch
with all keys
that belong to the corresponding patch
on the image T2, which is consistent with human visual change observation. However, frequent cross-attention calculation during feature extraction may lead to the unnecessary feature confusion, and it will be more difficult to distinguish similarity from inconsistency. To investigate how the cross-attention mechanism works well, ablation experiments were conducted on LEVIR-CD data set.
As shown in
Table 6, DCAT (with cross-attention) uses a cross-attention mechanism in backbone during feature extraction. In contrast, other variants do not use cross-attention in all transformer blocks but self-attention. In other words, DCAT (with self-attention) focuses on self-attention instead of cross-attention, i.e., dot products are performed not on
and
, but on
and
. In addition, our proposed method that utilizes cross-attention mechanism only in the last DCAB at each stage was also compared.
From
Table 6, DCAT based on the self-attention mechanism achieves the lowest F1 score of 91.16%. Meanwhile, when all the DATBs adopt the cross-attention mechanism, F1 score is pushed to 91.23%. The performance differences indicate that it is the cross-attention mechanism that improves the CD performance, i.e., similarities should be measured between a query on one image and the keys on the other image. By taking advantage of the novel interactive query calculation, DCAT imitates the visual principle of the human eyes in perceiving changes. In contrast, DCAT (with self-attention) computes self-attention independently, and there is no information interaction between two independent branches. The lack of short-term memory makes the changed features extracted is less representative and fails in compensating the misregistration, and the performances are thus being degraded. Furthermore, compared with the previous two methods, our proposed DCAT only uses cross-attention mechanism in the last DCAB of each stage and self-attention mechanism in other blocks, and it further improves the F1 to 91.40%. The improvements tell us that using a separation-aggregation architecture is helpful for the model to grasp the commonness and characteristics between changed regions and unchanged regions in feature extraction. Our cross-attention mechanism aggregates and analyzes the context information at the end of each stage of backbone, which alleviates the difficulty in distinguishing mixed bi-temporal features in feature extraction and provides a reliable input for the following bi-temporal feature fusion in FPFN.
5.2. Ablation Study on Mixer
As an important component of DCAB, the mixer uses a pre-defined channel division strategy to calculate the cross-attention of high-frequency and low-frequency respectively. This strategy enhances the perception capability of the transformer in the frequency spectrum. As demonstrated in previous research [
65], the lower layers of the transformer require more local information, while the higher layers require more global information. Therefore, the mixer’s channel division strategy complements the transformer’s hierarchical structure and optimizes the information flow through the network. To achieve a better trade-off between high- and low-frequency components across all layers, we pre-define a frequency ramp structure. More specifically, the frequency ramp division structure that we proposed involves reducing the dimension of the high-frequency branch and increasing the dimension from lower layers to higher layers in the low-frequency branch. This approach effectively attenuates the high-frequency noise in the input signal and emphasizes the low-frequency components, which are often more informative for various visual recognition tasks. Additionally, the division structure enables the mixer to better utilize the hierarchical feature representations in the transformer, as it corresponds to the lower layers’ preference for local details and the higher layers’ preference for global information. Therefore, this design choice in the mixer enhances the model’s ability to capture and utilize more relevant information from the input signal.
To assess the impact of the mixer’s components, we analyze the effects of the channel division ratio (
and
) in
Table 7. The experimental results presented in
Table 7 indicate that the model with an increased channel division ratio (
) and a decreased channel division ratio (
) outperforms the other four mixer structures evaluated in this study, which is consistent with prior research. Specifically, the frequency ramp division structure enables the mixer to effectively balance the representation of high-frequency and low-frequency components across all layers of the network, which facilitates the learning of more discriminative visual features. Therefore, our findings suggest that the frequency ramp division structure is a promising method for enhancing the performance of dual cross-attention mixer.
5.3. Ablation Study on BFF
Change detection is a challenging visual task that involves processing bi-temporal image inputs. Integrating features from different temporal phases is a crucial aspect of this task. In the ablation study, we compared the impact of different strategies for combining DEM and CSM features in BFF on change detection performance. Our study considered both parallel and sequential arrangements for integrating DEM and CSM features, as illustrated in
Figure 9. In the sequential arrangement, we evaluated two orders: DEM-first and CSM-first. The parallel arrangement, in this context, was found to yield better performance than the sequential arrangement.
In addition to comparing different integration strategies, we also examined two common fusion methods: absolute value subtraction and direct concatenation. Absolute value subtraction involves subtracting the absolute difference of corresponding pixel in the two feature maps, while direct concatenation involves concatenating the two maps along the channel dimension. The results in
Table 8 demonstrate that the performance of the two feature combination methods mentioned is significantly lower than that achieved by parallel or sequential approaches.
Overall, our study highlights the importance of selecting appropriate integration strategies and fusion methods in BFF for effective change detection. This indicates that incorporating both DEM and CSM features is valuable for CD and using the parallel arrangement strategy further improves performance.
5.4. Ablation Study on FPFN
Benefiting from hierarchical structure design, our encoder can extract features at different scales from shallow to deep level. Therefore, the way in which semantic features at different levels are mixed is crucial for the CD performance. Usually, most deep neural networks utilize the extracted deepest features for information aggregation. FPFN fully incorporates different layers of feature representations to enhance the change feature separability. To verify the effectiveness of layer-by-layer feature fusion, we conducted comparative experiments on the fusion strategy of FPFN. In detail, starting from the deepest features , we respectively take different combinations to merge previous feature maps (, , and ).
The quantitative performances are presented in
Table 9.
denotes the model only mixing the deepest features
and
at stage 4 to obtain change maps, while
stands for the bi-temporal features extracted at stages 3 and 4 for decoder, the rest is the same. Obviously, the decoder incorporating multi-layer features has significant improvements, which also illustrates that the hierarchical encoder is effective at extracting features of different scales and adaptable to the multi-layer decoder. In
Table 9, the more early features fused by the decoder, the better the performance of CD. Our proposed multi-layer feature fusion network,
, achieves the highest values of precision, recall, and F1, respectively.
We visualized attention maps after feature fusion and corresponding prediction maps to visually evaluate performance differences with different fusion methods. Observing the results in the third column of
Figure 10, we can see that the decoder
that uses only features at stage 4 suffers from significant adhesion and misjudgment. With the integration of lower-level features, although
and
also have the adhesion phenomenon, they achieved higher improvements. The necessity of fusing early features can also be clarified by the fact that with the deepening of fusion layer, the attention maps gradually highlight the changed area, and it is clear and less noisy. In terms of building edge recognition, the detection results by merging multi-layer features have more sharp edges. In short, the above comparisons of FPFN with different structures justify that the proposed fusion network merging multi-layer features is superior to the decoder that only partially fuses deep features in discriminating changed regions.
5.5. Ablation on Robustness
In the LEVIR-CD+ experiment, some misregistered samples were observed, as shown in
Figure 6. To further evaluate the generalization and robustness of DCAT against object shifts, we conducted ablation experiments on the synthetic images and real season-varying remote sensing images change detection dataset (SVCD) [
66]. SVCD includes a sub-dataset of synthetic images with object shifts. These shifts are in the range [−5, 5] pixels in horizontal and vertical directions without intersections with the other objects and image boundaries. The results presented in
Table 10 demonstrate that our approach outperforms other methods on the SVCD test set by a significant margin. This may be attributed to the ability of our method, DCAT, to model context within a global, highly abstract, spatio-temporal scope. Additionally, our method utilizes cross-attention for contrastive queries, which enhances the feature representation and spatial tolerance of the model.
Some visualization results are presented in
Figure 11. It can be seen intuitively that the proposed approach outperformed the other methods, achieving more accurate detection boundaries. Our model is able to better avoid false positives that may result from the similar appearance of objects in the scene, which is a common challenge in CD tasks. For instance, in
Figure 11(1), we can observe that the FC-EF method incorrectly recognizes a small circle as a changed target, even though there was no actual change in the corresponding position. This is likely due to the high color similarity between the small circle and the background, which makes it difficult for the model to distinguish between them. In contrast, our proposed approach is able to better avoid false positives in such scenarios. By utilizing a more comprehensive feature representation and a dual cross-attention mechanism, our model is able to capture more contextual information and make more informed predictions. As a result, our approach achieves more accurate and reliable CD results, even in challenging scenarios where false positives are common.
Moreover, our DCAT approach demonstrates remarkable ability in handling small gaps caused by close proximity. An instance of close adjacent shapes in
Figure 11(3) highlights the accuracy of our method. DCAT learns effective context within the spatio-temporal domain, which enables it to better represent real semantic changes and obtain non-adhesive, separable boundaries. This ability to effectively capture context and enhance feature representation is one of the key factors contributing to the superior performance of our approach in CD tasks. In consequence, the proposed method is qualified for the CD task of small object shifts, which means good robustness and generalization.
5.6. Ablation Study on Pretraining
In recent years, researchers have increasingly focused on remote sensing pretraining, leading to impressive breakthroughs in the CD field. We also explore the potential of pretraining with DCAT in this section,
Table 11 presents the performance results of several typical pretraining methods on the LEVIR-CD test set.
SwinSUNet [
36] utilizes a pure transformer network with a Siamese U-shaped structure for change detection. The method is initialized with a Swin transformer model pretrained on the ImageNet dataset. It achieves a significant improvement in the F1 score, achieving a score of 90.14%. ViTAEv2-s [
59] proposes an advanced ViTAE transformers-based model pretrained on ImageNet data set, which achieves F1 score 91.26%. Nevertheless, it is also pretrained on a large-scale remote sensing data set MillionAID [
67], which obtains F1 score of 90.93%. The training set of MillionAID contains 10,000 images for classification, we pretrain DCAT encoder on this set as done in most segmentation methods. The pretraining of the DCAT encoder leads to a improvement of 0.11% in terms of F1 score for the proposed approach. We note that our method outperforms those pretrained models even without using any pretraining weights or training on other large-scale datasets. This suggests that pretraining helps learn extra information to improve change feature discrimination.
5.7. Visualization of the Encoder
Our dual cross-attention transformer block plays a crucial role in capturing the semantic change differences between bi-temporal features at each layer. To gain a better understanding of the dual cross-attention transformer block, we visualize the attention maps at different stages. By examining the attention maps, we can observe that our transformer block is capable of identifying regions with significant change in the feature maps. In addition, the attention maps reveal that the dual cross-attention transformer block can effectively capture both high-level and low-level semantic concepts. At the earlier stages, the attention maps are focused on capturing low-level features, while at the later stages, the attention maps become more focused on high-level features such as objects and scenes.
Figure 12 shows the visualization results by fusing the bi-temporal attention maps
and
at the
i-th stage from the LEVIR-CD datasets. We select several typical change scenarios, including changes in isolated small areas and dense areas. Based on the visualization of the attention maps, it can be seen that the features extracted by the encoder are mainly concentrated in the regions corresponding to semantic concepts with varying interests. Specifically, the attention maps highlight the changed areas in red, indicating higher attention values, while the unchanged areas are marked in blue, indicating lower attention values.
At the first stage, the encoder pays little attention to the pixels about buildings. At the second and third stages, our proposed encoder is able to identify the changed regions of interest. At the fourth stage, this phenomenon is more obvious. Interestingly, from
Figure 12f,g, our encoder is effective in highlighting the pixels surrounding the buildings (e.g., shadows and trees) and larger areas have higher values of attention in
Figure 12g. The shift of learned attention towards the changed regions related to buildings with the increase of network depth can be attributed to two main reasons. Firstly, the encoder of our model has the ability to capture long-range spatio-temporal dependencies, which helps it to extract more comprehensive features that are sensitive to changes in the environment. Secondly, the semantic context surrounding buildings plays a crucial role in object recognition, making it an essential cue for identifying ground objects. As a result, the attention of the dual cross-attention transformer block gradually shifts towards the changed regions related to buildings with an increase in network depth. This observation indicates that our model is capable of learning complex representations that are highly informative and sensitive to the context of the environment. Therefore, the attention maps become more focused on the changed regions related to buildings in deeper layers, indicating that the proposed dual cross-attention transformer block can effectively capture the semantic differences between bi-temporal features and extract informative features for change detection. The visualization results indicate that our model is powerful in capturing semantic differences from coarse to fine and implicitly learning some extra concepts to promote change detection.
6. Conclusions
In this paper, we present a novel approach called DCAT that leverages the power of dual cross-attention transformers for accurate change detection in remote sensing imagery. Inspired by the visual change observation of human eyes, where changes are detected by alternative comparisons from left to right and successive matching from back to forth [
68], we introduce a dual cross-attention mechanism that combines the transformer’s query, key, and value to achieve alternative attention. This approach enables the capture of change-specific feature salience and compensation for change features impacted by misregistration.
Based on dual cross-attention, we present a mixer that combines high- and low-frequency information, replacing the traditional self-attention mixer. Two branches of cross-attention are calculated to obtain feature representations under high- and low-frequency mixers, respectively. DCAT utilizes a channel ramp splitting method to couple convolution-based CNN and cross-attention-based transformer, which enables the model to expand its perception capability and capture more diverse features from the input paired data. The CNN and transformer modules are fused in a parallel and hierarchical manner, enabling the efficient combination of local and global features, resulting in improved performance in remote sensing CD tasks.
Furthermore, the encoder of DCAT has a hierarchical structure to adapt to changed objects of various sizes. Hierarchical features extracted at different stages effectively capture changed objects of small sizes and reduce missed alarms. In addition to the powerful dual cross-attention-based transformer encoder, we adopt a feature pyramid fusion network to aggregate multi-scale dual-input features. The parallel structure of BFF effectively enhances the differences between bi-temporal features and uses consistent features to select similarity.
Extensive experiments demonstrate that DCAT has significant advantages over other self-attention transformers, indicating its potential as a general-purpose backbone for change detection.
Limitation
During the ablation study, we noticed that incorporating a pretraining dataset of 10,000 images only resulted in a slight improvement in DCAT’s performance. We infer that this could be attributed to the limited size of the pretraining dataset and the inherent differences between the pretraining method (which involves classifying two images simultaneously) and the image pairs used for change detection. Consequently, there is still a hot topic to explore and develop large-scale pretrained models in the field of remote sensing.
Another limitation of DCAT architecture is the manual definition of the channel numbers in the frequency ramp structure, namely, and for the low- and high-frequency mixer, respectively. This requires a significant amount of experience for better performance. To address this limitation, we plan to consider using neural architecture search to adapt the number of allocated high-frequency and low-frequency channels.
Despite the novelty of our proposed DCAT, future work will focus on its extension to other object types, such as roads, street views, and vegetation. We also plan to explore the use of DCAT for multi-class semantic change type identification and for pretraining models in remote sensing, which will be our future research direction.