A Conditionally Parameterized Feature Fusion U-Net for Building Change Detection

Gu, Yao; Ren, Chao; Chen, Qinyi; Bai, Haoming; Huang, Zhenzhong; Zou, Lei

doi:10.3390/su16219232

Open AccessArticle

A Conditionally Parameterized Feature Fusion U-Net for Building Change Detection

by

Yao Gu

¹

,

Chao Ren

^1,2,*

,

Qinyi Chen

¹,

Haoming Bai

¹,

Zhenzhong Huang

¹ and

Lei Zou

¹

College of Geomatics and Geoinformation, Guilin University of Technology, Guilin 541006, China

²

Guangxi Key Laboratory of Spatial Information and Geomatics, Guilin 541106, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(21), 9232; https://doi.org/10.3390/su16219232

Submission received: 19 September 2024 / Revised: 12 October 2024 / Accepted: 21 October 2024 / Published: 24 October 2024

(This article belongs to the Special Issue Urban Planning and Built Environment)

Download

Browse Figures

Versions Notes

Abstract

:

The semantic richness of remote sensing images often presents challenges in building detection, such as edge blurring, loss of detail, and low resolution. To address these issues and improve boundary precision, this paper proposes CCCUnet, a hybrid architecture developed for enhanced building extraction. CCCUnet integrates CondConv, Coord Attention, and a CGAFusion module to overcome the limitations of traditional U-Net-based methods. Additionally, the NLLLoss function is utilized in classification tasks to optimize model parameters during training. CondConv replaces standard convolution operations in the U-Net encoder, boosting model capacity and performance in building change detection while ensuring efficient inference. Coord Attention enhances the detection of complex contours in small buildings by utilizing its attention mechanism. Furthermore, the CGAFusion module combines channel and spatial attention in the skip connection structure, capturing both spatial and channel-wise correlations. Experimental results demonstrate that CCCUnet achieves high accuracy in building change detection, with improved edge refinement and the better detection of small building contours. Thus, CCCUnet serves as a valuable tool for precise building extraction from remote sensing images, with broad applications in urban planning, land use, and disaster monitoring.

Keywords:

building change detection; small buildings; attention mechanism; feature fusion

1. Introduction

Building change detection using remote sensing imagery involves the monitoring and analysis of the morphological, structural, and functional attributes of buildings over time to reveal the processes and trends in their evolution. With the acceleration of urbanization and the increasing emphasis on the safety and sustainability of the built environment, the significance of building change detection through remote sensing imagery has become more pronounced. It provides not only a foundation for decision making in urban planning and management, but also offers essential technical support for disaster monitoring, assessment, and cultural heritage preservation [1].

According to relevant statistics, the safety incidents resulting from building aging, damage, and other factors occur frequently on a global scale each year, posing significant threats to public safety and property. The application of remote sensing change detection technology enables the timely identification of potential safety hazards in buildings, providing early warnings and countermeasures for relevant authorities, which effectively reduces the incidence of safety accidents. Furthermore, given the escalating impacts of global climate change, the resilience and sustainability of buildings, as critical components of urban environments, are garnering increased attention. Provided that the multi-temporal imagery obtained possesses adequate spatial and temporal resolution, it is theoretically feasible to acquire systematic and comprehensive information on surface changes, including change time, location, extent, type, degree, and status [2]. The utilization of remote sensing imagery-based building change detection technology offers accurate data support for disaster monitoring and assessment, provides a scientific foundation for urban planning and construction, and fosters sustainable urban development.

2. The Current State of Research

In recent years, deep neural networks have garnered substantial attention in the field of image understanding due to their remarkable ability to automatically learn complex and contextually relevant features. The primary objective has been to design advanced architectures for multi-level feature representations [3]. Despite these advancements, challenges such as low accuracy, suboptimal model performance, a scarcity of training samples, and the inefficient utilization of image features have increasingly surfaced. The introduction of U-Net has proven to be an effective solution to these issues, with numerous U-Net-based frameworks showing promising results in change detection through deep feature extraction and optimization techniques.

Recent advances in deep learning for remote sensing image analysis have significantly enhanced the U-Net architecture and its variants. One notable development is the MultiRes-UNet network, which extends the original U-Net model by introducing MultiRes blocks to capture features at multiple scales. This network design incorporates additional convolutional operations and skip connections to bridge discrepancies between encoder and decoder features, while integrating semantic edge information to improve boundary accuracy and address challenges with irregular polygons.

However, existing deep learning-based methods still suffer from problems such as extracted features not being discriminative enough, leading to false detections and a loss of detail, particularly in complex remote sensing scenarios. To address these issues, we propose a method called Fusion-Former for building change detection. Fusion-Former fuses window-based self-attention with depth-wise convolution, termed Fusion-Block, combining convolutional neural networks (CNNs) and a Transformer to effectively integrate information at different scales.

In the proposed Fusion-Former model, Fusion-Block enables the extraction of more effective change information from multi-channel remote sensing data, and the Vision-Module significantly enhances the performance of the Fusion-Block. Efficiency comparisons demonstrate that Fusion-Former achieves optimal performance. However, certain challenges remain, particularly in edge feature extraction, indicating that there is still room for further improvement in model accuracy. Future research will continue to explore the combination of CNNs and Transformers, as well as ways to enhance feature extraction efficiency and achieve a lightweight model. This method is applicable not only to remote sensing image change detection but also to general semantic segmentation tasks, making it a promising area for further study and exploration [4,5].

This innovation presents a new solution for change detection in multi-temporal image analysis while reflecting ongoing efforts to address the challenges of complex data relationships and multi-scale feature extraction in Transformer-based models [6,7]

Recent advancements in remote sensing image analysis have led to several notable improvements in building extraction methodologies based on the U-Net architecture. One such advancement is EfficientUNet+, which enhances the traditional U-Net model by employing EfficientNet-b0 as the encoder and integrating scSE blocks into the decoder. This design not only improves the accuracy and efficiency of building extraction but also addresses the problem of boundary blurring. By combining a building boundary-weighted cross-entropy loss with Dice loss, EfficientUNet+ imposes stronger constraints on the building boundaries, leading to enhanced segmentation performance [8]. Another significant development in this field is the introduction of Seg-Unet, a novel deep neural network that merges SegNet and U-Net techniques to effectively extract building objects from high-resolution aerial images. By combining the strengths of both architectures, Seg-Unet achieves superior segmentation results, demonstrating an effective solution for extracting building objects from high-resolution imagery [9]. Additionally, a geometry-aware deep learning method has been specifically proposed for post-earthquake building component segmentation. This method extends the U-Net model with a composite loss function that incorporates a geometric consistency (GC) term. The GC term imposes constraints on both the perimeter and area of each component region, which helps address the challenge of accurately segmenting homogeneous structural components. This enhanced U-Net network features six encoder layers and five decoder layers, with additional convolutional layers in the lower levels designed to improve high-level feature extraction capabilities [10]. A further advancement is the two-stage, dual-branch U-Net architecture that incorporates a selective kernel (SK) module. This model addresses issues of inaccurate building footprint localization and damage level classification by sharing weights between the two branches and using the SK module to adaptively adjust receptive fields at different scales. This approach significantly improves feature extraction, enhances localization accuracy, and improves the damage classification performance [11]. Another recent innovation is the Feature Self-Attention U-Net (FSAU-Net) model, which introduces feature self-attention (FSA) in the encoding phase to refine feature extraction and spatial attention (SA) in the decoding phase to emphasize important spatial locations. The FSA module enhances feature representation by computing attention weights based on the features’ own characteristics, while the SA module focuses on relevant regions to improve the overall segmentation accuracy. Furthermore, skip connections are employed to fuse shallow and deep features, thereby reducing the loss of building information during the segmentation process [12]. Lastly, the improved MS-ResUNet network integrates multi-scale pyramid image inputs and a multi-layer attention module for building extraction from novel temporal satellite images. This network leverages multi-scale inputs to capture features at various scales and utilizes a multi-layer attention mechanism to refine feature representations. Additionally, a novel spatial analysis framework is introduced to detect and quantify changes in building vector information across different time points, advancing the capabilities for change detection in remote sensing imagery [13].

These advancements reflect ongoing efforts to refine building extraction methodologies by addressing the challenges related to feature extraction, edge segmentation, and multi-scale analysis. They collectively aim to achieve more accurate and effective building detection in high-resolution remote sensing imagery.

Recent advancements in change detection for high-resolution remote sensing images have introduced several notable enhancements to the U-Net architecture. One such approach proposes an improved U-Net architecture that integrates dilated convolutions and a spatial pyramid pooling structure to expand the receptive field of the model. This design effectively reduces interference from phenomena such as spectral confusion and object confusion in high-resolution images. Despite these advancements, the model’s feature extraction capability remains limited for complex tasks, particularly for detecting small buildings, indicating a need for further improvements in future research [14]. Another significant development is the FlowS-Unet model, which builds upon a fully convolutional neural network framework to combine low-level high-resolution information with high-level semantic information. This model performs predictions on each fused feature map to minimize loss and enhances the network’s representational power. Additionally, a series of post-processing steps—including dilation, erosion, and hole-filling—are applied to refine the segmentation results. However, challenges remain with edge segmentation precision and training time performance, suggesting that there is still room for improvement in achieving accurate change detection [15]. A further advancement is the MAUnet model, which introduces a multi-scale information aggregation approach by combining building feature maps filtered through a convolutional block attention module with multi-scale features extracted via a dilated spatial pyramid pooling module. This method improves the model’s ability to detect small buildings in complex backgrounds. However, the performance of MAUnet in extracting buildings from even more challenging background information remains suboptimal and requires additional enhancements [16].

Recent advancements in building change detection have led to significant improvements in the U-Net++ architecture. One notable approach introduced a twin-differential structure in the U-Net++ encoder to separately learn features from two temporal images, utilizing residual blocks to facilitate gradient convergence and detailed feature extraction. However, this method suffers from high computational complexity and a large number of parameters. Building on this foundation, an enhanced U-Net++ structure was developed for end-to-end change detection in high-resolution satellite images, improving multi-scale global feature utilization but still facing challenges with environmental variations. To further advance detection capabilities, attention blocks were incorporated to effectively fuse feature maps from different layers, allowing for adaptive weight allocation and reducing background interference. The proposed method integrates geometric boundary information through a distance-based class map and employs a multi-task loss function to jointly optimize building and boundary extraction, achieving refined detection results during the testing phase [17,18,19]. One approach utilizes the first three levels of features from the encoder’s feature hierarchy as low-level inputs for spatial feature pyramid (SFP) and variable resolution encoding (VRE) modules to capture local spatial details, while high-level features are fed into a Transformer to model long-range dependencies through its self-attention mechanism. Although this method effectively simulates feature relationships, it still faces challenges in terms of accuracy and robustness for large-scale datasets. Another method proposes a Mobile-Unet model that employs the bneck module from MobileNetV3 for efficient feature extraction with fewer parameters and reduced training time, though it may miss the finer details of small detection targets. A third method introduces the SERG-UNet algorithm, which integrates an attention mechanism into the deep residual structure of the SEG-ResNet framework to enhance the feature extraction efficiency and address issues such as missing small buildings and incomplete edges in high-resolution imagery. Despite its advancements, SERG-UNet’s increased network depth complicates its deployment. These methods highlight ongoing efforts to balance feature detail, model efficiency, and deployment challenges in the field of building change detection [20,21,22].

This paper proposes a novel method for building change detection in remote sensing images, referred to as CCCUnet. This method is based on Conditionally Parameterized Convolutions for Efficient Inference (CondConv) and Coord Attention for feature fusion. CCCUnet employs a comprehensive encoder-decoder architecture in which CondConv replaces the traditional convolution operations in the U-Net encoder during the encoding phase. This enhancement significantly improves the model’s capacity and performance in effectively extracting edge features from images. Following the convolution layers, Coord Attention is integrated to combine positional coordinates with content information, allowing for a better capture of structural and sequential characteristics, thereby enhancing the detection of small-scale changes.

During the decoding phase, CGAFusion is utilized in the skip connections to fully integrate the features extracted by the encoder. This approach captures significant information across various semantic layers and improves the detection of edge details. The effectiveness of the proposed method is validated through a comparative analysis with state-of-the-art techniques on publicly available datasets.

The remainder of this paper is structured as follows: Section 3 delves into the detailed methodology of CCCUnet, including descriptions of the CondConv, Coord Attention, and CGAFusion modules. Section 4 presents the experimental setup, datasets utilized, and results obtained, while Section 5 discusses the implications of our findings and suggests directions for future research. Finally, Section 6 concludes the paper by summarizing the key takeaways and contributions of our work.

3. Methodology

3.1. U-Net Model

The U-Net model [23] is a classic deep learning architecture primarily designed for image segmentation tasks. Introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015, the U-Net model was detailed in their seminal paper titled “U-Net: Convolutional Networks for Biomedical Image Segmentation”. This model achieved outstanding performance in the cell tracking challenge of the ISBI 2015 competition and is characterized by a symmetric encoder-decoder (Encoder-Decoder) structure.

The encoder part of U-Net, also known as the down-sampling or contraction path, captures contextual information from the image through a series of convolutional layers, ReLU activation functions, and max-pooling layers, progressively reducing the spatial dimensions of the feature maps. Conversely, the decoder part, also known as the up-sampling or expansive path, restores the spatial dimensions of the image using transposed convolutional layers (such as deconvolution or transposed convolution) and merges features from the corresponding levels of the encoder through skip connections. These skip connections integrate low-level and high-level features, thus capturing more detailed information. At the end of the model, a 1 × 1 convolutional layer maps the feature maps to the desired number of classes, and a Sigmoid function performs pixel-wise classification.

A key feature of U-Net is the use of skip connections between the encoder and decoder, which transfer feature maps directly from the encoder to the decoder, preserving fine-grained details and enhancing the accuracy of target boundary localization. Additionally, U-Net supports end-to-end training, enabling direct learning from raw image inputs to final segmentation outputs without the need for complex preprocessing or post-processing steps. The model also employs multi-scale feature fusion through skip connections, which improves its ability to segment targets at various scales. Furthermore, U-Net incorporates a weight-sharing strategy where the same convolutional kernel parameters are utilized in both the encoder and decoder paths, thereby reducing the number of model parameters and improving the training efficiency.

The U-Net model can be divided into three main components, as illustrated in Figure 1. The first component is the backbone feature extraction stage, which generates a series of feature layers and produces five initial effective feature maps. The second component is the enhanced feature extraction stage, where these five effective feature maps are subjected to up-sampling and feature fusion to produce a final, integrated feature map. The third component is the prediction stage, which classifies each pixel in the final feature map, thus performing pixel-level segmentation tasks.

3.2. Design of the Improved U-Net Model

Although the U-Net model has demonstrated excellent performance in building change detection tasks, it still has several limitations. For example, in complex scenarios, the model’s feature extraction capability might be insufficient, which can lead to false negatives or false positives when detecting small buildings. Moreover, building change detection tasks often involve processing time-series images, necessitating improvements in the model’s temporal processing capabilities. To address these challenges, this paper proposes an improved U-Net model named CCCUnet, which is designed to enhance performance in building change detection tasks. The technical roadmap is illustrated in Figure 2.

To begin with, this study adopts CondConv dynamic convolutions (conditionally parameterized convolutions for efficient inference, abbreviated as CondConv) to replace traditional convolution operations in the encoder segment. This approach aims to enhance the model’s capacity and performance for change detection tasks while maintaining inference efficiency. CondConv’s unique feature is its ability to dynamically compute convolution kernels based on input data, thus overcoming the limitation of traditional static convolution methods where all samples share the same kernel. Additionally, the negative log-likelihood loss function (NLLLoss) is introduced to replace the conventional cross-entropy loss function, addressing class imbalance issues and optimizing the training process.

Next, the attention mechanism is incorporated to enhance the model’s focus on critical features. Specifically, the Coord Attention module is integrated into each layer of the encoder, which allows the model to adaptively adjust the importance of different features and thereby improve the capture of key features related to small building changes.

Finally, the CGAFusion feature fusion module is employed. CGAFusion is an innovative module designed for feature fusion, aimed at enhancing the performance of deep neural networks. By combining channel attention and spatial attention mechanisms, CGAFusion effectively strengthens the model’s ability to capture spatial and channel dependencies in image features. Although feature maps carry rich image details, the interrelationships between features across different channels and spatial locations may not be sufficiently pronounced. The introduction of the CGAFusion module addresses this issue by mitigating the impact of non-uniform fog distribution on detection results, thereby enabling the network to better understand and utilize these latent dependencies.

In summary, this paper presents a comprehensive and in-depth improvement and optimization of the U-Net model through a series of innovative strategies. These strategies include replacing traditional convolution operations with CondConv, integrating the Coord Attention mechanism, introducing the efficient CGAFusion feature fusion module and refining the loss function. These carefully designed modifications not only significantly enhance the model’s performance in building change detection tasks but also improve detection results for small buildings in complex scenarios.

3.2.1. Improvements in Convolutional Layer Architectures

A core challenge faced by convolutional neural networks (CNNs) today is the reliance on fixed convolution kernels that uniformly process all samples in a dataset. This limitation constrains the adaptability and capacity scalability of the model. To enhance model performance, conventional approaches often involve increasing the number of model parameters, deepening the network layers, or widening the channels. However, these strategies are frequently accompanied by a significant rise in computational complexity and deployment difficulty. To address these issues, this paper introduces CondConv, a novel method for conditionally parameterized convolutions designed to meet the high parameter efficiency demands of contemporary network designs, particularly in real-time computer vision applications such as terminal video processing and autonomous driving systems. These applications not only require rapid model responses but also aim to balance performance improvements with minimal parameter overhead.

CondConv overcomes the drawbacks of static convolutions where a single kernel is shared across all samples by dynamically computing the convolution kernels based on the input. Specifically, in CondConv layers, the convolution kernels are parameterized as a linear combination of n experts (kernels). This approach allows for more efficient model capacity expansion compared to increasing kernel sizes, as the increased number of experts can be combined dynamically without having to apply larger kernels to multiple positions in the input. Thus, CondConv enables the enhancement of model capacity while maintaining efficient inference.

In a standard convolutional layer, the same convolution kernel is applied to all input samples. In contrast, in a CondConv layer, the convolution kernels are dynamically computed based on the input samples, as illustrated in Figure 3. Specifically, the convolution kernels in CondConv are parameterized as follows:

K (x) = \sum_{i = 1}^{n} α_{i} (x) W_{i}

(1)

where Wi denotes the i-th expert kernel,

α_{i} (x)

represents the input-dependent weights computed by a routing function with learnable parameters, n is the number of experts (kernels), and

α_{i} (x)

is the activation function. When CondConv is used to replace standard convolution layers, each convolution kernel retains the same dimensions as the kernels in the original convolutional layers.

In previous work, the capacity of conventional convolutional layers is typically increased by enlarging the height/width of convolution kernels or by increasing the number of input/output channels. However, expanding the convolution kernel size results in additional computational costs that scale with the size of the input feature map, which can be quite large.

In contrast, the CondConv layer computes the convolution kernels as a linear combination of a set of kernels for each sample before applying the convolution operation. Each kernel is computed only once but is applied to many different positions in the input image, which allows for increased model capacity while controlling the computational cost. By increasing the number of kernels, the network capacity can be expanded with relatively low additional inference cost compared to performing multiple convolutions. Moreover, a routing function can be designed to be computationally efficient, capable of meaningfully distinguishing between input samples, and easily interpretable. This routing function calculates the input-dependent routing weights through three steps: global average pooling, a fully connected layer, and a Sigmoid activation function:

α_{i} (x) = S i g m o i d (F C (G A P (x)))

(2)

where the learnable routing weight matrix maps the pooled input to n expert weights. Based on these dynamically aggregated convolution kernels, the normal convolution operation is then performed. Thus, the proposed routing function in this work enables adaptive local operations using the global context of the entire input.

Additionally, the ReLU activation function is applied. As a classic nonlinear activation function, ReLU is particularly well suited for modeling nonlinear relationships in neural networks. Compared to Tanh, ReLU is simpler in form, faster in computation, and effectively mitigates the vanishing gradient problem and overfitting issues. Its mathematical expression and derivative are given by:

R e L U (x) = m a x (x, 0) = \{\begin{matrix} x, x \geq 0 \\ 0, x < 0 \end{matrix}

(3)

R e L U' (x) = m a x (x, 0) = \{\begin{matrix} 1, x \geq 0 \\ 0, x < 0 \end{matrix}

(4)

where x represents the input feature value, and max(x,0) denotes the maximum of x and 0. However, during training, ReLU can encounter issues. As described by Equation (4), a high learning rate can cause large gradients during backpropagation, which may excessively reduce the ReLU layer’s weights and biases, resulting in zero outputs during forward propagation and halting the normal parameter updates. This can lead to extreme data distributions. To address this issue, batch normalization (BN) is introduced. BN is a normalization technique designed to accelerate the training of deep neural networks and maintain a consistent distribution of inputs across each layer [24]. By normalizing data to a standard normal distribution before it enters the ReLU layer, BN helps generate reasonable gradients during backpropagation, reduces the sensitivity of ReLU to high learning rates, speeds up convergence, enhances feature propagation, and prevents the vanishing gradient problem.

The main functions of BN include the following:

Accelerating training speed: By standardizing the output to have a mean of 0 and a variance of 1, BN speeds up the convergence of the model training process.
Mitigating gradient vanishing: BN reduces the internal covariate shift, allowing each layer to learn independently and facilitating faster learning across the entire network.
Reducing internal covariate shift: During network training, changes in parameters alter the input data distribution for subsequent layers. As the network deepens, this shift becomes more pronounced. BN normalizes the output to a standard normal distribution, thereby alleviating the “gradient vanishing” issue in deeper networks.
Regularization effect: By computing the mean and variance of the data across the batch, BN introduces a form of noise that regularizes the network, preventing over-reliance on specific nodes’ outputs.

In summary, CondConv enhances the model’s ability to capture intricate patterns in remote sensing images, particularly in tasks requiring fine detail extraction, such as building change detection. By integrating CondConv within the U-Net encoder of our proposed CCCUnet model, we achieve a more robust and adaptable feature extraction process, ultimately improving performance in detecting small-scale changes in complex environments.

3.2.2. Coord Attention

In change detection tasks, particularly with small buildings, existing methods often face challenges such as missed detections, false detections, and blurred edges [25]. To address these challenges, this paper incorporates the **Coord Attention** mechanism into the U-Net encoder. Coord Attention is designed to enhance spatial relationships and positional sensitivity while maintaining the lightweight nature of the network, making it suitable for large-scale remote sensing data processing. Developed by Hou et al. from the National University of Singapore, Coord Attention introduces a novel approach to handling spatial dependencies by encoding both coordinate and content information.

The core principle of Coord Attention lies in addressing the limitations of traditional attention mechanisms, which often fail to effectively capture positional information. Coord Attention splits the attention mechanism into two complementary processes: horizontal and vertical attention, ensuring that spatial dependencies in both directions are encoded explicitly.

The structure of the Coord Attention mechanism is illustrated in Figure 4. This mechanism performs average pooling in both the horizontal and vertical dimensions to obtain two 1D vectors. These vectors are then concatenated and subsequently compressed using 1 × 1 convolution (1 × 1 Conv). Batch normalization (BN) and nonlinear activation functions are then applied to encode spatial information in the vertical and horizontal dimensions. The features are split and processed through 1 × 1 convolutions to achieve the same number of channels as the input feature maps, followed by normalization and weighting.

Coord Attention enhances the U-Net encoder’s ability to accurately localize small-scale changes by leveraging the coordinate-based attention mechanism. This is particularly effective for detecting subtle structural changes in complex urban environments where traditional methods might struggle with spatial resolution. The integration of Coord Attention into our proposed CCCUnet framework results in improved performance in both edge detection and small building differentiation, as demonstrated by our experimental results.

3.2.3. Channel and Spatial Attention Fusion

CGAFusion (channel and spatial attention fusion) is a feature fusion module that enhances the performance of deep neural networks. It integrates channel attention and spatial attention mechanisms to more effectively capture both the spatial and channel dependencies of image features [26]. Feature maps in deep neural networks contain rich information about the image; however, the dependencies between different channels and spatial locations may not be immediately apparent. The CGAFusion module utilizes two attention mechanisms to help the network better exploit this information, as shown in Figure 5:

The channel attention [27] mechanism is pivotal in evaluating the importance of each feature channel within the feature maps. It enables the network to concentrate more effectively on the features that are most relevant for the current task. The implementation of channel attention begins with global average pooling across the channels of the feature map, which aggregates the information and computes a set of importance weights. These weights are then applied to the respective channels, allowing the model to emphasize more informative features while suppressing less relevant ones.
The spatial attention [28] mechanism focuses on capturing dependencies between different spatial locations within the feature map. This is crucial for identifying which areas of the image contribute most significantly to the task at hand. The implementation process involves first performing global average pooling across channels to create a spatial context representation. Subsequently, convolutional layers are employed to learn importance weights for each spatial location. These weights are applied to the feature map, effectively enhancing the network’s sensitivity to spatial distributions.

Figure 5. Structure of the CGAFusion module.

By seamlessly integrating both channel and spatial attention mechanisms, the CGAFusion module effectively captures the intricate dependencies across both dimensions. This comprehensive approach improves the model’s ability to understand and manage complex scenarios, such as non-uniform fog [29] distributions in images, thereby enhancing the overall performance of the network across various visual tasks. The results demonstrate that CGAFusion significantly contributes to improved feature extraction and detail recognition, making it a critical component in the proposed CCCUnet architecture.

3.2.4. Negative Log-Likelihood Loss

In this study, the NLLLoss function is employed instead of the traditional cross-entropy loss function [30]. This loss function is typically used in conjunction with a Log-Softmax layer in the final network layer and is mainly applied to multi-class classification problems. During model training, the NLLLoss minimizes the discrepancy between the model’s predictions and the true labels, thus bringing the model’s predictions closer to the actual results. The formula for NLLLoss is as follows:

N L L L o s s = - \frac{1}{N} \sum_{i = 1}^{N} \log (p_{i}, y_{i})

(5)

where N denotes the number of samples in the batch, and pi, yi represent the probability of the true class yi for the i-th sample.

The NLLLoss measures the uncertainty of model predictions by computing the negative log probability of the correct class. When the model assigns a higher probability to the correct class, the NLLLoss value decreases, indicating more accurate predictions by the model. By minimizing the NLLLoss, the model’s parameters are adjusted to better fit the training data and improve classification accuracy during the training process. NLLLoss has been widely applied in many machine learning tasks, particularly in multi-class classification and probability modeling problems [31].

NLLLoss and cross-entropy loss (CELoss) are both commonly used in classification tasks in machine learning and deep learning. The key difference is in their input requirements: NLLLoss expects log probabilities from a Log-Softmax function [32],, while CELoss requires raw logits (the model’s direct outputs). CELoss automatically applies softmax and log operations internally. This allows NLLLoss to effectively optimize models by focusing on the log probability of the correct class. Compared to CELoss, NLLLoss offers a more targeted approach for multi-class classification by emphasizing the negative log probability of the true class.

In summary, NLLLoss is an effective and widely used loss function for classification tasks. It measures the discrepancy between the model’s predictions and the true labels by computing the negative log-likelihood value and helps optimize model parameters during training.

4. Results

4.1. WHU Change Detection Dataset

In this study, the WHU Change Detection (WHU-CD) dataset [33] was utilized as the experimental dataset, with a sample shown in Figure 6. As depicted in Figure 6, T1 represents the aerial imagery captured in 2012, and T2 represents the aerial imagery captured in 2016.

The WHU-CD dataset focuses on the building changes in Christchurch, New Zealand, during a specific historical period. Specifically, this dataset documents the aftermath of the 6.3 magnitude earthquake that occurred in February 2011 and the subsequent reconstruction efforts over the following years. The core of the dataset consists of aerial images captured in April 2012, which provide a detailed record of the post-earthquake urban environment and serve as a benchmark for comparing pre-earthquake conditions. The dataset includes 12,796 buildings across a 20.5 square kilometer area (with the number of buildings increasing to 16,077 in the 2016 dataset for the same region), with precise annotations of building locations, shapes, and quantities. The images have a spatial resolution of 0.075 m and a total size of 32,507 × 15,354 pixels, ensuring detailed and accurate data.

For model training and evaluation, we cropped the WHU-CD dataset to 256 × 256 pixel images. The training set consists of 6624 image pairs, which provide a rich set of building change cases to facilitate effective feature learning and change detection strategies.

4.2. Experimental Settings

The network model was created using the PyTorch 2.1.0 deep learning framework, with Python version 3.10. To accelerate the computation process, CUDA 12.1 was utilized as the acceleration library. In terms of hardware, the processor used was an Intel(R) Xeon(R) Platinum 8352V, with a memory capacity of 80 GB, and the GPU used was an RTX 4090D. The Adam optimizer was employed with a learning rate set to 1 × 10⁻⁴, a batch size of 2, and a total of 100 epochs for training.

4.3. Model Evaluation Criteria

To quantitatively evaluate the performance of the experiment on the remote sensing image change detection task, this study selected several metrics including precision (P) [34], recall (Re) [34], F1 score (F1) [35], intersection over union (IoU) [36], and overall accuracy (OA) [37]. Precision measures the proportion of true positive predictions out of all predicted positives, and is defined as the ratio of true positives to the sum of true positives and false positives. Recall represents the proportion of true positive predictions out of all actual positives, and is defined as the ratio of true positives to the sum of true positives and false negatives. F1 score is the harmonic mean of precision and recall, providing a balance between these two metrics. IoU evaluates the overlap between the predicted and ground-truth targets in object detection and other tasks. OA measures the proportion of correctly classified instances over the entire dataset. The formulas for these metrics are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(8)

I O U = \frac{T P}{T P + F N + F P}

(9)

O A = \frac{T P + T N}{T P + T N + F N + F P}

(10)

In the formula, TP (true positive) and TN (true negative) represent the correctly predicted positive and negative samples by the model, respectively. FN (false negative) indicates the number of positive samples that were incorrectly predicted as negative, and FP (false positive) refers to the number of negative samples that were incorrectly predicted as positive.

4.4. Comparative Experiment Results and Analysis

4.4.1. Introduction to Comparison Methods

To validate the effectiveness of the proposed CCCUnet, we compared it with four established change detection methods: FCEF [38], SiamUnet_conc [38], SiamUnet_diff [38], and the original U-Net network. All methods were trained and tested under identical conditions to ensure a fair comparison.

The compared methods are three foundational fully convolutional change detection networks: FCEF employs an early fusion strategy. SiamUnet_conc and SiamUnet_diff use late fusion strategies. The primary difference between these two methods lies in their feature fusion techniques: SiamUnet_conc employs concatenation, while SiamUnet_diff uses a difference method.

4.4.2. Performance Evaluation

Table 1 presents the accuracy results of each change detection method on the WHU-CD dataset under the experimental conditions. As illustrated in Table 1, The bolded data represents the highest value in the experimental group. CCCUnet demonstrated a superior performance in most evaluation metrics. Specifically, CCCUnet achieved the highest scores in all metrics except for recall, where it showed a slightly lower performance compared to the other methods.

4.4.3. Detailed Comparison of Change Detection Results

Figure 7 provides a detailed comparison of the different models’ performance in building change detection on the WHU dataset. In the figure, white areas represent buildings that have undergone changes, while black areas represent regions with no changes. Comparing the results of this model with ground truth labels and other models reveals certain issues. For instance, as indicated by the first row, some roads were mistakenly detected as building changes, and these false detections are clearly visible in the output images. Additionally, as shown in the second row, this model performs better than others in terms of edge clarity, allowing for the better handling of detailed information. Furthermore, the third row highlights the limitations of other models in detecting changes in some small buildings, failing to accurately identify certain changes in these structures.

To improve the detection performance, this study explored model optimization. First, the CGAFusion module was introduced into the skip connections of the U-Net model to enhance the model’s sensitivity to changes in small buildings. Experimental results demonstrated a significant improvement in the model’s ability to detect small building changes, successfully identifying changes that were previously missed. However, despite these improvements, the model still faces challenges such as irregular target shapes and the incomplete detection of certain changes.

To further enhance the detection performance, the Coord Attention mechanism was introduced. This mechanism helps the model more accurately delineate building contours and reduces false positives and missed detections. Experimental results showed that the addition of Coord Attention improved the model’s handling of boundaries and edges, not only reducing false detections but also producing more regular building shapes, thereby narrowing the gap between detected changes and ground truth labels. This improvement not only increases the model’s detection accuracy but also provides more reliable data for subsequent building change analysis and applications.

4.5. Ablation Study Results and Analysis

4.5.1. Experimental Design and Setup

To evaluate the effectiveness of the proposed modules, we designed a series of ablation experiments based on the baseline model. In these experiments, we systematically added each proposed module to the baseline model and assessed the impact of each module on the model’s performance. The results of these experiments are summarized in Table 2. The bolded data represents the highest value in the experimental group.

In our analysis, we observed that, among all the model configurations tested, those that excluded the three specific modules exhibited the lowest accuracy compared to other configurations. This observation clearly indicates that incorporating these modules significantly improves the model’s predictive performance.

4.5.2. Analysis of the CGAFusion Module

Among the ablation studies, excluding the CGAFusion module had a significant impact on the model’s overall performance. The model’s F1 score dropped to 93.79%, and its IoU fell to 80.61%, indicating that this module plays a crucial role in accurately detecting building changes, particularly in refining the spatial details. The CGAFusion module’s ability to combine spatial and channel attention enables the model to focus on both the global context and local features, making it highly effective in addressing complex scenarios like small building changes or irregular building edges. Its absence shows that the model becomes less efficient in fine-grained detection tasks, emphasizing the importance of this module for precise change detection.

4.5.3. Analysis of CoordAttention Module

The ablation study of the CoordAttention module reveals its importance in the overall performance of CCCUnet. When this module was removed, the model achieved the highest precision at 97.03%, indicating that CoordAttention might not be strictly necessary for optimizing precision in certain cases. However, the model’s Recall and F1 score decreased, suggesting that CoordAttention improves the model’s ability to locate building changes comprehensively. Specifically, the IoU metric dropped to 77.56%, demonstrating that, while precision benefits from the removal of CoordAttention, the overall accuracy of detecting changes across the image suffers. Therefore, CoordAttention is crucial for detecting complex contours and boundaries, particularly for irregularly shaped buildings, which aligns with its design of focusing on both spatial and directional information.

4.5.4. Analysis of CondConv Module

In the case of the CondConv module, removing it resulted in the highest recall value of 95.75%, suggesting that the model without CondConv is particularly effective in detecting most building changes. However, the F1 score and IoU metrics decreased compared to the full CCCUnet model, highlighting that CondConv improves the model’s ability to balance between precision and recall. The lower IoU (79.15%) in the ablation study without CondConv indicates that this module plays an essential role in the overall segmentation quality, especially when dealing with small or subtle building changes. CondConv enhances the model’s adaptability by allowing the convolution operations to conditionally adjust to the data, making the network more flexible and capable of handling variations in building shapes and sizes.

4.5.5. Combined Analysis of CoordAttention, CondConv, and CGAFusion Modules

When two or more of the advanced modules were removed, the model’s performance showed varying degrees of degradation. For instance, the combination of excluding CoordAttention and CGAFusion resulted in an F1 score of 95.79%, but the IoU dropped to 77.73%, showing that these modules collectively contribute to the model’s ability to detect fine details and edges. Similarly, the removal of CondConv and CGAFusion resulted in an F1 score of 96.35% and IoU of 80.23%, reflecting a compromise in performance. Finally, excluding CondConv and CoordAttention led to an IoU of 80.42%, showing the robustness of the remaining CGAFusion module but also highlighting how the interaction between the modules results in optimal detection.

4.5.6. Overall Performance Improvement with CCCUnet

The full integration of CondConv, CoordAttention, and CGAFusion modules into the CCCUnet architecture has resulted in significant improvements in model performance. The final CCCUnet model achieved an F1 score of 93.98%, an IoU of 80.97%, and an overall accuracy (OA) of 96.99%, outperforming the baseline U-Net and other ablation models. These improvements demonstrate the complementary nature of these modules, each enhancing different aspects of the model’s performance. CondConv enables better adaptation to local changes, CoordAttention improves edge detection and spatial awareness, and CGAFusion strengthens the model’s ability to process complex multi-scale features.

This combination not only leads to enhanced detection accuracy but also improves the model’s ability to capture subtle changes in small buildings, refine contours, and minimize errors in edge detection. These enhancements validate the effectiveness of CCCUnet in urban planning, land use analysis, and disaster monitoring tasks, where accurate building change detection is crucial.

5. Discussion

As urbanization accelerates, building change detection becomes increasingly critical for effective urban planning and land use management. CCCUnet, as an innovative tool for building change detection, not only demonstrates significant advantages in detection accuracy but also effectively addresses changes in complex environments. Below is an in-depth analysis of CCCUnet’s potential and advantages in three key application scenarios: urban planning, land use management, and disaster monitoring.

In urban planning, accurate building change detection is essential for developing rational policies and strategies. CCCUnet provides data support for urban planners, helping them identify newly constructed buildings. This capability enables planners to gain a better understanding of the dynamics of urban development, facilitating more informed decision making. Furthermore, CCCUnet’s edge refinement capabilities ensure high-quality detection results, giving planners clearer insights when analyzing building outlines and spatial layouts.

Land use management is vital for achieving sustainable development. CCCUnet effectively monitors changes between different land use types, such as urban expansion. By accurately analyzing remote sensing images, CCCUnet assists land managers in timely identifying potential environmental issues and taking appropriate measures. For instance, in land development projects, CCCUnet can quickly assess the impacts of development on the surrounding environment, providing a scientific basis for decision making. Additionally, CCCUnet’s robustness allows it to deliver reliable change detection results even under variable environmental conditions.

CCCUnet also excels in disaster monitoring. Natural disasters like earthquakes, floods, and fires often cause sudden changes in buildings and infrastructure. CCCUnet’s rapid detection capabilities enable it to assess damage immediately after disasters, assisting relevant agencies in timely rescue and recovery efforts. For example, in flood events, CCCUnet can swiftly identify submerged buildings, providing crucial data for post-disaster assessments and recovery operations. Furthermore, CCCUnet can analyze historical data to identify high-risk areas, offering valuable insights for future disaster prevention and management.

A comparative analysis of the experimental results reveals the strengths and weaknesses of the various change detection methods. CCCUnet stands out by excelling across all evaluation metrics, particularly in detecting changes in small buildings and capturing fine details. The addition of CondConv and Coord Attention modules has significantly enhanced CCCUnet’s accuracy and stability, positioning it as the most effective solution for change detection tasks.

To more clearly illustrate the strengths and limitations of each model, they are presented in Table 3.

In summary, CCCUnet is not only a technological innovation but also a crucial tool for advancing urban planning, land use management, and disaster monitoring. Through its application in these domains, CCCUnet demonstrates significant advantages in improving building change detection accuracy and addressing challenges in complex environments.

6. Conclusions

This study proposes a novel U-Net-based architecture for dynamic conditional feature fusion, termed CCCUnet. This innovative architecture is specifically designed to address a range of challenging issues encountered in building extraction from remote sensing images. These challenges include the semantic richness of remote sensing data, the blurring of building edges caused by non-uniform fog distribution, the loss of critical details, and a reduction in overall image resolution.

Although CCCUnet does not introduce a completely new architecture, it demonstrates significant advancements in this field through the thoughtful integration of existing components. First, CCCUnet leverages the CondConv (conditional convolution) technique, which significantly enhances the model capacity and performance of convolutional neural networks (CNNs) through the dynamic adjustment of convolutional kernel parameters. This technique allows the network to adapt more effectively to the complexities of remote sensing image environments, thus improving both the accuracy and efficiency of building extraction.

Additionally, CCCUnet introduces the NLLLoss (Negative Log Likelihood Loss) function to optimize the model’s parameters. This loss function offers a more precise measure of the discrepancy between the predicted and true values, effectively reflecting the model’s performance in building extraction tasks and guiding the model towards more effective learning.

To enhance the detection capability of small building contours, CCCUnet incorporates the Coord Attention mechanism. This mechanism, along with the CGAFusion module, innovatively combines spatial and channel-wise attention to address the challenges of detecting small and complex buildings that are often overlooked by other models. The result is improved performance in refining edges and detecting complex contours, especially in urban settings.

Furthermore, by applying these techniques to a real dataset like WHU-CD, CCCUnet provides insights into practical applications, demonstrating its ability to reduce false detections and improve edge clarity. These improvements have important implications for urban monitoring and geographic information systems.

Finally, to further improve the overall performance of the network, CCCUnet employs the CGAFusion (conditional guided attention fusion) module. This module dynamically fuses feature information from different levels, effectively integrating multi-scale information to enhance the precision of building extraction and refine edge details.

Through a series of experimental validations, CCCUnet has demonstrated significant advantages in building extraction accuracy, edge refinement, and complex contour detection. Extensive experimentation highlights how CCCUnet excels at detecting small building changes, which are often difficult for traditional methods. This focus on small building detection adds practical value for applications in urban planning and land use analysis. Compared to traditional building extraction methods, CCCUnet not only achieves more accurate building extraction from remote sensing images but also better preserves edge details and captures complex building contours. Moreover, this method exhibits strong robustness and effectively addresses complex environmental factors such as non-uniform fog interference. Therefore, CCCUnet holds significant implications for practical applications in urban planning, land use, and disaster monitoring.

Additionally, this study offers a comprehensive analysis of error cases and detailed comparisons with other models. This level of analysis is crucial in understanding the specific strengths and weaknesses of CCCUnet, providing valuable insight into why the proposed modifications work effectively.

However, we acknowledge that this study does not include a comprehensive comparison with more recent state-of-the-art (SOTA) models. Due to time constraints, we were unable to conduct this analysis in the current study. Nevertheless, we plan to address this limitation in future work by incorporating and benchmarking CCCUnet against the latest SOTA models. This will allow us to assess how our approach compares to the most recent advancements in the field, further enhancing the practical relevance and impact of this work.

In conclusion, while CCCUnet may not introduce an entirely new model, it offers significant methodological refinements, performance improvements, and practical contributions. These advancements are valuable to both the research community and practitioners in the field of building change detection.

Author Contributions

Conceptualization, C.R.; Methodology, Y.G.; Validation, Q.C.; Investigation, H.B. resources, Z.H.; data curation, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 42064003). The author is located at the Guilin University of Technology.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

Our research on the CCCUnet architecture for building change detection significantly contributes to sustainable development in several detailed ways: (1). Improved urban planning: By accurately detecting changes in urban structures from remote sensing images, your research provides valuable data for urban planners. This allows for better decision making regarding land use, infrastructure development, and resource allocation, promoting efficient and sustainable urban growth. (2). Disaster monitoring and management: The ability to monitor building changes in real-time aids disaster response efforts. Your model can help identify damaged structures following natural disasters, enabling timely interventions and resource distribution, thereby enhancing community resilience and recovery efforts. (3). Resource efficiency: By enhancing the accuracy of building change detection, our research reduces the need for extensive ground surveys. This not only saves time and resources but also minimizes the environmental impact associated with traditional data collection methods, aligning with sustainable practices. (4). Data for policy development: The insights gained from our research can inform policymakers about urban dynamics, enabling the formulation of regulations and policies that promote sustainable land use and environmental protection. (5). Integration of advanced technologies: our work incorporates advanced methodologies, such as CondConv and Coord Attention, which can set a precedent for future research. This integration encourages the adoption of cutting-edge technologies in geoinformatics, contributing to a broader understanding of sustainability challenges and solutions. (6). Facilitating smart cities: The outcomes of our research can support the development of smart city initiatives by providing essential data for monitoring urban environments. This contributes to creating more livable, efficient, and sustainable urban spaces. (7). Community engagement: The improved detection of changes can facilitate community engagement by allowing residents to visualize urban transformations and participate in discussions about development projects, fostering a sense of ownership and responsibility towards sustainable practices. By addressing these areas, our research not only advances the field of remote sensing and geoinformatics but also plays a crucial role in supporting the broader goals of sustainable development.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, C.; Wang, S.; Chen, X.; Li, J.; Xie, T. Object-level change detection of multi-sensor optical remote sensing images combined with UNet++ and multi-level difference module. Acta Geod. Cartogr. Sin. 2023, 52, 283. [Google Scholar]
Liu, S.C.; Du, K.C.; Zheng, Y.J.; Chen, J.; Du, P.J.; Tong, X.H. Remote sensing change detection technology in the Era of artificial intelligence: Inheritance, development and challenges. Natl. Remote Sens. Bull. 2023, 27, 1975–1987. [Google Scholar] [CrossRef]
Jing, Z.; Guan, H.; Zang, Y.; Ni, H.; Li, D.; Yu, Y. Survey of point cloud semantic segmentation based on deep learning. J. Front. Comput. Sci. Technol. 2021, 15, 1. [Google Scholar]
Abolfazl, A.; Biswajeet, P. Integrating semantic edges and segmentation information for building extraction from aerial images using UNet. Mach. Learn. Appl. 2021, 6, 100194. [Google Scholar]
Fan, Z.; Wang, S.; Pu, X.; Wei, H.; Liu, Y.; Sui, X.; Chen, Q. Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection. Electronics 2023, 12, 4823. [Google Scholar] [CrossRef]
Xu, S.; He, X.; Cao, X.; Hu, J. Damaged Building Detection with Improved Swin-Unet Model. Wirel. Commun. Mob. Comput. 2022, 2022, 2124949. [Google Scholar] [CrossRef]
Chen, P.; Lin, J.; Zhao, Q.; Zhou, L.; Yang, T.; Huang, X.; Wu, J. ADF-Net: An Attention-Guided Dual-Branch Fusion Network for Building Change Detection near the Shanghai Metro Line Using Sequences of TerraSAR-X Images. Remote Sens. 2024, 16, 1070. [Google Scholar] [CrossRef]
You, D.; Wang, S.; Wang, F.; Zhou, Y.; Wang, Z.; Wang, J.; Xiong, Y. EfficientUNet+: A Building Extraction Method for Emergency Shelters Based on Deep Learning. Remote Sens. 2022, 14, 2207. [Google Scholar] [CrossRef]
Abolfazl, A.; Biswajeet, P.; Alamri, A.M. An ensemble architecture of deep convolutional Segnet and Unet networks for building semantic segmentation from high-resolution aerial images. Geocarto Int. 2022, 37, 3355–3370. [Google Scholar]
Wang, Y.; Jing, X.; Chen, W.; Li, H.; Xu, Y.; Zhang, Q. Geometry-informed deep learning-based structural component segmentation of post-earthquake buildings. Mech. Syst. Signal Process. 2023, 188, 110028. [Google Scholar] [CrossRef]
Ahmadi, S.A.; Mohammadzadeh, A.; Yokoya, N.; Ghorbanian, A. BD-SKUNet: Selective-Kernel UNets for Building Damage Assessment in High-Resolution Satellite Images. Remote Sens. 2023, 16, 182. [Google Scholar] [CrossRef]
Hu, M.; Li, J.; A, X.; Zhao, Y.; Lu, M.; Li, W. FSAU-Net: A network for extracting buildings from remote sensing imagery using feature self-attention. Int. J. Remote Sens. 2023, 44, 1643–1664. [Google Scholar] [CrossRef]
Feng, W.; Guan, F.; Tu, J.; Sun, C.; Xu, W. Detection of Changes in Buildings in Remote Sensing Images via Self-Supervised Contrastive Pre-Training and Historical Geographic Information System Vector Maps. Remote Sens. 2023, 15, 5670. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, M.; Wang, F.; Yang, Y.; Liu, Z. Detection method of high-resolution remote sensing building area change based on improved U--Net. Glob. Geol. 2023, 42, 159–167. [Google Scholar]
Gu, L.; Xu, S.; Zhu, L. Detection of building changes in remote sensing images via flows-unet. Acta Autom. Sin. 2020, 46, 1291–1300. [Google Scholar]
Jin, S.; Guan, M.; Bian, Y.C.; Wang, S. Building Extraction from Remote Sensing lmages Based on lmproved U-Net. Laser Optoelectron. Prog. 2023, 60, 59–65. [Google Scholar]
Li, X.; Huang, Y. Siam-differential feature fusion network for change detection of high-resolution images. Sci. Surv. Mapp. 2023, 48, 129–139. [Google Scholar] [CrossRef]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Hua, Z.; Hua, Z.; Xiangcheng, Z. A Multiscale Attention-Guided UNet++ with Edge Constraint for Building Extraction from High Spatial Resolution Imagery. Appl. Sci. 2022, 12, 5960. [Google Scholar] [CrossRef]
Wu, J.; Li, Z.; Cai, Y.; Liang, H.; Zhou, L.; Chen, M.; Guan, J. A Novel Tongue Coating Segmentation Method Based on Improved TransUNet. Sensors 2024, 24, 4455. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Yu, H.; Lu, X.; Li, G. A water extraction method for remote sensing with light weight network model. Sci. Surv. Mapp. 2022, 47, 64–72. [Google Scholar] [CrossRef]
Hu, M.; Li, J.; Yao, Y.; A, X.; Lu, M.; Li, W. SER-UNet algorithm for building extraction from high-resolution remote sensing image combined with multipath. Acta Geod. Cartogr. Sin. 2023, 52, 808–817. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhu, H.; Zhang, Z.; Qin, Y.; Song, W.; Zhang, J. Research on the detection method of multi-type diseases on rural pavement. Sci. Surv. Mapp. 2022, 47, 170–180. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar]
Lee, D.; Jang, K.; Cho, S.Y.; Lee, S.; Son, K. A Study on the Super Resolution Combining Spatial Attention and Channel Attention. Appl. Sci. 2023, 13, 3408. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2021. [Google Scholar]
Shubao, Q.; Baolin, L. A multimodal fusion-based deep learning framework combined with keyframe extraction and spatial and channel attention for group emotion recognition from videos. Pattern Anal. Appl. 2023, 26, 1493–1503. [Google Scholar]
Mozafari, A.S.; Gomes, H.S.; Janny, S.; Gagné, C. A new loss function for temperature scaling to have better calibrated deep networks. arXiv 2018, arXiv:1810.11586. [Google Scholar]
Oh, D.; Shin, B. Improving evidential deep learning via multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 7895–7903. [Google Scholar]
Kulathunga, N.; Ranasinghe, N.R.; Vrinceanu, D.; Kinsman, Z.; Huang, L.; Wang, Y. Effects of nonlinearity and network architecture on the performance of supervised neural networks. Algorithms 2021, 14, 51. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Van Rijsbergen, C.J. A non-classical logic for information retrieval. Comput. J. 1986, 29, 481–485. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Congalton, R.G.; Green, K. Assessing the Accuracy of Remotely Sensed Data: Principles and Practices; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]

Figure 1. U-Net model architecture diagram.

Figure 2. Model structure.

Figure 3. Structure of the CondConv mechanism.

Figure 4. Structure of the Coord Attention mechanism.

Figure 6. WHU Change Detection dataset. Figures (a–f) show a partial display of the dataset.

Figure 7. Experimental extraction effect of algorithm and its comparison algorithm in this section on WHU Change Detection dataset test set. (a) T1, (b) T2, (c) Label, (d) FCEF, (e) SiamUnet_conc, (f) SiamUnet_diff, (g) U-Net, (h) CCC Unet.

Table 1. Evaluation of the change detection accuracy of different methods.

Network	Pre (%)	Re (%)	F1 (%)	IOU (%)	OA (%)
FCEF	93.98	93.53	93.75	80.82	96.80
SiamUnet_conc	95.57	91.37	93.32	79.46	96.72
SiamUnet_diff	92.70	93.24	92.97	78.71	96.36
Unet	94.23	92.84	93.51	80.12	96.71
CCCUnet	96.28	91.79	93.86	80.97	96.99

Table 2. Comparison of the accuracy of ablation tests.

Network	Pre (%)	Re (%)	F1 (%)	IOU (%)	OA (%)
Unet	94.23	92.84	93.53	80.12	96.71
No CoordAtt	97.03	94.27	95.63	77.56	92.69
No CondConv	96.62	95.75	96.18	79.15	93.55
No CGAFusion	95.55	92.10	93.79	80.61	96.89
No CoordAtt + CGAFusion	96.59	95.00	95.79	77.73	92.92
No CondConv + CGAFusion	97.11	95.61	96.35	80.23	93.86
No CondConv + CoordAtt	97.10	95.72	96.41	80.42	93.94
CCCUnet	96.28	91.79	93.98	80.97	96.99

Table 3. The strengths and limitations of each model.

Model	Strategy	Strengths	Limitations
FCEF	Early fusion	Effectively consolidates multi-scale information. Performs well across most metrics for change detection tasks.	Struggles with detecting subtle changes and small buildings. Early fusion may lead to over-integration, missing finer details.
SiamUnet_conc	Late fusion	Utilizes feature learning to enhance detection accuracy. Effective feature concatenation.	Less sensitive to changes compared to SiamUnet_diff. Limited performance in extreme and subtle change scenarios.
SiamUnet_diff	Difference-based feature fusion	Emphasizes change regions, enhancing the detection sensitivity. Excels at detecting subtle changes and small buildings.	Sensitive to background noise, leading to increased false positives in complex or subtle change environments.
U-Net	Classic U-Net architecture	Provides robust results on standard datasets. Well established and widely used framework.	Struggles with detecting changes in small buildings and subtle variations.
CCCUnet	CondConv + Coord Attention + CGAFusion mechanisms	Enhances sensitivity to minor changes. Improves accuracy in detecting contours and small buildings. Excels across key metrics like F1, IOU, OA.	No significant limitations noted compared to other models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gu, Y.; Ren, C.; Chen, Q.; Bai, H.; Huang, Z.; Zou, L. A Conditionally Parameterized Feature Fusion U-Net for Building Change Detection. Sustainability 2024, 16, 9232. https://doi.org/10.3390/su16219232

AMA Style

Gu Y, Ren C, Chen Q, Bai H, Huang Z, Zou L. A Conditionally Parameterized Feature Fusion U-Net for Building Change Detection. Sustainability. 2024; 16(21):9232. https://doi.org/10.3390/su16219232

Chicago/Turabian Style

Gu, Yao, Chao Ren, Qinyi Chen, Haoming Bai, Zhenzhong Huang, and Lei Zou. 2024. "A Conditionally Parameterized Feature Fusion U-Net for Building Change Detection" Sustainability 16, no. 21: 9232. https://doi.org/10.3390/su16219232

APA Style

Gu, Y., Ren, C., Chen, Q., Bai, H., Huang, Z., & Zou, L. (2024). A Conditionally Parameterized Feature Fusion U-Net for Building Change Detection. Sustainability, 16(21), 9232. https://doi.org/10.3390/su16219232

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Conditionally Parameterized Feature Fusion U-Net for Building Change Detection

Abstract

1. Introduction

2. The Current State of Research

3. Methodology

3.1. U-Net Model

3.2. Design of the Improved U-Net Model

3.2.1. Improvements in Convolutional Layer Architectures

3.2.2. Coord Attention

3.2.3. Channel and Spatial Attention Fusion

3.2.4. Negative Log-Likelihood Loss

4. Results

4.1. WHU Change Detection Dataset

4.2. Experimental Settings

4.3. Model Evaluation Criteria

4.4. Comparative Experiment Results and Analysis

4.4.1. Introduction to Comparison Methods

4.4.2. Performance Evaluation

4.4.3. Detailed Comparison of Change Detection Results

4.5. Ablation Study Results and Analysis

4.5.1. Experimental Design and Setup

4.5.2. Analysis of the CGAFusion Module

4.5.3. Analysis of CoordAttention Module

4.5.4. Analysis of CondConv Module

4.5.5. Combined Analysis of CoordAttention, CondConv, and CGAFusion Modules

4.5.6. Overall Performance Improvement with CCCUnet

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI