1. Introduction
Urban expansion is currently a critical concern in developing countries such as India. The United Nations projects India’s population to reach 1.66 billion by 2050 [
1]. Management flaws and excessive urbanization in developing nations have led to an increase in poverty and imbalanced growth. This has resulted in the emergence and expansion of slums, often referred to as “informal settlements” [
1]. The objective of Sustainable Development Goals 11.1 (SDG 11.1) is to guarantee that every individual has access to suitable, secure, and reasonably priced living circumstances, and to improve the quality of slums by the year 2030 [
2]. Several organizations, such as government agencies and non-governmental organizations (NGOs), conduct slum surveys and slum maps to support slum improvements programs [
3,
4,
5]. Nevertheless, conventional survey operations incur substantial expenses in terms of manpower and time. The acknowledged gaps among specialists and investigators inevitably lead to inconsistencies in results when using these procedures [
6].
Remotely sensed images have emerged as a key data source for acquiring geographical information on slums. A growing cohort of researchers are utilizing remotely sensed imagery to do slum mapping [
7,
8]. The precise identification and categorization of informal settlements using remote sensing data pose significant difficulties for researchers [
9,
10,
11]. Identifying distinct urban structures is difficult due to the unique characteristics of these classes, which differ from the typical land-cover classifications. This task necessitates the retrieval of texture and spatial characteristics [
12,
13,
14,
15]. Unlike agricultural land or other natural vegetation types, urban structures lack unique and easily distinguishable spectral signatures. On the other hand, the internal spatial features of slums, such as the density of housing, the size of individual dwelling units, and their structure, show potential as effective tools for identifying slums.
Conventional approaches employ texture, shape, spectral, and spatial characteristics, which are then processed using clustering or classification algorithms [
16,
17,
18]. Recently, the use of deep learning techniques in remote sensing applications, such as slum mapping and urban analyses for building and road detection, has led to significant performance improvements [
19,
20,
21,
22]. The authors in [
23] proposed a dual-resolution U-Net model to enhance the analysis of multisource data. This model effectively captures features at different scales, leading to improved inference and enriched contextual information. The focus of their work was to optimize the borders of built-up areas. A convolutional neural network (CNN) within an encoder–decoder was designed to extract and combine multi-scale characteristics for the purpose of building footprint extraction in urban analysis [
24]. The authors in [
25] examined a network that utilizes various features to derive hierarchical features for urban segmentation. A completely connected neural network with discrete wavelet transform characteristics is proposed in [
26]. This combination allows for the detection and analysis of multi-scale features, resulting in enhanced performance. However, wavelet transform is suitable for representing linear edges. In their study, the authors in [
27] introduced a deep learning method that effectively distinguishes between build-up areas and background pixels in a dataset with a significant imbalance. They achieved this by utilizing a cross-entropy feature. Although deep learning methods are suitable for these applications, these architectures must be able to handle a wide range of variations, including differences in imaging characteristics, backgrounds, and the shape, size, and appearance of urban areas in aerial images. This work attempts to propose a network that can incorporate different image characteristics, including nonlinear boundaries. The differences in image characteristics can be captured by multiresolution analysis techniques.
Traditionally, wavelet transforms have been widely used for multiresolution analysis [
26]. However, it is well documented that they have limitations in capturing directional information beyond the horizontal, vertical, and diagonal directions [
28,
29]. The wavelet transform employs a specific collection of basis functions that are characterized by approximately isotropic functions that exist at all scales and positions. Thus, it is more suitable for isotropic features or features that are only slightly anisotropic. In addition to the wavelet transform, various other sets of basis functions have been employed to explore different properties such as alignments, elongations, edges, and curvilinear features. Advancements in multiresolution analysis, such as curvelets [
30], have been demonstrated to surpass these constraints. This work aims to investigate the application of curvelet features for semantic segmentation with self-attention modules, specifically in the field of remotely sensed image analysis. Our focus is on exploring the potential of these recently developed multiresolution algorithms for identifying slums.
This research presents a novel approach that utilizes curvelet-based multiresolution analysis (MRA) features integrated in a dual self-attention (SA) network to enhance the accuracy of detecting and extracting slum regions from remotely sensed images. The ability of MRA to extract features at various scales and orientations exhibited by different classes, including slum areas, has inspired the integration of MRA features into a self-attentive framework. This framework enables the model to concentrate on and extract regions that contain significant information by assigning different levels of importance to different regions. The utilization of the MRA feature in conjunction with an attention module presents a pioneering approach for extracting multi-scale information in the context of identifying slum areas. The main contributions of this work can be summarized as follows:
- (1)
A new framework of dual self-attention network integrated with curvelet-based multiresolution analysis features.
- (2)
To the best of our knowledge, the proposed method reaches state-of-the-art performance when carrying out a semantic segmentation for informal settlement identification with IRS-1C and Worldview-2 datasets.
- (3)
Using the foundation of our experiments and analyses, this paper attempts to explain the reason why an improvement in these parameters could be achieved by integrating multiresolution features in the network when applied to remotely sensed images.
2. Methods
The proposed network’s architecture is depicted in
Figure 1. The system is composed of three components: the feature extraction backbone, the multi-level multiresolution feature augmentation, and the dual-attention module. This method utilizes a feature extraction backbone to extract characteristics from a given input aerial image at various scales. The input image is decomposed using the multiresolution technique, and four different outcomes are obtained. The features in these bands manifest directional details of the scene. B1, B2, B3, and B4 represent approximation features, horizontal details, vertical details, and diagonal details of the input image in the wavelet-based multiresolution. For the curvelet decomposition, these bands represent curvilinear details present in the image. The four different bands (B1–B4) represent the extracted features from different levels in the backbone, resulting in a combined multi-scale feature map of the aerial image. Ultimately, the acquired characteristics, in addition to the combined multi-scale feature map, are inputted into a dual-attention module. This module consists of position and channel self-attention modules, allowing for the network to focus more on a specific region of the feature map while disregarding other regions. The combination of a multiresolution framework and self-attention enables the flexible integration of local characteristics with their global dependencies. This study explores two distinct backbones for feature extraction: VGG-16 [
31] and ResNet50 [
32]. Although both backbones yield similar outcomes, this work has opted for the VGG-16 backbone because of its lightweight design and reduced inference latency for practical use. Similarly, just a subset of the layers from the original VGG-16 network have been utilized, specifically the first layers up to the fourth max-pooling block. The study discovered that this structure, although beneficial for acquiring a comprehensive feature representation, also prioritized shorter response times [
33]. In the VGG-16 [
31] network, there are thirteen convolutional layers, five max-pooling layers and three dense layers. Instead of increasing the number of hyper-parameters, it emphasizes the usage of convolutional layers of
filter with stride 1 and maxpool layer of
filter of stride 2. The Conv-1 Layer is equipped with 64 filters, Conv-2 has 128 filters, Conv-3 has 256 filters, and both Conv-4 and Conv-5 have 512 filters. The details of VGG-16 network can be found in [
31].
2.1. Multiresolution Analysis
Multiresolution analysis (MRA) refers to the process of analyzing data at multiple levels of detail or resolution.
2.1.1. Wavelet Transform
Wavelet transform is a mathematical technique used to analyze signals or data in terms of wavelets, which are small waves or oscillations with certain properties. Wavelet transform is a mathematical process employed to represent images at various resolutions. Wavelets can be understood as the result of projecting a signal onto a particular set of scaling
ϕ(t) and wavelet basis
ψ(t) functions inside a vector space. The wavelet coefficients calculated correspond to these projection values. A discrete wavelet transform is implemented using filter banks. The basis functions are represented using the dilation Equations (1) and (2), as described in [
34].
The low-pass filter coefficients are represented by h[.] and the high-pass filter coefficients are represented by g[.] in the filter bank. The scaling index is denoted by m, and the translating index is denoted by n.
The wavelet-based MRA method is highly efficient in handling one- and two-dimensional signals that have linear discontinuities. The wavelet transform decomposes the image into high-pass and low-pass filter bands, allowing for the extraction of directional features that encompass horizontal, vertical, and diagonal elements. Nevertheless, these three linear directions are restrictive and may not adequately capture the necessary directional information in remotely sensed images.
2.1.2. Curvelet Transform
Curvelet transform is a mathematical operation used to analyze and represent data in a way that captures and highlights curved features. It operates at several scales and directions using basis functions that are curved like wedges. Basis functions are mathematical functions that decompose the input signal and express it in the transform domain. Wavelet basis functions are isotropic, meaning they have the same properties in all directions. As a result, a high number of coefficients are needed to accurately describe curve singularities [
28]. The curvelets encompass the complete frequency space, with basis functions that are formed by grouping wavelet basis functions into linear structures at various scales and orientations. This enables them to effectively capture curvilinear discontinuities, as shown in
Figure 2.
Curvelets divide the frequency spectrum into scales that are powers of two (2
j, where j is an integer representing the scale) and further divide them into angular wedges that exhibit a parabolic aspect ratio. The curvelet transform operates in two dimensions, utilizing the spatial variable
x, the frequency domain variable
ω, and the polar coordinates
r and
θ in the frequency domain. The definition consists of two windows, namely the radial window {
W(r)} and the angular window {
V(t)}, which are represented by Equations (3)-(6). The polar “wedge” denoted as
Uj is upheld by the radial window {
W(r)} and the angular window {
V(r)}.
where
is a smooth function satisfying the following:
In the frequency domain, these wedges are defined in [
30] as follows:
Here,
W(.) represents a radial window,
V(.) represents an angular window,
r represents the radius,
θ represents the angle of orientation, and
s represents the scale. This article utilizes the wrapping-based fast discrete curvelet transform [
35].
The input image is decomposed into one approximation (B1) and three detailed sub-bands (B2–B4) with the wavelet and curvelet MRA. The three sub-bands containing detailed information are combined to enhance the multi-scale feature maps. This helps the network by providing additional details of singularity. An approximation band is used as input for the next level of MRA decomposition. The use of many levels of wavelet and curvelet decompositions has been implemented to enhance all of the acquired multi-scale feature maps. The feature maps are combined and convolved to generate a unified multi-scale feature map, enabling the efficient capture of both global and local contextual directional information.
2.2. Dual Attention
This work utilizes a self-attention framework that draws inspiration from the dual-attention network proposed in [
36]. The utilization of two synchronous attention modules, namely the position-attention module and the channel-attention module, has been implemented to efficiently capture the interconnectedness of features in both spatial and channel dimensions. This structure facilitates the mutual enhancement of any two positions with similar characteristics, whether they are within the same channel or across several channels, in response to a certain trait. The multiresolution framework captures local as well as global contextual information due to the nature of the decomposed sub-bands of the input image. The attention modules streamline the extracted features by prioritizing elements that are essential for enhancing the segmentation performance due to their weighting scheme. The attention module can be understood as a process that both removes noise and improves the quality of the feature space by transforming it into a domain where spatial and channel features are not uniformly weighted. Furthermore, because to the reduction in receptive fields in conventional fully linked networks to a local scale, they lack the ability to represent intricate long-range contextual information derived from a diverse combination of local and global properties. The dual-attention technique effectively addresses the aforementioned issues by utilizing location and channel-attention modules.
Figure 3 provides a comprehensive diagram of the arrangement of location and channel-attention modules, showcasing various matrix linkages.
2.2.1. Position-Attention Module
In order to capture the long-range dependencies inside each channel and to identify the most significant elements in a spatial context, we employ the position-attention module. By subjecting an input feature map to two distinct adaptive convolutional layers, two additional feature maps are obtained. In order to generate the spatial-attention matrix S, I1 and I2 are reshaped into two-dimensional matrices of shape , where , and then matrices are multiplied as follows to produce the spatial-attention matrix S of shape .
The spatial-attention matrix is built adaptively from the input multi-resolution feature map. It reflects the correlation between distinct regions within a channel map. After undergoing an additional adaptive convolutional layer, the input feature map I is multiplied by S and then reshaped back to its original form . The spatial-attention matrix exhibits distinct responses to various regions of the feature map, hence enhancing focus on certain areas of the image. Thus, we have the capability to selectively combine characteristics from various resolutions within a broader context.
2.2.2. The Channel-Attention Module
Each input channel in the attention module is associated with a class-specific response at a certain resolution and in the combined multi-scale feature. Identifying the connections between these feature maps, both inside and between them, is crucial since it significantly enhances the semantic representation by minimizing misclassifications. The computation of channel attention is analogous to the position-attention module, with the exception that it involves calculating the channel-attention matrix . This matrix is computed along the channel dimension to capture long-range dependencies among feature mappings. The channel-attention matrix is derived from the same reshaped feature maps (I1, I2) as .
The spatial- (
P) and channel-attention (
Q) modules produce outputs that are fused element wise. This combined output is then sent into a softmax layer to generate the overall output of the attention block, represented as
Ai (the
ith attention block).
The feature map obtained from the combined joint–multi-scale, as explained earlier, is integrated with the feature maps produced at other scales through up-sampling. These combined feature maps are then inputted into the dual-attention network, where each scale has its own dedicated attention block. The attention-block process features maps that are half the size of the original image and have only four feature channels. Each class has its own feature map at a certain resolution, as well as a joint multi-scale feature map. The choice of four channels per attention block was based on the fact that each pair of channels encodes a class-specific response at a specific scale considering the joint multi-scale nature of the problem. Since this is a single-class identification problem, four channels were deemed sufficient to leverage the interdependencies among channel spaces and enhance the feature representation of specific semantics.
The output of each attention block
Ai at various scales is combined using element-wise summation to create the joint-scale attention map (
A). This map enhances the feature representation, leading to more precise outcomes. The calculation of the joint-scale attention map is as follows:
The resulting feature map is processed through an up-sampling and pixel-wise deconvolution block to produce the final semantically segmented outcome, which has the same spatial resolution as the input data. The outcome consists of two channels that reflect the answer for each class.
2.3. Progressive Generation
The use of a progressively increasing network architecture has been implemented for training purposes. The highest level of detail in the aerial image
, where
H and
W are the image dimensions, was downscaled to the coarsest resolution image represented as
, where
N is the number of iterations. The network underwent training, beginning with a coarse resolution and progressively increasing to the finest possible resolution. First, a segmentation mask is constructed at a lower resolution. Then, the weights learned at this resolution are used to compute the next-highest resolution and update the weights. This procedure leads to the successive refinement of the segmentation mask. The transition from a lower to a higher level of detail for a certain iteration was accomplished in accordance with the equation outlined in [
37].
where
upscale() function implements bilinear interpolation with a scale factor of two,
is the network supervision signal (as detailed in the following section), and
is an empirically determined threshold. This allows for the iterative improvement of the segmentation mask, resulting in homogeneous semantics.
2.4. Loss Function
Typically, semantic segmentation tasks that involve remote sensing images demonstrate a significant difference in class distribution. Therefore, to address the imbalance of classes in the training data, the network has utilized focal loss as the supervision signal instead of the normal cross-entropy loss [
38].
The focal loss incorporates a modulation term into the usual cross-entropy loss, enabling the supervision signals to place greater emphasis on regions that are misclassified and challenging to segment due to the underlying characteristics of the scene. The modulating term is gradually reduced to zero as the confidence in segmentation improves, and the focal loss converges to the conventional cross entropy loss when the modulation term approaches zero. The focus loss is mathematically represented as follows:
where
is the modulating factor,
is the weighting parameter to account for class imbalance,
p is the predicted probability, and
y is the ground-truth class.
4. Discussion
In this work, the effect of integrating wavelet- and curvelet-based MRA features on semantic segmentation architecture trained for informal settlement identification on two datasets were studied in a quantitative and qualitative manner. High values of the overall accuracy, mIoU score, F-score, precision, and recall metrics were observed, with some degree of variability in the performance. The improvements in quantitative measurements highlight the effectiveness and practicality of using a multi-resolution framework combined with self-attention to enhance the accuracy of semantic segmentation in identifying informal settlements. As for the trade-off between precision and recall, in the context of binary semantic segmentation of slum areas, it can be understood in the following manner. A higher recall signifies a model with superior ability to accurately identify a large number of slum-area pixels, even if it results in misclassifying some background pixels as slum, hence reducing the precision score, as observed in
Table 2.
Figure 5i demonstrates the enhanced segmentation outcomes of the suggested network in terms of its capacity to accurately extract distinct and realistic slum masks. In contrast, the U-Net produces slum footprints that are irregular and shapeless blobs. The proposed network accurately extracts both straight-edged areas and curve-shaped regions with high precision in terms of structure. In contrast, Resnet, for example, produces noisy footprints that lack consistent semantics. This improvement is because the curvelet coefficients efficiently captured the curved boundaries and represented the disoriented edges, as it has anisotropic properties in its basis function. U-Net and ResUnet fail to separate individual footprints and instead combine all of the build-up masks into a single large mass. This characteristic is unsuitable for several applications related to urban planning. In
Figure 6i, the suggested method effectively detects and extracts very narrow slum structures while minimizing misclassifications, which is clearly seen when compared to the segmentation results of ResUnet.
To further assess the results, a visual comparison for qualitative analysis is carried out. In the visual comparison between the input images, ground-truth reference masks, and segmented outcomes, it is observed that, in general, the slum representations in the outcomes using the proposed method are a clear improvement over the other models. In
Figure 5 and
Figure 6, the misclassified portions are marked with red circles, areas encircled in blue show irregular boundaries in the result, and areas encircled in green show proper regular boundary detection in the result. The wavelet-based approach and other models are found to be unsuitable in terms of the orientation and anisotropic properties of the classes, as illustrated in
Figure 5d,f,g and
Figure 6c,e,f,h; however, curvelet features efficiently captured the curved edges, as shown in
Figure 5i and
Figure 6i. The slum areas do not appear to be intermixed with partially built-up areas and other classes, as observed while using other models. The accuracy in the boundary shape and edge continuity is observed to be better especially in the middle-left portion in
Figure 5. Similarly, a deviation from the reference mask (
Figure 6b) boundaries can be observed in the bottom-right part of the image (
Figure 6c–h).
In the ablation investigation, we employed the VGG-16 network with a single scale and a structure that closely resembles the feature extraction backbone employed in the proposed method as the baseline. Subsequently, we showcase the efficacy of distinct elements in the suggested approach by gradually integrating each module. The comparative findings are displayed in
Table 3. By default, all trials in the ablation employ progressive generation training, unless specifically stated otherwise.
Table 3 clearly demonstrates incorporating multiresolution features, and the dual-attention mechanism leads to an enhancement of 14% and 10% compared to the baseline. Moreover, the effectiveness of the attention mechanism in enhancing the outcomes of semantic segmentation is observed. This can be due to its capacity to progressively improve the segmentation mask, leading to consistent and uniform semantics.
Figure 7 displays the segmentation outcome, illustrating the effectiveness of exclusively utilizing MRA features in the proposed network. The MRA framework is valuable because it aims to accurately represent informal settlement footprints and effectively detect and extract intricate structures and informal settlements of different shapes and sizes. The dual-attention process aids in capturing essential long-range relationships to obtain accurate segmentation outcomes, which is especially valuable in cluttered scenes. The attention mechanism ensures the network does not take into account the specific image acquisition properties, such as spatial resolution, resulting in consistent and uniform semantic understanding.
Table 4 describes the epoch sensitivity of the VGG16 network. An analysis of the performance measures indicates a correlation between the epoch value the model’s effectiveness with the model achieving optimal performance at a midway value. Specifically, the optimal results of both datasets are observed at 170, where mIoU/OA scores reach their highest point. For the IRS-1C dataset, the performance improves consistently as the epoch value changes from 50 to 170, suggesting that the network does not become trained properly at lower values of epochs. However, beyond this value, there is a notable decrease in the performance when the epoch reaches higher values. The network undergoes overfitting, which hinders its ability to generalize the results to unseen samples.
The U-net architecture [
39] is a lightweight network that skips fully connected layers and only utilizes the valid portion of each convolution. Additionally, it employs extensive data augmentation by applying elastic deformations to the training images. This allows for the network to acquire invariance to these deformations, regardless of whether these changes are present in the labeled set. In order to propose ResUnet, the authors in [
40] substituted regular neural units with residual units as fundamental components within the U-net framework. With just 25% of U-Net’s parameters, this network performs better. Nevertheless, these networks [
39,
40] lack multiresolution capabilities to handle different resolutions, which is a crucial characteristic of remotely sensed images. The authors of reference [
41] use a fully connected network to combine features from different layers, making an adjustable nonlinear local-to-global representation block. The proposed network integrates semantic information from a deep, coarse layer with appearance information from a shallow, fine layer in order to generate precise and intricate segmentations. Due to the variety and complexity of ground objects, high-resolution remotely sensed images have a large intraclass variance and a small interclass variance. This makes the semantic segmentation task very difficult in fully connected networks. The authors in [
43] propose an end-to-end network (ScattNet) to overcome this problem. ScattNet incorporates lightweight spatial and channel-attention modules that enable the adaptive refinement of features. Nevertheless, the study exclusively employs two widely used attention modules. Hence, the goal of designing the attention module with high effectiveness and capturing additional discriminative information continues to be a challenging task. In [
44], the authors propose an adaptive screening feature (ASF) network to adjust the receptive field and enhance useful feature information in an adaptive manner. This neural network employs a wide space up-sampling block and an adaptive information utilization block (AIUB) to enhance the number of feature maps and improve the accuracy of incomplete target footprints. This network’s accuracy and mean Intersection over Union (mIoU) are comparable to other state-of-the-art methods (
Table 1 and
Table 2). Compared with conventional methods, CNN-based methods learn semantic features automatically, thereby achieving strong representation capability. Nevertheless, the convolution operation’s small receptive field restricts CNN-based approaches from obtaining the global information. In reference [
45], the authors propose a class-guided Swin Transformer (CG-Swin) with the aim of efficiently collecting both local and global information in remote sensing images. In [
46], the authors propose a similar approach to capture both local and global feature refinement within CNN. The authors employ a global-local transformer block-based decoder to construct a transformer analogous to UNet (UNetFormer). This approach produces findings that are comparable to those of other state-of-the-art methods. An approach utilizing an adversarial generative model, ResiDualGAN [
48], is proposed for remote sensing image translation, where an in-network resizer module is used for addressing the scale variation of image datasets and a residual connection is used for strengthening the stability of real-to-real image translation and improving the performance in cross-domain semantic segmentation tasks. However, ResiDualGAN only minimizes the pixel-level domain gap, and the minimization of feature-level and output-level domain gaps and self-training strategies need improvement. Furthermore, because of the restrictions imposed by CNN’s down-sampling procedure, the ResiDualGAN model has rigid input image size limitations.
The improved outcomes in this work are a result of incorporating four MRA bands alongside the dual-attention mechanism. This suggests that conducting further investigations with a larger number of decompositions and introducing more MRA bands in the integration process would be beneficial. Nevertheless, completing this task is difficult due to the need to adjust the sampling rate because the resulting images from various MRA decompositions have varying sizes. Also, compared to the other existing models, this method requires additional time to decompose the original image in order to obtain MRA bands.