Self-Attention Multiresolution Analysis-Based Informal Settlement Identification Using Remote Sensing Data

Ansari, Rizwan Ahmed; Mulrooney, Timothy J.

doi:10.3390/rs16173334

Open AccessArticle

Self-Attention Multiresolution Analysis-Based Informal Settlement Identification Using Remote Sensing Data

by

Rizwan Ahmed Ansari

^*

and

Timothy J. Mulrooney

Department of Environmental, Earth and Geospatial Sciences, North Carolina Central University, Durham, NC 27707, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3334; https://doi.org/10.3390/rs16173334

Submission received: 10 July 2024 / Revised: 1 September 2024 / Accepted: 5 September 2024 / Published: 8 September 2024

(This article belongs to the Special Issue Image Processing from Aerial and Satellite Imagery)

Download

Browse Figures

Versions Notes

Abstract

:

The global dilemma of informal settlements persists alongside the fast process of urbanization. Various methods for analyzing remotely sensed images to identify informal settlements using semantic segmentation have been extensively researched, resulting in the development of numerous supervised and unsupervised algorithms. Texture-based analysis is a topic extensively studied in the literature. However, it is important to note that approaches that do not utilize a multiresolution strategy are unable to take advantage of the fact that texture exists at different spatial scales. The capacity to do online mapping and precise segmentation on a vast scale while considering the diverse characteristics present in remotely sensed images carries significant consequences. This research presents a novel approach for identifying informal settlements using multiresolution analysis and self-attention techniques. The technique shows potential for being resilient in the presence of inherent variability in remotely sensed images due to its capacity to extract characteristics at many scales and prioritize areas that contain significant information. Segmented pictures underwent an accuracy assessment, where a comparison analysis was conducted based on metrics such as mean intersection over union, precision, recall, F-score, and overall accuracy. The proposed method’s robustness is demonstrated by comparing it to various state-of-the-art techniques. This comparison is conducted using remotely sensed images that have different spatial resolutions and informal settlement characteristics. The proposed method achieves a higher accuracy of approximately 95%, even when dealing with significantly different image characteristics.

Keywords:

informal settlements; urban analysis; multiresolution analysis; self-attention mechanism; remote sensing

1. Introduction

Urban expansion is currently a critical concern in developing countries such as India. The United Nations projects India’s population to reach 1.66 billion by 2050 [1]. Management flaws and excessive urbanization in developing nations have led to an increase in poverty and imbalanced growth. This has resulted in the emergence and expansion of slums, often referred to as “informal settlements” [1]. The objective of Sustainable Development Goals 11.1 (SDG 11.1) is to guarantee that every individual has access to suitable, secure, and reasonably priced living circumstances, and to improve the quality of slums by the year 2030 [2]. Several organizations, such as government agencies and non-governmental organizations (NGOs), conduct slum surveys and slum maps to support slum improvements programs [3,4,5]. Nevertheless, conventional survey operations incur substantial expenses in terms of manpower and time. The acknowledged gaps among specialists and investigators inevitably lead to inconsistencies in results when using these procedures [6].

Remotely sensed images have emerged as a key data source for acquiring geographical information on slums. A growing cohort of researchers are utilizing remotely sensed imagery to do slum mapping [7,8]. The precise identification and categorization of informal settlements using remote sensing data pose significant difficulties for researchers [9,10,11]. Identifying distinct urban structures is difficult due to the unique characteristics of these classes, which differ from the typical land-cover classifications. This task necessitates the retrieval of texture and spatial characteristics [12,13,14,15]. Unlike agricultural land or other natural vegetation types, urban structures lack unique and easily distinguishable spectral signatures. On the other hand, the internal spatial features of slums, such as the density of housing, the size of individual dwelling units, and their structure, show potential as effective tools for identifying slums.

Conventional approaches employ texture, shape, spectral, and spatial characteristics, which are then processed using clustering or classification algorithms [16,17,18]. Recently, the use of deep learning techniques in remote sensing applications, such as slum mapping and urban analyses for building and road detection, has led to significant performance improvements [19,20,21,22]. The authors in [23] proposed a dual-resolution U-Net model to enhance the analysis of multisource data. This model effectively captures features at different scales, leading to improved inference and enriched contextual information. The focus of their work was to optimize the borders of built-up areas. A convolutional neural network (CNN) within an encoder–decoder was designed to extract and combine multi-scale characteristics for the purpose of building footprint extraction in urban analysis [24]. The authors in [25] examined a network that utilizes various features to derive hierarchical features for urban segmentation. A completely connected neural network with discrete wavelet transform characteristics is proposed in [26]. This combination allows for the detection and analysis of multi-scale features, resulting in enhanced performance. However, wavelet transform is suitable for representing linear edges. In their study, the authors in [27] introduced a deep learning method that effectively distinguishes between build-up areas and background pixels in a dataset with a significant imbalance. They achieved this by utilizing a cross-entropy feature. Although deep learning methods are suitable for these applications, these architectures must be able to handle a wide range of variations, including differences in imaging characteristics, backgrounds, and the shape, size, and appearance of urban areas in aerial images. This work attempts to propose a network that can incorporate different image characteristics, including nonlinear boundaries. The differences in image characteristics can be captured by multiresolution analysis techniques.

Traditionally, wavelet transforms have been widely used for multiresolution analysis [26]. However, it is well documented that they have limitations in capturing directional information beyond the horizontal, vertical, and diagonal directions [28,29]. The wavelet transform employs a specific collection of basis functions that are characterized by approximately isotropic functions that exist at all scales and positions. Thus, it is more suitable for isotropic features or features that are only slightly anisotropic. In addition to the wavelet transform, various other sets of basis functions have been employed to explore different properties such as alignments, elongations, edges, and curvilinear features. Advancements in multiresolution analysis, such as curvelets [30], have been demonstrated to surpass these constraints. This work aims to investigate the application of curvelet features for semantic segmentation with self-attention modules, specifically in the field of remotely sensed image analysis. Our focus is on exploring the potential of these recently developed multiresolution algorithms for identifying slums.

This research presents a novel approach that utilizes curvelet-based multiresolution analysis (MRA) features integrated in a dual self-attention (SA) network to enhance the accuracy of detecting and extracting slum regions from remotely sensed images. The ability of MRA to extract features at various scales and orientations exhibited by different classes, including slum areas, has inspired the integration of MRA features into a self-attentive framework. This framework enables the model to concentrate on and extract regions that contain significant information by assigning different levels of importance to different regions. The utilization of the MRA feature in conjunction with an attention module presents a pioneering approach for extracting multi-scale information in the context of identifying slum areas. The main contributions of this work can be summarized as follows:

(1): A new framework of dual self-attention network integrated with curvelet-based multiresolution analysis features.
(2): To the best of our knowledge, the proposed method reaches state-of-the-art performance when carrying out a semantic segmentation for informal settlement identification with IRS-1C and Worldview-2 datasets.
(3): Using the foundation of our experiments and analyses, this paper attempts to explain the reason why an improvement in these parameters could be achieved by integrating multiresolution features in the network when applied to remotely sensed images.

2. Methods

The proposed network’s architecture is depicted in Figure 1. The system is composed of three components: the feature extraction backbone, the multi-level multiresolution feature augmentation, and the dual-attention module. This method utilizes a feature extraction backbone to extract characteristics from a given input aerial image at various scales. The input image is decomposed using the multiresolution technique, and four different outcomes are obtained. The features in these bands manifest directional details of the scene. B1, B2, B3, and B4 represent approximation features, horizontal details, vertical details, and diagonal details of the input image in the wavelet-based multiresolution. For the curvelet decomposition, these bands represent curvilinear details present in the image. The four different bands (B1–B4) represent the extracted features from different levels in the backbone, resulting in a combined multi-scale feature map of the aerial image. Ultimately, the acquired characteristics, in addition to the combined multi-scale feature map, are inputted into a dual-attention module. This module consists of position and channel self-attention modules, allowing for the network to focus more on a specific region of the feature map while disregarding other regions. The combination of a multiresolution framework and self-attention enables the flexible integration of local characteristics with their global dependencies. This study explores two distinct backbones for feature extraction: VGG-16 [31] and ResNet50 [32]. Although both backbones yield similar outcomes, this work has opted for the VGG-16 backbone because of its lightweight design and reduced inference latency for practical use. Similarly, just a subset of the layers from the original VGG-16 network have been utilized, specifically the first layers up to the fourth max-pooling block. The study discovered that this structure, although beneficial for acquiring a comprehensive feature representation, also prioritized shorter response times [33]. In the VGG-16 [31] network, there are thirteen convolutional layers, five max-pooling layers and three dense layers. Instead of increasing the number of hyper-parameters, it emphasizes the usage of convolutional layers of

3 \times 3

filter with stride 1 and maxpool layer of

2 \times 2

filter of stride 2. The Conv-1 Layer is equipped with 64 filters, Conv-2 has 128 filters, Conv-3 has 256 filters, and both Conv-4 and Conv-5 have 512 filters. The details of VGG-16 network can be found in [31].

2.1. Multiresolution Analysis

Multiresolution analysis (MRA) refers to the process of analyzing data at multiple levels of detail or resolution.

2.1.1. Wavelet Transform

Wavelet transform is a mathematical technique used to analyze signals or data in terms of wavelets, which are small waves or oscillations with certain properties. Wavelet transform is a mathematical process employed to represent images at various resolutions. Wavelets can be understood as the result of projecting a signal onto a particular set of scaling ϕ(t) and wavelet basis ψ(t) functions inside a vector space. The wavelet coefficients calculated correspond to these projection values. A discrete wavelet transform is implemented using filter banks. The basis functions are represented using the dilation Equations (1) and (2), as described in [34].

ϕ (t) = \sum_{n} h [n] ϕ (2^{m} t - n)

(1)

ψ (t) = \sum_{n} g [n] ϕ (2^{m} t - n)

(2)

The low-pass filter coefficients are represented by h[.] and the high-pass filter coefficients are represented by g[.] in the filter bank. The scaling index is denoted by m, and the translating index is denoted by n.

The wavelet-based MRA method is highly efficient in handling one- and two-dimensional signals that have linear discontinuities. The wavelet transform decomposes the image into high-pass and low-pass filter bands, allowing for the extraction of directional features that encompass horizontal, vertical, and diagonal elements. Nevertheless, these three linear directions are restrictive and may not adequately capture the necessary directional information in remotely sensed images.

2.1.2. Curvelet Transform

Curvelet transform is a mathematical operation used to analyze and represent data in a way that captures and highlights curved features. It operates at several scales and directions using basis functions that are curved like wedges. Basis functions are mathematical functions that decompose the input signal and express it in the transform domain. Wavelet basis functions are isotropic, meaning they have the same properties in all directions. As a result, a high number of coefficients are needed to accurately describe curve singularities [28]. The curvelets encompass the complete frequency space, with basis functions that are formed by grouping wavelet basis functions into linear structures at various scales and orientations. This enables them to effectively capture curvilinear discontinuities, as shown in Figure 2.

Curvelets divide the frequency spectrum into scales that are powers of two (2^j, where j is an integer representing the scale) and further divide them into angular wedges that exhibit a parabolic aspect ratio. The curvelet transform operates in two dimensions, utilizing the spatial variable x, the frequency domain variable ω, and the polar coordinates r and θ in the frequency domain. The definition consists of two windows, namely the radial window {W(r)} and the angular window {V(t)}, which are represented by Equations (3)-(6). The polar “wedge” denoted as Uj is upheld by the radial window {W(r)} and the angular window {V(r)}.

V (t) = \{\begin{matrix} 1 & ; & | t | \leq 1 / 3 \\ \cos [\frac{π}{2} υ (3 | t | - 1)] & ; & 1 / 3 \leq | t | \leq 2 / 3 \\ 0 & ; & e l s e \end{matrix}

(3)

W (r) = \{\begin{matrix} \cos [\frac{π}{2} ν (5 - 6 r)] & ; & 2 / 3 \leq r \leq 5 / 6 & , \\ 1 & ; & 5 / 6 \leq r \leq 4 / 3 & , \\ \cos [\frac{π}{2} ν (3 r - 4)] & ; & 4 / 3 \leq r \leq 5 / 3 & , \\ 0 & ; & e l s e & . \end{matrix}

(4)

where

ν

is a smooth function satisfying the following:

ν (x) = \{\begin{matrix} 0 & ; x \leq 0 \\ 1 & ; x \geq 1 \end{matrix} υ (x) + υ (1 - x) = 1

(5)

In the frequency domain, these wedges are defined in [30] as follows:

U_{s} (r, θ) : = s^{3 / 4} W (s r) V (θ / \sqrt{s})

(6)

Here, W(.) represents a radial window, V(.) represents an angular window, r represents the radius, θ represents the angle of orientation, and s represents the scale. This article utilizes the wrapping-based fast discrete curvelet transform [35].

The input image is decomposed into one approximation (B1) and three detailed sub-bands (B2–B4) with the wavelet and curvelet MRA. The three sub-bands containing detailed information are combined to enhance the multi-scale feature maps. This helps the network by providing additional details of singularity. An approximation band is used as input for the next level of MRA decomposition. The use of many levels of wavelet and curvelet decompositions has been implemented to enhance all of the acquired multi-scale feature maps. The feature maps are combined and convolved to generate a unified multi-scale feature map, enabling the efficient capture of both global and local contextual directional information.

2.2. Dual Attention

This work utilizes a self-attention framework that draws inspiration from the dual-attention network proposed in [36]. The utilization of two synchronous attention modules, namely the position-attention module and the channel-attention module, has been implemented to efficiently capture the interconnectedness of features in both spatial and channel dimensions. This structure facilitates the mutual enhancement of any two positions with similar characteristics, whether they are within the same channel or across several channels, in response to a certain trait. The multiresolution framework captures local as well as global contextual information due to the nature of the decomposed sub-bands of the input image. The attention modules streamline the extracted features by prioritizing elements that are essential for enhancing the segmentation performance due to their weighting scheme. The attention module can be understood as a process that both removes noise and improves the quality of the feature space by transforming it into a domain where spatial and channel features are not uniformly weighted. Furthermore, because to the reduction in receptive fields in conventional fully linked networks to a local scale, they lack the ability to represent intricate long-range contextual information derived from a diverse combination of local and global properties. The dual-attention technique effectively addresses the aforementioned issues by utilizing location and channel-attention modules. Figure 3 provides a comprehensive diagram of the arrangement of location and channel-attention modules, showcasing various matrix linkages.

2.2.1. Position-Attention Module

In order to capture the long-range dependencies inside each channel and to identify the most significant elements in a spatial context, we employ the position-attention module. By subjecting an input feature map

I \in R^{H \times W \times C}

to two distinct adaptive convolutional layers, two additional feature maps

(I_{1}, I_{2}) \in R^{H \times W \times C}

are obtained. In order to generate the spatial-attention matrix S, I₁ and I₂ are reshaped into two-dimensional matrices of shape

(N \times C)

, where

N = H \times W

, and then matrices are multiplied as follows to produce the spatial-attention matrix S of shape

S_{N \times N} = I_{1} I_{2}^{T}

.

The spatial-attention matrix is built adaptively from the input multi-resolution feature map. It reflects the correlation between distinct regions within a channel map. After undergoing an additional adaptive convolutional layer, the input feature map I is multiplied by S and then reshaped back to its original form

(H \times W \times C)

. The spatial-attention matrix exhibits distinct responses to various regions of the feature map, hence enhancing focus on certain areas of the image. Thus, we have the capability to selectively combine characteristics from various resolutions within a broader context.

2.2.2. The Channel-Attention Module

Each input channel in the attention module is associated with a class-specific response at a certain resolution and in the combined multi-scale feature. Identifying the connections between these feature maps, both inside and between them, is crucial since it significantly enhances the semantic representation by minimizing misclassifications. The computation of channel attention is analogous to the position-attention module, with the exception that it involves calculating the channel-attention matrix

X \in R^{C \times C}

. This matrix is computed along the channel dimension to capture long-range dependencies among feature mappings. The channel-attention matrix is derived from the same reshaped feature maps (I₁, I₂) as

X = I_{1}^{T} I_{2}

.

The spatial- (P) and channel-attention (Q) modules produce outputs that are fused element wise. This combined output is then sent into a softmax layer to generate the overall output of the attention block, represented as A_i (the ith attention block).

A_{i} = softmax (\sum_{i = 0}^{H} \sum_{j = 0}^{W} \sum_{k = 0}^{C} (P_{i j k} + Q_{i j k}))

(7)

The feature map obtained from the combined joint–multi-scale, as explained earlier, is integrated with the feature maps produced at other scales through up-sampling. These combined feature maps are then inputted into the dual-attention network, where each scale has its own dedicated attention block. The attention-block process features maps that are half the size of the original image and have only four feature channels. Each class has its own feature map at a certain resolution, as well as a joint multi-scale feature map. The choice of four channels per attention block was based on the fact that each pair of channels encodes a class-specific response at a specific scale considering the joint multi-scale nature of the problem. Since this is a single-class identification problem, four channels were deemed sufficient to leverage the interdependencies among channel spaces and enhance the feature representation of specific semantics.

The output of each attention block A_i at various scales is combined using element-wise summation to create the joint-scale attention map (A). This map enhances the feature representation, leading to more precise outcomes. The calculation of the joint-scale attention map is as follows:

A = softmax (\sum_{i = 0}^{H} \sum_{j = 0}^{W} \sum_{k = 0}^{C} (A_{1}^{i j k} + A_{2}^{i j k} + \dots \dots + A_{N}^{i j k}))

(8)

The resulting feature map is processed through an up-sampling and pixel-wise deconvolution block to produce the final semantically segmented outcome, which has the same spatial resolution as the input data. The outcome consists of two channels that reflect the answer for each class.

2.3. Progressive Generation

The use of a progressively increasing network architecture has been implemented for training purposes. The highest level of detail in the aerial image

F \in R^{H \times W}

, where H and W are the image dimensions, was downscaled to the coarsest resolution image represented as

C \in R^{\frac{H}{2 N} \times \frac{W}{2 N}}

, where N is the number of iterations. The network underwent training, beginning with a coarse resolution and progressively increasing to the finest possible resolution. First, a segmentation mask is constructed at a lower resolution. Then, the weights learned at this resolution are used to compute the next-highest resolution and update the weights. This procedure leads to the successive refinement of the segmentation mask. The transition from a lower to a higher level of detail for a certain iteration was accomplished in accordance with the equation outlined in [37].

F \in R^{H_{i} \times W_{i}} = u p s c a l e (C \in R^{H_{i - 1} \times W_{i - 1}}); l < δ

(9)

where upscale() function implements bilinear interpolation with a scale factor of two,

l

is the network supervision signal (as detailed in the following section), and

δ

is an empirically determined threshold. This allows for the iterative improvement of the segmentation mask, resulting in homogeneous semantics.

2.4. Loss Function

Typically, semantic segmentation tasks that involve remote sensing images demonstrate a significant difference in class distribution. Therefore, to address the imbalance of classes in the training data, the network has utilized focal loss as the supervision signal instead of the normal cross-entropy loss [38].

The focal loss incorporates a modulation term into the usual cross-entropy loss, enabling the supervision signals to place greater emphasis on regions that are misclassified and challenging to segment due to the underlying characteristics of the scene. The modulating term is gradually reduced to zero as the confidence in segmentation improves, and the focal loss converges to the conventional cross entropy loss when the modulation term approaches zero. The focus loss is mathematically represented as follows:

l = \{\begin{matrix} - α (1 - p)^{γ} \log (p); y = 1 \\ - (1 - α) p^{γ} \log (1 - p); o t h e r w i s e \end{matrix}

(10)

where

γ

is the modulating factor,

α

is the weighting parameter to account for class imbalance, p is the predicted probability, and y is the ground-truth class.

3. Results

3.1. Dataset and Experimental Setup

The experimental analysis of this research starts with the preparation and pre-processing of the data, which involves selecting study tiles and preparing reference data. The study area comprises sections of Mumbai city characterized by a high concentration of slums and formal built-up areas. The region is experiencing fast urban growth, with slums serving as housing and economic spaces for low-income populations. The first dataset is a panchromatic sensor image captured by an IRS-1C satellite. It covers a portion of Mumbai city, specifically the slums of Dharavi and the surrounding areas. The image has a spatial resolution of

5.8 m \times 5.8 m

. This resolution is well suited for texture analysis, as it is not suitable for extracting individual structure or small roads. However, it is effective in identifying groupings of built-up areas, which create a noticeable checked pattern in densely populated urban regions. The second dataset consists of a high-resolution Worldview-2 image (

2 m \times 2 m

) of Mumbai city, specifically capturing the slums of Cheeta Camp and its surrounding areas. The slums of Mumbai are typically characterized by extremely high population densities and the concentration of small structures. However, slums can vary in their characteristics. Figure 4 shows four sample areas, depicting the variability in their structure and rehabilitation. Some slums have been rehabilitated and are located near tall apartment buildings and towers (Figure 4a). Others are long-established and have small shops and distinct roof patterns along main roads and railway lines (Figure 4b). There are also highly congested slums with narrow lanes in inner pockets (Figure 4c), such as parts of the well-known Dharavi slum, areas near Cheeta Camp and Mankhurd. Inner-city slums can cover large areas and have high roof density and small lanes, like parts of Shivaji Nagar (Figure 4d) and Mankhurd. Slums can also be found on hilltops, such as those in Antop Hill and Koliwada. Lastly, there are slums composed of temporary structures, characterized by very poor and informal settlements.

The slum boundary data utilized for training, testing, and validation was developed by local experts through visual interpretation and field verification. The ground truth of slum data for the years 2015–2016 was acquired from the official website (https://sra.gov.in, accessed on 29 March 2024) of the Slum Rehabilitation Authority (SRA) of Maharashtra.

The VGG16 feature extraction backbone is utilized with pre-trained weights, allowing for the possibility of enabling continued learning of the VGG16 layers while training the complete network. The multi-resolution framework involved decomposing the images using wavelet and curvelet transforms. The entire network was trained on the NVIDIA Tesla P100 GPU using the Adam optimizer with a batch size of 16. The values of β1 and β2 were set to 0.9 and 0.99, respectively. As previously stated, the network has been trained using progressive generation. Consequently, the training was begun with input image chips of size

64 \times 64

—extracted from the original image—until convergence, followed with suit on chips of size

128 \times 128

until the network converged on input image chips of size

256 \times 256

. The final input image size converged after 170 epochs. Various picture augmentation techniques, such as rotation and translation, have been used to increase the amount of data. Both datasets in this investigation had distinct training and testing data, which were employed in the appropriate situations.

Several other deep-learning- and transformer-based networks have been developed and utilized for comparison with the proposed network. These include U-Net [39], Deep Residual U-Nets [40], FCN32 [41], Contourlet based deep learning [42], ScattNet [43], ASF-net [44], CG-Swin [45], and UNetFormer [46], MRA segmentation [47], and ResiDualGAN [48]. All of these models were created using the Tensorflow platform (version 1.13), and were trained on the same dataset following the proposed network until they reached a point of overfitting.

3.2. Results

Figure 5 displays the outcomes obtained from the IRS-1C dataset. Based on the findings of semantic segmentation, the suggested method outperforms many classical and deep learning frameworks in terms of segmentation quality. Our approach demonstrates a 4% enhancement in mIoU (mean Intersection over Union) and a 4.4% improvement in total accuracy compared to alternative methodologies (as shown in Table 1).

In order to strengthen the reliability of the proposed method, the outcomes have been evaluated using the Worldview-2 dataset. Figure 6 illustrates the qualitative outcomes, depicting the Dharavi region and the tiles of the Antop Hill area. The suggested method successfully extracts distinct and separate slum imprints, even in densely populated areas, as demonstrated in the accompanying image crops. The proposed method achieves a higher mean Intersection over Union (mIoU) of 0.939 and a pixel accuracy of 95.32%, as shown in Table 2. Moreover, the method is clearly resistant to alterations in spatial resolutions, as well as clutter and backdrop fluctuations, due to the improved feature enhancements achieved through a multiresolution framework combined with the attention module.

4. Discussion

In this work, the effect of integrating wavelet- and curvelet-based MRA features on semantic segmentation architecture trained for informal settlement identification on two datasets were studied in a quantitative and qualitative manner. High values of the overall accuracy, mIoU score, F-score, precision, and recall metrics were observed, with some degree of variability in the performance. The improvements in quantitative measurements highlight the effectiveness and practicality of using a multi-resolution framework combined with self-attention to enhance the accuracy of semantic segmentation in identifying informal settlements. As for the trade-off between precision and recall, in the context of binary semantic segmentation of slum areas, it can be understood in the following manner. A higher recall signifies a model with superior ability to accurately identify a large number of slum-area pixels, even if it results in misclassifying some background pixels as slum, hence reducing the precision score, as observed in Table 2. Figure 5i demonstrates the enhanced segmentation outcomes of the suggested network in terms of its capacity to accurately extract distinct and realistic slum masks. In contrast, the U-Net produces slum footprints that are irregular and shapeless blobs. The proposed network accurately extracts both straight-edged areas and curve-shaped regions with high precision in terms of structure. In contrast, Resnet, for example, produces noisy footprints that lack consistent semantics. This improvement is because the curvelet coefficients efficiently captured the curved boundaries and represented the disoriented edges, as it has anisotropic properties in its basis function. U-Net and ResUnet fail to separate individual footprints and instead combine all of the build-up masks into a single large mass. This characteristic is unsuitable for several applications related to urban planning. In Figure 6i, the suggested method effectively detects and extracts very narrow slum structures while minimizing misclassifications, which is clearly seen when compared to the segmentation results of ResUnet.

To further assess the results, a visual comparison for qualitative analysis is carried out. In the visual comparison between the input images, ground-truth reference masks, and segmented outcomes, it is observed that, in general, the slum representations in the outcomes using the proposed method are a clear improvement over the other models. In Figure 5 and Figure 6, the misclassified portions are marked with red circles, areas encircled in blue show irregular boundaries in the result, and areas encircled in green show proper regular boundary detection in the result. The wavelet-based approach and other models are found to be unsuitable in terms of the orientation and anisotropic properties of the classes, as illustrated in Figure 5d,f,g and Figure 6c,e,f,h; however, curvelet features efficiently captured the curved edges, as shown in Figure 5i and Figure 6i. The slum areas do not appear to be intermixed with partially built-up areas and other classes, as observed while using other models. The accuracy in the boundary shape and edge continuity is observed to be better especially in the middle-left portion in Figure 5. Similarly, a deviation from the reference mask (Figure 6b) boundaries can be observed in the bottom-right part of the image (Figure 6c–h).

In the ablation investigation, we employed the VGG-16 network with a single scale and a structure that closely resembles the feature extraction backbone employed in the proposed method as the baseline. Subsequently, we showcase the efficacy of distinct elements in the suggested approach by gradually integrating each module. The comparative findings are displayed in Table 3. By default, all trials in the ablation employ progressive generation training, unless specifically stated otherwise. Table 3 clearly demonstrates incorporating multiresolution features, and the dual-attention mechanism leads to an enhancement of 14% and 10% compared to the baseline. Moreover, the effectiveness of the attention mechanism in enhancing the outcomes of semantic segmentation is observed. This can be due to its capacity to progressively improve the segmentation mask, leading to consistent and uniform semantics.

Figure 7 displays the segmentation outcome, illustrating the effectiveness of exclusively utilizing MRA features in the proposed network. The MRA framework is valuable because it aims to accurately represent informal settlement footprints and effectively detect and extract intricate structures and informal settlements of different shapes and sizes. The dual-attention process aids in capturing essential long-range relationships to obtain accurate segmentation outcomes, which is especially valuable in cluttered scenes. The attention mechanism ensures the network does not take into account the specific image acquisition properties, such as spatial resolution, resulting in consistent and uniform semantic understanding.

Table 4 describes the epoch sensitivity of the VGG16 network. An analysis of the performance measures indicates a correlation between the epoch value the model’s effectiveness with the model achieving optimal performance at a midway value. Specifically, the optimal results of both datasets are observed at 170, where mIoU/OA scores reach their highest point. For the IRS-1C dataset, the performance improves consistently as the epoch value changes from 50 to 170, suggesting that the network does not become trained properly at lower values of epochs. However, beyond this value, there is a notable decrease in the performance when the epoch reaches higher values. The network undergoes overfitting, which hinders its ability to generalize the results to unseen samples.

The U-net architecture [39] is a lightweight network that skips fully connected layers and only utilizes the valid portion of each convolution. Additionally, it employs extensive data augmentation by applying elastic deformations to the training images. This allows for the network to acquire invariance to these deformations, regardless of whether these changes are present in the labeled set. In order to propose ResUnet, the authors in [40] substituted regular neural units with residual units as fundamental components within the U-net framework. With just 25% of U-Net’s parameters, this network performs better. Nevertheless, these networks [39,40] lack multiresolution capabilities to handle different resolutions, which is a crucial characteristic of remotely sensed images. The authors of reference [41] use a fully connected network to combine features from different layers, making an adjustable nonlinear local-to-global representation block. The proposed network integrates semantic information from a deep, coarse layer with appearance information from a shallow, fine layer in order to generate precise and intricate segmentations. Due to the variety and complexity of ground objects, high-resolution remotely sensed images have a large intraclass variance and a small interclass variance. This makes the semantic segmentation task very difficult in fully connected networks. The authors in [43] propose an end-to-end network (ScattNet) to overcome this problem. ScattNet incorporates lightweight spatial and channel-attention modules that enable the adaptive refinement of features. Nevertheless, the study exclusively employs two widely used attention modules. Hence, the goal of designing the attention module with high effectiveness and capturing additional discriminative information continues to be a challenging task. In [44], the authors propose an adaptive screening feature (ASF) network to adjust the receptive field and enhance useful feature information in an adaptive manner. This neural network employs a wide space up-sampling block and an adaptive information utilization block (AIUB) to enhance the number of feature maps and improve the accuracy of incomplete target footprints. This network’s accuracy and mean Intersection over Union (mIoU) are comparable to other state-of-the-art methods (Table 1 and Table 2). Compared with conventional methods, CNN-based methods learn semantic features automatically, thereby achieving strong representation capability. Nevertheless, the convolution operation’s small receptive field restricts CNN-based approaches from obtaining the global information. In reference [45], the authors propose a class-guided Swin Transformer (CG-Swin) with the aim of efficiently collecting both local and global information in remote sensing images. In [46], the authors propose a similar approach to capture both local and global feature refinement within CNN. The authors employ a global-local transformer block-based decoder to construct a transformer analogous to UNet (UNetFormer). This approach produces findings that are comparable to those of other state-of-the-art methods. An approach utilizing an adversarial generative model, ResiDualGAN [48], is proposed for remote sensing image translation, where an in-network resizer module is used for addressing the scale variation of image datasets and a residual connection is used for strengthening the stability of real-to-real image translation and improving the performance in cross-domain semantic segmentation tasks. However, ResiDualGAN only minimizes the pixel-level domain gap, and the minimization of feature-level and output-level domain gaps and self-training strategies need improvement. Furthermore, because of the restrictions imposed by CNN’s down-sampling procedure, the ResiDualGAN model has rigid input image size limitations.

The improved outcomes in this work are a result of incorporating four MRA bands alongside the dual-attention mechanism. This suggests that conducting further investigations with a larger number of decompositions and introducing more MRA bands in the integration process would be beneficial. Nevertheless, completing this task is difficult due to the need to adjust the sampling rate because the resulting images from various MRA decompositions have varying sizes. Also, compared to the other existing models, this method requires additional time to decompose the original image in order to obtain MRA bands.

5. Conclusions

This paper proposes a supervised technique for semantic segmentation to identify slum areas from remotely sensed images. The technique is designed to be resilient to different characteristics of images by utilizing the enhanced feature representation capability of MRA at multiple scales and by refining these features using self-attention modules. The lightweight feature extraction backbone is enhanced by employing multiple levels of curvelet decomposition to incorporate multi-scale characteristics, resulting in the creation of a combined multi-scale multiresolution feature map. The joint features are enhanced through the implementation of a self-attention framework as a dual-attention module. This module selectively extracts features containing crucial information, resulting in increased segmentation performance. We substantiated the effectiveness of the suggested approach using both quantitative and qualitative assessments, focusing on the preservation of the geometric structure of the extracted slum areas.

The proposed curvelet approach achieves an overall classification accuracy of 95% with proper boundary shapes. This would enable urban authorities to monitor slum dynamics and conduct an analysis for planning purposes.

Author Contributions

Conceptualization, R.A.A.; methodology, R.A.A.; software, R.A.A.; validation, R.A.A. and T.J.M.; formal analysis, R.A.A. and T.J.M.; investigation, R.A.A.; resources, R.A.A.; data curation, R.A.A.; writing—original draft preparation, R.A.A.; writing—review and editing, T.J.M.; visualization, R.A.A.; supervision, T.J.M.; project administration, T.J.M.; funding acquisition, T.J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by funding from the National Science Foundation under Grant No. 2226312 and NASA Award #22-MUREPDEAP-0002. The opinions, findings, conclusions, and recommendations expressed in this material are solely those of the authors and do not necessarily reflect the views of the National Science Foundation or NASA.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to express profound gratitude to the reviewers and editors who contributed their time and expertise to ensure the quality of this work. The authors would also like to thank Rehan Ahmed for helping in capturing pictures of samples covering different slum areas (Figure 4).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Slum Almanac 2015–2016|UN-Habitat. Available online: https://unhabitat.org/slum-almanac-2015-2016-0 (accessed on 22 December 2023).
The Challenge of Slums-Global Report on Human Settlements 2003|UN-Habitat. Available online: https://unhabitat.org/the-challenge-of-slums-global-report-on-human-settlements-2003 (accessed on 22 December 2023).
Tjia, D.; Coetzee, S. Geospatial Information Needs for Informal Settlement Upgrading—A Review. Habitat Int. 2022, 122, 102531. [Google Scholar] [CrossRef]
Thomson, D.R.; Stevens, F.R.; Chen, R.; Yetman, G.; Sorichetta, A.; Gaughan, A.E. Improving the Accuracy of Gridded Population Estimates in Cities and Slums to Monitor SDG 11: Evidence from a Simulation Study in Namibia. Land Use Policy 2022, 123, 106392. [Google Scholar] [CrossRef]
Daneshyar, E.; Keynoush, S. Developing Adaptive Curriculum for Slum Upgrade Projects: The Fourth Year Undergraduate Program Experience. Sustainability 2023, 15, 4877. [Google Scholar] [CrossRef]
Leonita, G.; Kuffer, M.; Sliuzas, R.; Persello, C. Machine Learning-Based Slum Mapping in Support of Slum Upgrading Programs: The Case of Bandung City, Indonesia. Remote Sens. 2018, 10, 1522. [Google Scholar] [CrossRef]
MacTavish, R.; Bixby, H.; Cavanaugh, A.; Agyei-Mensah, S.; Bawah, A.; Owusu, G.; Ezzati, M.; Arku, R.; Robinson, B.; Schmidt, A.M.; et al. Identifying Deprived “Slum” Neighbourhoods in the Greater Accra Metropolitan Area of Ghana Using Census and Remote Sensing Data. World Dev. 2023, 167, 106253. [Google Scholar] [CrossRef] [PubMed]
Kuffer, M.; Abascal, A.; Vanhuysse, S.; Georganos, S.; Wang, J.; Thomson, D.R.; Boanada, A.; Roca, P. Data and Urban Poverty: Detecting and Characterising Slums and Deprived Urban Areas in Low- and Middle-Income Countries. In Advanced Remote Sensing for Urban and Landscape Ecology; Springer Nature: Berlin/Heidelberg, Germany, 2023. [Google Scholar] [CrossRef]
Rehman, M.F.U.; Aftab, I.; Sultani, W.; Ali, M. Mapping Temporary Slums From Satellite Imagery Using a Semi-Supervised Approach. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Prabhu, R.; Parvathavarthini, B. Morphological Slum Index for Slum Extraction from High-Resolution Remote Sensing Imagery over Urban Areas. Geocarto Int. 2022, 37, 13904–13922. [Google Scholar] [CrossRef]
Luo, E.; Kuffer, M.; Wang, J. Urban Poverty Maps—From Characterising Deprivation Using Geo-Spatial Data to Capturing Deprivation from Space. Sustain. Cities Soc. 2022, 84, 104033. [Google Scholar] [CrossRef]
Lu, W.; Hu, Y.; Peng, F.; Feng, Z.; Yang, Y. A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping. Remote Sens. 2024, 16, 260. [Google Scholar] [CrossRef]
Trento Oliveira, L.; Kuffer, M.; Schwarz, N.; Pedrassoli, J.C. Capturing Deprived Areas Using Unsupervised Machine Learning and Open Data: A Case Study in São Paulo, Brazil. Eur. J. Remote Sens. 2023, 56, 2214690. [Google Scholar] [CrossRef]
Alrasheedi, K.G.; Dewan, A.; El-Mowafy, A. Using Local Knowledge and Remote Sensing in the Identification of Informal Settlements in Riyadh City, Saudi Arabia. Remote Sens. 2023, 15, 3895. [Google Scholar] [CrossRef]
Dabra, A.; Kumar, V. Evaluating Green Cover and Open Spaces in Informal Settlements of Mumbai Using Deep Learning. Neural Comput. Appl. 2023, 35, 11773–11788. [Google Scholar] [CrossRef]
Mudau, N.; Mhangara, P. Mapping and Assessment of Housing Informality Using Object-Based Image Analysis: A Review. Urban Sci. 2023, 7, 98. [Google Scholar] [CrossRef]
Sharma, D.; Singhai, J. An Unsupervised Framework to Extract the Diverse Building from the Satellite Images Using Grab-Cut Method. Earth Sci. Inf. 2021, 14, 777–795. [Google Scholar] [CrossRef]
Xu, S.; Pan, X.; Li, E.; Wu, B.; Bu, S.; Dong, W.; Xiang, S.; Zhang, X. Automatic Building Rooftop Extraction from Aerial Images via Hierarchical RGB-D Priors. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7369–7387. [Google Scholar] [CrossRef]
Brenning, A. Interpreting Machine-Learning Models in Transformed Feature Space with an Application to Remote-Sensing Classification. Mach. Learn. 2023, 112, 3455–3471. [Google Scholar] [CrossRef]
Bergamasco, L.; Bovolo, F.; Bruzzone, L. A Dual-Branch Deep Learning Architecture for Multisensor and Multitemporal Remote Sensing Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2147–2162. [Google Scholar] [CrossRef]
Cao, R.; Fang, L.; Lu, T.; He, N. Self-Attention-Based Deep Feature Fusion for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2021, 18, 43–47. [Google Scholar] [CrossRef]
Yu, Y.; Li, Y.; Liu, C.; Wang, J.; Yu, C.; Jiang, X.; Wang, L.; Liu, Z.; Zhang, Y. MarkCapsNet: Road Marking Extraction From Aerial Images Using Self-Attention-Guided Capsule Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Lu, K.; Sun, Y.; Ong, S.-H. Dual-Resolution U-Net: Building Extraction from Aerial Images. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 489–494. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xia, M.; Wang, X.; Liu, Y. RoadNet: Learning to Comprehensively Analyze Road Networks in Complex Urban Scenes from High-Resolution Remotely Sensed Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 2043–2056. [Google Scholar] [CrossRef]
Li, C.; Liu, Y.; Yin, H.; Li, Y.; Guo, Q.; Zhang, L.; Du, P. Attention Residual U-Net for Building Segmentation in Aerial Images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4047–4050. [Google Scholar] [CrossRef]
Azimi, S.M.; Fischer, P.; Körner, M.; Reinartz, P. Aerial LaneNet: Lane-Marking Semantic Segmentation in Aerial Imagery Using Wavelet-Enhanced Cost-Sensitive Symmetric Fully Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 2920–2938. [Google Scholar] [CrossRef]
Kang, J.; Fernandez-Beltran, R.; Sun, X.; Ni, J.; Plaza, A. Deep Learning-Based Building Footprint Extraction With Missing Annotations. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Welland, G. Beyond Wavelets; Academic Press: Cambridge, MA, USA, 2003. [Google Scholar]
Ansari, R.A.; Thomas, W. Curvelet based U-Net Framework for Building Footprint Identification. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, XLVIII-2-W3-2023, 15–19. [Google Scholar] [CrossRef]
Candes, E.J.; Donoho, D.L. Curvelets, Multiresolution Representation, and Scaling Laws. In Wavelet Applications in Signal and Image Processing VIII; SPIE: San Diego, CA, USA, 2000; Volume 4119, pp. 1–12. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ye, M.; Ruiwen, N.; Chang, Z.; He, G.; Tianli, H.; Shijun, L.; Yu, S.; Tong, Z.; Ying, G. A Lightweight Model of VGG-16 for Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6916–6922. [Google Scholar] [CrossRef]
Mallat, S.G. A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef]
Candès, E.; Demanet, L.; Donoho, D.; Ying, L. Fast Discrete Curvelet Transforms. Multiscale Model. Simul. 2006, 5, 861–899. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 3141–3149. [Google Scholar] [CrossRef]
Dai, A.; Diller, C.; Niessner, M. SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 846–855. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2018, arXiv:1708.02002. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Swizterland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Ansari, R.A.; Malhotra, R.; Buddhiraju, K.M. Identifying Informal Settlements Using Contourlet Assisted Deep Learning. Sensors 2020, 20, 2733. [Google Scholar] [CrossRef]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic Segmentation Network With Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 905–909. [Google Scholar] [CrossRef]
Chen, J.; Jiang, Y.; Luo, L.; Gong, W. ASF-Net: Adaptive Screening Feature Network for Building Footprint Extraction From Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Meng, X.; Yang, Y.; Wang, L.; Wang, T.; Li, R.; Zhang, C. Class-Guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Ansari, R.A.; Buddhiraju, K.M. Textural Segmentation of Remotely Sensed Images Using Multiresolution Analysis for Slum Area Identification. Eur. J. Remote Sens. 2019, 52 (Suppl. S2), 74–88. [Google Scholar] [CrossRef]
Zhao, Y.; Guo, P.; Sun, Z.; Chen, X.; Gao, H. ResiDualGAN: Resize-Residual DualGAN for Cross-Domain Remote Sensing Images Semantic Segmentation. Remote Sens. 2023, 15, 1428. [Google Scholar] [CrossRef]

Figure 1. Proposed Methodology. Bx MRA represents different multiresolution analysis bands capturing different details; DeConv box performs de-convolution; multi-scale feature vector F_MS is fed to the dual-attention module.

Figure 2. Curvelet MRA construction (adapted from [35]).

Figure 3. Dual-attention mechanism (adapted from [36]).

Figure 4. Sample informal settlement areas depicting variability in structures: (a) informal settlements around tall buildings near BKC area; (b) informal settlements along railway line near Mankhurd area; (c) informal settlements near Sion showing inner pockets; (d) informal settlements near main road in Shivaji Nagar area.

Figure 5. Slum area identification (IRS-1C) results: areas encircled in red show misclassified areas; areas encircled in blue show irregular boundaries in the result; and areas encircled in green show the proper regular boundary detection in the result. (a) Original image covering Dharavi and nearby areas; (b) reference showing slum areas; (c) slums identified using UNet [39]; (d) slums identified using ResUNet [40]; (e) slums identified using FCN-32 [41]; (f) slums identified using ASFNet [44]; (g) slums identified using ResiDualGAN [48]; (h) slums identified using self-attention and wavelet-based MRA; (i) slums identified using the proposed method of self-attention and curvelet-based MRA.

Figure 6. Slum area identification results using the Worldviw-2 image covering Cheeta Camp and nearby areas: areas encircled in red show misclassified areas, areas encircled in blue show irregular boundaries in the result, and areas encircled in green show proper regular boundary detection in the result. (a) Original image covering Dharavi and nearby areas; (b) reference showing slum areas; (c) slums identified using UNet [39]; (d) slums identified using ResUNet [40]; (e) slums identified using FCN-32 [41]; (f) slums identified using ASFNet [44]; (g) slums identified using ResiDualGAN [48]; (h) slums identified using self-attention and wavelet-based MRA; (i) slums identified using the proposed method of self-attention and curvelet-based MRA.

Figure 7. Ablation results. (a) Slums identified using the baseline model. (b) Slums identified using the baseline architecture with MRA features. (c) Slums identified using the baseline model integrated with self-attention. (d) Slums identified using the proposed method of baseline model integrated with MRA features and the self-attention mechanism.

Table 1. Performance comparison for the IRS-1C dataset.

Model	IoU		mIoU	Precision	Recall	F-Score	Accuracy (%)
Model	Slum	Non-Slum	mIoU	Precision	Recall	F-Score	Accuracy (%)
U-Net [39]	0.890	0.904	0.897	0.842	0.932	0.884	90.10
Deep ResUnet [40]	0.795	0.815	0.805	0.928	0.814	0.867	92.60
ScattNet [43]	0.914	0.938	0.926	0.947	0.902	0.923	93.03
FCN-32 [41]	0.858	0.872	0.865	0.902	0.917	0.909	90.36
UNetFormer [46]	0.913	0.868	0.890	0.897	0.928	0.912	90.96
CG-Swin [45]	0.927	0.899	0.913	0.911	0.928	0.919	94.81
ASF-Net [44]	0.938	0.910	0.924	0.920	0.939	0.929	93.45
Textural MRA [47]	0.896	0.875	0.885	0.896	0.902	0.898	91.40
Contour Deep learning [42]	0.912	0.901	0.906	0.902	0.916	0.908	93.10
ResiDualGAN [48]	0.932	0.909	0.920	0.935	0.912	0.923	93.04
Self-attention with wavelet MRA	0.924	0.931	0.927	0.942	0.941	0.941	94.01
Proposed self-attention with curvelet MRA	0.938	0.947	0.942	0.951	0.943	0.946	95.98

Table 2. Performance comparison for the Worldview-2 dataset.

Model	IoU		mIoU	Precision	Recall	F-Score	Accuracy (%)
Model	Slum	Non-Slum	mIoU	Precision	Recall	F-Score	Accuracy (%)
U-Net [39]	0.910	0.886	0.898	0.893	0.889	0.890	91.41
Deep ResUnet [40]	0.801	0.803	0.802	0.746	0.888	0.810	90.75
ScattNet [43]	0.920	0.917	0.918	0.878	0.965	0.919	95.77
FCN-32 [41]	0.886	0.911	0.898	0.796	0.853	0.823	89.17
UNetFormer [46]	0.905	0.853	0.879	0.905	0.928	0.916	92.38
CG-Swin [45]	0.932	0.881	0.906	0.898	0.932	0.914	93.92
ASF-Net [44]	0.938	0.910	0.924	0.879	0.902	0.890	91.15
Textural MRA [47]	0.920	0.901	0.910	0.901	0.911	0.905	92.30
Contour Deep learning [42]	0.911	0.875	0.893	0.934	0.910	0.921	94.90
ResiDualGAN [48]	0.915	0.925	0.920	0.921	0.934	0.927	92.89
Self-attention with wavelet MRA	0.932	0.931	0.931	0.906	0.939	0.922	93.98
Proposed self-attention with curvelet MRA	0.941	0.938	0.939	0.927	0.969	0.947	95.32

Table 3. Ablation results.

Model	mIoU	Precision	Recall	F-Score	Accuracy (%)
Baseline	0.811	0.793	0.826	0.809	88.57
Baseline + MRA	0.876	0.864	0.912	0.887	93.70
Baseline + Dual Self-Attention	0.862	0.901	0.847	0.873	90.51
Proposed (Baseline + MRA + Dual Self-Attention)	0.942	0.951	0.943	0.946	95.98

Table 4. Sensitivity of VGG16 network. Results are in the form mIoU/OA.

Epoch	IRS-1C Dataset	Worldview-2 Dataset
Epoch	mIoU/OA	mIoU/OA
50	0.881/0.910	0.876/0.892
150	0.914/0.921	0.892/0.914
170	0.942/0.959	0.939/0.953
190	0.909/0.916	0.897/0.921
240	0.883/0.907	0.881/0.905

mIoU: mean intersection over union, OA: overall accuracy.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ansari, R.A.; Mulrooney, T.J. Self-Attention Multiresolution Analysis-Based Informal Settlement Identification Using Remote Sensing Data. Remote Sens. 2024, 16, 3334. https://doi.org/10.3390/rs16173334

AMA Style

Ansari RA, Mulrooney TJ. Self-Attention Multiresolution Analysis-Based Informal Settlement Identification Using Remote Sensing Data. Remote Sensing. 2024; 16(17):3334. https://doi.org/10.3390/rs16173334

Chicago/Turabian Style

Ansari, Rizwan Ahmed, and Timothy J. Mulrooney. 2024. "Self-Attention Multiresolution Analysis-Based Informal Settlement Identification Using Remote Sensing Data" Remote Sensing 16, no. 17: 3334. https://doi.org/10.3390/rs16173334

APA Style

Ansari, R. A., & Mulrooney, T. J. (2024). Self-Attention Multiresolution Analysis-Based Informal Settlement Identification Using Remote Sensing Data. Remote Sensing, 16(17), 3334. https://doi.org/10.3390/rs16173334

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Attention Multiresolution Analysis-Based Informal Settlement Identification Using Remote Sensing Data

Abstract

1. Introduction

2. Methods

2.1. Multiresolution Analysis

2.1.1. Wavelet Transform

2.1.2. Curvelet Transform

2.2. Dual Attention

2.2.1. Position-Attention Module

2.2.2. The Channel-Attention Module

2.3. Progressive Generation

2.4. Loss Function

3. Results

3.1. Dataset and Experimental Setup

3.2. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI