1. Introduction
Semantic segmentation, a fundamental task for interpreting remote sensing imagery (RSI), is currently essential in various fields, such as water resource management [
1,
2], land cover classification [
3,
4,
5], urban planning [
6,
7] and precision agriculture [
8,
9] and so forth. This task strives to produce a raster map that refers to an input image by assigning a categorical label to every pixel [
10]. The observed objects and terrain information are easily recognized and analyzed with the labeled raster map, contributing to structuralized and readable knowledge. However, the reliability and availability of the transformed knowledge are tremendously conditioned on the accuracy of semantic segmentation.
Conventional segmentation methods of remote sensing imagery are essentially implemented by statistically analyzing the prior distributions. For example, Arivazhagan et al. [
11] extracted and combined wavelet statistical features and co-occurrence features, characterizing the textures at different scales. As a result, the experiments on monochrome images achieved comparable performance. For segmenting plant leaves, Gitelson A. and Merzlyak M. [
12] developed a new spectral index, green normalized difference vegetation index (GNDVI), based on the spectral properties. Although it works well, the specifying spectral range occupation leads to finite application. Likewise, the normalized difference index for ASTER 5–6 was devised incorporating several vegetation indices [
13]. This index helps the model segment the major crops by grouping crop fields with similar values. Subsequently, Blaschke T. summarized these methods as object-based image analysis for remote sensing (OBIA), utilizing spectral and spatial information in an integrative fashion [
14]. To sum up, the traditional methods are target-specific and spectra-fixed, making the segmentation model not robust. Moreover, concerning the arrival of the big data era, this kind of approach is far from usable when individually working on the task.
More recently, machine learning models were extensively applied to classify pixels following handcrafted features. For example, Yang et al. [
15] captured texture and context by fusing Texton descriptor and the association potential in the conditional random field (CRF) framework. Mountrakis et al. [
16] reviewed the support vector machine (SVM) classifiers in remote sensing. SVM resorts to a small number of training samples while reaching comparable accuracy. Random forest (RF) also exhibits its strong ability to classify remote sensing data with high dimensionality and multicollinearity [
17]. Nevertheless, conventional machine learning methods are not automative and intelligent. Although the efficiency is acceptable, the accuracy is criticized, especially for multi-sensor and multi-platform data.
Since the successful development of convolutional neural networks (CNNs), numerous studies have examined the application of CNNs. CNNs have demonstrated the powerful capacity of feature extraction and object representations compared with traditional methods in machine learning. One of the most significant breakthroughs was the fully convolutional neural network (FCN) [
18]. FCN is the first end-to-end framework, allowing the deconvolution layer to recover feature maps. However, the major drawback is the accompanying information loss in shrinking and dilating features’ spatial size. To alleviate the transformation loss, an encoder-decoder segmentation network (a.k.a. SegNet) [
19] was designed with a symmetrical architecture. In the encoder, the indexes of the largest pixels are recorded. As to the corresponding decoder stage, the recorded pixels are re-assigned to the same position. Similarly, U-Net [
20] retains the detailed information better with the skip connections between encoder and decoder. The initial implementation of medical images verifies the remarkable progress compared to FCN and SegNet. In addition, these works revealed that comprehensively capturing contextual information enables accurate segmentation results.
Endeavoring to enrich the learnt representations with contextual information, the atrous convolution was proposed [
21]. This unit adjusts a rate to convolve the adjacent pixels for generating the central position’s representations. DeepLab V2 [
22] built a novel atrous spatial pyramid pooling (ASPP) module to sample features at different scales. Furthermore, DeepLab V3 [
23] integrated global average pooling to embed more helpful information. Considering the execution efficiency, DeepLab V3+ [
24] opted for Xception as the backbone and depth-wise convolution. Unfortunately, enlarging the receptive field will cause edge distortions, where the surrounding pixels are error-prone.
Regarding the geographic objects’ properties of RSI, context aggregation is advocated. Even for the same class, the scale of the optimal segments is different. Therefore, the context aggregation methods are devoted to minimizing the heterogeneity of intra-class objects and maximize the heterogeneity of inter-class objects by fusing the multi-level feature maps at various scales. For example, Zhang et al. [
25] designed a multi-scale context aggregation network. This network encodes the raw image by the high-resolution network (HRNet) [
26], in which four parallel branches are presented to generate four sizes of feature maps. Then, these are enhanced with corresponding tricks before concatenation. Results on the ISPRS Vaihingen and Potsdam datasets are competitive. Coincidentally, a multi-level feature aggregation network (MFANet) [
27] was proposed with the same motivation. Two modules, channel feature compression (CFC) and multi-level feature aggregation upsample (MFAU), were designed to reduce the loss of details and make the edge clear. Moreover, Wang et al. [
28] defined a cost-sensitive loss function in addition to fuse multi-scale deep features.
Alternatively, the attention mechanism was initially applied to boost the performance for labeling natural images. SENet [
29] built a channel-wise attention module that recalibrates the channel weights with learnt relationships between arbitrary channels. In this way, SENet could help the network pay more attention to the channels with complementary information. Concerning spatial and channel correlations simultaneously, CBAM [
30] and DANet [
31] were implemented and achieved perceptible improvements results on natural image data. The following proposed self-attention fashion further optimizes the representations. The non-local neural network [
32] was proposed to learn the position-wise attention maps both in spatial and channel domains. Regarding the capability of self-attentively modeling the long-range dependencies, ACFNet [
33] offered a coarse-to-fine segmentation pipeline. The self-attention module also inspired the proposal of OCRNet [
34], in which relational context is extracted and fed for prediction, sharping object’s boundaries.
While the attention modules are transplanted to RS, the diversity and easily-confused geo-objects are well-distinguished than before [
35]. To this end, many variant networks that introduce attention rationale were investigated. CAM-DFCN [
36] incorporates channel attention to FCN architecture for using multi-modal auxiliary data to enhance the distinguishability of features. HMANet [
37] adaptively captures the correlations that lie in space, channel and category domains effectively. This model benefits from the extensible self-attention mechanism. Li et al. [
38] disclosed that the most challenging task is recognizing and accepting the diverse intra-class variance and inconspicuous inter-class variance. Thereby, the SCAttNet integrates the spatial and channel attention to form a lightweight yet efficient network. More recently, Lei et al. [
39] proposed LANet, which bridges the gap between high- and low-level features by embedding the local focus from high-level features with the designed patch attention module. In terms of the successful cases of attention-based methods, it is concluded that the attention modules can strengthen the separability of learned representations, making the error-prone objects more distinguishable. Marmanis et al. [
40] pointed out that edge regions contain implicated semantically meaningful boundaries. However, the existing attentive techniques equally learn the representation for every pixel. As a result, the surrounding pixels of edges are easily misjudged, tending to the blurry edge region even rising high uncertainties of long-distance pixels.
To perceive and transmit edge knowledge, Marmanis et al. combined semantically informed edge detection with encoder-decoder architecture. Then, the class boundaries are explicitly modeled, adjusting the training phase. This memory-efficient method yields more than 90% accuracy on the ISPRS Vaihingen benchmark. Afterward, PEGNet [
41] presented a multipath atrous convolution module to generate dilated edge information across canny and morphological operations. Thus, the edge-region maps help the network identify the pixels around edges with high consistency. In addition, a recalibrate module is regulated by training loss to guide the misclassified pixels, reporting an overall accuracy of more than 91% of the Vaihingen dataset.
To sum up, the commonly used combination of boundary detector and segmentation network is complex with much more time-costs. The independent boundary detector requires corresponding loss computation and an embedded interface of the trunk network. In addition, the existing methods are far from adaptively extracting and injecting edge distributions. Hence, the purpose of this study includes two aspects: (1) the edge knowledge is urgent to be explicitly modeled and incorporated into learnt representations, facilitating the network’s discriminative capability in labeling pixels that position at marginal areas; (2) the extraction and incorporation of edge distributions should be learnable and end-to-end trainable without breaking the inherent spatial structure.
Generally, CNNs have demonstrated superiority in segmenting remote sensing imagery by learning local patterns. Nevertheless, the remote sensing imagery always covers wide-range areas and observes various ground objects. This property makes the networks insufficient and leads to the degradation of accuracy. Furthermore, although the attention modules brought astounding improvements by learning contextual information, equally processing edges induce blur around boundaries, which indirectly causes massive misclassified pixels. Hence, it is necessary to incorporate the edge distributions to assist the network in enhancing edge delineation and recognizing various objects. Motivated by the attention mechanism and two-dimensional principal component analysis (2DPCA) diagram [
42], we found that injecting edge distributions with a learnable way is available. Therefore, in this study, we firstly formulate and re-define the covariance matrix inspired by 2DPCA. To refine the representations, two perspectives of efforts are devoted. One is learning edge distributions modelled by the re-defined covariance matrix following the inherently spatial structure of encoded feature maps. The other is the employment of the non-local block, a typical self-attention module with high efficiency, to enhance the representations with local and global contextual information. Therefore, the hybrid strategy makes the segmentation network determine the dominating features and filter irrelevant noise. In summary, the contributions are as follows,
- (1)
Inspired by the image covariance analysis of 2DPCA, the covariance matrix (CM) is re-defined with learnt feature maps in the network. Then, the edge distribution attention module (EDA) is devised based on the covariance matrix analysis, modeling the dependencies of edge distributions in a self-attentive way explicitly. Through the column-wise and row-wise edge attention maps, the vertical and horizontal relationships are both quantified and leveraged. Specifically, in EDA, the handcrafted feature is successfully combined with learnt ones.
- (2)
A hybrid attention module (HAM) that emphasizes the edge distributions and position-wise dependencies is devised. Thereby, more complementary edge and contextual information are collected and injected. This module supports independent and flexible embedding by a parallel architecture.
- (3)
A conceptually end-to-end neural network, named edge distribution-enhanced semantic segmentation neural network (EDENet), is proposed. EDENet hierarchically integrates HAM to generate representative and discriminative encoded features, providing available and reasonable cues for dense prediction.
- (4)
Extensive experiments are conducted on three datasets, ISPRS Vaihingen [
43] and Potsdam [
44] and DeepGlobe [
45] benchmarks. In addition, the results indicate that EDENet is superior to other state-of-the-art methods. In addition, the ablation study further tests the efficacy of EDA.
The remainders of this paper are organized as follows:
Section 2 introduces the related works, including attention mechanism, 2DPCA and non-local block.
Section 3 concretely presents the devised framework and pipeline of sub-modules.
Section 4 quantitatively and qualitatively evaluates the proposed method on both aerial and satellite images. Finally, the conclusions are drawn in
Section 5.
6. Conclusions
The semantic segmentation of RSI plays a pivotal role for various downstream applications. In addition, boosting the accuracy of segmentation has been a hot topic in the field. The existing approaches have produced competitive results by attention-based deep convolutional neural networks. However, the deficiency of edge information leads to blurry boundaries, even raising high uncertainties of long-distant pixels during recognition.
In this study, we have investigated an end-to-end trainable semantic segmentation neural network of RSI. Essentially, we first formulate and model the edge distributions of encoded feature maps inspired by covariance matrix analysis. Then, the designed EDA learns the column-wise and row-wise edge attention maps in a self-attentive fashion. As a result, the edge knowledge is successfully modeled and injected into learnt representations, facilitating representativeness and distinguishability. In addition to leverage edge distributions, HAM employs non-local block as another parallel branch to capture the position-wise dependencies. As a result, the complementary contextual and edge information are learned to enhance the discriminative capability of the network. In experiments, three diverse datasets from multiple sensors and different imaging platforms are examined. The results indicate the efficacy and superiority of the proposed model. With the ablation study, we further demonstrate the effects of EDA.
Nevertheless, there are still several challenging issues to be addressed. First of all, the multi-modal data are necessarily fused to improve the semantic segmentation performance, such as DSM information and SAR data. Moreover, the transferable models are of great concern to adaptively cope with the increasingly diverse imaging sensors. Furthermore, the basic convolution units also have the potential to be optimized in convergence rate while producing a global-optimal solution, as the previous work has validated [
60]. In addition, the semantic segmentation tasks, image fusion [
61], image denoising [
62] and image restoration [
63] of remote sensing images also rely on the feature extraction; the extension of the proposed module should be promising and challenging.