Next Article in Journal
Large Uncertainty on Forest Area Change in the Early 21st Century among Widely Used Global Land Cover Datasets
Next Article in Special Issue
High-Resolution SAR Image Classification Using Multi-Scale Deep Feature Fusion and Covariance Pooling Manifold Network
Previous Article in Journal
Enhanced Redundant Measurement-Based Kalman Filter for Measurement Noise Covariance Estimation in INS/GNSS Integration
Previous Article in Special Issue
An Effective Lunar Crater Recognition Algorithm Based on Convolutional Neural Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Attention-Based Pyramid Network for Segmentation and Classification of High-Resolution and Hyperspectral Remote Sensing Images

1
Key Laboratory of Mountain Hazards and Surface Process, Institute of Mountain Hazards and Environment, Chinese Academy of Sciences, Chengdu 610041, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
Bell Labs, Murray Hill, NJ 07974, USA
4
CAS Center for Excellence in Tibetan Plateau Earth Sciences, Chinese Academy of Sciences (CAS), Beijing 100101, China
5
School of Economics and Management, Southwest Jiao Tong University, Chengdu 610031, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2020, 12(21), 3501; https://doi.org/10.3390/rs12213501
Submission received: 22 September 2020 / Revised: 19 October 2020 / Accepted: 20 October 2020 / Published: 24 October 2020
(This article belongs to the Special Issue Deep Learning for Remote Sensing Image Classification)

Abstract

:
Unlike conventional natural (RGB) images, the inherent large scale and complex structures of remote sensing images pose major challenges such as spatial object distribution diversity and spectral information extraction when existing models are directly applied for image classification. In this study, we develop an attention-based pyramid network for segmentation and classification of remote sensing datasets. Attention mechanisms are used to develop the following modules: (i) a novel and robust attention-based multi-scale fusion method effectively fuses useful spatial or spectral information at different and same scales; ( i i ) a region pyramid attention mechanism using region-based attention addresses the target geometric size diversity in large-scale remote sensing images; and ( i i i ) cross-scale attention in our adaptive atrous spatial pyramid pooling network adapts to varied contents in a feature-embedded space. Different forms of feature fusion pyramid frameworks are established by combining these attention-based modules. First, a novel segmentation framework, called the heavy-weight spatial feature fusion pyramid network (FFPNet), is proposed to address the spatial problem of high-resolution remote sensing images. Second, an end-to-end spatial-spectral FFPNet is presented for classifying hyperspectral images. Experiments conducted on ISPRS Vaihingen and ISPRS Potsdam high-resolution datasets demonstrate the competitive segmentation accuracy achieved by the proposed heavy-weight spatial FFPNet. Furthermore, experiments on the Indian Pines and the University of Pavia hyperspectral datasets indicate that the proposed spatial-spectral FFPNet outperforms the current state-of-the-art methods in hyperspectral image classification.

Graphical Abstract

1. Introduction

Supervised segmentation and classification are important processes in remote sensing image perception. Many socioeconomic and environmental applications, including urban and regional planning, hazard detection and avoidance, land use and land cover, as well as target mapping and tracking, can be handled by using suitable remote sensing data and effective classifiers [1,2]. A great deal of data with different spectral and spatial resolutions is currently available for different applications with the development of modern remote sensing technology. Among these massive remote sensing data, high-resolution and hyperspectral images are two important types. High-resolution remote sensing images usually have rich spatial distribution information and a few spectral bands, which contain the detailed shape and appearance of objects [3]. Semantic segmentation is a powerful and promising scheme to assign pixels in high-resolution images with class labels [4,5]. Hyperspectral images can capture hundreds of narrow spectral channels with an extremely fine spectral resolution, allowing accurate characterization of the electromagnetic spectrum of an object and facilitating a precise analysis of soils and materials [6]. Because each pixel can be considered a high-dimensional vector and to be surrounded by local spatial neighborhood, supervised spatial-spectral classification methods are suitable for hyperspectral images.
However, segmentation or classification of different types of remote sensing images is an exceedingly difficult process, which includes major challenges of spatial object distribution diversity (Figure 1) and spectral information extraction. Specifically, the following are the challenges with segmentation and classification of remote sensing images:
  • Missing pixels or occlusion of objects: different from traditional (RGB) imaging methods, remote sensing examines an area from a significantly long distance and gathers information and images remotely. Due to the large areas contained in one sample and the effects of the atmosphere, clouds, and shadows, missing pixels or occlusion of objects are inevitable problems in remote sensing images.
  • Geometric size diversity: the geometric sizes of different objects may vary greatly and some objects are small and crowded in remote sensing imagery because of the large area covered comprising different objects (e.g., cars, trees, buildings, roads in Figure 1).
  • High intra-class variance and low inter-class variance: this is a unique problem in remote sensing images and it inspires us to study superior methods aiming to effectively fuse multiscale features. For example, in Figure 1, buildings commonly vary in shape, style, and scale; low vegetations and impervious surfaces are similar in appearance.
  • Spectral information extraction: hyperspectral image datasets contain hundreds of spectral bands, and it is challenging to extract spectral information because of the similarity between the spectral bands of different classes and complexity of the spectral structure, leading to the Hughes phenomenon or curse of dimensionality [7]. More importantly, hyperspectral datasets usually contain a limited number of labeled samples, thus making it difficult to extract effective spectral information from hyperspectral images.
A.
Review of Semantic Segmentation of High-resolution Remote Sensing Images by Multiscale Feature Processing.   
First, to solve the problem of spatial object distribution diversity in high-resolution images, it is necessary to effectively extract and fuse features in multiple scales. Recently, deep-learning methods have shown excellent performance in remote sensing image processing, especially deep convolutional neural networks (DCNNs), which have strong ability to express multiscale features (such as FCNs [8], S-RA-FCN [9], DeepLabv3 [10], and DeepLabv3+ [11]). To date, many models based on DCNNs for semantic segmentation of remote sensing images have been proposed. Sun and Wang [4] established a semantic segmentation scheme based on fully convolutional networks [8]. Wang et al. [12] proposed a gated network based on the information entropy of feature maps. This method can effectively integrate local details with contextual information. The cascaded convolutional neural networks [5,13] were utilized for the segmentation of remote sensing images by successively aggregating contexts. Most recently, many multiscale context-augmented models [9,14,15] have been proposed to exploit contextual information in remote sensing images. Remote sensing target segmentation problems such as object occlusion, geometric size diversity, and small objects have attracted increasing research attention [16,17,18,19].
Further analysis of these multiscale/contextual feature fusion models reveals that their common objective is to establish an effective feature attention weight fusion method. Attention mechanisms are widely used for various tasks such as machine translation [20], scene classification, and semantic segmentation. The non-local network [21] first adopts a self-attention mechanism as a submodule for computer vision tasks. Recently, many attention-reinforced mechanisms [9,22,23] have been proposed on the basis of non-local operation in semantic segmentation. Attention U-Net [24] learns to suppress irrelevant areas in an input image while highlighting useful features for a specific task on the basis of cross-layer self-attention. CCNet [25] harvests the contextual information of all the positions in one image by stacking two serial criss-cross attention modules. ACFNet [26] is a coarse-to-fine segmentation network based on the attention class feature module, which can be embedded in any base network. Most recently, various self-attention mechanisms have proven to be effective for solving the problem of multiscale feature fusion in feature pyramid-based models  [27,28,29,30].
In summary, the above-mentioned multiscale feature fusion models based on attention mechanisms apply convolutional neural networks (CNNs) in three-band data, which have achieved significant breakthroughs in semantic segmentation. However, these models still cannot effectively solve the problem of spatial distribution diversity in remote sensing for the following reasons:
(1)
Most models only consider the fusion of two or three adjacent scales and do not further consider how to achieve the feature fusion of more or even all the different scale layers. Improved classification accuracy can be achieved by combining useful features at more scales.
(2)
Although a small part of the attention mechanism (such as GFF [31]) considers the fusion of more layers, it does not successfully solve the semantic gaps between high- and low-level features. The detailed analysis of different feature layers is discussed in Section 2.1.
(3)
The novel attention mechanisms based on self-attention mainly focus on spatial and channel relations for semantic segmentation (such as the non-local network [21]). Regional relations are not considered for the remote sensing images, and thus the relationship between object regions cannot be deepened.
B.
Review of Spatial-spectral Classification for Hyperspectral Images by Multiscale Feature Processing.   
To solve the problem of spectral information extraction in hyperspectral images and enhance the classification performance, spatial-spectral classification methods have gained prominent application in hyperspectral image processing, mainly including handcrafted feature-based approaches [32,33,34,35] and deep learning methods. Since deep learning methods (especially DCNNs) have proven to be more advantageous in feature extraction and representation compared with the traditional shallow learning method, this paper mainly focuses on deep spatial-spectral feature extraction and representation by multiscale feature processing in DCNNs. A review of DCNN-based classification methods for spatial-spectral approaches is given in [6], including 1D or 2D CNN [36,37], 2D + 1D CNN [38], and 3D CNN [39,40,41]. However, although these methods achieve promising performance for hyperspectral classification, they cannot fully extract and represent features, because they utilize the features of only the last convolutional layer for classification without considering multiscale features obtained by the previous convolutional layers. To this end, Zhao et al. [42] proposed a multiple convolutional layer fusion framework to fuse features extracted from different convolutional layers. The fusion process mainly involves the majority voting or direct concatenate mechanisms after applying the fully connected layer to each convolutional layer. The CNNs with multiscale convolution (MS-CNNs) [43] are proposed to address the limited number of training samples and class differences in variance for hyperspectral images by extracting deep multiscale features. By conducting experiments on three popular hyperspectral images, Imani and Ghassemian [44] demonstrated that although feature fusion methods are time-consuming, they can provide superior classification accuracy compared to other methods. Imani and Ghassemian [44] also showed that multiscale feature fusion is developed into one of the trends of hyperspectral image classification. Furthermore, attention mechanisms are used to extract and fuse contextual features. Haut et al. [45] is the first to develop a visual attention-driven mechanism for spatial-spectral hyperspectral image classification, which applies the attention mechanism to residual neural networks. Mei et al. [46] proposed a spatial-spectral attention network for hyperspectral image classification by the RNN and CNN both with the attention mechanism. However, these methods are only the initial application of multiscale fusion and the attention mechanism in hyperspectral datasets. There is still room for improvement in the following aspects in the area of hyperspectral image classification:
(1)
When dealing with hyperspectral spatial neighborhoods of the considered pixel, the semantic gap in multiscale convolutional layers is not considered, and simple fusion is not the most effective strategy.
(2)
The spectral redundancy problem is not considered sufficiently in the existing hyperspectral classification models. With regard to such a complex spectral distribution, there is exceedingly little work on extraction of spectral information from coarse to fine (multiscale) processing by different channel dimensions.
C.
Contributions.
Bearing the above challenges in mind, in this study, we propose an attention-based pyramid network by using a self-attention mechanism flexibly. Our model utilizes attention mechanisms in the following three areas:
(1)
We propose attention-based multiscale fusion to fuse useful features at different and the same scales to achieve the effective extraction and fusion of spatial multiscale information and extraction of spectral information from coarse to fine scales.
(2)
We propose cross-scale attention in our adaptive atrous spatial pyramid pooling (adaptive-ASPP) network to adapt to varied contents in a feature-embedded space, leading to effective extraction of the context features.
(3)
A region pyramid attention module based on region-based attention is proposed to address the target geometric size diversity in large-scale remote sensing images.
Through different combinations of these attention modules, different forms of feature fusion pyramid frameworks (two-layer and three-layer pyramids) are established. First, a novel and practical segmentation model, called the heavy-weight spatial feature fusion pyramid network (FFPNet), is proposed to solve the spatial object distribution diversity problem in high-resolution remote sensing images. The heavy-weight spatial FFPNet is a three-level feature fusion pyramid built on the basis of region pyramid attention and attention-based multiscale fusion modules. Furthermore, boundary-aware (BA) loss [47] is used to train the heavy-weight spatial FFPNet in an end-to-end manner. Second, a spatial-spectral FFPNet is developed to extract and integrate multiscale spatial features and multi-dimensional spectral features of hyperspectral images using the attention-based multiscale fusion module. The spatial-spectral FFPNet mainly consists of two modules: a light-weight spatial feature fusion pyramid (FFP) and a spectral FFP. The light-weight spatial FFP is a two-level pyramid, whose trainable parameters are less than one-third those of the heavy-weight spatial FFPNet. Thus, the light-weight module is suitable for a small number of labeled samples of the hyperspectral dataset. In addition, the spectral FFP, which is also a two-level pyramid, is proposed to better extract the spectral features from hyperspectral datasets by compressing spectral information from coarse to fine scales.
To evaluate the accuracy and efficiency of the proposed models, first, extensive experiments are conducted on two challenging high-resolution semantic segmentation benchmark datasets, namely the ISPRS (International Society for Photogrammetry and Remote Sensing) Vaihingen dataset and the ISPRS Potsdam dataset. The local experimental results demonstrate that the heavy-weight spatial FFPNet outperforms other predominant DCNN-based models (DeepLabv3+ [11] considered as the baseline). In addition, the effectiveness and practicability of these novel attention-based mechanisms is demonstrated by conducting an ablation study. Furthermore, we apply the spatial-spectral FFPNet to two popular hyperspectral datasets, namely the Indian Pines dataset and the University of Pavia dataset. The experimental results (the well-known CNN model [40] considered as the baseline) indicate that the spatial-spectral FFPNet is more robust for a small number of training samples of the hyperspectral dataset and can obtain state-of-the-art results under different training samples. Our proposed spatial-spectral FFPNet has excellent ability to extract and express multiscale spatial and spectral information. It is worth noting that the spatial-spectral FFPNet with data enhancement is a better choice for hyperspectral image classification when the sample size is extremely small.

2. Proposed Spatial-Spectral FFPNet

2.1. Overview

In this study, we focus on the challenge of spatial and spectral distribution of remote sensing images in the “encoder–decoder” frameworks [9,11,12,48,49,50]. The encoder part is based on a convolutional model to generate a feature pyramid with different spatial levels or spectral dimensions. Then, the decoder fuses multiscale contextual features. The interaction of adjacent scales can be formulated as
F l = H f l , f l + 1 ,
where F l is the fused feature at the l th level, H represents a combination of multiplication [49,51], weight sum [31], concatenation [50], attention mechanism [27,48,52], and other operations [12].
However, these operations cannot solve the problem of multiscale feature fusion of objects in remote sensing images. The main reason is that the feature maps from the lower layers are of high resolution and may have excessive noise, resulting in insufficient spatial details for high-level features. Further, these integrated operations may suppress necessary details in the low-level features, and most of these fusion methods do not consider the large semantic gaps between the feature pyramids generated by the encoder. Furthermore, these operations do not consider effective extraction and fusion of multiscale spectral information in hyperspectral images.
Therefore, we propose a multi-feature fusion model based on attention mechanisms in this paper. Current attention mechanisms [9,22,24,53] are based on the non-local operation [21], which usually deal with spatial pixel and channel selections. These mechanisms cannot achieve regional relations of objects and cannot effectively extract and integrate multiscale features in remote sensing images. To address these issues, three novel attention modules are proposed: (1) A region pyramid attention (RePyAtt) module is proposed to effectively establish relations between different region features of objects and relationships between local region features by using a self-attention mechanism on different feature pyramid regions; (2) An adaptive-ASPP module aims to adaptively select different spatial receptive fields to tackle large appearance variations in remote sensing images by adding an adaptive attention mechanism to the ASSP [10,11]; (3) A multiscale attention fusion (MuAttFusion) module is proposed to fuse the useful features at different scales and the same scales effectively.
As shown in Figure 2, segmentation and classification schemes of remote sensing images are achieved through the different combinations of the proposed attention modules. First, for high-resolution images, most of the information is concentrated in spatial dimensions. The proposed heavy-weight spatial FFPNet segmentation model solves the spatial object distribution diversity problem in remote sensing images. We adopt ResNet-101 [54] pretrained on ImageNet [55] as the backbone of the segmentation model. A three-level feature fusion pyramid is designed as shown in Figure 2. In addition, the residual convolution (ResConv) module is used as the basic processing unit, while the adaptive-ASPP module is used to adaptively combine the context features generated from the ResNet-101 and ResConv.
Second, for hyperspectral images, the proposed spatial-spectral FFPNet extracts and integrates multiscale spatial and spectral features. Recalling Figure 2, the spatial-spectral FFPNet includes three parts: (1) multiscale spatial feature extraction with the light-weight spatial FFP; (2) multi-spectral feature extraction with the spectral FFP; (3) fusion of spatial and spectral features as well as classification prediction with fully connected layers. Specifically, the light-weight spatial FFP module is a shallow classification framework, which uses the blocks of VGGNet-16 [56]. It only has a two-level feature fusion pyramid based on MuAttFusion. In comparison, the trainable parameters of the light-weight spatial FFP module are less than one-third those of the heavy-weight spatial FFPNet. This is because of the small number of labeled samples in hyperspectral datasets. The more parameters a model has, the greater its capacity, but also more labeled data needed to prevent overfitting. Similarly, the spectral FFP module has a two-level feature fusion pyramid based on MuAttFusion, which reduces the amount of parameters while capturing as much spectral information as possible.

2.2. Region Pyramid Attention Module

Currently, the soft attention-based methods mainly aim to capture long-range contextual dependencies on the basis of the non-local mechanism and its variants. However, the geometric size of different objects in remote sensing images varies significantly, so it is challenging to achieve regional dependencies of objects using existing models. Inspired by the ideas of the feature pyramid, we propose the region pyramid to address the target scale diversity. After this, we combine the region pyramid and self-attention to effectively establish dependencies between different object region features and relationships between local region features. We illustrate our approach via a simple schematic in Figure 3.

2.2.1. Region Pyramid

We partition the input feature maps into different regions via a chunk operation. The region block size defined in this article are {single pixel level, 8 × 8 level, 4 × 4 level, 2 × 2 level, and 1 × 1 level}. In addition, we conduct an ablation study on different combinations as detailed in Section 3.3.2. For each group of the pyramid, we first feed the region blocks into a global pooling layer to obtain the regional representations. Then, we concatenate the representations of the region block to generate a regional representation of the whole input feature. It is worth noting that the single-pixel level is directly sent to the self-attention module without the global pooling operation.

2.2.2. Self-Attention on The Regional Representation

To exploit more explicit regional dependencies of objects, we compute the self-attention representations within the regional representation. Self-attention consists of one 3 × 3 convolution and one 1 × 1 convolution, with the number of channels F / 2 and 1, respectively, where F denotes the number of channels of the input feature maps. Further experiments show that the parallel form of attention-weighted representations of different region groups can effectively enhance the dependencies across different region features and the relationships between local region features better than pixel-wise and channel-wise self-attention operators.
As illustrated in Group 3 of Figure 3, we first divide the input feature X into G ( 2 × 2 ) partitions. Then, we concatenate the point statistics after global pooling to obtain the regional representation X m 3 R F × G . We apply self-attention on X m 3 as follows:
A m 3 = softmax W 1 * X m 3 , Z m 3 = A m 3 f X m 3 + X m 3 ,
where A m 3 R 1 × G is an attention matrix based on the global information across all spectral bands, and Z m 3 R F × G is the weighted output features. W 1 denotes the combination operation of one 3 × 3 convolution and 1 × 1 convolution. f ( . ) represents 1 × 1 convolution and * denotes convolution.
Finally, the output of the RePyAtt module is obtained by the weighted sum of different region groups, which is formulated as
i = 1 M U p ( Z m i ) X ,
where M represents the total number of groups in the region pyramid, ⊗ denotes region-wise multiplication, and U p ( . ) is the upsampling layer using the nearest interpolation.

2.3. Multi-Scale Attention Fusion

The main task of the proposed MuAttFusion module is to effectively integrate multiscale spatial and spectral features of different objects in remote sensing images. MuAttFusion selectively fuses same-layer, high-layer, and low-layer features by an adaptive attention method as shown in Figure 4.

2.3.1. Higher- and Lower-scales

The lower-layer branch propagates spatial information from the lower layers ( T < t ) to the current layer (t) by the downsampling aggregation module (DAM). As shown in Figure 4a, to minimize memory consumption, we first use a 1 × 1 convolutional layer to compress the incoming feature maps. To achieve a consistent size for all feature maps, low-level features are downsampled to the feature size of the current layer by using bilinear interpolation. To fully use the entire feature information, all lower-layer feature maps are concatenated. Introducing the lower layers into the current layer inadvertently passes noise as well. To tackle this, high-level ( T > t ) contextual information is simultaneously propagated into the current layer by the upsampling aggregation module (UAM). The UAM structure is similar to that of the DAM, as shown in Figure 4b.

2.3.2. Attention Fuse Module

The lower-layer features, although refined, may contain some unnecessary background clutter, whereas in the higher-layer features, the detailed information may be oversuppressed in the current layer. To address these issues, we introduce an attention fuse (AttFuse) module, shown in Figure 4c. This module combines features of these two branches by adaptive attention weights. Consider the two feature maps F L L and F H L ; the attention module concatenates them and feeds them through a set of convolution layers ( 3 × 3 conv and 1 × 1 conv) and a sigmoid layer to produce an attention map with two channels, with each channel specifying the importance of the corresponding feature map. The attention maps are calculated as follows: A f = s i g m o i d C o n c a t F L L , F H L . The attention maps thus generated are then multiplied element-wise to produce the final higher- and lower-layer fusion feature maps: F f = A f 1 F L L + A f 2 F H L . F f is a powerful and enriched multiscale feature by combining the advantages of lower-layer features F L L and F H L .
Finally, the output feature F ˜ t of the MuAttFusion module is then fused with the same-layer features by the RePyAtt module: F ˜ t = C o n c a t F S L , F f . It is worth noting that for the light-weight model, the output feature and the same-layer features are directly fused to reduce the model parameters.
To further refine the features and reduce network parameters. ResConv shown in Figure 5 is introduced. The ResConv block consists of one 1 × 1 convolution and two 3 × 3 dilated convolution, with rates = 1 and 3. The 1 × 1 convolution reduces the network channel, thereby reducing the network parameters. Two 3 × 3 dilated convolution can deepen the network to enhance its ability to capture sophisticated features.

2.4. Adaptive-ASPP Module

Objects within a remote sensing image typically have different sizes. Existing multiple branch structures such as ASPP  [10,11] and DenseASPP [57] are developed to learn features using filters of different dilation rates in order to adapt to the scales. However, these approaches ignore the same problem: different local regions may correspond to objects with different scales and geometric variations. Thus, spatial adaptive filters are desired for different scales to tackle large feature variations in remote sensing images.
Toward this end, inspired by the MuAttFusion module described in Section 2.3, an adaptive-ASPP is designed to adapt to varied contents. The core of adaptive-ASPP is to adjust the combination weights for different contents in a feature-embedded space. The CASINet [30] was proposed to solve this problem; it first uses a non-local operation to achieve the adaptive information interaction across scales. However, the non-local operation [21] was used to exploit the long-range contexts for feature refinement and its calculation cost is high. The non-local operation is not applicable for cross-scale attention problems; this is also verified by our experiments. Different from the non-local operation, we propose a novel cross-scale attention (CrsAtt) module based on the self-attention mechanism.

Cross-Scale Attention Module

The structure of the proposed CrsAtt module is shown in the top of Figure 6. CrsAtt first uses two different scales to obtain the attention coefficients; then, it adaptively adjusts different scale feature weights by element-wise multiplication of the input scale feature maps and attention coefficients.
As depicted in Figure 6, consider five intermediate feature maps, { X 1 , X 2 , X 3 , X 4 , and X 5 } , obtained from five branches of the ASPP with each X i R H × W × C (except X 5 , which is obtained by image pooling of features). Information interaction is performed across each scale feature of four scales X 1 , X 2 , X 3 , X 4 , with each scale being a feature node. Then, CrsAtt operations are performed on the four features. The feature of the i th scale is calculated as
A j i = σ 2 φ T σ 1 W g T * X j + W x T * X i , X i ^ = j = 1 j i j = 4 A j i X i + X i ,
where σ 1 = max ( 0 , x ) and σ 2 = 1 1 + e x correspond to ReLU and Sigmoid activation functions, respectively. * denotes channel-wise 1 × 1 × 1 convolutional layer parameterized by W x R C × C i n t , W g R C × C i n t . In addition, φ T R C int × 1 is computed using channel-wise 1 × 1 × 1 convolutions.
Finally, the output of the adaptive-ASPP is obtained by concatenating { X 1 ^ , X 2 ^ , X 3 ^ , X 4 ^ , and X 5 }.

2.5. Heavy-Weight Spatial FFPNet Model

The heavy-weight spatial FFPNet model is achieved through the combination of the three attention-based modules introduced in the previous sections. The configurations of the three-level heavy-weight FFPNet is shown in Table 1. Concretely, consider an input image X R C × H × W , in which C, H, and W denote the number of channels, height, and width of the image, respectively. First, the image is fed it into the ResNet-101 [54] pretrained on the ImageNet dataset [55] to generate different scale feature maps. In the first level of the pyramid, the features from the four stages of the backbone are fed into ResConv to generate different scale feature maps x 1 , x 2 , x 3 , and x 4 , with 256 channels, respectively. In addition, the output of the backbone is fed into the adaptive-ASPP module to generate the feature map x 5 to adaptively combine these context features. In the second level of the pyramid, the intermediate features x 2 and x 3 are sent to RePyAtt based on region-based attention; x 6 and x 7 are then generated after MuAttFusion. In the third level of the pyramid, the final predicted segmentation map is generated after using MuAttFusion for x 6 , x 7 , and x 5 again. Furthermore, BA loss [47] is utilized to train the heavy-weight spatial FFPNet in an end-to-end manner to optimize the model parameters. By the simple modification of cross entropy loss, the BA loss is utilized to solve the issue that the pixels surrounding the boundary are hard to predict.

2.6. Spatial-Spectral FFPNet Model

To maximize the use of hyperspectral spatial and spectral information, instead of dealing with the hypercube as a whole, the proposed spatial-spectral FFPNet model includes two CNN modules: the light-weight spatial FFP module for learning multiscale spatial features and the spectral FFP module for extracting spectral features along multiple dimensions. The features from the two modules are then concatenated and fed to a fully connected classifier to perform spatial-spectral classification.

2.6.1. Light-Weight Spatial Feature Fusion Pyramid Module

The light-weight spatial FFP module is a relatively shallow spatial feature extraction framework, which only uses VGGNet-16 [56] as the backbone; the configurations of two-level light-weight spatial FFP module is shown in Table 2. Compared with the heavy-weight spatial FFPNet, the light-weight one only has 24.8 million trainable parameters owing to the small number of labeled hyperspectral samples. Furthermore, MuAttFusion is utilized to fuse the useful features from x 1 , x 2 , and x 3 , generated by the backbone after the execution of ResConv.

2.6.2. Spectral Feature Fusion Pyramid Module

As shown in Figure 2, the spectral module can use multiple convolutional kernels to automatically extract features from fine to coarse scales as convolutional layer progresses. Similarly, to solve the problem of spectral redundancy of hyperspectral images, the spectral information can be compressed by different channel dimensions from coarse to fine scales, and the useful features in the multiple scales are selected and merged by the attention mechanism. Thus, the spectral FFP module extracts the spectrum features of hyperspectral data more effectively. The configurations of the two-level spectral FFP module are presented in Table 3. Specifically, the multiscale features can be divided into three stages by different channels, with depths of 64, 32, and 16. Every stage contains a 3 × 3 convolutional layer to reduce the dimension of features and a 1 × 1 convolutional layer to further enhance the expression ability of spectral features. MuAttFusion is then harnessed to extract and combine useful features generated from the three stages.

2.6.3. Merge

In the spatial-spectral FFPNet, the last step is the combination of the output features of the light-weight spatial FFP and spectral FFP modules. The overall framework is shown in Figure 2. To effectively merge the spatial and spectral features as well as express the fused spatial-spectral features, first, the multiscale spatial features generated by the light-weight spatial FFP module and the multi-dimensional spectral features extracted by the spectral FFP module are converted into a one-dimensional tensor by a fully connected layer with the ReLU activation function. Then, the two types of features are directly merged through concatenation. Finally, another fully connected layer with the ReLU activation function is used to further refine and represent the combined spectral–spatial features, and a softmax activation layer predicts the probability distribution of each class.
Furthermore, to prevent the model from overfitting in case of limited hyperspectral datasets, the dropout method [58] is used for the fully connected layers. Specifically, the dropout method randomly selects hidden neurons as zero with a probability of 0.5. These dropped neurons will not play a role in the forward and backward processes of the model.

3. Experiments

Numerical experiments were carried out on two high-resolution remote sensing datasets, namely ISPRS Vaihingen dataset (http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html) and ISPRS Potsdam dataset (http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html), to validate the effectiveness of the heavy-weight spatial FFPNet segmentation model. Furthermore, in order to evaluate the performance of our newly presented spatial-spectral FFPNet classification architecture, two popular hyperspectral image datasets, namely the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) Indian Pines dataset (http://www.ehu.eus/ccwintco/index.php/Hyperspectral-Remote-Sensing-Scenes#Indian-Pines) and the University of Pavia dataset (http://www.ehu.eus/ccwintco/index.php/Hyperspectral-Remote-Sensing-Scenes#Pavia-University-scene) were utilized.

3.1. Dataset Description and Baselines

3.1.1. High-Resolution Datasets

The Vaihingen dataset consists of 3-band IRRG (Infrared, Red and Green) image data acquired by airborne sensors. There are 33 images with a spatial resolution of 9 cm. The average size of each image is 2494 × 2064 pixels. All datasets are labeled into the five foreground classes (impervious surfaces, buildings, low vegetation, trees, and cars) and one background class (see Figure 7). Following the setup in the online test, 16 images were used as a training set, while the remaining 17 images (Image IDs: 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35, 38) were used as a model testing set. We randomly sampled the 512 × 512 patches from the 33 images and the images were processed at the training stage with normalization, random horizontal flip, and Gaussian blur. Finally, 6200 images were generated in the training set and 1000 images in the testing set. The Potsdam dataset is composed of 38 images with a spatial resolution of 5 cm and consists of IRRGB channels. IRRG and RGB images are utilized for the segmentation model. The size of all images is 6000 × 6000 pixels, which are annotated with pixel-level labels of six classes corresponding to the Vaihingen dataset. To train and evaluate the heavy-weight spatial FFPNet, we also followed the data partition method used in benchmark methods. 24 images were selected as a training set and 14 images (Image IDs: 02_13, 02_14, 03_13, 03_14, 04_13, 04_14, 04_15, 05_13, 05_14, 05_15, 06_13, 06_14, 06_15, 07_13) as a testing set. We randomly sampled the 512 × 512 patches from the original images and generated 14,000 patches for the training set and 3000 patches for the testing set. Similar to the Vaihingen dataset, the patches were processed at the training stage with normalization, random horizontal flip, and Gaussian blur.

3.1.2. Hyperspectral Datasets

The AVIRIS Indian Pines (IP) dataset is gathered by the AVIRIS sensor. The image contains 224 spectral channels in the 400–2500 nm region of the visual and infrared spectra. As a conventional setup, 24 spectral bands were removed owing to noise and the remaining 200 bands were utilized for the experiments. The image is of size 145 × 145 with a spatial resolution of 20 m per pixel, and its ground truth contains sixteen different land-cover classes, which is shown in Figure 7. 10,249 pixels were selected for manual labeling according to the ground truth map. The University of Pavia (UP) dataset is recorded by the ROSIS-03 sensor. The image consists of 610 × 340 pixels and 115 bands with a spectral coverage ranging from 0.43 to 0.86 μ m. After removing noisy bands, 103 bands were used. Nine classes of land covers were considered in the ground truth of this image, which are shown in Figure 7.

3.1.3. Baselines

In order to evaluate the heavy-weight spatial FFPNet segmentation model, we chose DeepLabv3+ [11] as our baseline. DeepLabv3+ fuses multiscale features by introducing low-level features to refine high-level features; thus, state-of-the-art performance is achieved on many public datasets. Furthermore, for the spatial-spectral classification model, a generally recognized deep convolutional neural network proposed by Paoletti et al. [40] was utilized as the baseline for hyperspectral image classification. The CNN is a 3-D network using spatial and spectral information, which performs on hyperspectral datasets accurately and efficiently.

3.2. Evaluation Metrics

To evaluate the performance of the proposed models for segmentation and classification of remote sensing images, F1 score, Overall Accuracy (OA), Average Accuracy (AA), mean Intersection over Union (mIoU), and Kappa coefficient were used. First, for the measurement of heavy-weight spatial FFPNet performance, the F1 score was calculated for the foreground object classes and for a comprehensive comparison of the heavy-weight spatial FFPNet with different models, OA for all categories including background and mIoU were adopted. In addition, following a previous study [28], the ground truth with eroded boundaries was utilized for the evaluation in order to reduce the impact of uncertain border definitions. Second, AA, OA, and Kappa coefficient were used to measure the performance of spatial-spectral FFPNet classification. More importantly, the average results of three runs of all experiments of training and testing sets were calculated.

3.3. Heavy-Weight Spatial FFPNet Evaluation on High-Resolution Datasets

3.3.1. Implementation Details

The stochastic gradient descent (SGD) was employed as the optimizer of the heavy-weight spatial FFPNet with momentum = 0.9 and weight decay = 5 × 10 4 . The initial learning rate = 2.5 × 10 4 , a poly learning rate policy was used for the optimizer. The mini-batch size was set to 4 and the maximum epoch was 10. In addition, the batch normalization and the ReLU function were used in all layers, except for the output layers, where softmax units were used. Furthermore, inspired by the baseline model (Deeplabv3+), the dropout method (with probability = 0.5 and 0.1) was employed in the last layer of the decoder module to effectively avoid overfitting. We used Pytorch for implementation on a high-performance computing cluster, with one NVIDIA Titan RTX 24 GB GPU. The average training time for each experiment was approximately 20 h. Code will be made publicly available.

3.3.2. Experiments on Vaihingen Dataset

Ablation study. In the proposed heavy-weight spatial FFPNet, three novel attention based modules extract and integrate multiscale features adaptively and effectively in remote sensing images. To verify the effectiveness of these attention-based modules, extensive experiments in different settings were conducted and the results are listed in Table 4. In addition, to study the adaptability of different combinations of region sizes to object features, we investigated different combinations of groups in the RePyAtt module, and the results are presented in Table 5.
As can be seen in Table 4, the three novel attention modules result in significant improvement compared to the baseline (Deeplabv3+ with ResNet-101). Specifically, the use of the feature fusion pyramid framework yields an OA of 90.64% and an mIoU of 80.37%, which are 2.55% and 7.54% improvement, respectively, over the values yielded by the baseline. However, employing the ASPP module in the framework can lead to a slight decline on the model performance. This result is mainly because the ASPP module cannot solve well the issue of context feature fusion for geometric variations in remote sensing images. By contrast, the adaptive-ASPP adapts to varied contents by the CrsAtt module, thus improving the performance over the baseline by 2.82% and 8.51% in terms of OA and mIoU, respectively. Therefore, it is demonstrated that the adaptive-ASPP can be widely used in other related models in case of large appearance variations. Furthermore, the introduction of the BA loss can improve the performance by approximately 0.20% and 0.51% in terms of OA and mIoU compared with when the CE loss is used. Overall, the novel heavy-weight spatial FFPNet has great benefit in dealing with the spatial object distribution diversity challenge in remote sensing images.
We further studied the effects of different combinations of groups in the RePyAtt module. Table 5 shows that the performance is optimal and robust when the combination is set to { 4 × 4 level, 2 × 2 level, and 1 × 1 level}, in which the best OA and mIoU of 90.91% and 81.33%, respectively, are achieved. In addition, it can be observed that more combinations of groups do not necessarily result in better model performance. Thus, an optimal combination can be used to more effectively achieve region-wise dependencies of objects, resulting in improved model performance.
Comparison with existing methods. To evaluate the effectiveness of the segmentation model, we compare our model with other leading benchmark models and the results are shown in Table 6. Specifically, FCNs [8] connect multiscale features by the skip architecture. DeepLabv3 [10] adopts the ASPP module with global pooling operation to capture contextual features. UZ_1 [59] is a CNN model based on encoder–decoder. Attention U-Net [24] fuses the adjacent-layer features based on attention mechanisms. DeepLabv3+ [11] fuses multiscale features by introducing low-level features to refine high-level features based on DeepLabv3 [10]. RefineNet [60] refines low-resolution semantic features with fine-grained low-level features in a recursive manner to generate high-resolution semantic feature maps. S-RA-FCN [9] produces relation-augmented feature representations by the spatial and channel relation modules. ONE_7 [61] fuses the output of two multiscale SegNets [62]. DANet [22] adaptively integrates local features with their global dependencies by two types of attention modules. GSN5 [12] utilizes entropy as a gate function to select features. DLR_10 [63] combines boundary detection with SegNet and FCN. PSPNet [64] exploits the capability of global context information by the pyramid pooling module. Importantly, most of the models adopt the ResNet-101 as the backbone.
Table 6 indicates that the heavy-weight spatial FFPNet outperforms other context aggregation or attention-based models in terms of OA. Specifically, we can see that the heavy-weight spatial FFPNet outperforms the baseline model (DeepLabv3+ [11]), with 2.82% and 8.5% increase in OA and mIoU, respectively. Importantly, the qualitative comparisons between our proposed model and the baseline model are provided in Figure 8. The quantitative and qualitative analyses indicate that our method outperforms the DeepLabv3+ [11] method by a large margin. Furthermore, compared with PSPNet [64], the heavy-weight spatial FFPNet achieves 0.06% improvement in OA but slightly inferior results in some categories such as low vegetation, trees, and cars. However, compared with other high-performance models (such as HMANet [28]) on the Vaihingen dataset, the performance of the heavy-weight spatial FFPNet can be further improved by adopting some strategies such as data augmentation and left-right flipping counterparts during inference.

3.3.3. Experiments on Potsdam Dataset

In order to further validate the effectiveness of the segmentation model, we conducted experiments on the Potsdam dataset. Table 7 shows the result of a comparison of the proposed model with other excellent models, including DAT_6 [65], an end-to-end self-cascaded network CASIA3 [5], and the other methods used for the comparison on the Vaihingen dataset. Notably, the heavy-weight spatial FFPNet achieves an OA of 92.44% and mIoU of 86.20%, which are 0.02% and 1.32% improvement, respectively, compared to the values achieved by PSPNet. In addition, our F1 score of cars is much higher than that achieved by PSPNet and exceeds the second-best value achieved by CCNet by 1.03%. Thus, the effectiveness of our proposed model in handling multiscale feature fusion for the segmentation of remote sensing images is demonstrated.
In addition, the quantitative comparison results are shown in Figure 9. The third and fourth columns represent the results of the baseline and the proposed models, respectively. Obviously, the visualization results in the fourth column are better than those in the third column. Moreover, as Table 7 indicates, the proposed model shows an improvement of 1.5% in OA and 4.51% in mIoU compared with the values achieved by DeepLabv3+ [11]. Therefore, it has been further demonstrated that the heavy-weight spatial FFPNet can effectively extract and fuse the spatial features of remote sensing images, thereby improving the segmentation performance of high-resolution images.
Comparison between IRRG and RGB. Generally, the high-resolution remote sensing images are RGB band data. Therefore, it is necessary to compare the two types of available input images for the high-weight spatial FFPNet, including IRRG and RGB modes. The results of the last two rows in Table 7 show that the overall results of the two modes are a little difference. Specifically, the IRRG images improve the average performance by about 0.5% comparing to the RGB images. Again, the visualization results in Figure 10 show that the IRRG images can obtain better segmentation maps.

3.4. Spatial-Spectral FFPNet Evaluation on Hyperspectral Datasets

3.4.1. Implementation Details

Data preprocessing. When the hyperspectral images are divided into training and testing sets, the imbalance between categories brings difficulties for model training. For example, in the IP dataset, the “Oats” class has 20 labeled pixels, while the “Soybean-mintill” class has 2455 labeled pixels. To ensure comparability of data as much as possible, the same data processing strategy as the baseline [40] is used to deal with the imbalance of categories, that is, optimally selecting the number of samples of each category. Importantly, some lower data setups were added to highlight the superiority of the spatial-spectral FFPNet. Specifically, a maximum number of samples per category was set as a threshold to select the number of samples, that is, 50, 100, 150, and 200, per category. For the richer class samples, we simplified split the samples on the basis of the threshold. On contrary, when the number of class samples was less than twice the threshold, 50% samples of the corresponding class were selected. Detailed training sample schemes for the IP dataset are listed in Table 8, and the same schemes are adopted for the UP dataset as shown in Table 9. It is worth noting that the number of samples in both datasets is less than or equal to that in the baseline [40]; we conducted more sampling schemes for subsequent experiments.
Model setup. The proposed spatial-spectral FFPNet was initialized with two strategies: the backbone of the light-weight spatial FFP module was initialized with VGGNet16 pretrained on ImageNet [55]. Importantly, the parameters (weights and bias) of the first convolutional layer in the pretrained network only include three channels, while the hyperspectral classification task requires p-channel inputs (for example, 200 channels for the IP dataset and 103 channels for the UP dataset). Therefore, we copied the initialization parameters of the first convolutional layer in the pretrained network until the p-channel inputs were reached, similarly to CoinNet [66]. By contrast, the spectral FFP module was initialized with Kaiming uniform distribution. In addition, different from high-resolution experiments, the cross-entropy loss function was used to minimize the spatial-spectral FFPNet parameters because of the quantity limitation of labeled hyperspectral images. Batch normalization and ReLU were used in all layers, except for the classifier layer. Adam [67] was employed as the optimizer with a learning rate of 0.001. The mini-batch size was set to 24 for both datasets, and the maximum epoch was 200. The experiments were conducted on a high-performance computing cluster with one NVIDIA Titan RTX 24 GB GPU.

3.4.2. Ablation Study

In order to analyze the effectiveness of the proposed spatial-spectral FFPNet in hyperspectral image classification, two aspects are mainly considered. First, to analyze how the training set, including the number of samples and sample patch size, and data augmentation affect the performance of hyperspectral image classification, we conducted different ablation studies. Second, in order to analyze the impact of spatial and spectral models on the performance of hyperspectral image classification, the ablation experiments of the spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet were conducted for the IP and UP hyperspectral image datasets.
(1) Sample patch size. We conducted an ablation study for different sample patch sizes. For the IP dataset, patch sizes d = 9, 15, 19 and 29 were considered, and for the UP dataset, d = 9, 15, 21, and 27 were tested. The different patch sizes determine the different amount of spatial information that can be utilized for hyperspectral image classification. Table 10 shows the total training times and accuracy results with different patch sizes for a fixed number of samples (100) per category. On both datasets, as more pixels are added, more useful contextual spatial information could be utilized by the spatial-spectral FFPNet model. Thus, the model achieves better performance in the case of more spatial information while also spending more training time. However, as the patch size is further increased, such as d = 27 for the UP dataset, the model performance slightly decreases; this is because a patch containing too many other classes can detract from the target pixel. Specifically, for the IP data, d = 29 obtains the best performance, with an OA of 98.76%. However, the average training time for d = 29 is considerably longer than that for the other groups (almost twice that for d = 19). In terms of the accuracy to time ratio, d = 19 yields the best performance for the IP dataset, with an OA of 98.50% for an average training time of 14.66 min. For the UP dataset, d = 21 achieves the best result in terms of the accuracy–time ratio, resulting in an OA of 98.82% for an average training time of 8.16 min. In addition, d = 15 requires the minimum training time (7.06 min) to achieve an acceptable accuracy (96.41%).
(2) Sample per category. In order to evaluate the impact of the number of samples per category on the model performance, many experiments with different patch sizes and different number of training samples, that is, 50, 100, 150, and 200, per category were conducted.
The classification accuracy results in terms of OA, AA, and kappa coefficients obtained for the IP dataset are presented in Table 11. Obviously, according to the results of each patch size (d = 9, 15, 19, and 29), as more training samples per category are added, the accuracy of the proposed model classification increases, and the training time also increases. Concretely, when the number of samples is small, the model can achieve a superior classification result; that is, with d = 9, 15, 19, and 29 and 50 samples per category, OA values of 82.12%, 86.44%, 89.79%, and 94.64%, respectively, are achieved. Therefore, it is confirmed that the spatial-spectral FFPNet model can fully use multiscale spatial and spectral information to achieve more robust and accurate end-to-end hyperspectral image classification with a small number of training samples. Furthermore, with 200 samples per category, the best OA value of 99.84% of all groups is achieved for d = 29, and the OA values for d = 9, 15, and 19 vary by not more than 0.8%. All of the groups with 200 samples per category attain values above 99%; this further shows the spatial-spectral FFPNet’s robustness and the ability to express and extract multiscale features.
The qualitative results obtained for the IP dataset for patch sizes of d = 9, 15, 19, and 29, respectively, with 50, 100, 150, and 200 samples per category, are provided in Figure 11, Figure 12, Figure 13 and Figure 14. First, the visualization results of the confusion matrix for each category indicate that as more training samples are added, the color of the diagonal area gets brighter, while the other areas become more unified to blue. This indicates that the classification results of each class are improving. In addition, as the patch size increases, the accuracy of each class increases. However, when relatively adequate training samples are used in the network, the accuracy of each class is relatively similar (e.g., d = 9, 15, 19, and 29 with 200 samples per category, and d = 29 with 150 samples per category). Second, according to the classification maps acquired from each experiment, shown in Figure 11, Figure 12, Figure 13 and Figure 14, the best results are achieved with 200 samples per category, especially for d = 19 and 29 with 200 samples per category; these are the most similar to the ground truth map of the IP image. Specifically, when the number of spatial pixels is small (d = 9, 15, and 19 with 50 samples per category), a small part of the middle pixels of the areas in some categories could be misclassified, especially for d = 9 with 50 samples per category. However, when the number of training samples per category increases to 100, these middle pixels are accurately classified. Furthermore, when the number of training samples is less, a small number of pixels near the edges are easily misclassified; we call this the “boundary error effect”. However, with increasing training samples, the boundary error effect gradually weakens or even disappears for 150 training samples per category. More importantly, the overall classification result is generally excellent when the sample size is extremely small (i.e., for d = 9, 15, 19, and 29 with 50 samples per category). This demonstrates that the proposed model can better address the problem of overfitting when less hyperspectral samples are available.
Table 12 lists the results for the UP dataset. For every patch size, as the training samples increase, the model performance gradually improves. Notably, the model performance is more sensitive when the sample size is exceedingly small. For example, for d = 9 and 15 with 50 samples per category, the model performance in terms of OA for the UP dataset is less than 80%, while it improves to more than 90% when the samples per category are increased to 100. Furthermore, the sensitivity of the model performance to the small amount of training data and a large number of model parameters can be better addressed by data enhancement methods such as random rotation and addition of random noise. The effectiveness of data augmentation will be discussed in the third ablation analysis. Furthermore, d = 27 with 200 samples per category can be regarded as the best setting for the UP dataset as an OA of 99.53% is achieved. However, the model performance for d = 15 and 21 with 200 samples per category differs from the performance for d = 27 with 200 samples per category by less than 0.4%. In terms of the training time, d = 15 or d = 21 is a better choice.
(3) Data enhancement. We used random horizontal and vertical flips, random rotation (with angles 90°, 180°, and 270°) to enhance small-scale hyperspectral datasets. To further test the effectiveness of the spatial-spectral FFPNet subjected to data enhancement when the training samples are intensely limited, 50 samples per category for the IP dataset and the UP dataset were utilized. To highlight the superiority of the proposed model purely, it is worth noting that data enhancement techniques were not used in other experiments in this paper because the baseline model [40] does not use data enhancement.
The ablation study results of data enhancement, with sample patch sizes of d = 9, 15, 19, and 29 and 50 samples per category on the IP and UP datasets are shown in Table 13 and Table 14, respectively. Clearly, the classification accuracy of the spatial-spectral FFPNet with data enhancement is significantly higher than that of the spatial-spectral FFPNet. The accuracy difference between the two datasets reaches 2–20% in terms of OA. The performance of the model with data enhancement is more dominant for the UP dataset, which is due to the lower ratio of training samples to total samples. Specifically, with d = 9, for the IP dataset, the difference between the spatial-spectral FFPNet with data enhancement and the spatial-spectral FFPNet models is 8.20%, while the corresponding difference for the UP dataset is 19.93%. As the spatial information increases (patch size increases), the performance advantage of the data-enhanced model gradually decreases. For example, in terms of OA, the performance of data-enhanced model is 3.22% higher than that of spatial-spectral FFPNet for the IP dataset with d = 29, and the performance of the spatial-spectral FFPNet improves by 2.71% for the UP dataset with d = 27 by data enhancement. Thus, in case of an extremely small quantity of the labeled hyperspectral dataset, the spatial-spectral FFPNet with data enhancement may be the best choice.
(4) Spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet. As mentioned in the method section, the spectral FFPNet focuses on the extraction and fusion of multi-scale spectral features, while the spatial FFPNet focuses more on effective extraction and integration of context spatial features by the use of attention-based modules. The effects of spectral-only and spatial-only models on the performance of hyperspectral data classification as well as the effectiveness of the spatial-spectral FFPNet model require further analysis. Thus, we conducted ablation studies on the spatial-only, spectral-only, and spatial-spectral FFPNet models. The spatial-only and spectral-only models correspond to the light-weight spatial FFP module and the spectral FFP module in Figure 2 with fully connected classifiers, respectively.
Table 15 and Table 16, respectively, present the results of a comparison of different models on the IP and UP datasets with sample patch sizes of d = 9, 15, 21, and 27 and 100 samples per category. Obviously, on both datasets, the performance of the spatial-spectral FFPNet model is significantly better than that of the exclusive spatial FFPNet and spectral FFPNet, especially when a small amount of spatial information is considered. Specifically, when d = 9, for the IP dataset, the OA value of the spatial-spectral model shows an improvement of 3.15% and 6.77% compared with the values achieved by the spatial-only and spectral-only models, respectively. In addition, the spatial-only and spectral-only models are more unstable and have limited accuracy for the UP dataset when spatial information is restricted (i.e., when d = 9, the spatial-spectral model shows an improvement of 12.39% and 33.26% compared with the values achieved by spatial-only and spectral-only models, respectively). Therefore, it is demonstrated that spatial information (the neighboring pixels) and spectral information should be simultaneously considered in the model to obtain an excellent classification result. Furthermore, the proposed spatial-spectral FFPNet model can effectively extract and fuse multiscale spatial and spectral features to achieve high classification accuracy even with a few training samples. However, Table 15 and Table 16 indicate that when the number of training samples is sufficient, the performance gap between the three models is not very large, especially for the IP dataset.

3.4.3. Comparison with Existing CNN Methods

To further verify the effectiveness and superiority of the spatial-spectral FFPNet model, we compare it with some of the state-of-the-art and well-known CNN models developed in recent years for hyperspectral classification. The main comparisons about the configuration and training settings of these models are briefly described as follows:
(1)
CNNs by Chen et al. [41]: First, the model configuration includes 1-D, 2-D, and 3-D CNNs. The 1-D CNN consists of five convolutional layers with ReLU and five pooling layers for the IP dataset as well as three convolutional layers with ReLU and three pooling layers for the UP dataset; the 1-D CNN extracts only spectral information. The 2-D CNN contains three 2D convolutional layers and two pooling layers. The latter two 2-D convolutional layers use the dropout strategy to prevent overfitting. The 3-D CNN is designed to effectively extract spatial and spectral information. It includes three 3D convolution and ReLU nonlinear activation layers, and the dropout strategy is also used to prevent overfitting. Overall, the design of the proposed model represents the early application of DCNN methods in hyperspectral image classification. However, although the models are simple and effective, the spatial and spectral distribution diversities of hyperspectral datasets are not considered. Second, in training settings, 1765 labeled samples were used as the training set for the IP dataset and 3930 samples as the training set for the UP dataset. Furthermore, experiments were conducted with different patch sizes (d = 9, 19, 29 for the IP dataset; d = 15, 21, 27 for the UP dataset) in the baseline [40].
(2)
CNN by Paoletti et al. [40]: The CNN model serves as the baseline model for our hyperspectral experiments. First, in the model configuration, to extract hyperspectral classification features, three 3D convolutional layers (i.e., 600 × 5 × 5 × 200 , 200 × 3 × 3 × 600 , 200 × 1 × 1 × 200 for the IP dataset, and 380 × 7 × 7 × 103 , 350 × 5 × 5 × 380 , 350 × 1 × 1 × 350 for the UP dataset) are designed, and each convolution layer is followed by a ReLU function. To reduce the spatial resolution, the first two convolution layers are followed by two 2 × 2 max pooling. In addition, to prevent overfitting, the dropout method is executed in the first two convolution layers of the model, with probability = 0.1 for the IP dataset and 0.2 for the UP dataset. Next, a four-layer full connection classifies the extracted features. Although the 3D model requires less parameters and layers, it cannot address the diversity problem of spatial object distribution in hyperspectral data and cannot make the full use of spectral information. In addition, 3D convolution processes hyperspectral data are uniform volumetric data, while the hyperspectral actual object distribution is asymmetrical. Second, in the training setting, detailed experiments were conducted on different training samples and different patch sizes for the IP and UP datasets. The best experimental results of the baseline model (patch sizes d = 9, 19, and 29 with training samples = 2466 for the IP dataset; patch size d = 15, 21, and 27 with training samples = 1800 for the UP dataset) were considered for the comparison of the models.
(3)
Attention networks [45]: A visual attention-driven mechanism applied to residual neural networks (ResNet) facilitates spatial-spectral hyperspectral image classification. Specifically, the attention mechanism is integrated into the residual part of ResNet, which mainly includes two parts, namely the trunk and mask. The trunk consists of some residual blocks that perform feature extraction from the data, while the mask consists of a symmetrical downsampler–upsampler structure to extract useful features from the current layer. Although the attention mechanism has been successfully applied to ResNet, this attention method does not solve the problems of spatial distribution (the different geometric shapes of the objects) and spectral redundancy of hyperspectral data. Second, the network was optimized using 1537 training samples with 300 epochs for the IP dataset and 4278 training samples with 300 epochs for the UP dataset.
(4)
Multiple CNN fusion [42]: Compared with other models, although the multiscale spectral and spatial feature fusion model is time-consuming, it can achieve superior classification accuracy and hence has been gaining prominence in hyperspectral image classification. For example, Zhao et al. [42] presented a multiple convolutional layers fusion framework, which fuses features extracted from different convolutional layers for hyperspectral image classification. This multiple CNN model only considers the fusion of spatial features at different scales, but not the effective extraction of spatial and spectral features at multiple scales. Specifically, the multiscale spectral and spatial feature fusion model is divided into two types according to the fusion mechanism. The first one is the side output decision fusion network (SODFN), which applies majority voting to many side classification maps generated by each convolutional layer. The other one is the fully convolutional layer fusion network (FCLFN), which combines all features generated by each convolutional layer. Second, the SODFN and FCLFN parameters were tuned using 1029 training samples for the IP dataset and 436 training samples for the UP dataset.
Indian Pines dataset benchmark evaluation. The classification results for the IP dataset obtained by different CNN models and our proposed model are shown in Table 17. Obviously, the spatial-spectral FFPNet without data enhancement generates the highest OA, AA, kappa coefficient and thus presents the best performance among all benchmark models. The proposed model also shows excellent performance in each class. The best result of the spatial-spectral FFPNet exceeds the best result of the baseline model [40] by 1.47% in terms of OA. Notably, the spatial-spectral FFPNet shows superior performance in all experimental configurations compared with the baseline model. In particular, when spatial information is limited (i.e., patch size d = 9), the proposed model outperforms the same configuration of the baseline model by 8.96%, and even exceeds the best configuration of the baseline model (d = 29, training samples = 2466) by 0.7%. The results presented in Table 17 further demonstrates the superiority of the proposed spatial-spectral FFPNet and its robustness in case of a small number of training samples.
A graphical comparison of the OA values for the IP dataset obtained by different CNN models is shown in Figure 15. The OA results obtained by our proposed model are presented in red, while those obtained by the benchmark models are presented in black. Clearly, our model shows more promising performance under different training samples compared with the well-known hyperspectral classification CNN models. Specifically, when the number of training samples is relatively small (600–800), the spatial-spectral FFPNet with data enhancement acquires outstanding performance. As the number of samples increases, the dominance of the spatial-spectral FFPNet over the other methods is relatively more. In addition, note that attention networks [45] and multiple CNN fusion [42] perform better than the CNN models by [40,41]; this is also in line with the current development trend of hyperspectral classification; that is, the application of multiscale feature fusion and attention mechanisms in the spatial and spectral dimensions.
University of Pavia dataset benchmark evaluation.Table 18 lists the classification results of different CNN models developed from 2016 to 2019 and our proposed model for the UP dataset. The spatial-spectral FFPNet (without data enhancement) with different d values shows outstanding results in performance. Specifically, with the experimental configuration of d = 27, the proposed model shows an improvement of 1.73% compared to the CNN model [40] in terms of OA. The spatial-spectral FFPNet also shows a great performance compared with the baseline [40] with the same experimental configuration. Furthermore, the performance of the spatial-spectral FFPNet with a lower patch size (d = 9) differs slightly from that of the baseline model [40] with d = 15. Notably, attention networks [45] perform the best among all model in terms of OA and AA results because of the sufficient training samples (4278), while our proposed model performs the best in terms of the kappa coefficient.
Figure 16 provides a graphical comparison of the OA of different CNN models for the UP dataset. Again, our model attains more homogenized and favorable classification results. As Figure 16 shows, the results of our proposed model are centered around 400–1900 training samples. However, the local OA comparison chart indicates that the multiple CNN fusion [42] is superior (98.17%) even for an extremely small number of training samples (approximately 400). As the training samples increase, the superiority of the spatial-spectral FFPNet gradually gains prominence. In addition, for a large training sample (>3000), attention networks [45] perform considerably better than the traditional CNN model [41].

4. Conclusions

In this study, we mainly focus on spatial object distribution diversity and spectral information extraction, which are the major challenges of high-resolution and hyperspectral remote sensing images. To address the spatial and spectral problems, three novel and practical attention-based modules were proposed: attention-based multiscale fusion, region pyramid attention, and adaptive-ASPP. We constructed different forms of feature fusion pyramid frameworks (two-layer or three-layer pyramids) by combining these attention-based modules. First, we developed a new semantic segmentation framework for high-resolution images, called the heavy-weight spatial FFPNet. Second, for the classification of hyperspectral images, an end-to-end spatial-spectral FFPNet was presented to extract and fuse multiscale spatial and spectral features. The experiments conducted on two high-resolution datasets demonstrated that the proposed heavy-weight spatial FFPNet achieves excellent segmentation accuracy. Detailed ablation studies further revealed the superiority of the three attention-based modules in processing the spatial distribution diversity of remote sensing images. Furthermore, detailed training parameter analysis and comparison with other state-of-the-art CNNs (such as [40]) were performed on the two hyperspectral datasets. The results demonstrated that the spatial-spectral FFPNet is more robust and achieves greater accuracy in case when the number of training samples of the hyperspectral dataset is small and that it can obtain state-of-the-art results under different training samples. Overall, the proposed methods can serve as a new baseline for remote sensing image segmentation and classification. In future work, we will focus on few-shot or zero-shot segmentation and classification in high-resolution or hyperspectral remote sensing data to promote practical application of deep learning in remote sensing image perception.

5. Code and Model Availability

Algorithms, trained models (including heavy-weight spatial FFPNet for segmentation of high-resolution remote sensing images and spatial-spectral FFPNet for classfication of hyperspectral images) are available to the public on Github under a GNU General Public License (https://github.com/xupine/FFPNet).

Author Contributions

Conceptualization, X.Y. and C.O.; Funding acquisition, C.O.; Investigation, Y.Z.; Methodology, Q.X.; Software, Q.X.; Supervision, C.O.; Visualization, Y.Z.; Writing–original draft, Q.X.; Writing–review and editing, Q.X., X.Y. and C.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the NSFC(Grant No. 42022054), the Strategic Priority Research Program of CAS (Grant No. XDA23090303), the National Key Research and Development Program of China (Project No. 2017YFC1501000), the CAS Youth Innovation Promotion Association.

Acknowledgments

Authors want to sincerely acknowledge the work of the reviewers in the manuscript with special thanks to Holly Huang, Assistant Editor.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ghamisi, P.; Mura, M.D.; Benediktsson, J.A. A Survey on Spectral–Spatial Classification Techniques Based on Attribute Profiles. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2335–2353. [Google Scholar] [CrossRef]
  2. Wei, K.; Ouyang, C.; Duan, H.; Li, Y.; Chen, M.; Ma, J.; An, H.; Zhou, S. Reflections on the catastrophic 2020 Yangtze River Basin flooding in southern China. Innovation 2020, 1, 100038. [Google Scholar] [CrossRef]
  3. Wang, N.; Chen, F.; Yu, B.; Qin, Y. Segmentation of large-scale remotely sensed images on a Spark platform: A strategy for handling massive image tiles with the MapReduce model. ISPRS J. Photogramm. Remote Sens. 2020, 162, 137–147. [Google Scholar] [CrossRef]
  4. Sun, W.; Wang, R. Fully convolutional networks for semantic segmentation of very high resolution remotely sensed images combined with DSM. IEEE Geosci. Remote Sens. Lett. 2018, 15, 474–478. [Google Scholar] [CrossRef]
  5. Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef] [Green Version]
  6. Audebert, N.; Le Saux, B.; Lefèvre, S. Deep learning for classification of hyperspectral data: A comparative review. IEEE Geosci. Remote Sens. Mag. 2019, 7, 159–173. [Google Scholar] [CrossRef] [Green Version]
  7. Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2014, 42, 1778–1790. [Google Scholar] [CrossRef] [Green Version]
  8. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  9. Mou, L.; Hua, Y.; Zhu, X.X. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12416–12425. [Google Scholar]
  10. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
  11. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  12. Wang, H.; Wang, Y.; Zhang, Q.; Xiang, S.; Pan, C. Gated convolutional neural network for semantic segmentation in high-resolution images. Remote Sens. 2017, 9, 446. [Google Scholar] [CrossRef] [Green Version]
  13. Cheng, G.; Wang, Y.; Xu, S.; Wang, H.; Xiang, S.; Pan, C. Automatic road detection and centerline extraction via cascaded end-to-end convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3322–3337. [Google Scholar] [CrossRef]
  14. Cheng, W.; Yang, W.; Wang, M.; Wang, G.; Chen, J. Context aggregation network for semantic labeling in aerial images. Remote Sens. 2019, 11, 1158. [Google Scholar] [CrossRef] [Green Version]
  15. Li, P.; Lin, Y.; Schultz-Fellenz, E. Contextual Hourglass Network for Semantic Segmentation of High Resolution Aerial Imagery. arXiv 2018, arXiv:1810.12813. [Google Scholar]
  16. Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
  17. Sebastian, C.; Imbriaco, R.; Bondarev, E.; de With, P.H. Adversarial Loss for Semantic Segmentation of Aerial Imagery. arXiv 2020, arXiv:2001.04269. [Google Scholar]
  18. Dong, R.; Pan, X.; Li, F. DenseU-net-based semantic segmentation of small objects in urban remote sensing images. IEEE Access 2019, 7, 65347–65356. [Google Scholar] [CrossRef]
  19. Du, Y.; Song, W.; He, Q.; Huang, D.; Liotta, A.; Su, C. Deep learning with multi-scale feature fusion in remote sensing for automatic oceanic eddy detection. Inf. Fusion 2019, 49, 89–99. [Google Scholar] [CrossRef] [Green Version]
  20. Jain, S.; Wallace, B.C. Attention is not explanation. arXiv 2019, arXiv:1902.10186. [Google Scholar]
  21. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  22. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
  23. Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 593–602. [Google Scholar]
  24. Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
  25. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
  26. Zhang, F.; Chen, Y.; Li, Z.; Hong, Z.; Liu, J.; Ma, F.; Han, J.; Ding, E. ACFNet: Attentional Class Feature Network for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6798–6807. [Google Scholar]
  27. Sindagi, V.A.; Patel, V.M. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1002–1012. [Google Scholar]
  28. Niu, R. HMANet: Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images. arXiv 2020, arXiv:2001.02870. [Google Scholar]
  29. Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-scale Feature Learning for Object Detection. arXiv 2019, arXiv:1912.05384. [Google Scholar]
  30. Jin, X.; Lan, C.; Zeng, W.; Zhang, Z.; Chen, Z. CaseNet: Content-adaptive scale interaction networks for scene parsing. arXiv 2019, arXiv:1904.08170. [Google Scholar]
  31. Li, X.; Zhao, H.; Han, L.; Tong, Y.; Yang, K. GFF: Gated Fully Fusion for Semantic Segmentation. arXiv 2019, arXiv:1904.01803. [Google Scholar]
  32. Tarabalka, Y.; Benediktsson, J.A.; Chanussot, J. Spectral–Spatial Classification of Hyperspectral Imagery Based on Partitional Clustering Techniques. IEEE Trans. Geosci. Remote Sens. 2009, 47, 2973–2987. [Google Scholar] [CrossRef]
  33. Archibald, R.; Fann, G. Feature Selection and Classification of Hyperspectral Images With Support Vector Machines. IEEE Geosci. Remote Sens. Lett. 2007, 4, 674–677. [Google Scholar] [CrossRef]
  34. Sun, S.; Zhong, P.; Xiao, H.; Wang, R. Active Learning With Gaussian Process Classifier for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1746–1760. [Google Scholar] [CrossRef]
  35. Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Hyperspectral Image Classification Using Dictionary-Based Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3973–3985. [Google Scholar] [CrossRef]
  36. Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the Deep Supervised Learning for Hyperspectral Data Classification Through Convolutional Neural Networks, Milan, Italy, 26–31 July 2015; pp. 4959–4962. [Google Scholar]
  37. Zhao, W.; Du, S. Spectral–Spatial Feature Extraction for Hyperspectral Image Classification: A Dimension Reduction and Deep Learning Approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
  38. Luo, Y.; Zou, J.; Yao, C.; Li, T.; Bai, G. HSI-CNN: A Novel Convolution Neural Network for Hyperspectral Image. In Proceedings of the 2018 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 16–17 July 2018. [Google Scholar]
  39. Li, Y.; Zhang, H.; Shen, Q. Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional Neural Network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef] [Green Version]
  40. Paoletti, M.; Haut, J.; Plaza, J.; Plaza, A. A new deep convolutional neural network for fast hyperspectral image classification. ISPRS J. Photogramm. Remote Sens. 2018, 145, 120–147. [Google Scholar] [CrossRef]
  41. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef] [Green Version]
  42. Zhao, G.; Liu, G.; Fang, L.; Tu, B.; Ghamisi, P. Multiple convolutional layers fusion framework for hyperspectral image classification. Neurocomputing 2019, 339, 149–160. [Google Scholar] [CrossRef]
  43. Gong, Z.; Zhong, P.; Yu, Y.; Hu, W.; Li, S. A CNN With Multiscale Convolution and Diversified Metric for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3599–3618. [Google Scholar] [CrossRef]
  44. Imani, M.; Ghassemian, H. An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges. Inf. Fusion 2020, 59, 59–83. [Google Scholar] [CrossRef]
  45. Haut, J.M.; Paoletti, M.E.; Plaza, J.; Plaza, A.; Li, J. Visual Attention-Driven Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8065–8080. [Google Scholar] [CrossRef]
  46. Mei, X.; Pan, E.; Ma, Y.; Dai, X.; Huang, J.; Fan, F.; Du, Q.; Zheng, H.; Ma, J. Spectral-Spatial Attention Networks for Hyperspectral Image Classification. Remote Sens. 2019, 11, 963. [Google Scholar] [CrossRef] [Green Version]
  47. Xu, Q.; Ouyang, C.; Jiang, T.; Fan, X.; Cheng, D. DFPENet-geology: A Deep Learning Framework for High Precision Recognition and Segmentation of Co-seismic Landslides. arXiv 2019, arXiv:1908.10907. [Google Scholar]
  48. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing And Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  49. Lin, D.; Shen, D.; Shen, S.; Ji, Y.; Lischinski, D.; Cohen-Or, D.; Huang, H. ZigZagNet: Fusing Top-Down and Bottom-Up Context for Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7490–7499. [Google Scholar]
  50. Zhen, M.; Wang, J.; Zhou, L.; Fang, T.; Quan, L. Learning Fully Dense Neural Networks for Image Semantic Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, Hawaii, USA, 27 January–1 February 2019; Volume 33, pp. 9283–9290. [Google Scholar]
  51. Zhang, Z.; Zhang, X.; Peng, C.; Xue, X.; Sun, J. Exfuse: Enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 269–284. [Google Scholar]
  52. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  53. Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Change Loy, C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
  54. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  55. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  56. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  57. Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
  58. Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  59. Volpi, M.; Tuia, D. Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 55, 881–893. [Google Scholar] [CrossRef] [Green Version]
  60. Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  61. Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef] [Green Version]
  62. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
  63. Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef] [Green Version]
  64. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  65. Sherrah, J. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv 2016, arXiv:1606.02585. [Google Scholar]
  66. Pan, B.; Shi, Z.; Xu, X.; Shi, T.; Zhang, N.; Zhu, X. CoinNet: Copy initialization network for multispectral imagery semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2018, 16, 816–820. [Google Scholar] [CrossRef]
  67. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Figure 1. Challenges of object segmentation in spatial distribution in remote sensing images.
Figure 1. Challenges of object segmentation in spatial distribution in remote sensing images.
Remotesensing 12 03501 g001
Figure 2. Proposed segmentation scheme of high-resolution images and classification scheme of hyperspectral images (upper left). For the heavy-weight spatial FFPNet, given a high-resolution image (3-band), ResNet-101 pretrained on ImageNet [55] is used as the backbone for feature extraction (middle, where W denotes the height or width of the image). The heavy-weight spatial FFPNet is a three-level feature fusion pyramid. The detailed configurations of the heavy-weight spatial FFPNet are described in Section 2.5. Furthermore, BA loss is used to train the heavy-weight FFPNet in an end-to-end manner. For the spatial-spectral FFPNet, given a hyperspectral image with the size of p × H × W , where p is the number of spectral bands, the image is sent to the light-weight spatial FFP and the spectral FFP modules simultaneously. The light-weight spatial FFP is a two-level pyramid, and VGGNet16 pretrained on ImageNet [55] is used as the backbone. Notably, the initial parameters of the first convolutional layer in the pretrained network are copied until the p-channel inputs are attained. Furthermore, fully connected layers are used to effectively merge multiscale spatial feature obtained by the light-weight spatial FFP and spectral feature obtained by the spectral FFP and predict the class of all pixels. The detailed description of the spatial-spectral FFPNet is presented in Section 2.6.
Figure 2. Proposed segmentation scheme of high-resolution images and classification scheme of hyperspectral images (upper left). For the heavy-weight spatial FFPNet, given a high-resolution image (3-band), ResNet-101 pretrained on ImageNet [55] is used as the backbone for feature extraction (middle, where W denotes the height or width of the image). The heavy-weight spatial FFPNet is a three-level feature fusion pyramid. The detailed configurations of the heavy-weight spatial FFPNet are described in Section 2.5. Furthermore, BA loss is used to train the heavy-weight FFPNet in an end-to-end manner. For the spatial-spectral FFPNet, given a hyperspectral image with the size of p × H × W , where p is the number of spectral bands, the image is sent to the light-weight spatial FFP and the spectral FFP modules simultaneously. The light-weight spatial FFP is a two-level pyramid, and VGGNet16 pretrained on ImageNet [55] is used as the backbone. Notably, the initial parameters of the first convolutional layer in the pretrained network are copied until the p-channel inputs are attained. Furthermore, fully connected layers are used to effectively merge multiscale spatial feature obtained by the light-weight spatial FFP and spectral feature obtained by the spectral FFP and predict the class of all pixels. The detailed description of the spatial-spectral FFPNet is presented in Section 2.6.
Remotesensing 12 03501 g002
Figure 3. Proposed RePyAtt Module. We first generate a region pyramid by partitioning the input feature maps (left) into four groups and employing the self-attention mechanism to extract the regional dependence. Finally, the output of the RePyAtt module is obtained by summation (‘SUM’) of different region groups.
Figure 3. Proposed RePyAtt Module. We first generate a region pyramid by partitioning the input feature maps (left) into four groups and employing the self-attention mechanism to extract the regional dependence. Finally, the output of the RePyAtt module is obtained by summation (‘SUM’) of different region groups.
Remotesensing 12 03501 g003
Figure 4. Proposed MuAttFusion module. It selectively fuses the same-layer, higher-layer, and lower-layer features using an adaptive attention method.
Figure 4. Proposed MuAttFusion module. It selectively fuses the same-layer, higher-layer, and lower-layer features using an adaptive attention method.
Remotesensing 12 03501 g004
Figure 5. ResConv module used to refine the features and reduce network parameters.
Figure 5. ResConv module used to refine the features and reduce network parameters.
Remotesensing 12 03501 g005
Figure 6. Structure of the proposed adaptive-ASPP module. It is designed to adjust the combination weights for varied contents in a feature-embedded space by using the proposed cross-scale attention (CrsAtt) module (top). Image Pooling represents a global average pooling operation.
Figure 6. Structure of the proposed adaptive-ASPP module. It is designed to adjust the combination weights for varied contents in a feature-embedded space by using the proposed cross-scale attention (CrsAtt) module (top). Image Pooling represents a global average pooling operation.
Remotesensing 12 03501 g006
Figure 7. Ground-truth images of different datasets and the number of image samples for high-resolution datasets (Vaihingen and Potsdam) and pixel samples for hyperspectral datasets (the IP and UP datasets).
Figure 7. Ground-truth images of different datasets and the number of image samples for high-resolution datasets (Vaihingen and Potsdam) and pixel samples for hyperspectral datasets (the IP and UP datasets).
Remotesensing 12 03501 g007
Figure 8. Qualitative comparisons between our method and the baseline (Deeplabv3+) on the Vaihingen dataset with 512 × 512 patches.
Figure 8. Qualitative comparisons between our method and the baseline (Deeplabv3+) on the Vaihingen dataset with 512 × 512 patches.
Remotesensing 12 03501 g008
Figure 9. Qualitative comparisons between our method and the baseline (Deeplabv3+) on the Potsdam dataset with 512 × 512 patches.
Figure 9. Qualitative comparisons between our method and the baseline (Deeplabv3+) on the Potsdam dataset with 512 × 512 patches.
Remotesensing 12 03501 g009
Figure 10. Qualitative comparisons of the two types of available input images (IRRG and RGB) on the Potsdam test set.
Figure 10. Qualitative comparisons of the two types of available input images (IRRG and RGB) on the Potsdam test set.
Remotesensing 12 03501 g010
Figure 11. Classification results for the IP image with d = 9 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Figure 11. Classification results for the IP image with d = 9 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Remotesensing 12 03501 g011
Figure 12. Classification results for the IP image with d = 15 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Figure 12. Classification results for the IP image with d = 15 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Remotesensing 12 03501 g012
Figure 13. Classification results for the IP image with d = 19 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Figure 13. Classification results for the IP image with d = 19 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Remotesensing 12 03501 g013
Figure 14. Classification results for the IP image with d = 29 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Figure 14. Classification results for the IP image with d = 29 and number of samples per category = 50 (first column), 100 (second column), 150 (third column), and 200 (fourth column). The upper row represents the visualization results of the confusion matrix for each category on the testing set (the more prominent the color of the diagonal area, the better the result) and the lower row represents the qualitative results of classification.
Remotesensing 12 03501 g014
Figure 15. Comparison of the OA obtained by different CNN models for the IP dataset. The abscissa represents the total number of training samples (600–2500), and the ordinate represents the OA (%) of the CNN models. The figure mainly compares the accuracy of the existing CNN methods and the proposed spatial-spectral FFPNet under different training samples. In the legend, different shapes represent different methods for hyperspectral classification. In the same shape, different colors indicate different configurations of the same method. The OA results obtained by our proposed model is presented in red, and those obtained by other CNNs [41], CNN [40], attention networks [45], and multiple CNN fusion [42] are presented in black.
Figure 15. Comparison of the OA obtained by different CNN models for the IP dataset. The abscissa represents the total number of training samples (600–2500), and the ordinate represents the OA (%) of the CNN models. The figure mainly compares the accuracy of the existing CNN methods and the proposed spatial-spectral FFPNet under different training samples. In the legend, different shapes represent different methods for hyperspectral classification. In the same shape, different colors indicate different configurations of the same method. The OA results obtained by our proposed model is presented in red, and those obtained by other CNNs [41], CNN [40], attention networks [45], and multiple CNN fusion [42] are presented in black.
Remotesensing 12 03501 g015
Figure 16. Comparison of the OA obtained by different CNN models for the UP dataset. The abscissa represents the total number of training samples (400–4400), and the ordinate represents the OA (%) of the CNN models. The figure mainly compares the accuracy of different models (existing CNN methods and the proposed spatial-spectral FFPNet) under different training samples. In the legend, different shapes represent different methods for hyperspectral classification. In the same shape, different colors indicate different configurations of the same method. The OA results obtained by our proposed model are presented in red and those obtained by the existing CNNs [40,41], attention networks [45], and multiple CNN fusion [42] are presented in black.
Figure 16. Comparison of the OA obtained by different CNN models for the UP dataset. The abscissa represents the total number of training samples (400–4400), and the ordinate represents the OA (%) of the CNN models. The figure mainly compares the accuracy of different models (existing CNN methods and the proposed spatial-spectral FFPNet) under different training samples. In the legend, different shapes represent different methods for hyperspectral classification. In the same shape, different colors indicate different configurations of the same method. The OA results obtained by our proposed model are presented in red and those obtained by the existing CNNs [40,41], attention networks [45], and multiple CNN fusion [42] are presented in black.
Remotesensing 12 03501 g016
Table 1. Three-level heavy-weight spatial FFPNet configurations. The module parameters are denoted as “module name(receptive fields of different convolutions)-number of modules-number of module output channels”. Note that some complex modules only give the module name.
Table 1. Three-level heavy-weight spatial FFPNet configurations. The module parameters are denoted as “module name(receptive fields of different convolutions)-number of modules-number of module output channels”. Note that some complex modules only give the module name.
LevelDetailed Configurations
First-level pyramid x 1 : Conv2d(7 × 7)-1-64 + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
x 2 : Maxpool + Block1(1 × 1 + 3 × 3 + 1 × 1)-3-256 + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
x 3 : Block2(1 × 1 + 3 × 3 + 1 × 1)-4-512 + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
x 4 : Block3(1 × 1 + 3 × 3 + 1 × 1)-23-1024 + Block4(1 × 1 + 3 × 3 + 1 × 1)-3-2048 + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
x 5 : Adaptive-ASPP
Second-level pyramid x 7 : RePyAtt + MuAttFusion( x 1 , x 2 , x 3 , x 4 ) + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
x 6 : RePyAtt + MuAttFusion( x 1 , x 2 , x 3 , x 4 ) + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
Third-level pyramidMuAttFusion( x 5 , x 6 , x 7 )
Parameter:78.8 million
Table 2. Two-level light-weight spatial feature fusion pyramid configurations. The module parameters are denoted as “module name(receptive fields)-number of modules-number of module output channels”.
Table 2. Two-level light-weight spatial feature fusion pyramid configurations. The module parameters are denoted as “module name(receptive fields)-number of modules-number of module output channels”.
LevelDetailed Configurations
First-level pyramid x 1 : Conv2d(3 × 3)-2-64 + Conv2d(3 × 3)-2-128 + Maxpool + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
x 2 : Conv2d(3 × 3)-3-256 + Conv2d(3 × 3)-3-512 + Maxpool + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
x 3 : Conv2d(3 × 3)-3-512 + Maxpool + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
Second-level pyramidMuAttFusion( x 1 , x 2 , x 3 ) + ResConv(1 × 1 + 3 × 3 + 3 × 3)-1-256
Parameter24.8 million
Table 3. Two-level spectral feature fusion pyramid configurations. The module parameters are denoted as “module name(receptive fields)-number of modules-number of module output channels”.
Table 3. Two-level spectral feature fusion pyramid configurations. The module parameters are denoted as “module name(receptive fields)-number of modules-number of module output channels”.
LevelDetailed Configurations
First-level pyramid x 1 : Conv2d(3 × 3 + 1 × 1)-1-64
x 2 : Conv2d(3 × 3 + 1 × 1)-1-32
x 3 : Conv2d(3 × 3 + 1 × 1)-1-16
Second-level pyramidMuAttFusion( x 1 , x 2 , x 3 )
Parameter0.20 million
Table 4. Results of the ablation study on the Vaihingen testing dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 10 and mini-batch size = 4.
Table 4. Results of the ablation study on the Vaihingen testing dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 10 and mini-batch size = 4.
MethodRePyAttMuAttFusionAdaptive-ASPPASPPCE LossBA LossOA(%)mIoU(%)
ResNet-101 Baseline (Deeplabv3+) 88.0972.83
ResNet-101 + RePyAtt + MuAttFusion + BA 90.6480.37
ResNet-101 + RePyAtt + MuAttFusion + ASPP + BA 90.3779.96
ResNet-101 + RePyAtt + MuAttFusion + Adaptive-ASPP + BA 90.9181.33
ResNet-101 + RePyAtt + MuAttFusion + Adaptive-ASPP + CE 90.7180.82
Table 5. Results of the ablation study with different combinations of groups in the RePyAtt module; the values in bold are the best. All results are the average of three runs with maximum epoch = 10 and mini-batch size = 4.
Table 5. Results of the ablation study with different combinations of groups in the RePyAtt module; the values in bold are the best. All results are the average of three runs with maximum epoch = 10 and mini-batch size = 4.
Pyramid CombinationsOA(%)mIoU(%)
{single pixel, 8, 4, 2, 1}90.2879.63
{single pixel, 4, 2, 1}90.9181.33
{single pixel, 2, 1}90.4980.08
{single pixel, 1}90.6680.39
Table 6. Experimental results on the Vaihingen dataset; the values in bold are the best.
Table 6. Experimental results on the Vaihingen dataset; the values in bold are the best.
MethodImp. Surf.Build.Low Veg.TreeCarMean F1(%)OA(%)mIoU(%)
FCNs [8]88.1191.3677.1085.7075.0383.4685.7372.12
DeepLabv3 [10]87.7592.0477.4785.8565.2181.6686.4870.05
UZ_1 [59]89.2092.5081.6086.9057.3081.5087.30-
Attention U-Net [24]90.4492.9180.3087.9079.1086.1387.9576.05
DeepLabv3+ [11]90.0393.1379.0887.0968.9483.6588.0972.83
RefineNet [60]90.8294.1181.0788.9282.1787.4288.9878.01
S-RA-FCN [9]91.4794.9780.6388.5787.0588.5489.23-
ONE_7 [61]91.0094.5084.4089.9077.8087.5289.80-
DANet [22]91.6395.0283.2588.8787.1689.1989.8580.53
GSN5 [12]91.8095.0083.7089.7081.9088.4290.10-
DLR_10 [63]92.3095.2084.1090.0079.3088.1890.30-
PSPNet [64]92.7995.4684.5189.9488.6190.2690.8582.58
Heavy-weight Spatial FFPNet92.8095.2483.7589.3886.5689.5590.9181.33
Table 7. Experimental results on the Potsdam dataset; the values in bold are the best.
Table 7. Experimental results on the Potsdam dataset; the values in bold are the best.
MethodImage TypeImp. Surf.Build.Low Veg.TreeCarMean F1(%)OA(%)mIoU(%)
UZ_1 [59]IRRG89.3095.4081.8080.5086.5086.7085.80-
FCNs [8]IRRG89.0593.3483.5483.6789.4887.8286.4078.48
Attention U-Net [24]IRRG90.2692.4785.4985.9094.7089.7687.6481.62
DeepLabv3 [10]IRRG89.9094.5883.5885.4873.2485.3687.7375.12
S-RA-FCN [9]IRRG91.3394.7086.8183.4794.5290.1788.5982.38
RefineNet [60]IRRG91.1795.1385.2287.6995.1590.8789.1683.51
DeepLabv3+ [11]IRRG92.2795.5285.7186.0489.4289.7989.6081.69
DST_6 [65]IRRG92.4096.4086.8087.7093.4091.3490.20-
DANet [22]IRRG91.5095.8387.2188.7995.1691.7090.5683.77
AZ3IRRG93.1096.3087.2088.6096.0092.2490.70-
CASIA3 [5]IRRG93.4096.8087.6088.3096.1092.4491.00-
PSPNet [64]IRRG93.3696.9787.7588.5095.4292.4091.0884.88
Heavy-weight Spatial FFPNetIRRG93.6196.7087.3188.1196.4692.4491.1086.20
Heavy-weight Spatial FFPNetRGB92.8296.2986.7188.5296.4892.1690.5485.72
Table 8. Number of training samples used by the spatial-spectral FFPNet for the IP dataset.
Table 8. Number of training samples used by the spatial-spectral FFPNet for the IP dataset.
IP
ClassPixels200 Samples
per Category
150 Samples
per Category
100 Samples
per Category
50 Samples
per Category
200 Samples
per Category in [40]
Alfalfa462323232333
Corn-notill142820015010050200
Corn-mintill83020015010050200
Corn23711811810050181
Grass-pasture48320015010050200
Grass-trees73020015010050200
Grass-pasture-mowed281414141420
Hay-windrowed47820015010050200
Oats201010101014
Soybeans-notill97220015010050200
Soybeans-mintill245520015010050200
Soybeans-clean59320015010050200
Wheat20510210210050143
Woods126520015010050200
Bldg-grass-tree-drives38619315010050200
Stone-steel-towers934646464675
Total10,2492306181312936932466
Table 9. Number of training samples used by the spatial-spectral FFPNet for the UP dataset.
Table 9. Number of training samples used by the spatial-spectral FFPNet for the UP dataset.
UP
ClassPixels200 Samples
per Category
150 Samples
per Category
100 Samples
per Category
50 Samples
per Category
200 Samples
per Category in [40]
Asphalt663120015010050200
Meadows18,64920015010050200
Gravel209920015010050200
Trees306420015010050200
Painted metal sheets134520015010050200
Bare soil502920015010050200
Bitumen133020015010050200
Self-blocking bricks368220015010050200
Shadows94720015010050200
Total42,776180013509004501800
Table 10. Total training time (in minutes) and accuracy evaluation with different patch sizes d = 9, 15, 19, and 29 for the IP dataset and d = 9, 15, 21, and 27 for the UP dataset; the values in bold are the best. All results are the average of three runs with 100 samples per category; maximum epoch = 200 and mini-batch size = 24.
Table 10. Total training time (in minutes) and accuracy evaluation with different patch sizes d = 9, 15, 19, and 29 for the IP dataset and d = 9, 15, 21, and 27 for the UP dataset; the values in bold are the best. All results are the average of three runs with 100 samples per category; maximum epoch = 200 and mini-batch size = 24.
DatasetPatch SizeTotal TimeAccuracy
OAAAKappa
IPd = 915.1396.3098.3195.73
d = 1517.8696.7898.6896.29
d = 1914.6698.5099.1698.27
d = 2921.5798.7499.4398.57
UPd = 99.1291.3791.2888.60
d = 157.0696.4196.1495.25
d = 218.1698.8298.3798.44
d = 279.6097.2997.2696.42
Table 11. Classification accuracies obtained by the proposed spatial-spectral FFPNet (with sample patch sizes d = 9, 15, 19, and 29) for the IP dataset. The values in bold are the best under different sample patch sizes. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Table 11. Classification accuracies obtained by the proposed spatial-spectral FFPNet (with sample patch sizes d = 9, 15, 19, and 29) for the IP dataset. The values in bold are the best under different sample patch sizes. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Sample Patch Sized = 9d = 15d = 19d = 29
Samples per Category50100150200501001502005010015020050100150200
Alfalfa100.00100.0095.6595.65100.00100.0095.65100.00100.00100.00100.00100.00100.00100.00100.00100.00
Corn-notill73.6695.6398.8398.7874.3894.6598.5199.3583.1699.3299.5399.5987.1199.72100.00100.00
Corn-mintill72.6998.6398.9799.8486.0397.4099.7198.2597.4496.3099.5699.0595.65100.00100.00100.00
Corn93.58100.00100.00100.00100.00100.00100.00100.0098.93100.00100.00100.00100.00100.00100.00100.00
Grass-pasture90.7698.9698.80100.0097.4699.48100.00100.0093.0797.6598.8098.9495.8398.33100.00100.00
Grass-trees97.0698.1099.3199.6297.5099.84100.00100.0097.3599.05100.0098.8795.0598.3599.45100.00
Grass-pasture-mowed100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Hay-windrowed97.20100.00100.00100.0098.36100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Oats100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Soybeans-notill78.9894.0199.7599.3387.0499.1997.7899.7387.1597.8199.51100.0098.2698.7099.57100.00
Soybeans-mintill70.7393.2594.8498.5874.8093.2597.5399.3880.7997.5495.7599.7891.1996.4199.3599.35
Soybean-clean86.3799.1999.5599.2494.6698.5898.87100.0096.8798.99100.00100.0098.6599.3299.32100.00
Wheat98.71100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Woods94.0797.9498.5798.7896.7998.6399.6499.3493.8399.91100.00100.0098.73100.0099.68100.00
Bldg-grass-tree-drives99.1197.20100.00100.0098.5197.90100.00100.00100.00100.00100.00100.0096.88100.00100.00100.00
Stone-steel-towers100.00100.00100.0095.74100.00100.00100.00100.0097.87100.00100.00100.00100.00100.00100.00100.00
OA82.1296.3097.9899.0786.4496.7898.7499.4789.7998.5098.6399.6894.6598.7499.6999.84
AA90.8198.3199.0299.1094.1098.6899.2399.7595.4099.1699.5799.7697.3499.4399.8499.96
Kappa79.6595.7397.6598.9084.5696.2998.5399.3888.3598.2798.4199.6393.9398.5799.6499.82
Run time9.2415.1320.3922.558.2117.8628.4731.4410.3714.6623.1234.0515.4321.5727.4137.38
Table 12. Classification accuracies obtained by the proposed spatial-spectral FFPNet (with sample patch sizes of d = 9, 15, 21, and 27) for the UP dataset. The values in bold are the best under different sample patch sizes. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Table 12. Classification accuracies obtained by the proposed spatial-spectral FFPNet (with sample patch sizes of d = 9, 15, 21, and 27) for the UP dataset. The values in bold are the best under different sample patch sizes. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Sample Patch Sized = 9d = 15d = 21d = 27
Samples per Category50100150200501001502005010015020050100150200
Asphalt66.1488.6595.7296.3278.0394.5798.1998.1993.5497.3498.9797.5992.2891.6199.7099.82
Meadows87.4794.6498.0997.9692.8697.9899.4499.9897.6899.8199.8999.9197.4599.3199.6899.85
Gravel49.6290.2797.1495.9991.0396.3799.4398.6689.1298.0999.8199.8193.3298.6699.0599.05
Trees57.4486.0387.8690.6059.7992.9596.0896.6172.1994.3994.3996.3483.1693.3494.2696.61
Painted metal sheets95.2497.6299.40100.0097.9298.8199.7099.1198.5198.51100.00100.0096.1399.4099.70100.00
Bare Soil18.4687.5996.2693.9560.1498.4199.68100.0094.9999.68100.00100.0095.78100.00100.00100.00
Bitumen94.8898.4999.4097.5998.4998.8099.40100.0097.59100.00100.00100.0097.59100.00100.00100.00
Self-Blocking Bricks36.9684.4695.9896.3036.7490.0098.7099.0277.3999.2497.2898.0471.0993.9197.7299.13
Shadows79.6593.8196.0297.3579.6597.35100.0098.2393.3698.2398.67100.0095.1399.1295.5899.56
OA67.9991.3796.5896.5179.4796.4198.9999.2592.6698.8299.1299.1592.8797.2999.0599.53
AA65.0991.2896.2196.2377.1896.1498.9698.8790.4998.3798.7899.0891.3397.2698.4199.33
Kappa56.5288.6095.4795.3772.7395.2598.6699.0190.2498.4498.8398.8790.5596.4298.7599.38
Run time4.919.1213.2315.484.797.0612.4014.054.908.1611.5014.555.049.6013.6618.45
Table 13. Ablation study of data enhancement to spatial-spectral FFPNet performance (with sample patch sizes of d = 9, 15, 19, and 29 and 50 samples per category) on the IP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Table 13. Ablation study of data enhancement to spatial-spectral FFPNet performance (with sample patch sizes of d = 9, 15, 19, and 29 and 50 samples per category) on the IP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Methodsd = 9d = 15d = 19d = 29
Class Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Alfalfa100.00100.00100.00100.00100.00100.00100.00100.00
Corn-notill73.6682.2274.3889.8483.1696.8187.1198.60
Corn-min72.6995.9086.0395.6497.4498.4695.6598.55
Corn93.5898.40100.00100.0098.93100.00100.0098.31
Grass/Pasture90.7691.6997.4697.0093.0794.4695.8395.83
Grass/Trees97.0694.1297.5098.6897.3598.0995.05100.00
Grass/pasture-mowed100.00100.00100.00100.00100.00100.00100.00100.00
Hay-windrowed97.2099.7798.3699.30100.00100.00100.00100.00
Oats100.00100.00100.00100.00100.00100.00100.00100.00
Soybeans-notill78.9887.2587.0484.5387.1596.0898.2697.39
Soybeans-min70.7384.2474.8093.4380.7994.9791.1994.62
Soybean-clean86.3797.9794.6699.0896.8797.0598.6599.32
Wheat98.71100.00100.00100.00100.00100.00100.00100.00
Woods94.0795.0696.7999.9293.8398.6098.73100.00
Bldg-Grass-Tree-Drives99.1199.7098.5199.11100.0099.7096.88100.00
Stone-steel100.00100.00100.00100.0097.8797.87100.00100.00
OA82.1290.3286.4494.6889.7997.0294.6597.88
AA90.8195.3994.1097.2895.4098.2697.3498.91
Kappa79.6588.9584.5693.8988.3596.5893.9397.58
Table 14. Ablation study of data enhancement to spatial-spectral FFPNet performance (with sample patch sizes of d = 9, 15, 21, and 27 and 50 samples per category on the UP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Table 14. Ablation study of data enhancement to spatial-spectral FFPNet performance (with sample patch sizes of d = 9, 15, 21, and 27 and 50 samples per category on the UP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Methodsd = 9d = 15d = 21d = 27
Class Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Spatial-Spectral
FFPNet
Spatial-Spectral
FFPNet +
Data Enhancement
Asphalt66.1491.3778.0390.8393.5497.1692.2888.17
Meadows87.4789.9492.8698.0797.6898.3397.4597.73
Gravel49.6272.9091.0388.1789.1291.2293.3296.95
Trees57.4484.7359.7990.0872.1981.0783.1690.73
Painted metal sheets95.2496.7397.9298.8198.5199.4096.1398.21
Bare Soil18.4685.2860.1499.2894.9997.6995.7898.81
Bitumen94.8892.4798.4999.7097.59100.0097.59100.00
Self-Blocking Bricks36.9680.0036.7487.7277.3991.7471.0994.35
Shadows79.6593.8179.6592.4893.3697.7995.1396.02
OA67.9987.9279.4795.0992.6695.9992.8795.59
AA65.0987.4777.1893.9090.4994.9391.3395.66
Kappa56.5284.1972.7393.5290.2494.6890.5594.19
Table 15. Ablation study of the spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet with sample patch sizes of d = 9, 15, 19, and 29 and 100 samples per category on the IP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Table 15. Ablation study of the spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet with sample patch sizes of d = 9, 15, 19, and 29 and 100 samples per category on the IP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Sample Patch Sized = 9d = 15d = 19d = 29
ModelsSpatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Spatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Spatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Spatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Alfalfa100.00100.00100.00100.00100.00100.0095.65100.00100.00100.00100.00100.00
Corn-notill92.7781.1095.6393.0795.4894.6599.4796.3999.3298.0499.7299.72
Corn-mintill96.3095.7598.6392.7496.3097.4096.7197.5396.3099.52100.00100.00
Corn100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Grass-pasture96.6198.4398.9697.1397.1399.4897.9195.8297.6594.1796.6798.33
Grass-trees95.8797.9498.1097.7895.5699.8498.1099.5299.0598.9097.8098.35
Grass-pasture-mowed57.14100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Hay-windrowed99.7498.94100.0098.4199.47100.00100.00100.00100.00100.00100.00100.00
Oats100.0090.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Soybeans-notill90.0986.8794.0196.4395.2899.1997.7096.3197.8198.2698.2698.70
Soybeans-mintill87.9081.4493.2595.1691.0093.2595.5495.1297.5498.0496.9096.41
Soybean-clean95.9496.5599.1996.1597.7798.5899.1996.7598.9997.30100.0099.32
Wheat100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00100.00
Woods95.7195.6297.9498.2898.4598.6399.9199.4899.91100.00100.00100.00
Bldg-grass-tree-drives99.3096.5097.2096.5097.2097.90100.00100.00100.00100.0098.96100.00
Stone-steel towers97.8795.74100.0097.8795.74100.00100.00100.00100.00100.00100.00100.00
OA93.1589.5396.3095.8695.3196.7897.9797.1698.5098.5598.7098.74
AA94.0894.6898.3197.4797.4698.6898.7698.5699.1699.0199.2799.43
Kappa92.1087.9595.7395.2094.5896.2997.6596.7298.2798.3498.5298.57
Run time13.375.8015.1315.888.6817.8614.428.4814.6616.6716.0121.57
Table 16. Ablation study of the spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet with sample patch sizes of d = 9, 15, 21, and 27 and 100 samples per category on the UP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Table 16. Ablation study of the spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet with sample patch sizes of d = 9, 15, 21, and 27 and 100 samples per category on the UP dataset; the values in bold are the best. All results are the average of three runs with maximum epoch = 200 and mini-batch size = 24.
Sample Patch Sized = 9d = 15d = 21d = 27
ModelsSpatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Spatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Spatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Spatial
FFPNet
Spectral
FFPNet
Spatial-Spectral
FFPNet
Asphalt79.4857.8288.6590.0481.4194.5791.4394.3997.3488.4784.6791.61
Meadows84.3462.4694.6495.3596.4297.9898.0197.8199.8198.2695.9299.31
Gravel76.9168.7090.2790.2791.7996.3798.0994.4798.0997.9097.5298.66
Trees76.8954.4486.0380.6878.9892.9587.6069.8494.3993.0883.2993.34
Painted metal sheets97.6293.4597.6294.3597.9298.8199.1196.7398.5199.4098.5199.40
Bare SoilC60.7829.9187.5990.0689.5898.4199.9299.1299.6899.6899.84100.00
Bitumen94.5896.0898.4997.8997.8998.8099.40100.00100.00100.00100.00100.00
Self-Blocking Bricks63.1537.2884.4686.3064.1390.0088.7084.6799.2485.6587.6193.91
Shadows91.5991.5993.8192.4891.5997.3597.3597.3598.2397.3593.3699.12
OA78.9858.1191.3791.8189.0296.4195.7394.1698.8295.5193.2597.29
AA80.5965.7591.2890.8287.7596.1495.5192.7198.3795.5393.4197.26
Kappa72.4247.1088.6089.2385.5095.2594.3792.2298.4494.0791.0996.42
Run time5.832.249.125.043.127.066.573.398.167.634.979.60
Table 17. Classification accuracies (in %) of different CNN models developed from 2016 to 2019 and the proposed model for the IP dataset (with maximum epoch = 200 and mini-batch size = 24). The values in bold are the best. The CNN by [40] is considered as the baseline.
Table 17. Classification accuracies (in %) of different CNN models developed from 2016 to 2019 and the proposed model for the IP dataset (with maximum epoch = 200 and mini-batch size = 24). The values in bold are the best. The CNN by [40] is considered as the baseline.
CNN ModelsAttention Networks [45]Multiple CNN Fusion [42]CNNs [41]CNN [40]Spatial-Spectral FFPNet
A-ResNetSamplesSODFNFCLFNSamples1-D2-D3-Dd = 9d = 19d = 29Samplesd = 9d = 19d = 29Samplesd = 9d = 15d = 19d = 29Samples
Alfalfa89.23795.1295.12589.5899.65100.00100.00100.00100.003099.1399.5799.132395.65100.00100.00100.0033
Corn-notill97.6921498.9199.3814385.6890.6496.3490.5794.0697.1715080.4894.4798.1720098.7899.3599.59100.00200
Corn-min99.2912599.06100.008387.3699.1199.4997.6996.4398.1715096.6598.2298.9220099.8498.2599.05100.00200
Corn92.243698.12100.002493.33100.00100.0099.92100.00100.0010099.66100.00100.00118100.00100.00100.00100.00181
Grass/Pasture99.027295.6295.164996.8898.4899.9198.1098.7298.7615099.4699.7599.71200100.00100.0098.94100.00200
Grass/Trees99.7711099.0999.247398.9997.9599.7599.3499.67100.0015099.5398.9099.4020099.62100.0098.87100.00200
Grass/pasture-mowed93.04480.0372.10391.67100.00100.00100.00100.00100.0020100.00100.00100.0014100.00100.00100.00100.0020
Hay-windrowed100.0072100.0099.534899.49100.00100.0099.5899.92100.0015099.6799.62100.00200100.00100.00100.00100.00200
Oats90.593100.0088.892100.00100.00100.00100.00100.00100.0015100.00100.00100.0014100.00100.00100.00100.0014
Soybeans-notill98.5714699.3199.549890.3595.3398.7294.2897.6399.1415092.4398.0098.6220099.3399.73100.00100.00200
Soybeans-min99.3736898.8798.6424577.9078.2195.5287.7592.9394.5915076.4294.3296.1520098.5899.3899.7899.35200
Soybean-clean97.148987.9992.686095.8299.3999.4794.8197.1799.0615097.7499.0999.3320099.24100.00100.00100.00200
Wheat100.003197.83100.002198.59100.00100.00100.00100.00100.0015099.71100.0099.90102100.00100.00100.00100.00143
Woods99.5719099.9199.9112698.5597.7199.5598.0997.8899.7615097.7198.8598.9620098.7899.34100.00100.00200
Bldg-Grass-Tree-Drives99.585898.5697.413987.4199.3199.5489.7995.8098.395099.2799.90100.00193100.00100.00100.00100.00200
Stone-steel97.721496.3997.591098.0699.2299.34100.0099.5798.9250100.00100.00100.004695.74100.00100.00100.0075
OA98.75 98.2198.56 87.8189.9997.5693.9496.2997.87 90.1197.2398.37 99.0799.4799.6899.84
AA97.05 96.5495.94 93.1297.1999.2396.8798.1199.00 96.1298.7999.27 99.1099.7599.7699.96
Kappa98.58 97.9798.36 85.3087.9597.0293.1295.7897.57 88.8196.8598.15 98.9099.3899.6399.82
Total samples 1537 1029 1765 2466 2306
Table 18. Classification accuracies (in %) of different CNN models developed from 2016 to 2019 and the proposed model for the UP dataset (with maximum epoch = 200, mini-batch size = 24). The values in bold are the best. The CNN by  [40] is considered the baseline.
Table 18. Classification accuracies (in %) of different CNN models developed from 2016 to 2019 and the proposed model for the UP dataset (with maximum epoch = 200, mini-batch size = 24). The values in bold are the best. The CNN by  [40] is considered the baseline.
CNN ModelsAttention Networks [45]Multiple CNN Fusion [42]CNNs [41]CNN [40]Spatial-Spectral FFPNet
A-ResNetSamplesSODFNFCLFNSamples1-D2-D3-Dd = 15d = 21d = 27Samplesd = 15d = 21d = 27Samplesd = 9d = 15d = 21d = 27Samples
Asphalt99.8066399.6297.036792.0697.1199.3697.5398.8098.5954892.8195.3196.3120096.3298.1997.5999.82200
Meadows99.97186599.98100.0018692.8087.6699.3698.9899.4699.6054097.2098.1697.5420097.9699.9899.9199.85200
Gravel99.5621097.4795.142283.6799.6999.6998.9699.5999.4539296.9797.9296.8420095.9998.6699.8199.05200
Trees99.7430691.7988.493193.8598.4999.6399.7599.6899.5754298.6298.7497.5820090.6096.6196.3496.61200
Painted metal sheets99.9713599.7799.181598.91100.0099.9599.9399.7899.61256100.00100.0099.65200100.0099.11100.00100.00200
Bare Soil100.0050396.7799.465194.1798.0099.9699.4299.9399.8453298.5799.5799.3320093.95100.00100.00100.00200
Bitumen99.1613389.5195.891592.6899.89100.0098.7199.88100.0037597.2799.7598.9020097.59100.00100.00100.00200
Self-Blocking Bricks99.7336897.59100.003889.0999.7099.6598.5899.5399.6751496.1798.2098.8920096.3099.0298.0499.13200
Shadows99.889592.5296.201197.8497.1199.3899.8799.7999.8323199.8699.8299.5820097.3598.23100.0099.56200
OA99.86 98.1398.17 92.2894.0499.5498.8799.4799.48 96.8398.0697.80 96.5199.2599.1599.53
AA99.76 96.1196.80 92.5597.5299.6699.0899.6099.57 97.5098.6198.29 96.2398.8799.0899.33
Kappa99.82 97.5397.58 90.3792.4399.4198.5199.3099.32 95.8397.4497.09 95.3799.0198.8799.38
Total samples 4278 436 3930 1800 1800
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xu, Q.; Yuan, X.; Ouyang, C.; Zeng, Y. Attention-Based Pyramid Network for Segmentation and Classification of High-Resolution and Hyperspectral Remote Sensing Images. Remote Sens. 2020, 12, 3501. https://doi.org/10.3390/rs12213501

AMA Style

Xu Q, Yuan X, Ouyang C, Zeng Y. Attention-Based Pyramid Network for Segmentation and Classification of High-Resolution and Hyperspectral Remote Sensing Images. Remote Sensing. 2020; 12(21):3501. https://doi.org/10.3390/rs12213501

Chicago/Turabian Style

Xu, Qingsong, Xin Yuan, Chaojun Ouyang, and Yue Zeng. 2020. "Attention-Based Pyramid Network for Segmentation and Classification of High-Resolution and Hyperspectral Remote Sensing Images" Remote Sensing 12, no. 21: 3501. https://doi.org/10.3390/rs12213501

APA Style

Xu, Q., Yuan, X., Ouyang, C., & Zeng, Y. (2020). Attention-Based Pyramid Network for Segmentation and Classification of High-Resolution and Hyperspectral Remote Sensing Images. Remote Sensing, 12(21), 3501. https://doi.org/10.3390/rs12213501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop