1. Introduction
Hyperspectral images collect both spectral and spatial information of the scene of interest. Since spectral signature varies with material composition and there is a spectrum for each pixel in the image, an HSI gives the spatial distribution of different materials in the image. The number of spectral bands in an HSI depends on the imaging system but usually there are hundreds of bands in an HSI providing a useful source of information. HSIs provide rich datasets that are useful in many areas such as remote sensing [
1,
2,
3,
4], agriculture [
5], food processing [
6,
7,
8], face recognition [
9], etc. In some of these applications such as in most of the remote sensing related areas, the goal is to perform pixel-wise segmentation. Different algorithms have been proposed in the literature to perform remote sensing HSI classification that can be divided into two categories: traditional and deep-learning based methods.
Traditional methods such as k-nearest neighbor (KNN) classifier [
10], random forest [
11] and support vector machine (SVM) [
12] have been successfully applied and achieved good results. However, these methods only use spectral information and do not take into account the spatial location of the input pixels. To further exploit the existing information in an HSI and advance the classification accuracies, spectral-spatial feature extraction methods have been proposed and widely used in the literature. The logic of these methods is that in an HSI, pixels in a local region most likely own similar spectra. Extended multi-attribute profile (EMAP) [
13] as an example of these joint spectral-spatial methods, processes the input image by applying attribute filters to extract geometric or textural information. This method models spatial features with respect to operators based on fixed structural elements (SEs).
Deep learning-based methods have been used for remote sensing HSI classification applications since the year 2014 [
14] and have become very popular ever since. In this work, Chen et al. used joint spectral-spatial features and stacked auto-encoder (SAE) as the deep learning framework. In 2016, Ma et al. [
15] proposed a variation of SAE called spatial updated deep auto-encoder (SDAE). In this method to consider sample similarity and accommodate contextual information, a regularization term is added to the energy function. More recently, Teffahi et al. [
16] used EMAP features along with sparse auto-encoder to perform HSI classification. Deep belief network (DBN) is another type of deep neural network which is built of layers of restricted Boltzmann machine (RBM) and is successfully employed in the classification of remote sensing hyperspectral images. In 2015, Chen et al. [
17] employed the similar framework as used in [
14] and replaced SAE with a DBN. In 2017, Zhou et al. [
18] also used DBN to perform HSI classification. However, they applied a group based weight-decay process in training of the group- restricted Boltzmann machine (GRBM), the bottom layer of the DBN. Convolutional neural network (CNN) as another type of the deep learning models has been successfully applied in HSI classification as well [
19,
20,
21]. In 2017, Li et al. [
20] proposed a data augmentation method to deal with the large number of training samples required for training the CNN. Also, to incorporate the spatial information, the authors used the majority voting strategy in the test time. Pan et al. [
21] proposed a CNN based deep-learning model to classify HSI datasets. In this paper, spectral and spatial information are merged through rolling guidance filter (RGF), an edge-preserving filter. In 2018, Shu et al. [
19] used two shallow CNNs whose inputs are a one-channel spectral quilt obtained by stacking spectral patches.
In remote sensing HSI classification, in addition to the spectral data, spatial relationships between pixels provides additional information that can be used to enhance accuracy of classification. In other words, rather than considering each pixel as an independent spectrum during the feature extraction phase, pixel location and geometric relationship to nearby pixels are used to help label pixels. One question we consider in this work is whether pixels should all weigh equally in this computation, or whether the weight of specific pixels should be weighted based on geometric relationships. If a target pixel location is close to an edge, there is a high probability that some of its neighbors are located on the other side of the border and thus belong to a different class. If all the neighboring pixels have the same contribution in building the target pixel’s spatial feature vector, this will not be a good representation of the pixel’s spatial information. So, the goal of this study is to create a more accurate representation of the spatial feature vectors by assigning different weights to the neighbors.
In this paper, we proposed a novel approach for spectral-spatial feature extraction. We weight pixels based on their difference to significant edges in the image based on the assumption that pixels well-separated from the boundaries of geometric regions are more likely to be representative of the whole region than pixels towards the edge. We first calculate the gradient image of the input HSI. We then use the results from a distance transform for each pixel with respect to the dominant edges to augment the pixel’s spatial feature vector. To the best of our knowledge, this is the first time distance transform values are used in this way to add weights to a target pixel surrounding neighbors. Next, we augment the spatial feature vector with EMAP features in order to integrate further geometric information. Finally, the joint spectral-spatial feature vectors provide input to a two-layer stacked autoencoder including a sparse AE in each layer. The workflow diagram of our proposed HSI classification method is shown in
Figure 1. We performed extensive experiments on three hyperspectral datasets and results show that our proposed method outperforms the other methods examined in this study.
The rest of this paper is organized as follows: In
Section 2, a brief description of the algorithms including in our proposed method is presented.
Section 3 presents our proposed feature extraction and classification framework. Experimental results and comparison to other methods are presented in
Section 4 and
Section 5. Finally,
Section 6 concludes this paper.
3. Methods
In pixel-wise classification, our goal is to group pixels according to their membership in a discrete set of classes (types). Spatial information can be used to enhance classification because it is likely that multiple pixels of the same class will be geometrically grouped together. Thus the classification of a particular pixel provides additional evidence towards the classification of its neighbours [
15]. In our work, we have taken advantage of this tendency in a nuanced way. We predict that pixels away from the edges of a class should weigh more heavily in classification, because the likelihood that they will be similar to nearby pixels is assumed to be greater. A pixel at the edge of a region is likely to be near pixels of different classes, which means that its geometric location is less useful in classification. Consider the cases shown in
Figure 3. In
Figure 3a, the orange border enclosed a class of pixles of the same class. We have used the blue square to identify a pixel that is located well within the orange area. Green squares depict the neighborhood area around the blue pixel. Because the blue pixel is located well within the orange area far away from any boundary, our hypothesis is that this pixel and its neighbors should belong to the same class. Therefore, when we form the feature vector for the blue pixel, the spectral values of the green region will be included as features of the blue pixel. We treat pixels close to or on the border in the opposite way. Consider the red pixel and its pink neighbourhood depicted in
Figure 3b. The red pixel is located close to a boundary, so its neighbours belong to multiple classes. In such cases as these, we would decrease the effect of the neighbors in the target pixel’s spatial feature by factoring in the (small) distance from the edge of the region. Our approach is to augment the feature vector with the value of each neighbor along with its distance transform value. This ensures that neighbours further from an edge are weighted more heavily, whereas neighbours nearer to an edge have their effect suppressed.
In order to find the distance transform image from an HSI cube (see
Figure 4), we propose the following procedure: First, independently apply a Gaussian smoothing filter to each band of the HSI to suppress input noise. Next, we calculated the gradient image by employing the method described in [
26].The aerial remote sensing scenes used in this study, consist of structures such as agricultural areas, roads and buildings representing edges at major directions as can be seen in
Figure 5,
Figure 6 and
Figure 7. These edges represent borders of regions of different classes in each dataset. So, obtaining a gradient image which highlights these edges and so the borders of these regions is desired. Since HSI datasets are composed of many bands, we used the method described in [
26] to obtain a one band gradient image from the entire dataset. This gradient image highlights the borders of the desired regions. This process is described as follows:
First, for each spectral band, we computed the image gradients at the four major directions 0°, 45°, 90°, and 135° using Sobel filters [
27] according to Equation (
10):
where
is the image at band
j,
is the Sobel filter at one of the four specified directions,
N is the total number of spectral bands, and
is the gradient image of band
j obtained from the
ith Sobel filter. Also, * represents the convolution operation.
Next, to obtain the single image gradient for each direction, the corresponding gradients of all bands are added together:
where
is the gradient image at direction
i and
N is the total number of spectral bands.
Finally, to compute the final gradient image of the whole HSI cube, the average of the four directional image gradients is computed:
where
is the output gradient image of the input HSI.
Having calculated the gradient image, we apply a threshold to extract strong edges. Pixels with values greater than
are labeled as foreground and the rest are labeled as background in the resulting binary mask. To remove small connected components in the thresholded image, first morphological opening is applied on the binary mask image and then connected components having fewer than
pixels are removed from the image. In
Section 4.2.3, we explain what values are chosen for these two hyperparameters. Finally, in order to find the distance of the pixels in the original image to the foreground (edge) pixels in the thresholded image, we used distance transform [
28] with euclidean distance metric:
where
and
represent two pixels’ locations in the image. The distance transform yields a resulting image. The brightness of pixels in the transformed image indicates distance from the edge.
Figure 4 illustrates how the process acts on an image from the Salinas HSI dataset.
The spatial feature vector is then augmented with the distance transform values as is shown in
Figure 8. To find the spatial features of the pixels in the HSI, similar to [
14], we first reduced the dimension of the data along its spectral axis using the principal component analysis (PCA) method. Then, for each target pixel (blue square in
Figure 8), a surrounding neighborhood area (green region) is considered. PCA searches for the most accurate data representation in a lower dimensional subspace composed of the uncorrelated linear combinations of the original variables called principal components. What PCA does is mapping the input data to the dimensions along which data vary the most. Since we are interested in reducing the dimensionality of the input data, we only keep the first few PCs which contain the most information of the original data. In
Section 4.2.1, we talk about the number of preserved PCs in our experiments. To form the primary spatial feature vector, PC values of the neighbors and distance transform values from the corresponding distance transform image (
Figure 4) are combined according to Equation (
14):
where
and
represent the primary spatial feature vector associated with the target pixel
and the
ith pixel in the neighborhood region around this pixel, respectively.
and
represent a vector containing the PC score values and the distance transform value (scalar) associated with pixel
, respectively. Also,
indicates horizontal concatenation. Note that the distance transform values (
to
) can be placed either before or after each set of PC values for each neighbour of the target pixel. With this transformation, we ensure that the contribution of neighbour pixels in forming the target pixel’s spatial feature declines with proximity to edges of the region.
To compute the final proposed spatial feature vector and extract even more spatial information by adding geometric attributes to the primary spatial feature vector, similar to [
16] we incorporated the EMAP features (see
Section 2.4) as well. Finally, the pixel’s spectrum is added to the proposed spatial feature vector to form the input data to the stacked auto-encoder.
In
Section 4, we test the effectiveness of our proposed method in classification of various remote sensing hyperspectral scenes. We performed two sets of experiments. In the first set of experiments (
Proposed-P), we used only our primary proposed feature vector. In the second set of experiments (
Proposed-F) we test our final feature vector that includes the EMAP features as well as the distance transform information.
In our
Proposed-F feature vector to build the EMAP features for Salinas and University of Pavia datasets, we used area of the regions and the length of the diagonals of the regions’ bounding boxes as the regions attributes. For the Surrey dataset, area of the regions, length of the diagonals of the regions’ bounding boxes and standard deviation of the pixels in the regions are considered. For each attribute, an EAP is obtained from the first four PC images of the input HSI for all three datasets. For computing each AP, four different threshold values for filtering are used which are listed in
Table 1.
5. Discussion
To check the effectiveness of our proposed method, we compared it with some traditional and recent hyperspectral classification methods. These methods include linear SVM, Kernel SVM with radial basis function (RBF), extended multi-attribute profiles (EMAP), DAE [
14], PPF-CNN [
20], and EMAP-SAE [
16]. We implemented both linear and kernel SVM using libsvm library [
29].
We performed 10-fold cross validation to find the best values for the regularization parameter
C and the width of the gaussian kernel
gamma. For the Salinas database, value of
for
C in case of linear SVM and values of
and 1 for
C and
gamma parameters of the gaussian SVM resulted in the best performance. For the University of Pavia dataset,
C equal to 10 for the linear SVM and in case of the kernel SVM values of
and 0.1 for
C and
gamma outperformed the other values. For the Surrey dataset, these values have been obtained as
,
, and 0.01, respectively. We used the kernel SVM classifier with the EMAP method and found the following values for the gaussian SVM’s parameters for the three datasets: Salinas:
C =
and
gamma = 0.1, University of Pavia:
C =
and
gamma = 0.001, and Surrey:
C =
and
gamma = 0.01. With all other methods we used logistic regression (LR) classifier.
Table 4,
Table 5 and
Table 6 compare the result of our proposed methods (using the Proposed-P and Proposed-F spatial feature vectors) with the other methods experimented in this study.
From these tables, we can draw the following conclusions. First, methods that combine spectral and spatial information achieve higher classification accuracies compared to the spectral-based approaches (L-SVM and K-SVM). Second, deep learning-based classifiers (i.e., DAE, PPF-CNN, EMAP-SAE, Proposed-P, and Proposed-F) outperform traditional HSI classification methods for almost all the accuracy metrics. In the DAE and EMAP-SAE methods, similar to our algorithm, joint spectral-spatial information and SAE is employed. However, they did not use the weighting strategy for neighboring pixels. So, by comparing the results of our Proposed-P feature vector with these two methods we can conclude that assigning different weights to the neighboring pixels improves the accuracy of the classification. Furthermore, this improvement shows the effectiveness of our proposed algorithm to compute such weights. The last column of these tables shows the results of our Proposed-F feature vector. The increase in the accuracy metrics compared to the results from our proposed primary feature vector shows the effectiveness of adding more geometric attributes on top of our proposed-P spatial feature vector. Therefore, as can be seen from these tables, adding the distance transform value to the feature vector and combining it with EMAP features improves accuracy metrics as well as the test time compared to the other methods. As an example,
Table 7 summarizes the improvements made by our proposed methods compared to the EMAP-SAE [
16] algorithm. For the Salinas dataset in terms of OA, adding only distance transform features (Proposed-P) improves the result by 2.27% compared to the EMAP-SAE method. For the University of Pavia and Surrey datasets, the corresponding increase in the OA is 0.57% and 2.07%, respectively. Furthermore, the kappa coefficient has been improved by 0.03, 0.01, and 0.03 for the three datasets, respectively. The increase in the AA compared to the EMAP-SAE method for the three datasets have been observed as 1.38 %, 0.65%, and 2.35%, respectively. The improvements made by our final feature vector (Proposed-F) are as follows: for the Salinas dataset the OA, AA and Kappa coefficient have been improved by 3.03%, 1.75%, and 0.04, respectively. The observed increase in these metrics for the University of Pavia dataset are 1.21%, 1.55%, and 0.02, respectively. Finally, we can see an improvement of 3.19%, 3.61%, and 0.04 for the Surrey dataset.
Figure 12,
Figure 13 and
Figure 14 show the classification maps obtained from our method and the other methods explored in this study. As can be seen from these figures, there are fewer misclassified pixels in our classification maps which is consistent with the results shown in
Table 4,
Table 5 and
Table 6. For instance, according to
Figure 12i (and also
Table 4), our proposed final spatial feature vector especially improves the classification accuracies of classes 8 and 15 of the Salinas dataset. As
Figure 13i shows, our proposed feature vector improves the classification accuracy of class 3 of the University of Pavia dataset significantly compared to other methods.
6. Conclusions
In this paper, we proposed a new method for the HSI classification task. In an HSI, pixels in a same region with a high probability belong to the same category. Therefore, the information of a target pixel’s neighbors may be used as the pixel’s spatial features. However, our conjecture in this work is that not all neighbours should be permitted to contribute to the same extent in forming the spatial feature vector for a target pixel. We hypothesize that pixels farthest from geometric boundaries are most likely to be located near neighbours of the same class, and as a result pixel distance from geometric edges should be used to weight the contribution of neighbouring pixels. Our proposed approach incorporates a new distance transform-based spatial feature vector which augments the spatial feature vector with the distance of pixels from the boundaries of geometric regions. The distance measure is obtained from a distance transform after some morphological processing of the input image. We use the stacked sparse autoencoder as the deep learning framework and train it based on a training data set with known classification. To find the best values for parameters of our method, we performed extensive experiments in the hyperparameter space. We also investigated the impact of geometric information by further augmenting the feature vector with EMAP features. We tested our method on three HSI datasets and based on our experimental results, we obtained the following findings. First, assigning different weights to the target pixel’s adjacent neighbors improved classification accuracies, in accordance with our hypothesis. We also saw an increase in classification accuracy with the inclusion of EMAP features in the spatial feature vector, which provides more support to the importance of spatial information in hyperspectral images. Regardless of the improvements made by our algorithm it has some limitations. First, since training and evaluation of our model should be done on each dataset separately, three different sets of the SAE’s weights and model’s hyperparameters are obtained. Second, since our model introduces several hyeprparameters that need to be tuned, a careful search in the hyperparameter space is needed.
Although the experimental results show that our method outperforms the other traditional and recent HSI classifiers experimented in this study, there is still work to be done in the future. For example, to test the generalization power of our algorithm, it should be tested on more remote sensing hyperspectral scenes acquired with various sensors. Training a unified model that can perform pixel-wise classification of these various remote sensing HSI scenes with a high accuracy is another future research goal. Moreover, with more computational power, we would need less time to explore more combinations of hyperparameter values in order to ensure our classification accuracy is maximized.