1. Introduction
A hyperspectral image (HSI) offers more refined spectral information than other remote sensing images (e.g., optical and multispectral images), making it advantageous for hyperspectral image classification (HSIC) tasks. HSIC has various applications in military detection, mineral exploration, agricultural production, urban planning, and environmental monitoring [
1,
2,
3,
4,
5]. However, HSI has inherent characteristics such as high-dimensional data, a limited training sample size, and spectral uncertainty. Predicting the true class of each pixel with high accuracy based on spatial–spectral features in HSI remains a challenge [
3] and merits further research.
During early HSIC research, most studies focused on investigating how spectral characteristics functioned in classification, proposing many traditional pixel-based classification methods, covering support vector machines (SVMs) [
6], K-nearest neighbors [
7], polynomial logistic regression [
8,
9], etc. The main disadvantage of these methods is that the feature engineering step is a time-consuming task, and the classification accuracy tends to be reduced when highly relevant features or features with low information content are extracted in this step. Given the inherent nonlinear relationship between spectral information and the corresponding materials in HSI, it is challenging to accurately classify such data using traditional machine learning methods.
Deep learning (DL) is regarded as a powerful tool that can be used to solve nonlinear problems. It has been utilized extensively in several image processing tasks, including image classification [
10,
11], target detection [
12], natural language processing [
13], etc. Inspired by the success of these applications, HSIC has also benefited from DL, demonstrating promising results. Currently, one of the most important technical indicators of the classification model’s quality is how to avoid the dimension disaster problem and extract useful and discriminative feature information from the high-dimensional feature space of HSIs. In [
14,
15,
16,
17,
18], most scholars employ 2D CNNs to extract the spatial characteristics in the hyperspectral pixel neighborhood after first performing a principal component analysis (PCA) on the entire set of hyperspectral data to lower the dimensionality of the original space. This method can efficiently extract spatial information and lower computing costs by integrating PCA and CNNs. Nevertheless, spectral information is unavoidably lost during the dimensionality reduction process, which might impact the model’s ability to comprehend input features and its overall classification performance. Li et al. proposed a three-dimensional convolutional neural network (3-D-CNN) that can synchronously analyze spatial spectrum characteristics and produce impressive classification performance [
19]. It is worth noting that a diverse-region CNN (DR-CNN) utilized various neighboring areas of the center target pixel; nevertheless, when the label sample size was insufficient, the model’s performance was a cause for concern in terms of generality [
20].
In addition, as the data volume increases, the issue of scarce labeling samples worsens, making it challenging to use the above-mentioned CNN-based supervised learning approach in the absence of a sufficient number of labeled training samples. Consequently, scholars began focusing on the semi-supervised learning (SSL) approach, which utilized the labeled data and information from unlabeled samples as a supplement. For instance, Zhou et al. suggested a label propagation approach that makes use of labeled samples to execute label propagation on the full HSI to obtain labels for unlabeled samples [
21]. However, because of the sensitivity of the parameters, this approach is prone to noise. The semi-supervised support vector machine (S3VM) included the construction of a support vector machine classifier by combining existing labeled samples and adding a certain proportion of unlabeled samples [
22]. Although it produces good classification results, a good generalization performance necessitates careful parameter adjustment. Makhzsani et al. provided a semi-supervised classification method by restricting the reconstruction error and reconstructing HSIs using autoencoders [
23]. Common algorithms include sparse autoencoders and variational autoencoders. Good classification results have been obtained using this strategy. However, it is too time-consuming.
Notably, semi-supervised graph convolutional networks (SSGCNs) have demonstrated notable efficacy as one of the most efficient SSL techniques. By effectively processing the local spatial characteristics and global semantic characteristics in HSIs, the utilization of features from unlabeled nodes and the comprehensive learning of the interaction and feature transfer between nodes have led to a notable increase in classification accuracy. However, a traditional GCN can compile and convert features from every graph node’s neighbor, but it only utilizes spectral features and overlooks the significant space structures that are embedded in the original HSI data [
24]. Moreover, when dealing with a large number of pixels, the construction and computing costs associated with the graph structure are unfeasible. Compared with the original GCN, Qin et al. proposed a spectral-space GCN (S
2GCN) that achieved superior accuracy in classification [
25]. Nevertheless, this approach only employs a fixed neighborhood size and graph throughout the graph convolution process, making it unable to flexibly capture spectral–spatial information from various local areas and accurately portray the intrinsic relationship between pixels. Consequently, Wan et al. [
26] suggested a multi-scale dynamic GCN (MDGCN) that included superpixels in multi-hop graph learning, saving training time while reducing computing complexity. However, integrating multi-scale spatial information utilizing a spatial multi-hop graph structure may lead to classifier deviation, which would impact the classification performance. Li et al. [
27] proposed a novel framework called SGML, which can better capture the similarities and differences between samples to improve classification efficiency by combining graph-embedding technology and metric learning methods. However, this network adopts a multi-scale superpixel segmentation technology to process hyperspectral images, which is likely to ignore the pixel features of local details. Dong et al. [
28] proposed a weighted feature fusion method combining a convolutional neural network (CNN) and graph attention network (GAT), which led to the proposal of a new solution for dual-branch fusion networks, but the classification effect still needs to be improved.
Hence, in this paper, we propose a dual-branch fusion of a GCN and CNN, namely DFGCN, to achieve superior hyperspectral image classification outcomes. First, a multi-scale superpixel segmentation method is employed in the GCN branch to optimize the utilization of various feature information points related to shapes and sizes. Additionally, this approach significantly reduces computational cost by converting the algorithmic calculation unit from individual pixels to superpixels. Next, fusion-adjacent matrices are created based on superpixels for each scale to better measure the similarity between the graph nodes, resulting in more efficient graph convolution and stronger node representation. Then, a spectral feature enhancement module between the two graph convolutions enhances the most important channels of information during data transmission. In the CNN branch, we designed a convolutional network with an attention mechanism to concentrate on extracting detailed features of local areas. Through the fusion of the multi-scale superpixel features from the GCN branch and the local pixel features from the CNN branch, our proposed approach comprehensively captures and fully learns rich spatial–spectral information, thereby enhancing classification performance. The following are the novel aspects of this study:
- (1)
The methodology adopted in this research involves the construction of a fusion adjacency matrix following the segmentation of an HSI using multi-scale superpixel segmentation. The incorporation of the Pearson correlation coefficient as a supplement to the construction of a similarity function based on Euclidean distance is a critical aspect of this study, and the weight ratio between the two is of paramount importance. The introduction of new adjacency matrices plays a vital role in discovering novel graph structures, facilitating the learning of more powerful node representations and enhancing the effectiveness of the graph convolutions. The proposed technique enables the extraction of spatial information features that are more comprehensive and discriminative than existing methods do.
- (2)
The spectral feature enhancement module was designed in the middle of the two graph convolutions to enhance important channel information in a self-supervised way to extract more important spectral information.
- (3)
We fused the GCN branch based on multi-scale superpixel segmentation with the CNN branch, which included an attention mechanism, to fully extract the long-distance contextual information and local detail features of the HSI. Furthermore, our extensive experiments demonstrated that the proposed DFGCN outperforms several widely used and advanced classification techniques in terms of classification results.
The remainder of this article is arranged as follows: Our methods are introduced in
Section 2 and include the entire architecture, a synopsis of the superpixel segmentation, and a detailed implementation of the proposed DFGCN. Our experimental data sets and evaluation indicators are described in
Section 3. Our extensive experiment results are presented in
Section 4. Further analysis and discussion are included in
Section 5.
Section 6 presents our conclusions.
2. Methods
2.1. Architecture of the Proposed DFGCN
This section describes the proposed DFGCN, which can be seen in
Figure 1. It is primarily divided into two branches: the GCN branch, based on multi-scale superpixel segmentation, and the CNN branch with an attention mechanism. In the GCN branch, we perform adaptive multi-scale superpixel segmentation on the first principal component after reducing the dimensionality of the HSI using the PCA method. For each scale, we map the graph nodes from the pixel scale to the superpixel scale. Then, we carry out fusion adjacency matrix construction (FAMC). We establish a spectral feature-enhanced module between the two graph convolutions and employ them for spectral feature extraction. In the CNN branch, we design a convolutional network with an attention mechanism to focus on extracting detailed features of local areas. Finally, we fuse the complementary features of the two branches and send them to the classifier. In the following section, we will provide a detailed description of the main DFGCN implementation procedures, including the multi-scale superpixel segmentation method, construction of the fusion adjacency matrix, design of the spectral feature improvement module, and structure of the CNN branch.
2.2. Superpixel Segmentation
Superpixel segmentation is a technique that enhances the ability to extract semantic information from images by aggregating pixels in the image that have similar color and texture features into a more significant and recognizable portion [
29]. This new portion serves as the fundamental component of subsequent image processing, which can greatly reduce the computational burden, as seen in
Figure 2 below. Furthermore, superpixel segmentation has already been employed as a preprocessing technique in many HSIC methods and has proven to be effective [
30]. For example, Li et al. proposed a symmetric graph metric learning framework based on a multi-scale adaptive superpixel segmentation technique to increase classification efficiency using the graph’s structural characteristics and metric learning technology [
27]. Jia et al. presented methods for clustering pixels with similar spectral characteristics by carrying out weighted label propagation on superpixels [
31]. They reduced the computing time and obtained a notable classification performance. Specifically, entropy rate segmentation (ERS) is usually chosen to produce superpixels because of its efficacy compared with other methods [
32]. In summary, ERS could be translated as the solution to the following objective function, since it is a graph-based technique:
Here, is the limitation of the entropy rate, which is used to create homogeneous clusters. is a balanced constraint that lowers the amount of imbalanced superpixels by requiring clusters to have comparable spatial sizes. represents the balance of the constraint’s weight coefficient, which must be greater than or equal to 0.
2.3. Fusion Adjacency Matrix Construction
Next, we will introduce how to build a fusion-graph-adjacent matrix using an HSI after multi-scale superpixel segmentation. First of all, we will briefly introduce the process of converting pixels into superpixels. Because superpixels may automatically modify their size and shape based on the HSI content, they are an excellent way to describe land cover. Consequently, we leverage superpixels to make further graph learning easier. In this case, the superpixel’s value is determined by weighing the average of the pixels that make up one superpixel: .
Let and be the labeled and unlabeled superpixels, respectively. The length of every superpixel is denoted by . Superpixels with labeled and unlabeled samples are denoted by and , respectively, and . By means of majority voting of the contained pixels, the corresponding labels of are selected. After that, every superpixel is utilized to create the graph , with representing each graph node’s superpixel characteristic.
The majority of currently existing GCN-based HSIC methods employ a single Euclidean distance to build the similarity function of the graph adjacency matrix [
27,
33]. As an intuitive distance measurement method, Euclidean distance can measure the degree of difference between pixels, expressed by the geometric distance between hyperspectral pixels, more effectively (Formula (2)). However, it does not properly account for the linear correlation between data, which could lead to a graph adjacency matrix that may not accurately capture the complex data features in the HSI, thereby impacting the accuracy and stability of classification and weakening the robustness of the algorithm.
Therefore, we consider introducing the Pearson correlation coefficient as a supplement to construct the similarity function based on the Euclidean distance. The Pearson correlation coefficient measures the similarity between variables based on their covariance (Formula (3)), which takes into account the linear relationship between variables to determine whether variables change in similar or opposite trends.
Constructing a similarity function, which more effectively captures the spatial relationship, feature similarity, and correlation between different pixels in an HSI, is achieved by combining the Euclidean distance and Pearson correlation coefficient. This creates more comprehensive feature information while improving the performance of the HSIC tasks and enhances the classification efficacy and accuracy. The structural function of the graph-adjacent matrix is shown in Formula (4), where
denotes that two graph nodes are adjacent and
represents the weight ratio of two similar measurement methods.
2.4. Graph Convolutional Network
When it comes to spectral-based convolutional graph neural networks, the GCN is one of the most often used techniques. Its main use in topological graphs is to extract pertinent vertices and edges’ spatial characteristics [
34,
35]. Notably, Kipf and Welling [
24] developed an efficient layer-by-layer propagation method that can encode node properties and the local graph structure, leading to a more stable state, using Chebyshev polynomials for estimating the convolution kernel. In short, from the Fourier perspective of graph Laplacian [
36], the definition of the convolution operation is as follows:
Among them, the convolution filter parameterized by
is represented by
, while
represents the graph signal. The eigenvector matrix of the normalized graph Laplacian is represented by the symbol
, which may be written as
. The identity matrix with the proper size is represented by the symbol
. The graph’s degree matrix is denoted by
, while the adjacent matrix is represented by
. The diagonal matrix
corresponds to the eigenvalues of
. Next, the authors in [
34] used the truncated translational Chebyshev polynomial
to approximate
. This can be stated as follows:
where
is the kth Chebyshev coefficient, and shifted,
.
is the greatest eigenvalue of
. Notably, this operation is K-localized, since it uses the Kth-order polynomial of the Laplacian. The layer-by-layer convolution process is further approximated and restricted to
using the GCN [
24]. The computation formula is as follows:
where
and
are shared by the two free parameters across the entire graph. Under restrictive conditions of
, (4) can be simplified to
To prevent the issue of disappearance/explosion gradients and numerical instability,
performs renormalization in conjunction with
and
. Lastly, the graph convolution can be expressed as follows for the signal
(N nodes):
where
and
are the trainable convolutional variables and the number of kernels, respectively. The graph convolution’s output is represented by
.
Moreover, considering the highly nonlinear geometric nature of an HSI in the characteristics area, which is susceptible to changes in lighting, environment, atmosphere, time conditions, etc., we may potentially improve the robustness of the experiment by working on the graph [
37]. Several studies have used a GCN to classify an HSI, thereby achieving encouraging results. In this work, we further explore how to fully utilize the advantages of graph convolution by supplementing the spatial information across scales and taking the similarities and correlations across nodes into account. Specifically, we spread the labeled sample feature into the unlabeled samples using graph convolution and design, designing the spectral feature enhancement module to study the local correlation of spectral features within nodes. Through multi-scale interaction and deep feature mining operations, we obtain more representative and discriminative features and achieve highly accurate classification results.
2.5. Spectral Feature Enhancement Module
According to the GCN theory, the primary purpose of graph convolution is to propagate information across nodes without taking into account how important the internal relationships of nodes are. On the other hand, local and non-local spectral features are highly significant for classification tasks when processing an HSI, as they are tightly associated with nodes in the graph. Thus, we sandwich the spectral feature enhancement module (SFEM) between two graph convolutions. The purpose of this module is to enhance the expressive ability of spectral features so that it can more effectively discern distinctions between various categories.
Figure 3 depicts the design of this module.
To better capture details and heterogeneous information, we initially perform two lightweight one-dimensional convolution(1-D-Conv) operations on the spectral features, first increasing the dimension and then reducing it. The output is then scaled to within the range of 0 to 1 as significant factors of various channels of graph nodes using the sigmoid function. Features of significant channels are then highlighted by performing an element-wise multiplication of these factors with the input graph node features. We also add the original features with the aforementioned outcomes to prevent unnecessary information loss. In using this self-supervision approach, fewer significant spectral features are comparatively limited, while the critical feature expression of channels is boosted. The following formula represents the SFEM:
where
represents the input graph node feature, and
and
are the weights of two 1-D-Conv, respectively.
represents the pixel product operation.
2.6. Structure of CNN Branch
Superpixel segmentation technology aggregates pixels in an HSI and represents them with the same features, but the same-spectrum and different-spectrum characteristics of HSI data may lead to erroneous superpixel segmentation, thus affecting the subsequent classification accuracy. Secondly, after treating each superpixel as a graph node, information can only be propagated between each superpixel, ignoring the local spatial spectrum information within the superpixel. Considering the above factors, we designed a CNN branch with an attention mechanism to obtain local detail features to solve the problems of edge smoothing and detail loss during classification, which may be caused by the superpixel segmentation technology. This branch consists of two Squeeze-and-Excitation (SE) attention mechanisms and depthwise separable convolutions. Among them, the SE attention mechanism ensures better classification results with a small amount of calculation. Depthwise separable convolution is a special convolution operation in a CNN that aims to reduce the number of parameters and calculations of the model while improving its efficiency and performance.
The SE module includes three key steps: First, compress the spatial dimension of the input features from a three-dimensional tensor of H × W × C to a tensor of 1 × 1 × C. Second, generate each feature through a fully connected layer. The excitation weight of the channel is used to characterize its importance. Finally, multiply these weights by the original feature tensor to adjust the importance of each channel in the feature map, highlight important features, and suppress the influence of unimportant features, thereby achieving adaptation attention weighting. A structural diagram of this process is shown in
Figure 4.
Depthwise separable convolution consists of two steps: depth convolution and pointwise convolution. First, a convolution kernel of size K × K × 1 is applied to each channel of the image of an input size of H × W × B for convolution operation, which will produce B feature maps of sizes of H × W, where each feature map corresponds to a channel of the input image. Then, use a convolution kernel of a size of 1 × 1 × B to perform a point-by-point convolution operation on the feature map obtained via depth convolution, which is equivalent to a linear combination between channels and will eventually produce an output result of H × W × M. Here, M is the number of output channels in the pointwise convolution operation. Compared with traditional convolution methods, depth-separable convolution can reduce the computational cost to 1/K2.